ch 0 introduction - faculty of engineering and applied ...weimin/courses/engr9871/notes9871.pdf ·...

133
Ch 0 Introduction §0.1 Overview of Information Theory and Coding Overview The information theory was founded by Shannon in 1948. This theory is for transmission (communication system) or recording (storage system) over/in a channel. The Channel can be wireless or wire channel (communication: copper telephone or fiber optic cables), magnetic or optical disks (storage). There are three aspects that need to be considered: (Compression) (Error Detection and Correction) (Cryptography) Information Theory is based on the Probability Theory. A communication or compression procedure includes: Sent messages source receiver Received messag ges channel s y ymbols 0110101001110… encoder decoder Source coding g Channel coding g Source decoding g Channel decoding g Decompression Compression Error Detection and Correction Source Entropy Channel Capacity 1

Upload: ngocong

Post on 30-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Ch 0 Introduction

§0.1 Overview of Information Theory and Coding Overview The information theory was founded by Shannon in 1948. This theory is for transmission (communication system) or recording (storage system) over/in a channel. The Channel can be wireless or wire channel (communication: copper telephone or fiber optic cables), magnetic or optical disks (storage). There are three aspects that need to be considered: (Compression) (Error Detection and Correction) (Cryptography) Information Theory is based on the Probability Theory. A communication or compression procedure includes:

Sent messages

ssoouurrccee rreecceeiivveerr

RReecceeiivveedd mmeessssaaggeess

cchhaannnneell

ssyymmbboollss

00111100110011000011111100……

eennccooddeerr ddeeccooddeerr

SSoouurrccee ccooddiinngg

CChhaannnneell ccooddiinngg

SSoouurrccee ddeeccooddiinngg

CChhaannnneell ddeeccooddiinngg

Decompression Compression Error Detection and Correction

Source Entropy Channel Capacity

1

Digital Communication and Storage Systems A basic information processing system consists of Channel: produces a received signal r which differs from the original signal, c (the channel introduces noise, channel distortion, etc.). Thus, the decoder can only produce an estimate m’ of the original message, m. GGooaall ooff pprroocceessssiinngg:: Information conveyed through (or stored in) the channel must be reproduced at the destination as reliable as possible. At the same time, it needs to allow the transmission of as much information as possible per unit time (communication system) or storage (storage system). Information Source The Source Message m consists of a time sequence of symbols emitted by the information source. The source can be: Continuous-time Source, if this message is continuous in time, e.g., speech waveform. Discrete-time Source, if the message is discrete in time, e.g., data sequences from a computer. The symbols emitted by the source can be: Continuous in amplitude, e.g., speech waveform. Discrete in amplitude, e.g., text with a finite symbol alphabet. This course primarily concerns with discrete-time and discrete-amplitude (i.e. digital) sources, as practically all new communication or storage systems fall into this category.

Since the information and coding theory depends on the probability theory, we need to

review it first.

2

§0.2 Review of Random Variables and Probability

Probability Let us consider a single experiment, such as rolling of a dice, with a number of possible outcomes. The sample space S of the experiment consists of the set of all possible outcomes.

In the case of a dice 1,2,3,4,5,6S , with the integer representing the number of

dots on the six faces of the dice. Event: Complement

Example 0.1: For S and A defined above, find A . Two events are said to be Mutually Exclusive if they have no sample points in common. For example: Or: The Union of two events: The Intersection of two events:

Associated with each event A contained in is its S Probability, denoted by .

This has the following properties:

P A

For mutually exclusive events:

, i jA A i j , the probability of the union is i iii

P A P A

Example 0.2: if }4,2{A , find . )(AP

3

Joint Event and Joint Probability Instead of dealing with a single experiment, let us perform two experiments and consider their outcomes. For example: The two experiments can be separate tosses of a single dice or a single toss consisting of two consecutive dices. The sample space S consists of the 36 two-tuples (i,j), where , 1,...,6i j .

Each point in the sample space is assigned the probability 1

36.

Let us denote

iA , , as the outcomes of the first experiment, and 1,...,i n

jB , , as the outcomes of the second experiment. 1,...,j m

Assuming that the outcomes , 1,...,jB j m , are mutually exclusive and , it

follows that:

jj

B S

If , 1,...,iA i n , are mutually exclusive and i

iA S , then

In addition, Conditional Probability

A joint event ,A B occurs with the probability ,P A B , which can be expressed as:

,

where P A B and P B A are conditional probabilities.

Example 0.3: Let us assume that we toss a dice.

The events are 1,2,3A and 1,3,6 B , find . )|( ABP

4

A conditional probability is . Let A and B be two events in a single experiment: If these are mutually exclusive ( )A B , then

. If , BA The Bayes Theorem:

If , then

1

, 1,..., , are mutually exclusive andi

n

ii

A i n

A S

Statistical Independence: Let P A B be the probability of occurrence of A given that

B has occurred. Suppose that the occurrence of A does not depend on the occurrence of B. Then, Example 0.4: Two successive experiments in tossing a dice

12,4,6

2A P A even-numbered sample points in the first toss

12,4,6

2B P B even-numbered sample points in the second toss

Determine the probability of the joint event “even-numbered outcome on the first toss (A)” and “even-numbered outcome on the second toss (B)” . ),( BAP

5

Random Variables Sample space S

Elements s S X s is a Random Variable

For examples: Probability Mass Function (PMF)

1 1

1,

0, otherwise

M M

X i i i ii i

ii

p x P X x P X x x x p x x x

x xx x

For example: Definition: The Mean of the random variable X:

Example 0.5: 1,2,3,4,5,6 , S X s s , find E(X).

Useful Distributions Let X be a discrete random variable that has two possible values, say or 1X 0X , with probabilities p and 1- p , respectively.

This is the Bernoulli distribution, and the PMF can be represented as given in the figure. The mean of such a random variable is . The performance of a fixed number of trials with fixed probability of success on each trial is known as a Bernoulli trial.

p xX

1/6

x

1 2 3 4 5 6

0 1 x

p 1-p Xp x

6

Let , 1,...,iX i

1

n

ii

Y X

n

, be statistically independent and identically distributed random

variables with a Bernoulli distribution, and let us define a new random variable,

. This random variable takes values from to n . Associated probabilities can

be expressed as:

0

More generally, ,

where

!

! !

n n

k k n k

is the binomial coefficient. This represents the probability to

have k successes in n Bernoulli trials. The probability mass function (PMF) can be expressed as This represents the binomial distribution (see the www.mathworld.com website). The mean of a random variable with a binomial distribution is: E Y np .

Definitions: 1. The Mean of a function of the random variable X, g X , is defined as

. 2. The Variance of the random variable X is defined as . Example (calculate the variance for the random variable defined in Example 0.5, whose mean is 21/6) . 3. The Variance of a function of the random variable X, g X , is defined as

.

7

Ch 1 Discrete Source and Entropy

§1.1 Discrete Sources and Entropy

1.1.1 Source Alphabets and Entropy Overview The Information Theory is based on the Probability Theory, as the term information

carries with it a connotation of UNPREDICTABILITY (SURPRISE) in the transmitted

signal.

The Information Source is defined by :

- The set of output symbols

- The probability rules which govern the emission of these symbols.

Finite-Discrete Source: finite number of unique symbols.

The symbol set is called the Source Alphabet.

Definition

A is a source alphabet with M possible symbols, . We can say

that the emitted symbol is a random variable, which takes values in A. The number of

elements in a set is called its Cardinality, e.g.,

The source output symbols can be denoted as where

is the symbol emitted by the source at time t. Note that here t is an integer time index.

Ast

Stationary Source: the set of probabilities is not a function of time. It means, at any

given time moment, the probability that the source emits is ma Pr( )m mp aProbability mass function:

Since the source emits only members of its alphabet, then

8

Information Sources Classification

Stationary Versus Non-Stationary Source:

For a Stationary Source the set of probabilities is not a function of time, whereas for a

Non-stationary Source it is.

Synchronous Source Versus Asynchronous Source:

A Synchronous Source emits a new symbol at a fixed time interval, Ts, whereas for an

Asynchronous Source the interval between emitted symbols is not fixed.

The latter can be approximated as synchronous, by defining a null character when the

source does not emit at time t. We say the source emits a null character at time t.

Representation of the Source Symbols

The symbols emitted by the source must be represented somehow. In digital systems, the

binary representation is used.

Pop Quiz: How many bits are required to represent the symbols 1, 2, 3, 4? or in a symbol

of n symbols 1, 2, 3, …, n?

Answer:

The symbols represented in this fashion are referred to as Source Data.

Distinction between Data and Information

For example: An information source has an alphabet with only 1 symbol. This

representation of this symbol is data, but this data is not information, as it is completely

uninformative. Since information carries the connotation of uncertainty, the information

content of this source is zero.

Question: how can one measure the information content of a source?

Answer:

9

Entropy of a Source

Example: Pick a marble from a bag of 2 blue, and 5 read marbles

Probability for picking a red marble:

pred = 5/7

Number of choices for each red picked

1 / pred = 7/5 =1.4

Each transmitted Symbol 1 is just one choice out of 1/p1 many possible choices and

therefore Symbol 1 contains log2 1/p1 bits information (1/ p1 = 2 log2

1/ p1).

Similarly, Symbol k contains log2 1/pk bits information.

The average information bits per symbol for our source is Entropy, it is calculated by

Shannon gave this precise mathematical definition of the average amount of

information conveyed per source symbol, used to measure the information content of a

source.

Unit of Measure (entropy):

Range of entropy: where M is the cardinality of the source

A, and when ,(i.e. equal probabilities), H(A) takes the maximum. 1

1,...,

mpM

m M

10

Example 1.1: What is the entropy of a 4-ary source having symbol probabilities

}05.0,15.0,3.0,5.0{AP ?

Example 1.2: If }1,0{A with probabilities },1{ ppPA where 10 p , determine

the range of . )A(H

Example 1.3: For a M-ary source, what distribution of probabilities maximizes the )(AP

information entropy ? )(AH

11

Measurement of the Information Efficiency of the Source is in terms of ratio of the

entropy of the source to the (average) number of binary digits used to represent the

source data.

Example 1.4: For a 4-ary source }11,10,01,00{A that has symbol probabilities

}05.0,15.0,3.0,5.0{AP . What is the efficiency of the source?

When the entropy of the source is lower than the (average) number of bits used to

represent the source data, an efficient coding scheme can be used to encode the source

information, using, an average, fewer binary digits. This is called Data Compression and

the encoder used for that is called Source Encoder.

1.1.2 Joint and Conditional Entropy

If we have two information source A and B, and

we want to make a compound symbol C with ),( jiij bac , find H(C).

12

i) If A and B are statistically independent:

ii) If B depends on A:

Example 1.5: We often use a parity bit for error detection. For a 4-ary information source

}3,2,1,0{A with , and the parity generator with }25.0,25.0,25.0,25.0{AP }1,0{B

32,1

10,0{

oraif

oraifb j

where 2,1j , find , and . )(AH )(BH ),( BAH

13

1.1.3 Entropy of Symbol Blocks and the Chain Rule

To find where ),,,( 110 nAAAH )1,,1,0( ntAt is the symbol at index time of

that is drawn from alphabet A.

t

Example 1.5: Suppose a memoryless source with }1,0{A having equal probabilities

emits a sequence of 6 symbols. Following the 6th symbol, suppose a 7th symbol is

transmitted which is the sum modulo 2 of the six previous symbols (this is just the

exclusive-or of the symbols emitted by A). What is the entropy of the 7-symbol

sequence?

14

Example 1.6: For an information source having alphabet A with |A| symbols, what is the

range of entropies possible?

§1.2 Source Coding

1.2.1 Mapping Functions and Efficiency

For an inefficient information source, i.e. H(A) < log2(|A|), the communication system

can be made more cost effective through source coding.

Source

Encoder

Information Source Sequence

Code Words

s'0,s'1,… s't ϵ B(code alphabet)

s0,s1,… st ϵ A(source alphabet)

15

In its simplest form, the encoder can be viewed as a mapping of the source alphabet A to

a code alphabet B, i.e., C: A→B. Since the encoded sequence must be decoded at the

receiver end, the mapping function C must be invertible.

Goal of coding: average information bits/symbol ~ average bits we use to represent a

symbol (i.e. code efficiency ~ 1).

Example 1.7: Let A be a 4-ary source with symbol probabilities }05.0,15.0,3.0,5.0{AP ,

let C be an encoder with maps the symbols in A into strings of binary bits, as below

111)(,05.0

110)(,15.0

10)(,3.0

0)(,5.0

33

22

11

00

aCp

aCp

aCp

aCp

Determine the average number of transmitted binary digits per code word and the

efficiency of the encoder.

Example 1.8: Let C be an encoder grouping the symbols in A into ordered pairs

ji aa , , the set of all possible pairs ji aa , is called the Cartesian product of set A

and is denoted as A X A. Thus, encoder C: A X A → B. or . Now let A be a baaC ji ),(

4-ary memoryless source with symbol probabilities given in Example 1.7, determine the

average number of transmitted binary digits per code word and the efficiency of the

encoder. The code words are shown in the table following.

16

< ai,aj> Pr< ai,aj> bm < ai,aj > Pr< ai,aj > bm

a0,a0 00 a2,a0 1101

a0,a1 100 a2,a1 0111

a0,a2 0.075 1100 a2,a2 0.0225 111110

a0,a3 0.025 11100 a2,a3 0.0075 1111110

a1,a0 0.15 101 a3,a0 0.025 11101

a1,a1 0.09 010 a3,a1 0.015 111101

a1,a2 0.45 0110 a3,a2 0.0075 11111110

a1,a3 0.015 111100 a3,a3 0.0025 11111111

1.2.2 Mutual Information

If we have source set A and code set B, what are the entropy relationship between them?

A B

17

i) A B

a b

ii) A B

ai aj

b

18

iii) A B

bi bj

ai

1.2.3 Data Compression

Why Data Compression ?

Whenever space is concern, you would like to use data compression. For example, when

sending text files over a modem or Internet. If the files are smaller, they will get faster to

the destination. All media, such as text, audio, graphics or video has “redundancy”.

Compression attempts to eliminate this redundancy.

Example of Redundancy: If the representation of a media captures content that is not

perceivable by humans, then removing such content will not affect the quality of the

content. For example, capturing audio frequencies outside the human hearing range can

be avoided without any harm to the audio’s quality.

ENCODER DECODER

Compressed message, B

Decompressed message, A’

Original message, A

Lossless Compression:

Lossy Compression:

19

Lossless and lossy compression are terms that describe whether or not, in the

compression of the message, all original data can be recovered when decompression is

performed.

Lossless Compression

- Every single bit of data originally transmitted remains after decompression.

After decompression, all the information is completely restored.

- One can use lossless compression whenever space is a concern, but the

information must be the same.

In other words, when a file is compressed, it takes up less space, but when it is

decompressed, it still has the same information.

- The idea is to get rid of redundancy in the information.

- Standards: ZIP, GZIP, UNIX Compress, GIF

Lossy Compression

- Certain information is permanently eliminated from the original message,

especially redundant information.

- When the message is decompressed, only a part of the original information is still

there (although the user may not notice it).

- Lossy compression is generally used for video and sound, where a certain amount

of information loss will not be detected by most users.

- Standards: JPEG (still), MPEG (audio and video), MP3 (MPEG-1, Layer 3)

Lossless Compression

When we encode characters in computers, we assign each an 8-bit code based on

(extended) ASCII chart. (Extended) ASCII: fixed 8 bits per character

For example: for “hello there!”, a number of 12 characters*8bits=96 bits are needed.

Question: Can one encode this message using fewer bits?

Answer: Yes. In general, in most files, some characters appear most often than others. So,

it makes sense to assign shorter codes for characters that appear more often, and longer

codes for characters that appear less often. This is exactly what C. Shannon and R.M.

Fano were thinking when created the first compression algorithm in 1950.

20

Kraft Inequality Theorem

Prefix Code (or Instantaneously Decodable Code): A code that has the property of

being self-punctuating. Punctuating means dividing a string of symbols into words. Thus,

a prefix code has punctuating built into the structure (rather than adding in using special

punctuating symbols). This is designed in a way that no code word is a prefix of any

other (longer) code word. It is also data compression code.

To construct an instantaneously decodable code of minimum average length (for a

source A or given random variable a, with values drawn from the source alphabet), it

needs to follow the Kraft Inequality:

For an instantaneously decodable code B for a source A, the code lengths {li } must

satisfy the inequality

Conversely, if the code word lengths satisfy this inequality, then there exists an

instantaneously decodable code with these word lengths.

Shanno-Fano Theorem

KRAFT INEQUALITY tells us when an instantaneously decodable code exists. But we

are interested in finding the optimal code, i.e., the one that maximizes the efficiency, or

minimizes the average code length, . The average code length of the code B for the

source A (with a as a random variable of values drawn from the source alphabet with

probabilities {pi}) is minimized if the code lengths {li} are given by:

L L

This quantity is called the Shannon Information (pointwise).

Example 1.9: Consider the following random variable a, with the optimal code lengths given by the Shannon information. Calculate the average code length.

3 3 2 1 li i=0,1,2,3

1/8 1/8 1/4 1/2 pi i=0,1,2,3

a3 a2 a1 a0 a

The average code length of the optimal code is:

21

Note that this is the same as the entropy of A, H(A).

Lower Bound on the Average Length

The observation about the relation between the entropy and the expected length of the

optimal code can be generalized. Let B be an instantaneous code for the source A. Then,

the average code length is bounded by:

Upper Bound on the Average Length

Let B a code with optimal code lengths, i.e., i

ii pl 2log . Then, the average length

is bounded by:

Why is the upper bound H(A)+1 and not H(A)? Because sometimes the Shannon

information gives us fractional lengths, and we have to round them up.

Example 1.10: Consider the following random variable a, with the optimal code lengths

given by the Shannon information theorem. Determine the average code length bounds.

2.7 2.7 2.3 2.0 2.0 li i=0,1,2,3

0.15 0.15 0.20 0.25 0.25 pi i=0,1,2,3

a4 a3 a2 a1 a0 a

The entropy of the source A is

The source coding theorem tells us:

where L is the code length of the optimal code.

Example 1.11: For the source in Ex. 1.10, the following code tries to make the code

words with optimal code lengths as closely as possible, find the average code length.

3 3 2 2 2 li i=0,1,2,3

011 010 11 10 00 b

a4 a3 a2 a1 a0 a

22

The average code length for this code is .

This is very close to the optimal code length of H(A)=2.2855.

Summary

i) The motivation for data compression is to reduce in the space allocated for data

(increase of source efficiency). It is obtained by reducing redundancy which exists in data.

ii) Compression can be lossless or lossy. In the former case, all information is completely

restored after decompression, whereas in the latter case it is not (used in applications in

which the information loss will not be detected by most users).

iii) The optimal code, which ensures a maximum efficiency for the source, is

characterized by the lengths of the code words given by the Shannon information, . p2log i

iv) According to the source coding theorem, the average length of the optimal code is

bounded by entropy as

v) The coding schemes for data compression include Huffman, Lempel-Ziv, Arithmetic

coding.

§1.3 Huffman Coding

Remarks

Huffman coding is used in data communications, speech coding, video compression.

Each symbol is assigned a variable-length code that depends on its frequency (probability

of occurrence). The higher the frequency, the shorter the code word. It is a variable-

length code. The number of bits for each code word is an integer (requires an integer

number of coded bits to represent an integer number of source symbols). It is a Prefix

Code (instantaneously decodable).

Encoder – Tree Building Algorithm

Huffman code words are generated by building a Huffman tree:

Step 1 : List the source symbols in a column in descending order of probabilities.

23

Step 2 : Begin with the two symbols with the two lowest probability symbols. The

combining of the two symbols forms a new compound symbol or a branch in the tree.

This step is repeated using the two lowest probability symbols from the new set of

symbols, and continues until all the original symbols have been combined into a single

compound symbol.

Step 3 : A tree is formed, with the top and bottom stems going from the compound

symbol to the symbols which form it, labeled with 0 and 1, respectively, or the other way

around. Code words are assign by reading the labels of the tree stems from right to the

left, back to the original symbol.

Example 1.12: Let the alphabet of the source A be {a0, a1, a2, a3}, and the probabilities

of emitting these symbols be {0.50 0.30 0.15 0.05}. Draw the Huffman tree and find the

Huffman codes.

STEP 1 STEP 2 STEP 3

Probability Symbol

0.50 a0

0.30 a1

0.15 a2

0.05 a3

Symbol Code Words

a0

a1

a2

a3

Hardware implementation of encoding and decoding.

24

25

How are the Probabilities Known?

Counting symbols in input string:

- data must be given in advance; requires an extra pass on the input string. Data source’s

distribution is known

- data not necessarily known in advance, but we know its distribution. Reasonable care

must be taken in estimating the probabilities, since large errors lead to serious loss in

26

optimality. For example, a Huffman code designed for English text can have a serious

loss in optimality when used for French.

More Remarks

For Huffman coding, the alphabet and its distribution must be known in advance. It

achieves entropy when occurrence probabilities are negative powers of 2 (optimal code).

Huffman code is not unique (because some arbitrary decisions in the tree construction).

Given the Huffman tree, it is easy (and fast) to encode and decode. In general, the

efficiency of Huffman coding relies on having a source alphabet A with a fairly large

number of symbols. Compound symbols are obtained based on the original symbols (see,

e.g., AxA). For a compound symbol formed with n symbols, the alphabet is An, and the set

of probabilities of the compound symbols is denoted by PAn.

Question: How does one get PAn?

Answer: Easy for a memoryless source. Difficult for a source with memory!

§1.4 Lempel-Ziv (LZ) Coding

Remarks

LZ coding does not require the knowledge of the symbol probabilities beforehand. It is a

particular class of dictionary codes. They are compression codes that dynamically

construct their own coding and decoding tables by looking at the data stream itself.

In simple Huffman coding, the dependency between the symbols is ignored, while in LZ,

these dependencies are identified and exploited to perform better encoding. When all the

data is known (alphabet, probabilities, no dependencies), it’s best to use Huffman (LZ

will try to find dependencies which are not there…)

This is the compression algorithm used in most PCs. Extra information is supplied to the

receiver, these codes initially “expand”. The secret is that most of the code words

represent strings of source symbols. In a long message it is more economical to encode

these strings (can be of variable length), than it is to encode individual symbols.

27

Definitions related to the Structure of the Dictionary

Each entry in the dictionary has an address, m. Each entry is an ordered pair, <n, ai >.

The former ( n ) is a pointer to another location in the dictionary, it is also the

transmitted code word. ai is a symbol drawn from the source alphabet A. A fixed-length

binary word of b bits is used to represent the transmitted code word. The number of

entries will be lower or equal to 2b. The total number of entries will exceed the number of

symbols, M, in the source alphabet. Each transmitted code word contains more bits that it

would take to represent the alphabet A.

Question: Why do we use LZ coding if the code word has more bits?

Answer: Because most of these code words represent STRINGS of source symbols

other than single.

Encoder

A Linked-List Algorithm (simplified for the illustration purpose) is sued, it inlcudes:

Step 1: Initialization

The algorithm is initialized by constructing the first M +1 (null symbol plus M source

symbols) entries in the dictionary, as follows.

aM-1 0 M

… … …

am 0 m

… … …

a1 0 2

a0 0 1

null 0 0

Dictionary Entry (n, ai) Address (m)

Note: The 0-address entry in the dictionary is a null symbol. It is used to let the decoder

know where the end of the string is. In a way, this entry is a punctuation mark. The

pointers n in these first M+1 entries are zero. It means they point to the null entry at

28

address 0 at the beginning.

The initialization also initializes pointer variable to zero (n=0), and the address pointer to

M +1, (m=M+1). The address pointer points to the next “blank” location in the dictionary.

Iteratively executed:

Step 2: Fetch next source symbol.

Step 3:

If

the ordered pair <n, a> is already in the dictionary, then

n = dictionary address of entry <n, a>

Else

transmit n

create new dictionary entry <n, a> at dictionary address m

m = m+1

n = dictionary address of entry <0, a>

Step 4:

Return to Step 2.

Example 1.13: A binary information source emits the sequence of symbols 110 001 011

001 011 100 011 11 etc. Construct the encoding dictionary and determine the sequence of

transmitted code symbols.

Initialize:

Source

symbol

Present

n

Present

m

transmit Next

n

Dictionary

entry

1

1

0

0

29

0 1 6 5

1 5 6 5 2 5,1

0 2 7 4

1 4 7 4 2 4,1

1 2 8 3

0 3 8 3 1 3,0

0 1 9 5

1 5 9 6

0 6 9 6 1 6,0

1 1 10 1 2 1,1

1 2 11 3

1 3 11 3 2 3,1

0 2 12 4

0 4 12 4 1 4,0

0 1 13 5

1 5 13 6

1 6 13 6 2 6,1

1 2 14 3

1 3 14 11

Thus, the encoder's dictionary is:

Dictionary address Dictionary entry

0 0, null

1 0, 0

2 0, 1

3

4

5

6 5, 1

7 4, 1

30

8 3, 0

9 6, 0

10 1, 1

11 3, 1

12 4, 0

13 6, 1

14 No entry yet

Decoder

The decoder at the receiver must also construct an identical dictionary for decoding.

Moreover, reception of any code word means that a new dictionary entry must be

constructed. Pointer n for this new dictionary entry is the same as the received code word.

Source symbol a for this entry is not yet known, since it is the root symbol of the next

string (which has not been transmitted by the encoder).

If the address of the next dictionary entry is m, we see that the decoder can only construct

a partial entry <n, ?>, since it must await the next received code word to find the root

symbol a for this entry. It can, however, fill in the missing symbol a in its previous

dictionary entry, at address m -1. It can also decode the source symbol string associated

with the received code word n.

Example 1.14: Decode the received code words transmitted in Example 1.13.

We know the received code words are 221 543 613 46

Address (m) n (pointer) ai (symbol) Decoded bits

0

1

2

3

4

31

5

6

7

8

9

… … … …

§1.5 Arithmetic Coding

Remarks

Assigns one (normally long) code word to the entire input stream. Reads the input stream

symbol by symbol, appending more bits to the code word each time. The code word is a

number obtained based on the symbol probabilities. The symbols probabilities need to

be known. Encodes symbols using a non-integer number of bits (in average), which

results in a very good efficiency of the encoder (it allows to achieve the entropy lower

bound). It is often used for data compression in image processing.

Encoder

Construct a code interval (rather than a code number), which uniquely describes a block

of successive source symbols. Any convenient b within this range is a suitable code word,

representing the entire block of symbols.

Algorithm:

, [ , )

0, 0, 1i ii i l h

j j

a A I S S

j L H

32

, use ai's Ii=[Sli,Shi) to update

Select a number b that fall in the final interval as the code word.

1

i

1

Next read

1

Until all a hav

+

e been encoded

.

i

i

l

h

j

jj

a

S

H L S

j

L

j

REPEAT

-

+ i

j

j

jH L

L

Example 1.15: For a 4-ary source },,,{ 3210 aaaaA with }05.0,15.0,3.0,5.0{AP ,

assign each a fraction of the real number interval as Aai iI

).1,95.0[:);95.0,8.0[:);8.0,5.0[:);5.0,0[: 33221100 IaIaIaIa

Encode the sequence with arithmetic coding. 23001 aaaaa

j ai Lj Hj ∆ Lj+1 Hj+1

0

1

2 a0 0.5 0.65 0.15 0.5 0.575

3 a3 0.5 0.575 0.075 0.57125 0.575

4 a2 0.57125 0.575 0.00375 0.57425 0.5748125

33

Decoder

In order to decode the message, the symbol order and probabilities must be passed to the

decoder. The decoding process is identical to the encoding. Given the code word (the

final number), at each iteration the corresponding sub-range is entered, decoding the

symbols representing the specific range.

Given b , the decoding procedure is

, use ai's Ii=[Sli,Shi) to update

F i n d s u c h t h a t

O u t p u t s

0 , 1 , Δ

-

+

+ i

i

L H H - L

b

H

LI

L

L

- L

y m b o l

Δi

l

h

L

i

a

S

SH

R e p e a t

n t i l l a s t s y m b o l i s d e c o d e d .

i

U

Example 1.16: For the source and encoder in Example 1.15, decode . 55747070312.0b

L H ∆ Ii Next H Next L Next ∆ ai

0.5 0.65 0.15 I0 0.575 0.5 0.075 a0

0.5 0.575 0.075 I3 0.575 0.57125 0.00375 a3

0.57125 0.575 0.00375 I2 0.5748125 0.57425 0.0005625 a2

Practical Issues

Attention: the precision with which we calculate /)( Lb

ilS

.

Round-off error in this calculation can lead to an erroneous answer. Numerical overflow

(see the products and ). The limited size of and limits the size of the il

SihS

ihS

34

alphabet A. In practice it is important to transmit and decode the info “on the fly.” Here

we must read in the entire block of source symbols before being able to compute the code

word. We also must receive the entire code word b before we can begin decoding.

Not intuitive Not intuitive Intuitive Intuition

Code words for strings of source symbols

One code word for all data

One code word for each symbol

Code words

Best results for long messages

Very close If probabilities are negative powers of 2

Entropy

Used for better compression

Not used Not used Symbol Dependency

None None None Data Loss

Not known in advance Known in advance Known in advance Alphabet

Not known in advance Known in advance Known in advance Probabilities

Lempel-Ziv Arithmetic Huffman

35

Ch 2 Channel and Channel Capacity

§2.1 Discrete Memoryless Channel Model

Communication Link

Definition

In most communication or storage systems, the signal is designed such that the output

symbols, y0,y1,...,yt , are statistically independent if the input symbols, c0,c1,...,ct , are

statistically independent. If the output set Y consists of discrete output symbols, and if the

property of statistical independence of the output sequence holds, the channel is called a

Discrete Memoryless Channel (DMC).

Transition Probability Matrix

Mathematically, we can view the channel as a probabilistic function that transforms a

sequence of (usually coded) input symbols, c, into a sequence of channel output symbols,

y. Because of noise an other impairments of the communication system, the

transformation is not one-to-one mapping from the set of input symbols, C, to the set of

Informatio

Source Encoder

Channel Encoder

Modulator

Channel Demodulator

Channel Decoder

Composite Discrete-Input Discrete-Output Channel

Continuous-Input Continuous-Output Channel

Source Decoder c0,c1,...,ct

y0,y1,...,yt

Alphabet C Probabilities PC

Alphabet Y Probabilities PY

36

output symbols, Y. Any particular c from C may have some probability, py|c , of being

transformed to an output symbol y, from Y, this probability is called a (Forward)

Transition Probability.

For a DMC, let be the probability that symbol c is transmitted, the probability that the

received symbol is y is given in terms of transition probabilities as

cp

The probability distribution of the output set Y, denoted by QY, may be easily calculated

in matrix form as

0 0 0 1 0 1

1 0 1 1 1 1

1 0 1 1 1 1

| | | 00

| | | 11

11 | | |

....

....

...... ...

....

MC

MC

CY M M M MY Y Y C

y c y c y c

y c y c y c

Y

MM y c y c y c

p p p pq

p p p pqQ

pq p p p

or, more compactly, Here,

CP : Probability distribution of the input alphabet

YQ : Probability distribution of the output alphabet

CYP | :

Remarks: The columns of PY|C sum to unity (no matter what symbol is sent, some

output symbol must result). Numerical values for the transition probability matrix are

determined by analysis of the noise and transmission impairment properties of the

channel, and the method of modulation/demodulation.

Hard Decision Decoding : MY = MC. Hard refers to the decision that the demodulator

makes; it is a firm decision on what symbol was transmitted.

Soft Decision Decoding : MY > MC. The final decision is left to the receiver decoder.

37

Example 2.1: C={0,1} , with equally probable symbols; Y={y0, y1, y2}. The transition

probability matrix of the channel is

|

0.80 0.05

0.15 0.15 .

0.05 0.80Y CP

QY=?

Remarks: The sum of the elements on each column of the transition probability matrix is

1. This is an example of soft-decision decoding.

Example 2.1 (cont’d): Calculate the entropy of Y for the previous system. Compare this

with the entropy of source C.

(how can this happen?)

Remarks: We noticed the same thing when we discussed the source encoder

(encryption encoder). It is possible for the output entropy to be greater than the input

entropy, but the “additional” information carried in the output is not related to the

information from the source. The “extra” information in the output comes from the

presence of noise in the channel during transmission, and not from the source C.

This “extra” information carried in Y is truly “useless”. In fact, it is harmful because it

produces uncertainty about what symbols were transmitted.

Question: Can we solve this problem by using only systems which employ hard-decision

decoding?

38

Answer:

Example 2.2: C={0,1} , with equally probable symbols; Y={0,1}. The transition

probability matrix of the channel is

Calculate the entropy of Y. Compare this with the entropy of source C.

|

0.98 0.05.

0.02 0.95YCP

Remarks: Y carries less information than was transmitted by the source.

Question: Where did it go ?

Answer: It was lost during the transmission process. The channel is information lossy !

So far, we have looked at two examples, in which the output entropy was either greater or

less than the input entropy. What we have not considered yet is what effect all this has on

the ability to “tell from observing Y what original information was transmitted.”

Do not forget that the purpose of the receiver is to recover the original transmitted

information !

What does the observation of Y tell us about the transmitted information sequence?

As we know, Mutual information is a measure of how much the uncertainty of

generating a random variable c is reduced by observing a random variable y !

If Y tells us nothing about C (e.g., Y and C are independent, such as somebody cut the

phone wire and there is no signal getting through).

But if

39

Looking at Y there is no uncertainty on C. i.e., Y contains sufficient information to tell

what the transmitted sequence is. The conditional entropy is a measure of how much

information loss occurs in the channel !

Example 2.3: Calculate the mutual information for the system of Example 2.1.

Remark: The mutual information for this system is well below the entropy ( H(C)=1 )

of the source and so, this channel has a high level of information loss.

Example 2.4: Calculate the mutual information for the system of Example 2.2.

Remarks: This channel is quite lossy also. Although H(Y) was almost equal to H(C) in

Example 2.2, the mutual information is considerably less than H(C) . One cannot tell

how much information loss we are dealing with simply by comparing the input and

output entropies !

§2.2 Channel Capacity and Binary Symmetric Channel

Maximization of Mutual Information and Channel Capacity

Each time the transmitter sends a symbol, it is said to use the channel. The Channel

Capacity is the maximum average amount of information that can be sent per channel

use.

40

Question: Why it is not the same as the mutual information ?

Answer: Because for a fixed transition probability matrix, a change in the probability

distribution of C, PC , results in a different mutual information, I(C;Y).The maximum

mutual information achieved for a given transition probability matrix is the Channel

Capacity.

with units of bits per channel use.

An analytical closed-form solution to find CC is difficult to achieve for an arbitrary

channel. An efficient numerical algorithm for finding CC was derived in 1972, by Blahut

and Arimoto (see textbook).

Example 2.5: For the following transition probability matrix, find the channel capacity,

the input and output probability distributions that achieve the channel capacity, and

mutual information given a uniform Pc.

a)

c

|

0.98 0.05 0.51289 0.52698, 0.78585, ,

0.02 0.95 0.48711 0.47302Y C C C YP Q

|

0.80 0.10 0.4824 0.4377, 0.39775, ,

0.20 0.90 0.5176 0.5623Y C C C YP C P Q

0.05 0.80 0.425

P C

b)

|

0.80 0.05 0.46761 0.4007, 0.48130, ,

0.20 0.95 0.53239 0.5993Y C C C YP C P Q

|

0.80 0.30 0.510 0.555, 0.191238, ,

0.20 0.70 0.490 0.445Y C C C YP C P Q

d)

|

0.80 0.05 0.4250.5

0.15 0.15 , 0.57566, , 0.1500.5Y C C C YP C P Q

e)

41

Remarks: The channel capacity proves to be a sensitive function of the transition

probability matrix, PY|C , but a fairly weak function of PC. The last case is interesting, as

the uniform input distribution produces the maximum mutual information.

This is an example of Symmetric Channel. Note that the columns of symmetric

channel’s transition probability matrix are permutations of each other. Likewise, the top

and bottom rows are permutations of each other. The center row, which is not a

permutation of the other rows, corresponds to the output symbol y1, which, as we noticed

in Example 2.3, makes no contribution to the mutual information.

Symmetric Channels

Symmetric channels play an important role in communication systems and many such

systems attempt, by design, to achieve a symmetric channel function. The reason for the

importance of the symmetric channel is that when such a channel is possible, it

frequently has greater channel capacity than an non-symmetric channel would have.

Example 2.6:

The transition probability matrix is slightly changed compared to Example 2.5e), and the channel capacity decreases. Example 2.7:

|

0.79 0.05 0.42070.50095

0.16 0.15 , 0.571215, , 0.15500.49905

0.05 0.80 0.4243Y C C C YP C P Q

P

|

0.950 0.024 0.024 0.002 0.25

0.024 0.950 0.002 0.024 0.25, 1.653488,

0.024 0.002 0.950 0.024 0.25

0.002 0.024 0.024 0.950 0.25

Y C C CP C

This is an example of using quadrature phase-shift keying (QPSK), which is a modulation

method that produces a symmetric channel. For QPSK, MC=MY=4.

42

Remarks:

i) The capacity for this channel is achieved when PC is uniformly distributed. This is

always the case for a symmetric channel.

ii) The columns of the transition probability matrix are permutations of each other, and so

are the rows.

iii) When the transition probability matrix is a square matrix, this permutation property of

columns and rows is sufficient condition for a uniformly distributed input alphabet to

achieve the maximum mutual information. Indeed, the permutation condition is what it

gives rise to the term “symmetric channel .”

Binary Symmetric Channel (BSC)

A symmetric channel of considerable importance, both theoretically and practically, is a

binary symmetric channel (BSC), for which

The parameter p is known as the Crossover Probability, and it is the probability that the

demodulator/detector makes a hard-decision decoding error. The BSC is the model for

essentially all binary-pulse transmission systems of practical importance.

Channel Capacity: for uniform input probability distribution

which is often written as

where the notation H(P) arises from the terms involving p.

Remarks:

The capacity is bounded by the range

The upper bound is achieved only if

The case p = 0 is not surprising, as it corresponds to a channel which does not

make errors (known as “noiseless” channel).

43

The case p = 1 corresponds to a channel which always makes errors. If we know

that the channel output is always wrong, we can easily set things right by

decoding the opposite of what the channel output is.

The case p = 0.5 corresponds to a channel for which the output symbol is as

likely to be correct as it is to be incorrect. Under this condition, the information

loss in the channel is total, and the channel capacity is zero. The capacity of the

BSC is a concave-upward function, possessing a single minimum at p = 0.5.

Except for p = 0 and p = 1 cases, the capacity of the BSC is always less than the

source entropy. If we try to transmit information through the channel using the

maximum amount of information per symbol, some of this info will be lost, and

decoding errors at the receiver will result. However, if we add sufficient

redundancy to the transmitted data stream, it is possible to reduce the

probability of lost information to an arbitrary low level.

§2.3 Block Coding and Shannon’s 2nd Theorem

Equivocation

We have seen that there is a maximum amount of information per channel use that can be

supported by the channel. Any attempt to exceed this channel capacity will result in

information being lost during transmission. That is,

and, so

The conditional entropy H(C|Y) corresponds to our uncertainty about what the input of

the channel was, given our observation of the channel output. It is a measure of the

information loss during the transmission. For this reason, this conditional entropy is

often called the Equivocation. The equivocation has the property that

and it is given by

44

The equivocation is zero if and only if the transition probabilities py|c are either zero or

one for all pairs (yY, cC).

Entropy Rate

The entropy of a block of n symbols satisfy the inequality

with equality if and only if C is a memoryless source. In transmitting a block of n symbols, we use the channel n times. Recall that channel capacity has units of bits per channel use, and refers to an average amount of information per channel use. Since H(C0,C1,...,Cn-1) is the average information contained in the n-symbol block, it follows that the average information per channel use would be

However, the average bits per channel use is achieved in the limit, when n goes to infinity,

such that

0 1 1( , , ..., )lim ( )n

n

H C C CR H

n

C

where R is called the Entropy Rate. , with equality if and only if all symbols are statistically independent.

Suppose that they are not, and in the transmission of the block, we deliberately introduce

redundant symbols. Then, R < H(C). Taking this further, suppose that we introduce a

sufficient number of redundant symbols in the block so that

( )R H C

Question: Is the transmission without information loss (i.e. zero equivocation) possible in

such case?

Answer: Remarkably enough, the answer to this question is “YES”!

What is the implication of doing so ?

It is possible to send information through the channel with arbitrarily low probability of

error.

The process of adding redundancy to a block of transmitted symbols is called Channel

Coding.

45

Question: Does there exist a channel code that will accomplish this purpose?

Answer: The answer to this question is given by the Shannon’s second theorem.

Shannon’s 2nd Theorem

Suppose R < Cc , where Cc is the capacity of a memoryless channel. Then, for any >

0, there exists a block of length n and rate R whose probability of block decoding error

pe satisfies pe ≤ when the code is used on this channel.

Shannon’s second theorem (also called Shannon’s main theorem) tells us that it is

possible to transmit information over a noisy channel with arbitrarily small probability of

error. The theorem says that if the entropy rate R in a block of n symbols is smaller

than the channel capacity, then we can make the probability of error arbitrarily

small.

What error are we speaking about?

Suppose we send a block of n bits in which k < n of these bits are statistically

independent “information” bits and n-k are redundant “parity” bits computed from the k

information bits, according to some coding rule. The entropy of the block will then be k

bits and the average information in bits per channel use will be

If this entropy rate is less than the channel capacity, Shannon’s main theorem says we can

make the probability of error in recovering our original k information bits arbitrarily

small. The channel will make errors within our block of n bits, but the redundancy built

into the block will be sufficient to correct these errors and recover the k bits of

information we transmitted.

Shannon’s theorem does not say that we can do this for just any block length n we

might want to choose! The theorem says there exists a block length n for which there is

a code of rate R. The required size of the block length n depends on the upper bound

we pick for our error probability. Actually, Shannon’s theorem implies very strongly

that the block length n is going to be very large if R is to approach CC to within an

arbitrarily small distance with an arbitrarily probability of error.

The complexity and expense of an error-correcting channel code are believed to grow

rapidly as R approaches the channel capacity and the probability of a block decoding

46

error is made arbitrarily small. It is believed by many that beyond a particular rate, called

Cutoff Rate, R0, it is prohibitively expensive to use the channel. In the case of the binary

symmetric channel, this rate is given by

0 2log 0.5 (1 )R p p

The belief that R0 is some kind of “sound barrier” for practical error correcting codes

comes from the fact that for certain kind of decoding methods, the complexity of the

decoder grows extremely rapidly as R exceeds R0.

§2.4 Markov Processes and Sources with Memory

Markov Process

Thus far, we have discussed memoryless sources and channels. We now turn our

attention to sources with memory. By this, we mean information sources, where the

successive symbols in a transmitted sequence are correlated with each other, i.e.,

the sources in a sense “remember” what symbols they have previously emitted, and the

probability of their next symbol depends on this history.

Sources with memory arise in a number of ways. First, natural languages, such as

English, have this property. For example, the letter “q” in English is almost always

followed by the letter “u”. Similarly, the letter “t” is followed by the letter “h”

approximately 37% of the time in English text. Many real-time signals, such as speech

waveform, are also heavily time correlated. Any time correlated signal is a source with

memory. Finally, we sometimes wish to deliberately introduce some correlation

(redundancy) in a source for purposes of block coding, as discussed in the previous

section.

Let A be the alphabet of a discrete source having MA symbols, and suppose this source

emits a time sequence of symbols (s0,s1,…,st,…) with each stA. If the conditional

probability p(st | st-1,…,s0) depends only on j previous symbols, so that

p(st | st-1,…,s0)=p(st | st-1,…,st-j),

then A is called a j th order Markov process. The string of j symbols is

called the state of the Markov process at time t. A j th order Markov process, therefore,

has possible states.

47

Let us number these possible states from 0 to N -1 and let n(t) represent the probability

of being in state n at time t. The probability distribution of the system at time t can

then be represented by the vector

For each state at time t, there are MA possible next states at time t +1, depending on which

symbol is emitted next by the source.

If we let pi |k be the conditional probability of going to state i given that the present state

is k, the state probability distribution at time t + 1 is governed by the transition

probability matrix.

and is given by

Example 2.8: Let A be a binary first-order Markov source with A={0,1}. This source

has 2 states, labeled “0” and “1”. Let the transition probabilities be

0|0 0|1 0| 1

1|0 1|1 1| 1|

1|0 1|1 1| 1

...

...

... ... ... ....

...

N

NA

N N N N

p p p

p p pP

p p p

|

0.3 0.4.

0.7 0.6AP

What is the equation for the next probability state? Find the state probabilities at time t=2,

given that the probabilities at time t=0 are 0=1 and 1=0.

The next-state equation for the state probabilities:

48

Example 2.9: Let A be a second-order binary Markov source with

If all the states are equally probable at time t = 0, what are the state probabilities at t =1 ?

Pr( 0 | 0,0) 0.2 Pr( 1| 0,0) 0.8

Pr( 0 | 0,1) 0.4 Pr( 1| 0,1) 0.6

Pr( 0 |1,0) 0.0 Pr( 1|1,0) 1.0

Pr( 0 |1,1) 0.5 Pr( 1|1,1) 0.5

a a

a a

a a

a a

Define the . The possible state

transitions and their associated transition probabilities can be represented using a

state diagram. For this problem, the state diagram is

The next state probability equation is

Remarks: Every column of the transition probability matrix adds to one. Every properly

constructed transition probability matrix has this property.

49

Steady State Probability and the Entropy Rate

Starting from the equation for the state probabilities, is can be shown by induction that

the state probabilities at time t are given by

A Markov process is said to be Ergodic if we can get from the initial state to any other

state in some number of steps and if, for large t, Πt approaches a steady-state value that is

independent of the initial probability distribution, Π0. The steady-state value is reached

when

The Markov processes which model information sources are always ergodic. Example 2.10: Find the steady-state probability distribution for the source in Example

2.9.

In the steady state, the state probabilities become

It appears from this that we have four equations and four unknowns, so, solving for the

four probabilities is no problem. However, if we look closely, we will see that only three

of the equations above are linearly independent. To solve for the probabilities, we can use

any of three of the above equations and the constraint equation. This equation is a

consequence of the fact that the total probability must sum to unity;

it is certain that the system is in some state!

Dropping the first equation above and using the constraint, we have

1 3

2 0

3 2 3

0.5

0.8 0.6

0.5

1

0 21 3 1

50

which has the solution

This solution is independent of the initial probability distribution. The situation

illustrated in the previous example, where only N - 1 of the equations resulting from the

transition probability expression are linearly independent and we must use the “sum to

unity” equation to obtain the solution, always occurs in the steady-state probability

solution of an ergodic Markov process.

0 1 2 31/ 9, 2 / 9, 4 / 9.

Entropy Rate of an Ergodic Markov Process

POP QUIZ: How do you define the entropy rate?

The entropy rate, R, is the average information per channel use (average info bits per

channel use) 0 1 1( , ,..., )

lim ( )t

t

H A A AR H A

t

with equality if and only if all symbols are statistically independent.

For ergodic Markov sources, as t grows very large, the state probabilities converge

to a steady-state value, n, for each of the N possible states (n=0,...,N-1). As t becomes

large, the average information per symbol in the block of symbols will be determined by

the probabilities of occurrence of the symbols in A, after the state probabilities converge

to their steady-state values.

Suppose we are in state Sn at time t . The conditional entropy of A is

Since each possible symbol a leads to a single state, Sn can lead to MA possible next

states. The remaining N - MA states cannot be reached from Sn , and for these states the

transition probability pi|n=0. Therefore, the conditional entropy expression can be

expressed in terms of the transition probabilities as

For large t , the probability of being in state Sn is given by its steady-state probability n.

Therefore, the entropy rate of the system is

51

This expression, in turn, is equivalent to

where pi|n are the entries in the transition probability matrix and the n are the steady-state

probabilities.

Example 2.11: Find the entropy rate for the source in Example 2.9. Calculate the steady-

state probability of the source emitting a “0” and the steady-state probability of the source

emitting a “1”. Calculate the entropy of a memoryless source having these symbol

probabilities and compare the result with the entropy rate of the Markov source.

With the steady-state probabilities calculated in Example 2.10, by applying the formula

for the entropy rate of an ergodic Markov source, one gets

The steady state probabilities of emitting 0 and 1 are, respectively

The entropy of a memoryless source having this symbol distribution is

Thus, R<H(X) as expected.

Remarks:

i) In earlier section, we discussed about how introducing redundancy into a block of

symbols can be used to reduce the entropy rate to a level below the channel capacity and

52

how this technique can be used for error correction at the receive-side, in order to

achieve an arbitrarily small information bit error rate.

ii) In this section, we have seen that a Markov process also introduces redundancy

into the symbol block.

Question: Can this redundancy be introduced in such a way such to be useful for error

correction? Answer: YES! This is the principle underlying a class of error correcting codes known

as convolutional codes.

iii) In the previous lecture we examined the process of transmitting information C

through a channel, which produces a channel output Y. We have found out that a noisy

channel introduces information loss if the entropy rate exceeds the channel capacity.

iv) It is natural to wonder if there might be some (possible complicated) form of data

processing which can be performed on Y to recover the lost information. Unfortunately,

the answer to this question is NO! Once the information has been lost, it is gone!

Data Processing Inequality

This states that additional processing of the channel output can at best result in no further

loss of information, and may even result in additional information loss.

A very common example of this kind of information loss is the roundoff or truncation

error during digital signal processing in a computer or microprocessor. Another

examples is quantization in an analog to digital converter. Designers of these systems

need to have an awareness of the possible impact of such design decisions, as the word

length of the digital signal processor or the number of bits of quantization in analog to

digital converters, on the information content.

Y Z

Data Processing

53

§2.5 Constrained Channels

Channel Constraints

So far, we have considered only memoryless channels corrupted by noise, which are

modeled as discrete-input discrete-output memoryless channels. However, in many cases

we have channels which place constraints on the information sequence.

Sampler

Modulator

Bandlimited Channel

Demodulator

+

The coded information at is presented to the modulator, which transforms the symbol

sequence into continuous-valued waveform signals, designed to be compatible with the

physical channel (bandlimited channel). Examples of bandlimited channels are wireless

channels, telephone lines, TV cables, etc. During transmission, the information bearing

signal is distorted by the channel and corrupted with noise. The output of the

demodulator, which attempts to combat the distortion and minimize the effect of the

noise, is sampled and the detector attempts to reconstruct the original coded sequence, at.

The timing recovery is required; the performance of this block are crucial in recovering

the information. The theory and practice of performing these tasks consist the

modulation theory, which is treated in “Digital Communications” textbooks. In this

course, we are concern with the information theory aspects of this process. What

are these aspects?

Remarks:

i) When the system needs to recover the timing information, additional information

should be transmitted for that. As the maximum information rate is limited by the

Symbol Detector

Timing Recovery

s(t)

Noise at

Block Diagram of a Typical Communication System. yt

54

channel capacity, the information needed for timing recovery is included at the expense

of user information. This may require that the sequence of transmitted symbols be

constrained in such a way as to guarantee the presence of timing infomation embedded

within the transmitted coded sequence.

ii) Another aspect arises from the type and severity of channel distortions imposed by the

physical bandlimited channel. We can think of the physical channel as performing a kind

of data processing on the information bearing waveform presented to it by the modulator.

But data processing might result in information loss. A given channel can thus place its

own constraints on the allowable symbol sequence which can be “process” without

information loss.

iii) Modulation theory tells us that it is possible and desirable to model the

communication channel as a cascade of noise-free channel and an unconstrained noisy

channel (we have implicitly used such a model, except that we have not considered any

constraint on the input symbol sequence).

yt Constrained

Channel, ht

Decision Block +

at xt rt

Noise, nt

Linear and Time-Invariant (LTI) Channel

The LTI channel is specified by a set of parameters ht, which represent the channel

impulse response. The channel’s output sequence is related to the input sequence as

The decision block is presented with a noisy signal

The decision block takes these inputs and produces output symbols, yt, drawn from a

finite alphabet Y, with MY ≥ MA.

55

If MY =MA, yt is an estimate of the transmitted symbol at, and the decision block is

said to make a Hard-decision.

If MY > MA, the decision block is said to make a Soft-decision, and the final decision

on the transmitted symbol at is made by the decoder.

Example 2.12: Let A be a source with equiprobable symbols, A={-1,1}. The bandlimited

channel has the impulse response {h0=1 h1=0 h2=-1}. Calculate the steady-state entropy

of the constrained channel’s output and the entropy rate of the sequence xt.

State of the channel at time t : St = <at-1,at-2>.

The states are as follows:

(-1,-1) is state S0, (1,-1) is state S1,

(-1, 1) is state S2, (1, 1) is state S3.

The channel can be represented as a Markov process, with the state diagram given in the

sequel.

1 / 2 (0.5)

Note that all transition probabilities, shown in parentheses, are 0.5. The arrows are

labeled at / xt . One can easily show that X={-2, 0, 2}.

The state probability equation is then given by

-1 1

1-1

-1 / -2 (0.5)

1 / 2 (0.5)

-1 / -2 (0.5) 1 / 0

-1 / 0

-1 -1

1 1

-1 / 0 (0.5)

1 / 0 (0.5)

S0

S2

S1

S3

1

0.5 0 0.5 0

0.5 0 0.5 0

0 0.5 0 0.5

0 0.5 0 0.5

t t

56

57

from which we set up 4 equations and find the steady state probabilities, i.e.,

i=0.25, i=0,1,2,3.

The output symbol X's probabilities are:

The steady state entropy of the channel output is

The entropy rate is:

which equals the source entropy → channel is lossless.

Note that the entropy rate is not equal to the steady state entropy of the channel’s output

symbols. While the channel is lossless, the sequences it produces does not carry

sufficient information to permit clock recovery for arbitrary input sequences. For

example, a long input sequence of “-1”, “+1”, or a long sequence of alternating symbols,

“+1-1” or“-1+1”, all produce a long output of zeros at the output of the channel. Timing

recovery methods can fail in such situations.

Ch 3 Error Control Strategies

Error Control Strategies

Forward Error Correction (FEC)

Automatic Repeat Request (ARQ)

Forward Error Correction (FEC) In a one-way communication system: The transmission or recording is strictly in one direction, from transmitter to receiver. Error control strategy must be FEC; that is, they employ error-correcting codes that automatically correct errors detected at the receiver. For example: 1) digital storage systems, in which the information recorded can be replayed weeks or even months after it is recorded, and 2) deep-space communication systems. Most of the coded systems in use today employ some form of FEC, even if the channel is not strictly one-way! However, for a two-way system, the control strategies use error detection and retransmission that is called automatic repeat request (ARP).

§3.1 Automatic Repeat Request

Automatic Repeat Request (ARQ)

In most communication systems, the information can be sent in both directions, and the

transmitter also acts at a receiver (transceiver), and vice-versa. For example: data

networks, satellite communications, etc. Error control strategies for a two-way system

can include error detection and retransmission, called Automatic Repeat Request

(ARQ). In an ARQ system, when errors are detected at the receiver, a request is sent for

the transmitter to repeat the message, and repeat requests continue to be sent until the

message is correctly received. ARQ SYSTEMS

Stop-and-Wait ARQ

Selective ARQ Go-Back-N ARQ

Continuous ARQ

58

Types

Stop-and-Wait (SW) ARQ: The transmitter sends a block of information to the receiver

and waits for a positive (ACK) or negative (NAK) acknowledgment from the receiver. If

an ACK is received (no error detected), the transmitter sends the next block. If a NAK is

received (errors detected) , the transmitter resends the previous block. When the errors

are persistent, the same block may be retransmitted several times before it is correctly

received and acknowledged.

Continuous ARQ: The transmitter sends blocks of information to the receiver

continuously and receives acknowledgments continuously. When a NAK is received, the

transmitter begins a retransmission. It may back-up to the block and resend that block

plus the N-1 blocks that follow it. This is called Go-Back-N (GBN) ARQ. Alternatively,

the transmitter may simply resend only those blocks that are negatively acknowledged.

This is known as Selective Repeat (SR) ARQ.

Comparison

GBN Versus SR ARQ

SR ARQ is more efficient than GBN ARQ, but requires more logic and buffering.

Continuous Versus SW ARQ

Continuous ARQ is more efficient than SW ARQ, but it is more expensive to implement.

For example: In a satellite communication, where the transmission rate is high and the

round-trip delay is long, continuously ARQ is used. SW ARQ is used in systems where

the time taken to transmit a block is long compared to the time taken to receive an

acknowledgment. SW ARQ is used on half-duplex channels (only one way transmission

at a time), whereas continuous ARQ is designed for use on full-duplex channels

(simultaneous two-way transmission).

Performance Measure

Throughput Efficiency: is the average number of information (bits) successfully

accepted by the receiver per unit of time, over the total number of information digits that

could have been transmitted per unit of time.

59

Delay of a Scheme: The interval from the beginning of a transmission of a block to the

receipt of a positive acknowledgment for that block.

GBN Versus SR ARQ

Figure 1 From Lin and Costello, Error Control

ARQ Versus FEC

The major advantage of ARQ versus FEC is that error detection requires much simpler

decoding equipment than error correcting. Also, ARQ is adaptive in the sense that

information is retransmitted only when errors occurs. In contrast, when the channel error

is high, retransmissions must be sent too frequently, and the SYSTEM THROUGHPUT

is lowered by ARQ. In this situation, a HYBRID combination of FEC for the most

frequent error patterns along with error detection and retransmission for the less likely

error patterns is more efficient than ARQ alone (HYBRID ARQ).

60

§3.2 Forward Error Correction

Performance Measures – Error Probability

The performance of a coded communication system is in general measured by its

probability of decoding error (called the Error Probability) and its coding gain over the

uncoded system that transmit information at the same rate (with the same modulation

format).

There are two types of error probabilities, probability of word (or block) error and

probability of bit error. The probability of block error is defined as the probability that

a decoded word (or block) at the output of the decoder is in error. This error probability is

often called the Word-Error Rate (WER) or Block-error Rate (BLER). The

probability of bit-error rate, also called the Bit Error Rate (BER), is defined as the

probability that a decoded information bit at the output of the decoder is in error.

A coded communication system should be designed to keep these two error probabilities

as low as possible under certain system constraints, such as power, bandwidth and

decoding complexity.

The error probability of a coded communication system is commonly expressed in terms

of the ratio of energy-per information bit, Eb, to the one-sided power spectral density

(PSD) N0 of the channel noise.

Example 3.1: Consider a coded communication system using an (23, 12) binary Golay

code for error control. Each code word consists of 23 code digits, of which 12 are of

information. Therefore, there are 11 redundant bits, and the code rate is R=12/23=0.5217.

Suppose that BPSK modulation with coherent detection is used and the channel is

AWGN, with one-side PSD N0 . Let Eb / N0 at the input of the receiver be the signal-to-

noise ratio (SNR), which is usually expressed in dB.

61

The bit-error performance of the (23,12) Golay code with both hard- and soft-decision

decoding versus SNR is given, along with the performance of the uncoded system.

2

From Lin and Costello, Error Control

From the above figure, the coded system, with either hard- or soft-decision decoding,

provides a lower bit-error probability than the uncoded system for the same SNR, when

the SNR is above a certain threshold.

With hard-decision, this threshold is 3.7 dB.

For SNR=7dB, the BER of the uncoded system is 8x10-4, whereas the coded system

(hard-decision) achieves a BER of 2.9x10-5. This is a significant improvement in

performance.

For SNR=5dB this improvement in performance is small: 2.1x10-3 compared to 6.5x10-3.

However, with soft-decision decoding, the coded system achieves a BER of 7x10-5.

62

Performance Measures – Coding Gain

The other performance measure is the Coding Gain. Coding gain is defined as the

reduction in SNR required to achieve a specific error probability (BER or WER) for a

coded communication system compared to an uncoded system.

Example 3.1 (cont’d): Determine the coding gain for BER=10-5.

For a BER=10-5, the Golay-coded system with hard-decision decoding has a coding gain

of 2.15 dB over the uncoded system, whereas with soft-decision decoding, a coding gain

of more than 4 dB is achieved. This result shows that soft-decision decoding of the Golay

code achieves 1.85 dB additional coding gain compared to hard-decision decoding at a

BER of 10-5.

This additional coding gain is achieved at the expense of higher decoding complexity.

Coding gain is important in communication applications, where every dB of improved

performance results in savings in overall system cost.

Remarks:

At sufficient low SNR, the coding gain actually becomes negative. This threshold

phenomenon is common to all coding schemes. There always exists an SNR below which

the code loses its effectiveness and actually makes the situation worse. This SNR is

called the Coding Threshold. It is important to keep this threshold low and to maintain a

coded communication system operating at an SNR well above its coding threshold.

Another quantity that is sometimes used as a performance measure is the Asymptotic

Coding Gain (the coding gain for large SNR).

§3.3 Shannon’s Limit of Code Rate

Shannon’s Limit

63

In designing a coding system for error control, it is desired to minimize the SNR

required to achieve a specific error rate. This is equivalent to maximizing the coding

gain of the coded system compared to an uncoded system using the same modulation

format. A theoretical limit on the minimum SNR required for a coded system with

code rate R to achieve error-free communication (or an arbitrarily small error

probability) can be derived based on Shannon’s noisy coding theorem.

This theoretical limit, often called the Shannon Limit, simply says that for a coded

system with code rate R, error-free communication is achieved only if the SNR exceeds

this limit. As long as SNR exceeds this limit, Shannon’s theorem guarantees the existence

of a (perhaps very complex) coded system capable of achieving error-free

communication.

For transmission over a binary-input, continuous-output AWGN with BPSK signaling,

the Shannon’s limit, in terms of SNR as a function of the code rate does not have a close

form; however, it can be evaluated numerically.

0.188 dB

3

64

9.462 dB

Shannon’s limit

5.35 dB

Convolutional Code, R=1/2

4

From Lin and Costello, Error Control

From Fig. 3 (Shannon limit as a function of the code rate for BPSK signaling on a

continuous-output AWGN channel), one can see that the minimum required SNR to

achieve error free communication with a coded system with rate R=1/2, is 0.188 dB. The

Shannon limit can be used as a yardstick to measure the maximum achievable coding

gain for a coded system with a given rate R over an uncoded system with the same

modulation format. For example, to achieve BER=10-5, un uncoded BPSK system

requires an SNR of 9.65 dB. For a coded system with code rate R=1/2, the Shannon limit

is 0.188 dB. Therefore, the maximum potential coding gain for a coded system with code

rate R=1/2 is 9.462 dB.

For example (Fig. 4), a rate R=1/2 convolutional code with memory order 6, achieves

BER=10-5 with SNR=4.15 dB, and achieves a code gain of 5.35 dB compared to the

uncoded system. However, it is 3.962 dB away from the Shannon’s limit. This gap can be

reduced by using a more powerful code.

65

§3.4 Codes for Error Control

Basic Concepts in Error Control

There can be a hybrid of the two approaches, as well.

Codes for Error Control (FEC)

66

Types of Channnels

Compound Channels

Burst-Error Channels

Random Error Channels

Types of Channels

Random Error Channels: are memoryless channels; the noise affects each transmitted

symbol independently. Example: deep space and satellite channels, most line-of-sight

transmission.

Burst Error Channels: are channels with memory. Example: fading channels (the

channel is in a “bad state” when a deep fade occurs, which is caused by multipath

transmission) and magnetic recordings subject to dropouts caused by surface defects and

dust particles.

Compound Channels: both types of errors are encountered.

67

Ch 4 Error Detection and Correction

Source Encoder

ECC Encoder

Channel

DTC Encoder

At Transmitter

At Receiver

Source Decoder

ECC Decoder

DTC Decoder

Encoding and Decoding Procedure

§4.1 Error Detection and Correction Capacity Definition

A code can be characterized in terms of its amount of error detection capability and error

correction capability. The Error Detection Capability is the ability of the decoder to

tell if an error has been made in transmission. The Error Correction Capability is the

ability of the decoder to tell which bits are in error.

68

Binary Code, M={0,1}

Coded sequence, C

Channel Encoder

0 1( ,..., ), nc c n k

Assumptions:

- independent bits

- each message is equally probable, 2k equally likely messages, of k bits each

- r = n-k redundant bits

Thus, the Entropy Rate of the coded word is , this is also called the Code Rate. For every is the Hamming distance between the two code words. The Hamming Distance is defined as the number of bits which are different in the two code words. There is at least one pair of code words for which the distance is the least. This is called

the Minimum Hamming Distance of the code.

Example 4.1 (Repetition Code) : Given encoding rule:

G(0)→ 000

G(1) →111

i.e. only two valid code words. Find its code rate and Hamming weight.

Hamming Weight wH of a code word is defined as the number of “1” bits in the code

word (the Hamming distance between the code word and the zero code word).

• Message: block of k bits • Code Word: block of n bits

• Only 2k out of 2n are used as code words.

One To One Correspondence

0 1( ,..., )km mm c

Message Code word

G is the encoding rule

, , , ( , )i j H i jC i j d c c c c

69

Example 4.1 (cont’d) : For the received words in the 1st column of the Table below,

determine their source words.

Decision: based on the minimum Hamming distance between the received word and

the code words.

• The code corrects 1 error (dH=1), but does not simultaneously detect the 2 bit

error. Moreover, we can miscorrect the received word.

• The code detects up to two bits in error (3 bits in error lead to a code word;

dmin between the two code words is 3).

111

110

101

100

011

010

001

000

Error Flag Decoded Word

Received Word

Example 4.2 (Repetition Code) : Given coding rule,

G(0)→ 0000

G(1) →1111

find decoded words for the received words in the table on the next page.

n = 4, k = 1, r = 3, dmin = 4, R=1/4

• Correct 1 error (dH =1) and Detect 2 errors (dH=2)

• An error of 3 or 4 bits will be miscorrected.

70

Received

Word Received Decoded

Word Decoded Word

Word

0000 1000

0001 1001

0010 1010

0011 1011

0100 1100

0101 1101

0110 1110

0111 1111

Hamming Distance and Code Capability

1. Detect Up to t Errors IF AND ONLY IF

Example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. This code detects up to t = 2 errors.

2. Correct Up to t Errors IF AND ONLY IF

Example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. This code corrects t = 1 error.

3. Detect Up to td Errors and Correct Up to tc Errors IF AND ONLY IF

Example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. This code cannot simultaneously correct (tc = 1) and detect (td = 2) errors.

Number of Redundant Bits

The minimum Hamming Distance is related to the number of redundant bits, r

71

This gives us the lower limit on the number of the redundant bits for a certain minimum

Hamming distance (certain detection and correction capability), and it is called the

Singleton Bound.

For example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. dmin = r +1 See its error

detection and correction capabilities as previously discussed.

§4.2 Linear Block Codes

Definition

Linear Block Codes can be mathematically treated using the mathematics of vector

spaces.

Linear Block Codes

Binary (We deal here only with such codes)

Non-Binary

Reed-Solomon

Galois Field has two elements, i.e., A={0,1} or A=GF(2)

),,( A Exclusive Or And

0 1 1

1 0 0

1 0 +

1 0 1

0 0 0

1 0

(Digital Logic)

(Digital Logic)

72

( , , )nA

Scalar Multiplication

Vector Addition

Vector space An is a set with elements 0 1( ,..., ), with each n ia a a A a

The set of code words, C, is a subset of An. It is a subspace (2k elements); any subspace

is also a vector space.

If the sum of two code words is also a code word, such a code is called a Linear

Code).

Consequence : All-zero vector is a code word, 1 1(because )C 0 c c 0

Vector Space

Linear Independent : For code words, 0 1 ,..., kc c

if and only if , these are linear independent, and they are Basis Vectors.

0 ,..., k 1c c0 1... 0ka a

If they are linear independent and if and only if every can be uniquely written as Cc

0 0 1 1... k ka a c c c

then, the Dimension of a vector space is defined as the number of basis vectors it takes to

describe (span) it.

Generating Code Word

Question: how do we generate a code word ?

73

c mG

Linear Combination of the rows of the G matrix

They form a basis. The k rows must be linearly independent.

All the lines of G are code words!

For example:

Example 4.3: For linear block code n = 7, k = 4, r = 3, generated by

Find all the code words.

Code Word 1 x n

Message 1 x k

Generator Matrixk x n

0 1( , ..., ), nc c n k c

m 0 1( ,..., )km m

0

1

...

k

g

G

g

1 0 0 0 1 0 1

0 1 0 0 1 1 1

0 0 1 0 1 1 0

0 0 0 1 0 1 1

) () ( 32106543210 mmmmccccccc

74

Here, it is linear systematic block code, since

75

§4.2.1 Linear Systematic Block Codes

Definition

If the generating matrix can be written as :

]|[ kIPG

Parity-check matrix k x n-k

n x k Identity matrix k x k

Redundant Checking Part n-k digits

Message Information Part k digits

n bits

then, a linear block code generated by such a generator matrix is called Linear

Systematic Block Code. Its code words are in the form of

Example 4.3 (cont’d): n = 7, k = 4, r = 3

design the encoder.

1 0 0 0 1 0 1

0 1 0 0 1 1 1

0 0 1 0 1 1 0

0 0 0 1 0 1 1

) () ( 32106543210 mmmmcccccccc

0 0 2 3

1 0 1 2

2 1 2 3

3 0

4 1

5 2

6 3

+ +

+ + Parity Check Bits

+ +

Information Bits

c m m m

c m m m

c m m m

c m

c m

c m

c m

ENCODING CIRCUIT

(last k bits)

(first r bits)

76

the encoder can be designed as

§4.2.2 Hamming Weight and Distance

Hamming Distance of two code words is the number of

positions in which they differ. Hamming Weight of a code word is the

number of non-zero positions in it. It is clear

), 2c (:, 121 ccdc H)(: iHi cwc

In Example 4.3 : n = 7, k = 4, r = 3, determine the Hamming weight for

)1000001(1 c

)0010001(1 c

77

Minimum Hamming Distance

The Minimum Hamming Distance of a linear block code is equal to the Minimum

Hamming Weight of the non-zero code vectors.

In Example 4.3 : n = 7, k = 4, r = 3, dmin = wmin=3

§4.2.3 Error Detection and Correction Capacity

Rules

1min tdi) Detect Up to t Errors IF AND ONLY IF

ii) Correct Up to t Errors IF AND ONLY IF min 2 1d t iii) Detect Up to td Errors and Correct Up to tc Errors IF AND ONLY IF

min min2 1 and 1c cd t d t t d

In Example 4.3: n = 7, k = 4, r = 3

The minimum Hamming distance is 3, and, such, the number of errors which can be

detected is 2 and the number of errors which can corrected is equal to 1. The code does

not have the capability to simultaneously detect and correct errors. (see the relations

between dmin and the correction/detection capability of a code).

Error Vector

For received vectors v c + e

Error Vector

No Error Ex: An error at the first bit (0 0 0 0 0 0 0 )e e (1000000)

78

Parity Check Matrix

TGH = 0

G=Generator Matrix k x n

H=Parity Check Matrix

For Systematic Code in which ][ kkIPG , then

For a Code Word

In Example 4.3: n = 7, k = 4, r = 3, find its parity-check matrix.

From the generator matrix G in Example 4.3,

n-k x n

k x n-k

0: TT GHmHcc

0 1 2 0 1 2 3( )c c c m m m m TH 0

0 0 2 3

1 0 1 2

2 1 2 3

0

0

0

c m m m

c m m m

c m m m

0

0

T

T

Hc

GHm

Parity Check Equations

79

Syndrome Calculation and Error Detection

Syndrome is defined as:

In Example 4.3: n = 7, k = 4, r = 3, if

There is an error, but this error is undetectable!

The error vector introduces 3 error. But the minimum Hamming distance for this code is

3, and, such, a 3 error pattern can lead to another code word!

Note: When we say that the number of errors which can be detected is 2, we refer at all

error patterns with 2 bits in errors. However, the code is capable to detect patterns with

more than 2 errors, but not all !

Question: What is the number of error patterns which can be detected with this code?

Answer: The total number of error patterns is 2n-1 (the all-zero vector is not an error!).

However, 2k-1 of them lead to code words, which mean that they are not detectable. So,

the number of error patterns which are detectable is 2n-2k.

Ts = vH

1 x n n x n-k 1 x n-k

If =0 v=cIf 0 v c

80

Error Correction Capacity

Likelihood Test

Why and When the Minimum Hamming Distance is a Good Decoding Rule ?

Let 21,cc be two code words and v be the received word,

If 1c is the actual code word, then the number of errors is

If 2c is the actual code word, then the number of errors is

Which of these two code words is most likely based on v ?

The most likely code word is the one with the greatest probability of occuring with the

received word, i.e.,

This is called the Likelihood Ratio Test. or, equivalently,

1

1 2

2

ln ln 0p p

c

v,c v,c

c

Log-Likelihood Ratio Test.

81

The joint probabilities can be further written as

(i = 1,2) i i

p p pv, c v i|c c

For the BSC Channel (independent errors)

(i = 1,2) Pr( ) (1 )i

i

t nip t p p it v|c

where ),( ii cvdt is the number of errors that have occurred during the transmission of

code word ic . Since there is a specific error pattern for a received word, the binomial

coefficient does not appear in above.

IF

Condition 1: the code words have the same probability and

Condition 2: p < 0.5 (p is the crossover probability of the BSC channel)

By performing some calculations, one gets that:

82

§4.2.4 Decoding Linear Block Codes

Standard Array Decoder

The simplest, least clever, and often most expensive strategy for implementing error

correction is to simply look up c in a decoding table that contains all possible v . This is

called a standard-array decoder, and the lookup table is called the Standard Array. The

first word in the first column of the standard array is the zero code-word (it also means

zero error). If no error, the received words are the code words. These are given in the first

row of the standard array. For a linear block code (n, k), the first row contains 2k code

words, including the zero code-word. All 2n words are contained in the array. Each row

contains 2k words. So, the number of columns is 2k. The number of rows will then be

2n/2k =2n-k=2r. The standard array for a (7, 4) code can be seen in the table on next page.

When decoding with the standard array, we indentify the column of the array where the

received vector appears. The decoded vector is the vector in the first row of that column.

Each row is called Coset. In the first column we have all correctable error patterns.

These are called Coset Leaders. Decoding is correctly done if and only if the error

pattern caused by the channel is a coset leader (including the zero-vector). The

words on each column, except for the first element, which is a code word, are obtained by

adding the coset leader to the code word.

Question: How do we chose the coset leaders?

To minimize the probability of a decoding error, the error patterns that are more likely to

occur for a given channel should be chosen as coset leaders. For a BSC, an error pattern

of smaller weight is more probable than an error pattern of larger weight. Therefore,

when the standard array is formed, each coset leader should be chosen to be a vector of at

least weight from the remaining available vectors. Choosing coset leaders this way, each

coset leader will have the minimum weight in its coset. In a column, one gets the words

which are at minimum distance of the code word, which is the first element of the column.

A linear block code is capable to correct 2n-k error patterns (including zero error).

83

84

Syndrome Decoder

Standard array decoder becomes slow when the block code length is large. A more

efficient method is syndrome decoder. Syndrome Vector is defined as:

where v is the received vector, H is the parity-check matrix. The syndrome is

independent on the code word; It depends only on the error vector (for a specific code).

All the 2k n-tuples (n bit words) of a coset have the same syndrome.

Steps in the Syndrome Decoder

1. For the received word, the syndrome is calculated by

2. The coset leader is calculated.

3. The transmitted code word is obtained by

§§4.2.44.2.4: Decoding Linear Block Codes: Decoding Linear Block Codes

Syndrome Decoder

Ts = vH

1 x n n x n-k 1 x n-k

If =0 v=cIf 0 v c

85

T Ts v H e He

c v e

Example 4.4: Design the Syndrome decoder for Example 4.3 in which n = 7, k = 4, r

= 3

For the parity-check matrix in Example 4.3 and the single-bit error pattern:

86

§4.2.5 Hamming Codes

Definition

Hamming codes are important linear block codes, used for single-error controlling in

digital communications and data storage systems. For any integer , there exist a

Hamming Code with the following parameters:

3r

Code Length:

Number of information digits:

Number of parity check digits:

Error correction capability:

Systematic Hamming code has:

In Example 4.3: n = 7, k = 4, r = 3

2 1 7rn Code Length:

2 1 4rk r Number of information digits:

3r n k Number of parity check digits:

min1 ( =1)t dError correction capability:

Thus, the code given as example is a Hamming code.

Example 4.5: Construct the parity-check matrix for the (7, 4) systematic Hamming code.

Example 4.6: Write down the generator matrix for the Hamming code of Example

4.5.

87

Perfect Code

If we form the standard array for the Hamming code of length , the n-tuples of

weight 1 can be used as coset leaders. Recall that the number of cosets is !

That would be the zero vector and the n–tuples of weight 1. Such a code is called a

12 rn

rkn 22/2

Perfect Code. “PERFECT” does not mean “BEST”!

A Hamming code corrects only error patterns of single error and no others.

Some Theorems on The Relation Between the Parity Check

Matrix and the Weight of Code Words

Theorem 1: For each code word of weight d, there exist d columns of H, such that the

vector sum of these columns is equal to the zero vector.

The reciprocal is true.

Theorem 2: The minimum weight (distance) of a code is equal to the smallest number of

columns of H that sum to 0.

In Example 4.3: n = 7, k = 4, r = 3

The columns of H are non-zero and distinct. Thus, no two columns add to zero, and the

minimum distance of the code is at least 3. As H consists of all non-zero r-tuples as its

columns, the vector sum of any such two columns must be a column in H, and thus, there

are three columns whose sum is zero. Hence, the minimum Hamming distance is 3.

88

Shortened Hamming Codes

If we delete columns of H of a Hamming code, then the dimension of the new parity

check matrix, H’, becomes .. Using H’ we obtain a Shortened Hamming

Code, with the following parameters:

(2 1 )rr

Code Length:

Number of information digits:

Number of parity check digits:

Minimum Hamming Distance:

In Example 4.3: We shorten the code (7,4)

We delete from PT all the columns of even weight, such that no three columns add to zero

(since total weight must be odd). However, for the column of weight 3, there are 3

columns in Ir , such that the 4 columns’ sum is zero. We can thus conclude that the

minimum Hamming distance of the shortened code is exactly 4. This increases the error

correction and detection capability.

The shortened code is capable of correcting all error patterns of single error and detecting

all error patterns of double errors. By shortening the code, the error correction and

detection capability is increased.

89

Ch 5 Cyclic Codes

§5.1 Description of Cyclic Codes

Definition

Cyclic code is a class of linear block codes, which can be implemented with extremely

cost effective electronic circuits.

Cyclic Shift Property

A cyclic shift of is given by 0 1 2 1( ... n nc c c c uIn general, a cyclic shift of c can be written as

A Cyclic Code is a linear block code C, with code words such that for every , the vector given by the cyclic shift of is also a code word.

Example 5.1: Verify the (6,2) repetition code is a cyclic code. Since a cyclic shift of any of its code vectors results in a vector that is element of C. Check by yourself. Example 5.2: Verify the (5,2) linear block code defined by the generator matrix

Its code vectors are

is not a cyclic code.

Its code vectors are Gmc

) c

0 1 2 1( ... n nc c c c )ccCc

{(000000), (111111), (010101), (101010)}C

1 0 1 1 1

0 1 1 0 1

G

90

0 0 0 0 0

1 0 1 1 1

0 1 1 0 1

1 1 0 1 0

The cyclic shift of (10111) is (11011), which is not an element of C. Similarly, the cyclic

shift of (01101) is (10110), which is also not a code word.

Code (or Codeword) Polynomial

0 1 2 1( ... )n nc c c c c

Code Word

One to-one correspondence

2 10 1 2 1( ) ... n n

n nc X c c X c X c X

Code Polynomial of degree (highest exponent of X) n -1 or less.

Theorem: The non-zero code polynomial of minimum degree in a cyclic code is

unique, and is of order r.

Theorem 1: A binary code polynomial of degree n -1 or less is a code word if and only if

it is a multiple of . )(Xg

where are the k information digits to be encoded.

1

( ) ( ) ( )c X m X g X

An (n, k) cyclic code is completely specified by its non-zero code polynomial of

minimum degree, g(X), called the generator polynomial .

0 1 11( ) ( ... ) ( )k

km mc X X Xm g X

degree n -1 or less

degree k -1 or less

degree r

0 ,..., km m

91

Theorem 2: The generator polynomial, , of an (n, k) cyclic code is a factor of )(Xg

1nX .

Question: For any n and k, is there an (n, k) cyclic code?

Theorem 3: If is a polynomial of degree r = n - k and if it is a factor of

then generates an (n, k) cyclic code.

)(Xg

)(Xg

Remark: For n large, 1nX may have many factors of degree n - k. Some of these

polynomials generate good codes, whereas some generate bad codes.

Example 5.3: Determine the factor of X7+1 that can generate (7, 4) cyclic codes.

For a (7,4) code, r=n-k=7-4=3, the generator polynomial can be chosen either as

or

Systematic Cyclic Code For message: generate systematic cyclic code includes:

m X 0 , the steps to11

1( ) ...mm mX kk X

Step 1: Step 2: Step 3:

degree ≤ n - k -1

Proof: ( ) ( ) ( ) ( )n kX m X a X g X b X

degree ≤ n -1 degree = n - k

( ) ( ) ( ) ( )n kb X X m X a X g X

0 1 1 01 1

1... ...n k n k n

parity check bits message

k kb b mX Xm

Code word

b X X

92

Example 5.4: Find (7, 4) cyclic code, generated by when

3( ) 1 X X g X

i.e.

Step 1: Multiply the message m(X) by Xn – k.

Step 2: Obtain the remainder b(X) from dividing Xn – k m(X) by g(X).

Step 3: Combine b(X) and Xn – k m(X) to form the systematic code word.

§§55..22:: GGeenneerraattoorr aanndd PPaarriittyy--cchheecckk MMaattrriicceess

Generator Matrix

Let (n, k) be a cyclic code, with the Generator Polynomial

Then, a code polynomial can be written as

3( ) 1 X m X (1001)m

c

1001k bits o

( 011 )parity check bi f the mes ets sag

93

which is equivalent to the fact that span C. 1, ,...,( ) ( ) ( )kXg X X g Xg X

1

( )

( )

...

( )k

g X

Xg X

X g X

G

0 1 2

0 1 1

0 2 1

... 0 0 0 ... 0

0 ... 0 0 ... 0

0 0 ... 0 ... 0

...............................................

n k

n k n k

n k n k n k

g g g g

g g g g

g g g g

0

......................

0 0 0 ... 0 0 .............

with 1.

n k

n k

g

g g

k x n

Systematic Generator Matrix

In general, G is not in a systematic form. However, we can bring it in a systematic form

by performing row operations.

[ ]kG PIReminder: For a Systematic Code

Example 5.5: Determine the systematic generator matrix for (7, 4) cyclic code, generated

by 3( ) 1g X X X

1 1 0 1 0 0 0

0 1 1 0 1 0 0

0 0 1 1 0 1 0

0 0 0 1 1 0 1

G

R3+R1R4+R1+R2

1

1

1

1

1 1 0 0 0 0

0 1 1 0 0 0=

1 1 1 0 0 0

1 0 1 0 0 0

R1 R2 R3 R4

systematic form

94

The (7, 4) cyclic code, generated by when message is (1100) 3 ( ) 1g X X X

for other messages, the code see below:

The (7, 4) cyclic code in systematic form, generated by when

message is (0011)

3 ( ) 1g X X X

95

for other messages, the code see below:

Parity-check Matrix

We know: 1 ( ) (nX )g X h X

degree k degree r =n - k

Parity-check Polynomial

Let be a code word, 0 1 1( ... )c nc c c

( ) ( ) ( )c X a X g X

degree ≤ k-1

96

Thus, do not appear in , i.e., 11 ,,, nkk XXX nXXaXa )()(

the coefficients of must be equal to zero, then 1 1, ,...,k k nX X X

00, 1 -

ki n i ji

h c j n k

from which we can set up n-k equations.

Reciprocal of h(X) is defined as

It can be shown that this is a factor of 1nX , thus, it can generate an (n, n-k) cyclic code. The generator matrix of the (n, n-k) cyclic code is

1 2 0

1 1 0

2 1 0

... 0 0 0 ... 0

0 ... 0 0 ... 0

0 0 ... 0 ... 0

................................

k k k

k k

k

h h h h

h h h h

h h h h

H

0

0

.....................................

0 0 0 ... 0 0 ................

with 1.k

h

h h

As for a linear block code, any code word is orthogonal to every row of H,

( )T cH 0

H is a Parity Check Matrix of the cyclic code. h(X) is called the parity polynomial of

the code. A cyclic code is uniquely specified by h(X). Remark: The polynomial generates the dual code of C, (n, r). )( 1XhX k

Example 5.6: Find the dual code generator polynomial for (7, 4) cyclic code, generated

by 3( ) 1g X X X

4, 7 4 3k r n k

97

Generates

7 3 4r

§5.3 Encoder for Systematic Cyclic Codes

Find Remainder by Binary Polynomial Division

Recall the 3 steps to generate systematic cyclic codes are:

Step 1: Multiply the message m(X) by Xn – k.

Step 2: Obtain the remainder b(X) from dividing Xn – k m(X) by g(X).

Step 3: Combine b(X) and Xn – k m(X) to form the systematic codeword.

In the 2nd step, we assume that mk-1=1, the remainder can be found by considering the

calculation of

All the 3 steps can be accomplished with a division circuit of (n-k)-stage register with

feedback based on g(X). The mechanism of the division process have a simple

implementation for binary polynomials. We assume that the bits are transmitted serially

with the highest power of X being transmitted first.

We illustrate the mechanism using n = 7 and r = 3, i.e., the (7,4) code.

degree ≤ n-1 0 11( ) ...n k n k

knm mX m X X X

)1/( 11

11

XgXgXX r

rrn

98

Remainder after this cycle

2

1 1

1

g

S g

1

2

1

1

0

...

r

r

g

g

S

g

g

In the general case,

In the next division cycle we have divided by 341

52 XXgXg 11

22

3 XgXgX

2 2 2

2 1 1 1

1 0 1 0

0 1 0 1

1 0 0 1 1 0 0

g g g

S g g g

1

2

2 1

1

0

1 0 ... 0

0 1 ... 0

... ... ... ... ...

0 0 ... 1

0 0 ... 0

r

r

g

g

S S

g

g

Remainder after this cycle

1S

1S

( 1) ( 1)

r rI

In the general case,

The process continues 2 more times, for a total of k cycles (k=4 here),

and

The process for the terms m2 X5 is the same, except only k-1=3 cycles are involved . The

same is true for each successive term in Xrm(X), with one less shift in for each decrease in

the power of X.

3 2S S 4 3S S

99

For a general (n, k) code, we can represent the long-division process for the remainder

vector as

1

2

1 0

1

0

... , 1, 2,..., ,

r

r

t t k t

g

g

S S m t k S

g

g

0

Example 5.7: For k = 4 and r = 3, find the remainder vector.

Homework: Write S3 and S4 for the (7,4) code.

Encoder Circuit

After obtaining the remainder, run Step 3: Combine b(X) and Xn – k m(X) to form the

systematic code word.

In Example 5.7: For k = 4 and r = 3, design the encoder circuit.

100

3 3 3 23 2 1 0( ) ( )X m X X m X m X mX m

Codeword

101

For the general case, an endocder of n-k-stage shift register is

D D D

g0 g2 g1

Parity-check digits

3m

2m

0 3g m 1 3g m 2 3g m

0b1b 2b

1S

2S0 2 0 2 3g m g g m 0 3 1 2 1 2 3m g m g g m 2

1 3 2 2 2 3g m g m g m

2 2 3m g m

2 3g m

g

and so on …………………………………………………………………….

3( ) 1Homework: Find the encoding circuit for the (7, 4) code, generated by g X X X

Encoding a cyclic code can also be accomplish by using its parity polynomial,

As hk =1 (see formula in slide 2)

(1)

11 1( ) 1 ... 1k k

kh X h X h X X

1

0, 1 -

kn k j i n i ji

c h c j n k

This is known as a difference equation.

0 1

0 1 1 1

- parity check binary digits information binary digits ...

( ... ... )

k

n k n k n

n k km m

c c c c c

For a Systematic Code:

Given the k info bits, (1) is a rule for determining the n-k parity check digits, . 110 knccc

The encoder circuit using parity polynomial is

h0=1

102

The Encoding Operations can be described in the following steps:

Step1: Initially, Gate 1 is turned on and Gate 2 is turned off. The k information digits, , are shifted into the register and the communication channel simultaneously.

1k0 1 1( ) ... km X m m X m X

Step 2: As soon as the k information bits have entered the shift register, Gate 1 is turned off and Gate 2 is turned on. The first parity check digit,

is formed and appear at point P.

Step 3: The register is shifted once. The first parity-check digit is shifted into the channel and into the register. The second parity check digit, is formed and appear at point P. Step 4: Step 3 is repeated until n-k parity-check digits have been formed and shifted into

the channel. Then, Gate 1 is turned on and gate 2 is turned off. The next message will be

shifted into the register.

Remark 1: This is a k-stage shift register.

Remark 2:

If r > k, the k-stage encoding circuit is more economical.

Otherwise, the (n-k)-stage encoding circuit is preferable.

Homework: Find the encoding circuit for the (7, 4) code, generated

by , based on h(X). 3( ) 1g X X X

§5.4 Syndrome Computation and Error Correction

Definition of Syndrome

Cyclic Codes are Linear Block Codes. For a received word 0 1 1( ... )nv v v v

Ts vHSyndrome is defined as

We know that , so if , v is a codeword. TcH 0 Ts = vH 0

103

10 1 1( ) ... n

nv X v v X v X For Cyclic Codes: Received Polynomial

or ( ) ( ) ( ) ( )v X a X g X s X

degree ≤ r-1 degree r=n-k degree ≤ n-1

The r = n-k coefficients of S(X) form the Syndrome S. if and only if

is a code polynomial (a multiple of g(X)).

( )s X 0 ( )v X

Syndrome Computation Circuit

)(Xs is the remainder of the division v(X) / g(X). It can be computed with a division circuit, which is identical to the (n-k)-stage encoding circuit, except that the received polynomial is shifted into the register from the left end.

The received polynomial is shifted into the register with all stages initially set to zero. As soon as v(X) has been shifted into the register, the content in the register form the syndrome s(X). Properties of Syndrome Let s(X) be the syndrome of a received polynomial v(X). The remainder s(1)(X) resulting

from dividing Xs(X) by the generator polynomial g(X) is the syndrome of v(1)(X), which is a cyclic shift of v(X) (For proof: see the definition of the syndrome). The syndrome

s(1) (1)(X) can be obtained by shifting the register (syndrome) once, with s(X) as the initial content and with the input gate disabled. This is equivalent with dividing Xs(X) by g(X).

(X) of v

In general, the remainder s(i)(X) resulting from dividing Xis(X) by the generator

polynomial g(X) is the syndrome of v(i)(X), which is a cyclic shift of v(X). This

104

property is useful in decoding cyclic codes. The syndrome s(i)(X) of v(i)(X) can be obtained by shifting the register (syndrome) i times, with s(X) as the initial content and

with the input gate disabled. This is equivalent with dividing Xis(X) by g(X).

Example 5.8: Find the syndrome circuit for the (7,4) cyclic code generated by . Suppose that the received vector is .

Calculate the syndrome and compare it with the contents of the shift register after the 7th shift. Show the contains of the shift register with the input gate disabled and comment on the result.

The remainder of v(X) / g(X) is , and so, the syndrome is , or . For the content of the shift register, see the next table, which is related to the syndrome circuit.

3 (0010110)v( ) 1g X X X

(0010110)v2 1X

105

With the input gate disabled, the syndrome of is obtained by shifting the register once, the syndrome of is obtained if we shift the register twice, and so on.

(1) (0001011( )(2)( ) (1000101

)v X)v X

Let be the transmitted code polynomial, and let (c X be the error pattern. Then, the received polynomial is As and , then

)

( ) ( ) ( )m X g Xc X ( ) ( ) ( ) ( )v X a X g X s X The syndrome is computed based on the received vector, and the decoder has to estimate the error pattern e(X) based on the syndrome. However, the error pattern is not known at the decoder. The syndrome is equal to the remainder of dividing the error pattern by the generator polynomial.

Remark: One can notice that if and only if or (the error pattern is a codeword)

( )e X 0( )s X 0( ) ( )e X c X

For the latter, the error pattern is undetectable ! Remark: The error detection circuit is simply a syndrome circuit with an OR gate whose inputs are the syndrome digits. If the syndrome is non-zero, the output of the OR gate is 1, and the presence of errors has been detected. CYCLIC CODES ARE VERY EFFECTIVE FOR DETECTING ERRORS, RANDOM OR BURST ! Burst Error Patterns Definition: The Burst Length of an error polynomial e(X) is defined as the number of bits from the first error term in e(X) to the last error term, inclusive. Example: has the burst length b=7-3+1=5. 3( )e X X X 7

By definition, there can be only one burst in a block.

Example: has the burst length b=20-3+1=18, and not two bursts of length 5 and 2.

3 7 19 20( )e X X X X X Definition: An error pattern with errors confined to i high-order positions and l-i low-order positions is also regarded as a burst of length l. This is called an end-around burst. Example: is an end-around burst of length 7. ( 0101 111000000 0)e

106

CASE 1: Suppose that e(X) is a burst of length r = n-k or less.

( ) ( )je X X B X

degree ≤ n-k-1

Because degree{ ( )} degree{ ( )}B X g X ( ) is not a factor of ( )g X B XX

Also is not a factor of ( ), as ( ) divides 1nX g X g X

( ) ( ) is not divisible by ( )je X X B X g X or, equivalently, the syndrome caused by e(X) is not equal to zero. The (n, k) cyclic code is capable of detecting any error burst of length

CASE 2: Suppose that e(X) is a burst of length r +1 = n-k+1, and let it start from the ith

position. Thus, it ends at the (i+n-k)th position. Errors are confined to

with

1, , ...,i i i ne e e k

1i i n k e eThere are such bursts (the error bits in the first and last positions are 1, and only

the n-k+1-2 (i.e. n-k-1) positions can take any value, i.e., either 0 or 1). Among these,

only one cannot be detected (zero syndrome), i.e.,

12n k

( ) ( )ie X X g X

The fraction of undetectable bursts of length n – k +1 is

Error Detection Capability

CASE 3: Suppose that e(X) is a burst of length l > n – k + 1 or (r + 1). then there are

such bursts (the bits in the first and last positions are 1, and only the l -2 positions can

take any value, i.e., either 0 or 1). Among these, the undetectable ones (zero syndrome)

must be of the form,

degree=n-k

The number of such bursts is

The fraction of undetectable burst errors of length l is

degree l -1 ( ) 1

0 1 ( ) 1( ) ... l n kl n ka X a a X a X

0 ( ) 11, 1l n ka a

( ) ( ) ( )ie X X a X g X

107

Example 5.9: Analyze the error detection capacity of the (7, 4) cyclic code generated by

3( ) 1g X X X

The minimum Hamming distance for this code is 3, thus, the code can detect up to 2

random errors (see the relation between dmin and td.)

Also, it detects 112 error patterns

The code can detect any burst errors of length

It also detects many burst of length >3.

The fraction of undetectable error patterns with n-k+1=4 errors is .

The fraction of undetectable error patterns with more than 4 errors is

Cyclic Redundancy Check (CRC) Codes

CRC are error-detecting codes typically used in ARQ systems. CRC has no error

correction capability, but they can be used in combination with an error-correcting code.

The error control system is in the form of a concatenated code.

CRC

ENCODERError Correction

ENCODER Tx

Error Correction

DECODERCRC Syndrome

Checker Rx

§5.5 Decoding of Cyclic Codes

Decoding Steps

The decoding process consists of three steps, as for decoding of linear block codes. These

are:

i)

ii)

108

iii)

Syndrome Computation: The syndrome for cyclic codes can be computed with a

division circuit whose complexity is linearly proportional to the number of parity check

binary digits, i.e., n-k.

Error Corrections: The error-correction step is simply adding (mod-2) the error-pattern

to the received vector (exclusive-or gate).

The association of the syndrome with an error pattern can be completely specified by a

decoding table. This is a straightforward approach to the design of a decoding circuit is

via a combinational logic circuit that implements the table look-up procedure. However,

the limit to this approach is that the complexity tends to grow exponentially with the code

length and number of errors to be corrected.

Cyclic Codes have considerable algebraic properties, which allow a low complexity

structure of the encoder. The cyclic structure of a cyclic code allows us to decode a

received vector v(X) serially. The received digits are decoded one at a time, and each

digit is decoded with the same circuitry.

Decoding Circuit (Decoder)

2

109

Two Cases

As soon as the syndrome has been computed, the decoding circuit checks whether the

syndrome s(X) corresponds to a correctable error pattern ,

with an error at the higher position Xn -1, i.e., en-1=1.

10 1 1( ) ... n

ne X e e X e X

CASE I: If s(X) does not correspond to an error pattern with en-1=1, the received

polynomial and the syndrome register are cyclically shifted once simultaneously. We

obtain

and the syndrome register form , the syndrome of . (1) ( )v X(1) ( )s X

Now, the second digit, vn -2 becomes the first digit of v(1) (X).

The same decoding circuit checks whether s(1) (X) corresponds to an error at location Xn-1.

CASE II: If s(X) of v(X) does correspond to an error pattern with en-1=1, the first

received digit vn-1 is an erroneous digit, and it must be corrected. The correction is carried

out by the sum 1 1.n nv e This correction results in a modified received polynomial

2 11 0 1 2 1 1( ) ... ( )n n

n n nv X v v X v X v e X

The effect of en-1 on the syndrome is removed from the syndrome s(X). v1(X) and the

syndrome register are cyclically shifted once simultaneously. The polynomial which

results now is

Its syndrome, , is the remainder resulting from dividing

by the generator polynomial g(X).

(1) 11 1 1 0 2( ) ( ) ... n

n n nv X v e v X v X

(1)

1 (s X )

1 1( ) ( ) ( ) ( )n nX a X g X s X X

1[ ( ) ]nX s X X

Proof

1 (Error Correction)nX ( ) ( ) ( ) ( )v X a X g X s X

(Shift Once)Xv X

( ) ( ) ( ) ( )Xv X X Xa X g X Xs X n nX

110

Such that the remainder of is the remainder of

, which is , because

( ) : ( )nXv X X g X( ) | ( 1)ng X X 1[ ( ) ] : ( )nX s X X g X (1) ( )s X 1

Therefore, if 1 is added to the left end of the syndrome register while it is shifted, we

obtain . The decoding circuitry proceeds to decode vn-2. Whenever an error is

detected and corrected, its effect is removed from the syndrome.

(1)1 ( )s X

Remarks:

The decoding stops after n shifts (= total number of binary bits in a received

word).

If e(X) is a correctable error pattern, the contents of the syndrome register is zero

at the end of the decoding operation, and the received vector has been correctly

decoded. Otherwise, an uncorrectable error pattern has been detected.

This decoder applies in principle to any (n, k) cyclic code.

But whether it is practical depends entirely on its error-pattern detection circuit.

In some cases this is a simple circuit.

Design Decoder

Example 5.10: Design the decoder for the (7,4) cyclic code generated by

3( ) 1g X X X

min 3d

It is capable of correcting any single error over a block of 7 bits. There are 7 such error

patterns. These and the all-zero vector form all the coset leaders of the decoding table.

They form all correctable error patterns. Suppose that the received polynomial,

60 1 6( ) ...v X v v X v X

is shifted into the syndrome register from the left end.

Write syndrome and error patterns in Table 1 on next page.

111

1

We see that is the only error pattern with an error located at . When this

error pattern occurs, the syndrome in the syndrome register is (101), after the entire v(X)

has entered the syndrome register. The detection of this syndrome indicates that v6 is an

erroneous digit and must be corrected.

6

6( )e X X 6X

3

Suppose that the single error occurs at location , i.e., iX ( ) iie X X 0 6i ,

After the entire received polynomial has been shifted into the syndrome register, the

syndrome in the register will not be (101). However, another 6-i shifts, the contents in the

syndrome register will be (101) and the next received digit to come out of the register

will be the erroneous digit. Only the syndrome (101) needs to be detected.

We use a 3 input AND gate.

112

In the sequel, we give an example for the decoding process when the codeword

( ) is transmitted and

113

( ) is received. A single error occurs at location X2.

When the entire received polynomial has been shifted into the syndrome and buffer

registers, the syndrome register contains (001). We see that after 4 shifts, the content in

the syndrome register is (101) and the next digit to come out from the buffer is the

erroneous digit, v2.

(1001011)c 63 5( ) 1c X X X X (1011011)v2 3 5( ) 1v X X X X X 6

4 3

Ch 6 Convolutional Codes

§6.1 Description of Convolutional Codes

Compare with Linear Block Code

Convolutional codes are the second major form of error-correcting channel codes. They

differ from the linear block codes in both structural form and error correcting properties.

With linear block codes, the data stream is divided into a number of blocks of k binary

digits, each block is encoded into an n-bit code word. On the other hand, Convolutional

Codes convert the entire data stream into a single code word.

The code rate for the linear block codes can be ≥0.95, but they have limited error

correction capabilities. For convolutional codes, the code rate is usually below 0.9, but

they have more powerful error-correcting capabilities and good for very noisy channels

with high raw error probabilities. Puncturing is used to achieve higher code rates.

Encoding

The source data is broken into frames of k0 bits per frame. M +1 frames of source data are

coded into n0-bit code frame, where M is the Memory Depth of the shift register.

Convolutional codes are encoded using shift registers. As each new data frame is read,

the old data is shifted one frame to the right, and a new code word is calculated.

Characteristics of the Code: Code Rate , Constraint Length

For binary convolutional codes: k0=1

114

Example 6.1: For a , binary convolutional encoder below, determine

its code polynomials.

1/ 2R 3

, so k0=1 (binary), n0=2, M=2

For each 1-bit (k0) frame of the input message m(X), we obtain 2-bit (n0) code frame on

the output with one bit in c0(X) and one in c1(X). These are interleaved and sent as a two-

bit symbol sequence.

1/ 2R 3

We can associate two code polynomials,

0 0( ) ( ) ( )c X m X g X

such that 1 1( ) ( ) ( )c X m X g X

The vector corresponding to the output is

For example, if message 0 1 0 1( ) [ ( ) ( )] ( )[ ( ) g ( )] ( ) ( )X c X c X m X g X X m X X C G

3( ) 1m X X X

Then

Let us assume that the highest power of X is the first symbol transmitted, and that we

first send c0, and then c1. Thus, the transmitted sequence is

C0(0) C1

(0) C0(1) C1

(1) … C0(t) C1

(t) =

You can also input the message to the encoder directly to verify the result. The message

has 4 bits, i.e., ( 1 0 1 1), but the transmitted sequence contains 12 transmitted bits.

Therefore, the Code Rate is 4/12=1/3, not 1/2 !

115

Effective Code Rate

In Example 6.1, the code rate is 1/3, the explanation for that is that the encoder has M=2

memory elements and it has to “flush” its buffer to complete the code sequence. The last

two code symbols in the transmitted code sequence, i.e., 01 and 11, correspond to

empting the encoder’s shift register. The first 8 bits correspond to the 4 message bits at

rate ½, so the Effective Code Rate is 1/3. This reduction in the code rate is known as the

Fractional Rate Loss.

For a convolutional code with rate R (K bits of information) and memory depth M, the

Effective Code Rate is

Convolutional codes are effective when , the effective code rate approaches the code rate.

K M

Memory Depth and Constraint Length

For the rate R convolutional codes, the generator vector is defined as

G 0 1( ) [ ( )... ( )]nX g X g Xand the vector of the code polynomials as

( ) ( ) ( )X m X XC G

Convolutional codes are LINEAR, as the sum of two code polynomials is a code polynomial. There is a strong similarity with cyclic codes, and convolutional codes have some properties of cyclic codes. Memory Depth

Constraint Length Sometimes it is defined as M.

For a given code rate, if increases, a better error rate performance is obtained, at the expense of increased decoder complexity.

§6.2 Structure Properties of Convolutional Codes

State Diagram

The convolutional encoder is a “state machine” (it is convenient to represent its operation

using a State Diagram). With M memory elements, it has states M2

116

Example 6.2: Find the state diagram for the encoder in Example 6.1.

M=2, we associate the 2M=4 states with the content of the shift register, as

State Diagram: Used to analyze performance of a convolutional code

Output Input

Trellis Diagram

Trellis Diagram is to use states at different time to analyze performance of a

convolutional code

117

t

Adversary Paths

The error-correcting property of a convolutional code is determined by the adversary

paths through the trellis. Adversary Paths: the paths that begin in the same state and

end in the same state, and have no state in common at any step between the initial and

final states.

Adversary Paths and Hamming Distance

For the following paths,

P P S P S P adversaries from time index are

are

are

are

0 0 0 0 0 0:S S S S S

1 0 1 2 0 0: S S S S

2 0 1 3 2 0: S S S S

3 0 0 1 2 0:S S S S S

0 1,P P

2 3,

Hamming Distance

adversaries from time index adversaries from time index adversaries from time index

P P

0 3,P P

0 2,P P

118

Performance is based on the Hamming distance ( ),( jiH ccd of the two code sequences)

between the adversary paths in the trellis. As we can see in this simple example, the

number of adversary paths grows, and we wonder how we can handle the combinatorics

involved. The trellis path analysis is simplified in case of linear codes. On such case, the

Hamming distance between two code sequences in the trellis is equivalent to the

Hamming distance between some code word and the all-zero code sequence.

Transfer function

This information can be found using transfer function. We will show only the non-zero

adversary paths which begin and end in state S0. We modify the state diagram by

removing the self loop at the S0 state, and adding a new node S0, representing the

termination of the non-zero adversary path.

Transfer Function Operators 0 1S S

1 | 11

Code symbol of weight 2

Source symbol of weight 1

Source Symbol Weight Operator: N Code Symbol Weight Operator: D

Time Index Operator: J

For this case, is the state operator for the transition (exponent means number

of ``1`` bits in D or N).

Example 6.3: Write the transfer operators for each branch of the state diagram.

119

Results are:

We can solve for the transfer function for all possible paths starting at S0 and ending at S0,

by writing a set of state equations for the transfer function diagram.

with the beginning and ending state S0, respectively. The transfer function ( )0 0, eX X

),,( DNJT is found by solving this set of equations for , with using linear

algebra,

)(0

eX 10 X

To see the the individual adversary paths, apply long division and then

5 3

( , , )1 (1

D NJT J N D

DNJ J

)

5 3 6 2 4( , , ) (1 ) ....T J N D D NJ D N J J

5 3( , , )[1 (1 )]T J N D DNJ J D NJ Proof: Check that

The transfer functions supplies us with all the information we need to completely characterize the structure and performance of the code. For example,

120

)1(426 JJND shows that there are exactly two paths of Hamming weight 6 and both paths involve source symbols with Hamming distance 2. One is reached is 4 transitions, and the other one in 5. With this information, the two paths satisfied are found as

021210

02310

SSSSSS

SSSSS

§6.3 Decoding Methods

Viterbi Algorithm

Convolutional codes are employed when significant error correction capability is required.

In such cases, the decoding cannot be carried out using syndrome method and shift

register circuits, but a more powerful method is needed. Such a method was introduced

by Viterbi (1965) and quickly became known as the Viterbi algorithm. The Viterbi

Algorithm is of major practical importance, and we will introduce it primarily by means

of examples.

We have seen that a convolutional code with constraint length has states

in its trellis. One way to view the Viterbi decoder is to construct it as a network of simple,

identical processors, with one processor for each state in the trellis.

1M

For example: , it needs 2,3 Mv 422 states.

Example of node processor: It receives inputs from the node processors S0 and S2, and

supplies outputs for node processors S0 and S1.

S0 S0

S1 S1

S2 S2

S3 S3

Each processor does the following: i) monitors the received code sequence, y(X), which

can be written as y(X)=c(X)+e(X). Each processor calculates a number (likelihood

t t+1 t+2

121

metric) that is related to the probability that the received sequence arises from a

transmitted sequence. The likelihood metric is the accumulated Hamming distance

between the received sequence and expected transmitted sequence. The larger the

distance, the less likely it is that this processor is decoding the true transmitted message.

2) Each processor must supply, as an output, its likelihood metric to each node processor

connected to its output side. 3) For each of its input paths, the node processor must

calculate the Hamming distance between the n-bit code symbol y and the n-bit code

symbol it should have received if the path of the transmitted message had just made a

transition (likelihood update). It adds the likelihood update to the likelihood supplied

to it by the source node processor. It selects the path associated to the input-side

processor having the smallest accumulated Hamming distance (the most likely path).

4) Based on which path is selected, the processor must decode the message associated

with the selected path and update a record (called Survivor Path Register) of all of the

decoded message bits associated with the selected path.

Survivor Path Register Mathod of Viterbi Decoding

Example 6.4: Assume that we have the convolutional code discussed as in Example 6.1.

At time t, assume that the processors have the following initial conditions:

Assume that the received code-word symbol at time t is y=11. Find the resulting

likelihoods and survivor path registers for each of the node processors at time t+1.

111011xxxxx 2 S3

101110xxxxx 1 S2

111001xxxxx 3 S1

000100xxxxx 3 S0

Survivor Path Register Likelihood Metric ( )

Node Processor ( ) , 0,1, 2,iS i 3 , 0,1, 2,3i i

122

Write down Trellis Diagram (see example discussed earlier)

For node S0: if y=11

from S0, then 0/00

y=11

from S2, then 0/11

Thus, processor S0 selects transition

as the most likely transition. The resulting

register for S0 becomes1011100xxxx and the

new likelihood becomes .

Now, let's look at the node S1:

y = 11

2 0S S

t

For node S1: if y=11

from S0, then 1/11 01 y = 11

010

from S2, then 1/00 21

212

The likelihood are tied. The node processor has no

statistical way to choose between the paths. It t resolves this dilemma by “tossing a coin”. Let’s

say that the node processor S2 “wins the toss”. The

survivor path is 1011101xxxx and the new likelihood metric becomes

The same procedure applies for S2 and S3 node processors.

Example 6.5: For the convolutional code discussed in Example 6.1, assume that it is known the encoder initial’s state is S0. Decode the received sequence 10 10 00 01 10 01.

Since we know the initial state, we initialize the likelihoods to: (actually, any large numbers will do)

123

Above is the result of applying the Viterbi algorithm. The solid lines are the selected Paths , the dashed lines are Rejected Paths. T=tied path, is shown above the branches. The accumulated Hamming distances are indicated below each node. the first two steps are easier since we know S0 always winds (other u are large). Results of steps 3-6 are

t = 3 0

1

2

3

:

:

:

:

S

S

S

S

000xxx

101xxx

010xxx

011xxx

t = 4 0

1

2

3

:

:

:

:

S

S

S

S

0000xx

0001xx

0110xx

1011xx

t = 6 0

1

2

3

:

:

:

:

S

S

S

S

101100

101101

101110

101111

t = 5 0

1

2

3

:

:

:

:

S

S

S

S

01100x

00001x

10110x

10111x

124

After the 3rd step, we cannot decide on the correct decoding of even the 1st bit (since the

4 path registers disagree on what this bit should be). Till the 6th step, all 4 survivor

registers agree on the first 4 decoded bits. Why? If you trace back from t=6, all surviving

paths join together at t = 4. However, see the tie! This result depends on how we choose

the tie!

After the algorithm has a chance to observe a sufficient number of received symbols, it is

able to use the sequence of information to pick the globally most likely transmitted

sequence.

Notice that the path selection for the 4 first steps through the trellis cannot be changed by

any further decisions the node processors may make. This is because all the node

processors now agree on the first four steps. Received: 10 10 00 01 10 01

Most Likely: 11 10 00 01 ?? ??

In any practical implementation of the Viterbi algorithm, we must use a finite number of

bits for the survivor path register. This is called the Decoding Depth.

If we use a few bits, the performance of the algorithm will be hurt by having to force the

decoding decisions when we run out of decision bits. In such case, the “most likely” bits

are those that lead to the best likelihood metric. Most of the time this will result in correct

decoding, but sometimes it will not. An erroneous decision is called Truncation Error.

How many bits of decoding depth are required to make the probability of

truncation error negligible?

Forney (1970) gave the answer to this question. Answer: 5.8 times the number of bits

in the encoder’s shift register, i.e,

Practical Implementation for Long Code Sequence With Large

Number of Errors

When the number of error is large (this is why we use convolutional codes), the

arithmetic circuits can run out of bits for representing the likelihoods. We should notice

that all node decisions are relative decisions. A strategy for dealing with arithmetic

overflow is to occasionally subtract the value of the lowest likelihood from each node

125

processor’s likelihood. This leaves the relative likelihoods unchanged, while limiting the

range of the likelihood number each node processor must be able to express.

The Traceback Method of Viterbi Decoding

Each node manages a path survivor register, in which the node processor’s best

estimate is stored at each moment. This is a method easy to understand, but it is not

effective of keeping track of the decoded message when a high speed decoder is required.

The survivor path registers must be interconnected to permit parallel transfer. This

interconnection is very costly to implement. The Traceback Method is an alternative

way of keeping track of the decoded message sequence. This method is very popular, as

its implementation in integrated circuits is more cost effective. The method exploits a

priori information that the decoder has about the trellis structure of the code.

Basic Idea : For exampl Me: = 2

0 1

Content of the register in state S2

“1” which appears at moment t on the output, is actually applied at moment t-2 on the

input. We can use the content of the last delay in the register to decode the message,

but there is a delay of M clock cycles. Since we have already a delay of at least 5.8M

clock cycles to avoid the truncation errors in the Viterbi algorithm, this additional

decoding delay is a small price to pay for obtaining a lower cost hardware solution.

126

Instead of transferring the contents of the survivor register, each node processor is

assigned a unique register in which we store a single bit. This is the last bit of the

state picked by that node processor as survivor path (in the previous example this is

“1”). As we deal with binary codes, each node has two inputs (two path choices). The

bit that can be chosen is different for the two possible paths (see the trellis diagram).

This will always be true with the state-naming convention we are using.

Trellis Diagram Only the surviving path decisions are shown at each time step. The solid line is the

survivor path agreed on by all four nodes processors at the last time step shown in

the figure below.

127

The entries into each node processor’s traceback (i.e., survivor path) register at each

trellis step are shown in the figure. The traceback process is also illustrated. It begins at

the far right side of the figure and proceeds backwards in time. Once the traceback is

completed, the decoded bit sequence is read from left to right. The path traces back to

state “00” (S0). Whatever else may have happened during the time prior to the start of

the figure, we know that the last 2 bits leading into state “00” must have been “0,0”, so,

the decoded message sequence corresponding to the solid line must be . The

last two message bits, corresponding to the final 2 steps through the trellis have not

been decoded yet (due to the extra decoding lag mentioned above).

§6.4 Approaches to Increase Code Rate

Code Rate of Convolutional Codes

Let be the set of all possible Hamming distances between

adversaries in the transfer function of the convolutional code, such that 0 1 2, , ,.d ..d d

0 1 2 ...d d d The minimum distance is called the 0d minimum free distance, . Performance of

convolutional codes are determined by the minimum free distance. Convolutional codes

provide very powerful error correction capability, at the price of low code rate.

fd

For examples

128

Using Nonbinary Convolutional Codes

So far we have been looking only at convolutional codes with rate 1/n0 ( low R ). If the

source frame is increased to some k0 >1, we can achieve a rate k0/n0 convolutional code.

Example 6.6: Find the code rate and Trellis diagram for the 2-source frame encoder

shown on next page.

R = ; df = 3, it is a 4-ary code.

129

In Example 6.6: The number of inputs in each trellis node processor is equal to 4.

(Disadvantage!) In general, a k0/n0 convolutional code requires to deal with input

paths, so, the complexity of the Viterbi decoder increases geometrically with k0. This is a

severe problem. Non-binary convolutional codes are non-popular.

02k

Using Punctured Convolutional Codes

An alternative way to increase code rate is puncturing. We start with a 1/n0

convolutional code, such as a 1/2 code rate. The transmitted code word corresponding to

the 1/2 code is (c0 c1). We delete one of the code bits every 2 code symbols. Thus, the Code sequence:

The deleted code bits are not transmitted.

In average, 3 code bits are transmitted every two message bits, which yields a rate 2/3

code. Deleting code bits is called Puncturing the code. The rate is increased at the

expense of reducing the minimum free distance.

However, it is not fair to compare df of a punctured code with df of the base code. Instead,

we should compare df with that of a non-binary code, with the same R and number of

elements in its encoder. Cain showed that there are punctured codes with the same df as

130

the best known non-binary codes of the same rate and memory depth. Punctured codes

with rates up to 9/10 are known. Punctured codes are still linear codes (but not longer

shift invariant).

Example 6.7: Us the same encoder as Example 6.1, but c1 is punctured in every second

code-word. Find its code rate and trellis diagram.

Code rate R=2/3 (from a base code with R=1/2), df = 3.

The state diagram of such code requires 8 states rather than 4 (4 for “time-even” trellis

and 4 for “time-odd” trellis states). The Viterbi algorithm requires only 4 node

processors (M=2).

The Puncturing Period: the number of bits encoded before the returning to the base code.

Examples 6.8: Find the punctual period of the punctured code (7,5), 7

Punctured codes are specified in a manner similar to octal generator notation.

Code sequence:

m

(7,5), 7

2

3R

(7 in octal) (5 in octal)

The second message bit is encoded using only the generator polynomial 7.

131

Here the puncturing period is equal to 2. It requires (M=2) node processors, and

the state diagram contains 8 (= 4x2, 4 times puncture period) states.

Examples 6.9: Find the punctual period of the punctured code (15,17), 15,17

(15 in octal) ( 7 15,17),15,1 (17 in octal)

The second message bit is encoded using only the generator polynomial 15, whereas the

third message bit only by using the generator polynomial 17, thus, 3

4R

Code sequence:

Here the puncturing period is equal to 3. The Viterbi decoder requires (M=3) node

processors, and the state diagram contains 24 (= 8x3, 8 times puncture period) states.

The punctured codes presented here are punctured versions of known good-rate 1/2 codes.

However, it is not always true that puncturing a good code (1/n0 rate) yields a good

punctured code. There is no known systematic procedure for generating good

punctured convolutional codes. Good codes are discovered by computer search.

Some good punctured codes examples:

132

133