ch 0 introduction - faculty of engineering and applied ...weimin/courses/engr9871/notes9871.pdf ·...
TRANSCRIPT
Ch 0 Introduction
§0.1 Overview of Information Theory and Coding Overview The information theory was founded by Shannon in 1948. This theory is for transmission (communication system) or recording (storage system) over/in a channel. The Channel can be wireless or wire channel (communication: copper telephone or fiber optic cables), magnetic or optical disks (storage). There are three aspects that need to be considered: (Compression) (Error Detection and Correction) (Cryptography) Information Theory is based on the Probability Theory. A communication or compression procedure includes:
Sent messages
ssoouurrccee rreecceeiivveerr
RReecceeiivveedd mmeessssaaggeess
cchhaannnneell
ssyymmbboollss
00111100110011000011111100……
eennccooddeerr ddeeccooddeerr
SSoouurrccee ccooddiinngg
CChhaannnneell ccooddiinngg
SSoouurrccee ddeeccooddiinngg
CChhaannnneell ddeeccooddiinngg
Decompression Compression Error Detection and Correction
Source Entropy Channel Capacity
1
Digital Communication and Storage Systems A basic information processing system consists of Channel: produces a received signal r which differs from the original signal, c (the channel introduces noise, channel distortion, etc.). Thus, the decoder can only produce an estimate m’ of the original message, m. GGooaall ooff pprroocceessssiinngg:: Information conveyed through (or stored in) the channel must be reproduced at the destination as reliable as possible. At the same time, it needs to allow the transmission of as much information as possible per unit time (communication system) or storage (storage system). Information Source The Source Message m consists of a time sequence of symbols emitted by the information source. The source can be: Continuous-time Source, if this message is continuous in time, e.g., speech waveform. Discrete-time Source, if the message is discrete in time, e.g., data sequences from a computer. The symbols emitted by the source can be: Continuous in amplitude, e.g., speech waveform. Discrete in amplitude, e.g., text with a finite symbol alphabet. This course primarily concerns with discrete-time and discrete-amplitude (i.e. digital) sources, as practically all new communication or storage systems fall into this category.
Since the information and coding theory depends on the probability theory, we need to
review it first.
2
§0.2 Review of Random Variables and Probability
Probability Let us consider a single experiment, such as rolling of a dice, with a number of possible outcomes. The sample space S of the experiment consists of the set of all possible outcomes.
In the case of a dice 1,2,3,4,5,6S , with the integer representing the number of
dots on the six faces of the dice. Event: Complement
Example 0.1: For S and A defined above, find A . Two events are said to be Mutually Exclusive if they have no sample points in common. For example: Or: The Union of two events: The Intersection of two events:
Associated with each event A contained in is its S Probability, denoted by .
This has the following properties:
P A
For mutually exclusive events:
, i jA A i j , the probability of the union is i iii
P A P A
Example 0.2: if }4,2{A , find . )(AP
3
Joint Event and Joint Probability Instead of dealing with a single experiment, let us perform two experiments and consider their outcomes. For example: The two experiments can be separate tosses of a single dice or a single toss consisting of two consecutive dices. The sample space S consists of the 36 two-tuples (i,j), where , 1,...,6i j .
Each point in the sample space is assigned the probability 1
36.
Let us denote
iA , , as the outcomes of the first experiment, and 1,...,i n
jB , , as the outcomes of the second experiment. 1,...,j m
Assuming that the outcomes , 1,...,jB j m , are mutually exclusive and , it
follows that:
jj
B S
If , 1,...,iA i n , are mutually exclusive and i
iA S , then
In addition, Conditional Probability
A joint event ,A B occurs with the probability ,P A B , which can be expressed as:
,
where P A B and P B A are conditional probabilities.
Example 0.3: Let us assume that we toss a dice.
The events are 1,2,3A and 1,3,6 B , find . )|( ABP
4
A conditional probability is . Let A and B be two events in a single experiment: If these are mutually exclusive ( )A B , then
. If , BA The Bayes Theorem:
If , then
1
, 1,..., , are mutually exclusive andi
n
ii
A i n
A S
Statistical Independence: Let P A B be the probability of occurrence of A given that
B has occurred. Suppose that the occurrence of A does not depend on the occurrence of B. Then, Example 0.4: Two successive experiments in tossing a dice
12,4,6
2A P A even-numbered sample points in the first toss
12,4,6
2B P B even-numbered sample points in the second toss
Determine the probability of the joint event “even-numbered outcome on the first toss (A)” and “even-numbered outcome on the second toss (B)” . ),( BAP
5
Random Variables Sample space S
Elements s S X s is a Random Variable
For examples: Probability Mass Function (PMF)
1 1
1,
0, otherwise
M M
X i i i ii i
ii
p x P X x P X x x x p x x x
x xx x
For example: Definition: The Mean of the random variable X:
Example 0.5: 1,2,3,4,5,6 , S X s s , find E(X).
Useful Distributions Let X be a discrete random variable that has two possible values, say or 1X 0X , with probabilities p and 1- p , respectively.
This is the Bernoulli distribution, and the PMF can be represented as given in the figure. The mean of such a random variable is . The performance of a fixed number of trials with fixed probability of success on each trial is known as a Bernoulli trial.
p xX
1/6
x
1 2 3 4 5 6
0 1 x
p 1-p Xp x
6
Let , 1,...,iX i
1
n
ii
Y X
n
, be statistically independent and identically distributed random
variables with a Bernoulli distribution, and let us define a new random variable,
. This random variable takes values from to n . Associated probabilities can
be expressed as:
0
More generally, ,
where
!
! !
n n
k k n k
is the binomial coefficient. This represents the probability to
have k successes in n Bernoulli trials. The probability mass function (PMF) can be expressed as This represents the binomial distribution (see the www.mathworld.com website). The mean of a random variable with a binomial distribution is: E Y np .
Definitions: 1. The Mean of a function of the random variable X, g X , is defined as
. 2. The Variance of the random variable X is defined as . Example (calculate the variance for the random variable defined in Example 0.5, whose mean is 21/6) . 3. The Variance of a function of the random variable X, g X , is defined as
.
7
Ch 1 Discrete Source and Entropy
§1.1 Discrete Sources and Entropy
1.1.1 Source Alphabets and Entropy Overview The Information Theory is based on the Probability Theory, as the term information
carries with it a connotation of UNPREDICTABILITY (SURPRISE) in the transmitted
signal.
The Information Source is defined by :
- The set of output symbols
- The probability rules which govern the emission of these symbols.
Finite-Discrete Source: finite number of unique symbols.
The symbol set is called the Source Alphabet.
Definition
A is a source alphabet with M possible symbols, . We can say
that the emitted symbol is a random variable, which takes values in A. The number of
elements in a set is called its Cardinality, e.g.,
The source output symbols can be denoted as where
is the symbol emitted by the source at time t. Note that here t is an integer time index.
Ast
Stationary Source: the set of probabilities is not a function of time. It means, at any
given time moment, the probability that the source emits is ma Pr( )m mp aProbability mass function:
Since the source emits only members of its alphabet, then
8
Information Sources Classification
Stationary Versus Non-Stationary Source:
For a Stationary Source the set of probabilities is not a function of time, whereas for a
Non-stationary Source it is.
Synchronous Source Versus Asynchronous Source:
A Synchronous Source emits a new symbol at a fixed time interval, Ts, whereas for an
Asynchronous Source the interval between emitted symbols is not fixed.
The latter can be approximated as synchronous, by defining a null character when the
source does not emit at time t. We say the source emits a null character at time t.
Representation of the Source Symbols
The symbols emitted by the source must be represented somehow. In digital systems, the
binary representation is used.
Pop Quiz: How many bits are required to represent the symbols 1, 2, 3, 4? or in a symbol
of n symbols 1, 2, 3, …, n?
Answer:
The symbols represented in this fashion are referred to as Source Data.
Distinction between Data and Information
For example: An information source has an alphabet with only 1 symbol. This
representation of this symbol is data, but this data is not information, as it is completely
uninformative. Since information carries the connotation of uncertainty, the information
content of this source is zero.
Question: how can one measure the information content of a source?
Answer:
9
Entropy of a Source
Example: Pick a marble from a bag of 2 blue, and 5 read marbles
Probability for picking a red marble:
pred = 5/7
Number of choices for each red picked
1 / pred = 7/5 =1.4
Each transmitted Symbol 1 is just one choice out of 1/p1 many possible choices and
therefore Symbol 1 contains log2 1/p1 bits information (1/ p1 = 2 log2
1/ p1).
Similarly, Symbol k contains log2 1/pk bits information.
The average information bits per symbol for our source is Entropy, it is calculated by
Shannon gave this precise mathematical definition of the average amount of
information conveyed per source symbol, used to measure the information content of a
source.
Unit of Measure (entropy):
Range of entropy: where M is the cardinality of the source
A, and when ,(i.e. equal probabilities), H(A) takes the maximum. 1
1,...,
mpM
m M
10
Example 1.1: What is the entropy of a 4-ary source having symbol probabilities
}05.0,15.0,3.0,5.0{AP ?
Example 1.2: If }1,0{A with probabilities },1{ ppPA where 10 p , determine
the range of . )A(H
Example 1.3: For a M-ary source, what distribution of probabilities maximizes the )(AP
information entropy ? )(AH
11
Measurement of the Information Efficiency of the Source is in terms of ratio of the
entropy of the source to the (average) number of binary digits used to represent the
source data.
Example 1.4: For a 4-ary source }11,10,01,00{A that has symbol probabilities
}05.0,15.0,3.0,5.0{AP . What is the efficiency of the source?
When the entropy of the source is lower than the (average) number of bits used to
represent the source data, an efficient coding scheme can be used to encode the source
information, using, an average, fewer binary digits. This is called Data Compression and
the encoder used for that is called Source Encoder.
1.1.2 Joint and Conditional Entropy
If we have two information source A and B, and
we want to make a compound symbol C with ),( jiij bac , find H(C).
12
i) If A and B are statistically independent:
ii) If B depends on A:
Example 1.5: We often use a parity bit for error detection. For a 4-ary information source
}3,2,1,0{A with , and the parity generator with }25.0,25.0,25.0,25.0{AP }1,0{B
32,1
10,0{
oraif
oraifb j
where 2,1j , find , and . )(AH )(BH ),( BAH
13
1.1.3 Entropy of Symbol Blocks and the Chain Rule
To find where ),,,( 110 nAAAH )1,,1,0( ntAt is the symbol at index time of
that is drawn from alphabet A.
t
Example 1.5: Suppose a memoryless source with }1,0{A having equal probabilities
emits a sequence of 6 symbols. Following the 6th symbol, suppose a 7th symbol is
transmitted which is the sum modulo 2 of the six previous symbols (this is just the
exclusive-or of the symbols emitted by A). What is the entropy of the 7-symbol
sequence?
14
Example 1.6: For an information source having alphabet A with |A| symbols, what is the
range of entropies possible?
§1.2 Source Coding
1.2.1 Mapping Functions and Efficiency
For an inefficient information source, i.e. H(A) < log2(|A|), the communication system
can be made more cost effective through source coding.
Source
Encoder
Information Source Sequence
Code Words
s'0,s'1,… s't ϵ B(code alphabet)
s0,s1,… st ϵ A(source alphabet)
15
In its simplest form, the encoder can be viewed as a mapping of the source alphabet A to
a code alphabet B, i.e., C: A→B. Since the encoded sequence must be decoded at the
receiver end, the mapping function C must be invertible.
Goal of coding: average information bits/symbol ~ average bits we use to represent a
symbol (i.e. code efficiency ~ 1).
Example 1.7: Let A be a 4-ary source with symbol probabilities }05.0,15.0,3.0,5.0{AP ,
let C be an encoder with maps the symbols in A into strings of binary bits, as below
111)(,05.0
110)(,15.0
10)(,3.0
0)(,5.0
33
22
11
00
aCp
aCp
aCp
aCp
Determine the average number of transmitted binary digits per code word and the
efficiency of the encoder.
Example 1.8: Let C be an encoder grouping the symbols in A into ordered pairs
ji aa , , the set of all possible pairs ji aa , is called the Cartesian product of set A
and is denoted as A X A. Thus, encoder C: A X A → B. or . Now let A be a baaC ji ),(
4-ary memoryless source with symbol probabilities given in Example 1.7, determine the
average number of transmitted binary digits per code word and the efficiency of the
encoder. The code words are shown in the table following.
16
< ai,aj> Pr< ai,aj> bm < ai,aj > Pr< ai,aj > bm
a0,a0 00 a2,a0 1101
a0,a1 100 a2,a1 0111
a0,a2 0.075 1100 a2,a2 0.0225 111110
a0,a3 0.025 11100 a2,a3 0.0075 1111110
a1,a0 0.15 101 a3,a0 0.025 11101
a1,a1 0.09 010 a3,a1 0.015 111101
a1,a2 0.45 0110 a3,a2 0.0075 11111110
a1,a3 0.015 111100 a3,a3 0.0025 11111111
1.2.2 Mutual Information
If we have source set A and code set B, what are the entropy relationship between them?
A B
17
iii) A B
bi bj
ai
1.2.3 Data Compression
Why Data Compression ?
Whenever space is concern, you would like to use data compression. For example, when
sending text files over a modem or Internet. If the files are smaller, they will get faster to
the destination. All media, such as text, audio, graphics or video has “redundancy”.
Compression attempts to eliminate this redundancy.
Example of Redundancy: If the representation of a media captures content that is not
perceivable by humans, then removing such content will not affect the quality of the
content. For example, capturing audio frequencies outside the human hearing range can
be avoided without any harm to the audio’s quality.
ENCODER DECODER
Compressed message, B
Decompressed message, A’
Original message, A
Lossless Compression:
Lossy Compression:
19
Lossless and lossy compression are terms that describe whether or not, in the
compression of the message, all original data can be recovered when decompression is
performed.
Lossless Compression
- Every single bit of data originally transmitted remains after decompression.
After decompression, all the information is completely restored.
- One can use lossless compression whenever space is a concern, but the
information must be the same.
In other words, when a file is compressed, it takes up less space, but when it is
decompressed, it still has the same information.
- The idea is to get rid of redundancy in the information.
- Standards: ZIP, GZIP, UNIX Compress, GIF
Lossy Compression
- Certain information is permanently eliminated from the original message,
especially redundant information.
- When the message is decompressed, only a part of the original information is still
there (although the user may not notice it).
- Lossy compression is generally used for video and sound, where a certain amount
of information loss will not be detected by most users.
- Standards: JPEG (still), MPEG (audio and video), MP3 (MPEG-1, Layer 3)
Lossless Compression
When we encode characters in computers, we assign each an 8-bit code based on
(extended) ASCII chart. (Extended) ASCII: fixed 8 bits per character
For example: for “hello there!”, a number of 12 characters*8bits=96 bits are needed.
Question: Can one encode this message using fewer bits?
Answer: Yes. In general, in most files, some characters appear most often than others. So,
it makes sense to assign shorter codes for characters that appear more often, and longer
codes for characters that appear less often. This is exactly what C. Shannon and R.M.
Fano were thinking when created the first compression algorithm in 1950.
20
Kraft Inequality Theorem
Prefix Code (or Instantaneously Decodable Code): A code that has the property of
being self-punctuating. Punctuating means dividing a string of symbols into words. Thus,
a prefix code has punctuating built into the structure (rather than adding in using special
punctuating symbols). This is designed in a way that no code word is a prefix of any
other (longer) code word. It is also data compression code.
To construct an instantaneously decodable code of minimum average length (for a
source A or given random variable a, with values drawn from the source alphabet), it
needs to follow the Kraft Inequality:
For an instantaneously decodable code B for a source A, the code lengths {li } must
satisfy the inequality
Conversely, if the code word lengths satisfy this inequality, then there exists an
instantaneously decodable code with these word lengths.
Shanno-Fano Theorem
KRAFT INEQUALITY tells us when an instantaneously decodable code exists. But we
are interested in finding the optimal code, i.e., the one that maximizes the efficiency, or
minimizes the average code length, . The average code length of the code B for the
source A (with a as a random variable of values drawn from the source alphabet with
probabilities {pi}) is minimized if the code lengths {li} are given by:
L L
This quantity is called the Shannon Information (pointwise).
Example 1.9: Consider the following random variable a, with the optimal code lengths given by the Shannon information. Calculate the average code length.
3 3 2 1 li i=0,1,2,3
1/8 1/8 1/4 1/2 pi i=0,1,2,3
a3 a2 a1 a0 a
The average code length of the optimal code is:
21
Note that this is the same as the entropy of A, H(A).
Lower Bound on the Average Length
The observation about the relation between the entropy and the expected length of the
optimal code can be generalized. Let B be an instantaneous code for the source A. Then,
the average code length is bounded by:
Upper Bound on the Average Length
Let B a code with optimal code lengths, i.e., i
ii pl 2log . Then, the average length
is bounded by:
Why is the upper bound H(A)+1 and not H(A)? Because sometimes the Shannon
information gives us fractional lengths, and we have to round them up.
Example 1.10: Consider the following random variable a, with the optimal code lengths
given by the Shannon information theorem. Determine the average code length bounds.
2.7 2.7 2.3 2.0 2.0 li i=0,1,2,3
0.15 0.15 0.20 0.25 0.25 pi i=0,1,2,3
a4 a3 a2 a1 a0 a
The entropy of the source A is
The source coding theorem tells us:
where L is the code length of the optimal code.
Example 1.11: For the source in Ex. 1.10, the following code tries to make the code
words with optimal code lengths as closely as possible, find the average code length.
3 3 2 2 2 li i=0,1,2,3
011 010 11 10 00 b
a4 a3 a2 a1 a0 a
22
The average code length for this code is .
This is very close to the optimal code length of H(A)=2.2855.
Summary
i) The motivation for data compression is to reduce in the space allocated for data
(increase of source efficiency). It is obtained by reducing redundancy which exists in data.
ii) Compression can be lossless or lossy. In the former case, all information is completely
restored after decompression, whereas in the latter case it is not (used in applications in
which the information loss will not be detected by most users).
iii) The optimal code, which ensures a maximum efficiency for the source, is
characterized by the lengths of the code words given by the Shannon information, . p2log i
iv) According to the source coding theorem, the average length of the optimal code is
bounded by entropy as
v) The coding schemes for data compression include Huffman, Lempel-Ziv, Arithmetic
coding.
§1.3 Huffman Coding
Remarks
Huffman coding is used in data communications, speech coding, video compression.
Each symbol is assigned a variable-length code that depends on its frequency (probability
of occurrence). The higher the frequency, the shorter the code word. It is a variable-
length code. The number of bits for each code word is an integer (requires an integer
number of coded bits to represent an integer number of source symbols). It is a Prefix
Code (instantaneously decodable).
Encoder – Tree Building Algorithm
Huffman code words are generated by building a Huffman tree:
Step 1 : List the source symbols in a column in descending order of probabilities.
23
Step 2 : Begin with the two symbols with the two lowest probability symbols. The
combining of the two symbols forms a new compound symbol or a branch in the tree.
This step is repeated using the two lowest probability symbols from the new set of
symbols, and continues until all the original symbols have been combined into a single
compound symbol.
Step 3 : A tree is formed, with the top and bottom stems going from the compound
symbol to the symbols which form it, labeled with 0 and 1, respectively, or the other way
around. Code words are assign by reading the labels of the tree stems from right to the
left, back to the original symbol.
Example 1.12: Let the alphabet of the source A be {a0, a1, a2, a3}, and the probabilities
of emitting these symbols be {0.50 0.30 0.15 0.05}. Draw the Huffman tree and find the
Huffman codes.
STEP 1 STEP 2 STEP 3
Probability Symbol
0.50 a0
0.30 a1
0.15 a2
0.05 a3
Symbol Code Words
a0
a1
a2
a3
Hardware implementation of encoding and decoding.
24
How are the Probabilities Known?
Counting symbols in input string:
- data must be given in advance; requires an extra pass on the input string. Data source’s
distribution is known
- data not necessarily known in advance, but we know its distribution. Reasonable care
must be taken in estimating the probabilities, since large errors lead to serious loss in
26
optimality. For example, a Huffman code designed for English text can have a serious
loss in optimality when used for French.
More Remarks
For Huffman coding, the alphabet and its distribution must be known in advance. It
achieves entropy when occurrence probabilities are negative powers of 2 (optimal code).
Huffman code is not unique (because some arbitrary decisions in the tree construction).
Given the Huffman tree, it is easy (and fast) to encode and decode. In general, the
efficiency of Huffman coding relies on having a source alphabet A with a fairly large
number of symbols. Compound symbols are obtained based on the original symbols (see,
e.g., AxA). For a compound symbol formed with n symbols, the alphabet is An, and the set
of probabilities of the compound symbols is denoted by PAn.
Question: How does one get PAn?
Answer: Easy for a memoryless source. Difficult for a source with memory!
§1.4 Lempel-Ziv (LZ) Coding
Remarks
LZ coding does not require the knowledge of the symbol probabilities beforehand. It is a
particular class of dictionary codes. They are compression codes that dynamically
construct their own coding and decoding tables by looking at the data stream itself.
In simple Huffman coding, the dependency between the symbols is ignored, while in LZ,
these dependencies are identified and exploited to perform better encoding. When all the
data is known (alphabet, probabilities, no dependencies), it’s best to use Huffman (LZ
will try to find dependencies which are not there…)
This is the compression algorithm used in most PCs. Extra information is supplied to the
receiver, these codes initially “expand”. The secret is that most of the code words
represent strings of source symbols. In a long message it is more economical to encode
these strings (can be of variable length), than it is to encode individual symbols.
27
Definitions related to the Structure of the Dictionary
Each entry in the dictionary has an address, m. Each entry is an ordered pair, <n, ai >.
The former ( n ) is a pointer to another location in the dictionary, it is also the
transmitted code word. ai is a symbol drawn from the source alphabet A. A fixed-length
binary word of b bits is used to represent the transmitted code word. The number of
entries will be lower or equal to 2b. The total number of entries will exceed the number of
symbols, M, in the source alphabet. Each transmitted code word contains more bits that it
would take to represent the alphabet A.
Question: Why do we use LZ coding if the code word has more bits?
Answer: Because most of these code words represent STRINGS of source symbols
other than single.
Encoder
A Linked-List Algorithm (simplified for the illustration purpose) is sued, it inlcudes:
Step 1: Initialization
The algorithm is initialized by constructing the first M +1 (null symbol plus M source
symbols) entries in the dictionary, as follows.
aM-1 0 M
… … …
am 0 m
… … …
a1 0 2
a0 0 1
null 0 0
Dictionary Entry (n, ai) Address (m)
Note: The 0-address entry in the dictionary is a null symbol. It is used to let the decoder
know where the end of the string is. In a way, this entry is a punctuation mark. The
pointers n in these first M+1 entries are zero. It means they point to the null entry at
28
address 0 at the beginning.
The initialization also initializes pointer variable to zero (n=0), and the address pointer to
M +1, (m=M+1). The address pointer points to the next “blank” location in the dictionary.
Iteratively executed:
Step 2: Fetch next source symbol.
Step 3:
If
the ordered pair <n, a> is already in the dictionary, then
n = dictionary address of entry <n, a>
Else
transmit n
create new dictionary entry <n, a> at dictionary address m
m = m+1
n = dictionary address of entry <0, a>
Step 4:
Return to Step 2.
Example 1.13: A binary information source emits the sequence of symbols 110 001 011
001 011 100 011 11 etc. Construct the encoding dictionary and determine the sequence of
transmitted code symbols.
Initialize:
Source
symbol
Present
n
Present
m
transmit Next
n
Dictionary
entry
1
1
0
0
29
0 1 6 5
1 5 6 5 2 5,1
0 2 7 4
1 4 7 4 2 4,1
1 2 8 3
0 3 8 3 1 3,0
0 1 9 5
1 5 9 6
0 6 9 6 1 6,0
1 1 10 1 2 1,1
1 2 11 3
1 3 11 3 2 3,1
0 2 12 4
0 4 12 4 1 4,0
0 1 13 5
1 5 13 6
1 6 13 6 2 6,1
1 2 14 3
1 3 14 11
Thus, the encoder's dictionary is:
Dictionary address Dictionary entry
0 0, null
1 0, 0
2 0, 1
3
4
5
6 5, 1
7 4, 1
30
8 3, 0
9 6, 0
10 1, 1
11 3, 1
12 4, 0
13 6, 1
14 No entry yet
Decoder
The decoder at the receiver must also construct an identical dictionary for decoding.
Moreover, reception of any code word means that a new dictionary entry must be
constructed. Pointer n for this new dictionary entry is the same as the received code word.
Source symbol a for this entry is not yet known, since it is the root symbol of the next
string (which has not been transmitted by the encoder).
If the address of the next dictionary entry is m, we see that the decoder can only construct
a partial entry <n, ?>, since it must await the next received code word to find the root
symbol a for this entry. It can, however, fill in the missing symbol a in its previous
dictionary entry, at address m -1. It can also decode the source symbol string associated
with the received code word n.
Example 1.14: Decode the received code words transmitted in Example 1.13.
We know the received code words are 221 543 613 46
Address (m) n (pointer) ai (symbol) Decoded bits
0
1
2
3
4
31
5
6
7
8
9
… … … …
§1.5 Arithmetic Coding
Remarks
Assigns one (normally long) code word to the entire input stream. Reads the input stream
symbol by symbol, appending more bits to the code word each time. The code word is a
number obtained based on the symbol probabilities. The symbols probabilities need to
be known. Encodes symbols using a non-integer number of bits (in average), which
results in a very good efficiency of the encoder (it allows to achieve the entropy lower
bound). It is often used for data compression in image processing.
Encoder
Construct a code interval (rather than a code number), which uniquely describes a block
of successive source symbols. Any convenient b within this range is a suitable code word,
representing the entire block of symbols.
Algorithm:
, [ , )
0, 0, 1i ii i l h
j j
a A I S S
j L H
32
, use ai's Ii=[Sli,Shi) to update
Select a number b that fall in the final interval as the code word.
1
i
1
Next read
1
Until all a hav
+
e been encoded
.
i
i
l
h
j
jj
a
S
H L S
j
L
j
REPEAT
-
+ i
j
j
jH L
L
Example 1.15: For a 4-ary source },,,{ 3210 aaaaA with }05.0,15.0,3.0,5.0{AP ,
assign each a fraction of the real number interval as Aai iI
).1,95.0[:);95.0,8.0[:);8.0,5.0[:);5.0,0[: 33221100 IaIaIaIa
Encode the sequence with arithmetic coding. 23001 aaaaa
j ai Lj Hj ∆ Lj+1 Hj+1
0
1
2 a0 0.5 0.65 0.15 0.5 0.575
3 a3 0.5 0.575 0.075 0.57125 0.575
4 a2 0.57125 0.575 0.00375 0.57425 0.5748125
33
Decoder
In order to decode the message, the symbol order and probabilities must be passed to the
decoder. The decoding process is identical to the encoding. Given the code word (the
final number), at each iteration the corresponding sub-range is entered, decoding the
symbols representing the specific range.
Given b , the decoding procedure is
, use ai's Ii=[Sli,Shi) to update
F i n d s u c h t h a t
O u t p u t s
0 , 1 , Δ
-
+
+ i
i
L H H - L
b
H
LI
L
L
- L
y m b o l
Δi
l
h
L
i
a
S
SH
R e p e a t
n t i l l a s t s y m b o l i s d e c o d e d .
i
U
Example 1.16: For the source and encoder in Example 1.15, decode . 55747070312.0b
L H ∆ Ii Next H Next L Next ∆ ai
0.5 0.65 0.15 I0 0.575 0.5 0.075 a0
0.5 0.575 0.075 I3 0.575 0.57125 0.00375 a3
0.57125 0.575 0.00375 I2 0.5748125 0.57425 0.0005625 a2
Practical Issues
Attention: the precision with which we calculate /)( Lb
ilS
.
Round-off error in this calculation can lead to an erroneous answer. Numerical overflow
(see the products and ). The limited size of and limits the size of the il
SihS
ihS
34
alphabet A. In practice it is important to transmit and decode the info “on the fly.” Here
we must read in the entire block of source symbols before being able to compute the code
word. We also must receive the entire code word b before we can begin decoding.
Not intuitive Not intuitive Intuitive Intuition
Code words for strings of source symbols
One code word for all data
One code word for each symbol
Code words
Best results for long messages
Very close If probabilities are negative powers of 2
Entropy
Used for better compression
Not used Not used Symbol Dependency
None None None Data Loss
Not known in advance Known in advance Known in advance Alphabet
Not known in advance Known in advance Known in advance Probabilities
Lempel-Ziv Arithmetic Huffman
35
Ch 2 Channel and Channel Capacity
§2.1 Discrete Memoryless Channel Model
Communication Link
Definition
In most communication or storage systems, the signal is designed such that the output
symbols, y0,y1,...,yt , are statistically independent if the input symbols, c0,c1,...,ct , are
statistically independent. If the output set Y consists of discrete output symbols, and if the
property of statistical independence of the output sequence holds, the channel is called a
Discrete Memoryless Channel (DMC).
Transition Probability Matrix
Mathematically, we can view the channel as a probabilistic function that transforms a
sequence of (usually coded) input symbols, c, into a sequence of channel output symbols,
y. Because of noise an other impairments of the communication system, the
transformation is not one-to-one mapping from the set of input symbols, C, to the set of
Informatio
Source Encoder
Channel Encoder
Modulator
Channel Demodulator
Channel Decoder
Composite Discrete-Input Discrete-Output Channel
Continuous-Input Continuous-Output Channel
Source Decoder c0,c1,...,ct
y0,y1,...,yt
Alphabet C Probabilities PC
Alphabet Y Probabilities PY
36
output symbols, Y. Any particular c from C may have some probability, py|c , of being
transformed to an output symbol y, from Y, this probability is called a (Forward)
Transition Probability.
For a DMC, let be the probability that symbol c is transmitted, the probability that the
received symbol is y is given in terms of transition probabilities as
cp
The probability distribution of the output set Y, denoted by QY, may be easily calculated
in matrix form as
0 0 0 1 0 1
1 0 1 1 1 1
1 0 1 1 1 1
| | | 00
| | | 11
11 | | |
....
....
...... ...
....
MC
MC
CY M M M MY Y Y C
y c y c y c
y c y c y c
Y
MM y c y c y c
p p p pq
p p p pqQ
pq p p p
or, more compactly, Here,
CP : Probability distribution of the input alphabet
YQ : Probability distribution of the output alphabet
CYP | :
Remarks: The columns of PY|C sum to unity (no matter what symbol is sent, some
output symbol must result). Numerical values for the transition probability matrix are
determined by analysis of the noise and transmission impairment properties of the
channel, and the method of modulation/demodulation.
Hard Decision Decoding : MY = MC. Hard refers to the decision that the demodulator
makes; it is a firm decision on what symbol was transmitted.
Soft Decision Decoding : MY > MC. The final decision is left to the receiver decoder.
37
Example 2.1: C={0,1} , with equally probable symbols; Y={y0, y1, y2}. The transition
probability matrix of the channel is
|
0.80 0.05
0.15 0.15 .
0.05 0.80Y CP
QY=?
Remarks: The sum of the elements on each column of the transition probability matrix is
1. This is an example of soft-decision decoding.
Example 2.1 (cont’d): Calculate the entropy of Y for the previous system. Compare this
with the entropy of source C.
(how can this happen?)
Remarks: We noticed the same thing when we discussed the source encoder
(encryption encoder). It is possible for the output entropy to be greater than the input
entropy, but the “additional” information carried in the output is not related to the
information from the source. The “extra” information in the output comes from the
presence of noise in the channel during transmission, and not from the source C.
This “extra” information carried in Y is truly “useless”. In fact, it is harmful because it
produces uncertainty about what symbols were transmitted.
Question: Can we solve this problem by using only systems which employ hard-decision
decoding?
38
Answer:
Example 2.2: C={0,1} , with equally probable symbols; Y={0,1}. The transition
probability matrix of the channel is
Calculate the entropy of Y. Compare this with the entropy of source C.
|
0.98 0.05.
0.02 0.95YCP
Remarks: Y carries less information than was transmitted by the source.
Question: Where did it go ?
Answer: It was lost during the transmission process. The channel is information lossy !
So far, we have looked at two examples, in which the output entropy was either greater or
less than the input entropy. What we have not considered yet is what effect all this has on
the ability to “tell from observing Y what original information was transmitted.”
Do not forget that the purpose of the receiver is to recover the original transmitted
information !
What does the observation of Y tell us about the transmitted information sequence?
As we know, Mutual information is a measure of how much the uncertainty of
generating a random variable c is reduced by observing a random variable y !
If Y tells us nothing about C (e.g., Y and C are independent, such as somebody cut the
phone wire and there is no signal getting through).
But if
39
Looking at Y there is no uncertainty on C. i.e., Y contains sufficient information to tell
what the transmitted sequence is. The conditional entropy is a measure of how much
information loss occurs in the channel !
Example 2.3: Calculate the mutual information for the system of Example 2.1.
Remark: The mutual information for this system is well below the entropy ( H(C)=1 )
of the source and so, this channel has a high level of information loss.
Example 2.4: Calculate the mutual information for the system of Example 2.2.
Remarks: This channel is quite lossy also. Although H(Y) was almost equal to H(C) in
Example 2.2, the mutual information is considerably less than H(C) . One cannot tell
how much information loss we are dealing with simply by comparing the input and
output entropies !
§2.2 Channel Capacity and Binary Symmetric Channel
Maximization of Mutual Information and Channel Capacity
Each time the transmitter sends a symbol, it is said to use the channel. The Channel
Capacity is the maximum average amount of information that can be sent per channel
use.
40
Question: Why it is not the same as the mutual information ?
Answer: Because for a fixed transition probability matrix, a change in the probability
distribution of C, PC , results in a different mutual information, I(C;Y).The maximum
mutual information achieved for a given transition probability matrix is the Channel
Capacity.
with units of bits per channel use.
An analytical closed-form solution to find CC is difficult to achieve for an arbitrary
channel. An efficient numerical algorithm for finding CC was derived in 1972, by Blahut
and Arimoto (see textbook).
Example 2.5: For the following transition probability matrix, find the channel capacity,
the input and output probability distributions that achieve the channel capacity, and
mutual information given a uniform Pc.
a)
c
|
0.98 0.05 0.51289 0.52698, 0.78585, ,
0.02 0.95 0.48711 0.47302Y C C C YP Q
|
0.80 0.10 0.4824 0.4377, 0.39775, ,
0.20 0.90 0.5176 0.5623Y C C C YP C P Q
0.05 0.80 0.425
P C
b)
|
0.80 0.05 0.46761 0.4007, 0.48130, ,
0.20 0.95 0.53239 0.5993Y C C C YP C P Q
|
0.80 0.30 0.510 0.555, 0.191238, ,
0.20 0.70 0.490 0.445Y C C C YP C P Q
d)
|
0.80 0.05 0.4250.5
0.15 0.15 , 0.57566, , 0.1500.5Y C C C YP C P Q
e)
41
Remarks: The channel capacity proves to be a sensitive function of the transition
probability matrix, PY|C , but a fairly weak function of PC. The last case is interesting, as
the uniform input distribution produces the maximum mutual information.
This is an example of Symmetric Channel. Note that the columns of symmetric
channel’s transition probability matrix are permutations of each other. Likewise, the top
and bottom rows are permutations of each other. The center row, which is not a
permutation of the other rows, corresponds to the output symbol y1, which, as we noticed
in Example 2.3, makes no contribution to the mutual information.
Symmetric Channels
Symmetric channels play an important role in communication systems and many such
systems attempt, by design, to achieve a symmetric channel function. The reason for the
importance of the symmetric channel is that when such a channel is possible, it
frequently has greater channel capacity than an non-symmetric channel would have.
Example 2.6:
The transition probability matrix is slightly changed compared to Example 2.5e), and the channel capacity decreases. Example 2.7:
|
0.79 0.05 0.42070.50095
0.16 0.15 , 0.571215, , 0.15500.49905
0.05 0.80 0.4243Y C C C YP C P Q
P
|
0.950 0.024 0.024 0.002 0.25
0.024 0.950 0.002 0.024 0.25, 1.653488,
0.024 0.002 0.950 0.024 0.25
0.002 0.024 0.024 0.950 0.25
Y C C CP C
This is an example of using quadrature phase-shift keying (QPSK), which is a modulation
method that produces a symmetric channel. For QPSK, MC=MY=4.
42
Remarks:
i) The capacity for this channel is achieved when PC is uniformly distributed. This is
always the case for a symmetric channel.
ii) The columns of the transition probability matrix are permutations of each other, and so
are the rows.
iii) When the transition probability matrix is a square matrix, this permutation property of
columns and rows is sufficient condition for a uniformly distributed input alphabet to
achieve the maximum mutual information. Indeed, the permutation condition is what it
gives rise to the term “symmetric channel .”
Binary Symmetric Channel (BSC)
A symmetric channel of considerable importance, both theoretically and practically, is a
binary symmetric channel (BSC), for which
The parameter p is known as the Crossover Probability, and it is the probability that the
demodulator/detector makes a hard-decision decoding error. The BSC is the model for
essentially all binary-pulse transmission systems of practical importance.
Channel Capacity: for uniform input probability distribution
which is often written as
where the notation H(P) arises from the terms involving p.
Remarks:
The capacity is bounded by the range
The upper bound is achieved only if
The case p = 0 is not surprising, as it corresponds to a channel which does not
make errors (known as “noiseless” channel).
43
The case p = 1 corresponds to a channel which always makes errors. If we know
that the channel output is always wrong, we can easily set things right by
decoding the opposite of what the channel output is.
The case p = 0.5 corresponds to a channel for which the output symbol is as
likely to be correct as it is to be incorrect. Under this condition, the information
loss in the channel is total, and the channel capacity is zero. The capacity of the
BSC is a concave-upward function, possessing a single minimum at p = 0.5.
Except for p = 0 and p = 1 cases, the capacity of the BSC is always less than the
source entropy. If we try to transmit information through the channel using the
maximum amount of information per symbol, some of this info will be lost, and
decoding errors at the receiver will result. However, if we add sufficient
redundancy to the transmitted data stream, it is possible to reduce the
probability of lost information to an arbitrary low level.
§2.3 Block Coding and Shannon’s 2nd Theorem
Equivocation
We have seen that there is a maximum amount of information per channel use that can be
supported by the channel. Any attempt to exceed this channel capacity will result in
information being lost during transmission. That is,
and, so
The conditional entropy H(C|Y) corresponds to our uncertainty about what the input of
the channel was, given our observation of the channel output. It is a measure of the
information loss during the transmission. For this reason, this conditional entropy is
often called the Equivocation. The equivocation has the property that
and it is given by
44
The equivocation is zero if and only if the transition probabilities py|c are either zero or
one for all pairs (yY, cC).
Entropy Rate
The entropy of a block of n symbols satisfy the inequality
with equality if and only if C is a memoryless source. In transmitting a block of n symbols, we use the channel n times. Recall that channel capacity has units of bits per channel use, and refers to an average amount of information per channel use. Since H(C0,C1,...,Cn-1) is the average information contained in the n-symbol block, it follows that the average information per channel use would be
However, the average bits per channel use is achieved in the limit, when n goes to infinity,
such that
0 1 1( , , ..., )lim ( )n
n
H C C CR H
n
C
where R is called the Entropy Rate. , with equality if and only if all symbols are statistically independent.
Suppose that they are not, and in the transmission of the block, we deliberately introduce
redundant symbols. Then, R < H(C). Taking this further, suppose that we introduce a
sufficient number of redundant symbols in the block so that
( )R H C
Question: Is the transmission without information loss (i.e. zero equivocation) possible in
such case?
Answer: Remarkably enough, the answer to this question is “YES”!
What is the implication of doing so ?
It is possible to send information through the channel with arbitrarily low probability of
error.
The process of adding redundancy to a block of transmitted symbols is called Channel
Coding.
45
Question: Does there exist a channel code that will accomplish this purpose?
Answer: The answer to this question is given by the Shannon’s second theorem.
Shannon’s 2nd Theorem
Suppose R < Cc , where Cc is the capacity of a memoryless channel. Then, for any >
0, there exists a block of length n and rate R whose probability of block decoding error
pe satisfies pe ≤ when the code is used on this channel.
Shannon’s second theorem (also called Shannon’s main theorem) tells us that it is
possible to transmit information over a noisy channel with arbitrarily small probability of
error. The theorem says that if the entropy rate R in a block of n symbols is smaller
than the channel capacity, then we can make the probability of error arbitrarily
small.
What error are we speaking about?
Suppose we send a block of n bits in which k < n of these bits are statistically
independent “information” bits and n-k are redundant “parity” bits computed from the k
information bits, according to some coding rule. The entropy of the block will then be k
bits and the average information in bits per channel use will be
If this entropy rate is less than the channel capacity, Shannon’s main theorem says we can
make the probability of error in recovering our original k information bits arbitrarily
small. The channel will make errors within our block of n bits, but the redundancy built
into the block will be sufficient to correct these errors and recover the k bits of
information we transmitted.
Shannon’s theorem does not say that we can do this for just any block length n we
might want to choose! The theorem says there exists a block length n for which there is
a code of rate R. The required size of the block length n depends on the upper bound
we pick for our error probability. Actually, Shannon’s theorem implies very strongly
that the block length n is going to be very large if R is to approach CC to within an
arbitrarily small distance with an arbitrarily probability of error.
The complexity and expense of an error-correcting channel code are believed to grow
rapidly as R approaches the channel capacity and the probability of a block decoding
46
error is made arbitrarily small. It is believed by many that beyond a particular rate, called
Cutoff Rate, R0, it is prohibitively expensive to use the channel. In the case of the binary
symmetric channel, this rate is given by
0 2log 0.5 (1 )R p p
The belief that R0 is some kind of “sound barrier” for practical error correcting codes
comes from the fact that for certain kind of decoding methods, the complexity of the
decoder grows extremely rapidly as R exceeds R0.
§2.4 Markov Processes and Sources with Memory
Markov Process
Thus far, we have discussed memoryless sources and channels. We now turn our
attention to sources with memory. By this, we mean information sources, where the
successive symbols in a transmitted sequence are correlated with each other, i.e.,
the sources in a sense “remember” what symbols they have previously emitted, and the
probability of their next symbol depends on this history.
Sources with memory arise in a number of ways. First, natural languages, such as
English, have this property. For example, the letter “q” in English is almost always
followed by the letter “u”. Similarly, the letter “t” is followed by the letter “h”
approximately 37% of the time in English text. Many real-time signals, such as speech
waveform, are also heavily time correlated. Any time correlated signal is a source with
memory. Finally, we sometimes wish to deliberately introduce some correlation
(redundancy) in a source for purposes of block coding, as discussed in the previous
section.
Let A be the alphabet of a discrete source having MA symbols, and suppose this source
emits a time sequence of symbols (s0,s1,…,st,…) with each stA. If the conditional
probability p(st | st-1,…,s0) depends only on j previous symbols, so that
p(st | st-1,…,s0)=p(st | st-1,…,st-j),
then A is called a j th order Markov process. The string of j symbols is
called the state of the Markov process at time t. A j th order Markov process, therefore,
has possible states.
47
Let us number these possible states from 0 to N -1 and let n(t) represent the probability
of being in state n at time t. The probability distribution of the system at time t can
then be represented by the vector
For each state at time t, there are MA possible next states at time t +1, depending on which
symbol is emitted next by the source.
If we let pi |k be the conditional probability of going to state i given that the present state
is k, the state probability distribution at time t + 1 is governed by the transition
probability matrix.
and is given by
Example 2.8: Let A be a binary first-order Markov source with A={0,1}. This source
has 2 states, labeled “0” and “1”. Let the transition probabilities be
0|0 0|1 0| 1
1|0 1|1 1| 1|
1|0 1|1 1| 1
...
...
... ... ... ....
...
N
NA
N N N N
p p p
p p pP
p p p
|
0.3 0.4.
0.7 0.6AP
What is the equation for the next probability state? Find the state probabilities at time t=2,
given that the probabilities at time t=0 are 0=1 and 1=0.
The next-state equation for the state probabilities:
48
Example 2.9: Let A be a second-order binary Markov source with
If all the states are equally probable at time t = 0, what are the state probabilities at t =1 ?
Pr( 0 | 0,0) 0.2 Pr( 1| 0,0) 0.8
Pr( 0 | 0,1) 0.4 Pr( 1| 0,1) 0.6
Pr( 0 |1,0) 0.0 Pr( 1|1,0) 1.0
Pr( 0 |1,1) 0.5 Pr( 1|1,1) 0.5
a a
a a
a a
a a
Define the . The possible state
transitions and their associated transition probabilities can be represented using a
state diagram. For this problem, the state diagram is
The next state probability equation is
Remarks: Every column of the transition probability matrix adds to one. Every properly
constructed transition probability matrix has this property.
49
Steady State Probability and the Entropy Rate
Starting from the equation for the state probabilities, is can be shown by induction that
the state probabilities at time t are given by
A Markov process is said to be Ergodic if we can get from the initial state to any other
state in some number of steps and if, for large t, Πt approaches a steady-state value that is
independent of the initial probability distribution, Π0. The steady-state value is reached
when
The Markov processes which model information sources are always ergodic. Example 2.10: Find the steady-state probability distribution for the source in Example
2.9.
In the steady state, the state probabilities become
It appears from this that we have four equations and four unknowns, so, solving for the
four probabilities is no problem. However, if we look closely, we will see that only three
of the equations above are linearly independent. To solve for the probabilities, we can use
any of three of the above equations and the constraint equation. This equation is a
consequence of the fact that the total probability must sum to unity;
it is certain that the system is in some state!
Dropping the first equation above and using the constraint, we have
1 3
2 0
3 2 3
0.5
0.8 0.6
0.5
1
0 21 3 1
50
which has the solution
This solution is independent of the initial probability distribution. The situation
illustrated in the previous example, where only N - 1 of the equations resulting from the
transition probability expression are linearly independent and we must use the “sum to
unity” equation to obtain the solution, always occurs in the steady-state probability
solution of an ergodic Markov process.
0 1 2 31/ 9, 2 / 9, 4 / 9.
Entropy Rate of an Ergodic Markov Process
POP QUIZ: How do you define the entropy rate?
The entropy rate, R, is the average information per channel use (average info bits per
channel use) 0 1 1( , ,..., )
lim ( )t
t
H A A AR H A
t
with equality if and only if all symbols are statistically independent.
For ergodic Markov sources, as t grows very large, the state probabilities converge
to a steady-state value, n, for each of the N possible states (n=0,...,N-1). As t becomes
large, the average information per symbol in the block of symbols will be determined by
the probabilities of occurrence of the symbols in A, after the state probabilities converge
to their steady-state values.
Suppose we are in state Sn at time t . The conditional entropy of A is
Since each possible symbol a leads to a single state, Sn can lead to MA possible next
states. The remaining N - MA states cannot be reached from Sn , and for these states the
transition probability pi|n=0. Therefore, the conditional entropy expression can be
expressed in terms of the transition probabilities as
For large t , the probability of being in state Sn is given by its steady-state probability n.
Therefore, the entropy rate of the system is
51
This expression, in turn, is equivalent to
where pi|n are the entries in the transition probability matrix and the n are the steady-state
probabilities.
Example 2.11: Find the entropy rate for the source in Example 2.9. Calculate the steady-
state probability of the source emitting a “0” and the steady-state probability of the source
emitting a “1”. Calculate the entropy of a memoryless source having these symbol
probabilities and compare the result with the entropy rate of the Markov source.
With the steady-state probabilities calculated in Example 2.10, by applying the formula
for the entropy rate of an ergodic Markov source, one gets
The steady state probabilities of emitting 0 and 1 are, respectively
The entropy of a memoryless source having this symbol distribution is
Thus, R<H(X) as expected.
Remarks:
i) In earlier section, we discussed about how introducing redundancy into a block of
symbols can be used to reduce the entropy rate to a level below the channel capacity and
52
how this technique can be used for error correction at the receive-side, in order to
achieve an arbitrarily small information bit error rate.
ii) In this section, we have seen that a Markov process also introduces redundancy
into the symbol block.
Question: Can this redundancy be introduced in such a way such to be useful for error
correction? Answer: YES! This is the principle underlying a class of error correcting codes known
as convolutional codes.
iii) In the previous lecture we examined the process of transmitting information C
through a channel, which produces a channel output Y. We have found out that a noisy
channel introduces information loss if the entropy rate exceeds the channel capacity.
iv) It is natural to wonder if there might be some (possible complicated) form of data
processing which can be performed on Y to recover the lost information. Unfortunately,
the answer to this question is NO! Once the information has been lost, it is gone!
Data Processing Inequality
This states that additional processing of the channel output can at best result in no further
loss of information, and may even result in additional information loss.
A very common example of this kind of information loss is the roundoff or truncation
error during digital signal processing in a computer or microprocessor. Another
examples is quantization in an analog to digital converter. Designers of these systems
need to have an awareness of the possible impact of such design decisions, as the word
length of the digital signal processor or the number of bits of quantization in analog to
digital converters, on the information content.
Y Z
Data Processing
53
§2.5 Constrained Channels
Channel Constraints
So far, we have considered only memoryless channels corrupted by noise, which are
modeled as discrete-input discrete-output memoryless channels. However, in many cases
we have channels which place constraints on the information sequence.
Sampler
Modulator
Bandlimited Channel
Demodulator
+
The coded information at is presented to the modulator, which transforms the symbol
sequence into continuous-valued waveform signals, designed to be compatible with the
physical channel (bandlimited channel). Examples of bandlimited channels are wireless
channels, telephone lines, TV cables, etc. During transmission, the information bearing
signal is distorted by the channel and corrupted with noise. The output of the
demodulator, which attempts to combat the distortion and minimize the effect of the
noise, is sampled and the detector attempts to reconstruct the original coded sequence, at.
The timing recovery is required; the performance of this block are crucial in recovering
the information. The theory and practice of performing these tasks consist the
modulation theory, which is treated in “Digital Communications” textbooks. In this
course, we are concern with the information theory aspects of this process. What
are these aspects?
Remarks:
i) When the system needs to recover the timing information, additional information
should be transmitted for that. As the maximum information rate is limited by the
Symbol Detector
Timing Recovery
s(t)
Noise at
Block Diagram of a Typical Communication System. yt
54
channel capacity, the information needed for timing recovery is included at the expense
of user information. This may require that the sequence of transmitted symbols be
constrained in such a way as to guarantee the presence of timing infomation embedded
within the transmitted coded sequence.
ii) Another aspect arises from the type and severity of channel distortions imposed by the
physical bandlimited channel. We can think of the physical channel as performing a kind
of data processing on the information bearing waveform presented to it by the modulator.
But data processing might result in information loss. A given channel can thus place its
own constraints on the allowable symbol sequence which can be “process” without
information loss.
iii) Modulation theory tells us that it is possible and desirable to model the
communication channel as a cascade of noise-free channel and an unconstrained noisy
channel (we have implicitly used such a model, except that we have not considered any
constraint on the input symbol sequence).
yt Constrained
Channel, ht
Decision Block +
at xt rt
Noise, nt
Linear and Time-Invariant (LTI) Channel
The LTI channel is specified by a set of parameters ht, which represent the channel
impulse response. The channel’s output sequence is related to the input sequence as
The decision block is presented with a noisy signal
The decision block takes these inputs and produces output symbols, yt, drawn from a
finite alphabet Y, with MY ≥ MA.
55
If MY =MA, yt is an estimate of the transmitted symbol at, and the decision block is
said to make a Hard-decision.
If MY > MA, the decision block is said to make a Soft-decision, and the final decision
on the transmitted symbol at is made by the decoder.
Example 2.12: Let A be a source with equiprobable symbols, A={-1,1}. The bandlimited
channel has the impulse response {h0=1 h1=0 h2=-1}. Calculate the steady-state entropy
of the constrained channel’s output and the entropy rate of the sequence xt.
State of the channel at time t : St = <at-1,at-2>.
The states are as follows:
(-1,-1) is state S0, (1,-1) is state S1,
(-1, 1) is state S2, (1, 1) is state S3.
The channel can be represented as a Markov process, with the state diagram given in the
sequel.
1 / 2 (0.5)
Note that all transition probabilities, shown in parentheses, are 0.5. The arrows are
labeled at / xt . One can easily show that X={-2, 0, 2}.
The state probability equation is then given by
-1 1
1-1
-1 / -2 (0.5)
1 / 2 (0.5)
-1 / -2 (0.5) 1 / 0
-1 / 0
-1 -1
1 1
-1 / 0 (0.5)
1 / 0 (0.5)
S0
S2
S1
S3
1
0.5 0 0.5 0
0.5 0 0.5 0
0 0.5 0 0.5
0 0.5 0 0.5
t t
56
57
from which we set up 4 equations and find the steady state probabilities, i.e.,
i=0.25, i=0,1,2,3.
The output symbol X's probabilities are:
The steady state entropy of the channel output is
The entropy rate is:
which equals the source entropy → channel is lossless.
Note that the entropy rate is not equal to the steady state entropy of the channel’s output
symbols. While the channel is lossless, the sequences it produces does not carry
sufficient information to permit clock recovery for arbitrary input sequences. For
example, a long input sequence of “-1”, “+1”, or a long sequence of alternating symbols,
“+1-1” or“-1+1”, all produce a long output of zeros at the output of the channel. Timing
recovery methods can fail in such situations.
Ch 3 Error Control Strategies
Error Control Strategies
Forward Error Correction (FEC)
Automatic Repeat Request (ARQ)
Forward Error Correction (FEC) In a one-way communication system: The transmission or recording is strictly in one direction, from transmitter to receiver. Error control strategy must be FEC; that is, they employ error-correcting codes that automatically correct errors detected at the receiver. For example: 1) digital storage systems, in which the information recorded can be replayed weeks or even months after it is recorded, and 2) deep-space communication systems. Most of the coded systems in use today employ some form of FEC, even if the channel is not strictly one-way! However, for a two-way system, the control strategies use error detection and retransmission that is called automatic repeat request (ARP).
§3.1 Automatic Repeat Request
Automatic Repeat Request (ARQ)
In most communication systems, the information can be sent in both directions, and the
transmitter also acts at a receiver (transceiver), and vice-versa. For example: data
networks, satellite communications, etc. Error control strategies for a two-way system
can include error detection and retransmission, called Automatic Repeat Request
(ARQ). In an ARQ system, when errors are detected at the receiver, a request is sent for
the transmitter to repeat the message, and repeat requests continue to be sent until the
message is correctly received. ARQ SYSTEMS
Stop-and-Wait ARQ
Selective ARQ Go-Back-N ARQ
Continuous ARQ
58
Types
Stop-and-Wait (SW) ARQ: The transmitter sends a block of information to the receiver
and waits for a positive (ACK) or negative (NAK) acknowledgment from the receiver. If
an ACK is received (no error detected), the transmitter sends the next block. If a NAK is
received (errors detected) , the transmitter resends the previous block. When the errors
are persistent, the same block may be retransmitted several times before it is correctly
received and acknowledged.
Continuous ARQ: The transmitter sends blocks of information to the receiver
continuously and receives acknowledgments continuously. When a NAK is received, the
transmitter begins a retransmission. It may back-up to the block and resend that block
plus the N-1 blocks that follow it. This is called Go-Back-N (GBN) ARQ. Alternatively,
the transmitter may simply resend only those blocks that are negatively acknowledged.
This is known as Selective Repeat (SR) ARQ.
Comparison
GBN Versus SR ARQ
SR ARQ is more efficient than GBN ARQ, but requires more logic and buffering.
Continuous Versus SW ARQ
Continuous ARQ is more efficient than SW ARQ, but it is more expensive to implement.
For example: In a satellite communication, where the transmission rate is high and the
round-trip delay is long, continuously ARQ is used. SW ARQ is used in systems where
the time taken to transmit a block is long compared to the time taken to receive an
acknowledgment. SW ARQ is used on half-duplex channels (only one way transmission
at a time), whereas continuous ARQ is designed for use on full-duplex channels
(simultaneous two-way transmission).
Performance Measure
Throughput Efficiency: is the average number of information (bits) successfully
accepted by the receiver per unit of time, over the total number of information digits that
could have been transmitted per unit of time.
59
Delay of a Scheme: The interval from the beginning of a transmission of a block to the
receipt of a positive acknowledgment for that block.
GBN Versus SR ARQ
Figure 1 From Lin and Costello, Error Control
ARQ Versus FEC
The major advantage of ARQ versus FEC is that error detection requires much simpler
decoding equipment than error correcting. Also, ARQ is adaptive in the sense that
information is retransmitted only when errors occurs. In contrast, when the channel error
is high, retransmissions must be sent too frequently, and the SYSTEM THROUGHPUT
is lowered by ARQ. In this situation, a HYBRID combination of FEC for the most
frequent error patterns along with error detection and retransmission for the less likely
error patterns is more efficient than ARQ alone (HYBRID ARQ).
60
§3.2 Forward Error Correction
Performance Measures – Error Probability
The performance of a coded communication system is in general measured by its
probability of decoding error (called the Error Probability) and its coding gain over the
uncoded system that transmit information at the same rate (with the same modulation
format).
There are two types of error probabilities, probability of word (or block) error and
probability of bit error. The probability of block error is defined as the probability that
a decoded word (or block) at the output of the decoder is in error. This error probability is
often called the Word-Error Rate (WER) or Block-error Rate (BLER). The
probability of bit-error rate, also called the Bit Error Rate (BER), is defined as the
probability that a decoded information bit at the output of the decoder is in error.
A coded communication system should be designed to keep these two error probabilities
as low as possible under certain system constraints, such as power, bandwidth and
decoding complexity.
The error probability of a coded communication system is commonly expressed in terms
of the ratio of energy-per information bit, Eb, to the one-sided power spectral density
(PSD) N0 of the channel noise.
Example 3.1: Consider a coded communication system using an (23, 12) binary Golay
code for error control. Each code word consists of 23 code digits, of which 12 are of
information. Therefore, there are 11 redundant bits, and the code rate is R=12/23=0.5217.
Suppose that BPSK modulation with coherent detection is used and the channel is
AWGN, with one-side PSD N0 . Let Eb / N0 at the input of the receiver be the signal-to-
noise ratio (SNR), which is usually expressed in dB.
61
The bit-error performance of the (23,12) Golay code with both hard- and soft-decision
decoding versus SNR is given, along with the performance of the uncoded system.
2
From Lin and Costello, Error Control
From the above figure, the coded system, with either hard- or soft-decision decoding,
provides a lower bit-error probability than the uncoded system for the same SNR, when
the SNR is above a certain threshold.
With hard-decision, this threshold is 3.7 dB.
For SNR=7dB, the BER of the uncoded system is 8x10-4, whereas the coded system
(hard-decision) achieves a BER of 2.9x10-5. This is a significant improvement in
performance.
For SNR=5dB this improvement in performance is small: 2.1x10-3 compared to 6.5x10-3.
However, with soft-decision decoding, the coded system achieves a BER of 7x10-5.
62
Performance Measures – Coding Gain
The other performance measure is the Coding Gain. Coding gain is defined as the
reduction in SNR required to achieve a specific error probability (BER or WER) for a
coded communication system compared to an uncoded system.
Example 3.1 (cont’d): Determine the coding gain for BER=10-5.
For a BER=10-5, the Golay-coded system with hard-decision decoding has a coding gain
of 2.15 dB over the uncoded system, whereas with soft-decision decoding, a coding gain
of more than 4 dB is achieved. This result shows that soft-decision decoding of the Golay
code achieves 1.85 dB additional coding gain compared to hard-decision decoding at a
BER of 10-5.
This additional coding gain is achieved at the expense of higher decoding complexity.
Coding gain is important in communication applications, where every dB of improved
performance results in savings in overall system cost.
Remarks:
At sufficient low SNR, the coding gain actually becomes negative. This threshold
phenomenon is common to all coding schemes. There always exists an SNR below which
the code loses its effectiveness and actually makes the situation worse. This SNR is
called the Coding Threshold. It is important to keep this threshold low and to maintain a
coded communication system operating at an SNR well above its coding threshold.
Another quantity that is sometimes used as a performance measure is the Asymptotic
Coding Gain (the coding gain for large SNR).
§3.3 Shannon’s Limit of Code Rate
Shannon’s Limit
63
In designing a coding system for error control, it is desired to minimize the SNR
required to achieve a specific error rate. This is equivalent to maximizing the coding
gain of the coded system compared to an uncoded system using the same modulation
format. A theoretical limit on the minimum SNR required for a coded system with
code rate R to achieve error-free communication (or an arbitrarily small error
probability) can be derived based on Shannon’s noisy coding theorem.
This theoretical limit, often called the Shannon Limit, simply says that for a coded
system with code rate R, error-free communication is achieved only if the SNR exceeds
this limit. As long as SNR exceeds this limit, Shannon’s theorem guarantees the existence
of a (perhaps very complex) coded system capable of achieving error-free
communication.
For transmission over a binary-input, continuous-output AWGN with BPSK signaling,
the Shannon’s limit, in terms of SNR as a function of the code rate does not have a close
form; however, it can be evaluated numerically.
0.188 dB
3
64
9.462 dB
Shannon’s limit
5.35 dB
Convolutional Code, R=1/2
4
From Lin and Costello, Error Control
From Fig. 3 (Shannon limit as a function of the code rate for BPSK signaling on a
continuous-output AWGN channel), one can see that the minimum required SNR to
achieve error free communication with a coded system with rate R=1/2, is 0.188 dB. The
Shannon limit can be used as a yardstick to measure the maximum achievable coding
gain for a coded system with a given rate R over an uncoded system with the same
modulation format. For example, to achieve BER=10-5, un uncoded BPSK system
requires an SNR of 9.65 dB. For a coded system with code rate R=1/2, the Shannon limit
is 0.188 dB. Therefore, the maximum potential coding gain for a coded system with code
rate R=1/2 is 9.462 dB.
For example (Fig. 4), a rate R=1/2 convolutional code with memory order 6, achieves
BER=10-5 with SNR=4.15 dB, and achieves a code gain of 5.35 dB compared to the
uncoded system. However, it is 3.962 dB away from the Shannon’s limit. This gap can be
reduced by using a more powerful code.
65
§3.4 Codes for Error Control
Basic Concepts in Error Control
There can be a hybrid of the two approaches, as well.
Codes for Error Control (FEC)
66
Types of Channnels
Compound Channels
Burst-Error Channels
Random Error Channels
Types of Channels
Random Error Channels: are memoryless channels; the noise affects each transmitted
symbol independently. Example: deep space and satellite channels, most line-of-sight
transmission.
Burst Error Channels: are channels with memory. Example: fading channels (the
channel is in a “bad state” when a deep fade occurs, which is caused by multipath
transmission) and magnetic recordings subject to dropouts caused by surface defects and
dust particles.
Compound Channels: both types of errors are encountered.
67
Ch 4 Error Detection and Correction
Source Encoder
ECC Encoder
Channel
DTC Encoder
At Transmitter
At Receiver
Source Decoder
ECC Decoder
DTC Decoder
Encoding and Decoding Procedure
§4.1 Error Detection and Correction Capacity Definition
A code can be characterized in terms of its amount of error detection capability and error
correction capability. The Error Detection Capability is the ability of the decoder to
tell if an error has been made in transmission. The Error Correction Capability is the
ability of the decoder to tell which bits are in error.
68
Binary Code, M={0,1}
Coded sequence, C
Channel Encoder
0 1( ,..., ), nc c n k
Assumptions:
- independent bits
- each message is equally probable, 2k equally likely messages, of k bits each
- r = n-k redundant bits
Thus, the Entropy Rate of the coded word is , this is also called the Code Rate. For every is the Hamming distance between the two code words. The Hamming Distance is defined as the number of bits which are different in the two code words. There is at least one pair of code words for which the distance is the least. This is called
the Minimum Hamming Distance of the code.
Example 4.1 (Repetition Code) : Given encoding rule:
G(0)→ 000
G(1) →111
i.e. only two valid code words. Find its code rate and Hamming weight.
Hamming Weight wH of a code word is defined as the number of “1” bits in the code
word (the Hamming distance between the code word and the zero code word).
• Message: block of k bits • Code Word: block of n bits
• Only 2k out of 2n are used as code words.
One To One Correspondence
0 1( ,..., )km mm c
Message Code word
G is the encoding rule
, , , ( , )i j H i jC i j d c c c c
69
Example 4.1 (cont’d) : For the received words in the 1st column of the Table below,
determine their source words.
Decision: based on the minimum Hamming distance between the received word and
the code words.
• The code corrects 1 error (dH=1), but does not simultaneously detect the 2 bit
error. Moreover, we can miscorrect the received word.
• The code detects up to two bits in error (3 bits in error lead to a code word;
dmin between the two code words is 3).
111
110
101
100
011
010
001
000
Error Flag Decoded Word
Received Word
Example 4.2 (Repetition Code) : Given coding rule,
G(0)→ 0000
G(1) →1111
find decoded words for the received words in the table on the next page.
n = 4, k = 1, r = 3, dmin = 4, R=1/4
• Correct 1 error (dH =1) and Detect 2 errors (dH=2)
• An error of 3 or 4 bits will be miscorrected.
70
Received
Word Received Decoded
Word Decoded Word
Word
0000 1000
0001 1001
0010 1010
0011 1011
0100 1100
0101 1101
0110 1110
0111 1111
Hamming Distance and Code Capability
1. Detect Up to t Errors IF AND ONLY IF
Example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. This code detects up to t = 2 errors.
2. Correct Up to t Errors IF AND ONLY IF
Example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. This code corrects t = 1 error.
3. Detect Up to td Errors and Correct Up to tc Errors IF AND ONLY IF
Example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. This code cannot simultaneously correct (tc = 1) and detect (td = 2) errors.
Number of Redundant Bits
The minimum Hamming Distance is related to the number of redundant bits, r
71
This gives us the lower limit on the number of the redundant bits for a certain minimum
Hamming distance (certain detection and correction capability), and it is called the
Singleton Bound.
For example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. dmin = r +1 See its error
detection and correction capabilities as previously discussed.
§4.2 Linear Block Codes
Definition
Linear Block Codes can be mathematically treated using the mathematics of vector
spaces.
Linear Block Codes
Binary (We deal here only with such codes)
Non-Binary
Reed-Solomon
Galois Field has two elements, i.e., A={0,1} or A=GF(2)
),,( A Exclusive Or And
0 1 1
1 0 0
1 0 +
1 0 1
0 0 0
1 0
(Digital Logic)
(Digital Logic)
72
( , , )nA
Scalar Multiplication
Vector Addition
Vector space An is a set with elements 0 1( ,..., ), with each n ia a a A a
The set of code words, C, is a subset of An. It is a subspace (2k elements); any subspace
is also a vector space.
If the sum of two code words is also a code word, such a code is called a Linear
Code).
Consequence : All-zero vector is a code word, 1 1(because )C 0 c c 0
Vector Space
Linear Independent : For code words, 0 1 ,..., kc c
if and only if , these are linear independent, and they are Basis Vectors.
0 ,..., k 1c c0 1... 0ka a
If they are linear independent and if and only if every can be uniquely written as Cc
0 0 1 1... k ka a c c c
then, the Dimension of a vector space is defined as the number of basis vectors it takes to
describe (span) it.
Generating Code Word
Question: how do we generate a code word ?
73
c mG
Linear Combination of the rows of the G matrix
They form a basis. The k rows must be linearly independent.
All the lines of G are code words!
For example:
Example 4.3: For linear block code n = 7, k = 4, r = 3, generated by
Find all the code words.
Code Word 1 x n
Message 1 x k
Generator Matrixk x n
0 1( , ..., ), nc c n k c
m 0 1( ,..., )km m
0
1
...
k
g
G
g
1 0 0 0 1 0 1
0 1 0 0 1 1 1
0 0 1 0 1 1 0
0 0 0 1 0 1 1
) () ( 32106543210 mmmmccccccc
74
§4.2.1 Linear Systematic Block Codes
Definition
If the generating matrix can be written as :
]|[ kIPG
Parity-check matrix k x n-k
n x k Identity matrix k x k
Redundant Checking Part n-k digits
Message Information Part k digits
n bits
then, a linear block code generated by such a generator matrix is called Linear
Systematic Block Code. Its code words are in the form of
Example 4.3 (cont’d): n = 7, k = 4, r = 3
design the encoder.
1 0 0 0 1 0 1
0 1 0 0 1 1 1
0 0 1 0 1 1 0
0 0 0 1 0 1 1
) () ( 32106543210 mmmmcccccccc
0 0 2 3
1 0 1 2
2 1 2 3
3 0
4 1
5 2
6 3
+ +
+ + Parity Check Bits
+ +
Information Bits
c m m m
c m m m
c m m m
c m
c m
c m
c m
ENCODING CIRCUIT
(last k bits)
(first r bits)
76
the encoder can be designed as
§4.2.2 Hamming Weight and Distance
Hamming Distance of two code words is the number of
positions in which they differ. Hamming Weight of a code word is the
number of non-zero positions in it. It is clear
), 2c (:, 121 ccdc H)(: iHi cwc
In Example 4.3 : n = 7, k = 4, r = 3, determine the Hamming weight for
)1000001(1 c
)0010001(1 c
77
Minimum Hamming Distance
The Minimum Hamming Distance of a linear block code is equal to the Minimum
Hamming Weight of the non-zero code vectors.
In Example 4.3 : n = 7, k = 4, r = 3, dmin = wmin=3
§4.2.3 Error Detection and Correction Capacity
Rules
1min tdi) Detect Up to t Errors IF AND ONLY IF
ii) Correct Up to t Errors IF AND ONLY IF min 2 1d t iii) Detect Up to td Errors and Correct Up to tc Errors IF AND ONLY IF
min min2 1 and 1c cd t d t t d
In Example 4.3: n = 7, k = 4, r = 3
The minimum Hamming distance is 3, and, such, the number of errors which can be
detected is 2 and the number of errors which can corrected is equal to 1. The code does
not have the capability to simultaneously detect and correct errors. (see the relations
between dmin and the correction/detection capability of a code).
Error Vector
For received vectors v c + e
Error Vector
No Error Ex: An error at the first bit (0 0 0 0 0 0 0 )e e (1000000)
78
Parity Check Matrix
TGH = 0
G=Generator Matrix k x n
H=Parity Check Matrix
For Systematic Code in which ][ kkIPG , then
For a Code Word
In Example 4.3: n = 7, k = 4, r = 3, find its parity-check matrix.
From the generator matrix G in Example 4.3,
n-k x n
k x n-k
0: TT GHmHcc
0 1 2 0 1 2 3( )c c c m m m m TH 0
0 0 2 3
1 0 1 2
2 1 2 3
0
0
0
c m m m
c m m m
c m m m
0
0
T
T
Hc
GHm
Parity Check Equations
79
Syndrome Calculation and Error Detection
Syndrome is defined as:
In Example 4.3: n = 7, k = 4, r = 3, if
There is an error, but this error is undetectable!
The error vector introduces 3 error. But the minimum Hamming distance for this code is
3, and, such, a 3 error pattern can lead to another code word!
Note: When we say that the number of errors which can be detected is 2, we refer at all
error patterns with 2 bits in errors. However, the code is capable to detect patterns with
more than 2 errors, but not all !
Question: What is the number of error patterns which can be detected with this code?
Answer: The total number of error patterns is 2n-1 (the all-zero vector is not an error!).
However, 2k-1 of them lead to code words, which mean that they are not detectable. So,
the number of error patterns which are detectable is 2n-2k.
Ts = vH
1 x n n x n-k 1 x n-k
If =0 v=cIf 0 v c
80
Error Correction Capacity
Likelihood Test
Why and When the Minimum Hamming Distance is a Good Decoding Rule ?
Let 21,cc be two code words and v be the received word,
If 1c is the actual code word, then the number of errors is
If 2c is the actual code word, then the number of errors is
Which of these two code words is most likely based on v ?
The most likely code word is the one with the greatest probability of occuring with the
received word, i.e.,
This is called the Likelihood Ratio Test. or, equivalently,
1
1 2
2
ln ln 0p p
c
v,c v,c
c
Log-Likelihood Ratio Test.
81
The joint probabilities can be further written as
(i = 1,2) i i
p p pv, c v i|c c
For the BSC Channel (independent errors)
(i = 1,2) Pr( ) (1 )i
i
t nip t p p it v|c
where ),( ii cvdt is the number of errors that have occurred during the transmission of
code word ic . Since there is a specific error pattern for a received word, the binomial
coefficient does not appear in above.
IF
Condition 1: the code words have the same probability and
Condition 2: p < 0.5 (p is the crossover probability of the BSC channel)
By performing some calculations, one gets that:
82
§4.2.4 Decoding Linear Block Codes
Standard Array Decoder
The simplest, least clever, and often most expensive strategy for implementing error
correction is to simply look up c in a decoding table that contains all possible v . This is
called a standard-array decoder, and the lookup table is called the Standard Array. The
first word in the first column of the standard array is the zero code-word (it also means
zero error). If no error, the received words are the code words. These are given in the first
row of the standard array. For a linear block code (n, k), the first row contains 2k code
words, including the zero code-word. All 2n words are contained in the array. Each row
contains 2k words. So, the number of columns is 2k. The number of rows will then be
2n/2k =2n-k=2r. The standard array for a (7, 4) code can be seen in the table on next page.
When decoding with the standard array, we indentify the column of the array where the
received vector appears. The decoded vector is the vector in the first row of that column.
Each row is called Coset. In the first column we have all correctable error patterns.
These are called Coset Leaders. Decoding is correctly done if and only if the error
pattern caused by the channel is a coset leader (including the zero-vector). The
words on each column, except for the first element, which is a code word, are obtained by
adding the coset leader to the code word.
Question: How do we chose the coset leaders?
To minimize the probability of a decoding error, the error patterns that are more likely to
occur for a given channel should be chosen as coset leaders. For a BSC, an error pattern
of smaller weight is more probable than an error pattern of larger weight. Therefore,
when the standard array is formed, each coset leader should be chosen to be a vector of at
least weight from the remaining available vectors. Choosing coset leaders this way, each
coset leader will have the minimum weight in its coset. In a column, one gets the words
which are at minimum distance of the code word, which is the first element of the column.
A linear block code is capable to correct 2n-k error patterns (including zero error).
83
Syndrome Decoder
Standard array decoder becomes slow when the block code length is large. A more
efficient method is syndrome decoder. Syndrome Vector is defined as:
where v is the received vector, H is the parity-check matrix. The syndrome is
independent on the code word; It depends only on the error vector (for a specific code).
All the 2k n-tuples (n bit words) of a coset have the same syndrome.
Steps in the Syndrome Decoder
1. For the received word, the syndrome is calculated by
2. The coset leader is calculated.
3. The transmitted code word is obtained by
§§4.2.44.2.4: Decoding Linear Block Codes: Decoding Linear Block Codes
Syndrome Decoder
Ts = vH
1 x n n x n-k 1 x n-k
If =0 v=cIf 0 v c
85
T Ts v H e He
c v e
Example 4.4: Design the Syndrome decoder for Example 4.3 in which n = 7, k = 4, r
= 3
For the parity-check matrix in Example 4.3 and the single-bit error pattern:
86
§4.2.5 Hamming Codes
Definition
Hamming codes are important linear block codes, used for single-error controlling in
digital communications and data storage systems. For any integer , there exist a
Hamming Code with the following parameters:
3r
Code Length:
Number of information digits:
Number of parity check digits:
Error correction capability:
Systematic Hamming code has:
In Example 4.3: n = 7, k = 4, r = 3
2 1 7rn Code Length:
2 1 4rk r Number of information digits:
3r n k Number of parity check digits:
min1 ( =1)t dError correction capability:
Thus, the code given as example is a Hamming code.
Example 4.5: Construct the parity-check matrix for the (7, 4) systematic Hamming code.
Example 4.6: Write down the generator matrix for the Hamming code of Example
4.5.
87
Perfect Code
If we form the standard array for the Hamming code of length , the n-tuples of
weight 1 can be used as coset leaders. Recall that the number of cosets is !
That would be the zero vector and the n–tuples of weight 1. Such a code is called a
12 rn
rkn 22/2
Perfect Code. “PERFECT” does not mean “BEST”!
A Hamming code corrects only error patterns of single error and no others.
Some Theorems on The Relation Between the Parity Check
Matrix and the Weight of Code Words
Theorem 1: For each code word of weight d, there exist d columns of H, such that the
vector sum of these columns is equal to the zero vector.
The reciprocal is true.
Theorem 2: The minimum weight (distance) of a code is equal to the smallest number of
columns of H that sum to 0.
In Example 4.3: n = 7, k = 4, r = 3
The columns of H are non-zero and distinct. Thus, no two columns add to zero, and the
minimum distance of the code is at least 3. As H consists of all non-zero r-tuples as its
columns, the vector sum of any such two columns must be a column in H, and thus, there
are three columns whose sum is zero. Hence, the minimum Hamming distance is 3.
88
Shortened Hamming Codes
If we delete columns of H of a Hamming code, then the dimension of the new parity
check matrix, H’, becomes .. Using H’ we obtain a Shortened Hamming
Code, with the following parameters:
(2 1 )rr
Code Length:
Number of information digits:
Number of parity check digits:
Minimum Hamming Distance:
In Example 4.3: We shorten the code (7,4)
We delete from PT all the columns of even weight, such that no three columns add to zero
(since total weight must be odd). However, for the column of weight 3, there are 3
columns in Ir , such that the 4 columns’ sum is zero. We can thus conclude that the
minimum Hamming distance of the shortened code is exactly 4. This increases the error
correction and detection capability.
The shortened code is capable of correcting all error patterns of single error and detecting
all error patterns of double errors. By shortening the code, the error correction and
detection capability is increased.
89
Ch 5 Cyclic Codes
§5.1 Description of Cyclic Codes
Definition
Cyclic code is a class of linear block codes, which can be implemented with extremely
cost effective electronic circuits.
Cyclic Shift Property
A cyclic shift of is given by 0 1 2 1( ... n nc c c c uIn general, a cyclic shift of c can be written as
A Cyclic Code is a linear block code C, with code words such that for every , the vector given by the cyclic shift of is also a code word.
Example 5.1: Verify the (6,2) repetition code is a cyclic code. Since a cyclic shift of any of its code vectors results in a vector that is element of C. Check by yourself. Example 5.2: Verify the (5,2) linear block code defined by the generator matrix
Its code vectors are
is not a cyclic code.
Its code vectors are Gmc
) c
0 1 2 1( ... n nc c c c )ccCc
{(000000), (111111), (010101), (101010)}C
1 0 1 1 1
0 1 1 0 1
G
90
0 0 0 0 0
1 0 1 1 1
0 1 1 0 1
1 1 0 1 0
The cyclic shift of (10111) is (11011), which is not an element of C. Similarly, the cyclic
shift of (01101) is (10110), which is also not a code word.
Code (or Codeword) Polynomial
0 1 2 1( ... )n nc c c c c
Code Word
One to-one correspondence
2 10 1 2 1( ) ... n n
n nc X c c X c X c X
Code Polynomial of degree (highest exponent of X) n -1 or less.
Theorem: The non-zero code polynomial of minimum degree in a cyclic code is
unique, and is of order r.
Theorem 1: A binary code polynomial of degree n -1 or less is a code word if and only if
it is a multiple of . )(Xg
where are the k information digits to be encoded.
1
( ) ( ) ( )c X m X g X
An (n, k) cyclic code is completely specified by its non-zero code polynomial of
minimum degree, g(X), called the generator polynomial .
0 1 11( ) ( ... ) ( )k
km mc X X Xm g X
degree n -1 or less
degree k -1 or less
degree r
0 ,..., km m
91
Theorem 2: The generator polynomial, , of an (n, k) cyclic code is a factor of )(Xg
1nX .
Question: For any n and k, is there an (n, k) cyclic code?
Theorem 3: If is a polynomial of degree r = n - k and if it is a factor of
then generates an (n, k) cyclic code.
)(Xg
)(Xg
Remark: For n large, 1nX may have many factors of degree n - k. Some of these
polynomials generate good codes, whereas some generate bad codes.
Example 5.3: Determine the factor of X7+1 that can generate (7, 4) cyclic codes.
For a (7,4) code, r=n-k=7-4=3, the generator polynomial can be chosen either as
or
Systematic Cyclic Code For message: generate systematic cyclic code includes:
m X 0 , the steps to11
1( ) ...mm mX kk X
Step 1: Step 2: Step 3:
degree ≤ n - k -1
Proof: ( ) ( ) ( ) ( )n kX m X a X g X b X
degree ≤ n -1 degree = n - k
( ) ( ) ( ) ( )n kb X X m X a X g X
0 1 1 01 1
1... ...n k n k n
parity check bits message
k kb b mX Xm
Code word
b X X
92
Example 5.4: Find (7, 4) cyclic code, generated by when
3( ) 1 X X g X
i.e.
Step 1: Multiply the message m(X) by Xn – k.
Step 2: Obtain the remainder b(X) from dividing Xn – k m(X) by g(X).
Step 3: Combine b(X) and Xn – k m(X) to form the systematic code word.
§§55..22:: GGeenneerraattoorr aanndd PPaarriittyy--cchheecckk MMaattrriicceess
Generator Matrix
Let (n, k) be a cyclic code, with the Generator Polynomial
Then, a code polynomial can be written as
3( ) 1 X m X (1001)m
c
1001k bits o
( 011 )parity check bi f the mes ets sag
93
which is equivalent to the fact that span C. 1, ,...,( ) ( ) ( )kXg X X g Xg X
1
( )
( )
...
( )k
g X
Xg X
X g X
G
0 1 2
0 1 1
0 2 1
... 0 0 0 ... 0
0 ... 0 0 ... 0
0 0 ... 0 ... 0
...............................................
n k
n k n k
n k n k n k
g g g g
g g g g
g g g g
0
......................
0 0 0 ... 0 0 .............
with 1.
n k
n k
g
g g
k x n
Systematic Generator Matrix
In general, G is not in a systematic form. However, we can bring it in a systematic form
by performing row operations.
[ ]kG PIReminder: For a Systematic Code
Example 5.5: Determine the systematic generator matrix for (7, 4) cyclic code, generated
by 3( ) 1g X X X
1 1 0 1 0 0 0
0 1 1 0 1 0 0
0 0 1 1 0 1 0
0 0 0 1 1 0 1
G
R3+R1R4+R1+R2
1
1
1
1
1 1 0 0 0 0
0 1 1 0 0 0=
1 1 1 0 0 0
1 0 1 0 0 0
R1 R2 R3 R4
systematic form
94
The (7, 4) cyclic code, generated by when message is (1100) 3 ( ) 1g X X X
for other messages, the code see below:
The (7, 4) cyclic code in systematic form, generated by when
message is (0011)
3 ( ) 1g X X X
95
for other messages, the code see below:
Parity-check Matrix
We know: 1 ( ) (nX )g X h X
degree k degree r =n - k
Parity-check Polynomial
Let be a code word, 0 1 1( ... )c nc c c
( ) ( ) ( )c X a X g X
degree ≤ k-1
96
Thus, do not appear in , i.e., 11 ,,, nkk XXX nXXaXa )()(
the coefficients of must be equal to zero, then 1 1, ,...,k k nX X X
00, 1 -
ki n i ji
h c j n k
from which we can set up n-k equations.
Reciprocal of h(X) is defined as
It can be shown that this is a factor of 1nX , thus, it can generate an (n, n-k) cyclic code. The generator matrix of the (n, n-k) cyclic code is
1 2 0
1 1 0
2 1 0
... 0 0 0 ... 0
0 ... 0 0 ... 0
0 0 ... 0 ... 0
................................
k k k
k k
k
h h h h
h h h h
h h h h
H
0
0
.....................................
0 0 0 ... 0 0 ................
with 1.k
h
h h
As for a linear block code, any code word is orthogonal to every row of H,
( )T cH 0
H is a Parity Check Matrix of the cyclic code. h(X) is called the parity polynomial of
the code. A cyclic code is uniquely specified by h(X). Remark: The polynomial generates the dual code of C, (n, r). )( 1XhX k
Example 5.6: Find the dual code generator polynomial for (7, 4) cyclic code, generated
by 3( ) 1g X X X
4, 7 4 3k r n k
97
Generates
7 3 4r
§5.3 Encoder for Systematic Cyclic Codes
Find Remainder by Binary Polynomial Division
Recall the 3 steps to generate systematic cyclic codes are:
Step 1: Multiply the message m(X) by Xn – k.
Step 2: Obtain the remainder b(X) from dividing Xn – k m(X) by g(X).
Step 3: Combine b(X) and Xn – k m(X) to form the systematic codeword.
In the 2nd step, we assume that mk-1=1, the remainder can be found by considering the
calculation of
All the 3 steps can be accomplished with a division circuit of (n-k)-stage register with
feedback based on g(X). The mechanism of the division process have a simple
implementation for binary polynomials. We assume that the bits are transmitted serially
with the highest power of X being transmitted first.
We illustrate the mechanism using n = 7 and r = 3, i.e., the (7,4) code.
degree ≤ n-1 0 11( ) ...n k n k
knm mX m X X X
)1/( 11
11
XgXgXX r
rrn
98
Remainder after this cycle
2
1 1
1
g
S g
1
2
1
1
0
...
r
r
g
g
S
g
g
In the general case,
In the next division cycle we have divided by 341
52 XXgXg 11
22
3 XgXgX
2 2 2
2 1 1 1
1 0 1 0
0 1 0 1
1 0 0 1 1 0 0
g g g
S g g g
1
2
2 1
1
0
1 0 ... 0
0 1 ... 0
... ... ... ... ...
0 0 ... 1
0 0 ... 0
r
r
g
g
S S
g
g
Remainder after this cycle
1S
1S
( 1) ( 1)
r rI
In the general case,
The process continues 2 more times, for a total of k cycles (k=4 here),
and
The process for the terms m2 X5 is the same, except only k-1=3 cycles are involved . The
same is true for each successive term in Xrm(X), with one less shift in for each decrease in
the power of X.
3 2S S 4 3S S
99
For a general (n, k) code, we can represent the long-division process for the remainder
vector as
1
2
1 0
1
0
... , 1, 2,..., ,
r
r
t t k t
g
g
S S m t k S
g
g
0
Example 5.7: For k = 4 and r = 3, find the remainder vector.
Homework: Write S3 and S4 for the (7,4) code.
Encoder Circuit
After obtaining the remainder, run Step 3: Combine b(X) and Xn – k m(X) to form the
systematic code word.
In Example 5.7: For k = 4 and r = 3, design the encoder circuit.
100
3 3 3 23 2 1 0( ) ( )X m X X m X m X mX m
Codeword
101
For the general case, an endocder of n-k-stage shift register is
D D D
g0 g2 g1
Parity-check digits
3m
2m
0 3g m 1 3g m 2 3g m
0b1b 2b
1S
2S0 2 0 2 3g m g g m 0 3 1 2 1 2 3m g m g g m 2
1 3 2 2 2 3g m g m g m
2 2 3m g m
2 3g m
g
and so on …………………………………………………………………….
3( ) 1Homework: Find the encoding circuit for the (7, 4) code, generated by g X X X
Encoding a cyclic code can also be accomplish by using its parity polynomial,
As hk =1 (see formula in slide 2)
(1)
11 1( ) 1 ... 1k k
kh X h X h X X
1
0, 1 -
kn k j i n i ji
c h c j n k
This is known as a difference equation.
0 1
0 1 1 1
- parity check binary digits information binary digits ...
( ... ... )
k
n k n k n
n k km m
c c c c c
For a Systematic Code:
Given the k info bits, (1) is a rule for determining the n-k parity check digits, . 110 knccc
The encoder circuit using parity polynomial is
h0=1
102
The Encoding Operations can be described in the following steps:
Step1: Initially, Gate 1 is turned on and Gate 2 is turned off. The k information digits, , are shifted into the register and the communication channel simultaneously.
1k0 1 1( ) ... km X m m X m X
Step 2: As soon as the k information bits have entered the shift register, Gate 1 is turned off and Gate 2 is turned on. The first parity check digit,
is formed and appear at point P.
Step 3: The register is shifted once. The first parity-check digit is shifted into the channel and into the register. The second parity check digit, is formed and appear at point P. Step 4: Step 3 is repeated until n-k parity-check digits have been formed and shifted into
the channel. Then, Gate 1 is turned on and gate 2 is turned off. The next message will be
shifted into the register.
Remark 1: This is a k-stage shift register.
Remark 2:
If r > k, the k-stage encoding circuit is more economical.
Otherwise, the (n-k)-stage encoding circuit is preferable.
Homework: Find the encoding circuit for the (7, 4) code, generated
by , based on h(X). 3( ) 1g X X X
§5.4 Syndrome Computation and Error Correction
Definition of Syndrome
Cyclic Codes are Linear Block Codes. For a received word 0 1 1( ... )nv v v v
Ts vHSyndrome is defined as
We know that , so if , v is a codeword. TcH 0 Ts = vH 0
103
10 1 1( ) ... n
nv X v v X v X For Cyclic Codes: Received Polynomial
or ( ) ( ) ( ) ( )v X a X g X s X
degree ≤ r-1 degree r=n-k degree ≤ n-1
The r = n-k coefficients of S(X) form the Syndrome S. if and only if
is a code polynomial (a multiple of g(X)).
( )s X 0 ( )v X
Syndrome Computation Circuit
)(Xs is the remainder of the division v(X) / g(X). It can be computed with a division circuit, which is identical to the (n-k)-stage encoding circuit, except that the received polynomial is shifted into the register from the left end.
The received polynomial is shifted into the register with all stages initially set to zero. As soon as v(X) has been shifted into the register, the content in the register form the syndrome s(X). Properties of Syndrome Let s(X) be the syndrome of a received polynomial v(X). The remainder s(1)(X) resulting
from dividing Xs(X) by the generator polynomial g(X) is the syndrome of v(1)(X), which is a cyclic shift of v(X) (For proof: see the definition of the syndrome). The syndrome
s(1) (1)(X) can be obtained by shifting the register (syndrome) once, with s(X) as the initial content and with the input gate disabled. This is equivalent with dividing Xs(X) by g(X).
(X) of v
In general, the remainder s(i)(X) resulting from dividing Xis(X) by the generator
polynomial g(X) is the syndrome of v(i)(X), which is a cyclic shift of v(X). This
104
property is useful in decoding cyclic codes. The syndrome s(i)(X) of v(i)(X) can be obtained by shifting the register (syndrome) i times, with s(X) as the initial content and
with the input gate disabled. This is equivalent with dividing Xis(X) by g(X).
Example 5.8: Find the syndrome circuit for the (7,4) cyclic code generated by . Suppose that the received vector is .
Calculate the syndrome and compare it with the contents of the shift register after the 7th shift. Show the contains of the shift register with the input gate disabled and comment on the result.
The remainder of v(X) / g(X) is , and so, the syndrome is , or . For the content of the shift register, see the next table, which is related to the syndrome circuit.
3 (0010110)v( ) 1g X X X
(0010110)v2 1X
105
With the input gate disabled, the syndrome of is obtained by shifting the register once, the syndrome of is obtained if we shift the register twice, and so on.
(1) (0001011( )(2)( ) (1000101
)v X)v X
Let be the transmitted code polynomial, and let (c X be the error pattern. Then, the received polynomial is As and , then
)
( ) ( ) ( )m X g Xc X ( ) ( ) ( ) ( )v X a X g X s X The syndrome is computed based on the received vector, and the decoder has to estimate the error pattern e(X) based on the syndrome. However, the error pattern is not known at the decoder. The syndrome is equal to the remainder of dividing the error pattern by the generator polynomial.
Remark: One can notice that if and only if or (the error pattern is a codeword)
( )e X 0( )s X 0( ) ( )e X c X
For the latter, the error pattern is undetectable ! Remark: The error detection circuit is simply a syndrome circuit with an OR gate whose inputs are the syndrome digits. If the syndrome is non-zero, the output of the OR gate is 1, and the presence of errors has been detected. CYCLIC CODES ARE VERY EFFECTIVE FOR DETECTING ERRORS, RANDOM OR BURST ! Burst Error Patterns Definition: The Burst Length of an error polynomial e(X) is defined as the number of bits from the first error term in e(X) to the last error term, inclusive. Example: has the burst length b=7-3+1=5. 3( )e X X X 7
By definition, there can be only one burst in a block.
Example: has the burst length b=20-3+1=18, and not two bursts of length 5 and 2.
3 7 19 20( )e X X X X X Definition: An error pattern with errors confined to i high-order positions and l-i low-order positions is also regarded as a burst of length l. This is called an end-around burst. Example: is an end-around burst of length 7. ( 0101 111000000 0)e
106
CASE 1: Suppose that e(X) is a burst of length r = n-k or less.
( ) ( )je X X B X
degree ≤ n-k-1
Because degree{ ( )} degree{ ( )}B X g X ( ) is not a factor of ( )g X B XX
Also is not a factor of ( ), as ( ) divides 1nX g X g X
( ) ( ) is not divisible by ( )je X X B X g X or, equivalently, the syndrome caused by e(X) is not equal to zero. The (n, k) cyclic code is capable of detecting any error burst of length
CASE 2: Suppose that e(X) is a burst of length r +1 = n-k+1, and let it start from the ith
position. Thus, it ends at the (i+n-k)th position. Errors are confined to
with
1, , ...,i i i ne e e k
1i i n k e eThere are such bursts (the error bits in the first and last positions are 1, and only
the n-k+1-2 (i.e. n-k-1) positions can take any value, i.e., either 0 or 1). Among these,
only one cannot be detected (zero syndrome), i.e.,
12n k
( ) ( )ie X X g X
The fraction of undetectable bursts of length n – k +1 is
Error Detection Capability
CASE 3: Suppose that e(X) is a burst of length l > n – k + 1 or (r + 1). then there are
such bursts (the bits in the first and last positions are 1, and only the l -2 positions can
take any value, i.e., either 0 or 1). Among these, the undetectable ones (zero syndrome)
must be of the form,
degree=n-k
The number of such bursts is
The fraction of undetectable burst errors of length l is
degree l -1 ( ) 1
0 1 ( ) 1( ) ... l n kl n ka X a a X a X
0 ( ) 11, 1l n ka a
( ) ( ) ( )ie X X a X g X
107
Example 5.9: Analyze the error detection capacity of the (7, 4) cyclic code generated by
3( ) 1g X X X
The minimum Hamming distance for this code is 3, thus, the code can detect up to 2
random errors (see the relation between dmin and td.)
Also, it detects 112 error patterns
The code can detect any burst errors of length
It also detects many burst of length >3.
The fraction of undetectable error patterns with n-k+1=4 errors is .
The fraction of undetectable error patterns with more than 4 errors is
Cyclic Redundancy Check (CRC) Codes
CRC are error-detecting codes typically used in ARQ systems. CRC has no error
correction capability, but they can be used in combination with an error-correcting code.
The error control system is in the form of a concatenated code.
CRC
ENCODERError Correction
ENCODER Tx
Error Correction
DECODERCRC Syndrome
Checker Rx
§5.5 Decoding of Cyclic Codes
Decoding Steps
The decoding process consists of three steps, as for decoding of linear block codes. These
are:
i)
ii)
108
iii)
Syndrome Computation: The syndrome for cyclic codes can be computed with a
division circuit whose complexity is linearly proportional to the number of parity check
binary digits, i.e., n-k.
Error Corrections: The error-correction step is simply adding (mod-2) the error-pattern
to the received vector (exclusive-or gate).
The association of the syndrome with an error pattern can be completely specified by a
decoding table. This is a straightforward approach to the design of a decoding circuit is
via a combinational logic circuit that implements the table look-up procedure. However,
the limit to this approach is that the complexity tends to grow exponentially with the code
length and number of errors to be corrected.
Cyclic Codes have considerable algebraic properties, which allow a low complexity
structure of the encoder. The cyclic structure of a cyclic code allows us to decode a
received vector v(X) serially. The received digits are decoded one at a time, and each
digit is decoded with the same circuitry.
Decoding Circuit (Decoder)
2
109
Two Cases
As soon as the syndrome has been computed, the decoding circuit checks whether the
syndrome s(X) corresponds to a correctable error pattern ,
with an error at the higher position Xn -1, i.e., en-1=1.
10 1 1( ) ... n
ne X e e X e X
CASE I: If s(X) does not correspond to an error pattern with en-1=1, the received
polynomial and the syndrome register are cyclically shifted once simultaneously. We
obtain
and the syndrome register form , the syndrome of . (1) ( )v X(1) ( )s X
Now, the second digit, vn -2 becomes the first digit of v(1) (X).
The same decoding circuit checks whether s(1) (X) corresponds to an error at location Xn-1.
CASE II: If s(X) of v(X) does correspond to an error pattern with en-1=1, the first
received digit vn-1 is an erroneous digit, and it must be corrected. The correction is carried
out by the sum 1 1.n nv e This correction results in a modified received polynomial
2 11 0 1 2 1 1( ) ... ( )n n
n n nv X v v X v X v e X
The effect of en-1 on the syndrome is removed from the syndrome s(X). v1(X) and the
syndrome register are cyclically shifted once simultaneously. The polynomial which
results now is
Its syndrome, , is the remainder resulting from dividing
by the generator polynomial g(X).
(1) 11 1 1 0 2( ) ( ) ... n
n n nv X v e v X v X
(1)
1 (s X )
1 1( ) ( ) ( ) ( )n nX a X g X s X X
1[ ( ) ]nX s X X
Proof
1 (Error Correction)nX ( ) ( ) ( ) ( )v X a X g X s X
(Shift Once)Xv X
( ) ( ) ( ) ( )Xv X X Xa X g X Xs X n nX
110
Such that the remainder of is the remainder of
, which is , because
( ) : ( )nXv X X g X( ) | ( 1)ng X X 1[ ( ) ] : ( )nX s X X g X (1) ( )s X 1
Therefore, if 1 is added to the left end of the syndrome register while it is shifted, we
obtain . The decoding circuitry proceeds to decode vn-2. Whenever an error is
detected and corrected, its effect is removed from the syndrome.
(1)1 ( )s X
Remarks:
The decoding stops after n shifts (= total number of binary bits in a received
word).
If e(X) is a correctable error pattern, the contents of the syndrome register is zero
at the end of the decoding operation, and the received vector has been correctly
decoded. Otherwise, an uncorrectable error pattern has been detected.
This decoder applies in principle to any (n, k) cyclic code.
But whether it is practical depends entirely on its error-pattern detection circuit.
In some cases this is a simple circuit.
Design Decoder
Example 5.10: Design the decoder for the (7,4) cyclic code generated by
3( ) 1g X X X
min 3d
It is capable of correcting any single error over a block of 7 bits. There are 7 such error
patterns. These and the all-zero vector form all the coset leaders of the decoding table.
They form all correctable error patterns. Suppose that the received polynomial,
60 1 6( ) ...v X v v X v X
is shifted into the syndrome register from the left end.
Write syndrome and error patterns in Table 1 on next page.
111
1
We see that is the only error pattern with an error located at . When this
error pattern occurs, the syndrome in the syndrome register is (101), after the entire v(X)
has entered the syndrome register. The detection of this syndrome indicates that v6 is an
erroneous digit and must be corrected.
6
6( )e X X 6X
3
Suppose that the single error occurs at location , i.e., iX ( ) iie X X 0 6i ,
After the entire received polynomial has been shifted into the syndrome register, the
syndrome in the register will not be (101). However, another 6-i shifts, the contents in the
syndrome register will be (101) and the next received digit to come out of the register
will be the erroneous digit. Only the syndrome (101) needs to be detected.
We use a 3 input AND gate.
112
In the sequel, we give an example for the decoding process when the codeword
( ) is transmitted and
113
( ) is received. A single error occurs at location X2.
When the entire received polynomial has been shifted into the syndrome and buffer
registers, the syndrome register contains (001). We see that after 4 shifts, the content in
the syndrome register is (101) and the next digit to come out from the buffer is the
erroneous digit, v2.
(1001011)c 63 5( ) 1c X X X X (1011011)v2 3 5( ) 1v X X X X X 6
4 3
Ch 6 Convolutional Codes
§6.1 Description of Convolutional Codes
Compare with Linear Block Code
Convolutional codes are the second major form of error-correcting channel codes. They
differ from the linear block codes in both structural form and error correcting properties.
With linear block codes, the data stream is divided into a number of blocks of k binary
digits, each block is encoded into an n-bit code word. On the other hand, Convolutional
Codes convert the entire data stream into a single code word.
The code rate for the linear block codes can be ≥0.95, but they have limited error
correction capabilities. For convolutional codes, the code rate is usually below 0.9, but
they have more powerful error-correcting capabilities and good for very noisy channels
with high raw error probabilities. Puncturing is used to achieve higher code rates.
Encoding
The source data is broken into frames of k0 bits per frame. M +1 frames of source data are
coded into n0-bit code frame, where M is the Memory Depth of the shift register.
Convolutional codes are encoded using shift registers. As each new data frame is read,
the old data is shifted one frame to the right, and a new code word is calculated.
Characteristics of the Code: Code Rate , Constraint Length
For binary convolutional codes: k0=1
114
Example 6.1: For a , binary convolutional encoder below, determine
its code polynomials.
1/ 2R 3
, so k0=1 (binary), n0=2, M=2
For each 1-bit (k0) frame of the input message m(X), we obtain 2-bit (n0) code frame on
the output with one bit in c0(X) and one in c1(X). These are interleaved and sent as a two-
bit symbol sequence.
1/ 2R 3
We can associate two code polynomials,
0 0( ) ( ) ( )c X m X g X
such that 1 1( ) ( ) ( )c X m X g X
The vector corresponding to the output is
For example, if message 0 1 0 1( ) [ ( ) ( )] ( )[ ( ) g ( )] ( ) ( )X c X c X m X g X X m X X C G
3( ) 1m X X X
Then
Let us assume that the highest power of X is the first symbol transmitted, and that we
first send c0, and then c1. Thus, the transmitted sequence is
C0(0) C1
(0) C0(1) C1
(1) … C0(t) C1
(t) =
You can also input the message to the encoder directly to verify the result. The message
has 4 bits, i.e., ( 1 0 1 1), but the transmitted sequence contains 12 transmitted bits.
Therefore, the Code Rate is 4/12=1/3, not 1/2 !
115
Effective Code Rate
In Example 6.1, the code rate is 1/3, the explanation for that is that the encoder has M=2
memory elements and it has to “flush” its buffer to complete the code sequence. The last
two code symbols in the transmitted code sequence, i.e., 01 and 11, correspond to
empting the encoder’s shift register. The first 8 bits correspond to the 4 message bits at
rate ½, so the Effective Code Rate is 1/3. This reduction in the code rate is known as the
Fractional Rate Loss.
For a convolutional code with rate R (K bits of information) and memory depth M, the
Effective Code Rate is
Convolutional codes are effective when , the effective code rate approaches the code rate.
K M
Memory Depth and Constraint Length
For the rate R convolutional codes, the generator vector is defined as
G 0 1( ) [ ( )... ( )]nX g X g Xand the vector of the code polynomials as
( ) ( ) ( )X m X XC G
Convolutional codes are LINEAR, as the sum of two code polynomials is a code polynomial. There is a strong similarity with cyclic codes, and convolutional codes have some properties of cyclic codes. Memory Depth
Constraint Length Sometimes it is defined as M.
For a given code rate, if increases, a better error rate performance is obtained, at the expense of increased decoder complexity.
§6.2 Structure Properties of Convolutional Codes
State Diagram
The convolutional encoder is a “state machine” (it is convenient to represent its operation
using a State Diagram). With M memory elements, it has states M2
116
Example 6.2: Find the state diagram for the encoder in Example 6.1.
M=2, we associate the 2M=4 states with the content of the shift register, as
State Diagram: Used to analyze performance of a convolutional code
Output Input
Trellis Diagram
Trellis Diagram is to use states at different time to analyze performance of a
convolutional code
117
t
Adversary Paths
The error-correcting property of a convolutional code is determined by the adversary
paths through the trellis. Adversary Paths: the paths that begin in the same state and
end in the same state, and have no state in common at any step between the initial and
final states.
Adversary Paths and Hamming Distance
For the following paths,
P P S P S P adversaries from time index are
are
are
are
0 0 0 0 0 0:S S S S S
1 0 1 2 0 0: S S S S
2 0 1 3 2 0: S S S S
3 0 0 1 2 0:S S S S S
0 1,P P
2 3,
Hamming Distance
adversaries from time index adversaries from time index adversaries from time index
P P
0 3,P P
0 2,P P
118
Performance is based on the Hamming distance ( ),( jiH ccd of the two code sequences)
between the adversary paths in the trellis. As we can see in this simple example, the
number of adversary paths grows, and we wonder how we can handle the combinatorics
involved. The trellis path analysis is simplified in case of linear codes. On such case, the
Hamming distance between two code sequences in the trellis is equivalent to the
Hamming distance between some code word and the all-zero code sequence.
Transfer function
This information can be found using transfer function. We will show only the non-zero
adversary paths which begin and end in state S0. We modify the state diagram by
removing the self loop at the S0 state, and adding a new node S0, representing the
termination of the non-zero adversary path.
Transfer Function Operators 0 1S S
1 | 11
Code symbol of weight 2
Source symbol of weight 1
Source Symbol Weight Operator: N Code Symbol Weight Operator: D
Time Index Operator: J
For this case, is the state operator for the transition (exponent means number
of ``1`` bits in D or N).
Example 6.3: Write the transfer operators for each branch of the state diagram.
119
Results are:
We can solve for the transfer function for all possible paths starting at S0 and ending at S0,
by writing a set of state equations for the transfer function diagram.
with the beginning and ending state S0, respectively. The transfer function ( )0 0, eX X
),,( DNJT is found by solving this set of equations for , with using linear
algebra,
)(0
eX 10 X
To see the the individual adversary paths, apply long division and then
5 3
( , , )1 (1
D NJT J N D
DNJ J
)
5 3 6 2 4( , , ) (1 ) ....T J N D D NJ D N J J
5 3( , , )[1 (1 )]T J N D DNJ J D NJ Proof: Check that
The transfer functions supplies us with all the information we need to completely characterize the structure and performance of the code. For example,
120
)1(426 JJND shows that there are exactly two paths of Hamming weight 6 and both paths involve source symbols with Hamming distance 2. One is reached is 4 transitions, and the other one in 5. With this information, the two paths satisfied are found as
021210
02310
SSSSSS
SSSSS
§6.3 Decoding Methods
Viterbi Algorithm
Convolutional codes are employed when significant error correction capability is required.
In such cases, the decoding cannot be carried out using syndrome method and shift
register circuits, but a more powerful method is needed. Such a method was introduced
by Viterbi (1965) and quickly became known as the Viterbi algorithm. The Viterbi
Algorithm is of major practical importance, and we will introduce it primarily by means
of examples.
We have seen that a convolutional code with constraint length has states
in its trellis. One way to view the Viterbi decoder is to construct it as a network of simple,
identical processors, with one processor for each state in the trellis.
1M
For example: , it needs 2,3 Mv 422 states.
Example of node processor: It receives inputs from the node processors S0 and S2, and
supplies outputs for node processors S0 and S1.
S0 S0
S1 S1
S2 S2
S3 S3
Each processor does the following: i) monitors the received code sequence, y(X), which
can be written as y(X)=c(X)+e(X). Each processor calculates a number (likelihood
t t+1 t+2
121
metric) that is related to the probability that the received sequence arises from a
transmitted sequence. The likelihood metric is the accumulated Hamming distance
between the received sequence and expected transmitted sequence. The larger the
distance, the less likely it is that this processor is decoding the true transmitted message.
2) Each processor must supply, as an output, its likelihood metric to each node processor
connected to its output side. 3) For each of its input paths, the node processor must
calculate the Hamming distance between the n-bit code symbol y and the n-bit code
symbol it should have received if the path of the transmitted message had just made a
transition (likelihood update). It adds the likelihood update to the likelihood supplied
to it by the source node processor. It selects the path associated to the input-side
processor having the smallest accumulated Hamming distance (the most likely path).
4) Based on which path is selected, the processor must decode the message associated
with the selected path and update a record (called Survivor Path Register) of all of the
decoded message bits associated with the selected path.
Survivor Path Register Mathod of Viterbi Decoding
Example 6.4: Assume that we have the convolutional code discussed as in Example 6.1.
At time t, assume that the processors have the following initial conditions:
Assume that the received code-word symbol at time t is y=11. Find the resulting
likelihoods and survivor path registers for each of the node processors at time t+1.
111011xxxxx 2 S3
101110xxxxx 1 S2
111001xxxxx 3 S1
000100xxxxx 3 S0
Survivor Path Register Likelihood Metric ( )
Node Processor ( ) , 0,1, 2,iS i 3 , 0,1, 2,3i i
122
Write down Trellis Diagram (see example discussed earlier)
For node S0: if y=11
from S0, then 0/00
y=11
from S2, then 0/11
Thus, processor S0 selects transition
as the most likely transition. The resulting
register for S0 becomes1011100xxxx and the
new likelihood becomes .
Now, let's look at the node S1:
y = 11
2 0S S
t
For node S1: if y=11
from S0, then 1/11 01 y = 11
010
from S2, then 1/00 21
212
The likelihood are tied. The node processor has no
statistical way to choose between the paths. It t resolves this dilemma by “tossing a coin”. Let’s
say that the node processor S2 “wins the toss”. The
survivor path is 1011101xxxx and the new likelihood metric becomes
The same procedure applies for S2 and S3 node processors.
Example 6.5: For the convolutional code discussed in Example 6.1, assume that it is known the encoder initial’s state is S0. Decode the received sequence 10 10 00 01 10 01.
Since we know the initial state, we initialize the likelihoods to: (actually, any large numbers will do)
123
Above is the result of applying the Viterbi algorithm. The solid lines are the selected Paths , the dashed lines are Rejected Paths. T=tied path, is shown above the branches. The accumulated Hamming distances are indicated below each node. the first two steps are easier since we know S0 always winds (other u are large). Results of steps 3-6 are
t = 3 0
1
2
3
:
:
:
:
S
S
S
S
000xxx
101xxx
010xxx
011xxx
t = 4 0
1
2
3
:
:
:
:
S
S
S
S
0000xx
0001xx
0110xx
1011xx
t = 6 0
1
2
3
:
:
:
:
S
S
S
S
101100
101101
101110
101111
t = 5 0
1
2
3
:
:
:
:
S
S
S
S
01100x
00001x
10110x
10111x
124
After the 3rd step, we cannot decide on the correct decoding of even the 1st bit (since the
4 path registers disagree on what this bit should be). Till the 6th step, all 4 survivor
registers agree on the first 4 decoded bits. Why? If you trace back from t=6, all surviving
paths join together at t = 4. However, see the tie! This result depends on how we choose
the tie!
After the algorithm has a chance to observe a sufficient number of received symbols, it is
able to use the sequence of information to pick the globally most likely transmitted
sequence.
Notice that the path selection for the 4 first steps through the trellis cannot be changed by
any further decisions the node processors may make. This is because all the node
processors now agree on the first four steps. Received: 10 10 00 01 10 01
Most Likely: 11 10 00 01 ?? ??
In any practical implementation of the Viterbi algorithm, we must use a finite number of
bits for the survivor path register. This is called the Decoding Depth.
If we use a few bits, the performance of the algorithm will be hurt by having to force the
decoding decisions when we run out of decision bits. In such case, the “most likely” bits
are those that lead to the best likelihood metric. Most of the time this will result in correct
decoding, but sometimes it will not. An erroneous decision is called Truncation Error.
How many bits of decoding depth are required to make the probability of
truncation error negligible?
Forney (1970) gave the answer to this question. Answer: 5.8 times the number of bits
in the encoder’s shift register, i.e,
Practical Implementation for Long Code Sequence With Large
Number of Errors
When the number of error is large (this is why we use convolutional codes), the
arithmetic circuits can run out of bits for representing the likelihoods. We should notice
that all node decisions are relative decisions. A strategy for dealing with arithmetic
overflow is to occasionally subtract the value of the lowest likelihood from each node
125
processor’s likelihood. This leaves the relative likelihoods unchanged, while limiting the
range of the likelihood number each node processor must be able to express.
The Traceback Method of Viterbi Decoding
Each node manages a path survivor register, in which the node processor’s best
estimate is stored at each moment. This is a method easy to understand, but it is not
effective of keeping track of the decoded message when a high speed decoder is required.
The survivor path registers must be interconnected to permit parallel transfer. This
interconnection is very costly to implement. The Traceback Method is an alternative
way of keeping track of the decoded message sequence. This method is very popular, as
its implementation in integrated circuits is more cost effective. The method exploits a
priori information that the decoder has about the trellis structure of the code.
Basic Idea : For exampl Me: = 2
0 1
Content of the register in state S2
“1” which appears at moment t on the output, is actually applied at moment t-2 on the
input. We can use the content of the last delay in the register to decode the message,
but there is a delay of M clock cycles. Since we have already a delay of at least 5.8M
clock cycles to avoid the truncation errors in the Viterbi algorithm, this additional
decoding delay is a small price to pay for obtaining a lower cost hardware solution.
126
Instead of transferring the contents of the survivor register, each node processor is
assigned a unique register in which we store a single bit. This is the last bit of the
state picked by that node processor as survivor path (in the previous example this is
“1”). As we deal with binary codes, each node has two inputs (two path choices). The
bit that can be chosen is different for the two possible paths (see the trellis diagram).
This will always be true with the state-naming convention we are using.
Trellis Diagram Only the surviving path decisions are shown at each time step. The solid line is the
survivor path agreed on by all four nodes processors at the last time step shown in
the figure below.
127
The entries into each node processor’s traceback (i.e., survivor path) register at each
trellis step are shown in the figure. The traceback process is also illustrated. It begins at
the far right side of the figure and proceeds backwards in time. Once the traceback is
completed, the decoded bit sequence is read from left to right. The path traces back to
state “00” (S0). Whatever else may have happened during the time prior to the start of
the figure, we know that the last 2 bits leading into state “00” must have been “0,0”, so,
the decoded message sequence corresponding to the solid line must be . The
last two message bits, corresponding to the final 2 steps through the trellis have not
been decoded yet (due to the extra decoding lag mentioned above).
§6.4 Approaches to Increase Code Rate
Code Rate of Convolutional Codes
Let be the set of all possible Hamming distances between
adversaries in the transfer function of the convolutional code, such that 0 1 2, , ,.d ..d d
0 1 2 ...d d d The minimum distance is called the 0d minimum free distance, . Performance of
convolutional codes are determined by the minimum free distance. Convolutional codes
provide very powerful error correction capability, at the price of low code rate.
fd
For examples
128
Using Nonbinary Convolutional Codes
So far we have been looking only at convolutional codes with rate 1/n0 ( low R ). If the
source frame is increased to some k0 >1, we can achieve a rate k0/n0 convolutional code.
Example 6.6: Find the code rate and Trellis diagram for the 2-source frame encoder
shown on next page.
R = ; df = 3, it is a 4-ary code.
129
In Example 6.6: The number of inputs in each trellis node processor is equal to 4.
(Disadvantage!) In general, a k0/n0 convolutional code requires to deal with input
paths, so, the complexity of the Viterbi decoder increases geometrically with k0. This is a
severe problem. Non-binary convolutional codes are non-popular.
02k
Using Punctured Convolutional Codes
An alternative way to increase code rate is puncturing. We start with a 1/n0
convolutional code, such as a 1/2 code rate. The transmitted code word corresponding to
the 1/2 code is (c0 c1). We delete one of the code bits every 2 code symbols. Thus, the Code sequence:
The deleted code bits are not transmitted.
In average, 3 code bits are transmitted every two message bits, which yields a rate 2/3
code. Deleting code bits is called Puncturing the code. The rate is increased at the
expense of reducing the minimum free distance.
However, it is not fair to compare df of a punctured code with df of the base code. Instead,
we should compare df with that of a non-binary code, with the same R and number of
elements in its encoder. Cain showed that there are punctured codes with the same df as
130
the best known non-binary codes of the same rate and memory depth. Punctured codes
with rates up to 9/10 are known. Punctured codes are still linear codes (but not longer
shift invariant).
Example 6.7: Us the same encoder as Example 6.1, but c1 is punctured in every second
code-word. Find its code rate and trellis diagram.
Code rate R=2/3 (from a base code with R=1/2), df = 3.
The state diagram of such code requires 8 states rather than 4 (4 for “time-even” trellis
and 4 for “time-odd” trellis states). The Viterbi algorithm requires only 4 node
processors (M=2).
The Puncturing Period: the number of bits encoded before the returning to the base code.
Examples 6.8: Find the punctual period of the punctured code (7,5), 7
Punctured codes are specified in a manner similar to octal generator notation.
Code sequence:
m
(7,5), 7
2
3R
(7 in octal) (5 in octal)
The second message bit is encoded using only the generator polynomial 7.
131
Here the puncturing period is equal to 2. It requires (M=2) node processors, and
the state diagram contains 8 (= 4x2, 4 times puncture period) states.
Examples 6.9: Find the punctual period of the punctured code (15,17), 15,17
(15 in octal) ( 7 15,17),15,1 (17 in octal)
The second message bit is encoded using only the generator polynomial 15, whereas the
third message bit only by using the generator polynomial 17, thus, 3
4R
Code sequence:
Here the puncturing period is equal to 3. The Viterbi decoder requires (M=3) node
processors, and the state diagram contains 24 (= 8x3, 8 times puncture period) states.
The punctured codes presented here are punctured versions of known good-rate 1/2 codes.
However, it is not always true that puncturing a good code (1/n0 rate) yields a good
punctured code. There is no known systematic procedure for generating good
punctured convolutional codes. Good codes are discovered by computer search.
Some good punctured codes examples:
132