power consumption of reed-solomon decoder...
TRANSCRIPT
EXAMENSARBETE
2002:289 CIV
MASTER OF SCIENCE PROGRAMME
Department of Computer Science and Electrical EngineeringDivision of Computer Engineering
2002:289 CIV • ISSN: 1402 - 1617 • ISRN: LTU - EX - - 02/289 - - SE
Power Consumption ofReed-Solomon Decoder Algorithms
STEFAN RYSTEDT
Power consumption of Reed-Solomon
decoder algorithms
Master’s Thesis in Computer Science
Stefan Rystedt
May 2001
Examiner:
Per Lindgren
Division of Computer Engineering
Department of Computer Science and
Electrical Engineering
Luleå University of Technology
Supervisor:
Martin Malmström
UN/D
Ericsson Microwave Systems AB
Mölndal
Abstract
The Core Unit ASIC Technology & System on Silicon at Ericsson Micro Wave Systems has a
project on STM-1 / SDH communication over a microwave link called Mini Link. They want to
have a power effective Reed-Solomon encoder to correct bit errors. In this master’s thesis three
different algorithms for Reed-Solomon codes are implemented in hardware using VHDL and
then compared by power consumption. The Reed-Solomon code implemented is an RS(255,239)
that handles errors but not erasures.
The different algorithms used are Berlekamp-Massey, Gröbner basis by Fitzpatrick and Welch-
Berlekamp. They are implemented in VHDL and first compared by the size after being
synthesized, then compared by power dissipation estimated for non, four and eight errors per
block with Watt Watcher from Sequence. The Berlekamp-Masey had the lowest power
dissipation for all error rates and the smallest size. However, no real winner could be selected
among the algorithms since the differences in size and power dissipation where so small.
Acknowledgements
I would like to thank my fiancée for being supportive when nothing seemed to go my way. I
also want to thank the people at Ericsson Microwave Systems AB in Mölndal where this
master’s thesis is written. Especially Anders Persson for the Unix support. Martin Uddeborg,
Magnus Carlsson, Carl Gustafsson and Joachim Strömbergson for their pranks and incredibly
intelligent conversations that kept me alert at all times, and my supervisor Martin Malmström for
all support and material.
I also would like to thank Paul Rowley from Sequence for helping me with Watt Watcher,
Patrick Fitzpatrick for explaining details in his algorithm. Last, I would like to thank my
supervisor at Luleå University of Technology (LTU) for his quick responses.
Stefan Rystedt
Table of contents
1 Introduction 1
2 Power Dissipation 3
3 Galois field 5
4 Reed-Solomon code 74.1 Encoding 7
4.2 Decoding Algorithms 8
4.2.1 Berlekamp-Massey & Fitzpatrick 8
4.2.2 Welch-Berlekamp 11
5 Implementation 155.1 Parallelism 16
5.2 Galois field operations 16
5.2.1 Multiplication 17
5.2.2 Inversion 18
5.2.3 Division 18
5.3 Berlekamp-Massey 18
5.3.1 Syndrome block 18
5.3.2 Key equation block 18
5.3.3 Error Locator and Error Corrector blocks 19
5.4 Fitzpatrick 19
5.5 Welch-Berlekamp 19
6 Verification 21
7 Comparisons 237.1 Size 23
7.2 Power estimations 24
8 Conclusions 27
9 Future works 29
10 Glossary 31
11 References 33
Appendix A Berlekamp-Massey key equation I
Appendix B Fitzpatrick key equation III
Appendix C Welch-Berlekamp key equation V
1
1 Introduction
The Core Unit ASIC Technology & System on Silicon at Ericsson Micro Wave Systems
has a project on STM-1 / SDH1 communication over Mini Link
2. In all communications
with digital data, it is likely to have bit errors in the transmission. This is also the case in
the communication over the Mini Link. Several error-control codings (ECCs) can be used
to find and correct errors in a digital data stream. The different ECCs have different
advantages; some are good at random errors and others at burst errors. One ECC that is
good at correcting burst errors is Reed-Solomon. It will be used together with the Viterbi
coding (which are better at random errors) to find and correct the errors in the data stream
transmitted over the Mini Link.
Ericsson Micro Wave Systems already has an implementation of the Viterbi encoder/
decoder that they can use and they have bought a Reed-Solomon encoder/decoder as an IP
(Intellectual Property). However, they would like to have an own implementation of the
Reed-Solomon encoder/decoder that is fit for their special needs. What they want is a
decoder that can correct up to eight symbol errors of 255 symbols where the symbol size is
eight bits. The ASIC with the implemented Reed-Solomon code will be in the Mini Link
together with several other ASICs and they want to keep the temperature down. The power
consumption is getting a larger and larger problem when putting more and more in the
same chip, therefore it is more important to have low power implementation in mind when
building hardware.
There are several different algorithms for the Reed-Solomon coding. In this master’s
thesis, three of them have been implemented in hardware. They are compared by power
dissipation to see if there are any big advantages to choose one over the others.
1 SDH is a transmission standard for digital telecommunication. STM-1 gives that the transfer rate are
155.52Mb/s2 Mini Link is an Ericsson product to wireless transfer digital data from one point to another with
microwaves.
3
2 Power Dissipation
The sources of power dissipation can be divided into two categories, dynamic and static.
The dynamic power dissipation consists of the power consumed to switch the signals. The
static is the power consumed to hold the signals.
According to Demmer [1], about 99% of the total power dissipation is dynamic. This
means that unless there is almost no activity in the design, the static power dissipation can
be discarded. However, this might not be true for future technologies. With smaller
geometry’s the leakage in transistors increase and hence the static power dissipation.
The dynamic power dissipation can be divided into a few categories. First, there is the
clock tree that will consume power every time the clock signal switches. To reduce the
power dissipation for the clock tree, the supply voltage can be minimized. A lower supply
voltage will result in slower gates so that the maximum clock frequency will become
lower. Another approach is to make the design parallel and in that way lower the clock
frequency needed. Clock gating could also be used to stop the clock for the part of the
design that does not do any work at the present. Since both the clock frequency and the
supply voltage were already decided in the specification for the Reed-Solomon, and the
clock gating is quite complex, no power optimization was done for the clock tree.
The second block is the registers. They also consume power as the clock switches,
regardless of the input signals. However, they consume more power if the input switches.
To minimize the power dissipation from the registers the number of registers should be
minimized as much as possible. The switching on the input should also be kept as low as
possible.
Last, there is the logic between the registers. This logic only consumes power when the
signal in the logic switches. The signals that switch often should travel through as little
logic as possible and let the signals that more seldom switch take the longer way. The size
of the logic should also be kept at a minimum. One way to do that is to reduce the number
of complex operations by transforming mathematical expressions like X2+aX+b to
X(X+a)+b .
5
3 Galois field
Reed-Solomon codes are based on Galois fields, see Waltrich [2], also known as finite
fields. The Galois field consist of a finite number of elements and has the property that
arithmetic operations (+,- ,* , / ) on field elements always have a result in the field. The
notation GF(q) where q=pm
is used to denote a Galois field. If p equals two then m is
equal to the number of bits in each symbol. In the rest of this master’s thesis, p is assumed
to be two, since the binary representation is well suited for hardware implementation. The
elements in the field are usually denoted a0,a1
,…,aq - 1. To construct a Galois field, the
degree of the field (i.e. the value of q ) and a primitive polynomial most be decided. For
GF(256) there are 16 such polynomials. To calculate a you can use the formula
a i=2
imod(P(2)) (4.1)
where P(x) is the primitive polynomial. Note that in a Galois field aq - 1 always equals
one. The primitive polynomials can be found by calculating aq - 1 for all different poly-
nomials with degree m . The polynomials that gives aq - 1=1 are primitive polynomials.
To calculate a i + 1 from the m bits symbol a i
, shift the bits of a i left one step and check if
the bit m+1 is set. If it is, do a xor between the result of the left shift of a i and the
primitive polynomial P(2) .
In table 1, The field for GF(16) with the primitive polynomial P(x)=x4+x+1Ú
x4=x+1 is calculated. GF(16) has four bits per symbol, a0
,a1,a2
,a3 represents the
four bits. a0 is the LSB (Least Significant Bit) and a3
is the MSB (Most Significant Bit)
and as shown in table 1 the rest of the symbols is built upon combining these.
6
a0=1
a1=a1
a2=a2
a3=a3
a4=a+1
a5=a2
+aa6
=a3+a2
a7=a4
+a3=(a+1)+(a3
)=a3+a+1
a8=a4
+a2+a=(a+1)+(a2
+a)=a2+1
a9=a3
+aa1 0
=a4+a2
=(a+1)+(a2)=a2
+a+1
a1 1=a3
+a2+a
a1 2=a4
+a3+a2
=(a+1)+(a3+a2
)=a3+a2
+a+1
a1 3=a4
+a3+a2
+a=(a+1)+(a3+a2
+a)=a3+a2
+1
a1 4=a4
+a3+a=(a+1)+(a3
+a)=a3+1
a1 5=a4
+a=(a+1)+(a)=1
Table 1 GF(16) with the primitive polynomial P(x)=x4+x+1Úx
4=x+1
Adding and subtracting in Galois field is exactly the same thing namely a xor function, e.g.
with the Galois field in table 1 a5+a8
=a5-a8
=(a2+a)+(a2
+1)=a+1=a4. Multipli-
cation is simply an addition of the powers modulus the number of elements in the field
minus one.
aa³a b=a ( a + b ) m o d ( q - 1 )
(4.2)
Division is subtraction of the powers modulus the number of elements minus one.
( )1mod)( --= qba
b
a
aaa
(4.3)
Inversion is done by taking the exponent modulus the number of elements minus one.
a - a=a - a m o d ( q - 1 )
(4.4)
7
4 Reed-Solomon code
The Reed-Solomon code is an error correcting code with which bit errors in transmitted
words can be found and corrected. It has become an important coding technique when
transmitting digital data due to its large burst error correction capability with minimal
added data bits. It is a block code, which means that the data is sent in blocks of symbols
where the number of symbols per block is in relation to the number of bits in each symbol.
The formula is
n=2b-1 (5.1)
where n is the number of symbols per block and b the number of bits per symbol. There
are d data symbols, which is the actual data to send, and p parity symbols, which is
redundant data built on the data symbols. Reed-Solomon codes are often denoted as
RS(n,d) so in an RS(15,11) there are 15 symbols per block, 11 data symbols and 15-
11=4 parity symbols. As mentioned earlier the number of bits per symbol decides the
Galois field, so for the RS(15,11) a GF(16) is used. To correct an error in one symbol
two parity symbols are needed, so the total number of errors ( t ) that can be corrected are
2
pt = . (5.2)
In the RS(15,11) code, up to two symbol errors can be corrected since 2
4=t . If some
symbols that are received are highly unreliable, they can be marked as erased. To correct
an erased symbol only one parity symbol is needed since the location of the broken
symbols is known. Therefore, the total number of errors ( t ) and erasures (s) that can be
corrected are
p=2t+s . (5.3)
If 2t+s>p one thing of two can happen, either the decoder will detect that it can not
recover the code word and flag this, or it will miss-decode and recover an incorrect word.
4.1 Encoding
Most Reed-Solomon codes are systematic codes; this means that the encoder just adds the
parity symbols without modifying the data symbols.
To generate a systematic code a generator polynomial is needed. All valid code words are
exactly divisible by this polynomial. The general formula to calculate the polynomial is.
( ) ( )Ô-
=
++=1
0
p
i
ijxxG a (5.4)
where j is an integer and p is the number of parity symbols.
One easy way to calculate the parity symbols in hardware is to use a p stage feedback shift
register, see figure 1.
8
Figure 1. Reed-Solomon decoder using a p stage feedback shift register.
The constants b1 to bP are calculated from the coefficients of the generator polynomial,
G(x)=gpxp+gp - 1x
p - 1+…+g1x+g0 by the formula.
0g
gb i
i = (5.5)
The data symbols are fed one by one to the encoder. The data is put on the output through
multiplexer B and to the registers through multiplexer A. So the data symbols are
immediate put on the output without any altering. When the decoder has received all data
symbols. Multiplexer A sends zeros to the registers and multiplexer B sends the, by the
registers, calculated parity symbols. When p parity symbols ha been sent all registers are
cleared and the decoder is ready to decode next block of data.
4.2 Decoding Algorithms
There are several different algorithms to decode reed-Solomon codes. In this master’s
thesis three of them are implemented and then compared regarding to the power
consumption. The three different algorithms are Berlekamp-Massey Johannesson et al [3],
Gröbner basis by Fitzpatrick [4] and Welch-Berlekamp Chambers et al [5]. The decoder
for the Mini Link only has to handle errors and not erasures. Therefore, the part of the
algorithms that handles erasures has been left out.
4.2.1 Berlekamp-Massey & Fitzpatrick
The Berlekamp-Massey algorithm was chosen because it is a well-known algorithm and
has often been used in hardware implementations. The Fitzpatrick algorithm has the same
architecture as the Berlekamp-Massey, see figure 2, only the key equation to calculate the
Error Locator Polynomial differs. It was chosen since it is claimed to be more efficient to
implement in hardware than the Berlekamp-Massey key equation.
Figure 2 The design of Berlekamp-Massey or Fitzpatricks algorithm.
0
Encoded data
In d
ata
1b2-Pb1-PbPb 2b
GF multiplication
GF addition
A
B
Syndrom
Error LocationPolynomial
Berlekamp-Masey
orFitzpatric key
equation
ErrorLocation
Chien search
ErrorMagnitude
Forney’salgorith
S(x)
R(x)
ELP(x) ix
ii xe
9
First, a Syndrome is calculated, this is used to calculate an error polynomial. The degree of
the error polynomial is the same as the number of errors. From the roots of the error
polynomial, the error locations can be calculated. Last, the Syndrome, the error locations
and the error polynomial are used to calculate the error magnitude.
When using the feed back shift register decoder, the formula to calculate the generator
polynomial (eq 5.4) has to be modified. The new formula is
( ) ( )Ô=
-+=p
i
ixxG1
a . (5.6)
Say that the sent word is V(x) and the received code word is R(x) then
R(x)=V(x)+E(x) , (5.7)
The error polynomial E(x)=Se ixi, where e i is the Magnitude of the error and x
i is the
position of the error.
Taking the Discreet Fourier transform (DFT) gives
R(x)=V(x)+E(x) (5.8)
where R and V are the Fourier transform of R and V respectively.
It can be proved that
V j=0, j=d,d+1,…,n-1 (5.9)
and hence from (eq 5.8)
R j=E j j=d,d+1,…,n-1 (5.10)
i.e. n-d consecutive components of the DFT of the error patterns can be computed at the
receiver. If the remaining d components could be found, the inverse Fourier Transform
(IDFT) could be used to obtain the error polynomial E(x).
If there are at most t errors the cyclic complexity of E(x) is at most t . This means that
there is a polynomial L(x) of degree at most t that can generate E(x) from t consecutive
components of E(x) by the formula
( ) 1,,1,0,1
-=L= ä=
=+ nj
tl
l
2nlmodjlj EE (5.11)
When the n-d components of E(x) are calculated, an Error Locator Polynomial L(x) can
be found with L0=1 that can generate E(x) with the equation (5.11), by solving a system
of t linear equations
10
öööööö
÷
õ
ææææææ
ç
å
=
öööööö
÷
õ
ææææææ
ç
å
L
LLL
öööööö
÷
õ
ææææææ
ç
å
+
+
+
++++
-+-+-+
-+++-+
-+++++
d
3-td
2-td
1-td
td3d2d
tdd
1tdtdd
2td1tdtd
E
E
E
E
EEEE
EEEE
EEEE
EEEE
44
3
4444
4
4
4
1
3
21
1
1
312
21
1
td
ntdt
nt
n
(5.12)
The Error Locator Polynomial has the form
( ) ( )Ô=
+=Lk
1l
ixx la (5.13)
where i l , l=1,2,…,k¢ t is the error locations for the at most t errors.
Since the IDFT of E(x) is quite complex to implement in hardware the Forney algorithm
is used. The Forney algorithm uses the n-d components of E(x) and the Error Locator
Polynomial to calculate E(x) .
4.2.1.1 Syndrome calculation
The calculation of the Syndrome S(x) with degree p-1 is done by calculate the DFT for
Ej, j=d,d+1,…,n-1 where
( )( )( )
( ) öööööö
÷
õ
ææææææ
ç
å
=
öööööö
÷
õ
ææææææ
ç
å
=
öööööö
÷
õ
ææææææ
ç
å
-
-
-
-d
n
n
n
p R
R
R
R
S
S
S
S
a
aaa
444
3
2
1
1
2
1
0
d
3-n
2-n
1-n
E
E
E
E
(5.14)
4.2.1.2 Error Locator Polynomial
From the Syndrome, the Error Locator Polynomial L(x) can be found from the equation
(5.12). The Berlekamp-Masey or the Fitzpatrick key equation is used to solve it. The
Berlekamp-Masey key equation is shown in appendix A and the Fitzpatrick key equation
in appendix B.
4.2.1.3 Error locations
To calculate the error location; the roots of the error polynomial L has to be calculated.
This is done by Chien search. Calculate the result of the error polynomial for every
element in the Galois field and see if it equals zero. If L(a i)=0, i=0,1,…,n-1 then a i
is a root and the location of the error is xi.
If the number of roots does not match the degree of the polynomial, there are more errors
than can be corrected.
11
4.2.1.4 Error Magnitude
When the error locations are known the error magnitude can be calculated using the
Forney algorithm.
Calculate the modified error locator polynomial W.
W (x)=S(x)L(x )mod(xp) . (5.15)
Calculate L’ (x) .
The error magnitude at the error location xi is
( )( )i
i
ieaa
L¡W
= . (5.16)
Last, subtract the error polynomial E(x) with the received word R(X) to get the sent word
V(x) .
4.2.2 Welch-Berlekamp
Figure 3 The design of the Welch-Berlekamp algorithm
The Welch-Berlekamp algorithm has a different architecture than the two above, see figure
3.
First it has a recoding stage where new parity symbols are calculated, then there is a
calculation of a key-equation and last a Chine search to find the roots and from them the
error magnitude.
First, some definitions have to be made.
Let U be the set {0,1…n-1} and V any subset of U then define
( ) ÔÍ
-=Vi
ixxVf )(; a . (5.17)
If V is the empty set then f (V;x)=1 .
We also define
f ’ (V;x)=f(V-{i};x) . (5.18)
Here V-{i} denotes deletion of the member {i} from the set V .
Recoder Key
ErrorLocation
Chien search
ErrorMagnitudeY(x)
R(x)
Q(x) ix
ii xe
P(x)
12
Let D be the data symbols set {0,1,…,d-1} and P be the parity symbols set P=U-
D={d,d+1,…,n-1} .
4.2.2.1 Recoding
Say that the sent word is V(x) and the received code word is R(x), recode the data
symbols D in R(X) , i.e. calculate new parity symbols, to get a new code word RC(X).
Then calculate Y(x) by subtracting the parity symbols from the recoded word with the
parity symbols from the received word and multiply the result with a constant.
y i=(rc i + d-r i + d )f ’ (P;a i + d) , i=0,1,…,p-1. (5.19)
Y(x) is then used when solving the key equation below. Notice that if there are no errors
the parity symbols for the recoded word and the received word are the same and rc i + d-
r i + d=0 hence Y(X) will be the zero polynomial.
4.2.2.2 Key equation
The key equation to solve is
y iQ(a i + d)=P(a i + d
) , i=0,1,…,p-1. (5.20)
Where the error location can be calculated from Q(x) and the error magnitude from P(x)
and the derivate of Q(x) . The key equation can be found in appendix C.
4.2.2.3 Calculating the error location.
The roots of the polynomial Q(x) have to be found. This is done with Chien search as in
the two algorithms above.
If a i is a root to Q(x) then the error location is x
i.
If the number of roots does not match the degree of the polynomial, there are more errors
than can be corrected.
13
4.2.2.4 Calculating the error magnitude.
The error magnitude e i at error location xi are calculated by the equation
( )( ) ( ) Di
PfQ
Pe
ii
i
i Í¡
= ,;aa
a(5.21)
There are also equations to calculate the error magnitude for the parity symbols, but they
are left out in this master’s thesis.
15
5 Implementation
The Reed-Solomon code that should be implemented in hardware is an RS(255,239)
code over the GF(256) , which means that the symbol length is 8 bits. The primitive
polynomial for the GF(256) is x8+x
4+x
3+x
2+1 . The decoder should not handle
erasures only errors. It should be able to fetch and process a new symbol every clock cycle.
Since the minimum data rate it should be able to handle is 155.52 Mbit, the clock
frequency has to be at least
khz8.20
8*255
239
10*52.155 6
º
The Hardware implementation was made in Renoir where the design blocks were created
and connected. All the blocks were then implemented as VHDL code using Emacs. All
three implementations have the same interface of the decoder see figure 4.
Figure 4 The symbol for the decoders.
When DV_RS_DEC is set the symbol on the D_RS_DEC is read and the internal number
of symbols per block count is increased by one. When the LB_RS_DEC and DV_RS_DEC
are both set the symbol on D_RS_DEC is considered the last symbol for this block. If the
number of symbols for this block is not equal to 255, the block is thrown away without any
reports. If the number of symbols equals 255, the block is processed to find the error
locations and magnitudes if any.
When DV_RS_DEC_O is set the decoded symbol on D_RS_DEC_O is valid. If both
DV_RS_DEC and LB_RS_DEC are set, the last symbol in this block is valid on the
output.
If the ERR_RS_DEC_O is set with the LB_RS_DEC_O there are too many errors in this
block to be corrected.
16
All algorithms take more than 2³255=510 clock cycles but less than 3³255=765 clock
cycles to decode a symbol. Since all algorithms only calculates the error magnitude of the
erroneous symbol all the data symbols received in the decoder has to be stored in the
decoder. The block length is 255 symbols that means that 3 blocks of data has to be stored,
but it is only the data symbols that are of interest. Therefore only 3³239=717 symbols
have to be stored. This could be done with simple registers but since registers tends to
consume much power, a dual port synchronous SRAM is used instead.
5.1 Parallelism
All the algorithms are described in a serial style, which is okay, if implemented in
software. In hardware on the other hand, parallelism is desirable. Therefore, an analysis of
the algorithm has to be done to see what can and should be made in parallel. Parallelism
increases the speed at which the calculation can be performed, i.e. the number of clock
cycles it takes. It can also increase the size of the implementation. Since the decoder
should be able to receive one symbol per clock cycle, no stages can take more than 255-
clock cycles to complete. Hence, some parallelism has to be implemented.
The calculation of type y=F(a i) is never done in parallel. Taking the power of a Galois
field element is very complex, which has to be done if the degree of F is larger than 1.
Therefore the fact that ax3+bx
2+cx=d is equal to x(x(xa+b)+c)+d is used. This
calculation is then easily done in four steps with only one single Galois field multiplication
and addition.
1. y=xa
2. y=x(y+b)
3. y=x(y+c)
4. y=y+d
The calculation of type Y(x)=F(x)³a i can be described as
y1=f1³a i
y 2=f2³a i
4
yd=fd³a i
which is best implemented in parallel with 16 Galois field multiplications. For the different
key equations there are many different calculations made in parallel.
5.2 Galois field operations
The more complex operations in the Galois field: inversion, multiplication and division are
implemented as functions in a VHDL package.
17
5.2.1 Multiplication
As mentioned before, multiplication in Galois Field is only addition of the powers. The
problem is that in hardware there is only the binary representation of the symbols. Since
the hardware can not say directly what the symbol representation is by looking at the bit
representation. There has to be a look up table with 255 entries to translate the bit
representation to the symbol representation. This would result in very large hardware since
there are many multiplications in the algorithms. What is known is that a0,a1
,…,a7
represents the eight bits. a0 is the LSB and a7
is the MSB and all the rest of the symbols
are made of adding this symbols together. By using that if a³b=c and a=d+e then
(d+e)b=d³b+e³b=c . The multiplication of two symbols can be divided into several
multiplications where one of the multiplicands is one of a0,a1
,…,a7. As example take
a³b , where a and b is element in the GF(256) used for the encoder. Say that a=a3 5, the
bit representation of a3 5 is “10011100” or a7
+a4+a3
+a2 so the multiplication can be
rewritten as a7³b+a4³b+a3³b+a2³b .
To calculate a i³b, i>0 , repeat the following i times; left shift all bits in b one step then if
the ninth bit in b is set take b modulus the primitive polynomial.
The VHLD code for the multiplication is
- The primitive polynomial: x^8+x^4+x^3+x^2+1
constant PRIMITIVE_POLYNOMIAL : std_logic_vector(8 downto 0) := "10001111";
function GF_MULT (A,B : in std_logic_vector (7 downto 0))
return std_logic_vector is
variable I,J : integer; -- Counters
variable TEMP : std_logic_vector (8 downto 0);
variable ANSWER : std_logic_vector (7 downto 0);
begin
if A(0) = '1' then
ANSWER := B;
else
ANSWER := (others => '0');
end if;
GF_MULT : for I in 1 to 7 loop
if A(I) = '1' then
TEMP(8) := '0';
TEMP(7 downto 0) := B;
GF : for j in 1 to I loop
TEMP := TEMP(7 downto 0) & '0';
if TEMP(8) = '1' then
TEMP := TEMP xor PRIMITIVE_POLYNOMIAL;
end loop GF;
ANSWER := TEMP(7 downto 0) xor ANSWER;
end loop GF_MULT;
return ANSWER;
end GF_MULT;
18
5.2.2 Inversion
The inversion is very complex in Galois field to which no easy solution was found.
Therefore the inversion is simply done by a truth table with 255 entries, all optimizations
are then left to the synthesis tools. This result in a much bigger hardware than for the
multiplication, fortunately inversion is quite rare in all the algorithms.
5.2.3 Division
The division is simply done by inverting one of the terms before the multiplication.
This of course results in even larger hardware than the inversion it self, but also division is
rare in the different algorithms.
5.3 Berlekamp-Massey
5.3.1 Syndrome block
The 16 syndrome elements are calculated in parallel as in figure 5. When all symbols in a
block are received the calculation of S(X) is finished and sent to the next block and all the
registers are cleared.
There is the same amount of calculations done regardless of the error rates. Therefore is
the power dissipation for this step not dependent upon the number of errors in the block.
Figure 5 The parallel calculation of the syndrome.
5.3.2 Key equation block
The calculation of the Error Locator Polynomial with the Berlekamp-Massey algorithm,
see appendix A, is divided in two steps that are repeated 16 times. One that calculates)(kD (Appendix A step 3) this takes L clock cycles. The other calculates the rest in just one
clock cycle.
1a0S
2a 1S
16a 15S
( )xS
ReceivedSymbol
GF multiplication
GF addition
19
The number of times D ( k ) not equal to zero is dependent upon the number of errors. Since
there is less calculation done if D ( k ) equals to zero there will be less signals that are
shifting values and hence less power dissipation. Therefore, with low error rates the power
dissipation will be less than with high error rates for this block
5.3.3 Error Locator and Error Corrector blocks
The calculations of type F(a i) , with degree deg, are done by a piped calculation as in
figure 6. In the Error Locator block the calculation of the Modified Polynomial and the
derivate of the Error Locator Polynomial, which is needed by the Error Corrector block,
are made. The calculation in the Error Corrector block is only performed when the Error
Locator has found an error so the power dissipation for this block is dependent upon the
error rates. The Chien search on the Error Locator Polynomial in the Error Corrector block
takes 255 clock cycles to test all possible roots, regardless of the number of errors.
However, the degree of the Error Locator Polynomial is dependent of the number of errors.
With a low degree most of the GF multiplication in the piped calculation will be zero, so
there will be less signal switching and hence less power dissipation.
Figure 6 The piped calculation of F.
5.4 Fitzpatrick
The Syndrome, Chien search and the Forney calculation is done exactly as in the
Berlekamp-Massey algorithm only the key equation to calculate the Error Locator
polynomial differs, See appendix B for the Fitzpatrick algorithm.
The key equation calculation is divided in two steps repeated 16 times. The first calculates
a j (Appendix B step 3) and this takes the degree of bj clock cycles. The other step does
the rest of the calculation and takes one clock cycle. The degree of bj and the number of
times a j are not zero is dependent upon the number of errors. Few errors produce a low
degree of bj and a j is zero more often. Therefore, the power dissipation will be less with
low error rates than with high.
5.5 Welch-Berlekamp
Since f (P;x) , needed by the Error Locator, and f ’ (P;x) , needed by the Recoder, do not
depend upon the data received. They are pre calculated and inserted in the VHDL code as
constants.
The recoding is done exactly as the encoding except for a Galois Field addition between
the old, the new parity symbols, and then a Galois fields multiplication, see equation 5.19.
The power dissipation for the Recoder does not depend on the error rates.
ia
degF 1deg-F
2deg-F 0F
1F
GF multiplication
GF addition
20
The Key equation, see appendix C, is divided in 3 steps repeated 16 times. The first step
calculates d (Appendix C, step 2), it takes the degree of Q(x) clock cycle. The second step
calculates c (Appendix C, step 8), this takes the degree of W(x) clock cycle, and the third
calculates the rest in one clock cycle. If d equals zero there is no need for the second step.
With fewer errors the degree of Q(x) and W(x) is smaller and hence the power
dissipation.
The Error Locator and the Error Corrector block are very similar to those used by the other
two algorithms except that there is no calculation of a modified Error Locator Polynomial.
21
6 Verification
To verify the hardware a software implementation was made in Java. The implementation
in Java was done in a generic way, so that the block length, number of parity symbols and
the Galois field used where variable. This made it possible to verify that the different
stages in the algorithm worked, by using different examples. Operations that seemed
complex to implement in hardware, like Galois field multiplication, was first implemented
in a software style and then as close to hardware as possible.
Constants needed for the algorithm to work were calculated by the software
implementation. It also produces data files that can be used in a Test bench to verify the
correctness of the Hardware implementation. The data files produced can be used either for
the whole design or for some block in the design.
The hardware was verified by a test bench that produce encoded data with errors for the
decoder and then checked that the decoder corrected the blocks that had 8 symbol error or
less. All the implementations passed the test.
23
7 Comparisons
7.1 Size
The VHDL was synthesized to TSMC technology using Ambit from Cadence. It was
synthesized with high effort for area optimization; the clock speed set to 100MHz. The size
of the synthesized design can be seen in figure 7; the size is in number of cells. Since the
size of a cell differs from different vendors and it is the size relative to each other that is
interesting, no axes are given.
Figure 7 The size for the different block and different implementations
As can be seen in figure 7 the size of the implemented algorithms does not differ too much.
Remember that only the Key Equation differs between the Berlekamp-Massey and the
Fitzpatrick algorithm. This means that the other blocks will consume about the same power
in both algorithms. It can also be noted that the size of their key equations is quite small in
comparison to the size of the common blocks. This indicates that the power consumption
will be much higher for the common blocks than for the key equation. This implies that the
algorithm choice for the Key Equation is of secondary importance.
The Syndrome/Recoder is the only block where the power dissipation does not depend
upon the error rates. Since this is the smallest block in the algorithms, the over all power
dissipation probably depends a lot on the error rates of the received symbols.
Size
Syndrom/Recoder Key Equation Error Location Error Magnitude Total with out
memory
Berlekamp-Masye Fitzpatrick Welche-Berlekamp
24
7.2 Power estimations
The power estimation was done with a tool from Sequnce called Watt Watcher. This tool
takes the VHDL code and simulation data and estimates the power consumption. The
estimation would be more accurate if it was done on the synthesized circuit rather then the
VHDL code. This could also be done with Watt Watcher unfortunately there was not time
enough to do that. The Watt Watcher version used was brand new and not so tested since
the design was too complex for the old version. This makes the estimation even more
uncertain. Since the power dissipation should depend on the error rates, the power
dissipation was estimated for none, middle and high error rates. The simulation data is 50
blocks of random generated data with all blocks having the same number of errors. The
errors where inserted at random places with random magnitude. The middle error
estimation contained four errors per block and the high error estimation contained eight
errors per block.
Figure 8 The average power dissipation of the different algorithm at different error rates
The Berlekamp-Masey has the lowest dissipation of the algorithm regardless of the error
rates and Welch-Berlekamp the highest, see figure 8. The Berlekamp-Masey and
Fitzpatrick algorithm has very similar power dissipation as expected from the size
comparison. The differences in power dissipation between low and high error rates are not
as high as expected from the size comparison.
So how is the power dissipation divided among the building blocks in the algorithms?
Total power dissipation
0,000
0,005
0,010
0,015
0,020
0,025
0,030
0,035
0,040
0 errors 4 errors 8 errors
W
Berlekamp-Masey Fitzpatrick Welch-Berlekamp
25
Figure 9 Power dissipation for the different blocks with no errors
Figure 10 Power dissipation for the different blocks with 8 errors
The most of the power dissipation comes from the Syndrome/Recoder block, which was
the only block that did not depend upon the error rates. This would explain why the
differences between none and high error rates are quite small.
In the Syndrome/Recoder 16 Galois field multiplication is done for every symbol in a
block. To process one block of data 16³255=4096 multiplications are performed. The
Syndrome/Recoder might be the smallest block in size, but for every symbol it receives
almost every signal in it switches hence the high power dissipation.
Power dissipation with 8 errors
0,000
0,002
0,004
0,006
0,008
0,010
0,012
0,014
0,016
Syndrome/Recoder Key Equation Error Locator Error Corrector
W
Berlekamp-Masey Fitzpatrick Welch-Berlekamp
Power dissiaption with no errors
0,000
0,002
0,004
0,006
0,008
0,010
0,012
0,014
0,016
Syndrome/Recoder Key Equation Error Locator Error Corrector
W
Berlekamp-Masey Fitzpatrick Welch-Berlekamp
27
8 Conclusions
The first block in all the designs dissipates more than half of the total power with no errors.
It is the block with the highest power dissipation even with high error rates. Even if the
estimation is uncertain, it can be seen that a large portion of the power dissipation comes
from the first block. This means that the power optimization should be concentrated on this
block, by finding better ways to calculate the Syndrome or some other way to do the
recoding.
The Key Equation and Error Locator blocks do very little work when there are no errors.
The average power dissipation for these two blocks is therefore approximately the same as
the power consumption due to clock switching in registers. By shutting down the clock for
these two blocks when no work is done, they would consume almost no power at all. The
gain would be greatest for the Welch-Berlekamp since its key equation consumes much
more power than the other two designs.
The Berlekamp-Masey algorithm had the lowest power dissipation for all error rates and
the smallest size. The differences between the Fitzpatrick algorithm and Berlekamp-Masey
where so small that the Fitzpatrick algorithm as well can be chosen. The Welch-Berlekamp
had a little bigger power dissipation than the other two. Mainly because there are more
registers in the key equation and that the recoding is a little more complex than the
syndrome calculation. By finding a more power effective recoding algorithm and using
clock gating it might be more power effective then the other two.
The conclusion is that none of the algorithm has any big advantage over the others when it
comes to power dissipation. The power dissipation is to close and the uncertainty to high.
29
9 Future works
More time could be spent on the power optimisation for the different algorithms. As
mentioned earlier clock gating can be a way to reduce power dissipation. Unfortunately, it
was not possible to use clock gating in this work, but it would be interesting to see how
much gain it would have been when using this.
Today there are power optimisation programs on the market that, by looking at the VHDL
code, give suggestions how to change the code to reduce the power dissipation. It would be
interesting to look at such programs and test them to see how much they can reduce.
The power estimation was now done on the VHDL code. By doing it on the synthesised
circuit, the estimation should be much more accurate.
31
10 Glossary
ASIC Application Specific Integrated Circuit.
DFT Discrete Fourier Transform
GTL Gate Transfer Level.
IDFT Inverse Discrete Fourier Transform
IP Intellectual Property.
LSB Least Significant Bit.
MSB Most Significant Bit.
RTL Register Transfer Level.
SDH Synchronous Digital Hierarchy.
SRAM Synchronous Random Access Memory.
STM-1 Synchronous Transport Module-1.
VHDL Very high speed integrated circuit Hardware Description Language.
33
11 References
[1] Demmer, Christian. Study in Optimization of Low Power ASIC Design on RT-Level,
Ericsson Gothenburg, April 1999.
[2] Waltrich, Joe. Understanding Reed-Solomon Error Correction. Multimedia Systems
Design. December 1998 volume 2, number 12
[3] Johannesson, Rolf & Zigangirov, Kamil Sh. Fundamentals of Convolutional Codes. In
appendix A Reed-Solomon codes. Dec 1996
[4] Fitzpatrick, Patrick. Errors-and-Erasures Decoding of BCH Codes. Dept of
Mathematics, College Cork University, Ireland January 1998.
[5] W. G. Chambers & Zien, Nader F. On the Welch-Berlekamp scheme for decoding
Reed-Solomon codes: a review. Agust 2, 1997
I
Appendix A Berlekamp-Massey key equation
The variables k and L are integers, ( )kD is an element in the Galois field.
( )kL and ( )xT
is polynomial in the Galois field.
1. Initialize the variables ( ) 1,0,0 0 =L== Lk and ( ) xxT = .
2. Set 1+= kk .
3. Compute ( )kD as follows
( ) ( )ä=
---
- L=DL
i
ik
k
k
k ss1
1
1
1 .
4. If( ) 0=D k
, then goto step 8.
5. Modify the polynomial( ) ( ) ( ) ( )xTkkk D-L=L -1
.
6. If kL ²2 , then goto step 8.
7. Set LkL -= and ( )( )
( )k
k
xTDL
=-1
.
8. Set ( ) ( )xxTxT =
9. If pk < goto step 3, where p is the number of parity symbols.
10. The error polynomial L is ( )pL .
III
Appendix B Fitzpatrick key equation.
kdji ,,, are integer, 10 ,aa are elements in the Galois field and 10 ,bb are polynomials in
the Galois field. Primes denote boolean complement and ( )[ ]ixF means the coefficient of
ix in the polynomial ( )xF .
1. Initialize the variables 0,1,1,1,1,1)(,0)( 010 ======= kdjixbxb a
2. Set 1+= kk
3. Set [ ]1
)()(-
=kjj xSxba
4. If 0=ia then goto step 11.
5. Set )()()( xbxbxb i
i
iii a
a ¡¡¡ -= .
6. Set )()( xxbxb ii = .
7. Set ij ¡= and 1-= dd
8. If 0¸d goto step 13
9. Set ii ¡= and 1=d
10. Goto step 13
11. Set ii xbb ¡¡ = .
12. Set 1, +== ddij .
13. If pk < goto step 2
The error polynomial L is ib
V
Appendix C Welch-Berlekamp key equation
ji, are integer, d, c are elements in the Galois field and W(x), V(X), P(X), Q(X) are
polynomials in the Galois field.
1. Initialize ( ) ( ) ( ) ( ) 1,0,1,0,0,1 0000 ====== xQxPxVxWji .
2. Calculate ( ) ( )1
1
1
11
-+-
-+-- += di
i
di
ii PQyd aa .
3. If 0¸d then goto step 8.
4. Set ( ) ( ) ( ) ( )xWxxWxVxxV i
di
ii
di
i 1
1
1
1 )(,)( --+
--+ +=+= aa .
5. Set ( ) ( ) ( ) ( )xQxQxPxP iiii 11 , -- == .
6. Set 1+= jj .
7. Goto 13.
8. Calculate( ) ( )
d
VWyc
di
i
di
ii
1
1
1
11
-+-
-+-- +
=aa
.
9. Set ( ) ( ) ( ) ( ) ( ) ( )xcQxWxWxcPxVxV iiiiii 1111 , ---- +=+= .
10. Set ( ) ( ) ( ) ( ) ( )xQxQxPxxP i
di
ii
di
i 1
1
1
1 , --+
--+ +=+= aa .
11. If 0=j then interchange (swap) the values V with P and W with Q.
12. If 0¸j set 1-= jj .
13. 1+= ii .
14. If pi ¢ then goto step 2.
15. Return ( )xQi and ( )xPi .