s is w 1 .s% il e67he0hhhh0hhei is? h h i · pdf file · 2014-09-27technical...

93
%1%10 1~'6 %\S IS W 1 .S% Il 1 ADA is? h h ~ I E67hE0hhhh0hhEI

Upload: trannhi

Post on 31-Mar-2018

215 views

Category:

Documents


3 download

TRANSCRIPT

%1%10 1~'6 %\S IS W 1 .S% Il 1

ADA

is? h h ~ IE67hE0hhhh0hhEI

z0

Technical Document 1166October 1987

Performance of Data0Compression Codes in

Channels with Errors

SAIC Comsystems Division

9STAt

J A.

Approved for public release; The views and conclusions contained indistribution is unlimited, this report are those of the authors and

should not be interpreted as representingthe official policies, either expressed orimplied, of the Naval Ocean SystemsCenter or the U.S. Government.

NAVAL OCEAN SYSTEMS CENTERSan Diego, California 92152-5O00

E. G. SCHWEIZER. CAPT, USN R. M. HILLYERC'jpandsr Technfci Direto

ADMINISTRATIVE INFORMATION

This report was prepared by SAIC Comsystems Division, under contractN66001-85-D-0029. for Code 83 of the Naval Ocean Systems Center.

Released under authority ofW.R. Dishong. HeadSubmarine Broadcast Systems Division

UNCLASSIFIEDSECUMTY CLASS CATION Of HI PA

REPORT DOCUMENTATION PAGEI a REPORT SECURITY CLASSIFICATIONlbEShCIEAMG

UNCLASSIFIED2a SECURITY CLASSIFICATION AUTHORITY 3 DiSTRIEUTION/AVA-ASBITY OF REPORT

2b DECASIFCTON DONRDN SCEDL Approved for public release; distribution is unlimited.

4 PERFORMING 3RAMZNATION REPORT N4UMBER(S 5 MOWITORING ORGANIATION~ REPORT NUINSERS,

NOSC TD 116"

6& NAME OF PERFORMING ORGANIZATION 60B OFFICE SYMBOL 7. NAME OF MONITORING ORGANIZATION

SAIC ______ __ Naval 0ca Systems Center6c ADDRESS IC,, Sm. -' zip Cof, 7b ADDRESS ICNY StN. WKIdZIP Cdw.

Comaystems Division2815 Camino Del Rio SouthSan Diego, CA 92106 San Diego, CA 92152-5000

go NAME OF FUNDING SPONSORING ORGANIZATION j 5 OFFICE SYMBOL 9 PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER

Space and Naval Warfare /'I ACtw

System Command j_ _______N66001-6-D-0029

Sc ADDRESS 'Cow Sim* w ZPcade, 10 SOURCE OF FUNDING NUMBERS

PROGRAM EL.EMENT NO PROJECT NO0 TASK NO AGENCYACCESS" NO

Washington, DC 20363-5100 33131N CM51 JCCM 5100

II 1TTL INV&%* So-Kve CMftSdww

Performance of Data Compression Code" in Channels with Errors

12 PERSONAL AUTHMOS

1in. IW F EOR 3 TM (VE Jan a? 14 DATE OF REPORT fyew Atomb Dow ~ IS 6PAGE COUNTF13& TYEOPRPR ROM ____ TO____ October 19"7 91IS SUPPLEMENTARY NOTATION

17 COSATJ CODES 1S SUBJECT TERMS IComN.,, on rr'Ws 4 nqiewy &nd mI.Mdv by b/ocit flrwmo

P161. GROUP SUB-GROUPI commna-free codesI block codes

19 ABSTRACT IC"~,~ w Pewso d noctuwy ww d~ iwd b Mic n..INI

Huffem codes, comma-free codes, and block codes with shift indicators ane important candidate message compressioncodes for improving the efficiency of communication systems. This study was undertaken to determine If these codes couldbe used to increase the thruput of the fixed very low frequency (FVLF) communication system. This application involvesthe use of compression codes in a channel with errors.

DD FO M17 ,8-A 63 A"R EDITION MAY66E USED UNTIL EXHAUSTED UNCLASSIFIEDI ~~~~~~~~ALL OTHER EDITIONS ARE OBSOLETE _______________

SECufRITv CLASSIFIATION OF THIS PAGE

maCcnnT ctnnPcnTo Of twis PAGE (Nb DOe 0081m4

Do FORW 1473,84 JAN %CASI1

SfcuRITV CL AflIFICAtIOu OF TISg PAG(hm S f One geg

TABLE OF CONTENTS

EXECUTIVE SUMMARY .......................................... ...

Introduction ... .............. .....................

Background .... ................ o ... ........................ IResults ..... .... ........ ..... . o ... .. ............... 2

E Cindin OF ..... CO MP ES IO.C D ........o . . - . .. . .o........... o.... vi

Conclusions ... ode .......... .......................... viiRecoumendations ................... ... .. .... .......... vii

ICOU . ... ........... ......................... . 1

APPROACH OMRESSI .................. ..... ................ 2

DESCRIPTrION OF.DATACOPRESSIONCODES............................ 3

Generalized Baudot Codes ................ 4..............Huffman Codes .. .... ....... ..... ..................... . 5Comma-free Codes ........................................ .. 14

DATA COMPRESSION . . .. o.. o ....... ........ . * ... . .. o . .. . .. 0 . . .. 21

Generalized Baudot Codes .. . .. o. . . . .. o .. o..... . .. o . .. . ... 26

Comma-free Codes .. .. .. .. . .. o . ... . .. o .. . ... . ... .. . .. oo .. . .. o . 32

Surveying Comma-Free Codes ....... o....................... 32Choice of a Comma-free Code for a 58-character Set ....... 38

COMPRESSIvN CODE PERFORMANCE IN A CHANNEL WITH ERRORS .......... 42

Generalized Baudot Codes . ................ .................. . 44Huffman Codes ...................... o......................... 44

Introduction ................................ o............. 44

Description of Simulation Software ....................... 44Simulation Results for a 95-character Set ................ 45Simulation Results for a 58-character Set ................ 47

Coma-free Codes ....... .... ........... ...... .......... 63

Comma-free Codes Constructed in Two Steps UsingLength One Words ......... ..... o .... o ... ............ . 63

Comma-free Codes Constructed Using Other Than LengthOne Words .................... ............ 75

SUMMARY .............. ...... ................................... 77

i

LIST OF TABLES

E-1. Performance Comparison of Best Data Compression Codes ....... v

1. Error Recovery Test using a Five Character Alphabet.........122. Summary of Narrative Files (IBM PC MASS-11 Files)...........233. Character Probabilities of Occurrence for

F our Narrative Files...................................... 244. Probabilities of Character Occurrence for any

Character in 15 Character Subsets of the FourNarrative Files .................................... 27

5. Huffman Code Word Lengths for Four Narrative Files..........296. Average Number of Bits-per-Character for HuffEman Codes

and Different Training Files ........................ 317. Partial Survey of the Distributions of Code Word

Lengths for Codes Constructed using Prefixes andSuffixes of Lengths 1,1,2 and 3 in the First ThreeStages of Construction .............................. 34

8. Partial Survey of the Distributi-ons of Code WordLengths for Codes Constructed using Prefixes andSuffixes of Lengths 1,1 and 3 in the First ThreeStages of Construction ............................... 35

9. Partial Survey of the Distributions of Code WordLengths for Codes Constructed using Prefixes andSuffixes of Lengths 1,1,2 and 4 in the First ThreeStages of Construction .............. *................... 36

10. Partial Survey of the Distributions of Code WordLengths for Codes Constructed using Prefixes andSuffixes of Lengths 1,1 and 4 in thie First ThreeStages of Construction .. *..........* ................ 37

11. Huffmian and Comma-free Code Word Lengths for FourNarrative Files ...... . ............... . ... . .. .. . ...... 41

12. Comparison of Bits-per-Character Values of Huffmanand Comma-Free Codes ... .. ............ .. .. . .. . ........ 43

13. Huf fman Code Words for Narrative File IV which Providedthe Best Performance in a Channel with Errors ...... 64

14. Probabilities of Erroneous Comma Insertions or Deletionsdue to Bit Errors for the Suffix-Prefix Comma-Free

15. Probabilities of Erroneous Comma Insertions due toBit Errors for the Suffix-Suffix Comma-Free Code ....... 71

LIST OF FIGURES

1. Two Examples of Huffman Coding.............................62. Huffman Codes Include Block Coding for Equal Probability

Symbkols (8 Symbol Example)................................. 83. Example Showing Data Compression of a Huffman Code

Relative to aBlock Code................................... 94. Examples of Error Propagation for Two Huffman Codes

Providing the Same Compression......................... 10

ii

LIST OF FIGURES (Cont.)

5. Huffman Code Error Recovery Test using 5-CharacterAlphabet ....................................... 13

6. An Example of the Construction of a Comma-Free Code5..........17. An Example of the Comma-Free Algorithm to Insert "Commas"

in an Error-Free Channel ................................. 168. An Example of the Comma-Free Algorithm to Insert "Commas"

in a Channel with Errors ................................. 189. Impact of Bit Errors on Comma Placement for Suffix-Prefix

Comma-Free Code ................................... 2010. Impact of Bit Errors on Comma Placement for Suffix-Suffix

Comma-Free Code .......................................... 2211. Wr-, % Lengths for Selected Two- and Three-Step

auffix-Prefix Comma-Free Codes ........................... 3912. Word Lengths for Selected Two- and Four-Step

Suffix-Prefix Comma-Free Codes ........................... 4013. Decoding Error Statistics Resulting from Randomly

Induced Bit Errors for a Variety of Huffman Codes ........ 4614. Average Number of Decoded Characters per Bit Error for

Four Huffman Codes ......... ........................... 4815. Distribution of the Average Ratio of Input Symbol Errors

to Bit Errors for a Large Number of Equally EfficientHuffman Codes ..................................... ....... 49

16. Character Decoding Errors for Experiments UsingNarrative I as a Training File ........................... 51

17. Character Decoding Errors for Experiments UsingNarrative II as a Training File .......................... 52

18. Character Decoding Errors for Experiments UsingNarrative III as a Training File ......................... 53

19. Character Decoding Errors for Experiments UsingNarrative IV as a Training File .......................... 54

20. Distribution of Output Errors for Best 58-CharacterHuffman Code Using Narrative File I as Training File ..... 55

21. Distribution of Output Errors for Worst 58-CharacterHuffman Code Using Narrative File I as Training File ..... 56

22. Distribution of Output Errors for Best 58-CharacterHuffman Code Using Narrative File II as Training File .... 57

23. Distribution of Output Errors for Worst 58-CharacterHuffman Code Using Narrative File II as Training File .... 58

24. Distribution of Output Errors for Best 58-CharacterHuffman Code Using Narrative File III as Training File... 59

25. Distribution of Output Errors for Worst 58-CharacterHuffman Code Using Narrative File III as Training File... 60

26. Distribution of Output Errors for Best 58-CharacterHuffman Code Using Narrative File IV as Training File .... 61

27. Distribution of Output Errors for Worst 58-CharacterHuffman Code Using Narrative File IV as Training File .... 62

iii

EXECUTIVE SUMMARY

Introduction

Huffman codes, comma-free codes, and block codes with shiftindicators are important candidate message compression codes forimproving the efficiency of communication systems. Datacompression codes have been used for communications in error-freechannels. This study was undertaken to determine if these codescould be utilized to increase the thruput of the fixed very lowfrequency (FVLF) communication system. This application involvesthe use of compression codes in a channel with errors.

Background

The investigation of data compression codes was constrainedto the investigation of information carrying bits and tocompression based on the probabilities of occurrences ofcharacters. The data compression capabilities of the candidatecodes were investigated by estimating the average number of bits-per-character for the different codes; the performance of thecode in a channel with errors was investigated in terms of theaverage number of characters decoded in error per bit error andthe average number of characters output from the decoder in errorper bit error. Generally speaking, as the number of bits-per-character decreases (that is, as data compression increases), thenumber of characters decoded in error per bit error and thenumber of characters output from the decoder in error per bitboth increase.

Results

The performance of Huffman codes, suffix/prefix comma-freecodes, and some variants of Baudot codes were obtained for theencoding of narrative files of an IBM PC for a 58-character setin lieu of processing of Navy messages (which were not availablein an IBM PC compatible format). These results should beindicative of the results which could be obtained for thenarrative portions of Navy messages using the 58-character(Baudot) set. Huffman code performance results in channels witherrors were obtained through simulation on the IBM PC; resultsfor the other codes were obtainied analytically.

The number of degrees of freedom in the Huffman codeconstruction process and the complexity of the impacts of biterrors on character synchronization precluded analyticaltreatment of Huffman code performance in a channel with errors.The severe problems uncovered by the simulation of Huffman codesin these channels led to the consideration of alternative datacompression codes less sensitive to bit errors. The errormechanisms for these alternative codes are direct enough to allowanalytical treatment.

Table E-1 summarizes the results of this investigation.The comma-free code statistics are for the construction leading

iv

TABLE E-1. PERFORMANCE COMPARISON OF BESTDATA COMPRESSION CODES

COMPRESSION BITS PER DECODE CHAR ERR OUTPUT CHAR ERRCODE CHARACTER PER BIT ERROR PER BIT ERROR

1-shift Baudot 5.1 1.0 1.0

3-shift Baudot 4.5 1.0 1.1

Comma-free 4.0 1.5 1.5

Huffman 3.9 2.2 2.4 *

* A single bit error led to a maximum of 14 outputcharacters for this code.

v

to a comma-free code most nearly matching the Huffman code incompression. The Huffman code results are for the codeconstructed with the lowest number of decoded character errorsper bit error. The summary results have been rounded to a singlesignificant place to remove the small dependency of the valuesobtained on the particular narrative used as a basis forestimating character probabilities of occurrence.

Findings

The main findings of this analysis were:

(1) The normal Baudot code uses a single shift key toreduce the number of bits required to transmit information from 6to about 5.06. Generalizations of this construction can furtherreduce the average number of bits required to around 4.5 bits-per-character while maintaining a basic block structure.

(2) A suffix/prefix comma-free code can be constructedwhich provides nearly the same data compression as a Huffman codeprovided that the probabilities of the occurrence of thedifferent characters decrease in a regular manner. For thecharacter set and probabilities of occurrence of the charactersof the set used in the Huffman simulation, the penalty variedfrom a low of .05 bit per character to a high of .18 bit percharacter for the four narrative files investigated for using asuffix/prefix comma-free code instead of a Huffman code.

(3) A single bit error can lead to very long sequences ofdecoding errors when Huffman codes are used. Sequences of outputcharacters in error exceeding 90 characters in length wereobserved.

(4) It was found that operator interactive processing of theoutput narrative file could be used to correct about three-quarters of the Huffman decoder errors. Not all the errors couldbe detected by the operator, given only the output text witherrors; some detected errors could not be corrected if multiplebit errors had occurred in the same code word.

(5) In a channel with errors, the performance of thesuffix/prefix comma-free codes, which provided the bestcompression, only depends on whether the code is constructedusing a suffix and a prefix or whether it is constructed usingtwo suffixes or two prefixes. The codes in the two categoriesprovide very similar performance in a channel with errors.

(6) A single bit error can lead to at most two charactererrors for the above prefix/suffix codes. This result followsfrom the fact that a single bit error can lead to at most onecomma being inserted incorrectly by the suffix/prefix commainsertion algorithm for these codes.

vi

(7) A large proportion of the compression gains achievableusing Huffman and comma-free codes is provided by the coding ofthe frequently occurring blank by a short code word.

Conclusions

The following conclusions were drawn as a result of theinvestigation:

(1) Comma-free codes significantly outperform Huffman codesin an error channel. They provide nearly the same compressionand have significantly fever decoded or output character errorsthan Huffman codes.

(2) A generalized Baudot code offers modest compressiongains (13%) with only one decoded or output character error perbit error.

(3) Comma-free codes probably could be designed with someerror correction incorporated into the encoding of end-of-linecharacters to provide moderate compression gains (around 30%).

(4) more significant compression gains should be achievableby basing either Huffman or comma-free code word assignments tocharacters on the character and on the one or more charactersimmediately preceding it in the message.

Recommendations

We make the following recommendations.

(1) The analytical results obtained for the suffix-prefixand suffix-suffix comma-free codes in a channel with errorsshould be extended to more general comma-free codes.

(2) The most promising codes, i.e., the comma-free codesand the generalized Baudot codes, should be exercised on realNavy messages to verify that the findings reported herein applyto Navy messages.

(3) The most promising codes should be identified to encodecharacters using their conditional probabilities of occurrence(conditioned on receipt of the one or two previous characters).The techniques developed in this report can be used to identify

* the best codes for this application.

(4) Error correcting techniques, such as operator* interaction or soft-decision logic, should be investigated for

use with comma-free codes. Until this has been done, it isdifficult to select among the available comma-free codes givingthe same compression.

vii

INTRODUCTION

It is desirable to increase the channel capacity of thesubmarine broadcast system. One technique which has beensuggested for doing this is to more efficiently encode thenarrative portions of messages through use of data compressioncodes. A data compression code assigns short binary code wordsto symbols with a high frequency of occurrence, and long codewords to symbols with a low frequency of occurrence.Difficulties arise when data compression codes are used inchannels with errors, because one bit error can lead to multiplecharacter errors due to temporary loss of charactersynchronization.

This report investigates the behavior of Huffman, Comma-free, and generalized Baudot codes for alphabets of 58 charactersin channels with errors. Some preliminary results are providedon the feasibility of correcting errors in narrative portions ofmessages by using narrative context.

SCOPE

Data compression codes have been used in error-freechannels, but to our knowledge they have not been used inchannels with errors. Consequently, their performance inchannels with errors has not been established. This paperrepresents an initial study of the behavior of data compressioncodes suitable for encoding the 58-character Baudot code used forNavy messages in a channel with errors.

Results are presented for an alphabet derived from the 95character set of the IBM PC and for processing narrative filesstored on its hard disk. These files were edited to use onlycapital letters and certain seldom used symbols were deleted,namely " [, 1, II, 1,1", to obtain an alphabet the same size asthat required for encoding the Baudot dictionary. The decisionto use the reduced IBM PC character set and available documentfiles was made so that results could be obtained without thedevelopment of a Navy message data base, which was not availablein IBM PC compatible format when this analysis was undertaken.

The analysis is complicated by the fact that the errorproperties of both Huffman and error-free codes depend on thespecific choices of bits and code words, respectively, used toconstruct the codes for a particular application. This meansthat codes exist which provide the same data compression gainswith differing error properties. The main thrust of this paperis to identify the Huffman codes and the comma-free codes givingthe best performance in a channel with errors and compare theirperformance with that of generalized Baudot codes. This involvescharacterizing the relationships between character errors and biterrors for the codes.

An additional complication arises in the analysis of comma-free codes: the construction process used does not dependexplicitly on the probabilities of occurrence of the charactersto be encoded and therefore a given code may not be well matchedto the statistics of the character set. An approach is presentedwhich allows the determination of the comma-free code which givesthe best data compression, which can be constructed using theprocedure developed by R. A. Scholtz. This procedure was used toselect the comma-free code which gives the best data compressionfor the 58-character set used for the Huffman simulations.

APPROACH

Huffman codes are known to provide the best data compressionpossible for variable length codes. This property is ensured bythe code construction process itself which is based directly onthe probabilities of occurrences of the characters to be encoded.

The comma-free codes analyzed in this report are known assuffix/prefix codes and are constructed using a sequentialprocedure found by R. A. Scholtz. This procedure does notutilize probabilities of occurrence to guide the constructionprocess. It was necessary for us to develop an approach to matchthe word lengths of available prefix/suffix codes to thecharacter probabilities of occurrence to provide comparable datacompression to that automatically provided by the Huffman codes.

Even after specifying the distribution of word lengths ofHuffman or comma-free codes, there are degrees of freedom in theconstruction process. It was discovered that the errorproperties of the codes depended on the choices made in theconstruction process. This report has been structured to revealthese dependencies and to provide a technique for the selectionof the compression codes providing the best performance in achannel with errors.

The insights provided by the investigation of Huffman codesand comma-free codes led to the identification of certain naturalextensions of the presently used Baudot codes. A comparison ofthe performance of these codes with those of Huffman and comma-free codes provid~es a performance gauge against which the lattercodes can be assessed.

The Huffman construction process has a great number ofdegrees of freedom. The impact of bit errors on charactersynchronization and character errors is very context-dependent;therefore, an analytical study of the dependency of errorstatistics on the Huffman construction process could not beperformed. A simulation program was written and exercised formany different Huffman codes by altering the specific choices inthe Huffman construction process for a fixed character set andfixed probabilities of occurrence for the characters in thecharacter set. The best compression code found in this mannerwas then further exercised to provide baseline data compressionand statistical error properties for Huffman codes.

2

Unlike the Huffman code, the number of degrees of freedom inthe comma-free construction process depends on the number ofsequential steps and not on the character set size. Theperformance of codes constructed in a few steps can beestablished analytically. Then we found a very surprising thing,the comma-free code which best matches the compressionperformance of the Huffman codes for the narrative filesprocessed only involved a simnle two-step construction. For two-step constructions, the available degrees of freedom for comma-free codes only leads to two code sets with differing errorstatistics. For these codes the impact of bit errors on both thealgorithm which identifies code word (the comma insertionalgorithm) and the character decoding process is characterized interms of the bit within a code word in error.

The remainder of the report is broken into three majorsections and a short summary section.

The first major section provides descriptions of theconstruction processes for generalized Baudot, Huffman, andcomma-free codes. Examples of Huffman and comma-free codes arepresented to illustrate the dependency of the performance of thecodes in a channel with errors on the code construction process.This section provides background and motivation for the remainingsections of the report.

The second major section establishes the data compressionwhich is to be expected for generalized Baudot codes, Huffmancodes, and comma-free codes used to encode narrative files basedonly on estimated probabilities of occurrence of the charactersin the narrative files. A symbol set consisting of 58 characterswas utilized for this work to best simulate the 58 Baudotcharacter set in use for Navy messages.

The third major section describes the performance ofgeneralized Baudot, Huffman and comma-free codes 'in a channelwith errors.

In the last section, the best codes found are discussed andrecommendations submitted.

DESCRIPTION OF DATA COMPRESSION CODES

This section of the report contains three subsections. Thefirst subsection describes a family of compression codes whichhave a structure very similar to that of the presently usedBaudot code, We call these codes generalized Baudot codes. Thesecond and third subsections describe Huffman and comma-freecodes, respectively, with emphasis on the description of theconstruction processes for the codes and their impact on theperformance of the codes in channels with errors.

3

Generalized Baudot Codes

The Navy Baudot code now being used can be viewed asconsisting of two kinds of characters: information carryingcharacters and shift characters. Receipt of a shift charactercode word changes the decoding of the next code word--the receiptof the shift character by itself does not increase theinformation passed to the receiver.

The existing Baudot alphabet consists of 57 informationcharacters and one shift character. If a simple block code wasused 6 bits would be required. However, if a 5 bit code is usedinstead, and one of the 32 code words is used as a shiftcharacter, 31 information characters can be transmitted using 5bits and the remaining 26 information characters can betransmitted by using the 5 bit code word reserved for a shiftcharacter followed by a 5 bit code word. In effect, theremaining 26 information characters are transmitted using 10bits.

The shift character can be implemented as either a one-character shift or as a toggle shift. we discuss codes using theshift character as a one-character shift. This is the caseamendable to analysis in terms of character probabilities ofoccurrence.

A simple example suffices to indicate the compressionprovided by the use of shift characters. Suppose we wanted toencode six characters: a,b,c,d,e, and f, and that the probabilityof occurrence of a,b, or c was .75 and of the remainingcharacters .25. If the six characters were encoded with a blockcode then 3 bits would be required per character. Suppose that"001 was used to transmit "a", "01" for "b", "10" for "c", and"110 a shift character. Then "1100" could be used to transmit"d", "1101" to transmit "e", and "1110" to transmit "f". Theaverage number of bits needed to transmit a character using thiscode is given by (.75)(2) + (.25)(4) - 2 + (.25)2 - 2.5 bits-per-character.

More than one shift character could be used at each stageand more than one shift in succession leading to a whole familyof different Baudot-like codes, which we call generalized Baudotcodes. For example, suppose again that we are building a codeusing blocks of two characters, We could reserve two of the 2-bit code words for shift characters. Then we would have two 2-bit code words, and eight 4-bit code words available for encodinginformation characters. We could use some of the 4-bit codewords as shift characters to generate 6-bit code words, and soon.

The data compression provided by any code is determined bythe distribution of code word lengths in the code. A generalizedBaudot code is specified by its basic code length and the numberof characters used as shift characters for each multiple of the

4

block length. The generalized Baudot codes of most interest for

Navy messages use code lengths which are multiples of 3, 4, or 5.

Huffman Codes

Using only the probabilities of a set of characters beingtransmitted, Huffman provided an organized technique forconstructing efficient codes. Huffman codes use the minimumnumber of bits on the average to transmit characters from theset. The procedure for constructing a Huffman code isillustrated in the following example [reference 1].

Suppose that we wish to code five characters: a, b, c, d,and e with the probabilities 0.125, 0.0625, 0.25, 0.0625, and0.5, respectively. For this example, which is illustrated infigure 1, the Huffman procedure first involves three regroupingsof five characters.

Grouped characters are indicated by (b,d), (a,b,d), and(c,a,b,d) along the top of figure 1. At each stage in this firststep, the two characters or group of characters with the lowestprobabilities are grouped. A group of characters is assigned theprobability obtained by summing the probabilities of thecharacters in the group.

The Huffman code is constructed based on the characterswhich have been grouped at each stage by proceeding from right toleft. Two of the many possible codes which can be assigned tothe original character set are illustrated in figure 1.

We discuss the construction of Code A first. Step 1:assign "0" to the most likely character "e" and "1" to thecharacter set (c,a,b,d). These bits are the first bit in thecode words assigned to the characters. The character "e" isdistinguished from the characters "cw, "a", "b", and "d" by thefact that its code begins with "0" and their codes begin with"1". Step 2: no bit is assigned to *e", and a second bit isassigned to the remaining characters. This bit is chosen todistinguish "c" from "a", "b", and "d"--"0" is shown assigned to"c" and "1" assigned to the other characters. Step 3: noadditional bits are assigned to "e" and "c" and additional bitsare assigned to distinguish "a" from "b" and "d". Step 4: nobits are assigned to "e", "c", and "a" and bits are assigned todistinguish "b" and "d".

Code B, also shown in figure 1, differs from code A in thatat step 1, the character "e" is assigned "I" and the characters"c", "a", "b", and "d" begin with "0". The remaining steps arethe same. Note, that "0" and "1" can be assigned in either wayat each step, leading to the construction of 16 different codesfor the example shown in figure 1.

The example in figure 1 is very regular in that noreordering is necessary during the grouping of characters at the

5

C-,)

m1c

LL u- L.

91 LJ cJm- UC

CD C

LD~CC'p3LO U,) w Lii

U') c LO toD CDC

OU CD C)0 AUM .0

different stages of the construction process. It is worthwhileto note that the Huffman coding procedure can lead to blockcoding when all of the character probabilities are the same.

For example, consider the case of eight characters:a,b,c,d,e,f,g, and h, each having a probability of 0.125. Figure2 illustrates a Huffman construction process leading to a blockcode for this case. Note, the characters are listed in thenatural alphabet order. The first step leads to grouping g andh, the next step to grouping e and f, the next to grouping c andd, and the fourth step to grouping a and b. Each group is

* assigned a probability of 0.25. The next two steps leads togrouping ef,g, and h, and to grouping a,b,c, and d. Each ofthese groups is assigned a probability of 0.5.

In general, the Huffman code construction process forcharacters with differing probabilities of occurrence leads to acode with some characters having code words of the same lengthand other characters having code words of differing lengths.

Figure 3 illustrates the data compression achievable fromeither code A or Code B (as well as any of the other codesconstructable by the Huffman process) described in figure 1. Thegain is gauged by comparing the expected average number of bitsto transmit a character for the Huffman code with a fixed lengthcode. The average code word length (L) for the example Huffmancode is given by:

L - 0.125(3) + 0.0625(4) + 0.25(2) + 0.0625(4) + 0.5(1)

L - 1.875

The Huffman code has the smallest average code word length.However, it has variance (V):

2 2 2V - 0.125(3 - 1.875) + 0.0625(4 -1.875) + 0.25(2 - 1.875)

2 2+ 0.0625(4 - 1.875) + 0.5(1l 1.875)

Z 1.109375

By comparison, Block Coding, which assigns codes of equal lengthto each symbol, would have produced an average length of 3 withzero variance.

The following examples show how one bit error in Huffmancoding can cause errors in more than one character when decoding;extra characters may be introduced or some characters may bedropped. In each case, the first bit of the sequence waschanged. The surprising dependency of character errors on thechoices made in the Huffman code construction process motivatedthis study. This phenomenon is illustrated in figure 4.

7

C=)

C-

~-a LUJ

N C-I-

-0 -0 4-C CD-% -%: 1 1 1 1 )

v N CD

C)c~

-,c _ .) L

Cw =LL-

CC.

160~ --

I-,j

'4-

x U') LO In r%% r% -

a3 >- r- to In fl- 0D LO P. m 0cCn n m cu mu co O r - m C=)

. I =r uc L ur

-4-

0 mr

1=(D (0

w =o

C-,-

LA- -D L-

LALU

cc =D

z 3z

c C) CD

UU WMC

-~~~C C- - 4-N j

- - m

C) C C) -c-,

CD -0 D C-LLJ CL

S" -F 0 v wW "

.4a.,CD-

Figure 4 shows the impact of introducing a single bit errorinto the code word assigned "a" for Code A and Code B. ForHuffman codes, and other variable length block codes, the impactof an error depends on the characters following "a". In theexample, "abcde" is being transmitted. The impact of the singlebit error is enclosed by brackets and an error count shown to theright for each of the two Huffman codes.

For code A, an error in the first bit of the code word for*" leads to it being incorrectly decoded into the two charactersme" and "c"; i.e., one input character is decoded in error andtwo erroneous characters are output.

For code B, an error in the first bit of the code word for"a' leads to the next three characters being decoded in error fora total of 10 characters being output erroneously.

For code A, the bit error does not lead to loss of charactersynchronization; while for code B, it does. in general, biterrors do lead to loss of synchronization for Huffman codes.

Some codes have been discovered which tend to lose charactersynchronization less often and for shorter periods of time thanHuffman codes. These codes utilize an intermediate processingstep to define code words (comma insertion) and are called comma-free codes. Comma-free codes are discussed in the nextsubsection.

The error propagation dependency on Hutffman code illustratedby figure 4 was potentially so important that a preliminarysimulation was con~ducted to preclude the possibility that theexample was a fluke. The simulations were for the two Hutffmancodes associated with the example presented in figure 1.

Table 1 summarizes the results of introducing a bit error inthe first bit of the first code word. This code word is theencoded first character of the five characters shown in the"input char(acter)' columns of the table; the impact of this biterror on the decoding process is shown by presenting thecharacters output from the decoder in the "output char~acterJ"columns.

Two statistics summarize the experimental results presentedin table 1: the number of input symbols decoded in error (2.77weighted average) and the number of output symbols in error (2.81weighted average). This second statistic snows on the averagehow long it takes to regain character synchronization after a biterror is introduced.

The same simulation was run for different Huffman codesobtained by changing the first, second, or third bit of eachcodeword. The results shown in figure 5 show that there is avery definite dependency of the error properties on the choicesmade in constructing a Huffman code.

TABLE 1. ERROR RECOVERY TEST USING A FIVE CHARACTER ALPHABET

INPUT OUTPUT INPUT OUTPUT INPUT OUTPUTCHAR CHAR CHAR CHAR CHAR CHAR

abcde eeebcde bdeac eebceac dbcae eecbcaeabced eeebced bdeca eebceca dbcea eecbceaabdce eeebdce beacd eeaacd dbeac eecbeacabdec eeebdec beadc eeaadc dbeca eecbecaabecd eeebecd becad eeacad dcabe eeccabeabedc eeebedc becda eeacda dcaeb eeccaebacbde eeecbde bedac eeadac dcbae eeccbaeacbed eeecbed bedca eeadca dcbea eeccbeaacdbe eeecdbe cabde eceebde dceab eecceabacdeb eeecdeb cabed eceebed dceba eeccebaacebd eeecebd cadbe eceedbe deabc eeceabcacedb eeecedb cadeb eceedeb deacb eeceacbadbce eeedbce caebd eceeebd debac eecebacadbec eeedbec caedb eceeedb debca eecebcaadcbe eeedcbe cbade ecebede decab eececabadceb eeedceb cbaed ecebeed decba eececbaadebc eeedebc cbdae ecebcae eabcd ceebcdadecb eeedecb cbdea ecebcea eabdc ceebdcaebcd eeeebcd cbead eceaad eacbd ceecbdaebdc eeeebdc cbeda eceada eacdb ceecdbaecbd eeeecbd cdabe ececabe eadbc ceedbcaecdb eeeecdb cdaeb ececaeb eadcb ceedcbaedbc eeeedbc cdbae ececbae ebacd cebecdaedcb eeeedcb cdbea ececbea ebadc cebedcbacde eebecde cdeab ececeab ebcad cedadbaced, eebeced cdeba ececeba ebcda ceddabadce eebedce ceabd ebebd ebdac cebcacbadec eebedec ceadb ebedb ebdca cebccabaecd eebeecd cebad ebbed ecabd cceebdbaedc eebeedc cebda ebbca ecadb cceedbbcade eedade cedab ebcab ecbad ccebedbcaed eedaed cedba ebcba ecbda ccebcabcdae eeddae dabce eecabce ecdab ccecabbcdea eeddea dabec eecabec ecdba ccecbabcead eedead dacbe eecacbe edabc cecabcbceda eededa daceb eecaceb edacb cecacbbdace eebcace daebc eecaebc edbac cecbacbdaec eebcaec daecb eecaecb edbca cecbcabdcae eebccae dbace eecbace edcab ceccabbdcea eebccea dbaec eecbaec edcba ceccba

The results presented in table 1 were derived using thefollowing correspondence between characters and code words:

b <-> 1110 e <-> 0

C <-> 10

12

U,

00 0

cl)o ci'0CL Ucn JC a" 40 rCO W 0OC a= 0 L.0

> cC O3 C- > = PC

C.I II <

V C)

cc 62

a' C)

LL4JJl

.00

C)

VrI 62

LO CT)

Comma-free Codes

Comma-free codes are binary codes so constructed that it ispossible to identify individual code words prior to decoding thereceived bit stream. In this report, we restrict our attentionto a particular family of comma-free codes, known as"suffix/prefix" codes, found by R. A. Scholtz [reference 21.

In order to illustrate the ideas involved in Scholtz'sconstruction process, we choose a particularly simple examplederived from a somewhat longer example presented in his paper.Figure 6 illustrates the Scholtz construction process for a codeconstrained to a maximum code length of five.

Scholtz constructs his code words sequentially. The sets ofcode words available to be assigned to characters are denoted by"C", "C'", and "C''" in the example. Starting with the set "C",consisting of two code words "0" and "1", the code set "C'" isconstructed by taking one of the two original code words andusing it as a suffix an arbitrary number of times for the othercode word, we chose to use "1" as a suffix and retain "0" as acode word in "C,". Any of the words in "C'" could be used as asuffix to create new code words and thus construct a new set ofcode words "C''". We choose to use the shortest code word "0" asa suffix to construct "C''". As a result "C''" contains no codewords of length one.

We have presented the code words in "C''" in rows accordingto code word length and by columns beginning with still availablecode words of "C'". Additional code words could be constructedby choosing, for example, "01" as a suffix, and excluding it as acode word in the new set "C''" constructed from "C''".

Generally speaking, new code words can be constructed byeither using suffixes or prefixes. The process can be carriedout any number of times.

Figure 7 illustrates the process used to construct "commas"for the code illustrated in figure 6. Figure 7 shows the commaconstruction process for an error-free channel and figure 8illustrates the impact of errors on the construction process.

Suppose that the characters to be transmitted have beenassigned the code words shown in brackets in figure 8. Thetransmitted and received bit stream would consist simply of thebits enclosed in these brackets with no indication of where onecode word ended and another began.

Figure 7 shows the three-step process used to insert"commas" i.e., to delineate the code words which were sent.Thcomma insertion process parallels the code construction process.it proceeds by first inserting commas between all the bits andthen successively deleting those according to rules based on thesuffix choices.

14

LU '1=CCf)

C-

C=:-c-

CD,

c ~ o --

U--

-MC- CCD

LUC?) C:> -- CM

IL -w-q I

>< -I -< = C= n 1I- L L -I-.'

-cc C= = = I cf= C_- r

U-C CD -CC) ~LU -CLU CL)U

C?") LO~~C?)3 I--

- LU = '-~

C:D)

-- c) --

I -- - C) -LU=~c 3- f -- I--

I-" C) U

3C=) LUJ C-.) _

CL) C-)Lr:C) -c

LU C.D

C-c:) c: D:

a"

-C

C-D-

C=;,

LUU-y -~ LO-

c~C=) F--

C/) C-

-I -L Cm, -co C- LU

-:= c:>~- C-

C- -- - C- I

=--. -L - -4= c

LU E-=

- - -

C =

16

To aid the reader in following the process, we havemaintained the bits in alignment from step-to-step in figure 7--the spaces introduced for this purpose are not interpreted by thedecoder. Corresponding to choosing "1" as a suffix, commas areremoved preceding "1"s in the second step of the comma insertionprocess. All the code words in "C'" are now isolated. Next,corresponding to choosing "0" as a suffix to construct "C''", thefirst comma is removed whenever ", 0," occurs. After thedeletion of these commas, all the codes words have been isolated.

Figure 8 traces through the impact of three character errorson the comma insertion process. The transmitted bit stream isthe same as that presented in figure 7. The received bit streamshown below the arrow labeled by "errors" has bit errors in thethird bit of the first word, the second bit of the fourth word,and the second bit of the next-to-last word. The deletion commasteps leads to the last bit stream. It is easy to see that theleftmost bit error would lead to a character decoding error, butnot loss of character synchronization; the second bit error wouldlead to the previous character and the character with the biterror both being decoded incorrectly, i.e., to lost of charactersynchronization; the rightmost bit error also leads to theprevious character and the character with the bit error beingdecoded incorrectly and loss of character synchronization.

Even with three bit errors introduced into three of eightcharacters, three characters were still correctly decoded in theabove example. This example shows less impact of errors thanpreviously shown by the Huffman code example.

There are choices in the construction of comma-free codesthat would lead to the same distribution of code word lengths,and hence to the same data compression. The behavior of the codein an error channel depends on these choices. This can beillustrated by considering two particularly simple codes thathave the same distribution of word lengths. One code isconstructed by first using "1" as a suffix and then "0" as aprefix. The second code is constructed by first using "1" as asuffix and then "0" as a suffix, namely the code illustrated forup to length five code words in figure 6. The first code will bereferred to as the suffix-prefix code and the second code as thesuffix-suffix code. (These codes turn out to be very importantfor practical applications; this will be discussed in a latersection.)

The code words of either of the two codes have lengths 2 tom > 1. The code words are easy to describe mathematically:

(a) suffix-prefix code words are of the form

k "O"s followed by h "1"s with k > 0, h > 0, k +h <= m

17

- - - C.,

-= C C.D 0, :=- =

__> c:: C= C C.,

LLJ~N 0-q

* L - - m Cfl "-

cr) L) CD= 0C

q< e- -j Cl-

U~ - U NJ C...

0 W CCl <~ )-~c cc cc ' -

=J L1J LLI

0 ) C LU

C=)

C- = = => C>U L o -

m~ - C 3 E(=

C=> C= C- -n -I 0 L C.))LI

C- C Z Cf) C3)LLi(

~cu L)

= CL)

ClCD

18C-

(b) suffix-suffix code words are of the form

1-"O" followed by k "1"s followed by h "O"s withk > 0, h >= U', 1 + k + h <= m

It is particularly easy to describe the impact of bit errorson the comma insertion process of the suffix-prefix code. Figure9 summarizes the impact of bit errors on the process as afunction of where the bit error occurs in a code word bracketedby two other code words. Four cases are distinguished: first bitin error, either of the two transition bits from "0" to "1" inerror, last bit in error, and a central "0" or "1" bit in error.

The comma insertion process can lead to code words which arelonger than m bits and therefore not decodable. If the first bitis in error, and k2 > 1, then this occurs only if the first wordhas maximal length m; if the last bit is in error, and h2 > 1,then this occurs only if the last word has maximal length m. :ngeneral, if the first bit is in error and k2 > 1, a comma isinserted (incorrectly) one position to the left; if the last bitis in error and h2 > 1, a comma is inserted (incorrectly) oneposition to the right. If the first bit is in error and k2 =1

then the comma between the first and second word is deletedleading to a code word of the length k1i hi 1 +h2. The newword being non-code word if this expression exceeds m. Likewise,if the last bit is in error and h2 - 1, the comma between thesecond and third word is deleted leading to a code word of lengthk2 + 1 + U3 + h3, which can be non-code word if this expressionexceeds m.

Errors in the transition bit, namely the k2-th bit or the(k2.1)-th bit, with k2 > 1 and h2 > 1, does not impact thepositions at which commas are inserted. The resulting erroneousword is always decoded as a single character.

Errors in the middle of a string of "0"s or in the middle ofa string of "i"s lead to the insertion of a spare comma withinthe middle word. The middle word is always decoded as twocharacters for these cases.

In summary, this survey of the impact of bit errors on thesuffix-prefix comma insertion and decoding process has shown thata single bit error can lead to at most one comma being insertedincorrectly. All the following possibilities occur: the comma

* between the first two words can be deleted or moved to the right,the comma between the second and third words can be deleted ormoved to the left, or a new comma can be inserted splitting the

* middle word. However, it can happen the commas are all insertedcorrectly and only a decoding error occurs. A single bit errorleads to either one or two character errors so that, unlikeHuffman codes, the impact of a single bit error on characterdecoding is strictly limited.

A discussion of the impact of bit errors on the suff ix-suffix code is somewhat more complicated than for the suffix-

19

iN I-- C

a aj-jccC)C

7] cc LU =~ _n C0 1 LU -C CD-iI-

-C-

CD C-.) LL

I- I~

0U 6-4a)=m > c UU C/ ~ L

0) LJW 0i C3C C)I- C L. -

P- )= CD U*m~~ 03-tccCw >1

CD C-)CZ

0~C CD%4-4-XS-IWcc:1C- c

cn0c

prefix code because the suffix-suffix code words have a morecomplicated structure.

Figure 10 illustrates the impact of errors for different bitpositions in the middle code word of three successive code words.

The impact of an error in the first bit of the middle worddepends on the first word, namely on whether or not hl > 0. Ihi > 0 then the middle word "steals a 0 " from the first word;if hl - 0 then the comma between the first and middle word isdeleted and the second word is "added onto" the first word.

A bit error in the second bit position of the middle wordleads one or more "0"s being added to the first word depending onwhether or not k2 > 1.

Errors at other bit positions lead to similar behavior forthe suffix-suffix code as described for the suffix-prefix code.

Our brief discussion of the suffix-suffix code indicates aclear performance difference between the suffix-prefix code an(,the suffix-suffix code in an error channel. Both codes share theproperty that a single bit error leads to at most two successivecharacters being decoded in error and the misplacement of at mostone comma.

DATA COMPRESSION

This section contains four subsections. in the firstsubsection, the probabilities of occurrence of the charactersappearing in four different narrative files are discussed. Inthe next three sections, the generalized Baudot codes, Huffmancodes, and comma-free codes providing the best data compressionfor character encoding based on these probabilities of occurrenceare identified.

Introduction

Four narrative files resident on the hard disk of the IBM PCwere used to investigate compression and error properties of datacompression codes. The data compression possible using characterencoding is determined by the probabilities of occurrence of thecharacters in the data being encoded. In this section, datacompression results are obtained for encoding based on characterprobabilities of occurrence in the four narrative files.

Table 2 summarizes the general properties of the fournarrative files used throughout this study. Table 2characterizes the four narratives in terms of the number of linesof text and the number of bytes in them. The four narrativeswere all technical documents involving some equations.

Table 3 completes the description of the narratives relevantto their use for data compression investigations by presentingthe probabilities of occurrence for the different characters for

21

C--

C-D

cc~

21 lE *-FE

~LLW

I- U- U - > U- C-L1J0;. 1 4 0W c0 -

C Z C) CD. D 4 c

3LL

TABLE 2. SUMMARY OF NARRATIVE FILES(IBM PC MASS-li FILES)

NARRATIVE NUMBER OF NUMBER OFFILE DESCRIPTION BYTES LINES

I INVENTION DISCLOSURE OF NESTED 31,744 612SPATIAL-TEMPORAL INTERFERERSUPPRESSOR

II INVENTION DISCLOSURE OF SPATIAL 28,672 503COMBINER

III MEMO RE:MEECN MTG OF 12,288 20719-20 NOV 1985

IV ADAPTIVE ALGORITHM PERFORMANCE 13,312 290

23

TABLE 3. CHARACTER PROBABILITIES OF OCCURRENCE FORFOUR NARRATIVE FILES

NARRATIVE NARRATIVE NARRATIVE NARRATIVECHARACTER FILE I FILLE 11 FILE III FILE IV

U0.3085 0.3171 0.2851 0.4095E0.0855 0.0865 0.0882 0.0664

T 0.0636 0.0677 0.0585 0.0492N 0.0543 0.053" 0.0483 0.04070 0.0416 0.0518 0.0505 0.04201 0.0513 0.0511 0.0537 0.0426A 0.0455 0.0450 0.0499 0.0434R 0.0392 0.0421 0.0471 0.0438S 0.0372 0.0391 0.0440 0.0367H 0.0277 0.0263 0.0291 0.0162C 0.0193 0.0232 0.0245 0.0216L 0.0233 0.0218 0.0262 0.0226D 0.0191 0.0206 0.0327 0.0165U 0.0142 0.0185 0.0147 0.0165P 0.0163 0.0170 0.0171 0.0164

M 0.0120 0.0162 0.0230 0.0150F 0.0151 0.0141 0.0190 0.0169G 0.0117 0.0121 0.0163 0.0109B 0.0049 0.0101 0.0017 0.0067V 0.0136 0.0099 0.0078 0.0077w 0.0147 0.0082 0.0080 0.0030

0.0073 0.0076 0.0075 0.0087Y 0.0050 0.0058 0.0074 0.0026

0.0062 0.0001 0.0036 0.00600.0016 0.0039 0.0024 0.00240.0016 0.0039 0.0024 0.0024

-0.0047 0.0034 0.0044 0.00531 0.0078 0.0024 0.0015 0.0054K 0.0040 0.0023 0.0014 0.0008

/0.0003 0.0016 0.0003 0.0019

NOTE: "denotes blank

24

TABLE 3. CHARACTER PROBABILITIES OF OCCURRENCE FORFOUR NARRATIVE FILES (CONT.)

NARRATIVE NARRATIVE NARRATIVE NARRATIVECHARACTER FILE I FILE II FILE III FILE IV

J 0.0016 0.0014 0.0011 0.0000X 0.0018 0.0013 0.0017 0.00082 0.0059 0.0013 0.0016 0.0031

0.0031 0.0011 0.0000 0.0000- 0.0019 0.0010 0.0002 0.0000Z 0.0002 0.0009 0.0011 0.0004Q 0.0014 0.0008 0.0012 0.00043 0.0015 0.0007 0.0010 0.00180 0.0009 0.0005 0.0014 0.0050- 0.0000 0.0005 0.0000 0.0000" 0.0003 0.0003 0.0003 0.00004 0.0000 0.0003 0.0002 0.0001

0.0014 0.0002 0.0000 0.00000.0006 0.0002 0.0008 0.0004

8 0.0001 0.0002 0.0005 0.0015

0.0000 0.0002 0.0000 0.00005 0.0001 0.0002 0.0005 0.00149 0.0000 0.0002 0.0007 0.0034* 0.0021 0.0002 0.0000 0.0000

0.0000 0.0002 0.0006 0.00006 0.0000 0.0001 0.0002 0.0015> 0.0000 0.0001 0.0001 0.00007 0.0000 0.0001 0.0001 0.0001

0.0000 0.0001 0.0002 0.0004< 0.0000 0.0000 0.0002 0.0000

0.0000 0.0000 0.0001 0.0000* 0.0018 0.0000 0.0000 1.0000

0.0000 0.0000 0.0001 0.0000

NOTE: * denotes blank

25

each of the four narrative files. The character ordering isbased on the probabilities of occurrence of the characters. Theprobabilities of occurrence of the characters are similar for thefour narrative files. This similarity becomes clearer whencumulative probabilities of occurrence are examined for thecharacters partitioned in subsets of 15 characters. Table 4presents the sums of the probabilities of occurrence for thecharacters in nominal 15 character subsets.

In the next subsection, we show how the probabilities intable 3 can be exploited through the design of codes with ablock-like structure (generalized Baudot codes). In the nextsubsection, the average number of bits-per-character iscalculated for the Huffman codes constructed using theseprobabilities of occurrence. In the third subsection, atechnique is presented for matching as closely as possible thedistribution of available word lengths for comma-free codes tothose provided by a Huffman code.

Table 3 suggests that if we could match the code wordlengths of the words assigned to the 10 to 15 characters with thehighest probabilities of occurrence, we should achieve nearly thesame compression for a comma-free code as for a Huffman code.This turned out to be the case and motivated the order chosen topresent the material in this section.

Generalized Baudot Codes

The encoding of a 58-character set with a block coderequires 6 bits. The standard Baudot code uses a shift characterso that a structured code using 5 bits or 10 bits (in effect) ca .be utilized to code the 58-character set.

The shift symbol is not an information carrying character;i.e., the shift by itself transmits no information. The bits-per-character for the best single shift code is obtained as

5 - 5(probability of occurrence of any of the 27 leastcommonly occurring characters)

The average number of bits-per-character turn out to be 5.08,5.06, 5.06, and 5.06, for narrative files I, II, III, and IV,respectively. Note, therefore, that the Baudot code represents acompression gain over block coding of 6/5 = 1.20. Thecompression gains for Huffman and comma-free codes should berelative to the Baudot code (not a block code). For this reason,the compression gains implied by the results obtained using thesecodes in this study are less than those generally quoted in theliterature on these codes. In the literature, the block code isusually taken as the basis for compression calculations.

Consider a generalized Baudot code using more than one shiftsymbol. Suppose, in particular, that the shifts were used toproduce a code with 15 words of length 4, 15 of length 8, 15 oflength 12, and 13 of length 16. One of the first 16 code words

26

TABLE 4. PROBABILITIES OF CHARACTER OCCURRENCE FOR ANYCHARACTER IN 15 CHARACTER SUBSETS OF THEFOUR NARRATIVE FILES

NARRATIVE CHARACTERS CHARACTERS CHARACTERS CHARACTERSFILE 1-15 16-30 31-45 46-58

I .849 .135 .0158 .0002

II .882 .106 .011 .001

111 .879 .108 .012 .001

IV .884 .102 .014 .000

27

is a shift, i.e., leads to a different interpretation of the nextcode word, one of these code word3 is reserved to lead to stillanother interpretation of the next code word, and one of these isreserved to lead to still another interpretation of the next codeword. In each case a shift only applies to the next code word.The number of bits-per-character for this particular code isgiven by:

4 + 4 (probability of occurrence characters '6-30)

+ 4 (probability of occurrence characters 31-45)

+ 4 (probability of occurrence characters 46-58)

where the characters have been successively numbered beginningwith the most commonly occurring character and ending with theleast commonly occurring character. The average number of bits-per-character required to transmit information using this code is4.68, 4.52, 4.54, and 4.52, for narrative files I, II, III, andIV, respectively.

Further generalizations of Baudot codes do not seempromising. For example, an attempt to match the distribution ofword lengths for Huffman codes by a code build in terms ofmultiples of 3 leads to the following code word structure: 6words of length 3, 12 words of length 6, 30 words of length 9,and 10 words of length 12. The average number of bits-per-character for this code was found to be 4.5 bits-per-characterfor narrative file I. It provides slightly greater compressionwith a far greater complexity that the code based on multiples of4.

Huffman Codes

The structure of a Huffman code in the sense of itsdistribution of lengths of code words is determined by theprobabilities of occurrence of the 58 characters in the narrativefile (provided some convention to treat equi-probable sets in theconstruction process is adopted). Table 5 summarizes the codewords assigned by the particular computer implementation of theHuffman constructed process that we used in our study. Thesecode word lengths were obtained using the probabilities ofoccurrence of the characters presented in table 3 for the fournarrative files. The order of the characters is the same intable 5 as that in table 3 and the characters are partitionedinto sets of 15 characters to facilitate our discussion of table5.

Recall that the probability of occurrence of one of thefirst 15 characters listed in table 5 exceeds .84 for all thenarrative files. The word lengths assigned to the first 15characters based on the probabilities of occurrence of thecharacters in the different narrative files never differ by morethan one bit. The word lengths are nearly the same for the next15 characters and tend to differ greatly only for the least

28

TABLE 5. HUFFMAN CODE WORD LENGTHS FOR FOUR NARRATIVE FILES

WORD LENGTHS WORD LENGTHSFOR NARRATIVE FILE FOR NARRATIVE FILE

,CHAR I II III IV CHAR I I1 111 IV

2 2 2 1 J9 9 10 22E 3 3 4 4 X 9 9 9 11T 4 4 4 4 2 10 8 10 9N 4 4 4 5 + 10 8 18 150 4 5 4 5 -10 9 13 21I 4 4 4 5 Z 10 12 10 12A 4 4 4 5 Q 10 10 10 12R 5 5 4 5 3 10 9 10 105 5 5 5 5 0 11 10 9 8H 5 5 5 6 -11 25 19 27C 5 6 5 6 11 12 11 18L 6 5 5 6 4 12 15 12 13D 6 6 5 6 12 9 19 27U 6 6 6 6 .12 11 10 11P 6 6 6 6 8 12 14 11 10

M 6 6 6 6 @12 18 16 24F 6 6 6 6 5 12 12 11 10G 6 6 6 7 9 13 22 10 9B 7 8 6 8 *12 9 17 26V 7 6 7 7 12 24 11 19W 7 6 7 9 6 13 19 13 10

*7 7 7 7 >13 17 13 23Y 7 8 7 9 7 14 20 13 14

7 7 8 8 015 21 12 118 7 9 9 <16 25 12 208 7 9 9 17 23 13 17

-8 8 8 8 #18 13 15 2519 7 9 8 %18 16 14 16

K 9 8 9 11/9 11 11 9

NOTE: ""denotes blank

29

probable characters. This means that use of any of the fournarrative files as a training file should lead to similarcompression results for encoding the narrative files.

Some of the differences between word lengths presented intable 5 could have been lessened by adopting a differentconvention for equi-probability character sets. In particular, adifferent convention should have been adopted for treatingcharacters with zero probability to ensure that they would all beassigned code words of the same length or nearly the same length.This was discovered after the fact. For example, a betterassignment of code words to zero probability of occurrencecharacters could have been achieved by assigning a very smallprobability of occurrence, say one .0000001 to each of them.

The data compression performance can be summarized by theaverage number of bits-per-character required to transmit thedifferent narrative files using the four Huffman codes associatedwith their differing probabilities of occurrence. We refer tothe narrative file used to estimate character probabilities ofoccurrence as the training file.

The average number of bits-per-character required to encodethe training file itself can be calculated directly as the sum ofthe probabilities of occurrence of a character with the length ofthe code word assigned to it for the training file (the valueobtained by summing the entries in the second columns of tables 1through 4).

The average number of bits-per-character required to encodethe remaining three narrative files using a Huffman codeconstructed from the training file is obtained by multiplyingeach character probability of occurrence in the narrative fileunder consideration by the length of the Huffman code assigned tothat character and summing the results.

Table 6 summarizes the results of the Huffman code averagebits-per-character calculations. Note that using narrative filesII and III as training files gave nearly the same results. Themaximum difference between two entries of the tables occurredwhen narrative file IV was used as a training file for narrativefile I; the difference was only .23 bits-per-character.

Most of the reduction in the average number of bits requiredto transmit a character shown in table 5 occurs because of thehigh probability of occurrence of a blank in the narrative files.This can be seen by calculating the average number of bitsassigned to the non-blanks for the Huffman codes assigned to eachof the narratives using it as a training file.

Let p(x) denote the probability of occurrence of a blank fora narrative x and let n(x) denote the code word assigned toblanks for that narrative. Then the average number of bits pernon-blank character b^ can be calculated from the average value

30

TABLE 6. AVERAGE NUMBER OF BITS-PER-CHARACTER FORHUFFMAN CODES AND DIFFERENT TRAINING FILES

TRAINING NARRATIVE NARRATIVE NARRATIVE NARRATIVEFILE FILE I FILE I! FILE III FILE IV

I 4.04 4.05 4.24 3.77

II 4.12 3.95 4.19 3.73

11 4.12 3.96 4.15 3.72

IV 4.27 4.05 4.32 3.63

31

for all characters b (values given by the diagonal entries in the

table 5) for manuscript x by using the formula:

b^ = (b - p(x)n(x))/(1-p)

Using this formula, b' values of 4.9, 4.9, 4.8, and 6.1 bits-per-character were obtained for narratives I, I, I1, and IV,respectively.

Comma-free Codes

Surveying Comma-Free Codes

The construction of a Huffman code leads to a code providingthe best data compression. The construction process described byR. A. Scholtz leads to a family of codes with word lengthsdepending on the choices of suffixes and prefixes used in thesteps of the construction process. R. A. Scholtz does notdiscuss how to match the comma-free construction process to theprobabilities of occurrence of the characters to be encoded toprovide the best compression. We have found a solution to thisproblem.

In this section, we develop an approach to surveying thedistributions of code word lengths that can be obtained bydifferent choices of suffixes and prefixes. In the nextsubsection, we show how to choose a comma-free code that givesthe best compression given a character set and the probabilitiesof occurrence of the characters in the set.

The method that we found allows us to survey codes in termsof the distributions of their code words and can be conductedwithout specifying the particular code word chosen at each stepof the construction process, or whether the chosen word at eachstep is used as a suffix or a prefix. All that need be specifiedis the lengths of the words chosen for suffixes and prefixes.

The R. A. Scholtz comma-free code construction process issequential. A natural way to survey the codes is to survey theminductively based on the construction steps. We proceed to makethis idea precise.

Let C[k] denote the set of code words produced after thefirst k steps of the construction process. Let C[01 = { 0, 1 1be the starting point in the construction process.

We seek to describe the distribution of code words by lengthin the set C[k+1J given the distribution of code words by lengthin the set C[k] and the choice of a word of length s as either asuffix or prefix in the (k.1)-th step of the construction processof R. A. Scholtz.

Let nfk](j) denote the number of code words of length j inset C[k]. For example, suppose C[11 is constructed from C[03

32

using either "0" or "1". In this case, n[0](1) = 2 and n[0J(j) =0 for j > 1 and n[I](1) = for j > 0.

In the suffix/prefix construction process a word used as asuffix or prefix can no longer be used as a code word. It isconvenient to describe n[k+1](j) in terms of n^[k](j) when a wordof length s is chosen for constructing C[k+l] from C~kj.

Let

n^[k](j) = n[k](j) - 1 if j = s

= n[k](j) if j A s

Then

n[k+1](j) = n^[k](j) + n^(k](j-s) + ... + n^[k](j-ns)

with the convention that nA[k](j-ns) = 0 if j - ns < 1

In words, to obtain the number of code words of length j in setC[k+1j one simply adds the numbers of code words in C[k](excluding the single code word used as a suffix or prefix,namely, n^[k](j)) plus those which could be obtained by addingthe suffix/prefix to available words of length j-s (namely,n^[k](j-s)) plus those which could be obtained by adding thesuffix/prefix word twice to available words of length j-2s(namely, n^[k](j-2s)), etc.

The above procedure is ideal for compiling tabular summariesof distributions of code word lengths for available comma-freecodes constructed using the suffix/prefix process. Tables 7through 10 present word length distributions of some of thecomma-free codes which can be constructed in this manner. Withthe exception of the first column, the numbers of code words in acode are only summarized up through the length of code wordneeded to allow the coding of 58 characters. There are aninfinite number of code words of ever increasing lengthsavailable for each of the codes.

The codes summarized in the tables all begin withC[O - { 0, 1 1, CPI] constructed using a length 1 code word (bynecessity), and C[2] constructed from CPI] using the otheravailable length 1 code word (not a necessity).

The columns contain the number of code words of the lengthlabeling the rows. The number enclosed in parentheses under thecolumn labels is the length of the code word used to constructthat code set from the code set with distribution of word lengthsgiven by the previous column. For example, in table 7, the codeset C[31 is obtained from the code set C[2] through use of asuffix or prefix word of length 2 (there happens to be only onecode word of length two available in this example, it would beeither "01" or "10" depending on the particular choices ofprefixes or suffixes used in the construction of C[Il and C[23.

33

TABLE 7. PARTIAL SURVEY OF THE DISTRIBUTIONS OF CODE WORDLENGTHS FOR CODES CONSTRUCTED USING PREFIXESAND SUFFIXES OF LENGTHS 1,1,2 AND 3 IN THE FIRSTTHREE STAGES OF CONSTRUCTION

COMM4A-FREE CODE

(0) (1) (2) (3) (41 (5)WORD !C C C C C C

LENGTH ()()(2) (3) (3)

12 1

2 1 1

3 1 2 21

4 1 3 3 3 3

5 I1 4 6 6 6

6 15 8 9 9

7 16 12 15 18

8 17 15 21 27

9 18 18 2-7

10 1 9 24

1 1 10

12 11

NOTE: TABLE ONLY INCLUDES CODE WORD LENGTHS NECESSARY TO REACH58 OR MORE CODE WORDS

34

TABLE 8. PARTIAL SURVEY OF THE DISTRIBUTIONS OF CODE WORDLENGTHS FOR CODES CONSTRUCTED USING PREFIXESAND SUFFIXES OF LENGTHS 1,1, AND 3 IN THE FIRSTTHREE STAGES OF CONSTRUCTION

COMMA-FREE CODE

(0) ()(2) (3) (4)WORD C C C C CLENGTH ()(1) (3) (3)

1 21

21111

3 12 10

4 13 3 3

5 14 5 6

6b 5 6 6

7 1 6 9 12

8 1 7 12 17

9 1 8 14 20

10 1 9 18

11 10

12 .11

NOTE: TABLE ONLY INCLUDES CODE WORD LENGTHS NECESSARYTO REACH 58 OR MORE CODE WORDS

35

TABLE 9. PARTIAL SURVEY OF THE DISTRIBUTIONS OF CODE WORDLENGTHS FOR CODES CONSTRUCTED USING PREFIXESAND SUFFIXES OF LENGTHS 1,1,2 AND 4 IN THE FIRSTTHREE STAGES OF CONSTRUCTION

COMMA-FREE CODE

(0) (1) (2) (3) (4) (5)WORD C C C C C CLENGTH (1) (1) (2) (4) (4)1

12 1

2 11

3 1 2 2 2 2

4 1 3 3 2 1

5 1 4 6 6 6

6 1 5 8 8 8

7 1 6 12 14 14

8 1 7 15 17 18

9 1 8 18 24 30

10 1 9 24

11 1 10

12 1 11

NOTE: TABLE ONLY INCLUDES CODE WORD LENGTHS NECESSARY TO REACH58 OR MORE CODE WORDS

36

TABLE 10. PARTIAL SURVEY OF THE DISTRIBUTIONS OF CODE WORDLENGTHS FOR CODES CONSTRUCTED USING PREFIXESAND SUFFIXES OF LENGTHS 1,1, AND 4 IN THE FIRSTTHREE STAGES OF CONSTRUCTION

COMMA-FREE CODE

(0) (1) (2) (3)WORD C C C CLENGTH' (1) (1) (4)

I 2 1

2 1 1 1

3 1 2 2

4 1 3 2

5 1 4 4

6 1 5 6

7 1 6 8

8 1 7 9

9 1 8 12

10 1 9 15

11 1 10 22

12 1 11

NOTE: TABLE ONLY INCLUDES CODE WORD LENGTHS NECESSARYTO REACH 58 OR MORE CODE WORDS

37

Figures 11 and 12 illustrate the tailoring of thedistribution of word lengths possible by using some of the two-or three-step constructions and some of the two-- or four-stepconstructions, respectively. The figures are constructed tocompare the code words available through selected comma-freeconstructions to encode a 58-character set. Each curve islabeled by the suffix/prefix word lengths leading to the plottedcode set.

Choice of a Comma-free Code for a 58-character Set

It is possible to calculate the average number of bits-per-character for a given character set and the probabilities ofoccurrence of the characters in the sets for each candidatecomma-free code. The calculation would consist of first orderingthe characters by probability of occurrence (say from highest tolowest) and assigning code words to the characters by orderingthe available code words from shortest to longest. A sum overthe character set of the probability of occurrence of a charactermultiplied by the character code word length then is the averagenumber of bits-per-character for the code.

It is possible to reduce the comma-free code candidates to afew obvious front-runners by using Huffman. code words as a gaugefor the candidates.

The probabilities of occurrence for the character set usedfor our 58 character simulations of the Huffman code, has theproperty that the character probabilities fall off rapidly fromthe most used characters to the least used characters as shown intable 3 which was presented earlier. In such a situation, if wecould closely match the code word lengths provided by the Huffmancode constructed for the given character probabilities ofoccurrence for the first 10 to 15 characters, we would expectvery similar compression performance from that comma-free and aHuffman code.

Table 11 shows how closely the simplest suffix/prefixcandidate code word lengths match those provided by the Huffmancode. The first four columns of word lengths repeat theinformation presented earlier in table 4 for Huffman codes and afifth column presents the word lengths for any of the suffix-prefix comma-free codes obtained by use of "0" and "1" assuffixes or prefixes in a two-step construction. The assignmentof code words to characters is presented for the orderingprovided by the probabilities of occurrence of characters innarrative file II. However, as can be seen from table 11, thisassignment leads to excellent word length agreement through thefirst 40 characters, regardless of which narrative file is usedas the training file.

Consider the first fifteen rows of table 11 for word lengthsof the Huffman code for narrative file Il and for the comma-freecode. For the third character ("T"), the commfa-free code is one

38

C)

U),

CDC

00

w * -

I______

0

C_ C-CD

E CD CD

TABLE 11. HUFFMAN AND COMMA-FREE CODE WORD LENGTHSFOR FOUR NARRATIVE FILES

WORD LENGTHS WORD LENGTHSFOR NARRATIVE FILE COMMA- FOR NARRATIVE FILE COMMA-

CHAR I II III IV FREE CHAR I II III IV FREE

2 2 2 1 2 J 9 9 10 22 9E 3 3 4 4 3 X 9 9 9 11 9T 4 4 4 4 3 2 10 8 10 9 9

N 4 4 4 5 4 + 10 8 18 15 9O 4 5 4 5 4 = 10 9 13 21 9I 4 4 4 5 4 Z 10 12 10 12 9A 4 4 4 5 5 Q 10 10 10 12 10R 5 5 4 5 5 3 10 9 10 10 10s 5 5 5 5 5 0 11 10 9 8 10H 5 5 5 6 5 - 11 25 19 27 10C 5 6 5 6 6 " 11 12 11 18 10L 6 5 5 6 6 4 12 15 12 13 10D 6 6 5 6 6 12 9 19 27 10U 6 6 6 6 6 12 11 10 11 10P 6 6 6 6 6 8 12 14 11 10 10

M 6 6 6 6 7 @ 12 18 16 24 11F 6 6 6 6 7 5 11 12 11 10 11G 6 6 6 7 7 9 13 22 10 9 11B 7 8 6 8 7 * 12 9 17 26 11V 7 6 7 7 7 ; 12 24 11 19 11W 7 6 7 9 6 13 19 13 10 11* 7 7 7 7 8 > 13 17 13 23 11Y 7 8 7 9 8 7 14 20 13 14 11

7 7 8 8 8 ' 15 21 12 11 119 7 9 9 8 < 16 25 12 20 118 7 9 9 8 17 23 13 17 12

- 8 8 8 8 8 # 18 13 15 25 129 7 9 8 8 % 18 16 14 16 12

K 9 8 9 11 9/ 9 11 11 9 9

NOTE: THE COMMA-FREE CODE HAS BEEN CHOSEN TO BEST MATCH THEWORD LENGTHS OF THE HUFFMAN CODE FOR NARRATIVE FILE II

41

bit shorter than the narrative file II word length, which leadsto a decrease in the expected number of bits of .0677. Only twocharacters ("A" and "C") of the first 15 are oie bit longer,which leads to an increase in the expected number of bits of.0450 and .0232. The net difference (or penalty for using thecomma-free code) between the average number of bits-per-characterfor the first 15 characters is only .0005 bits-per-character.For the remaining rows the penalty for using the comma-free codeis at most .0162 bits for any character and only above .01 forthree characters. The overall penalty for using the comma-freecode rather than a Huffman code for compression (based onnarrative file II) is only .045 (rounded down to three places)bits-per-character.

Table 12 presents a comparison between Huffman code bits-per-character values and comma-free code bits-per-charactervalues for the code word to character assignments shown in table11. Slightly lower average bits-per-character values arepossible for comma-free codes encoding narrative file codes I,III, and IV than those shown in table 12. However, theperformance differences between the Huffman codes and the comma-free codes are so small that we did not investigate othermatchings of comma-free codes to Huffman code word lengths forthese cases. The Huffman code average bits-per-character valuesare those obtained using the assignment of code words to anarrative file based on the probabilities of occurrence of itscharacters.

The next best comma-free code appears to be the code (7, 1,3) for which similar calculations revealed a penalty of .114(rounded down to three bits) bits-per-character for using thiscomma-free code instead of the Huffman code for narrative fileII.

It appears from this example that similar procedures wouldallow us to find a comma-free code giving nearly the samecompression behavior as a Huffman code, provided that theprobabilities of occurrence of the characters in the characterset fall off in a reasonable manner from the highest probabilityof occurrence to the lowest.

COMPRESSION CODE PERFORMANCE IN A CHANNEL WITH ERRORS

In this subsection we estimate the performance ofgeneralized Baudot codes, Huffman Codes, and Comma-free codes ina channel with errors. The generalized Baudot and comma-freecodes could be studied analytically. Huffman codes were studiedthrough use of simulations. Each code is evaluated by estimatingthe average number of characters decoded in error per bit errorand the average number of characters output by the decoder inerrcr per bit error.

42

TABLE 12. COMPARISON OF BITS-PER-CHARACTER VALUESOF HUFFMAN AND COMMA-FREE CODES

B ITS-PER-CHARACTER

NARRATIVEFILE HUFFMAN CODE COMMA-FREE CODE

I 4.04 4.10

1I 3.95 4.00

TI 4.15 4.26

IV 3.63 3.81

43

Generalized Baudot Codes

The performance of the generalized Baudot codes is simple toevaluate because a single bit error always leads to one characterbeing decoded in error, whether it occurs in a code word of aninformation carrying character or a shift character. If the biterror occurs in the code word of an information character thatcharacter is decoded in error, and if it occurs in a shiftcharacter then that character and intervening shift charactersthrough the first information character are output as errorcharacters.

For the single-shift 5 bit based Baudot code, the statisticsare:

1 character decoded in error per bit error

1.01 to 1.02 output character in error per bit errordepending on the training file

For the three-shift 4 bit based Baudot code, the statisticsare:

I character decoded in error per bit error

1.13 to 1.17 output character in error per bit errordepending on the training file

Huffman Codes

Introduction

Simulation results were obtained for the 95-character symbolset associated with the IBM PC. It was intended to developsoftware and plan further analysis based on these results andthen to apply lessons learned to the processing of a Navy messagedata base. The Navy message data base was not available in atimely enough manner to allow the analysis to continue withoutinterruption so it was decided to emulate the 58-character set(Baudot) used in Navy communications by reducing the 95-characterIBM PC set to 58 characters. This was accomplished by use of theall capital letter option of the operating system and by editingthe documents being processed to be free of selected specialsymbols. This section contains three subsections: the firstdescribes the simulation software; the second, the resultsobtained for a 95-character set; and the third, the resultsobtained for a 58-character set.

Description of Simulation Software

The original program reads a text file and counts the namberof occurrences of each character; from this, a Huffman code isconstructed using the construction process first described byHuffman in his original paper. This construction process hasnumerous degrees of freedom. In order to study the relationship

44

between the performance of a Huffman code in a channel witherrors and the specific choices made in the Huffman constructionprocess, a program was written that allowed the user to specifythe probability that the character set with the highestprobability of occurrence would be assigned a "1" at each stageof the construction process. (Even though it turned out that theprobability of a "1" occurring was not a meaningful parameter,varying the probability of a "1" occurring allowed searches to beconducted for a particular Huffman code with better performance:a channel with errors than most codes which could be

constructed.) The probabilities of occurrence for each characterand the assigned Huffman code words for each character arewritten to files so that the particular code used to obtain aparticular set of performance results could always be recoveredestablished if desired.

Four basic programs were written to exercise Huffman codes:(1) a program which would encode a taessage file using the Huffmancode, (2) a program to introduce random bit errors into theencoded file (the probability of a bit error is user specified),(3) a program to decode the encoded bit stream, and (4) a programwhich compares the decoded bit stream with the original messageand accumulates various error statistics.

An additional software program was developed to evaluate thefeasibility of a user correcting character errors through messagecontext. The program was interactive and allowed the user toselect any character (of 20 displayed characters) of the messagefor possible reinitiation of the Huffman decoding process. Theinteractive program would retrieve the code word of the selectedcharacter and reinitiate the Huffman coding process by alteringeach bit in its code word. The new characters would be displayedand after the trial decodings were completed, the user couldchoose the most acceptable string of characters and the softwarewould implement the bit change in the original bit streamcorresponding to the selected option. The software compares theselected character strings with the original message and recordsthe number of errors (total) and number of corrected errors.

Simulation Results for a 95-character Set

Figure 13 summarizes the results of an extensive search forthe Huffman code providing the best performance in a channel witherrors. The code word lengths of all of the codes constructedare the same because the grouping of the characters, as discussedearlier, is the same for all of the codes. They differ only inthe choices of "1" or "0" at each stage of the Huffmanconstruction process. The experiments were run by varying theprobability that the set of characters with highest probabilityof occurrence at each stage was assigned a "I". The verticalscale is the average number of characters decoded in error ( inputcharacters to the decoder) per bit error. Each Huffman. code wasexerc,-sed against each of the four narrative files described in-ne previous paragraph and the maximum and minimum~ average number

X c

CC

0 a

Zr.lCO

-f-L

4-J-

CU~ Lnm

= = r-

C-C

CD-

of characters decoded in error per nit error tor tnese tournarrative files plotted.

Figure 13 shows that the average number of decodedcharacters per bit error is definitely dependent on the Huffmancode. The dependence on narrative file is small compared withthe dependence upon the code used. Figure 14 shows this inanother way, showing the dependence of the average number ofdecoded characters per bit error for the best performing code,the worst performing codes, and two selected average performingcodes.

Figure 15 presents a distribution of all of the averagenumber of decoded character per bit error values obtained for thedifferent experiments on the narrative files presented in figure13. It is particularly noteworthy that the four results obtainedfor the code with the lowest value fall into the two lowest valuebins of figure 13. Figures 13 and 15 clearly show that the codewith the best error ratios was clearly the best performing codein a channel with errors.

Some experiments were run using an operator-interactiveprogram. These experiments were designed to determine thepercentage of errors introduced into a text file through Huffmanencoding, transmission in a noisy channel, and Huffman decodingthat could be corrected through narrative context. Inparticular, it appears possible to use a standard spell checkprogram as a basis for reinitializing Huffman decoding afterchanging a bit likely to be in error. The potential of such analgorithm could be assessed by using an operator-interactiveprogram--with the operator choosing the decoding which providedtext which made the most sense.

A 4271 character narrative file consisting of 88 lines and4456 bytes was chosen to assess operator-interactive correctingof narrative character errors. Bit errors were introducedrandomly at a rate of .005. This error rate would lead to anestimated 90 characters containing a bit error ((.005)(4271characters)(4 bits/character)). These 90 character errors led to362 character decoding errors. After the inter-active sessionthe operator was able to reduce the number of character decodingerrors to 85 errors (that is the number of character errors werereduced by 76 percent).

Simulation Results for a 58-character Set

A series of simulations was run to find the best performingHuffman code in a channel with errors. Trials were run using theprobabilities of occurrence of the characters in each of the fournarrative files. Each Huffman code was then exercised for thefour narrative files, including the one from which theprobabilities of occurrence of the characters were derived.

Simulations were run by randomly introducing bit errors at arate of 3 per '000 bits. Successive bit errors were :ndependen:

47

C3 C 0 0

Luj

1-

44

-

I -3 LU

C- cuC) -

CIrCIc

C

C

C..o~q.

C. LIJ c

U, 4 J

C-

~m

CD. 4..I en ~ ~

*W C~- C T

e,

0O 4,0 0' Co~mr ru cu --

U)

C-)C-

of one another. No attempt was made to model tne impact ot oursierrors on the channel. It was felt that should burst errors posea problem in the implementation of a particular code, it wouldalways be possible to superimpose interleaving after datacompression encoding and deinterleaving prior to data compressiondecoding.

Figures 16 through 19 present the average numbers ofcharacters decoded in error per bit error for experiments runusing narrative files I, II, III, and IV, respectively, as atraining file. Each figure presents the results encodedaccording to the narrative file for which the decoded charactererrors per bit value was obtained.

The first thing to observe is that in all four figures, thepoorest error performance results were obtained when processingnarrative file IV (the most compressible). In contrast, the besterror performance was obtained for narrative file III, which wasthe least compressible.

The second thing to notice is that the results obtainedusing narrative file IV as a training file gave the bestperformance in general in an error channel.

In order to more completely characterize the performance ofHuffman codes in channels with errors, a worst and best case foreach narrative file as a training file from the viewpoint ofaverage number of characters decoded in error per bit error wereselected for further analysis. Figures 20 through 27 present thedistributions of lengths of successive output characters in errorobtained for the trials given for the selected simulations. Notethat any of the output character error sequencr- may involve morethan one bit error. This is likely for very long sequences ofoutput character errors and less likely for shorter sequencesbecause the bit errors were randomly introduced at a rate of 3 in1000. However, the likelihood of two bit error induced errorsequences merging is very small and can be neglected; therefore,output character performance is summarized in terms of theaverage number of output character errors per bit error.

Some of the distributions had very long sequences of errors(a maximum of 104 for narrative I as a training file), so thatthe distributions are presented for character error sequences oflengths 1, 2, ... , 9 and those of length 10 or greater. Themaximum length of an error sequence is included in each figure inthe upper right hand corner.

There is a dramatic difference in the structure of thedistributions for each of the narrative files used as a trainingfile for the Huffman codes found to give the best and worstperformance in a channel with errors. The best distributionshave a preponderance of short length sequences (lengths one, two,and three) while the worst distributions tend to be relativelyfla:- with the occurrence of extremely long character error

50

CL) CL) CL)C,

4-J 4-) 4- 4-

co m~ cc roL_ C_ C_ C

+ 0

0 +* 4-CU

x

o -

I:

+ - Inn-

0) c o -. r cuC

C__C

C)- CL_ L

Lro CUJ0_ CCL

- C- - C-

C- L L L

+ * 0

o: :+ -

+-

4K0 0

0 + m

CD~ U.: --

CDb L- zU ro - 4--

4K 0

- IC- - - -.> > > >

.- .. .- .-r r- r" eO

C_ C_ C_ C_

Z: Z

+ : 0

_cu

Z7

0 +

+ C

' Cr

0 -I-XI-I .-

--

w o

a:

'J 1S I I I I I I .-- :

="m C._ C_

I= 4-1 En C_D uL LCf _ C._

0 _ C_UJ rO C_ 4--)C r- Lij -uqDC--L)r

Izu~

>1 > :> :>

r- c r.I . c -

+ *0

047K

I -W

0+*-

0*-acc -

0 W-U,

Or-

CD c cc to U) m e q

~- C-

'4-1 U) C.

o C) oui

C.) . LU J

LC)-

ccc

'4-

LO__ LgLO- -W - m m cu

cv

-a--

C--

"KT 7

X0cc

+L

--

- C--

LLL

CCo

a.,j

CL)

CD

4--

(C

2

(CC-

c~.

LO C _

C- C-

Cj

P_ __ _ __ _ r. to t C- V w e u C

__ _ __ _ __)LL

C-C-

u-.

T

(-O

E

C-

CULL

C..+

(0 >

+ C-

CL)u

UaCL)

C-

C)

7

C-

C-

C-

C-* C-

+ (10C-)

= C>

cu -fz

LO Co n C) M~ 0 LO o n 40 M~ 4 In, 0= C) D?. " go ID 10 In W IV7 M M CU CU -

C:)

C) _ _

cr) 4-'

C-

C-

en

c nC

C-.

C. C

-40-'

:4n

E-

C-

cC-C--

LL -

-a-'

4-

W- W- LO O U, w, c') en) M~ M~ -

Ul)

C-

C-

4-I

E-

C.-

aLi-i

Cu -

C-=

C-:

C.)

sequences (35, 104, 65, and 9.11 for 7 arra Ves , "respectively).

The best performing Huffmar code fount dur7nc ce s :at;had the distribution presented in t .gure 26 and o-- :frec _.ne7arrative IV was used as a tra~aing fie. Tn s ST a'u. * . .vresulted in three output character sequences longer tranch.aracters, one each of 9, '2, and 14 characters. .ne averacgeength of an output sequence of character errors f:r tn s Hoffman

code was 2.4 output character errors.

:- is worthwhile to compare the distribution sncwr, for tebest case in figure 26 with a distribution assoc-anec n.tn toeworst case shown in figure 2' which occurred for a code wrhernarrative I was used as a training file. This distr.buton isnearly flat, extending beyond length IC sequences. :ndeec, t.erewere 22 instances of output character error sequences of .eq-nn'0 or more in the error simulation for this partu.ar .. fmancode.

Table 13 presents the Huffman code found through simulationusing narrative file IV as a training file and providing theowest average number of characters decoded in error per blt

error (the code resulting in tne distribution shown in figure26,.

Finally, we observe that for each narrative file as atraining file and for each narrative file encoded and decoded inan error channel, there was a significant dependency of theaverage number of characters decoded in error per bit error onthe particular Huffman code. There was usually at least a two-to-one difference in performance depending on the particularchoices of "1"s and ""s in the Huffman construction process.Recall that the number of bits-per-character depended only weaklyon the narrative file using as a training file and not at all onthe details of the Huffman code chosen. The results presented inthis section show that both decode and output character errorstatistics are far more dependent on the construction process andnarrative file statistics than compression results.

Comma-free Codes

In this section we first estimate the impact of errors onthe performance of the different comma-free codes which can beconstructed in two steps by choosing code words of length one. Wethen briefly discuss the impact of bit errors on the decodingerrors for more general comma-free codes.

Comma-free Codes Constructed in Two Steps Usinq Lengtn OneWords

The error analysis discussed in this subsection isaccomplished by obtaining the results for the two particular:odes already discussed in the section describing comma-freecodes. The codes were called the suffix-prefix code and the

63

TABLE 13. HU.FMAN CODE WORDS FOR NARRATIVE FILE IVWHICH PROVIDED THE BEST PERFORMANCE TN A

CHANNEL WITH ERRORS

CHAR CODE WORD CHAR CODE WORD

01 1000101000101000 10000000011 A 11101111 o o10101 B 0001101

# 100010100000000100 C 11110% 100010700000000101 D 000011

100010,00000001 E 1100010000' F 10001100100000 G 111110

* 111111'01C01 H 10000+ 0010101000 1 1010

1111110 J 11111111010001011 K 001010101001000i L 000010

/ 111111 1 M 001001

00101011101 N 10011 001010110 0 10112 0010101111 P 0010113 1111111011 Q 10001010014 001010111000 R 000005 100010100011 S 000106 0010101110010 T 00117 10001010000001 U 000111

8 100010100001 V 00011003 001010,110011 W 0010100

100010101100 X 1111111001111111101000 Y 1000100

< 1000101000000000 z 1000101010

001010,001 1000101011011000101000001 ~1000101011

64

suffix-suff.x code. After the error analysis nas Deen carriedout for these two codes, it is easy to argue that: (1) theana'vsis for the suffix-orefix code apply to all four codes us -r.either "C" or "I" fi-st and a suffix or prefix first and (2) theanalvs.s for the suffix-suffix code applies to the four codesusing "0" or "1" first and either both suffixes or octh prefixes.

The first step in the analysis of each commna-free code is tocalculate four probabilities:

P(M ) = the probability that a bit error leads to thedeletion of a comma between two code words, i.e., to two codewords being merged

P(X2) = the probability that a bit error leads to themovement of a comma relative to its true position if an error hadnot occurred

P(X3) = the probability that a bit error leads to theaddition of a comma relative to those if an error had notoccurred

P(X4) = the probability that a bit error leads to no changein the placement of the commas

The second step of the error analysis allows us to determinethe impact of the bit errors upon character decoding. Inparticular, note the following:

(1) if a bit error leads to comma deletion then twocharacters are incorrectly decoded as a single character or notdecodable

(2) if a bit error leads to comma movement then twocharacters and incorrectly decoded into two characters

(3) if a bit error leads to the insertion of a comma (alwayswithin the code word with the bit error) then one character willbe incorrectly decoded into two characters

(4) if there is no change in the commas then one characterwill be incorrectly decoded into a single character

It follows that the average number of input characters in erroror the average number of output characters in error can beestimated directly from the probabilities P(Xi), i = I, 2, 3, 4.

Each bit error leads to an error in some character given byits character probability. For each character the impact of anit error, assumed equally likely in each bit of the code word,,an be calculated depending on the structure of the code word.To make this precise we introduce the following definitions:

65

p~c) = the probab:Iity of ofccrrence o: c.'ara2e

n~c = the number of bits n tne code word for

= the number of bits n the code word for c lead.nQ *cthe deletion of a comma

n(c2) = the number of bits in the code word for c leading -ocoma movement

n(c3) = the number of bits in the code word for c lean: tothe addition of a comma

n~c4) = the number of bits in the code word for c leadinc tono change in the comma

Then the probabilities P(Xi), 1 = I, 2, 3, and 4, are calculatedfor the suffix-prefix code using the normal cond.tionalprobability procedure, namely

P(Xi) = sum over all characters of p(c)n(ci)/n(c) for . =2, 3, and 4

The required calculations for a particular assignment of thesuffix-prefix code words to a 58-character set are easy buttedious.

Table 14 summarizes the calculations by character of theimpact of bit errors for the probabilities of occurrence ofcharacters in narrative file Ii. Similar results are expectedfor the remaining three narrative files. The first columncontains the character probab1'ilities of occurrence listed indescending order and the sezond column contains the comma-freecode word associated to the character whose probability ofoccurrence is presented in the first column. Columns 3, 4, 5,and 6 present n(ci)/n(c), i = 1, 2, 3, and 4, respectively, forthe character whose probability of occurrence is presented.

One observation should be made: the values of n(ci)/n(c)depend on the fine structure of the code words, and differ forwords of the same length. The table has been constructed byassigning the code words with the most equal number of "0"s and"1"s to the highest probability character and the most unbalancedcode words to the lowest probability characters. We illustratethis by discussing the summarized calculations for the words oflength 8.

There are seven codes word of length 8 available. The twomost unbalanced code words, namely, 01111111 and 00000001 havedifferent n(ci)/n(c) values from the other six. This occursbecause for these two codes an error in the first and last bit,respectively, turns the word into all "1"s and all "0"s,respectively, which lead to the deletion of the comma between thefirst and second words and the second and third words,respectively. Also, these two words only have one transition bit

66

7ABLE '4. PROBABLTES OF ERRCNEJuS COMMA :NSERT:NS ORDELET:ONS DUE TO BIT ERRORS FOR THE SUFF:X-PREF'IXCOMMA-FREE CODE

PROBAB.ILTYOF DELETE COMMA ADD NO

OCCURRENCE CODEWORD COMMA MOVES COMMA CHANGE

0 3V' 17 .00 0 0

C0865 0 0.33 0.33 0 0.33.33 0..33

C'0537 0011 3 0.50 0 0 .500. 05i8 0W 0.25 0.25 0.25 3.25.O5Y 0001 0.25 0.25 0.25 0.25C.0450 oc1'1 0 0.40 0.20 3.40

0.342 00011 0 0 .40 0 .20 . 40Z.0391 00001 0.20 0.20 0.40 0.20

C.3263 10000 0.20 0 .20 40 0. 20S.0232 000111 0 0 .33 0 33 0 .332 . 218 000011 0 0.33 0.33 0.330206 001111 0 0.33 0.33 0.33

0.0185 011111 0.17 0.17 0.50 0.173.0170 100000 0.'7 017 0.50 0.170.0162 0000111 0 0.29 0.43 0.280.0141 0001111 0 0.29 0.43 0.280.3121 0011111 0 0.29 0.43 0.28001 0000011 0 0.29 0.43 0.28

0.0099 0111111 0.14 0.14 0.57 0.140.0082 1000000 0.14 0.14 0.57 0.143.0076 00001111 0 0.25 0.5 3.253 0058 00000111 0 0.25 0 .5 0 .250.0050 11100000 0 0.25 0.5 0.250.0039 00111111 0 0.25 0.5 0.250.0039 00000011 0 0.25 0.5 3.250.0 34 01111111 0.13 0.13 0.63 0.130.0024 00000001 0.13 0.13 0,63 0.130.0023 00000111' 0 0.22 0.56 3.220.0026 00001111 0 0.22 0.56 0 .220 .004 0001111' 0 0.22 0.56 0 .220 .0013 00000011 0 0. 22 0.56 0 .220.0013 00000001 0 0.22 0.56 0.22

0.0011 00111111' 0 0.22 0.56 0.220.0010 01111111 " 0.11 0.11 0.67 0.1!

0.0009 000000001 0.11 0.11 0.67 0.1'0.0008 0000011 1 0 0. 20 0.60 0. 20.0OOC7 0000 1 111 0 0. 20 0. 60 0. 20

0.0005 0000001111 0 0. 20 0.60 0.200.0005 0001111111 0 0.20 0.60 0.20

67

TA3BE '4, PROBABIL T:ES OF ERRONEOUS COMMA :NSERT: NS ORDELETIONS DUE TO B:T ERRCRS FOR THE S:FF:X-PREF:XCOMMA-FREE CODE (CONT.)

?RCOAB L

OF' DELETE COMMA ADD NCOCCURRENCE CODEWORD COMMA MOVES COMMA CHANGE

.2003 300000011 0 0.20 0 60? 30C3 001111111' 0 0.20 0.60

.022 000000001i 0.20 0.60 3.2'.L , 0 20 1 60>2202 01!111111 0.0 0.10 0.70 . 12--:-02 0000003001 0.10 0.10 0.70 ,0022 00C00011' 0 0 98 0 .64 0 IE.2:JI2 0000 l ,i",I 0 0 .8 0.64 2.8

.2 .0 11111 0 0. 8 0.64 .*83022 0"2000011 1 0 0.18 0.64 C,-8

:.ii.2 002'111 0 0918 0.64 2 '8

'3 ' 000000D0011 0 0.18 0.64 0.1800111111111 0 0918 0.64 098

2.000' 00000000011 0 0.18 0.64 0,12.02 0111111111 0.09 0.09 0.73 0.090000 00000000001 0.09 0.09 0.73 0.08.oCO0c 000000111111 0 0.17 0.66 097.. 0000 000001111111 0 0.17 0.66 0 17

C 002 000000011111 0 0.17 0.66 0.17

68

oDS* ion Whc2 n s not 6:; 1Oc p s 1, dA :: : -".edge Dit -an oe in error w thoot ohancing any cororas; wn - ,ea- other words, either of the trans.ztior. b'ts -an oe errwt ' n.C.t =ang:ng any commas. F.nallv, these two, coce rs ave:ve :nter-:r positions withn a strinc of s or . ske ,ther words on.v nave four; errors -n tnese z.: -ost . :.s_ead tc an. additonal conna within the word w!tn Ine c t er.

For the assignment of suffix-prefix codes to --arac-ersdescriDed above and summarized by table 11, the :oiow-ncstatistics were obtained:

P X2 = .

P X3 = 6

PX4 =

Note that about three-quarters of the contribution to P(X,,"that provided by the code word "01" assigned to the blankcnaracter with probability of occurrence 0.317124.

The average number of coded characters decoded in error perSt error is g:ven by

(.42)(2) - (.21)(2) - (.16)(1) - (.21)(1) = 1.63

-he average number of output characters which are incorrect perbt error is given by

.42)(I) - (.21) (2) + (.16)(2) + (.21)(1) = 1.37.

These values are obtained by treating words too long to bedecoced because they exceed the longest word assigned one of the5 zharacters as being incorrectly decoded. (Such characters<ould be decoded into a 59-th character indicating an error hasoccurred.) To distinguish these cases from the cases when theerroneous words arising through misplacement of commas can bedecoded into one of the 58 characters to which code words haveoeen assigned would require more delicate arguments depending onthe lengths and structure of the code words for the characterspreceding and following the one in error.

The calculations presented clearly indicate that theperformance of the prefix-suffix code in an error channel isconsiderably better than the performance of any Huffman code thatwe :ound, so tnat these more delicate calculations are notnecessary. And, this improved performance was obtained by payingan insignificant penalty in compression (or equivalently,thruput).

A two step comma-free code construction using a one-bitprefix and a one-bit suffix, no matter what choices are made

69

- coes (an oe obtained r07. .e.a : " s. f t hs were aocne t-e ass

Se same D na b .i -i es n 1 r.,".e. e "-c rC ass:gnment before *'e _ er ra ..ore x su;tf.x odes using a on P-I * pre!x a - a e._ i , ese assignments hae a e same e ss.

e t ur< now to estimat:% tn.e _mza- if err:: -s. .: -s tf,; x :ode _,scussed ear. er. Rea_ :ia ,.

*~ -n~ oc o~ nave ti~e structure 0 ... . .2 '... a" .eas t r

.oa. .... rde differs froa, the s':.ix-o ref.x :c:e - , a t. -a- error :n the first pos.tion of a coce icrc t e:-

i.e e 'circ 'i t e Drev c s coe wcrd.

We re - - ,e i e f'-s, n: a-,. 7e s e ,' e

t7e -ext -w para A' Sr s ar( t rr :o!, e e ne a 's.s . t..-f - r H,>e'.r i ~e ass~g-me' : i:f coce wc-: -ar ers as

-- " *e . .r'.e % s 'ne oas s fcr a-

-e i- .v rarrv o -t "e 7a cr at ions for "rhe e"s D_rre.o:e cf tne chdracters in narrative f"e SI:dr

res. -s would oe obtained for the probai-:t- es of oc-.rren:et7e :raracters :n trhe other three ndrrat~ve fiIes.

able '5 sunmar izes the i mpact of errors from the t, ,:rdb, b: t.,ro gn the erd of a word. These errors either ead tc the

auo.dd or of a comrur.a, or no change in. the commas, depending onw.etner or rot the bit error charges a to a as -- r, q of "^"s and whether or no, an error changes a "" to a

a str ng, of "0 s. We now trn o evaua .c e : are oferrors :r. tre firs two bts

A: errcr ir the first but leads to Ihe movemen of a romr'a-ne deletion of a comma depending on whether tr.e f :s, coce

'ord ends in "0" (3.5' from tab.e 5 or ends in. .... . 9.erefore,

the probability that a bit error in the firs- . i. :eadsto the moveer,t of a comma is given by:

probarllity that a code word ends r, "-" son overzraracters cf p(c)/n(c) = . ) .3 =

2) the probability tnat a bit error :n ,ne f-rs it .eaisthe deletion of a comma is given by:

prooabit t y that a code word ends n ' snm ove:craracters of p(c)!n(c)) = (0.49)(.31 = 52

An error in the second bit leads to the deletio of a -ormraor the addition of a comma, dependIng on whether or rot the

TABLE - . PROBABLT ES OF ERRQNEOUS COMMA :NSE...:ONSDUE TO B'T ERRORS FOR THESFFZX-SUFF:XCO MmA-FREE CODE

?RBAB:L:TYr ADD NO

DCC'-RRENCE CODEWORD COMMA CHANGE

" 3" "

-C C>086 00 0.33

- .33C53-' CC 0.25 0.25

5:10 0.25 0.50o .:5' 0.25 0.25

-0CC 0.40 0.20

.420 430 0.20 0.403 39 .... 0.20 0.400263 C i0.40 0.20

0.0232 010000 0.50 0.173.0213 O1000 0.33 0.33'.C206 011100 0.33 0.33085 011110 0. 33 0.33

I.017C 011111 0.50 0.17.C, 62 0100000 057 0. 14

C.0141 0110000 0.43 0.280 .012' 0111000 0.43 0.280.0101 0111100 0.43 0.280.0099 0111110 0.43 0.280.0082 0111111 0.57 0.140.0076 01000000 0.62 0.130.0058 01100000 0.5 0.250.005C0 01110000 0.5 0.250.0039 01111000 0.5 0.250.0039 01111100 0.5 0.250.0034 01111110 0.5 0.250.0024 01111111 0.63 0.130.0023 010000000 0.67 0.110.0016 011000000 0.55 0.220.0014 011100000 0.55 0.220.0013 011110000 0.55 0.220.0013 011111000 0. 55 0. 220.0017 011111100 0.55 0.22C.001i 011111110 0.55 0.220.0009 0111,1111 0.67 0.110.0008 0100000000 0.70 0.100.0007 0110000000 0.60 0.200.0005 0111000000 0.60 0.200.0005 0111100000 0.60 0.200.0003 0111110000 0.60 0.20I.0003 0111111000 0.60 0.200.0002 0111111100 0.60 0.20

71

TABLE 15. PROBABILITIES OF ERRONEOUS COMMA INSERTIONSDUE TO BIT ERRORS FOR THE SUFFIX-SUFFIXCOMMA-FREE CODE (CONT.1

PROBABILITYOF ADD NO

OCCURRENCE CODEWORD COMMA CHANGE

0.0002 0111111110 0.60 0.200.0002 0111111111 0.70 0.100.0002 01000000000 0.73 0.090.0002 01100000000 0.64 0.180.0002 01110000000 0.64 0.180.0002 01111000000 0.64 0.180.0002 01111100000 0.64 0.180.0001 01111110000 0.64 0.180.0001 01111111000 0.64 0.180.0001 01111111100 0.64 0.180.0001 01111111110 0.64 0.18C.0000 01111111111 0.73 0.09C.0000 010000000000 0.75 0.080.0000 011000000000 0.67 0.17C.0000 011100000000 0.67 0.17

72

second bit in error was the only "1" in the code word.Therefore,

(1) the probability that a bit error in the second bit leadsto the deletion of a comma is given by:

p(2/2+ p(3)/3 + .. + p(12)/12

(0.3171) (1/2)" (0.0865)(1/3)" (0.0537)(1/4)" (0.0450)(1/5)+ (0.0232)(1/6)" (0.0162)(1/7)" (0.0076)(1/8)" (0.0023)(1/9)" (0.0008)(1/10)" (0.0002)(1/11)" (0.0000)(1/12)

= 0.197 [rounded to three places]

whiere p(2), p(3), .. ,p(12) are the probabilities of occurrenceof the character which has been assigned a word of length twowith a single "1", a word of length three with a single "1",

a word of length twelve with a single "1", respectively(see table 15).

(2) the probability that a bit in the second bit leads tothe addition of a comma is given by:

(sum over characters of p(c)/n(c)) -

[()2+ p(3)/3 +.+ p( 12 )/12 3=

.312 - .197 = .115

where p(2), p(3), ... , p(12) are as above.

The probability that a bit error leads to the addition of acomma through changing other than the first or second bit is .18from summing the data presented in column 3 of table 15. Theprobability that a bit error leads to no change in the commasthrough changing other than the first or second bit is .20 fromsumming the data presented in column 4 o~f table 15.

For the assignment of suffix-suffix codes to charactersdescribed above and summarized by table 15, the following summarystatistics follow from combining the estimates which have beenobtained:

P(X1) = .35

P(X2) = .16

73

P(X3) =.29

P(X4 = .20

Note that about three-quarters of the contribution to P(X1) isthat provided by the code word "01" assigned to the blankcharacter with probability of occurrence 0.317124.

The average number of coded characters decoded in error perbit error is given by

(.35)(2) +(.16)(2) +(.29)(1) +(.20)(1) = 1.51

The average number of output characters which are incorrect perbit error is given by

(.35)(1) +(.16)(2) +(.29)(2) +(.20)(1) - 1.45.

The statistics developed for the particular suffix-suffixcode applies to the other suffix-suffix code and to both pref ix-prefix codes. Observe that if "0" is chosen as a suffix firstand then "1", the code words have the structure

10 ... 01 ... 1 with at least one "0" and always starting with"1"

This code is obtained from the one analyzed by interchanging theroles of "0" and "1" so it will have the same statistics as thecode analyzed, provided that the code words assigned to thecharacters are those obtained by interchanging "0"s and "1"s inthe assignment made previously. Observe that if "1" is chosen asa prefix, and "0" as a prefix, the code words have thestructure:

0 ... 01 ... 10 with at least one "1" and always ending with "0"

These code words are simply the mirror images of the words in thesuffix-suffix code analyzed. Assign these code words tocharacters by taking mirror code words to those assigned for thesuffix-suffix code. Then, the first position analysis, whichdepended on the ending of the prior word, applies to the last bitof the prefix-prefix code and the beginning of the next code wordending in "0" or "1". The second bit analysis applies to thesecond to last bit. The analyses conducted before clearly applyto the remaining bits. It follows the statistics will be thesame for this prefix-prefix code.

The remaining prefix-prefix code is obtained from the one justdiscussed by interchanging "'1"s and "0"s; therefore, it will alsohave the same statistics, provided (once again) that theassignment of code words to characters is obtained byinterchanging "1"s and "0"s in the above assignment.

74

Comma-free Codes Constructed Using Other Than Length One Words

The results presented in the last section were for the twosimplest kinds of comma-free codes. In this section, it is shown.that more complicated phenomena can occur leading to more thantwo character decoding errors as a result of a single bit error.The results indicate that as the number of steps in the comma-free code construction process increases without limit the numberof character decoding errors probably increase without limit.However, we have not succeeded in exhibiting this for a sequenceof comma-free codes involving the use of longer and longerconstruction processes.

Some additional terminology is needed to facilitate thediscussion of general comma-free codes. Let

k denote the kernel of the code under construction

p(i) , i = 1, 2, .. denote the prefixes used in the codeunder construction

s(j), j = 1, 2, .. denote the suffixes used in the code

under construction

The most general comma-free codes have not been discussed inthis manuscript and have not been analyzed in this study. Weimpose the following additional conditions on the codes underdiscussion:

(1) k ="0" or "in

(2) both "0" and "I" are used as either prefixes or suffixes

(3) the length of the prefix or suffix used in k-thconstruction step is less than or equal to the length of theprefix or suffix used in the (k+l)-th construction step. (Theserestrictions may not be necessary for carrying out an erroranalysis similar to that presented, but they are convenient andprobably do not exclude any codes that are of interest for datacompression.)

The results presented in the last section can be generalizedto an important family of comma-free codes, which we callexhaustive codes. A comma-free code is called exhaustive if foreach of the steps in the code construction process the code wordchosen as either a prefix or suffix is the shortest code wordpossible. For exair'ple, referring to tables 7,8,9, and 10, thecodes (1,1), (1.1,2), (1,1,2,3), and (1,1,2,3,3) are exhaustiveand the codes ( ,1,3), (1,1,3,3), (1,1,2,4), and (1,4A) are non-exhaust ive codes.

Foe an exhaustive comma-free code, a single bit error can!.ead to at most two characters decoded in error. For non-exhaustive comma-free codes, it may happen that a single biterror leads to more than two characters decoded in error. To

75

establish this result, consider (1) an incoming sequence of bitsas a sequence of kernels, prefixes, and suffixes, and (2) thecomma-insertion algorithm (after the first step) as deletingcommas between the kernels, prefixes, and suffixes. Now, let usdiscuss the potential impact of a single bit error occurring in akernel or in a prefix or suffix of the code words. inparticular, we wish to discuss how a error can impact the commadeletion process between two words which are error free.

Let us denote the word with a bit error by use of ""

Consider, the incoming sequence of binary bits parsed intocodewords as follows:

w(1 w(2)w-(3)w(4)w(5).

Under what conditions will the comma separating w(l) and w(2) orthe comma between w(4) and w(5) be altered as a result of a biterror somewhere in the codeword w(3)? Each of these words isconstructed from the kernel and prefixes and suffixes, asdescribed above.

For the comma between w(l) and w(2) to be erased by thecomma-insertion algorithm, the prefix or kernel beginning w(2)must be transformed into a suffix through a bit error in w(3).Since none of bits in w(2) are in error, this can only happen ifthe addition of bits to the bits of w(2) has created a suffixused in the construction process; i.e., there exists a code wordof shorter length in the code than some suffix in the code. Thismeans the comma-free code is non-exhaustive.

For the comma between w(4) and w(5) to be erased by thecomma-insertion algorithm, the suffix or kernel ending w(4) mustbe transformed into a prefix through a bit error in w(3). Sincenone of the bits in w(4) are in error, this can only happen ifthe addition of bits to the bits of w(2) has created a prefixused ir the construction process; i.e., there exists a code wordof shorter length in the code than some prefix in the code. Thismeens the comma-free code is non-exhaustive.

it is clear that one could improve upon the results byexamining the non-exhaustive codes to see if either of the abovephenomena can occur for a particular selection of prefixes orsuffixes. This is relatively straightforward for any family ofcodes constructed using mostly short length prefixes and suffixesand no more than six construction steps. This is because theanalysis is carried out by examining only those code words lessthan the longest suffix or prefix used in the construction. weillustrate this by considering the suffix-prefix codes of theform (1,1,3).

For kern~el "0", suffix "1", prefix "0", there would be twothree letter code words available, namely "001" and "011", to beselected as either a suffix or prefix. The only length two codewords are "00" and "01". Note if "011" is chosen as a prefix,there is no one bit which can be combined with "00" or "01" from

76

the left to create it; if instead "001" is cnosen. as a prefixcould be combined with "01" to obtain it. However, under triesame conditions if "011" is chosen as a suffix, then a one-bit"addition" to "01" on the right leads to the chosen suffix;ifinstead "001" is chosen as a suffix then again a one-bit"addition" to "00" on the right leads to the chosen suffix.Thus, for three of the four constructions considered, one biterror could lead to three characters decoded in, error. Note,that for these codes, there has to be a very special combinationof words and very particular bit errors to lead to more than twocharacter decoding errors as a result of a singl.e bit error.

if we consider when a single bit error could lead to fourcharacter decoding errors, similar reasoning to that presentedabove would lead to the necessity that two code words of length 2plus one or more bits would need to be a suffix or prefix; i.e.,the code would need to involve a suffix or prefix of length 5 ormore. Hence, for the four non-exhaustive codes discussed in thissection, namely (1 ,1 ,3) , (1 ,1 ,3,3) , (1, 1,2,4 ), and (1 ,1 ,4) , asingle bit error can never lead to four or more characterdecoding errors.

SUMMARY

The present code consists of a parity bit, 5 informationbits, and a stop bit. our discussion of generalized Baudot codessuggested that a code using 4 information bits could replace the5 information bit code now being used. we suggest a parity bit,4 information bits, and 1/2 bit for stops (i.e., a stop bit forevery 8 information bits). The number of bits for the new codeis given by:

4.5 + (4.5/4)(1.5) = 6.2 bits-per-character

The data compression provided by this code would be

7/6.2 = 1.13

The more complicated encoding and decoding associated withcomma-free codes does promise some additional compression.However, a mechanism to allow receiver synchronization in theabsence of stop bits needs to be identified, and theincorporation of error correction codes requires care. Note, onecannot just add parity bits operating on code words, because intheir presence the comma insertion algorithm would break down.Error correction information must be carried by comma-free codewords. We suggest that it be incorporated into the encoding ofthe end-of-line character. one would map the error correctioninformation into a set of long code words which could providecorrection for the line and end-of-line indication by itspresence.

It is always possible to superimpose an error correction codeon the serial bit sequence before transmission and then utilizeit for error correction before beginning the comma-free decoding

77

process which begins with the comma-insertion algorithm.

If error detection and correction bits were kept to the sameratio of information bits to non-information bits as the codediscussed above, we would need

4 + 1.5 = 5.5 bits-per-character

This would translate into a data compression ratio of

7/5.5 = 1.37

This compression ratio is probably the best that could beaccomplished using comma-free codes and character encoding basedon probabilities of occurrence of the characters.

More powerful encoding techniques would use conditionalprobability of occurrences of characters. The suggested approachwould be to create a table of comma-free code words forcharacters depending on the previously transmitted one or two-haracters. The encoding process would be reinitialized with thebeginning of each word. As a further aid to the identificationof the beginnings of words, it might prove desirable to alwaysencode spaces in the same way by reserving some particular codeword for spaces. It is recommended that the use of conditionalprobabilities of character occurrence rather than probabilitiesof occurrence and the appropriate error detection and correctioncoding for use with the compression code be analyzed in a follow-on effort. Such an approach promises considerable additionalcompression over that shown for any of the cases investigated inthis report.

78

LIST OF REFERENCES

Huffman, D., "A Method for the Construction of MinimumRedundancy Codes", Proceedings of the Institute of RadioEngineers, Vol. 40, pp. 1098-1101, September 1952.

2. Scholtz, R., "Codes with Synchronization Capability",IEEE Transactions on Information Theory, Vol. IT-12,No. 2, April 1966.

3. Scholtz, R., "Maximal and Variable Word-Length Comma-FreeCodes", IEEE Transactions on Information Theory,Vol. IT-15, No. 2, March 1969.

79

IEIYEI

17 LrilE[l ooj-