word hy-phen-a-tion by com-put«er - citeseer

1

Upload: others

Post on 11-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Word Hy-phen-a-tion by Com-put«er - CiteSeer

August 1983 Report No. STAN-CS-83-977

Word Hy-phen-a-tion by Com-put«er

Franklin Mark Liang

Deparlment of Computer Science

Stanford UniversityStanford, CA 94305

Page 2: Word Hy-phen-a-tion by Com-put«er - CiteSeer

WORD HY-PHEN-A-TION BY COM-PUT-ER

Franklin Mark LiangDepartment of Computer Science

Stanford UniversityStanford, California 94305

• Abstract* • .

This thesis describes research leading to an improved word hyphenation algo-

rithm for the TjrjX82 typesetting system. Hyphenation is viewed primarily as a data

compression problem, where we are given a dictionary of words with allowable divi-

sion points, and try to devise methods that take advantage of the large amount of

redundancy present.

The new hyphenation algorithm is based on the idea of hyphenating and in-

hibiting patterns. These are simply strings of letters that, when they match hi a

word, give us information about hyphenation at some point in the pattern. For

example, ' - t ion' and *c-c' are good hyphenating patterns. An important feature of

this method is that a suitable set of patterns can be extracted automatical!/ from

the dictionary.

In order to represent the set of patterns in a compart form that is also reasonably

efficient for searching, the author has developed a new data structure called a packed

trie. This data structure allows the very fast search times characteristic of indexed

tries, but in many cases it entirely eliminates the wasted space for null links usually

present in such tries. We demonstrate the versatility and practical advantages of

this data structure by using a variant of it as the critical component of the program

that generates the patterns from the dictionary.

The resulting hyphenation algorithm uses about 4500 patterns that compile

into a packed trie occupying 25K bytes of storage. These patterns find 89% of the

hyphens in a pocket dictionary word list, with essentially no error. By comparison,

the uncompressed dictionary occupies over 500K bytes.

This research was supported in part by the National Science Foundation under grants IST-8B-01926 and MSC-8S-00984, and by the System Development Foundation. 'TgK' is a trademarkof the American Mathematical Society.

Page 3: Word Hy-phen-a-tion by Com-put«er - CiteSeer

WORD HY-PHEN-A-TIONBY COM-PUT-ER

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCEf

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

W PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

by

Franklin Mark Liang

June 1983

ii

Page 4: Word Hy-phen-a-tion by Com-put«er - CiteSeer

© Copyright 1983

by

Franklin Mark Liang

iii

Page 5: Word Hy-phen-a-tion by Com-put«er - CiteSeer

Acknowledgments

I am greatly indebted to my adviser, Donald Knuth, for creating the researchenvironment that made this work possible. When I began work on the I jX projectas a summer job, I would not have predicted that computer typesetting wouldbecome such an active area of computer science research. Prof. Knuth's foresightwas to recognize that there were a number of fascinating problems in the fieldwaiting to be explored, and his pioneering efforts have stimulated many others tothink about these problems.

I am also grateful to the Stanford Computer Science Department for providingthe facilities and the community that have formed the major part of my life for thepast several yean.

I thank my readers, Luis Trabb Pardo and John Gill, as well as Leo Guibaswho served on my orals committee on short notice.

In addition, thanks to David Fuchs and Tom Pressburger for helpful adviceand encouragement.

Finally, this thesis is dedicated to my parents, for whom the experience ofpursuing a graduate degree has bee.i perhaps even more traumatic than it was formyself.

IV

Page 6: Word Hy-phen-a-tion by Com-put«er - CiteSeer

Table of contents

Introduction 1

Examples . 2

T^X and hyphenation 3

Time magazine algorithm 4

Patterns 5

Overview of thesis 7

The dictionary problem 8

Data structures 0

Superimposed coding . 10

Tries 11

Packed tries 15

Suffix compression . . . . . 16

Derived forms 18

Spelling checkers 10

Related work 21

Hyphenation 23

Finite-state machines with output . 23

Minimization with don't cares . 24

Pattern matching 26

Pattern generation 29

Heuristics 30

Collecting pattern statistics 31

Dynamic packed tries . 32

Experimental results 34

Examples . 3 7

History and Conclusion 39

Appendix . . . . . . 45

The PATGEW program . , . 4 5

List of patterns • . . . . 74

References . . . . . . . . . . . . . . . 83

Page 7: Word Hy-phen-a-tion by Com-put«er - CiteSeer

Chapter 1

Introduction

* The work described in this thesis was inspired by the need for a word hyphen-ation routine as part «f Don Knuth's T jX typesetting system [1]. This system wasinitially designed in order to typeset Prof. Knuth's seven-volume series of books,The Art of Computer Programming, when he became dissatisfied with the qual-ity of computer typesetting done by his publisher. Since Prof. Knuth's books wereto be a definitive treatise on computer science, he could not bear to see his schol-arly work presented in an inferior manner, when the degradation was entirely dueto the fact that the material had been typeset by a computer!

Since then, TjrjX (also known as Tau Epsilon Chi, a system for technical text)has gained wide popularity, and it is being adopted by the American MathematicalSociety, the world's largest publisher of mathematical literature, for use in its jour-nals. TjjjX is distinctive among other systems for word processing/document prepa-ration in its emphasis on the highest quality output, especially for technical mate-rial.

One necessary component of the system is a computer-based algorithm for hy-phenating English words. This is part of the paragraph justification routine, and itis intended to eliminate the need for the user to specify word division points explic-itly when they are necessary for good paragraph layout. Hyphenation occurs rela-tively infrequently in most book-format printing, but it becomes rather critical innarrow-column formats such as newspaper printing. Insufficient attention paid tothis aspect of layout results in large expanses of unsightly white space, or (evenworse) in words split at inappropriate points, e.g. new-spaper.

Hyphenation algorithms for existing typesetting systems are usually either rule-based or dictionary-based. Rule-based algorithms rely on a set of division rules suchas given for English in the preface of Webster's Un;ibridged Dictionary [2]. These in-clude recognition of common prefixes and suffixes, splitting between double conso-nants, and other more specialized rules. Some of the "rules" are not particularly

Page 8: Word Hy-phen-a-tion by Com-put«er - CiteSeer

2 INTRODUCTION•

amenable to computer implementation; e.g. "split between the elements of a com-pound word". Rule-based schemes are inevitably subject to error, and they rarelycover all possible cases. In addition, the task of finding a suitable set of rules in thefirst place can be a difficult and lengthy project.

Dictionary-based routines sim'ply store an entire word list along with the allow-able division points. The obvious disadvantage of this method is the excessive stor-age required, as well as the slowing down of the justification process when the hy-phenation routine needs to access a part of the dictionary on secondary store.

Examples

To demonstrate the importance of hyphenation, consider Figure 1, which showsa paragraph set in three different ways by T£p(. The first example uses TjjjX's nor-mal paragraph justification parameters, but with the hyphenation routine turnedoff. Because the line width in this example is rather narrow, TjjX is unable to findan acceptable way of justifying the paragraph, resulting in the phenomenon knownas an "overfull box*.

One way to fix this problem is to increase the "stretchability" of the spaces be-tween words, as shown in the second example. (TjjX users: This was done by in-creasing the stretch component of spaceship to . 5em.) The right margin is nowstraight, as desired, but the overall spacing is somewhat loose.

In the third example, the hyphenation routine is turned on, and everything isbeautiful.

In olden time* when wishingtill helped one, there lived a king

whose daughters were all beautifi I,jut the youngest WM «O beautifulhat the sun itself, which has seer

10 much, was astonished wheneve rI ihone in her face. Close byhe king's custlc lay a great darkorcat, and under nn old lime-trcrn tlir Forest « » a well, and whenthe Hay wns very warm, the king*:child went out into the forest andsat down by the side of the tool'nuntain, and when shr was borccihe took a golden ball, and threwI up on high and caught it, andthis hall was her favorite plaything.

In olden limes when wishingstill helped one, there lived aking whose daughters were allbeautiful, but the youngest wasso beautiful that the sun itself,which has seen so much, wasastonished whenever it shone inher face. Close by the king'scastle lay a great dark forest,and under t \ old lime tree inthe forest was a well, and whenthe day was very warm, theking's child went out into theforest and sat down by the sideof the cool fountain, and whenshe was bored she took a goldenball, and threw it up on highand caught it, and this ball washer favorite plaything.

In olden times'when wish-ing still helped one, there liveda king whose daughters were allbeautiful, but the youngest waaso beautiful that the tun itself,which has seen so much, was as-tonished whenever it shone in herface. Close by the king's castlelay a great dark forest, and un-der an old lime-tree in the forestwas a well, and when the day waavery warm, the king's child wentout into the forest and sat downby the side of the cool fountain,and when she was bored she look*. golden ball, and threw it up onhigh and cnuplit it, and this ballwas her favorite plaything.

Figure 1. A typical paragraph with and without hyphenation.

Page 9: Word Hy-phen-a-tion by Com-put«er - CiteSeer

INTRODUCTION . S

sel-fadjotnt as-so-ciate as-so-cl-ate

Pit-tsburgh prog-ress pro-gresa

clcarin-ghouee rec-ord re-cord

fun-draising a-rith me-tic ar-ith-met-ic

ho-meowners eve-ning even-ing

playw-right pe-ri-od-ic per-i-o-dic

algori-thm

walkth-rough in-de-pen-dent in-de-jend-ent

Re-agan tri-bune trib-une

Figure 2. Difficult hyphenations.

However, life is not always so simple. Figure 2 shows that hyphenation can bedifficult. The first column shows erroneous hyphenations made by various typeset-ting systems (which shall remain nameless). The next group of examples are wordsthat hyphenate differently depending on how they are used. This happens mostcommonly with words that can serve as both nouns and verbs. The last two ex-amples show that different dictionaries do not always agree on hyphenation (in thiscase Webster's vs. American Heritage).

TjrjX and hyphenationThe original TgX hyphenation algorithm was designed by Prof. Knuth and

the author in the summer of 1977. It is essentially a rule-based algorithm, withthree main types of rules: (1) suffix removal, (2) prefix removal, and (3) vowel-consonant-consonant-vowel (veev) breaking. The latter rule states that when thepattern 'vowel-consonant-consonant-vowcl' appears in a word, we can in most casessplit between the consonants. There are also many special case rules; for example,"break vowel-q" or "break after ck". Finally a small exception dictionary (about300 words) is used to handle particularly objectionable errors made by the aboverules, and to hyphenate certain common words (e.g. pro-gram) that are not split bythe rules. The complete algorithm is described in Appendix H of the old TjjX man-ual.

In practice, the above algorithm has served quite well. Although it does notfind all possible division points in a word, it very rarely makes an error. Tests on apocket dictionary word list indicate that about 40% of the allowable hyphen pointsare found, with 1% error (relative to the total number of hyphen points). The al-gorithm requires 4K 36-bit words of code, including the exception dictionary.

Page 10: Word Hy-phen-a-tion by Com-put«er - CiteSeer

4 INTRODUCTION

The goal of the present research was to develop a better hyphenation algo-rithm. By "better" we mean finding more hyphens, with little or no error, and us-ing as little additional space as possible. Recall that one way to perform hyphen-ation is to simply store the entire dictionary. Thus we can view our task as a datacompression problem. Since there is a good deal of redundancy in English, we canhope for substantial improvement over the straightforward representation.

Another goal was to automate the design of the algorithm as much as pos-sible. The original T|rjX algorithm was developed mostly by hand, with a gooddeal of trial and etror. Extending such a rule-based scheme to find the remain-ing hyphens seems very difficult. Furthermore such an effort must be repeated foreach new language. The former approach can be a problem even for English, be-cause pronunciation (and thus hyphenation) tends to change over time, and be-cause different types of publication may call for different sets of admissible hy-phens. ,

Time magazine algorithmA number of approaches were considered, including methods that have been dis-

cussed in the literature or implemented in existing typesetting systems. One of themethods studied was the so-called Time magazine algorithm, which is table-basedrather than rule-based.

The idea is to look at four letters surrounding each possible ' Breakpoint, namelytwo letters preceding and two letters following the given point. However we do notwant to store a table of 264 = 456,976 entries representing all possible four-lettercombinations. (In practice only about 15% of these four-letter combinations actu-ally occur in English words, but it is not immediately obvious how to take advan-tage of this.)

Instead, the method uses three tables of size 262, corresponding to the two let-ters preceding, surrounding, and following a potential hyphen point. That is, ifthe letter pattern wx-yz occurs in a word, we look up three values correspond-ing to the letter pairs wx, xy, and yz, and use these values to determine if we cansplit the pattern.

What should the three tables contain? In the T:\ne algorithm the table valueswere the probabilities that a hyphen could occur after, between, or before two givenletters, respectively. The probability that the pattern wx-yz can be split is then es-timated as the product of these three values (as if the probabilities were indepen-dent, which they aren't). Finally the estimated value is compared against a thresh-old to determine hyphenation. Figure 3 shows an example of hyphenation proba-bilities computed by this method.

Page 11: Word Hy-phen-a-tion by Com-put«er - CiteSeer

INTRODUCTION

1,1 I

S U P E R C A L I F R A G I I I S T I C E I P I A L I D O C I O U S

Figure S. Hyphenation probabilities.

The advantage of this table-based approach is that the tables can be gen-erated automatically from the dictionary. However, some experiments with themethod yielded discouraging results. One estimate is 40% of the hyphens found,with 8% error. Thus a large exception dictionary would be required for good per-formance.

The reason for the limited performance of the above scheme is that just four let-ters of context surrounding the potential break point are not enough in many cases.In an extreme example, we might have to look as many as 10 letters ahead in or-der to determine hyphenation, e.g. dem-on-stra-tion vs. de-mpn-stra-tive.

So a more powerful method is needed. •

PatternsA good deal of experimentation led the author to a more powerful method

based on the idea of hyphenation patterns. These ore simply strings of letters that,when they match in a word, will tell us how to hyphenate at some point in the pat-tern. For example, the pattern ' t ion' might tell us that we can hyphenate be-fore the V . Or when the pattern 'cc' appears in a word, we can usually hy-phenate between the c's. Here arc some more examples of good hyphenating pat-terns:

.in-d . in-s . i n - t .un-d b-s -cia con-s con-t e-ly er-1 er-raex- -ful i t - t i-ty -lees 1-ly -ment n-co -ness n-f n-1 n-ein-v om-m -sion 8-ly s-nos t i - ca x-p

(The character ' . ' matches the beginning or end of a word.)

Page 12: Word Hy-phen-a-tion by Com-put«er - CiteSeer

6 INTRODUCTION

Patterns have many advantages. They arc a general form of "hyphenation rule"that can include prefix, suffix, and other rules as special cases. Patterns can even de-scribe an exception dictionary, namely by using entire words as patterns. (Actu-ally, patterns are often more concise than an exception dictionary because a sin-gle pattern can handle several variant forms of a word; e.g. pro-gram, pro-grams,and pro-grammed.)

More importantly, the pattern matching approach has proven very effective. Anft

appropriate set of patterns captures very concisely the information needed to per-form hyphenation. Yet the pattern rules are of simple enough form that they canbe generated automatically from the dictionary.

When looking for good hyphenating patterns, we soon discover that almost allof them have some exceptions. Although - t ion is a very "safe" pattern, it fails onthe word cat-ion. Most other cases are less clear-cut; for example, the common pat-tern n- t can be hyphenated about SO percent of the time. It definitely seems worth-while to use such patterns, provided that we can deal with the exceptions in somemanner.

After chooskg a set of hyphenating patterns, we may end up with thousandsof exceptions. Theie could be listed in an exception dictionary, but we soon no-tice there are many similarities among the exceptions. For example, in the orig-inal T]jjX algorithm we found that the vowcl-consonant-consonant-vowel rule re-sulted in hundreds <>f errors of the form X-Yer or X-Yers, for certain consonantpairs XY, so we put in a new rule to prevent those errors.

Thus, there may be "rules" that can handle large classes of exceptions. To takeadvantage of this, patterns come to the rescue again; but this time they are inhibit-*ivg patterns, because they show where hyphens should not be placed. Some good ex-imples of inhibiting patterns are: b=ly (don't break between b and ly), bs=, =cing,io=n, i« t in , «l8, nn», ns s t , n=ted, =pt, t i=a l , =tly, «ts, and t t « .

As it turns out, this approach is worth pursuing further. That is, after ap-plying hyphenating and inhibiting patterns as discussed above, we might have an-other »e( of hyphenating patterns, then another set of inhibiting patterns, andBO on. We can think of each level of patterns as being "exceptions to the ex-ceptions" of the previous level. The current Tj}X82 algorithm uses five alternat-ing levels of hyphenating and inhibiting patterns. The reasons for this will be ex-plained in Chapter 4.

The idea of patterns is the basis of the new TJJX hyphenation algorithm, andit was the inspiration for much of the intermediate investigation, that will be de-scribed.

Page 13: Word Hy-phen-a-tion by Com-put«er - CiteSeer

INTRODUCTION 7

Overview of thesis

In developing the pattern scheme, two main questions arose: (1) How can werepresent the set of hyphenation patterns in a compact form that is also reason-ably efficient for searching? (2) Given a hyphenated word list, how can we gener-ate a suitable set of patterns?

To solve these problems, the author has developed a new data structure calleda parked trie. This data structure allows the very fast search times characteris-tic of indexed tries, but in many cases it entirely eliminates the wasted space fornull links usually present in such tries.

We will demonstrate the versatility and practical advantages of this data struc-ture *y using it not only to represent the hyphenation patterns in the final algo-rithm, but also d'j the critical component of the program that generates the pat-terns from the dictionary. Packed tries have many other potential applications, in-cluding identifier lookup, spelling checking, and lexicographic sorting.

Chapter 2 considers the simpler problem of recognizing, rather than hyphenat-ing, a set of words such as a dictionary, and uses this problem to motivate and ex-plain the advantages of the packed trie data, structure. We also point out the close re-lationship between tries and finite-state machines.

Chapter 3 discusses ways of applying these ideas to hyphenation. After con-sidering various approaches, including minimization with don't cares, we return tothe idea of patterns.

Chapter 4 discusses the heuristic method used to select patterns, introduces dy-namic packed tries, and describes some experiments with the pattern generation pro-*gram.

Chapter 5 gives a brief history, and mentions ideas for future research.Finally, the appendix contains the WEB [3] listing of the portable pattern gen-

eration program PATGEN, as well as the set of patterns currently used by Tj£X82.

Note: The present chapter has been typeset by giving unusual instructions toTgX so that it hyphenates words much more often than usual; therefore the readercan see numerous examples of word breaks that were discovered by the new algo-rithm.

Page 14: Word Hy-phen-a-tion by Com-put«er - CiteSeer

Chapter 2

The dictionary problem

In this chapter we consider the problem of recognizing a set of words over a.ualphabet. To be more precise, an alphabet is a set of characters or symbols, forexample the Liters A through Z, or the ASCII character set. A word is a sequenceof characters from the alphabet. Given a set of words, our problem is to design adata structure that will allow us to determine efficiently whether or not some wordis in the set.

In particular, we will use spelling checking as an example throughout thischapter. This is a topic of interest in its own right, but we discuss it here becausethe pattern matching techniques we propose will turn out to be very useful in ourhyphenation algorithm.

Our problem is a special case of the general set recognition problem, because theelements of our set have the additional structure of being variable-length sequencesof symbols from a finite alphabet. This naturally suggests methods based on acharacter-by-character examination of the key, rather than methods that operateon the entire key at once. Also, the redundancy present in natural languages such asEnglish suggests additional opportunities for compression of the set representation.

We will be especially interested in space minimization. Most data structures forset representation, including the one we propose, are reasonably fast for searching.That is, a search for a key doesn't take much more time than is needed to examinethe key itself. However, most of these algorithms assume that everything is "incore", that is, in the primary memory of the computer. In many situations, suchas our spelling checking example, this is not feasible. Since secondary memoryaccess times are typically much longer, it is worthwhile to try compressing the datastructure as much as possible.

In addition to determining whether a given word is in the set, there arc otheroperations we might wish to perform on the set representation. The most basic areinsertion and deletion of words from the set. More complicated operations includeperforming the union of two sets, partitioning a set according to some criterion,

Page 15: Word Hy-phen-a-tion by Com-put«er - CiteSeer

:. THE DICTIONARY PROBLEM 9

determining which of several sets an element is a member of, or operations basedon an ordering or other auxiliary information associated with the keys in the set.For the data structures we consider, we will pay some attention to methods forinsertion and deletion, but we shall not discuss the more complicated operations.

We first survey some known methods for set representation, and then proposea new data structure called a "packed trie".

Data structuresMethods for set representation include the following: sequential lists, sorted

lists, binary search trees, balanced trees, hashing, superimposed coding, bit vec-tors, and digital search trees (also known as tries). Good discussions of these datastructures can be found in a number of texts, including Knuth [4], Standish [5], andAHU [6]. Below we make a few remsirks about each of these representations.

A sequential list is the most straightforward representation. It requires bothspace and search time proportional to the number of characters in the dictionary.

A sorted list assumes an ordering on the keys, such as alphabetical order.Binary search allows the search time to be reduced to the logarithm of the size ofthe dictionary, but space is not reduced.

A binary search tree also allows search in logarithmic time. This can be thoughtof as a more flexible version of a sorted list that can be optimized in various ways.For example if the probabilities of searching for different keys in the tree are known,then the tree can be adapted to improve the expected search time. Search treescan also handle insertions and deletions easily, although an unfavorable sequence ofsuch operations may degrade the performance of the tree.

Balanced tree schemes (including AVL trees, 2-3 trees, and B-trees) correctthe above-mentioned problem, so that insertions, deletions, and searches can allbe performed in logarithmic time in the worst case. Variants of trees have othernice properties, too; they allow merging and splitting of sets, and priority queueoperations. B-trees are well-suited to large applications, because they are designedto minimize the number of secondary memory accesses required to perform a search.However, space utilization is not improved by any of these tree schemes, and in factit is usually increased because of the need for extra pointers.

Hashing is an essentially different approach to the problem. Here a suitablerandomizing function is used to compute the location at which a key is stored.Hashing methods arc very fast on the average, although the worst case is linear;fortunately this worst case almost never happens.

An interesting variant of hashing, called superimposed coding, was proposedby Bloom [7] (see also [4, §6.5], [8]), and at last provides for reduction in space,

Page 16: Word Hy-phen-a-tion by Com-put«er - CiteSeer

10 THE DICTIONARY PROBLEM

although at the expense of allowing some error. Since this method is perhaps lesswell known we give a description of it here.

Superimposed codingThe idea is as follows. We use a single large bit array, initialized to leros, plus

a suitable set of d different hash functions. To represent a word, we use the hashfunctions to compute d bit positions in the large array of bits, and set these bits toones. We do this for each word in the set. Note that some bits may be set by morethan one word.

To test if a word is in the set, we compute the d bit positions associated withthe word as above, and check to see if they are all ones In the array. If any ofthem are zero, the word cannot be in the set, so we reject it. Otherwise if all ofthe bits are ones, we accept the word. However, some words not in the set mightbe erroneously accepted, if they happen to hash into bits that are all "covered" bywords in the set.

It can be shown [7] that the above scheme makes the best use of space when thedensity of bits in the array, after all the words have been inserted, is approximatelyone-half. In this case the probability that a word not in the set is erroneouslyaccepted is 2~d. For example if each word is hashed into 4 bit positions, the errorprobability is 1/16. The required size of the bit array is approximately ndlge,where n is the number of items in the set, and lge « 1.44.

In fact Bloom specifically discusses automatic hyphenation as an applicationfor his scheme! The scenario is as follows. Suppose we have a relatively compactroutine for hyphenation that works correctly for.OO percent of the words in a largedictionary, but it is in error or fails to hyphenate the other 10 percent. We wouldthen like some way to test if a word belongs to the 10 percent, but we do not haveroom to store all of these words in main memory. If we instead use the superimposedcoding scheme to test for these words, the space required can be much reduced. Forexample with d = 4 we only need aboxit 6 bits per wcrd. The penalty is that somewords will be erroneously identified as being in the 10 percent. However, this isacceptable because usually the test word will be rejected and we can then be surethat it is not one of the exceptions. (Either it is in the other 90 percent or it is notin the dictionary at all.) In the comparatively rare case that the word is accepted,we can go to secondary store, to check explicitly if the word is one of the exceptions.

The above technique is actually used in some commercial hyphenation routines.For now, however, T£JX will not have an external dictionary. Instead we will requirethat our hyphenation routine be essentially free of error (although it may not achievecomplete hyphenation).

Page 17: Word Hy-phen-a-tion by Com-put«er - CiteSeer

THE DICTIONARY PROBLEM 11

An extreme case of superimposed coding should also be mentioned, namely thebit-vector representation of a set. (Imagine that each word is associated with a singlebit position, and one bit is allocated for each possible word.) This representation isoften very convenient, because it allows set intersection and union to be performedby simple logical operations. But it also requires space proportional to the size ofthe universe of the set, which is impractical for words longer than three or fourcharacters.

Tries

The final class of data structures we will consider are the digital search trees,first described by de la Briandais [9] and Frcdkin [10]. Fredkin also introduced theterm "trie" for this class of trees. (The term was derived from the word retrieval,although it is now pronounced "try".)

Tries are distinct from the other data structures discussed so far because theyexplicitly assume that the keys are a sequence of values over some (finite) alphabet,rather than a single indivisible entity. Thus tries are particularly well-suited forhandling variable-length keys. Also, when appropriately implemented, tries canprovide compression of the set represented, because common prefixes of words arecombined together; words with the same prefix follow the same search path in thetrie.

A trie can be thought of as an m-ary tree, where m is the number of charactersin the alphabet. A search is performed by examining the key one character at atime and using an m-way branch to follow the appropriate path in the trie, startingat the root.

We will use the set of 31 most common English words, shown below, to illustratedifferent ways of implementing a trie.

AANDAREASATBEBUT

FORFROMHADHAVEHEHERHIS

ISISITNOTOFONOR

THETHISTOWASWHICHWITHYOU

BY I THAT

Figure 4- The SI most common English words.

Page 18: Word Hy-phen-a-tion by Com-put«er - CiteSeer

12 THE DICTIONARY PROBLEM

Figure 5. Linked trie for the SI most eommon English words.

Page 19: Word Hy-phen-a-tion by Com-put«er - CiteSeer

THE DICTIONARY PROBLEM 13

Figure 5 shows a linked trie representing this set of words. In a linked trie,the m-way branch is performed using a sequential scries of comparisons. Thus inFigure 5 each node represents a yes-no test against a particular character. Thereare two link fields indicating the next node to take depending on the outcome ofthe test. On a 'yes ' answer, we also move to the next character of the key. Theunderlined characters are terminal nodes, indicated by an extra bit in the node. Ifthe word ends when we are at a terminal node, then the word is in the set.

Note that we do not have to actually store the keys in the trie, because eachnode^ implicitly represents a prefix of a word, namely the sequence of charactersleading to that node.

A linked trie is somewhat slow because of the sequential testing required foreach character of the key. The number of comparisons per character can be as largeas m, the size of the alphabet. In addition, the two link fields per node are somewhatwasteful of space. (Under certain circumstances, it is possible to eliminate ono ofthese two links. We will explain this later.)

In an indexed trie, the m-way branch is performed using an array of size m.The elements of the array are pointers indicating the next family of the trie togo to when the given character is scanned, where a "family" corresponds to thegroup of nodes in a linked trie for testing a particular character of the key. Whenperforming a search in an indexed trie, the appropriate pointer can be accessed bysimply indexing from the base of the array. Thus search will be quite fast.

But indexed tries typically waste a lot of space, because most of the arrays haveonly a few "valid" pointers (for words in the trie), with the rest of the links beingnull. This is especially common near the bottom of the trie. Figure 6 shows anindexed trie for the set of 31 common words. This representation requires 26 X 32 =832 array locations, compared to 59 nodes for the linked trie.

Various methods have been proposed to remedy the disadvantages of linkedand indexed tries. Trabb Pardo [11] describes and analyzes the space requirementsof some simple variants of binary tries. Knuth [4, ex. 6.3-20] analyzes a compositemethod where an indexed trie is used for the first few levels of the trie, switching tosequential search when only a few keys remain in a subtric. Mchlhorn [12] suggestsusing a I inary search tree to represent each family of a trie. This requires storageproportional to the number of "valid" links, as in a linked trie, but allows eachcharacter of the key to be processed in at most logm comparisons. Maly [13] hasproposed a "compressed trie" that uses an implicit representation to eliminate linksentirely. Each level of the trie is represented by a bit array, where the bits indicatewhether or not some word in the set passes through the node corresponding to

Page 20: Word Hy-phen-a-tion by Com-put«er - CiteSeer

14 THE DICTIONARY PROBLEM

i

23

4567801011121314151617181920

21222324

2526272829303132

A2

12

22

25

B5C

28

D

g

g

E

00

14

g

g

F7

g

G JI

n

21

26

g

g

i

16

15

23

29

27

J K L M

g

N173

g

0

019

8

10

18

g

1

32

P Q R

4

9

g

o

g

s

g

og

g

g

T20

g

g

g

g

g

30

U

6

g

V

13

W24X Y31

g

z

Figure 6. Indexed trie for the SI most eommon English words.

Page 21: Word Hy-phen-a-tion by Com-put«er - CiteSeer

THE DICTIONARY PRODLEM 15

that bit. In addition each family contains a field indicating the number of nonzerobits in the array for all nodes to the left of the current family, so that we can findthe desired family on the next level. The storage required for each family is thusreduced to m+logn bits, where n is the total number of keys. However, compressedtries cannot handle insertions and deletions easily, nor do they retain the speed ofindexed tries.

Packed triesOur idea is to use an indexed trie, but to save the space for null links by

packing the different families of the trie into a single large array, so that links fromone family may occupy space normally reserved for links for other families thathappen to be null. An example of this is illustrated below.

A ] G I C 1 1 j E |

(In the following, we will sometimes refer to families of the indexed trie asstates, and pointers as transitions. This is by analogy with the terminology forfinite-state machines.)

When performing a search in the trie, we need a way to check if an indexedpointer actually corresponds to the current family, or if it belongs to some otherfamily that just happens to be packed in the same location. This is done by ad-ditionally storing the character indexing a transition along with that transition.Thus a transition belongs to a state only if its character matches the character weare indexing on. This test always works if one additional requirement is satisfied,namely that different states may not be packed at the same base location.

The trie can be packed using a first-fit method. That is, we pack the statesone at a time, putting each state into the lowest-indexed location in which it willfit (not overlapping any previously packed transitions, nor at an already occupiedbase location). On numerous examples based on typical word lists, this heuristicworks extremely well. In fact, nearly all of the holes in the trie are often filled bytransitions from c ther states.

Figure 7 shows the result when the indexed trie of Figure 6 is packed intoa single array using the first-fit method. (Actually we have used an additionalcompression technique called suffix compression before packing the trie; this will beexplained in the next section.) The resulting trie fits into just 60 locations. Note

Page 22: Word Hy-phen-a-tion by Com-put«er - CiteSeer

16 THE DICTIONARY PROBLEM

00

10

20

30

40

50

0 1A_8

2

Bll3 4 5

D_06

F 37EjQ

8

H309123

C 5 H 0 N25 032 E 0 012 M 0

T33

R 0

R14

A29

N 1

U 4

W46

D 0

T 0

S 0

Y37

E12

R 2

Y 0

s_o

N 0

T 0

F_Q

0 6

115

0 4

R 0

H44

V 2

S 0

038

T 0

115

I 7

H35

A 4

136

N 0

T 5

A15 0 0 E 0

U_Q

Figure 7. Packed trie for the SI most common English word*.

that the packed trie is a single large array; the rows in the figure should be viewedas one long row.

As an example, here's what happens when we search for tho word HAVE in theparked trie. We associate the values 1 through 26 with the letters A through Z.The root of the trie is packed at location 0, so we begin by looking at location 8corresponding to the letter H. Since 'H30' is stored there, this is a valid transitionand we then go to location 30. Indexing by the letter A, we look in location 31,which tells us to go to 29. Now indexing by V gets location 51, which points to 2.Finally indexing by E gets location 7, which is underlined, indicating that the wordHAVE is indeed in the set.

Suffix compressionA big advantage of the trie data structure is that common prefixes of words

are combined automatically into common paths in the trie. This provides a gooddeal of compression. To save more space, we can try to take advantage of commonsuffixes.

Page 23: Word Hy-phen-a-tion by Com-put«er - CiteSeer

THE DICTIONARY PROBLEM 17

One way of doing this is to construct a trie in the usual manner, and then mergecommon subtries together, starting from the leaves (lieves) and working upward.We call this process suffix compression.

For example, in the linked trie of Figure 5 the terminal nodes for the wordsHIS and THIS, both of which test for the letter S and have no successors, can becombined into a single node. That is, we can let their parent nodes both pointto the same node; this does not change the set of words accepted by the trie. Itturns out that we can then combine the parent nodes, since both of them test for Iand-go to the S node if successful, otherwise stop (no left successor). However, thegrandparent nodes (which are actually siblings of the I nodes) cannot be combinedeven though they both test for E, because one of them goes to a terminal R nodeupon success, while the other has no right successor.

With a larger set of words, a great deal of merging can be possible. Clearly allleaf nodes (nodes with no successors) that test the same character can be combinedtogether. This alone saves a number of nodes equal to the number of words in thedictionary, minus the number of words that are prefixes of other words, plus at most26. In addition, as we might expect, longer suffixes such as -ly, -ing, or - t ion canfrequently be combined.

The suffix compression process may sound complicated, but actually it canbe described by a simple recursive algorithm. For each node of the trie, we firstcompress each of its subtries, then determine if the node can be merged with someother node. In effect, we traverse the trie in depth-first order, checking each nodeto see if it is equivalent to any previously seen node. A hash table can be used toidentify equivalent nodes, based on their (merged) transitions.

The identification of nodes is somewhat easier using a binary tree representationof the trie, rather than an m-ary representation, because each node will then havejust two link fields in addition to the character and output bit. Thus it will beconvenient to use a linked trie when performing suffix compression. The linkedrepresentation is also more convenient for constructing the trie in the first place,because of the ease of performing insertions.

After applying suffix compression, the trie can be converted to an indexedtrie and packed as described previously. (We should remark that performing suffixcompression on a linked trie can yield some addition?1 '.ompression, because triefamilies can be partially merged. However such compression is lost when the trie isconverted to indexed form.)

The author has performed numerous experiments with the above ideas. The re-sults for some representative word lists are shown in Table 1 below. The last three

Page 24: Word Hy-phen-a-tion by Com-put«er - CiteSeer

1044272

38,619

1204285

38,638

18 . THE DICTIONARY PROBLEM

columns show the number of nodes in the linked, suffix-compressed, and packedtries, respectively. Each transition of the packed trie consists of a pointer, a char-acter, and a bit indicating if this is an accepting transition.

word list words characters linked compressed packed

pascal 35 145 125murray 2720 19,144 8039pocket 31,036 247,612 92,339unabrd 235,545 2,250,805 759,045

Table 1. Suffix-eompreised packed triei.

The algorithms for building a linked trie, suffix compression, and first-fit pack-ing are used in Tj$X82 to preprocess the set of hyphenation patterns into a packedtrie used by the hyphenation routine. A WEB description of these algorithms can befound in [14].

Derived formsMost dictionaries do not list the most common derived forms of words, namely

regular plurals of nouns and verbs (-s forms), participles and gerunds of verbs (-edand -ing forms), and comparatives and superlatives of adjectives (-er and -est) .This makes sense, because a user of the dictionary can easily determine when a wordpossesses one of these regular forms. However, if we use the word list from a typicaldictionary for spelling checking, we will be faced with the problem of determiningwhen a word is one of these derived forms.

Some spelling checkers deal with this problem by attempting to recognize af-fixes. This is done not only for the derived forms mentioned above but other com-mon variant forms as well, with the purpose of reducing the number of word3 thathave to be stored in the dictionary. A set of logical rules is used to determine whencertain prefixes and suffixes can be stripped from the word under consideration.

However such rules can be quite complicated, and they inevitably make errors.The situation is not unlike that of finding rules for hyphenation, which shouldnot be surprising, since affix recognition is an important part of any rule-basedhyphenation algorithm. This problem has been studied iu some detail in a series ofpapers by Resnikoff and Dolby [15].

Since affix recognition is difficult, it is preferable to base a spelling checker ona complete word list, including all derived forms. However, a lot of additional spacewill be required to store all of these forms, even though much of the added data is

Page 25: Word Hy-phen-a-tion by Com-put«er - CiteSeer

THE DICTIONARY PROBLEM 19

redundant. We might hope that some appropriate method could provide substan-tial compression of the expanded word list. It turns out that suffix-compressed trieshandle this quite well. When derived forms were added to our pocket dictionaryword list, it increased in size to 49,858 words and 404,046 characters, but the result-ing packed trie only increased to 46,553 transitions (compare the pocket dictionarystatistics in Table 1).

"Hyphenation programs also need to deal with the problem of derived forms.In our pattern-matching approach, we intend to extract the hyphenation rules au-tomatically from the dictionary. Thus it is again preferable for our word list toinclude all derived forms.

The creation of such an expanded word list required a good deal of work.The author had access to a computer-readable copy of Webster's Pocket Dictionary[16], including parts of speech and definitions. This made it feasible to identifynouns, verbs, etc., and to generate the appropriate derived forms mechanically.Unfortunately the resulting word lists required extensive editing to eliminate munynever-used or somewhat nonsensical derived forms, e.g. ' informations' .

Spelling checkersComputer-based word processing systems Lave recently come into widespread

use. As a result there has been a surge of interest in programs for automatic spellingchecking and correction. Here we will consider the dictionary representations usedby some existing spelling checkers.

One of the earliest programs, designed for a large timesharing computer, wasthe DEC-10 SPELL program written by Ralph Gorin [17). It uses a 12,000 worddictionary stored in main memory. A simple hash function assigns a unique 'bucket*to each word depending on its length and the first two characters. Words in thesame bucket are listed sequentially. The number of words in each bucket is relativelysmall (typically 5 to 50 words), so this representation is fairly efficient for searching.In addition, the buckets provide convenient access to groups of similar words; thisis useful when the program tries to correct spelling errors.

The dictionary used by SPELL does not contain derived forms. Instead somesimple affix stripping rules arc normally used; the author of the program notes thatthese are "error-prone".

Another spelling checker is described by James L. Peterson [18]. His programuses three separate dictionaries: (1) a small list of 258 common English words, (2)a dynamic 'cache' of about 1000 document-specific words, and (3) a large, compre-hensive dictionary, stored on disk. The list of common words (which is static) isrepresented using a suffix-compressed linked trie. The dynamic cache is maintained

Page 26: Word Hy-phen-a-tion by Com-put«er - CiteSeer

20 THE DICTIONARY PROBLEM

using a hash table. Both of these dictionaries arc kept in main memory for speed.The disk dictionary uses an in-core index, so that at most one disk access is requiredper search.

Robert Nix [19] describes a spelling checker based on the superimposed codingmethod. He reports that this method allows the dictionary from the SPELL pro-gram to be compressed to just 20 percent of its original size, while allowing 0.1%chance of error.

A considerably different approach to spelling checking was taken by the TYPOprogram developed at Bell Labs [20]. This program uses digram and trigram fre-quencies to identify "improbable" words. After processing a document, the wordsare listed in order of decreasing improbability for the user to peruse. (Words ap-pearing in a list of 2726 common technical words are not shown.) The authorsreport that this format is "psychologically rewarding", because many errors arefound at the beginning, inducing the user to continue scanning the list until errorsbecome rare.

In addition to the above, there have recently been a number of spelling checkersdeveloped for the "personal computer" market. Because these programs run onsmall microprocessor-based systems, it is especially important to reduce the size ofthe dictionary. Standard techniques include hash coding (allowing some error), in-core caches of common words, and special codes for common prefixes and suffixes.One program first constructs a sorted list of all words in the document, and thencompares this list with the dictionary in a single sequential pass. The dictionarycan then be stored in a compact form suited for sequential scanning, where eachword is represented by its difference from the previous word.

Besides simply detecting when words are not in a dictionary, the design of apractical spelling checker involves a number of other issues. For example manyspelling checkers also try to perform spelling correction. This is usually done bysearching the dictionary for words similar to the misspelled word. Errors and sug-gested replacements can be presented in an interactive fashion, allowing the user tosee the context from the document and make the necessary changes. The contentsof the dictionary arc of course very important, and each user may want to modifythe word list to match his or her own vocabulary. Finally, a plain spelling checkercannot detect problems such as incorrect word usage or mistakes in grammar; amore sophisticated program performing syntactic and perhaps semantic analysis ofthe text would be necessary.

Page 27: Word Hy-phen-a-tion by Com-put«er - CiteSeer

THE DICTIONARY PROBLEM 21

Conclusion and related ideasThe dictionary problem is a fundamental problem of computer science, and

it has many applications besides spelling checking. Most data structures for thisproblem consider the elements of the set as atomic entities, fitting into a single com-puter word. However in many applications, particularly word processing, the keysare actually variable-length strings of characters. Most of the standard techniquesare somewhat awkward when dealing with variable length keys. Only the trie datastructure is well-suited for this situation.

We have proposed a variant of tries that we call a packed trie. Search in apacked trie is performed by indexing, and it is therefore very fast. The first-fitpacking technique usually produces a fairly compact representation as well.

We have not discussed how to perform dynamic insertions and deletions with apacked trie. In Chapter 4 we discuss a way to handle this problem, when no suffixcompression is used, by repacking states when necessary.

The idea of suffix compression is not new. As mentioned, Peterson's spellingchecker uses this idea also. But in fact, if we view our trie as a finite-state machine,suffix compression is equivalent to the well-known idea of state minimization. Inour case the machine is acyclic, that is, it has no loops.

Suffix compression is also closely related to the common subexpression problemfrom compiler theory. In particular, it can be considered a special case of a problemcalled acyclic congruence closure, which has been studied by Downey, Sethi, andTarjan [21]. They give a linear-time algorithm for suffix compression that does notuse hashing, but it is somewhat complicated to implement and requires additionaldata structures.

The idea for the first-fit packing method was inspired by the paper "Storing asparse table" by Tarjan and Yao [22]. The technique has been used for compressingparsing tables, as discussed by Zeiglcr [23] (see also [24]). However, our packedtrie implementation differs somewhat from the applications discussed in the abovereferences, because of our emphasis on space minimization. In particular, the ideaof storing the character that indexes a transition, along with that transition, seemsto be new. This has an advantage over other techniques for distinguishing states,such as the use of back pointers, because the character requires fewer bits.

The paper by Tarjan and Yao also contains on interesting theorem character-izing the performance of the first-fit packing method. They consider a modificationsuggested by Zcigler, where the states arc first sorted into decreasing order basedon the number of non-null transitions in each state. The idea is that small states,which can be packed more easily, will be saved to the end. They prove that if the

Page 28: Word Hy-phen-a-tion by Com-put«er - CiteSeer

22 THIS DICTIONARY PROBLEM

distribution of transitions among states satisfies a "harmonic decay" condition, thenessentially all of the holes in the first-fit packing will be filled.

More precisely, let n(/) be the total number of non-null transitions in states withmore than / transitions, for / > 0. If the harmonic decay property n(l) < n/(l -f 1)is satisfied, then the first-fit-decrcasing packing satisfies 0 < b(i) < n for all •', wheren = n(0) is the total number of transitions and b(i) is the base location at whichthe tth state is packed.

The above theorem does not take into account our additional restriction thatno two states may be packed at the same base location. When the proof is modifiedto include this restriction, the bound goes up by a factor of two. However in practicewe seem to be able to do much better.

The main reason for the good performance of the first-fit packing scheme isthe fact that there are usually enough single-transition states to fill in the holescreated by larger states. It is not really necessary to sort the states by number oftransitions; any packing order that distributes large and small states fairly evenlywill work well. We have found it convenient simply to use the order obtained bytraversing the linked trie.

Improvements on the algorithms discussed in this chapter are possible in certaincases. If we store a linked trie in a specific traversal order, we can eliminate oneof the link fields. For example, if we list the nodes of the trie in preordcr, the leftsuccessor of a node will always appear immediately after that node. An extra bit isused to indicate that a node has no left successor. Of course this technique worksfor other types of trees as well. . .

If the word list is already sorted, linked trie insertion can be performed withonly a small portion of the trie in memory at any time, namely the portion alongthe current insertion path. This can be a great advantage if we are are processinga large dictionary and cannot store the entire linked trie in memory.

Page 29: Word Hy-phen-a-tion by Com-put«er - CiteSeer

Chapter 3

Hyphenation

Let us now try to apply the ideas of the previous chapter to the problem ofhyphenation. T£JX82 will use the pattern matching method described in Chapter 1,but we shall first discuss some related approaches that were considered.

Finite-state machines with outputWe can modify our trie-based dictionary representation to perform hyphenation

by changing the output of the trie (or finite-state machine) to a multiple-valuedoutput indicating how the word can be hyphenated, instead of just a binary yes-nooutput indicating whether or not the word is in the dictionary. That is, instead ofassociating a single bit with each trie transition, we would have a larger "output"field indicating the hyphenation "action" to be taken on this transition. Thus onrecognizing the word hy-phen-a-tion, the output would say "you can hyphenatethis word after the second, sixth, or seventh letters".

To represent the hyphenation output, we could simply list the hyphen positions,or we could use a bit vector indicating the allowable hyphen points. Since therearc only a few hundred different outputs and most of them occur many times, wecan save some space by assigning each output a unique code and storing the actualhyphen positions in a separate table.

To conveniently handle the variable number of hyphen positions in outputs,we will use a linked representation that allows different outputs to share commonportions of their output lists. This is implemented using a hash table containingpairs of the form (output, next), where output is a hyphenation position and nextis a (possibly null) pointer to another entry in the table. To add a new output listto the table, we hash each of its outputs in turn, making each output point to theprevious one. Interestingly, this process is quite similar to suffix compression.

The trie with hyphenation output can be suffix-compressed and packed in thesame manner as discussed in Chapter 2. Because of the greater variety of out-puts more of the subtrics will be distinct, and there is somewhat less compression.

23

Page 30: Word Hy-phen-a-tion by Com-put«er - CiteSeer

24 HYPHENATION

From our pocket dictionary (with hyphens), for example, we obtained a packed trieoccupying 51,699 locations.

We can improve things slightly by "pushing outputs forward". That is, we canoutput partial hyphenations as soon as possible instead of waiting until the end ofthe word. This allows some additional suffix compression.

For example, upon scanning the letters hyph at the beginning of a word, wecan already say "hyphenate after the second letter" because this is allowed for allwords beginning with those letters. Note we could not say this after scanning j . jthyp, because of words like hyp-not-ic. Upon further scanning ena, we can say"hyphenate after the sixth letter".

When implementing this idea, we run into a small problem. There are quitea few words that are prefixes of other words, but hyphenate differently on theletters they have in common, e.g. ca-ret and care-tak-er, or aa-pi-r in and aa-pir-ing. To avoid losing hyphenation output, we could have a separate outputwhenever an end-of-word bit appears, but a simpler method is to append an end-of-word character to each word before inserting it into the trie. This increases the sizeof the linked trie considerably, but suffix compression merges most of these nodestogether.

With the above modifications, the packed trie for the pocket dictionary wasreduced to 44,128 transitions.

Although we have obtained substantial compression of the dictionary, the resultis still too large for our purposes. The problem is that as long as we insist thatonly words in the dictionary be hyphenated, we cannot hope to reduce the spacerequired to below that needed for spelling checking alone. So we must give up thisrestriction.

For example, we could eliminate the end-of-word bit. Then after pushing out-puts forward, we can prune branches of the trie for which there is no further output.This would reduce the pocket dictionary trie to 35,429 transitions.

Minimization with don't cares /

In this section we describe a more drastic approach to compression that takesadvantage of situations where we "don't care" what the algorithm dors.

As previously noted, most of the states in an indexed trie are quite sparse;that is, only a few of the characters have explicit transitions. Since the missingtransitions are never accessed by words in our dictionary, we can allow them to befilled by arbitrary transitions.

Page 31: Word Hy-phen-a-tion by Com-put«er - CiteSeer

HYPHENATION 25

This should not be confused with the overlapping of states that may occur inthe trie-packing process. Instead, we mean that the added transitions will actuallybecome part of the state.

There are two ways in which this might allow us to save more space in the min-imization process. First, states no longer have to be identical in order to be merged;they only have to agree on those characters where both (or all) have explicit transi-tions. Second, the merging of non-equivalent states may allow further merging thatwas not previously possible, because some transitions have now become equivalent.

For example, consider again the trie of Figure 5. When discussing suffix com-pression, we noted that the terminal S nodes for the words HIS and THIS could bemerged together, but that the parent chains, each containing transitions for A, E,and I, could not be completely merged. However, in minimization with don't caresthese two states can be merged. Note that such a merge will require that the DVstate below the first A be merged with the T below the second A; this can be donebecause those states have no overlapping transitions.

As another example, notice that if the word AN were added to our vocabulary,then the NRST chain succeeding the root A node could be merged with the NST chainbelow the initial I node. (Actually, it doesn't make much sense to do minimizationwith don't cares on a trie used to recognize words in a dictionary, but we will ignorethat objection for the purposes of this example.)

Unfortunately, trie minimization with don't cares seems more complicated thanthe suffix-compression process of Chapter 2. The problem is that states can bemerged in more than one way. That is, the collection of mcrgcable states no longerforms an equivalence relation, as in regular finite-state minimization. In fact, wecan sometimes obtain additional compression by allowing the same state to appearmore than once. Another complication is that don't care merges can introduceloops into our trie.

Thus it seems that finding the minimum size trie will be difficult. Pfleeger[25] has shown this problem to be NP-complete, by transformation from graphcoloring; however, his construction requires the number of transitions per state tobe unbounded. It may be possible to remove this requirement, but we have notproved this.

So in order to experiment with trie minimization with don't cares, we havemade some simplifications. We start by performing suffix compression in the usualmanner. We then go through the states in a bottom-up order, checking each tosee if it can be merged with any previous state by taking advantage of don't cares.Note that such merges may require further merges among states already seen.

Page 32: Word Hy-phen-a-tion by Com-put«er - CiteSeer

28 HYPHENATION

We only try merges that actually save space, that is, where explicit transitionsare merged. Otherwise, states with only a few transitions are very likely to bemergeable, but such merges may constrain us unnecessarily at a later stage of theminimization. In addition, we will not consider having multiple copies of states.

Even this simplified algorithm can be quite time consuming, so we did not try iton our pocket dictionary. On a list of 2726 technical words, don't care minimizationreduced the number of states in the suffix-compressed, output-pruned trie from1685 to just 283, while the number of transitions was reduced from 3627 to 2427.However, because the resulting states were larger, the first-fit packing performedrather poorly, producing a packed trie with 3408 transitions. So in this case don'tcare minimization yielded an additional compression of less than 10 percent.

Also, the behavior of the resulting hyphenation algorithm on words not in thedictionary became rather unpredictable. Once a word leaves the "known" paths ofthe packed "trie, strange things might happen!

We can get even wilder effects by carrying the don't care assumption one stepfurther, and eliminating the character field from the packed trie altogether (leavingjust the output and trie link). Words in the dictionary will always index the correcttransitions, but on other words we now have no way of telling when we have reachedan invalid trie transition.

It turns out that the problem of state minimization with don't cares was studiedin the 1960s by electrical engineers, who called it "minimization of incompletelyspecified sequential machines" (see e.g. [26]). However, typical instances of theproblem involved machines with only a few states, rather than thousands as inour case, so it was often possible to find a minimized machine by hand. Also, theemphasis was on minimizing the number of states of the machine, rather than thenumber of state transitions.

In ordinary finite-state minimization, these are equivalent, but don't care min-imization can actually introduce extra transitions, for example when states areduplicated. In the old days, finite-state machines were implemented using combina-tional logic, so the most important consideration WJXS to reduce the number of states.In our trie representation, however, the space used is proportional to the numberof transitions. Furthermore, finite-state machines are now often implemented usingPLA's (programmed logic arrays), for which the number of transitions is also thebest measure of space.

Pattern matchingSince trie minimization with don't cares still doesn't provide sufficient compres-

sion, and since it lead • to unpredictable behavior on words not in the dictionary,

Page 33: Word Hy-phen-a-tion by Com-put«er - CiteSeer

HYPHENATION 27

we need a different approach. It seems expensive to insist on complete hyphenationof the dictionary, so we will give up this requirement. We could allow some errors;or to be safer, we could allow some hyphens to be missed.

We now return to the pattern matching approach described in Chapter 1. Somefurther arguments as to why this method seems advantageous are given below. Weshould first reassure the reader that all the discussion so far has not been in vain,because a packed trie will be an ideal data structure for representing the patternsin the final hyphenation algorithm. Here the outputs will include the hyphenationlevel as well as the intercharacter position.

Hyphenating and inhibiting patterns allow considerable flexibility in the per-formance of the resulting algorithm. For example, we could allow a certain amountof error by using patterns that aren't always safe (but that presumably do findmany correct hyphens).

We can also restrict ourselves to partial hyphenation in a natural way. Thatis, it turns out that a relatively small number of patterns will get a large fraction ofthe hyphens in the dictionary. The remaining hyphens become harder and harderto find, as we are left with mostly exceptional cases. Thus we can choose the mosteffective patterns first, taking more and more specialized patterns until we run outof space.

In addition, patterns perform quite well on words not in the dictionary, if thosewords follow "normal" pronunciation rules.

Patterns are "context-free"; that is, they can apply anywhere in a word. Thisseems to be an important advantage. In the trie-based approach discussed earlierin this chapter, a word is always scanned from beginning to end and each state ofthe trie 'remembers' the entire prefix of the word scanned so far, even if the lettersscanned near the beginning no longer affect the hyphenation of the word. Suffixcompression eliminates some of this unnecessary state information, by combiningstates that are identical with respect to future hyphenation. Minimization withdon't cares takes this further, allowing 'similar' states to be combined as long asthey behave identically on all characters that they have in common.

However, we have seen that it is difficult to guide the minimization with don'tcares to achieve these reductions. Patterns embody such don't care situations nat-urally (if we can find a good way of selecting the patterns).

The context-free nature of patterns helps in another way, as explained below.Recall that we will use a packed trie to represent the patterns. To find all patternsthat match in a given word, we perform a search starting at each letter of the word.Thus after completing a search starting from some letter position, we may have to

Page 34: Word Hy-phen-a-tion by Com-put«er - CiteSeer

28 HYPHENATION

back up in the word to start the next search. By contrast, our original trio-basedapproach works with no backup.

Suppose we wanted to convert the pattern trie into a finite-state recognizerthat works with no backup. This can be done in two stages. We first add "failurelinks" to each state that tell which state to go to if there is no explicit transitionfor the current character of the word. The failure state is the state in the trie thatwe would have reached, if we had started the search one letter later in the word.

Next, we can convert the failure-link machine into a true finite-state machineby-filling in the missing transitions of each state with those of its failure state. (Formore details of this process, see [27], [28].)

However, the above state merging will introduce a lot of additional transitions.Even using failure links requires one additional pointer per state. Thus by perform-ing pattern matching with backup, we seem to save a good deal of space. And inpractice,'long backups rarely occur.

Finally, the idea of inhibiting patterns seems to be very useful. Such patternsextend the power of a finite-state machine, somewhat like adding the "not" operatorto regular expressions.

Page 35: Word Hy-phen-a-tion by Com-put«er - CiteSeer

Chapter 4

Pattern generation

We now discuss how to cboose a suitable set of patterns for hyphenation. In or-der to decide which patterns are "good", we must first specify the desired propertiesof the resulting hyphenation Jgorithm.

We obviously want to maximize the number of hyphens found, minimize theerror, and minimize the space required by our algorithm. For example, we could tryto maximize some (say linear) function of the above three quantities, or we couldhold one or two of the quantities constant and optimize the others.

For 1^X82, we wanted a hyphenation algorithm meeting the following require-ments. The algorithm should use only a moderate amount of space (20-30K bytes),including any exception dictionary; and it should find as many hyphens as possible,while making little or no error. This is similar to the specifications for the originalTjjX algorithm, except that we now hope to find substantially more hyphens.

Of course, the results will depend on the word list used. We decided to basethe algorithm on our copy of Webster's Pocket Dictionary, mainly because this wasthe only word list we had that included all derived forms.

We also thought that a larger dictionary would contain many rare or specializedwords that we might not want to worry about. In p ' ticular, we did not want suchinfrequent words to affect the choice of patterns, because we hoped to obtain a setof patterns embodying many of the "usual" rules for hyphenation.

In developing the Tj<jX82 algorithm, however, the word list was tuned up con-siderably. A few thousand common words were weighted more heavily so that theywould be more likely to be hyphenated. In fact, the current algorithm guaranteescomplete hyphenation of the 676 most common English words (according to [29]),as well as a short list of common technical words (e.g. al-go-rithm).

In addition, over 1000 "exception" words have been added to the dictionary,to ensure that they would not be incorrectly hyphenated. Most of these were foundby testing the algorithm (based on the initial word list) against a larger dictionaryobtained from a publisher, containing about 115,000 entries. This produced about

29

Page 36: Word Hy-phen-a-tion by Com-put«er - CiteSeer

30 PATTERN GENERATION

10,000 errors on words not in the pocket dictionary. Most of these were specialisedtechnical terms that we decided not to worry about, but a few hundred were em-barrassing enough that we decided to add them to the word list. These includedcompound words (camp-fire), proper names (Af-ghan-i-stan), and new words(bio-rhythm) that probably did not exist in 1966, when our pocket dictionary wasoriginally put online.

After the word list was augmented, a new set of patterns was generated, anda new list of exceptions was found and added to the list. Fortunately this processseemed to converge after a few iterations.

HeuristicsThe selection of patterns in an 'optimal' way seems very difficult. The problem

is that f cvera! patterns may apply to a particular hyphen point, including bothhyphenating and inhibiting patterns. Thus complicated interactions can arise ifwe try to determine, say, the minimum set of patterns finding a given number ofhyphens. (The situation is somewhat analogous to a set cover problem.)

Instead, we will select patterns in a series of "passes" through the word list.In each pass we take into account only the effects of patterns chosen in previouspasses. Thus we sidestep the problem of interactions mentioned above.

In addition, we will define a measure of pattern "efficiency" so that we can usea greedy approach in each pass, selecting the most efficient patterns.

Patterns will be selected one level at a time, starting with a level of hyphenatingpatterns. Patterns at each level will be selected in order of increasing pattern length.

Furthermore patterns of a given length applying to different intercharacterpositions (for example - t i o and t - io) will be selected in separate passes throughthe dictionary. Thus the patterns of length n at a given level will be chosen in n -f 1passes through the dictionary.

At first we did not do this, but selected all patterns of a given length (at agiven level) in a single pass, to save time. However, we founJ that this resulted inconsiderable duplication of effort, as many hyphens were covered by two or morepatterns. By considering different intercharacter positions in separate passes, thereis never any overlap among the patterns selected in a single pass.

In each pass, we collect statistics on all patterns appearing in the dictionary,counting the number of times we could hyphenate at a particular point in thepattern, and the number of times we could not.

For example, the pattern t io appears 1793 times in the pocket dictionary, andin 1773 cases we can hyphenate the word before the t, while in 20 cases we can

Page 37: Word Hy-phen-a-tion by Com-put«er - CiteSeer

PATTERN GENERATION 31

not. (We only count instances where the hyphen position occurs at least two lettersfrom either edge of the word.)

These counts arc used to determine the efficiency rating of patterns. For exam*pie if we arc considering only "safe" patterns, that is, patterns that can always behyphenated at a particular position, then a reasonable rating is simply the numberof hyphens found. We could then decide to take, say, all patterns finding at least agiven number of hyphens.

However, most of the patterns we use will make some error. How should thesepatterns be evaluated? In the worst case, errors can be handled by simply listingthem in an exception dictionary. Assuming that one unit of space is required torepresent each pattern as well as each exception, the "efficiency" of a pattern couldbe defined as eff= good / (I -f bad) where good is the number of hyphens correctlyfound and bad is the number of errors made.

(The space used by the final algorithm really depends on how much compressionis produced by the packed trie used to represent the patterns, but since it is hard topredict the exact number of transitions required, we just use the number of patternsas an approximate measure of size.)

By using inhibiting patterns, however, we can often do better than listing theexceptions individually. The quantity bad in the above formula should then bedevalued a bit depending on how effective patterns at the next level are. So abetter formula might be

e i r = W*•" 1 + bad/bad-eff'

where bad.ejj is the estimated efficiency of patterns at the next level (inhibitingerrors at the current level).

Note that it may be difficult to determine the efficiency at the next level, whenwe are still deciding what patterns to take at the current level! We will use a patternselection criterion of the form eff> thresh, but we cannot predict exactly how manypatterns will be chosen and what their overall performance will be. The best wecan do is use reasonable estimates based on previous runs of the pattern generationprogram. Some statistics from trial runs of this program are presented later in thischapter.

Collecting pattern statisticsSo the main task of the pattern generation process is to collect count statistics

about patterns in the dictionary. Because of time and space limitations this becomesan interesting data structure exercise.

Page 38: Word Hy-phen-a-tion by Com-put«er - CiteSeer

32 PATTERN GENERATION•

For short (length 2 and 3) patterns, we can simply use a table of size 26a or 263,respectively, to hold the counts during a pass through the dictionary. For longerpatterns, this is impractical.

Here's the first approach we used for longer patterns. In a pass through thedictionary, every occurrence of a pattern is written out to a file, along with an indi-cation of whether or not a hyphen was allowed at the position under consideration.The file of patterns is sorted to bring identical patterns together, and then a passis made through the sorted list to compile the count statistics for each pattern.

This approach makes it feasible to collect statistics for longer length patterns,and was used to conduct our initial experiments with pattern generation. Howeverit is still quite time and space consuming, especially when sorting the large lists ofpatterns. Note that an external sorting algorithm is usually necessary.

Since only a fraction of the possible patterns of a particular length actuallyoccur in the dictionary, we could instead store them in a hash tabls or one of theother data structures discussed in Chapter 2. It turns out that a modification ofour packed trie data structure is well-suited to this task. The advantages of thepacked trie are very fast lookup, compactness, and graceful handling of variablelength patterns.

Combined with some judicious "pruning" of the patterns that are considered,the memory requirements are much reduced, allowing the entire pattern selectionprocess to be carried out "in core" on our PDP-10 computer.

By "pruning" patterns we mean the following. If a pattern contains a shorterpattern at the same level that has already been chosen, the longer pattern obviouslyneed not be considered, so we do not have to count its occurrences. Similarly, ifa pattern appears so few times in the dictionary thtt under the current selectioncriterion it can never be chosen, then we can mark the pattern as "hopeless" sothat any longer patterns at this level containing it need not be considered.

Pruning greatly reduces the number of patterns that must be considered, es-pecially at longer lengths.

Dynamic packed tries

Unlike the static dictionary problem considered in Chapter 2, the set of patternsto be represented is not known in advance. In order to use a packed trie for storingthe patterns being considered in a pass through the dictionary, we need some wayto dynamically insert new patterns into the trie.

For any pattern, we start by performing a search in the packed trie as usual,following existing links until reaching a state where a new trie transition must be

Page 39: Word Hy-phen-a-tion by Com-put«er - CiteSeer

PATTERN GENERATION 33

added. If we are lucky, the location needed by the new transition will still be emptyin the packed trie, otherwise we will have to do some repacking.

Note that we will not be using suffix compression, because this complicatesthings considerably. We would need back pointers or reference counts to determinewhat nodes need to be unmerged, and we would need a hash table or other auxiliaryinformation in order to remerge the newly added nodes. Furthermore, suffix mergingdoes not produce a great deal of compression on the relatively short patterns wewill be dealing with.

The simplest way of resolving the packing conflict caused by the addition of anew transition is to just repack the changed state (and update the link of its parentstate). To maintain good space utilization, we should try to fit the modified Btateamong the holes in the trie. This can be done by maintaining a dynamic list ofunoccupied cells in the trie, and using a first-fit search.

However, repacking turns out to be rather expensive for large states that areunlikely to fit into the holes in the trie, unless the array is very sparse. We canavoid this by packing such states into the free space immediately to the right ofthe occupied locations. The size threshold for attempting a first-fit packing can beadjusted depending on the density of the array, how much time we are willing tospend on insertions, or how close we are to running out of room.

After adding the critical transition as discussed above, we may need to addsome more trie nodes for the remaining characters of the new pattern. These newstates contain just a single transition, so they should be easy to fit into the trie.

The pattern generation program uses a second packed trie to store the set ofpatterns selected so far. Recall that, before collecting statistics about the patternsin each word, we must first hyphenate the word according to the patterns chosen inprevious passes. This is done not only to determine the current partial hyphenation,but also to identify pruned patterns that need not be considered. Once again, theadvantages of the packed trie are compactness and very fast "hyphenation".

At the end of a pass, we need to add new patterns, including "hopeless" pat-terns, to the trie. Thus it will be convenient to use a dynamic packed trie here aswell. At the end of a level, we probably want to delete hopeless patterns from thetrie in order to recover their space, if we are going to generate more levels. Thisturns out to be relatively easy; we just remove the appropriate output and returnany freed nodes to the available list.

Below we give some statistics that will give an idea of how well a dynamicpacked trie performs. We took the current set of 4447 hyphenation patterns, ran-domized them, and then inserted them one-by-one into a dynamic packed trie.

Page 40: Word Hy-phen-a-tion by Com-put«er - CiteSeer

34 PATTERN GENERATION

(Note that in the situations described above, there will actually be many searchesper insertion, so we can afford some extra effort when performing insertions.) Thepatterns occupy 7214 trie nodes, but the packed trie will use more locations, de-pending on the setting of the first-fit packing threshold. The columns of the tableshow, respectively, the maximum state size for which a first-fit packing is attempted,the number of states packed, the number of locations tried by the first-fit procedure(this dominates the running time), the number of states repacked, and the numberof locations used in the final packed trie.

thresh pack first-fit unpack trie_max

oo 6113 877,301 2781 967113 6060 761,228 2728 94589 6074 559,835 2742 96067 6027 359,537 2695 90065 5863 147,468 2531 10,3664 5746 03,181 2414 11,2093 5563 33.82G 2231 13,2962 5242 10,885 1910 15,0091 4847 895C 1515 16,5360 4577 6073 1245 18,628

Table 2. Dynamic packed trie statistic*.

Experimental resultsWe now give some results from trial runs of the pattern generation program,'

and explain how the current 1^X82 patterns were generated. As mentioned earlier,the development of these patterns involved some augmentation of the word list.The results described here arc based on the latest version of the dictionary.

At each level, the selection of patterns is controlled by three parameters calledgood-wt, bad.wt, and thresh. If a pattern can be hyphenated good times at a partic-ular position, but makes bad errors, then it will be selected if

good* good.wt — bad* bad.wt > thresh.

Note that the efficiency formula given earlier in this chapter can be converted intothe above form.

We can first try using only safe patterns, that is, patterns that can always behyphenated at a particular position. The table below shows the results when allsafe patterns finding at least a given number of hyphens are chosen. Note that

Page 41: Word Hy-phen-a-tion by Com-put«er - CiteSeer

gf P A T T E R N GENERATION 35

parameters patterns hyphens percent

1111111

ooCO

CO

ooooooCO

4020105321

4011024

2272

4603

7052

10,456

16,336

31,083

45,310.

58,580

70,014

76,236

83,450

87,271

35.2%

51.3%

66.3%

79.2%

86.2%

94.4%

98.7%

Table S. Safe hyphenating patterns.

an infinite bad.wt ensures that only safe patterns are chosen. The table shows thenumber of patterns obtained, and the number and percentage of hyphens found.

We see that, roughly speaking, halving the threshold doubles the number ofpatterns, but only increases the percentage of hyphens by a constant amount. Thelast 20 percent or so of hyphens become quite expensive to find.

(In order to save computer time, we have only considered patterns of length6 or less in obtaining the above statistics, so the figures do not quite represent allpatterns above a given threshold. In particular, the patterns at threshold 1 do notfind 100% of the hyphens, although even with indefinitely long patterns there wouldstill be a few hyphens that would not be found, such as re-cord.)

The space required to represent patterns in the final algorithm is slightly morethan one trie transition per pattern. Each transition occupies 4 bytes (1 byte eachfor character and output, plus 2 bytes for trie link). The output table requiresan additional 3 bytes per entry (hyphenation position, value, and next output),but there are only a few hundred outputs. Thus to stay within the desired spacelimitations for TjrjX82, we can use at most about 5000 patterns.

We next try using two levels of patterns, to see if the idea of inhibiting patternsactually pays off. The results are shown below, where in each case the initial levelof hyphenating patterns is followed by a level of inhibiting patterns that removenearly all of the error.

The last set of patterns achieves 86.7% hyphenation using 4696 patterns. Bycontrast, the 1 oo 3 patterns from the previous table achieves 86.2% with 7052patterns. So inhibiting patterns do help. In addition, notice that we have only used"8<afc" inhibiting patterns above; this means that none of the good hyphens are lost.We can do better by using patterns that also inhibit some correct hyphens.

After a good deal of further experimentation, we decided to use Rve levelsof patterns in the current T]rjX82 algorithm. The reason for this is as follows. In

Page 42: Word Hy-phen-a-tion by Com-put«er - CiteSeer

Table 4- Two levels of patterns.

36 PATTERN GENERATION

parameters patterns hyphens percent

51,359 505 58.1% 0.6%0 463 58.1% 0.1%

64,893 1694 73.5% 1.9%0 1531 73.5% 0.2%

76,632 5254 86.7% 5.9%0 4826 86.7% 0.5%

addition to finding a high percentage of hyphens, we also wanted a certain amount ofguaranteed behavior. That is, we wanted to make essentially no errors on words inthe dictionary, and also to ensure complete hyphenation of certain common words.

To accomplish this, we use a final level of safe hyphenating patterns, withthe threshold set as low as feasible (in our case 4). If we then weight the list ofimportant words by a factor of at least 4, the patterns obtained will hyphenatethem completely (except when a word can be hyphenated in two different ways).

To guarantee no error, the level of inhibiting patterns immediately precedingthe final level should have a threshold of 1 so that even patterns applying to a singleword will be chosen. Note these do not need to be "safe" inhibiting patterns, sincethe final level will pick up all hyphens that should be found.

The problem is, if there are too many errors remaining before the last inhibitinglevel, we will need too many patterns to handle them. If we use three levels in all,then the initial level of hyphenating patterns can allow just a small amount of error.

However, we would like to take advantage of the high efficiency of hyphenatingpatterns that allow a greater percentage of error. So instead, we will use an initiallevel of hyphenating patterns with relatively high threshold and allowing consider-able error, followed by a 'coarse' level of inhibiting patterns removing most of theinitial error. The third level will consist of relatively safe hyphenating patterns witha somewhat lower threshold than the first level, and the last two levels will be asdescribed above.

The above somewhat vague considerations do not specify the exact patternselection parameters that should be used for each pass, especially the first threepasses. These were only chosen after much trial and error, which would take too longto describe here. We do not have any theoretical justification for these parameters;they just seem to work well.

The table below shows the parameters used to generate the current set of TgX82patterns, and the results obtained. For levels 2 and 4, the numbers in the "hyphens"

Page 43: Word Hy-phen-a-tion by Com-put«er - CiteSeer

2345

2131

142

oo

8

7

1

4

(4)(5)(6)(8)

Table

509985

1647

1320

S. Current

PATTERN GENERATION 87

level p^arameters patterns hyphens percent

1 1 2 20 (4) 458 67,604 14,156 76.6% 16.0%

7407 11,942 68.2% 2.5%

13,198 551 83.2% 3.1%

1010 2730 82.0% 0.0%

6428 0 89.3% 0.0%

TEX82 pattern*.

column show the number of good and bad hyphens inhibited, respectively. Thenumbers in parentheses indicate the maximum length of patterns chosen at thatlevel.

A total of 4919 patterns (actually only 4447 because some patterns appear morethan once) were obtained, compiling into a suffix-compressed packed trie occupying5943 locations, with 181 outputs. As shown in the table, the resulting algorithmfinds 89.3% of the hyphens in the dictionary. This improves on the one and twolevel examples discussed above. The patterns were generated in 109 passes throughthe dictionary, requiring about 1 hour of CPU time.

ExamplesThe complete list of hyphenation patterns currently used by TfjjX82 appears in

the appendix. The digits appearing between the letters of a pattern indicate thehyphenation level, as discussed above.

Below we give some examples of the patterns in action. For each of the followingwords, we show the patterns that apply, the resulting hyphenation values, and thehyphenation obtained. Note that if more than one hyphenation value is specified fora given intercharacter position, then the higher value takes priority, in accordancewith our level scheme. If the final value is odd, the position is an allowable hyphenpoint.

computer 4mlp pu2t Spute put3er Co4m5pu2t3er com-put-er

algorithm Ilg4 Igo3 lgo 2ith 4hm allg4o3r2it4hm al-go-rithm

hyphenation hy3ph he2n hena4 hen5at lna n2at i t i o 2iohy3phe2n5a4t2ion hy-phen-ation

concatenation o2n onlc lea lna n2at I t io 2ioco2nlcateln2alt2ion con-cate-na-tion

mathematics math3 ath5em th2e lma a t l i c 4csmath5eimatli4cs math-e-mat-ics

Page 44: Word Hy-phen-a-tion by Com-put«er - CiteSeer

38 PATTERN GENERATION

typesetting type3 els2e 4t3t2 2 t l in type3s2e4t3t2ing

type-set-ting

program pr2 lgr pr2olgram pro-gram

supercalifragilisticexpialidocious

ulpe rlc ica alii agli gil4 i l i i i l4iet i s l t i st2i s l t ic

lexp x3p pi3a 2ila i2al 2id ldo lei 2io 2UB

8ulperlcallifraglil4islt2iclex3p2i3al2ildolc2io2uB

su-per-cal-ifrag-ilis-tic-ex-pi-ali-do-cioua

"Below, we show a few interesting patterns. The reader may like to try figuring

out what words they apply to. (The answers appear in the Appendix.)

ainboaySal

earSk

e2mel

hach4hEelo

if4fr

IBogo

n3uinnyp4

oSaSles

orew4

Bspai4tarc

4todo

uir4m

And finally, the following patterns deserve mention:

3tex fon4t highS

Page 45: Word Hy-phen-a-tion by Com-put«er - CiteSeer

Chapter 5

History and Conclusion

-The invention of the alphabet was one of the greatest advances in the historyof civilization. However, the ancient Phoenicians probably did not anticipate thefact that, centuries later, the problem of word hyphenation would become a majorheadache for computer typesetters all over the world.

Most cultures have evolved a linear style of communication, whereby a trainof thought is converted into a sequence of symbols, which are then laid out in neatrows on a page and shipped off to a laser printer.

The trouble was, as civilization progressed and words got longer and longer,it became occasionally necessary to split them across lines. At first hyphens wereinserted at arbitrary places, but in order to avoid distracting breaks such as the-rapis t , it was soon found preferable to divide words at syllable boundaries.

Modern practice is somewhat stricter, avoiding hyphenations that might causethe reader to pronounce a word incorrectly (e.g. eoneidera-tion) or where a singleletter is split from a component of a compound word (e.g. cardi-ovascular).

The first book on typesetting, Joseph Moxon's Mechanick Exercises (1683),mentions the need for hyphenation but does not give any rules for it. A few dictio-naries had appeared by this time, but were usually just word lists. Eventually theybegan to show syllable divisions to aid in pronunciation, as well as hyphenation.

With the advent of computer typesetting, interest in the problem was renewed.Hyphenation is the *H' of 'H &i J' (hyphenation and justification), which are thebasic functions provided by any typesetting system. The need for automatic hy-phenation presented a new and challenging problem to early systems designers.

Probably the first work on this problem, as well as many other aspects of com-puter typesetting, was done in the early 1950s by a French group led by G. D.Bafour. They developed a hyphenation algorithm for French, which was lateradapted to English [U.S. Patent 2,702,485 (1955)].

Their method is quite simple. Hyphenations are allowed anywhere in a wordexcept among the following letter combinations: before two consonants, two vawcJs,

39

Page 46: Word Hy-phen-a-tion by Com-put«er - CiteSeer

40 HISTORY AND CONCLUSION

or x; between two vowels, consonant-h, e-r, or s-s; after two consonants where thefirst is not 1, m, n, r, or s; or after c, j , q, v, consonant-w, nra, l r , nb, nf, nl , nm,nn, or nr.

We tested this method on our pocket dictionary, and it found nearly 70 percentof the hyphens, but also about an equal amount of incorrect hyphens! Viewed inanother way, about 65% of the erroneous hyphen positions are successfully inhibited,along with 30% of the correct hyphens. It turns out that a simple algorithm likethis one works quite well in French; however for English this is not the case.

Other early work on automatic hyphenation is described in the proceedings ofvarious conferences on computer typesetting (e.g. [30]). A good summary appearsin [31], from which the quotes in the following paragraphs were taken.

At the Los Angeles Times, a sophisticated logical routine was developed basedon the grammatical rules given in Webster's, carefully refined and adapted for com-puter implementation. Words were analyzed into vowel and consonant patternswhich were classified into one of four types, and rules governing each type applied.Prefix, suffix, and other special case rules were also used. The results were report-edly "85-95 percent accurate", while the hyphenation logic occupies "only 5,000positions of the 20,000 positions of the computer's magnetic core memory, lessspace than would be required to store 500 8-letter words averaging two hyphens perword."

Perry Publications in Florida developed a dictionary look-up method, alongwith their own dictionary. An in-corc table mapped each word, depending on itsfirst two letters, into a particular block of words on tape. For speed, the dictionarywas divided between four tape units, and "since the RCA 301 can search tape inboth directions," each tape drive maintained a "homing position" at the middle ofthe tape, with the most frequently searched blocks placed closest to the homingpositions.

In addition, they observed that many words could be hyphenated after the 3rd,5th, or 7th letters. So they removed all such words from the dictionary (saving somespace), and if a word was not found in the dictionary, it was hyphenated after the3rd, 5th, or 7th letter.

A hybrid approach was developed at the Oklahoma Publishing Company. Firstsome logical analysis was used to determine the number of syllables, and to checkif certain suffix and special case rules could be applied. Next the probability ofhyphenation at each position in the word was estimated using three probabilitytables, and the most probable breakpoints were identified. (This seems to be theorigin of the Time magazine algorithm described in Chapter 1.) An exception

Page 47: Word Hy-phen-a-tion by Com-put«er - CiteSeer

HISTORY AND CONCLUSION 41

dictionary handles the remaining cases; however there was some difference of opinion

as to the size of the dictionary required to obtain satisfactory results.

Many other projects to develop hyphenation algorithms have remained pro-

prietary or were never published. For example, IBM alone worked on "over 35

approaches to the simple problem of grammatical word division and hyphenation".

By now, we might have hoped that an "industry standard" hyphenation algo-

rithm would exist. Indeed Berg's survey of computerized typesetting [32] contains

a description of what could be considered a "generic" rule-based hyphenation algo-

rithm (he doesn't say where it comes from). However, we have seen that any logical

routine must stop short of complete hyphenation, because of the generally illogical

basis of English word division.

The trend in modern systems has been toward the hybrid approach, where a

logical routine is supplemented by an extensive exception dictionary. Thus the in-

core algorithm serves to reduce the size of the dictionary, as well as the frequency

of accessing it, as much as possible.

A number of hyphenation algorithms have also appeared in the computer sci-

ence literature. A very simple algorithm is described by Rich and Stone [33]. The

two parts of the word must include a vowel, not counting a final e, ee or ed. The

new line cannot begin with a vowel or double consonant. No break is made between

the letter pairs eh, gh, p, ch, th, wh, gr, pr, cr, tr , wr, br, f r, dr, vowel-r, vowel-n,

or om. On our pocket dictionary, this method found about 70% of the hyphens with

45% error.

The algorithm used in the Bell Labs document compiler Roff is described by

Wagner [34], It uses suflix stripping, followed by digram analysis carried out in a

back to front manner. In addition a more complicated scheme is described using four

classes of digrams combined with an attempt to identify accented and nonaccented

syllables, but this seemed to introduce too many errors. A version of the algorithm is

described in [35]; interestingly, this reference uses the terms "hyphenating pattern"

(referring to a Snobol string-matching pattern) as well as "inhibiting suffix".

Ockcr [36], in a master's thesis, describes another algorithm based on the rules

in Webster's dictionary. It includes recognition of prcGxes, suffixes, and special

letter combinations that help in determining accentuation, followed by an analysis

of the "liquidity" of letter pairs to find the character pair corresponding to the

greatest interruption of spoken sound.

Moitra et al [37] use an exception table, prefixes, suffixes, and a probabilistic

break-value table, In addition they extend the usual notion of affixes to any letter

Page 48: Word Hy-phen-a-tion by Com-put«er - CiteSeer

42 HISTORY AND CONCLUSION

pattern that helps in hyphenation, including 'root words' (e.g. l in t , pot) intended

to handle compound words.

Pat terns as paradigmOur pattern matching approach to hyphenation is interesting for a number

of reasons. It has proved to be very effective and also very appropriate for theproblem. In addition, since the patterns are generated from the dictionary, it iseasy to accommodate changes to the word list, as our hyphenation preferenceschange or as new words are added. More significantly, the pattern scheme can bereadily applied to different languages, if we have a hyphenated word list for thelanguage.

The effectiveness of pattern matching suggests that this paradigm may be use-ful in other applications as well. Indeed more general patten^ matching systemsand the related notions of production systems and augmented transition networks(ATN's) are often used in artificial intelligence applications, especially natural lan-guage processing. While AI programs try to understand sentences by analyzingword patterns, we try to hyphenate words by analyzing letter patterns.

One simple extension of patterns that we have not considered is the idea ofcharacter groups such as vowels and consonants, as used by nearly all other algo-rithmic approaches to hyphenation. This may seem like a serious omission, becausea potentially useful meta-pattcrn like 'vowel-consonant-consonant-vowel1 would thenexpand to 6 x 20 X 20 x C = 14400 patterns. However, it turns out that a suffix-compressed trie will reduce this to just 6 + 20 + 20 + 6 = 52 trie nodes. So ourmethods can take some advantage of such "mcta-patterns".

In addition, the use of inhibiting as well as hyphenating patterns seems quitepowerful. These can be thought of as rules and exceptions, which is another commonAI paradigm.

Concerning related work in AI, we must especially mention the Meta-DENDRALprogram [38], which is designed to infer automatically rules for mass-spectrometry.An example of such a rule is N—C—C—C —» N—C * C—C, which says that if themolecular substructure on the left side is present, then a bond fragmentation mayoccur as indicated on the right side. Meta-DENDRAL analyzes a set of mass-spectraldata points and tries to infer a set of fragmentation rules that can correctly predictthe spectra of new molecules. The inference process starts with some fairly generalrules and then refines them as necessary, using the experimental data as positive ornegative evidence for the correctness of a rule.

Page 49: Word Hy-phen-a-tion by Com-put«er - CiteSeer

HISTORY AND CONCLUSION 43

The fragmentation rules can in general be considerably more complicated than

our simple pattern rules for hyphenation. The molecular "pattern" can be a tree-

like or even cyclic structure, and there may be multiple fragmentations, possibly

involving "migration" of a few atoms from one fragment to another. Furthermore,

there are usually extra constraints on the form of rules, both to constrain the

search and to make it more likely that meaningful or "interesting" rules will be

generated. Still, there arc some striking similarities between these ideas and our

pattern-matching approach to hyphenation.

Packed tr ies

Finally, the idea of packed tries deserves further investigation. An indexed

trie can be viewed as a finite-state machine, where state transitions are performed

by address calculation based on the current state and input character. This is

extremely fast on most computers.

However indexing usually incurs a substantial space penalty because of space

reserved for pointers that are not used. Our packing technique, using the idea of

storing the index character to distinguish transitions belonging to different states,

combines the best features of both the linked and indexed representations, namely

space and speed. We believe this is a fundamental idea.

There are variou" issues to be explored here. Some analysis of different packing

methods would be interesting, especially for the handling of dynamic updates to a

packed trie.

Our hyphenation trie extends a finite-state machine with its hyphenation "ac-

tions". It would be interesting to consider other applications that can be handled by

extending the basic finite-state framework, while maintaining as much of its speed

as possible.

Another possibly interesting question concerns the size of the character and

pointer fields in trie transitions. In our hyphenation trie half of the space is occupied

by the pointers, while in our spelling checking examples from one-half to three-

fourths of the space is used for pointers, depending on the size of the dictionary.

In the latter case it might be better to use a larger "character" size in the trie, in

order to get a better balance between pointers and data.

When performing a search in a packed trie, following links will likely make us

jump around in the trie in a somewhat random manner. This can be a disadvantage,

both because of the need for large pointers, and also because of the lack of locality,

which could degrade performance in a virtual memory environment. There are

probably ways to improve on this. For example, Frcdkin [10] proposes an interesting

'n-dimcnsional binary trie* idea for reducing pointer size.

Page 50: Word Hy-phen-a-tion by Com-put«er - CiteSeer

44 . HISTORY AND CONCLUSION

We have presented packed tries as a solution to the set representation problem,with special emphasis on data compression. It would be interesting to compare ourresults with other compression techniques, such as Huffman coding. Also, perhapsone could estimate the amount of information present in a hyphenated word list, asa lower bound on the size of any hyphenation algorithm.

Finally, our view of finite-state machines has been based on the underlyingassumption of a computer with random-access memory. Addressing by indexingseems to provide power not available in some other models of computation, suchas pointer machine, or comparison-based models. On the other hand, a 'VLSI* orother hardware model (such as programmed logic arrays) can provide even greaterpower, eliminating the need for our perhaps contrived packing technique. But thenother communication issues will be raised.

If all problems of hyphenation have not been solved,at least some progress has been made since that night,

when according to legend, an RCA Marketing Managerreceived a phone call from a disturbed customer.

His 301 had Just hyphenated "God".

— Paul E. Justus (1972)

Page 51: Word Hy-phen-a-tion by Com-put«er - CiteSeer

TgX82 hyphenation patterns

.ach4

.ad4d«r••fit.«13t.••Sit.anSc• »ng4

.aniEa

.ant4

.anSt*

.antiSt

.arBa

.arm*

.ar4ty

.a*3c

.aalp

.ailt

.««ter5

.itc«5

.auld

.ar4i• awn4

• bi4g

.baEna

.ba*4«

.bar 4

.beSra

.baSaa

.be5fto

.brl2

.but4tl

.caa4pa

.canSc

.capaSb

.carSol

.ca4t

.ce41ach4.chillSi.el9.citSr.co3».co4r.coiSner.de4moi. de3o.de3ra.deSrl.dei4c.dlctloS.do4t.du4c. dumbS.earthB.eat31.eb4.aer4• eg2.el5d.el3e«.enam3• en3g

J

.«nS»

.eqSuiBt

.•r4ri

.Ml

.euS

.eyeS

.fei3

.ior5»ar

•g»2.ga2.gen3t4•geBog.gi5a.gl4b• go4r.handBi.hanBk.he 2.heroSi.hatS.hatS.M3b.hi3er.honSey.hon3o.hoT5.Id41.idolS.ia3a.imSpin.inl.In3cl.ine2.In2k.inSt.ir5r.ii4i.]u3r.Is4cy.l»4a.latSar.lathB.Ie2.legSa.Ien4.lepS.lev!.H4g.ligSa.Ii2n.1130.114t.aagSaS.nalSo.nanSa.narEti.ae2.ner3e•EeEtor.•ill.•lstSl.Bon3a

.•oSre

.auBta

.•utaSb

.nl4e

.od2

.oddS

.of5t«

.orSato• or 3c.orld.or St.0*3.o»4tl.othS.outS.pud5»l.paBta.pe5t.it.pl4«.pioBn.pi2t.praSa.ra4c.ran4t.ratioBna.rea2.reEait.re«2.raBatat.ri4g,rit5u.ro4q.rosSt.rowSd.ru4d.•ci3a.aelfS.•ellS.•e2n.•eErla.«h2.•12.iing4.•t4.•taSbl

•y2.ta4.ta2.tenSan.th2.ti2.til4.tiaSoS.ting4.tinSk,ton4a.to4p.topSl.tonSa.tribBat.vnla.nn3ca

«/

.cnderS

.nnla

.nn5k

.nn5o• nnSn.op 3.nraS.u»5a.Ten4da.YeEra.wllBl.ya44ab.aBbalaSbanaba2abBardabiSaabSitSababSlatabSoSlli4abrabBrogab3ula4caracSardacSaroa5ceonaclaraSchat4a2cia3cieaclina3cioac5robact5ifac3ulac4uma2dad4dinadSar.2adia3dlaad3icaadl4era3dioa3ditaSdiuad41ead3cvadSranad4su4adua3ducadSumae4raeri4aa2faff4a4gabaga4naB5ell>/

aga4o4ageu

•8li4ag41agina2goSagogag3onlaSguaragBnl

•<8Ta3haa3haah 41»3hoa 12aSla•Sic.tiSlya414nainSinainBoaitSen

•1JaklanalBabal3ada41ar4aldi2alaal3enda41entla51e5oaliial4ia.ali4eal51er4allic4alBa51og.a41y.4aly«5a51y«tSalyt3alyi4aaaaaSabam3agaaaSraaraSaeca4matlta4mSatoanBeraai»3ic

amSifamSilfami inami4noa2aoaBaonamorBiamp5en

J

•2nanSag*3an»lyaSnaranSareanar4iaSnatl4 andande4»an3di»anldlan4dovaSnasaSnanan5e«t.»3n«u2angangBl*anlgla4nlica3niaian313fan4iaaaSnlaia5ninaanSioa3nipan3iihanSita3ninan4kllSannixanoianSotanothSan2saan4scoan4anan2«pans3poan4ttan4inrantal4an4tie4 an toan2tran4twan3uaan3ulaSnur4aoapar4apfiatapSeroa3phar4aphia4pillaapSillarap3inap3itaa3pitua2pl

y

•pocBapSolaaporElapotStapiSet•Spo•queS2a2rarSactaSrad*arBadlt•rSalaSraaatsaran4gara3par4ataBratiearSatlraBranarBaT4•raw4arbal4ar4cha«arSdinaar4drarBaaaaSraaarSantaSraasar4flar4MarilarSlalarSiana3riatar4iaarSinkbar31oar2ixar2alarBoSdaBronia3rooar2parSqarre4ar4taar2ih4aa.aa4ab»3antaahl4a5iia.aSalba3«lcEaEai4tatk31••41a4aocaaSphaa4fhaiStea

J

atltriiurBa•2taat-Sabl•tSaeatSalo•tSapataScatBaekatSagoatSan.atSaraaterBnaBtarnaatSaatatBar4»thathEeaaEthaaat4hoathSoa4ati.aStlaatSlSaatllc•tsitatlonSaratSltua4toga2toaatSoalia4top»4to»altratBropat4akaUtagatSt*at4tha2tnatSuaatSua•t3nlat3uraa2tyau4banghSau3guau412aunSdau3rauBalbait&enuultha2vaav3ag•Sran»ve4nosv3eraarSeraarSeryarilJ

a»14araySlgaySoealTor3avajaw3ia«41yavi4ax4ieax4idaySalaT«4ay.4axl4«raxxSlEba.badSgarba4geballabanSdagban4«banSlbarblBbari4afci«4filbatb*4i2blbb2babSbarbbUna4bld4ba.baak4boats4ba2dbaSdaba3d*bo3dibajglbaSgnlbalbellibaSlo4be5»baSnlgbaEnn4be»4bo3apbeSctr3betbetSiibaStrba3twbe3wbeSyo2bf4b3hbl2bbi4d3blablEan

•J

74

Page 52: Word Hy-phen-a-tion by Com-put«er - CiteSeer

TgX82 HYPHENATION PATTERNS 75

bi4ar2b31flbllbl311ibim5r4bln4dblSnetbi3ogrbi6oabi2t3bl3tiobl3tr3blt5MbSlttbljbk4b212blithSb41a.blan4BbleipI31iab41oblnn4t4bla4b3nbneSgSbodbod31bo4abolSlcboMblbon4i

bonSatSboo6bor.4bloraborSdEbor*6bjrl

6toi4b5ot»bothSbo4tobounds4bp4brltbrothS2bSt2bior42btbt41b4tob3trbuf4f»rbu4gabuSUbuaUbu4nbuntiifcu3rabntSl*b«ti4«SV«at4b«USbati*

bSutoblr4b5»6 by.bya4leacab3iacalblcach4*c»5den4cag42c5ahca31atcal4UcallSU4calocanEdcan4«can41ccanSlacan31zc»n4tycany4caSparcarEoacattSercaiStlg4c*iyca4U4catiTC«T5«1

cSccchaBccl4accoapatccon4ecouSt8c*.4c*d.4cadaaSeal6c«l.ScallleanScane2can4«4c«nlScantScapcoSrim4cataScaialce«5»i6bcat5tcat4c6a4Uca»43ch4ch.4ch3abEchanlechEaSnlacl.2C A M » 9

4cka4(USU

y

3che«ichBanach3e-.ch3or«4ckllnSchina.chSlnaaaEchiniEchloSchltchl2z3cho2ch4tileiSciaci2aRbcla6rci6c4ciarEclfie.4clici41aScili2cla2cine41naScinatcln3eaclingcElng.Sclnoclon44clp«clSpK4clplc4claU4clatl2c litCltSllSellcklckSlIc4144clarcSlaratl*Eclar*cla4a4cllcclla4cly4cEnlcocoSagcoa22cogco4grcol4coSinccolfilEcolecolScrco»5«rco»4ac4ea«cokSgco*5V

J

co3p»cop3icco4pl4corbcoro3aco«4»corlC0T64

covEacozfi*coExlclqcraaStficrat.Ecratlccre3atScred4c3reUcr«4rcri2crlSfc4rlncrii4

Scrltlcro4plcrop5ocro«4ecrc4d4c3»2

2cltCta4b

ctSangc6tantc2t«c3t«rc4ticoctia3ictu4rc4twcudSc4ufc4ulcu5ityBculicul4tlaScultucu2aacSuaacu4alScuncu3plcu5pycarSa4bcuSrlalenacuaa413c4utcu4ti«4c5utiT4cutrleycie4Id2aMa.2d3a4bdach4J

4daf2dagda2a2dan3gdardSdarkS4dary3dat4dat.iT

4dato6dtT4

daTE*6d*ydlbdScdld42de.deafEdeb&itdo4bondecan4do4cildo5coa2diod4doo.do5ifdeli4edo!5iSqdoSiod4coEdea.3dooicdeaSic.deSallde4aonadeaorSldendo4mrd«3nodantlEfdo3nudelpdo3p«depl4de2pud3aqd4«rhEdaradernElider&adaa2d2aa.delacde2tSode*3tlde3atrdo4«tt

deltda2tod«lrdov3il4dey4dltd4g»d3ge^tdjll

d2gydlh2Edi.•d413adiafibdi4caad4ice3dict3did5di3ondllfdi3gadi41atodllnldinaSdina.EdinldlEnlxldiodioEgdi4pldir2dilradirtSidislEdi«id4is3td2itildllTdljd5)c24d51a3dle.3dled3dlaa.4dloia

2d31o4d5lB

2dlydla4dln4Ido3do.doEdaEdoa2dSofd4ogdo41adoli4doBlordoa5izdo3natdonl4doo3ddop4pd4or3doa4d5oatdo4r3doxdipldrdragSoa<dr»idre4dreaSr

J

Edrendri4bdril4dro4p4drowEdrupll4dry2dla2da4pd4avd4«yd2thIdadloladu2cdlucadacSer4duct.4d'icti

du5eldu4gd3ulodua4badu4n4dapdu4p«dlTdlwd2y5dyndy4aadyaSpela4b«3act•adlaadSl*•»4g«aaEgaraa41aalEar•I13OB

aan3araSandaar3aaar4caarSaaaar41caar411aarSkaar2te»rt3»

aaSape3a»«

aaatS•*2t

aatfieneath31aSatlf«4a3tB

B » 2 T

eaT3on

aaTSieaTSo

I.Ik•4b«l.•4bel«

/

e4bene<bite3bra4cad«canEcecca5•lea•cSeaaaec2ie4cibecSificatecSiflaecBlfyacSia•cl4teSclt*•4claaa4clnaa2cola4coaae4coap«e4conce2coracSoraecoSroalcra4creaec4ta&ac4t«alcaa4calec3ula2a2da4ed3da4dl«rade4a4adl• 3diaadSib

• ed3ica

adMa•dlitadlfii4edo•4doladon2a4drl• 4duladSuleaa2caed31aa2faal31««41yaa2a••4na«e4plaa2a4aatt4eo4tyaSas•Ife«f3«ralalf•4flcSaficiJ

afil4•3finaatSiSnlt*Satlt•forEa*•4fva«.4agal•gar4•gSlb•g41c•gElngaSgltSagEna4go.a4go«eglal•SgurBagy•lb.4eher4•12•51c•16d•Ig2•lEgl•Slab•3inf•Ung•Elnat•Ir4d• it3a

alSth• 6ity•1J•4]nd•jEadlakl4n•k41aalia•41a.•41ac•Ian4d•lEatlTa41av•Iaza4•Sl«a•IBabraSelac•41«d•13aga•Elan•411«r•ll«a•12f•121•31ib«•4161c.•131ca•311rr•161gl»•Ella•413in«•311o•211a•lSlak•311*3

7

Page 53: Word Hy-phen-a-tion by Com-put«er - CiteSeer

76 HYPHENATION PATTEON8

4ella

el41abello4

aSloe•16og

•13op.•12«h

•14ta

• S l n d <•

el Bug

•4aae

•4nag

eEnaneaSana

eaSb

elite

•2nal

e4met

eaSicaeal4e

emSlgraemlir.2

eoSlna

em3i3ni

•4ai«emSiah•Emiaa•n31xSemnlxeno4g•monlSo•a3pl•4mulemSulaomuSn

•3my

•nSaao

•4nant

•nch4aren3dlceSnaaeBnee•n3eaenBtro

•nSeal•nSeat

•n3etr•3new

•nSlcf

•Snia

•Enlla3nlo

en31»h

•n31t

t-Snltt

Sanil4enn

4enoano4g

e4noian3oTen4aw

entbage4anthaa

on3vu•nEuf

•3ny.4en3x•Sof

•o2g•4ol4

•3olaop3aralor

' eo3re

eoSroleos4

•4ot

eo4to

•Soot•Sow

•2pa

•3pal

epSanc• Spel

e3pent

epSetitieephe4.e4pli

alpo

e4prec.epSreca•4pr«d•p3rah•3pro•4prob•p4ah•pStlEb•4pntepButa

•lq•qulSl

e4q3ul3a

aria

era4b4er»nd•r3ar4er»tl.2erb

•r4bl•r3ch

•r4cha2«ra.•3real

araScoereSin

•rSal.

ar3eaoarSena•rSenca

4«renaarSantara4q

arEaainrSeat

areU

arlharil

alrla4Serlcke3rien

•rl4«r

• \J

or3ineelrio

4erlt

er41nerl4T

e4riraer3m4

er4nia4ernlt

Eernix«r3no

2eroerSob

aSroc

ero4rerlon

aria

er3«etert3er4ertl

•r3tw

4aru•rn4tBerwau .

els4ae4«age.

•4sages

••2c•2*ca••Scane3»cr••Senelt2e

•2>eceiEecr

e«Bone

•4aart.

a4aerta

•4aerra

4eah•3ahaaahEanelale2»ic•2ald••Elden•aSlgna

•2aEia

e«414ne»U4ta

aal4uaSaklnet4ai

•2tole»3olu

e2aonaaSona

elfpai3par

eiSplraea4pra

2e«ae«4al4be«tan4

ea3tlgaaStla

v'

4ei2toe3iton2e«tre5stroestrucS

e2(uresSurr••4v

ata4b

eton4d

e3teo

ethod3

•tliceEtlda

etin4ati4no

eStir

eStltloetSitlr4etn

•tSona

•3tra•3tro

et3ric

et5rif«t3rog

et5roaet3uaetEya

•tSz

4m•Sun•3np

•u3roeu>4

•ute4•uti61

euEtreva2pBe2vaa

evEaateSvea•r3ell

evel3oeSvengeven4i

•Tier

eBverb

elTi•T3id

6T141

•4Tin« T 1 4 T

•STOC

•ETU

elwa«4wag

•Sweao3»h

ewllSe»3ingo3wit

lexpSeyc

Eeya.ey«4

J

lfafa3blfab3r

fa4c«4fag

fain4fallSa

4fa4na

famEia

EfarfarSth

fa3tafa3the

4fatofaults

4fEb4fd4fe.feas4

feath?fe4b

4fecaEfect

2fod

fe311

f e 4.-00

'en2dfend6eferl5ferrfev44flff4fe»

f4fielatin.f2fSia

f4flyf2fy

4fhlfifi3a

2f3ic.4f3ical

f3ican4flcata

f3icenfi3cer

fic4i

SflciaSficla

4flcafi3cn

fiSdelfightSf 1151

fiUSin

4fUy2finEfina

fin2d5

fi2nefIin3g

fin4nfis4tif412fSless

J

flin4flo3re

f21rS

4fa4fnlfoEfonfon4de

fon4t

fo2rfoErat

forSayforeEt

for41

fortSafosS

4fEp

fra4tfSrea

fresScfri2

fril4frolS2f3a

2ft?4to

f2ty

3fufuEel4fugfu4o>infuSnefu3rl

fusl4fu«4«

4futa

lfylgagaf4Sgal.

3gallga31o2gaa

gaSmetgSanoganSiaga3nii

ganlSxa4gano

gar6n4

ga*«4gath3

4gatlr4gaz

g3bgd42ga.2gedgeez4

gel41ngeSUa

ge5liz4gely

lgenge4natgeSnia

4geno4genylgeo

ge3omg4eryEgealgethS

4getoge4ty

ga4T

4glg2

g3gerggln6

gh31nghSout

gh4to

Sgl.Igi4a

glaErgllc

Eglclag4ico

gienSEgiea.gil4g3imen

3g4in.glnSga6g41n«6gioSglrgir41g3ialgl4u

BgiT

3gllgl2gla4

gladSl6glaa

lglagli4b

g31ig3glo

glo3r

gl»

gn4a

g4na.gnet4t

glnlg2nlng4nlo

glnog4non

lgo3go.

gobS

6goe3g4o4g

go3iagon2

4g3o3nagondoS

•y •

go3nlBgoo

goSrlzgorSoufigot.

gOTl

g3plgr4grada

g4ralgran2

figraph.

gEraphar

EgrapMc4graphy4gray

gr«4n<gree«.

4grlt

g4ro

gruf4

8*2gSst*gth3

gu4a

3guard2gue

EgulSt3gun3gna4gu4t

g3»lgy2g5y3ngyEra

h3ab41

h*ch4hae4a

hae4thSaguha31ahalaSaha'Sahin4cihan4cyShand.han4g

hangSarhangSo

hEa&nlthan4khan4tahap31

hapBt

ha3ranhaErath»r2d

h»rd3p

har41a

harpSen

harStarhaiSs

haun4Ehax

haz3a

Ub

V

lhead3hear

he4canhSecath4ed

heEdoShe3141

hel4Ua

hel41y

hEelohea4p

he2n

hena4henSat

heoErhepS

h4erahera3pher4ba

hereSah3ern

hSero«hSary

hlea

he2aEph«4t

het4adhou4

hithlhhiEan

hl4cohlghS

M112

hiBor4h41nahlon4a

hl4p

hlr41hi3rohir4p

hlr4rhl«3el

hl*4a

hithSarhl2r

4hk4hU4

hlan4h21o

hlo3rl4MB

hmot42hlnhSodii

hSodaho4g

hoge4holSar

3hol4aho4aahoae3

hon4ahoSny3hcod

AO0A4

yf76«

Page 54: Word Hy-phen-a-tion by Com-put«er - CiteSeer

HYPHENATION PATTERNS 77

horfiathoSrithortSahoSrohoi4ehoSsenhoalpJhou«.hou»e3hor5«l4h5p4hr4hreoShroEnlthro3po4hl«2h4ihh4tarhtlenhtSe«h<tyhu4ghu4minhunSkahun4thci3t4 'hu4thlwh4warthy3pahy3phhy2t211*i2allam*laaSata12an4iancian3i4ian4tlaEp*iaaa414atiria4trlei4ataIbe4ib3eraib5ertibSiaib3inibSit.ibBltailblIb311iEbollbr12bSrilSbun41camEicap41car14car.14caralcaaS14caylcc««

J •

4iceo4ich2iclificidicSina12cipicSipa14cly12cEoe411crEicrai4cryIc4t«ictu2ic4t3uaic3ulaic4uaicBuoi3cur21di4daiidSancldSdide3alida4a12dlldSianIdl4ariSdiaid3ioidifi ildlitidElu13dl«14doaid3o*14dr12dnldSuo21«4ied4«EleSgaIflld3ienEa4Ien4aiSenn13entlliar.13atcilaat13et411.ifSaroltfSan1141r4111c.1311a131141ft

2Uiga5big3eraIght3141gi13glb1(311

J

Ig3inig3it14g4112goIg3origSotiBgraiguSiiglnr13h41E1413j41kIlia113a4b141ada1216aBllaEra131eglllerIler4116111111131a1121b1131o1141st2ilit1121zlllSab411nilSoq114tyHEur113T

14magIm3agairaaBryinentaSr41metlmlllmSidalmlElalEmlnl41nltIn4nl13oon12mninSuli2in.14n3an4inarIncel4in3car41ndlnSdling21na13neeIner4arISneaa4inga4ingainBgen41ngllnSgllng4ingo

J

4ingu2inllfinl.14niain3ioinlisiEnlta.Einltioin3ity4 Ink41nl21nn2Uno14no4cino4«14not2ina *' In3aainsurEa2int.2in4tftinluISnus4iny21o41o.Iogo4io2grilolIo4aIon3ation4oryion3iioEphIor3114o«ioEthlEotiIo4to14our2ipIpa4iphra§4Ip31Ip41cIp4re4Ip3ul13quaiqSuefIq3uldIq3ul3t41rliraIra4b14racird5elretde14rel14rel414resirEgllrliIrl5doIr4ifiri3tu615r2iz

ir4mlniro4g6iron.lrSnl21a.It5agi*3arifa«S21tlcIs3ch41>al<3erSisllaShani*3honish5opis31bIsl4diEslslsSltiT41s4kislan44iccs12coIso5morlalpIs2piis4py4islsii4salis<on4Ii4seal<4ta.iilteliltliet41y41*tral12«ulsSua4ita.Ita4bl14tag41ta5m13tan13tat21tait3eraISteriit4ea21thilti4itia412tleIt31caE15tlckIt31git511112tim21tlo41tla14tlsm12t5o5n41 ton14traaitSry41tt

•J

It3natlEtudIt3ul4itz.iln2iTiT3elliv3en.14v3er.14ver«.lrSil.ii5iolTlitISroraiv3o3ro14v3ot415wIx4o4iy41zariz!4SlzontSjajac4qja4pljejerSs4je«tie4Je«tyjew3Jo4p6jndg3ka.k3abkSagkala4kal4klbk2edlkoake4gkeSUk3en4dklarkea4k3eat.ke4tyk3fkh4kllEkl.Sk21ek4111kiloSk4iak41n.kin4dakSlneaakln4gkl4pki«4k51«hkk4kll4kley4kly

J

klakEneaIk2nokoSrkoah4k3onkroSn4kla2k4aeka41MaykStklwIab3ic14abolacUHadeIa3dyIag4nIara3o31andIau4dllanSetIan4taIar4gIar3iIas4aIa6tan41atall41atlr41»TIa4r4a211bIbin4411c2Ice413cl21d12daId4eraId4eriIdl4ldSla13dr14dril«2aIe4bllaftSSlag.ElaggIa4natleaSatlc41an.3 lanefilana.llantIe3phIe4prleraSbIer4a31erg314eri14erolo«2laSacaBlaaq

y

31eaaEleaa.13araIer4er.leT4araIev4era31ay41eya211lEtr411g4lSgalgarS14gaaIgo3213h114agU2aaliarElt114»«114atoHSblSllclo114cor411ca411ct.141cn13icy131dalldSarSlldlIlt3ar1411111411filigataSUgh114gra311k414141

•Iln4blIla31U4ao141a4p141na1141naIln3aaUn31llnkSarUSog414iqIli4plilt12it.SlltlcalSlStlctIlr3arIlls41jIka313kalIka4t111141av121aISlea131ac

y

131eg131al131a4n131a4t11211211n4lSllna114elloqalSllEoat161ov21alEaatIa31ng14aodI>on42 1 1 B 2

31o.lobSalIo4cl4 lotSlogleISogoSlognIoa3arElongIon4113o3niiloodEPlop*.lopSl13opaIora4Io4ratoloSrlalorfiosEloa. <loaSatSlotophliEloiophfIoa4tIo4talonnSd21ont41or21plpaSb13phalfiphllpSlng13plt14pllEpr411r211a214ac12aa14.1a41tItiagItana5lltalUa4lt«ra«1U31Utlaa.

y

Page 55: Word Hy-phen-a-tion by Com-put«er - CiteSeer

78 T\jX82 HYPHENATION PATTERNS

Hif4lltrItu2Itur3iluBaIn3brIuch4Iu3ciIu3enlaMluBldIn4aaEluailSumn.61aanlaIu3oIuo3r41upIuaa4Im3t«lintlEren15ret4211*

11741ya41yblyBaelySno21ya4lfyaalaa2mab•a2caBaBchlMaa4clEagSiaSaagn2mahnaidS4maldna31igBtaSlln•al41inal4tyBaanlacanBliman31t4mapBiaorine.aaErizmar41y•ar3vmaSac*nas4eaaalt5matenath3na3tia4matiia4mlbBba4t5mSbll!>4b3ingnbl4r4»5c

./

4M.

2ned4 med.SoedlaaeSdie•SeBdy•e2gaelEonme!4tne2«aemloSlmenaen4amenBacnen4d«4 mononcn41msni4men«u63mentaen4t«neBonaSeraa2nesSmestiae4ta•et3aloeltemeSthia4atrSnetricBeStri*Eie3tryBe4T4mlf2ah6al.nl3a•id4a•Id4grig4Smllit•SiSlis•4111•In4aSBlndaSineea4inglBlnSgliaSinglyain4ta4inuBlOt4a2iania4er.BlaBlaia4tlB51atry4althu21i4mk4allillBuiaSry4aln«jv4a,/

J

a4ninnn4oIBO

4mocrSnocratizmo2dlBo4gOBola2aoiSaa4BOIC

aoSleatuo3meBonSetBon6g«aonl3aBon4iiaaon4iatao3nizaonol4ao3ny.ao2r4aora.aoa2moSsey .ao3spaoth3nSouf3mou*B O 2 T

4mlpnparaSnpaSrabaparSla3petaphaa4a2plmpi4aapSieto4plinnSpirmpBifnpo3riapoaSlt*a4poua•poTSap4trm2py4n3r4mla2B4ahmSal4atlaanalaSr45 multBUlti33moamun24 supau4n •4a*laa2nla2bn4abu '4aac.na4ca

J

nSactnagSer.nak4na41inaBlla4naltnaSmitn2annancl4nan4itnank4nar3c4narenar31nar41nBaran4aana«4cnasBtin2atna3talnatoSmizn2aunau3se3nautnav4o4nlb4ncarSn4ces.n3chanScheonSchiln3chisnclinnc4itncourBanlcrnlcun4dalnSdannldendSoat.ndl4bn5d2ifnlditr.3diinSducndu4rnd2we2ne.n3earne2bnob3uno2cBneck2nedne4gatnegSatlrBnegane41anelSlzne5aine4molnon4nene3neo

J

ne4pone2qnlerneraSbn4erarn2eren4erBiner4rlnei2ne«.4neip2neat4nea*Snetlcne4rnSerane4*n3fn4gabn3gelnge4n4en&gerenSgeringShan3gibnglinn5gitn4glangov4ng&thnigun4gumn2gy4nlh4nhs4nhab3nhe43n4iani3ar>ni4apni3lani4blni4dniSdini4ernl2flniEficatnBigrnik4nilsnl3miiniinBnine.nin4gni4oSnli.nis4tan2itr.4ith3nition3itornl3trnlj4nk2n5keron3ket

J

nk3J.nnlkl4nllnEmnne4iunet44nln2nne4nni3alnni4Tnob41no3bl«nSocl4n3o2d3noe4nognoge4noieBino514iBnologia3nomicnBoSmiano4mono3royno4nnon4agnonSin5oniz4nopEnop5oSlinorSabno4rary4noscnos4enoaStnoStalnou3nounnov3ol3no*13nlp4npl4npre4cnlqnlrnru42nl«2nsBabnsatl4ns4cn2sen4e3eanaldlnslg4n2«lna3nn4socns4penfispinsta5blnitnta4bnter3»nt21n5tibnti4or

J

nti2fn3tin«n4t31ngnti4pntrolSlint4»ntu3aenulanu4dnu5ennuMfen3uln3nu31tn4uanulmanBuml3nu4nn3uonu3trnlr2nl«4nyn4nyp44nzn3za4oaoad3oSaSleaoard3oa84eoastBaoatBiob3a3boSbarobe41olblo2binobSingo3brob3uloScooch4o3chetoclf3o4cllo4clamo4codoc3racocfiratliocre3BocritoctorBaoc3ulaoScureodBdedod31codi3oo2do4odor3odSuct.od5uctao4eloBengo3eroe4taooeY

o2floi51teoflt4to2gSaCrogBatlro4gatoolgeoSgen*oSgeoo4gero3gielolgiaog31to4glo6g21y3ognlzo4grooguBilogy2ogynolh2ohab5012olc3eaoi3deroiff4oig4oiBleto3ingointSeroSiaaolSaonolatBenoi3teroB]2oko3kenok51«ollao41anolam4ol2doldieol3aro31eaco31etol4fiol21o311ao311ceolBid.o3114foBlilol31ng06II0oBlia.ol31shoSliteoSlltlooBliT011146olSoglzolo4rolSplol2tol3ub

J

ol3ua«olSunoEluaO12T

o21yonSahoraaBlpmSatltoa2baon4blO2B«

oa3ena0Bi5eraao4aetoaBatryo3olao&3ic.oa31caoBaldomllnoSalniBoounend0B04g*o4aonoa3piocproSo2nonlaon4aco3nanonlc3oncll2ondonBdoo3nenonEeaton4guonllco3nlocnllfo6nluon3koyon4odlon3oayon3ionapl4or,«plrBaontu4onten4on3t41ontlfSonBuaonraBoo2oodBeoodSloo4koop31o3ordooatEo2paopeSdopler3opera4operag2oph

./

Page 56: Word Hy-phen-a-tion by Com-put«er - CiteSeer

TgXBJ HYPHENATION PATTERNS 79

eSphamoSpharepSingo3pitoEpono4poilolpropluopy5olqolraoSra.o4r3agorEaliXorSangaor«5»oSraalorSaloraSshorSeat.oraw4or4gu4oSrlaor3<C4o5rilorllnelrloor3ityo3riuor2«iorn2aoSroforSougorfipaSorrhor4aaor*Senorst4oi3thior3thyor4tyoEruaolryoa3aloa2co«4cao3scop4oscoploSscro.4i4eOSSitiTo«3itooa3ityosl4uo«41o2soos4paos4pooi2tio5statlos5tilo«5tito4tanotale4got3er.otSara

J

o4Ua4othothSealoth314otSlc.ctSicao3tlcaoStifo3tlsotoSaou2ou3bloachSlouSetou41ouncEaroun2dO U E T

or4enorer4na«-. arSaor4arto3vi§OTltl4o5v4olow3darow3alowSaatowliownSio4»ooylalpapa4capa4capac4tp4adSpaganp3agatp4aipain4p4alpan4apanSalpan4typa3nypalppa4puparaSblparSagaparSdl3paraparSalp4a4rlpar41apaCtapaStarSpathicpaSthypa4tricpav4Spay4plbpd44pe.3pe4a

J

pear41po2c2p2edSpedaSpedlpadla4ped4icp4aapee4dpak4pe41apeli4apa4nanp4encpen4thpaSonp4ara.paraSblp4eragp4ariparlfiatper4malparneSp4arnparSopar3tipeErupar ITpe2tpaStanpaStlx4pf4pg4ph.pharSiphe3noph4erph4ea.phlicSphiephSingEphisti3phizph213phobSphonaEphonlpho4r4phsph3tEpliulphypi3aplan4pi4ciapl4cyp41dpSldapi3deSpidl3plecpi3enpl4grappl31opl2np41n.

' J

plnd4p4ino3p j lopion4p3ithpiSthapi2ta2p3k2Ip212SplanplaaStpll3apllEer4pligpll4nplol4plu4«plnn4b<pl«2p3npo4cSpod.poSaapoSetSEpo4gpoin2EpolntpolyStpo4nipo4pIp4orpo4rylpospoolsp4otpo4taBpoun4plpppa5rap2pepipedp5pelp3penp3perp3petppoSsltapr2pray4aSprecipreScopre3emprefSacpre41apre3rp3resa3presapreStanpre3v5prl4aprin4t3pri4aprls3op3rocaprofEltpro31pros3a

J

prolt2pls2p2aaps4hp4aib2pltpt5»4bp2t«p2thptl3ajptu4rp4tvpub 3puo4puf4pulScpo4apu2npur4rEpuspu2tEputaputSsrpu3trput4tadput4tlnP3wqu2quaSr2qua.3quer3quat2rabra3blrach4er5aclrai5firaf4tr2alra41oran3etr2amirane5oran4gar4aniraSnorap3er3raphyrarScrare4rarSef4rarilr2aaratlon4rau4traSvairay3elraSiloribr4bab14 bagrbi2rbi4fr2binrSblnarbSlng.

J

rb4orlcr2carcan4 r>er3cha*rch4arr4cl4brc4itrcu>3r4dalrd21rdl4ardl4errdln4rdSlng2ra.ralalre3anraSarrSraavre4i*rEabratraeSollrecEonpare4cre2r2adraldare3disradSitre4facra2fareSfar.re3Hra4fyreg3isre5itrollireSlur4en4taren4tareloroSplnre4posirelpnrler4r4erlrero4reSrnr4ea.re4splresaSibres2treSstalreSstrre4tarre4t!4xre3trireu2reSutlrey2re4Talrar3alrSeySer.reSvorsreSyerCreS?ll

J

rer5ol«r»4wh

rlf

r4fy

rg2rg3arr3gatrSgiergl4nrgSlngrEglsrEgltrlglrgo4nr3g«

rh44rh.4rhilrl3ari*4brl4agr4ibribSaricEasr4ica4riclSrlcldri4clar41coridSarri3ancriSentrilerrlSetrlgEaft6rigiril3ixErinanrimSi3rimorU4par2inaSrina.rin4drin4orin4griloSrlphrlphSari2plripSlier4iqr31ir4ia.ri«4cr3ishrla4pri3ta3brSltad.ritEer.ritSersrit31cri2turltSurrivSel

J

ritSatrirSi

r3jr3katrk41ark41in

rllrl«4r21adr4Ugr41isrlElakrSlo4

rlar«a5cr2aarSaaaraEeraraSlngr4Blng.r4alorSait

r4«yr4narr3n*lr4nerrSnatrSnayrSnlcrlnls4rSnitrSnlTrno4r4aoar3n«robSlr2ocroScr •ro4aroliaroSfllrok2roEkerErole.roa>Sat«rom4irom4pron4alron4eroSn4iaron4talrooaSrootro3pelrop3icror3iroSroros5parros4aro4tharo4ty •ro4varovSolrox5

ripr4paa

/

r!pr«trpSar.r3patrp4h4rpSingr3po

rlr4rre4e

rr«4lr4reorre4ft

rrl4arrt4»rrou4rroa4rry§44ra2rlaanaBtl

r*4cr2tarSaacraa4crraEar.r»3aaraaSrSrlakrSaharialr4sl4braonSrlaprEavrtach4r4tagrStabrt«n4drtaSorltlrtBlbrtl4dr4tiorr3tlgrtllStrtiUlr4tilyr4tlatr4tlTr3trirtroph4rt4ahru3aru3e41ru3enru4glru3ianw3plru2arunkSrun4tyrSaacrutlEnrv4ervel41r3venryBer.

v/

Page 57: Word Hy-phen-a-tion by Com-put«er - CiteSeer

80 1£X8> HYPHENATION PATTERNS

rEreetrSreyr3ricT T 1 4 T

r3rorlw>7<cSryngarjr3t•a22* labStacktacSrltSaetStai•»lir4•a!4ataSlo•al4t3aanc•an4da• lap•aStaSiaStle•atSn•au4•aSTorfiia*4«Sb•can4tS• ca4p•cavS• 4ced4sceie4cee•ch2•4che3s4claEacin4decleS•tell•coM4icopy•conr5aalcn4eSd4te.•e4aseai4aeaSw•e2c3o3eect4i4ed•e4d4a• Sedl•»2g•eg3r5aei•ellaheeltEselrA s erne•e4aollenSat4senc•en4d

/

eSened»»i.5g•Benin4»entd4tentl•ep3a34il«r.• 4erl«er4o4fervotla4«•eSah•eiStEieSuaSlav»ey3en•aw4i6iez4«3f2«3g• 2h2§h.• MarEtherthlin•h3io3ahipahlvS«ho4•hSuld• ho.i3•hor4•hortS4ihwsilb•SlccSlide.EaideeEtidi•iEdii4«igna• il4e4til72»lin•2inaSeine.•3ingleioSalon•lonSa• 12r•irSaleie3altloEiiulliT

Saiz•kS4ske•3ket•kEineakEing• 112•31at•21e•llthS

J

2»1«•3maanal13•aanSaael4eSnenEaalth••olEd4•In4leo•o4ce•oft3•o41ab•ol3d2•o311cEeolYS»o«3a4on.iona4•on4g•4opSaophle•Sophls•Sophy•orEc•orSd4*OT•oSrl2ipaSepal•pa4n•pen4d2«5peo2iper•2pheSapher•phoS•pil4ipSing4«pio.4ply•4pon•por44spot•qual41air2ee•lea••aa3•2eSc•3eel•5aenge4iei.•S*et• lei»4«le••14er•s5ily• 4al••411•4sn••pend4••2t••urSa••Ew2a t.

y

>2tag•2tal•tan4iEstand•4ta4pSatat.•4tediternSl•Staro•ta2v•tewSa•3tha•t21•4tl.•StlatitleSstick•4tla•3tif•tSlngEstlr• ltlaSstock•tom3a£•tone•4 topSitore•t4r•4tradEstratu«4tray•4trid4atry4et3w»2ty

In•ulalsu4b3«u2g38U5ilsuit3«4ul•u2o•un3iEU2JI

•u2r4«T•w24iwo•4y4»ycSsyl•yn5o•y5ri*lta3ta.2tabtaSblsaStabollt4tacita5do4taMtaiSlota21taSlatalSen

J-

talSi4talktal41i«taSlogtaSmotan4datanta3taSpertaSpltar4a4 tare4 tareta3rlxta*4etaSiy4tatieta4turtaun4tav42tawtaz41e2tlb

liet4chtchSet4tld4ta.t.ead414te»tteco4Etect2tledteSdilteet«g4te5gerte5gi3tel.teli4Bteleta2ma2teo3at3tenan3 tone3 tend4teneeltontten4taglteote4pte5pat«r3cSter3dlteriter5ie»ter3isterlEzaSternitterSv4tos.4te«at3os».tethSa3teu3 tax4tey

•J

2tlt4tlg2th.than4th2e4 the*th3ei»theSattho31e

SthatthSle.thBica4thllSthink4thl

thEodaEthodlc4thoothorSltthoSris2th«ltlatl«abU4ato2tl2b4tickt4icot4icluEtidl3tlantlf2tlBfy2tigStigntillSinltiB4timptim5ul2tlint2ina3tine.3tiniltioti5octionSaa5tiqti3>a3ti«etia4ntl5eotie4pSti»ticati3tlti4altiTtiv4altiztl3zati3zon2tltSlatlan43tlo.3tled3tle».tSlet.

y

tSle4tl«t*o42tln2ltoto3btoScrat4todo2tofto2grtoSlcto2aatom4bto Sayton4alltoSnat4tono4 tonyto2rato3rie

torBlstoi2Etour4 touttoSvar4tlpltrttraSbtraScktracl4trac41ttrac4tatra»4tra5ventravSelStreSftr«4atremSlEtriatriSce*Etricla4trica2triatri4TtroSaitronSi4tronytro5phatro3eptro3rtruSitrue44tle2t4ectsh4t4cw4t3t2t4teitStottu4ltutulatu3artu4bltud24tue

4tnf4EtttSi

StoaU4nlt2tSup.SturaSturltnrSlaturBotuSry3tu«4tTt»44tlvat«i«44tTOlty4tya2tyltypaStySph4titx4e4mbu»c4uaEntnan4tuarSantuar2dnarSluarStulatuav4ub4au4belu3beru4baraulb41u4bSlhg

• u3bla.u3caucl4buc4itncla3u3cru3cuu4cyudSdud3erudEestudev4oldieud3ieduJ3ie«ud5i«uSditu4donud4slu4duu4enauens4uen4tauer4113ufau3flugh3en

J

mgSU2nl2nilBli

«14nullngulr4aulta4U1T3

«lT4ar.

afj4oknlltul*BbnBlitialch4EulchaulSder«14a«Uaanl4glul2inSllaulSlngulSlshvl41arul4114bnl411f4ulSaull4o4ul.ul«5aanlltl«ltra94«ltna31nulBululSruaEab\im4bi

ua4blyulaln4a31ngoaorSoua2punat4

n2naun4erulnlun4iau2ninnnSiahIU\13T

un3»4un4owunt3abun4ter.un4teeunu4unEyunSzu4orenSo*ulouulpeuperS*uSpla

Page 58: Word Hy-phen-a-tion by Com-put«er - CiteSeer

\ njX82 HYPHENATION PATTERNS81

opSlng. a3pl

np3papportSnptSlbaptu4olra4ura.n4ragu4ra» ,or4banrc4ariduraEatur4farur4fru3rif •urHflcurlinuSrlonlriturSisor2iarising.ur4nouro»4ur4penr4plartSarorStaaur3theurti4uMtlao3ro2uauEsadoSaan .aa4apo»c2uaScluaeSauSalau3aicui411nuilpuaSilutStar*ualtrn2«uosur4uta4bu3tat4uta.4utel4utenuten4i4ult21at!Sillu3tinaut3ingutlonSau4tia

' SuEtlxu4tllutSofutoSg

ntoGnatlcoStonu4touQtf4o3uuu4«Q1T2

uru3nx4a

. IraEra.2rla4bracSllrac3nrag4Ta4g«T.EliaTalEoT»llUraSaoTaEnixraSplvarSladSrat4T6.4radTegST3«1.

T « 1 3 H

ra41oT4elyTen3oaTEanuaT4erdETara.T4eralT3eranverSencv4eraarer3ieTerni4n3reraaTor3thT4e2«4rea.Tea4tave4tatet3orre4tyTlSallfiTlanErida.

- Srlded4r3idenErldaaSTldiT311

yiSgnvik42TilEvllitv3i31ixTlin4ri4naT2incTlnSd

J

4Tlngrio31T3io4rTllOUTi4pTlSroTil3itT13IO

Tl3au4Titirit3r4Tlty3T1T

ETCT0i4 .STO)C

To41aTSolaETOltSTOIT

TOBSI

TorEabTori 4To4ry .To4ta4T0ta«4TT4

T4ywSabl2wacwaSgarwagSowaitswSal.waa4war4twat4twaltawaSTerwlbweaSriareath3wed4nweet3weeSTwel41wlarweatSw3eTwhi4

vlSwil2wlllSinwln4dewln4gwir43wisa«ith3wixSw4kil4o«wl31nw4noIwo2woalsoSren

J

»5pTra4wri4wrlta4w3ihwt41wi4pawEi4t4wt

zlaxacSaz4agozao3z4apzaaSz3c2zlaie4cotoz2edzer41zeSro

x)hxhl2zhilSzhu4z3izlSaziEcziSdix41maxiEmlzz3oz4obx3pxpan4dxpectoSxpe3dzlt2z3tlZlaxu3azx4ySae3yar4y5atylbyi<=y2caycSery3chycMaycoa4ycoUyldyEeaylary4arfye»4ye4t

ysgi4y3h

yiiy31aylla5bly31o

J

y51ttyabolS

y»e4ynpaSyn3chrynSdynfigynBlcBynxylo4yoEdy4oSgyoa4yoEnaty4onay4oty4pedyperByp3iy3poy4pocyp2UyBpttyraEayrSlay3royr4rya4cy3«2«yt31cayi3io3yai(y4ioy»f4yslty»3taygar*.y3thi«yt3icylwzalz5a2bzar24zb2x*ze4nza4pzlerze3roxat42x11* • *i411

z41aExl4zalzozo4azoBolzta44zlz2

x4zy

<

Page 59: Word Hy-phen-a-tion by Com-put«er - CiteSeer

Answers

moun-tain-ous vil-lain-ous

be-tray-al de-fray-al por-tray-al

hear-ken

ex-treme-ly su-preme-ly

tooth-aches

bach-e-lor ech-e-lon

.riff-raff

anal-o-gous ho-mol-o-gous

gen-u-ine

any-place

co-a-lesce

fore-warn fore-word

de-spair

ant-arc-tic corn-starch

mast-odon

squirmed

82

Page 60: Word Hy-phen-a-tion by Com-put«er - CiteSeer

References

[1] Knuth, Donald E. T&Land METflFONT, New Directions in Typesetting. DigitalPress, 1979.

[2] Webster's Third New International Dictionary. G. & C. Merriam, 1961.

[3] Knuth, Donald E. The WEB System of Structured Documentation. Preprint,Stanford Computer Science Dept., September 1982.

[4] Knuth, Donald E. Tie Art of Computer Programming, Vol. 3, Sorting andSearching. Addison-Wesley, 1973.

[5] Standish, T. A. Data Structure Techniques. Addison-Wesley, 1980.

[6] Aho, A. V., Hopcroft, J. E., and Ullman, J. D. Algorithms and Data Structures.Addison-Wesley, 1982.

[7] Bloom, B. Space/time tradeoffs in hash coding with allowable errors. CACM

13, July 1970, 422-436.

[8] Carter, L., Floyd, R., Gill, J., Markowslcy, G., and Wegman, M. Exact andapproximate membership testers. Proc. 10th ACM SIGACT Symp., 1978, 59-65.

[9] de la Briandais, Rene. File searching using variable length keys. Proc. Western

Joint Computer Conf. x5, 1959, 295-298.

[10] Fredkin, Edward. Trie memory. CACM 3, Sept. 1960, 490-500.

[11] TVabb Pardo, Luis. Set representation and set intersection. Ph.D. thesis, Stan-ford Computer Science Dept., December 1978.

[12] Mehlhorn, Kurt. Dynamic binary search. SIAM J. Computing 8, May 1979,175-198.

[13] Maly, Kurt. Compressed tries. CACM 19, July 1976, 409-415.

[14] Knuth, Donald E. Tj$L82. Preprint, Stanford Computer Science Dept., Septem-ber 1982.

[15] Resnikoff, H. L. and Dolby, J. L. The nature of affixing in written English.Mechanical Translation 8, 1965, 84-89. Part II, June 1966, 23-33.

[16] Tie Merriam-Webster Pocket Dictionary. G. & C. Merriam, 1974.

[17] Gorin, Ralph. SPELL.REG[UP,DOC] at SU-AI.

[18] Peterson, James L. Computer programs for detecting and correcting spellingerrors. CACM 23, Dec. 1980. 673-687.

83

Page 61: Word Hy-phen-a-tion by Com-put«er - CiteSeer

84 REFERENCES

[19] Nix, Robert. Experience with a space-efficient way to store a dictionary. CACM

24, May 1981, 297-298.

[20] Morris, Robert and Cherry, Lorinda L. Computer detection of typographical

errors. IEEE Trans. Prof. Comm. PC-18, March 1975, 54-64.

[21] Downey, P., Sethi, R., and Tarjan, R. Variations on the common subexpression

problem. JACM 27, Oct. 1980, 758-771.

[22] Tarjan, R. E. and Yao, A. Storing a sparse table. CACM 22, Nov. 1979,608-611.

[23] Zeigler, S. F. Smaller faster table driven parser. Unpublished manuscript, Madi-

son Academic Computing Center, U. of Wisconsin, 1977.

[24] Aho, Alfred V. and Ullman, Jeffrey D. Principles of Compiler Design, sections

3.8 and 6.8. Addison-Wesley, 1977.

[25] Pfleeger, Charles P. State reduction in incompletely specified finite-state ma-

chines. IEEE Trans. Computers C-22, Dec. 1973, 1099-1102.

[26] Kohavi, Zvi. Switching and Finite Automata Theory, section 10-4. McGraw-

Hill, 1970.

[27] Knuth, D. E., Morris, J. H., ar. i Pratt, V. R. Fast pattern matching in string*.

SIAM J. Computing 6, June 1977, 323-350.

[28] Aho, A. V. In R. V. Book (ed.), Formal Language Theory: Perspectives and

Open Problems. Academic Press, 1980.

[29] Kucera, Henry and Francis, W. Nelson. Computational Analysis of Present-Day

American English. Brown University Press, 1967.

[30] Research and Engineering Council of the Graphic Arts Industry. Proceedings of

the 13th Annual Conference, 1963.

[31] Stevens, M. E. and Little, J. L. Automatic Typographic-Quality Typesetting

Techniques: A State-of-the-Art Review. National Bureau of Standards, 1967.

[32] Berg, N. Edward. Electronic Composition, A Guide to the Revolution in Type-

setting. Graphical Arts Technical Foundation, 1975.

[33] Rich, R. P. and Stone, A. G. Method for hyphenating at the end of a printed

line. CACM 8, July 1965, 444-445.

[34] Wagner, M. R. The search for a simple hyphenation scheme. Bell Laboratories

Technical Memorandum MM-71-1371-8.

[35] Gimpel, James F. Algorithms in Snobol 4. Wiley-Interscience, 1976.

Page 62: Word Hy-phen-a-tion by Com-put«er - CiteSeer

REFERENCES 85

[36] Ocker, Wolfgang A. A program to hyphenate English words. IEEE Trans. Prof.Comm. PC-18, June 1975, 78-84.

[37] Moitra, A., Mudur, S. P., and Narwekar, A. W. Design and analysis of a hy-

phenation procedure. Software Prac. Exper. 0, 1979, 325-337.

[38] Lindsay, R., Buchanan, B. G., Feigenbaum, E. A., and Lederberg, J. DENDRAL.

McGraw-Hill, 1980.