tug96

15
7/30/2019 tug96 http://slidepdf.com/reader/full/tug96 1/15 Extending T E X for Unicode Richard J. Kinch 6994 Pebble Beach Ct Lake Worth FL 33467 USA Telephone (561) 966-8400 FAX (305) 644-6978 [email protected] http://www.truetex.com Abstract T E X began its \childhood" with 7-bit-encoded fonts, and has entered adolescence with 8-bit encodings such as the Cork standard. Adulthood will require T E X to embrace 16-bit encoding standards such as Unicode. Omega has debuted as a well-designed extension of the T E X formatter to accommodate Unicode, but much new work remains to extend the fonts and DVI translation that make up the bulk of a complete T E X implementation. Far more than simply doubling the width of some variables, such an extension implies a massive reorganization of many components of T E X. We describe the many areas of development needed to bring T E X fully into multi-byte encoding. To describe and illustrate these areas, we introduce the TrueT E X r ° Unicode edition, which implements many of the extensions using the Windows Graphics Device Interface and TrueType scalable font technology. Integrating T E X and Unicode You cannot use T E X for long without discovering that character encoding is a big, messy issue in every implementation. The promise of Unicode, a 16-bit character-encoding standard [15, 14], is to clean up the mess and simplify the issues. While Omega [5, 13] has upgraded T E X-to-DVI translation to handle Unicode [3], the fonts and DVI -to-device translators are far too entrenched in narrow encodings to be easily upgraded. This paper will develop the concepts needed to create Unicode T E X fonts and DVI translators, and exhibit our progress in the TrueT E X Unicode edition. A fully Unicode-capable T E X brings many substantial bene¯ts: ² T E X will work smoothly with non-T E X fonts. While T E X already has a degree of access to 8-bit PostScript and TrueType fonts, there are many limitations that Unicode can eliminate. ² T E X will eliminate the last vestiges of its deep- seated bias for the English language and US ver- sions of multilingual platforms like Windows. It will adapt freely and instantaneously to other languages, not just in the documents produced, but in its run-time messages and user interface. This °exibility is crucial to quality software, especially to a commercial product in an inter- national marketplace. ² With access to Unicode fonts, the natural ability of T E X to process the largecharacter sets of the Asian continent willbe realized. Methods such as the Han uni¯cation will be accessible. ² TEX will install with fewer font and driver ¯les. Many 8-bit fonts will ¯t into one 16-bit font, and in systems like Windows, which treat fonts as a system-wide resource, fewer fonts are an advantage. Only one applicationwill be needed to translate from Unicode DVI to output device. ² T E X documents will convert to other portable forms (like PDF, OpenDoc, or HTML) and will work with Windows OLE, without tricks and without pain. ² Computer Modern and other T E X fonts will be usable in non-T E X Unicode applications. The 8-bit encoding problems have broken Computer Modern on every variety of Microsoft Windows. ² When 16-bit encodings overcome the resistance of the past |and we have every reason to hope that they will| T E X will play a continuing role in software of the future, instead of becoming an antique. Claimingthese promises involves some trouble along the way, but without 16 bits to use for encoding, we will never have a solid solution. TUGboat, Volume 0 (1996), No. 0| Proceedings of the 1996 Annual Meeting 1001

Upload: scribdir

Post on 14-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 1/15

Extending TEX for Unicode

Richard J. Kinch6994 Pebble Beach CtLake Worth FL 33467 USA

Telephone (561) 966-8400FAX (305) [email protected] 

http://www.truetex.com 

Abstract

TEX began its \childhood" with 7-bit-encoded fonts, and has entered adolescencewith 8-bit encodings such as the Cork standard. Adultho od will require TEXto embrace 16-bit encoding standards such as Unicode. Omega has debuted asa well-designed extension of the TEX formatter to accommodate Unicode, butmuch new work remains to extend the fonts and DVI translation that make upthe bulk of a complete TEX implementation. Far more than simply doubling the

width of some variables, such an extension implies a massive reorganization of many components of TEX.We describe the many areas of development needed to bring TEX fully into

multi-byte encoding. To describe and illustrate these areas, we introduce theTrueTEX

r° Unicode edition, which implements many of the extensions usingthe Windows Graphics Device Interface and TrueType scalable font technology.

Integrating TEX and Unicode

You cannot use TEX for long without discoveringthat character encoding is a big, messy issue in everyimplementation. The promise of Unicode, a 16-bitcharacter-encoding standard [15, 14], is to clean up

the mess and simplify the issues.While Omega [5, 13] has upgraded TEX-to-DVI

translation to handle Unicode [3], the fonts andDVI-to-device translators are far too entrenched innarrow encodings to be easily upgraded. This paperwill develop the concepts needed to create UnicodeTEX fonts and DVI translators, and exhibit ourprogress in the TrueTEX Unicode edition.

A fully Unicode-capable TEX brings manysubstantial bene ts:

² TEX will work smoothly with non-TEX fonts.While TEX already has a degree of access to

8-bit PostScript and TrueType fonts, there aremany limitations that Unicode can eliminate.

² TEX will eliminate the last vestiges of its deep-seated bias for the English language and US ver-sions of multilingual platforms like Windows. Itwill adapt freely and instantaneously to otherlanguages, not just in the documents produced,but in its run-time messages and user interface.This °exibility is crucial to quality software,

especially to a commercial product in an inter-national marketplace.

² With access to Unicode fonts, the naturalability of TEX to process the large character setsof the Asian continent will be realized. Methodssuch as the Han uni¯cation will be accessible.

² TEX will install with fewer font and driver ¯les.Many 8-bit fonts will ¯t into one 16-bit font,and in systems like Windows, which treat fontsas a system-wide resource, fewer fonts are anadvantage. Only one applicationwill be neededto translate from Unicode DVI to output device.

² TEX documents will convert to other portableforms (like PDF, OpenDoc, or HTML) and willwork with Windows OLE, without tricks a ndwithout pain.

² Computer Modern and other TEX fonts will beusable in non-TEX Unicode applications. The

8-bit encoding problems have broken ComputerModern on every variety of Microsoft Windows.

² When 16-bit encodings overcome the resistanceof the past | and we have every reason to hopethat they will| TEX will play a continuing rolein software of the future, instead of becomingan antique.

Claiming these promises involves some trouble alongthe way, but without 16 bits to use for encoding, wewill never have a solid solution.

TUGboat, Volume 0 (1996), No. 0 | Pro ceedings of the 1996 Ann ual Meeting 1001

Page 2: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 2/15

Richard J. Kinch

Let us survey in the rest of this paper what isneeded to achieve these various aspects of integrat-ing TEX and Unicode.

Omega and Unicode

The goal of our work has been to create a Unicode-

capable DVI translator, and to reorganize the TEXfonts into a Unicode encoding. TEX itself (thatis, the formatter) already has a Unicode successor,namely Omega.

The chief advance of Omega is that it gener-alizes the TEX formatter to handle wider encod-ings. What Omega is less concerned with is theDVI format (which has always provided for widerencodings, up to 32 bits), the encoding of the exist-ing TEX fonts, and the translation of  .dvi ¯les foroutput devices. In fact, Omega has side-stepped DVI

translation altogether with its extended xdvicopy translation, whereby Omega operates within the oldenvironment of 8-bit TEX fonts and the old DVI

translators. Since Unicode rendering is supportedon Windows but not on other popular TEX platforms(UNIX, DOS, etc.), a devotion to Omega's porta-bility requires that Omega use the old fonts andDVI translators. Lacking any compulsion to extendthe DVI translators for Unicode, the Omega projecthas justi¯ably invested most of its e®ort into earlierstages of the typesetting process [4, page 426].

Our Unicode TEX fonts and Unicode DVI

translator, while having a natural connection toOmega, are capable of connecting TEX82 to Unicode

as well. Through the mechanism of virtual fonts [11],TEX can access Unicoded fonts while using its old8-bit encodings itself.

What's the fuss?

Wishing for 16 bits of Unicode sounds like, hey presto, we just widen some integer types, doublesome constants, and type \ make" somewhere very,very high in a directory tree. The task is far fromthis simple for several reasons:

One, TEX and TEXware is full of 256-membertables which enumerate all code points. Thesewould have to grow to 65,536 members. While

Haralambous and Plaice want us to agree thatthis is \impossible" for practical reasons [6], theyassume that we are not going to re-implement the8-bit-encoded software for sparse arrays. Applyingsparse-array techniques to manage per-characterdata will avoid an impossible increase in executiontime and/or memory, although it will require aninitial extra e®ort to upgrade the software.

Two, these tables have to be stored in ¯les,and we need to carefully and deliberately extendthe ¯le formats to handle the extensions. Not onlycould we come up with a bad design that limits usunnecessarily in the future, but all the old TEXwarehas to be upgraded, and then we have to port the

upgrades.Three, we must rationalize the old ad-hoc

character sets into a big union set. Just catalogingand managing this data is a large task: manytens of thousands of items, where we used to haveonly hundreds before. Some degree of databasemanagement to ols must be applied to get the codesinto a form which we can compile into software; itis not enough to just type in some array initializershere and there.

Encoding standards are necessarily incompleteor imprecise in some aspects, and none ¯t theTEX enterprise. While many of the Unicode math

symbols were taken from TEX, many of the TEXcharacters are missing from Unicode. But Unicodeis about the closest encoding to TEX math thatwe can expect from an unspecialized encoding, andwith Unicode we gain a powerful connection tomultilingual character sets.

Extending Computer Modern to U nicode

A \rational" encoding establishes a mapping of char-acter names to unique integers, and this mappingdoes not vary from font to font. The ComputerModern fonts were not encoded rationally. For

example, code 0x7b is overloaded about 8 di®erentcharacters, and character dotlessi appears in di®er-ent codes in di®erent fonts. Given an 8-bit limiton encoding, this was inevitable. But this makesfor many troubles; moving up to a rational, 16-bitencoding is a clean solution.

Computer Modern is also \incomplete" in thesense that if you made a table having on one axisthe list of all the speci c styles (Roman, Italic,typewriter, sans serif, etc.) and on the other axisall the characters in all the fonts (A{Z, punctuation,diacritics, math symbols, etc.), the table would havelots of holes when it came to what METAFONT

source exists. Commercial text fonts have all of these holes ¯lled, or at least the regions populatedin the table are rectangular. In Computer Modernthe regions are randomly shaped.

Furthermore, the character axis of this (verylarge) imaginary table is missing many charactersconsidered important in non-TEX encodings. Forexample, ANSI characters like °orin, perthousand,

1002 TUGboat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting

Page 3: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 3/15

Extending TEX for Unicode

cent, currency, yen, brokenbar, etc., are not imple-mented in Computer Mo dern. Certain of these sym-bols can be unapologetically composed from exist-ing Computer Modern symbols: a Roman multiplyor divide would come from the math symbol font,trademark from the T and M of a smaller optical

size, and so on. But many other characters will justhave to be autographed anew in METAFONT (atleast one related work is in progress [6]).

The job of extending Computer Modern to be arational and complete set of fonts ¯rst requires thatwe reorganize the existing characters into a clean,16-bit encoding. Then we are in a strong positionto ¯ll in the missing characters.

We not only want to give TEX access to 16-bit-encoded fonts, we also want the converse: non-TEXapplications to have access to the Computer Modernfonts i n TrueType form. This mandates adherenceto the Unicode standard wherever possible, and

an organized method to manage the non-Unicodecharacters in Computer Modern.

Here is a list of the components we consideressential to a Unicode rationalization of ComputerModern. In this list we take a di®erent approachfrom Haralambous' Unicode Computer Modernproject [7], which is aimed at producing virtual fontswhich resolve to 8-bit .pk fonts from METAFONT.Our aim is a set of Unicoded TrueType fonts.

² A METAFONT-to-outline converter, a very dif-¯cult although not impossible task, as illus-trated in MetaFog [9].

² A database system to treat the convertedComputer Modern glyphs as atoms, for inputto a TrueType font-builder.

² A database of character names which covers allthe characters in the TEX fonts and in Unicode.We call the grand union TEX character setTeXunion , in the same way that we denotethe Unicode characters as the Unicode char-acter set. (We will use Small Caps to indi-cate a formal set.) TeXunion contains 1108characters by the present inventory. Produc-

ing this database involves some work becausethere are no standards for TEX character names(that is, single-word alphanumeric names suchas are used in PostScript encodings). The stan-dard Unicode character descriptions are lengthyphrases instead of single words, making themunjoinable to the TEX names. For example, theUnicode standard provides the verbose entryfor the code 0x00ab, \left-pointing double

angle quotation mark". The PostScript

names1 in common use come from an assort-ment o f sources, and they exhibit inconsisten-cies, con°icts, and ambiguities which frustratecomputation o f set pro jections and joins.

² A database of TEX encodings, which tells whichcharacters appear in which TEX fonts. TEX

uses a gumbo of no less than 28 (!) distinctencodings (Table 1). This number may come asa surprise, but has been hidden by the web of METAFONT source ¯les. The sum of all TEXencodings constitutes a database of 3 415 name-to-code pairs, each name being taken from the1108 members of  TeXunion . We designatethis set of encodings (that is, a set of mappingsof names to integers) as TeXencod .

As if the miasmic fog of encoding conventionswere not confused enough, small-caps fontspresent still more encoding problems. Theyrepresent an axis of variation that is hardlyde¯ned in the usual set of font parameters.We must consider small-caps characters to b edi®erent from their corresponding parents. If this is not done, then there is no way tocompose virtual small-caps fonts from theirlowercase counterparts, because we would haveno way of knowing which characters are toshrink (a jumbled set of letters and accentedletters) and which do not (punctuation andall the rest). Thus, for each encoding usedanywhere by a small-caps font, we must makea duplicate small-caps version (altering the

lowercase character names to small-caps names)of the encoding in the list of all the encodings.Thus we have a csc2 for roman2, t1csc for t1,and so on.

To produce these duplicate encodings, weneed a rule to convert lowercase names (bothletters a{z and diacritical letters) to small-capslowercase names and back. We have simplybeen appending \sc" to the name (this worksbecause there are no collisions with names thathapp en already to end in \sc"). For example, asmall-caps letter \a" is \asc".

Adobe has been appending \small" to their

names (this often causes character names toexceed a traditional limit of 15 characters inlength), as in the MacExpert encoding [2]. Thisis done in an irregular manner by appendingto the uppercase character names (for example,

1 There is an attempt at st andardization in PostScript-style names from the Association for Font InformationInterchange (AFII), but the standard is proprietary and onpaper only. The names are serial numbers as opposed toabbreviated descriptions.

TUGboat, Volume 0 (1996), No. 0 | Pro ceedings of the 1996 Ann ual Meeting 1003

Page 4: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 4/15

Richard J. Kinch

Table 1: TEX Single-Byte Encodings(TeXencod ). Covers all Computer Modern,AMS  math symbol, and Euler fonts. Each i temmaps a set of 128 or 256 character names tointegers.

Name Description

csc0 TEX caps and small caps (ligs = 0)csc1 TEX caps and small caps (ligs = 1)euex AMS  Euler Big Operatorseufb AMS  Euler Fraktur Boldyeufm AMS  Euler Fraktureur AMS  Eulereus AMS  Euler Scriptlasy LATEX symbolslcircle LATEX circlesline LATEX linesl og o METAF ONT l og o

manfnt T E Xbook  symbols fontmathex2 TEX math extensionma thit1 TEX math italic (ligs = 1)ma thit2 TEX math italic (ligs = 2)mathsy1 TEX math symbols (ligs = 1)mathsy2 TEX math symbols (ligs = 2)msam AMS  symbol set Amsbm AMS  symbol set Broman0 TEX Roman (ligs = 0)roman1 TEX Roman (ligs = 1)roman2 TEX Roman (ligs = 2)t1 LATEX NFSS T1 encodingt1csc T1 with small caps

texset0 TEX \texset" encoding (ligs = 0)textit0 TEX text italic (ligs = 0)textit2 TEX text italic (ligs = 2)title2 TEX 1-inch capitals (ligs = 2)yA superset of eufm with two extra chars

the Adobe small-caps for aacute is Aacutes-mall, while ours is aacutesc); apparently some-one mistook appearance for semantics. Fur-thermore, Adobe has a small-caps version of bare diacritics in their MacExpert encoding, al-

though the diacritical character name is irregu-larly changed to an initial capital (for example,Adobe small-caps for acute is Acutesmall).

The LATEX T1 encoding, which was supposedto have been uniform for all DC fonts, also hasan irrational aspect, in that the T1 encoding isoverloaded when it is applied to both lowercaseand small-caps fonts. Somewhere in theLATEX macros is buried something tantamountto another small-caps encoding of T1, which

indicates which codes are letters or diacritics.Another related limitationof T1 encoding is thelack of small-caps accents.

A wealth of code positions does not exemptUnicode from pecksni±an absurdities. TheUnicode committee will not provide encodings

for a small-caps alphabet (small-caps being amatter of typography and not information con-tent), although they provide encodings for somesmall-caps characters (which appear in olderencoding standards subsumed by Unicode).

² A rationalizationof the TEX character sets intotheir la rgest common subsets (Table 2). Thisrepresents the relations between the TEX en-codings and the character subsets as organizedin Knuth's METAFONT sources. The ¯rst itemin each entry of Table 2 gives the TEX encodingas given in Table 1, known by the .mf source

¯le used to generate the font; the remainingitems are the common subsets generated via theMETAFONT source ¯les of the same names.

The relation set forth in Table 2 is not re-¯ned for the distinctions regarding the ligaturesetting. Certain of Knuth's encodings appearoverquali¯ed, namely, mathex2, mathitf12g,and mathsyf12g do not vary with the ligs

setting, although it is speci¯ed in the META-FONT driver ¯le.2

These decompositions of the various TEX en-codings may be considered close to the \great-est common" subsets, although we do not re-

quire a full decomposition here. To be com-pletely decomposed, the f012g-numbered itemson the right should be further decomposed intothe unnumbered common set and the variousnumbered di®erential sets. The sets on the rightcolumn of Table 2 we will use below as the setknown as TeXpages. We have not yet madethe e®ort to elaborate the members of eachTeXpages set, which is needed to compute theremaining work to complete the style axes of Computer Modern.

² A database giving the mapping of TEX fonts

to their encodings as known above (Table 3).The table below lists TEX font names and theirencoding name; an N  indicates a wildcard forany optical point size integer, excluding sizes of the same style already matched earlier in thetable. If a new optical size for a font name isnot in this table, the presumption should be

2 Also,the comments at the top of romsub.mfarein errorabout what happens when ligs = 2. Apparently, no one hastried any other ligs setting!

1004 TUGboat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting

Page 5: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 5/15

Extending TEX for Unicode

Table 2: TeXencod Decomposition intoTeXpages. Set romanlsc is romanl withsmall-caps semantics; it does not actually appearin Computer Modern. Braced digits indicatefactored su±xes.

TeXencod Covering TeXpagesMember Members

csc0 a ccent0 cscspu g reeku punctromand romanp romanu rom-spu romsub0 romanlsc

csc1 accent12 comlig cscspu greekupunct romand romanp romanuromspu romsub1 romanlsc

mathex2 bigdel bigop bigaccmathitf12g romanu itall greeku greekl

italms olddig rommsmathsy f12g calu symbolroman0 accent0 greeku punct romand

romanl romanp romanu romsplromspu romsub0

roman1 accent12 comlig greeku punctromand romanl romanp ro-manu romspl romspu romsub1

roman2 accent12 comlig greeku punctromand romanl romanp ro-manu romlig romspl romspu

texset punct romand romanl romanpromanu tset tsetsl

textit0 accent0 greeku itald itall italpitalsp punct romanu romspu

romsub0textit2 accent12 comlig greeku italditalig itall italp italsp punctromanu romspu

msam calu asymbolsmsbm ca lu bsym bols xbb o ld. . . (. . . a nd so on for the rest . . . )

that its encoding ought to be the lowest  opticalsize in the table of the same name. The wildcard \*" matches any su±x, such as variations

on style or optical size, for names which do notmatch higher in the table. We designate the setfcmb10, . . . , eusm*g as TeXfonts.

² A fuzzy-matching operator which, when join-ing, selecting, and projecting the above data-bases, can resolve the redundancy, synonyms,and ambiguities in the character names andtheir composition. Here is an inventory of issuesknown to date:

Table 3: Mapping of  TeXfonts to TeXencod .N  indicates an optical point size; a sterisk a su±xwildcard.

TEX Font Encoding TEX Font Encoding

cmb10 roman2 cmbsy5 mathsy1

cmbsyN  mathsy2 cmbxN  roman2cmbxsl10 roman2 cmbxti10 textit2cmcscN  csc1 cmdunh10 roman2cmexN  mathex2 cm®10 roman2cm¯10 textit2 cm¯b8 roman2cminch title2 cmitt10 textit0cmmi5 mathit1 cmmiN  mathit2cmmib5 ma thit1 cmmibN  mathit2cmmr1 0 ma thit2 cmmb1 0 ma thit2cmr5 roman1 cmrN  roman2cmslN  roma n2 cmsltt1 0 ro man0cmssN  roman2 cmssbx10 roman2cmssdc10 roman2 cmssiN  roman2cmssq8 roman2 cmssqi8 roman2cmsy5 mathsy1 cmsyN  mathsy2cmtcsc10 csc0 cmtexN  texset0cmtiN  textit2 cmttN  roman0cmu10 textit2 cmvtt10 roman2lasy* lasy lcircle* lcircleline* line logo* logomanfnt manfnt msam10 msammsbm10 msbm dccscN  t1cscdctcscN  t1csc dc* t1euex* euex eufb* eufbeufm* eufm eurb* eur

eurm* eur eusb* euseusm* eus

{ bar (vertical bar) vs. brokenbar (verticalbroken bar)

{ macron vs. overscore

{ minus vs. hyphen vs. endash vs. sfthyphenvs. dash

{ grave vs. quoteleft in code 0x60

{ space (0x40) vs. nbspace (0xa0) vs. visi-blespace vs. spaceopenbox vs. spaceliteral

{ rubout in code 0x7f{ ring vs. degree

{ dotaccent vs. periodcentered vs. middotvs. dotmath; Zdotaccent vs. Zdot, etc.

{ quotesingle vs. quoteright

{ slash vs. virgule

{ star vs. asterisk

{ oneoldstyle vs. one, etc.

{ diamondmath vs. diamond vs. lozenge

TUGboat, Volume 0 (1996), No. 0 | Pro ceedings of the 1996 Ann ual Meeting 1005

Page 6: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 6/15

Richard J. Kinch

{ openbullet vs. degree

{ nabla vs. gradient

{ cwm vs. compoundwordmark (a T1 prob-lem) vs. zeronobreakspace

{ perzero vs. zeroinferior vs. perthousand-

zero para perthousand{ slash vs. suppress vs. polishlslash, as in

Lslash and Lsuppress

{ Ng vs. Eng (and ng vs. eng)

{ hungarumlaut vs. umlaut, as in Ohun-garumlaut vs. Oumlaut

{ dbar vs. thorn

{ tilde vs. asciitilde

{ tcaron transmogri¯es to tcomma, et a l.

{ mu vs. mu1 vs. micro, code 0xb5

{ Dslash vs. Dmacron, code 0x0110; dslashvs. dmacron, code 0x0111

{ °orin (not in Unicode) vs. fscript, code0x0192

{ fraction vs. fraction1 vs. slashmath, code0x2215

{ circleR vs. registered, circlecopyrt vs.copyright

{ arrowboth vs. arrowlongboth

{ aleph vs. alef vs. alephmath

{ Ifractur (eus, mathsy1, mathsy2) vs. Ifrak-tur (Unicode) vs. Rfractur (eus, mathsy1,mathsy2, and  Unicode); the spelling

should uniformly be \fraktur"{ smile vs. smileface vs. invsmileface vs.

Unicode 0x263A (unnamed)

{ Omega vs. o hm, Omegainv vs. mho

{ names not starting with letters: 0script(0x2134), 2bar (0x01bb)

We represent these items in a text ¯le hav-ing the following format: Each line of the¯le gives character name synonyms, one groupof synonyms per line. Any of the names onone line are synonyms, and can be freely ex-changed. For example, the line \visiblespace

spaceliteral" means that the character namesvisiblespace and spaceliteral are com-pletely equivalent names. (The former was usedin the TEX DC fonts [10], while the latter wasthe PostScript name used in the Lucida SansUnicode TrueType font of Windows NT.)

A special case of \synonym" is the Unicodefall-back. This is a code number which isa \synonym" for TeXunion members not inUnicode, and is our assignment of the Unicode

\private zone" codes for the mis ts. For exam-ple, the line \ ff 0xf001 0xfb01" (Microsoftfonts have an undocumented usage like this)means that the character ff (which is a lig-ature not to be found in Unicode) carries arecommended private-zone code assignment of 

0xf001 or 0xfb01. One or more such recom-mended codes may appear, in order of pref-erence. In resolving a private-zone con°ict,a font-building program may take the recom-mended codes in order until a non-con°ictingcode is found. Only after the recommendedcodes are exhausted should the program makea random private-zone assignment. Codes maybe given in decimal, octal (leading 0), or hex(leading 0x) formats. Programs using thesetables take care to distinguish character codes(which contain only hexadecimal digits if start-ing with 0x, otherwise only octal digits if start-

ing with a leading zero, otherwise only decimaldigits) from names (anything else, includingnames which start with digits). Lucida SansUnicode contains some names like \2500" (forcode position 0x2500); if this presents a prob-lem we might have to pre¯x a letter to thesenames.

A name may appear in more than onesynonym group, although such groups do not join within the matching algorithm. The ¯rstname in any group is the \canonical" name.The canonical name is the name which shouldbe output by programs which compute setoperations on the encoding sets. This helpsto achieve a \¯ltered" result which does notcontain troublesome synonyms. For example,if the synonym ¯le contained the lines:

joseph jose yosef josephus 0xfb10

joseph joe joey

the names jose, yosef, and josephus wouldhave fall-back code 0xfb10, the names joe andjoey would have no fall-back code, and all thenames above would invariably be transformedto joseph on output.

TheTrue

TEX ¯lter accessory program,joincode, performs a relational join on twofont encoding sets, making a new encoding.It resolves the issues of a given synonym ¯leaccording to the rules we have stated.

² A database of non-TEX encodings, which tellswhich characters appear in various encodingstandards such as ANSI or Unicode. This listpresently constitutes a database of 3523 entriesfrom a set of 1814 characters. Some of the

1006 TUGboat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting

Page 7: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 7/15

Extending TEX for Unicode

Table 4: Some Non-TEX Encodings.

Name Description

ase Adobe PostScript \Adob e-StandardEncoding", thebuilt-in encoding of manyType 1 fonts

belleek Belleek[8] scheme for LATEX T1encoding on 8-bit TrueType

belleekc Belleek with small-capslatin1 Latin 1 (ISO 8859-1)latin1ps Adobe PostScript \ISO-

Latin1Encoding" (which is not ISO Latin-1!)

mac Macintoshmacexp Adobe PostScript \Mac-

Expert" encoding (used byAcrobat in PDF [2]), containing

ligatures, small caps, fractions,typesetting nicetiesmre3 Adobe PostScript \Macin-

toshRomanEncoding" (usedby Acrobat in PDF), a subsetof \mac", which omits somemath characters and the Appletrademark

pdfdoc Adobe PostScript \PDFDo-cEncoding" (used by Acrobatin PDF), an ad-hoc encodingused in PDF outline entries,text annotations, and Info dic-

tionary strings, consisting of aremapped set = fase [ mre [waeg

unicode 16-bit Unicode (Windows NTand 95, AT&T Plan 9)

common examples o f commercial importancetoday are given in Tables 4 and 5. Having thesesets allows us to export virtual fonts for any of the encodings represented, so we can virtualizenon-Unicode, non-TEX fonts.

² A TrueType-font-builder that takes the con-verted outlines from various TEX fonts, orga-nizes sets of them based on a n output encod-ing, and builds binary TrueType fonts from thereorganized glyph data.

A sub-tool for the font-builder incorporatesa redundancy-elimination feature that allowsyou to specify a table listing which charactersin a given TEX font may be taken from otherTEX fonts without repeating a costly META-

Table 5: Some Encodings Used in Windows.

Name Description

wae Adobe \WinAnsiEnco ding"(used in Acrobat in PDF),\winansi" with bullets inthe .notdef positions, somesemantic synonyms

winansi Windows ANSI 8-bit(US/Western Europe codepage) (Includes certain non-ANSI characters in 0x80{0x9f

range of codes.)winansiu Windows ANSI Unicode

(US/Western Europe codepage) (Same characters asare present in the winansiencoding, except the non-ANSI

characters are in their Unicodepositions.)winmultu Windows Multi-Lingual

Unicode (Windows 95/NT)(655 characters supporting allLatin alphabets, Greek,Cyrillic, OEM screencharacters.)

winNNNN  Windows (for code pages num-bered NNNN )

FONT glyph conversion. One example of sucha redundancy is how DC fonts largely replicatethe Computer Modern fonts; it would be awaste of e®ort to convert the glyphs twice.Another example is that many font variants areslanted versions of the upright face, and thegeometric slant is easily applied to an already-converted glyph rather than slanting in META-FONT and repeating the glyph conversion.This technique is also used to compose accentsand letters for \purely" accented characters(where the accent and letter do not overlap),since the MetaFog conversion is applied only

to the accent part of such glyphs, allowing theredundant letter conversion to be done onlyonce.

Another sub-to ol builds these redundancytables by comparing the encoding tables forsets of fonts against a target font. Forexample, a DC font combines punctuationand symbol characters spread across severalComputer Modern fonts.

TUGboat, Volume 0 (1996), No. 0 | Pro ceedings of the 1996 Ann ual Meeting 1007

Page 8: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 8/15

Richard J. Kinch

Table 6: DC fonts in LATEX (Rev. 1/95). The\funny" fonts (Fibonacci Roman, etc.) areomitted. This list is made by examining names in*.fd from the LATEX distribution.

Font 5 6 7 8 9 10 12 17

dcb ² ² ² ² ² ² ² ²dcbx ² ² ² ² ² ² ²dcbxsl ² ² ² ² ² ² ²dcbxti ² ² ²dccsc ² ² ²dcitt ² ² ² ² ²dcr ² ² ² ² ² ² ² ²dcsl ² ² ² ² ²dcsltt ² ² ² ²dcss ² ² ² ² ²dcssbx ²dcssdc ²dcssi ² ² ² ² ²dctcsc ² ² ²dcti ² ² ² ² ² ²dctt ² ² ² ²dcu ² ² ² ² ² ²

In our system, we actually produce textualversions of the binary fonts and convert themto Type 1 and TrueType formats with separatetools. This allows a general conversion to beoptimized for the ultimate binary format. Forexample, Type 1 glyphs require knot-pivoting,following by co mbing, to insert extrema tangentpoints. The hinting methods also di®er betweenType 1 and TrueType.

Once a binary version of a font is prepared,containing all the glyphs, a re-encoding tool(TrueTEX accessory program ttf_edit [17],which is a stack-oriented TrueType font encod-ing editor) must be applied to ¯nish the fontfor real-world use. The re-encoding stage notonly re-encodes, but can o ptionally adjust themetrics and kerning information. By makingthese aspects \afterthoughts" we can ¯ne-tune

fonts without going back into the detailed con-version process. The re-encoding stage can alsoupgrade any 8-bit-encoded TrueType font to anarbitrary Unicode encoding, which is importantsince many commercial font editors can onlyoutput 8-bit TrueType fonts.

² A notion of what TEX fonts we want to convert.If we consider the DC fonts a good target, wecome up with quite a list (Table 6).

Rationalizing TEX fonts in Unicode sets

Let us consider the shu²ing and dealing needed toreorganize Computer Modern into a Unicode encod-ing. With the luxury of thousands of code positions,we can un-do the \scattering" of characters amongstthe T

EX fonts. For example, the math italic set

(mathitf12g) contains the regular (not italic) low-ercase Greek letters. Conversely, w e are going tohave to scatter a few TEX fonts that happened tocombine dissimilar styles into one 7-bit font, such asthe math symbol fonts (mathsyf12g) which containcalligraphic capitals.

In set-theoretic terms, the rationalization taskinvolves the following steps:

² Begin with the union of all TEX characters, theset we have called TeXunion . Remember thatthis is the set of character names, not the glyphsthemselves.

² Partition this union set into the largest subsetswhich do not cross encodings. This partitioningis a set of proper subsets of  TeXunion ; wecall this set TeXpages . For example, allthe uppercase letters A{Z make such a subset.A combination of all upper- and lowercaseletters A{Z and a{z do not, because thesmall-caps fonts do not contain the lowercaseletters. These subsets are equivalent to Knuth'sComputer Modern METAFONT \program"¯les, because this was the highest level of source¯le nesting in which he did not make conditionalthe generation of characters.

² Encode the TEX character union set for a new,universal 16-bit encoding. That is, we inventa mapping of  TeXunion members to unique16-bit integer codes. Most of the membersof  TeXunion appear in Unicode and sohave a natural encoding already determined.For the TeXunion members not in Unicode

(which includes a ll the small-caps letters), weshall promulgate (by ¯at) assignments to theUnicode private-zone codes. We designate thissubset of TeXunion as TeXpzone ; this subset¯nds its concrete representation in the private-

zone codes expressed in the character synonymtable. We have the relation

(Unicode [ TeXpzone) ¾ TeXunion ;

since Unicode contains many characters notused in TEX. We can compute a mapping:

TeXunion 7! (Unicode [ TeXpzone )

by matching character names from the leftto the names on the right; in this way wearrive at a Unicode code number for each TEX

1008 TUGboat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting

Page 9: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 9/15

Extending TEX for Unicode

character. We designate this mapping (a 16-bit number for each member of  TeXunion )as TeXeru (\TEX encodings rationalized toUnicode", rhymes with \kangaroo"). Thismapping is the key result of rationalizingComputer Modern into Unicode.

² We can think again of a large table having,on one axis, the TEX fonts, on the other, themembers of  TeXpages, and bullets whereverComputer Modern implements METAFONTglyphs. This table will be sparsely andirregularly populated. The sparseness re°ectsthe fact that the TEX fonts cover a wide range of characters, while the variations in style mostlyare typographic distinctions on alphabets andpunctuation; in other words, TEX providesmore than a few symbols sets, in contrastto the simple ANSI/Symbol set distinctionin 8-bit Windows fonts. The irregularity in

this imaginary table results from the ad-hocarrangment of  TeXpages among TeXfonts.The sparseness is not a de¯ciency, but we oughtto have some goal in mind for the rationalextension of Computer Modern a nd the otherTEX fonts to populate areas of this table forthe sake of regularity. Realizing this goal wouldrequire drafting of new METAFONT code andtranslation to outlines. This is similar tohow the DC font project extended ComputerModern to the rational T1 encoding.

² At this stage we are ready to determine a list of actual Computer Modern Unicode fonts whichwill cover TeXfonts. While Haralambousretained 8-bit PK fonts as the actual fonts forvirtual Unicode fonts [7, 6], we will create aconverse realization, namely Unicode TrueTypefonts as the actual fonts for virtual CM, T1, orUT1 encodings.

Using the imaginary table just described, wecan take the union of row subsets such thatcolumns are not overlapped. If we want tomaintain stylistic uniformity within individualfonts, we merge rows subject to a personaldecision as to which rows \belong together"

in a stylistic sense. On the other hand, if wewant to minimize the number of actual fontsand don't mind di®erent styles in a single font,we can make a \knapsack" optimizationto packthe rows as tightly as possible. (Indeed, if wediscard the Unicode conformity, we could putall of Computer Modern into a single Unicodefont!) In any case, this collapsing of rowsinvolves imprecise judgments to arrive at anoptimized reduced set of fonts.

Implicit in this reduction is the factoring of wildcarded optical sizes that was introducedin TeXfonts; we call this reduced set of fonts TeXinUni, which will have a similarwildcarding to its parent TeXfonts, but fewermembers in parallel with the reduction.

² To produce each real Unicode font (a memberof  TeXinUni), we assemble the glyphs andmetrics from TeXfonts and install them viaTeXeru into each code of the mapped-toTeXinUni member.

² We must ¯nally produce Omega virtual fonts(that is, .xvp ¯les) which will map 8-bit DVI

codes from the old TEX fonts into TeXeru

codes in TeXinUni members. For this we usethe TrueTEX metric exporter to generate an.xvp ¯le, and XVPtoXVF to convert this to an.xvf ¯le; the .xfm ¯le also produced contains

the same information as the METAFONT .tfmand may be discarded if only TEX82 is to beused for formatting.

Generating U nicode virtual fonts for

non-TEX fonts

Let us consider a converse task: instead of con-verting single-byte-encoded Computer Modern fontsinto Unicode fonts, let us assume we have a Unicodefont in TrueType form, and want to make it usablewith TEX or Omega. To use a font TEX (and Omega)require a .tfm (or .xfm) metric ¯le and a .vf (or.xvf) virtual font ¯le. The virtual font is neces-

sary only if a remapped encoding or composition isneeded (usually the case).

To generate metric and virtual font ¯les forUnicode fonts in Windows, TrueTEX provides aFile + Export Metrics item which takes the userthrough several steps which illustrate the elementsof such a translation:

² First, the user selects the font from the Win-dows standard font-selection dialog. For exam-ple, in Windows, standard fonts include Arial(a Helvetica clone), Courier, and Times NewRoman (a Times Roman clone), together with

their bold and/or italic variations. Windowswill also install other TrueType fonts or (withAdobe Type Manager) Type 1 fonts. After theuser selects a font, TrueTEX has a \font han-dle" with which it can access all the geometricinformation needed to calculate global and per-character metric quantities for the font.

² Second, the user must select names for theoutput .xvp ¯le. Font names in Windo ws areverbose strings containing several words (such

TUGboat, Volume 0 (1996), No. 0 | Pro ceedings of the 1996 Ann ual Meeting 1009

Page 10: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 10/15

Richard J. Kinch

as, \Times New Roman Regular (TrueType)"),while TEX insists on a single-word alphabeticname; therefore TrueTEX selects a TEX-stylename for the font based on the TrueType ¯lename (such as \times"). The metric output inthis case would be a ¯le times.xvp.

² Third, the user must specify the \input" and\output" encodings for the virtual font. Theinput encoding is the encoding of the actualfont, typically ANSI or Unicode. The outputencoding is the virtual encoding which theuser desires to construct for TEX's point of view, and is typically a member of  TeXencod .TrueTEX uses encodings in the form of  .cod

¯les (each line contains a hex code ¯eld followedby the character name ¯eld) or .afm (AdobeFont Metric [1]) ¯les4 To select an encoding, theuser selects an item from a list which TrueTEX

presents, each item giving a description of the encoding (for example, \ TEXRoman withligs = 2" corresponds to the roman2.cod ¯le).

The user can also browse for enco ding ¯les in-stead of selecting from the canned list. The usercan edit custom encoding ¯les (which are justtext ¯les in afm format), and thereby gains com-plete °exibility of input versus output encodingin the virtual fonts, including automatic pro-duction of composites for missing input charac-ters. The ttf edit [17] program originates afm

encoding tables from existing TrueType fonts,allowing maximal compatibility with randomly-

encoded fonts.² User input is now complete. TrueTEX begins

analysis of the information provided by formingin-memory encoding tables from the encoding¯les (using sparse-array techniques to managelarge, sparse code ranges). TrueTEX sorts andindexes the tables for fast content-addressibilityby either code or character name, assembles theglobal metrics in xvp terms for the font, andvisits each input code in the font to build a tableof per-character metrics and ligatures. xvp ¯lebuilding may now begin.

² For each output character name, TrueTEXdetermines if an exact match exists to aninput name, and thus to an input code.If there is such an exact match, TrueTEXemits xvp commands which give the character's

4.afm  ¯les need not contain metrics; they can s imply

de ne a n encoding for a dummy font, using only the C, CH,and N ¯elds of the CharMetrics table. We thus maintaincompatibility with ot her afm -reading software and avoidinventing yet another ¯le format.

metrics and which re-map the TEX PUT/SET

commands to the Unicode positions.

² For output characters which have no exact toinput characters, TrueTEX invokes the \com-position engine." If the composition engine cancompose or substitute a glyph for the character,it emits the xvp commands for the metrics andother actions. If the composition engine is\stumped", TrueTEX emits an xvp commentto note that the character is unencoded.

Note that virtual fonts can use more than oneinput font to produce a virtual output font. Thiswould allow, for example, a text font, an expert font(such as might contain ligatures), and a symbol font(such as might contain math symbols) to contributeto a single TEX Unicode font. Another use for thistechnique would be the assembling of a Unicodefont from the old bit-mapped PK fonts. TrueTEX

supp orts all of the TEX and Omega-extended virtualfont mechanisms, but does not yet directly supportmultiple input fonts for metric export (or bit-mapped fonts, for that matter), although an expertuser can merge multiple virtual-property-list ¯les toproduce such a virtual font.

Comp osing missing characters

The Unicode and TeXunion sets are disjoint, butthe virtual font mechanism allows users to createvirtual TeXunion characters missing from a Uni-

code font with various composition or substitution

methods.True

TEX uses this technique to createcompletely populated virtual fonts when the under-lying TrueType fonts are missing accented charac-ters or ligatures.

To allow for easy upgrading, the \compositionengine" in TrueTEX uses a user-modi¯able scriptin a PostScript-like language to control the compo-sition and substitution pro cess. By changing thescript, the user can add new composition methods orsubstitution rules, or adapt the methods to varioustypographic conventions. TrueTEX, using its ownmini-PostScript interpreter, interprets this script atmetric-export time, which means that users who

speak PostScript and know a bit of font designcan customize the composer. A good script yieldsa much better TEX virtual font, since commercialfonts are typically missing characters that TEX con-siders important, and the script can ¯ll in most of the missing pieces.

When the composition script receives controlfrom TrueTEX, all the encoding and metric infor-mation for the font and input and output encodingsare de¯ned as PostScript arrays and dictionaries.

1010 TUGboat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting

Page 11: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 11/15

Extending TEX for Unicode

Table 7: Composable Accent Characters.

Name Position Example

umlaut top-center Äoacute top-center ¶obreve top-center ocaron1 top-center ocedilla bottom-center »ocircum°ex top-center ocomma top-right L'dieresis top-center Äodotaccent top-center _ograve top-center µohungarumlaut top-center }omacron top-center ¹oogonek1 bottom-rightperiod top-center _oring2 top-center1

For D/d/L/l, changes to comma at top-right.2Accent is not present in the Computer Modernfonts used in this p ortable document.

The standard composition script in TrueTEX im-plements the following techniques:

² Accent-plus-letter composition: if the nameimplies that the character is an accentedletter, the script decomposes the name intothe letter a nd accent components, and (whenthe letter and accent exist in the input font)uses the geometric information and typographicconventions to overlay the accent onto the letterin a virtual accented character, as shown inTable 7. Since the composer has detailedgeometric information on the glyph shape,which is more elaborate than the bounding-boxmetrics TEX uses, it can do a careful job of placing accents.

² Ligature composition from sequence of letters:A ligature character (not to be confused withthe ligature rules of the exported TEX metrics,a di®erent topic) such as the T1 character\SS" will not usually exist in a TrueType font.

Code positions for ligatures are not part of theUnicode standard,5 so even common ligaturesare often not present. The composition scriptforms these by concatenating the componentletters within the bounding box of the TEXcharacter. This is applied to the ligatures: ®,¯, °, ±, ², IJ, ij, SS, À, ½, Á, and ¾.

5 Unicode does contain ligatures which are phoneticletters in certain languages, but this does not includetypographic ligatures such as the f-ligatures

² Remapping of certain names: The synonymtable shows the problems of character nameswhich are not standardized. An a d-hoc sectionof the composition script ¯xes up any ambi-guities by recognizing ambiguous names andmaking the appropriate substitutions.

² Future extensions: Several more exotic com-position methods are possible for an upgradedcomposition script. A novel idea is an \ emer-gency fall-back" character generator: as a lastresort for a missing character, the script couldhave a table of low-resolution renderings forall of  TeXunion , consisting of (for example)an 8 £ 16 dot-matrix or a plotter-style stickfont; virtual font commands would render thesewith rules. In this method we could render anyTEX document in a crude but accurate fashionwithout any fonts or special's at all!6

Another idea would use the ability of virtualfonts to call upon the TEX \special command.A B¶ezier curve special could draw and ¯llglyphs without the need for operating systemsupport for fonts. This would not be e±cient,and hinting would be missing on low-resolutiondevices, but it would place all the scalable fontinformation in an XVF ¯le.

Project ing ligature rules into TrueType

fonts

The TrueType fonts in Windows do not supply any

ligature rules such as are contained in ComputerModern. To export metrics containing the usualTEX ligature rules, TrueTEX considers the rulesin Table 8 when exporting the global vpl (xvp)metrics, when the target liga tures exist in the font,or when the ligatures can be produced by thecomposition engine.

Supp orting metric export formats

TrueTEX supports both .vpl (TEX Virtual Prop-erty List) and .xvp (Omega Extended Virtual Prop-erty List) ¯le formats when exporting font metrics.This is more than merely a variation in format;

when exporting to .vpl format, the encodings aretruncated to 8 bits, so that the composition processfor missing characters will likely b e more intensive.TEXware programs VPtoVF and XVPtoXVF trans-late the property list ¯les to their respective .tfm,.xfm, .vf, and .xvf binary formats for use with

6 This could s olve the problem of rendering TEX docu-ments in HTML browsers.

TUGboat, Volume 0 (1996), No. 0 | Pro ceedings of the 1996 Ann ual Meeting 1011

Page 12: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 12/15

Richard J. Kinch

Table 8: Ligature Rules Applied to Exported Fonts.

First Second Result Description

Dashes

hyphen hyphen endash -- to endashendash hyphen emdash endash- to emdash

Shortcuts to national symbols

comma comma quotedblbase ,, to quotedblbaseless less guillemotleft << to left guillemotgreater greater guillemotright >> to right guillemotexclam quoteleft exclamdown !` to exclamdownquestion quoteleft questiondown ?` to questiondown

F-ligaturesf f ® ff to ®  f i ¯ fi to ¯f l ° fl to °® i ± ®i to ±® l ² ®l to ²

Paired single quotes to double quotesquoteleft quoteleft quotedblleft `` to quotedblleftquoteright quoteright quotedblright '' to quotedblright

TEX and Omega. TrueTEX launches the appro-priate translator after the property-list export iscomplete.

TrueTEX metric export also supports the older.pl (TEX Property List) metric ¯le format, and thecompanion program PLtoTF, should it be neededfor use with older TEX software. In this case theuser can specify only an input font encoding, and the

property list re°ects this encoding as applied to theTrueType font selected, without virtual remapping.

While exporting .xvp ¯les will connect theTrueTEX previewer to Windows' Unicode fonts,the TEX82 formatter requires .tfm metric ¯les, notOmega .xfm ¯les. If the output font has an 8-bitencoding, the resulting virtual font is neverthelesscompatible with the original TEX formatter's 8-bitcharacter codes, and will not require the Omegaformatter. To create a .tfm ¯le for such a font,a TrueTEX ¯lter program xvptovpl truncates thevirtual codes in the .xvp ¯le and produces atruncated .vpl ¯le, and via VPtoVF, a .tfm ¯lefor use with TEX. The .vf ¯le created in thisprocess is discarded, since it does not properly mapthe 8-bit characters to Unicode. The .xfm and.tfm ¯les produced by this process will contain thesame information; the .xfm format is needed only i f Omega is to be used. TrueTEX uses the .xvf ¯le tomap the 8-bit TEX characters to Unicode positions.

Implementing sparse metric tables

In implementing programs which use metric data,we must take care to apply sparse-matrix techniquesto avoid enormous memory demands from nearly-empty font-metric tables. Sixteen-bit encoded fontsare typically sparsely populated. For example,the Windows NT text fonts contain about 650

characters each; most codes are in the range 0{0x2ff, with some symbols in the 0x2000 vicinityand a few odd characters in the private zone at0xf001 or 0xfb01. We would expect such asegmented locality in a typical font.

One technique is to use a segmented table withbinary-search lookup; this is close to the methodused in TrueType fonts. A hash table for the two-byte keys may be used instead of the binary search.Segmented tables will require the least storage,at the expense of a possible hashing performanceproblem in the event of degenerate tables. Sinceall Unicode-capable operating systems are advanced

enough to support virtual memory, the performancerisk does not justify the memory savings.

TrueTEX uses a 2-level pointer technique:metrics for a 16-bit code table consist of a table of 256 pointers to metric tables with 256 entries each.In this way a typical font having perhaps 6 or 7contiguous code populations has very little wastedspace. The worst-case and best-case performanceare both acceptable. The lookup time is accelerated

1012 TUGboat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting

Page 13: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 13/15

Extending TEX for Unicode

by avoiding a null-pointer test by pointing unen-coded pages to a dummy table of zero-width metrics.

A time-bomb of a problem looms with Omega's.xfm format [12], which recklessly ignores anysparse-array issues. An .xfm ¯le, like the older.tfm format, keeps an unpacked char info array

which spans the smallest to largest codes (that is,the interval [bc; ec]). Since the Unicode private-zone population typically extends through 0xfb00,practically all fonts will have a char info of about126 KB, almost all wasted space. This will result ina typical .xfm ¯le being 130 KB, instead of about4 KB that a simple sparse-array technique wouldprovide. It is imperative that we upgrade the .xfm

format and xfm-reading programs to better handlesparse encodings.7

Summary

Let us review the areas of development needed toextend all of TEX to Unicode:

² Extending the .tfm ¯le format and its run-time forms to large, sparsely populated fonts,without sacri¯cing backward compatibility andwithout exploding ¯le lengths. We haveseen that the .xfm format can represent theinformation, although it needs improvement forsparsely populated fonts.

² Extending Computer Modern and other meta-fonts to fully populate the appropriate Unicodepositions. A complete Unicode text font re-quires about 500 symbols. While it is unlikely

that all the styles of Computer Modern will re-ceive the attention to ¯ll the tables completely,we can at least insert legible placeholders.

² Creating a formal database of TEX characternames, joinable to the Unicode o±cial names.Some standard is necessary for any develop-ment, and there is no reason to favor anyone'sfavorite names. What is crucial is that theregistry be initiated now, so that TEX softwareauthors have an early start o n making inter-changeable fonts and documents. While TEXusers cannot themselves dictate the standard

Unicode names, we can at least make a stand-in version for our own use (since none seem toexist at present), and if an acceptable set of Unicode names comes along, we can adopt itlater.

² Creating a formal database of all 28 TEXfont encodings and their 31 greatest common

7 The .vf and .xvf virtual font formats have alwayspacked s parseper-characterinformation,so theyneed no suchattention.

subsets, joinable to Unicode and other encodingstandards. The TrueTEX tables are availableto all for examination and use [16]. Once theseare improved by public usage and scrutiny, theyshould be adopted as a formal standard forTEX.

² Identifying and assigning non-Unico de TEXcharacter names to the Unicode private zone,thereby promoting inter-operability of UnicodeTEX implementations. These should be con-cretely represented in a synonym table, whichalso needs to be published. There are manypotential con°icts, and taking counsel from asmany di®erent users of TEX as possible is theonly way to maximize the compatibility of theresult. The Omega pro ject has already startedto stake out claims on the virgin Unicode realestate [4, Table 4, page 425] for -Babel. Thereare no doubt synonyms and ambiguities outsideour own experience; one can only hope there issu±cient room for all interested parties.

² Extending DVI translators to accommodatethe extended .tfm, .vf, and .dvi formats,including sparse-array techniques for e±cientrun-time performance. The Omega project hasissued this call to \DVI-ware developers" [4,page 426, Conclusion], although with surprisingaplomb for the implications. We herebyrespond with our implementation in TrueTEX,and invite others to build on our experience.

² Promulgating the ongoing TeXeru, the TEX-

in-Unicode mapping, based on a seasoned reg-istry. An o ngoing authority for additions andcorrections will be vital. This authority will beresponsible for registering new TEX characternames and avoiding Unicode con°icts.

² Changing the plain TEX and LaTEX macros toaccommodate the 16-bit encoding extensions,while maintaining backward compatibility froma single source. This is a tall order, and one wehave not touched.

² Extending \special handling for 16-bit char-acter sets. Now we open up the carousing com-

mand of TEX to a whole new vista of revelry,with the gift of tongues. This i s another itemthat we shall put o® for now.

² Implementing TEX and DVI translation user in-terfaces in selectable languages. While Omegaprocesses  in Unicode, it talks  to the user in theold 8-bit fashion. Perhaps it is a bit much toexpect a Web2C TEX change ¯le to incorporatewchar t  and other Unicode constructs of the Cprogramming language. But DVI translators

TUGboat, Volume 0 (1996), No. 0 | Pro ceedings of the 1996 Ann ual Meeting 1013

Page 14: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 14/15

Richard J. Kinch

for Windows NT and other Unicode-capableplatforms should have this designed in fromthe start. A properly designed application canbe reimplemented for another language by anynon-programmer who knows the applicationand can translate the messages; no program-

ming or recompilation is required.² Establishing provisions for orderly extension

of the fonts and character sets, so that newTEX fonts, characters, and encodings may b eincorporated into la ter versions. Having madethe e®ort to retro¯t METAFONT output foran altogether di®erent way of encoding, wewould hope that font designers recognize theshortcomings of the 7- and 8-bit encodings andkeep the promises of Unicode in mind.

² Rationalizing the TEX fonts into orthogonalstyles and weights (such as cmmb10 and cmmr10,for example). As T

EX users we don't care about

this, but if Computer Modern is to be acceptedin non-TEX applications, the style axes willhave to b e fully varied and populated along theconventional ranges.

² Providing a means for creating virtual fontsfor non-TEX Unicode fonts in TrueType orType 1 format. Although this capability isavailable now only in the commercial TrueTEXUnicode edition, a new TEXware stand-alonetool could interpret TrueType font ¯les (orwhatever typeface technology is supportingUnicode rendering) and join the encoding and

other information into an .xvp ¯le.

References

[1] Adobe Systems Incorporated. Portable Doc-ument Format Reference Manual . Reading,Mass.: Addison-Wesley, 1993.

[2] Adobe Systems Incorporated. Adobe Font Met-rics (AFM) File Format Speci¯cation , Version4.1, October 1995. [Published in PDF andPostScript form at ftp://ftp.adobe.com.]

[3] Fairbairns, Robin. \Omega| Why Bother withUnicode?" TUGboat 16,3 (1995), pages 325{328.

[4] Haralambous, Yannis, John Plaice, and Jo-hannes Braams \Never Again Active Charac-ters! -Babel." TUGboat 16,4 (1995), pages418{427.

[5] Haralambous, Yannis. \ , a TEX ExtensionIncluding Unicode and Featuring Lex-like Fil-tering Processes." Proceedings of the Eighth European T E X Conference , Gda¶nsk, Poland,1994, pages 153 { 166. [There is a n Omega Web

page at http://www.ens.fr/omega. See alsothe Omega FTP site [12].]

[6] Haralambous, Yannis, and John Plaice. \ +Virtual METAFONT = Unicode + Typogra-phy (First Draft)." ftp://nef.ens.fr/pub/

tex/yannis/omega/cernomeg.ps.gz, January

1996.[7] Haralambous, Yannis. \The Unicode Computer

Modern Project." [A document link on theOmega Web page [5].]

[8] Kinch, Richard. \Belleek: TEX Virtual T1-Encoded Fonts for Windows TrueType." Avail-able on the author's Web site as a LATEX docu-ment in belleek.zip. [The Belleek software isan implementation of T1-encoded fonts usingTrueType scaling technology under MicrosoftWindows. Belleek consists of TrueType fontswhich render elements of METAFONT glyphs

in scalable form, plus TEX virtual fonts whichremap and compose T1 characters from theseelements.]

[9] Kinch, Richard. \MetaFog: Converting META-FONT Shapes to Contours." TUGboat 16,3(1995), pages 233{ 243.

1014 TUGboat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting

Page 15: tug96

7/30/2019 tug96

http://slidepdf.com/reader/full/tug96 15/15

Extending TEX for Unicode

[10] Knappen, JÄorg. \The European ComputerModern Fonts." CTAN ¯le: tex-archive/

fonts/dc/mf/dcdoc.tex.

[11] Knuth, Donald. \Virtual Fonts: More Fun forGrand Wizards." TUGboat 11,1 (1990), pages13{24.

[12] Plaice, John, and Yannis Haralambous. \DraftDocumentation for the System." ftp://nef.

ens.fr/pub/tex/yannis/omega/first.tex,February 1995. [See also the Omega Web page[5].]

[13] Plaice, John. \Progress in the Project."TUGboat 15,3 (1994), pages 320 {324.

[14] The Unicode Consortium, Inc. http://www.

unicode.org. [This site holds ¯les listing codesand descriptive phrases. There are no samplecharacter images, and there are no PostScriptnames. A monograph gives paper images [15].]

[15] Addison-Wesley Publishing Company. The Unicode Standard: Worldwide Character En-coding, Version 1.0 , Volumes I and II.

[16] The encodings (consisting at present of 46 en-coding sets containing over 9000 pairs, repre-sented in .afm format) herein listed in Tables 1,4, and 5, are available via the a uthor's Web site.

[17] The programs ttf_edit and joincode forDOS, Windows, and Linux are available via theauthor's Web site.

TUGboat, Volume 0 (1996), No. 0 | Pro ceedings of the 1996 Ann ual Meeting 1015