solution for multilingual literature by xsl formatter · pdf filesolution for multilingual...

17
Solution for Multilingual Literature by XSL Formatter May 2003 Update: September 2003 Antenna House, Inc. Multilingual Typesetting Study Group Problems in making multilingual literature Let us first go over potential challenges in multilingual computer typesetting. Each of these items is already difficult enough on its own, and rapid progress in technology is mak- ing our mastery and utilization of such typesetting even harder. Japanese language typesetting is an established field but the multilingual one is the field for the future. The Multilin- gual Typesetting Study Group aims to review these issues to promptly find as much as possible practical solutions. (1) How to create the source data for typesetting? For computers to process character information, such data needs to be already prepared as coded information. From this perspective, the creation of multilingual data is far more diffi- cult than doing so monolingually in Japanese. 1. Selection of character encoding Character encoding (or character codes) has been standardized mainly as a region-specific local char- acter code set per country. But representation of data using a local character code set would not ena- ble the handling of documents with a mix of multi- ple languages. Any multilingual editing or, in particular, editing and typesetting of documents containing more than one language would inevita- bly require Unicode. (2) (1) This document is written as XML document conforming to Sim- ple Doc.dtd, then formatted by XSL Formatter V2.5 and converted to PDF. To what extent can Unicode support language di- versity? What is the latest status of Unicode stand- ardization and what products are available with Unicode capability? Since Unicode is evolving at a very fast pace, it is necessary to keep abreast with the latest news on Unicode and digest it accurately. What kinds of problems are found in Unicode? The types of character codes usable have been limi- ted in conventional JIS or ASCII encoded text files. In contrast, a variety of new character codes has been defined for texts by Unicode. For instance, there are the following codes 16 character codes from U+2000 to U+200F. What significance does these codes have for typesetting? How shall we best use them? 16 characters starting from U+2000 2000;N # EN QUAD 2001;N # EM QUAD 2002;N # EN SPACE 2003;N # EM SPACE 2004;N # THREE-PER-EM SPACE 2005;N # FOUR-PER-EM SPACE 2006;N # SIX-PER-EM SPACE 2007;N # FIGURE SPACE 2008;N # PUNCTUATION SPACE 2009;N # THIN SPACE 200A;N # HAIR SPACE 200B;N # ZERO WIDTH SPACE 200C;N # ZERO WIDTH NON-JOINER (2) There is a solution that uses ISO-2022 based encoding. - 1 -

Upload: truongdang

Post on 22-Mar-2018

234 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

Solution for Multilingual Literature byXSL Formatter

May 2003Update: September 2003

Antenna House, Inc.Multilingual Typesetting Study Group

Problems in making multilingual literature

Let us first go over potential challenges in multilingualcomputer typesetting. Each of these items is already difficultenough on its own, and rapid progress in technology is mak-ing our mastery and utilization of such typesetting evenharder.

Japanese language typesetting is an established field butthe multilingual one is the field for the future. The Multilin-gual Typesetting Study Group aims to review these issues topromptly find as much as possible practical solutions. (1)

How to create the source data for typesetting?For computers to process character information, such data

needs to be already prepared as coded information. From thisperspective, the creation of multilingual data is far more diffi-cult than doing so monolingually in Japanese.

1. Selection of character encoding・ Character encoding (or character codes) has been

standardized mainly as a region-specific local char-acter code set per country. But representation ofdata using a local character code set would not ena-ble the handling of documents with a mix of multi-ple languages. Any multilingual editing or, inparticular, editing and typesetting of documentscontaining more than one language would inevita-bly require Unicode. (2)

(1) This document is written as XML document conforming to Sim-

ple Doc.dtd, then formatted by XSL Formatter V2.5 and converted

to PDF.

・ To what extent can Unicode support language di-versity? What is the latest status of Unicode stand-ardization and what products are available withUnicode capability? Since Unicode is evolving at avery fast pace, it is necessary to keep abreast withthe latest news on Unicode and digest it accurately.

・ What kinds of problems are found in Unicode?・ The types of character codes usable have been limi-

ted in conventional JIS or ASCII encoded text files.In contrast, a variety of new character codes hasbeen defined for texts by Unicode. For instance,there are the following codes 16 character codesfrom U+2000 to U+200F. What significance doesthese codes have for typesetting? How shall webest use them?

16 characters starting from U+20002000;N # EN QUAD2001;N # EM QUAD2002;N # EN SPACE2003;N # EM SPACE2004;N # THREE-PER-EM SPACE2005;N # FOUR-PER-EM SPACE2006;N # SIX-PER-EM SPACE2007;N # FIGURE SPACE2008;N # PUNCTUATION SPACE2009;N # THIN SPACE200A;N # HAIR SPACE200B;N # ZERO WIDTH SPACE200C;N # ZERO WIDTH NON-JOINER

(2) There is a solution that uses ISO-2022 based encoding.

- 1 -

Page 2: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

200D;N # ZERO WIDTH JOINER200E;N # LEFT-TO-RIGHT MARK200F;N # RIGHT-TO-LEFT MARK

2. Selection of the computer. How do we choose the hard-ware and OS?・ What type of environment do we choose: Macin-

tosh, Windows 2000/XP, Unix (Solaris, etc.), Li-nux, or JAVA?

・ In Windows, multilingual processing that includesAsian languages is made possible by the provisionof a multilingual processing layer (library) calledUniscribe. Internet Explorer and Microsoft Wordseems to use Uniscribe to allow the processing ofeven a wide range of Asian languages

・ It seems that Windows is the most advanced in mul-tilingual processing capability. How much process-ing capability can you obtain in JAVA for Asianlanguages?

3. How do we enter data into the computer?・ What types of software are available for data entry?・ Selection of keyboard: in what way should the key-

board be prepared? Keyboards have been standar-dized in each country; personal computers sold in aspecific country come with keyboards in its na-tional standard. (3)In particular, how can we enterdata in other languages using keyboards packagedwith PCs in Japan?

・ Do we need IME? What should be the selection cri-teria? As is generally known, romaji (roman charac-ter) input and kana-kanji conversion is the mainapproach for imput method of the Japanese lan-guage. But it seems too demanding for foreignerswho are not familiar with Japanese to enter kanjiby pronunciation using roman characters. In thesame token, it will be very difficult for Japanese toinput Chinese characters using pinyin, although itmust be the natural choice for Chinese.

(3) Last year I went to an Internet café in Stockholm and tried to

send an email. I remember being perplexed at not being able to fig-

ure out the input method for the ever-necessary @ mark when enter-

ing the email address because of the Swedish keyboard layout.

4. Method of representing data・ Should the data be application-dependent binary or

application-independent XML?・ Our study group assumes that XML could be the

best for achieving multilingual processing. On theother hand, it is true that XML poses a higher hur-dle for learners to clear. Tagging XML is not toodifficult but, generally speaking, people tend to beoverly intimidated by tags. How can we lower thehurdle for XML?

・ With XML, the data structure (Schema) has to bedesigned.

・ Instead of defining new data structures, can we useexisting DTD/Schema? For instance, is DocBook.dtd usable?

・ Is it possible to propose a new standard DTD/Schema definition? Strengthening of SimpleDoc.dtd is proposed.

5. 5. Selection of editing software・ Whether or not the familiar editing software are

usable has a great effect on the productivity ofdocument creation; determining the range of edit-ing software applicable will be very important.From this perspective, the availability of MicrosoftWord will greatly enhance the process. How far isMicrosoft Word usable as a multilingual editingsoftware?

・ There is a number of editing software claiming tobe multilingual. However, there is not much availa-ble in terms of an all-mighty software that is capa-ble of editing English, other Western languages,Japanese, Chinese, Korean, Arabic, Hebrew, andThai in a single version. If we have to switch edit-ing software by language, no document with multi-ple languages can be generated. In addition, changeof softwares from language to language would in-volve the learning of new operations and raise thequestion of data compatibility; the softwares bylanguage should be avoided. But what shall we doif there is no way out but to introduce of software?

・ When we want to edit with WYSIWYG or some-thing similar to it as the editing method, do we

- 2 -

Page 3: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

have any tool that enables such a method in multi-ple languages? Assuming we do, which software isthe best for that tool? XML support appears to bewell advanced in Office 2003 (Word 2003), but towhat extent is WYSIWYG editing of XML by Of-fice 2003 possible?

・ It is necessary to have tools to support Schema-driven data input and editing to create data withXML. Is there such a software?

・ If experts create a document, they can use a type ofXML editing software that displays the tags as theyare knowledgeable enough to test and learn. Isthere an XML editing tool that can do this multilin-gually? If so, which software is the best for that tool?

Method of typesetting1. Selection of typesetting software

・ What is the software selection for multilingualtypesetting currently available? What languagescan each of these software handle?

・ To what extent can we utilize QuarkXpress or In-Design in multilingual typesetting?

・ When we have frequent changes in layout, we willneed typesetting software that approximates WY-SIWYG. If that is the case, DTP software wouldgive us a better cost performance than XML. Inwhat situation shall we use XML?

・ There is no XML typesetting software availablethat allows frequent layout changes, WYSIWYGediting, and real time reflection of layout and edit-ing results to the XML source data.

・ XSL Formatter is the most superior multilingualtypesetting software available, but what are its limi-tations and in what situations is it most effective?

2. Fonts: fonts are essential in the visualization and displayof character images on screen, paper, or as PDF.・ What types of Unicode-compatible fonts are availa-

ble?

・ When a PDF file is created and then distributed orprinted, it is necessary to embed the outline offonts in PDF. Therefore, the fonts to be used in mul-tilingual typesetting have to allow outline-embed-ding. What fonts are available for multilingualtypesetting?

3. Layout specification by XSL-FO・ How much can we utilize XSL-FO? To what extent

can we specify complex layouts?・ Does it work when typesetting rules are different

from one language to another?・ Does it work when languages of different typeset-

ting rules coexist in a single text?・ Does it work when one language runs from right to

left and another runs from left to right in one docu-ment?

Printing and PDF creation methods1. How to print multilingual documents2. Distinctions between PDF for printing and PDF for the

Web and their specific applications

Others1. Preparation of Table of Content and back of the book In-

dex.2. Sorting order of index, sorting rules by language, and

sorting rules for mixed language documents

Preliminary Knowledge of Multi Language For-matting

Character and LanguageLanguage is expressed in characters. First, in order to proc-

ess a text string, computers must handle the characters whichdescribe the language. The following table contains the prin-cipal languages with their respective local character code sets.

ISO Language code ISO Language Type of letter code classified by area

ar Arabic Arabic ASMO 449, Latin/Arabic Alphabet

bg Bulgarian Cyrillic Latin/Cyrillic Alphabet

- 3 -

Page 4: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

ISO Language code ISO Language Type of letter code classified by area

km Cambodian Khmer (First registered from Unicode V3.0)

zh-CN Chinese(Simplified) Simplified Chinese GB2312, GB18030

zh-TW Chinese(Traditional) Traditinal Chinese BIG5

hr Croatian Latin Latin Alphabet No.2,10

cs Czech Latin Latin Alphabet No.2

da Danish Latin Latin Alphabet No.1,4,5,6,8,9

nl Dutch Latin Latin Alphabet No.1,5,9

en English Latin Latin Alphabet No.1..10

et Estonian Latin Latin Alphabet No.4,6,7,9

fi Finnish Latin Latin Alphabet No.4,6,7,9,10

fr French Latin Latin Alphabet No.9,10

de German Latin Latin Alphabet No.1..10 (Excluding 7)

el Greek Greek Latin/Greek Alphabet

he Hebrew Hebrew Latin/Hebrew Alphabet

hi Hindi Devanagari IS 13194 (ISCII), etc.

hu Hungarian Latin Latin Alphabet No.2,10

is Icelandic Latin Latin Alphabet No.1,6,9

id Indonesian Latin Latin Characters

it Italian Latin Latin Alphabet No.1,3,5,8,9,10

ja Japanese Latin, Lanji, Kana, Katakana JISX0201, JIS X0208, JIS X0212

kk Kazakh Cyrillic Extended Latin/Cyrillic Alphabet (Cyrillic Asean)

ko Korean Hangeul, Kanji KS C5601, KS X1001, Johab

lv Latvian Latin Latin Alphabet No.4,7

ms Malay Latin orArabic Latin Alphabet, Arabic Extended

lt Lithuanian Latin Latin Alphabet No.4,6,7

no Norwegian Latin Latin Alphabet No.1,4..9

fa Persian(Farsi) Arabic Extended Latin/Arabic Alphabet (Arabic Character 28+ Original 4Characters)

pl Polish Latin Latin Alphabet No.2,7,10

pt Portuguese Latin Latin Alphabet No.1,3,5,8,9

ro Romanian Latin Latin Alphabet No.10

ru Russian Cyrillic koi8-r, Latin/Cyrillic Alphabet 32 Chars (not compatible with Uk-rainian)

sr Serbian Cyrillic Latin/Cyrillic Alphabet (Serbian)

sk Slovak Latin Latin Alphabet No.2

sl Slovenian Latin Latin Alphabet No.2,4,6,10

es Spanish Latin Latin Alphabet No.1,5,8,9

sv Swedish Latin Latin Alphabet No.1,4,5,6,8,9

sw Swahili Latin

tl Tagalog/Takalog Latin

th Thai Thai TIS 620, Latin/Thai Alphabet

tr Turkish Latin Latin Alphabet No.5

- 4 -

Page 5: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

ISO Language code ISO Language Type of letter code classified by area

uk Ukrainian Cyrillic koi8-u, Latin/Cyrillic Alphabet 33 Chars

ur Urdu Arabic Extended

vi Vietnamese Latin Extended Latin Characters

xh Xhosa Latin

zu Zulu Latin

UnicodeAt present, Unicode provides the capacity to encode al-

most any character of languages around the world.History of Unicode

Oct. 1991 Unicode 1.0.0 issuedJul. 1996 Unicode 2.0.0 issuedSep. 1999 Unicode 3.0.0 issuedMar. 2002 Unicode 3.2.0 issuedApr. 2003 Unicode 4.0.0 issued

Unicode not only defines character code set, but also pro-vides writing direction and line breaking properties and otherspecifications. Unicode character database includes informa-tion that defines writing direction of each character. Unicodealso specifies various kinds of text processing standards suchas Line Breaking Properties, Unicode BIDI text processingalgorithm, and reference implimentations for developers tocreate an application program.

Internal character code of OS and applicationDuring the eighties to the ninties, the personal computer

OS was based on character codes of each local area. The ap-plication programs that run on the OS was restricted by localarea and had limitations with handling of character codes.For example, Japanese WindowsMe internally manipulatesJapanese characters encoded by Shift JIS (JIS X0201 and JISApplication software that runs on WindowsMe cannot easilyprocess special Latin letters such as A with diacresis:Ä, Owith diacresis:Ö, U with diacresis:Ü and so on. These co-des are assigned for half width Japanese katakana in JISX0201 and the character codes conflict with special Latin let-ters.

As for Windows2000/XP of Microsoft, the processing in-side OS is based on Unicode and multilingual processingfunctions are strengthened sufficiently. Windows2000 or XPshoud be selected for multilingual processing.

There are some which process the data inside with Uni-code and some which are processed with the local charactercode in the application software. When trying to process mul-tilingual, it is necessary to select the application softwarewhich processes Unicode inside. For example, XSL Format-ter and Micorosoft Word2000/XP are Unicode application,but FrameMaker is not a Unicode application.

FontWhen processing languages through computers, font tech-

nology is considered to be the next important infrastructure.In fact, without fonts, characters can be neither printed nordisplayed. The following table contains a list of fonts that areusually installed with Microsoft Windows 2000/XP, or canbe downloaded free of charge from the Internet. Amongthese fonts, Arial Unicode MS is the only font that covers allrange of Unicode.

Arial Unicode MS has drawbacks that it does not includeall of the characters of Unicode 3.2 yet, and its design ofglyph is somewhere poor in quality.

For languages such as English, Westeren European, Slav,Japanese, Chinese (simplified and traditional), Korean, Ara-bic, Hebrew and Thai, TrueType or OpenType (TrueTypeFormat) fonts with enough quality can be prepared free ofcharge. Of course, these fonts alone are insufficient for de-signers who illustrate high quality print materials. However,for the purpose of IOM manuals, these fonts are practical.

Standard installation of Windows 2000 does not necessa-rily mean that all fonts that are shown as supplied with Win-dows 2000 are installed. Angsana (Thai font) or Mangal(Hindi font) is not installed with the standard installation ofWindows 2000 or Windows XP. These languages are not in-stalled unless you select the Regional Options of the ControlPanel, go to the system language setting, and choose the lan-

- 5 -

Page 6: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

guage, e.g., Thai or Indic, to reset the system. (See next dia-gram)

Font family The principal character which it covers Procurement manner Sort

Arial Unicode MS All characters of Unicode V2 Office2000/XP etc. Sans-serif

Arial Latin,Greek,Cyrillic,Arabic,Hebrew 2000/XP Sans-serif

Courier New Latin,Greek,Cyrillic,Arabic,Hebrew 2000/XP Monospace

Lucida Console Latin,Greek,Cyrillic 2000/XP Monospace

Lucida Sans Unicode Latin,Greek,Cyrillic,Hebrew, symbol 2000/XP Sans-serif

Microsoft Sans Serif Latin,Greek,Cyrillic,Arabic,Hebrew, Thai 2000/XP Sans-serif

Tahoma Latin,Greek,Cyrillic,Arabic,Hebrew, Thai 2000/XP Sans-serif

Times New Roman Latin,Greek,Cyrillic 2000/XP Serif

Vernada Latin,Greek,Cyrillic 2000/XP Sans-serif

Arabic Transparent Arabic 2000/XP Sans-serif(Latin), Cursive(Arabic)

Traditional Arabic Arabic 2000/XP Sans-serif(Latin), Cursive(Arabic)

Sylfaen Latin, Greek, Cyrillic,Armenian, Georgian XP Serif

MS Hei Simplified Chinese IE5, Global IME5 Monospace(Latin), Sans-serif(Chinese)

MS Song Simplified Chinese IE5, Global IME5 Monospace(Latin), Serif(Chinese)

SimSun Simplified Chinese XP Monospace(Latin), Serif(Chinese)

MingLiU Traditional Chinese 2000/XP Monospace(Latin), Serif(Chinese)

PMingLiU Traditional Chinese Office2000 Serif

Mangal Devanagari 2000/XP

Palatino Linotype Greek Poliytonic 2000/XP Serif

Shruti Gujarati XP

Raavi Gurmukhi XP

David Hebrew 2000/XP Serif

David Transparent Hebrew 2000/XP Serif

Fixed Miriam Transparent Hebrew 2000/XP Monospace

Miriam Hebrew 2000/XP Sans-serif

Miriam Fixed Hebrew 2000/XP Monospace

Miriam Transparent Hebrew 2000/XP Sans-serif

Rod Hebrew 2000/XP Monospace

MS Gothic Japanese 2000/XP Monospace(Latin), Sans-serif(Japanese)

MS Mincho Japanese 2000/XP Monospace(Latin), Serif(Japanese)

Tunga Kannada XP

Batang Korean 2000/XP Serif

Gulim Che Korean IE5, Global IME5 Monospace(Latin), Sans-serif(Korean)

Estrangelo Edessa Syriac XP

Latha Tamil 2000/XP

Gautami Telugu XP

MV Boli Thaana XP

Angsana New Thai 2000/XP Serif

- 6 -

Page 7: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

Font family The principal character which it covers Procurement manner Sort

Cordina New Thai 2000/XP Sans-serif

IrisUPC Thai 2000/XP Sans-serif

Setting of Regional Options

PDF technologyPDF technology is another promotional feature of a multi-

lingual formatting. It is a medium that converts paper into adigital version. Paper print can not be transmitted as fast as adigital version via the Internet to anywhere across the world.The multilingual documents could be circulated by electronicmedia such as CD-ROM, Internet by using PDF Technology.

An important aspect is that the embedding of font outlinedata into PDF becomes possible.

Consider formatting Arabic, Hebrew and Thai, outputtingto PDF, PDFs created without embeded outline of font haveno meaning and they may not be circulated across the globe.The embedding of font outline in PDF is subsitantial.

XML and XSL Technology

XML and XSL are built on the multi language infrastruc-ture of Windows, Unicode, Font and PDF. They are mostsuitable technology for the conversion of multi languagedocuments into PDF.

XMLXML is the optimum method of expressing contents of

multilingual documents.・ XML adopts UTF-8 and UTF-16 Unicode encoding

schema. Therefore, Unicode text can be processed with-out any character code conversion.

・ In XML, a document file can be divided into many par-tial files. Graphics are independent from main docu-ments and are linked to the main document as externalfiles. Using this mechanism, when creating a document,one can store the text portion of different languages intoseperate files. Graphics in all languages can be used ascommon files. Finally, all these parts are integrated to-gether to form a complete document.

・ Microsoft Word is the only WYSIWYG editing soft-ware that can handle almost all the languages in theworld. There are plain text editors, such as WindowsNotePad, UniPad, that can handle multi language docu-ments. Consequently, such text editors can be used toedit XML files since XML files are plain text file.

XSLXSL is a specification that is designed for formatting XML

on paged media. XSL has designed as compatible with globallanguages from the following aspects:

XSL defines a set of objects for formatting, such as page,header area, footer area, side bar or footnote area, footnotecontents, before float, side float, block or paragraph level,character level, inline level, a list or an itemized statement,table, link.

- 7 -

Page 8: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

By specifying the properties (attribute values) for each ob-ject, the layout or qualification of each object can be designa-ted.

Font specificationLet us set up the typesetting font by specifying the proper-

ties of the font family in the paragraph object. Even whendata has been created with correct character codes, charactersmay not be displayed or the character shape may be switchedunder the wrong specification; it is very important to prop-erly specify the font’s family name for multilingual typeset-ting.

The font family name is specified from the names that ap-pear on the Windows menu. Examples for FO are as follows:・ font-family="MS Mincho"・ font-family="MS Gothic"・ font-family="Arial"・ font-family="Times New Roman"It is also possible to specify the generic font family name.

There are five such font families available: serif, sans-serif,cursive, fantasy, and monospace. Once the value for the fontfamily property is specified using a generic font familyname, XSL Formatter takes up the font name actually instal-led in the Windows that is in operation when typesetting. Thematching list of generic font families by language and actualfont names can be set up using Formatter Options → Lan-guage-Fonts,i18n / Language : Japanese in the XSL Format-ter menu

If the font family name is specified in the list as in the fol-lowing, then application of the font from the left is priori-tized. By using this feature, we can specify at one go theEuropean and Japanese fonts when a document consists of amixture of both European and Japanese texts.

Formatting sample of font-family<fo:blockfont-family="Arial, MS Gothic, sans-serif">English is Arial. 日本語はゴシックになります。

</fo:block>The following is the formatted result.

English is Arial. 日本語はゴシックになります。

Formatting mixed documnet of Japanese and Chi-nese languages

As Japanese, Traditional Chinese and Simplified Chineseare assigned the same codes for characters with the sameshapes in Unicode, it is important to note that when a docu-ment has a mixture of these three languages it is better to ex-plicitly specify the font family rather than the generic fontfamily.

font-family setting for Japanese and Chinese<fo:block><fo:inline font-size="12pt" font-family="MS Mincho">Japanese:浅 与</fo:inline>、

<fo:inline font-size="12pt" font-family="SimSun">Simplified Chinese:浅 与</fo:inline>、

<fo:inline font-size="12pt" font-family="MingLiU">Traditional Chinese:浅 与</fo:inline></fo:block>

This is formatted as follows.

Japanese:浅 与、 Simplified Chinese:浅 与、 Traditional Chinese:浅 与

Multilingual mixtureAnother issue is how to align font baselines when there is

a mixture of languages in the text. We have fonts with thebaseline at the bottom of the character (e.g. Latin characters),fonts with the baseline at the top (hanging baseline; e.g. Indiccharacters), and fonts of which the lower edge becomes thebaseline (kanji or Chinese characters). In XSL, the baselinecan be adjusted when these characters are positioned horizon-tally. Refer to Internationalized Text Formatting in CSS andXSL (Steve Zilles) for further details. The implementation ofa baseline adjustment feature is not yet completed in XSLFormatter.

- 8 -

Page 9: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

Internationalization feature of XSLIn XSL, the default value of the line or character progres-

sion direction, etc., is the horizontal writing mode of Euro-pean texts, but other progression directions can be freelyspecified.Writing-mode

In XSL, the direction in which characters are written andthe progression direction of sentence lines can be specifiedby the writing-mode. However, the writing mode can bespecified only in the areas that are generated from the fol-lowing FO. For example, as we cannot write from right toleft by specifying the writing-mode for fo:block, we haveto place fo:block into fo:block-container.・ fo:simple-page-master・ fo:region-body・ fo:region-before・ fo:region-after・ fo:region-start・ fo:region-end・ fo:table・ fo:block-container・ fo:inline-container

If specified as writing-mode=“tb-rl,” Japanese and Chinesevertical writing modes can be specified. Also, if specifiedas writing-mode=“rb-tb,” languages with characters writ-ten from right to left such as Arabic or Hebrew can bespecified. If the writing-mode is specified on the page, forexample, the progression direction of a column in a multi-column can be specified and the rows of cells can be writ-ten from right to left if specified in the table.

UnicodeBIDI and fo:bidi-overrideDefining writing direction of characters in mixed multilin-gual text string is a more complex task. Unicode definesUnicodeBIDI specification to solve multilingual charactermixing problems. These problems are encountered when adocument contains both characters that are described fromleft to right (such as Latin alphabets or Japanesecharacters), and from right to left (such as Arabic or He-brew alphabets). UnicodeBIDI is adapted as fo:bidi-over-ride in XSL. Details of UnicodeBIDI and fo:bidi-overridewill be explained later in the chapter of 'Using Arabic Lan-guage'.

XSL Formatter and Multilingual TypesettingHereafter, is a brief explanation of how XSL Formatter

deals with multilingual typesetting issues.Glyph Substitution

It is necessary to use different glyphs to display or printthe same character code when both vertical and horizontaltypesettings are required such as the case of Japanese orTraditional Chinese languages. Microsoft Windows auto-matically changes glyphs on display when vertical writingis specified, PDFs that are created by Arobat become thesame as Windows display. PDF Output Option of XSLFormatter sets vertical writing features of TrueType/Open-Type fonts by its program. Arabic requires change ofglyph of the same character at the start, middle and end po-sitions in a word. In Thai, it is necessary to display two ormore characters of Unicode on the upper and lower sidesof each character. Microsoft Windows automaticallychanges glyphs on display when Arabic or Thai charactersare specified, PDFs that are created by Arobat become thesame as Windows display. PDF Output Option performsthe glyph replacement process for Arabic or Thai language.

Location of line breakingThe method for determining line breaks is different depend-ing on the language; multilingual typesetting softwaremust implement the method for line-break algorithm bylanguage. For example, barring exceptions like kinsokucharacters (characters that cannot begin or end a line), Jap-anese and Chinese languages can have line breaks betweenany characters. Since English and other European lan-guages have inter-word spacing, line breaks can occur be-tween words, although end-of-line irregularities may bedecreased by hyphenation. In contrast, there is no inter-word spacing in the Thai language, but the location of aline break is fundamentally that of a word break. It is nec-essary to change the method for determining the line-breakalgorithm by language in multilingual typesetting software.

Treatment of punctuationJapanese typesetting rule requires quite complicated treat-ment for punctuations. Some of full-width punctuationsshall be trimmed in the middle of line, XSL Formatter im-pliments punctuation trimming rule for full-width charac-ters by itself. And there are many prohibiting rules to place

- 9 -

Page 10: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

punctuations at the end or at the start of line. XSL Format-ter impliments Line Breaking Properties of Unicode.

HyphenationXSL specifies properties to define ON/OFF status of hy-phenation processing rules. XSL Formatter includes algo-rithms originated from TeX project. By preparing ahyphenation pattern dictionary written in XML, hyphena-tion in different languages will be effective. (Refer to theparagraph in English on "Formatting examples by majorlanguages" for examples of hyphenation settings.)

In XSL, attributes such as 'country' or 'language' can bespecified in fo:block, fo:character, and fo:page-sequence(also possible to specify at one go country-language byxml:lang). Because of these attributes, we can use a hy-phenation dictionary for each language in each paragraph.Further, "Hyphenologist" by Computer Hyphenation Ltd.is available as a option for XSL Formatter V2.5 Mainte-nance Release 2 or later. XSL Formatter Hyphenation Op-tion provides you with the capability to hyphenate 40 ormore languages.

Justification and Word SpacingJustification method shall be changed by language. Al-though word spacing may change slightly in English, wordspacing should not change in Arabic. For this reason, justi-fication of Arabic text can be achieved by inserting aglyph called Kashida between characters to control theword length. XSL Formatter impliments the function.

Beautiful TypesettingBeautiful typesetting in justified English paragraphs can'tbe made by just adjusting word spacing. Instead, it can beachievd by adjustment of line breaking point, font-stretchand letter spacing. Since XSL Formatter V2.5 is unable toperform this task at this moment, upgrade is considered inorder to make high-quality publishing possible.

Using Thai language

Among the standard fonts in Windows 2000, both Tahomaand Microsoft Sans Serif support the range of Thai charac-ters. If you go to Regional Options (Windows 2000) or Re-gional and Language Options (Windows XP) and add ‘Thai,’the following Thai fonts are additionally installed.

・ Angsana New・ AngsanaUPC・ Browallia New・ BrowalliaUPC・ Cordia New・ CordiaUPC・ DilleniaUPC・ EucrosiaUPC・ FreesiaUPC・ IriUPC・ JasmineUPC・ KodchiangUPC・ Lily UPCInput Thai language and try formatting. Use SC Unipad, a

Unicode text editor. In Unipad, the codes for the Thai lan-guage can be inputted by referring to the corresponding Uni-code code chart. Angsana New (16pt) was specified for Thaifor the example.Angsana New font family, 16 point size isspecified to Thai language

นี่อะไรคะ〔Translation〕What is this ?

หนังสือพิมพภาษาไทยครับ〔Translation〕Thai language newspaper.

There is no inter-word spacing in Thai. However, the line-break location is basically a word break. For this reason,check the word break by using a dictionary to determine theline-break location. In XSL Formatter V2.5, a feature thatcan automatically start a new line with a word-break by usingWindow’s Uniscribe is added. The following example showsthe start of a new line by locating the break in the word“school”Word[School]

โรงเรียนโรงเรียนโรงเรียนโรงเรียนโรงเรียนโรงเรียนโรงเรียนโรงเรียนโรงเรียนโรงเรียน

- 10 -

Page 11: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

โรงเรียนโรงเรียนโรงเรียนโรงเรียนโรงเรียนThe following example shows that the start of a new line

by locating the break in the word “school”is changed if thevowel in the word is miss-spelled.

โรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยนโรงเรึยน

The following shows the sample of the muxture of Japa-nese and Thai.

ศ、สの後のรはしばしば発音されません。

動詞の前にการ kaan やคความ khwaam を付けると、

動詞が名詞化されます。

Using Arabic language

Let us now use Arabic. Among standard fonts in Windows2000, the following five fonts support the range of Arabiccharacters:・ Arial・ Courier New・ Tahoma・ Microsoft Sans Serif・ Times New RomanNote that Andalus, Arabic Transparent, Simplified Arabic,

Simplified Arabic Fixed, and Traditional Arabic, which areadded in Regional Options in Windows 2000, cannot be usedas the embedding of fonts is prohibited.

First, we have an example of a document that includes theopening of the United Nations’ Universal Declaration of Hu-man Rights in only Arabic. Since Arabic characters run fromright to left and this property is defined by the Unicode Data-

base, the section in Arabic will be written from right to leftby simply starting to write in Arabic.

Sample of Arabic<fo:block font-family="Tahoma" language="ar">Arabic (Omitted)</fo:block>

This is then formatted as in the following. Since the pro-gression direction of the text that includes this paragraph isset up for left-to-right writing, Arabic lines end up as left-jus-tified. Also, the period is located at the right edge.

لحقوق العالمي اإلعالناإلنسانالديباجة

جميع في المتأصلة بالكرامة االعتراف كان لمّاالمتساوية وبحقوقهم البشرية األسرة أعضاءفي والسالم والعدل الحرية أساس هو الثابتة

.العالمقد وازدراؤها اإلنسان حقوق تناسي كان ولما

.اإلنساني الضمير آذت همجية أعمال إلى أفضياعالم انبثاق البشر عامة إليه يرنو ما غاية وكانمن ويتحرر والعقيدة القول بحرية الفرد فيه يتمتع

والفاقة الفزع .In XSL-FO, we can change the direction of writing in the

middle of the region by specifying the writing-mode. As thewriting-mode can only be set up for regions that generate areference area, the paragraph in Arabic is put into a block-container. If writing-mode= “rl-tb” is specified for this block-container, then the entire region becomes set up as writtenfrom right to left, therefore the paragraph begins from theright. The period is also located at the left edge.

Sample of Arabic written from right to left<fo:block-container writing-mode="rl-tb" font-family="Tahoma" language="ar"><fo:block> Arabic Arabic Arabic</fo:block></fo:block-container>

- 11 -

Page 12: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

It is formatted as follows.

لحقوق العالمي اإلعالناإلنسان

الديباجةجميع في المتأصلة بالكرامة االعتراف كان لمّا

المتساوية وبحقوقهم البشرية األسرة أعضاءفي والسالم والعدل الحرية أساس هو الثابتة.العالم

قد وازدراؤها اإلنسان حقوق تناسي كان ولماالضمير آذت همجية أعمال إلى أفضيا

البشر عامة إليه يرنو ما غاية وكان .اإلنسانيالقول بحرية الفرد فيه يتمتع عالم انبثاق

.والفاقة الفزع من ويتحرر والعقيدةThe following shows the mixture of Arabic and English.

باب ab means either father or a father, and ابbāb either door or a door.

How to specify the progression direction in Multilin-gual Mixture Document

BIDI (bi-directional) document consists of text strings thatcontain mixtures of multilingual characters that flow fromright to left like Arabic and Hebrew and those that are com-posed from left to right like Japanese and English.

When characters of different progression directions are nes-ted, ambiguity arises. In order to overcome this problem, Uni-code defines BIDI processing algorithm of the character.This is mainly consists of a implicit rule based on characterproperties for writing direction and explicit control characterssuch as embedding or override-control characters.

XSL specifies fo:bidi-override function to be used for con-trol BIDI problem. UnicodeBIDI and fo:bidi-override func-tions is properly implemented in XSL Formatter. Thefollowing example provides more details.

In this case, parentheses bind the Arabic text within a fo:block.

<fo:block>(ضصش) ضصش ENGLISH</fo:block>Parentheses are neutral characters, i.e. a character without

directional properties. Generally, a neutral character is influ-enced by the directionality of the surrounding characters. Forexample, a parenthesis inserted between 'Left-to-Right' and'Left-to-Right' characters will adopt the 'Left-to-Right' prop-

erty, and a parenthesis inserted into 'Right-to-Left' and 'Right-to- Left' characters inherits the 'Right-to-Left' property.However, when a parenthesis is inserted between two othercharacters of opposite directional properties, the directionalproperty of the higher or surrounding level, in this case, thewriting-mode of fo: block is adopted.

Therefore, the example of fo:block is displayed as follows:

شصض( شصض ) ENGLISHOne of the methods which provents this is by using the

Unicode directional control characters (RLM, RLE). (4)

Example using RLM<fo:block>(ضصش) ضصش &#x200F;ENGLISH</fo:block>

Example using RLE<fo:block>&#x202B;(ضصش) ضصش &#x202C;ENG-LISH</fo:block>

The above two are displayed as follows.

ENGLISH )شصض( شصضSame results can be achieved by using fo:bidi-override.

Conclusion

There seems to be no product among current typesettingsoftwares that can process almost all main languages of theworld by only one version or one edition. Our objective re-mains to improve XSL Formatter to the point where it canachieve highquality output available for publishing purposeof all global languages. Please teach us your informed ideasthat make it possible to achieve this target.

(4) FO example uses Unicode LRO (U+202D) to describe character

flow, in order of input from left to right. When applied to Arabic

characters, which normally flow from right to left, these characters

will be forced to flow from left to right and thus apprear to be flow-

ing from the wrong direction when output is displayed.

- 12 -

Page 13: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

Formatting Examples in Major Languages

Japanese

海に沈む島

ツバルは今

今、南太平洋に浮かぶ小さな島ツバルが、危機にさらされている。地球の温暖化で、最初に海に沈む島と想像されている。1997 年京都で環境に関する会議が開かれ 2008 年から 2012 年の間に先進国全体の温室効果ガスの排気

量を、1990 年の排気量と比較して 5%以上減らすことを義務つけた。

温暖化防止対策

チェック 事項 チェック 事項

エアコンの使用を減らす ごみを減らす

テレビを付けっぱなしにしない 水を出しっぱなしにしない

できるだけ車を使わず歩く 紙を再利用する

今、南太平

洋に浮かぶ小

さな島ツバル

が、危機にさ

る。地球の温

暖化で、最初

に海に沈む島

と想像されて

いる。一九九

七年京都で環

境に関する会

議が開かれ二

〇〇八年から

二〇一二年の

間に先進国全

体の温室効果

ガスの排気量

を、一九九〇

年の排気量と

比較して五%

以上減らすこ

とを義務つけ

た。

Hebrew

בים הטובע האי

"טובל"ב קורה מהשטובל נראה ,הארץ כדור התחממות בעקבות .סכנה בפני עומד ,הפסיפיק בדרום אשר "טובל" הקטן האי ,אלה בימים

ובה ,הסביבה באיכות הקשורים בנושאים שעסקה ועידה בקיוטו נערכה 1997 בשנת .בים לטבוע ביותר הקרוב האי הואחמישה בלפחות המתקדמות במדינות חמצני -הדו הפחמן פליטת שיעור את להוריד יש 2008-2012 :השנים בין כי נקבע

).1990 בשנת חמצני -הדו הפחמן פליטת לשיעור בהשוואה( אחוזים

הארץ כדור התחממות את למנוע כדי

פריטבדיקהפריטבדיקהאשפה פחות לייצרבמזגנים השימוש את להפחית

דולקת הטלוויזיה את להשאיר לאהזמן כל

במים לחסוך

ופחות ,יותר ללכת להשתדלבמכונית להשתמש

נייר למחזר

- 13 -

Page 14: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

Arabic

・ Arabic is written from right to left. As for a character, its glyph changes ac-cording to the location of character in the word: start, middle, end.

البحر في الغوص

االن؟ توفاليو في يحصل ماذا

ال�ذي االول البلد تص�بح سوف ت�وفاليو بان ال�معتقد من .ال�عالمية االنظار ن�حوها تتجه ال�تي الصغيرة الج�زر م�ن ت�وفاليو ت�عتبر االن،ثان�ي كمية تقل�يل اقرار ت�م المؤتمر ه�ذا وفي .ال�بيئة مشاكل ح�ول كيوتو م�دينة في مؤتم�ر عقد ت�م 1997 عام ف�ي .البحر ف�ي يغوص.1990 بعام مقارتنا ،2012 الى 2008 عام من الفترة خالل %5 من اكثر بنسبة الجو في الكاربون اوكسيد

العالم حرارة ارتفاع لمنع

الفقرةالفحصالفقرةالفحص.القمامة من التقليل.الهواء مكيف استخدام من التقليل

بالماء االقتصاد.مفتوح التلفزيون ترك عدمالس���يارة م�ن ب���دال الس�ير ع�لى االع�تماد

.االمكان بقدرالورق استخدام اعادة

Thai

・ Phonogramic Thai language is displayed with 42 consonants of vowel and 32 voicepitch marks.

เกาะที่กําลังจะจม

เกาะตูวาลู...

เกาะเล็กๆที่อยูทางใตของทะเลแปซิฟกกําลังอยูในภาวะอันตรายตามการคาดคะเนแลว เกาะตูวาลูจะเปนประเทศแรกที่จมหายไปในทะเลจากสภาวะโลกรอน(GlobalWarming)จากการประชุมระดับโลกในดานปญหาสิ่งแวดลอมที่เกียวโตเมื่อปค.ศ.1997 ที่ประชุมไดมีมติใหประเทศพัฒนาแลวทั้งหมดลดปริมาณการระบายสารคาบอนไดออกไซดออกสูบรรยากาศใหไดมากกวา 5% ในระหวางปค.ศ.2008 ถึง ค.ศ.2012 เมื่อเทียบกับปริมาณของสารดังกลาวที่ระบายออกในปค.ศ.1990

การหลีกเลี่ยงสภาวะโลกรอน (Global Warming)

เครื่องหมาย รายการ เครื่องหมาย รายการลดการใชเครื่องปรับอากาศ ลดปริมาณขยะไมเปดโทรทัศนทิ้งไวโดยไมจําเปน ไมเปดน้ําทิ้งไวพยายามเดินแทนการใชรถยนต นํากระดาษมารีไซเคิลใชใหม

- 14 -

Page 15: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

Traditional Chinese

沈下大海的島嶼

現在的圖華路(Tuvalu)島

現在、浮在南太平洋上的小島圖華路濱臨于極大的危機。由于地球溫暖化的影響、它可能會成為第一個沈下大海的島嶼。1997年在日本京都召開的有關環境的會議上、就自2008年至2012年之間所有先進國家的溫室效應氣體的排氣量、做出了履行與1990年排氣量相比至少減少5%義務的規定。

溫暖化防止措施

檢查 事項 檢查 事項

少用空調 減少垃圾

不要將電視機開 不管 不要發生長流水現象

儘量步行不用汽車 紙張再利用

現在、浮

在南太平洋上

的小島圖華路

濱臨于極大的

危機。由于地

球溫暖化的影

響、它可能會

成為第一個沈

嶼。1997

年在日本京都

召開的有關環

境的會議上、

就自2008

年至2012

年之間所有先

進國家的溫室

效應氣體的排

氣量、做出了

履行與199

0年排氣量相

5%義務的規

定。

Simplified Chinese

沉下大海的岛屿

现在的图华路(Tuvalu)岛

现在、浮在南太平洋上的小岛图华路滨临于极大的危机。由于地球温暖化的影响、它可能会成为第一个沉下大海的岛屿。1997年在日本京都召开的有关环境的会议上、就自2008年至2012年之间所有先进国家的温室效应气体的排气量、做出了履行与1990年排气量相比至少减少5%义务的规定。

温暖化防止措施

检查 事项 检查 事项

少用空调 减少垃圾

不要将电视机开着不管 不要发生长流水现象

尽量步行不用汽车 纸张再利用

- 15 -

Page 16: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

Korean

바다 속으로 가라앉는 섬

투발루는 지금

남태평양의 조그만 섬나라인 투발루는 지금 바다에 잠길 위기에 처해 있다. 지구 온난 현상으로 인해 최초로 바다 속으로 사라질 것으로 보인다. 1997 년 교토에서 환경에 관한 회의가 열렸고, 이 회의에서 2008 년에서 2012년 사이에 선진국 전체의 온실 효과를 일으키는 가스의 배기양을 1990 년의 배기양에 비해 5% 이상 감소시키는 것을 의무화 하였다.

온난 현상 방지 대책

체크 사항 체크 사항

에어콘 사용을 줄인다 쓰레기를 줄인다

텔레비를 오래 켜두지 않는다 물을 절약한다

가능한 한 자동차를 이용하지 않고걷는다

종이를 재활용한다

English (Quoted from "The Chicago Manual of Style")

13.2

This chapter will describe some of the common problems that arise insetting technical material and will suggest ways in which these prob-lems can be solved or circumvented. It is intended for authors unfami-liar with techniques of typesetting and for copyeditors not blessed witha mathematical background. For more on typesetting and printing ingeneral see chapter l9.

13.3The advent of sophisticated phototypesetting systems, including bothphotomechanical and CRT systems, has revolutionized the setting ofmathematical copy in recent years. Many expressions and arrangementsof expressions that formerly were impossible or very difficult to set arenow relatively easy to achieve. Not every manuscript involving mathe-matical expressions is composed by such an advanced system, however,and authors and editors should have some idea what to expect of the par-ticular typesetting system employed for the manuscript in hand.

13.4Typesetting systems can be thought of as existing on four levels of so-phistication in mathematical capabilities.

<fo:block

hyphenate="true"

language="en">

It enables

hyphenation

function.

- 16 -

Page 17: Solution for Multilingual Literature by XSL Formatter · PDF fileSolution for Multilingual Literature by XSL Formatter May 2003 ... Problems in making multilingual literature ... Arabic,

Reference

Unicodehttp://www.unicode.org/

UniPadwww.unipad.org

UnicodeFontshttp://www.alanwood.net/unicode/fonts.html

Internationalized Text Formatting in CSS and XSLhttp://homepage.mac.com/thgewecke/.Public/SZillesPaper.pdf

FOPhttp://xml.apache.org/fop/index.html

TeX hyphenation dictionaryhttp://www.ctan.org/tex-archive/language/hyphenation/?action=/tex-archive/language/

World Scripthttp://www.omniglot.com

World Languagehttp://www.alc.co.jp/mlng/

Universal Declaration of Human Rightshttp://www.unhchr.ch/udhr/

- 17 -