localizationguidelinesver3_4

Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 1

Localization Guidelines


Contents

Summary (for Project Managers) ....................................................................................................................... 4

1 Introduction ............................................................................................................................................... 5

1.1 Outline................................................................................................................................................ 5

1.2 Purpose of this Document ................................................................................................................. 5

1.3 Scope .................................................................................................................................................. 5

2 Background ................................................................................................................................................ 6

3 Standards ................................................................................................................................................... 8

3.1 Text Localization standards ................................................................................................................ 8

3.1.1 Inputting multilingual data ........................................................................................................ 8

3.1.2 Rupee Symbol ............................................................................................................................ 9

3.1.3 Displaying multilingual data ..................................................................................................... 10

3.1.4 Storing multilingual data .......................................................................................................... 12

3.2 Localization standards...................................................................................................................... 13

4 Common Language Data Repository (CLDR) ............................................................................................ 17

4.1 APPROACH ....................................................................................................................................... 18

4.1.1 ANALYSIS .................................................................................................................................. 18

5 Frequently Used Entries for Localization (FUEL) ...................................................................................... 22

6 Generic Guidelines ................................................................................................................................... 23

6.1 Categorical Classification of Guidelines: .......................................................................................... 31

7 Migration to Unicode ............................................................................................................................... 36

7.1.1 Compiling Unicode Applications in Visual C++ ......................................................................... 37

8 Data Encoding and Byte Order Markers .................................................................................................. 38

8.1 Tips and Tricks for C/C++/VC++ ........................................................................................................ 41

8.2 Encodings in Web Pages .................................................................................................................. 42

8.2.1 Setting and Manipulating Encodings ....................................................................................... 43

8.2.2 HTML pages .............................................................................................................................. 43


8.2.3 ASP.Net .................................................................................................................................... 44

8.2.4 XML pages ................................................................................................................................ 45

8.2.5 Java ........................................................................................................................................... 45

9 Directionality ............................................................................................................................................ 46

10 Cascading Style Sheets (CSS) .................................................................................................................... 47

10.1 How to Use @font-face ................................................................................................................... 47

10.2 CSS rendering Issues in Indic Script .................................................................................................. 48

11 SQL Server 2005 and International Data: Using Unicode ........................................................................ 49

12 ISO 639 Language Codes .......................................................................................................................... 50

13 References ............................................................................................................................................... 51


Summary (for Project Managers)

This document discussed various guidelines to help developers to localize their

software products / services. This document will benefit new software developers in

using the existing technologies together with various initiatives and standards, which

will help them, understand the complications involved with various scripts and

platforms. A further benefit is better integration and interoperability of products. The

document starts with the definitions of Localization (L10N), Internationalization

(I18N), and Globalization. It also brings to limelight some national as well as

international standards currently used in the Localization industry. Various important

topics have been touched in this document like Inputting, storage and rendering

Indian language data, Unicode migration from legacy data, Usage of Common Locale

Data Repository (CLDR), Characters encoding for proper representation on various

platforms, Unicode, Culture and locale specific information for Indian languages,

guidelines to develop localized software application / service, encoding and byte

order markers to differentiate and manipulate different files, directionality issues

with right to left scripts such as Urdu and the problem of using Cascading Style

Sheets (CSS ) in context Indian languages.

The topics such as Authoring, Internationalization and Localization have been

discussed in details with XLIFF as an example. Frequently Used Entries for Localization

(FUEL) for term consistency as an open source initiative, in context of Indian language

has been cited. Various ISO 639 Language Codes have been provided at the end of the

document. However, some of the guidelines are generic in nature and may or may

not be applicable to mobile platforms.


1 Introduction

1.1 Outline

This document defines general guidelines to help developers to localize their software products / services.

The document starts with the definitions of Localization (L10N), Internationalization (I18N), and

Globalization. The advantage behind defining Localization is to create a common understanding of the term

Localization and to learn about the dos and donts in order to create a localized application. The

document will also discuss some of the frequently used national as well as international standards used in

the Localization industry. The document has been divided into various topics viz., Localization /

internationalization standards, User Interface (UI), directionality issues, Frequently Used Entries for

Localization (FUEL), resource identifiers, Common Locale Data Repository (CLDR), Characters encoding,

Unicode, Culture , Locales etc.

1.2 Purpose of this Document

This document is intended for:

Software Designers / Engineers: To understand and evaluate Localization areas related to product design.

Testing and QA Engineers: To define test plans and test cases keeping Localization and

Internationalization in mind.

1.3 Scope

The scope of this document is to create awareness among the software developers and the designers to

create applications/ products keeping the Localization aspect in mind. These guidelines do not describe

how to internationalize a product. However, there are some coding snippets / examples which are used to

illustrate the concept and not intended as a guide to implementation. Some of the guidelines are generic

in nature and may or may not be applicable to mobile platforms.


2 Background India is a multilingual country with 22 official languages and 12 different scripts. The official languages of India are Hindi, Marathi, Sanskrit, Bangla, Assamese, Punjabi, Gujarati, Malayalam, Maithili, Santali, Bodo, Dogri, Tamil, Oriya, Telugu, Kannada, Urdu, Sindhi, Kashmiri, Manipuri, Konkani and Nepali. The various scripts are: Devanagari, Gurmukhi, Bengali, Assamese, Gujarati, Kannada, Oriya, Telugu, Malayalam, Tamil, Perso-Arabic, Meitei Mayek, Ol chiki and Roman [16].

This document discusses various Localization standards along with the basic needs required to localize a piece of text i.e. input, storage and display. Standards formulating organizations are ISO (International Organization for Standardization), W3C (World Wide Web Consortium) and OASIS (Organization for the Advancement of Structured Information Standards); here we will focus on the translation and Localization standards developed by the last two organizations.

The definition of a standard according to ISO/IEC Guide 2:2004 [17] provides the following general definition of a standard:

a document, established by consensus and approved by a recognized body, that provides, for common and repeated use, rules, guidelines or characteristics for activities or their results, aimed at the achievement of the optimum degree of order in a given context.

Various standards and specifications used in Localization are: Unicode, XML Localization Interchange File Format (XLIFF), Translation Memory eXchange (TMX), Term Base eXchange (TBX), Segmentation Rules eXchange (SRX), Indian Script Code for Information Interchange (ISCII)[18], Perso-Arabic Script Code for Information Interchange (PASCII) [19] etc.

There are mainly four terms widely used in Localization domain:

GILT stands for Globalization, Internationalization, Localization and Translation [20].

There are many definitions of GILT, but for the purposes of this document the following definitions will be used.

Translation is not just about text conversion but also deals with conversion of spoken words from source to target language, without losing its meaning.

Anastasiou & Schler (2010) defines Translation [21] as Translation is the text transfer from a source to a target language; text is everywhere, in laws, news, academic dissertations, user manuals, advertisements and so on; the list is endless. Besides, we often see that text is accompanied by icons, diagrams, and other visual effects. For example, in newspapers as well as in user manuals and advertisements we have many pictures, animations, logos and so on. We often see that these icons change, when they are transferred to a target language.


LISA defined Localization as follows[22]:

Localization refers to the actual adaptation of the product for a specific market. It includes translation, adaptation of graphics, adoption of local currencies, use of proper forms for dates, addresses, and phone numbers, and many other details, including physical structures of products in some cases.

Schler (2007) with emphasis on digital content defines Localization as:

"Localization is the linguistic and cultural adaptation of digital content to the requirements and locale of a foreign market, and the provision of services and technologies for the management of multilingualism across the digital global information flow."

Globalisation deals with the strategy making for selling the product or services to the foreign markets.

Sikes (2009) defines Globalisation [23] as

Expansion of marketing strategies to address regional requirements of all kinds while Localization is Engineering a product to enable efficient adaptation to local requirements

Internationalisation [24]:

Internationalisation is the design and development of a product, application or document that enables easy Localization for target audiences that differ in region, culture, or language.

In short, internationalisation reduces the engineering effort spent for various languages and cultures. It also makes languages locale independent. A locale is a specific geographical, political, or cultural region. It is usually identified by a combination of language and country, for example, en_UK represents the locale UK English.


3 Standards

The standards have been divided into two categories:

a) Text Localization standards

b) Localization standards

3.1 Text Localization standards

In order to localise a piece of text the following are the three basic steps:

1. Inputting multilingual data (Keyboards), 2. Displaying multilingual data (fonts), 3. Storing multilingual data (Unicode).

3.1.1 Inputting multilingual data

There are three types of inputting mechanism for Indian languages, viz.:

3.1.1.1 INSCRIPT/Enhanced INSCRIPT

INSCRIPT: The INSCRIPT layout uses the standard QWERTY 101 keyboard. The Indian language characters are divided into vowels and consonants. The vowels are placed on the left hand side of the keyboard and the consonants are placed on the right hand side of the keyboard layout. The mapping of characters in Indian languages is such that it remains common for all the left to right languages. Due to this the basic character set of the Indian languages is common. Enhanced INSCRIPT has got four layers of Keyboard to accommodate more characters. INSCRIPT layout can be used by urban as well as non-urban community.


The following figure represents the INSCRIPT Layout.

3.1.1.2 Rupee Symbol

With the introduction of the new Rupee symbol, there are several hack/non-standard implementations existing on the web. It is mandatory that Rupee symbol placement on the keyboard should be as indicated in the figure above while the storage value should be U+20B9.


3.1.2 Displaying multilingual data

Displaying multilingual data would require fonts. Font is a set of well defined shapes to display symbols (letters, punctuation marks, special characters of the language.) An 8-bit font can represent 256 glyphs by giving unique index (called glyph index) and name to each glyph/shape.

Since Indian scripts are complex in nature and because of this complexity rendering of these scripts is different than its display. Hence, many legacy standards have been developed to display ISCII text eg. Indian Script Font Code (ISFOC) [25] which will require special software to render stored text in ISCII. The right thing would be to use language specific fonts and converters in order to render text as per the language rules. With the evolution of computing environment from 8-bit to 16-bit ISCII was surpassed by Unicode and True type font by Open type font.

3.1.2.1 Open Type Font

OpenType is a cross-platform font file format developed jointly by Adobe and Microsoft[16].

Microsoft defines Open type font as

OpenType. All the information controlling the substitution and relative positioning of glyphs during glyph processing is contained within the font itself. This information is defined in OpenType Layout (OTL) features that are, in turn, associated with specific scripts and language systems.*9+

The OpenType font format has the following advantages:

multi-platform support

support for international character sets

smaller file sizes

support for advanced typographic control

3.1.2.2 Rendering / Rasteriser Engines

Displaying multilingual data would also require, rendering / rasterisers engines such as Uniscribe, Pango, ICU, Harfbuzz.


Uniscribe: Uniscribe is a Microsoft shaping and rendering engine to ensure proper representation of multilingual content on windows platform.

Pango: Pango is a library for laying out and rendering of text, with an emphasis on internationalization. Pango can be used anywhere that text layout is needed, though most of the work on Pango so far has been done in the context of the GTK+ widget toolkit. Pango forms the core of text and font handling for GTK+-2.x.

ICU: ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

Harfbuzz: HarfBuzz is an OpenType text shaping engine.

For further details please refer the following urls:

ICU: http://site.icu-project.org/

PANGO: http://www.pango.org/

ICU: http://site.icu-project.org/

Harfbuzz: http://www.freedesktop.org/wiki/Software/HarfBuzz

3.1.2.3 Sakal Bharati font

A single font which contains all the Indic scripts has been developed by CDAC Pune. This font has got consistent look and feel across various Indic Scripts including English language. The following picture shows how different scripts are rendered on the screen.


3.1.3 Storing multilingual data

Multilingual data can be stored in Unicode.

Unicode consortium defines Unicode as

Unicode is the universal character encoding, maintained by the Unicode consortium. This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols.[4]

It is the superset of all the languages in the world which also includes punctuation, special characters (shapes), currency symbols, mathematical symbols etc [5]. Using Unicode, more than 65000 different characters can be represented. Unicode comprises of many code pages.

3.1.3.1 Normalization in Unicode

The Unicode data requires normalization. There are many cases where a character can be entered in more

than one ways using the Unicode code chart.

eg.

One must take utmost care while developing applications in Unicode like internationalized domain names

(IDN).


3.2 Localization standards

The various standards in today's Localization world are: XLIFF, Segmentation Rules eXchange (SRX), Translation Memory eXchange (TMX), Term-Base eXchange (TBX), Global Information Management Metrics eXchange-Volume (GMX-V), XML Text Memory (xml:tm).

2..1. XLIFF: is an XML based intermediate format which is used to store, carry and interchange localizable data. According to XLIFF specification *13+: XLIFF is the XML Localization Interchange File Format designed by a group of software providers, Localization service providers, and Localization tools providers. It is intended to give any software provider a single interchange file format that can be understood by any Localization provider.

2..2. SRX: SRX rules based on XML vocabulary was developed for breaking the text into translatable segments/ smaller fragments. SRX is defined in two parts: : specification about rules applicable for each language. : specification about how rules are applied for each language.

2..3. TMX: TMX is the translation memory data exchange standard between applications. It is divided into two parts: Translation Unit and Segment of translation memory text .

2..4. TBX: TBX-Basic is a TBX compliant terminology markup language for translation and Localization processes that permit a limited set of data categories. The purpose of TBX-Basic is to enhance the ability to exchange terminology resources between users.

2..5. GMX-V: It measures the work-load for a given Localization job, not just by word and

character count, but also by counting exact and fuzzy matches, punctuation symbols etc. It can also be used to count the number of pages, screenshots etc.

2..6. xml:tm: It is the vendor-neutral open XML standard, which allows text memory including

translations to be embedded within XML documents using XML namespace syntax.


The standards we are going to describe now follow the order of a Localization workflow, i.e. authoring,

Internationalization and Localization as shown in the following diagram:

Fig. Authoring, internationalization, and Localization standards

At authoring stage, the standard which leaps out is DITA (Darwin Information Typing Architecture) managed by OASIS. ITS (Internationalization Tag Set) is the internationalization standard by W3C, while XLIFF is a standard which carries Localization data and metadata under the umbrella of OASIS. DITA is an XML-based specification which processes modular and extensible topic-based information types as specializations of existing types. It allows for the introduction of specific semantics for specific purposes. An example of DITA is shown below: (1) DITA example Installing a hard drive You open the box and insert the drive. hard drive disk drive Unscrew the cover. The drive bay is exposed. Insert the drive into the drive bay. If you feel resistance,try another angle.

Authoring

DITA (OASIS) Internationalization

ITS (W3C) Localization

XLIFF (OASIS)


The example is retrieved from DITA OASIS online community portal: http://dita.xml.org/what-do-info-typed-dita-topic-examples-look, 10/06/12.

In the above mentioned example the metadata tag stores the meta information for Keywords, index terms and audience type. After this the steps and the steps results are given. There is also a provision to link other DITA files too. After this Authoring stage the next stage is Internationalization. Internationalization reduces the engineering effort spent for various languages and cultures. Directionality (RTL or LTR), which is very important for internationalization, is supported by the ITS specification. The ITS specification consists of data categories, which is a set of elements and attributes. Perhaps the most important data category is , as it expresses information about whether the content of an element or attribute should be translated or not. The values of this data category are yes (translatable) or no (not translatable). The selection of XML node(s) may be explicit (using local approach) or implicit (using global rules). An ITS example using local approach is given below: (2) ITS example An example article John Doe [email protected] This is a short article. Here we see that we do not have an explicit metadata tag, but metadata is still available, such as title, author, persons name (first name and surname), affiliation, address, and contact e-mail. We also see that that all authors information should not be translated as indicated by the data category translate=no: .


A basic minimal XLIFF file (3) follows: (3) Minimal XLIFF file with one translation unit (TU) Thankyou Danke On the first line we have the XLIFF declaration and also the schemaLocation attribute of the XML schema-instance namespace. In the file element we have the name of the file (Greetings.txt), its source language (English (US)), and its data type (plain text). Then the translation unit element follows with its source and target text in the source language (SL) and target language (TL), respectively. XLIFF also allows the storage of important data for (software) localisation; an example is the restype attribute. Among its values, are checkbox, cursor, dialog, hscrollbar (horizontal scrollbar), icon, menubar (a toolbar containing one or more tope level menus), and window. An example15 of a dialog resource type follows: (4) XLIFF TU of a dialog About Dialog In example (4), we see metadata about font, style, and coordinates. This is specific to the dialog resource type. When metadata is about the file and the localisation process in general, then it is included in the header element. An example of the header element follows: (5) Process metadata in header element


4 Common Language Data Repository (CLDR)

The CLDR provides key building blocks for software to support the world's languages. The data in the

repository is used by companies for their software internationalization and localization: adapting software

to the conventions of different languages for such tasks as formatting of dates, times, time zones, numbers,

and currency values; sorting text; choosing languages or countries by name; and many others. C.L.D.R.s

provide useful information as to the locale and are therefore crucial from the perspective of localization.

For the purpose of clarity, CLDRs may be reduced to two sub-types: CLDRs are of two types: Basic and

Exhaustive.

The Basic CLDR comprises of

Calendars

Numeric formats,

Date and Time formats

Currencies

These are used not only by Operating Systems to show time date and conversion but also for more

extensive functions such as inserting Date, Time, Currency etc in the locale of the country.

Advanced or Exhaustive CLDRs cover in addition to the above, items such as

Weight systems

Distance systems

Location

Modes of Address

Language Selection


Oral Pronunciation (tied to the Voice Browser)

A majority of CLDRs use LDML for mark-up.

4.1 APPROACH

Existing CLDR templates (for the major part from the Unicode site) were analysed from the point of view of

their compliance to Locale data. Experts were invited to give their comments on the tags within the CLDR

and whether the existing tags were sufficient. The results of the analysis of CLDR are given below.

4.1.1 ANALYSIS

This analysis of CLDRs is divided into two parts:

Basic CLDRs and Advanced CLDRs

4.1.1.1 BASIC CLDR

The repository contains information as to Time, Date, Year, Currency and Weight. Insofar as the basic

CLDRs are concerned, it was noted that a majority were geared towards the Western model. The lacuna in

each locale data is analysed in what follows:

YEAR

The year format which is normally followed is the Gregorian calendar. In fact even the CLDR for Hindi and

other Indian languages complies with the Western norm, with the months being transliterated into the

local language.

It would be advisable for true representation (as in the case of Chinese) to adopt the Indian luni-solar

calendar and have the year calculated as Vikram Samwat for Gujarati, Dogri, Marathi and Konkani as well as

to a certain extent for Sindhi. In the case of Gujarati, the Parsi calendar where each day of the month has a


separate name would involve a very specific CLDR. On the other hand, the calendar for Urdu and Kashmiri

could conform to the Muslim calendar.

DATE

The same assumptions could be for Date where the date notation in Indian languages varies and can be

DD MM YYYY

DD MM YY

MM DD YYYY

MM DD YY

TIME

Time is calculated as per Hour Minute and Second. But traditional time calculation in India, especially in

Astrology, is based on ghatikas and palas.

Even in the case of Hour Minute Second notation, apart from the Indian railways and Airlines where a 24

Hour notation is maintained, the normal notation is for a 12 hour cycle.

NUMBERS

The case of Numbers is unique, since Numbers in Indian languages have their own grammar and a

NUM2Word routine necessarily implies a deep study of the various ways of representing numbers in Indian

Languages. This has been one of the areas of intensive study and the results of the Number notation as

represented in a Num2Word routine are provided at the end of the chapter for Marathi, Konkani, Gujarati,

Urdu, Sindhi, Dogri and Kashmiri.


CURRENCY

The normal currency is Rs and Paise. However in common parlance there are further subdivisions into a

quarter, half and three quarters. Should these be represented as is done in the US with the word dime

being acceptable?

WEIGHTS MEASURES AND DISTANCE

Although India has adopted the Metric System to be at par with the world, in real time, old weights still

continue to function. Should these be included in the CLDR where the Weight and Measures are analyzed.

A similar problem arises in the case of distance where the old Foot and Mile system is partly used. Similarly

measurement of land is still in traditional measures. This is especially important for localizing land records

where each state has its own traditional term for measurement such as Guntha, Bigha etc.

The above few items show the complexity in evolving a basic CLDR for Indian languages. In the case of the

more advanced aspects of the CLDR which are normally not adopted in the standard CLDR, socio-cultural

aspects are analyzed.

4.1.1.2 ADVANCED CLDR

This embraces socio-cultural aspects. such as Icons, Images, Symbols, Myths, Beliefs, Geographical entities,

Custom and Tradition, Festivals. Another aspect is that of name representation where the traditional FIRST

NAME FATHERS NAME & SURNAME are not pertinent, especially in the South where such features are

replaced by patronymics. Similarly notification of location, especially in the case of addresses are not easy

to reduce to a single format. Different cultures tend to change the order from Ascending to Descending as

is the case in Iran and to a certain extent in parts of Bhuj in Gujarat, where the country is placed first and

the destinator is placed at the end. Honorifics and terms of respect and address also form an important


part of CLDR which to be totally exhaustive has to embrace each and every aspect of the culture of the

locale and render it transparent to the user.

Since exact formats in LDML for these different entities are still under development, these have not been

studied in depth.

COMPLIANCY

The different CLDRs that have been analyzed and studied / and or developed (as in the case of Dogri) are

all compliant with the Western notational system. In that sense they are perfectly compliant. They have

been implemented by both IBM and Microsoft in their operating systems and within these limits can be

termed as perfectly compliant.

However the term compliancy implies a norm or standard and if the above discussion is to be gone by,

the CLDRs do not reflect in any manner the cultural and ethnic thought processes of the languages under

study. A closer look at this problem which is a vital one especially in the area of e-governance localization is

a must.

In what follows two sample CLDRs one for Dogri and the other for Urdu are presented. Both are in the

traditional framework although the Urdu CLDR proposes like the Arabic one a dual system: Gregorian and

Muslim. Simultaneously the results of the study undertaken on Num2Word conversion in the shape of a

tabular representation of the basic numerals from one to the highest acceptable number are also provided

for all the languages under study.

Further details can be found at : http://cldr.unicode.org/


5 Frequently Used Entries for Localization (FUEL)

FUEL is an open source initiative to standardize terms for open source software programs. It aims at

resolving the problem of term inconsistency and lack of standardization in Computer software translation,

across various platforms. It also works to provide a standard and consistent terminology for a language.

Following Indian language support has been added in this initiative.

Languages with FUEL [31]:

Assamese (as)

Bengali (India) (bn_IN)

Bhojpuri (bho)

Chhattisgarhi (hne)

Gujarati (gu)

Hindi (hi)

Kazakh (kaz)

Magahi (mag)

Maithili (mai)

Malayalam (ml)

Marathi (mr)

Punjabi (pa)

Oriya (or)

Tamil (ta)

Telugu (te)

Urdu (ur)

Kannada (kn)


6 Generic Guidelines

i. Use Unicode as your character encoding to represent text.

ii. It is recommended to use Latest version of Unicode and Unicode Compliant Open Font Type

during design and deployment of Indian language application (Refer : http://unicode.org)

iii. It is recommended to use of Enhanced INSCRIPT keyboard for inputting

i. Drawback of Phonetic / transliteration based layouts is that they are useful only

for English knowing users.

ii. Drawback of Typewriter layouts is that they are preferred only by operators

migrating from physical typewriters.

iii. INSCRIPT is easy to learn and use especially for non English speaker.

iv. Major OS such as MS-Windows and Linux support INSCRIPT by default

v. It also caters to latest additions such as the Rupee symbol

vi. Refer : http://cdac.in/downloads

iv. Isolate all user interface elements from the program source code. Move all localisable

resources to separate resource-only DLLs. Localisable resources include user interface

elements, such as strings, error messages, dialog boxes, menus, and embedded object

resources. Resources which are not localisable should not put into the resource-only DLLs.

Refer: http://msdn.microsoft.com/en-us/library/w7x1y988.aspx

v. Use the same resource identifiers throughout the life of the project. Changing identifiers

makes it difficult to update localised resources from one build to another. Refer:

http://www.lingobit.com/solutions/bestpractices.html

vi. Allow plenty of room for the length of strings to expand in the user interface. In some

languages, phrases can require 50-75 percent more space than they need in other languages.


For example, dialog boxes may expand due to Localization, a large dialog box that occupies the

entire screen in low-resolution mode may require resizing to an unusable size when localised.

Refer: http://msdn.microsoft.com/en-us/library/w7x1y988.aspx

vii. UI controls such as buttons or drop-down lists should not be placed on top of other controls.

Sizing and hotkey issues with hidden controls usually are found through testing, which might

not be done during Localization . In this case, the UI is not localisable because the button size

cannot be extended to the length required for the translation without rearranging the button

positions. Rearranging button positions can be costly and makes the UI inconsistent among

languages. Refer: http://msdn.microsoft.com/en-us/library/aa163857(v=office.10).aspx

viii. There should be proper use of decimal separator which varies from locale to locale.

ix. Avoid images, banners with text, because for Indian language version they need to be

translated as well and language switch will not be able to handle the text inside images.

x. For desktop applications the icon, icon-text and title should be in local language.

For applications requiring RTL support it recommended that the html tag 'direction' be specified

with RTL as value e.g.

'...' [in Persian].

That would yield this result:

xi. Application surface Localization are useful where end user is only consuming information and

processing / computing with Indian language data is not critical.

i. example of surface Localization is printing of statement of accounts or and bill

passbook printing


xii. Use clear, concise, and grammatically correct language. Ambiguous words, obtuse or highly

technical sentences, and grammatical mistakes increase translation time and costs.

xiii. When using abbreviations and acronyms, ensure that the abbreviations and acronyms have

meanings that are understood by most users. You should always define abbreviations and

acronyms that might not be obvious in all languages.

Refer: http://msdn.microsoft.com/en-us/library/aa163854(v=office.10).aspx

xiv. Avoid using images and icons that contain text in your application. They are expensive to

localise.

xv. Avoid strings that contain a preposition and a variable are difficult to localise because, in some

languages, different prepositions are used in different contexts.


Example code Better code

At %s Time: %s

At %s Date: %s

At %s Location: %s

xvi. Automated translation tools can significantly cut down on Localization vendor's costs. But

automatic translation tools only work if standard phrases are being used. Many Localization

vendors are paid per word. Consider the amount of money that can be saved if one standard

phrase can be easily, or automatically translated into multiple languages. For example, the

following messages could be standardized into one consistent message:



Message Standardized version of message

Not enough memory There is not enough memory available.

There is not enough memory available There is not enough memory available.

Insufficient Memory! There is not enough memory available.

xvii. Different languages often have different punctuation and spacing rules. Consider these

differences when writing strings in code. Thus, if this string is constructed at run time, the

localiser cannot change the point to a comma. For similar reasons, apply these considerations

to numbers, dates, or any other information that might have different formats in other

languages. Refer: http://msdn.microsoft.com/en-us/library/aa163854(v=office.10).aspx

xviii. Use of numerals / digits

i. For applications involving number crunching such as banking applications,

billing applications, statistics it is recommended that English numbers / Digits be

used. Unless mandated otherwise by the agency commissioning the project.

This is because most programming environments, spreadsheets and

databases do not support computing with Indian digits which are

treated as characters as opposed to numeral.

ii. For applications such as digital data preservation Indian digits may be

considered.


xix. Avoid using compounded variables

In the following example, to translate the preposition "on" correctly, you might have to ask

the developer what the variables stand for.


Example String Explanation from developer

%I:%M%p on %A, %B %d, %Y %A Full weekday name

%B Full month name

%d Day of month as decimal number (01 - 31)

%I Hour in 12-hour format (01 - 12)

%M Minute as decimal number (00 - 59)

%p Current locale's A.M./P.M. indicator for 12-hour clock

%Y Year with century, as decimal number

xx. Use Microsoft specified resource file naming convention.


xxi. Use unique variable names

If the same variable name is used for different variables, for example, if the sequence of

the variables is hard coded, the word order in the translated sentence might be wrong

because word order differs from language to language.

Example code Better code

Set created on %s at %s Set created on %1 at %2

Backup of %s on %s at %s Backup of %1 on %2 at %3

Printing %s of %s on %s Printing %1 of %2 on %3

xxii. Do not hardcode strings or user interface resources.

xxiii. Allocate text buffers dynamically since text size may expand when translated.

xxiv. Avoid composite strings

The strings shown in the following table cannot be localized unless the localizer knows

what the type of object or item the variables stand for. Even then, Localization might be

difficult, because the value of the variable might require a different syntax; the article

"the" has variations in another language (in German: "der, die, das, dem, den, des," the

same as in English where you have "a" or "an"); the adjectives might change according to

the gender of the word; or other factors. Using composite strings increases the chances of

mistranslation. These Localization problems can be eliminated by writing out each

message as a separate string instead of using variables.



Examples of composite strings

Are you sure you want to delete the selected %s?

%s drive letter or drive path for %s.

Are you sure you want to delete %s's profile?

Cannot %s to Removable, CD-ROM or unknown types of drives.

A %s error has occurred %sing one of the %s sectors on this drive.

xxv. Unused strings and dialog boxes should be removed from the source, so localisers do not

waste time localising them.

xxvi. Avoid using controls within a sentence. You might want to place a UI control within a sentence.

For example, you might want to give users a drop-down menu to make a choice within a

sentence. This practice is not recommended, because to localise a sentence that includes UI

controls, the localiser often has to either change the position of the controls (if possible) or be

content with an improper sentence structure. Also, the UI controls are often drop-down

combo boxes that are comprised of multiple controls. Moving and aligning these can be error-

prone.

xxvii. Test localised applications on all language variants of Windows XP (except Oriya). If your

application uses Unicode, as recommended, it should run fine with no modifications. If it uses

a Windows code page, you will need to set the culture/locale to the appropriate value for your

localised application and reboot before testing. Refer: http://msdn.microsoft.com/en-

us/library/aa291552(v=vs.71).aspx


xxviii. Cultural and Policy Making Issues

i. Avoid slang expressions, colloquialisms, and obscure phrasing in all text. At best,

they are difficult to translate; at worst, they are offensive.

ii. Avoid maps that include controversial regional or national boundaries. They are

a notorious source of political offense.

iii. Avoid images in bitmaps and icons that are ethnocentric or offensive in other

cultures/locales. Refer: http://www.lingobit.com/solutions/bestpractices.html


6.1 Categorical Classification of Guidelines:

Sr.

No. Category Sub Category Guideline

1 Translatable

Components

Textual Objects

Fixed - textual objects which should

not get translated. e.g. - User Name,

Group Name, Password, System or

host names, keyboard accelerators

All fixed text should be separate from resource

file and not translated.

Message - translated information

displayed to the user by the product.

All messages should be reviewed for slang

terminology, message fragments, and other

criteria that cause messages to be difficult to

translate.

All Translatables should be separate from the

source and placed in a locale-specific location.

If translated items do not exist for a locale,

correct default values should be used.

Allow plenty of room for the length of strings to

expand in the user interface. In some languages,

phrases can require 50-75 percent more space

than they need in other languages.

When variables are used for the text strings,

extra space (atleast one line per variable) should

be provided.

There should not be any controls placed in

between the sentences.

when possible, it is best to put text labels above

UI controls such as edit boxes. This positioning

allows for the greatest extension of the text


field.

Avoid inline CSS with values.

Never use absolute positions.

UI controls such as buttons or drop-down lists

should not be placed on top of other controls.

Text on a button should never be dynamically

linked onto the button from a string variable but

should be placed on the button itself as a

property of the button.

Other - translated help files,

documentation.

Help files , documentations should be available

for different Locals.

Non - Textual

Icons, Images, Colors, Sounds, etc

1. These items should be culturally neutral as

much as possible. Avoid culture-specific

examples, showing body parts, gender-specific

roles, religious references, political symbols, text

in graphics.

2. If these items are not culturally neutral then

all non-textual items should reside in locale-

specific directories and these items should be

separate from the source product i.e. The item

should not compiled or hardcoded into the

product and should be configurable.

3. If these items are not culturally neutral and if

an item has not been translated into a specific

locale, Correct default item should appear.

4. If these items are not culturally neutral, and if

any textual messages appear in icons/ images,

then the text from the icon or image should be

easily able to separate from the image. And

these messages must be translated separately


from the icon.

5. If these items are not culturally neutral, the

tools which created the items should be easily

available in other places, where the localization

of these items will take place.

2 Cultural Data

Time, Date and Calendar

- system or user log files or text

windows - display of time and date

information in files - calendar

functions - display of chronologically-

sorted data - display of user accesses

to the system

Time, Date and Calendar format in a locale-

specific format.

Numbers, Currency, and Metrics

Numbers should display in the locale-specific

format.

The currency symbol should displayed in the

locale-specific format.

Currency strings should be formatted according

to locale conventions.

If using the International three-letter currency

code, correct one should be used according to

local.

Collation

Collation refers to the order in which

characters are sorted.

If the sort order change in different locales then

It must be sorted according to the rules of the

locale.

Personal Names, Honorifics, etc


The formats for personal names, honorifics and

salutations should be configurable for each

locale.

The formats for postal addresses and telephone

numbers should be configurable for each locale.

Characters and Strings

Application should handle Unicode and convert

correctly between Unicode and the native

encoding of the platform. i.e. non-ASCII

filenames and pathnames should be handled

correctly

Filesystem I/O

Can non-ASCII characters be saved and loaded

correctly from the file system, without mention

of proper byte order marker?

3

Text (Writing

System)

Foundation

Transfer Encoding

Wherever data is shared between

applications, such as mail, Internet

browsing, etc. or Wherever data is

displayed.

If the application transfers data through

protocols or networks that strip the 8th bit, The

application should encode, decode, and display

the data correctly.

If the product includes a help browser or web

browser:

1. Browser should load files or pages in different

encodings.

2. Browser should be able to properly display

files or pages with multibyte text.

3. Once a file is loaded, is it possible to display it


in another encoding?

4. Bookmarks and menus should show filenames

with multibyte text correctly.

Input

Application should correctly accept, parse and

display non-ASCII input in all places where it is

appropriate to enter non-ASCII text.

As non-ASCII text is being entered, the

backspace key should correctly delete the non-

ASCII text.

Product should have the mechanism to input

characters used for various languages.

Shortcut key combinations should be accessible

on international keyboards

Output

labels, menu items, text areas,

canvases, HTML pages, etc.

All areas of the application should display all

characters within the current locale.

Your application should use the default fonts, or

if explicit fonts are named, the fonts should be

stored in an external resource file that can be

configured for each locale.

Printing out characters from the users' native

languages should be proper.

Application should work correctly on localized

editions of operations systems that the product

supports.


7 Migration to Unicode

According to Microsoft creating a new program based on Unicode is fairly easy. Unicode has a few features that require special handling, but you can isolate these in your code. Converting an existing program that uses code-page encoding to one that uses Unicode or generic declarations is also straightforward. Here are the steps to follow:

1. Modify your code to use generic data types. Determine which variables declared as char or char* are text, and not pointers to buffers or binary byte arrays. Change these types to TCHAR and TCHAR*, as defined in the Win32 file WINDOWS.H, or to _TCHAR as defined in the Visual C++ file TCHAR.H. Replace instances of LPSTR and LPCH with LPTSTR and LPTCH. Make sure to check all local variables and return types. Using generic data types is a good transition strategy because you can compile both ANSI and Unicode versions of your program without sacrificing the readability of the code. Don't use generic data types, however, for data that will always be Unicode or always stays in a given code page. For example, one of the string parameters to MultiByteToWideChar and WideCharToMultiByte should always be a code page-based data type, and the other should always be a Unicode data type.

2. Modify your code to use generic function prototypes. For example, use the C run-time call _tcslen instead of strlen, and use the Win32 API SetWindowText instead of SetWindowTextA. This rule applies to all APIs and C functions that handle text arguments.

3. Surround any character or string literal with the TEXT macro. The TEXT macro conditionally places an "L" in front of a character literal or a string literal definition. Be careful with escape sequences. For example, the Win32 resource compiler interprets L/" as an escape sequence specifying a 16-bit Unicode double-quote character, not as the beginning of a Unicode string.

4. Create generic versions of your data structures. Type definitions for string or character fields in structures should resolve correctly based on the UNICODE compile-time flag. If you write your own string-handling and character-handling functions, or functions that take strings as parameters, create Unicode versions of them and define generic prototypes for them.

5. Change your build process. When you want to build a Unicode version of your application, both the Win32 compile-time flag -DUNICODE and the C run-time compile-time flag -D_UNICODE must be defined.

6. Adjust pointer arithmetic. Subtracting char* values yields an answer in terms of bytes; subtracting wchar_t* values yields an answer in terms of 16-bit chunks. When determining the number of bytes (for example, when allocating memory for a string), multiply the length of the string in symbols by sizeof(TCHAR). When determining the number of


characters from the number of bytes, divide by sizeof(TCHAR). You can also create macros for these two operations, if you prefer. C makes sure that the ++ and -- operators increment and decrement by the size of the data type. Or even better, use Win32 APIs CharNext and CharPrev.

7. Add code to support special Unicode characters. These include Unicode characters in the compatibility zone, characters in the Private Use Area, combining characters, and characters with directionality. Other special characters include the Private Use Area noncharacter U+FFFF, which can be used as a placeholder, and the byte-order marks U+FEFF and U+FFFE, which can serve as flags that indicate a file is stored in Unicode. The byte-order marks are used to indicate whether a text stream is little-endian or big-endian. In plaintext, the line separator U+2028 marks an unconditional end of line. Inserting a paragraph separator, U+2029, between paragraphs makes it easier to lay out text at different line widths.

8. Debug your port by enabling your compiler's type-checking. Do this with and without the UNICODE flag defined. Some warnings that you might be able to ignore in the code page-based world will cause problems with Unicode. If your original code compiles cleanly with type-checking turned on, it will be easier to port. The warnings will help you make sure that you are not passing the wrong data type to code that expects wide-character data types. Use the Win32 National Language Support API (NLS API) or equivalent C run-time calls to get character typing and sorting information. Don't try to write your own logic for handling locale-specific type checking-your application will end up carrying very large tables!

7.1.1 Compiling Unicode Applications in Visual C++

By using the generic data types and function prototypes, you have the liberty of creating a non-Unicode application or compiling your software as Unicode. To compile an application as Unicode in Visual C/C++, go to Project/Settings/C/C++ /General, and include UNICODE and _UNICODE in Preprocessor Definitions. The UNICODE flag is the preprocessor definition for all Win32 APIs and data types, and _UNICODE is the preprocessor definition for C run-time functions.


8 Data Encoding and Byte Order Markers

Consider using locale-based routines and further internationalization.

For Windows 95, 98 and ME, consider using the Microsoft MSLU (Microsoft Layer for

Unicode)

Consider string compares and sorting, Unicode Collation Algorithm

Consider Unicode Normalization

Consider Character Folding

Unicode BOM Encoding Values

Encoding Form BOM

Encoding

UTF-8 EF BB BF

UTF-16

(big-endian) FE FF

UTF-16

(little-endian) FF FE

UTF-16BE,

UTF-32BE

(big-endian)

No BOM!

UTF-16LE, UTF-

32LE

(little-endian)

No BOM!

UTF-32

(big-endian)

00 00 FE

FF

The Byte Order Marker (BOM) is Unicode character U+FEFF. (It can also

represent a Zero Width No-break Space.) The code point U+FFFE is illegal

in Unicode, and should never appear in a Unicode character stream.

Therefore the BOM can be used in the first character of a file (or more

generally a string), as an indicator of endian-ness. With UTF-16, if the first

character is read as bytes FE FF then the text has the same endian-ness

as the machine reading it. If the character is read as bytes FF FE, then the

endian-ness is reversed and all 16-bit words should be byte-swapped as

they are read-in. In the same way, the BOM indicates the endian-ness of

text encoded with UTF-32.

Note that not all files start with a BOM however. In fact, the Unicode

Standard says that text that does not begin with a BOM MUST be

interpreted in big-endian form.

The character U+FEFF also serves as an encoding signature for the

Unicode Encoding Forms. The table shows the encoding of U+FEFF in


UTF-32

(little-endian)

FF FE 00

00

SCSU

(compression) 0E FE FF

each of the Unicode encoding forms. Note that by definition, text labeled

as UTF-16BE, UTF-32BE, UTF-32LE or UTF-16LE should not have a BOM.

The endian-ness is indicated in the label.

For text that is compressed with the SCSU (Standard Compression

Scheme for Unicode) algorithm, there is also a recommended signature.

Constant and Global Variables

ANSI Wide TCHAR

EOF WEOF _TEOF

_environ _wenviron _tenviron

_pgmptr _wpgmptr _tpgmptr

Data Types

ANSI Wide TCHAR

char wchar_t _TCHAR

_finddata_t _wfinddata_t _tfinddata_t

__finddata64_t __wfinddata64_t _tfinddata64_t

_finddatai64_t _wfinddatai64_t _tfinddatai64_t

int wint_t _TINT

signed char wchar_t _TSCHAR

unsigned char wchar_t _TUCHAR


char wchar_t _TXCHAR

L _T or _TEXT

LPSTR

(char *)

LPWSTR

(wchar_t *)

LPTSTR

(_TCHAR *)

LPCSTR

(const char *)

LPCWSTR

(const wchar_t *)

LPCTSTR

(const _TCHAR *)

LPOLESTR

(For OLE) LPWSTR LPTSTR

For further details regarding data types and functions please refer: http://www.i18nguy.com/unicode/c-

unicode.html

Most string operations for Unicode can be coded with the same logic used for handling the Windows character set. The difference is that the basic unit of operation is a 16-bit quantity instead of an 8-bit one. The header files provide a number of type definitions that make it easy to create sources that can be compiled for Unicode or the Windows character set.

For 8-bit (ANSI) and double-byte characters: typedef char CHAR; // 8-bit character typedef char *LPSTR; // pointer to 8-bit string For Unicode (wide) characters: typedef unsigned short WCHAR; // 16-bit character typedef WCHAR *LPWSTR; // pointer to 16-bit string

The figure below shows the method by which the Win32 header files define three sets of types:

One set of generic type definitions (TCHAR, LPTSTR), which depend on the state of the _UNICODE manifest constant.

Two sets of explicit type definitions (one set for those that are based on code pages or ANSI and one set for Unicode).


With generic declarations, it is possible to maintain a single set of source files and compile them for either Unicode or ANSI support.

Figure 4: WCHAR, a new data type (source Microsoft/MSDN).

8.1 Tips and Tricks for C/C++/VC++

These are some small tips which we can use during coding :-

(1) Using CString in string copy function

wcscpy(wchar_t*,(const wchar_t*)(LPCTSTR)Cstring)

(2) For converting integer to string

char buffer[20]; int i = 3445;

_itoa( i, buffer, 10 ); printf( "String of integer %d (radix 10): %s\n", i, buffer );

(3) To know the size of any file

FILE *fpRead; long lFileSize; if ((fpRead=_tfopen(_T(c:\\test.txt),_T("rb"))) == NULL) MessageBox (_T("Canot Open"),NULL,MB_OK); else { fseek(fpRead,0L,SEEK_SET);

fseek(fpRead,0,SEEK_END); lFileSize =ftell(fpRead); }


8.2 Encodings in Web Pages

Generally speaking, there are four different ways of setting the character set or the encoding of a Web page.

With this approach, you can select from the list of supported code pages to create your Web content. The downside of this approach is that you are limited to languages that are included in the selected character set, making true multilingual Web content impossible. This limits you to a single-script Web page.

Number entities can be used to represent a few symbols out of the currently selected code page or encoding. Let's say, for example, you have decided to create a Web page using the previous approach with the Latin ISO charset 8859-1. Now you also want to display some Greek characters in a mathematical equation; Greek characters, however, are not part of the Latin code page. Take, for instance, the Greek character , which has the Unicode code-point U+03A6. By using the decimal number entity of this code point preceded by the character's output will be as follows: This is my text with a Greek Phi: . and the output would be:This is my text with a Greek Phi: . Unfortunately, this approach makes it impossible to compose large amounts of text and makes editing your Web content very hard.

Unlike Win32 applications where UTF-16 is by far the best approach, for Web content UTF-16 can be used safely only on Windows NT networks that have full Unicode support. Therefore, this is not a suggested encoding for Internet sites where the capabilities of the client Web browser as well the network Unicode support are not known.

This Unicode encoding is the best and safest approach for multilingual Web pages. It allows you to encode the whole repertoire of Unicode characters. Also, all versions of Internet Explorer 4 and later as well as Netscape 4 and later support this encoding, which is not restricted to network or wire capabilities. The UTF-8 encoding allows you to create multilingual Web content without having to change the encoding based on the target language.


Figure: Example of a multilingual Web page encoded in UTF-8 (Source: Microsoft).

8.2.1 Setting and Manipulating Encodings

Since Web content is currently based on Windows or other encoding schemes, you'll need to know how to set and manipulate encodings. The following describes how to do this for HTML pages, Active Server Pages (ASP), and XML pages.

8.2.2 HTML pages

Internet Explorer uses the character set specified for a document to determine how to translate the bytes in the document into characters on the screen or on paper. By default, Internet Explorer uses the character set specified in the HTTP content type returned by the server to determine this translation. If this parameter is not given, Internet Explorer uses the character set specified by the


meta element in the document, taking into account the user's preferences if no meta element is specified. To apply a character set to an entire document, you must insert the meta element before the body element. For clarity, it should appear as the first element after the head, so that all browsers can translate the meta element before the document is parsed. The meta element applies to the document containing it. This means, for example, that a compound document (a document consisting of two or more documents in a set of frames) can use different character sets in different frames. Here is how it works:

8.2.3 ASP.Net

1. Explicitly set the CurrentUICulture and CurrentCulture properties in your application. Do not rely on defaults.

2. Note that ASP.NET applications are managed applications and therefore can use the same classes as other managed applications for retrieving, displaying, and manipulating information based on culture.

3. Be aware that you can specify the following three types of encodings in ASP.NET: requestEncoding specifies the encoding received from the client's browser. responseEncoding specifies the encoding to send to the client browser. In most

situations, this encoding should be the same as that specified for requestEncoding. fileEncoding specifies the default encoding for .aspx, .asmx, and .asax file parsing.

4. Specify the values for the requestEncoding, responseEncoding, fileEncoding, culture, and uiCulture attributes in the following three places in an ASP.NET application:

In the globalization section of a Web.config file. This file is external to the ASP.NET application. For more information, see Element.

In a page directive. Note that, when an application is in a page, the file has already been read. Therefore, it is too late to specify fileEncoding and requestEncoding. Only uiCulture, Culture, and responseEncoding can be specified in a page directive.

Programmatically in application code. This setting can vary per request. As with a page directive, by the time the application's code is reached it is too late to specify fileEncoding and requestEncoding. Only uiCulture, Culture, and responseEncoding can be specified in application code.

5. Note that the uiCulture value can be set to the browser accept language.


8.2.4 XML pages

All XML processors are required to understand two transformations of the Unicode character encoding: UTF-8 (the default encoding) and UTF-16. The Microsoft XML Parser (MSXML) supports more encodings, but all text in XML documents is treated internally as the Unicode UTF-16 character encoding.

The encoding declaration identifies which encoding is used to represent the characters in the document. Although XML parsers can determine automatically if a document uses the UTF-8 or UTF-16 Unicode encoding, this declaration should be used in documents that support other encodings.

For example, the following is the encoding declaration for a document that uses the ISO 8859-1 encoding (Latin 1):

< xml version="1.0" encoding="ISO-8859-1" >

8.2.5 Java

Application programming interfaces can be used to access localized messages, that is, translated versions of original text. These messages need to be accessed by applications at runtime to ensure that the correct locale-specific messages are used. Special APIs must be used to retrieve this text. For example, there are two distinct APIs for message handling on Solaris: the catgets() family, which is used with .msg files, and the gettext() family used with .po files. In Java, resource bundles can be used. These are basically Java classes; that is .java files. The API getString() is used to retrieve localized text from Messages the resource bundles.


9 Directionality

Using the bi-Directionality algorithm, one can switch between right to left and left to right scripts. Three languages in India viz., Urdu, Sindhi and Kashmiri are written in Right to Left direction. The direction can be set using the following dir attributes. dir = LTR | RTL

Other useful terminology which can also be used for the same purposes is listed below: LRE: A short name for the Unicode character U+202A LEFT-TO-RIGHT EMBEDDING. This invisible control character is used to begin a range of text with an embedded base direction of left-to-right. LRO: A short name for the Unicode character U+202E LEFT-TO-RIGHT OVERRIDE. This invisible control character is used to begin a range of text that ignores the Unicode bidirectional algorithm and arranges characters from left to right. PDF: A short name for the Unicode character U+202C POP DIRECTIONAL FORMATTING. This invisible control character is used to signal the end of a range of text that was started with one of the RLE, LRE, RLO or LRO characters. RLE: A short name for the Unicode character U+202B RIGHT-TO-LEFT EMBEDDING. This invisible control character is used to begin a range of text with an embedded base direction of right-to-left. RLO: A short name for the Unicode character U+202E RIGHT-TO-LEFT OVERRIDE. This invisible control character is used to begin a range of text that ignores the Unicode bidirectional algorithm and arranges characters from right to left.


10 Cascading Style Sheets (CSS)

10.1 How to Use @font-face

With the increasing browser capabilities beautiful typography is not a challenge any more. @font-face can

be used to create beautiful typography. Using @font-face is also very simple and easy. It requires few lines

of CSS and the declaration of font family like in any other font on the web.

The following code snippet is taken from: http://boldperspective.com/2011/how-to-use-css-font-face/

body { font-family: web-font, fallback-fonts; }

strong { font-family: web-font-bold; }

em { font-family: web-font-italic; }

@font-face {

font-family: 'web-font';

src: url('web-font.eot?') format('eot'),

url('web-font.woff') format('woff'),

url('web-font.ttf') format('truetype'),

url('web-font.svg') format('svg');

font-weight: normal;

font-style: normal;

}

@font-face {

font-family: 'web-font-bold';

src: url('web-font-italic.eot?') format('eot'),

url('web-font-italic.woff') format('woff'),

url('web-font-italic.ttf') format('truetype'),

url('web-font-italic.svg') format('svg');

font-weight: bold;

font-style: normal;

}

@font-face {

font-family: 'web-font-italic';

src: url('web-font-bold.eot?') format('eot'),

url('web-font-bold.woff') format('woff'),

url('web-font-bold.ttf') format('truetype'),

url('web-font-bold.svg') format('svg');

font-weight: normal;

font-style: italic;

}


10.2 CSS rendering Issues in Indic Script

There are many rendering issues associated with the application of CSS on an Indic script. The following is

the list of few of the issues:

1. Drop Case or Capitalization of the first letter. Figure below shows about the syllable formation in

Devanagari script. How should one apply drop case in this scenario?

2. If someone applies the underline style to the words in Indic scripts such as Devanagri, then the

matras are not displayed properly.

3. In case of two or more Indic scripts if someone applies strikethrough the rendering becomes

inappropriate.

4. Bullets in Indic scripts are not supported.

5. Vertical alignment of characters is also a big challenge in Indic scripts.


11 SQL Server 2005 and International Data: Using Unicode

1. Use nchar, nvarchar and ntext data types to store Indic/Unicode data

2. Precede all Unicode string with a prefix N (capital N case sensitive) when dealing Unicode

string constants

e.g.

SELECT * FROM TeluguDictionary WHERE (Telugu like N'%%')

INSERT INTO TeluguDictionary VALUES ('akkadi',N'')References

3. Note that in Indian Languages several words have multiple correct spellings and alternate

representation forms. The Unicode data requires normalization.

e.g.

4. Also note that IL numerals are not mapped to English numerals. So a MS-SQL query will

give different result than the query using English numerals

e.g. select * from trains_table where train_no ='5312 ;

select * from trains_table where train_no =' ;

So for correct results, you must map it to the English numerals.


12 ISO 639 Language Codes

This page offers a combined view of the language code tables of ISO 639 parts 1, 2, and 3. Select just the elements of the Part 1, 2, or 3 code, or show the set of code elements sorted by name. Viewing by name will enable you to browse for any name associated with a specific identifier, including an inverted form of a name (e.g., the code element id=[aaq] "Eastern Abnaki" will also be found under "Abnaki, Eastern"). The elements may also be ordered by scope of denotation or type of language. The "more" link provides further documentation on what the code element denotes. In the case of a macrolanguage, this includes a listing of its individual member languages.

A tabular representation of ISO 639 language code is given below:

6393 6392/6395 6391 Language Name

asm asm as Assamese

ben ben bn Bengali

brx

Bodo (India)

dgo

Dogri (individual language)

guj guj gu Gujarati

hin hin hi Hindi

kan kan kn Knnada

kas kas ks Kashmiri

kok kok

Konkani (macrolanguage)

mai ma

Maithili

mal mal ml Malayalam

mni mni

Manipuri

mar mar mr Marathi

nep nep ne Nepali (macrolanguage)

ori ori or Oriya (macrolanguage)

pan pan pa Panjabi

sat sat

Santali

san san sa Sanskrit

snd snd sd Sindhi

tam tam ta Tamil

tel tel te Telugu

urd urd ur Urdu

For further details please refer: http://sil.org/iso639-3/codes.asp

Four letters Script code is available at the following url: http://unicode.org/iso15924/iso15924-codes.html


13 References

1. SUN Globalisation: http://developers.sun.com/dev/gadc/i18ntesting/checklists/detcheck/detcheck.html

[Accessed: 15 April 2012]

2. Microsoft: http://msdn.microsoft.com [Accessed: 12 April 2012]

3. Localisation : http://www.wordesign.com/Localization /index.htm#_Toc469910877 [Accessed: 12 April 2012]

4. Unicode: http://unicode.org [Accessed: 20 April 2012]

5. CDAC: http://cdac.in/downloads [Accessed: 20 April 2012]

6. Lingobit Best Practices: http://www.lingobit.com/solutions/bestpractices.html [Accessed: 30 April 2012]

7. Localisation: http://www.wordesign.com/Localization/index.htm#_Toc469910877 [Accessed: 30 April 2012]

8. I18Guy Localisation: http://www.i18nguy.com/unicode/c-unicode.html [Accessed: 25 April 2012]

9. ISO 639 code for languages: http://sil.org/iso639-3/codes.asp[Accessed: 30 June 2012]

10. 4 Letter Script Code: http://unicode.org/iso15924/iso15924-codes.html[Accessed: 30 June 2012]

11. CLDR: http://cldr.unicode.org/[Accessed: 30 June 2012]

12. Language Tag & Test Direction: http://www.w3.org/TR/html4/struct/dirlang.html[Accessed: 30 June 2012]

13. IBM Globalisation Guidelines: http://www-01.ibm.com/software/globalization/[Accessed: 30 June 2012]

14. OASIS: https://www.oasis-open.org/[Accessed: 20 June 2012]

15. ITS: http://www.w3.org/TR/2007/REC-its-20070403/[Accessed: 28 June 2012]

16. Indic Scripts and Unicode http://www.Unicode.org/standard/WhatIsUnicode.html [Accessed on 21 May 2012]

17. ISO/IEC Guide 2:2004, Standardization and related activities General vocabulary.

18. TDIL: url: tdil.mit.gov.in [Accessed on 10-April-2012]

19. PASCII: url: http://parc.cdac.in/PASCii.htm [Accessed on: 29 May 2012]

20. Esselink, B. (2000). A Practical Guide to Localisation, Amsterdam: Benjamins.

21. Anastasiou & Schler, (2010) Localisation, Internationalisation, and Globalisation Against Digital Divide and Information Poverty.

22. LISA Internationalisation definition, url: http://www.lisa.org/Internationalization.58.0.html, [accessed 17 Oct 2010]


23. Sikes, R. (2009) localisation: the global pyramid capstone, Multilingual, April, pp.3-5.

24. Internationalisation: http://www.w3.org/International/questions/qa-i18n [Accessed on: 25 May 2012]

25. ISFOC: url: http://pune.cdac.in/html/gist/standard/isfoc.aspx [Accessed on 28 May 2012]

26. ICU: http://site.icu-project.org/ [Accessed on 29 May 2012 ]

27. PANGO: http://www.pango.org/ [Accessed on 29 May 2012 ]

28. ICU: http://site.icu-project.org/ [Accessed on 29 May 2012 ]

29. Harfbuzz: http://www.freedesktop.org/wiki/Software/HarfBuzz [Accessed on 29 May 2012 ]

30. FUEL: https://fedorahosted.org/fuel/wiki/fuel-hindi [Accessed on 20 July 2012]

31. Font face: http://boldperspective.com/2011/how-to-use-css-font-face/ [Accessed on 25 July 2012]

localizationguidelinesver3_4

Documents

localization guidelines

text localization standards

inputting multilingual

localization fuel

document ver

copyright cdac

generic guidelines

contents summary