localizationguidelinesver3_4

52
Note: This document (ver 3.4) is for developers’ use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 1 Localization Guidelines

Upload: hanmant-rachmale

Post on 19-Oct-2015

6 views

Category:

Documents


0 download

DESCRIPTION

localization guidelines .. most trusted document.......................................................................sddsdf

TRANSCRIPT

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 1

    Localization Guidelines

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 2

    Contents

    Summary (for Project Managers) ....................................................................................................................... 4

    1 Introduction ............................................................................................................................................... 5

    1.1 Outline................................................................................................................................................ 5

    1.2 Purpose of this Document ................................................................................................................. 5

    1.3 Scope .................................................................................................................................................. 5

    2 Background ................................................................................................................................................ 6

    3 Standards ................................................................................................................................................... 8

    3.1 Text Localization standards ................................................................................................................ 8

    3.1.1 Inputting multilingual data ........................................................................................................ 8

    3.1.2 Rupee Symbol ............................................................................................................................ 9

    3.1.3 Displaying multilingual data ..................................................................................................... 10

    3.1.4 Storing multilingual data .......................................................................................................... 12

    3.2 Localization standards...................................................................................................................... 13

    4 Common Language Data Repository (CLDR) ............................................................................................ 17

    4.1 APPROACH ....................................................................................................................................... 18

    4.1.1 ANALYSIS .................................................................................................................................. 18

    5 Frequently Used Entries for Localization (FUEL) ...................................................................................... 22

    6 Generic Guidelines ................................................................................................................................... 23

    6.1 Categorical Classification of Guidelines: .......................................................................................... 31

    7 Migration to Unicode ............................................................................................................................... 36

    7.1.1 Compiling Unicode Applications in Visual C++ ......................................................................... 37

    8 Data Encoding and Byte Order Markers .................................................................................................. 38

    8.1 Tips and Tricks for C/C++/VC++ ........................................................................................................ 41

    8.2 Encodings in Web Pages .................................................................................................................. 42

    8.2.1 Setting and Manipulating Encodings ....................................................................................... 43

    8.2.2 HTML pages .............................................................................................................................. 43

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 3

    8.2.3 ASP.Net .................................................................................................................................... 44

    8.2.4 XML pages ................................................................................................................................ 45

    8.2.5 Java ........................................................................................................................................... 45

    9 Directionality ............................................................................................................................................ 46

    10 Cascading Style Sheets (CSS) .................................................................................................................... 47

    10.1 How to Use @font-face ................................................................................................................... 47

    10.2 CSS rendering Issues in Indic Script .................................................................................................. 48

    11 SQL Server 2005 and International Data: Using Unicode ........................................................................ 49

    12 ISO 639 Language Codes .......................................................................................................................... 50

    13 References ............................................................................................................................................... 51

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 4

    Summary (for Project Managers)

    This document discussed various guidelines to help developers to localize their

    software products / services. This document will benefit new software developers in

    using the existing technologies together with various initiatives and standards, which

    will help them, understand the complications involved with various scripts and

    platforms. A further benefit is better integration and interoperability of products. The

    document starts with the definitions of Localization (L10N), Internationalization

    (I18N), and Globalization. It also brings to limelight some national as well as

    international standards currently used in the Localization industry. Various important

    topics have been touched in this document like Inputting, storage and rendering

    Indian language data, Unicode migration from legacy data, Usage of Common Locale

    Data Repository (CLDR), Characters encoding for proper representation on various

    platforms, Unicode, Culture and locale specific information for Indian languages,

    guidelines to develop localized software application / service, encoding and byte

    order markers to differentiate and manipulate different files, directionality issues

    with right to left scripts such as Urdu and the problem of using Cascading Style

    Sheets (CSS ) in context Indian languages.

    The topics such as Authoring, Internationalization and Localization have been

    discussed in details with XLIFF as an example. Frequently Used Entries for Localization

    (FUEL) for term consistency as an open source initiative, in context of Indian language

    has been cited. Various ISO 639 Language Codes have been provided at the end of the

    document. However, some of the guidelines are generic in nature and may or may

    not be applicable to mobile platforms.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 5

    1 Introduction

    1.1 Outline

    This document defines general guidelines to help developers to localize their software products / services.

    The document starts with the definitions of Localization (L10N), Internationalization (I18N), and

    Globalization. The advantage behind defining Localization is to create a common understanding of the term

    Localization and to learn about the dos and donts in order to create a localized application. The

    document will also discuss some of the frequently used national as well as international standards used in

    the Localization industry. The document has been divided into various topics viz., Localization /

    internationalization standards, User Interface (UI), directionality issues, Frequently Used Entries for

    Localization (FUEL), resource identifiers, Common Locale Data Repository (CLDR), Characters encoding,

    Unicode, Culture , Locales etc.

    1.2 Purpose of this Document

    This document is intended for:

    Software Designers / Engineers: To understand and evaluate Localization areas related to product design.

    Testing and QA Engineers: To define test plans and test cases keeping Localization and

    Internationalization in mind.

    1.3 Scope

    The scope of this document is to create awareness among the software developers and the designers to

    create applications/ products keeping the Localization aspect in mind. These guidelines do not describe

    how to internationalize a product. However, there are some coding snippets / examples which are used to

    illustrate the concept and not intended as a guide to implementation. Some of the guidelines are generic

    in nature and may or may not be applicable to mobile platforms.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 6

    2 Background India is a multilingual country with 22 official languages and 12 different scripts. The official languages of India are Hindi, Marathi, Sanskrit, Bangla, Assamese, Punjabi, Gujarati, Malayalam, Maithili, Santali, Bodo, Dogri, Tamil, Oriya, Telugu, Kannada, Urdu, Sindhi, Kashmiri, Manipuri, Konkani and Nepali. The various scripts are: Devanagari, Gurmukhi, Bengali, Assamese, Gujarati, Kannada, Oriya, Telugu, Malayalam, Tamil, Perso-Arabic, Meitei Mayek, Ol chiki and Roman [16].

    This document discusses various Localization standards along with the basic needs required to localize a piece of text i.e. input, storage and display. Standards formulating organizations are ISO (International Organization for Standardization), W3C (World Wide Web Consortium) and OASIS (Organization for the Advancement of Structured Information Standards); here we will focus on the translation and Localization standards developed by the last two organizations.

    The definition of a standard according to ISO/IEC Guide 2:2004 [17] provides the following general definition of a standard:

    a document, established by consensus and approved by a recognized body, that provides, for common and repeated use, rules, guidelines or characteristics for activities or their results, aimed at the achievement of the optimum degree of order in a given context.

    Various standards and specifications used in Localization are: Unicode, XML Localization Interchange File Format (XLIFF), Translation Memory eXchange (TMX), Term Base eXchange (TBX), Segmentation Rules eXchange (SRX), Indian Script Code for Information Interchange (ISCII)[18], Perso-Arabic Script Code for Information Interchange (PASCII) [19] etc.

    There are mainly four terms widely used in Localization domain:

    GILT stands for Globalization, Internationalization, Localization and Translation [20].

    There are many definitions of GILT, but for the purposes of this document the following definitions will be used.

    Translation is not just about text conversion but also deals with conversion of spoken words from source to target language, without losing its meaning.

    Anastasiou & Schler (2010) defines Translation [21] as Translation is the text transfer from a source to a target language; text is everywhere, in laws, news, academic dissertations, user manuals, advertisements and so on; the list is endless. Besides, we often see that text is accompanied by icons, diagrams, and other visual effects. For example, in newspapers as well as in user manuals and advertisements we have many pictures, animations, logos and so on. We often see that these icons change, when they are transferred to a target language.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 7

    LISA defined Localization as follows[22]:

    Localization refers to the actual adaptation of the product for a specific market. It includes translation, adaptation of graphics, adoption of local currencies, use of proper forms for dates, addresses, and phone numbers, and many other details, including physical structures of products in some cases.

    Schler (2007) with emphasis on digital content defines Localization as:

    "Localization is the linguistic and cultural adaptation of digital content to the requirements and locale of a foreign market, and the provision of services and technologies for the management of multilingualism across the digital global information flow."

    Globalisation deals with the strategy making for selling the product or services to the foreign markets.

    Sikes (2009) defines Globalisation [23] as

    Expansion of marketing strategies to address regional requirements of all kinds while Localization is Engineering a product to enable efficient adaptation to local requirements

    Internationalisation [24]:

    Internationalisation is the design and development of a product, application or document that enables easy Localization for target audiences that differ in region, culture, or language.

    In short, internationalisation reduces the engineering effort spent for various languages and cultures. It also makes languages locale independent. A locale is a specific geographical, political, or cultural region. It is usually identified by a combination of language and country, for example, en_UK represents the locale UK English.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 8

    3 Standards

    The standards have been divided into two categories:

    a) Text Localization standards

    b) Localization standards

    3.1 Text Localization standards

    In order to localise a piece of text the following are the three basic steps:

    1. Inputting multilingual data (Keyboards), 2. Displaying multilingual data (fonts), 3. Storing multilingual data (Unicode).

    3.1.1 Inputting multilingual data

    There are three types of inputting mechanism for Indian languages, viz.:

    3.1.1.1 INSCRIPT/Enhanced INSCRIPT

    INSCRIPT: The INSCRIPT layout uses the standard QWERTY 101 keyboard. The Indian language characters are divided into vowels and consonants. The vowels are placed on the left hand side of the keyboard and the consonants are placed on the right hand side of the keyboard layout. The mapping of characters in Indian languages is such that it remains common for all the left to right languages. Due to this the basic character set of the Indian languages is common. Enhanced INSCRIPT has got four layers of Keyboard to accommodate more characters. INSCRIPT layout can be used by urban as well as non-urban community.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 9

    The following figure represents the INSCRIPT Layout.

    3.1.1.2 Rupee Symbol

    With the introduction of the new Rupee symbol, there are several hack/non-standard implementations existing on the web. It is mandatory that Rupee symbol placement on the keyboard should be as indicated in the figure above while the storage value should be U+20B9.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 10

    3.1.2 Displaying multilingual data

    Displaying multilingual data would require fonts. Font is a set of well defined shapes to display symbols (letters, punctuation marks, special characters of the language.) An 8-bit font can represent 256 glyphs by giving unique index (called glyph index) and name to each glyph/shape.

    Since Indian scripts are complex in nature and because of this complexity rendering of these scripts is different than its display. Hence, many legacy standards have been developed to display ISCII text eg. Indian Script Font Code (ISFOC) [25] which will require special software to render stored text in ISCII. The right thing would be to use language specific fonts and converters in order to render text as per the language rules. With the evolution of computing environment from 8-bit to 16-bit ISCII was surpassed by Unicode and True type font by Open type font.

    3.1.2.1 Open Type Font

    OpenType is a cross-platform font file format developed jointly by Adobe and Microsoft[16].

    Microsoft defines Open type font as

    OpenType. All the information controlling the substitution and relative positioning of glyphs during glyph processing is contained within the font itself. This information is defined in OpenType Layout (OTL) features that are, in turn, associated with specific scripts and language systems.*9+

    The OpenType font format has the following advantages:

    multi-platform support

    support for international character sets

    smaller file sizes

    support for advanced typographic control

    3.1.2.2 Rendering / Rasteriser Engines

    Displaying multilingual data would also require, rendering / rasterisers engines such as Uniscribe, Pango, ICU, Harfbuzz.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 11

    Uniscribe: Uniscribe is a Microsoft shaping and rendering engine to ensure proper representation of multilingual content on windows platform.

    Pango: Pango is a library for laying out and rendering of text, with an emphasis on internationalization. Pango can be used anywhere that text layout is needed, though most of the work on Pango so far has been done in the context of the GTK+ widget toolkit. Pango forms the core of text and font handling for GTK+-2.x.

    ICU: ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

    Harfbuzz: HarfBuzz is an OpenType text shaping engine.

    For further details please refer the following urls:

    ICU: http://site.icu-project.org/

    PANGO: http://www.pango.org/

    ICU: http://site.icu-project.org/

    Harfbuzz: http://www.freedesktop.org/wiki/Software/HarfBuzz

    3.1.2.3 Sakal Bharati font

    A single font which contains all the Indic scripts has been developed by CDAC Pune. This font has got consistent look and feel across various Indic Scripts including English language. The following picture shows how different scripts are rendered on the screen.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 12

    3.1.3 Storing multilingual data

    Multilingual data can be stored in Unicode.

    Unicode consortium defines Unicode as

    Unicode is the universal character encoding, maintained by the Unicode consortium. This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols.[4]

    It is the superset of all the languages in the world which also includes punctuation, special characters (shapes), currency symbols, mathematical symbols etc [5]. Using Unicode, more than 65000 different characters can be represented. Unicode comprises of many code pages.

    3.1.3.1 Normalization in Unicode

    The Unicode data requires normalization. There are many cases where a character can be entered in more

    than one ways using the Unicode code chart.

    eg.

    One must take utmost care while developing applications in Unicode like internationalized domain names

    (IDN).

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 13

    3.2 Localization standards

    The various standards in today's Localization world are: XLIFF, Segmentation Rules eXchange (SRX), Translation Memory eXchange (TMX), Term-Base eXchange (TBX), Global Information Management Metrics eXchange-Volume (GMX-V), XML Text Memory (xml:tm).

    2..1. XLIFF: is an XML based intermediate format which is used to store, carry and interchange localizable data. According to XLIFF specification *13+: XLIFF is the XML Localization Interchange File Format designed by a group of software providers, Localization service providers, and Localization tools providers. It is intended to give any software provider a single interchange file format that can be understood by any Localization provider.

    2..2. SRX: SRX rules based on XML vocabulary was developed for breaking the text into translatable segments/ smaller fragments. SRX is defined in two parts: : specification about rules applicable for each language. : specification about how rules are applied for each language.

    2..3. TMX: TMX is the translation memory data exchange standard between applications. It is divided into two parts: Translation Unit and Segment of translation memory text .

    2..4. TBX: TBX-Basic is a TBX compliant terminology markup language for translation and Localization processes that permit a limited set of data categories. The purpose of TBX-Basic is to enhance the ability to exchange terminology resources between users.

    2..5. GMX-V: It measures the work-load for a given Localization job, not just by word and

    character count, but also by counting exact and fuzzy matches, punctuation symbols etc. It can also be used to count the number of pages, screenshots etc.

    2..6. xml:tm: It is the vendor-neutral open XML standard, which allows text memory including

    translations to be embedded within XML documents using XML namespace syntax.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 14

    The standards we are going to describe now follow the order of a Localization workflow, i.e. authoring,

    Internationalization and Localization as shown in the following diagram:

    Fig. Authoring, internationalization, and Localization standards

    At authoring stage, the standard which leaps out is DITA (Darwin Information Typing Architecture) managed by OASIS. ITS (Internationalization Tag Set) is the internationalization standard by W3C, while XLIFF is a standard which carries Localization data and metadata under the umbrella of OASIS. DITA is an XML-based specification which processes modular and extensible topic-based information types as specializations of existing types. It allows for the introduction of specific semantics for specific purposes. An example of DITA is shown below: (1) DITA example Installing a hard drive You open the box and insert the drive. hard drive disk drive Unscrew the cover. The drive bay is exposed. Insert the drive into the drive bay. If you feel resistance,try another angle.

    Authoring

    DITA (OASIS) Internationalization

    ITS (W3C) Localization

    XLIFF (OASIS)

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 15

    The example is retrieved from DITA OASIS online community portal: http://dita.xml.org/what-do-info-typed-dita-topic-examples-look, 10/06/12.

    In the above mentioned example the metadata tag stores the meta information for Keywords, index terms and audience type. After this the steps and the steps results are given. There is also a provision to link other DITA files too. After this Authoring stage the next stage is Internationalization. Internationalization reduces the engineering effort spent for various languages and cultures. Directionality (RTL or LTR), which is very important for internationalization, is supported by the ITS specification. The ITS specification consists of data categories, which is a set of elements and attributes. Perhaps the most important data category is , as it expresses information about whether the content of an element or attribute should be translated or not. The values of this data category are yes (translatable) or no (not translatable). The selection of XML node(s) may be explicit (using local approach) or implicit (using global rules). An ITS example using local approach is given below: (2) ITS example An example article John Doe [email protected] This is a short article. Here we see that we do not have an explicit metadata tag, but metadata is still available, such as title, author, persons name (first name and surname), affiliation, address, and contact e-mail. We also see that that all authors information should not be translated as indicated by the data category translate=no: .

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 16

    A basic minimal XLIFF file (3) follows: (3) Minimal XLIFF file with one translation unit (TU) Thankyou Danke On the first line we have the XLIFF declaration and also the schemaLocation attribute of the XML schema-instance namespace. In the file element we have the name of the file (Greetings.txt), its source language (English (US)), and its data type (plain text). Then the translation unit element follows with its source and target text in the source language (SL) and target language (TL), respectively. XLIFF also allows the storage of important data for (software) localisation; an example is the restype attribute. Among its values, are checkbox, cursor, dialog, hscrollbar (horizontal scrollbar), icon, menubar (a toolbar containing one or more tope level menus), and window. An example15 of a dialog resource type follows: (4) XLIFF TU of a dialog About Dialog In example (4), we see metadata about font, style, and coordinates. This is specific to the dialog resource type. When metadata is about the file and the localisation process in general, then it is included in the header element. An example of the header element follows: (5) Process metadata in header element

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 17

    4 Common Language Data Repository (CLDR)

    The CLDR provides key building blocks for software to support the world's languages. The data in the

    repository is used by companies for their software internationalization and localization: adapting software

    to the conventions of different languages for such tasks as formatting of dates, times, time zones, numbers,

    and currency values; sorting text; choosing languages or countries by name; and many others. C.L.D.R.s

    provide useful information as to the locale and are therefore crucial from the perspective of localization.

    For the purpose of clarity, CLDRs may be reduced to two sub-types: CLDRs are of two types: Basic and

    Exhaustive.

    The Basic CLDR comprises of

    Calendars

    Numeric formats,

    Date and Time formats

    Currencies

    These are used not only by Operating Systems to show time date and conversion but also for more

    extensive functions such as inserting Date, Time, Currency etc in the locale of the country.

    Advanced or Exhaustive CLDRs cover in addition to the above, items such as

    Weight systems

    Distance systems

    Location

    Modes of Address

    Language Selection

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 18

    Oral Pronunciation (tied to the Voice Browser)

    A majority of CLDRs use LDML for mark-up.

    4.1 APPROACH

    Existing CLDR templates (for the major part from the Unicode site) were analysed from the point of view of

    their compliance to Locale data. Experts were invited to give their comments on the tags within the CLDR

    and whether the existing tags were sufficient. The results of the analysis of CLDR are given below.

    4.1.1 ANALYSIS

    This analysis of CLDRs is divided into two parts:

    Basic CLDRs and Advanced CLDRs

    4.1.1.1 BASIC CLDR

    The repository contains information as to Time, Date, Year, Currency and Weight. Insofar as the basic

    CLDRs are concerned, it was noted that a majority were geared towards the Western model. The lacuna in

    each locale data is analysed in what follows:

    YEAR

    The year format which is normally followed is the Gregorian calendar. In fact even the CLDR for Hindi and

    other Indian languages complies with the Western norm, with the months being transliterated into the

    local language.

    It would be advisable for true representation (as in the case of Chinese) to adopt the Indian luni-solar

    calendar and have the year calculated as Vikram Samwat for Gujarati, Dogri, Marathi and Konkani as well as

    to a certain extent for Sindhi. In the case of Gujarati, the Parsi calendar where each day of the month has a

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 19

    separate name would involve a very specific CLDR. On the other hand, the calendar for Urdu and Kashmiri

    could conform to the Muslim calendar.

    DATE

    The same assumptions could be for Date where the date notation in Indian languages varies and can be

    DD MM YYYY

    DD MM YY

    MM DD YYYY

    MM DD YY

    TIME

    Time is calculated as per Hour Minute and Second. But traditional time calculation in India, especially in

    Astrology, is based on ghatikas and palas.

    Even in the case of Hour Minute Second notation, apart from the Indian railways and Airlines where a 24

    Hour notation is maintained, the normal notation is for a 12 hour cycle.

    NUMBERS

    The case of Numbers is unique, since Numbers in Indian languages have their own grammar and a

    NUM2Word routine necessarily implies a deep study of the various ways of representing numbers in Indian

    Languages. This has been one of the areas of intensive study and the results of the Number notation as

    represented in a Num2Word routine are provided at the end of the chapter for Marathi, Konkani, Gujarati,

    Urdu, Sindhi, Dogri and Kashmiri.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 20

    CURRENCY

    The normal currency is Rs and Paise. However in common parlance there are further subdivisions into a

    quarter, half and three quarters. Should these be represented as is done in the US with the word dime

    being acceptable?

    WEIGHTS MEASURES AND DISTANCE

    Although India has adopted the Metric System to be at par with the world, in real time, old weights still

    continue to function. Should these be included in the CLDR where the Weight and Measures are analyzed.

    A similar problem arises in the case of distance where the old Foot and Mile system is partly used. Similarly

    measurement of land is still in traditional measures. This is especially important for localizing land records

    where each state has its own traditional term for measurement such as Guntha, Bigha etc.

    The above few items show the complexity in evolving a basic CLDR for Indian languages. In the case of the

    more advanced aspects of the CLDR which are normally not adopted in the standard CLDR, socio-cultural

    aspects are analyzed.

    4.1.1.2 ADVANCED CLDR

    This embraces socio-cultural aspects. such as Icons, Images, Symbols, Myths, Beliefs, Geographical entities,

    Custom and Tradition, Festivals. Another aspect is that of name representation where the traditional FIRST

    NAME FATHERS NAME & SURNAME are not pertinent, especially in the South where such features are

    replaced by patronymics. Similarly notification of location, especially in the case of addresses are not easy

    to reduce to a single format. Different cultures tend to change the order from Ascending to Descending as

    is the case in Iran and to a certain extent in parts of Bhuj in Gujarat, where the country is placed first and

    the destinator is placed at the end. Honorifics and terms of respect and address also form an important

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 21

    part of CLDR which to be totally exhaustive has to embrace each and every aspect of the culture of the

    locale and render it transparent to the user.

    Since exact formats in LDML for these different entities are still under development, these have not been

    studied in depth.

    COMPLIANCY

    The different CLDRs that have been analyzed and studied / and or developed (as in the case of Dogri) are

    all compliant with the Western notational system. In that sense they are perfectly compliant. They have

    been implemented by both IBM and Microsoft in their operating systems and within these limits can be

    termed as perfectly compliant.

    However the term compliancy implies a norm or standard and if the above discussion is to be gone by,

    the CLDRs do not reflect in any manner the cultural and ethnic thought processes of the languages under

    study. A closer look at this problem which is a vital one especially in the area of e-governance localization is

    a must.

    In what follows two sample CLDRs one for Dogri and the other for Urdu are presented. Both are in the

    traditional framework although the Urdu CLDR proposes like the Arabic one a dual system: Gregorian and

    Muslim. Simultaneously the results of the study undertaken on Num2Word conversion in the shape of a

    tabular representation of the basic numerals from one to the highest acceptable number are also provided

    for all the languages under study.

    Further details can be found at : http://cldr.unicode.org/

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 22

    5 Frequently Used Entries for Localization (FUEL)

    FUEL is an open source initiative to standardize terms for open source software programs. It aims at

    resolving the problem of term inconsistency and lack of standardization in Computer software translation,

    across various platforms. It also works to provide a standard and consistent terminology for a language.

    Following Indian language support has been added in this initiative.

    Languages with FUEL [31]:

    Assamese (as)

    Bengali (India) (bn_IN)

    Bhojpuri (bho)

    Chhattisgarhi (hne)

    Gujarati (gu)

    Hindi (hi)

    Kazakh (kaz)

    Magahi (mag)

    Maithili (mai)

    Malayalam (ml)

    Marathi (mr)

    Punjabi (pa)

    Oriya (or)

    Tamil (ta)

    Telugu (te)

    Urdu (ur)

    Kannada (kn)

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 23

    6 Generic Guidelines

    i. Use Unicode as your character encoding to represent text.

    ii. It is recommended to use Latest version of Unicode and Unicode Compliant Open Font Type

    during design and deployment of Indian language application (Refer : http://unicode.org)

    iii. It is recommended to use of Enhanced INSCRIPT keyboard for inputting

    i. Drawback of Phonetic / transliteration based layouts is that they are useful only

    for English knowing users.

    ii. Drawback of Typewriter layouts is that they are preferred only by operators

    migrating from physical typewriters.

    iii. INSCRIPT is easy to learn and use especially for non English speaker.

    iv. Major OS such as MS-Windows and Linux support INSCRIPT by default

    v. It also caters to latest additions such as the Rupee symbol

    vi. Refer : http://cdac.in/downloads

    iv. Isolate all user interface elements from the program source code. Move all localisable

    resources to separate resource-only DLLs. Localisable resources include user interface

    elements, such as strings, error messages, dialog boxes, menus, and embedded object

    resources. Resources which are not localisable should not put into the resource-only DLLs.

    Refer: http://msdn.microsoft.com/en-us/library/w7x1y988.aspx

    v. Use the same resource identifiers throughout the life of the project. Changing identifiers

    makes it difficult to update localised resources from one build to another. Refer:

    http://www.lingobit.com/solutions/bestpractices.html

    vi. Allow plenty of room for the length of strings to expand in the user interface. In some

    languages, phrases can require 50-75 percent more space than they need in other languages.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 24

    For example, dialog boxes may expand due to Localization, a large dialog box that occupies the

    entire screen in low-resolution mode may require resizing to an unusable size when localised.

    Refer: http://msdn.microsoft.com/en-us/library/w7x1y988.aspx

    vii. UI controls such as buttons or drop-down lists should not be placed on top of other controls.

    Sizing and hotkey issues with hidden controls usually are found through testing, which might

    not be done during Localization . In this case, the UI is not localisable because the button size

    cannot be extended to the length required for the translation without rearranging the button

    positions. Rearranging button positions can be costly and makes the UI inconsistent among

    languages. Refer: http://msdn.microsoft.com/en-us/library/aa163857(v=office.10).aspx

    viii. There should be proper use of decimal separator which varies from locale to locale.

    ix. Avoid images, banners with text, because for Indian language version they need to be

    translated as well and language switch will not be able to handle the text inside images.

    x. For desktop applications the icon, icon-text and title should be in local language.

    For applications requiring RTL support it recommended that the html tag 'direction' be specified

    with RTL as value e.g.

    '...' [in Persian].

    That would yield this result:

    xi. Application surface Localization are useful where end user is only consuming information and

    processing / computing with Indian language data is not critical.

    i. example of surface Localization is printing of statement of accounts or and bill

    passbook printing

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 25

    xii. Use clear, concise, and grammatically correct language. Ambiguous words, obtuse or highly

    technical sentences, and grammatical mistakes increase translation time and costs.

    xiii. When using abbreviations and acronyms, ensure that the abbreviations and acronyms have

    meanings that are understood by most users. You should always define abbreviations and

    acronyms that might not be obvious in all languages.

    Refer: http://msdn.microsoft.com/en-us/library/aa163854(v=office.10).aspx

    xiv. Avoid using images and icons that contain text in your application. They are expensive to

    localise.

    xv. Avoid strings that contain a preposition and a variable are difficult to localise because, in some

    languages, different prepositions are used in different contexts.

    Refer: http://msdn.microsoft.com/en-us/library/aa163852(v=office.10).aspx

    Example code Better code

    At %s Time: %s

    At %s Date: %s

    At %s Location: %s

    xvi. Automated translation tools can significantly cut down on Localization vendor's costs. But

    automatic translation tools only work if standard phrases are being used. Many Localization

    vendors are paid per word. Consider the amount of money that can be saved if one standard

    phrase can be easily, or automatically translated into multiple languages. For example, the

    following messages could be standardized into one consistent message:

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 26

    Refer: http://msdn.microsoft.com/en-us/library/aa163854(v=office.10).aspx

    Message Standardized version of message

    Not enough memory There is not enough memory available.

    There is not enough memory available There is not enough memory available.

    Insufficient Memory! There is not enough memory available.

    xvii. Different languages often have different punctuation and spacing rules. Consider these

    differences when writing strings in code. Thus, if this string is constructed at run time, the

    localiser cannot change the point to a comma. For similar reasons, apply these considerations

    to numbers, dates, or any other information that might have different formats in other

    languages. Refer: http://msdn.microsoft.com/en-us/library/aa163854(v=office.10).aspx

    xviii. Use of numerals / digits

    i. For applications involving number crunching such as banking applications,

    billing applications, statistics it is recommended that English numbers / Digits be

    used. Unless mandated otherwise by the agency commissioning the project.

    This is because most programming environments, spreadsheets and

    databases do not support computing with Indian digits which are

    treated as characters as opposed to numeral.

    ii. For applications such as digital data preservation Indian digits may be

    considered.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 27

    xix. Avoid using compounded variables

    In the following example, to translate the preposition "on" correctly, you might have to ask

    the developer what the variables stand for.

    Refer: http://msdn.microsoft.com/en-us/library/aa163852(v=office.10).aspx

    Example String Explanation from developer

    %I:%M%p on %A, %B %d, %Y %A Full weekday name

    %B Full month name

    %d Day of month as decimal number (01 - 31)

    %I Hour in 12-hour format (01 - 12)

    %M Minute as decimal number (00 - 59)

    %p Current locale's A.M./P.M. indicator for 12-hour clock

    %Y Year with century, as decimal number

    xx. Use Microsoft specified resource file naming convention.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 28

    xxi. Use unique variable names

    If the same variable name is used for different variables, for example, if the sequence of

    the variables is hard coded, the word order in the translated sentence might be wrong

    because word order differs from language to language.

    Example code Better code

    Set created on %s at %s Set created on %1 at %2

    Backup of %s on %s at %s Backup of %1 on %2 at %3

    Printing %s of %s on %s Printing %1 of %2 on %3

    xxii. Do not hardcode strings or user interface resources.

    xxiii. Allocate text buffers dynamically since text size may expand when translated.

    xxiv. Avoid composite strings

    The strings shown in the following table cannot be localized unless the localizer knows

    what the type of object or item the variables stand for. Even then, Localization might be

    difficult, because the value of the variable might require a different syntax; the article

    "the" has variations in another language (in German: "der, die, das, dem, den, des," the

    same as in English where you have "a" or "an"); the adjectives might change according to

    the gender of the word; or other factors. Using composite strings increases the chances of

    mistranslation. These Localization problems can be eliminated by writing out each

    message as a separate string instead of using variables.

    Refer: http://msdn.microsoft.com/en-us/library/aa163852(v=office.10).aspx

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 29

    Examples of composite strings

    Are you sure you want to delete the selected %s?

    %s drive letter or drive path for %s.

    Are you sure you want to delete %s's profile?

    Cannot %s to Removable, CD-ROM or unknown types of drives.

    A %s error has occurred %sing one of the %s sectors on this drive.

    xxv. Unused strings and dialog boxes should be removed from the source, so localisers do not

    waste time localising them.

    xxvi. Avoid using controls within a sentence. You might want to place a UI control within a sentence.

    For example, you might want to give users a drop-down menu to make a choice within a

    sentence. This practice is not recommended, because to localise a sentence that includes UI

    controls, the localiser often has to either change the position of the controls (if possible) or be

    content with an improper sentence structure. Also, the UI controls are often drop-down

    combo boxes that are comprised of multiple controls. Moving and aligning these can be error-

    prone.

    xxvii. Test localised applications on all language variants of Windows XP (except Oriya). If your

    application uses Unicode, as recommended, it should run fine with no modifications. If it uses

    a Windows code page, you will need to set the culture/locale to the appropriate value for your

    localised application and reboot before testing. Refer: http://msdn.microsoft.com/en-

    us/library/aa291552(v=vs.71).aspx

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 30

    xxviii. Cultural and Policy Making Issues

    i. Avoid slang expressions, colloquialisms, and obscure phrasing in all text. At best,

    they are difficult to translate; at worst, they are offensive.

    ii. Avoid maps that include controversial regional or national boundaries. They are

    a notorious source of political offense.

    iii. Avoid images in bitmaps and icons that are ethnocentric or offensive in other

    cultures/locales. Refer: http://www.lingobit.com/solutions/bestpractices.html

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 31

    6.1 Categorical Classification of Guidelines:

    Sr.

    No. Category Sub Category Guideline

    1 Translatable

    Components

    Textual Objects

    Fixed - textual objects which should

    not get translated. e.g. - User Name,

    Group Name, Password, System or

    host names, keyboard accelerators

    All fixed text should be separate from resource

    file and not translated.

    Message - translated information

    displayed to the user by the product.

    All messages should be reviewed for slang

    terminology, message fragments, and other

    criteria that cause messages to be difficult to

    translate.

    All Translatables should be separate from the

    source and placed in a locale-specific location.

    If translated items do not exist for a locale,

    correct default values should be used.

    Allow plenty of room for the length of strings to

    expand in the user interface. In some languages,

    phrases can require 50-75 percent more space

    than they need in other languages.

    When variables are used for the text strings,

    extra space (atleast one line per variable) should

    be provided.

    There should not be any controls placed in

    between the sentences.

    when possible, it is best to put text labels above

    UI controls such as edit boxes. This positioning

    allows for the greatest extension of the text

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 32

    field.

    Avoid inline CSS with values.

    Never use absolute positions.

    UI controls such as buttons or drop-down lists

    should not be placed on top of other controls.

    Text on a button should never be dynamically

    linked onto the button from a string variable but

    should be placed on the button itself as a

    property of the button.

    Other - translated help files,

    documentation.

    Help files , documentations should be available

    for different Locals.

    Non - Textual

    Icons, Images, Colors, Sounds, etc

    1. These items should be culturally neutral as

    much as possible. Avoid culture-specific

    examples, showing body parts, gender-specific

    roles, religious references, political symbols, text

    in graphics.

    2. If these items are not culturally neutral then

    all non-textual items should reside in locale-

    specific directories and these items should be

    separate from the source product i.e. The item

    should not compiled or hardcoded into the

    product and should be configurable.

    3. If these items are not culturally neutral and if

    an item has not been translated into a specific

    locale, Correct default item should appear.

    4. If these items are not culturally neutral, and if

    any textual messages appear in icons/ images,

    then the text from the icon or image should be

    easily able to separate from the image. And

    these messages must be translated separately

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 33

    from the icon.

    5. If these items are not culturally neutral, the

    tools which created the items should be easily

    available in other places, where the localization

    of these items will take place.

    2 Cultural Data

    Time, Date and Calendar

    - system or user log files or text

    windows - display of time and date

    information in files - calendar

    functions - display of chronologically-

    sorted data - display of user accesses

    to the system

    Time, Date and Calendar format in a locale-

    specific format.

    Numbers, Currency, and Metrics

    Numbers should display in the locale-specific

    format.

    The currency symbol should displayed in the

    locale-specific format.

    Currency strings should be formatted according

    to locale conventions.

    If using the International three-letter currency

    code, correct one should be used according to

    local.

    Collation

    Collation refers to the order in which

    characters are sorted.

    If the sort order change in different locales then

    It must be sorted according to the rules of the

    locale.

    Personal Names, Honorifics, etc

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 34

    The formats for personal names, honorifics and

    salutations should be configurable for each

    locale.

    The formats for postal addresses and telephone

    numbers should be configurable for each locale.

    Characters and Strings

    Application should handle Unicode and convert

    correctly between Unicode and the native

    encoding of the platform. i.e. non-ASCII

    filenames and pathnames should be handled

    correctly

    Filesystem I/O

    Can non-ASCII characters be saved and loaded

    correctly from the file system, without mention

    of proper byte order marker?

    3

    Text (Writing

    System)

    Foundation

    Transfer Encoding

    Wherever data is shared between

    applications, such as mail, Internet

    browsing, etc. or Wherever data is

    displayed.

    If the application transfers data through

    protocols or networks that strip the 8th bit, The

    application should encode, decode, and display

    the data correctly.

    If the product includes a help browser or web

    browser:

    1. Browser should load files or pages in different

    encodings.

    2. Browser should be able to properly display

    files or pages with multibyte text.

    3. Once a file is loaded, is it possible to display it

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 35

    in another encoding?

    4. Bookmarks and menus should show filenames

    with multibyte text correctly.

    Input

    Application should correctly accept, parse and

    display non-ASCII input in all places where it is

    appropriate to enter non-ASCII text.

    As non-ASCII text is being entered, the

    backspace key should correctly delete the non-

    ASCII text.

    Product should have the mechanism to input

    characters used for various languages.

    Shortcut key combinations should be accessible

    on international keyboards

    Output

    labels, menu items, text areas,

    canvases, HTML pages, etc.

    All areas of the application should display all

    characters within the current locale.

    Your application should use the default fonts, or

    if explicit fonts are named, the fonts should be

    stored in an external resource file that can be

    configured for each locale.

    Printing out characters from the users' native

    languages should be proper.

    Application should work correctly on localized

    editions of operations systems that the product

    supports.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 36

    7 Migration to Unicode

    According to Microsoft creating a new program based on Unicode is fairly easy. Unicode has a few features that require special handling, but you can isolate these in your code. Converting an existing program that uses code-page encoding to one that uses Unicode or generic declarations is also straightforward. Here are the steps to follow:

    1. Modify your code to use generic data types. Determine which variables declared as char or char* are text, and not pointers to buffers or binary byte arrays. Change these types to TCHAR and TCHAR*, as defined in the Win32 file WINDOWS.H, or to _TCHAR as defined in the Visual C++ file TCHAR.H. Replace instances of LPSTR and LPCH with LPTSTR and LPTCH. Make sure to check all local variables and return types. Using generic data types is a good transition strategy because you can compile both ANSI and Unicode versions of your program without sacrificing the readability of the code. Don't use generic data types, however, for data that will always be Unicode or always stays in a given code page. For example, one of the string parameters to MultiByteToWideChar and WideCharToMultiByte should always be a code page-based data type, and the other should always be a Unicode data type.

    2. Modify your code to use generic function prototypes. For example, use the C run-time call _tcslen instead of strlen, and use the Win32 API SetWindowText instead of SetWindowTextA. This rule applies to all APIs and C functions that handle text arguments.

    3. Surround any character or string literal with the TEXT macro. The TEXT macro conditionally places an "L" in front of a character literal or a string literal definition. Be careful with escape sequences. For example, the Win32 resource compiler interprets L/" as an escape sequence specifying a 16-bit Unicode double-quote character, not as the beginning of a Unicode string.

    4. Create generic versions of your data structures. Type definitions for string or character fields in structures should resolve correctly based on the UNICODE compile-time flag. If you write your own string-handling and character-handling functions, or functions that take strings as parameters, create Unicode versions of them and define generic prototypes for them.

    5. Change your build process. When you want to build a Unicode version of your application, both the Win32 compile-time flag -DUNICODE and the C run-time compile-time flag -D_UNICODE must be defined.

    6. Adjust pointer arithmetic. Subtracting char* values yields an answer in terms of bytes; subtracting wchar_t* values yields an answer in terms of 16-bit chunks. When determining the number of bytes (for example, when allocating memory for a string), multiply the length of the string in symbols by sizeof(TCHAR). When determining the number of

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 37

    characters from the number of bytes, divide by sizeof(TCHAR). You can also create macros for these two operations, if you prefer. C makes sure that the ++ and -- operators increment and decrement by the size of the data type. Or even better, use Win32 APIs CharNext and CharPrev.

    7. Add code to support special Unicode characters. These include Unicode characters in the compatibility zone, characters in the Private Use Area, combining characters, and characters with directionality. Other special characters include the Private Use Area noncharacter U+FFFF, which can be used as a placeholder, and the byte-order marks U+FEFF and U+FFFE, which can serve as flags that indicate a file is stored in Unicode. The byte-order marks are used to indicate whether a text stream is little-endian or big-endian. In plaintext, the line separator U+2028 marks an unconditional end of line. Inserting a paragraph separator, U+2029, between paragraphs makes it easier to lay out text at different line widths.

    8. Debug your port by enabling your compiler's type-checking. Do this with and without the UNICODE flag defined. Some warnings that you might be able to ignore in the code page-based world will cause problems with Unicode. If your original code compiles cleanly with type-checking turned on, it will be easier to port. The warnings will help you make sure that you are not passing the wrong data type to code that expects wide-character data types. Use the Win32 National Language Support API (NLS API) or equivalent C run-time calls to get character typing and sorting information. Don't try to write your own logic for handling locale-specific type checking-your application will end up carrying very large tables!

    7.1.1 Compiling Unicode Applications in Visual C++

    By using the generic data types and function prototypes, you have the liberty of creating a non-Unicode application or compiling your software as Unicode. To compile an application as Unicode in Visual C/C++, go to Project/Settings/C/C++ /General, and include UNICODE and _UNICODE in Preprocessor Definitions. The UNICODE flag is the preprocessor definition for all Win32 APIs and data types, and _UNICODE is the preprocessor definition for C run-time functions.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 38

    8 Data Encoding and Byte Order Markers

    Consider using locale-based routines and further internationalization.

    For Windows 95, 98 and ME, consider using the Microsoft MSLU (Microsoft Layer for

    Unicode)

    Consider string compares and sorting, Unicode Collation Algorithm

    Consider Unicode Normalization

    Consider Character Folding

    Unicode BOM Encoding Values

    Encoding Form BOM

    Encoding

    UTF-8 EF BB BF

    UTF-16

    (big-endian) FE FF

    UTF-16

    (little-endian) FF FE

    UTF-16BE,

    UTF-32BE

    (big-endian)

    No BOM!

    UTF-16LE, UTF-

    32LE

    (little-endian)

    No BOM!

    UTF-32

    (big-endian)

    00 00 FE

    FF

    The Byte Order Marker (BOM) is Unicode character U+FEFF. (It can also

    represent a Zero Width No-break Space.) The code point U+FFFE is illegal

    in Unicode, and should never appear in a Unicode character stream.

    Therefore the BOM can be used in the first character of a file (or more

    generally a string), as an indicator of endian-ness. With UTF-16, if the first

    character is read as bytes FE FF then the text has the same endian-ness

    as the machine reading it. If the character is read as bytes FF FE, then the

    endian-ness is reversed and all 16-bit words should be byte-swapped as

    they are read-in. In the same way, the BOM indicates the endian-ness of

    text encoded with UTF-32.

    Note that not all files start with a BOM however. In fact, the Unicode

    Standard says that text that does not begin with a BOM MUST be

    interpreted in big-endian form.

    The character U+FEFF also serves as an encoding signature for the

    Unicode Encoding Forms. The table shows the encoding of U+FEFF in

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 39

    UTF-32

    (little-endian)

    FF FE 00

    00

    SCSU

    (compression) 0E FE FF

    each of the Unicode encoding forms. Note that by definition, text labeled

    as UTF-16BE, UTF-32BE, UTF-32LE or UTF-16LE should not have a BOM.

    The endian-ness is indicated in the label.

    For text that is compressed with the SCSU (Standard Compression

    Scheme for Unicode) algorithm, there is also a recommended signature.

    Constant and Global Variables

    ANSI Wide TCHAR

    EOF WEOF _TEOF

    _environ _wenviron _tenviron

    _pgmptr _wpgmptr _tpgmptr

    Data Types

    ANSI Wide TCHAR

    char wchar_t _TCHAR

    _finddata_t _wfinddata_t _tfinddata_t

    __finddata64_t __wfinddata64_t _tfinddata64_t

    _finddatai64_t _wfinddatai64_t _tfinddatai64_t

    int wint_t _TINT

    signed char wchar_t _TSCHAR

    unsigned char wchar_t _TUCHAR

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 40

    char wchar_t _TXCHAR

    L _T or _TEXT

    LPSTR

    (char *)

    LPWSTR

    (wchar_t *)

    LPTSTR

    (_TCHAR *)

    LPCSTR

    (const char *)

    LPCWSTR

    (const wchar_t *)

    LPCTSTR

    (const _TCHAR *)

    LPOLESTR

    (For OLE) LPWSTR LPTSTR

    For further details regarding data types and functions please refer: http://www.i18nguy.com/unicode/c-

    unicode.html

    Most string operations for Unicode can be coded with the same logic used for handling the Windows character set. The difference is that the basic unit of operation is a 16-bit quantity instead of an 8-bit one. The header files provide a number of type definitions that make it easy to create sources that can be compiled for Unicode or the Windows character set.

    For 8-bit (ANSI) and double-byte characters: typedef char CHAR; // 8-bit character typedef char *LPSTR; // pointer to 8-bit string For Unicode (wide) characters: typedef unsigned short WCHAR; // 16-bit character typedef WCHAR *LPWSTR; // pointer to 16-bit string

    The figure below shows the method by which the Win32 header files define three sets of types:

    One set of generic type definitions (TCHAR, LPTSTR), which depend on the state of the _UNICODE manifest constant.

    Two sets of explicit type definitions (one set for those that are based on code pages or ANSI and one set for Unicode).

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 41

    With generic declarations, it is possible to maintain a single set of source files and compile them for either Unicode or ANSI support.

    Figure 4: WCHAR, a new data type (source Microsoft/MSDN).

    8.1 Tips and Tricks for C/C++/VC++

    These are some small tips which we can use during coding :-

    (1) Using CString in string copy function

    wcscpy(wchar_t*,(const wchar_t*)(LPCTSTR)Cstring)

    (2) For converting integer to string

    char buffer[20]; int i = 3445;

    _itoa( i, buffer, 10 ); printf( "String of integer %d (radix 10): %s\n", i, buffer );

    (3) To know the size of any file

    FILE *fpRead; long lFileSize; if ((fpRead=_tfopen(_T(c:\\test.txt),_T("rb"))) == NULL) MessageBox (_T("Canot Open"),NULL,MB_OK); else { fseek(fpRead,0L,SEEK_SET);

    fseek(fpRead,0,SEEK_END); lFileSize =ftell(fpRead); }

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 42

    8.2 Encodings in Web Pages

    Generally speaking, there are four different ways of setting the character set or the encoding of a Web page.

    With this approach, you can select from the list of supported code pages to create your Web content. The downside of this approach is that you are limited to languages that are included in the selected character set, making true multilingual Web content impossible. This limits you to a single-script Web page.

    Number entities can be used to represent a few symbols out of the currently selected code page or encoding. Let's say, for example, you have decided to create a Web page using the previous approach with the Latin ISO charset 8859-1. Now you also want to display some Greek characters in a mathematical equation; Greek characters, however, are not part of the Latin code page. Take, for instance, the Greek character , which has the Unicode code-point U+03A6. By using the decimal number entity of this code point preceded by the character's output will be as follows: This is my text with a Greek Phi: . and the output would be:This is my text with a Greek Phi: . Unfortunately, this approach makes it impossible to compose large amounts of text and makes editing your Web content very hard.

    Unlike Win32 applications where UTF-16 is by far the best approach, for Web content UTF-16 can be used safely only on Windows NT networks that have full Unicode support. Therefore, this is not a suggested encoding for Internet sites where the capabilities of the client Web browser as well the network Unicode support are not known.

    This Unicode encoding is the best and safest approach for multilingual Web pages. It allows you to encode the whole repertoire of Unicode characters. Also, all versions of Internet Explorer 4 and later as well as Netscape 4 and later support this encoding, which is not restricted to network or wire capabilities. The UTF-8 encoding allows you to create multilingual Web content without having to change the encoding based on the target language.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 43

    Figure: Example of a multilingual Web page encoded in UTF-8 (Source: Microsoft).

    8.2.1 Setting and Manipulating Encodings

    Since Web content is currently based on Windows or other encoding schemes, you'll need to know how to set and manipulate encodings. The following describes how to do this for HTML pages, Active Server Pages (ASP), and XML pages.

    8.2.2 HTML pages

    Internet Explorer uses the character set specified for a document to determine how to translate the bytes in the document into characters on the screen or on paper. By default, Internet Explorer uses the character set specified in the HTTP content type returned by the server to determine this translation. If this parameter is not given, Internet Explorer uses the character set specified by the

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 44

    meta element in the document, taking into account the user's preferences if no meta element is specified. To apply a character set to an entire document, you must insert the meta element before the body element. For clarity, it should appear as the first element after the head, so that all browsers can translate the meta element before the document is parsed. The meta element applies to the document containing it. This means, for example, that a compound document (a document consisting of two or more documents in a set of frames) can use different character sets in different frames. Here is how it works:

    8.2.3 ASP.Net

    1. Explicitly set the CurrentUICulture and CurrentCulture properties in your application. Do not rely on defaults.

    2. Note that ASP.NET applications are managed applications and therefore can use the same classes as other managed applications for retrieving, displaying, and manipulating information based on culture.

    3. Be aware that you can specify the following three types of encodings in ASP.NET: requestEncoding specifies the encoding received from the client's browser. responseEncoding specifies the encoding to send to the client browser. In most

    situations, this encoding should be the same as that specified for requestEncoding. fileEncoding specifies the default encoding for .aspx, .asmx, and .asax file parsing.

    4. Specify the values for the requestEncoding, responseEncoding, fileEncoding, culture, and uiCulture attributes in the following three places in an ASP.NET application:

    In the globalization section of a Web.config file. This file is external to the ASP.NET application. For more information, see Element.

    In a page directive. Note that, when an application is in a page, the file has already been read. Therefore, it is too late to specify fileEncoding and requestEncoding. Only uiCulture, Culture, and responseEncoding can be specified in a page directive.

    Programmatically in application code. This setting can vary per request. As with a page directive, by the time the application's code is reached it is too late to specify fileEncoding and requestEncoding. Only uiCulture, Culture, and responseEncoding can be specified in application code.

    5. Note that the uiCulture value can be set to the browser accept language.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 45

    8.2.4 XML pages

    All XML processors are required to understand two transformations of the Unicode character encoding: UTF-8 (the default encoding) and UTF-16. The Microsoft XML Parser (MSXML) supports more encodings, but all text in XML documents is treated internally as the Unicode UTF-16 character encoding.

    The encoding declaration identifies which encoding is used to represent the characters in the document. Although XML parsers can determine automatically if a document uses the UTF-8 or UTF-16 Unicode encoding, this declaration should be used in documents that support other encodings.

    For example, the following is the encoding declaration for a document that uses the ISO 8859-1 encoding (Latin 1):

    < xml version="1.0" encoding="ISO-8859-1" >

    8.2.5 Java

    Application programming interfaces can be used to access localized messages, that is, translated versions of original text. These messages need to be accessed by applications at runtime to ensure that the correct locale-specific messages are used. Special APIs must be used to retrieve this text. For example, there are two distinct APIs for message handling on Solaris: the catgets() family, which is used with .msg files, and the gettext() family used with .po files. In Java, resource bundles can be used. These are basically Java classes; that is .java files. The API getString() is used to retrieve localized text from Messages the resource bundles.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 46

    9 Directionality

    Using the bi-Directionality algorithm, one can switch between right to left and left to right scripts. Three languages in India viz., Urdu, Sindhi and Kashmiri are written in Right to Left direction. The direction can be set using the following dir attributes. dir = LTR | RTL

    Other useful terminology which can also be used for the same purposes is listed below: LRE: A short name for the Unicode character U+202A LEFT-TO-RIGHT EMBEDDING. This invisible control character is used to begin a range of text with an embedded base direction of left-to-right. LRO: A short name for the Unicode character U+202E LEFT-TO-RIGHT OVERRIDE. This invisible control character is used to begin a range of text that ignores the Unicode bidirectional algorithm and arranges characters from left to right. PDF: A short name for the Unicode character U+202C POP DIRECTIONAL FORMATTING. This invisible control character is used to signal the end of a range of text that was started with one of the RLE, LRE, RLO or LRO characters. RLE: A short name for the Unicode character U+202B RIGHT-TO-LEFT EMBEDDING. This invisible control character is used to begin a range of text with an embedded base direction of right-to-left. RLO: A short name for the Unicode character U+202E RIGHT-TO-LEFT OVERRIDE. This invisible control character is used to begin a range of text that ignores the Unicode bidirectional algorithm and arranges characters from right to left.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 47

    10 Cascading Style Sheets (CSS)

    10.1 How to Use @font-face

    With the increasing browser capabilities beautiful typography is not a challenge any more. @font-face can

    be used to create beautiful typography. Using @font-face is also very simple and easy. It requires few lines

    of CSS and the declaration of font family like in any other font on the web.

    The following code snippet is taken from: http://boldperspective.com/2011/how-to-use-css-font-face/

    body { font-family: web-font, fallback-fonts; }

    strong { font-family: web-font-bold; }

    em { font-family: web-font-italic; }

    @font-face {

    font-family: 'web-font';

    src: url('web-font.eot?') format('eot'),

    url('web-font.woff') format('woff'),

    url('web-font.ttf') format('truetype'),

    url('web-font.svg') format('svg');

    font-weight: normal;

    font-style: normal;

    }

    @font-face {

    font-family: 'web-font-bold';

    src: url('web-font-italic.eot?') format('eot'),

    url('web-font-italic.woff') format('woff'),

    url('web-font-italic.ttf') format('truetype'),

    url('web-font-italic.svg') format('svg');

    font-weight: bold;

    font-style: normal;

    }

    @font-face {

    font-family: 'web-font-italic';

    src: url('web-font-bold.eot?') format('eot'),

    url('web-font-bold.woff') format('woff'),

    url('web-font-bold.ttf') format('truetype'),

    url('web-font-bold.svg') format('svg');

    font-weight: normal;

    font-style: italic;

    }

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 48

    10.2 CSS rendering Issues in Indic Script

    There are many rendering issues associated with the application of CSS on an Indic script. The following is

    the list of few of the issues:

    1. Drop Case or Capitalization of the first letter. Figure below shows about the syllable formation in

    Devanagari script. How should one apply drop case in this scenario?

    2. If someone applies the underline style to the words in Indic scripts such as Devanagri, then the

    matras are not displayed properly.

    3. In case of two or more Indic scripts if someone applies strikethrough the rendering becomes

    inappropriate.

    4. Bullets in Indic scripts are not supported.

    5. Vertical alignment of characters is also a big challenge in Indic scripts.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 49

    11 SQL Server 2005 and International Data: Using Unicode

    1. Use nchar, nvarchar and ntext data types to store Indic/Unicode data

    2. Precede all Unicode string with a prefix N (capital N case sensitive) when dealing Unicode

    string constants

    e.g.

    SELECT * FROM TeluguDictionary WHERE (Telugu like N'%%')

    INSERT INTO TeluguDictionary VALUES ('akkadi',N'')References

    3. Note that in Indian Languages several words have multiple correct spellings and alternate

    representation forms. The Unicode data requires normalization.

    e.g.

    4. Also note that IL numerals are not mapped to English numerals. So a MS-SQL query will

    give different result than the query using English numerals

    e.g. select * from trains_table where train_no ='5312 ;

    select * from trains_table where train_no =' ;

    So for correct results, you must map it to the English numerals.

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 50

    12 ISO 639 Language Codes

    This page offers a combined view of the language code tables of ISO 639 parts 1, 2, and 3. Select just the elements of the Part 1, 2, or 3 code, or show the set of code elements sorted by name. Viewing by name will enable you to browse for any name associated with a specific identifier, including an inverted form of a name (e.g., the code element id=[aaq] "Eastern Abnaki" will also be found under "Abnaki, Eastern"). The elements may also be ordered by scope of denotation or type of language. The "more" link provides further documentation on what the code element denotes. In the case of a macrolanguage, this includes a listing of its individual member languages.

    A tabular representation of ISO 639 language code is given below:

    6393 6392/6395 6391 Language Name

    asm asm as Assamese

    ben ben bn Bengali

    brx

    Bodo (India)

    dgo

    Dogri (individual language)

    guj guj gu Gujarati

    hin hin hi Hindi

    kan kan kn Knnada

    kas kas ks Kashmiri

    kok kok

    Konkani (macrolanguage)

    mai ma

    Maithili

    mal mal ml Malayalam

    mni mni

    Manipuri

    mar mar mr Marathi

    nep nep ne Nepali (macrolanguage)

    ori ori or Oriya (macrolanguage)

    pan pan pa Panjabi

    sat sat

    Santali

    san san sa Sanskrit

    snd snd sd Sindhi

    tam tam ta Tamil

    tel tel te Telugu

    urd urd ur Urdu

    For further details please refer: http://sil.org/iso639-3/codes.asp

    Four letters Script code is available at the following url: http://unicode.org/iso15924/iso15924-codes.html

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 51

    13 References

    1. SUN Globalisation: http://developers.sun.com/dev/gadc/i18ntesting/checklists/detcheck/detcheck.html

    [Accessed: 15 April 2012]

    2. Microsoft: http://msdn.microsoft.com [Accessed: 12 April 2012]

    3. Localisation : http://www.wordesign.com/Localization /index.htm#_Toc469910877 [Accessed: 12 April 2012]

    4. Unicode: http://unicode.org [Accessed: 20 April 2012]

    5. CDAC: http://cdac.in/downloads [Accessed: 20 April 2012]

    6. Lingobit Best Practices: http://www.lingobit.com/solutions/bestpractices.html [Accessed: 30 April 2012]

    7. Localisation: http://www.wordesign.com/Localization/index.htm#_Toc469910877 [Accessed: 30 April 2012]

    8. I18Guy Localisation: http://www.i18nguy.com/unicode/c-unicode.html [Accessed: 25 April 2012]

    9. ISO 639 code for languages: http://sil.org/iso639-3/codes.asp[Accessed: 30 June 2012]

    10. 4 Letter Script Code: http://unicode.org/iso15924/iso15924-codes.html[Accessed: 30 June 2012]

    11. CLDR: http://cldr.unicode.org/[Accessed: 30 June 2012]

    12. Language Tag & Test Direction: http://www.w3.org/TR/html4/struct/dirlang.html[Accessed: 30 June 2012]

    13. IBM Globalisation Guidelines: http://www-01.ibm.com/software/globalization/[Accessed: 30 June 2012]

    14. OASIS: https://www.oasis-open.org/[Accessed: 20 June 2012]

    15. ITS: http://www.w3.org/TR/2007/REC-its-20070403/[Accessed: 28 June 2012]

    16. Indic Scripts and Unicode http://www.Unicode.org/standard/WhatIsUnicode.html [Accessed on 21 May 2012]

    17. ISO/IEC Guide 2:2004, Standardization and related activities General vocabulary.

    18. TDIL: url: tdil.mit.gov.in [Accessed on 10-April-2012]

    19. PASCII: url: http://parc.cdac.in/PASCii.htm [Accessed on: 29 May 2012]

    20. Esselink, B. (2000). A Practical Guide to Localisation, Amsterdam: Benjamins.

    21. Anastasiou & Schler, (2010) Localisation, Internationalisation, and Globalisation Against Digital Divide and Information Poverty.

    22. LISA Internationalisation definition, url: http://www.lisa.org/Internationalization.58.0.html, [accessed 17 Oct 2010]

  • Note: This document (ver 3.4) is for developers use only. Copyright C-DAC, GIST 2011-2012. All rights reserved 52

    23. Sikes, R. (2009) localisation: the global pyramid capstone, Multilingual, April, pp.3-5.

    24. Internationalisation: http://www.w3.org/International/questions/qa-i18n [Accessed on: 25 May 2012]

    25. ISFOC: url: http://pune.cdac.in/html/gist/standard/isfoc.aspx [Accessed on 28 May 2012]

    26. ICU: http://site.icu-project.org/ [Accessed on 29 May 2012 ]

    27. PANGO: http://www.pango.org/ [Accessed on 29 May 2012 ]

    28. ICU: http://site.icu-project.org/ [Accessed on 29 May 2012 ]

    29. Harfbuzz: http://www.freedesktop.org/wiki/Software/HarfBuzz [Accessed on 29 May 2012 ]

    30. FUEL: https://fedorahosted.org/fuel/wiki/fuel-hindi [Accessed on 20 July 2012]

    31. Font face: http://boldperspective.com/2011/how-to-use-css-font-face/ [Accessed on 25 July 2012]