building blocks for accessing multilingual data: cldr

21
Building Blocks for Accessing Multilingual Data: CLDR Steven R. Loomis, IBM GFTT 1

Upload: steven-r-loomis

Post on 22-Jan-2018

312 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Building Blocks for Accessing Multilingual Data: CLDR

Building Blocks for Accessing Multilingual Data: CLDRSteven R. Loomis, IBM GFTT

1

Page 2: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

About Me

• Senior Software Engineer, IBM Global Foundations Technology Team • IBM’s technical lead for the ICU4C/C++ software library, and primary voting representative to Unicode

• Member of CLDR-TC, lead of ULI-TC

2

Page 3: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Agenda• About CLDR • Focus Areas: • Language Identification • Transliteration • Searching and Sorting • Keyboards/Entry

• Q&A

3

Page 4: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

What is CLDR?

• Common Locale Data Repository • Language and region-specific data • Covers hundreds of language/region pairs • Open data (like Unicode itself), XML/JSON format

• Community input, carefully curated

4

Page 5: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Who is CLDR?

• CLDR’s Technical Committee, the CLDR-TC, is part of the Unicode Consortium

• Active participation by industry, academic, open source projects, national standards bodies, individuals

5

Page 6: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Who uses CLDR?

• Apple, Google, IBM, Microsoft… • Wikimedia foundation, jQuery, … • Java, node.js, php, … • Many users via ICU C/C++/Java library

6

Page 7: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Locale Data

• Data required for respecting the linguistic, cultural, geopolitical requirements of specific users

• Example: "What day is it?"

7

Page 8: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

XML / JSON

• XML: “es-US” • <month type="6">Junio</month>

• JSON: “es-US” • { … "6": "Junio", …}

8

Page 9: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

CLDR Coverage

• Coverage vs. number of languages

9

Page 10: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

CLDR site and SurveyTool (DEMO)

• DEMO: • http://unicode.org/cldr • http://st.unicode.org/cldr-apps

10

Page 11: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Locale Identifiers — BCP47

• Example: sr-Latn-RS • sr : ISO-639 "Serbian" • Latn : ISO-15924 "Latin Script" (vs Cyrillic)

• RS : ISO 3166 / UN M.49 "Serbia"

LatnLatnsr

LatnLatnLatn

LatnLatnRS

11

Page 12: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Language/Territory/Script info

Facts: • “The Cyrillic Script can be used to write Mongolian, Russian, Serbian…”

• “Italian is spoken in Italy, San Marino, Switzerland…”

12

Page 13: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Language Identification: ExemplarsEnglish (Latin)

a b c d e f g h i j k l m n o p q r s t u v w x y z

Serbian (Latin)

a b c ć č d đ dž e f g h i j k l lj m n nj o p r s š t u v z ž

Serbian (Cyrillic)

а б в г д ђ е ж з и ј к л љ м н њ о п р с т ћ у ф х ц ч џ ш

Russian (Cyrillic)

а б в г д е ё ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я

13

Page 14: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Transliteration

• Existing data for rule sets. • ALA-LC format could be included. • Rule based engine.

14

Page 15: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Transliteration Rule Example: Greek

• <tRule>Σ ↔ S ;</tRule> • <tRule>τ ↔ t ;</tRule> • <tRule>Τ ↔ T ;</tRule>

15

Page 16: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Demo: ICU transliterator demo

• http://demo.icu-project.org/icu-bin/translit

16

Page 17: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Searching and Sorting

• Unicode (UCA) provides base • CLDR “tailors”: English vs. Danish vs. French

• German: Mueller = Müller = MUELLER • Multiple stages and options: • blackbird vs black-bird vs BlackBird

17

Page 18: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Demo: Collator

• http://demo.icu-project.org/icu-bin/collation.html

18

Page 19: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Keyboards / Entry

• Standardized identifier for keyboard tables

• Allows comparison between keyboard providers

19

Page 20: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Demo: MARC processor

CLDR data

Script: Armn (Armenian) Exemplar text matches hy “Armenian” Transliterate to latin: “Hayastaneayc‘ ekeġec‘i” Regions where spoken: Armenia, Russia, Georgia, Syria, Lebanon, Iran, Turkey, Cyprus

20

uses: CLDR, ICU4J, MARC4J

Page 21: Building Blocks for Accessing Multilingual Data: CLDR

Access available handouts at ala.15.ala.org/sessions/handouts.

Thank You / Q&A

[email protected] • @srl295 ( Twitter, GitHub, Freenode ) • ibm.biz/srloomis

21