building blocks for accessing multilingual data: cldr
TRANSCRIPT
Building Blocks for Accessing Multilingual Data: CLDRSteven R. Loomis, IBM GFTT
1
Access available handouts at ala.15.ala.org/sessions/handouts.
About Me
• Senior Software Engineer, IBM Global Foundations Technology Team • IBM’s technical lead for the ICU4C/C++ software library, and primary voting representative to Unicode
• Member of CLDR-TC, lead of ULI-TC
2
Access available handouts at ala.15.ala.org/sessions/handouts.
Agenda• About CLDR • Focus Areas: • Language Identification • Transliteration • Searching and Sorting • Keyboards/Entry
• Q&A
3
Access available handouts at ala.15.ala.org/sessions/handouts.
What is CLDR?
• Common Locale Data Repository • Language and region-specific data • Covers hundreds of language/region pairs • Open data (like Unicode itself), XML/JSON format
• Community input, carefully curated
4
Access available handouts at ala.15.ala.org/sessions/handouts.
Who is CLDR?
• CLDR’s Technical Committee, the CLDR-TC, is part of the Unicode Consortium
• Active participation by industry, academic, open source projects, national standards bodies, individuals
5
Access available handouts at ala.15.ala.org/sessions/handouts.
Who uses CLDR?
• Apple, Google, IBM, Microsoft… • Wikimedia foundation, jQuery, … • Java, node.js, php, … • Many users via ICU C/C++/Java library
6
Access available handouts at ala.15.ala.org/sessions/handouts.
Locale Data
• Data required for respecting the linguistic, cultural, geopolitical requirements of specific users
• Example: "What day is it?"
7
Access available handouts at ala.15.ala.org/sessions/handouts.
XML / JSON
• XML: “es-US” • <month type="6">Junio</month>
• JSON: “es-US” • { … "6": "Junio", …}
8
Access available handouts at ala.15.ala.org/sessions/handouts.
CLDR Coverage
• Coverage vs. number of languages
9
Access available handouts at ala.15.ala.org/sessions/handouts.
CLDR site and SurveyTool (DEMO)
• DEMO: • http://unicode.org/cldr • http://st.unicode.org/cldr-apps
10
Access available handouts at ala.15.ala.org/sessions/handouts.
Locale Identifiers — BCP47
• Example: sr-Latn-RS • sr : ISO-639 "Serbian" • Latn : ISO-15924 "Latin Script" (vs Cyrillic)
• RS : ISO 3166 / UN M.49 "Serbia"
LatnLatnsr
LatnLatnLatn
LatnLatnRS
11
Access available handouts at ala.15.ala.org/sessions/handouts.
Language/Territory/Script info
Facts: • “The Cyrillic Script can be used to write Mongolian, Russian, Serbian…”
• “Italian is spoken in Italy, San Marino, Switzerland…”
12
Access available handouts at ala.15.ala.org/sessions/handouts.
Language Identification: ExemplarsEnglish (Latin)
a b c d e f g h i j k l m n o p q r s t u v w x y z
Serbian (Latin)
a b c ć č d đ dž e f g h i j k l lj m n nj o p r s š t u v z ž
Serbian (Cyrillic)
а б в г д ђ е ж з и ј к л љ м н њ о п р с т ћ у ф х ц ч џ ш
Russian (Cyrillic)
а б в г д е ё ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я
13
Access available handouts at ala.15.ala.org/sessions/handouts.
Transliteration
• Existing data for rule sets. • ALA-LC format could be included. • Rule based engine.
14
Access available handouts at ala.15.ala.org/sessions/handouts.
Transliteration Rule Example: Greek
• <tRule>Σ ↔ S ;</tRule> • <tRule>τ ↔ t ;</tRule> • <tRule>Τ ↔ T ;</tRule>
15
Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: ICU transliterator demo
• http://demo.icu-project.org/icu-bin/translit
16
Access available handouts at ala.15.ala.org/sessions/handouts.
Searching and Sorting
• Unicode (UCA) provides base • CLDR “tailors”: English vs. Danish vs. French
• German: Mueller = Müller = MUELLER • Multiple stages and options: • blackbird vs black-bird vs BlackBird
17
Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: Collator
• http://demo.icu-project.org/icu-bin/collation.html
18
Access available handouts at ala.15.ala.org/sessions/handouts.
Keyboards / Entry
• Standardized identifier for keyboard tables
• Allows comparison between keyboard providers
19
Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: MARC processor
CLDR data
Script: Armn (Armenian) Exemplar text matches hy “Armenian” Transliterate to latin: “Hayastaneayc‘ ekeġec‘i” Regions where spoken: Armenia, Russia, Georgia, Syria, Lebanon, Iran, Turkey, Cyprus
20
uses: CLDR, ICU4J, MARC4J
Access available handouts at ala.15.ala.org/sessions/handouts.
Thank You / Q&A
• [email protected] • @srl295 ( Twitter, GitHub, Freenode ) • ibm.biz/srloomis
21