Click to add text
© 2016 IBM Corporation
Put ICU to Work!
1
November 1, 2016
Steven R. Loomis <[email protected]>
San José, California, USA @srl295
Yoshito Umaoka <[email protected]>
Littleton, Massachusetts, USA
© 2014 IBM Corporation
can’t I just use Unicode?
1,400 pages+ Annexes + additional standards
More than 110,000 characters
Significant update about once a year
80+ character properties,many multi-valued
2
© 2014 IBM Corporation
Unicode covers the world…
“Unicode provides a unique number for every character,no matter what the platform,no matter what the program,no matter what the language.”
Photo: NASA Earth Observatory
source: http://www.unicode.org/standard/WhatIsUnicode.html
3
© 2014 IBM Corporation
ICU brings you home
Requirements vary widely across languages & countries
Sorting
Text searching
Bidirectional text processing and complex text layout
Date/time/number/currency formatting
Codepage conversion
…many more
Photo Credits: james.thompson,Per Ola Wiberg ~ Powi, Tom Ravenscroft
4
© 2014 IBM Corporation
ICU comes home
• 1999 – IBM Classes for Unicode becomesInternational Components for Unicode
•
• 2016 — not just for Unicode, but from Unicode
• • Similar open-source license• • Hosting transferred from IBM to Unicode (ICU-TC) • • Build machines accessible to entire ICU-TC• • 1-click CLA process – anyone can contribute!• • http://blog.unicode.org/2016/05/icu-joins-unicode-consortium.html
© 2014 IBM Corporation
• Breaks: word, line, …• Formatting• Date & time• Durations• Messages• Numbers & currencies• Plurals• Transforms• Normalization• Casing• Transliterations
ICU's laundry list
6
• Unicode text handling• Charset conversions (200+)• Charset detection• Collation & Searching• Locales from CLDR (640+)• Resource Bundles• Calendar & Time zones• Complex-text layout engine• Unicode Regular Expressions• …
© 2014 IBM Corporation
• Mature, widely used (all IBM brands and operating systems), up-to-date set of C/C++ and Java libraries• Basis for Java 1.1 internationalization, but goes far beyond Java 1.1• Team continues to work on improving and monitoring performance.
• Very portable – identical results on all platforms/programming languages• C/C++ (ICU4C): 30+ platforms/compilers• Java (ICU4J): Oracle and IBM JRE
• Wrappers: D/C#/PHP/Python/…• Customizable & Modular• Open source (since 1999) – but non-restrictive
• Contributions from many parties (IBM, Google, Apple, Microsoft, ...)
Benefits of ICU
7
© 2014 IBM Corporation
Where do I get ICU? – http://icu-project.org/
§Main site: http://icu-project.org/
• Downloads
• API references
• Mailing list
• Bug tracking
§http://userguide.icu-project.org• User's guide with examples
© 2014 IBM Corporation
ICU Userguide
§General to Specific
© 2014 IBM Corporation10 11/4/16
API reference: http://icu-project.org/apiref (C and J)
© 2014 IBM Corporation
Mailing Lists
lMailing Lists: http://site.icu-project.org/contactsl Icu-support – technical support and discussionl Icu-design – API proposals by ICU teaml Icu-announce –announcementsl Icu-bugrfe / icu-commits : track ICU bugs and commits
© 2014 IBM Corporation
Bug Tracker - http://bugs.icu-project.org/trac
© 2014 IBM Corporation
New Ticket-
(security related bug –suggests hiding the ticket)
© 2014 IBM Corporation
Time to jump into ICU itself...
© 2014 IBM Corporation
Where we are going: Application Data
150,000 Ceuta and Melilla
38,087,800 Algeria
15,439,400 Ecuador
15
Task: Display a list of world regions, with their population figures.
© 2014 IBM Corporation
ICU for C: First Look
#include “unicode/…”
All ICU headers are in unicode/
UErrorCode status = U_ZERO_ERROR;
Error code is a Fill-in – must be initialized.
u_init(&status);
Returns success if ICU data loads OK.
if ( U_SUCCESS(status) ) …
TRUE if no error. 16
© 2014 IBM Corporation
ICU first look (C)
#include <unicode/uclean.h>
int main(…) {
UErrorCode status = U_ZERO_ERROR;
u_init(&status);
if(U_FAILURE(status)) {
puts(u_errorName(status));
return 1;
}
// ICU is OK…
// hereafter: ASSERT_OK
17
© 2014 IBM Corporation
Hello, World! (C)
#include <unicode/uclean.h>#include <unicode/ustream.h>
UErrorCode status = U_ZERO_ERROR;u_init(&status);ASSERT_OK(status);UnicodeString msg("Hello, ");msg.append("World");msg.append(0x2603);std::cout << msg << std::endl;
Hello, World☃18
© 2014 IBM Corporation
Hello, Welt! (C)
ULocaleDisplayNames *names;names = uldn_open(NULL, ULDN_DIALECT_NAMES, &status);UChar result[256]; // UTF-16int32_t len = uldn_regionDisplayName(names, "001", result, 256,
&status);uldn_close(names); // clean up!
ASSERT_OK(status);
UnicodeString msg("Hello, ");msg.append(result, len);msg.append(0x2603); // snowmanstd::cout << msg << std::endl;
Hello, World☃Hello, Welt☃
Hello, Mundo☃Hello, 世界☃
19
© 2014 IBM Corporation
Hello, Welt! (J)
Locale locale = Locale.getDefault();String world = LocaleDisplayNames
.getInstance(ULocale.forLocale(locale))
.regionDisplayName("001");System.out.println("Hello, " + world + "\u2603");
Hello, World☃Hello, Welt☃
Hello, Mundo☃Hello, 世界☃
20
© 2014 IBM Corporation
World hello? (C)
ULocaleDisplayNames *names;names = uldn_open(NULL, ULDN_DIALECT_NAMES, &status);UChar result[256]; // UTF-16int32_t len = uldn_regionDisplayName(names, "001", result, 256,
&status);uldn_close(names); // clean up
ASSERT_OK(status);
UnicodeString msg("Hello, ");msg.append(result, len);msg.append(0x2603); // snowmanstd::cout << msg << std::endl;
Hello, World☃Hello, Welt☃
Hello, Mundo☃Hello, 世界☃
21
© 2014 IBM Corporation
MessageFormat (1/2)
Order is different for different languages, can't just concatenate strings.
whom + " 's " + what + " is on the " + where
Result:• English: My Aunt’s pen is on the table.• Spanish: La pluma de mi tía está en la tabla.
22
© 2014 IBM Corporation
MessageFormat (2/2)• Pattern Syntax:
• English: {whom}’s {what} is on the {where}.
• Translated Pattern:• Spanish: {what} de {whom} está en la {where}.
• Result:
• English: My Aunt’s pen is on the table.
• Spanish: La pluma de mi tía está en la tabla.
• Note: Use caution, a better solution might be:• “Location: table Object: pen Owner: Aunt”
23
© 2014 IBM Corporation
world messages: setup
Locale defaultLocaleId = Locale::getDefault(); // “en_US”int32_t territoryCount; // == number of territories
LocalPointer<LocaleDisplayNames>ldn(LocaleDisplayNames::createInstance(defaultLocaleId,ULDN_DIALECT_NAMES));
UnicodeString myLanguage;ldn->localeDisplayName(defaultLocaleId, myLanguage);// “American English”
24
© 2014 IBM Corporation
world messages: format
UnicodeString welcome = resourceBundle.getStringEx("welcome", status);
UnicodeString names[] = { "myLanguage", "today", "territoryCount" };
Formattable args[] = { defaultLocaleName, Calendar::getNow(),
territoryCount };
MessageFormat fmt(welcome, defaultLocaleId, status);fmt.format(names, args, 3, result, status);
myLanguage today territoryCount
UnicodeString UDate int32_t
“American English” <right now> 257
25
(English) My language is {myLanguage} and today is {today, date}. We have {territoryCount, number} territories.
(English) My language is American English and today is Apr 15, 2014. We have 257 territories.
(Spanish) Mi lengua es {myLanguage}, y hoy es {today, date}. Tenemos{territoryCount, number} regiónes.
(Spanish) Mi lengua es español (Estados Unidos), y hoy es abr. 15, 2014. Tenemos 257 regiónes.
© 2014 IBM Corporation
world messages: J
String pattern; // loaded from resource bundle
MessageFormat fmt = new MessageFormat(pattern, defaultLocaleID);
Map<String, Object> msgargs = new HashMap<String, Object>();
msgargs.put("territoryCount", territoryCount);
msgargs.put("myLanguage", defaultLocaleName);
msgargs.put("today", System.currentTimeMillis());
System.out.println(fmt.format(msgargs, new StringBuffer(), null));
26
© 2014 IBM Corporation
message formats (English U.S. examples)
number date time
(default) 123,456.789 (default) Oct 22, 2013 12:34:56 PM
integer 123,457 short 10/22/13 12:34 PM
currency $123,456.79 medium Oct 22, 2013 12:34:56 PM
percent 82% long October 22, 2013 12:34:56 PM PST
full Tuesday, October 22, 2013 12:34:56 PM Pacific Standard Time
27
© 2014 IBM Corporation
resource bundles
English:greeting=“A hello to …”Spanish: greeting=“¡Hola a …!”
…
ResourceBundle("myApp", locale,…).getString(“greeting”)…
locale= “en”
locale= “es”
28
© 2014 IBM Corporation
Locale IDs
29
Script
Language
Territory
© 2014 IBM Corporation
authoring resource bundles
root.txtroot {hello{“Hello!”}}
en.txten {
// empty…}
“root” contains English text
30
“en” fully inherits from root.
“root” is sent out fortranslation…
© 2014 IBM Corporation
authoring resource bundles
root.txtroot {hello{“Hello!”}}
es.txtes {hello {“¡Hola!”}}
en.txten {
// empty…}
31
“es” returned from translationwith spanish content
© 2014 IBM Corporation
building resource bundles
$ bldicures--from ./txtfiles --dest ./out--name myapp
32
© 2014 IBM Corporation
using resource bundles: C
u_setDataDirectory("out"); // (see userguide)
ResourceBundle resourceBundle("myapp", locale, status);
UnicodeString thing = resourceBundle.getStringEx("hello", status);
std::cout << thing << std::endl;
33
© 2014 IBM Corporation
using ICU4C resource bundles in J
(Most apps will use Java ListResourceBundle)
Locale locale = Locale.getDefault();UResourceBundle bundle =
UResourceBundle.getBundleInstance(Sample30_ResHello.class.getPackage().getName().replace('.', '/')+"/data/reshello",locale,Sample30_ResHello.class.getClassLoader());
System.out.println(bundle.getString("hello"));
34
© 2014 IBM Corporation
population: who’s counting?
root { // root.txtinfo { "The territory of {territory} has {population} persons." }
}
The territory of Caribbean Netherlands has 20,000 persons.
The territory of Brazil has 201,010,000 persons.
The territory of Bouvet Island has 1 persons.
The territory of Unknown Region has 0 persons.
Bouvet de Lozier
35
Bouvet de Lozier
© 2014 IBM Corporation
population: the problem of plurals
“{count} territor(ies).”
if (territoryCount == 1) {message = message1; // “one territory…”;
} else {message = messageN; // “{territoryCount} territories”;
}
© 2014 IBM Corporation
plurals are language specific
Language # cases
Japanese
Hungarian1
English
Italian2
Polish
Romanian3
Arabic
Welsh6
zero, one, two, few, many, other
37
© 2014 IBM Corporation
population: using pluralsroot {
{population, plural,
=0{{territory} has a population of zero.}
one{Only one person lives on {territory}!}
other{The territory of {territory} has # persons.}}
}
The territory of Caribbean Netherlands has 20,000 persons.
The territory of Brazil has 201,010,000 persons.
Only one person lives on Bouvet Island!
Unknown Region has a population of zero.
Bouvet de Lozier
38
© 2014 IBM Corporation
population: sorting it all out
150,000 Ceuta and Melilla
38,087,800 Algeria
15,439,400 Ecuador
binary comparison inadequate
order varies by language (Danish ‘aa…’ follows ‘z…’)
need multiple-level collation
39
© 2014 IBM Corporation
collationUses:
comparingsortingsearching
Options:case sensitive?ignore punctuation?UPPERCASE first?which variant collator?which locale?custom tailorings?time vs. memory tradeoff?
40
© 2014 IBM Corporation
population: collation helper
// for use with std::sortclass CollatorLessThan :
public std::binary_function<const TerritoryEntry*,const TerritoryEntry*, bool> {
public:CollatorLessThan(Collator &coll, UErrorCode &status) : fColl(coll),
fStatus(status) {}inline bool operator()(const TerritoryEntry* a, const TerritoryEntry* b) {return (fColl.compare(a->getTerritoryName(),
b->getTerritoryName(), fStatus) == UCOL_LESS);}
private:Collator &fColl;UErrorCode &fStatus;
};41
© 2014 IBM Corporation
population: applying collation
LocalPointer<Collator>
coll(Collator::createInstance(locale, status));
std::sort(territoryList,
territoryList + territoryCount,
CollatorLessThan(*coll, status));
The territory of Afghanistan has 31,108,100 persons in it.The territory of Åland Islands has 26,200 persons in it.The territory of Albania has 3,011,410 persons in it.The territory of Algeria has 38,087,800 persons in it.…
42
© 2014 IBM Corporation
population: collation (J)
43
Collator collator = Collator.getInstance(locale);
territoryList = PopulationData.getTerritoryEntries(locale,
new TreeSet<TerritoryEntry>(new Comparator<TerritoryEntry>(){public int compare(TerritoryEntry o1,
TerritoryEntry o2) {return
collator.compare(o1.territoryName(), o2.territoryName());
}}));
© 2014 IBM Corporation
population: other locales
EnglishThe territory of Afghanistan has 22 262 500 persons.
The territory of Albania has 8 221 650 persons.The territory of Azerbaijan has 9 590 160 persons.
The territory of Аландские о-ва has 26 200 persons in it.
The territory of Албания has 3 011 410 persons in it.
44
Japaneseアイスランドには、315,281人います。アイルランドには、4,775,980人います。
アゼルバイジャンには、9,590,160人います。アセンション島には、940人います。
アフガニスタンには、31,108,100人います。アメリカ合衆国には、316,439,000人います。
SpanishEn la región de “Afganistán” hay 31.108.100 personas.
En la región de “Albania” hay 3.011.410 personas.En la región de “Alemania” hay 81.147.300 personas.
En la región de “Andorra” hay 85.293 personas.En la región de “Angola” hay 18.565.300 personas.
© 2014 IBM Corporation
ICU API Stability Policy
© 2014 IBM Corporation
API Status
December 3, 2015
InternalUsed by ICU implementationor Technology Preview.
DraftNew API, reviewed and approvedby ICU project team. The APImight be still changed.
StableFor public use, the API signature won'tbe changed in future releases.
DeprecatedPreviously Stable, but no longerrecommended. The API might beremoved after a few releases.
© 2014 IBM Corporation
Backward Compatibility
December 3, 2015
Source code compatible• Consumer program should be compiled
successfully without changes.• Rare exceptions, documented in readme.
Serialization compatible (ICU4J)• Newer ICU version should be able to deserialize
object data serialized by older ICU version.
© 2014 IBM Corporation
Packaging and Dependency Questions
© 2014 IBM Corporation
ICU Customization ( Problems and Solutions )
§ ICU Is Too Big
• Datal Details in the userguide http://userguide.icu-
project.org/icudatal Data Customizer tool - http://apps.icu-
project.org/datacustom/
• Code size (ICU4C) l http://userguide.icu-project.org/packagingl Example:
#define UCONFIG_NO_LEGACY_CONVERSIONl (May not reduce data size)
© 2014 IBM Corporation
ICU data changes from time to time...
Unicode 4.1 5.0 5.1 5.2 6.0 6.2 6.3
CLDR 1.6 1.7 1.8 1.9 2.0 21 22 23 24 25 26
ICU 4.4
Polish Date Format: "08-04-2014" ICU 53
Polish Date Format: "8 kwi 2014"
isLetter(' ') : false (undefined)
isLetter(' ') : true
(date format data from CLDR)
(isLetter data from Unicode)
© 2014 IBM Corporation
Compatibility Concerns
§Unicode stability• character type, upper/lower case, normalization, text direction, sorting order...• policy [http://www.unicode.org/policies/stability_policy.html]• but, Unicode is still growing
§ locale data• cultural data can be updated based on community voting• cultural format results are not suited for serializing data, application protocols
and storage
© 2014 IBM Corporation
Protecting your application from change-induced problems
§ “08-04-2014” may not parse as “8 kwi 2014”• DON'T
l send localized data across the network between programs (other side may parse/format differently)
l store localized data on disk ( later app version may parse/format differently)• DO send and store non-localized format:
l Binary: 0x12345678l “Neutral”- ISO 8601 - “2014-04-08”
§ may not be a letter (isLetter()) in one Unicode version, but may later be defined.• Could cause difficulties if used to validate account names, …• DO: Think carefully about where Unicode properties are used.
© 2014 IBM Corporation
Thanks / Q & A
Info/Support/Download:http://icu-project.org/
© 2014 IBM Corporation
ICU4J and JDK classes – 1 – When to use ICU4J vs. JDK
JDK class ICU class ICU Benefits Suggestion
java.lang.Character com.ibm.icu.lang.UCharacter lLatest Unicode standardlMore character properties support
JDK OK:ICU as-needed
java.math.BigDecimal com.ibm.icu.math.BigDecimal lFor backward compatibility only ICU not recommended in new code
java.text.Bidi com.ibm.icu.text.Bidi lLatest Unicode bidi algorithm (UAX#9)
JDK OK:ICU as-needed
java.text.BreakIterator com.ibm.icu.text.BreakIterator lLatest Unicode standard (UAX#29)lDictionary based word break (Thai, Lao, Chinese/Japanese)
JDK OK:ICU as-needed
java.text.Collatorjava.text.RuleBasedCollator
com.ibm.icu.text.Collatorcom.ibm.icu.text.RuleBasedCollator
lUnicode collation algorithm (UTS#10)lFaster comparison
ICU recommended
§ ICU has functionality beyond JDK – not shown on this chart. See userguide.
§Where there is overlap, in some cases JDK may be used instead of ICU.
© 2014 IBM Corporation
ICU4J and JDK classes - 2
JDK class ICU class ICU Benefits Suggestion
java.text.DateFormatjava.text.SimpleDateFormat
com.ibm.icu.text.DateFormatcom.ibm.icu.text.SimpleDateFormat
l Abstract (skeleton) pattern (e.g. year-month only format)
l Patterns for additional calendar typesl More field format types (e.g. narrow
weekday, standalone month)l Capitalization controll Slower service object creation, format &
parse than JDK
JDK OK,ICU as-needed
java.text.MessageFormat com.ibm.icu.text.MessageFormat l Plural formattingl Gender formatting (social applications)l Named arguments (“{filename}” vs “{4}”)l Auto apostrophe mode
JDK OK,ICU as-needed
java.text.NumberFormatjava.text.DecimalFormat
com.ibm.icu.text.NumberFormatcom.ibm.icu.text.DecimalFormat
+ RuleBasedNumberFormat
l More styles (e.g. scientific, currency spell out)
l Parse currencyl Algorithmic numbering systemsl Slower service object creation, format &
parse than JDK
JDK OK,ICU as-needed
© 2014 IBM Corporation
ICU4J and JDK classes - 3JDK class ICU class ICU Benefits Suggestion
java.util.Calendarjava.util.GregorianCalendar
com.ibm.icu.util.Calendarcom.ibm.icu.util.GregorianCalendar
+ other Calendar implementation classes
lMore calendar types (e.g. Chinese Luna, Coptic, Hindu)lFiner control around time zone transitionlSlower service object creation than JDKlNo Java 8 java.time package integration yet
JDK OK,ICU as-needed
java.util.Locale com.ibm.icu.util.ULocale lFull BCP 47 language tag syntax support also on Java 6 or older runtime
JDK 7 or newer: preferred
JDK 6 or older: OK, but ICU on as-needed basis.
java.util.ResourceBundle com.ibm.icu.util.UResourceBundle lICU style bundle supportlBundle name with script (e.g. zh_Hant)lNo JDK ResourceBundle.Control support (no controls for customizing lookup ordering, fallbacks and custom bundle names)
JDK recommendedICU not recommended unless you want to share the same resource with ICU4C
java.util.TimeZonejava.util.SimpleTimeZone
com.ibm.icu.util.TimeZonecom.ibm.icu.util.SimpleTimeZone
+ other TimeZone implementation classes
lIteration for time zone transitionslCustom time zone built from multiple different rules
JDK ok, ICU as-needed
JDK java.time.zone classes on Java 8 support zone transitions
l ICU SPI provider can provide ICU functionality through JDK API in Java 6+l Don't just change the import statements to use ICU.