unicode in
DESCRIPTION
Unicode in. 2008Q3 Mark Davis, Vladimir Weinstein, Andy Heninger. Standard SW Globalization. Data Handling Date, Time, Number Formatting Collation Locales/Languages Timezones & Calendars,… General Internationalization Using character properties instead of hard-coded lists - PowerPoint PPT PresentationTRANSCRIPT
Unicode in Unicode in
2008Q32008Q3
Mark Davis, Vladimir Mark Davis, Vladimir Weinstein, Andy HeningerWeinstein, Andy Heninger
Standard SW Globalization
Data Handling
Date, Time, Number Formatting
Collation
Locales/Languages
Timezones & Calendars,…
General Internationalization
Using character properties instead of hard-coded lists
Separation of code from localizable data (≈resource bundles)
Avoiding string concatenation, dealing with truncation …
2
Where was the problem? (pause)
View
Upload
Server
Server
Data
Index
DB
dump
3
More places than you might think
1. Ensure Client App is Unicode
Windows, don’t use ANSI
2. Prevent Encoding Mismatches
charset before web form params
3. Allow full Unicode identifiers
File names,…
4. Ensure Uniform Segmentation
Word ≠ [0-9a-zA-Z]+
5. Watch for hidden assumptions
Cp1252 corrupting bytes
6. Title requirement 3+ chars ok for English, but not Chinese ( 狗 )
View
Upload
Server
Server
Data
Index
DB
dump
✔
✔
✔
✔
✔
✔
✔
❺❺
❹
❸
❷❶
❷❹
4
Just a few extra challenges…
Massive amounts of data
Much web cruft to deal with
Very short release cycles
Many product × language/locale pairs (next slide)
5
Locale × Product Versions
http://googleblog.blogspot.com/2008/07/hitting-40-languages.html
6
Translation
Professional Vendors, Contractors, Volunteers
7
Translation Strategies
Normal Translation Memory
Multiple, very short release cycles
Weeks, not months
Product Alternatives for new features
A. Delay release until completely translated
B. Disable new features until translated
C. Accept some English strings in new features
8
Int’l Strategy: Unicode Zone
Converters
Non-Unicode
Unicode
Unicode Zone
Validation
9
Both forms of Unicode
UTF-8: C++, python
Mixture of char*, STL string, new robust class
UTF-8 is particularly good storage for the web (more later)
UTF-16: Java, Windows, Javascript, Mac
Libraries / Data
ICU, Joda Time, Internal libraries
Unicode Character Database, Unicode Locales (CLDR)
TZDB, ISO 4217 (currencies) – time sensitive
Update to new versions (eg Unicode 5.1) asap
10
Stable Identifiers
Unicode identifiers
Language/Locale, Script, Region, Currency, Timezone
based on BCP47, ISO 4217, TZDB
Required: unique, stable
CS = Czechoslovakia? Serbia & Montenegro?
Serbia = CS? = RS?London is in UK? GB?
Google Valid:
CanonicalUS, iw
Noncanonical SU, he
(deprecated / not preferred)
Google Disallowed*:
Private Use XA
Unassigned BB
Ill-Formed B1
Variants i-tao, en-SCOUSE
11
User’s Locale / Language
Needed to improve quality
Locale = Language + (possibly) other info
Known if user is Signed In
Heuristics where not Signed In.
IP Address
Accept-Language
Country from Accept-Language
Domain,…
12
Normalizing Languages/Locales
Based on Unicode locale data (CLDR)
zh, und-CN, und-Hans,… ≃ zh
zh-TW, zh-Hant,… ≃ zh-TW
en, und-Latn, und-US,… ≃ en
en-GB, en-Latn-GB,… ≃ en-GB
he-IL, iw-IL, he-Hebr, he,… ≃iw
13
Matching Languages/Locales
• Input: User’s requested languages, our supported languages
• Output: “best” supported language
• Need better match than truncation
• A “distance” metric on normalized languages
– Language, then script, then country
– Plus special information:hr vs bs, no vs nn, ro vs mo, tl vs fil
14
Web Cruft
• Problems
– Bad input: charset, language,…
– Inaccurate detection
– Difficulties in segmentation / morphology
• These are non-trivial
– Pages with conversion errors or unassigned (non-existent) characters: ≈4%
– Multiply that by billions and billions of pages…
15
You didn’t know there was going to be a test…
• How many pages are on the web?
• What’s the most frequent character? Script? (next slides) …
16
Most Web Data
17
Data in Different Scripts
18
Bad Source
• Original page has corrupted data
• Doubly-encoded UTF-8
• Random illegal control codes, unassigned chars
• Forms input data of unknown/wrong encoding
• Mixtures of different charsets, from
– Random pasting in non-Unicode enabled tools
– Page composition (eg server-side includes), mixing charsets
– Indic font encodings
19
Bad Server
• Server mis-identifies the type or encoding of the page in the HTTP protocol.
– Example: JPEGs served up as text
– Server overrides page with wrong charset
• If you don’t do special detection, you get random junk
– Interpreting a JPEG as windows-1252:not altogether productive…
20
Charset Tagging Trendshttp://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html
21
Encoding Detection
• Pages are so often untagged & mis-tagged:
– Both at HTTP and HTML levels
– And what happens if they differ?
• We have to heuristically detectthe “real” character encoding
• Need to do better than the browser
– In the browser, the user can adjust a bad guess
• UTF-8 source is the safest, but still must be verified
Bad Bad codescodes
{charset}_charsetEnTW…
22
Attacks!
• Cross-site scripting (XSS)
– Don’t treat ill-formed UTF-8 as space (or syntax)
• <p id=abc�onMouseOver=evilDoers()…
– Don’t swallow valid characters after ill-formed
• …q="�>onMouseOver=…
– Don’t allow UTF-7, UTF-16 as output encodings
• Browsers often mis-detect, and allow XSS.
23
Spamming/Spoofing
• IDNA Spoofing: “paypal.com”
• Spamming: need to detect equivalences
– http://spamsource.cn
– http://spamsource . cn fullwidth dot
– http://bücher.de
– http://xn--bcher-kva.de
– http://b%C3%BCcher.de
24
Language Detection• Pages are so often untagged & mis-
tagged:– Both at HTTP and HTML levels
• So, we have to heuristically determine the “real” language– Unfortunately, detecting language is
more complicated than encoding• Mixtures of languages on same page
• Need to detect short strings, out of context, without encoding
• Needs to happen after entity expansion: &#xxx; → Y
– Fortunately, misdetecting language is way less problematic than encoding
Bad Bad codescodes
en-securiden-securidEnglishEnglishxlxlChineseChinesezszsususesesesesen-us.en-us."en-us ""en-us "es-es-tses-es-tsundefined undefined espa�olespa�olutf-8utf-8
25
Non-English Languages26
Language Tags & Detection
27
If Lang Tags Normalized…
28
Tagged vs Detected29
Bad HTML
• It's easy to parse valid HTML correctly
• But invalid HTML is not uncommon
– We need to be as good at doing bad HTML as the browsers are
– That is, what the user sees in IE or Firefox is what needs to be indexed
• Illegal characters (controls) sneak in as character entities: 
30
Segmentation Challenges
• Indexing & query: breaking text into words
– ユニコードとは何か→ ユニコード · とは · 何か
• Problems if wrong:
– Source segmented as:|AB|C|
– User searches for “BC” not found
– Can segment/query multiple ways
31
Thai Segmentation
• คอมพิ�วเตอร์ จะ เกี่��ยวข้�อง กี่�บ เร์��อง ข้อง ต�วเลข้– Before segmentation (2007-03): 10 hits
– After segmentation: → 300,000+ hits!
• Spaces in query still make difference
– คอมพิ�วเตอร์จะเกี่��ยวข้�องกี่�บเร์��องข้องต�วเลข้
acts as a complete phrase, equals:
– “คอมพิ�วเตอร์ จะ เกี่��ยวข้�อง กี่�บ เร์��อง ข้อง ต�วเลข้”
32
Morphology Challenges
• Varies by language
• Stopwords, phrases: a, the,…
• Diacriticals: sasa → saša, sasha
• Decompounding: Abiball → abiball OR abi ball
• “Forms” of a word: go → gone, went, …
• Synonyms: car shopping → auto shopping
• …
33
Correcting User Typing
• Users may be on keyboard without accents, or expect transliteration
– Types “Sasha” or “Sasa” or “Саша” for “Saša”
• Misspellings
34
Character folding
• Avoid spurious input differences
– “financial” (fi lig., PDF)
• Normalize with:
– NFC + subset of NFKC + UCA + others
• Suppress display
– “➠”
Original Term
Index Term
1 ➠ omit
2 SHY omit
3 ىلص صلى
4 ₩ ₩
5 Can’t can't
6 fi fi
35
SW SW Globalization Globalization
at at
Mark Mark DavisDavis
Q&A
37
In Action
• Indexing stores canonicalized originals
– … Fishing … ro◌̂lles→
– … fishing … rôles
• Query expanded to variants
– fish → fish|fishing
– rôle → role|rôle|roles|rôles
• Expansions may be language-dependent
38
Freeform Parsing
39