unicode in

39
Unicode in Unicode in 2008Q3 2008Q3 Mark Davis, Vladimir Mark Davis, Vladimir Weinstein, Andy Heninger Weinstein, Andy Heninger

Upload: apu

Post on 14-Jan-2016

76 views

Category:

Documents


0 download

DESCRIPTION

Unicode in. 2008Q3 Mark Davis, Vladimir Weinstein, Andy Heninger. Standard SW Globalization. Data Handling Date, Time, Number Formatting Collation Locales/Languages Timezones & Calendars,… General Internationalization Using character properties instead of hard-coded lists - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Unicode in

Unicode in Unicode in

2008Q32008Q3

Mark Davis, Vladimir Mark Davis, Vladimir Weinstein, Andy HeningerWeinstein, Andy Heninger

Page 2: Unicode in

Standard SW Globalization

Data Handling

Date, Time, Number Formatting

Collation

Locales/Languages

Timezones & Calendars,…

General Internationalization

Using character properties instead of hard-coded lists

Separation of code from localizable data (≈resource bundles)

Avoiding string concatenation, dealing with truncation …

2

Page 3: Unicode in

Where was the problem? (pause)

View

Upload

Server

Server

Data

Index

DB

dump

3

Page 4: Unicode in

More places than you might think

1. Ensure Client App is Unicode

Windows, don’t use ANSI

2. Prevent Encoding Mismatches

charset before web form params

3. Allow full Unicode identifiers

File names,…

4. Ensure Uniform Segmentation

Word ≠ [0-9a-zA-Z]+

5. Watch for hidden assumptions

Cp1252 corrupting bytes

6. Title requirement 3+ chars ok for English, but not Chinese ( 狗 )

View

Upload

Server

Server

Data

Index

DB

dump

❺❺

❷❶

❷❹

4

Page 5: Unicode in

Just a few extra challenges…

Massive amounts of data

Much web cruft to deal with

Very short release cycles

Many product × language/locale pairs (next slide)

5

Page 6: Unicode in

Locale × Product Versions

http://googleblog.blogspot.com/2008/07/hitting-40-languages.html

6

Page 7: Unicode in

Translation

Professional Vendors, Contractors, Volunteers

7

Page 8: Unicode in

Translation Strategies

Normal Translation Memory

Multiple, very short release cycles

Weeks, not months

Product Alternatives for new features

A. Delay release until completely translated

B. Disable new features until translated

C. Accept some English strings in new features

8

Page 9: Unicode in

Int’l Strategy: Unicode Zone

Converters

Non-Unicode

Unicode

Unicode Zone

Validation

9

Page 10: Unicode in

Both forms of Unicode

UTF-8: C++, python

Mixture of char*, STL string, new robust class

UTF-8 is particularly good storage for the web (more later)

UTF-16: Java, Windows, Javascript, Mac

Libraries / Data

ICU, Joda Time, Internal libraries

Unicode Character Database, Unicode Locales (CLDR)

TZDB, ISO 4217 (currencies) – time sensitive

Update to new versions (eg Unicode 5.1) asap

10

Page 11: Unicode in

Stable Identifiers

Unicode identifiers

Language/Locale, Script, Region, Currency, Timezone

based on BCP47, ISO 4217, TZDB

Required: unique, stable

CS = Czechoslovakia? Serbia & Montenegro?

Serbia = CS? = RS?London is in UK? GB?

Google Valid:

CanonicalUS, iw

Noncanonical SU, he

(deprecated / not preferred)

Google Disallowed*:

Private Use XA

Unassigned BB

Ill-Formed B1

Variants i-tao, en-SCOUSE

11

Page 12: Unicode in

User’s Locale / Language

Needed to improve quality

Locale = Language + (possibly) other info

Known if user is Signed In

Heuristics where not Signed In.

IP Address

Accept-Language

Country from Accept-Language

Domain,…

12

Page 13: Unicode in

Normalizing Languages/Locales

Based on Unicode locale data (CLDR)

zh, und-CN, und-Hans,… ≃ zh

zh-TW, zh-Hant,… ≃ zh-TW

en, und-Latn, und-US,… ≃ en

en-GB, en-Latn-GB,… ≃ en-GB

he-IL, iw-IL, he-Hebr, he,… ≃iw

13

Page 14: Unicode in

Matching Languages/Locales

• Input: User’s requested languages, our supported languages

• Output: “best” supported language

• Need better match than truncation

• A “distance” metric on normalized languages

– Language, then script, then country

– Plus special information:hr vs bs, no vs nn, ro vs mo, tl vs fil

14

Page 15: Unicode in

Web Cruft

• Problems

– Bad input: charset, language,…

– Inaccurate detection

– Difficulties in segmentation / morphology

• These are non-trivial

– Pages with conversion errors or unassigned (non-existent) characters: ≈4%

– Multiply that by billions and billions of pages…

15

Page 16: Unicode in

You didn’t know there was going to be a test…

• How many pages are on the web?

• What’s the most frequent character? Script? (next slides) …

16

Page 17: Unicode in

Most Web Data

17

Page 18: Unicode in

Data in Different Scripts

18

Page 19: Unicode in

Bad Source

• Original page has corrupted data

• Doubly-encoded UTF-8

• Random illegal control codes, unassigned chars

• Forms input data of unknown/wrong encoding

• Mixtures of different charsets, from

– Random pasting in non-Unicode enabled tools

– Page composition (eg server-side includes), mixing charsets

– Indic font encodings

19

Page 20: Unicode in

Bad Server

• Server mis-identifies the type or encoding of the page in the HTTP protocol.

– Example: JPEGs served up as text

– Server overrides page with wrong charset

• If you don’t do special detection, you get random junk

– Interpreting a JPEG as windows-1252:not altogether productive…

20

Page 21: Unicode in

Charset Tagging Trendshttp://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html

21

Page 22: Unicode in

Encoding Detection

• Pages are so often untagged & mis-tagged:

– Both at HTTP and HTML levels

– And what happens if they differ?

• We have to heuristically detectthe “real” character encoding

• Need to do better than the browser

– In the browser, the user can adjust a bad guess

• UTF-8 source is the safest, but still must be verified

Bad Bad codescodes

{charset}_charsetEnTW…

22

Page 23: Unicode in

Attacks!

• Cross-site scripting (XSS)

– Don’t treat ill-formed UTF-8 as space (or syntax)

• <p id=abc�onMouseOver=evilDoers()…

– Don’t swallow valid characters after ill-formed

• …q="�>onMouseOver=…

– Don’t allow UTF-7, UTF-16 as output encodings

• Browsers often mis-detect, and allow XSS.

23

Page 24: Unicode in

Spamming/Spoofing

• IDNA Spoofing: “paypal.com”

• Spamming: need to detect equivalences

– http://spamsource.cn

– http://spamsource . cn fullwidth dot

– http://bücher.de

– http://xn--bcher-kva.de

– http://b%C3%BCcher.de

24

Page 25: Unicode in

Language Detection• Pages are so often untagged & mis-

tagged:– Both at HTTP and HTML levels

• So, we have to heuristically determine the “real” language– Unfortunately, detecting language is

more complicated than encoding• Mixtures of languages on same page

• Need to detect short strings, out of context, without encoding

• Needs to happen after entity expansion: &#xxx; → Y

– Fortunately, misdetecting language is way less problematic than encoding

Bad Bad codescodes

en-securiden-securidEnglishEnglishxlxlChineseChinesezszsususesesesesen-us.en-us."en-us ""en-us "es-es-tses-es-tsundefined undefined espa�olespa�olutf-8utf-8

25

Page 26: Unicode in

Non-English Languages26

Page 27: Unicode in

Language Tags & Detection

27

Page 28: Unicode in

If Lang Tags Normalized…

28

Page 29: Unicode in

Tagged vs Detected29

Page 30: Unicode in

Bad HTML

• It's easy to parse valid HTML correctly

• But invalid HTML is not uncommon

– We need to be as good at doing bad HTML as the browsers are

– That is, what the user sees in IE or Firefox is what needs to be indexed

• Illegal characters (controls) sneak in as character entities: &#x1E;

30

Page 31: Unicode in

Segmentation Challenges

• Indexing & query: breaking text into words

– ユニコードとは何か→ ユニコード · とは · 何か

• Problems if wrong:

– Source segmented as:|AB|C|

– User searches for “BC” not found

– Can segment/query multiple ways

31

Page 32: Unicode in

Thai Segmentation

• คอมพิ�วเตอร์ จะ เกี่��ยวข้�อง กี่�บ เร์��อง ข้อง ต�วเลข้– Before segmentation (2007-03): 10 hits

– After segmentation: → 300,000+ hits!

• Spaces in query still make difference

– คอมพิ�วเตอร์จะเกี่��ยวข้�องกี่�บเร์��องข้องต�วเลข้

acts as a complete phrase, equals:

– “คอมพิ�วเตอร์ จะ เกี่��ยวข้�อง กี่�บ เร์��อง ข้อง ต�วเลข้”

32

Page 33: Unicode in

Morphology Challenges

• Varies by language

• Stopwords, phrases: a, the,…

• Diacriticals: sasa → saša, sasha

• Decompounding: Abiball → abiball OR abi ball

• “Forms” of a word: go → gone, went, …

• Synonyms: car shopping → auto shopping

• …

33

Page 34: Unicode in

Correcting User Typing

• Users may be on keyboard without accents, or expect transliteration

– Types “Sasha” or “Sasa” or “Саша” for “Saša”

• Misspellings

34

Page 35: Unicode in

Character folding

• Avoid spurious input differences

– “financial” (fi lig., PDF)

• Normalize with:

– NFC + subset of NFKC + UCA + others

• Suppress display

– “➠”

Original Term

Index Term

1 ➠ omit

2 SHY omit

3 ىلص صلى

4 ₩ ₩

5 Can’t can't

6 fi fi

35

Page 36: Unicode in

SW SW Globalization Globalization

at at

Mark Mark DavisDavis

Page 37: Unicode in

Q&A

37

Page 38: Unicode in

In Action

• Indexing stores canonicalized originals

– … Fishing … ro◌̂lles→

– … fishing … rôles

• Query expanded to variants

– fish → fish|fishing

– rôle → role|rôle|roles|rôles

• Expansions may be language-dependent

38

Page 39: Unicode in

Freeform Parsing

39