collation in icu

47
Collation in ICU Mark Davis Chief SW Globalization Architect IBM Globalization Center of Competency

Upload: rumer

Post on 10-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Collation in ICU. Mark Davis Chief SW Globalization Architect IBM Globalization Center of Competency. Collation = Sorting Order. How hard can it be? A < B < C < … Complications Languages are complex and varied Unicode is a big set of characters Performance is crucial. Language - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Collation in ICU

Collation in ICU

Mark DavisChief SW Globalization Architect

IBM Globalization Center of Competency

Page 2: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 2

Collation = Sorting Order

How hard can it be?A < B < C < …Complications

Languages are complex and variedUnicode is a big set of charactersPerformance is crucial

Page 3: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 3

Varies By:

Language Swedish: z < ö German: ö < z

Usage Dictionary: öf < of Telephone: of < öf

Customizations A < a a < A

Versioning Fixes New Gov. Stds New Characters

Page 4: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 4

Strength Levels1. Base characters: a < b2. Accents: as < às < at

ignored if there is a L1 character difference3. Case: ao < Ao < aò

ignored if there is a L1 or L2 difference4. Punctuation: ab < a-b < aB

ignored* if there is a L1, L2, or L3 difference5. Tie-breaker: NFD code point order

Page 5: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 5

Context SensitivityContractions

H < Z, but CZ < CHExpansions

OE < Π< OFBoth

カー < カイ キー > キイ

Page 6: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 6

Canonical Equivalence

Å ≡ Å≡ A + º

x + . + ^ ≡ x + ^ + .ự ≡ u + ’

≡ ư + .≡ ụ + ’≡ u + . + ’≡ u + ’ + .

Page 7: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 7

OdditiesNormal accents

cote < coté < côte < côté• first accent difference determines order

French accentscote < côte < coté < côté• last accent difference determines order

Logical Order Exception (Thai, Lao) เ ก sorts like ก เ

Page 8: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 8

Merging Database Fields

F1 = LastName, F2 = FirstName

Sequential Weak 1st MergedF1, then F2 F1 (L1), F2 L1, L2, L3

diSilva, JohndiSilva, Freddi Silva, Johndi Silva, Freddísilva, Johndísilva, Fred

diSilva, Johndísilva, Johndi Silva, Johndi Silva, FreddiSilva, Freddísilva, Fred

diSilva, Johndi Silva, Johndísilva, JohndiSilva, Freddi Silva, Freddísilva, Fred

Page 9: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 9

Customizations

Parameters that change collation behavior

Choice of language (locale)Runtime choices

Examples to follow

Page 10: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 10

Parametric Customizations

StrengthBaseBase+AccentBase+Accent+ Case&c.

Case: A < a a < A

Punctuation: di Silva < diSilva diSilva < di Silva

Page 11: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 11

Punctuation (Alternates)Base Character

di silvadi SilvaDi silvaDi SilvaDickensdisilvadiSilvaDisilvaDiSilva

Ignoreable

Dickens di silvadisilvadi SilvadiSilvaDi silvaDisilvaDi SilvaDiSilva

Page 12: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 12

Extended Customizations

User-defined“&” ≡ “ampersand”

Merging tailoringsIranian + French

Script Orderb < ב < β < бβ < b < б < ב

Numbers A-10 < A-2 A-2 < A-10

Page 13: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 13

Collation also used for:Searching

ignore case, accent optionsSelection

Return all records where• Jones ≤ name < Smith

GraphemesWhat a user considers a “character”Regular expressions (Level 3)• See UTR #18, UTR #29

Page 14: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 14

UCAUTS #10: Unicode Collation Algorithm

Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc.Default ordering: all Unicode code pointsProvides for tailoring to given languagesAlso see: The Unicode Standard, §5.17: Sorting and Searching

Aligned with ISO 14651

Page 15: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 15

APIsString CompareSort KeysString SearchSpecial-Purposes

Sortkeys that bracket “Smith”• X <= Smith* < Y

Merged sortkeys

Page 16: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 16

Sort Keys

Transform string into series of bytes which will binary-compare

a: 06 C3 01 20 01 02 00

A: 06 C3 01 20 01 08 00

á: 06 C3 01 20 32 01 02 02 00

ab: 06 C3 06 D7 01 20 20 01 02 02 00

b: 06 D7 01 20 01 02 00

Level 1 Level 2 Level 3

Page 17: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 17

String Compare vs. Sort KeysSame results in either caseSC faster for single comparisons

average 5 to 10 times!SK faster for multiple comparisons

index once binary compare many times

Page 18: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 18

String SearchNaïve Approach

key matches in target at <x, y>iff target.substring(x, y) ≡ key

Boundary ComplicationsIgnorables: “a” matches in “(a)”?• at <0,2> & <1, 2> & <0,3> & <1,3>?

Contractions: “c” matches in “churo”?Normalization: “å” matches in “a¸˚”?

Page 19: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 19

WARNING 1: BasicsNot aligned with character set or repertoire

Latin-1: Swedish and German sorting differsNot code point (binary) order

Binary: Z < a < v < wEnglish: Z > aSwedish: v ≡ w

Not a property of stringsWith same database

• Swedish user: view/select• German user: view/select

Page 20: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 20

WARNING 2: Operations

Order not preserved under concatenation / substringing

x < y ↛ xz < yzx < y ↛ zx < zyxz < yz ↛ x < yzx < zy ↛ x < y

Page 21: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 21

WARNING 3: DependenceCollation is a relation over strings

Sort keys embody part of that relationThus, comparing sort keys from different tailorings (or parameters) gives undefined results.C < CH < DMay move binary value for D

Page 22: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 22

WARNING 4: StabilityStable Sort

Records with equal comparison come out in original orderProperty of algorithm, not comparison

Semi-Stable Comparisonx ≠ y → x ≢ yProperty of comparison, not algorithmDegrades performanceDoesn’t do what people think (or really want)!

Page 23: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 23

Implementation Details

Many possible implementationsICU as example here.

Page 24: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 24

What is ICU?Internationalization libraries for C, C++, Java*

Open source – non-viralSponsored by IBM

* Sun’s Java licenses an earlier ICU version; ICU4J updates it.

Unicode standard compliantfull supplementary support

Cross-platform; extensible and customizableHigh performance and thread-safe

Multiple locales in same thread – simultaneouslyhttp://oss.software.ibm.com/icu/

Page 25: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 25

ICU FeaturesUnicode text handling

Character set conversions (700+)

Collation & Searching

Locales (170+)

Resource Bundles

Calendar & Time zones

Complex-text layout engine

Breaks: character, word, line, & sentence

FormattingDate & time

Messages

Numbers & currencies

TransformsNormalization

Casing

Transliterations

Page 26: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 26

JavaSun licensed and includes an early version of ICU collation in JavaLatest ICU Java version:

Dramatically fasterMuch lower in memory consumptionHalved sortkey lengthMany additional features

Page 27: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 27

ICU/Java Collation ArchitectureL1-3, contractions, expansions, …Locale tailoringsFully rule-based specificationArbitrary runtime user customizations

& ‘?’ = ‘question mark’ & ‘$’ = ‘dollar sign’ & z < ‘george’

Page 28: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 28

ICU Collation I

Full UCA complianceFull supplementary character support

Solid performanceSmall sort-keysSmall Memory Footprint

Page 29: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 29

ICU Collation II

Parametric controlTailorable to any languageMultiple Versions simultaneously

Page 30: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 30

Memory Requirements

Flat-file (memory mapped)speeds initializationreduces memory footprint(next slide)

Delta TailoringSingle copy of UCA (≈80K)Small delta files per locale

Page 31: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 31

Memory Mappable

Old: separate allocations

New: offsets within mem-map

Page 32: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 32

Delta Tailoring

“a”

FR

found

UCA not

found

codenot

synthesized

Page 33: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 33

Sort Key CompressionCommon weights are 1-byte

Primary, secondary, tertiary, quarternarySequences are compressedUTF-16 Values for “Märk Davis” (22 bytes)

004D 00E4 0072 006B 0020 0044 0061 0076 0069 0073 0000

Sort Key (L3, ignorable punctuation - 19 bytes)2F 17 39 2B 1D 17 41 27 3B 0177 96 0A 018F 80 8F 07 00

Page 34: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 34

Simultaneous Multiple Versions

Programs can link against different versions of ICU, simultaneously!Preserves exact binary order over time.

App

Page 35: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 35

Performance: CodingAvoided unnecessary function calls.

Example: strlen too expensive!Avoided excess object creation

Reduce, Reuse, RecycleFast-pathed common casesUsed stack memory buffers

(with expansion if necessary)Made inner loops as tight as possible

Page 36: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 36

Performance: Algorithmic

Checks for identical prefixesTolerant of most unnormalized text

invokes normalization rarely

Compressed sort keysIncremental length/normalizationFCD format

Page 37: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 37

Fast C or D (FCD)

Accepts all NFD, most NFC, without normalization

X FCD NFC NFD

A- ring Y YAngstrom YA + ring Y YA + grave Y YA-ring + grave YA + cedilla + ring Y YA + ring + cedillaA-ring + cedilla Y

Page 38: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 38

Perf: ICU vs. Windows, glibcFunction: Full UCA!String comparison: comparable

≈ 20% worse to 400% betterSort keys: much shorter

≈ half as long

Warning: speed comparisons are approximate!Depends on data, parameters, features, CPU

Page 39: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 39

Perf: ICU vs. JavaFunction: Full UCA!String comparison: faster

≈ 2-3 times betterSort keys: shorter

≈ half as longAlso available: JNI version

Warning: speed comparisons are approximate!Depends on data, parameters, features, CPU

Page 40: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 40

More InformationICU

http://oss.software.ibm.com/icu/Design Document

http://oss.software.ibm.com/cvs/icu/icuhtml/design/collation/

Latest Version of these slideshttp://www.macchiato.com

Page 41: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 41

Q & A

Page 42: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 42

Backup Slides

Not used in the presentation, except in response to questions

Page 43: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 43

WARNING 5: Math. RelationS = {Unicode Strings}Reflexive∀a ∊ S: a ≤ a

Antisymmetric∀a, b ∊ S: a ≤ b & b ≤ a → a = b

Transitive∀a, b ∊ S: a ≤ b & b ≤ c → a ≤ c

Total∀a, b ∊ S: a ≤ b ∨ b ≤ a

Page 44: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 44

Identical Prefixes

Sorting / Searching DatabasesMany comparisons to “close” stringsCheck initial prefixes with binary compareDrop into collation loop at first differenceComplication…

Page 45: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 45

Initial Prefix Complication

Need to backup if in “bad” position:

TypeContraction (Spanish) c hNormalization a °Surrogate Pair <L> <T>

Example

Page 46: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 46

Fractional UCAFractional weights for compressionGaps for tailoring, future UCA additionsOnly stores differences in tailoring fileReduces memory footprint

a æ ɒ b a æ ɒ bprimary 0861 0865 0871 0875 17 18 60 18 66 19

secondary 20 20 20 20 03 03 03 03tertiary 02 02 02 02 03 03 03 03

UCA Frac. UCA

Page 47: Collation in ICU

San Jose, California — 04/22/2322st International Unicode Conference 47

Exceptional Values

Normal weight storageP P P P P P P P P P P P P P P P S S S S S S S S C C T T T T T T

 1  116b 8b 6b

F F F F T T T T d d d d d d d d d d d d d d d d d d d d d d d d4b 4b Tag 24 bit data

Special Weight StorageNOT_FOUND, EXPANSION, CONTRACTION, THAI, …