txt - theory and practice of text encoding

TXT

Theory and practice of text encoding

Michiel van Oosterhout

July 2014

Topics

Introduction

Definitions

Character sets

Unicode

Text encoding in C#

Unicode normalization

Summary

2

INTRODUCTION

3

Text processing: a domain model

4

Bytes

• Files

• Database

• Socket

• Stream

Text data

• string

• char

• Reader

• Writer

Text

• Visual

• Spoken

Program

DEFINITIONS

6

★ Grapheme

Grapheme is the smallest written unit

– Letter from an alphabet: ⟨p⟩ or ⟨ß⟩ or ⟨â⟩ or ⟨π⟩

– Numerical digit: ⟨4⟩ or ⟨٤⟩ or ⟨Ⅲ⟩

– Punctuation: ⟨?⟩ or ⟨;⟩

– Ligatures: ⟨fi⟩ or ⟨&⟩

– Symbols: ⟨€⟩ or ⟨∞⟩

– CJK characters and ideographs: ⟨平⟩ or ⟨不⟩

7

★ Diacritic

A mark that is added to a letter

Diacritical mark can take many forms, for example:

– Accent circumflex: ◌ as in ⟨â⟩

– Slash: ◌ as in ⟨ø⟩

– Cedilla: ◌ as in ⟨ç⟩

8

★ Glyph

Visual representation of a grapheme or diacritic

Implemented strictly in the rendering of text

9

gg

★ Control character

Used to embed instructions in text, for example:

– ␀ NULL

– ␄ end of transmission

– ␉ horizontal tab

10

★ Abstract character

A unit of information used for the organization, control, or representation of textual data

Graphemes

Diacritical marks

Control characters

Miscellaneous characters…

Named

– LATIN CAPITAL LETTER Z

– HORIZONTAL ELLIPSIS

– EURO SIGN

Properties

– Is capital, is digit, is punctuation, …

11

CHARACTER SETS

12

★ Character set

A set of abstract characters

Each character has a corresponding numerical (uint) value, for example:

– A maps to 65

– B maps to 66

– a maps to 97

Also called:

– Code page

– Code table

– Character map

13

★ Code points

A code point is an entry in the character table

Use hexadecimal notation for the numerical value

5A = Z = LATIN CAPITAL LETTER Z

14

Some (Western European) character sets

ASCII (1963)– 128 code points (7-bit ★ code space)

– No accented letters

Windows-1252 (1985)

– 256 code points (8-bit code space)

– More symbols and accented letters

– Default in US/Western Europe Windows installations!

– Maps EURO SIGN to code point 80

– Incorrectly called ANSI

ISO-8859-1 (Latin 1, 1987) and ISO-8859-15 (Latin 9, 1999)– 256 code points (8-bit code space)

– Designed for Western European languages

– Latin 9 added EURO SIGN at A4

15

UNICODE

16

Character encoding scheme

A method of representing abstract characters in bytes

– A byte is defined as an octet (8 bits)

Many character sets have a simple encoding scheme

– Numerical value (code point) == byte value

Character sets with > 256 code points, require different encoding scheme

– CJK sets have several 1000 code points

– Unicode has > 110.000 code points

– Needs multiple bytes per code point

17

Unicode

Standard for handling text in computing (since 1991)

Encodings

Character repertoire

– 21-bit code space

Rules

18

Unicode 1.0 encoding scheme

16-bit code space

UCS-2

– 1 ★ code unit per code point

– 2 bytes per code unit

• Big endian (most significant byte of each byte pair at the lowest address/beginning)

– Can only encode U+0000 – U+FFFF

– EURO SIGN (U+20AC): [0x20 0xAC]

20

Unicode 2.0 (1996)

21-bit code space

– Adds 16 new ★ planes

• Also 1 for private use

– Unicode 1.0 16-bit code space is now called ★ Basic Multilingual Plane (BMP)

21

UCS-4 encoding scheme

Also called UTF-32

1 code unit per code point

4 bytes per code unit

– Choice of big endian (UTF-32BE) or little endian (UTF-32LE)

EURO SIGN (U+20AC): [0x00 0x00 0x20 0xAC] (big endian)

22

UTF-16

1 or 2 code units per code point

– 2 code units are called a 'surrogate pair'

2 bytes per code unit

– Choice of big endian (UTF-16BE) or little endian (UTF-16LE)

Can encode U+0000 – U+10FFFF (17 planes)

EURO SIGN (U+20AC): [0x20 0xAC] (big endian)

GLOBE WITH MERIDIANS (U+1F310):

[0x3C 0xD8] [0x10 0xDF] (big endian)

• 1 code point > U+FFFF requires 2 code units

23

UTF-8 encoding scheme

The ‘standard’ encoding scheme for the internet

Compatible with ASCII

1 – 4 code units per code point

1 byte per code unit

EURO SIGN (U+20AC): [0xE2] [0x82] [0xAC]

0- 7F: 0.......80– 7FF: 110..... 10......

800– FFFF: 1110.... 10...... 10......10000– 1FFFFF: 11110... 10...... 10...... 10......

24

Byte Order Mark

ZERO-WIDTH NON-BREAK SPACE U+FEFF

Only used as the first character of an unmarked plaintext file

Indicator of endianness of UTF-16/32 encoding scheme

– Because these encoding schemes use 2 or 4 bytes per code unit

Signature for UTF-8 encoding scheme

– Does not indicate endianness, but rather the fact that the file is UTF-8 encoded

– Can cause problems (e.g. Linux/Unix scripts that must start with #!)

It is always preferable to 'tag' encoded text with the encoding scheme

– E.g. "this text is UTF-16BE (big endian)"

25

TEXT ENCODING IN C#

26

Skip

Encoding in C#

// Unicode literals (16-bit)var euro = "\u20AC";

// Unicode literal (32-bit)

var picto = "\U0001F310";

// Encode text to bytes

var bytes = Encoding.UTF8.GetBytes(euro);

> [0xE2, 0x82, 0xAC]

// Get a specific encoding scheme

var windows1252 = Encoding.GetEncoding("Windows-1252");

var bytes = windows1252.GetBytes(euro);

> [0x80]

27

Decoding in C#

// Decode bytes

var bytes = new { 0xE2, 0x82, 0xAC };

var euro = Encoding.UTF8.GetString(bytes);

> €

// Decoding error


var euro = windows1252.GetString(bytes);

> â‚¬

// To understand why, let's check the 'Windows-1252 bytes' for â

windows1252.GetBytes("â");

> [0xE2]

28

Replacement fallback strategy in C#

// ASCII encoder will replace unknown characters with a '?'

var str = "€";

var bytes = Encoding.ASCII.GetBytes(str);

Encoding.ASCII.GetString(bytes);

> ?

// UTF-8 decoder will replace with REPLACEMENT CHARACTER U+FFFD

var str = "€";


var bytes = windows1252.GetBytes(str);

Encoding.UTF8.GetString(bytes);

> �

29

Code units in C#

// Strings are actually UTF-16!

var picto = "\U0001F310";

picto.Length;

> 2

// UTF-16 uses 2 code units to encode code points > 0xFFFF

// Using a char type (notice single quotes)

var picto = '\U0001F310';

Compiler error CS1012: Too many characters in character literal

// char type can only contain a single UTF-16 code unit,

// not a surrogate pair

30

Code units in C#

// String with 2 abstract characters

// LATIN SMALL LETTER E and COMBINING ACUTE ACCENT

var str = "\u0065\u0301";

str.Length;

> 2

str[0];

> e

// Using System.Globalization.String.Info

var enumerator = String.Info.GetTextElementEnumerator(str);

enumerator.MoveNext();

enumerator.Current;

> e

31

UNICODE NORMALIZATION

32

Skip

Normalization: ★ canonical equivalence

Multiple (sequences of) code points representing the same character

LATIN SMALL LETTER A WITH CIRCUMFLEX (U+00E2)

– A single grapheme: ⟨â⟩

– A ★ 'primary composite'

– For compatibility with other character sets

– Similar to E2 in Windows-1252 or ISO-8859-1 (Latin 1)

LATIN SMALL LETTER A (U+0061) + COMBINING CIRCUMFLEX ACCENT

(U+0302)

– A single grapheme: ⟨a⟩, combined with a diacritic (not a grapheme): ◌

33

Canonical equivalence in C# - 1/2

// Single character

var bytes = new byte[] { 0xC3, 0xA2 };

var str1 = Encoding.UTF8.GetString(bytes);

> â

str.Length;

> 1

// a (0x61) with combining diacritical mark (0xCC, 0x82)

bytes = new byte[] { 0x61, 0xCC, 0x82 };


str2;

> â

str2.Length;

> 2

34

Canonical equivalence in C# - 2/2

str1.Length;

> 1

// Decompose into separate code points

str1.Normalize(NormalizationForm.FormD).Length;

> 2

str2.Length;

> 2

// Compose into 'primary composite' (the single character)

str2.Normalize(NormalizationForm.FormC).Length;

> 1

35

Normalization: ★ compatibility

Multiple (sequences of) code points representing compatible characters

– Different appearance, same meaning

LATIN SMALL LIGATURE FF (U+FB00)

– A single grapheme: ⟨ff⟩

LATIN SMALL LETTER F (U+0046) + LATIN SMALL LETTER F (U+0046)

– Two graphemes: ⟨f⟩⟨f⟩

36

Compatibility in C# - 2/2

// The ff ligature

var bytes = new byte[] { 0xEF, 0xAC, 0x80 };


str1.Length;

> 1

str2 = str.Normalize(NormalizationForm.FormKD);

str2.Length;

> 2

str2;

> ff

// ff was decomposed into two characters: ff

37

SUMMARY

38

Further reading

https://en.wikipedia.org/wiki/Unicode (Unicode on Wikipedia)

http://www.joelonsoftware.com/articles/Unicode.html (The Absolute Minimum)

http://www.objc.io/issue-9/unicode.html (NSString and Unicode)

http://www.unicode.org/glossary/ (Glossary of Unicode terms)

http://www.unicode.org/faq/ (Unicode FAQ)

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt (UTF-8 history)

http://msdn.microsoft.com/en-us/library/ms404377(v=vs.110).aspx (Character Encoding in the .NET Framework)

http://www.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/ (All about Unicode)

http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode (On the Goodness of Unicode, Tim Bray)

39

https://en.wikipedia.org/wiki/Unicode

http://www.joelonsoftware.com/articles/Unicode.html

http://www.objc.io/issue-9/unicode.html

http://www.unicode.org/glossary/

http://www.unicode.org/faq/

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

http://msdn.microsoft.com/en-us/library/ms404377(v=vs.110).aspx

http://www.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/

http://www.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/

http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode

txt - theory and practice of text encoding

Technology