txt - theory and practice of text encoding
TRANSCRIPT
TXT
Theory and practice of text encoding
Michiel van Oosterhout
July 2014
Topics
Introduction
Definitions
Character sets
Unicode
Text encoding in C#
Unicode normalization
Summary
2
INTRODUCTION
3
Text processing: a domain model
4
Bytes
• Files
• Database
• Socket
• Stream
Text data
• string
• char
• Reader
• Writer
Text
• Visual
• Spoken
Program
5
DEFINITIONS
6
★ Grapheme
Grapheme is the smallest written unit
– Letter from an alphabet: ⟨p⟩ or ⟨ß⟩ or ⟨â⟩ or ⟨π⟩
– Numerical digit: ⟨4⟩ or ⟨٤⟩ or ⟨Ⅲ⟩
– Punctuation: ⟨?⟩ or ⟨;⟩
– Ligatures: ⟨fi⟩ or ⟨&⟩
– Symbols: ⟨€⟩ or ⟨∞⟩
– CJK characters and ideographs: ⟨平⟩ or ⟨不⟩
7
★ Diacritic
A mark that is added to a letter
Diacritical mark can take many forms, for example:
– Accent circumflex: ◌ as in ⟨â⟩
– Slash: ◌ as in ⟨ø⟩
– Cedilla: ◌ as in ⟨ç⟩
8
★ Glyph
Visual representation of a grapheme or diacritic
Implemented strictly in the rendering of text
9
gg
★ Control character
Used to embed instructions in text, for example:
– ␀ NULL
– ␄ end of transmission
– ␉ horizontal tab
10
★ Abstract character
A unit of information used for the organization, control, or representation of textual data
Graphemes
Diacritical marks
Control characters
Miscellaneous characters…
Named
– LATIN CAPITAL LETTER Z
– HORIZONTAL ELLIPSIS
– EURO SIGN
Properties
– Is capital, is digit, is punctuation, …
11
CHARACTER SETS
12
★ Character set
A set of abstract characters
Each character has a corresponding numerical (uint) value, for example:
– A maps to 65
– B maps to 66
– a maps to 97
Also called:
– Code page
– Code table
– Character map
13
★ Code points
A code point is an entry in the character table
Use hexadecimal notation for the numerical value
5A = Z = LATIN CAPITAL LETTER Z
14
Some (Western European) character sets
ASCII (1963)– 128 code points (7-bit ★ code space)
– No accented letters
Windows-1252 (1985)
– 256 code points (8-bit code space)
– More symbols and accented letters
– Default in US/Western Europe Windows installations!
– Maps EURO SIGN to code point 80
– Incorrectly called ANSI
ISO-8859-1 (Latin 1, 1987) and ISO-8859-15 (Latin 9, 1999)– 256 code points (8-bit code space)
– Designed for Western European languages
– Latin 9 added EURO SIGN at A4
15
UNICODE
16
Character encoding scheme
A method of representing abstract characters in bytes
– A byte is defined as an octet (8 bits)
Many character sets have a simple encoding scheme
– Numerical value (code point) == byte value
Character sets with > 256 code points, require different encoding scheme
– CJK sets have several 1000 code points
– Unicode has > 110.000 code points
– Needs multiple bytes per code point
17
Unicode
Standard for handling text in computing (since 1991)
Encodings
Character repertoire
– 21-bit code space
Rules
18
Unicode 1.0 encoding scheme
16-bit code space
UCS-2
– 1 ★ code unit per code point
– 2 bytes per code unit
• Big endian (most significant byte of each byte pair at the lowest address/beginning)
– Can only encode U+0000 – U+FFFF
– EURO SIGN (U+20AC): [0x20 0xAC]
20
Unicode 2.0 (1996)
21-bit code space
– Adds 16 new ★ planes
• Also 1 for private use
– Unicode 1.0 16-bit code space is now called ★ Basic Multilingual Plane (BMP)
21
UCS-4 encoding scheme
Also called UTF-32
1 code unit per code point
4 bytes per code unit
– Choice of big endian (UTF-32BE) or little endian (UTF-32LE)
EURO SIGN (U+20AC): [0x00 0x00 0x20 0xAC] (big endian)
22
UTF-16
1 or 2 code units per code point
– 2 code units are called a 'surrogate pair'
2 bytes per code unit
– Choice of big endian (UTF-16BE) or little endian (UTF-16LE)
Can encode U+0000 – U+10FFFF (17 planes)
EURO SIGN (U+20AC): [0x20 0xAC] (big endian)
GLOBE WITH MERIDIANS (U+1F310):
[0x3C 0xD8] [0x10 0xDF] (big endian)
• 1 code point > U+FFFF requires 2 code units
23
UTF-8 encoding scheme
The ‘standard’ encoding scheme for the internet
Compatible with ASCII
1 – 4 code units per code point
1 byte per code unit
EURO SIGN (U+20AC): [0xE2] [0x82] [0xAC]
0- 7F: 0.......80– 7FF: 110..... 10......
800– FFFF: 1110.... 10...... 10......10000– 1FFFFF: 11110... 10...... 10...... 10......
24
Byte Order Mark
ZERO-WIDTH NON-BREAK SPACE U+FEFF
Only used as the first character of an unmarked plaintext file
Indicator of endianness of UTF-16/32 encoding scheme
– Because these encoding schemes use 2 or 4 bytes per code unit
Signature for UTF-8 encoding scheme
– Does not indicate endianness, but rather the fact that the file is UTF-8 encoded
– Can cause problems (e.g. Linux/Unix scripts that must start with #!)
It is always preferable to 'tag' encoded text with the encoding scheme
– E.g. "this text is UTF-16BE (big endian)"
25
TEXT ENCODING IN C#
26
Skip
Encoding in C#
// Unicode literals (16-bit)var euro = "\u20AC";
// Unicode literal (32-bit)
var picto = "\U0001F310";
// Encode text to bytes
var bytes = Encoding.UTF8.GetBytes(euro);
> [0xE2, 0x82, 0xAC]
// Get a specific encoding scheme
var windows1252 = Encoding.GetEncoding("Windows-1252");
var bytes = windows1252.GetBytes(euro);
> [0x80]
27
Decoding in C#
// Decode bytes
var bytes = new { 0xE2, 0x82, 0xAC };
var euro = Encoding.UTF8.GetString(bytes);
> €
// Decoding error
var windows1252 = Encoding.GetEncoding("Windows-1252");
var euro = windows1252.GetString(bytes);
> €
// To understand why, let's check the 'Windows-1252 bytes' for â
windows1252.GetBytes("â");
> [0xE2]
28
Replacement fallback strategy in C#
// ASCII encoder will replace unknown characters with a '?'
var str = "€";
var bytes = Encoding.ASCII.GetBytes(str);
Encoding.ASCII.GetString(bytes);
> ?
// UTF-8 decoder will replace with REPLACEMENT CHARACTER U+FFFD
var str = "€";
var windows1252 = Encoding.GetEncoding("Windows-1252");
var bytes = windows1252.GetBytes(str);
Encoding.UTF8.GetString(bytes);
> �
29
Code units in C#
// Strings are actually UTF-16!
var picto = "\U0001F310";
picto.Length;
> 2
// UTF-16 uses 2 code units to encode code points > 0xFFFF
// Using a char type (notice single quotes)
var picto = '\U0001F310';
Compiler error CS1012: Too many characters in character literal
// char type can only contain a single UTF-16 code unit,
// not a surrogate pair
30
Code units in C#
// String with 2 abstract characters
// LATIN SMALL LETTER E and COMBINING ACUTE ACCENT
var str = "\u0065\u0301";
str.Length;
> 2
str[0];
> e
// Using System.Globalization.String.Info
var enumerator = String.Info.GetTextElementEnumerator(str);
enumerator.MoveNext();
enumerator.Current;
> e
31
UNICODE NORMALIZATION
32
Skip
Normalization: ★ canonical equivalence
Multiple (sequences of) code points representing the same character
LATIN SMALL LETTER A WITH CIRCUMFLEX (U+00E2)
– A single grapheme: ⟨â⟩
– A ★ 'primary composite'
– For compatibility with other character sets
– Similar to E2 in Windows-1252 or ISO-8859-1 (Latin 1)
LATIN SMALL LETTER A (U+0061) + COMBINING CIRCUMFLEX ACCENT
(U+0302)
– A single grapheme: ⟨a⟩, combined with a diacritic (not a grapheme): ◌
33
Canonical equivalence in C# - 1/2
// Single character
var bytes = new byte[] { 0xC3, 0xA2 };
var str1 = Encoding.UTF8.GetString(bytes);
> â
str.Length;
> 1
// a (0x61) with combining diacritical mark (0xCC, 0x82)
bytes = new byte[] { 0x61, 0xCC, 0x82 };
var str2 = Encoding.UTF8.GetString(bytes);
str2;
> â
str2.Length;
> 2
34
Canonical equivalence in C# - 2/2
str1.Length;
> 1
// Decompose into separate code points
str1.Normalize(NormalizationForm.FormD).Length;
> 2
str2.Length;
> 2
// Compose into 'primary composite' (the single character)
str2.Normalize(NormalizationForm.FormC).Length;
> 1
35
Normalization: ★ compatibility
Multiple (sequences of) code points representing compatible characters
– Different appearance, same meaning
LATIN SMALL LIGATURE FF (U+FB00)
– A single grapheme: ⟨ff⟩
LATIN SMALL LETTER F (U+0046) + LATIN SMALL LETTER F (U+0046)
– Two graphemes: ⟨f⟩⟨f⟩
36
Compatibility in C# - 2/2
// The ff ligature
var bytes = new byte[] { 0xEF, 0xAC, 0x80 };
var str1 = Encoding.UTF8.GetString(bytes);
str1.Length;
> 1
str2 = str.Normalize(NormalizationForm.FormKD);
str2.Length;
> 2
str2;
> ff
// ff was decomposed into two characters: ff
37
SUMMARY
38
Further reading
https://en.wikipedia.org/wiki/Unicode (Unicode on Wikipedia)
http://www.joelonsoftware.com/articles/Unicode.html (The Absolute Minimum)
http://www.objc.io/issue-9/unicode.html (NSString and Unicode)
http://www.unicode.org/glossary/ (Glossary of Unicode terms)
http://www.unicode.org/faq/ (Unicode FAQ)
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt (UTF-8 history)
http://msdn.microsoft.com/en-us/library/ms404377(v=vs.110).aspx (Character Encoding in the .NET Framework)
http://www.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/ (All about Unicode)
http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode (On the Goodness of Unicode, Tim Bray)
39