unicode for rails - happygiraffe.nethappygiraffe.net/blog/files/unicode_for_rails.pdf · what is...

Unicode for RailsDominic Mitchell

Introduction to Unicode

What Is Unicode?

Unicode provides a unique number for every character,no matter what the platform,no matter what the program,no matter what the language.

http://www.unicode.org/standard/WhatIsUnicode.html



More often...

CharactersA, A, A == U+0041

a, a, a == U+0061

Ā == U+0100

ffi == U+FB03

☺ == U+263A

☂, ☃, ✄, ☠, …

ASCII

“In the beginning...”

Created in 1967.

7-bit.

http://en.wikipedia.org/wiki/ASCII

http://en.wikipedia.org/wiki/ASCII

ISO-8859Mid ‘80s

8-bit

ASCII superset

16 different, related standards

ISO-8859-1 (aka Latin-1) is most common

http://en.wikipedia.org/wiki/ISO/IEC_8859

http://en.wikipedia.org/wiki/ISO/IEC_8859

Windows-1252

Like ISO-8859-1, but with extra characters

e.g. smart quotes, em dash

The bane of your life

http://en.wikipedia.org/wiki/Windows-1252

http://en.wikipedia.org/wiki/Windows-1252

Unicode

21-bit

Pretty much all characters in use, in the same character set.

http://www.unicode.org/

http://www.unicode.org/

But There’s More!

Unicode also specifies:

Character Properties

Encodings

Algorithms

Sounds Complex?

It is.

The real world is complex.

But, you can get by on fairly minimal subset...

Encodings

How do you turn characters into octets?

It’s simple for ASCII & ISO-8859.

Unicode has three different schemes.

UTF-32

4 octets (32 bits) per character.

Very inefficient.

Not used much.

code point character UTF-32 code value(s) glyph

122 (7A) small Z (Latin) 00 00 00 7A z

27700 (6C34) water (Chinese) 00 00 6C 34 水

http://en.wikipedia.org/wiki/UTF-32


UTF-16

2 octets (16 bits) per character (mostly).

Common on Windows & Java.

Somewhat wasteful for mostly Western text.


122 (7A) small Z (Latin) 00 7A z

27700 (6C34) water (Chinese) 6C 34 水



UTF-8

Multi-Byte, but ASCII compatible.

Very common in Internet protocols.

Reliably recognisable.


122 (7A) small Z (Latin) 7A z

27700 (6C34) water (Chinese) E6 B0 B4 水



Which Encoding?

By default, pick UTF-8.

Choose UTF-16 when

Lots of non-Western text.

Interfacing with other UTF-16 systems.

Accents

Some are built-in (e.g. é)

But you can build your own with “combining characters” (e.g. ĵ)

U+006A LATIN SMALL LETTER J

U+0302 COMBINING CIRCUMFLEX ACCENT

Normalisation

How can I spot é if There’s More Than One Way To Do It?

“normalize” all strings before use

Four forms of normalisation

NFC, NFD, NFKC, NFKD

But only NFC matters, “says W3C”

http://www.w3.org/TR/charmod-norm/

http://www.w3.org/TR/charmod-norm/

Why bother?

It’s more work now...

But it opens everything up in the future!

The rest of the world is heading this way

Unicode in Rails

Where to start?

Examine a typical request.

Where to begin thinking about Unicode?

Model?

Controller?

HTTP Headers?

URI?

Domain name!

International Domain Names

Punycode

iñtërnâtiônàlizætiøn.net

xn--itrntinliztin-vdb0a5exd8ewcye.net

IDN

http://i%C3%B1t%C3%ABrn%C3%A2ti%C3%B4n%C3%A0liz%C3%A6ti%C3%B8n.net/

http://i%C3%B1t%C3%ABrn%C3%A2ti%C3%B4n%C3%A0liz%C3%A6ti%C3%B8n.net/

URIs

Called IRIs when used with Unicode

Must use percent-encoded UTF-8

Not %uXXXX (IE only)

Browsers

HTML Forms as input

Uses page charset unless...

Form has @accept-charsetPatchy support...

Finally

We get to your code, in the controller

i.e. Ruby

Ruby & Unicode

Bad reputation

Somewhat deserved:

Ruby understands bytes

Not characters

But!

There’s a magic flag!

-K kcode Specifies KANJI (Japanese) encoding.-Ku turns on UTF-8 mode

$KCODE = “UTF8”

$KCODE =~ /^u/iSets encoding in Tk

Allows CGI::unescapeHTML to output UTF-8

SOAP libs use it here and there

Big user is the regex engine

/./u matches a UTF-8 char

pack / unpack

The only other place Ruby understands UTF-8.

[0x100, 0x64, 0x61, 0x6d].pack("U*") =>"Ādam"

"Ādam".unpack("U*")=> [256, 100, 97, 109]

Unicode affects...

Any character processing. In String:[] []= =~ <=> == capitalize casecmp center chomp chop count delete downcase dump each eql? gsub index insert length ljust lstrip replace reverse rindex rjust rstrip scan slice split squeeze strip sub succ swapcase tr upcase upto

And regexes

jcode.rbCore library

Enhances String“Ādam”.length => 5“Ādam”.jlength => 4

Not very complete

iconv

Another core library

Converts between character encodingsconv = Iconv.new("UTF-8", "WINDOWS-1252")conv.iconv "\223foo\224"

=> "“foo”"

Many Alternatives

icu4r, unicode, utf8proc, character-encodings

But they’re less relevant as of Rails 1.2

ActiveSupport::MultiByte

In Rails 1.2 (see RC1 blog post)

Adds .chars method to all strings

"Ādam".chars.length=> 4

Optional C extension for speed

Controllers

Ensure all parameters are Unicode

Use a filter in ApplicationController

e.g. convert from Windows-1252 to UTF-8.

1 require 'iconv' 2 class ApplicationController < ActionController::Base 3 @@conv = Iconv.new "UTF-8", "WINDOWS-1252" 4 5 before_filter :fix_windows_1252 6 def fix_windows_1252 7 fix_windows_1252_in_hash request.parameters 8 unless is_utf8(request.parameters.to_s) 9 end 10 11 def fix_windows_1252_in_hash(h) 12 h.each do |k,v| 13 if v.is_a?(Hash) 14 fix_windows_1252_in_hash(v) 15 elsif v.is_a?(Array) 16 v.map! { |item| @@conv.iconv(item) } 17 else 18 h[k] = @@conv.iconv(v) 19 end 20 end 21 end 22 end

Models

As with Controllers, most issues are in Ruby itself

But keep an eye out for String processing

Validation

validates_format_of

validates_length_of

Databases

Most of the model

Need to ensure we can get out what we put in

MySQL

ALTER DATABASE ‘dev’ CHARACTER SET ‘UTF8’;SET NAMES ‘UTF8’;encoding: UTF8 in database.yml

Powered by

PostgreSQL

CREATE DATABASE foo ENCODING = 'UTF-8';SET client_encoding = 'UTF-8';

encoding: UTF-8 in database.ymlSELECT name,setting FROM pg_settings WHERE name LIKE 'lc_%';

Controllers (again)

Have to tell HTTP what character-encoding you are sendingContent-Type: text/html; charset=UTF-8

NB: Content-Length is bytes, not characters

1 class ApplicationController < ActionController::Base 2 after_filter :fix_charset 3 def fix_charset 4 headers["Content-Type"] ||= "text/html; charset=UTF-8" 5 6 if headers["Content-Type"].include?('text/') && \ 7 !headers["Content-Type"].include?('charset') 8 headers["Content-Type"] += "; charset=UTF-8" 9 end 10 end 11 end

View

Specify encoding in <meta> as well

In case page is saved

Watch out for helpers that go near Strings

e.g. excerpt, highlight, truncate

View

link_to

Safe! If using UTF-8 everywhere

View

JavaScript

Should be Unicode safe in all browsers

Apache

Good tip for .htaccess

AddDefaultCharset UTF-8

Conclusion

Unicode is hard

Ruby doesn’t have much support

Rails is better

Use UTF-8 everywhere

Test, test, test

unicode for rails - happygiraffe.nethappygiraffe.net/blog/files/unicode_for_rails.pdf · what is...

Documents