unicode for rails - happygiraffe.nethappygiraffe.net/blog/files/unicode_for_rails.pdf · what is...
TRANSCRIPT
What Is Unicode?
Unicode provides a unique number for every character,no matter what the platform,no matter what the program,no matter what the language.
http://www.unicode.org/standard/WhatIsUnicode.html
ASCII
“In the beginning...”
Created in 1967.
7-bit.
ISO-8859Mid ‘80s
8-bit
ASCII superset
16 different, related standards
ISO-8859-1 (aka Latin-1) is most common
Windows-1252
Like ISO-8859-1, but with extra characters
e.g. smart quotes, em dash
The bane of your life
Unicode
21-bit
Pretty much all characters in use, in the same character set.
Encodings
How do you turn characters into octets?
It’s simple for ASCII & ISO-8859.
Unicode has three different schemes.
UTF-32
4 octets (32 bits) per character.
Very inefficient.
Not used much.
code point character UTF-32 code value(s) glyph
122 (7A) small Z (Latin) 00 00 00 7A z
27700 (6C34) water (Chinese) 00 00 6C 34 水
UTF-16
2 octets (16 bits) per character (mostly).
Common on Windows & Java.
Somewhat wasteful for mostly Western text.
code point character UTF-16 code value(s) glyph
122 (7A) small Z (Latin) 00 7A z
27700 (6C34) water (Chinese) 6C 34 水
UTF-8
Multi-Byte, but ASCII compatible.
Very common in Internet protocols.
Reliably recognisable.
code point character UTF-8 code value(s) glyph
122 (7A) small Z (Latin) 7A z
27700 (6C34) water (Chinese) E6 B0 B4 水
Which Encoding?
By default, pick UTF-8.
Choose UTF-16 when
Lots of non-Western text.
Interfacing with other UTF-16 systems.
Accents
Some are built-in (e.g. é)
But you can build your own with “combining characters” (e.g. ĵ)
U+006A LATIN SMALL LETTER J
U+0302 COMBINING CIRCUMFLEX ACCENT
Normalisation
How can I spot é if There’s More Than One Way To Do It?
“normalize” all strings before use
Four forms of normalisation
NFC, NFD, NFKC, NFKD
But only NFC matters, “says W3C”
Why bother?
It’s more work now...
But it opens everything up in the future!
The rest of the world is heading this way
International Domain Names
Punycode
iñtërnâtiônàlizætiøn.net
xn--itrntinliztin-vdb0a5exd8ewcye.net
IDN
But!
There’s a magic flag!
-K kcode Specifies KANJI (Japanese) encoding.-Ku turns on UTF-8 mode
$KCODE = “UTF8”
$KCODE =~ /^u/iSets encoding in Tk
Allows CGI::unescapeHTML to output UTF-8
SOAP libs use it here and there
Big user is the regex engine
/./u matches a UTF-8 char
pack / unpack
The only other place Ruby understands UTF-8.
[0x100, 0x64, 0x61, 0x6d].pack("U*") =>"Ādam"
"Ādam".unpack("U*")=> [256, 100, 97, 109]
Unicode affects...
Any character processing. In String:[] []= =~ <=> == capitalize casecmp center chomp chop count delete downcase dump each eql? gsub index insert length ljust lstrip replace reverse rindex rjust rstrip scan slice split squeeze strip sub succ swapcase tr upcase upto
And regexes
iconv
Another core library
Converts between character encodingsconv = Iconv.new("UTF-8", "WINDOWS-1252")conv.iconv "\223foo\224"
=> "“foo”"
Many Alternatives
icu4r, unicode, utf8proc, character-encodings
But they’re less relevant as of Rails 1.2
ActiveSupport::MultiByte
In Rails 1.2 (see RC1 blog post)
Adds .chars method to all strings
"Ādam".chars.length=> 4
Optional C extension for speed
Controllers
Ensure all parameters are Unicode
Use a filter in ApplicationController
e.g. convert from Windows-1252 to UTF-8.
1 require 'iconv' 2 class ApplicationController < ActionController::Base 3 @@conv = Iconv.new "UTF-8", "WINDOWS-1252" 4 5 before_filter :fix_windows_1252 6 def fix_windows_1252 7 fix_windows_1252_in_hash request.parameters 8 unless is_utf8(request.parameters.to_s) 9 end 10 11 def fix_windows_1252_in_hash(h) 12 h.each do |k,v| 13 if v.is_a?(Hash) 14 fix_windows_1252_in_hash(v) 15 elsif v.is_a?(Array) 16 v.map! { |item| @@conv.iconv(item) } 17 else 18 h[k] = @@conv.iconv(v) 19 end 20 end 21 end 22 end
Models
As with Controllers, most issues are in Ruby itself
But keep an eye out for String processing
MySQL
ALTER DATABASE ‘dev’ CHARACTER SET ‘UTF8’;SET NAMES ‘UTF8’;encoding: UTF8 in database.yml
Powered by
PostgreSQL
CREATE DATABASE foo ENCODING = 'UTF-8';SET client_encoding = 'UTF-8';
encoding: UTF-8 in database.ymlSELECT name,setting FROM pg_settings WHERE name LIKE 'lc_%';
Controllers (again)
Have to tell HTTP what character-encoding you are sendingContent-Type: text/html; charset=UTF-8
NB: Content-Length is bytes, not characters
1 class ApplicationController < ActionController::Base 2 after_filter :fix_charset 3 def fix_charset 4 headers["Content-Type"] ||= "text/html; charset=UTF-8" 5 6 if headers["Content-Type"].include?('text/') && \ 7 !headers["Content-Type"].include?('charset') 8 headers["Content-Type"] += "; charset=UTF-8" 9 end 10 end 11 end
View
Specify encoding in <meta> as well
In case page is saved
Watch out for helpers that go near Strings
e.g. excerpt, highlight, truncate
Conclusion
Unicode is hard
Ruby doesn’t have much support
Rails is better
Use UTF-8 everywhere
Test, test, test