![Page 1: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/1.jpg)
Unicode for Small Children (and
Children at Heart)
Feihong HsuChicago Python Users Group
March 8, 2007
![Page 2: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/2.jpg)
Welcome to the Wonderful World of Unicorns!
A Magical Guide to the World's Most Beloved Mythological Equine
![Page 3: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/3.jpg)
Welcome to the Useful World of Unicode!
A Practical Guide to the World's Most Popular International Text Standard
![Page 4: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/4.jpg)
Top 3 reasons that unicorns are great
● Friendly and wise● Healing power● Bane of evil
![Page 5: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/5.jpg)
Top 3 reasons that Unicode is important
● Comprehensive language coverage
● Multiple languages in a single document
● Standardized
![Page 6: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/6.jpg)
The difference between Horses and Unicorns
Horses Unicorns
Habitat Grasslands Enchanted forests
Diet
Abilities
Apples, oats, grass, barley, etc.
Love, spirit of wonder
Galloping, eating, pooping
Sentience, telepathy, laser vision (unconfirmed)
![Page 7: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/7.jpg)
Difference between ISO 8859 and UnicodeISO 8859 Unicode
Some A lot
256 100,000+
1 1-4
# supported languages
# supported characters
# bytes for each character
![Page 8: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/8.jpg)
So what, exactly, is Unicode?
Unicode is a standard that assigns a unique number to each character in
every human language
Ok, not every language, see next slide
![Page 9: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/9.jpg)
What is Unicode not?
● Doesn't address how the characters are rendered (that's up to font makers)
● Doesn't deal with imaginary languages like Klingon and Elvish
● Doesn't deal with ancient languages● Doesn't deal with obscure languages
that no one uses
![Page 10: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/10.jpg)
How does Hollywood “create” unicorns?
● CGI● Horse with horn glued to forehead● Two dudes in a costume
![Page 11: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/11.jpg)
How does a programmer create Unicode documents?
● Technically, you can't make a Unicode document
● Usually you pick an official encoding (UTF-8, UTF-16, etc)
● Sometimes you use a language-specific encoding (GB2312, Shift-JIS)
![Page 12: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/12.jpg)
Python and UnicornWorking together to combat evil!
![Page 13: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/13.jpg)
Python and Unicode
Working together to create international applications!
![Page 14: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/14.jpg)
Unicode-related functions
● unichr()● ord()● unicode.encode()● str.decode()
![Page 15: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/15.jpg)
Examples of usage>>> s = unichr(23456)>>> print s宠>>> ord(s)23456>>> s.encode('utf-8')'\xe5\xae\xa0'>>> s.encode('gb2312')'\xb3\xe8'>>> print _³è>>> '\xe5\xae\xa0'.decode('utf-8')u'\u5ba0'>>> print _宠>>>
![Page 16: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/16.jpg)
unicode and str: two different types!
● They have exactly the same API● But they don't have the same
repr()● And they don't have the same
type()● Use isinstance() to tell them apart
![Page 17: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/17.jpg)
unicode and str example>>> u = unicode()>>> type(u)<type 'unicode'>>>> print repr(u)u''>>> isinstance(u, str)False>>> s = str()>>> type(s)<type 'str'>>>> print repr(s)''>>> isinstance(s, unicode)False>>>
![Page 18: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/18.jpg)
Two ways to write a Unicode file
● Use the file object returned by codecs.open()
● Use a regular file object along with unicode.encode()
![Page 19: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/19.jpg)
Example using codecs.open()
>>> import codecs>>> s = u'\u4f60\u597d\u4e16\u754c'>>> fout = codecs.open('document.txt', 'w', 'utf-8')
>>> fout.write(s)>>> fout.close()>>> open('document.txt').read().decode('utf-8')
u'\u4f60\u597d\u4e16\u754c'>>>
![Page 20: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/20.jpg)
Example using unicode.encode()
>>> s = u'\u4f60\u597d\u4e16\u754c'>>> fout = open('document.txt', 'w')>>> fout.write(s.encode('utf-8'))>>> fout.close()>>> open('document.txt').read().decode('utf-8')
u'\u4f60\u597d\u4e16\u754c'>>>
![Page 21: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/21.jpg)
Two ways to read Unicode files
● Use the file object returned by codecs.open()
● Use a regular file object along with str.decode()
● Watch out for the BOM!
![Page 22: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/22.jpg)
What is Byte Order Mark?
● Called BOM for short● In UTF-16 docs, indicates little-
endian or big-endian● Often appears in UTF-8 docs to
distinguish them from ASCII docs● Use read(1) for UTF-8 documents
with BOM
![Page 23: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/23.jpg)
Example of reading from a UTF-8 file with BOM
>>> import codecs>>> fin = codecs.open('bom_document.txt', 'r', 'utf-8')
>>> fin.read(1)u'\ufeff'>>> fin.read()u'\u4f60\u597d\u4e16\u754c'>>> fin.close()>>>
![Page 24: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/24.jpg)
Reading and writing XML
● ElementTree handles everything implicitly
● It even eats the BOM without complaining
● It doesn't even need the XML declaration (as long as you use ASCII or UTF-8)
● cElementTree works great too!
![Page 25: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/25.jpg)
File system directory listing
● On Windows, os.listdir('.') won't show you int'l characters
● You need to use os.listdir(u'.') to see the Unicode files
● os.getcwd() doesn't show int'l characters
● Use os.getcwdu() instead
![Page 26: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/26.jpg)
String interpolation
● Str template strings can be interpolated with both unicode and str objects (automatic conversion to unicode)
● Unicode template strings need to be interpolated with unicode objects
![Page 27: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/27.jpg)
String interpolation example
>>> 'Hello %s' % u'\u98db\u9d3b'u'Hello \u98db\u9d3b'>>> u'Hello %s' % u'\u98db\u9d3b'u'Hello \u98db\u9d3b'>>> 'Hello %s' % '\xe9\xa3\x9b\xe9\xb4\xbb''Hello \xe9\xa3\x9b\xe9\xb4\xbb'>>> u'Hello %s' % '\xe9\xa3\x9b\xe9\xb4\xbb'Traceback (most recent call last): File "<pyshell#36>", line 1, in ? u'Hello %s' % '\xe9\xa3\x9b\xe9\xb4\xbb'UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)
>>>
![Page 28: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/28.jpg)
Putting Unicode in your Python source code
● Put “# -*- coding: utf-8 -*-” at top of your file
● Idle automatically detects non-ASCII characters and prompts to edit your file
● Not generally recommended
![Page 29: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/29.jpg)
Regular expressions
● The \w special character doesn't usually match non-ASCII characters
● To match non-ASCII characters, use re.UNICODE flag
● Remember that punctuation in different languages uses different characters
![Page 30: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/30.jpg)
Regular expression example
>>> s = u'ABC\u4f60\u597d\u4e16\u754c'>>> m = re.match(r"\w+", s)>>> m.group()u'ABC'>>> m = re.match(r"\w+", s, re.UNICODE)>>> m.group()u'ABC\u4f60\u597d\u4e16\u754c'>>>
![Page 31: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/31.jpg)
Considerations for web pages
● Don't make pages or folders with int'l characters (Firefox doesn't handle int'l URLs well)
● Make sure you use the <meta> tag when generating web pages
● You can display Unicode even in ASCII-encoded pages (use character entities)
![Page 32: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/32.jpg)
Web page with <meta> tag
<html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
</head> <body> <h1>你好世界 </h1> </body></html>
![Page 33: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/33.jpg)
Web page with character entities
<html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ascii">
</head> <body> <h1>你好世界</h1> </body></html>Conversion recipe: s.encode('ascii', 'xmlcharrefreplace')
![Page 34: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/34.jpg)
Processing documents of unknown encoding
● Use the chardet module● chardet.detect() function:
–accepts a string– returns a dictionary with two keys:
'encoding' and 'confidence'● Also try BeautifulSoup for web pages
![Page 35: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/35.jpg)
Encoding detection example
>>> import chardet, urllib2>>> html = urllib2.urlopen('http://chol.co.kr').read()
>>> result = chardet.detect(html)>>> result{'confidence': 0.98999999999999999, 'encoding': 'EUC-KR'}
>>> print html.decode(result['encoding'])
![Page 36: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/36.jpg)
Tools that play nice with Unicode
● IDLE (raw_input() accepts Unicode)
● Notepad++ (can autodetect UTF-8 files with BOM)
● jEdit
![Page 37: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/37.jpg)
Libraries that play nice with Unicode
● Tkinter● wxPython● Mako ● BeautifulSoup● feedparser● Elementtree● lxml
![Page 38: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/38.jpg)
Libraries that don't play nice with Unicode
● cStringIO (StringIO.write() doesn't accept Unicode strings)
● buzhug● Various ID3 libraries● ?
![Page 39: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/39.jpg)
Databases
● SQLite has no problem with Unicode
● SQLAlchemy with SQLite is fine too
● Other databases - ?
![Page 40: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/40.jpg)
Platform-specific issues
● Windows DOS prompt has no love for Unicode
● MacOS X IDLE can't handle Unicode● MacOS X terminal doesn't like
Unicode, likes UTF-8● Recommendation: Use PyCrust?
![Page 41: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/41.jpg)
Demos
● Filesystem demo● Mako template engine demo● chardet demo● pysqlite demo● wxPython demo
![Page 42: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/42.jpg)
Questions?有问题吗?
![Page 43: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/43.jpg)
1
Unicode for Small Children (and
Children at Heart)
Feihong HsuChicago Python Users Group
March 8, 2007
Thanks to Chris McAvoy for the conversation at PyCon that inspired this talk.
![Page 44: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/44.jpg)
2
Welcome to the Wonderful World of Unicorns!
A Magical Guide to the World's Most Beloved Mythological Equine
Completely drawn on my tablet PC using the free Ink Art program. Unfortunately, Ink Art doesn't come with good coloring tools so I just left it colorless.
![Page 45: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/45.jpg)
3
Welcome to the Useful World of Unicode!
A Practical Guide to the World's Most Popular International Text Standard
![Page 46: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/46.jpg)
4
Top 3 reasons that unicorns are great
● Friendly and wise● Healing power● Bane of evil
![Page 47: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/47.jpg)
5
Top 3 reasons that Unicode is important
● Comprehensive language coverage
● Multiple languages in a single document
● Standardized
The Unicode Standard is maintained by the Unicode Consortium, an organization based in California.
![Page 48: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/48.jpg)
6
The difference between Horses and Unicorns
Horses Unicorns
Habitat Grasslands Enchanted forests
Diet
Abilities
Apples, oats, grass, barley, etc.
Love, spirit of wonder
Galloping, eating, pooping
Sentience, telepathy, laser vision (unconfirmed)
I really wasn't sure about including the laser vision ability. I honestly thought it was an urban myth. But when a friend of my cousin's sister's friend said that she saw it in person, I finally relented.
![Page 49: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/49.jpg)
7
Difference between ISO 8859 and UnicodeISO 8859 Unicode
Some A lot
256 100,000+
1 1-4
# supported languages
# supported characters
# bytes for each character
Somebody noted that ISO 8859 can actually support
more than 256 characters through its various
extensions, so this is an oversimplification.
![Page 50: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/50.jpg)
8
So what, exactly, is Unicode?
Unicode is a standard that assigns a unique number to each character in
every human language
Ok, not every language, see next slide
The “unique number” for each character is called a code point in Unicode terminology.
![Page 51: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/51.jpg)
9
What is Unicode not?
● Doesn't address how the characters are rendered (that's up to font makers)
● Doesn't deal with imaginary languages like Klingon and Elvish
● Doesn't deal with ancient languages● Doesn't deal with obscure languages
that no one uses
Although there are many languages that Unicode doesn't directly support, there are extensions to Unicode that are designed to handle these cases.
![Page 52: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/52.jpg)
10
How does Hollywood “create” unicorns?
● CGI● Horse with horn glued to forehead● Two dudes in a costume
It helps if the two dudes are very high. And if they have circus experience. And if neither of them has a trick leg.
![Page 53: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/53.jpg)
11
How does a programmer create Unicode documents?
● Technically, you can't make a Unicode document
● Usually you pick an official encoding (UTF-8, UTF-16, etc)
● Sometimes you use a language-specific encoding (GB2312, Shift-JIS)
In the vast majority of cases, I think UTF-8 is more than adequate. If in doubt, just go with that encoding.
![Page 54: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/54.jpg)
12
Python and UnicornWorking together to combat evil!
I think this is a case of the graphic actually undermining the point I'm trying to make. This is my attempt to render a dynamic, exciting action scene of a pitched battle between orc, unicorn and python. They are fighting for the fate of the damsel in distress because she is, like, oh so fine (well, at least when she's got her makeup on, which she doesn't in this picture). Unfortunately, the unicorn looks like it's about to be stabbed in the ass, and the python seems more interested in biting a chunk out of the damsel than in saving her.
![Page 55: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/55.jpg)
13
Python and Unicode
Working together to create international applications!
The only time I actually visited the Unicode Consortium's web site was to get a copy of the Unicode logo.
![Page 56: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/56.jpg)
14
Unicode-related functions
● unichr()● ord()● unicode.encode()● str.decode()
Thanks to Ian Bicking for pointing out that it should be unicode.encode(), not str.encode().
![Page 57: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/57.jpg)
15
Examples of usage>>> s = unichr(23456)>>> print s宠>>> ord(s)23456>>> s.encode('utf-8')'\xe5\xae\xa0'>>> s.encode('gb2312')'\xb3\xe8'>>> print _³è>>> '\xe5\xae\xa0'.decode('utf-8')u'\u5ba0'>>> print _宠>>>
The PDF version of this presentation doesn't render the Chinese character properly. But if you copy and paste in a Unicode-aware editor, you'll probably be able to see it. I admit it is pretty rare to put a Chinese character in Courier New font.
![Page 58: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/58.jpg)
16
unicode and str: two different types!
● They have exactly the same API● But they don't have the same
repr()● And they don't have the same
type()● Use isinstance() to tell them apart
Thanks to Atul Varma for making some comments that led me to adding this slide (and the next one).
![Page 59: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/59.jpg)
17
unicode and str example>>> u = unicode()>>> type(u)<type 'unicode'>>>> print repr(u)u''>>> isinstance(u, str)False>>> s = str()>>> type(s)<type 'str'>>>> print repr(s)''>>> isinstance(s, unicode)False>>>
![Page 60: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/60.jpg)
18
Two ways to write a Unicode file
● Use the file object returned by codecs.open()
● Use a regular file object along with unicode.encode()
![Page 61: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/61.jpg)
19
Example using codecs.open()
>>> import codecs>>> s = u'\u4f60\u597d\u4e16\u754c'>>> fout = codecs.open('document.txt', 'w', 'utf-8')
>>> fout.write(s)>>> fout.close()>>> open('document.txt').read().decode('utf-8')
u'\u4f60\u597d\u4e16\u754c'>>>
![Page 62: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/62.jpg)
20
Example using unicode.encode()
>>> s = u'\u4f60\u597d\u4e16\u754c'>>> fout = open('document.txt', 'w')>>> fout.write(s.encode('utf-8'))>>> fout.close()>>> open('document.txt').read().decode('utf-8')
u'\u4f60\u597d\u4e16\u754c'>>>
![Page 63: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/63.jpg)
21
Two ways to read Unicode files
● Use the file object returned by codecs.open()
● Use a regular file object along with str.decode()
● Watch out for the BOM!
![Page 64: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/64.jpg)
22
What is Byte Order Mark?
● Called BOM for short● In UTF-16 docs, indicates little-
endian or big-endian● Often appears in UTF-8 docs to
distinguish them from ASCII docs● Use read(1) for UTF-8 documents
with BOM
The actual value of the BOM is 0xfeff. If you try to print it in the Python interpreter, you won't see anything.
![Page 65: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/65.jpg)
23
Example of reading from a UTF-8 file with BOM
>>> import codecs>>> fin = codecs.open('bom_document.txt', 'r', 'utf-8')
>>> fin.read(1)u'\ufeff'>>> fin.read()u'\u4f60\u597d\u4e16\u754c'>>> fin.close()>>>
![Page 66: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/66.jpg)
24
Reading and writing XML
● ElementTree handles everything implicitly
● It even eats the BOM without complaining
● It doesn't even need the XML declaration (as long as you use ASCII or UTF-8)
● cElementTree works great too!
The lxml module is similarly awesome.
![Page 67: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/67.jpg)
25
File system directory listing
● On Windows, os.listdir('.') won't show you int'l characters
● You need to use os.listdir(u'.') to see the Unicode files
● os.getcwd() doesn't show int'l characters
● Use os.getcwdu() instead
The behavior under Mac OS X is somewhat different. I don't know about Linux.
![Page 68: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/68.jpg)
26
String interpolation
● Str template strings can be interpolated with both unicode and str objects (automatic conversion to unicode)
● Unicode template strings need to be interpolated with unicode objects
Template engines have these sorts of issues as well. In particular, if you want to render a unicode string in Mako or Myghty, you need to pass unicode strings into the template.
![Page 69: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/69.jpg)
27
String interpolation example
>>> 'Hello %s' % u'\u98db\u9d3b'u'Hello \u98db\u9d3b'>>> u'Hello %s' % u'\u98db\u9d3b'u'Hello \u98db\u9d3b'>>> 'Hello %s' % '\xe9\xa3\x9b\xe9\xb4\xbb''Hello \xe9\xa3\x9b\xe9\xb4\xbb'>>> u'Hello %s' % '\xe9\xa3\x9b\xe9\xb4\xbb'Traceback (most recent call last): File "<pyshell#36>", line 1, in ? u'Hello %s' % '\xe9\xa3\x9b\xe9\xb4\xbb'UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)
>>>
![Page 70: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/70.jpg)
28
Putting Unicode in your Python source code
● Put “# -*- coding: utf-8 -*-” at top of your file
● Idle automatically detects non-ASCII characters and prompts to edit your file
● Not generally recommended
I don't recommend putting Unicode strings in your source code because people who don't have Unicode-aware editors will just see annoying gibberish.
![Page 71: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/71.jpg)
29
Regular expressions
● The \w special character doesn't usually match non-ASCII characters
● To match non-ASCII characters, use re.UNICODE flag
● Remember that punctuation in different languages uses different characters
Punctuation characters in English:. ? !
Compare with punctuation characters in Chinese:。?!
Although they only look slightly different, they do have different code points in Unicode.
![Page 72: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/72.jpg)
30
Regular expression example
>>> s = u'ABC\u4f60\u597d\u4e16\u754c'>>> m = re.match(r"\w+", s)>>> m.group()u'ABC'>>> m = re.match(r"\w+", s, re.UNICODE)>>> m.group()u'ABC\u4f60\u597d\u4e16\u754c'>>>
![Page 73: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/73.jpg)
31
Considerations for web pages
● Don't make pages or folders with int'l characters (Firefox doesn't handle int'l URLs well)
● Make sure you use the <meta> tag when generating web pages
● You can display Unicode even in ASCII-encoded pages (use character entities)
As Atul Varma pointed out, Firefox mangles the URL but does so in a standard way. However, it still ends up not finding the page. IE can actually find and display pages with Unicode names. This is probably the only thing IE does better than Firefox.
![Page 74: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/74.jpg)
32
Web page with <meta> tag
<html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
</head> <body> <h1>你好世界 </h1> </body></html>
The text is Chinese for “Hello World”.
![Page 75: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/75.jpg)
33
Web page with character entities
<html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ascii">
</head> <body> <h1>你好世界</h1> </body></html>Conversion recipe: s.encode('ascii', 'xmlcharrefreplace')
Thanks to Ian Bicking for pointing out a shorter conversion recipe. For the record, the original one is:
''.join('&#%d' % ord(c) for c in s)
![Page 76: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/76.jpg)
34
Processing documents of unknown encoding
● Use the chardet module● chardet.detect() function:
–accepts a string– returns a dictionary with two keys:
'encoding' and 'confidence'● Also try BeautifulSoup for web pages
![Page 77: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/77.jpg)
35
Encoding detection example
>>> import chardet, urllib2>>> html = urllib2.urlopen('http://chol.co.kr').read()
>>> result = chardet.detect(html)>>> result{'confidence': 0.98999999999999999, 'encoding': 'EUC-KR'}
>>> print html.decode(result['encoding'])
You can also try BeautifulSoup for web pages. Example:
content = urllib2.urlopen(url).read()soup = BeautifulSoup(content)encoding = soup.originalEncoding
![Page 78: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/78.jpg)
36
Tools that play nice with Unicode
● IDLE (raw_input() accepts Unicode)
● Notepad++ (can autodetect UTF-8 files with BOM)
● jEdit
Note that only IDLE on Windows has this feature.
![Page 79: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/79.jpg)
37
Libraries that play nice with Unicode
● Tkinter● wxPython● Mako ● BeautifulSoup● feedparser● Elementtree● lxml
![Page 80: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/80.jpg)
38
Libraries that don't play nice with Unicode
● cStringIO (StringIO.write() doesn't accept Unicode strings)
● buzhug● Various ID3 libraries● ?
![Page 81: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/81.jpg)
39
Databases
● SQLite has no problem with Unicode
● SQLAlchemy with SQLite is fine too
● Other databases - ?
![Page 82: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/82.jpg)
40
Platform-specific issues
● Windows DOS prompt has no love for Unicode
● MacOS X IDLE can't handle Unicode● MacOS X terminal doesn't like
Unicode, likes UTF-8● Recommendation: Use PyCrust?
I checked and it turns out that PyCrust chokes on int'l characters sent through raw_input(), even on Windows. So I formally withdraw my recommendation of PyCrust.
![Page 83: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/83.jpg)
41
Demos
● Filesystem demo● Mako template engine demo● chardet demo● pysqlite demo● wxPython demo
![Page 84: Unicode for Small Children (and Children at Heart)](https://reader033.vdocuments.site/reader033/viewer/2022052522/554fa070b4c905ad218b49db/html5/thumbnails/84.jpg)
42
Click to add title
Questions?有问题吗?
Thanks to the experts in the audience who provided hard-hitting answers to the the tough questions. And, of course, thanks to everyone who attended my first talk at ChiPy. I hope there will be more.