unicode day 12 - 9/22/14 ling 3820 & 6820 natural language processing harry howard tulane...
TRANSCRIPT
UnicodeDay 12 - 9/22/14LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
Course organization
22-Sept-2014NLP, Prof. Howard, Tulane University
2
http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction. http://www.tulane.edu/~howard/CompCu
ltEN/
The quiz was the review.
Review of Lists
22-Sept-2014
3
NLP, Prof. Howard, Tulane University
Open Spyder
22-Sept-2014
4
NLP, Prof. Howard, Tulane University
6. Non-English characters: one code to rule them all
22-Sept-2014
5
NLP, Prof. Howard, Tulane University
Did you know …
1. >>> unsorted = 'a*@A6' 2. >>> sorted(unsorted) 3. ['*', '6', '@', 'A', 'a']
22-Sept-2014NLP, Prof. Howard, Tulane University
6
Introduction So your program is humming along, and it hits the string 'cañón' and chokes. For
instance, it may try to find out the length of cañón:
1. >>> S = 'cañón'
2. >>> len(S)
3. >>> from re import findall
4. >>> findall(r'\w{5}',S)
5. >>> T = findall(r'.{5}',S)
6. >>> T
7. ['ca\xc3\xb1\xc3']
8. >>> U = ''.join(T)
9. >>> print U
10. >>> findall(r'.{7}',S)
11. ['ca\xc3\xb1\xc3\xb3n']
12. >>> T = findall(r'.{7}',S)
13. >>> U = ''.join(T)
14. >>> print U
15. cañón
22-Sept-2014NLP, Prof. Howard, Tulane University
7
6.1. English characters and ASCII Computers were originally designed to use the
English alphabet, and in particular, an encoding of it called the American Standard Code for Information Interchange, abbreviated ASCII and pronounced /ˈæski/ or “ass-kee”, see ASCII in Wikipedia.
ASCII is ultimately based on telegraph codes and represents the numbers 0-9, the English letters a-z and A-Z, the English punctuation symbols plus a blank space, along with control codes that originated with Teletype machines, some of which are now obsolete.
22-Sept-2014NLP, Prof. Howard, Tulane University
8
ASCII characters
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 – – – – – – – – – – – – – – – –
1 – – – – – – – – – – – – – – – –
2 ! “ # $ % & ‘ ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ –
22-Sept-2014NLP, Prof. Howard, Tulane University
9
So now you know …
1. >>> unsorted = 'a*@A6' 2. >>> sorted(unsorted) 3. ['*', '6', '@', 'A', 'a'] 4. >>> ord(' ') 5. >>> ord('!') 6. >>> ord('~') 7. >>> chr(32) 8. >>> chr(33) 9. >>> chr(126) 10. >>> chr(127)
22-Sept-2014NLP, Prof. Howard, Tulane University
10
Background
6.2. Unicode and UTF-8
22-Sept-2014
11
NLP, Prof. Howard, Tulane University
6.2.1. Character encoding in Python
22-Sept-2014NLP, Prof. Howard, Tulane University
12
7. NLTK and Internet corporabut I am going to fold this chapter into §1 & §2, so the chapter numbering will change.
Next time
22-Sept-2014NLP, Prof. Howard, Tulane University
13