python 3 march 15, 2011. nltk import nltk nltk.download()

Python 3

March 15, 2011

import nltknltk.download()

import nltkfrom nltk.book import *

texts()

1. Look at the lists of available texts

print text1[0:50]

2. Check out what the text1 (Moby Dick) object looks like

print text1[0:50]Looks like a list of

word tokens

2. Check out what the text1 (Moby Dick) object looks like

NLTK3. Get list of top most frequent word TOKENS

fd=FreqDist(text1)print fd.keys()[0:10]

FreqDist is an object defined by NLTKhttp://www.opendocs.net/nltk/0.9.5/api/nltk.probability.FreqDist-class.html

Give it a list of word tokens

It will be automatically sorted. Print the first 10 keys

3. Get list of top most frequent word TOKENS

text1.concordance("and")

4. Now get a concordance of the third most common word

text1.concordance("and")

concordance is method defined for an nltk texthttp://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.Text-class.html#concordance

concordance(self, word, width=79, lines=25)Print a concordance for word with the specified context window.

4. Now get a concordance of the third most common word

mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]

fd=FreqDist(mobyDick)print fd.keys()[0:10]

5. What if you don't want punctuation in your list?First, simple way to fix it:

String Operations

Make a new list of tokens

String Operations

Make a new list of tokensCall it mobyDick

String Operations

For each token x in the original list…

String Operations

Copy the token into the new list, except replace

each , with nothing

String Operations

each , with nothing

Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)

String Operations

each , with nothing

Make a new FreqDist with the new list of tokens, call it fd

String Operations

each , with nothing

Print it like before

Make a new FreqDist with the new list of tokens, call it fd

String Operations

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

Regular Expressions

Import regular expression module

Regular Expressions

Compile a regular expression

Regular Expressions

The RegEx will match any of the characters

inside the brackets

Regular Expressions

Call the “sub” function associated with the RegEx

named punctuation

Regular Expressions

Replace anything that matches the RegEx with nothing

Regular Expressions

As before, do this to each token in the text1 list

Regular Expressions

Call this new list punctuationRemoved

Regular Expressions

Get a FreqDist of all tokens with length >1

Regular Expressions

Print the top 10 word tokens as usual

Regular Expressions

Regular Expressions are Really Powerful and Useful!

Quick Diversion

print fd.keys()[-10:]

7. What if you wanted to see the least common word tokens?

Quick Diversion

print fd.keys()[-10:]

7. What if you wanted to see the least common word tokens?

Print the tokens from position -10 to the end

Quick Diversion

print [(k, fd[k]) for k in fd.keys()[0:10]]

8. And what if you wanted to see the frequencies with the words?

For each key “k” in the FreqDist, print it and look up

its value (fd[k])

Back to Regular Expressions

import re

myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”

colorsRegEx=re.compile("blue|red|green")print colorsRegEx.sub("color",myString)

9. Another simple example

import re

Looks similar to the RegEx that matched punctuation before

import re

This RegEx matches the substring “blue” or the substring “red” or the

substring “green”

import re

Here, substitute anything that matches the RegEx with the string “color”

import re

10. A more interesting example

What if we wanted to identify all of the phone numbers in the string?

import re

phoneNumbersRegEx=re.compile('\d{11}')print phoneNumbersRegEx.findall(myString)

Note that \d is a digit, and {11} matches 11

digits in a row

This is a start. Output: ['18005551234']

import re

findall will return a list of all substrings of myString that

match the RegEx

import re

Also will need to know:

“?” will match 0 or 1 repetitions of the previous element

Note: find lots more information on regular expressions here: http://docs.python.org/library/re.html

import re

phoneNumbersRegEx=re.compile(''1?-?\(?\d{3}\)?-?\d{3}-?\d{4}'')print phoneNumbersRegEx.findall(myString)

Answer is here, but let’s derive it together

python 3 march 15, 2011. nltk import nltk nltk.download()

Documents

webnlp – an integrated web-interface for python nltk and...

natural language processing using python - · pdf...

ss19 praktikum nltk sklearn - julielab.de filepython api...

corpus bootstrapping with nltk - o'reilly...

python text processing with nltk 2.0...

nltk - natural language processing in python

nltk chapter 2, 4 (approximately) - language technology ·...

python nltk

gaining&transparency&into&cloud& compung& · iam aws...

nltk presentation

import python

extend python using c++ and arcobjects - pierssen · extend...

pln con python - unam · repaso nltk hapaxes()loshapaxes...

october 2005csa3180: text processing ii1 csa3180: natural...

a neuron + python tutorial learn how to use neuron with...

howtoperformsomecommonnlptasksusing nltk · 2011. 9. 6. ·...

attributing authorship with stylometrythe version. here’s...

getting set with python and nltk tuples, strings, numeric...

an introduction to python machine learning · machine...

nltk introduction