examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html natural language toolkit

Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html

Natural Language Toolkit


Overview• The NLTK is a set of Python modules to carry

out many common natural language tasks.• Access it at nltk.sourceforge.net• There are versions for Windows, OS X, Unix,

Linux. Detailed instructions on Installation tab• In addition to the toolkit you will need two other

modules: tkinter and Numeric. We haven’t been able to get numeric to install smoothly with Python 2.4 under Windows, only with 2.3.

• You do also want the contrib and data packages.• Pay attention to what INSTALL.TXT in the data

package says about the NLTK_CORPORA path.

http://nltk.sourceforge.net/


Accessing NLTK• Standard Python import command• >>> from nltk.corpus import gutenberg• >>> gutenberg.items()• ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',

'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Or• >>> import nltk.corpus• >>> nltk.corpus.gutenberg.items()• ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',

'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


Modules• The NLTK modules include:

– token: classes for representing and processing individual elements of text, such as words and sentences

– probability: classes for representing and processing probabilistic information.

– tree: classes for representing and processing hierarchical information over text.

– cfg: classes for representing and processing context free grammars.

– fsa: finite state automata– tagger: tagging each word with a part-of-speech, a sense, etc – parser: building trees over text (includes chart, chunk and

probabilistic parsers) – classifier: classify text into categories (includes feature,

featureSelection, maxent, naivebayes– draw: visualize NLP structures and processes– corpus: access (tagged) corpus data

• We will cover some of these explicitly as we reach topics.


One Simple ExampleIDLE 1.0.3 >>> from nltk.tokenizer import *>>> text_token = Token(TEXT='Hello world. This is a test file.')>>> print text_token<Hello world. This is a test file.>>>> WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(text_token)>>> print text_token<[<Hello>, <world.>, <This>, <is>, <a>, <test>, <file.>]>>>> print text_token['TEXT']Hello world. This is a test file.>>> print text_token['WORDS'][<Hello>, <world.>, <This>, <is>, <a>, <test>, <file.>]


LAB

• Detailed documentation and tutorials under the Documentation tab at the Sourceforge site.

• Work through the “gentle introduction” and “elementary language processing” tutorials on the NLTK:

nltk.sourceforge.net/tutorial/introduction/index.html

http://nltk.sourceforge.net/tutorial/introduction/index.html

examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html natural language toolkit

Documents