examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html natural language toolkit
TRANSCRIPT
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
Natural Language Toolkit
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
Overview• The NLTK is a set of Python modules to carry
out many common natural language tasks.• Access it at nltk.sourceforge.net• There are versions for Windows, OS X, Unix,
Linux. Detailed instructions on Installation tab• In addition to the toolkit you will need two other
modules: tkinter and Numeric. We haven’t been able to get numeric to install smoothly with Python 2.4 under Windows, only with 2.3.
• You do also want the contrib and data packages.• Pay attention to what INSTALL.TXT in the data
package says about the NLTK_CORPORA path.
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
Accessing NLTK• Standard Python import command• >>> from nltk.corpus import gutenberg• >>> gutenberg.items()• ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',
'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
Or• >>> import nltk.corpus• >>> nltk.corpus.gutenberg.items()• ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',
'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
Modules• The NLTK modules include:
– token: classes for representing and processing individual elements of text, such as words and sentences
– probability: classes for representing and processing probabilistic information.
– tree: classes for representing and processing hierarchical information over text.
– cfg: classes for representing and processing context free grammars.
– fsa: finite state automata– tagger: tagging each word with a part-of-speech, a sense, etc – parser: building trees over text (includes chart, chunk and
probabilistic parsers) – classifier: classify text into categories (includes feature,
featureSelection, maxent, naivebayes– draw: visualize NLP structures and processes– corpus: access (tagged) corpus data
• We will cover some of these explicitly as we reach topics.
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
One Simple ExampleIDLE 1.0.3 >>> from nltk.tokenizer import *>>> text_token = Token(TEXT='Hello world. This is a test file.')>>> print text_token<Hello world. This is a test file.>>>> WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(text_token)>>> print text_token<[<Hello>, <world.>, <This>, <is>, <a>, <test>, <file.>]>>>> print text_token['TEXT']Hello world. This is a test file.>>> print text_token['WORDS'][<Hello>, <world.>, <This>, <is>, <a>, <test>, <file.>]
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
LAB
• Detailed documentation and tutorials under the Documentation tab at the Sourceforge site.
• Work through the “gentle introduction” and “elementary language processing” tutorials on the NLTK:
nltk.sourceforge.net/tutorial/introduction/index.html