nltk: the good, the bad, and the awesome

NLTKThe Good, the Bad, and the Awesome

Jacob Perkins

• Python Text Processing with NLTK 2.0 Cookbook

• streamhacker.com

• weotta.com

• text-processing.com

• @japerk

The Good• Makes NLProc easier and more accessible

• Python (great learning language)

• Lots of documentation (and 2 books!)

• Designed for training custom models

• Includes many training corpora

• Many algorithms to experiment with

The Bad

• NLProc is hard

• Few out-of-the-box solutions (see Pattern)

• Not designed for big-data (see Mahout)

• Doesn’t have latest algorithms (see Scikits-Learn)

• No online or active learning algorithms

More Bad

• Doesn’t play nice with pip or easy_install

• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)

• Models can use a lot of memory (& disk if pickled)

The Awesome

• Great for education and research

• Lots of users & active community

• Extensible interfaces

• Training algorithms span human languages

More Awesome

• Trained models can be very fast

• Well known algorithms can be very accurate

• NLTK-Trainer (train models with 0 code)

• Corpus bootstrapping

Some Numbers• 3 Classification Algorithms

• 9 Part-of-Speech Tagging Algorithms

• Stemming Algorithms for 15 Languages

• 5 Word Tokenization Algorithms

• Sentence Tokenizers for 16 Languages

• 60 included corpora

Text-Processing.com

• NLTK Demos & APIs

• Sentiment Analysis

• Part-of-Speech Tagging & Chunking / NER

• Stemming

• Tokenization

Memory Usagetext-processing.com

CPU Usagetext-processing.com

NLTK-Trainer

• https://github.com/japerk/nltk-trainer

• 3 Training Command Scripts

‣ train_classifier.py

‣ train_tagger.py

‣ train_chunker.py

• Easy to tweak training parameters

• Duck-Typed corpus reading

https://github.com/japerk/nltk-trainer

https://github.com/japerk/nltk-trainer

Training Classifiers

• train_classifier.py movie_reviews --instances paras

• train_classifier.py movie_reviews --instances paras --min_score 2 --ngrams 1 --ngrams 2

• train_classifier.py movie_reviews --instances paras --classifier MEGAM

• train_classifier.py movie_reviews --instances paras --cross-fold 10

• Pickled models are saved in ~/nltk_data/classifiers/

Training Taggers

• train_tagger.py treebank

• train_tagger.py treebank --sequential ubt --brill

• train_tagger.py treebank --sequential ‘’ --classifier NaiveBayes

• train_tagger.py mac_morpho --simplify_tags

• Pickled models are saved in ~/nltk_data/taggers/

Training Chunkers

• train_chunker.py treebank_chunk

• train_chunker.py treebank_chunk --classifier NaiveBayes

• train_chunker.py conll2000 --fileids train.txt

• Pickled models are saved in ~/nltk_data/chunkers/

Corpus Bootstrapping

• Guess & Correct easier than starting from scratch

• Use an existing model for initial guesses

• emoticons

‣ :) = “pos”

‣ :( = “neg”

• ratings

‣ 5 stars = “pos”

‣ 1 star = “neg”

Portuguese Phrase Extraction & Classification• similar to condensr.com

• Brazilian Portuguese

• aspect classification is easy with training corpus

• need chunked corpus for phrase extraction

• use mac_morpho & nltk-trainer to train initial tagger

• part-of-speech tag annotation is time consuming

• simplified tags are much easier

• bracketed phrases w/out pos tags

treebank_chunk[ Pierre/NNP Vinken/NNP ],/, [ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB [ the/DT board/NN ]as/IN [ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]./.

Just Brackets

[ Pierre Vinken ] , [ 61 years ] old , will join [ the board ] as [ a nonexecutive director Nov. 29 ] .

NLP at Weotta

• Parsing & information extraction

• Text cleaning & normalization (more parsing)

• Text & keyword classification

• De-duplication

• Search indexing / IR

• Sentiment analysis

• Human integration

nltk: the good, the bad, and the awesome

Technology