nltk: the good, the bad, and the awesome
DESCRIPTION
Presented at the first meeting of the Bay Area NLP group: http://www.meetup.com/Bay-Area-NLP/events/16522295/ by Jacob Perkins.TRANSCRIPT
NLTKThe Good, the Bad, and the Awesome
Jacob Perkins
• Python Text Processing with NLTK 2.0 Cookbook
• streamhacker.com
• weotta.com
• text-processing.com
• @japerk
The Good• Makes NLProc easier and more accessible
• Python (great learning language)
• Lots of documentation (and 2 books!)
• Designed for training custom models
• Includes many training corpora
• Many algorithms to experiment with
The Bad
• NLProc is hard
• Few out-of-the-box solutions (see Pattern)
• Not designed for big-data (see Mahout)
• Doesn’t have latest algorithms (see Scikits-Learn)
• No online or active learning algorithms
More Bad
• Doesn’t play nice with pip or easy_install
• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)
• Models can use a lot of memory (& disk if pickled)
The Awesome
• Great for education and research
• Lots of users & active community
• Extensible interfaces
• Training algorithms span human languages
More Awesome
• Trained models can be very fast
• Well known algorithms can be very accurate
• NLTK-Trainer (train models with 0 code)
• Corpus bootstrapping
Some Numbers• 3 Classification Algorithms
• 9 Part-of-Speech Tagging Algorithms
• Stemming Algorithms for 15 Languages
• 5 Word Tokenization Algorithms
• Sentence Tokenizers for 16 Languages
• 60 included corpora
Text-Processing.com
• NLTK Demos & APIs
• Sentiment Analysis
• Part-of-Speech Tagging & Chunking / NER
• Stemming
• Tokenization
Memory Usagetext-processing.com
CPU Usagetext-processing.com
NLTK-Trainer
• https://github.com/japerk/nltk-trainer
• 3 Training Command Scripts
‣ train_classifier.py
‣ train_tagger.py
‣ train_chunker.py
• Easy to tweak training parameters
• Duck-Typed corpus reading
Training Classifiers
• train_classifier.py movie_reviews --instances paras
• train_classifier.py movie_reviews --instances paras --min_score 2 --ngrams 1 --ngrams 2
• train_classifier.py movie_reviews --instances paras --classifier MEGAM
• train_classifier.py movie_reviews --instances paras --cross-fold 10
• Pickled models are saved in ~/nltk_data/classifiers/
Training Taggers
• train_tagger.py treebank
• train_tagger.py treebank --sequential ubt --brill
• train_tagger.py treebank --sequential ‘’ --classifier NaiveBayes
• train_tagger.py mac_morpho --simplify_tags
• Pickled models are saved in ~/nltk_data/taggers/
Training Chunkers
• train_chunker.py treebank_chunk
• train_chunker.py treebank_chunk --classifier NaiveBayes
• train_chunker.py conll2000 --fileids train.txt
• Pickled models are saved in ~/nltk_data/chunkers/
Corpus Bootstrapping
• Guess & Correct easier than starting from scratch
• Use an existing model for initial guesses
• emoticons
‣ :) = “pos”
‣ :( = “neg”
• ratings
‣ 5 stars = “pos”
‣ 1 star = “neg”
Portuguese Phrase Extraction & Classification• similar to condensr.com
• Brazilian Portuguese
• aspect classification is easy with training corpus
• need chunked corpus for phrase extraction
• use mac_morpho & nltk-trainer to train initial tagger
• part-of-speech tag annotation is time consuming
• simplified tags are much easier
• bracketed phrases w/out pos tags
treebank_chunk[ Pierre/NNP Vinken/NNP ],/, [ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB [ the/DT board/NN ]as/IN [ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]./.
Just Brackets
[ Pierre Vinken ] , [ 61 years ] old , will join [ the board ] as [ a nonexecutive director Nov. 29 ] .
NLP at Weotta
• Parsing & information extraction
• Text cleaning & normalization (more parsing)
• Text & keyword classification
• De-duplication
• Search indexing / IR
• Sentiment analysis
• Human integration