text mining lab adrian and shawndra december 4, 2012 (version 1)
TRANSCRIPT
Text Mining LabAdrian and Shawndra
December 4, 2012 (version 1)
Outline
1. Download and Install Python
2. Download and Install NLTK
3. Download and Unzip Project Files
4. Simple Naïve Bayes Classifier
5. Demo collecting tweets -- > Evaluation
6. Other things you can do …
Download and Install Python
http://www.python.org/getit/ (latest version 2.7.3)
http://pypi.python.org/pypi/setuptools (install the setup tools for 2.7)
Download and Install NLTK
Install PyYAML: http://pyyaml.org/wiki/PyYAML
Install NUMPY: http://numpy.scipy.org/
Install NLTK: http://pypi.python.org/pypi/nltk
Install MatPlotLib: http://matplotlib.org/
Test Installation
Run python
At the prompt type
>>import nltk
>>import matplotlib
Downloading Models
>> nltk.download()
Open GUI downloader
Select “Models” tab and download:
maxent_ne_chunker
maxent_treebank_pos_tagger
hmm_treebank_pos_tagger
Select “Corpora” tab and download:
stopwords
AlternativelySelect “Collections” , click on all, and click the button to download all
Getting Started
Unzip project directory (lab1.zip)
Change to the lab1 directory
Open command window in the “lab1” directory
Windows 7 and later – Hold SHIFT; right-click in directory, select “Open command window here”
Unix/Mac – Open terminal; cd PATH/TO/lab1
Type “python” and then <enter> in terminal
>> import text_processing as tp
>> import nltk
Note: text_processing comes from your lab1 folder
Note: You must work from your lab1 directory
Downloading Models
>> nltk.download()
Open GUI downloader
Select “Models” tab and download:
maxent_ne_chunker
maxent_treebank_pos_tagger
hmm_treebank_pos_tagger
Select “Corpora” tab and download:
stopwords
AlternativelySelect “Collections” , click on all, and click the button to download all
Simple NB Sentiment Classifier
Read in tweetsCALL
>> paths = [‘neg_examples.txt', ‘pos_examples.txt']
>> documentClasses = [‘neg', ‘pos']
>> tweetSet = [tp.loadTweetText(p) for p in paths]
SAMPLE OUTPUT
>> len(tweetSet[0]), len(tweetSet[1])
(20000, 40000)
>> tweetSet[1][50]
"@davidarchie hey david !me and my bestfriend are forming a band .could you give us any advice please? it's means a lot for us :)"
Read in tweets (Code)
Reads in a file and treats each line as a tweet, lower-casing the text
TokenizeCALL
>> tokenSet = [tp.tokenizeTweets(tweets) for tweets in tweetSet]
SAMPLE OUTPUT
>> len(tokenSet[1][50])
31
>> tokenSet[1][50]
['@', 'davidarchie', 'hey', 'david', '!', 'me', 'and', 'my', 'bestfriend', 'are', 'forming', 'a', 'band', '.', 'could', 'you', 'give', 'us', 'any', 'advice', 'please', '?', 'it', "'", 's', 'means', 'a', 'lot', 'for', 'us', ':)']
Tokenize (Code)
For each tweet, splits the tweet by whitespace. Splits off punctuation separately.
(nltk.WordPunctTokenizer)
Filter out Non-EnglishCALL
>> englishSet = [tp.filterOnlyEnglish(tokens)
for tokens in tokenSet]
SAMPLE OUTPUT
>> len(englishSet[1][50])
22
>> englishSet[1][50]
['hey', 'david', 'me', 'and', 'my', 'are', 'forming', 'a', 'band', 'could', 'you', 'give', 'us', 'any', 'advice', 'please', 'it', 'means', 'a', 'lot', 'for', 'us']
Filter out Non-English (Code)
Reads in a dictionary file of English words – “wordsEn.txt” – and only keeps tokens in that dictionary
Filter out Stopwords
SAMPLE OUTPUT
>> len(noStopSet[1][50])
12
>> noStopSet[1][50]
['hey', 'david', 'forming', 'band', 'could', 'give', 'us', 'advice', 'please', 'means', 'lot', 'us']
CALL
>> noStopSet = [tp.removeStopwords(tokens,
[':)', ':(']) for tokens in englishSet]
Filter out Stopwords (Code)
Loads stop word list, and removes tokens that are in stop words. Also able to pass additional words as stop words
using the “addtlStopwords” argument
Stem
CALL
>> stemmedSet = [tp.stemTokens(tokens) for tokens in noStopSet]
SAMPLE OUTPUT
>> len(stemmedSet[1][50])
12
>> stemmedSet[1][50]
['hey', 'david', 'form', 'band', 'could', 'give', 'us', 'advic', 'pleas', 'mean', 'lot', 'us']
Stem (Code)
Loads a Porter stemmer implementation to remove suffixes from tokens. http://nltk.org/api/nltk.stem.html for more
information on NLTK's stemmers.
Make Bags of WordsCALL
>> bagsOfWords = [tp.makeBagOfWords(
tokens, documentClass=docClass)
for docClass, tokens in zip(documentClasses,
stemmedSet)]
SAMPLE OUTPUT
>> bagsOfWords[1][50][0].items()
[('us', 2), ('advic', 1), ('band', 1), ('could', 1), ('david', 1), ('form', 1), ('give', 1), ('hey', 1), ('lot', 1), ('mean', 1), ('pleas', 1)]
Make Bags of Words (Code)
For each tweet, constructs a bag of words (FreqDist) that counts that number of times each token occurs. Setting the bigrams
argument to True will also include bigrams in the bags.
Make Train and TestCALL
>> trainSet, testSet = tp.makeTrainAndTest(
reduce(lambda x, y: x + y, bagsOfWords),
cutoff=0.9)SAMPLE OUTPUT
>> len(trainSet), len(testSet)
(50697, 5633)
Make Train and Test (Code)
Given all of your examples, randomly selects proportion cutoff of examples for training, and 1-cutoff examples for
testing.
Train ClassifierCALL
>> import nltk.classify.util
>> nbClassifier = tp.trainNBClassifier(trainSet,
testSet)
SAMPLE OUTPUT
>> classifier.show_most_informative_features(n=20)
…..
Train Classifier (Code)
Trains a Naive Bayes classifier over the input training set. Prints out the accuracy over the test set and prints tokens
the most discriminating tokens.
Twitter Collection Demo
Directions for Collection Demo
Now, try out twitter_kw_stream.py to collect more tweets over a couple of different “classes”.
Some possible tokens (with high volume)
apple google cat dog pizza “ice cream”
Open a new terminal window in the same directory
For each keyword – KW -- you search for, type:
python twitter_kw_stream.py --keywords=KW
Wait a minute or so (till you retrieve about 100 tweets)
Accuracy given training data size
Assuming keywords searched for were:
apple google cat dog pizza “ice cream”
In terminal already running Python interpreter:>> paths = ['apple.txt', 'google.txt', 'cat.txt', 'dog.txt',
'pizza.txt', 'ice cream.txt']
>> addtlStopwords = ['apple', 'google', 'cat', 'dog', 'pizza', 'ice', 'cream']
>> cutoffs, uniAccs, biAccs = tp.plotAccuracy(paths,
addtlStopwords=addtlStopwords)If matplotlib installed correctly, should display accuracy of
NB classifier while varying the amount training data, with and without using bigrams. Saved to “nbPlot.png”
Other things you can do
Get Document SimilarityCALL
>> docs = tp.loadDocuments(paths)
>> sims = tp.getDocumentSimilarities(paths,
[p.replace('.txt', '') for p in paths])
SAMPLE OUTPUT
>> sims[('apple', 'dog')]
0.30735795122824466
>> sims[('apple', 'google')]
0.44204540065105324
Get Document Similarity (Code)
Calculates cosine similarity for each bag of word pair. Dot product of the two frequency vectors (after normalizing each
to a unit vector).
Calculate TF-IDF
CALL
>> tfIdfs = tp.getTfIdfs(docs)
SAMPLE OUTPUT
>> for path, tfIdf in zip(paths, tfIdfs):
… print 'Top 10 TF-IDF for %s: %s' %(path,
'\n'.join([str(t) for t in tfIdf]))
Part-of-Speech TagCALL
>> posneTagged = [[tp.partOfSpeechTag(ts) for ts in
classTokens[:100]]
for tokens in tokenSet]
SAMPLE OUTPUT>> posSet[1][50]
[('@', 'IN'), ('davidarchie', 'NN'), ('hey', 'NN'), ('david', 'VBD'), ('!', '.'), ('me', 'PRP'), ('and', 'CC'), ('my', 'PRP$'), ('bestfriend', 'NN'), ('are', 'VBP'), ('forming', 'VBG'), ('a', 'DT'), ('band', 'NN'), ('.', '.'), ('could', 'MD'), ('you', 'PRP'), ('give', 'VB'), ('us', 'PRP'), ('any', 'DT'), ('advice', 'NN'), ('please', 'NN'), ('?', '.'), ('it', 'PRP'), ("'", "''"), ('s', 'VBZ'), ('means', 'NNS'), ('a', 'DT'), ('lot', 'NN'), ('for', 'IN'), ('us', 'PRP'), (':)', ':')]
Part-of-Speech Tag (Code
Very simple, as long as you have a string of tokens, can just call nltk.pos_tag(tokens) to tag them with part-of-speech.
Find Named-EntitiesCALL
>> neSet = [[tp.getNamedEntityTree(ts) for ts in
classTokens[:100]]
for tokens in tokenSet]
SAMPLE OUTPUT>> neSet[1][50]
Tree('S', [('@', 'IN'), ('davidarchie', 'NN'), ('hey', 'NN'), ('david', 'VBD'), ('!', '.'), ('me', 'PRP'), ('and', 'CC'), ('my', 'PRP$'), ('bestfriend', 'NN'), ('are', 'VBP'), ('forming', 'VBG'), ('a', 'DT'), ('band', 'NN'), ('.', '.'), ('could', 'MD'), ('you', 'PRP'), ('give', 'VB'), ('us', 'PRP'), ('any', 'DT'), ('advice', 'NN'), ('please', 'NN'), ('?', '.'), ('it', 'PRP'), ("'", "''"), ('s', 'VBZ'), ('means', 'NNS'), ('a', 'DT'), ('lot', 'NN'), ('for', 'IN'), ('us', 'PRP'), (':)', ':')])
Find Named-Entities
Similarly simple, just call two NLTK functions. The performance of POS tagger and NE chunker are quite bad for
Twitter messages, however.