text mining lab adrian and shawndra december 4, 2012 (version 1)

Text Mining LabAdrian and Shawndra

December 4, 2012 (version 1)

Outline

1. Download and Install Python

2. Download and Install NLTK

3. Download and Unzip Project Files

4. Simple Naïve Bayes Classifier

5. Demo collecting tweets -- > Evaluation

6. Other things you can do …

Download and Install Python

http://www.python.org/getit/ (latest version 2.7.3)

http://pypi.python.org/pypi/setuptools (install the setup tools for 2.7)

Download and Install NLTK

Install PyYAML: http://pyyaml.org/wiki/PyYAML

Install NUMPY: http://numpy.scipy.org/

Install NLTK: http://pypi.python.org/pypi/nltk

Install MatPlotLib: http://matplotlib.org/

Test Installation

Run python

At the prompt type

>>import nltk

>>import matplotlib

Downloading Models

>> nltk.download()

Open GUI downloader

Select “Models” tab and download:

maxent_ne_chunker

maxent_treebank_pos_tagger

hmm_treebank_pos_tagger

Select “Corpora” tab and download:

stopwords

AlternativelySelect “Collections” , click on all, and click the button to download all

Getting Started

Unzip project directory (lab1.zip)

Change to the lab1 directory

Open command window in the “lab1” directory

Windows 7 and later – Hold SHIFT; right-click in directory, select “Open command window here”

Unix/Mac – Open terminal; cd PATH/TO/lab1

Type “python” and then <enter> in terminal

>> import text_processing as tp

>> import nltk

Note: text_processing comes from your lab1 folder

Note: You must work from your lab1 directory

Downloading Models

>> nltk.download()

Open GUI downloader

Select “Models” tab and download:

maxent_ne_chunker

maxent_treebank_pos_tagger

hmm_treebank_pos_tagger

Select “Corpora” tab and download:

stopwords

AlternativelySelect “Collections” , click on all, and click the button to download all

Simple NB Sentiment Classifier

Read in tweetsCALL

>> paths = [‘neg_examples.txt', ‘pos_examples.txt']

>> documentClasses = [‘neg', ‘pos']

>> tweetSet = [tp.loadTweetText(p) for p in paths]

SAMPLE OUTPUT

>> len(tweetSet[0]), len(tweetSet[1])

(20000, 40000)

>> tweetSet[1][50]

"@davidarchie hey david !me and my bestfriend are forming a band .could you give us any advice please? it's means a lot for us :)"

Read in tweets (Code)

Reads in a file and treats each line as a tweet, lower-casing the text

TokenizeCALL

>> tokenSet = [tp.tokenizeTweets(tweets) for tweets in tweetSet]

SAMPLE OUTPUT

>> len(tokenSet[1][50])

31

>> tokenSet[1][50]

['@', 'davidarchie', 'hey', 'david', '!', 'me', 'and', 'my', 'bestfriend', 'are', 'forming', 'a', 'band', '.', 'could', 'you', 'give', 'us', 'any', 'advice', 'please', '?', 'it', "'", 's', 'means', 'a', 'lot', 'for', 'us', ':)']

Tokenize (Code)

For each tweet, splits the tweet by whitespace. Splits off punctuation separately.

(nltk.WordPunctTokenizer)

Filter out Non-EnglishCALL

>> englishSet = [tp.filterOnlyEnglish(tokens)

for tokens in tokenSet]

SAMPLE OUTPUT

>> len(englishSet[1][50])

22

>> englishSet[1][50]

['hey', 'david', 'me', 'and', 'my', 'are', 'forming', 'a', 'band', 'could', 'you', 'give', 'us', 'any', 'advice', 'please', 'it', 'means', 'a', 'lot', 'for', 'us']

Filter out Non-English (Code)

Reads in a dictionary file of English words – “wordsEn.txt” – and only keeps tokens in that dictionary

Filter out Stopwords

SAMPLE OUTPUT

>> len(noStopSet[1][50])

12

>> noStopSet[1][50]

['hey', 'david', 'forming', 'band', 'could', 'give', 'us', 'advice', 'please', 'means', 'lot', 'us']

CALL

>> noStopSet = [tp.removeStopwords(tokens,

[':)', ':(']) for tokens in englishSet]

Filter out Stopwords (Code)

Loads stop word list, and removes tokens that are in stop words. Also able to pass additional words as stop words

using the “addtlStopwords” argument

Stem

CALL

>> stemmedSet = [tp.stemTokens(tokens) for tokens in noStopSet]

SAMPLE OUTPUT

>> len(stemmedSet[1][50])

12

>> stemmedSet[1][50]

['hey', 'david', 'form', 'band', 'could', 'give', 'us', 'advic', 'pleas', 'mean', 'lot', 'us']

Stem (Code)

Loads a Porter stemmer implementation to remove suffixes from tokens. http://nltk.org/api/nltk.stem.html for more

information on NLTK's stemmers.

Make Bags of WordsCALL

>> bagsOfWords = [tp.makeBagOfWords(

tokens, documentClass=docClass)

for docClass, tokens in zip(documentClasses,

stemmedSet)]

SAMPLE OUTPUT

>> bagsOfWords[1][50][0].items()

[('us', 2), ('advic', 1), ('band', 1), ('could', 1), ('david', 1), ('form', 1), ('give', 1), ('hey', 1), ('lot', 1), ('mean', 1), ('pleas', 1)]

Make Bags of Words (Code)

For each tweet, constructs a bag of words (FreqDist) that counts that number of times each token occurs. Setting the bigrams

argument to True will also include bigrams in the bags.

Make Train and TestCALL

>> trainSet, testSet = tp.makeTrainAndTest(

reduce(lambda x, y: x + y, bagsOfWords),

cutoff=0.9)SAMPLE OUTPUT

>> len(trainSet), len(testSet)

(50697, 5633)

Make Train and Test (Code)

Given all of your examples, randomly selects proportion cutoff of examples for training, and 1-cutoff examples for

testing.

Train ClassifierCALL

>> import nltk.classify.util

>> nbClassifier = tp.trainNBClassifier(trainSet,

testSet)

SAMPLE OUTPUT

>> classifier.show_most_informative_features(n=20)

…..

Train Classifier (Code)

Trains a Naive Bayes classifier over the input training set. Prints out the accuracy over the test set and prints tokens

the most discriminating tokens.

Twitter Collection Demo

Directions for Collection Demo

Now, try out twitter_kw_stream.py to collect more tweets over a couple of different “classes”.

Some possible tokens (with high volume)

apple google cat dog pizza “ice cream”

Open a new terminal window in the same directory

For each keyword – KW -- you search for, type:

python twitter_kw_stream.py --keywords=KW

Wait a minute or so (till you retrieve about 100 tweets)

Accuracy given training data size

Assuming keywords searched for were:

apple google cat dog pizza “ice cream”

In terminal already running Python interpreter:>> paths = ['apple.txt', 'google.txt', 'cat.txt', 'dog.txt',

'pizza.txt', 'ice cream.txt']

>> addtlStopwords = ['apple', 'google', 'cat', 'dog', 'pizza', 'ice', 'cream']

>> cutoffs, uniAccs, biAccs = tp.plotAccuracy(paths,

addtlStopwords=addtlStopwords)If matplotlib installed correctly, should display accuracy of

NB classifier while varying the amount training data, with and without using bigrams. Saved to “nbPlot.png”

Other things you can do

Get Document SimilarityCALL

>> docs = tp.loadDocuments(paths)

>> sims = tp.getDocumentSimilarities(paths,

[p.replace('.txt', '') for p in paths])

SAMPLE OUTPUT

>> sims[('apple', 'dog')]

0.30735795122824466

>> sims[('apple', 'google')]

0.44204540065105324

Get Document Similarity (Code)

Calculates cosine similarity for each bag of word pair. Dot product of the two frequency vectors (after normalizing each

to a unit vector).

Calculate TF-IDF

CALL

>> tfIdfs = tp.getTfIdfs(docs)

SAMPLE OUTPUT

>> for path, tfIdf in zip(paths, tfIdfs):

… print 'Top 10 TF-IDF for %s: %s' %(path,

'\n'.join([str(t) for t in tfIdf]))

Part-of-Speech TagCALL

>> posneTagged = [[tp.partOfSpeechTag(ts) for ts in

classTokens[:100]]


SAMPLE OUTPUT>> posSet[1][50]

[('@', 'IN'), ('davidarchie', 'NN'), ('hey', 'NN'), ('david', 'VBD'), ('!', '.'), ('me', 'PRP'), ('and', 'CC'), ('my', 'PRP$'), ('bestfriend', 'NN'), ('are', 'VBP'), ('forming', 'VBG'), ('a', 'DT'), ('band', 'NN'), ('.', '.'), ('could', 'MD'), ('you', 'PRP'), ('give', 'VB'), ('us', 'PRP'), ('any', 'DT'), ('advice', 'NN'), ('please', 'NN'), ('?', '.'), ('it', 'PRP'), ("'", "''"), ('s', 'VBZ'), ('means', 'NNS'), ('a', 'DT'), ('lot', 'NN'), ('for', 'IN'), ('us', 'PRP'), (':)', ':')]

Part-of-Speech Tag (Code

Very simple, as long as you have a string of tokens, can just call nltk.pos_tag(tokens) to tag them with part-of-speech.

Find Named-EntitiesCALL

>> neSet = [[tp.getNamedEntityTree(ts) for ts in

classTokens[:100]]


SAMPLE OUTPUT>> neSet[1][50]

Tree('S', [('@', 'IN'), ('davidarchie', 'NN'), ('hey', 'NN'), ('david', 'VBD'), ('!', '.'), ('me', 'PRP'), ('and', 'CC'), ('my', 'PRP$'), ('bestfriend', 'NN'), ('are', 'VBP'), ('forming', 'VBG'), ('a', 'DT'), ('band', 'NN'), ('.', '.'), ('could', 'MD'), ('you', 'PRP'), ('give', 'VB'), ('us', 'PRP'), ('any', 'DT'), ('advice', 'NN'), ('please', 'NN'), ('?', '.'), ('it', 'PRP'), ("'", "''"), ('s', 'VBZ'), ('means', 'NNS'), ('a', 'DT'), ('lot', 'NN'), ('for', 'IN'), ('us', 'PRP'), (':)', ':')])

Find Named-Entities

Similarly simple, just call two NLTK functions. The performance of POS tagger and NE chunker are quite bad for

Twitter messages, however.

text mining lab adrian and shawndra december 4, 2012 (version 1)

Documents

text slide

englishset slide

dictionary slide

wordpuncttokenizer slide

lab1 directory slide

addtlstopwords argument

twitter collection demo

python http