kiwipycon 2014 - nlp with python tutorial
TRANSCRIPT
Understanding human language with PythonAlyona Medelyan
Who am I?
Alyona Medelyan
▪ In Natural Language Processing since 2000▪ PhD in NLP & Machine Learning from Waikato▪ Author of the state-of-the-art keyword extraction algorithm Maui▪ Author of the most-cited 2009 journal survey “Mining Meaning with
Wikipedia”▪ Past: Chief Research Officer at Pingar ▪ Now: Founder of Entopix, NLP consultancy & software development
aka @zelandiya
Pre-tutorial survey results
ProgrammingPython
Beginers Experts
85% no experience with NLP,
general interest
Agenda
State of NLPRecap on fiction vs reality: Are we there yet?NLP ComplexitiesWhy is understanding language so complex?NLP using PythonLearning the basics, applying them, expanding into further topicsOther NLP areasAnd what’s coming next
State of NLP
Fiction versus Reality
He (KITT) “always had an ego that was easy to bruise and displayed a very sensitive, but kind and dryly
humorous personality.” - Wikipedia
Android Auto: “hands-free operation through voice commands
will be emphasized to ensure safe driving”
“by putting this into one's ear one can instantly understand anything said in any language” (Hitchhiker
Wiki)
WordLense:“augmented
reality translation”
The LCARS (or simply library computer) … used sophisticated artificial intelligence routines to
understand and execute vocal natural language commands (From Memory Alpha Wiki)
Let’s try out Google
It doesn’t always work… (the person searched for “Steve Jobs”)
“Samantha [the OS]proves to be constantly available, always curious and interested, supportive and undemanding”
Siri doesn’t seem to be as “available”
NLP Complexities
What is understanding language so complex?
Last week's GDP figures, which were 0.8% for the March quarter (average forecast was 0.4%) and included a revision of the December quarter figures from 0.2% to 0.5%... That takes away the rationale for the OCR to remain at stimulatory levels.It is currently at 2.5%.
Also, in fighting inflation, Dr. Bollard has one rather tricky ally - the exchange rate, which hit a record 85USc last week in N.Z. Running at that level, the currency is keeping imported inflation at low levels.
Sentence detection complexities
Word segmentation complexities
▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。▪ The first hot dogs were sold by Charles Feltman on
Coney Island in 1870. ▪ The first hot dogs were sold by Charles Feltman on
Coney Island in 1870.
Disambiguation complexities
Flying planes can be dangerous
Sentiment complexities
from: http://www.sentic.net/tutorial/
NLP using Python
Learning the basics, applying them, expanding into further topics
import sysimport pocketsphinx if __name__ == "__main__": hmdir = "/usr/share/pocketsphinx/model/hmm/wsj1" lmdir = "/usr/share/pocketsphinx/model/lm/wsj/wlist5o.lm.DMP" dictd = "/usr/share/pocketsphinx/model/lm/wsj/wlist5o.dic" wavfile = sys.argv[1] speechRec = pocketsphinx.Decoder(hmm = hmdir, lm = lmdir, dict = dictd) wavFile = file(wavfile,'rb') speechRec.decode_raw(wavFile) result = speechRec.get_hyp() print result
Speech recognition with PythonUsing CMU Sphinx
http://www.confusedcoders.com/random/speech-recognition-in-python-with-cmu-
pocketsphinx
text text text
text text text text text text text text text text text texttext text text
sentimentkeywords tags
genrecategories
taxonomy termsentities
namespatterns biochemicalentities… text text text
text text text text text text text text text text text texttext text text
What can we do with text?
text text text
text text text text text text text text text text text texttext text text
sentimentkeywords tags
genrecategories
taxonomy termsentities
namespatterns biochemicalentities… text text text
text text text text text text text text text text text texttext text text
What can we do with text?
practical partof this tutorial
Introducing NLTK – Python platform for NLP
Setting upClone or Download ZIP:https://github.com/zelandiya/KiwiPyCon-NLP-tutorial
Working with corpora in NLTK
>>> from nltk.corpus import movie_reviews
>>> print len(movie_reviews.fileids())
>>> print movie_reviews.categories()
>>> print movie_reviews.fileids('neg')[:10]>>> print movie_reviews.fileids('pos')[:10]
>>> print movie_reviews.words('pos/cv000_29590.txt')
>>> print movie_reviews.raw('pos/cv000_29590.txt')
>>> print movie_reviews.sents('pos/cv000_29590.txt')
NLTK Corpus – basic functionality
Getting to know text: Word frequencies
from nltk.corpus import movie_reviewsfrom nltk.probability import FreqDist
words = movie_reviews.words('pos/cv000_29590.txt')freqs = FreqDist(words)
print 'Most frequent words in review’, freqs.items()[:20]
for category in movie_reviews.categories():
print 'Category', category all_words = movie_reviews.words(categories=category) all_words_by_frequency = FreqDist(all_words) print all_words_by_frequency.items()[:20]
Output of “frequent words”
Most frequent words in review [('the', 46), (',', 43), ("'", 25), ('.', 23), ('and', 21), ... Category neg[(',', 35269), ('the', 35058), ('.', 32162), ('a', 17910), ...Category pos[(',', 42448), ('the', 41471), ('.', 33714), ('a', 20196), ...
How to get to the core words?
even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance
i think that from hell has a pretty solid acting, especially with the dreamy depp turning in a strong performance as he usually does
*
* “from hell” is the title of the movie, using just stopwords will not be sufficient to process this example correctly
Remove Stopwords!
Stopword removal with NLTK
from nltk.corpus import movie_reviewsfrom nltk.corpus import stopwords
stop = stopwords.words('english')
words = movie_reviews.words('pos/cv000_29590.txt')no_stops = [word for word in words if word not in stop]
NLTK Stopwords: Before & After
['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'", 're', 'about', 'superheroes', '(', 'batman', ',']
['films', 'adapted', 'comic', 'books', 'plenty', 'success', ',', 'whether', "'", 're', 'superheroes', '(', 'batman', ',’]
Part of speech tagging & filtering
import nltkfrom nltk.corpus import movie_reviewsfrom nltk.probability import FreqDist
words = movie_reviews.words('pos/cv000_29590.txt')pos = nltk.pos_tag(words)
filtered_words = [x[0] for x in pos if x[1] in ('NN', 'JJ')]
print FreqDist(filtered_words).items()[:20]
POS tagging & filtering results
[('films', 'NNS'), ('adapted', 'VBD'), ('from', 'IN'), ('comic', 'JJ'), ('books', 'NNS'), ('have', 'VBP'), ('had', 'VBN'), ('plenty', 'NN'), ('of', 'IN'), ('success', 'NN')
[('t', 9), ('comic', 5), ('film', 5), ('hell', 5), ('book', 3), ('campbell', 3), ('don', 3), ('ripper', 3), ('abberline', 2), ('accent', 2), ('depp', 2), ('end', 2),
From Single to Multi-Word Phrases
NEJM usually has the highest impact factor of the journals of clinical medicine.
highest, highest impact, highest impact factor
ignore stopwords
Option 1. Ngrams
Option 2. Chunking / POS patterns
from http://www.nltk.org/book/ch07.html#chap-chunk
Ngram extraction with NLTK
my_ngrams = []for n in range(2, 5): for gram in ngrams(words, n): if acceptable(gram[0]) \ and acceptable(gram[-1]) \
and has_no_boundaries(gram): phrase = ' '.join(gram) my_ngrams.append(phrase)
[("' s", 11), ("' t", 10), (', but', 6), ("don '", 5), ("don ' t", 5), ('from hell', 5)[('comic book', 2), ('jack the ripper', 2), ('moore and campbell', 2), ('say moore', 2),
Corpus statistics: TFxIDF
TFxIDF with Gensim
from nltk.corpus import movie_reviewsfrom gensim import corpora, models
texts = []for fileid in movie_reviews.fileids(): words = movie_reviews.words(fileid) texts.append(words)
dictionary = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]tfidf = models.TfidfModel(corpus)
for word in ['film', 'movie', 'comedy', 'violence', 'jolie']: id = dictionary.token2id.get(word)
print word, id, tfidf.idfs[id]
TFxIDF with Gensim: Results
film 124 0.190174003903movie 207 0.364013496254comedy 653 1.98564470702violence 1382 3.2108967825jolie 9418 6.96578428466
NLP using Python
Learning the basics, applying them, expanding into further topics
How a keyword extraction algorithm works
Document KeywordsCandidates Properties Scoring
Slide windowBreak at stopwords & punctuationNormalizeMap to vocabulary (optional)Disambiguate (optional)
Calculate:Frequency of occurrencesPosition in the documentPhrase lengthSimilarity to other candidatesProminence in this particular textPart of speech patternIs it a popular keyword?
Heuristic formulathat combines mostpowerful properties
Supervised machine learning that learns the importance of properties frommanually assigned keywords
OR
Candidates extraction in Python
def get_candidates(words, stop):
filtered_words = [word for word in words if word not in stop and word[0].isalpha()] text_ngrams = get_ngrams(words, stop) return filtered_words + text_ngrams
Candidate scoring in Python
def score_candidates(candidates, dictionary, tfidf): scores = {} freqs = FreqDist(candidates) for word in set(candidates): tf = float(frequencies[word]) / len(freqs) id = dictionary.token2id.get(word) if id: idf = tfidf.idfs[id] else: idf = 0 scores[word] = tf*idf
return sorted(scores.iteritems(), key=operator.itemgetter(1), reverse = True)
Test keywords extractor
bellboyjennifer bealsfour roomsbealsroomstarantinomadonnaantonio banderasvaleria golino
…four of the biggest directors in hollywood : quentin tarantino , robert rodriguez , … were all directing one big film with a big and popular cast ...the second room ( jennifer beals ) was better , but lacking in plot ... the bumbling and mumbling bellboy , and he ruins every joke in the film …
Analysis of the results
• Remove sub-phrases in favour of higher ranked ones• Score higher Adjectives & Adverb using Part of Speech tagging• Add stemming• …
neg/cv480_21195.txt fight club, club, fight, se7en and the game, inter - office, inter - office politics, tyler, office politics, politics, woven, inter, befuddled
neg/cv235_10704.txt babysitter, goal of the babysitter, thug, boyfriend, goal, fails, fantasizes, dream sequences, silverstone, dream
neg/cv248_15672.txt vampires, vampire, rude, suggestion, regressive movieneg/cv136_12384.txt lost in space, robinson, robinsons, story changes, cartoony
Getting insights from text!
Which actors, directors, movie plots and film qualities make a successful movie?
1. Apply candidate extraction on each review (to initialize TFxIDF scorer)2. Extract common keywords from positive and negative reviews
Insights – Step 1
from nltk.corpus import movie_reviewsfrom nltk.probability import FreqDistfrom basics_applied import keyword_extractor
candidate_extractor = keyword_extractor.CandidateExtractor()
texts = []texts_ids = {}count = 0for fileid in movie_reviews.fileids(): words = candidate_extractor.run(movie_reviews.words(fileid)) texts.append(words) texts_ids[fileid] = count count += 1
Insights – Step 2
for category in movie_reviews.categories():
print 'Category', category category_keywords = [] for fileid in movie_reviews.fileids(categories=category):
count = texts_ids[fileid] candidates = texts[count] keywords = candidate_scorer.run(candidates)[:20] for keyword in keywords: category_keywords.append(keyword[0]) if ' ' in keyword[0]: category_keywords.append(keyword[0]) cat_keywords_by_frequency = FreqDist(category_keywords) print cat_keywords_by_frequency.items()[:50]
Our insights
van damme 16zeta - jones 16smith 15batman 14de palma 14eddie murphy 14killer 14tommy lee jones 14wild west 14mars 13murphy 13ship 13space 13brothers 12de bont 12...
star wars 26disney 23war 23de niro 22jackie 21alien 20jackie chan 20private ryan 20truman show 20ben stiller 18cameron 18science fiction 18cameron diaz 16fiction 16jack 16...
Negative Positive
NLP using Python
Learning the basics, applying them, expanding into further topics
Text Categorization
textanddatamining.blogspot.co.nz/2011/07/svm-classification-intuitive.html
Entertainment
Politics
TVNZ: “Obama and Hangover star trade insults in interview”
Categorization vs Keyword Extraction
source ofterminology
numberof topics
any
document
vocabulary
domain-relevantmain topics onlyvery few
keyword assignment
term assignment
tagging
keyword extraction
all possible
text categorization
terminology extractiontopic modeling
full-text indexing
Text Classification with Python
documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]random.shuffle(documents)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())word_features = all_words.keys()[:2000]
# document_features: for word in word_features:# features['contains(%s)' % word] = (word in doc_words)
featuresets = [(document_features(d), c) for (d,c) in documents]train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
Classify new reviews using NLTK
# from http://www.imdb.com/title/tt2209764/reviews?ref_=tt_urvtranscendence = ['../data/transcendence_1star.txt', '../data/transcendence_5star.txt', '../data/transcendence_8star.txt', '../data/transcendence_great.txt']
classifier = nltk.NaiveBayesClassifier.train(featuresets)
for review in transcendence: f = open(review) raw = f.read() document = word_tokenize(raw) features = document_features(document) print review, classifier.classify(features)
Sentiment analysis with TextBlob
from textblob import TextBlobfrom textblob.sentiments import NaiveBayesAnalyzer
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())print blob.sentimentSentiment(classification='pos', p_pos=0.7996209910191279, p_neg=0.2003790089808724)
blob = TextBlob("I love this library")print blob.sentimentSentiment(polarity=0.5, subjectivity=0.6)
Sentiment Categorization with Text Blob
for review in transcendence: f = open(review) raw = f.read() blob = TextBlob(raw) sentiment = blob.sentiment if sentiment.polarity > 0.20: print review, 'pos', round(sentiment.polarity, 3),
round(sentiment.subjectivity, 3) else: print review, 'neg', round(sentiment.polarity, 3),
round(sentiment.subjectivity, 3)
../data/transcendence_1star.txt neg 0.017 0.502
../data/transcendence_5star.txt neg 0.087 0.51
../data/transcendence_8star.txt pos 0.257 0.494
../data/transcendence_great.txt pos 0.304 0.528
Sentiment analysis: Aspects
http://www.sentic.net/tutorial/
Topic modeling
http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
Insights through Topic Modeling with GenSim
candidate_extractor = basics_applied.keyword_extractor.CandidateExtractor()
for category in movie_reviews.categories():
texts = [] for fileid in movie_reviews.fileids(category): words = movie_reviews.words(fileid) clean_words = texts.append(candidate_extractor.run(words, 2))
dictionary = corpora.Dictionary(texts) dictionary.filter_extremes(no_below=10, no_above=0.1, keep_n=10000) corpus = [dictionary.doc2bow(text) for text in texts] print 'Category', category print 'LDA' lda = models.ldamodel.LdaModel(corpus, id2word=dictionary)
print 'HDP' model = models.hdpmodel.HdpModel(corpus, id2word=dictionary)
Insights
topic 0: acting ability + battle scenes + pretty much + mission to mars + natasha henstridge + live action + ve never + freddie prinze jr topic 1: bad acting + naked gun + lead role + close - ups + antonio banderas + johnny depp + nothing else + kind of movie + wild wild westtopic 2: salma hayek + woody allen + pulp fiction + next time + make sense + make a movie + target audience + opening sequencetopic 3: subject matter + horror movie + first one + anyone else + throughout the movie + granger movie + end credits + never seen topic 4: million dollars + ll see + deep impact + de palma + watching the film + granger movie gauge + didn ' t like + makes no sense
topic 0: martin scorsese + soap opera + fbi agent + old man + first thing + doesn ' t make + entertaining film + first - time + doesn ' t knowtopic 1: stanley kubrick + matt dillon + film i ' ve + time period + film like + last two + computer animation + men and women + whole film topic 2: action film + good and evil + star trek + usual suspects + soon becomes + written and directed + time period + new york + first movietopic 3: julianne moore + feature film + tom cruise + doesn ' t want + real people + much better + action sequences + see the movie topic 4: re looking + soap opera + austin powers + edward norton + entertaining film + well enough + old - fashioned + animated feature
Negative
Positive
LDA: Practical application
Sweaty Horse Blanket: Processing the Natural Language of Beerby Ben Fields
1. Keyword extraction 2. TFxIDF scoring3. LDA
Other NLP areas
What’s coming next?
From Strings to Concepts
most likelyless likely
unlikely
toolPrecc is a new compiler-compiler that is much more versatile than yacc.
✓
From Concepts to Facts
Applying the Semantic Web technology
▪ Show all politicians, their birth date and gender, mentioned in the document collection and in which documents they appear
Al Gore31-03-1948
male
Al Green01-09-1947
male
Alan Hunt09-10-1927
male
Alberto Fujimori28-07-1938
maleBarack Obama
04-08-1961male
Benazir Bhutto21-06-1953
female
…
SemanticSPARQLQuery
select distinct ?name ?birth ?gender where { graph <http://some.url/> …
Parsing
/m/0d3k14
/m/044sb
/m/0d3k14
… Jack Ruby, who killed J.F.Kennedy's assassin Lee Harvey Oswald. …
Sentiment0% Positive30% Neutral70% Negative
Freebase
What’s next?
Vs.
Conclusions:Understanding human language with Python
State of NLPRecap on fiction vs reality: Are we there yet?NLP ComplexitiesWhy is understanding language so complex?NLP using PythonNLTK, Gensim & TextBlobOther NLP areasAnd what’s coming next
PyNLPl
Try also:
clips.ua.ac.be/pages/patternPattern
scikit-learn.org/stable/
github.com/proycon/pynlpl