we love nltk
DESCRIPTION
NLTK + Data Matching? Yep!TRANSCRIPT
![Page 1: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/1.jpg)
[(‘We’, ‘PRP’),(‘<3’, ‘VBP’),(‘NLTK’, ‘NNP’)
]Dhiana Deva | Gabriel Fonseca
Data Matching @ UFRJ
![Page 2: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/2.jpg)
![Page 3: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/3.jpg)
![Page 4: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/4.jpg)
![Page 5: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/5.jpg)
![Page 6: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/6.jpg)
![Page 7: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/7.jpg)
![Page 8: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/8.jpg)
![Page 9: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/9.jpg)
![Page 10: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/10.jpg)
![Page 11: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/11.jpg)
“NLTK” == “Natural Language ToolKit”
+ Python library for NLP+ Created in 2001 at University of Pennsylvania+ Very extensive+ Many examples+ Built-in support for 84 datasets (today!)+ Great documentation+ Open source ;)+ Active community
![Page 12: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/12.jpg)
Lot’s of modules!corpus
standardized interfaces to corpora and lexicons
tokenizetokenizers!
stemstemmers!
collocationt-test, chi-squared, point-wise mutual information
classifydecision tree, maximum
entropy, naive bayes
clusterEM, k-means
chunkregular expression, n-gram, named-entity
metricsdistances, precision,
recall, agreement coefficients
probabilityfrequency distributions, smoothed probability
distributions
...parse
chart, feature-based, unification, probabilistic,
dependency
tagpart-of-speech tagging, n-gram, backoff, Brill,
HMM, TnT
![Page 13: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/13.jpg)
Can I haz Data Matching?☑ Accuracy score
☑ Precision score
☑ Recall score
☑ F-measure score
☐ Reduction ratio
☑ Stop-words (11 languages)
★ Punkt sentence tokenizer
★ Punkt word tokenizer
☑ N-gram (words and chars)
☑ Tf-idf
☑ Levenshtein distance
☑ Damerau-Levenshtein distance
☑ Binary distance... Durr!
★ Krippendorff's distance
★ Masi distance
☑ Jaccard distance
☐ Jaro distance
☐ Jaro-Winkler distance
☐ Monge-Elkan distance
☐ Soundex
☐ Phonex
☐ NYSIIS
☐ ONCA
☐ Double-Metaphone
☐ Fuzzy Soundex
☑ Decision tree
☑ SVM
☑ Naive Bayes
★ MaxEnt
![Page 14: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/14.jpg)
![Page 15: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/15.jpg)
![Page 16: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/16.jpg)
![Page 17: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/17.jpg)
![Page 18: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/18.jpg)
![Page 19: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/19.jpg)
Fun fun fun!Sentiment analysisSpelling correctionSpam detectionTopic modelingRecommender systemsData deduplication
![Page 20: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/20.jpg)
Why not song matching?!Grooveshark: online music streaming serviceSongs uploaded by record labels, independent artists and usersLot’s of duplicates!Tinysong: Grooveshark’s open RESTful APIOur goal: No repeated songs!
(remixes and lives are okay!)
![Page 21: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/21.jpg)
![Page 22: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/22.jpg)
Bohemian Rhapsody by Qween-?! {
"Url": "http:\/\/tinysong.com\/PBCJ",
"SongID": 33834073,
"SongName": "Bohemian Rhapsody",
"ArtistID": 2324,
"ArtistName": "Queen",
"AlbumID": 1071492,
"AlbumName": "Greatest Hits"
},
...
{
"Url": "http:\/\/tinysong.com\/CYxG",
"SongID": 28835215,
"SongName": "Bohemian Rhapsody",
"ArtistID": 1731732,
"ArtistName": "Qween -",
"AlbumID": 2364353,
"AlbumName": "A Night at the Opera"
}
...
![Page 23: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/23.jpg)
![Page 24: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/24.jpg)
![Page 25: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/25.jpg)
![Page 26: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/26.jpg)
Next stepsOther textual dataMachine learningAcoustic features
LoudnessBPMLiveness
Acoustic fingerprinting for supervised learningYes, songs have fingerprints too!
![Page 27: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/27.jpg)
Our “sentiment”+ Quick and easy!+ Exteeeeeeeeeeeeeeeeensive!+ Docs & community!+ Internationalization- Time performance- Memory usage- No online or active learning
![Page 28: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/28.jpg)
Want more?!+ jellyfish
Jaro-Winkler, Hamming, Soundex, Metaphone, NYSIIS, …+ nltk-trainer
Command-line NLTK classifiers!+ scikit-learn
More machine learning! Memory efficient!+ pattern
Web mining. Out-of-the-box!+ gensim
Topic modeling. Out-of-the-box!
![Page 29: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/29.jpg)
![Page 30: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/30.jpg)
Referenceshttp://www.nltk.org/
http://www.nltk.org/book/
http://streamhacker.com/
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
http://developers.grooveshark.com/tuts/tinysong
https://github.com/sunlightlabs/jellyfish
https://github.com/japerk/nltk-trainer
http://scikit-learn.org/stable/
http://www.clips.ua.ac.be/pattern
http://radimrehurek.com/gensim/
![Page 31: We love NLTK](https://reader034.vdocuments.site/reader034/viewer/2022042512/559429021a28abbc5a8b458b/html5/thumbnails/31.jpg)
Thanks! ;)