topic modeling of twitter followers - paris machine learning meetup - alex perrier
TRANSCRIPT
TOPIC MODELINGAPPLIQUÉ AUX FILS TWITTERS.
Alexis Perrier
Data & Software, Berklee College of Music, Boston
Data Science contributor
@alexip
@BerkleeOnline
@ODSC
Part I: Topic Modeling
Nature et applicationAlgos et Librairies
Part II: Projet: followers sur twitter
MethodesProblemesViz
Vue générale et rapide sur un large ensemble dedocuments
Technique non-supervisée
1 document plusieurs topics1 topic un ensemble de motsLa proportion des topics varie entre les documents
⇔⇔
ANALYSE SÉMANTIQUE DE COLLECTIONS DE DOCUMENTS
Divers CorpusLittératureJournauxDocuments o�cielsContenu en ligneRéseaux sociaux, forums, ....
Couplé a des variables externesEvolution dans le tempsAuteurs, locuteurs
PRINCIPAUX ALGORITHMES
Approche vectorielle
Latent Semantic Analysis (LSA)
Approche probabiliste, Bayésienne
Latent Dirichlet Allocation (LDA)Structural Topic Modeling (STM), pLSA, hLDA, ...
Approche Neural Networks
convnets, ...
LATENT SEMANTIC ANALYSIS - LSA
TF-IDF: Fréquence relative des mots => VectorisationMatrice document / fréquence des motsRéduction de dimensionDécomposition en Valeur Singulière (SVD)
aka Latent Semantic Indexing (LSI)
LATENT DIRICHLET ALLOCATION
Un topic est une liste des probabilités des mots dans unvocabulaire donné.
LDA: La distribution des topics suit une loi de Dirichlet.
K: Nombre de topics: Nombre de topics par document: Nombre de mots par topicαβ
Details:https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Inférence bayésienne, Gibbs sampling, Chineserestaurant process
LIBRAIRIES
Python libraries
- Topic Modelling for HumansLDA Python library
R packages
a. lsa packageb. lda packagec. topicmodels packaged.
Java libraries: S-Space Package, MALLET
C/C++ libraries: lda-c, hlda c, ctm-c d, hdp
Gensim
stm package
LE PROJET
3 articles
Topic Modeling of twitter followersSegmentation of Twitter Timelines via Topic ModelingNLP Analysis of the 2015 presidential candidatedebates
CONSTRUIRE LE CORPUS
1. Obtenir les timelines des 700 followers de :
Un document correspond a une timeline
2. Vectoriser le document
bag-of-wordsTimeline en anglais: lang = 'en' +
: tokenize, stopwords, stemming, POS
3. TF-IDF
Creer un dictionnaire de motsVectoriser les documents TF-IDFGensim, NLTK, Scikit, ....
@alexipTwython
langidNLTK
2) APPLIQUER LDA
Franchement mieux
u'0.055*app + 0.045*team + 0.043*contact + 0.043*idea + 0.029*quote + 0.022*free + 0.020*development + 0.019*looking + 0.017*startup + 0.017*build',u'0.033*socialmedia + 0.022*python + 0.015*collaborative + 0.014*economy + 0.010*apple + 0.007*conda + 0.007*pydata + 0.007*talk + 0.007*check + 0.006*anaconda',u'0.053*week + 0.041*followers + 0.033*community + 0.030*insight + 0.010*follow + 0.007*world + 0.007*stats + 0.007*sharing + 0.006*unfollowers + 0.006*blog',u'0.014*thx + 0.010*event + 0.008*app + 0.007*travel + 0.006*social + 0.006*check + 0.006*marketing + 0.005*follow + 0.005*also + 0.005*time',u'0.044*docker + 0.036*prodmgmt + 0.029*product + 0.018*productmanagement + 0.017*programming + 0.012*tipoftheday + 0.010*security + 0.009*javascript + 0.009*manager + 0.009*containers',u'0.089*love + 0.035*john + 0.026*update + 0.022*heart + 0.015*peace + 0.014*beautiful + 0.012*beauty + 0.010*life + 0.010*shanti + 0.009*stories',u'0.033*geek + 0.009*architecture + 0.007*code + 0.007*products + 0.007*parts + 0.007*charts + 0.007*software + 0.006*cryptrader + 0.006*moombo + 0.006*book',u'0.049*stories + 0.046*network + 0.044*virginia + 0.044*entrepreneur + 0.039*etmchat + 0.025*etmooc + 0.021*etm + 0.015*join + 0.014*deis + 0.010*today',u'0.056*slots + 0.053*bonus + 0.052*fsiug + 0.039*casino + 0.031*slot + 0.024*online + 0.014*free + 0.013*hootchat + 0.010*win + 0.009*bonuses',u'0.056*video + 0.043*add + 0.042*message + 0.032*blog + 0.027*posts + 0.027*media + 0.025*training + 0.017*check + 0.013*gotta + 0.010*insider'
Quels sont les topics?Combien de topics?
BACK TO THE CORPUS
Nettoyage des documentsCompleter la liste des stopwords a la mainIdenti�er les anomalies: Robots, retweets, hastag, ...Ne garder que les �ls qui ont twitté récemment.
245 timelines
Visualization - LDAvis
3) STRUCTURAL TOPIC MODELING
NLP: Tokenization, stemming, stop-words, ...Nommer les topics: plusieurs groupes de mots partopic exclusivité, fréquenceNombre de topic optimum: grid search + scoringIn�uence des variables externes
STM: PRESIDENTIAL DEBATES
Primaires US6 debats: 2 democrates, 4 republicains1 document = un intervenant pendant un debat
Visualization - stmBrowser
MERCI
@alexip
Slides: alexperrier.github.io
Code & Data & Viz:
- https://github.com/alexperrier/datatalks/tree/master/twitter - https://github.com/alexperrier/datatalks/tree/master/debates - http://nbviewer.jupyter.org/github/alexperrier/datatalks/blob/master/twitter/LDAvis_V2.ipynb- http://alexperrier.github.io/stm-visualization/index.html
Ref:
- topic modeling http://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf- lda: http://ai.stanford.edu/~ang/papers/nips01-lda.pdf - pyLDAvis: https://github.com/bmabey/pyLDAvis - stm: http://scholar.princeton.edu/files/bstewart/files/stmnips2013.pdf - stm R: http://structuraltopicmodel.com/ - stmBrowser: https://github.com/mroberts/stmBrowser