ods slack exploration
TRANSCRIPT
OpenDataScience SlackOfficially started 12.03.2015
● Started as platform for open data science communication
● The biggest data science community in the world
As for today 25.07.2017
● 1.3M messages, 5400 users (2000 weekly active)
Most active channels:
#deep_learning, #theory_and_practice, #visualization, #_general,
#_meetings, #_jobs, #big_data, #python, #r, #datasets, #nlp, #edu_courses
Nodes size - Page Rank
Clustering - walktrap
Channels 2017
Node size - PageRankClustering - walktrapData - messages in top-40 channels
Most active users in all threads
Node size - weighted degreeClustering - modularityData - threads in all channels
Most active users in random_flood channelNode size - weighted degreeClustering - modularityData - threads in _random_flood
Most active users in career channel
Node size - weighted degreeClustering - modularityData - threads in career channel
Detection of curious users
● Curious user - user who asks for help ● NLP techniques: preprocessing, regular expressions
Expert detection
● Experts - users with the highest numbers of specific reactions under his/her messages in threads
Troll detection
● Trolls - users with the highest numbers of specific reactions under his/her messages
Model Info
Problem Info
Problem:
Predict time of response in thread
Data:
12500 threads, time period 2016 - 2017
Tool:
Regression models
Data StatsFeatures:
● Text of main message● Day and hour of main message● Length of main message● Channel● Mentioned users● Links in text● Historical activity
Target variable:
● Waiting time for response
Applied ApproachesApproaches:
● Lasso regression (Scikit learn) MAE = 149 min
● XGBoost regression MAE = 140 min
● Lightgbm regression MAE = 119 min
Best results: Lightgbm
Plot for real and predicted response time (in minutes) for deep_lerning channel:
Further WorkFuture improvements:
● New features (for example, use number of active users in channel and number of threads before new thread)
● Use answers in channel also● Reduce dimensionality● Take into account the topic of thread
Users Classification
Problem:
Classify users by messages
Data:
users - 1771 (select top 100)
12 channels, time period 2016 - 2017
messages - 120 997
User Classification Tools
Regression models - accuracy 19.85%
● LogisticRegression - 19.85% (CountVectorizer, OneVsRestClassifier)● LinearSVC - 15.63%
LSTM - accuracy 16.56%
Channels Classification
Problem : Classify channels by messages
Data : messages - 120 997 time period 2016 - 2017
12 channels #career
#big_data
#kaggle_crackers
#lang_python
#lang_r#theory_and_practice
#nlp#welcome
#bayesian
#_meetings
#datasets#deep_learning
Channel Classification Tools
Preprocessing - pymorphy2 lemmatization, exclude: stop_words/url/emoji
Regression models - accuracy 55.33%● LogisticRegression - 55.33% (CountVectorizer, OneVsRestClassifier)● LinearSVC - 51.52%
CNN (with fasttext embeddings) - accuracy 51.67%
LSTM (with fasttext embeddings) - accuracy 55.42%
Ensemble - accuracy 58.17%
Our TeamVolodymyr Medentsiy
Vadym Korshunov
Ganna Kaplun
Andrii Skliar
Yana Mosiichuk
Kateryna Bobrovnyk
Vitalii Radchenko
Thanks!
https://github.com/udsclub/ucu_sentiment/tree/master/projects/p02