character-based neural embeddings for tweet clustering
TRANSCRIPT
Character-based Neural Embeddingsfor Tweet Clustering
Svitlana Vakulenko Lyndon Nixon Mihai Lupu
Vienna University of Economics and BusinessTU Wien (Vienna University of Technology)
MODUL Technology
The 5th International Workshop on Natural Language Processingfor Social Media (SocialNLP)
In conjunction with EACL 2017April 3, 2017 in Valencia, Spain
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 1 / 18
#WEDNTREADWORDS
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 2 / 18
#CMABRIGDEUINERVTISYEFECT
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 3 / 18
Character-based Neural Embeddings
Language modeling [Sutskever et al., 2011] [Kim et al., 2016]
Natural Language Generation [Goyal et al., 2016]
Word spelling correction [Sakaguchi et al., 2017]
Part-of-speech tagging [dos Santos and Zadrozny, 2014]
Information extraction [Qi et al., 2014]
Text classification [Zhang et al., 2015]
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 4 / 18
Tweet2Vec: bi-GRU RNN [Dhingra et al., 2016]
1
1Picture Credits: Tobias Fink
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 5 / 18
Breaking News Detection from Twitter Stream
EU projects: SocialSensor, REVEAL, Pheme, InVID ...
2
2Picture Credits: AppAdvice, Mind The Gap Public Relations
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 6 / 18
SNOW Data Challenge [Papadopoulos et al., 2014]
Dataset: 1M/24h tweets related to major events (Syria, terror, Ukraine,bitcoin) annotated with 59 reference topics, e.g.:
25-02-14 18:00 Nigeria children killed in attack on schoolNigeria,children,killed,attack,school,Boko Haram438372486808629250,438373272439123968,438373225320697856
Winner: aggressive filtering and hierarchical clustering [Ifrim et al., 2014]Precision: 0.56 Recall: 0.36 F-Measure: 0.4
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 7 / 18
Results
Interval Tweets Model Dimensions #Clusters Homogeneity Completeness V-Measure
18:00 10,344Tweet2Vec 500 3026 0.9958 0.9453 0.9699TweetTerm 433 66-79 0.9277 1 0.9625
22:00 14,471Tweet2Vec 500 5292 1 0.9601 0.9796TweetTerm 589 93-118 0.9385 0.9969 0.9668
23:15 8,231Tweet2Vec 500 3986 1 0.98 0.9899TweetTerm 565 67-142 0.8062 0.9978 0.8918
01:00 5,123Tweet2Vec 500 2242 1 0.8877 0.9405TweetTerm 721 71-111 0.8104 1 0.8953
01:30 4,589Tweet2Vec 500 2091 1 0.8762 0.934TweetTerm 635 64-78 0.8024 1 0.8903
Table: Results of clustering evaluation on the English-language dataset
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 8 / 18
TweetTerm: results sample
obama : michelle and i were saddened to hear of the passing of harold ramis ...touching tribute to ghostbusters star harold ramis from comic artiston the joyful comedy of harold ramismajor tokyo-based bitcoin exchange mt . gox goes dark”bitcoin exchange giant mt . gox goes dark — popular science ”
obesity rate for young children plummets 43 % in a decadethe national obesity rate for young children dropped 43 % over the past decade
diplomatic pressure is unlikely to reverse uganda’s cruel anti-gay lawprovisions of arizona proposed anti-gay laweven mitt romney wants arizona’s governor to veto the state’s anti-gay billicymi : arizona pizzeria response to state anti-gay bill
amazing debate nic ! well done !well done 4 -0well done ! i find running so difficult . feel proud !well done him :-)well done nicola my money is on you you done it well tonight ??
Table: Similarity patterns in tweets discovered by TweetTerm.
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 9 / 18
TweetTerm results
Similarity patterns: word-level N-grams
Word-based approach is inflexible : mt.gox vs mtgox
Esp. evident for social media posts with diverse vocabulary
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 10 / 18
Tweet2Vec: results sample
video : bitcoin : mtgox exchange goes offline - bitcoin , a virtual currency ...the slow-motion collapse of mt . gox is bitcoin’s first financial crisis ...Disastro bitcoin : mt . gox cessa ogni attivite ... : mt . gox , il pi u grande cambiavalute bitco ...
Correct
california couple finds time capsules worth $10 millioncalifornian couple finds $10 million worth of gold coins in tin can
Correct
ukraine puts off vote on new government despite eu pleas for quick action - washington post ...ukraine truce shattered , death toll hits 67 - kiev (reuters) - ukraine suffered its bloodiest day ...ukraine fighting leaves at least 18 dead as kiev barricades burn - clashes in ukraine ...
Partial
are you going to come on his network and get poor ratings too ?are you sold on the waffle taco ?
Incorrect
the chromecast app flood has started bythe importance of emotion in design by
Incorrect
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 11 / 18
Tweet2Vec results
sensitive to the order of symbols
also uncovers syntactic patterns instead of semantics
would benefit from stop-word removal, e.g. an analogue to IDFweighting scheme
black-box document and similarity representation
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 12 / 18
Directions for Future Work
1 SyntaxI filter out syntactic patternsI eliminate stop-wordsI e.g. develop an analogue for the IDF weighting scheme for neural
networks, i.e. an aggregation step
2 SemanticsI explore semantic similarity patterns, e.g. paraphrases and synonymsI leverage pre-trained word-embeddings? e.g. Word2Vec, GloveI combine word-based semantics with the character-based similarityI e.g. “construct a representation by concatenating a word and a
character embedding” [Hashimoto et al., 2016]
3 DatasetI extend experiments to a larger multi-lingual dataset of tweets
4 BaselineI compare with word-based neural network model
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 13 / 18
Questions!
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 14 / 18
Bibliography I
Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., and Cohen,W. W. (2016).Tweet2vec: Character-based distributed representations for socialmedia.In Proceedings of the 54th Annual Meeting of the Association forComputational Linguistics, ACL 2016, August 7-12, 2016, Berlin,Germany.
dos Santos, C. N. and Zadrozny, B. (2014).Learning character-level representations for part-of-speech tagging.In Proceedings of the 31th International Conference on MachineLearning, ICML 2014, 21-26 June, 2014, Beijing, China, pages1818–1826.
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 15 / 18
Bibliography II
Goyal, R., Dymetman, M., and Gaussier, E. (2016).Natural language generation through character-based rnns withfinite-state prior knowledge.In COLING 2016, 26th International Conference on ComputationalLinguistics, Proceedings of the Conference: Technical Papers,December 11-16, 2016, Osaka, Japan, pages 1083–1092.
Hashimoto, K., Xiong, C., Tsuruoka, Y., and Socher, R. (2016).A joint many-task model: Growing a neural network for multipleNLP tasks.CoRR, abs/1611.01587.
Ifrim, G., Shi, B., and Brigadir, I. (2014).Event Detection in Twitter using Aggressive Filtering andHierarchical Tweet Clustering.In Papadopoulos, S., Corney, D., and Aiello, L. M., editors,Proceedings of the SNOW 2014 Data Challenge co-located with23rd International World Wide Web Conference (WWW 2014), April8, 2014, Seoul, Korea, pages 33–40.
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 16 / 18
Bibliography III
Kim, Y., Jernite, Y., Sontag, D., and Rush, A. M. (2016).Character-aware neural language models.In Proceedings of the Thirtieth AAAI Conference on ArtificialIntelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages2741–2749.
Papadopoulos, S., Corney, D., and Aiello, L. M. (2014).SNOW 2014 Data Challenge: Assessing the Performance of NewsTopic Detection Methods in Social Media.In Papadopoulos, S., Corney, D., and Aiello, L. M., editors,Proceedings of the SNOW 2014 Data Challenge co-located with23rd International World Wide Web Conference (WWW 2014), April8, 2014, Seoul, Korea, pages 1–8.
Qi, Y., Das, S. G., Collobert, R., and Weston, J. (2014).Deep learning for character-based information extraction.In Advances in Information Retrieval - 36th European Conference onIR Research, ECIR 2014, Amsterdam, The Netherlands, April 13-16,2014. Proceedings, pages 668–674.
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 17 / 18
Bibliography IV
Sakaguchi, K., Duh, K., Post, M., and Durme, B. V. (2017).Robsut wrod reocginiton via semi-character recurrent neuralnetwork.In Proceedings of the Thirty-First AAAI Conference on ArtificialIntelligence, February 4-9, 2017, San Francisco, California, USA.,pages 3281–3287.
Sutskever, I., Martens, J., and Hinton, G. E. (2011).Generating text with recurrent neural networks.In Proceedings of the 28th International Conference on MachineLearning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2,2011, pages 1017–1024.
Zhang, X., Zhao, J., and LeCun, Y. (2015).Character-level convolutional networks for text classification.In Advances in Neural Information Processing Systems 28: AnnualConference on Neural Information Processing Systems 2015,December 7-12, 2015, Montreal, Quebec, Canada, pages 649–657.
Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 18 / 18