sentiment analysis of film-related messages on social media
TRANSCRIPT
Sentiment Analysis of Film-related Messages on Social Media
Christopher BurdorfNBCUniversal
“The big gamblers are not in Vegas, they are in Hollywood”
Animation Director
Sentiment Analysis of Social Media
Process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc., is positive, negative, or neutral.Facebook public messages – DataSiftTwitter tweets – public API (only 1%), Twitter Gnip Firehose (100%)
Stanford CoreNLP -https://github.com/stanfordnlp/CoreNLP Natural Language processing system which uses deep learning techniques to process sentiment.
Deep LearningCoreNLP uses RNTN (Recurrent Neural Tensor Networks)RNTNs use compositional vector representations for phrases of variable length and syntactic type.Used as features to classify each word and phrase within a sentenceComputes overall sentiment based on vector values for words and phrases it has been trained to recognize.Sentiment ranges from 0 – very negative to 4 – very positive
RNTN ModelRNTNs represent a phrase through word vectors and a parse tree and then compute vectors for higher nodes in the tree using the same tensor-based composition function.
Film Sentiment: FSOG
Save messages from Datasift Facebook public stream referencing Fifty Shades of Grey. Store in HBaseStored 130,000 Facebook messages over a two-week period surrounding the films opening (opening date Feb 13)Stored 300MB of Facebook message JSON data.Process sentiment analysis on the messages using different training models using parallel Scala collections.
ExampleNo model: Sentiment= 1, “Tonight we're feeling Romantically Involved #fiftyShades”(4 (3 (2 Tonight) (3 (3 we're) (3 feeling))) (4 (4 Romantically) (4 Involved)))With Model: Sentiment= 4, “Tonight we're feeling Romantically Involved”Can match phrases as well (eg. “can't wait”).
Facebook message counts: FSOG
Training Models: FSOG Median
Statistical Sampling
Manual assignment of sentiments on a statistically significant sampling of messages95% confidence level 7% margin of errorCompare result to training model results
Sampling Results
Performance IssuesSpam: 80% Tweets are spam. Facebook messages about 10% spam.Spam filtering using matching phrases vs H20 Deep Learning.Training performance improvements: took 8 hours to train full plus movie critic set worked with Standford NLP group to multithread – reduced training time to 1 hour.
Performance ImprovementsSentiment lookup performance improvements – 6 hours to analyze 130k messagesSwitched to distributed database (Cassandra) and implemented concurrent lookups using Akka Actors resulted in 7x speedup on 16 cores
Other languages
Other LanguagesTwitter Firehose is 40% English. other languages (eg. Spanish) are seeing prominent usage as well. 77% of Twitter's 284 million MAUs (Monthly Active Users) are located outside the USA. 82% of Facebook's 890 million DAUs (Daily Active Users) are located outside the USA and Canada.