integrated real-time social media sentiment analysis...
Post on 28-May-2020
33 Views
Preview:
TRANSCRIPT
Integrated real-time social media
sentiment analysis service using a big
data analytic ecosystem
By Danielle C. Aring
Under the Direction of
Dr. Sun Sunnie Chung
Department of Electrical Engineering and Computer Science
Cleveland State University
Big Data Analytics Methodology: Sentiment Analysis
▶ Big Data analytics are defined as the processes for researching massive data sets
to discover
▶ Hidden Information
▶ Hidden correlations
▶ A methodology is Sentiment analysis (opinion mining) of user-generated
text/messages => (This is very difficult!!)
▶ They contain
▶ Human expressions in natural languages
▶ Unstructured
Sentiment Analysis (Opinion Mining):Data Sources
▶ Opinion Mining applied per Phrase, Sentence, Paragraph in:
▶ Movie Reviews: IMDB Movie Review Site
▶ Product Reviews: Amazon Product Review Sites
▶ Social media dialog: Public posts from Social Network sites:
Overview of Sentiment Analysis:Approaches▶ Uses
Natural Language Processing Techniques
Text Analysis Techniques
To Extract, Quantify Subjective Information in a text span
▶ Two main types of opinions:▶ Regular: Sentiment expressed specific target entity
Ex: "The touch screen is really cool."▶ Comparative Sentiment expressed more than 1 entity
Ex: “iPhone is better than Blackberry.”
Sentiment Analysis Approaches:Defining the Structure of an Opinion
�An opinion is a quadruple
●Opinion mining is difficult!
●How to Estimate Sentiment: A correct Model must be chosen
Natural Language Processing (NLP) Techniques
Adopt Natural language Processing (NLP) Techniques
▶ Stemming
▶ Lemmatisation
▶ POS Tagging
▶ N-gram analysis
▶ Stop words removal
▶ Chunking
Problems in Natural Language Processing
○ Ignores the syntax and semantics of words
■ n-grams, Phrases
■ Synonymy, Polysemy, Grammar, Context
○ Loses Word Order in a Sentence
New Approach in NLP: Word2vec - Continuous Skip-Gram Model
○ The word2vec Skip-Gram model is a Neural Network architecture that
learns semantically meaningful vector representations for words.
○ Given a target word wi, the skip-gram model is trained to predict the
surrounding context words wi-2, wi-1, wi+1, wi+2 in a phrase window size n =
5 for example.
○ After training, the weights in the embedding layer capture the semantics
through backpropagation as an indirect result of the prediction task, and
become the word vectors.
■ Words with similar meanings are mapped to similar positions in the
vector space with high dimensions (for example 300 dimensions).
Word2vec: Continuous Skip-Gram Model
Sentiment Analysis Types
▶ Classifying word polarity (e.g. positive negative neutral) in a document
▶ Beyond Polarity
▶ Advanced Sentiment (Pang 2004)
▶ Lexical Approach (Taboda 2011)
▶ Minimum cut extraction (Pang and Lee 2004)
▶ Topic-based (O'Connor et al. 2010)
▶ Aspect (feature) based opinion mining using semi-supervised
approach (Mukherjee and Liu 2016)
▶ Wordnet probability model (Tomas et al. 2013)
▶ Paragraph Embedding with Vectors (Dai et al. 2015)
Research Goals
▶ Investigate whether a stream-processing big data social media sentiment
service with analytics can offer the following compared to batch mode
counterparts:
▶ Scalability (enormous volume)
▶ Efficient near Real-time data processing and
▶ Data Analytics (Sentiment Analysis) with Accuracy
System Architecture: Social Media Data Stream Sentiment Analysis Service (SMDSSAS)
SMDSSAS System Architecture
Layer 1: Data Extraction
▶ Created a Spark Configuration
▶ Created a Spark context
▶ Created a Spark Streaming Context
▶ Spark DStream to filter for our messages
Layer 2: Data Stream Layer
Apache Spark
▶ Developed in 2010 from a Berkeley Research project
▶ For distributed big data processing built on top of Hadoop
▶ Why Spark? Overcomes the multi-stage application limitations of Map-reduce
▶ Uses in memory abstraction Resilient Distributed Datasets (RDDs)
▶ RDDs partitioned across clusters operated on in parallel
▶ RDDs are persisted in memory
▶ Reused in other operations across multiple map-reduce stages
Spark Streaming
▶ Internally Spark streaming receives:
▶ Live input data streams
▶ divides streams into batches
▶ further processed using the Spark engine to generate a final stream of results in batches:
▶ Each batch is resilient distributed dataset (RDD), processed in batches using RDD operations (map, reduce)
Spark DStreams
▶ Processed results pushed out in batches of discretized DStreams
▶ DStream: Continuous abstracted stream of data (RDDs)
▶ Map-reduce performed on each batch
▶ Operation performed on DStream carried out on subsequent RDDs
Layer 3: Data pre-processing and Transformation
Layer 3: Data Preprocessing/ TransformationApplying the NLP techniques:▶ Phase 1: During Spark Streaming (Data Storage Layer)▶ Preprocess messages in Tweet Stream to remove
characters sensitive to Hive Scanner▶ Ex: "\t" , "\", "\n", and "[\\p{C}" (control characters)▶ Phase 2: During Real-Time Streaming For Sentiment Analysis▶ Function pre-process messages in Tweet stream to
remove:▶ Twitter ‘@’ , "#", image and website URLs▶ Following punctuation: [. ,! “ ‘]▶ Numbers 0-9,▶ Following non-alphanumeric characters: $%&^*() + ~
Layer 4: Feature Extraction
Layer 5: Prediction
Prediction Layer:Sentiment Analysis Methodology
Motivation of Experiment 1
▶ Develop base sentiment model successful in event prediction
▶ Using 2016 Pre-Presidential Election on Nov. 5th 2016 data
▶ Quantify level of positive/negative sentiment
▶ Apply refinements to Correlate user sentiment with topic
▶ Why?
▶ System perform accurate event predictions.
Input Data Source
▶ Collected 3 datasets from Spark Streaming API using filtered DStream for Hillary
Clinton and Donald Trump and their political policies over 3 months.
▶ Preprocessed/stored in NoSQL Hive System:
▶ Pre election October 23rd 2016
▶ Pre-election November 5th 2016
▶ Post-Election Pre-Inauguration January 1st 2017
Naive Sentiment Model
Our Approaches:Quantifying Polarity of Sentiment in user generated Tweets
Quantifying Polarity of Sentiment:Polarity Score Function▶ Donald Trump: where document di is the entire twitter dataset and m is the individual tweet in di
▶ Hillary Clinton:
Results
System Platform
▶ Components of our system:
1. Oracle VM Virtual Box version 5.1.20
2. CDH VM version 5.8
3. Scala IDE version 4.4.1
4. Apache Spark version 2.0.1
Shifting Sentiment Over the 2016 Presidential Election Cycle
Hillary Trump
Sentiment Model Refinements
Deterministic Topic Sentiment Model
▶ Given the presumption that topic and sentiment can be jointly inferred:
▶ Counted Instances of positive and negative sentiment in the context of user
provided topic word(s)
▶ Likelihood estimated as relative frequencies
▶ Tweets categorized by subjectivity and polarity (OpinionFinder Lexicon)
Deterministic Approach
Strong or
WeakSubjectivity
PositiveNegative
Or NeutralPolarity?
topicSentiment Scoringfunction
SubjectivityLexicon
TopicWords
List
If Twitter Message Contains Topic
Probabilistic Model
▶ Per-word log-based scoring function
▶ Beyond frequency based measure
▶ Used a modified log-likelihood
▶ Models probability of positive and negative tweets per user provided topic
word
Probabilistic Model
Results
Deterministic Model, Positive Polarity Measure of Sentiment, Donald Trump vs. Hillary Clinton
Deterministic Model Base Model
Trump 0.60 0.26Hillary 0.06 0.016
Contribution
• Development of a real time streaming Framework for multiphase sentiment
analysis: Social Media Data Stream Sentiment Analysis Service (SMDSSAS).
• Development of three Sentiment Models
1. Polarity Score Function
2. Deterministic Topic Model – Instances of positive and negative sentiment in
context of user provided topic word(s).
3. Probabilistic Model – Identify instance of positive and negative sentiment by
log of the ratio of sentiment count per topic correlated tweet.
Conclusions▶ Successful Event Prediction on pre-Election data stream.
Candidate Donald Trump predicted winner given 0.6 Positive Polarity vs 0.06 of Clinton
▶ Improvements in Accuracy compared to the Existing Literature
Topic Sentiment Analysis Accuracy: 70-79%
Real Time Sentiment Analysis on Previous Presidential Election 59%
Deterministic Model Accuracy: 81%
Probabilistic Model Accuracy: 74%
▶ Our Sentiment Model Design is the first seen in the existing literature work
▶ Combination of
▶ Topic Sentiment Analysis Models and
▶ Real-Time Streaming Sentiment Analysis
▶ System performed Scalable Sentiment Analysis
References▶ Apache Flume. 2017 Accessed at: https://flume.apache.org/▶ Dai Andrew, Olah Christopher, Le Quoc , "Document Embedding with Paragraph Vectors", (Google) arXiv:1507.07998v1, 2015▶ Apache Spark 2.1.0. 2017. "Evaluation Metrics - RDD-based API", 2017 Accessed at: https://spark.apache.org/docs/latest/ml-features.html▶ Apache Spark 2.1.0. 2017"Evaluation Metrics - RDD-based API", 2017 Accessed at: https://spark.apache.org/docs/2.1.0/mllib-linear-methods.html#linear-support-vector-machines-svms▶ Apache Spark 2.1.0. 2017. "Evaluation Metrics - RDD-based API", 2017 Accessed at: https://spark.apache.org/docs/2.1.0/ml-clustering.html▶ Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., and Zaharia, M. "Spark SQL: Relational data processing in Spark". In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’15), 2015.▶ Bifet, A., Maniu, S., Qian, J., Tian, G., He, C., and Fan, W. “Streamdm: Advanced data mining in spark streaming,” in 2015 IEEE International Conference on Data Mining Workshop (ICDMW), pp. 1608–1611. 2015.▶ Blanas, S., Patel, J., Ercgovac, V., Rao, J., "A comparison of join algorithms for log processing in MaPreduce". In Proceedings of the 2010 ACM SIGMOD, pages 975–986, 2010.▶ Blei, D., Ng, A., and Jordan, N. "Latent Dirichlet allocation". Journal of Machine Learning Research, 3:993–1022, 2003.▶ Borthakur, D., "Petabyte Scale Data at Facebook". Accessed 04/24/2017. http://www.infoq. com/presentations/Data-Facebook, 2012.▶ Cloudera. Accessed at: https://www.cloudera.com/products/enterprise-data-hub.html?src=GoogleAdWords&gclid=Cj0KEQjwioHIBRCes6nP56Ti1IsBEiQAxxb5G-by2R6GGduAVi-dVs087kNR89c-4AyUnj-cNf9OMrEaAvAX8P8HAQ▶ Dean, J., and Ghemawat, S. Mapreduce: simplified data processing on large clusters. In OSDI’04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation (Berkeley, CA, USA, 2004), USENIX Association, pp. 10–10.▶ Deng, L., Gao, J., and Vuppalapati, C. “Building a Big Data Analytics Service Framework for Mobile Advertising and Marketing”, in Proc. IEEE 1st Int. Conf. Big Data Comput. Service Appl. (BigDataService), pp. 256-266. 2015.▶ "CDH Overview". Cloudera. Cloudera Inc, 2017. Web. 14 Apr. 2017.▶ Ewe, Lars. "What's the Best Way to Manage Big Data for Healthcare: Batch vs. Stream Processing?". Ask Eva Blog. Evariant Inc, 10 Dec. 2015. Web 11 Apr. 2017.▶ "Introduction to Big Data With Apache Spark ". KDnuggets Home. KDnuggets, 2017. Web. 11 Apr. 2017.▶ Cheng, K.M., Otto, Lau, Raymond. “Big Data Stream Analytics for Near Real-Time Sentiment Analysis”. Journal of Computer and Communications. 3:189-195.2015.
References▶ Gundecha, P. Ranganath, S., Feng, Z., and Liu, H. “A Tool for Collecting Provenance Data in Social Media”, In Proceedings of the 19th ACM SIGKDD Demonstration, 30, 61. 2013.▶ Hsu, C.-W., Chang, C. , C.-C. , Lin, J. "A Practical Guide to Support Vector Classification". Tech. Rep. Taipei. 2003.▶ Hu, D. "Latent dirichlet allocation for text, images, and music". University of California, San Diego. 2009.▶ Hu, M., and Liu, B. 2004. Mining and summarizing customer reviews. KDD’04, 2004.▶ N. Jindal and B. Liu, “Mining comparative sentences and relations,” Proceedings of AAAI, 2006.▶ Katal A, Wazid M, Goudar RH. "Big data: issues, challenges, tools and good practices". In: Sixth international conference on contemporary computing (IC3), pp. 404–409. doi:10.1109/IC3. 2013.▶ Kulkarni, S., Bhagat, N, Fu, M., Kedigehalli, V., Kellogg, C., Mittal, S., Patel, J.M., Ramasamy, K, and Taneja, S., "Twitter heron: Stream processing at scale". In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 239–250. ACM, 2015.▶ Lekha R.Nair, DR. Sujala,D.Shetty, streaming Twitter Data Analysis Using Spark For Effective Job Search, Journal of Theoretical and Applied Information Technology ,. Vol.80. No. 2 2005 – 2015.▶ Liu.,B. "Sentiment Analysis and Subjectivity." Invited Chapter for the Handbook of Natural Language Processing, Second Edition. 2010.▶ MapR. 2014. Accessed at: https://mapr.com/ ▶ Mishne, Gilad and Maarten de Rijke. "A study of blog search". In Proceedings of ECIR. 2006.▶ Liu, B, M. Hu, and J. Cheng. "Opinion Observer: Analyzing and comparing opinions on the web". Proceedings of International Conference on World Wide Web (WWW). 2005.▶ Mukherjee, Arjun and Bing Liu. "Aspect Extraction through SemiSupervised Modeling". In Proceedings of 50th Anunal Meeting of Association for Computational Linguistics (ACL-2012). 2012.▶ O'Connor, Brendan, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. "From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series". In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010). 2010.▶ "Overview of Cloudera and the Cloudera Documentation Set ". Cloudera. Cloudera Inc, 2017. Web. 14 Apr. 2017.▶ Pang, B., & Lee, L. "A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts". In Proceedings of the Association for Computational Linguistics (pp. 271–278). 2004.▶ Pang, Bo and Lillian Lee. "Opinion Mining and Sentiment Analysis". Foundations and Trends in Information Retrieval series. Now publishers. 2008.▶ Rennie J., Shih, L., Teevan, J., & Karger, D. "Tackling the Poor Assumptions of Naive Bayes Text Classifiers". Proc. of ICLM. 2003.
References▶ Sagiroglu, S. and Sinanc, D. “Big data: A review,” In Collaboration Technologies and Systems (CTS), 2013 International Conference on, pp. 42–47. 2013.▶ Silva, Y.N. "NoSQL". Arizona State University. Accessed at: http://cis.csuohio.edu/~sschung/cis612/LectureNotes_NoSQL_1.pdf▶ Srivastava, D., Bhambhu L. "Data Classification Using Support Vector Machine". Journal Theoretical and Applied Information Technology, from www.jait.org. 2005-2009.▶ Stanford. "Scoring, term weighting and the vector space model". Cambridge University Press, p. 109-133. 2009. Accessed at: https://nlp.stanford.edu/IR-book/pdf/06vect.pdf▶ Taboada, Maite, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. "Lexicon-based methods for sentiment analysis". Computational Linguistics, 37(2): p. 267-307. 2011.▶ Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. "Hive — a warehousing solution over a Map-Reduce framework". In VLDB (2009).▶ A. Thusoo, D. Borthakur, R. Murthy, Z. Shao, N. Jain, H. Liu, S. Anthony, and J. S. Sarma. "Data warehousing and analytics infrastructure at Facebook". In SIGMOD, 2010.▶ Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., and Murthy, R. “Hive — A petabyte scale data warehouse using Hadoop”. In Proceedings of the International Conference on Data Engineering. 996–1005. 2010.▶ Tilve, A., Jain, S. “A Survey on Machine Learning Techniques For Text Classification”. International Journal of Engineering Sciences and Research. 6(2).p513-520. 2017.▶ Yi, J., Nasukawa, T., Bunescu, R. and Niblack, W.“Sentiment analyzer: Extracting sentiments about a given topic using natural language processing techniques”. In Proceedings of IEEE International Conference on Data Mining (ICDM). 2003.▶ Wang, H., Can, D., Kazemzadeh, A., Bar, F., and Narayanan, S. “A System for Real- Time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle”. In ACL System Demonstrations,pages 115–120. 2012.▶ Wang S., Zhiyuan, C., Liu, B. “Mining Aspect-Specific Opinion using a Holistic Lifelong Topic Model”. In: Proceedings of the 25th International Conference on World Wide Web. ACM, pp 167-176. 2016.▶ Zhai, Zhongwu, Bing Liu, Hua Xu, and Peifa Jia. “Clustering Product Features for Opinion Mining”. In Proceedings of ACM International Conference on Web Search and Data Mining (WSDM-2011). 2011
top related