integrated real-time social media sentiment analysis...

Post on 28-May-2020

33 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Integrated real-time social media

sentiment analysis service using a big

data analytic ecosystem

By Danielle C. Aring

Under the Direction of

Dr. Sun Sunnie Chung

Department of Electrical Engineering and Computer Science

Cleveland State University

Big Data Analytics Methodology: Sentiment Analysis

▶ Big Data analytics are defined as the processes for researching massive data sets

to discover

▶ Hidden Information

▶ Hidden correlations

▶ A methodology is Sentiment analysis (opinion mining) of user-generated

text/messages => (This is very difficult!!)

▶ They contain

▶ Human expressions in natural languages

▶ Unstructured

Sentiment Analysis (Opinion Mining):Data Sources

▶ Opinion Mining applied per Phrase, Sentence, Paragraph in:

▶ Movie Reviews: IMDB Movie Review Site

▶ Product Reviews: Amazon Product Review Sites

▶ Social media dialog: Public posts from Social Network sites:

Twitter

Facebook

Overview of Sentiment Analysis:Approaches▶ Uses

Natural Language Processing Techniques

Text Analysis Techniques

To Extract, Quantify Subjective Information in a text span

▶ Two main types of opinions:▶ Regular: Sentiment expressed specific target entity

Ex: "The touch screen is really cool."▶ Comparative Sentiment expressed more than 1 entity

Ex: “iPhone is better than Blackberry.”

Sentiment Analysis Approaches:Defining the Structure of an Opinion

�An opinion is a quadruple

●Opinion mining is difficult!

●How to Estimate Sentiment: A correct Model must be chosen

Natural Language Processing (NLP) Techniques

Adopt Natural language Processing (NLP) Techniques

▶ Stemming

▶ Lemmatisation

▶ POS Tagging

▶ N-gram analysis

▶ Stop words removal

▶ Chunking

Problems in Natural Language Processing

○ Ignores the syntax and semantics of words

■ n-grams, Phrases

■ Synonymy, Polysemy, Grammar, Context

○ Loses Word Order in a Sentence

New Approach in NLP: Word2vec - Continuous Skip-Gram Model

○ The word2vec Skip-Gram model is a Neural Network architecture that

learns semantically meaningful vector representations for words.

○ Given a target word wi, the skip-gram model is trained to predict the

surrounding context words wi-2, wi-1, wi+1, wi+2 in a phrase window size n =

5 for example.

○ After training, the weights in the embedding layer capture the semantics

through backpropagation as an indirect result of the prediction task, and

become the word vectors.

■ Words with similar meanings are mapped to similar positions in the

vector space with high dimensions (for example 300 dimensions).

Word2vec: Continuous Skip-Gram Model

Sentiment Analysis Types

▶ Classifying word polarity (e.g. positive negative neutral) in a document

▶ Beyond Polarity

▶ Advanced Sentiment (Pang 2004)

▶ Lexical Approach (Taboda 2011)

▶ Minimum cut extraction (Pang and Lee 2004)

▶ Topic-based (O'Connor et al. 2010)

▶ Aspect (feature) based opinion mining using semi-supervised

approach (Mukherjee and Liu 2016)

▶ Wordnet probability model (Tomas et al. 2013)

▶ Paragraph Embedding with Vectors (Dai et al. 2015)

Research Goals

▶ Investigate whether a stream-processing big data social media sentiment

service with analytics can offer the following compared to batch mode

counterparts:

▶ Scalability (enormous volume)

▶ Efficient near Real-time data processing and

▶ Data Analytics (Sentiment Analysis) with Accuracy

System Architecture: Social Media Data Stream Sentiment Analysis Service (SMDSSAS)

SMDSSAS System Architecture

Layer 1: Data Extraction

▶ Created a Spark Configuration

▶ Created a Spark context

▶ Created a Spark Streaming Context

▶ Spark DStream to filter for our messages

Layer 2: Data Stream Layer

Apache Spark

▶ Developed in 2010 from a Berkeley Research project

▶ For distributed big data processing built on top of Hadoop

▶ Why Spark? Overcomes the multi-stage application limitations of Map-reduce

▶ Uses in memory abstraction Resilient Distributed Datasets (RDDs)

▶ RDDs partitioned across clusters operated on in parallel

▶ RDDs are persisted in memory

▶ Reused in other operations across multiple map-reduce stages

Spark Streaming

▶ Internally Spark streaming receives:

▶ Live input data streams

▶ divides streams into batches

▶ further processed using the Spark engine to generate a final stream of results in batches:

▶ Each batch is resilient distributed dataset (RDD), processed in batches using RDD operations (map, reduce)

Spark DStreams

▶ Processed results pushed out in batches of discretized DStreams

▶ DStream: Continuous abstracted stream of data (RDDs)

▶ Map-reduce performed on each batch

▶ Operation performed on DStream carried out on subsequent RDDs

Layer 3: Data pre-processing and Transformation

Layer 3: Data Preprocessing/ TransformationApplying the NLP techniques:▶ Phase 1: During Spark Streaming (Data Storage Layer)▶ Preprocess messages in Tweet Stream to remove

characters sensitive to Hive Scanner▶ Ex: "\t" , "\", "\n", and "[\\p{C}" (control characters)▶ Phase 2: During Real-Time Streaming For Sentiment Analysis▶ Function pre-process messages in Tweet stream to

remove:▶ Twitter ‘@’ , "#", image and website URLs▶ Following punctuation: [. ,! “ ‘]▶ Numbers 0-9,▶ Following non-alphanumeric characters: $%&^*() + ~

Layer 4: Feature Extraction

Layer 5: Prediction

Prediction Layer:Sentiment Analysis Methodology

Motivation of Experiment 1

▶ Develop base sentiment model successful in event prediction

▶ Using 2016 Pre-Presidential Election on Nov. 5th 2016 data

▶ Quantify level of positive/negative sentiment

▶ Apply refinements to Correlate user sentiment with topic

▶ Why?

▶ System perform accurate event predictions.

Input Data Source

▶ Collected 3 datasets from Spark Streaming API using filtered DStream for Hillary

Clinton and Donald Trump and their political policies over 3 months.

▶ Preprocessed/stored in NoSQL Hive System:

▶ Pre election October 23rd 2016

▶ Pre-election November 5th 2016

▶ Post-Election Pre-Inauguration January 1st 2017

Naive Sentiment Model

Our Approaches:Quantifying Polarity of Sentiment in user generated Tweets

Quantifying Polarity of Sentiment:Polarity Score Function▶ Donald Trump: where document di is the entire twitter dataset and m is the individual tweet in di

▶ Hillary Clinton:

Results

System Platform

▶ Components of our system:

1. Oracle VM Virtual Box version 5.1.20

2. CDH VM version 5.8

3. Scala IDE version 4.4.1

4. Apache Spark version 2.0.1

Shifting Sentiment Over the 2016 Presidential Election Cycle

Hillary Trump

Sentiment Model Refinements

Deterministic Topic Sentiment Model

▶ Given the presumption that topic and sentiment can be jointly inferred:

▶ Counted Instances of positive and negative sentiment in the context of user

provided topic word(s)

▶ Likelihood estimated as relative frequencies

▶ Tweets categorized by subjectivity and polarity (OpinionFinder Lexicon)

Deterministic Approach

Strong or

WeakSubjectivity

PositiveNegative

Or NeutralPolarity?

topicSentiment Scoringfunction

SubjectivityLexicon

TopicWords

List

If Twitter Message Contains Topic

Probabilistic Model

▶ Per-word log-based scoring function

▶ Beyond frequency based measure

▶ Used a modified log-likelihood

▶ Models probability of positive and negative tweets per user provided topic

word

Probabilistic Model

Results

Deterministic Model, Positive Polarity Measure of Sentiment, Donald Trump vs. Hillary Clinton

Deterministic Model Base Model

Trump 0.60 0.26Hillary 0.06 0.016

Contribution

• Development of a real time streaming Framework for multiphase sentiment

analysis: Social Media Data Stream Sentiment Analysis Service (SMDSSAS).

• Development of three Sentiment Models

1. Polarity Score Function

2. Deterministic Topic Model – Instances of positive and negative sentiment in

context of user provided topic word(s).

3. Probabilistic Model – Identify instance of positive and negative sentiment by

log of the ratio of sentiment count per topic correlated tweet.

Conclusions▶ Successful Event Prediction on pre-Election data stream.

Candidate Donald Trump predicted winner given 0.6 Positive Polarity vs 0.06 of Clinton

▶ Improvements in Accuracy compared to the Existing Literature

Topic Sentiment Analysis Accuracy: 70-79%

Real Time Sentiment Analysis on Previous Presidential Election 59%

Deterministic Model Accuracy: 81%

Probabilistic Model Accuracy: 74%

▶ Our Sentiment Model Design is the first seen in the existing literature work

▶ Combination of

▶ Topic Sentiment Analysis Models and

▶ Real-Time Streaming Sentiment Analysis

▶ System performed Scalable Sentiment Analysis

References▶ Apache Flume. 2017 Accessed at: https://flume.apache.org/▶ Dai Andrew, Olah Christopher, Le Quoc , "Document Embedding with Paragraph Vectors", (Google) arXiv:1507.07998v1, 2015▶ Apache Spark 2.1.0. 2017. "Evaluation Metrics - RDD-based API", 2017 Accessed at: https://spark.apache.org/docs/latest/ml-features.html▶ Apache Spark 2.1.0. 2017"Evaluation Metrics - RDD-based API", 2017 Accessed at: https://spark.apache.org/docs/2.1.0/mllib-linear-methods.html#linear-support-vector-machines-svms▶ Apache Spark 2.1.0. 2017. "Evaluation Metrics - RDD-based API", 2017 Accessed at: https://spark.apache.org/docs/2.1.0/ml-clustering.html▶ Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., and Zaharia, M. "Spark SQL: Relational data processing in Spark". In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’15), 2015.▶ Bifet, A., Maniu, S., Qian, J., Tian, G., He, C., and Fan, W. “Streamdm: Advanced data mining in spark streaming,” in 2015 IEEE International Conference on Data Mining Workshop (ICDMW), pp. 1608–1611. 2015.▶ Blanas, S., Patel, J., Ercgovac, V., Rao, J., "A comparison of join algorithms for log processing in MaPreduce". In Proceedings of the 2010 ACM SIGMOD, pages 975–986, 2010.▶ Blei, D., Ng, A., and Jordan, N. "Latent Dirichlet allocation". Journal of Machine Learning Research, 3:993–1022, 2003.▶ Borthakur, D., "Petabyte Scale Data at Facebook". Accessed 04/24/2017. http://www.infoq. com/presentations/Data-Facebook, 2012.▶ Cloudera. Accessed at: https://www.cloudera.com/products/enterprise-data-hub.html?src=GoogleAdWords&gclid=Cj0KEQjwioHIBRCes6nP56Ti1IsBEiQAxxb5G-by2R6GGduAVi-dVs087kNR89c-4AyUnj-cNf9OMrEaAvAX8P8HAQ▶ Dean, J., and Ghemawat, S. Mapreduce: simplified data processing on large clusters. In OSDI’04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation (Berkeley, CA, USA, 2004), USENIX Association, pp. 10–10.▶ Deng, L., Gao, J., and Vuppalapati, C. “Building a Big Data Analytics Service Framework for Mobile Advertising and Marketing”, in Proc. IEEE 1st Int. Conf. Big Data Comput. Service Appl. (BigDataService), pp. 256-266. 2015.▶ "CDH Overview". Cloudera. Cloudera Inc, 2017. Web. 14 Apr. 2017.▶ Ewe, Lars. "What's the Best Way to Manage Big Data for Healthcare: Batch vs. Stream Processing?". Ask Eva Blog. Evariant Inc, 10 Dec. 2015. Web 11 Apr. 2017.▶ "Introduction to Big Data With Apache Spark ". KDnuggets Home. KDnuggets, 2017. Web. 11 Apr. 2017.▶ Cheng, K.M., Otto, Lau, Raymond. “Big Data Stream Analytics for Near Real-Time Sentiment Analysis”. Journal of Computer and Communications. 3:189-195.2015.

References▶ Gundecha, P. Ranganath, S., Feng, Z., and Liu, H. “A Tool for Collecting Provenance Data in Social Media”, In Proceedings of the 19th ACM SIGKDD Demonstration, 30, 61. 2013.▶ Hsu, C.-W., Chang, C. , C.-C. , Lin, J. "A Practical Guide to Support Vector Classification". Tech. Rep. Taipei. 2003.▶ Hu, D. "Latent dirichlet allocation for text, images, and music". University of California, San Diego. 2009.▶ Hu, M., and Liu, B. 2004. Mining and summarizing customer reviews. KDD’04, 2004.▶ N. Jindal and B. Liu, “Mining comparative sentences and relations,” Proceedings of AAAI, 2006.▶ Katal A, Wazid M, Goudar RH. "Big data: issues, challenges, tools and good practices". In: Sixth international conference on contemporary computing (IC3), pp. 404–409. doi:10.1109/IC3. 2013.▶ Kulkarni, S., Bhagat, N, Fu, M., Kedigehalli, V., Kellogg, C., Mittal, S., Patel, J.M., Ramasamy, K, and Taneja, S., "Twitter heron: Stream processing at scale". In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 239–250. ACM, 2015.▶ Lekha R.Nair, DR. Sujala,D.Shetty, streaming Twitter Data Analysis Using Spark For Effective Job Search, Journal of Theoretical and Applied Information Technology ,. Vol.80. No. 2 2005 – 2015.▶ Liu.,B. "Sentiment Analysis and Subjectivity." Invited Chapter for the Handbook of Natural Language Processing, Second Edition. 2010.▶ MapR. 2014. Accessed at: https://mapr.com/ ▶ Mishne, Gilad and Maarten de Rijke. "A study of blog search". In Proceedings of ECIR. 2006.▶ Liu, B, M. Hu, and J. Cheng. "Opinion Observer: Analyzing and comparing opinions on the web". Proceedings of International Conference on World Wide Web (WWW). 2005.▶ Mukherjee, Arjun and Bing Liu. "Aspect Extraction through SemiSupervised Modeling". In Proceedings of 50th Anunal Meeting of Association for Computational Linguistics (ACL-2012). 2012.▶ O'Connor, Brendan, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. "From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series". In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010). 2010.▶ "Overview of Cloudera and the Cloudera Documentation Set ". Cloudera. Cloudera Inc, 2017. Web. 14 Apr. 2017.▶ Pang, B., & Lee, L. "A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts". In Proceedings of the Association for Computational Linguistics (pp. 271–278). 2004.▶ Pang, Bo and Lillian Lee. "Opinion Mining and Sentiment Analysis". Foundations and Trends in Information Retrieval series. Now publishers. 2008.▶ Rennie J., Shih, L., Teevan, J., & Karger, D. "Tackling the Poor Assumptions of Naive Bayes Text Classifiers". Proc. of ICLM. 2003.

References▶ Sagiroglu, S. and Sinanc, D. “Big data: A review,” In Collaboration Technologies and Systems (CTS), 2013 International Conference on, pp. 42–47. 2013.▶ Silva, Y.N. "NoSQL". Arizona State University. Accessed at: http://cis.csuohio.edu/~sschung/cis612/LectureNotes_NoSQL_1.pdf▶ Srivastava, D., Bhambhu L. "Data Classification Using Support Vector Machine". Journal Theoretical and Applied Information Technology, from www.jait.org. 2005-2009.▶ Stanford. "Scoring, term weighting and the vector space model". Cambridge University Press, p. 109-133. 2009. Accessed at: https://nlp.stanford.edu/IR-book/pdf/06vect.pdf▶ Taboada, Maite, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. "Lexicon-based methods for sentiment analysis". Computational Linguistics, 37(2): p. 267-307. 2011.▶ Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. "Hive — a warehousing solution over a Map-Reduce framework". In VLDB (2009).▶ A. Thusoo, D. Borthakur, R. Murthy, Z. Shao, N. Jain, H. Liu, S. Anthony, and J. S. Sarma. "Data warehousing and analytics infrastructure at Facebook". In SIGMOD, 2010.▶ Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., and Murthy, R. “Hive — A petabyte scale data warehouse using Hadoop”. In Proceedings of the International Conference on Data Engineering. 996–1005. 2010.▶ Tilve, A., Jain, S. “A Survey on Machine Learning Techniques For Text Classification”. International Journal of Engineering Sciences and Research. 6(2).p513-520. 2017.▶ Yi, J., Nasukawa, T., Bunescu, R. and Niblack, W.“Sentiment analyzer: Extracting sentiments about a given topic using natural language processing techniques”. In Proceedings of IEEE International Conference on Data Mining (ICDM). 2003.▶ Wang, H., Can, D., Kazemzadeh, A., Bar, F., and Narayanan, S. “A System for Real- Time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle”. In ACL System Demonstrations,pages 115–120. 2012.▶ Wang S., Zhiyuan, C., Liu, B. “Mining Aspect-Specific Opinion using a Holistic Lifelong Topic Model”. In: Proceedings of the 25th International Conference on World Wide Web. ACM, pp 167-176. 2016.▶ Zhai, Zhongwu, Bing Liu, Hua Xu, and Peifa Jia. “Clustering Product Features for Opinion Mining”. In Proceedings of ACM International Conference on Web Search and Data Mining (WSDM-2011). 2011

top related