sales_prediction_technique using r programming
TRANSCRIPT
Sales Prediction
Analyzing Twitter Data
Abstract:
The Chevrolet Camaro is an American automobile manufactured by Chevrolet, Sixth generation car launched
on Feb 27 2016 in United States, yet to be launched in India. According to a recent report by SpeedLux, the 2016
Chevrolet Camaro is listed as the 4th most searched vehicle on Google this past year. Compared with last year's
line of Camaros from GM, this year's model is shorter, narrower, lower and lighter. Car Scoops believes consumers
will like these sportier physical changes.
So, it’s been quite some time this car has hit the road with its enhanced specifications and new design. The car has
been talk of the town since release and received lot of feedback from users via different social media sources. As I
mentioned in my previous deliverable, I choose to proceed with Twitter reviews and feedbacks for detecting the
sentiments of the posts as this means is highly used to share opinions. R programming has been used for
sentiment analysis.
Introduction:
The core idea of this paper is detecting and understanding how the audience responded to this sixth
generation car and do the sentiment analysis on the data captured part of tweets. As previously mentioned,
sentiment analysis is the process of determining the emotional tone behind a series of words used to gain an
understanding of the opinions and emotions expressed within an online platform. Sentiment analysis is used for
social media monitoring, tracking of products reviews, analyzing survey responses and in business analytics. The
ability to extract insights from social data is a practice that is being widely adopted by organizations across the
world. Machine learning makes sentiment analysis more convenient. I choose R programming to do the
sentiment analysis as it has sentiment R, RTextTools packages and the more general text mining package which
come handy in detailed analysis. Text analysis in R has been well recognized. tm package plays a bigger role in the
analysis. tm package is a framework for text mining applications within R. It did a good job for text cleaning
(stemming, delete the stop words etc.) and transforming texts to document-term matrix (dtm).
Data Analysis:
Before describing the steps involved in the analysis, below are the important packages which are essential in
the data analysis process:
twitteR : Provides an interface to the Twitter web API
ROAuth: Provides an interface to the OAuth 1.0 specification allowing users to authenticate via OAuth to the
server of their choice.
plyr : Provides tools for Splitting, Applying and Combining Data. A set of tools that solves a common set of
problems: you need to break a big problem down into manageable pieces, operate on each piece and then put
all the pieces back together.
Stringr: Simple, Consistent Wrappers for Common String Operations. A consistent, simple and easy to use set
of wrappers around the fantastic 'stringr' package.
ggplot2 : Create Elegant Data Visualizations Using the Grammar of Graphics. A system for 'declaratively'
creating graphics, based on ``The Grammar of Graphics''. You provide the data, tell 'ggplot2' how to map
variables to aesthetics, what graphical primitives to use, and it takes care of the details.
Httr: Tools for Working with URLs and HTTP. Useful tools for working with HTTP organized by HTTP verbs (GET(),
POST(), etc.). Configuration functions make it easy to control additional request components (authenticate(),
add_headers() and so on).
Wordcloud: Plot a cloud of words shared comparing the frequencies of words across documents.
Sentimentr: Calculate Text Polarity Sentiment t at the sentence level and optionally aggregate by rows or
grouping variable(s).
SnowballC: An R interface to the C libstemmer library that implements Porter's word stemming algorithm for
collapsing words to a common root to aid comparison of vocabulary.
Tm: The tm package offers functionality for managing text documents, abstracts the process of document
manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database
back-end support to minimize memory demands. An advanced meta data management is implemented for
collections of text documents to alleviate the usage of large and with meta data enriched document sets. tm
provides easy access to preprocessing and manipulation mechanisms such as whitespace removal, stemming,
or stopword deletion. Further a generic filter architecture is available to filter documents for certain criteria,
or perform full text search. The package supports the export from document collections to term-document
matrices.
Tmap: Thematic maps are geographical maps in which spatial data distributions are visualized. This package
offers a flexible, layer-based, and easy to use approach to create thematic maps, such as choropleths and
bubble maps.
RColorBrewer: Provides color schemes for maps (and other graphics)
1. Building word cloud using R
A word cloud is a text mining method that allows us to highlight the most frequently used keywords in a
paragraph of texts. It is also referred to as a text cloud or tag cloud. Building word cloud is a powerful method
for text mining and, it adds simplicity and clarity. They are easy to understand, to be shared and are impactful.
Word clouds are visually engaging than a table data. The height of each word in this picture is an indication of
frequency of occurrence of the word in the entire text. For word cloud formation, we follow the below steps:
Step 1: Setting up the working Directory in RStudio
setwd("C:/Users/nagar/Desktop/Extras/R")
Step 2: Installing and loading the necessary packages.
Below is the list of packages required for Emotion classification. The functionality of each of these packages has
been explained under Necessary packages.
library(twitteR)
library(ROAuth)
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(httr)
library(wordcloud)
library(sentiment)
Step 3: Create a corpus from the collection of text files.
mydata <- read.csv("C:/Users/nagar/Desktop/Extras/R/camaro.csv")
mycorpus <- Corpus(VectorSource(mydata$text))
After reading the data from CSV file using the function read.csv into a variable mydata, Corpus is created. The text
is loaded using Corpus() function from text mining (tm) package. Corpus is a list of a document (in our case, we
only have one document).
Step 4: Create structured data from the text file
mycorpus <- tm_map(mycorpus, tolower)
mycorpus <- tm_map(mycorpus, PlainTextDocument)
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, removeNumbers)
mycorpus <- tm_map(mycorpus, removeWords, stopwords(kind = "en"))
mycorpus <- tm_map(mycorpus, stripWhitespace)
mycorpus <- tm_map(mycorpus, stemDocument)
pal <- brewer.pal(8, "Dark2")
The tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove
common stopwords like ‘the’, “we”. The information value of ‘stopwords’ is near zero due to the fact that they
are so common in a language. Removing this kind of words is useful before further analysis. For ‘stopwords’,
supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian,
portuguese, russian, spanish and swedish. Language names are case sensitive.
Here, I have also remove numbers and punctuation with removeNumbers and removePunctuation arguments.
Another important preprocessing step is to make a text stemming which reduces words to their root form. In
other words, this process removes suffixes from words to make it simple and to get the common origin. For
example, a stemming process reduces the words “moving”, “moved” and “movement” to the root word, “move”.
Step 5: Making the word cloud using the structured form of the data.
wordcloud(mycorpus, min.freq=3, max.words=Inf, width=1000,
height=1000, random.order=FALSE, Color=pal)
Arguments of the word cloud generator function :
words : the words to be plotted
freq : their frequencies
min.freq : words with frequency below min.freq will not be plotted
max.words : maximum number of words to be plotted
random.order : plot words in random order. If false, they will be plotted in decreasing frequency
rot.per : proportion words with 90 degree rotation (vertical text)
colors : color words from least to most frequent. Use, for example, colors =“black” for single color.
Word Cloud on Camaro tweets:
2. Classify Emotion and publish graph:
Classification of emotion in R programming is achieved by using the function classify_emotion. This
function helps us to analyze some text and classify it in different types of emotion: anger, disgust, fear, joy,
sadness, and surprise. For this, I am using naive Bayes classifier trained on Carlo Strapparava and Alessandro
Valitutti’s emotions lexicon.
Below is the detailed description of the steps involved in Emotion Classification:
Step 1: Setting up the working Directory in RStudio
setwd("C:/Users/nagar/Desktop/Extras/R")
Step 2: Installing and loading the necessary packages.
Below is the list of packages required for Emotion classification. The functionality of each of these packages has
been explained under Necessary packages.
library(twitteR)
library(ROAuth)
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(httr)
library(wordcloud)
library(sentiment)
Step 3: Prepare the text for sentiment analysis.
Before starting the sentiment analysis, we need to keep our data file ready in the working directory. As an
alternate, using R programming, we can also connect directly to Twitter and use OAuth for authorizing our
credentials and proceed with the analysis, the mandate for this process is that we need to have Twitter account
and set up developers account and have all the authentication keys ready for connection establishment. Here, I
have downloaded the CSV data file and saved in the working directory mentioned in the above step. Loading the
data file into a variable for further process.
data <- read.csv("C:/Users/nagar/Desktop/Extras/R/Book1.csv")
Step 4: Perform Sentiment Analysis.
class_emo = classify_emotion(data, algorithm = "bayes", prior = 1.0)
Among the 3 parameters specified in the above function, the first parameter is the name of the data file being
classified, the second parameter is the algorithm being used which in our case is Bayes. So, as mentioned above,
A string indicating whether to use the naive Bayes algorithm or a simple voter algorithm. The third parameter is
a numeric specifying the prior probability to use for the naive Bayes classifier.
emotion = class_emo[,7]
Emotion best fit is set in the above syntax. Returns an object of class data.frame with seven columns and one row
for each document e.g. anger, disgust, fear, joy, sadness, surprise.
emotion[is.na(emotion)] = "unknown"
Lastly, substitute NA's by "unknown" under this step.
Step 5: Create and Sort the data frame.
sent_df = data.frame(text=data, emotion=emotion, stringsAsFactors =
FALSE)
The function data.frame() creates data frames, tightly coupled collections of variables which share many of the
properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software. The
input in the first parameter is our data file and the second parameter considers the emotion which has been
analyzed in the above steps. stringsAsFactors is false meaning, If a list or data frame or matrix is passed
to data.frame it is as if each component or column had been passed as a separate argument. The data frame is
stored into sent_df variable which is being sorted by the below command.
sent_df = within(sent_df, emotion <- factor(emotion, levels =
names(sort(table(emotion), decreasing = TRUE))))
Here, we use factor function to create a factor. The only required argument to factor is a vector of values which will
be returned as a vector of factor values. To change the order in which the levels will be displayed from their default
sorted order, the levels= argument can be given a vector of all the possible values of the variable in the order you
desire. The sorted data is again stored into the variable sent_df.
Step 6: Plotting the Statistics.
ggplot(sent_df, aes(x=emotion)) + geom_bar(aes(y=..count..,
fill=emotion)) + scale_fill_brewer(palette = "Dark2") + labs(x="emotion
categories", y="number of tweets", title="classification based on emotion")
One of the very important function in R is ggplot which enables us to create a very wide range of useful plots. aes
function has list of name value pairs giving aesthetics to map, in the current case it is emotion. The bar geom is
used to produce 1d area plots: bar charts for categorical x, and histograms for continuous y as shown below. color
selection can change with one of scale functions such as scale_fill_brewer . X and y Specifies the variables placed
on the horizontal and vertical axis.
Below is the Emotion classification plot for the Camaro tweets:
3. Classify Polarity and publish graph:
A fundamental task in sentiment analysis is polarity detection: the classification of the polarity of a given
text, whether the opinion expressed is positive, negative or neutral. This approach uses a supervised learning
algorithm to build a classifier that will detect polarity of textual data and classify it as either positive or negative. It
uses an opinionated dataset to train the classifier, data processing techniques to pre-process the textual data and
simple rules for categorizing text as positive or negative.
I am using the naïve Bayes classifier to attempt to classify the sentences as positive or negative. As the
name suggests, this works by implementing a Naive Bayes algorithm. Basically, this algorithm tries to guess
whether a sentence is positive or negative by examining how many words it has in each category and relating this
to the probabilities of those numbers appearing in positive and negative sentences.
The steps involved in polarity classification is like Emotion classification excepting the usage of classify_polarity
function for classifying positive and negative text. The creation and sorting process using data frame class uses
polarity function in this case. Below are the steps involved in Polarity classification.
Step 1: Setting up the working Directory in RStudio
setwd("C:/Users/nagar/Desktop/Extras/R")
Step 2: Installing and loading the necessary packages.
Below is the list of packages required for Emotion classification. The functionality of each of these packages has
been explained under Necessary packages.
library(twitteR)
library(ROAuth)
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(httr)
library(wordcloud)
library(sentiment)
Step 3: Prepare the text for sentiment analysis.
Before starting the sentiment analysis, we need to keep our data file ready in the working directory. As an
alternate, using R programming, we can also connect directly to Twitter and use OAuth for authorizing our
credentials and proceed with the analysis, the mandate for this process is that we need to have Twitter account
and set up developers account and have all the authentication keys ready for connection establishment. Here, I
have downloaded the CSV data file and saved in the working directory mentioned in the above step. Loading the
data file into a variable for further process.
data <- read.csv("C:/Users/nagar/Desktop/Extras/R/Book1.csv")
Step 4: Perform Sentiment Analysis.
class_pol = classify_polarity(data, algorithm = "bayes")
In contrast to the classification of emotions, the classify_polarity function allows us to classify some text as positive
or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s
subjectivity lexicon; or by a simple voter algorithm.
polarity = class_pol[,4]
Polarity best fit is set in the above syntax.
Step 5: Create and Sort the data frame.
sent_df = data.frame(text=data, polarity=polarity, stringsAsFactors
= FALSE)
The function data.frame() creates data frames, tightly coupled collections of variables which share many of the
properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software. The
input in the first parameter is our data file and the second parameter considers the polarity which has been analyzed
in the above steps. stringsAsFactors is false meaning, If a list or data frame or matrix is passed to data.frame it is as
if each component or column had been passed as a separate argument. The data frame is stored into sent_df
variable which is being sorted by the below command.
sent_df1 = within(sent_df, polarity <- factor(polarity, levels =
names(sort(table(polarity), decreasing = TRUE))))
Here, we use factor function to create a factor. The only required argument to factor is a vector of values which will
be returned as a vector of factor values. To change the order in which the levels will be displayed from their default
sorted order, the levels= argument can be given a vector of all the possible values of the variable in the order you
desire. The sorted data is again stored into the variable sent_df.
Step 6: Plotting the Statistics.
ggplot(sent_df1, aes(x=polarity)) + geom_bar(aes(y=..count..,
fill=polarity)) + scale_fill_brewer(palette = "Dark2") +
labs(x="polarity categories", y="number of tweets", title="classification
based on polarity")
One of the very important function in R is ggplot which enables us to create a very wide range of useful plots. aes
function has list of name value pairs giving aesthetics to map, in the current case it is emotion. The bar geom is
used to produce 1d area plots: bar charts for categorical x, and histograms for continuous y as shown below. color
selection can change with one of scale functions such as scale_fill_brewer . X and y Specifies the variables placed
on the horizontal and vertical axis.
Polarity graph of Camaro tweets:
Conclusions:
The insight from the data analysis process of the Chevrolet Camaro car from word cloud is that the highest
frequency of talk is about the engine, seat, speed, power, drive and few other aspects which have been captured
in the word cloud. Visual representation from word cloud on data tends to have an impact and generates interest
amongst the audience. For further analysis, it may stimulate more questions than it answers, but that’s a good
entry point to discussion. Based on the Emotional classification, we see that joy parameter stands out with big
margin followed by anger, surprise, sadness in relatively lesser proportion. Joy adds to positive outlook of the car
in the market, further analysis on this data can be carried using machine learning skills. Based on the Polarity
classification, highest proportion is under neutral parameter. It can be proven that specific classifiers such as
the Max Entropy and the SVMs can benefit from the introduction of a neutral class and improve the overall
accuracy of the classification. The other approach in the current case is estimating a probability distribution over
all categories. Since the data is clearly clustered into neutral, negative and positive language, it makes sense to
filter the neutral language out and focus on the polarity between positive and negative sentiments. Open source
software tools deploy machine learning, statistics, and natural language processing techniques to automate
sentiment analysis on large collections of data similar to Camaro review data.
As I mentioned in my previous deliverables, it was indeed a great learning process from R programming
perspective. I’m glad that I could pull together all the important aspects for sentiment analysis from several
different areas to work on one unified program. I believe, sentiment analysis is an evolving field with a variety of
use applications. Although sentiment analysis tasks are challenging due to their natural language processing
origins, much progress has been made over the last few years due to the high demand for it. Sentiment analysis
within microblogging has shown that Twitter can be seen as a valid online indicator of political sentiment. Tweets'
political sentiment demonstrates close correspondence to parties and politicians, political positions, indicating
that the content of Twitter messages plausibly reflects the offline political landscape.
References:
Sentiment Analysis and Opinion Mining by Bing Liu
(http://www.cs.uic.edu/~liub/FBS/SentimentAnalysisandOpinionMining.html)
Sentiment Analysis by Professor Dan Jurafsky (https://web.standford.edu/class/cs124/lec/ sentiment.pdf)
S. Blair-Goldensohn, Hannan, McDonald, Neylon, Reis and Reynar 2008 – Building a Sentiment
Summarizer for Local Service Reviews (http://www. ryanmcd.com/papers/local_service_summ.pdf) S.
Asur et al., “Predicting the Future With Social Media”,arXiv:1003.5699.
R. Sharda et al., “Forecasting Box-Office Receipts of Motion Pictures Using Neural Networks”, CiteSeerX
2002.
http://www.businessinsider.com/apple-and-samsung-just-revealed-their-exact-us-sales-figuresfor-the-
first-ever-time2012-8
https://www.coursera.org/learn/r-programming
http://www.bigdatanews.com/profiles/blogs/learn-everything-about-sentiment-analysis-using-r
Koweika A.,Gupta A.,Sondhi K.(2013).Sentiment analysis for social media. International Journal of
Advanced Research in Computer Science and Software Engineering
Younggue B.,Hongchul L.(2012),”Sentiment analysis of Twitter audience: Measuring the positive or
negative influence”, Journal of the American Society for Information Science and Technology.
http://stackoverflow.com/questions/10233087/sentiment-analysis-using-r
https://rpubs.com/cen0te/naivebayes-sentimentpolarity
https://www.youtube.com/watch?v=oXZThwEF4r0