Improving Public Transport Services
using Sentiment Analysis of Twitter data
*Shilpa Singh1 and Astha Pareek2 1Research Scholar Dept of CS& IT, The IIS University,
Jaipur, INDIA 2Sr. Asst. Professor Dept of CS& IT, The IIS University,
Jaipur, INDIA [email protected], [email protected]
Abstract
Public transport services play an important role in
maintaining the sustainability of urban transportation
systems, especially in cities where buying and maintaining
a car is expensive. One of the biggest advantages of using
public transport is that it may help in reducing city traffic
jams and pollution. However people have lots of mixed
opinions about the services provided by public transport,
some are satisfied and some are not satisfied with services
like price, driver, safety, cleanliness, etc. provided by the
public transport such as cabs, metros, airlines, buses, and
railways. So, the prime target is to find the opinion of
customers on these services and to enhance the nature of
these services according to feedback provided.
In this paper, we aim to measure customer opinion on
services provided by public transports through sentiment.
Data is taken from twitter using scraper written in python
*Shilpa Singh
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org234
language. Around 1928 tweets were taken in five different
sections that are cabs, metros, airlines, buses, and
railways. Then these tweets were processed to determine
whether they are of neutral, positive or negative opinion.
This opinion can be analyzed to determine factors which
are the main cause of debarment the use of public
transport, also the factors that make the use of public
transport. By Improvising the facilities and services based
on the result, public transport may be improved and people
may switch to public transport which would result in
reduced traffic jams and pollution. The accuracy of this
experiment has been calculated with Naïve Bayes and
Logistic Regression.
Keywords: Sentiment analysis (SA), Machine learning,
NLTK, Naïve Bayes, Logistic Regression, Python, Twitter.
1. Introduction
Rapidly increasing in the vast use of the internet and
user-generated thoughts has caused the lifting of opinion
sites such as Twitter.com, Quora.com, etc. These social
networking and Micro-blogging sites have become the
largest platform for sharing user's personal feelings or
social liking. These opinions can be used to analyze the
user's sentiments, feelings, and assessment of products as
well as for opinion about public transport or any other
topic.
Sentiment analysis of public transport is very beneficial
for generating methods to reduce the number of private
vehicles on the road and can increase the use of public
transportation. This may lead to reduce pollution as well as
congestion. Some studies show that more use of public
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org235
transport can lead to pollution control and reduce traffic
jams [19]. The unwillingness for public transport is because
of unsatisfied services provided by them like cleanliness,
time, facility, safety and security, food, etc. So if
improvement has been done on these services people may
use public transport.
Data has been taken from twitter as it is a rich source of
data and a user's opinion in the form of tweets. This data is
collected into five parts according to transportation. After
fetching data it is classified as positive or negative polarity
according to the service provided by them. And at last,
accuracy has been calculated using Naïve Bayes and
Logistic Regression.
The rest of the paper is prepared into five different
sections. Section 2 gives an overview of the literature
review that represents the work done on sentiment analysis
in public transport. Section 3, it depicts our investigated
approach for finding the sentiment polarity as positive or
negative on tweets fetched from twitter API using NLTK
tool. The next step is finding the accuracy using machine
learning techniques in the python language. Section 4
describes the result in which accuracy will be compared
between two classifiers that are Naïve Bayes and Logistic
Regression. Finally, our recommendations for future
research opportunities along with the conclusion are
reported in section 5.
2. Literature Review
Miscellaneous researches have been done on sentiment
analysis and public transport. Author[9] analyzed the
sentiments and opinion on twitter for the Odd-Even traffic
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org236
scheme. For this analysis author used the NLTK tool and
various open-source libraries and APIs of Python language.
Some authors [20],[17] researched the increase in private
vehicles cause congestion. The reluctance is based on
several factors, among others, the travel time, cost, safety,
and security, as well as the pleasure and convenience of the
users of public transport itself.
Analysis has been done on the two most popular online
transportation services provided in Indonesia [10]. The
author has been collected 126,405 tweets using the NLTK
tool. The author calculated Net Sentiment Score which
correlates with customer satisfaction using classification
results. The experiment shows that Grab's customer
satisfaction is higher than GO-JEK's. The study also shows
that customers tend to mention both the company's Twitter
account for bad experiences and not mentioning the
company's account for positive comments.
3. Proposed Work
Our study consists of four steps that will be explained
below. The first step is the collection of data that can be
further pre-processed. After pre-processing, the polarity of
data has been fetched as positive, negative or neutral. Then
accuracy has been classified using classifiers. Figure 1
shows the classification of the process.
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org237
Figure 1. Classification of Process
3.1. Data Collection
For fetching data we used Tweepy - client for Twitter
Application Programming Interface (API). It can be
installed using pip command: pip install tweepy as
described by Gupta et al., 2017. To fetch tweets from the
Twitter API an app needs to be registered through our
Twitter account. After creating an app the following steps
have been performed: [4].
Open https://apps.twitter.com/ and click the button –
'Create New App'.
Fill the details asked.
When the App is created, the page will be automatically
loaded.
Open the ‘Keys and Access Tokens’ tab.
Copy ‘Consumer Key’, ‘Consumer Secret’, ‘Access
token’ and ‘Access Token Secret’.
Data Collection
Pre-Processing of Data
Sentiment Analysis process
Classification
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org238
This formal method is slow in nature as it performs tweet
collection every time we start the program but gives a good
quantity of data with 32 and more attributes which can help
in the analysis of fetched raw data [4]. These attributes
include text, name, location, date and time of post,
retweeted_status, profile information, etc as shown in the
figure. 2.
Figure 2. Raw Data from Twitter
Total of 1928 tweets has been collected in five different
sections which depend on five different topics (as shown in
table 1) which are Railway, Aeroplane, Bus, Cab, Metro
[17][20]
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org239
Table 1. Dataset topic and number of Raw tweets
Dataset topic Number of raw tweets
Railway 470
Aeroplane 425
Bus 403
Cab 310
Metro 320
Out of 1928 raw tweets, relevant tweets have been
extracted from different datasets and manually collected
into a single table. Total of 448 relevant tweets has been
extracted which will be used for further experiment. The
observation of relevant, irrelevant and retweets of each
dataset are shown in table 2:
Table 2. Classification of Data as Relevant, Irrelevant
or Retweets
Now next step is to collect only the required attributes
like tweet, location, and retweet_status from 32 and more
Railwa
y
Aeroplan
e
Bus Ca
b
Metr
o
Total
Relevant 200 75 60 50 63 448
Irrelevan
t
37 134 87 51 37 346
Retweets 233 216 256 209 220 1134
Total 470 425 403 310 320 1928
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org240
attributes[9][20]. Figure 3 shows only the required
attributes of relevant data from five datasets for further
experiments.
Figure 3. Extracted Attributes from Raw Data taken
from all the Datasets.
Figure 3 shows that from all the datasets the required
fields have been extracted and collected into one single
table. The first field is tweets that are the opinion of the
user for particular transport services. The second field is
type of transport which indicates transport type that is
railway, aeroplane, cab, etc manually labeled according to
tweets taken from the dataset [17]. Location is the user's
location which tweets. The last field is number of retweets
extract directly from the raw dataset [9]
3.2. Pre-Processing of Data
The next step is to pre-process the data of the tweet
column so that it can be used further for finding the polarity
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org241
of each tweet as positive, negative or neutral. Pre-
processing of data is a vital part of text mining and
sentiment analysis. The pre-processing phase is divided
into four parts as shown in figure 4. The stages in the
process of preprocessing are as follows [17][10].
Data Cleansing: The process consists of case folding
and removes noise. Noise, in this case, is a character
other than letters (numbers, symbols, and punctuation).
Tokenization: The process of cutting a row of words in
the document into a single word piece.
Stopword Removal: Stop word removal process, that is
words that often appear but do not have a specific
meaning and is not considered important in the opinion
classification.
Word Normalizes: The process of converting not
standard word to the standard word called stemming. It
also called the process of removing ing, ed, ly from the
word.
Figure 4. Shows The Preprocessing Stages
Data Cleansing
Tokenization
Stop word Removal
Word Normalizes
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org242
3.3. Sentimental Analysis Process
Now data is classified as positive, negative and neutral
with the help of SentiWordNet [11]. This is a lexicon-based
technique for finding polarity or opinions of each tweet.
Three sentiment numerical scores: Obj(s), Pos(s) and
Neg(s) have been assign to each synset of wordnet by
SentiWordNet. These scores describe how much Objective,
Positive, Negative sense contained in the synset. Each of
the three scores ranges from 0. To 1.0 and their sum is 1.0
for each synset [12]. The entries contain the parts of the
speech category of the displayed entry. The lemma (word)
present in the form of lemma#sense-number, where the first
sense corresponds to the most frequent and different word
have different polarities. Using the above technique we had
been analysis the sentiment of each tweet of our database
[1].
The algorithm used for calculating polarity of each
tweet
1. Assign polarity value (PV) to each word, where
PV→[-1,0,+1].
2. Find sum of all positive polarity words PVp =∑v=
+1PV.
3. Find sum of all negative polarity words PVn =∑v= -
1PV.
4. Find total score of a tweet as T = PVp-PVn.
5. Return T as a polarity score of that tweet.
6. If T< 0 then negative tweets else positive tweets.
7. If T =0 then neutral tweets.
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org243
3.4. Classification
After finding the polarity of each tweet next step is to
analyze the experiment and classified the relation between
the outcomes. First, we observe the services of all the
public transport together, as it has been the main factor for
not using public transport services. Figure 5 shows that
facility service has been the main issue for ignorance, as
one can observe that 248 tweets out of 448 relevant tweets
come under facility service so we can say that people all
around India are not satisfied with the facility services
provided by transportation. After the facility, the time has
been the biggest issue in public transportation. If more
improvement has been done on the facility and time service
of all public transport people of India may use public
transportation [18].
Figure 5. Shows Classification of Data Based on
Services
Now the next very important thing to observe from the
experiment is the relationship between the type of transport
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org244
and services provided by transport having a negative
remark. So it was observed that railways have the highest
negative tweets and more improvement may be done on the
services of the railway. After the railway, bus and metro
services need to improve and then cab and aeroplane as
shown in figure 6.
Figure 6. Negative tweets of Public Transport Based on
Services
Last is the classification of a few locations based on
tweets as shown in figure 7. It has been observed that Patna
has the highest tweets for transport services. The number of
tweets from a particular location may differ from day to
day bases [9].
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org245
Figure 7. Location of User
At last, accuracy has been calculated using Naive Bayes
and Logistic Regression [7]. For doing so we have been
imported the following modules from NLTK as shown in
figure 8.
Figure 8: Modules Imported for using Classifiers
Finally, the accuracy has been observed for the above
experiment [8]. Naïve Bayes gives an accuracy of 61
percent and the Logistic Regression classifier gives an
accuracy of 63 percent as shown in figure 9.
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org246
Figure 9. Shows Accuracy Percent of Naïve Bayes and
Logistic Regression.
4. Result
We have done the above experiment on 1928 tweets and
it has been observed that NLTK is a good tool to find the
sentiment of public transport services as positive, negative
or neutral.
After finding the polarity of each tweet, accuracy has
been calculated using Naïve Bayes and Logistic
Regression, out of these two classifiers Logistic Regression
gives the highest accuracy of 63.00 percent and then Naive
Bayes gives 61.00 percent. It has been also observed that
the percentage of negative tweets is more than positive
tweets. Most of the negative tweets contain words like
security, time, Delays, road traffic, Congestion, crowd,
food, etc. So, the focus can be placed on these areas to
improve public transport services.
5. Conclusion And Feature Work
Sentiment analysis is a vast area where work can be
done in different disciplines like natural language process,
A.I., Machine learning and various other opinion mining
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org247
approaches. This experiment has been done with sentiment
analysis on public transport services by giving more focus
on problems faced by the user. So from the above
experiment, it can be concluded that if services given by
different transportation improve may leads to more use of
public transport.
For this experiment data has been taken from twitter, one
can take data from direct online course websites like
courser, Forums or facebook. Nltk tool has been used but
the same can be observed with the help of other tools like
rapid manner, weka, Hodoop, etc. Also in place of Naïve
Bayes and Logistic Regression techniques, other techniques
can be applied like MultinominalNB, BernoulliNB, SGD,
LinearSVC, etc.
6. References
[1] A. Agarwal, V. Sharma, G. Sikka and R. Dhir, “Opinion
Mining of News Headlines using SentiWordNet”,
IEEE. 16, (2016), 978-1-5090-0669-4.
[2] H. Balaji, V. Govindasamy and V. Akila, “Social
Opinion Mining and Concise Rendition”, ICACCCT.
(2016), 978-1-4673-9545-8.
[3] BiswaRanjanSamal, A. K. Behera and M. Panda,
“Performance Analysis of Supervised Machine
Learning Techniques for Sentiment Analysis”, IEEE.
(2017), 978-1-5090-4929-5.
[4] M. B. Myneni., L. V. N. Prasad and J. S. Devi, “A
Framework For Sementic Level SocialSentiment
Analysis Model”, JATIT. vol. 96, no. 16, (2017), pp.
1992-8645.
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org248
[5] A. Goel, J. Gautam and S. Kumar, “Real Time
Sentiment Analysis of Tweets Using Naive Bayes”,
IEEE. vol. 16,( 2016), 978-1-5090-3257-0.
[6] B. Gupta, M. Negi, K. Vishwakarma, G. Rawat and P.
Badhani, “Study of Twitter Sentiment Analysis using
Machine Learning Algorithms on Python”,
International Journal of Computer Applications. vol.
165, no. 9, (2017), pp. 0975 – 8887.
[7] D. K. Madhuri, “A Machine Learning based
Framework for Sentiment Classification: Indian
Railways Case Study”, IJITEE. vol. 8, ( 2019), pp.
2278-3075.
[8] A. X. Annier, V. Mohan, and S. H.Venu, “Sentiment
Analysis Applied to Airline Feedback to Boost
Customers Endearment”, IJAPS. vol. 2, ( 2016), pp.
51-58.
[9] S. K. Sharma and X. Hoque, “Sentiment Analysis for
Odd-Even Scheme in Delhi”, Indian Journal of Science
and Technology. vol. 11, no. 24, (2018), pp. 0974-6846.
[10] S. Anastasia and B Indra, :Twitter Sentiment Analysis
of Online Transportation Service Providers.” IEEE. 16,
(2016), 978-1-5090-4629-4.
[11] N. Medagoda, S. Shanmuganathan and J. Whalley,
“Sentiment Lexicon Construction Using SentiWordNet
3.0”, IEEE. 15. (2015), 978-1-4673-7679-2.
[12] P. H. Rahmath and T. Ahmad, “Sentiment analysis
Techniques – A Comparative Study.”, International
Journal of Computational Engineering & Management.
vol. 17, (2014), pp. 2230-7893.
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org249
[13] F. J. Ramírez, G. Alor, J. L. Sánchez, B. A. Olivares
and L. Rodríguez, “A Brief Review on the Use of
Sentiment Analysis Approaches in Social Networks”,
Springer International Publishing. 1007, (2018), 978-3-
319-69341-5_24.
[14] J. M. Salanova, M. Estrada, G. Aifadopoulou and E.
Mitsakis, “A review of the modeling of taxi services”
ELSEVIER. vol. 20, (2011), pp. 150-161.
[15] R. Shukla, A. Chandra and H. Jain, “OLA VS UBER:
The Battle of Dominance.” IOSR-JBM. (2017), pp. 73-
78.
[16] D. Terrana, A. Augello and G. Pilato, “Automatic
Unsupervised Polarity Detection on a Twitter Data
Stream”, IEEE. vol. 14, ( 2014), 978-1-4799-4003-5.
[17] V. Effendy, A. Novantirani and M. K. Sabariah,
“Sentiment Analysis on Twitter about the Use of City
Public Transportation Using Support Vector Machine
Method, IJOICT. vol. 2, no. 1, (2016), pp. 57-66.
[18] S. Gitto and P. Mancuso, “Improving airport services
using sentiment analysis of the websites.” Elsevier. vol.
22, ( 2017), pp. 132-136.
[19] T. Hoang, P. H. Cher, P. K. Prasetyo and E. Lim,
“Crowdsensing and Analyzing Micro-Event Tweets for
Public Transportation Insights.”, Research Collection
School Of Information Systems. 2, (2017).
[20] T. Bosznay, “Mind-map the Gap – Sentiment Analysis
of Public Transport.” Amadeus Software. (2017), 1264.
Journal of Information and Computational Science
Volume 10 Issue 1 - 2020
ISSN: 1548-7741
www.joics.org250