Download - Improving Public Transport Services using Sentiment ...joics.org/gallery/ics-2138.pdf · like price, driver, safety, cleanliness, etc. provided by the public transport such as cabs,

Improving Public Transport Services

using Sentiment Analysis of Twitter data

*Shilpa Singh1 and Astha Pareek2 1Research Scholar Dept of CS& IT, The IIS University,

Jaipur, INDIA 2Sr. Asst. Professor Dept of CS& IT, The IIS University,

Jaipur, INDIA [email protected], [email protected]

Abstract

Public transport services play an important role in

maintaining the sustainability of urban transportation

systems, especially in cities where buying and maintaining

a car is expensive. One of the biggest advantages of using

public transport is that it may help in reducing city traffic

jams and pollution. However people have lots of mixed

opinions about the services provided by public transport,

some are satisfied and some are not satisfied with services

like price, driver, safety, cleanliness, etc. provided by the

public transport such as cabs, metros, airlines, buses, and

railways. So, the prime target is to find the opinion of

customers on these services and to enhance the nature of

these services according to feedback provided.

In this paper, we aim to measure customer opinion on

services provided by public transports through sentiment.

Data is taken from twitter using scraper written in python

*Shilpa Singh

Journal of Information and Computational Science

Volume 10 Issue 1 - 2020

ISSN: 1548-7741

www.joics.org234

mailto:[email protected]

mailto:[email protected]

language. Around 1928 tweets were taken in five different

sections that are cabs, metros, airlines, buses, and

railways. Then these tweets were processed to determine

whether they are of neutral, positive or negative opinion.

This opinion can be analyzed to determine factors which

are the main cause of debarment the use of public

transport, also the factors that make the use of public

transport. By Improvising the facilities and services based

on the result, public transport may be improved and people

may switch to public transport which would result in

reduced traffic jams and pollution. The accuracy of this

experiment has been calculated with Naïve Bayes and

Logistic Regression.

Keywords: Sentiment analysis (SA), Machine learning,

NLTK, Naïve Bayes, Logistic Regression, Python, Twitter.

1. Introduction

Rapidly increasing in the vast use of the internet and

user-generated thoughts has caused the lifting of opinion

sites such as Twitter.com, Quora.com, etc. These social

networking and Micro-blogging sites have become the

largest platform for sharing user's personal feelings or

social liking. These opinions can be used to analyze the

user's sentiments, feelings, and assessment of products as

well as for opinion about public transport or any other

topic.

Sentiment analysis of public transport is very beneficial

for generating methods to reduce the number of private

vehicles on the road and can increase the use of public

transportation. This may lead to reduce pollution as well as

congestion. Some studies show that more use of public



ISSN: 1548-7741

www.joics.org235

transport can lead to pollution control and reduce traffic

jams [19]. The unwillingness for public transport is because

of unsatisfied services provided by them like cleanliness,

time, facility, safety and security, food, etc. So if

improvement has been done on these services people may

use public transport.

Data has been taken from twitter as it is a rich source of

data and a user's opinion in the form of tweets. This data is

collected into five parts according to transportation. After

fetching data it is classified as positive or negative polarity

according to the service provided by them. And at last,

accuracy has been calculated using Naïve Bayes and


The rest of the paper is prepared into five different

sections. Section 2 gives an overview of the literature

review that represents the work done on sentiment analysis

in public transport. Section 3, it depicts our investigated

approach for finding the sentiment polarity as positive or

negative on tweets fetched from twitter API using NLTK

tool. The next step is finding the accuracy using machine

learning techniques in the python language. Section 4

describes the result in which accuracy will be compared

between two classifiers that are Naïve Bayes and Logistic

Regression. Finally, our recommendations for future

research opportunities along with the conclusion are

reported in section 5.

2. Literature Review

Miscellaneous researches have been done on sentiment

analysis and public transport. Author[9] analyzed the

sentiments and opinion on twitter for the Odd-Even traffic



ISSN: 1548-7741

www.joics.org236

scheme. For this analysis author used the NLTK tool and

various open-source libraries and APIs of Python language.

Some authors [20],[17] researched the increase in private

vehicles cause congestion. The reluctance is based on

several factors, among others, the travel time, cost, safety,

and security, as well as the pleasure and convenience of the

users of public transport itself.

Analysis has been done on the two most popular online

transportation services provided in Indonesia [10]. The

author has been collected 126,405 tweets using the NLTK

tool. The author calculated Net Sentiment Score which

correlates with customer satisfaction using classification

results. The experiment shows that Grab's customer

satisfaction is higher than GO-JEK's. The study also shows

that customers tend to mention both the company's Twitter

account for bad experiences and not mentioning the

company's account for positive comments.

3. Proposed Work

Our study consists of four steps that will be explained

below. The first step is the collection of data that can be

further pre-processed. After pre-processing, the polarity of

data has been fetched as positive, negative or neutral. Then

accuracy has been classified using classifiers. Figure 1

shows the classification of the process.



ISSN: 1548-7741

www.joics.org237

Figure 1. Classification of Process

3.1. Data Collection

For fetching data we used Tweepy - client for Twitter

Application Programming Interface (API). It can be

installed using pip command: pip install tweepy as

described by Gupta et al., 2017. To fetch tweets from the

Twitter API an app needs to be registered through our

Twitter account. After creating an app the following steps

have been performed: [4].

Open https://apps.twitter.com/ and click the button –

'Create New App'.

Fill the details asked.

When the App is created, the page will be automatically

loaded.

Open the ‘Keys and Access Tokens’ tab.

Copy ‘Consumer Key’, ‘Consumer Secret’, ‘Access

token’ and ‘Access Token Secret’.

Data Collection

Pre-Processing of Data

Sentiment Analysis process

Classification



ISSN: 1548-7741

www.joics.org238

This formal method is slow in nature as it performs tweet

collection every time we start the program but gives a good

quantity of data with 32 and more attributes which can help

in the analysis of fetched raw data [4]. These attributes

include text, name, location, date and time of post,

retweeted_status, profile information, etc as shown in the

figure. 2.

Figure 2. Raw Data from Twitter

Total of 1928 tweets has been collected in five different

sections which depend on five different topics (as shown in

table 1) which are Railway, Aeroplane, Bus, Cab, Metro

[17][20]



ISSN: 1548-7741

www.joics.org239

Table 1. Dataset topic and number of Raw tweets

Dataset topic Number of raw tweets

Railway 470

Aeroplane 425

Bus 403

Cab 310

Metro 320

Out of 1928 raw tweets, relevant tweets have been

extracted from different datasets and manually collected

into a single table. Total of 448 relevant tweets has been

extracted which will be used for further experiment. The

observation of relevant, irrelevant and retweets of each

dataset are shown in table 2:

Table 2. Classification of Data as Relevant, Irrelevant

or Retweets

Now next step is to collect only the required attributes

like tweet, location, and retweet_status from 32 and more

Railwa

y

Aeroplan

e

Bus Ca

b

Metr

o

Total

Relevant 200 75 60 50 63 448

Irrelevan

t

37 134 87 51 37 346

Retweets 233 216 256 209 220 1134

Total 470 425 403 310 320 1928



ISSN: 1548-7741

www.joics.org240

attributes[9][20]. Figure 3 shows only the required

attributes of relevant data from five datasets for further

experiments.

Figure 3. Extracted Attributes from Raw Data taken

from all the Datasets.

Figure 3 shows that from all the datasets the required

fields have been extracted and collected into one single

table. The first field is tweets that are the opinion of the

user for particular transport services. The second field is

type of transport which indicates transport type that is

railway, aeroplane, cab, etc manually labeled according to

tweets taken from the dataset [17]. Location is the user's

location which tweets. The last field is number of retweets

extract directly from the raw dataset [9]

3.2. Pre-Processing of Data

The next step is to pre-process the data of the tweet

column so that it can be used further for finding the polarity



ISSN: 1548-7741

www.joics.org241

of each tweet as positive, negative or neutral. Pre-

processing of data is a vital part of text mining and

sentiment analysis. The pre-processing phase is divided

into four parts as shown in figure 4. The stages in the

process of preprocessing are as follows [17][10].

Data Cleansing: The process consists of case folding

and removes noise. Noise, in this case, is a character

other than letters (numbers, symbols, and punctuation).

Tokenization: The process of cutting a row of words in

the document into a single word piece.

Stopword Removal: Stop word removal process, that is

words that often appear but do not have a specific

meaning and is not considered important in the opinion

classification.

Word Normalizes: The process of converting not

standard word to the standard word called stemming. It

also called the process of removing ing, ed, ly from the

word.

Figure 4. Shows The Preprocessing Stages

Data Cleansing

Tokenization

Stop word Removal

Word Normalizes



ISSN: 1548-7741

www.joics.org242

3.3. Sentimental Analysis Process

Now data is classified as positive, negative and neutral

with the help of SentiWordNet [11]. This is a lexicon-based

technique for finding polarity or opinions of each tweet.

Three sentiment numerical scores: Obj(s), Pos(s) and

Neg(s) have been assign to each synset of wordnet by

SentiWordNet. These scores describe how much Objective,

Positive, Negative sense contained in the synset. Each of

the three scores ranges from 0. To 1.0 and their sum is 1.0

for each synset [12]. The entries contain the parts of the

speech category of the displayed entry. The lemma (word)

present in the form of lemma#sense-number, where the first

sense corresponds to the most frequent and different word

have different polarities. Using the above technique we had

been analysis the sentiment of each tweet of our database

[1].

The algorithm used for calculating polarity of each

tweet

1. Assign polarity value (PV) to each word, where

PV→[-1,0,+1].

2. Find sum of all positive polarity words PVp =∑v=

+1PV.

3. Find sum of all negative polarity words PVn =∑v= -

1PV.

4. Find total score of a tweet as T = PVp-PVn.

5. Return T as a polarity score of that tweet.

6. If T< 0 then negative tweets else positive tweets.

7. If T =0 then neutral tweets.



ISSN: 1548-7741

www.joics.org243

3.4. Classification

After finding the polarity of each tweet next step is to

analyze the experiment and classified the relation between

the outcomes. First, we observe the services of all the

public transport together, as it has been the main factor for

not using public transport services. Figure 5 shows that

facility service has been the main issue for ignorance, as

one can observe that 248 tweets out of 448 relevant tweets

come under facility service so we can say that people all

around India are not satisfied with the facility services

provided by transportation. After the facility, the time has

been the biggest issue in public transportation. If more

improvement has been done on the facility and time service

of all public transport people of India may use public

transportation [18].

Figure 5. Shows Classification of Data Based on

Services

Now the next very important thing to observe from the

experiment is the relationship between the type of transport



ISSN: 1548-7741

www.joics.org244

and services provided by transport having a negative

remark. So it was observed that railways have the highest

negative tweets and more improvement may be done on the

services of the railway. After the railway, bus and metro

services need to improve and then cab and aeroplane as

shown in figure 6.

Figure 6. Negative tweets of Public Transport Based on

Services

Last is the classification of a few locations based on

tweets as shown in figure 7. It has been observed that Patna

has the highest tweets for transport services. The number of

tweets from a particular location may differ from day to

day bases [9].



ISSN: 1548-7741

www.joics.org245

Figure 7. Location of User

At last, accuracy has been calculated using Naive Bayes

and Logistic Regression [7]. For doing so we have been

imported the following modules from NLTK as shown in

figure 8.

Figure 8: Modules Imported for using Classifiers

Finally, the accuracy has been observed for the above

experiment [8]. Naïve Bayes gives an accuracy of 61

percent and the Logistic Regression classifier gives an

accuracy of 63 percent as shown in figure 9.



ISSN: 1548-7741

www.joics.org246

Figure 9. Shows Accuracy Percent of Naïve Bayes and


4. Result

We have done the above experiment on 1928 tweets and

it has been observed that NLTK is a good tool to find the

sentiment of public transport services as positive, negative

or neutral.

After finding the polarity of each tweet, accuracy has

been calculated using Naïve Bayes and Logistic

Regression, out of these two classifiers Logistic Regression

gives the highest accuracy of 63.00 percent and then Naive

Bayes gives 61.00 percent. It has been also observed that

the percentage of negative tweets is more than positive

tweets. Most of the negative tweets contain words like

security, time, Delays, road traffic, Congestion, crowd,

food, etc. So, the focus can be placed on these areas to

improve public transport services.

5. Conclusion And Feature Work

Sentiment analysis is a vast area where work can be

done in different disciplines like natural language process,

A.I., Machine learning and various other opinion mining



ISSN: 1548-7741

www.joics.org247

approaches. This experiment has been done with sentiment

analysis on public transport services by giving more focus

on problems faced by the user. So from the above

experiment, it can be concluded that if services given by

different transportation improve may leads to more use of

public transport.

For this experiment data has been taken from twitter, one

can take data from direct online course websites like

courser, Forums or facebook. Nltk tool has been used but

the same can be observed with the help of other tools like

rapid manner, weka, Hodoop, etc. Also in place of Naïve

Bayes and Logistic Regression techniques, other techniques

can be applied like MultinominalNB, BernoulliNB, SGD,

LinearSVC, etc.

6. References

[1] A. Agarwal, V. Sharma, G. Sikka and R. Dhir, “Opinion

Mining of News Headlines using SentiWordNet”,

IEEE. 16, (2016), 978-1-5090-0669-4.

[2] H. Balaji, V. Govindasamy and V. Akila, “Social

Opinion Mining and Concise Rendition”, ICACCCT.

(2016), 978-1-4673-9545-8.

[3] BiswaRanjanSamal, A. K. Behera and M. Panda,

“Performance Analysis of Supervised Machine

Learning Techniques for Sentiment Analysis”, IEEE.

(2017), 978-1-5090-4929-5.

[4] M. B. Myneni., L. V. N. Prasad and J. S. Devi, “A

Framework For Sementic Level SocialSentiment

Analysis Model”, JATIT. vol. 96, no. 16, (2017), pp.

1992-8645.



ISSN: 1548-7741

www.joics.org248

[5] A. Goel, J. Gautam and S. Kumar, “Real Time

Sentiment Analysis of Tweets Using Naive Bayes”,

IEEE. vol. 16,( 2016), 978-1-5090-3257-0.

[6] B. Gupta, M. Negi, K. Vishwakarma, G. Rawat and P.

Badhani, “Study of Twitter Sentiment Analysis using

Machine Learning Algorithms on Python”,

International Journal of Computer Applications. vol.

165, no. 9, (2017), pp. 0975 – 8887.

[7] D. K. Madhuri, “A Machine Learning based

Framework for Sentiment Classification: Indian

Railways Case Study”, IJITEE. vol. 8, ( 2019), pp.

2278-3075.

[8] A. X. Annier, V. Mohan, and S. H.Venu, “Sentiment

Analysis Applied to Airline Feedback to Boost

Customers Endearment”, IJAPS. vol. 2, ( 2016), pp.

51-58.

[9] S. K. Sharma and X. Hoque, “Sentiment Analysis for

Odd-Even Scheme in Delhi”, Indian Journal of Science

and Technology. vol. 11, no. 24, (2018), pp. 0974-6846.

[10] S. Anastasia and B Indra, :Twitter Sentiment Analysis

of Online Transportation Service Providers.” IEEE. 16,

(2016), 978-1-5090-4629-4.

[11] N. Medagoda, S. Shanmuganathan and J. Whalley,

“Sentiment Lexicon Construction Using SentiWordNet

3.0”, IEEE. 15. (2015), 978-1-4673-7679-2.

[12] P. H. Rahmath and T. Ahmad, “Sentiment analysis

Techniques – A Comparative Study.”, International

Journal of Computational Engineering & Management.

vol. 17, (2014), pp. 2230-7893.



ISSN: 1548-7741

www.joics.org249

[13] F. J. Ramírez, G. Alor, J. L. Sánchez, B. A. Olivares

and L. Rodríguez, “A Brief Review on the Use of

Sentiment Analysis Approaches in Social Networks”,

Springer International Publishing. 1007, (2018), 978-3-

319-69341-5_24.

[14] J. M. Salanova, M. Estrada, G. Aifadopoulou and E.

Mitsakis, “A review of the modeling of taxi services”

ELSEVIER. vol. 20, (2011), pp. 150-161.

[15] R. Shukla, A. Chandra and H. Jain, “OLA VS UBER:

The Battle of Dominance.” IOSR-JBM. (2017), pp. 73-

78.

[16] D. Terrana, A. Augello and G. Pilato, “Automatic

Unsupervised Polarity Detection on a Twitter Data

Stream”, IEEE. vol. 14, ( 2014), 978-1-4799-4003-5.

[17] V. Effendy, A. Novantirani and M. K. Sabariah,

“Sentiment Analysis on Twitter about the Use of City

Public Transportation Using Support Vector Machine

Method, IJOICT. vol. 2, no. 1, (2016), pp. 57-66.

[18] S. Gitto and P. Mancuso, “Improving airport services

using sentiment analysis of the websites.” Elsevier. vol.

22, ( 2017), pp. 132-136.

[19] T. Hoang, P. H. Cher, P. K. Prasetyo and E. Lim,

“Crowdsensing and Analyzing Micro-Event Tweets for

Public Transportation Insights.”, Research Collection

School Of Information Systems. 2, (2017).

[20] T. Bosznay, “Mind-map the Gap – Sentiment Analysis

of Public Transport.” Amadeus Software. (2017), 1264.



ISSN: 1548-7741

www.joics.org250

Download - Improving Public Transport Services using Sentiment ...joics.org/gallery/ics-2138.pdf · like price, driver, safety, cleanliness, etc. provided by the public transport such as cabs,

Top Related