twitter analysis - is there a correlation between swedish tweets and the swedish stock...

51
INOM EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP , STOCKHOLM SVERIGE 2017 Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock market? Twitter Analys - Existerar det någon korrelation mellan svenska tweets och den svenska aktiemarknaden? DAVID SARKHOSH ALLAN NOURI KTH SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION

Upload: others

Post on 06-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2017

Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock market?

Twitter Analys - Existerar det någon korrelation mellan svenska tweets och den svenska aktiemarknaden?

DAVID SARKHOSH

ALLAN NOURI

KTHSKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION

Page 2: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

Twitter Analysis - Do TwitterSentiments Correlate toChanges of Swedish StockPrices?

DAVID SARKHOSH AND ALLAN NOURI OTMANFARHA

Bachelor in Computer ScienceDate: June 5, 2017Supervisor: Alexander KozlovExaminer: Örjan EkebergSwedish title: Twitter Analys - Existerar det någon korrelationmellan svenska tweets och den svenska aktiepriser?School of Computer Science and Communication

Page 3: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds
Page 4: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

iii

Abstract

Stock market prediction is a problem that has undergone extensiveresearch with many approaches and methods, such as mathematicalmodels, machine learning methods et cetera. Another interesting ap-proach is sentiment analysis, an approach takes the public opinion intoaccount when predicting stock market prices. This method, combinedwith some machine learning techniques have proven effective when itcomes to predict stock prices.

This study determines whether this method is usable on demo-graphics where information on public opinion does not come in abun-dance, in this instance the demographic of people who speak Swedish.The public sentiment is gathered by analyzing public opinion foundon Twitter, and hourly stock prices for three companies were gathered.This data was combined and linear regression was performed to see ifthere does indeed exist a possible correlation between these data sets.

The results showed that there does not appear to be a linear rela-tion between public sentiment and changes in stock prices. The meansquared error of the data points indicate that the data points deviateto much from the regression line for the regression line to be usable asan accurate model.

The limited amount of data on public sentiment led to the conclu-sion that Swedish Twitter flow is not usable as a source for extractingreliable information on public sentiment to be analyzed by any model.

Page 5: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

iv

Sammanfattning

Att förutspå aktiemarknaden är ett aktuellt forskningsproblem. Ma-skininlärning, matematiska metoder och opinonsmätning är metoderför att mäta och förutspå förändringar av aktiepriser. Opinionsmät-ning är en metod som har visat sig vara effektiv för att förutspå aktie-marknaden.

Den här studien undersöker om det finns en korrelation mellan denallmänna opinionen om ett företag och förändringar i företagets aktie-pris i demografiskt sett mindre befolkningsgrupper, i detta fall Sveri-ge och svensktalande. Allmännhetens inställning till tre olika företagmättes genom att analysera allmänhetens åsikter om företaget på densociala medieplatformen Twitter. Aktieprisförändringen och opinions-data samlades in timme för timme och linjär regression användes föratt undersöka om det finns något samband mellan aktiepriset och all-mänhetens åsikter.

Resultaten blev att det inte verkar finnas något linjärt sambandmellan den allmänna opinionen och förändringar i aktiepriserna. Detkvadratiska medelfelet visade att datapunkterna avviker för mycketfrån regressionslinjen för att opinionsmätning ska användas som entillförlitlig modell.

Den begränsade mängden opinionsdata ledde till slutsatsen att detSvenska Twitterflödet inte är användbart för att allmänhetens åsikterska kunna användas för att förutspå aktieprisförändringar.

Page 6: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

Contents

1 Introduction 11.1 Research Question . . . . . . . . . . . . . . . . . . . . . . 21.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Stock Market Prediction Models . . . . . . . . . . . . . . . 3

2.1.1 Random Walk theory . . . . . . . . . . . . . . . . . 32.1.2 Efficient market theory . . . . . . . . . . . . . . . . 32.1.3 Predicting Stock Prices using Twitter Mood . . . . 4

2.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Opinion mining and sentiment analysis . . . . . . . . . . 4

2.3.1 Sentiment Lexicon . . . . . . . . . . . . . . . . . . 52.3.2 Approaches to sentiment analysis . . . . . . . . . 5

2.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Predictive Analytics . . . . . . . . . . . . . . . . . . . . . . 7

2.5.1 Linear regression . . . . . . . . . . . . . . . . . . . 72.5.2 Machine Learning techniques . . . . . . . . . . . . 7

3 Methods 93.1 Collecting social media data . . . . . . . . . . . . . . . . . 9

3.1.1 Collecting Twitter data . . . . . . . . . . . . . . . . 93.2 Storing data . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . 103.2.2 pgAdmin . . . . . . . . . . . . . . . . . . . . . . . 103.2.3 Database Schema . . . . . . . . . . . . . . . . . . . 10

3.3 Companies . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.1 Fingerprint Cards . . . . . . . . . . . . . . . . . . . 113.3.2 Volvo . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.3 H&M . . . . . . . . . . . . . . . . . . . . . . . . . . 12

v

Page 7: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

vi CONTENTS

3.4 Collecting Stock Data . . . . . . . . . . . . . . . . . . . . . 123.5 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . 12

3.5.1 Translating from Swedish to English . . . . . . . . 123.5.2 Translation Method . . . . . . . . . . . . . . . . . . 133.5.3 Sentiment Analysis Method . . . . . . . . . . . . . 13

3.6 Data Classification . . . . . . . . . . . . . . . . . . . . . . 143.6.1 Classes . . . . . . . . . . . . . . . . . . . . . . . . . 143.6.2 The purpose of these classifications . . . . . . . . 15

3.7 Representation of Data . . . . . . . . . . . . . . . . . . . . 153.7.1 Representation of stock and sentiment data . . . . 153.7.2 Representation of Tweet Data . . . . . . . . . . . . 153.7.3 Storing stock and sentiment data . . . . . . . . . . 153.7.4 Representation of relationship between stock data

and sentiment data . . . . . . . . . . . . . . . . . . 163.8 Analyzing Data . . . . . . . . . . . . . . . . . . . . . . . . 16

3.8.1 Null hypothesis/Model . . . . . . . . . . . . . . . 16

4 Results 184.1 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1 Presentation of results . . . . . . . . . . . . . . . . 18

5 Discussion 325.1 Interpreting the Results . . . . . . . . . . . . . . . . . . . . 325.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.1 Limitations of the quantity and quality of Tweets 335.2.2 Limitations of the Sentiment Analysis Method . . 345.2.3 Limitations of the Regression Model . . . . . . . . 34

5.3 Further Research . . . . . . . . . . . . . . . . . . . . . . . 35

6 Conclusions 36

A Scatter Plots 40

Page 8: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

Chapter 1

Introduction

Predicting stock market volatility is a very complex problem that hasbeen under investigation for many years. The potential gains froman effective algorithm for stock market prediction should come as nosurprise. Many algorithms have been proposed for this task. Ear-lier methods included ideas from statistical inference theory and othermathematical models [1] [2]. Since the early 90’s algorithms that useso called artificial neural networks, a concept that was originally de-veloped in the field of neuroscience, have been implemented. [3]

The increase in the number of people engaging the worldwide web,and the subsequent increase in usage of social media has led to theavailability of a large pool of opinions circulating on blogs, microblogsand social networks. It makes sense that companies strive for a goodpublic image, as it is likely to increase their sales and profit. Thereforeit should come as no surprise that there exists a positive correlationbetween public opinion on a company, and the stock index of saidcompany [4] [5].

The methods of gathering public opinion has been researched in afield called opinion mining (or sentiment analysis), a branch of natu-ral language processing. Opinion mining is a field where little researchhad been done until the early 2000’s. Since then, thousands of papershave been written on the topic with interesting results, some with ap-plications in stock market forecasting [4] [6] [7] [8].

While most studies conducted on the correlation between publicopinion and stock indexes have shown a positive correlation betweenthe two of them, most studies have revolved around stock index datagathered from countries with large populations, such as Germany [4],

1

Page 9: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

2 CHAPTER 1. INTRODUCTION

the United States of America [5] [7], China [9], Saudi Arabia [10] andIndia [11]. Little study has been made on smaller countries (in termsof population), such as Sweden. It is not clear whether these resultscarry over to a smaller demographic, as there will not be as much dataavailable to analyze public opinion.

1.1 Research Question

Can Tweet sentiments from the demographic of people whospeak Swedish, be used to find a correlation between publicsentiment and the stock prices in the Swedish stock marketOMXS?

1.2 Scope

The scope of this report is limited to three different companies, activeon the Swedish stock market (OMXS) collected during four workingdays. The Sentiment Analysis used to collect data is limited to analysisof Twitter data. The analysis of the correlation between the stock andTwitter data is limited to the Linear Regression technique.

Page 10: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

Chapter 2

Background

This section will contain three different parts. The first subsection de-scribes stock market prediction models, the second part will cover datamining and data collection techniques. The third part will cover thetheories for performing sentiment analysis to analyze the opinions andsentiments of texts. The last part covers predictive analytics and sta-tistical techniques.

2.1 Stock Market Prediction Models

2.1.1 Random Walk theory

The random walk theory states that stock price changes are indepen-dent and have the same distribution [12]. Assuming the random walktheory holds for a stock market, it is impossible to predict trends andfuture stock prices changes using the past movement of the stock price[12]. This model however, seems to contradict results from studieswhich have shown that the stock market is predictable based on pre-vious movements to a reasonable degree [7] [13].

2.1.2 Efficient market theory

The efficient market theory states that market share prices reflect everypiece of information. This means that an investor cannot gain higherreturns than the market average in the long run. This theory is incon-sistent with market anomalies that have been observed empirically [4].

3

Page 11: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

4 CHAPTER 2. BACKGROUND

2.1.3 Predicting Stock Prices using Twitter Mood

There are several studies on using Twitter mood to predict stock prices[4] [7] [8]. One recent approach using Twitter mood to predict fluctua-tions of specific company stocks was a bachelor thesis from 2015. Thethesis studied how three organization’s stocks correlated with the pub-lic opinion of the companies on the social networking platform Twitter.The thesis focused on the US stock market and on US based compa-nies. In the thesis they successfully implemented a company-specificmodel that was able to predict stock price movements [8].

2.2 Data Mining

Data mining can, roughly speaking, be defined as a process of discov-ering various models, summaries, and derived values from a givencollection of data [14]. The experimental procedure for data miningcan be described as follows:

• Formulate a problem statement and a hypothesis

• Collect data

• Preprocess data, i.e. detect outliers, scale variables appropriately,etc.

• Estimate model.

• Interpret data and draw conclusions.

2.3 Opinion mining and sentiment analysis

Opinion mining and sentiment analysis is the field that analyses peo-ple’s emotions, sentiments and attitudes towards entities (in this fieldof study towards companies) in written language. Today, informationabout companies are available in a large scale online, the content’s vol-ume is often very big, thus, automated sentiment analysis systems areneeded [6].

Page 12: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 2. BACKGROUND 5

2.3.1 Sentiment Lexicon

A sentiment lexicon is a lexicon containing sentiment words. Senti-ment words are words that are commonly used to express positive ornegative thoughts respectively. A list of such words with positive andnegative sentiment words is called a sentiment lexicon [6].

2.3.2 Approaches to sentiment analysis

Recent approaches to sentiment analysis of text are presented in thissection together with its positive and negative aspects.

Keywords

The keyword approach is the most naive approach to sentiment anal-ysis. The idea is to look for affectionate words in a text that representseither positive or negative sentiments. For example: good, extraordi-nary (expresses positive sentiments) and bad, ugly (expresses negativesentiments) [15].

Lexical Affinity

Lexical affinity identifies words and assign them a value based onwhether the word is considered to express positive or negative sen-timents. This approach classifies a text by rating the sentiments of theindividual words in a text and evaluates . This method has a negativeside effect - negated sentences may not be evaluated correctly [15].

Statistical methods and other approaches

There exist statistical methods, for example by using Bayesian infer-ence, which classifies texts by using to statistical models and machinelearning techniques. Statistical approaches are mostly used for senti-ment analysis of larger texts. There are also more complex methodsthat involves methods which detects more complicated sentences andexpressions [15].

Page 13: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

6 CHAPTER 2. BACKGROUND

2.4 Correlation

In this section, the theoretical terms needed for statistical analysis willbe introduced to the reader. The mathematical theories correlation andcovariance are explained here. When observing statistical data, an in-teresting aspect to measure is the correlation between paired data. Themost common measure for correlation is the covariance and the corre-lation coefficient. The correlation measures the connection betweentwo events X and Y and the covariance measures the joint variabilityof two variables.

Covariance

Mathematical definition: The covariance C(X, Y ) between X and Y is

C(X, Y ) = E[(X − µx)(Y − µy)] (2.1)

Correlation coefficient

The correlation coefficient for X and Y is defined as follows

ρ(X, Y ) = C(X, Y )/(D(X)D(Y )) (2.2)

Covariance for a set of data

The covariance between x- and y-values in a set of data (x1, y1), ..., (xn, yn)

is

1

n− 1

n∑i=1

(xi − x)(yi − y) (2.3)

Correlation coefficient for a set of data

The correlation coefficient for a set of data is

r =cxy

sx ∗ sy(2.4)

Page 14: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 2. BACKGROUND 7

If the data varies so that big values of xi belongs to big values of yi,and the other way around, then the product (xi−x)(yi−y) is often posi-tive. Then the covariance and the correlation coefficient should also bepositive. The correlation coefficient between two stochastic variablesX and Y , r, is in the interval (−1 <= r <= 1). If C(X, Y ) = 0 or ifX and Y are independent, then X and Y are regarded as uncorrelated[16].

2.5 Predictive Analytics

Predictive analytics is the subject that predicts future outcomes andtrends. Predictive analytics can be performed by using several ap-proaches, two approaches are: Machine learning and linear regressiontechniques [17].

2.5.1 Linear regression

Linear regression is an approach to analyze the relationship betweenmultiple independent or dependent, scalar variables. The goal withthis technique is to obtain a linear combination on the form:

y(x1, x2, . . . , xn) = b0 +n∑

i=1

bixi (2.5)

where y is the response variable (output variable), b0, b1, . . . , bn are un-known fluctuation variables, and the x1, x2, . . . , xn are measured in-puts. [18] Determining the unknown coefficients bi can be done usinga least-squares approximation [16].

2.5.2 Machine Learning techniques

The field of predictive analytics using machine learning techniques areuseful tools and are increasing in popularity due to their good perfor-mance when handling datasets of large scale [17]. In general, machinelearning techniques use a set of training data to induce an estimatedmodel, which in turn can be used to predict output from future inputs.Machine learning generally involves a generator of inputs which pro-duces a vector of random inputs x, a system that returns an outputy for a given vector from the generator of inputs, and a learning ma-chine that estimates a function that maps input x to output y′ from

Page 15: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

8 CHAPTER 2. BACKGROUND

observed samples of mapped x to y from the system [14]. Some of themore common techniques and concepts of machine learning are listedbelow.

Classification

Classification is defined as follows: “This is a learning function thatclassifies a data item into one of several predefined classes” [141]. Thepurpose of classification is to accurately predict what class a given in-put is in, even if the output is not known.

Decision trees and Decision rules

Decision trees together with decision rules are methodologies of data-mining. They are powerful solutions to classificational problems. Thegoal with this learning is to classify values. The input is some data X ,and the output is the same data X together with a label or a classifica-tion of the data.

Page 16: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

Chapter 3

Methods

In this section, the methodology used in this thesis will be presented.The methods regarding collecting social media data, the selection ofdatabase service, the selection of companies, the sentiment analysismethod, the data classifications and the representation and analysis ofthe data will be thoroughly explained.

3.1 Collecting social media data

In order to gather data on public opinion of a company, we have cho-sen to look for user-generated input on social media.

3.1.1 Collecting Twitter data

The following subsections present the specifics behind the process ofgathering tweets from Twitter. In short, Tweets have been collected byrunning a Java program that automatically fetches tweets, and extractsthe data relevant for our analysis.

Twitter REST API for extracting data

There are several software API:s available for data mining of Twitterdata. Extracting data from twitter is available through the REST APIprovided by Twitter. The Twitter REST API uses Oauth to ensure thatauthorized users can use the API to extract Twitter data [19].

9

Page 17: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

10 CHAPTER 3. METHODS

Twitter4J

Twitter4J is an unofficial, open source Java library for integration ofthe Twitter REST API to collect Twitter data in Java Applications. Pro-vided with JavaDoc, documentation for implementing the API witha Java Application, Twitter4J works with java 5 and later.Twitter4J isfree for commercial and non-commercial usage [20]. Twitter4J containsfunctions to extract the relevant information about the tweets needed.

3.2 Storing data

The following subsections present how the mined Twitter data hasbeen stored. A SQL database was set up and all Twitter data has beenstored in accordance with a fairly straightforward schema.

3.2.1 PostgreSQL

PostgreSQL is an open source object-relational database. It runs on allof the major operating systems. In this case it is intended to set up andrun the database on MacOS. The database has full support for joins,views, keys, etc and includes most SQL data types such as Boolean,Integer, Numeric, Varchar, Char, etc. [21]. PostgreSQL contains allthe practical functionality of an SQL database needed to store and runqueries to gather the relevant information needed for the twitter anal-ysis.

3.2.2 pgAdmin

A relevant tool for setting up the server is pgAdmin. It is an opensource platform for administrating and developing postgreSQL databases.It is designed to meet the user needs, and includes writing SQL queriestowards the database as well as developing complex databases [22].

3.2.3 Database Schema

The database was simply developed for storing all of the tweets, con-taining the relevant information about each tweet needed for furtheranalysis. Below is a table showing all the information stored in thedatabase for each tweet.

Page 18: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 3. METHODS 11

Database SchemaEntry Data Type Description

company_name Varchar(35) The name of the companycreated_at Varchar(35) The timestamp of the tweet

favorite_count Integer The number of favorites/likesretweet_count Integer The number of retweets

text Varchar(300) The written tweetsentiment1 Integer The sentiment value according to

Class 1 (See section 3.6)sentiment2 Integer The sentiment value according to

Class 2 (See section 3.6)user_id Varchar(60) The id of the author of the tweet

followers Integer The number of followers ofthe author of the tweet

Table 3.1 Database Schema

3.3 Companies

The data collection will consist of stock data from three different com-panies and tweets about to them. Information about each companycan be found in the respective subsection. The three companies wereselected from three different industries so that the sentiment scoresare independent from one another. The companies are relatively big inSweden and therefore it should exist more Tweets on these companiesthan on smaller companies.

3.3.1 Fingerprint Cards

Fingerprint Cards is a biometric company, developing solutions fordevices that provides authentication and identification with humantouch, in form of fingerprint sensors [23]. Fingerprint Cards primaryindustry is the biotechnology industry.

3.3.2 Volvo

Volvo is a company that primarily produces vehicles such as cars andtrucks. Volvo is divided into two separate companies, Volvo Groupand Volvo Cars, that are separate entities but cooperate when it comes

Page 19: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

12 CHAPTER 3. METHODS

to research and development [24]. Volvo’s primary industry is the ve-hicle and motor industry.

3.3.3 H&M

Hennes & Mauritz (H&M) is a retailing company that sells clothes.The H&M group has over 4300 stores in over 50 countries [25]. H&M’sprimary industry is the fashion and the textile industry.

3.4 Collecting Stock Data

Financial data will be collected through web scraping. Each hourlystock value for each of the three companies the extracted manuallyfrom the website of the Swedish bank Avanza. All of the companies,together with other companies listed on the Nasdaq OMX Swedishstock market, are listed on the website of Avanza [26].

3.5 Sentiment Analysis

The tweet texts that was collected through the twitter API were col-lected one by one and the sentiment analysis and the data classifica-tion were performed on each tweet by a java program, the informationneeded about all the tweets was then stored into the database. Thissection will contain three subsection, explaining how the SentimentAnalysis will be performed.

3.5.1 Translating from Swedish to English

Due to the lack of freely available sentiment analysis tools for Swedishtexts. The tweets are translated from Swedish to English by usingGoogle Translate. A study on this topic has shown that relying onmachine translation preserves the sentiments in the general case. Thestudy also showed that in most cases, translating from Swedish intoEnglish, the sentiment was preserved. In the same study, Google Trans-late was used to perform the translations [27]. Due to the fact thatthe study mentioned above showed that the sentiments are preservedwhen translating from Swedish to English, the tweets will be trans-lated into English before evaluating the sentiments on each of the tweets.

Page 20: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 3. METHODS 13

3.5.2 Translation Method

Google Translate was used to translate the tweets from Swedish to En-glish. Google Translate is a web-based Translation service providedby Google [28]. The translation was automated by sending a HTTP-request to Google’s REST API from a Java program to get the trans-lated text for each tweet. The texts were translated by translating theentire texts directly with the Google Translate service. Google Trans-late was selected because it is a widely used and reliable translationservice provided by Google.

3.5.3 Sentiment Analysis Method

The sentiments were evaluated by using a Sentiment Analysis Dictio-nary, containing positive and negative english words. The SentimentAnalysis Dictionary contained all the positive words in one text fileand all the negative words in another text file [29]. The SentimentAnalysis was performed by looking up every individual word, if itwas in the Sentiment Analysis Dictionary as a positive word, the scorewas 1. If the word was in the Dictionary as a negative word, the scorewas -1 and if the word was not in the dictionary at all then the scorewas 0. The sum of the scores of all of the words in each text was evalu-ated. This method for determining sentiment values from tweets wasalso used in a bachelor thesis from 2015 [8]. This method is a naiveapproach to sentiment analysis, the method was selected due to thefact that it was used in a previous bachelor thesis and provides simplesentiment scores for short texts based on the presence of positive andnegative words. The listing below provides an example on how thesentiment analysis works. The text has been converted from Swedishto English and lower case characters and all characters that are notalphanumerical characters have been removed.

SandyPandyHeat : Good arguement but i f they havea b e t t e r , more durable sensor f o r smart cards ,they can not ramp up f a s t

Listing 3.1: Example of a positive tweet in English

The sentiment analysis was performed while the tweets were col-lected, by translating the texts with Google Translate as described insection 3.5.1, the sentiment value for each tweet was then stored in thedatabase in the same entry as the corresponding text. In this context,

Page 21: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

14 CHAPTER 3. METHODS

the words "good", "better", "durable", "smart" and "fast" were found inthe word list with positive words while no words corresponded to anyof the words in the negative word list. Since each positive word yieldsa value of 1, the sentiment score of this tweet was 5.

3.6 Data Classification

In order to perform some sort of analysis on the twitter data, datawill be gathered and presented in three different ways, in order tobe able to compare the three approaches. These three different ap-proaches regulates how the sentiment values are interpreted from sta-tistical point of view. The sentiment values that are being classifiedin this section are the values determined by the Sentiment AnalysisMethod in section 3.5.3.

3.6.1 Classes

The tweets and their mapping of sentiment values are classified intothree different classes, each of them are presented below.

Class1 Each sentiment value of each tweet determined by the methodin section 3.5.3 is included in the data analysis. The sentiment valuesare not changed or classified in any way.

Class2 If the sentiment value of the tweet is 1, it will be labelled aspositive. If the value of the tweet is -1 then the tweet is labelled asnegative. If the sentiment value is greater than 1, then the tweet will belabelled as very positive and the sentiment value of that tweet will be2. If the sentiment value is less than -1 then the tweet will be labelled asvery negative and its sentiment value will be -2 [8]. These two classeswas derived from the bachelor thesis mentioned in section 3.5.3.

Class3 Class 3: All properties from Class 2 applies here, with oneadditional constraint. Tweets will only be included in the calculationsif the author of the tweet has at least 200 followers.

Page 22: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 3. METHODS 15

3.6.2 The purpose of these classifications

Class 1 is intended to be an approach where the sentiment value ofa value depends directly on the sentiment analysis. The purpose ofClass 2 is to limit the impact of individual tweets with a high positivescore (over two) or a very negative score (lower than two) on the over-all score during a specific time period. Class 3 serves the purpose oftrying to avoid possible fake users and users with a small amount offollowers or none at all.

3.7 Representation of Data

The methods regarding how data that was stored in the database ispresented and evaluated in this subsection.

3.7.1 Representation of stock and sentiment data

In order to represent stock data, stock prices were gathered on a per-hour basis on opening hours (from 09:00 to 17:00), as described in sec-tion 3.4. Furthermore, calculated sentiment values were collected byaggregation of sentiment score on a per-hour basis. The stock datafrom a certain hour was mapped to the sentiment score on the fore-going hour, except for opening hour (09:00), where the stock data wasmapped to the aggregation of sentiment scores from all hours startingat closing time on the day before (17:00) up until the the hour beforeopening (08:00). The sentiment and stock values were calculated on aper-hour basis because the amount of tweets were not sufficiently highfor shorter time intervals.

3.7.2 Representation of Tweet Data

Tweet data will be collected and presented separately with all of thethree classifications presented in section 3.6. From each class, the sumand the average of the sentiment values of all tweets about a specificcompany during each hour will be presented in the data analysis.

3.7.3 Storing stock and sentiment data

Stock and sentiment data that was collected through the data collec-tion will be stored in Google Sheets for further processing. Data will

Page 23: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

16 CHAPTER 3. METHODS

be collected by querying the server to obtain the tweet sentiment dataand by web scraping the stock data.

3.7.4 Representation of relationship between stock dataand sentiment data

The relation between sentiment values gathered from collected tweets,and the stock market closing prices is visualized with a scatter plot ofdata points, with sentiment scores as an independent variable on thetraditional cartesian x-axis, and the stock data as a dependent variableon the y-axis.

3.8 Analyzing Data

This section describes how the gathered data is analyzed in order todraw conclusions about eventual relationships between datasets.

3.8.1 Null hypothesis/Model

In order to draw conclusions about the supposed relationship betweenpublic sentiment and stock prices, a null hypothesis is stated that stockprices are linearly dependent on public sentiment. More precisely, thatstock prices are directly proportional to the public sentiment valuesfound when analyzing the Twitter feed.

Linear Regression

In order to determine whether or not stock prices are linearly depen-dent on public sentiment, the method of least squares is employed tofind a linear best-fit function (a regression curve). The slope of the re-gression curve would indicate if there exists a correlation between thedependent and independent variables, and if so, whether the corre-lation is positive or negative. The mean squared error (MSE) is thenevaluated and used as an indicator of how well the linear regressioncurves fit the data points.

Calculations and Visualization

The necessary calculations to produce the linear regression curve, alongwith the standard deviation between data points and regression curve,

Page 24: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 3. METHODS 17

have all been carried out on MATLAB [30]. (Blom, 2004) has providedmathematical definitions of estimations of parameters for the linearregression curve (slope and y-intercept), as well as definitions of mea-sures in descriptive statistics (expectancy, standard deviation, correla-tion coefficient, MSE, etc.) to produce the regression curves and eval-uate them.

Page 25: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

Chapter 4

Results

In this section, the results gathered from the data collection will bepresented and analyzed.

4.1 Regression Analysis

This section will contain the results from the regression analysis per-formed on the sentiment values and the change of the stock price.

4.1.1 Presentation of results

The data represented in this section contains, on the x-axis the senti-ment values of the tweets from the hour before until the hour whenthe stock value was collected. The y-axis contains the correspondingvalues of the sentiments calculated according to the classifications (seesection 3.6) of the tweets during the same period of time. Some ofthe scatter plots will be presented in this section, while some will beappended in Appendix A.

Class 1 - Sum Presenting data from Class 1 (see section 3.6 Data Clas-sification), where the sum of the sentiment value of all tweets was eval-uated for each hour. The sentiment score and the stock price for Fin-gerprint Cards is shown in fig. 4.1. There is no apparent correlationbetween the two variables. Although, one observation that may be in-teresting is that the highest rise of the stock price (1.01) also has a veryhigh sentiment score (10). Moreover, the biggest decline of the stockprice (−0.83) also has a quite low sentiment score (−2).

18

Page 26: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 4. RESULTS 19

Figure 4.1: Scatter plot of sentiment score and stock prices for FINGB with a linear regression curve. Class 1 data with summarized senti-ment scores.

The sentiment score and the stock price for Volvo is shown in fig4.2. There is no apparent correlation between the two variables. Al-though, one interesting observation is that there are no extreme senti-ment values for Volvo, the highest sentiment value during one hour is5 and the lowest is -2.

Page 27: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

20 CHAPTER 4. RESULTS

Figure 4.2: Scatter plot of sentiment score and stock prices for VOLVB with a linear regression curve. Class 1 data with summarized senti-ment scores.

The sentiment score and the stock price for H&M is shown in fig.4.3. There is no apparent correlation between the two variables. Oneobservation that may be interesting is that the during the hour wherethe highest rise of the stock value occurs (1.7), there is also a quite highsentiment value (5).

Page 28: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 4. RESULTS 21

Figure 4.3: Scatter plot of sentiment score and stock prices for HM Bwith a linear regression curve. Class 1 data with summarized senti-ment scores.

Class 2 - Sum The sentiment score and the stock price for Fingerprintcards is shown in fig. 4.4. There is no apparent correlation betweenthe two variables. One observation is that the hour with the highestrise of the stock value (1.01) also has the highest sentiment score (52).Although, when the lowest sentiment score occur (−48), the stock riseswith a value of (0.46).

Page 29: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

22 CHAPTER 4. RESULTS

Figure 4.4: Scatter plot of sentiment score and stock prices for FINGB with a linear regression curve. Class 2 data with summarized senti-ment scores.

The sentiment score and the stock price for Volvo is shown in fig.4.5. There is no any apparent correlation between the stock price andthe sentiment score. One interesting observation is that the hour withthe highest rise of the stock value (1.1) also has the highest sentimentscore (10).

Page 30: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 4. RESULTS 23

Figure 4.5: Scatter plot of sentiment score and stock prices for VOLVB with a linear regression curve. Class 2 data with summarized senti-ment scores.

The sentiment score and the stock price for H&M is shown in fig.4.6. There is no apparent correlation between the stock price and thesentiment score as can be seen in the scatter plot. The lowest sentimentvalue −4 also has a negative sentiment value of −0.7. The highest riseof the stock price 1.7 also has a high sentiment value of 5. However, thebiggest decline of the stock price −1.5 has a positive sentiment scoreof 4.

Page 31: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

24 CHAPTER 4. RESULTS

Figure 4.6: Scatter plot of sentiment score and stock prices for HM Bwith a linear regression curve. Class 2 data with summarized senti-ment scores.

Class 3 - Sum The sentiment score and the stock price for FingerprintCards B, class 3 - sum (see section 3.6 Data Classification). As can beseen in the scatter plot (fig. 4.7) of Fingerprint Cards B has the sameedge cases as fig. 4.4, there is no apparent correlation between thetwo variables. The additional constraint of including tweets with morethan 200 users did not make a significant change in this case.

Page 32: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 4. RESULTS 25

Figure 4.7: Scatter plot of sentiment score and stock prices for FINGB with a linear regression curve. Class 3 data with summarized senti-ment scores.

Regarding Volvo, there is no apparent correlation between the twovariables in this case, as can be seen in fig. 4.8. The edge cases aresimilar to but not the same as the ones in fig. 4.5. The highest rise ofthe stock price 1.1 also has the highest sentiment value 5. Regardingthe other edge cases, there is no other such case of interest.

Page 33: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

26 CHAPTER 4. RESULTS

Figure 4.8: Scatter plot of sentiment score and stock prices for VOLVB with a linear regression curve. Class 3 data with summarized senti-ment scores.

For H&M, there is no apparent correlation between the two vari-ables. As can be seen in the scatter plot in fig. 4.9, the lowest sentimentscore has a decrease of the stock price and the biggest decrease of thestock price has a negative sentiment score. No other such interestingedge case exists.

Page 34: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 4. RESULTS 27

Figure 4.9: Scatter plot of sentiment score and stock prices for HM Bwith a linear regression curve. Class 3 data with summarized senti-ment scores.

Class 1 - Average The sentiment scores and the stock prices for class1, using the average approach as explained in section 3.6 are presentedin this section. As can be seen in fig. 4.10, there are no interestingedge cases other than the highest rise in stock price also has a positivesentiment value and the biggest decrease of the stock price also has anegative sentiment value for Fingerprint Cards.

Page 35: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

28 CHAPTER 4. RESULTS

Figure 4.10: Scatter plot of sentiment score and stock prices for FINGB with a linear regression curve. Class 1 data with averaged sentimentscores.

For Volvo, there is no apparent correlation between the sentimentvalue and the change of the stock price, as can be seen in fig. 4.11. Thehighest rise of the stock price has a positive sentiment score, other thanthat, there are no other such edge cases.

Page 36: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 4. RESULTS 29

Figure 4.11: Scatter plot of sentiment score and stock prices for VOLVB with a linear regression curve. Class 1 data with averaged sentimentscores.

For H&M, there is no apparent correlation between the two vari-ables, as seen in fig. 4.12. There are no interesting edge cases.

Figure 4.12: Scatter plot of sentiment score and stock prices for HM Bwith a linear regression curve. Class 1 data with averaged sentimentscores.

Page 37: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

30 CHAPTER 4. RESULTS

Class 2 and Class 3 - Average The sentiment scores and the stockprices for Class 2 and Class 3, using the average of the sentiment scoresis presented in Appendix A. There were no apparent correlation forthose scatter plots.

The following tables present regression data and other metrics fordescriptive statistics on a per company basis. The parameters pre-sented are the classes for which the data have been gathered, α is theslope of the best-fit linear regression curve, β is the y-intercept of thebest-fit linear regression curve. ρ is the sample correlation between thechange in stock prices (dependent variable) and sentiment score (in-dependent variable). MSE is the mean squared error, a metric of theerror of the regression curve.

Class α β ρ MSE

Sum 1 0.0286 −0.0168 0.2908 0.1318

Sum 2 0.0078 0.0155 0.2534 0.1347

Sum 3 0.0086 0.0246 0.2092 0.1377

Avg 1 0.0346 0.0217 0.0418 0.1437

Avg 2 0.1008 0.0135 0.1109 0.1422

Avg 3 −0.0123 0.0255 −0.0148 0.1439

Table 4.1: Regression parameters gathered from data on FingerprintCards (FPC B)

Class α β ρ MSE

Sum 1 −0.0372 0.1248 −0.1430 0.1584

Sum 2 0.0328 0.0832 0.2875 0.1484

Sum 3 0.0298 0.0884 0.1678 0.1571

Avg 1 −0.0842 0.1152 −0.0981 0.1601

Avg 2 −0.0863 0.1153 −0.1006 0.1600

Avg 3 −0.0807 0.1178 −0.1401 0.1585

Table 4.2: Regression parameters gathered from data on Volvo (VOLVB)

Page 38: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 4. RESULTS 31

Class α β ρ MSE

Sum 1 0.0185 −0.0268 0.1675 0.4962

Sum 2 0.0169 −0.0261 0.1397 0.5007

Sum 3 0.0534 −0.0312 0.1599 0.4975

Avg 1 0.0316 −0.0030 0.0276 0.5102

Avg 2 0.0317 −0.0025 0.0258 0.5102

Avg 3 0.0977 −0.0134 0.0794 0.5073

Table 4.3: Regression parameters gathered from data on Hennes &Mauritz (HM B)

One result from the tables 4.1 & 4.2 & 4.3, which is also apparentfrom the scatter plots, is that the magnitude of the slopes for each ofthe linear best-fit regression curves are fairly small. Also, we can seethat the coefficients of correlation for each company is fairly small, in-dicating that there is no linear relation between changes in stock pricesand sentiment scores. The mean squared errors were somewhat largerfor the HM-regression line than for the other lines (compare the MSE

column in Table 4.3 to Tables 4.1 and 4.2). The magnitude of theseerrors are all the same, regardless of which class you select.

Page 39: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

Chapter 5

Discussion

In this section, the results will be analyzed, interpreted and conclu-sions will be drawn. The limitations of the methods of this report willbe discussed together with self criticism and suggestions for furtherresearch.

5.1 Interpreting the Results

As seen in the Results section, the scatter plot did not show any sig-nificant correlation between the sentiment values and the change ofthe stock price. This is an indicator that there does not exist a lin-ear relationship between sentiment scores and changes in stock prices.However, there may exist another kind of relationship between thetwo variables.

The values which can be seen in section 4.1 in the scatter plot varya lot. However, in some cases, the most positive sentiment values oc-curred together with the highest rise of the stock price and the mostnegative sentiment values sometimes occurred at the same time as thebiggest declination of the stock price. This result cannot be interpretedas a trend, because this behavior was not consistent for all of the com-panies and for all of the sentiment value classes presented in section3.6.

The figures can tell us something about a potential linear relation-ship, in the unlikely event that it exists. The slope of the regressionlines in e.g. Figures 4.1 & 4.10 are quite small, indicating that changesin stock prices are hardly affected at all by increases in sentiment scores.In those particular cases, Table 4.1 indicate that the change in stock

32

Page 40: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 5. DISCUSSION 33

price is given by 2.86% and 3.46%. In some cases, it would seem thatchanges in stock price are affected negatively by increased sentimentscores, such as in the case of the Volvo (see Table 4.2).

The results are roughly the same, independent on which class isused. One trend that is discernible between classes is that the absolutevalue of the correlation coefficients is somewhat smaller when usingaveraging hourly sentiment scores than it is when summing hourlysentiment scores. Correspondingly, the mean square error estimatesare somewhat larger when averaging hourly sentiment scores as op-posed to summing them.

5.2 Limitations

In this section the limitations of the Tweets, the limitations of the Sen-timent Analysis method and the limitations of the Regression modelwill be discussed.

5.2.1 Limitations of the quantity and quality of Tweets

There are some limitations of this bachelor thesis which will be dis-cussed here. First of all, the time period selected consisted of a work-ing week where the stock market was closed on the Monday. Hence,tweet data was gathered during four days, which is 96 hours in total.For example for Fingerprint Cards there were 586 tweets during 96hours, this amount was 310 for Volvo and 457 for H&M. This resultedin an average of around 6 tweets per hour for Fingerprint Cards, 3tweets per hour for Volvo and under 5 tweets per hour for H&M. Thelow amount of Swedish tweets available for each hour can explain theresults. Hence, there were not very large amounts of Swedish tweetsfor each company during this time period and the results may be af-fected to some extent by single users rather than by the population atlarge.

As well as the fact that there were relatively few tweets available toanalyze, it appeared that some of them were automatically generated.Upon manually inspecting some tweets we found that some users hadbeen posting the same tweet over and over again, usually for the pur-pose of promoting a brand or a product. This affected the results of thehourly sentiment scores and may have interfered with our results, hadno such spam been collected. The reason these tweets are not desired

Page 41: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

34 CHAPTER 5. DISCUSSION

is because advertisement, in general, does not reflect public opinion.

In short, these points tells us that the Swedish Twitter flow mightnot be suitable for sentiment analysis. There may not be enough tweetspublished to use as a sample to extract reliable information about pub-lic sentiment on companies and their products. Also, some tweets arenot usable, as they are computer generated spam.

5.2.2 Limitations of the Sentiment Analysis Method

The sentiment analysis method used in this report used a naive ap-proach. There are several factors that could have improved the sen-timent analysis, one such thing is including emojis in the sentimentanalysis. Moreover, using a more complex sentiment analysis method,which takes more factors into account than just evaluating the tweetson positive and negative words could have improved the sentimentanalysis. Moreover, the importance of every tweet was not taken intoaccount when evaluating the sentiment values for each hour. For ex-ample, a higher like count or a higher retweet count increases the im-portance of the tweet, because these factors indicates that more usersrecognizes the original message [4].

Most models for public sentiment include the notion of mood statesor other more sophisticated public sentiment indicators such as GPOMS,Facebook Gross National Happiness Index, et cetera [4] [7] [5]. Thesemodels probably take more parameters into consideration, thus giv-ing a more accurate representation of public sentiment, which can beused to proceed with more reliable tests of relationships between pub-lic sentiment and changes in stock prices.

5.2.3 Limitations of the Regression Model

It seems naive to assume that there would exist a linear relation be-tween aggregated sentiment scores and changes in stock prices. Hadthere existed such a relation, it would be apparent and more readilyexploited by stock traders, and it would have been suggested by previ-ous research. If there exists another type of relation between sentimentscores and changes in stock prices, then the linear regression model isnot suitable of picking that relationship up anyway.

For that reason, the linear regression model might not have been

Page 42: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

CHAPTER 5. DISCUSSION 35

a good choice of model to determine whether public sentiment has apositive effect on stock prices.

5.3 Further Research

For further research, more parameters should be included to be able tofind some sort of correlation. Collection tweet data for a longer timeperiod is suggested to be able to analyze it and to find a correlation, ifthere exists one. When observing the scatter plots for Class 2 sum, forexample for Fingerprint Cards, FING B (fig. 4.4) and for Volvo, VOLVB (fig. 4.5) we could see that when having quite high sentiment scores(4 and higher for Volvo and 10 and higher for Fingerprint Cards) itseemed like the stock price increased in the majority of the cases whenthe sentiment score was positive. In that case, having an extremelyhigh sentiment score could be a property used in a predictive modelto predict the rise of the stock price. It could be used as a part of a De-cision Tree model (introduced in section 2.5.2), predicting the rise of astock price for very high sentiment values. However, due to the factthat the high sentiment scores occurred during the same hour as thestock rise and that this behavior was not found for Hennes & Mauritz(fig. 4.6) it is not clear whether it is a property that can be used in a gen-eral prediction model to predict a stock rise or not. Furthermore, it cannot be excluded that this behavior is a coincidence. Further researchon this behavior is recommended.

As discussed above, computer-generated spam and the exclusionof emojis for sentiment analysis purposes may have had an adverseeffect on the results of this study. Therefore, the consideration of spamfilter and emojis is suggested as a topic of further studies.

Page 43: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

Chapter 6

Conclusions

Since the amount of Swedish tweets for each company was not suf-ficiently high, the conclusion from this report was that there was nodirect correlation between the sentiments on Twitter and the changeof stock price for the three companies. Various factors may have af-fected the results. The results could be explained by the fact that theremay not be enough tweets to be able to find a correlation between thetweet sentiments and the change of stock prices of a company. Factorsthat also may have affected the results could be the sentiment anal-ysis method, the limited data sample size, and the regression model.Using a more advanced sentiment analysis tool could have improvedthe results. Other studies that use more sophisticated models that takemore variables into account have shown results that indicate a posi-tive correlation between public sentiment and stock prices [4] [7] [8][11] [13].

36

Page 44: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

Bibliography

[1] Eugene Fama, 1965, The Behavior of Stock-Market Prices, Journalof Business, Vol. 38 (Issue 1, 1965), pp. 34-105

[2] Burton G. Malkiel, 1973, The Efficient Market Hypothesis and ItsCritics, The Journal of Economic Perspectives, Vol. 17 (Winter, 2013)pp. 59-82

[3] Duc Truong Pham and Xing Liu, 1993, Neural Networks for Iden-tification, Prediction and Control, London: Springer London, ISBN:1-4471-3246-7

[4] Michael Nofer, 2015, The Value of Social Media for PredictingStock Returns. Preconditions, Instruments and Performance Analysis,Springer Fachmedien Wiesbaden, ISBN: 978-3-09507-9

[5] Juan Piñero-Chousa, Marcos Vizcaíno-Gonzalez and Ada MaríaPérez-Pico, 2016, Influence of Social Media over the Stock Market,Psychology Marketing, Vol. 34 (Issue 1, 2017) pp. 101-108

[6] Liu Bing, 2012, Sentiment Analysis and Opinion Mining, SanRafael: Morgan Claypool Publishers, ISBN: 1-60845-884-9

[7] Johan Bollen, Huina Mao and Xiao-Jun Zeng, 2010, Twitter moodpredicts the stock market, Journal of Computational Science, Vol. 2(Issue 1, 2011) pp.1-8

[8] Oscar Alsing and Oktay Bahceci, 2015, Stock Market Predic-tion using Social Media Analysis, Bachelor’s Thesis, Kung-liga Tekniska Högskolan, Stockholm, Sweden, Available at:http://kth.diva-portal.org/smash/get/diva2:811087/FULLTEXT01.pdf [accessed 14.02.2017]

37

Page 45: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

38 BIBLIOGRAPHY

[9] Chen Chen, 2014, Exploiting Social Media for Stock Market Pre-diction with Factorization Machine, IEEE/WIC/ACM InternationalJoint Conferences on Web Intelligence, Vol. 2 (Aug., 2014) pp. 142-149

[10] Hamed Al-Rubaiee, Renxi Qiu and Dayou Li, 2015, Analysis ofthe relationship between Saudi twitter posts and the Saudi stockmarket, 2015 IEEE Seventh International Conference on IntelligentComputing and Information Systems (Dec., 2015) pp. 660-665

[11] Aditya Bhardwaj et. al., 2015, Sentiment Analysis for Indian Stockmarket Prediction Using Sensex and Nifty, Procedia Computer Sci-ence Vol.70 (2015) pp. 85-91

[12] Girija V. Attergi et. al., 2015, Stock market prediction: A big dataapproach, TENCON 2015- 2015 IEEE Region 10 Conference (Nov.2015) pp. 1-5

[13] Nuno Oliviera, Paulo Cortez, Nelson Areal, 2017, The impact ofmicroblogging data for stock market prediction: Using Twitter topredict returns, volatility, trading volume and survey sentimentindices, Expert Systems With Applications Vol.73 (May, 2017) pp.125-144

[14] Mehmed Kantardzic, 2011, Data Mining: Concepts, Models, Meth-ods, Algorithms, Hoboken, NJ, USA : John Wiley Sons, Inc, ISBN:9780470890455

[15] Erik Cambria. et al., 2013, New Avenues in Opinion Mining andSentiment Analysis, IEEE Intelligent Systems Vol. 28 (Issue 2, 2013)pp. 15-21

[16] Gunnar Blom, 2004, Sannolikhetsteori och statistikteori med tillämp-ningar, Studentlitteratur AB, ISBN: 9789144024424

[17] Ali Bou Nassif et al., 2016, Guest editorial: special issue on pre-dictive analytics using machine learning, Neural Computing Ap-plications Vol. 27 (Issue 8, 2016) pp. 2153-2155

[18] George A. F. Seber and Alan J. Lee, 2012, Linear Regression Analy-sis, 2nd edition, Hoboken: Wiley, ISBN: 0-471-41540-5

[19] Twitter, 2017, REST APIs, URL: https://dev.twitter.com/rest/public [accessed: 06.03.2017]

Page 46: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

BIBLIOGRAPHY 39

[20] Twitter4J, 2007, Introduction, URL: http://twitter4j.org/en/ [accessed: 08.03.2017]

[21] PostgreSQL, 2017, About, URL: https://www.postgresql.org/about/ [accessed: 21.03.2017]

[22] pgAdmin, 2017, pgAdmin, PostgreSQL Tools, URL: https://www.pgadmin.org/ [accessed: 21.03.2017]

[23] Fingerprint Cards, 2017, About Fingerprint Cards, URL: https://corporate.fingerprints.com/en/about-fpc/ [accessed:25.03.2017]

[24] Volvo, 2017, Volvo Cars and the Volvo Group, URL: http://www.volvocars.com/intl/about/our-company/our-company-at-a-glance [accessed: 25.03.2017]

[25] HM, 2017, Market Overview, URL: http://about.hm.com/en/about-us/markets-and-expansion/market-overview.html [accessed: 25.03.2017]

[26] Avanza, 2017, Aktiefiltreraren, URL: https://www.avanza.se/aktier/lista.html, [accessed: 26.03.2017].

[27] Mona Dadoun and Daniel Olsson, 2016, Sentiment Classifica-tion Techniques Applied to Swedish Tweets Investigating the Ef-fects of translation on Sentiments from Swedish into English, Bach-elor’s Thesis, Kungliga Tekniska Högskolan, Stockholm, Swe-den, Available at http://kth.diva-portal.org/smash/get/diva2:926472/FULLTEXT01.pdf [accessed 31.03.2017]

[28] Google, 2017, Google Translate, URL: https://translate.google.com/ [accessed: 25.03.2017]

[29] Liu Bing, Hu Minqing and Cheng Junsheng, 2005, Opinon ob-server: analyzing and comparing opinions on the Web, Proceed-gins of the 14th international conference on world wide web (May,2005) pp. 342-251

[30] Mathworks, 2017, MATLAB, URL: https://www.mathworks.com/products/matlab.html [accessed: 04.04.2017]

Page 47: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

Appendix A

Scatter Plots

Figure A.1: Scatter plot of sentiment score and stock prices for FING Bwith a linear regression curve. Class 2 data with averaged sentimentscores.

40

Page 48: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

APPENDIX A. SCATTER PLOTS 41

Figure A.2: Scatter plot of sentiment score and stock prices for VOLVB with a linear regression curve. Class 2 data with averaged sentimentscores.

Figure A.3: Scatter plot of sentiment score and stock prices for HM Bwith a linear regression curve. Class 2 data with averaged sentimentscores.

Page 49: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

42 APPENDIX A. SCATTER PLOTS

Figure A.4: Scatter plot of sentiment score and stock prices for FING Bwith a linear regression curve. Class 3 data with averaged sentimentscores.

Figure A.5: Scatter plot of sentiment score and stock prices for VOLVB with a linear regression curve. Class 3 data with averaged sentimentscores.

Page 50: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

APPENDIX A. SCATTER PLOTS 43

Figure A.6: Scatter plot of sentiment score and stock prices for HM Bwith a linear regression curve. Class 2 data with averaged sentimentscores.

Page 51: Twitter Analysis - Is there a correlation between Swedish tweets and the Swedish stock ...kth.diva-portal.org/smash/get/diva2:1109320/FULLTEXT01.pdf · 2017-06-13 · theory holds

www.kth.se