making predictions in highly volatile cryptocurrency ... · 2.1predicting the stock market before...

MAKING PREDICTIONS IN HIGHLYVOLATILE CRYPTOCURRENCYMARKETS USING WEB SCRAPING

Guus van HeijningenStudent ID: 01605108

Promotor: Prof. Dr. Dries Benoit

Commissioners: Louis-Philippe Kerkhove, Ekaterina Loginova

A dissertation submitted to Ghent University in partial fulfilment of the requirements for the

degree of Master of Science in Statistical Data Analysis.

Academic year: 2017 - 2018

CONTENTS

Contents ii

Abstract iii

1 Introduction 1

2 Literature Review 3

2.1 Predicting the Stock Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Predicting the Cryptocurrency Market . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Predictive Modeling in Financial Markets . . . . . . . . . . . . . . . . . . . . . . 8

2.5 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Methodology 13

3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Data Extraction by Web Scraping . . . . . . . . . . . . . . . . . . . . . . 16

3.2.2 Subject Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.3 NLP Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.4 Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Results 27

4.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Feature Selection and Algorithm Performance . . . . . . . . . . . . . . . . . . 31

4.3 Prediction Performance of the Data Sources . . . . . . . . . . . . . . . . . . . . 33

4.4 Predicting Different Cryptocurrencies . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Conclussion and Discussion 37

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

References 41

Appendix A Model development coding 43

A.1 Pseudo-code of the subject extraction algorithm . . . . . . . . . . . . . . . . . 43

Appendix B Descriptive Statistics 45

B.1 Financial data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

B.2 Google Trends data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

B.3 Sentiment data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Appendix C Results 49

C.1 Parameter grid search using time series cross validation . . . . . . . . . . . 49

C.2 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

ii

ABSTRACT

In this dissertation, different data sources are scraped from the web to construct a

variety of features related to the cryptocurrency market. Data sources consist of fi-

nancial data from CryptoCompare, search queries from Google Trends and textual

data from the popular discussion platform Reddit. NLP methods are implemented for

feature engineering on the textual data, in order to extract cryptocurrency subjects

and sentiments from the textual data. Cryptocurrency closing prices are converted

into directional returns, to serve as the binary outcome variable in the classification

exercises. The gathered variables are included individually and combined to evaluate

the prediction performance. The best prediction performance is observed when all

data sources are included and the Random Forest algorithm is used to train the pre-

diction model. An prediction accuracy of 63.0%, together with an AUC score of 61.4%

is observed for this optimal model. Further investigation has led to the conclusion

that directional returns on popular cryptocurrencies are more predictable compared

to the less popular cryptocurrencies.

CHAPTER 1

INTRODUCTION

Since the introduction of the first cryptocurrency named Bitcoin (Nakamoto, 2008), an

entire market for cryptocurrencies has emerged within a decade. Especially the last

two years, the market has gained an enormous increase in popularity. Together with

the increase in popularity, more money has flown into the cryptocurrency market and

both the supply and demand for different cryptocurrencies has increased substan-

tially. There are currently around 1,500 different cryptocurrencies available and the

entire market capitalization is valued at more than 0,5 trillion USD. Of this market

cap, more than 50% is covered by only Bitcoin and Ethereum, two of the most popu-

lar coins. Cryptocurrency markets trade 24 hours per day, 7 days per week on a wide

variety of online exchange platforms.

Cryptocurrencies are notoriously volatile, making them substantially harder to predict

and hedge against than traditional means of investment (assuming that cryptocur-

rencies can be viewed as a means of investing capital). They are also known to be

dependent on a network effect and therefore the sentiment about cryptocurrencies

is crucial for their sustainability. Especially for novice currencies this network effect

seems crucial to the survival probability of its introduction to the market (Gandal and

Halaburda, 2016). Interesting debates have emerged considering the valuation of

cryptocurrencies and their added value to global society.

The internet is known as an ideal facilitator for these debates through platforms like

social media and specialized internet fora. These platforms are usually very accessi-

ble and therefore can contain a widespread collection of opinions. The technical nov-

elty, the volatility in monetary value and the low entry level of cryptocurrencies seem

to have caused an increase in internet communities that focus on these cryptocur-

rencies and everything related to them. This results in growing volumes of content,

in the form of news articles, blogs, videos and forum posts (Matta et al., 2015). The

rich volumes of data and the crucial role this data seems to play, create the great

opportunity to investigate the predictive value of textual data from internet platforms

on the price developments of cryptocurrencies.

The extensive discussions that are being held by people from all over the world, plus

the fact that cryptocurrencies are traded every single day, provides a vast stream

of available data. This dissertation aims to exploit this data, in order to gather ad-

ditional empirical evidence on the cryptocurrency market. The study combines web

scraping, text mining and predictive modeling to try and make predictions on direc-

tional cryptocurrency returns. The recency of the boom in cryptocurrencies implies

that the research in this domain has remained rather limited in both quantity and

scope. This dissertation hopes to add to literature by using a specific technique for

feature engineering, using different prediction algorithms, as well as investigating

less traditional cryptocurrencies. Textual documents are scraped from Reddit, one

of the biggest internet forums worldwide. This data will serve as the core source to

extract explanatory features from. These features, together with some other control

variables that are scraped from CryptoCompare and Google Trends, are used for the

development of predictive machine learning models. The directional returns of the

cryptocurrencies serve as the accuracy measure to train the different models.

The dissertation is divided into multiple sections. The next section will review existing

literature to identify the gaps and shape different research questions. The second

section will go deeper into the methodology used in this study by describing the data

and methods used. The third section will describe the observed results. Thereafter

the conclusion of the obtained evidence will be provided and the last section will

contain the discussion.

2

CHAPTER 2

LITERATURE REVIEW

2.1 Predicting the Stock Market

Before cryptocurrencies were traded, comparable studies were conducted on other

financial markets. Predictive modeling using textual data sources have already been

applied with mixed success in traditional stock trading. Features extracted from dif-

ferent textual data sources were used as explanatory variables in the predictive mod-

eling frameworks. The data sources differentiated from breaking news articles to

Twitter data in order to predict movements in stock prices. A variety of the concern-

ing articles will be discussed, in order to identify some takeaways from the traditional

financial markets towards the more novel cryptocurrency market.

Texts from financial breaking news articles were analyzed using different textual rep-

resentations to discover their predictive value on the S&P 500 index (Schumaker and

Chen, 2009). The study investigates the ability to predict discrete stock price values

and directional returns, using Support Vector Machines. To optimize these predictions,

different textual quantification techniques are used, to see which technique is most

valuable in the predictions. Results were evaluated over three performance metrics

and showed significant ability to predict stock prices on the short term using these

breaking news articles.

Besides the results on breaking news articles, it is observed that public mood from

Twitter data, measured using two different feature extraction methodologies, is Granger

causative to the Dow Jones Industrial Average (Bollen et al., 2011; Mittal and Goel,

2012). The aim was to monitor the public mood using Tweets, as they argue that the

public mood could be a proxy for general market sentiment. A change in the market

sentiment would be associated with a change in the value of the market index, as

change in market sentiment would initiate buy or sell orders on the stock market. The

Granger causality analysis shows that changes in the public mood match shifts in the

DJIA values that occur 3 to 4 days later. To be able to incorporate non-linear depen-

dencies, a Self-Organizing Fuzzy Neural Network (SOFNN) is trained to predict DJIA

2.2. PREDICTING THE CRYPTOCURRENCY MARKET

values as well. The prediction performance improves significantly, which confirms

their preliminary expected non-linear relation of the public mood on DJIA values.

In order to train the SOFNN, one of the studies uses k-fold sequential cross validation

to incorporate the time series dependency of the data (Mittal and Goel, 2012). This

methodology results in stronger evidence that the correlation between public mood

and DJIA values exists over the entire range of data. Additionally, their implementa-

tion of a portfolio management strategy, based on the DJIA prediction values, results

in a substantial profit.

Discussed empirical evidence has strengthened the idea that features from different

textual data sources can deliver descent prediction performance in financial markets.

The evidence also suggests that non-linear models might be better in grasping the

relationship between textual sentiment features and monetary market prices. Tak-

ing these findings toward the cryptocurrency markets can help in finding a robust

methodology to predict directional cryptocurrency returns. The proposed sequential

cross validation methodology seems to be a robust way of dealing with time series

data in a machine learning setting. Therefore, the methodology is incorporated into

this study to deal correctly with the sequential nature of the gathered data.

2.2 Predicting the Cryptocurrency Market

As the popularity of cryptocurrencies, and in particular Bitcoin, increased over the

years, more studies shifted their focus from the stock market towards the cryptocur-

rency market. The revolutionary character of cryptocurrencies makes it a challenging

and interesting topic for research. When we compare the cryptocurrency market to

the stock market, we can point out an immediate difference between both. Although

households are indirectly the biggest group of stock owners, they mostly own their

stocks via mutual funds, pension funds and insurance policy holdings (Grout et al.,

2009). These institutional investors do not regularly express their opinion on certain

stocks, and when they do, it is most likely not on a centralized platform that is open

to everyone. Cryptocurrencies on the other hand, are especially directly owned by

household investors (Leinz, 2018). Both cryptocurrencies and social internet plat-

forms have evolved from the internet era and cryptocurrency stakeholders seem to

use the internet as their main platform for discussion about everything related to

cryptocurrencies, while the stock market is additionally covered by more traditional

media channels like newspapers and paid news services. These differences might

create deviations in the observed relationships in both financial markets. It is inter-

4

CHAPTER 2. LITERATURE REVIEW

esting to investigate if the preliminary discussed dependencies in the stock market,

are also present in the cryptocurrency market.

Articles by Kaminski (2014) and Matta et al. (2015) get close, in terms of method-

ologies and research questions, to the research that has been done earlier for the

stock market. The studies use Twitter data to analyze relationships between Bitcoin

market indicators and Twitter posts containing emotional signals. The studies find sig-

nificant correlations between emotional tweets and the closing price, trading volume

and intra-day price spread of Bitcoin. However, a dynamic Granger causality analysis

does not confirm a causal effect of emotional Tweets on Bitcoin market values. These

results might suggest that the relationships between Twitter sentiment and prices,

that were observed within the stock market, do not implicitly occur within the Bitcoin

market.

As preliminary research did not directly find empirical evidence that sentiments from

Twitter data had a predictive value towards Bitcoin values, the impact of different

social media channels on Bitcoin performance is extensively compared (Mai et al.,

2015). To make the comparison, Twitter data as well as data from a social media plat-

form called Bitcointalk.org, an internet forum solely focusing on Bitcoin, is obtained.

They extract similar features from both sources to use in their statistical models. The

results confirm that the sentiments on the forum messages are more telling indica-

tors of future Bitcoin returns than the Twitter messages. The Granger causality test

indicates that forum sentiments of the previous day Granger-causes changes in future

Bitcoin prices, whereas there is no Granger causality from Twitter sentiment to daily

Bitcoin returns. These findings support the usage of forum post data over Twitter data

for prediction modeling. Therefore, this study uses forum data to further investigate

if data from another forum and over another time period can observe similar results.

Additionally, the relationship of Google search queries with Bitcoin trading volumes

is investigated to identify the impact of search frequencies on cryptocurrency mar-

kets (Matta et al., 2015). The results reveal that the frequency of Bitcoin search

queries might be a good explanatory variable in predicting Bitcoin trading volumes,

as it observes a significant Granger-causality between the two. Earlier research that

investigated a similar relationship, but on Bitcoin prices instead of trading volumes,

argues that as the price of the digital currency is solely driven by the investors’ faith

in the perpetual growth, investor sentiment becomes a crucial variable of its value

(Kristoufek, 2013). To represent this investor sentiment, the study also uses search

queries from Google Trends as a proxy of attention and interest in the currency. The

results show that a significant bidirectional relationship between both exists, which

suggests that search queries influence the price, but also prices influence the search

queries. The study adds as a comment that this feedback relationship is related to the

5

2.3. FEATURE EXTRACTION

creation of financial bubbles, as an increase in price leads to an increase in interest,

which in its place leads to another increase in price. While the same relation exists in

a price decrease, which represents the burst of a bubble. To control for theses bubble

and burst movements it might be valuable to add search query frequencies to the

prediction models.

Altogether, there have been multiple studies that have tried to investigate similar

research questions in the cryptocurrency markets, as has been done earlier in more

traditional financial markets. The assumption that different relationships might be

detected in the cryptocurrency market seems to be confirmed, as sentiments from

Twitter data do not yield the same results as in the stock market. Sentiments from

forum data on the other hand, do seem to have a predictive value on future cryptocur-

rency prices. Besides that, search queries provided by Google Trends appear to have

an important role in the movements of prices and volumes in cryptocurrency markets.

To optimize the ability to predict future price movements, both data sources, Google

Trends and from data from Reddit, are added to the study to extract explanatory fea-

tures from.

2.3 Feature Extraction

A major task in almost all of discussed literature is the extraction of features from

textual data. Textual data is known to be unstructured and hard to interpret without

having an idea about the context. This particular challenge has evolved itself into the

field of Natural Language Processing (NLP). The research field has developed multiple

methods to turn the unstructured text data into a more structured format that is easier

to digest by statistical models. Related NLP topics like sentiment analysis and topic

modeling are regularly discussed in news articles and research papers (Socher, 2018).

In research related to this study, sentiment analysis is implemented in different ways.

Most of the earlier discussed articles aim to find empirical evidence on the relation-

ship between financial metrics and the extracted sentiment features. One of the

proposed methods is to convert the textual data in different sentiment groups that

should represent the current mood of the public. Mood scores related to sentiments

like happy, calm, alert and kind are obtained over the investigated time periods to use

as predictors for stock movements (Bollen et al., 2011; Mittal and Goel, 2012). In a

similar research setting it has also been proposed to calculate numeric vectors, called

word embeddings, to represent the textual documents in a standardized way. These

vectors are in this case used as the predictors of the outcome variables (Schumaker

and Chen, 2009). Both of these methodologies result in a multiple of different esti-

6


mators, which could represent a wider spectrum of sentiments to be captured. The

disadvantage of multiple estimators in time series is that it might result in an increase

in the amount of noise when the lagged values of these estimators are added to the

parameter space. This threat is especially present when some of the estimators are

considered to be redundant in the explanation of the outcome variable. Preliminary

results on the mood scores exposed exactly this problem, when only one of the mul-

tiple mood scores was found to be of predictive value for stock market movements.

A more uni-dimensional way of representing the sentiment is by assigning a score

between -1 and 1 to a particular text document, based on its content. Scores higher

and lower than zero represent negative and positive text documents, respectively.

The magnitude of the score represents how far the text document is perceived to

lean to one of the extremes. Therewith, documents with a score around zero are

considered to be neutral in sentiment. Different methodologies are developed to

do such a sentiment scoring. Some studies develop their own sentiment scoring

algorithm (Mai et al., 2015; Kaminski, 2014), where others use a predefined rule-

based algorithm (Matta et al., 2015), or a specialized piece of software (Laskowski

and Kim, 2016). The choice of one of the available algorithms might have an impact

on the ability to capture the true sentiment in a textual document.

A research paper by Ribeiro et al. (2016) investigates the wide variety of sentiment

scoring methodologies that are developed and available as an open source project.

One of the methodologies that scores best is VADER, which seems very well able to

classify sentences into the three sentiment buckets positive, negative and neutral.

Besides, this rule-based algorithm is able to label more than 80% of the sentences

served to the algorithm, which was more than most of the investigated algorithms.

This is a useful property, especially when the textual data is not abundantly avail-

able through time. The VADER algorithm is readily available as a Python module.

Altogether, this made VADER the most promising sentiment scoring algorithm to use

in this study. Results of the usage of this particular algorithm can add empirical evi-

dence to existing literature, as it does not seem to be used in earlier research to score

comments related to cryptocurrencies.

Another challenge regarding the textual data is to extract one or more of the subjects

from a sentence and match these with the list of investigated cryptocurrencies. The

social discussion platforms that are targeted in previous studies are all focused on one

particular cryptocurrency, which does not require them to extract these subjects from

the comments that are committed to the platform. However, studies that used Twitter

data for their research were obliged to extract only the relevant tweets from the

social media network. The advantage of Twitter is that it uses hash-tags as a method

to label its messages that are spread over the network. This property helps in the

7

2.4. PREDICTIVE MODELING IN FINANCIAL MARKETS

identification of text messages related to cryptocurrencies within immense quantity

of tweets that are shared each day. As this property is not present on the targeted

internet forum, a methodology has to be implemented to extract the cryptocurrency

subjects from the analyzed text documents.

There does not seem to be a readily available methodology that is able to extract one

or multiple subjects from a sentence and match these with a list of provided names.

Therefore, this study will develop a rule-based methodology by itself to execute this

task. The aim of this methodology will be to extract and match these subjects as

accurately as possible in order to minimize the loss in data quality and therewith

research validity.

2.4 Predictive Modeling in Financial Markets

Last decade, a rise in the investigation of automated textual analysis has uncovered

a variety of relationships between publicly generated content and movements on the

financial markets. Several papers gathered proof of a sequential effect of such con-

tent on the volume and price of an exchange traded asset or index (Kristoufek, 2013;

Mai et al., 2015; Matta et al., 2015). Empirical evidence from these research papers

has laid out a strong foundation for the development of prediction models that exploit

these effects. Certain studies already tried to exploit these relationships by training

a prediction model for the stock market using the concerning variables (Schumaker

and Chen, 2009; Bollen et al., 2011; Mittal and Goel, 2012). However, such predic-

tion models do not seem to be amply available in existing literature focusing on the

cryptocurrency market. This study aims to fill this gap by the comparison of multi-

ple prediction models that are trained to predict directional returns on cryptocurrency

prices.

To capture the relationships between scraped features and the cryptocurrency prices

as accurately as possible, it will be of valid importance which type of prediction al-

gorithm is used. Prediction algorithms use different mathematical optimization tech-

niques to approximate the effect of its features on the outcome variable. Where some

algorithms are known to be good in capturing linear effects, others are found to be

better in capturing non-linear effects. As data shapes and relationships are different

per research subject, there is no one-model-fits-all solution. To find out which algo-

rithm fits best to the prediction problem in this study, it can already help to review

earlier studies with comparable prediction problems.

The earlier discussed study by Schumaker and Chen (2009) uses Support Vector Ma-

chine (SVM) models for its prediction algorithms. The article does not state which

8


kernel it uses for its SVM models. However, it is assumed that it uses a linear ker-

nel, as this corresponds with the kernel originally used in the algorithm when it was

developed (Boser et al., 1992). SVM allows for non-linear kernels like the polynomial

or radial basis function kernel. When non-linear dependencies are expected, these

kernels tend to perform better in prediction compared to the linear kernel. The usage

of SVM models resulted in significant prediction performance of directional returns

(58.0%, p-value < 0.05). However, the model is trained using word embeddings as

predictors, which is different to features used in this study. When a similar model was

developed over a different time range, using mood scores instead as its estimators,

comparable accuracy results (60%) where obtained (Mittal and Goel, 2012). SVM al-

gorithms with a linear kernel seem sufficiently able to predict movements in financial

markets, even when different features are extracted from textual documents.

To investigate the possibility of non-linear dependencies between sentiments from

text document and financial market movements, several studies investigate the use

of non-linear models to develop their prediction models. When a Self-Organizing Fuzzy

Neural Network model was used, a significant directional prediction accuracy (86.7%,

p < 0.05) was observed (Bollen et al., 2011). Additionally, when the same algorithm

was used over a different time period, similar prediction results (75.56%) where ob-

tained (Mittal and Goel, 2012). These findings support the assumption that sentiment

scores maintain a non-linear relationship with financial market prices, especially when

the economic significance of a directional prediction accuracy of around 80% is con-

sidered.

As discussed, this study aims to fill a gap in the available literature by developing

prediction models in the cryptocurrency market. Preliminary research points out that

it is likely to find non-linear dependencies between sentiment features and financial

market movements. But also linear prediction models have proven itself to be able

to significantly predict these market movements. However, as these findings were

obtained from the different financial market, other dependencies could be present

in the data that is investigated in this study. Therefore, both linear and non-linear

prediction models will be observed to compare their prediction performance.

9

2.5. RESEARCH QUESTIONS

2.5 Research Questions

From the earlier identified gaps in literature, multiple research questions can be for-

mulated. This study aims to answer these research questions and therewith add rel-

evant empirical evidence to the literature. The research questions that are proposed

are:

1. Which data preprocessing and prediction algorithm provides the best perfor-

mance in predicting directional returns on cryptocurrencies?

2. What type of data source is best able to predict directional price movements on

the cryptocurrency market?

3. Is there a difference in prediction performance between established and novel

cryptocurrencies?

The first research question mainly focuses on the perceived dependencies of the ex-

planatory time series variables towards cryptocurrency price movements. As dis-

cussed, preliminary research has identified differences in the predictability of market

movements using linear and non-linear algorithms. The time series nature of the data

plus the usage of multiple variables makes it harder to identify the true nature of the

dependencies by reviewing the descriptive statistics. To deal with these challenges,

the prediction algorithm is both used for feature selection as for prediction model-

ing. Feature selection is conducted first to decrease the number of variables. Due

to the inclusion of the lagged variables, a larger set of independent variables is ob-

tained. The predictive value of each of these lags is likely to vary, as was observed in

preliminary research. It is perceived to be beneficial to remove the redundant inde-

pendent variables, in order to decrease the amount of noise in the data. The feature

selection implementation will therewith select the most explanatory variables plus

their optimal lag, based on their prediction performance. After the feature selection,

the same algorithm will be used to train a prediction model and test its ability to

predict directional cryptocurrency returns. By comparing different models that are

based on different algorithms, more evidence can be found to support the assump-

tions on which relationships might be present in the cryptocurrency market. To test

the research questions, the prediction performance of the different models will be

compared to evaluate their ability to capture the assumed dependencies.

For the second research question, the focus is mainly put on the different data sources

that will be used to make the cryptocurrency predictions. As was found in preliminary

research, the different sources of available data about cryptocurrencies carry different

10


value of information to predict cryptocurrency returns. Part of this study is to enrich

the available empirical evidence on different types of data sources and their ability

to carry forward looking information. In order to investigate the value of information

of these different data sources, multiple models will be tested that are specifically

developed to predict directional cryptocurrency returns. These models will differ in

the data sources that are used to construct the available features within the model.

To compare the value of the different data sources, these models will be compared on

their ability to predict cryptocurrency returns and their variability in doing so. The data

sources that are considered in this study are financial cryptocurrency metrics, Google

Trends data and sentiment features extracted from forum data that is scraped from

Reddit. To identify their individual influence on the predictability of cryptocurrency

directional price movements, their explanatory variables will be put in and out of the

prediction models one by one. Additionally, combinations of the different data sources

will be tested to identify their joint prediction performance.

To investigate the third research question, a differentiation is made between the in-

vestigated cryptocurrencies. The differentiation will be based on the market capi-

talization of these cryptocurrencies. As cryptocurrencies are regularly regarded as

investment vehicles, next to serving a purpose as a currency, market capitalization is

considered to be a quantification of the popularity of a particular cryptocurrency. As

discussed earlier, it is possible and quite likely that the markets for the more popular

coins are more efficient compared to the less popular coins. One statement for this

hypothesis is that popular coins are better covered by the media and on internet plat-

forms when the cryptocurrency increases in popularity. When the cryptocurrencies

are better covered on their developments, the market gets more transparent and the

value of the cryptocurrency will better represent the view of the market participants.

The implication of this development is that it will be less likely that newly obtained

information will not yet be covered by the market price of the cryptocurrency. New

information would be directly priced into the more popular coins, which will make it

harder to predict their directional return on a daily basis. It is therefore expected that

the prediction performance on less popular coins will be better compared to popular

coins. This hypothesis is in line with the popular Efficient Market Hypothesis (Malkiel

and Fama, 1970) that is developed earlier for the more traditional financial markets.

To test the hypothesis, models based on the data of different cryptocurrencies are

compared, in order to test the difference in prediction performance based on market

capitalization.

11

2.5. RESEARCH QUESTIONS

12

CHAPTER 3

METHODOLOGY

The central focus in answering the research questions is the development of different

models that are able to predict directional cryptocurrency returns as accurately as

possible. Performance of the resulting prediction models will be used as the empirical

evidence on the stated hypotheses. The different prediction models should be able

to shed light on the relationships that exist around cryptocurrency prices. In order to

investigate the proposed research questions, a thorough analysis is conducted on the

available data using multiple different methods which will be discussed in this section.

3.1 Data

As discussed, multiple data sources will be tapped to get a variety of data that po-

tentially helps in predicting cryptocurrency price directions. Data that will be used to

develop the prediction models will consist of financial data, Google Trends data and

data from a popular online discussion platform named Reddit. The data spans over

a time range of a full year and is selected to contain a variety of market movements

over this time period. A representation of the full data architecture is presented in

figure 3.1.

Financial data is fetched from an API which is supplied by cryptocompare.com, a web-

site that monitors the market as it is connected to multiple cryptocurrency exchanges.

The data that is obtained from this API consists of the daily opening and closing price,

high-low and volume of more than 1500 different cryptocurrencies. Closing prices will

be used as dependent variables by calculating daily returns and directional returns.

Remaining financial data will be used to construct independent variables by using

their lagged values.

https://www.cryptocompare.com/

3.1. DATA

Fig

ure

3.1

:D

ata

arc

hit

ect

ure

of

the

full

stud

y

14

CHAPTER 3. METHODOLOGY

Google Trends keeps track of the frequency with which terms are searched on Google’s

search engine. As observed by Kristoufek (2013), search queries are associated with

changes in cryptocurrency prices and the other way around. Google Trends data

will be obtained by using a python module that is designed to request Trends data

from the Google Trends website, called Pytrends. The data will be obtained for the

full research period and search frequency data will be provided in a normalized way,

between 0 and 100. Lagged differences of the normalized search queries will be

constructed to serve as explanatory variables within the prediction models.

The largest data source in this study will be the data obtained from the popular online

discussion platform Reddit. This website is a very large platform where a wide range

of topics can be discussed under, so called, subreddits. Each subreddit is aimed at

a specific topic and under this subreddit you will find mainly discussions related to

this topic. The platform uses a tree like structure for its threads and the comments

that are placed under these threads. For this study the CryptoCurrency subreddit

is used to obtain the textual data. This subreddit can be found on https://www.

reddit.com/r/CryptoCurrency/. The data is obtained through an API developed by

a third party called Pushshift.io. This API was preferred over Reddit’s own API, as the

API by Reddit did not allow to gather data over a specific time range. Pushshift did

add this feature to its API which gave it a critical advantage over Reddit’s own API.

However, as Pushshift developed the API as a third party, the risk is present that the

data quality is affected by the fact that Pushshift does not own the original data as it

is provided to Reddit. After fetching the threads and comments data from Pushshift,

a total of more than 2 million comments is obtained over the full year period that is

studied. Besides the body of the comments, additional data about the comments and

threads is supplied. Variables that are gathered are thread titles, the voting scores on

comments and threads, comment indexes, post indexes and comment parent indexes.

Using the comment parent index, the comment three structure of a thread can be

resembled by matching the parent index with the comment en post indexes.

3.2 Methods

All the data in this study is scraped from the internet, where feature extraction from

various sources will shape the dependent and independent variables for further in-

vestigation. The time series nature of the data requires to identify the optimal lag

during the prediction analysis. Prediction analysis is conducted using Logistic regres-

sion and Random Forest classification, to predict if future prices will go up or down for

the studied cryptocurrencies. Prediction performance is trained on directional accu-

racy using a time series robust cross-validation method called k-fold sequential cross

15

https://www.reddit.com/r/CryptoCurrency/

https://www.reddit.com/r/CryptoCurrency/

https://pushshift.io/

3.2. METHODS

validation. The methods for these parts of the study will be discussed to provide a

deeper understanding of the analyses that are conducted.

3.2.1 Data Extraction by Web Scraping

Part of the study is to extract different data sources from the internet to use for further

analysis. As discussed, the data that is extracted can be subdivided in three types

of data; financial data, Google Trends data and textual data. Before web scraping is

started, a time range is specified to extract the data from. All data extraction happens

based on this time range. Part of the study is to investigate if there is a difference in

predictability between more and less popular cryptocurrencies. Therefore, data will

be extracted for multiple different coins. Less popular coins are assumed to be less

covered on Reddit and if data is not sufficiently available for a certain coin, it will be

hard to construct reliable prediction models based on this data. To minimize this risk,

it is decided to extract data for the 50 largest cryptocurrencies, based on market cap-

italization. If data is still not sufficiently available for one of these cryptocurrencies, it

might be decided to exclude them from the analysis as well. Another observed data

quality issue is the re-branding of a cryptocurrency during the studied time range.

There are cases where the developers of a cryptocurrency decide to re-brand their

currency, which will create a change in the name of the currency. This is a problem

as the Google Trends and Reddit data will not match the name of the currency for the

full period. Therefore it is decided to exclude these coins for analysis as well.

To extract the data in a replicable way, extraction pipelines are constructed that only

require a date range to be provided as parameters. First part of this pipeline is the

extraction of a list of the 50 largest cryptocurrencies based on market capitalization.

The API of cryptocompare.com provides a wide range of financial metrics for all the

available cryptocurrencies. This API also has the ability to extract historical data on

a daily basis, which is required. What this API does not provide is the current mar-

ket capitalization of the cryptocurrencies, which gives the problem that is not directly

possible to select the 50 largest coins. To overcome this issue, market capitalization

data is extracted from https://coinmarketcap.com/ as their API does provide this pos-

sibility. This list of the 50 largest coins is further used to extract financial data, data

from Google Trends, and for subject extraction on the Reddit comments and threads.

Financial data is obtained by requesting daily historical data over the specified time

range, for the selected number of coins, using the cryptocompare API. This data will

serve both as dependent and independent variables in further analysis. The sequen-

tial nature of the data will be used to create lagged variables and the returns on

closing prices of the cryptocurrencies will serve as predictors, together with trading

16

coinmarketcap.com


volumes and price spreads. The prediction models will be trained using directional

returns that are based on the closing prices. An upward movement in the closing

price will form the positive class in this case and no movement or a downward move-

ment will be considered as the negative class. This method creates a classification

problem, which could be evaluated together with preliminary research where a similar

methodology was used. By transforming it to a classification problem, it gets easier to

evaluate the prediction performance on their significance. The evaluation benchmark

could simply be chance, where you flip a coin and let the result decide if you bet for

an up- or downward return for the next day.

Google Trends data is obtained in a similar fashion as the financial data. The list of

50 largest coins is used to gather normalized search frequencies from the Pytrends

(Hogue and DeWilde, 2014) module in python. As Google Trends does not provide an

API to request this data, an open-source python module was developed that scrapes

the required data from the Google Trends website. The function requires a date range

and a string to be specified and returns the normalized queries for the specified string

over the date range. This normalized interest is provided on a daily granularity with

values ranging between 0 and 100. The date range is similar to the one specified

in the other API’s, covering a full year. Again the sequential nature of the change in

search queries for particular cryptocurrencies is used to create independent variables.

As the goal is to create prediction models, historical values will be used in these mod-

els to possibly capture a change in price, as a reaction to a change in the measured

search frequencies. This relationship relies on the assumption that the search fre-

quencies, from the Google search engine, are a proxy for the interest from investors

in the analyzed cryptocurrencies. An increase in the search frequency would be as-

sociated with an increase in the attraction of investors that can drive the price up, as

demand for the currency increases.

Most extensive data is gathered from Reddit. This data source provides a substantial

amount of data about threads and their corresponding comments. The challenge of

this data is to extract meaningful features that could represent the market sentiment

of cryptocurrency investors. Extracting the data itself can also be described as quite

a challenge, as it is such a vast amount of data. The Pushift API does not provide

an endpoint where all the required data can be extracted at once, which creates an

additional challenge. The method that is used here, is that first all the submitted

threads between the specified date range are requested using the submissions end-

point. This provides the basis to further extract all the comments that are placed

under these threads. Secondly, a filter is applied to filter out all the threads that

did not receive any comments. From the resulting dataset the id’s of these threads

are used to gather all the comment id’s that are associated with this thread. Lastly,

17

3.2. METHODS

all the particular comments are fetched from another endpoint of the API. The tree

structure of threads and comments that is obtained in this way, is later used to ex-

tract the cryptocurrency subject from a specific comment. This way of extracting the

data is implemented to minimize the amount of missing data, as each comment is

guaranteed to be part of one of the obtained threads. In case missing data will be

observed, it might be necessary to implement a data imputation technique in order to

maintain the sequential nature of the data. Gathering the textual data using a robust

method paves the way for further feature extraction, which will form a core part of

the explanatory variables. As there are strict request limits to the API, the fetching of

this substantial amount of data is quite time intensive.

3.2.2 Subject Identification

The CryptoCurrency subreddit is a subreddit that is open for discussion of any pos-

sible cryptocurrency. Therefore, the data that is obtained is not yet labeled on the

subjected cryptocurrency when it is retrieved. Finding the correct subject of each

comment is a crucial task for the analysis. Subjection of the comments and threads is

necessary in order to assign the later extracted sentiment features to a certain cryp-

tocurrency. When the comments are not subjected correctly, the quality of the data

can not be guaranteed and interpretations can mismatch the true relationships. The

aim will be to maintain data quality while preserving a sufficient amount of data for

further analysis.

The eventual subject extraction pipeline is constructed in the following way. From the

list of 50 largest coins, all the cryptocurrency names and ticker symbols are obtained.

Some cryptocurrencies are usually referred to by one of the words within their longer

name, for instance ICON Project is generally discussed as ICON. Therefore, the names

that consist of multiple words are split and put into a separate list. Within this list,

words that exist in other cryptocurrency names as well, like the words token, coin or

Bitcoin, create duplicates and are deleted from the list as they can not be used as

unique identifiers. This results in a list of words that can be used as unique iden-

tifiers for coin names that consist of multiple words. To continue, all the available

comments and post titles are tokenized into words and tagged using a part-of-speech

tagger. Both the tokenization as the part-of-speech tagging is done using the Natural

Language Toolkit (NLTK) module in python (Bird, 2017). The resulting tags are used

for filtering out the nouns, to match them with cryptocurrency names and ticker sym-

bols. This results in a list of subjects for each observation, which is associated with

the comment or post title that is analyzed. Some observations can contain none of

the cryptocurrency names from the list, where others contain multiple of them.

18


When the subjects are extracted for each case, the comment tree structure that is

present on Reddit is used to further assign subjects to comments that did not iden-

tify a particular subject. The tree structure is originated from the initial thread that

is placed on the CryptoCurrency subreddit. The title of the thread is already a good

indication of the subject of the comments that will follow. The comments that are

directly placed under the thread form the first branches and these branches continue

as comments are placed under this first layer of comments. This creates the tree

structure where each comment has a so called parent, which identifies under which

comment the concerning comment is placed. Therewith it is possible to move from

the bottom of a branch, all the way up to the start of the thread. The subjects that are

discussed in a thread higher up the hierarchy, are likely to be the point of discussion

for comments that follow underneath. This assumption is used to assign subjects to

comments where there is no subject obtained from the comment itself. The algorithm

looks further up in the hierarchy to assign the subject of parent comments to the con-

cerning comment. This increases the ability to identify subjects to all the retrieved

observations. A disadvantage of this method is that it is possible to mis-classify com-

ments by assuming they discuss the same subject as their parent comment does,

while this was not the case. It should be kept in mind that this threat stays present

when this automated identification pipeline is implemented, although it is assumed

that the risk is minimized due to the large volume of comments. When the comments

are aggregated on a daily level, the classified comments are likely to be dominated

by correctly classified comments, which diminishes the effect of the mis-classified

comments.

3.2.3 NLP Feature Extraction

The data that is obtained from Reddit contain a great amount of information. This

information is captured within the semantics of the comments and thread titles. The

NLP research field has evolved to create automated processes for extracting informa-

tion from textual representations of language. Different methodologies arose from

the field, to transform the unstructured textual representations into structured output

that can further be used for analysis. Sentiment analysis is implemented within this

study to extract structured features from the unstructured textual data.

The sentiment analysis is used to extract a sentiment score from the fetched com-

ments. This sentiment score ranges between -1 and 1, where -1 represent the most

negative comments and 1 represent the most positive. To obtain theses sentiment

scores, the rule-based python module named VADER is implemented (Gilbert, 2014).

This algorithm is run over the 2 million comments to obtain a score for every com-

19

3.2. METHODS

ment individually, which results in 2 million sentiment scores for the entire dataset.

The distribution of identified sentiments is presented in figure 3.2. A majority of the

comments is assigned a neutral score, around 700,000 in total, which is close to 33%

of all the comments. The comments will be separated into negative and positive com-

ments based on their assigned score. It can be observed that comments are assigned

more positive than negative scores. The resulting number of positive comments is

903,237, compared to 436,859 negative comments.

Figure 3.2: Frequencies of observed sentiment scores, y-axis is cut-off at 100,000 fora better overview

When the sentiment scores are obtained, the data is run through the subject iden-

tification pipeline that is discussed earlier. The observations which in the end did

not identify any cryptocurrency subject are removed from the dataset. The cases for

which multiple subjects were identified, result in the situation where the obtained sen-

timent score is assigned to all of the identified subjects. Therefore, the assumption is

made that when there are multiple subjects identified in the comment, the calculated

sentiment score refers to all of these cryptocurrencies. The threat of this assumption

is that some cryptocurrencies might be assigned a sentiment score that does not fully

relate to the context of the particular comment. A method to reduce this threat could

be to first tokenize the comments into sentences and consider each sentence as an

individual case. This will increase the dataset substantially as comments frequently

consist of multiple sentences. Due to the computational costs of this increase, it was

not possible to implement this method in this study.

The subject identification algorithm is able to extract 783,458 subjects from the fetched

comments. The distribution of the ten most extracted subjects is presented in figure

3.3. The distribution shows that Bitcoin is the cryptocurrency which is discussed most

frequently on the platform by a substantial amount. This is in line with the expected

20


popularity of this cryptocurrency. The high rank of Binance Coin is somewhat unex-

pected, as it is ranked 16th in market capitalization in the extracted list of cryptocur-

rencies. After further investigation it is discovered that Binance is also the name of

the largest cryptocurrency exchange in the market (CryptoCoinCharts, 2018). A flaw

of the NLP methods that are used, is that they are not able to identify if the cryptocur-

rency or the exchange is discussed when Binance is mentioned in the comments.

This might increase the gap between observed number of relevant comments with

the true number of relevant comments. When the gap gets substantially large, as it

seems to be the case here, the quality of the data can not be guaranteed. Therefore,

Binance Coin is not used for further analysis.

Additionally, it is observed that there are just a small number of coins that are dis-

cussed in high numbers. The reduced amount of coverage increases the threat

of incomplete sequential data. It is therefore decided to only investigate Bitcoin,

Ethereum, IOTA, NEO, Litecoin and Ripple, to see if they are adequately covered

over the full time range. Some cryptocurrencies have been subject to re-branding,

or might have been introduced later than the start of the investigated time range.

These cryptocurrencies are not eligible for further analysis and will be dropped. When

the distribution of comments is analyzed for the selection of cryptocurrencies, only

Bitcoin, Ethereum, Litecoin and Ripple seem adequately covered over the full time

range. For the last two of the selected cryptocurrencies there are some days in the

start of the time range where there are no comments observed, but as these periods

are alternated by days with sentiment data, they are included for further analysis.

Figure 3.3: Distribution of extracted subjects

The resulting dataset consists of the comments of which the algorithm was able to

extract at minimum one subject, together with their sentiment scores and some other

meta variables. The next challenge is to generalize this data into a time series format,

21

3.2. METHODS

so the data can be merged with the financial and Google Trends data. With the gener-

alization of the individual comments there will be an inevitable loss in the variability

of the data. A method to decrease the loss in variability would be to take a smaller

granularity than a day, for instance by hour or 15 minutes. However, the financial and

Google Trends data did not allow this implementation directly, as their most frequent

availability was by day. When the sentiment data is generalized to daily observations,

the resulting features that are obtained are the total number of comments, the sum

of positive comments and the sum of negative comments. The sum of positive and

negative comments are then also transformed into a ratio, to represent the relative

amount of positive comments over the amount of negative comments. These vari-

ables will form the sentiment features extracted from the textual data that is scraped

from Reddit.

3.2.4 Predictive Modeling

The starting point of prediction modeling will be to construct the basetable, which is

obtained from the combination of the gathered data sources. These data sources will

be merged on the observation date that is present in each of the data sources. This

yields a basetable with a total of 366 rows per cryptocurrency. Most of the variables

will be transformed into first differences, in order to assure stationarity. Stationarity

will be important for the prediction models in order to have the possibility to detect

relationships in the training set, which might also be present in the test set. When

variables are non-stationary, their values in the training set do potentially not corre-

spond to the values that are observed in the test set. The models will be more robust

for out-of-sample predictions when first differences are used instead of the original

feature. Before a model can be fitted to the data, non available data points have to

be imputed and the data has to be standardized. It is required to impute non available

data points in order to maintain the time sequence within the data.

When the basetable is imputed and standardized, the different prediction models can

be constructed. Multiple prediction pipelines will be set up in order to gather the

necessary evidence on the proposed research questions. To answer the first research

question concerning the preprocessing and prediction algorithm, a Lasso and Ran-

dom Forest classification algorithm will be tested on the data that is available for

Bitcoin. Within this pipeline, there will be a separation of the different data sources to

simultaneously investigate the predictive value of the different data sources. When

the the best available data sources, plus the optimal feature selection and prediction

algorithm are discovered, a second prediction pipeline will be implemented on the

different cryptocurrencies. This pipeline will be constructed in a similar fashion as the

22


previous pipeline, although the focus will be on different cryptocurrencies instead of

the different data sources.

Within both of the prediction pipelines, the data is split into a training and a test

set. The training set is used first for feature selection and hyper parameter tuning.

In order to find the best features, this training set is split into multiple folds using

time series cross validation. This type of cross validation method has been discussed

earlier and is known to result in more robust prediction results for time series data.

The aggregation of the obtained prediction performance in the different folds will

provide the optimal hyper parameter, which directly defines the number of features

that is selected. The prediction performance is evaluated using the area under the

ROC curve (AUC). The AUC metric is known to deal well with imbalanced class data,

which might occur when the data is split into smaller folds during cross validation.

The Lasso regularization is added to the Logistic regression algorithm in order to allow

for parameter shrinkage. Finding the optimal value for the penalization parameter

results in the selection of particular features, as the penalization parameter reduces

the effect size of the variables towards zero with an increase of the parameter. The

selected features can be evaluated on their importance using the resulting effect

sizes. As all the data is standardized, the magnitude of the effect size represents

its importance on the prediction of the outcome variable. With the Random Forest

algorithm, the feature selection is conducted using the obtained feature importances

from the optimal prediction model. The number of variables is reduced by setting a

threshold on the minimal required feature importance. This threshold is calculated

as the mean of all the obtained features importances. The feature importances in

the Random Forest algorithm are calculate as the mean decrease impurity, which is

defined as the total decrease in node impurity, averaged over all the trees that are

constructed in the algorithm. Node impurity is the ability of the chosen feature to split

the data in different classes when constructing a classification tree. A more extensive

explanation of the Random Forest algorithm is presented further in this section.

When the variables are selected with one of the algorithms, the basetable is restricted

to contain only these variables for the final prediction evaluation. The same algorithm

that was used for feature selection is used again to fit a model on the downsized

training set. This time no parameter tuning is required as we will use the optimal

hyper parameter that is identified in the time series cross validation exercise, which

is conducted for feature selection. With the obtained training model we will evaluate

the prediction performance on the isolated test set. The test set that is used is similar

in all of the different models. This will allow us to compare the prediction performance

on the same set of data. The test set will contain the very last part of the obtained

data range. The described evaluation methodology is represented in figure 3.4. To

23

3.2. METHODS

evaluate the prediction performance, the accuracy and the AUC score will be used as

scoring metrics. The accuracy score is added to the prediction evaluation in order to

be able to compare the results with preliminary studies, which only used the accuracy

score as an evaluation metric.

Figure 3.4: Representation of the cross validation method used in the prediction modelpipelines

To answer part of the first research question, two different classification algorithms

will be implemented. The difference in these algorithms will be their approach to fit

a model on the observed data points. To learn more on the possible dependencies

within the cryptocurrency market, a linear and a non-linear algorithm will be used

to fit the models. To model linear dependencies, Logistic regression is used within

the Lasso algorithm and for modeling non-linear dependencies the Random Forest

algorithm is used. A visual representation of both algorithms in a hypothetical two-

dimensional space is presented in figure 3.5 James et al. (2013). The first row exposes

the weakness of a tree based model when the true underlying dependency is linear,

while the second row shows the opposite in case of a non-linear dependency. Although

the true relationship of the gathered data is most likely not as straightforward as in

this example, it gives an intuitive explanation why one of both models might result in

a better performance compared to the other.

As mentioned earlier, the Lasso algorithm for classification is represented by the Lo-

gistic regression function with the addition of a penalization function called the L1

penalty (Friedman et al., 2001). The penalization function in the algorithm introduces

shrinkage to the effect size parameters with the increase of the hyper parameter λ.

This penalization is implemented into the Logistic regression function in order to pre-

vent over-fitting. Without the regularization term, the hyperplane is fit onto all the

included variables. As it is likely that there is a substantial amount of noise in the in-

cluded variables, the obtained effect sizes do not correspond with the true regression

hyperplane. To approximate the true hyperplane, a range of λ values can be tested

24


on their prediction performance. By shifting the λ parameter, the optimization model

will move from the unbiased estimator and allow some bias in return of a reduction

in variance. This shift regularly leads to an increase in prediction performance, until

the increase in bias outweighs the decrease in variance. Cross validation is used to

find the optimal point in the bias-variance trade-off. Cross validation will test the λ

parameter on multiple different folds and the value that yields the best prediction

performance on average is perceived to represent the optimal value for λ. Since the

Lasso algorithm is a derivation of the Logistic regression, it is linear in nature. As it is

one of the goals of the study to investigate the type of dependencies that exits within

the cryptocurrency market, this algorithm is used to test if these dependencies could

be represented by a linear model.

Figure 3.5: Linear versus tree based algorithms on two different underlying relation-ships

The Random Forest classification algorithm is implemented in order to test for non-

linear dependencies. Random Forest classification is known for capturing non-linear

dependencies as it cuts the parameter space in buckets by making splits in the data

Friedman et al. (2001). By creating these splits, the algorithm initiates interaction

terms with previous splits. This can be very well visualized in a two-dimensional

parameter space as in the second column of figure 3.5. The splits in the data create

buckets in which the most frequently occurring class decides on the class that will be

assigned to future observations that fall into this bucket. The splits are conducted

in a greedy approach, meaning that at each point the split with the largest benefit

in classification accuracy is picked. These splits will continue until a threshold is

reached. The Random Forest algorithm takes random subsamples of the independent

variables at each split. By adding randomness to the splits, the final tree that will be

obtained can be different after each run. All of the trees should have a low bias as

25

3.2. METHODS

each split in the tree is conducted on the most optimal split possible. The variance of

these different trees is potentially reduced by aggregating all the different trees that

are obtained into one final model, while the low bias is preserved. In that way the

Random Forest algorithm tries to optimize the bias-variance trade-off.

There are multiple hyper parameters that can be tuned within the Random Forest

classification algorithm. The hyper parameter that is used for tuning is parameter that

represents the minimum number of required samples to be at a leaf when a split is

considered. This parameter tends to decrease over-fitting as the minimum of required

samples increases. A small value of the hyper parameter allows the classification tree

to make splits where only a smaller portion of the observations is assigned to one of

the classes. This can initiate over-fitting as it creates very specific splits that are based

on the data at hand. When these buckets are used to classify out of sample data, it

might be that the splits where too specific, which decreases prediction performance

on this new portion of data. If the minimum of required samples at a leaf gets too

high it will create splits that can be too general, which could initiate under-fitting. As

discussed, time series cross validation is used to optimize this trade-off.

26

CHAPTER 4

RESULTS

The methodology that is extensively discussed in the previous section, is imple-

mented on the variety of scraped data sources. Results that follow from the afore-

mentioned implementations should serve as empirical evidence on the proposed re-

search questions. The variables that are extracted from the data sources are initially

investigated in order to apply some necessary feature engineering. Thereafter, the

prediction modeling is conducted to obtain the final results.

4.1 Descriptive Statistics

The descriptive statistics are first reviewed per external data source using the data

obtained on Bitcoin. Identification of successive movements between the data source

and the price development might already reveal certain dependencies that are useful

for prediction. Besides that, the movement of the data over time will be regarded to

indicate on which variables differencing is needed in order to obtain stationary data

series.

The outcome variable is represented by the directional return on the particular cryp-

tocurrency. To obtain this variable, daily returns are calculated from the closing val-

ues. When the return is larger than 0, the observation is classified as an upward

movement and if the return is zero or lower it is considered a downward movement.

The distribution of the outcome variable is quite balanced over the full time range.

209 (57%) of the directional returns are upward and 156 (43%) are downward. Lagged

values of the outcome variable could also contain explanatory value due the financial

bubble effect that was discussed earlier. To incorporate this potential effect, lagged

returns are included to the independent variables.

Increases in trading volume seem to occur simultaneously with large increases and

decreases in the price of Bitcoin (appendix B.1a). This observation is even more obvi-

ous when the trading volumes are observed together with the daily price spreads. Fac-

tors that influence the trading volume might therewith also be associated with price

changes. The trading volumes and price spreads are quite constant in the largest part

4.1. DESCRIPTIVE STATISTICS

of the data range. However, after November 2017 there is an increase observed in

the volatility of trading volumes and price spreads. This increase in volatility makes

it extra challenging to develop a model that is able to predict directional returns. The

model should be trained to deal with both volatile and less volatile periods. The up-

side of this challenge is that the collected evidence from this study is more likely to

represent the real world cryptocurrency market, in which a lot of volatility is observed.

A correlation appears to be present between Google search queries and Bitcoin trad-

ing volumes (appendix B.2). This relationship has been observed and tested in earlier

studies as well. Most of the peaks in volume coincide with the peaks in the normalized

values of search frequencies. This does not give the impression that there is a suc-

cessive dependency between search queries and trading volumes. When the search

queries are compared to the closing prices, it seems that peaks in search queries

also coincide with changes in the price of Bitcoin (appendix B.2c). The peaks in price

spreads are in many cases observed together with peaks in search frequencies. In-

creases in search queries for Bitcoin do not seem to tell much about future changes

in Bitcoin prices or trading volumes. There seems to be a correlation, but only on

the day itself and not in successive days. Additionally, the direction of the change in

return can not directly be determined from the frequency of search queries.

Overall the Google Trends data seems quite stable, except for some large peaks in

the normalized search frequencies. These peaks seem substantial compared to the

surrounding observations. They are quite abrupt and mostly seem to last for just

a single day. As discussed earlier, Google Trends data is assumed to represent the

interest of potential investors in the particular cryptocurrency. It is perceived that

the true public interest is not very likely to experience such extreme shifts in these

short periods. To smooth out the volatility in some extent, it can be tested to include

a moving average instead of the original trend. Implementing a moving average

does not induce the threat of data leakage during time series cross validation, as

only historic data is used to calculate the moving average. When a 14 day moving

average is compared to the original trend, it is observed that the moving average

does indeed show a more desired trend (appendix B.2a). The moving average trend

seems more successive to the price of Bitcoin and is less volatile over the full data

range. Therefore, it is decided to implement a moving average of 14 days of the

Google search queries to represent the trends data.

When the sentiment data is analyzed in figure 4.1, there is immediately a striking

observation in the data that is fetched from the Pushshift API. There seems to be a

gap in the sequence of the sentiment features. After further investigation, it appears

that is gap is observed for all the different coins that are extracted from the data.

The only explanation could therefore be that the data for these days is missing in the

28

CHAPTER 4. RESULTS

Pushshift database. This confirms the earlier mentioned threat of the reliance on a

third party data provider. The gap is observed for 13 consecutive days in the middle

of July 2017. In order to maintain the sequential nature of the data, the missing values

will be imputed. The imputation method that is used is interpolation. This method is

preferred over other imputation methods as it does not induce any undesired volatil-

ity over the imputed time range, which for example was observed when K-nearest

neighbor imputation was tested on the data in this study. Interpolation uses the two

available adjacent values of the non-available sequence to calculate a linear function

that imputes the values over the missing time range. The small period of time that

requires imputation is not expected to have a substantial impact on the full analysis.

The missing data resides in a relatively nonvolatile period of the full data range and

the imputation method that is chosen seems rather safe for such a period.

Figure 4.1: The total number and the sum of positive and negative comments overtime for Bitcoin

The sentiment data is compared with the closing prices and trading volumes of Bit-

coin, as well as the Google Trends search frequencies (appendix B.3). All of the senti-

ment variables are smoothed using a moving average of 14 days in order to decrease

the high amount of volatility that is observed. The sentiment data seems to move

quite in line with the closing price of Bitcoin. When the drop in value starts, the num-

ber of comments stagnates and seems to drop not much later in time. It is not clear

from the figure if the sentiment data could have predictive value on the closing prices.

The same goes for the relationship between Google search queries and the number

of comments that are gathered. The data series seem to coincide to some extent,

but not in the full data range. Furthermore, the ratio between positive and negative

comments shows a shift over time. The first half year the ratio moves around 2.6 and

after a shift in the middle of the year, it starts to move around a value of 1.9. This

29

4.1. DESCRIPTIVE STATISTICS

indicates that at some point the growth of negative comments outpaced the growth

of positive comments. It might be interesting to capture a shift in public sentiment to

see if this would affect future returns.

The series that are extracted from the different data sources do not directly show

successive dependencies towards Bitcoin prices. Increases in search frequencies and

trading volumes seem to be equally present for large price increases as well as de-

creases. Potentially in combination with the sentiment data it might be possible to

detect the direction of the price movement. This potential relationship describes an

interaction effect between sentiment and search frequencies or trading volumes. The

Random Forest algorithm is known to be able to capture these interaction effects. If

this relationship exists, the prediction performance of the Random Forest classifica-

tion model is expected to outperform the Lasso classification model.

Table 4.1 shows the final variables that are implemented into the basetable together

with their correlations. The delta sign in front of a variable indicates that the first

difference is used after reviewing the data distribution over time. When we look at

the correlations of the variables we see that there can be quite some correlation

between independent variables. Especially the correlation between the differenced

sentiment variables, plus the trading volume variable and the Google search interest

variable. The directional returns are derived from the return variable, but correlations

of independent variables with this return variables are not that substantial. Further

results will indicate the predictive power of these variables as they are used to train

the prediction models.

As discussed earlier, preliminary studies have found Granger-causalities of financial

market returns and similar lagged independent variables, ranging from 1 to 4 days in

the past. Therefore, the maximum lag in the basetable is set to 7 days, creating 7 lag

values of each variable at each observation. With a total of 8 different independent

variables, this results in a total of 56 features at each observation. Time series models

regularly use predictors that are measured on the same day as the outcome variable.

It is not likely that the data that is gathered in this study is available on the day

the prediction is required. Therefore, adding variables with no time lag as features

would be a deviation from reality. To ensure that the obtained prediction models are

in line with a real-world environment, all the variables start with a lag of 1 day. The

prediction models can therewith only rely on historic data.

30

CHAPTER 4. RESULTS

return Δ vol. Δ spread Δ interest Δ totalcomm.

Δ totalpos.

Δ totalneg.

Δ volume -0.08(0.11)

Δ spread -0.07(0.21)

0.81(0.00)

Δ interest 0.05(0.38)

0.70(0.00)

0.50(0.00)

Δ totalcomment

-0.05(0.35)

0.55(0.00)

0.32(0.00)

0.63(0.00)

Δ totalpositive

-0.01(0.78)

0.52(0.00)

0.29(0.00)

0.62(0.00)

0.99(0.00)

Δ totalnegative

-0.09(0.09)

0.58(0.00)

0.35(0.00)

0.65(0.00)

0.95(0.00)

0.90(0.00)

posnegratio

0.05(0.32)

-0.05(0.34)

-0.02(0.75)

-0.04(0.49)

-0.03(0.53)

-0.01(0.89)

-0.09(0.07)

Table 4.1: Correlation table containing the independent variables, p-values betweenbrackets. Variables are separated in financial (top), Google trends (middle) and senti-ment (bottom) data

4.2 Feature Selection and Algorithm Performance

The first results are obtained on the identification of the selected features and the

following prediction performance on the data set containing only Bitcoin data. These

results are obtained for both implemented algorithms. The first results will indicate

what type of dependencies might exist in the data that is gathered from the different

data sources. The prediction performance will serve as empirical evidence on these

dependencies. The main question here is if the underlying relationships are likely to

exist in a linear or a non-linear shape. The Lasso algorithm aims to represent the linear

dependencies, where the Random Fores algorithm should represent the non-liner de-

pendencies. Besides detecting underlying dependencies, part of the prediction model

pipeline is constructed to apply feature selection on the lagged variables. This is con-

ducted as an attempt to eliminate the noisy features from the prediction model and

to identify which lag would suit best for each variable. Feature selection results from

both algorithms will additionally show which features are considered to be important

for the explanation of directional returns on Bitcoin. When features are observed to be

important in both algorithms, it is more likely that they contain a strong relationship

with the outcome variable.

The dataset is split into a train and test set where the first 60% of the data is used

as training data and the remaining 40% as test data. This split proportion is chosen

31

4.2. FEATURE SELECTION AND ALGORITHM PERFORMANCE

with the aim to contain both a bullish and a bearish period in the test set, as can be

seen in figure 4.2. This is not entirely conform to the standards where the test set

should be picked before its characteristics are observed, but it is desired in this case

to have a test set where different periods of volatility are present in order to test its

actual ability to capture these periods. The training and the test split contain a similar

balance between classes compared to the full dataset. The training set is further used

for parameter tuning using k-fold sequential cross validation. The number of folds is

set to 10 in all of the cross validation applications in this study. The test set contains

out-of-sample data as these data points are not used for the training exercise of the

prediction model. Therefore, performance results on the test set are used as the

evidence on the ability to make predictions in highly volatile cryptocurrency markets.

Figure 4.2: Split of the data in training (red) and test (blue) sets

The optimal penalization parameter λ that is identified using grid search cross val-

idation with the Lasso algorithm is equal to 1.0. As a result, the Lasso algorithm

selected 17 features, which means that it eliminated 39 of the features. The opti-

mal parameter for the Random Forest algorithm was observed at a minimum of 26

leaf splits. It eliminated just a little more than half of the features, resulting in the

selection of 25 features. The grid search results of both algorithms is depicted in

appendix C.1. The parameter tuning does not seem to have a substantial effect on

the prediction performance as the change in AUC is rather flat. However, the effect

of over- and under-fitting is definitely visible when the training performance is com-

pared to the test performance. The prediction performance on the test set increases

as the amount of over-fitting gets reduced, but the prediction performance eventually

decreases as under-fitting comes into play.

Results on the feature selection exercise of both algorithms shows that they select

less than half of the available variables. Appendix C.2 shows that there is some

overlap in the 10 most important features, when comparing both algorithms. The

32

CHAPTER 4. RESULTS

total sum of negative comments with a lag of 3 days is selected in both algorithms as

the most important feature. Both algorithms include a selection from each available

data source. Therefore it is likely that all sources supply some explanatory value to

the outcome variable. Further analysis will investigate if there is any difference in the

prediction performance when different data sources are used individually.

When the final prediction results are compared on the test set, it is clear that the Ran-

dom Forest algorithm is better able to predict directional returns of Bitcoin. Random

Forest has a prediction accuracy of 63.0% and an AUC score of 61.4%, compared to

52.1% accuracy and 52.5% AUC with the Lasso model. So both the prediction ac-

curacy, as the AUC, are higher with the Random Forest model. This indicates that

non-linear models might have an advantage over linear models when trying to cap-

ture underlying dependencies in the cryptocurrency market. Altogether, the results

are not that strong in prediction performance as they remain between the 50% and

65% accuracy and AUC.

Preliminary studies have found directional prediction accuracy of 76% and 87% in

the stock market Bollen et al. (2011); Mittal and Goel (2012). Both studies also ob-

served that non-linear models delivered a better prediction performance compared

to linear prediction models. The difference in prediction performance might expose

the difficulty of making predictions in such a volatile market as the cryptocurrency

market. However, the studies concerning the stock market only evaluate their mod-

els over a period of 40 and 20 days respectively, which seems to be a rather small

period to capture different volatilities. The studies do not present the AUC score of

their predictions, which otherwise could help in a better evaluation of their prediction

performance on the smaller data range.

4.3 Prediction Performance of the Data Sources

The basetable is split into the different data sources by filtering out the non relevant

variables. All the possible data source combinations are tested with both algorithms,

resulting in 14 prediction models for Bitcoin. The prediction models in previous sec-

tion, which included all data sources, showed some advantage of the Random Forest

algorithm over the Lasso algorithm. The results on the different data sources are

presented in figure 4.3.

When investigating the results, some volatility in the prediction results is observed

when the Lasso classification algorithm is used. The optimal performance for this al-

gorithm, both in accuracy as AUC, is observed when only Trends data is used. Predic-

tion performance of financial data and financial plus Goolge Trends data is still above

33

4.3. PREDICTION PERFORMANCE OF THE DATA SOURCES

55%, but as observed in previous section, when all sources are combined the perfor-

mance moves further down. The Lasso prediction models containing sentiment data

tend to perform the worst overall. This might indicate that mainly the dependencies

with the sentiment data are not well captured by a linear model.

Figure 4.3: Prediction performance using both algorithms on the different datasources: financial data from CryptoCompare (fin), Google Trends (trend) and senti-ment data from Reddit (sent)

The results on the Random Forest algorithm show a more consecutive increase in

prediction performance when more data sources are added to the prediction model.

The performance on the models where sentiment data is involved increased by 5%

or more, both in accuracy as in AUC. This indicates that the sentiment data might be

better captured by a non-linear model, compared to a linear model. The results on

financial and Google Trends data sources on the other hand, show a slight decrease

in prediction performance compared to the Lasso models. These data source might

therewith be better represented by a linear optimization algorithm. The best pre-

diction performance with the Random Forest algorithm is obtained when all the data

sources are included. The addition of the financial and Google Trends data sources

still adds explanatory value to the model, even as they perform worse individually.

This supports the idea that each of the data source caries important information, but

that the value is mainly observed when these data sources are combined into one

prediction model. Apparently, the ability of the Random Forest classification model

34

CHAPTER 4. RESULTS

to create interactions between variables amplifies the ability of the sentiment data

to predict directional returns even better. These dependencies are harder to detect

when the data is initially explored, but the prediction performance suggests that such

relationships might actually exist.

Altogether the best performance for Bitcoin is obtained with the Random Forest clas-

sification algorithm when all the gathered data sources are included. It will be inter-

esting to see if this algorithm is also able to deliver similar results on other cryptocur-

rencies.

4.4 Predicting Different Cryptocurrencies

We have seen that Bitcoin is substantially the most discussed cryptocurrency on the

Reddit platform. It is identified more than double the amount of Ethereum, which

follows second. Ripple and Litecoin, on their part, are discussed less than half the

amount of Ethereum. In terms of market capitalization Bitcoin also leads the way in

a similar fashion. Its market capitalization is almost 4 times that of Ethereum, 10

times that of Ripple and 33 times that of Litecoin. Litecoin is not even in the top 5 of

cryptocurrencies with the largest market capitalizations (CoinMarketCap, 2018). The

possible reason that Litecoin is more covered on Reddit than cryptocurrencies with a

larger market cap, is that Litecoin is one of the earliest Bitcoin spin-offs, which likely

initiated substantial popularity.

With Bitcoin clearly as the most popular cryptocurrency in terms of coverage as well

as market capitalization, it will be interesting to see what the prediction performance

on less popular coins will be. The prediction performance of the different cryptocur-

rencies is evaluated over all the available data sources, using the Random Forest

classification algorithm.

The results of these prediction models are depicted in figure 4.4. The prediction

performance on Bitcoin clearly outperforms the other cryptocurrencies. Only Ripple

returned a prediction performance that is higher than 50% for both accuracy and

AUC, but the accuracy is still 10% lower than the one obtained for Bitcoin. Prediction

performance on Litecoin and Ethereum is just over 45%, which is substantially lower

than the performance on the other coins. These results suggest that the prediction

on more popular cryptocurrencies, like Bitcoin, provide better results compared to

predictions on less popular coins. The idea that it could be easier to predict directional

returns on less popular coins, as the market might be less efficient due to the lower

amount of coverage, is not confirmed by the results. Potentially, the lower amount

35

4.4. PREDICTING DIFFERENT CRYPTOCURRENCIES

Figure 4.4: Prediction performance on the different cryptocurrencies using the Ran-dom Forest classification algorithm

of coverage creates the situation where the available data does not cover the larger

opinion of the market.

36

CHAPTER 5

CONCLUSION AND DISCUSSION

5.1 Conclusion

The results in the previous section where gathered with the aim to provide empir-

ical evidence on the proposed research questions. The research questions where

deducted from the existing literature, by identifying gaps in the covered matter. The

conclusions on the investigated research questions are presented in this section.

The first research question was concerned with the problem of finding the best al-

gorithm to select the most important features and thereafter predict the directional

returns. Both algorithms selected less than half of the available features of the full

dataset. The remaining features came from all the different data sources, which did

not provide any signs of preference for any of the scraped data sources. Final predic-

tion results show that the Random Forest classification model results a better perfor-

mance compared to the Lasso classification model, with an accuracy of 63.0% and an

AUC of 61.4%. The presented prediction performance result suggests that the under-

lying dependencies are best represented by a non-linear model when all the gathered

data sources are used.

The initial results on the full dataset do not provide any insights on the individual

value of the different data sources. To investigate this question, multiple models are

trained with different combinations of the data sources as input. It is observed that,

with the Random Forest algorithm, the prediction performance increases as more data

sources are added to the model. Especially when the sentiment data was added, an

increase in prediction performance can be observed. This indicates that all the data

sources add some incremental value to the prediction model, with sentiment data as

the most valuable data source. This is quite a striking result, considering the earlier

studies that investigated the possible effects of public sentiment from web sources

on cryptocurrency prices.

When financial and Google Trends data are investigated individually, Lasso returns a

better prediction performance compared to Random Forest. This indicates that these

data sources might be better represented in a linear relationship with the outcome

5.2. DISCUSSION

variable, where the sentiment data might be better represented by a non-linear rela-

tionship. The overall optimal performance is still obtained when all the different data

sources are included into the non-linear Random Forest prediction model. This sug-

gests that capturing the interactions between the available variables of the different

data sources increases the ability to predict directional returns.

To answer the final research question, multiple cryptocurrencies where compared on

their prediction performance. The reasoning behind the research question is that the

popularity of certain cryptocurrencies might affect the ability to predict directional

returns. The difference in popularity is immediately observed when the frequency

at which a cryptocurrency is discussed was visualized. Bitcoin is clearly the most

covered cryptocurrency on the CryptoCurrency subreddit and also in market capital-

ization, Bitcoin is substantially larger compared to the other cryptocurrencies. The

prediction performance on Bitcoin was substantially better compared to the other

cryptocurrencies. Prediction accuracy and AUC stayed below 50% for both Ehtereum

and Litecoin. Only Ripple got a bit closer to the performance on Bitcoin with both

metrics around 55%. These results suggest that popular coins are more predictable

compared to the less popular coins. The larger amount of coverage of a popular coin

increases the amount of data that is available. It might be that this amount of avail-

able data is crucial for the development of an accurate prediction model. The idea

that the market for popular coins is more efficient seems to be invalidated by the

obtained results.

Sentiment scores calculated based on Reddit comments have noticeably more pre-

dictive power than the other independent variables in the model. This observation

indicates that using more sources to gather these online sentiments could be an in-

teresting avenue for further research. Moreover, this also indicates that the cryptocur-

rency market is - like many monetary markets - driven by human emotion, perhaps

even more so than others.

5.2 Discussion

The empirical evidence presented in this study helps in the development of a better

understanding of a rather new financial market. The prediction results where not able

to match preliminary results in the more traditional stock market, but the growing

amount of coverage by academic literature continuously uncovers new relationships

in the cryptocurrency market. This section aims to present some useful takeaways for

future research that will investigate a topic that is related to this study.

38

CHAPTER 5. CONCLUSSION AND DISCUSSION

A notable difference between the cryptocurrency market and the stock market seems

to be the underlying value of their assets. Stocks are known to be represented by

a fundamental value in terms of a proportion of the current value of the concerning

company, plus the future value it is expected to generate. Once the trust in the

generation of future value is lost, perhaps because of a bankruptcy, there still might

be some residual value in the assets the company holds. The residual value of the

stocks will represent this value and as long as there is collateral the stocks won’t lose

their entire value. Additionally, the value that is represented by the expected future

profitability of the company is for a large part dependent on the market in which

the company operates. These markets deviate in volatility, but are in general quite

stable and predictable. The intuitions of investors and therewith their sentiments on

the particular company and the market it operates in, dictate this part of the valuation

exercise. These intuitions might be the reason that preliminary studies have observed

the some importance of public sentiment on the future value of stocks.

A cryptocurrency, on the other hand, does not have this fundamental value. If the

underlying technology that is created to represent the cryptocurrency stops to exist,

there is nothing one can do with this cryptocurrency. There is no residual collateral

that could be sold to other parties to redeem some of the value the cryptocurrency

once had. Additionally, as there are no assets, there are also no vehicles that can

generate future value. The only value that a cryptocurrency seems to hold is the

perceived demand on the particular coin. Earlier studies have observed that cryp-

tocurrencies seem to depend on network effects Gandal and Halaburda (2016) and

are sensitive to the creation of financial bubbles by the reinforcement of changes in

public interest (Mai et al., 2015). These findings support the suggestion that senti-

ments play a major role in the valuation of cryptocurrencies. Nevertheless, the results

in this study do not directly reveal this importance, as the prediction performance

results are not noticeably better than the ones observed in studies concerning the

stock market. A reason for this might be the unprecedented volatility that is present

in these cryptocurrency markets. Sentiments, and therewith cryptocurrency value,

seem rather dependent on short-term alterations, while the sentiment shifts in the

stock market are less abrupt and exist within a more stabilized environment. In order

to find additional empirical evidence for this reasoning, it will be interesting for future

research to investigate the effect of sentiments in both markets and additionally to

capture the short-term sentiment alterations in the cryptocurrency market in order to

prove its value.

The time series nature of the data allows for a variety of feature engineering imple-

mentation to optimize the fit of the independent variables with the outcome variable.

Moving averages were already implemented in this study to decrease the volatility,

39

5.2. DISCUSSION

but there are definitely more options available. For instance, the observed shift in the

ratio of positive and negative comments is most likely not very well captured by the

current methodology. The usage of a certain smoothing function might help in the

detection of slow shifts in the public sentiment. Due to time constraints, it was not

possible to incorporate such functions in the feature engineering section. It seems

likely that with more extensive feature engineering, the prediction performance can

be increased substantially. It would be interesting to see if future research is able to

obtain better prediction results when more emphasis is placed on the feature engi-

neering part of the study.

The prediction model that delivered the highest performance contained variables from

all gathered data sources, which included a total of 8 different variables. However,

there was a substantial amount of correlation observed between some of these vari-

ables. Most of the sentiment variables showed quite some mutual correlation, with

correlation values ranging between 0.9 and 1. The observed correlation could make

some of these sentiment variables redundant, as the variation of the outcome vari-

able can potentially be explained by just one of the correlated variables. Therefore, it

could be beneficial in further research to put extra effort in the extraction of a wider

variety of sentiment features to represent the effect of public sentiment on the value

of cryptocurrencies. A wide variety of uncorrelated variables can potentially uncover

additional interactive relationships with the returns on cryptocurrencies, which might

increase prediction performance.

The text data that was scraped from Reddit showed some great potential in the devel-

opment of additional features. Data on the users that posted the comments and data

on the amount of up and down votes a comment obtained were a already included.

This data can be transformed into additional features to represent a wider range of

characteristics that are connected to the textual data. Additionally, it can be inter-

esting to conduct topic mining on the scraped comments to monitor the attention on

certain related topics over time and identify potential relationships with the value of

a particular cryptocurrency.

During the investigation of data coverage over the full time range, it was observed

that certain cryptocurrencies were not sufficiently covered to include them in the

analysis. The less popular cryptocurrencies that were included also showed some

small gaps in their data, which most likely affects the ability to predict directional

returns. It would be interesting for further research to identify additional data sources

to increase the data richness on the less popular coins. This could reduce the identi-

fied data quality issue of missing data for these particular coins. With sufficient data

available, a thorough comparison can be made while sustaining the intended research

validity.

40

BIBLIOGRAPHY

Bird, S. (2017). nltk: Natural language toolkit for natural language processing. [On-

line; accessed 2018-08-05].

Bollen, J., Mao, H., and Zeng, X. (2011). Twitter mood predicts the stock market.

Journal of computational science, 2(1):1–8.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal

margin classifiers. In Proceedings of the fifth annual workshop on Computational

learning theory, pages 144–152. ACM.

CoinMarketCap (2018). Top 100 cryptocurrencies by market capitalization. https:

//coinmarketcap.com/. [Online: accessed 2018-08-17].

CryptoCoinCharts (2018). Cryptocurrency exchanges / markets list. https://

cryptocoincharts.info/markets/info. [Online: accessed 2018-08-12].

Friedman, J., Hastie, T., and Tibshirani, R. (2001). The elements of statistical learning,

volume 1. Springer series in statistics New York, NY, USA:.

Gandal, N. and Halaburda, H. (2016). Can we predict the winner in a market with

network effects? competition in cryptocurrency market. Games, 7(3):16.

Gilbert, C. H. E. (2014). Vader: A parsimonious rule-based model for sentiment anal-

ysis of social media text. In Eighth International Conference on Weblogs and Social

Media (ICWSM-14).

Grout, P., Megginson, W., and Zalewska, A. (2009). One half-billion shareholders and

counting-determinants of individual share ownership around the world.

Hogue, J. and DeWilde, B. (2014). pytrends: Unofficial api for google trends. [Online;

accessed 2018-08-05].

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to statistical

learning, volume 112. Springer.

Kaminski, J. (2014). Nowcasting the bitcoin market with twitter signals. arXiv preprint

arXiv:1406.7577.

https://coinmarketcap.com/

https://coinmarketcap.com/

https://cryptocoincharts.info/markets/info

https://cryptocoincharts.info/markets/info

BIBLIOGRAPHY

Kristoufek, L. (2013). Bitcoin meets google trends and wikipedia: Quantifying the

relationship between phenomena of the internet era. Scientific reports, 3:3415.

Laskowski, M. and Kim, H. (2016). Rapid prototyping of a text mining application for

cryptocurrency market intelligence.

Leinz, K. (2018). A look at who owns bitcoin (young men), and why (lack of trust).

Bloomberg.

Mai, F., Bai, Q., Shan, Z., Wang, X., and Chiang, R. (2015). From bitcoin to big coin:

The impacts of social media on bitcoin performance. SSRN Electronic Journal.

Malkiel, B. G. and Fama, E. F. (1970). Efficient capital markets: A review of theory and

empirical work. The journal of Finance, 25(2):383–417.

Matta, M., Lunesu, I., and Marchesi, M. (2015). Is bitcoin’s market predictable? analy-

sis of web search and social media. In International Joint Conference on Knowledge

Discovery, Knowledge Engineering, and Knowledge Management, pages 155–172.

Springer.

Mittal, A. and Goel, A. (2012). Stock prediction using twitter sentiment analysis.

Standford University, CS229 (2011 http://cs229. stanford. edu/proj2011/GoelMittal-

StockMarketPredictionUsingTwitterSentimentAnalysis. pdf), 15.

Nakamoto, S. (2008). Bitcoin: A peer-to-peer electronic cash system.

Ribeiro, F. N., Araújo, M., Gonçalves, P., Gonçalves, M. A., and Benevenuto, F. (2016).

Sentibench-a benchmark comparison of state-of-the-practice sentiment analysis

methods. EPJ Data Science, 5(1):1–29.

Schumaker, R. P. and Chen, H. (2009). Textual analysis of stock market prediction

using breaking financial news: The azfin text system. ACM Transactions on Informa-

tion Systems (TOIS), 27(2):12.

Socher, R. (2018). Ai’s next great challenge: Understanding the nuances of language.

Harvard Business Review.

42

APPENDIX A

MODEL DEVELOPMENT CODING

A.1 Pseudo-code of the subject extraction

algorithm

Algorithm 1: Subject extraction from Reddit thread tree

Data: dataset of fetched Reddit threads and its comments

Result: cryptocurrency subject(s) of comment

initialization by matching all comments with the fetched cryptocurrency list

for comment in dataset do

if subject match in concerning comment thenreturn matching subject

end

while subject not identified from concerning comment do

move to next parent comment if subject match in parent comment thenreturn matching subject

end

end

if no subject match identified in full comment tree then

if subject match in thread title thenreturn matching subject extracted from thread title

elsereturn nothing

end

end

end

A.1. PSEUDO-CODE OF THE SUBJECT EXTRACTION ALGORITHM

44

APPENDIX B

DESCRIPTIVE STATISTICS

B.1 Financial data

(a)

(b)

(c)

Figure B.1: Financial data for Bitcoin over the full time range

B.2. GOOGLE TRENDS DATA

B.2 Google Trends data

(a)

(b)

(c)

Figure B.2: Google Trends data with financial data for Bitcoin over the full time range

46

APPENDIX B. DESCRIPTIVE STATISTICS

B.3 Sentiment data

(a)

(b)

(c)

(d)

Figure B.3: Sentiment data and other data sources for Bitcoin over the full time range

47

B.3. SENTIMENT DATA

48

APPENDIX C

RESULTS

C.1 Parameter grid search using time series cross

validation

(a)(b)

Figure C.1: Hyper parameter tuning using time series CV for both algorithms

C.2 Feature selection

(a) (b)

Figure C.2: Top 10 feature importances resulting from the implemented algorithms

making predictions in highly volatile cryptocurrency ... · 2.1predicting the stock market before...

Documents