multiple time-series forecasting on mobile network data using an …1075835/... · 2017. 2. 21. ·...

UPTEC F 17005

Examensarbete 30 hpFebruari 2017

Multiple time-series forecasting on mobile network data using an RNN-RBM model

Arvid Bäärnhielm

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Multiple time-series forecasting on mobile networkdata using an RNN-RBM model

Arvid Bäärnhielm

The purpose of this project is to evaluate the performance of a forecasting modelbased on a multivariate dataset consisting of time series of traffic characteristicperformance data from a mobile network. The forecasting is made using machinelearning with a deep neural network. The first part of the project involves theadaption of the model design to fit the dataset and is followed by a number ofsimulations where the aim is to tune the parameters of the model to give the bestperformance. The simulations show that with well tuned parameters, the neuralnetwork performes better than the baseline model, even when using only a univariatedataset. If a multivariate dataset is used, the neural network outperforms the baselinemodel even when the dataset is small.

ISSN: 1401-5757, UPTEC F 17005Examinator: Tomas NybergÄmnesgranskare: Justin PearsonHandledare: Tor Kvernvik

Popularvetenskaplig sammanfattning

I takt med att den tekniska utvecklingen har gjort det mojligt att producera allt snabbare

datorer, samtidigt som allt mer data samlas in och lagras, sa har det spannande forskning-

somradet Machine Learning kunnat vaxa fram. Machine Learning ar en del av det storre

forskningsomradet artificiell intelligens och malet ar att, med hjalp av stora mangder data

och noggrant installda algoritmer, skapa avancerade modeller som kan hitta monster i den

insamlade datan. Nar det kommer till sa stora mangder data ar den manskliga hjarnan

inte langre tillrackligt avancerad for att klara av att se dessa monster.

I detta projekt undersoks mojligheten att satta samman en modell av algoritmer inom

Machine Learning for att analysera statistisk data insamlad fran enskilda celler i ett

mobilnatverk. Datan ar insamlad i form av tidsserier, dar varden ar ackumulerade och

lagrade i jamna tidsintervall over en langre tid och dar datan ar insamlad fran flera celler

och for flera olika statistiska matvarden. Malet ar att undersoka om det sedan gar att

trana modellen att forutspa matvarden for framtida tidsintervall i tidsserien, till att borja

med ett tidsintervall in i framtiden, genom att lata modellen analysera den historiska

datan.

Traningen, eller optimeringen, sker genom att en stor mangd data matas in i modellen, dar

bade den historiska datan och det framtida vardet ar kant. Darefter finjusteras ett antal

parametrar i modellen med hjalp av en optimeringsfunktion sa att modellen aterskapar

det kanda framtida vardet sa korrekt som mojligt, for ett stort antal olika varden sam-

tidigt. Justeringen sker stegvis och automatiskt genom att testa traffsakerheten efter

varje justering. Modellens formaga att pa ett traffsakert satt uppskatta framtida varden

testas sedan mot en ungefar lika stor mangd separat data dar bade den historiska datan

och det framtida vardet ar kant, men dar bara den historiska datan matas in i mod-

ellen. Traffsakerheten jamfors med en annan modell av enklare typ, for att fa ett matt

pa kvaliteten pa modellen.

Modellen har tranats och testats pa ett par olika uppsattningar av data. Syftet ar dels att

undersoka hur traffsakerheten paverkas av mangden historisk data, men framst att un-

dersoka om det finns samband mellan geografiskt spridda celler samt mellan olika typer

ii

av statistiska matvarden. Datan har darfor utokats successivt fran ett enskilt statis-

tiskt matvarde i en enskild cell, till att slutligen innehalla flera olika typer av statistiska

matvarden fran ett flertal olika celler.

Slutsatsen av de simuleringar som har gjorts ar att modellen visar stor potential till att

skapa traffsakra prognoser for framtida matvarden. Det finns samtidigt flera forslag pa

forbattringsmojligheter i modellens uppbyggnad som kan oka traffsakerheten ytterligare.

Det finns darmed goda skal att gora fordjupade tester for att ytterligare undersoka mod-

ellens potential.

iii

v

Acknowledgements

I would like to express my sincere appreciation and deepest thanks to my supervisor Tor

Kvernvik and my second supervisors Tony Larsson and Johan Haraldsson at Ericsson for

all the support and engagement through this masters thesis. I am forever grateful for the

opportunity to come to Ericsson and to be able to work in such an interresting field as

Machine Learning. I would also like to extend my greatest gratitude to my subject reader

Justin Pearson at the Department of Information Technology at Uppsala University. Your

help and encouragement during times of struggle has been essential for my work. Finally,

I would like to thank my wife and family. Your love and support is my foundation in life.

Contents

Abstract i

Acknowledgements v

1 Introduction 2

1.1 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Time series analysis and forecasting models . . . . . . . . . . . . . . . . . . 3

1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 7

2.1 Mobile Network infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Holt-Winter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Recurrent Neural Network-Restricted Boltzmann Machine (RRN-RBM) . . 12

2.3.1 Recurrent Neural Network (RNN) . . . . . . . . . . . . . . . . . . . 12

2.3.2 Restricted Boltzmann Machine (RBM) . . . . . . . . . . . . . . . . 14

2.3.3 RNN and RBM combined as RNN-RBM . . . . . . . . . . . . . . . 17

2.3.4 Binary versus real-valued data . . . . . . . . . . . . . . . . . . . . . 19

vi

CONTENTS vii

3 Methodology 20

3.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Dataset 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 Dataset 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Evaluation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4.1 Residual sum of squares - RSS . . . . . . . . . . . . . . . . . . . . . 26

3.4.2 Area under the curve - AUC . . . . . . . . . . . . . . . . . . . . . . 26

3.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Results 31

4.1 Binary-valued input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Real-valued input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 Baseline forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.2 Single cell and single counter . . . . . . . . . . . . . . . . . . . . . . 36

4.2.3 Multiple cells and single counter . . . . . . . . . . . . . . . . . . . . 37

4.2.4 Single cell and multiple counters . . . . . . . . . . . . . . . . . . . . 40

4.2.5 Multiple cells and multiple counters . . . . . . . . . . . . . . . . . . 41

viii CONTENTS

5 Analysis 45

5.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Hidden units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Conclusion 53

6.1 Summary of thesis achievements . . . . . . . . . . . . . . . . . . . . . . . . 53

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Bibliography 57

A Software 62

A.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A.2 Theano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.3 PyCharm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.4 Git . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

B RSS and AUC tables 66

List of Tables

B.1 Table showing all results from the simulations using a single cell and a

single counter as input data. . . . . . . . . . . . . . . . . . . . . . . . . . . 66

B.2 Table showing all results from the simulations using multiple cells and a

single counter as input data. . . . . . . . . . . . . . . . . . . . . . . . . . . 67

B.3 Table showing all results from the simulations using a single cell and mul-

tiple counters as input data. . . . . . . . . . . . . . . . . . . . . . . . . . . 67

B.4 Table showing all results from the simulations using multiple cells and

multiple counters as input data. . . . . . . . . . . . . . . . . . . . . . . . . 68

ix

List of Figures

2.1 A historical mobile network to the left, with a single basestation covering

a large number of users and a heterogenous mobile network to the right,

consisting of a large number of small cells, connected to each basestation.

Image source: [Hal] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 An illustration of the load at cells in housing and business areas during

morning commuting. The top row shows activity in green, mostly at the

housing areas, while the middle row shows activity mostly along the roads,

and the bottom row shows activity mainly at the business areas. . . . . . . 9

2.3 An illustration of the load at cells along a highway when users pass by the

cells. The load rises at Cell A first, followed by Cell B, and Cell C. . . . . 10

2.4 An illustration of the load during one week. The top graph shows a smooth

curve, whith cycles that are easy to spot, while the bottom graph shows a

more rough curve, where the cycles are less obvious. . . . . . . . . . . . . . 10

2.5 An RNN model unfolded in time. The bottom layer is the input, the top

layer is the output and the middle layer is the hidden state, dependent on

the input and the previous hidden state. Image source: [LBH15] . . . . . . 13

2.6 A graphical description of an RBM with the visible layer in the bottom

and the hidden layer in the top. Image source: [Deea] . . . . . . . . . . . . 14

xi

xii LIST OF FIGURES

2.7 A graphical illustration of t-step Gibbs sampling. Note that the last hidden

step in the figure should be labeled as h(t−1) and not h(t), to follow the

pattern. Image source: [Deea] . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.8 A graphical illustration of the RNN-RBM model. The bottom layer is the

RNN implementation and the top two layers are the RBM implementation.

Image source: [Deeb] [BLBV12] . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 The confusion matrix presents a visualization of how well a model manage

to classify a set of values. The correctly classified values are shown on

the diagonal from the top left corner to the bottom right corner, while the

incorrectly classified values are presented in the other fields. Image source:

[dS] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 The ROC space presents, graphically, the performance of a classification

model. The dots represents the rate of true positives, TPR, versus the rate

of false positives, FPR. A value close to the upper corner, with high TPR

and low FPR is a good classifier. . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 The results from different choices of threshold for the clasification model

creates a curve. The area under the curve, AUC, as well as the shape of

the curve, gives an indication of the performance of the model. . . . . . . . 29

4.1 The left and the middle columns show the forecasts made using the RNN-

RBM model with 100 and 1000 hidden units in the RNN layer, respectively.

The right column shows the forecasts made using the Holt-Winter model.

The rows corresponds to dividing the input data into different number of

percentiles with 5 on top, 20 in the middle and 100 at the bottom. The

blue line is the real data and the read line is the forecasted data. . . . . . . 33

LIST OF FIGURES xiii

4.2 One-step forecast made using the Holt-Winter model (red) on unchanged

data and the corresponding RSS value. The real data is shown in blue.

The top right corner shows the ROC plot with the corresponding AUC value. 35

4.3 One-step forecast made using the Holt-Winter model (red) on modified

data and the corresponding RSS value. The real data is shown in blue.


4.4 One-step forecast made using the RNN-RBM model (red) with a single cell,

a single counter and 7 days of history as input. The real data is shown in

blue and the corresponding RSS value is shown above the plot. The top

right corner shows the ROC plot with the corresponding AUC value. . . . 37

4.5 One-step forecast made using the RNN-RBM model (red) with a single

cell, a single counter and all history as input. The real data is shown in



4.6 One-step forecast made using the RNN-RBM model (red) with multiple

cells, a single counter and 7 days of history as input. The real data is

shown in blue and the corresponding RSS value is shown above the plot.



cells, a single counter and all history as input. The real data is shown in



4.8 One-step forecast made using the RNN-RBM model (red) with a single cell,

multiple counters and 7 days of history as input. The real data is shown in



LIST OF FIGURES 1

4.9 One-step forecast made using the RNN-RBM model (red) with a single

cell, multiple counters and all history as input. The real data is shown in




cells, multiple counters and 7 days of history as input. The real data is

shown in blue and the corresponding RSS value is shown above the plot.



cells, multiple counters and all history as input. The real data is shown in



5.1 Figure of the forecast giving the overall best AUC value. The performance

counter is the number of packets transferred and is shown in blue with the

forecast in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Two figures showing how the RSS value and the AUC value are dependent

on the input data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3 Two figures showing how the RSS value and the AUC value depends on

the number of hidden units in the RBM layer. . . . . . . . . . . . . . . . . 49


the number of hidden units in the RNN layer. . . . . . . . . . . . . . . . . 50


the batch size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Chapter 1

Introduction

This project aims to evaluate the performance of a multivariate forecasting model on a

multivariate dataset consisting of time series data. The performance of the evaluated

model is compared to the performance of a baseline model. In this chapter, a brief

explanation of time series is given in Section 1.1, followed by a deeper explanation of time

series analysis and the state of art models that are used, in Section 1.2. The chapter ends

with Section 1.3, that gives a motivation of the usefulness of the project.

1.1 Time series

A time series is a sequence of numerical data points listed in time order. Most commonly

the data points are distributed evenly in time, with equally spacing between sucessive

data points over the entire time series. Many kinds of data can be gathered into time

series, as long as there is a time dependence in the data. Many times there is a desire

to be able to predict and forecast the future behaviour of the time series. Weather

forecasts are probably the most familiar type of time series forecasting, where multiple

time series of historical data from spatially distributed data sources, and a number of

different characteristics, such as temperature, pressure, humidity, etc, are combined into

2

1.2. Time series analysis and forecasting models 3

a large dataset. These added dimensions create a multivariate time series, in contrast to

a univariate time series, which only consist of one time series. The different data sources

used in this project are described further in Section 3.2.

1.2 Time series analysis and forecasting models

Time series analysis and forecasting has been used in many different areas and fields for

a long time, and the interest and the possible applications have grown over time. Apart

from the already mentioned weather forecasting, time series analysis is used to predict

traffic and congestion [MYWW15], [GPP+09], [TN07], to predict data traffic and load in

mobile networks [SPSM12], [WGLP10], [Bru00], [YMJ11], to predict movements of people

[JLP15], [DZ07], and to classify the time series (e.g., labeling sentences as grammatically

correct or incorrect), which is a task that is related to prediction and where there is a

mutual benefit of combining the tasks [HS03], to name a few applications. A number of

different algorithms and models have been designed and used to predict and forecast time

series, a few of these are described below.

The authors of [Bru00] and [TN07] suggests the use of Holt-Winter’s algorithm, an expo-

nential smoothing algorithm where the impact of historical data on the forecasted data

decays exponentially. The algorithm builds on the premise that time series can be de-

composed into three components: baseline, linear trend, and seasonal trend, where all

components are presumed to evolve over time. The algorithm is able to perform multi-

step prediction and is fairly simple in it’s design. In [GPP+09] the authors propose

the use of a multiplicative seasonal Autoregressive Integrated Moving Average (ARIMA)

model, also called Box-Jenkins. ARIMA is a generalized version of the Autoregressive

Moving Average (ARMA) model and both these models are used for prediction, but also

to better understand time series by simplifying the behavior of the series. The authors of

[WGLP10] use an improved variation of the Support Vector Machine (SVM) algorithm:

4 Chapter 1. Introduction

Least Squares Support Vector Machine (LS-SVM), that is shown to be more efficient

and accurate than the ARIMA algorithm. In [DZ07] the authors propose yet another

algorithm, called Scale-Free Echo State Network (SHESN), that is a variation of the Echo

State Network (ESN) algorithm. The SHESN algorithm makes use of clustering to further

increase the performance of the forecasting. The clustering is implemented using a natu-

rally evolving dynamic state reservoir, unlike the ESN algorithm, that use a completely

random dynamic state reservoir. The authors of [YMJ11] suggests the Prior knowledge

based Clustered Complex ESN (PCCESN) as an even better algorithm than SHESN, also

using a naturally evolving dynamic state reservoir to implement clustering, but with a

more adaptive implementation. The authors of [JLP15] suggest the use of the Cluster-

Aided Mobility Projector (CAMP) algorithm, that also uses clustering to increase the

performance of the forecasting algorithm. However, this algorithm is used to predict tra-

jectories, but the clustering implementation makes the algorithm perform very well even

with very short previous trajectories. In both [RCSR07] and [GSM+15] the authors use

the K-means clustering algorithm to find spatial patterns in time series data. This can be

useful in a forecasting algorithm since it could replace the large amount of cell data with a

much smaller number of cluster data and greatly increase the efficiency of the algorithm.

The authors of [MYWW15] suggest the use of a Recurrent Neural Network combined

with a Restricted Boltzmann Machine (RNN-RBM) to predict and forecast time series.

When predicting traffic congestion inside a city the accuracy could reach as high as 88%.

This was accomplished by using data from spatially distributed time series of traffic speed

on a number of roads in the city where the speeds were collected from GPS data from

a great number of taxis. The ability to find dependencies both spatially and temporally

distinguish the RNN-RBM model from other algorithms and models mentioned.

The different models all have their advantages and disadvantages and are better suited

for different types of tasks and problems, using different datasets. In this project the

possible benefits of using the multivariate properties of the RNN-RBM model, designed

by [BLBV12], will be evaluated. The model has been chosen for evaluation based on the

1.3. Motivation 5

promising results described in [MYWW15], which is a similar problem to the problem in

this project. For comparison, the univariate Holt-Winter model will be used as a baseline,

mainly due to its simple design and ability to capture seasonality in the data. Both these

models are further described in Chapter 2.

1.3 Motivation

For Ericsson, time series analysis and forecasting is a useful tool in many different fields

and for many different datasets. One important area where time series analysis and

forecasting can have great impact is when trying to forecast the mobile traffic in the

mobile network. A good forecast of the mobile traffic could help in a number of ways

to improve the performance of the network. A few possible use-cases where forecasts of

mobile traffic could be helpful are listed below.

• Load balancing

By knowing beforehand when there is about to be high load in the network, measures

can be taken to prevent a situation of overload. Data can be preallocated, users can

be rerouted to nearby cells as well as other measures. This will help improving the

Quality of Service (QoS) for all users.

• Energy savings

For most cells in the network, the data traffic will at times be very low, e.g. during

night time. In some cases, especially where the cells are close enough so that their

respective covered areas will overlap, there are not always a need for all components,

or even all cells, to be active. If cells could be partially or fully deactivated when

they are not needed, energy could be saved while also extending the life time of the

components. However, to avoid switching the components on and off repeatedly,

it is important to have knowledge about the amount of data traffic for a sufficient

amount of time into the future.

6 Chapter 1. Introduction

• Anomaly detection

The ability to detect anomalies in the data traffic can, among other things, help

detect components that are close to breaking down, and help reduce downtime by

making it possible to swith the component before it breaks down. Since an anomaly

is simply a divergence from the expected value, a good forecast will help improving

the anomaly detection.

The granularity of the forecasts, as well as how far into the future the forecasts are reliable,

will affect the impact of the forecasts as well as what applications that can benefit from

the forecasts. For some applications a granularity of several hours could be enough, while

for others the granularity has to be as short as milliseconds. The different demands in

granularity will limit the model in different ways. A forecast on millisecond level will of

course demand that the model is fast enough to make the forecasts in time. A less granular

forecast will also in itself reach further into the future, but for very short timespans a

forecast longer into the future will be difficult to produce.

Chapter 2

Theory

In this chapter the infrastructure of the mobile network and the dependencies between

different parts of the network is explained in Section 2.1, followed by an explanation of the

two models used in this project. The baseline model, the Holt-Winter model, is described

in Section 2.2, and the RNN-RBM model is described in Section 2.3.

2.1 Mobile Network infrastructure

The infrastructure of the mobile network consists of a large number of connected base

stations that transfer the data between the users and the network. Historically, the bases-

tations single-handed served a large numbers of users. Today, however, each basestation is

connected to a large number of smaller cells that each connects to a number of users [Hal].

Figure 2.1 shows a graphical representation of the connections between the basestation

and the cells, and how the network has changed from at first only consist of basestations

to adding more and more cells.

The cells can have different designs and properties depending on where they are located

and their purpose. The area covered can be large or small and overlaps in covered areas

7

8 Chapter 2. Theory

Figure 2.1: A historical mobile network to the left, with a singlebasestation covering a large number of users and a heterogenous mo-bile network to the right, consisting of a large number of small cells,connected to each basestation. Image source: [Hal]

are common. This helps with creating redundancy in the network so that if a cell or

basestation breaks down, a nearby cell or basestation can still give coverage to the area.

Each cell collects statistical data of how many users are connected and the amount of

data traffic they generate. When users move around, they will eventually move outside

the range of the cell they are connected to and connect to another cell that is closer.

This will cause the number of users connected to each cell to change over time. This

movement is not random, but follows certain patterns, i.e. from housing areas to business

areas in the morning and back in the evening or along a highway or a railway during

commuting hours. This can be graphically seen in Figure 2.2, where the movement from

housing areas to business areas is demonstrated by the colouring of the basestations. In

the top row, the active basestations are mainly at the housing areas and nearby roads.

In the middle row it can be seen that the active basestations are along the roads and

approaching the business areas. In the bottom row the active basestations are mainly at

the business areas. In Figure 2.3 an illustration of the load at the cells along a highway

can be seen. The car moves from top to down, passing the three cells and causes the

activity to increase and decrease at different times. Both these figures are, however, very

2.1. Mobile Network infrastructure 9

Figure 2.2: An illustration of the load at cells in housing and busi-ness areas during morning commuting. The top row shows activity ingreen, mostly at the housing areas, while the middle row shows activitymostly along the roads, and the bottom row shows activity mainly atthe business areas.

simplified and only give a simple explanation of the spatial dependencies in the mobile

network.

In addition to the spatial dependence, there is also a strong temporal dependence in the

traffic in the network as can be seen in Figure 2.4. The daily cycle starts with low traffic

at nighttime, followed by a quick rise in the morning to a plateu of high traffic during the

day and finally a slow decline in the evening and back to the low traffic at night. The

weekly cycle will show higher loads during workdays and lower load during weekends.

The graph on top shows a smooth curve, where cycles are easy to see (it is even possible

to spot a decrease in activity during lunch hours), while the graph at the bottom is an

example where the cycles are less obvious.

10 Chapter 2. Theory

Figure 2.3: An illustration of the load at cells along a highway whenusers pass by the cells. The load rises at Cell A first, followed by CellB, and Cell C.

Figure 2.4: An illustration of the load during one week. The topgraph shows a smooth curve, whith cycles that are easy to spot, whilethe bottom graph shows a more rough curve, where the cycles are lessobvious.

2.2. Holt-Winter 11

2.2 Holt-Winter

The additive Holt-Winter seasonal model with exponential smoothing is a model used for

time series forecasting [HA] [BD06] [Bru00]. The model builds on the premise that the

time series data can be decomposed into three components;

1. Baseline:

at = α(yt − ct−m) + (1− α)(at−1 + bt−1) (2.1)

2. Linear Trend (“slope”):

bt = β(at − at−1) + (1− β)bt−1 (2.2)

3. Seasonal Trend:

ct = γ(yt − at−1 − bt−1) + (1− γ)ct−m (2.3)

where yt is the true value at time t, m is the seasonality of the time series, and α, β, and γ

are adaption parameters of the model with values ranging from 0 to 1. The outputs at, bt,

and ct correspond to the baseline, the linear trend, and the seasonal trend, respectively.

Both the seasonal parameter, m, and the daption parameters α, β and γ can be calculated

in a number of ways1. The sum of the three components at, bt and ct+1−m creates the

forecasted value for time t+ 1 as

yt+1 = at + bt + ct+1−m (2.4)

where yt is the forecasted value for time t. The initial values for the three components

are calculated using

1For this project the seasonality is hard-coded to 24 hours, which translates to 96 time steps, and theadaption parameters are calculated automatically by the code referenced in 3.5.


1.

am−1 =1

m

m∑i=0

yi (2.5)

2.

bm−1 =1

m2

m∑i=0

(yi+m − yi) (2.6)

3.

ci = yi − a0 for 0 ≤ i < m (2.7)

giving the model an update period from t = m to the end of the time series.

2.3 Recurrent Neural Network-Restricted Boltzmann

Machine (RRN-RBM)

The Recurrent Neural Network-Restricted Boltzmann Machine (RNN-RBM) model is

different from many other models in that it use multivariate dependencies. The model

is a combination of the RNN model and the RBM model and to better understand the

combined RNN-RBM model, an explanation of each of these models is given in Section

2.3.1 and 2.3.2, respectively, followed by an explanation of how the models combine into

to the RNN-RBM model in Section 2.3.3.

2.3.1 Recurrent Neural Network (RNN)

Recurrent Neural Networks (RNNs) are models where the idea is to make use of sequential

information[Bri]. Since time series data in many cases are temporally dependent on

previous data, the use of traditional neural networks will fail since they assume that all

inputs (and outputs) are independent of each other. A RNN makes use of the temporal

dependence and use previous computations as input to each new computation.

2.3. Recurrent Neural Network-Restricted Boltzmann Machine (RRN-RBM) 13

Figure 2.5: An RNN model unfolded in time. The bottom layer is theinput, the top layer is the output and the middle layer is the hiddenstate, dependent on the input and the previous hidden state. Imagesource: [LBH15]

Figure 2.5 shows a typical RNN model unfolded in time. The bottom layer is the input

layer where xt is the input at time t. The middle layer is the hidden layer where st is the

hidden state at time t. st is calculated based on the the input xt and the previous hidden

state st−1 as

st = f(Uxt +Wst−1) (2.8)

where the function f is usually a nonlinearity such as tanh or ReLU2. The initial value of

s, used to calculate the first hidden state, is usually set to all zeroes. The top layer is the

output layer where ot is the output at time t. ot is usually calculated using the softmax

function as

ot = softmax(V st). (2.9)

The parameters U , V and W are updated during training to achieve a desirable output.

When used for forecasting of time series, the output should represent the input at time

t+ 1 and the output from input xt is therefore often labeled as ot+1.

2Rectified linear unit


Figure 2.6: A graphical description of an RBM with the visible layerin the bottom and the hidden layer in the top. Image source: [Deea]

2.3.2 Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machines (RBMs) are energy based models that has been used as

generative models of many different types of data, including high-dimensional temporal

sequences such as video or motion capture data or speech[Hin10]. The model includes one

hidden layer h = (h1, h2, ...hnk)T and one visible layer v = (v1, v2, ...vnv)T where nv and

nk are the number of visible units and hidden units respectively. A graphical description

is shown in Figure 2.6. All units in each layer are connected to all units in the other, but

no connections exist between units in the same layer. Each unit can take the value 1 or

0 where 1 corresponds to the unit being activated.

The energy of the model is a scalar value associated to each configuration of the variables

of interest and can be calculated by

E(v,h) = −aTv − bTh− hTWv (2.10)

where a and b are bias vectors connected to v and b respectively and W is a weight

matrix between the layers. Each pair of a visible and a hidden vector can be assigned a

probability by

p(v,h) =1

Ze−E(v,h) (2.11)

where Z is the sum of the energy of all possible pairs of visible and hidden units as


Z =∑v,h

e−E(v,h) (2.12)

and is used to normalize equation 2.11. The conditional probability that a hidden unit hj

is activated given the visible vector v, and the conditional probability that a visible unit

vi is activated given the hidden vector h can be calculated by

P (hj = 1|v) = σ(bj +∑i

viwi,j)) (2.13)

P (vi = 1|h) = σ(ai +∑j

hjwi,j)) (2.14)

respectvely, where σ(x) is the logistic sigmoid function 1/(1 + e−x). Since no connections

exist between units in the same layer, the conditional probability can be written as

P (h|v) =∏j

P (hj = 1|v) (2.15)

P (v|h) =∏i

P (vi = 1|h) (2.16)

and to find the parameters θ = (W, a,b) in equation 2.10, the RBM is required to maxi-

mize the probability of training set V by

arg maxθ

∏v∈V

P (v) (2.17)

which is equal to maximize the log-likelihood of P (v). This is commonly done by using

gradient descent as

θ = θ + η∂ lnP (v)

∂θ(2.18)


Figure 2.7: A graphical illustration of t-step Gibbs sampling. Notethat the last hidden step in the figure should be labeled as h(t−1) andnot h(t), to follow the pattern. Image source: [Deea]

where η is the learning rate and the partial derivative is calculated as

∂ lnP (v)

∂θ= −

⟨∂E(v,h)

∂θ

⟩P (h|v)

+

⟨∂E(v,h)

∂θ

⟩P (v,h)

(2.19)

where 〈·〉 denotes the expectation value with respect to probability distribution P . Solving

equation 2.17 is computationally expensive and a better way is to use the Contrastive

Divergence (CD) approach [Hin02] as

∂ lnP (v)

∂θ=

∂ lnP (v)∂wi,j

≈ P (hi = 1|v(0))v(0)j − P (hi = 1|v(t))v

(t)j

∂ lnP (v)∂ai

≈ v(0)i − v

(t)i

∂ lnP (v)∂bi

≈ P (hi = 1|v(0))− P (hi = 1|v(t))

(2.20)

where v(t) is the result from t-step Gibbs sampling. One step Gibbs sampling is performed

using equations 2.13 and 2.14 respectively and setting hj and vi equal to 1 randomly based

on the calculated probabilites. By repeting this process t times v(t) can be obtained, which

can be seen graphically in Figure 2.7.


Figure 2.8: A graphical illustration of the RNN-RBM model. Thebottom layer is the RNN implementation and the top two layers are theRBM implementation. Image source: [Deeb] [BLBV12]

2.3.3 RNN and RBM combined as RNN-RBM

Nicolas Boulanger-Lewandowski, together with Yoshua Bengio and Pascal Vincent at

Universite de Montreal, combined the RNN and RBM models into an RNN-RBM model,

as a generalization of the Recurrent Temporal RBM (RTRBM)[BLBV12]. The purpose

was to further utilize the forecasting capability of the two models and to create a model

that allowed more freedom to describe the temporal dependencies involved. The model

extends the RNN model by adding an RBM at each time step. The output layer of the

RNN, as described in Figure 2.5, is no longer a direct representation of the visible units

intended to forecast, but instead lays ground to the parameteres for the RBM model. This

can be seen graphically in Figure 2.8. The bottom layer constitutes the RNN model and

the top two layers constitutes the RBM model. The model consists of nine parameters;

W , bv and bh as part of the RBM model, Wuu, Wvu, u(0) and bu as part of the RNN model

and Wuh and Wuv to connect them.

The initial values for the matrices W , Wuu, Wvu, Wuh and Wuv can be set to small random

normalized values and the initial values for the bias vectors bv, bh, bu and u(0) can be set


to zero. The dimensions of the parameters are given by the number of units in the visible

layer, nv, the number of hidden units in the RBM layer, nh, and the number of hidden

units in the RNN layer, nhr. The number of hidden units in the RBM layer and the RNN

layer will be set by evaluating a number of combinations of parameters.

The bias vectors for the RBM model b(t)v and b

(t)h are updated through the hidden units

for the RNN layer u(t−1) as

b(t)v = bv +Wuvu(t−1) (2.21)

b(t)h = bh +Wuhu

(t−1) (2.22)

where bv and bh are the initial bias vectors for the visible units and the hidden units in

the RBM layer. The vector u(t) represents the hidden units for the RNN layer at time t

and is calculated as

u(t) = f(bu +Wuuu(t−1) +Wvuv

(t)) (2.23)

where f is an activation function and bu is the initial bias vector for the hidden units in

the RNN layer. The activation function is suggested by [BLBV12] to be the σ function,

while [Deeb] suggests using the tanh function. The training iteration of the model is

based on the following scheme:

1. Generate the hidden units u(t) for the RNN model using equation 2.23 on the set of

visible units.

2. Update the bias vectors b(t)v and b

(t)h using equation 2.22 and 2.21 respectively for

u(t−1) and perform n-step Gibbs sampling to obtain a representation of the visible

units v(t)∗.


3. Calculate the log-likelihood gradient using the CD approach described in equation

2.20 with respect to W , b(t)v and b

(t)h .

4. Propagate the gradient with respect to bv(t) and b(t)h backwards in time to obtain

the gradients with respect to Wuu, Wvu, Wuv, Wuh, bv, bh and bu.

The forecasted value v(t+1) is then obtained by first constructing the bias vectors b(t+1)v

and b(t+1)h , using equations 2.21 and 2.22, and then performing t−step Gibbs sampling,

with v(t+1) initiated as zero, until convergence.

2.3.4 Binary versus real-valued data

The model developed in [BLBV12], was designed to predict and generate MIDI sequences

by learning both temporal dependenies but also the chord conditional distribution. This

2-dimensional dependence can be adapted and applied to spatially distributed data and

in theory be extended to handle dependencies in n dimensions. The MIDI sequences are

represented by a binary vector of length 88, where each index correspon to a note in the

MIDI spectra. A one represents an active note, whereas a zero represent an inactive note.

The problem to be solved is therefore a binary problem, where a binary RBM design is

used. The problem in this project is, in contrast, a real-valued problem. To be able to

apply this problem to the proposed model, either the data has to be transformed to fit

the modoel, or the model has to be adapted to fit the data.

Chapter 3

Methodology

This chapter aims to give an explanation on how the project has been performed and the

tools that have been used. The chapter starts with Section 3.1, where the software that

has been used is described, and is followed by Section 3.2, that describes the datasets that

has been used for all simulations and evaluations. Section 3.3 gives a brief explanation of

the training process of the model, followed by a description of the evaluation method in

Section 3.4. The chapter ends with Section 3.5, explaining the source for the code used

in the project.

3.1 Software

A significant part of the project has been dedicated to programming the model and

running simulations of the forecasts. Programming a Machine Learning model can be

made in a lot of different programming languages, where MATLAB, R and Python are

some of the most popular ones. At Ericsson research, Python is a widely used language

and since Python is also the language that the base model is programmed in, the choice

of language obvious. To aid in the advanced Machine Learning a numerical computation

20

3.2. Dataset 21

library called Theano[TARA+16] was used. Python, Theano and other software that has

been used in the project, are described in Appendix A.

3.2 Dataset

Two different datasets from two different capital cities has been used for testing of the

model during the project. During the first weeks a dataset consisting of counters of

a number of different traffic characteristic performance data from a number of cells in a

major capital city was used, Dataset 1. It was primarily to evaluate the basic functionality

of the RNN-RBM model and the baseline model. In the later part of the project a

dataset consisting of counters of a number of different traffic characteristic performance

data from a number of cells in a small capital city was used, Dataset 2. This dataset

was used for the final testing and evaluation of the performance of the model depending

on the different modifications that has been made to the model. Since the datasets

consists of a number of different cells and a number of different data counters, there is

a 3-dimensional dependence for the forecast model; temporal, spatial and between data

counters. However, the dimensionality has been reduced to two dimensions by folding the

data counters and the spatiality into one dimension, due to time limitations on extending

the model to handle more than two dimensions. This will cause a loss of information

abut dependencies between similar data counters or data counters in the same cell and

consider all dependencies between different time series as equal.

3.2.1 Dataset 1

Dataset 1 consists of data from more than 7000 cells from the central parts of a major

capital city, including all types of city districts. The data is aggregated over 15-minute

intervals, giving 96 data points per day, for a period of 30 days. There are four different

data counters included in the set covering

22 Chapter 3. Methodology

• the amount of data traffic

• the number of data requests

• the number of calls

• the number of SMS sent

during each time interval. The data is aggregated over all users and normalized, to ensure

the integrity of all users, while still keeping the relative values consistent.

3.2.2 Dataset 2

Dataset 2 consists of data from 10 cells in the semi-central parts of a small capital city,

including mainly business areas. The 10 cells have been selected as the cells with the

largest amount of data transferred through the cells so that irregularities in the data

could be kept as low as possible. The data is aggregated over 15-minute intervals, giving

96 data points per day, for a period of 122 days. There are five different data counters

included in the set covering

• the amount of downloaded data

• the amount of uploaded data

• the number of packets sent

• the number of calls

• the number of SMS sent

during each time interval. The data is aggregated over all users but not normalized.

The data covers a period of 122 days. However, a number of days in the middle of the

interval had a varying amount of missing data. Some days even had to be disregarded

3.3. Parameter tuning 23

due to too much missing data, while others could be repaired by filling out the missing

data with interpolated data. Only days with limited missing data were repaired, to not

risk interfering with the performance of the model. This lead to the data being split into

two separate groups where the first group consisted of 47 days and was used for training,

while the second group consisted of 54 days and was used for testing. In between these

were 21 days that was disregarded due to too much missing data.

The data in Dataset 2 is transformed by taking the logarithm of each data point in an

attempt of making the data set more linear. This can only be done if no values are zero,

which is guaranteed by changing all zeroes in the input data to a very small number, larger

than zero. The dataset is then normalized with mean = 0 and standard deviation = 1. The

normalization parameters are obtained by normalizing the training data and is individual

for each time series. The parameters are then used as the basis when normalizing the

test data, meaning that the mean and standard deviation will differ slightly from 0 and

1 respectiviely for the test data. The parameters are also used when de-normalizing the

forecasted data, all to assure that all data is following the same range.

To ensure the data is anonymized, all data has been modified and no true values are

presented in the report. All relative differences are, however, kept intact.

3.3 Parameter tuning

The performance of the model is dependent on the tuning of a few parameters; the number

of hidden units in the RBM layer, the number of hidden unts in the RNN layer and the

batch size during training. The performance is also affected by the amount of historical

data used in the forecasting process.For each additional paramater value that should be

evaluated, a complete set of simulations together with all sets of values for the other

parameters is needed. This means that the number of simulations needed for evaluation


will scale very fast depending on the number of parameter values The total number of

simulations will scale as

S =n∏i=1

pi (3.1)

where S is the number of simulations, pi is the number of values for parameter i, and n

is the total number of different parameters. Because of this, a few initial simulations has

been made to try to pinpoint the range including the optimal parameter values. In the

next step the simulations has been made in a more systematic way, where a few different

values in the assumed optimal range has been evaluated. Three different number of hidden

units in the RBM layer and three different number of hidden units in the RNN layer has

been evaluated in this way.

When training the RNN-RBM model, the input data needs to be separated into smaller

batches, both to minimize the runtime and the performance of the training. The batches

cover all cells and all counters, but is limited in the time dimension. One batch at a time

is inserted into the model for training, until all batches have been trained. Between each

batch the error is calculated and the parameters are updated. When all batches has been

trained, the training is repeated for all batches for a number of cycles, or epochs, until the

error has converged. The size of the batches will impact the performance of the model; a

too small size has the risk of not capturing all dependencies, while too large batches have

the risk of making the model too hard, or even impossible, to train. The best choice of

batch size has been briefly evaluated during the project by testing both 1 day and 2 days

of data per batch. Some testing has also been made to artificially connect all batches, so

that the risk of not learning all temporal dependencies will be minimized.

During testing, the model needs some initial hstorical data to be able to make a forecast

of the next value. The amount of historical data used has been chosen in two different

ways. First, the initial data has been set to a fixed amount of data for each single forecast,

3.4. Evaluation method 25

in a so called “Sliding window”. This means that the accuracy of the forecast should not

depend on what time step that is being forecasted. The other approach is to use all

available historical data in the test set as initial data to the forecast. This means that

the initial data will be larger for forecasts at time steps in the end of the test set than in

the beginning of the test set. The accuracy of the forecast should therefore have a chance

of being better in the end of the test set. This approach will hereafter be referred to as

“Full history”.

3.4 Evaluation method

The forecasts made by the RNN-RBM model has been made as one-step rolling forecasts.

A one-step rolling forecast is a combination of multiple forecasts, where each forecast is

for a single time-step beyond the input data, and where the input data is changed between

the forecasts to include the known data for an additional time-step and, which also makes

the model create a forecast at one time-step after the last forecast, hence the “rolling”.

The input data for each individual value forecasted have been set, for the first set of tests,

to a fixed amount of data, then, for the second set of tests, to all available test data prior

to the value being predicted. This is also the way the Holt-Winter model operates on the

input data, with the restriction that a minimum of one season of data is needed for the

Holt-Winter model. Each complete forecast has then been evaluated and compared with

the corresponding baseline forecast using two different performance values; RSS, that

should be as low as possible, and AUC, that should be as high as possible (with 1 as an

upper limit), as explained below.


3.4.1 Residual sum of squares - RSS

The RSS value is used to compare continous values and is calculated by taking the square

of the difference between the forecasted value and the real value for each data point and

sum over all data points as given by

RSS =n∑i=1

(yi − yi)2 (3.2)

where yi is the real value to be forecasted and yi is the forecasted value. The RSS value

gives a comparable value of how close the forecasted data is to the real data; the lower

the RSS value, the better. The value is dependent on the range of the data, since the

equation only takes the square of each difference. This makes it impossible to use as a

comparable value for forecasts of different sources of data. However, using the RSS value

is a simple way of comparing forecasts of the same data source.

3.4.2 Area under the curve - AUC

The AUC value is used to compare the performance of classfied data. When evaluating

the performane of a classification model it is common to use the confusion matrix. The

confusion matrix for a binary classifications, i.e. if a cell in the network is overloaded

or not, can be seen in Figure 3.1. The confusion matrix can be extended to multiple

classifications, but in this project only the binary case is relevant.

On the diagonal from the top left corner to the bottom right corner are all values that were

correctly classified. In all other boxes are the values that were misclassified. In the binary

case the correct classifications are labeled as True Positive, TP , where a positive value was

correctly classified as a positive value, and True Negative, TN , where a negative value was

correctly classified as a negative value. The incorrect classifications are similarly labeled

as False Positive, FP , where a negative value was incorrectly classified as a positive value,


Figure 3.1: The confusion matrix presents a visualization of how well amodel manage to classify a set of values. The correctly classified valuesare shown on the diagonal from the top left corner to the bottom rightcorner, while the incorrectly classified values are presented in the otherfields. Image source: [dS]

and False Negative, FN , where a positive value was incorrectly classified as a negative

value. From these values a number of additional evaluation values can be calculated.

Among them are the True Positive Rate, TPR, as

TPR =TP

P(3.3)

where P is the number of positive values in the dataset, and False Positive Rate, FPR,

as

FPR =FP

N(3.4)

where N is the number of negative values in the dataset. When plotting these values with

TPR on the y-axis and FPR on the x-axis, a curve, called the ROC curve (Receiver Op-


Figure 3.2: The ROC space presents, graphically, the performance ofa classification model. The dots represents the rate of true positives,TPR, versus the rate of false positives, FPR. A value close to theupper corner, with high TPR and low FPR is a good classifier.

erating Characteristic), can be constructed by connecting the points (0, 0), (FPR, TPR)

and (1, 1), as can be seen in Figure 3.2. The area under this curve is then called the

AUC value, (Area Under the Curve) and is an indication of how well the model is able to

classify the values. An AUC value below 0.5 means that the model performs worse than

random and can be improved simply by reversing the predictions (as the bottom curve in

Figure 3.2). An AUC value at 0.5 means the model performs at par with a random guess

and indicates that it is a bad classifier. The closer the AUC value is to 1, the closer the

model is to a perfect classifier. Graphially this is when the (FPR, TPR) point is located

in the upper left corner.

When making a classification, a threshold has to be set to separate the predicted negative

values from the positive values. This threshold can be varied to maximize the perfor-

mance of the model. If the model predicts too many false positives, the threshold could

be increased, and conversely, if the model predicts too few true positives, the threshold

could be decreased. The performance of different thresholds can be captured in an ex-

tended ROC plot, where all (FPR, TPR)-pairs are plotted, as can be seen in Figure 3.3.


Figure 3.3: The results from different choices of threshold for theclasification model creates a curve. The area under the curve, AUC, aswell as the shape of the curve, gives an indication of the performanceof the model.

By connecting all points, a smoother ROC curve can be obtained that shows how the

performance is dependent on the thresholds. The closer the curve is to the edges and

the top left corner, the better the model is in making correct classifications. The AUC

value connected to this plot better represents the performance of the model than when

only considering one (FPR, TPR)-pair. The choice of threshold is very important for the

performance of the model, however, the smoother the curve is, the less sensitive it is to

the choice of threshold. The bottom curve in Figure 3.3 shows the results of an model

that is a bad classifier. The curve crosses the random curve several times and is very

sensitive to the choice of threshold. The shape of the curve, as well as the AUC value

at 0.52 indicates that the predictions might be more or less random. The upper curve,

however, shows the results of an model that is able to classify the data fairly well, with

an AUC value at 0.7, which is well above the random value at 0.5. The shape is also

smooth, which indicates that it is not very sensitive to the choice of threshold.


3.5 Code

The execution of this projet and all simulations has been made possible by using some

code developed by other groups. All baseline simulations have been conducted using

available code, free for download at [Que]. The code has not been modified in any way.

For the simulations using the RNN-RBM model, another set of code have been used. The

code is free and available for download at [BL] and explained further in [BLBV12]. This

code has been modified to fit the dataset. The mofidied version, together with all other

code, is available for Ericsson in the internal Git system.

Chapter 4

Results

The forecasts have been made using two different approaches and the results of each

approach has been compared to the forecast made by the baseline model, Holt-Winter.

The results of the first approach, where the input data is transformed to fit the original

design of the model is presented in chapter 4.1. The results of the second approach,

where instead the model is adapted to fit the input data is presented in chapter 4.2. All

comparisons are made for the same cell, randomly chosen from the set, called Cell A, and

for the same datatype, download, also randomly chosen.

4.1 Binary-valued input data

The RNN-RBM model that is used in this project is designed to work with binary data

representing a MIDI file. The model will take a binary vector with 88 values as input,

where each point in the vector corresponds to a certain note in a MIDI sequence, called

“piano-roll”1. There are dependencies between the notes, where certain combinations of

notes, e.g. combinations that form chords, have high probability to occur, while other

combinations have very low probability to occur. There are also temporal dependencies,

1https://en.wikipedia.org/wiki/Piano_roll#In_digital_audio_workstations

31

32 Chapter 4. Results

where the probability that a combination of notes will follow another combination of notes

will vary depending on the combinations.

The early testing of the model on the mobile network data are made by converting the

real-valued data in the dataset for the download data of Cell A into binary vectors, where

the data is converted into np different percentile values using a function in the python

library numpy where the output value from the function for each data point will be the

index of the percentile that the data is connected to. The index is then used as the base

for creating a vector with length np+1 where the first point corresponds to a missing value

in the input data and the following data points correspond to the different percentiles.

All data points in the vector will be zero except for the one with index corresponding to

the percentile value. The forecasts made in this way will not be able to understand if

a forecasted value is “close” to the correct value, but only be able to see if the value is

correct or not.

The number of percentiles can be chosen freely, where a low number of percentiles will

remove more of the information in the input data while in the same time make the output

data easier to forecast, in the sense that there are fewer options to choose from, and

conversely for a higher number of percentiles, more information will be preserved in the

input data, while in the same time it will be harder to forecast the correct value due to

a higher number of options to choose from. No matter how many percentiles used, some

information is bound to be lost2.

Three different number of percentiles have been chosen for evaluation: 5, 20, and 100,

to show how the performance of the model varies with the number of percentiles. For

each of these a forecast made by the Holt-Winter model has also been produced. For

the RNN-RBM model the number of hidden units in the RBM layer has been set to 200,

and the number of hidden units in the RNN layer has been set to 100 and 1000 in two

different simulations. The one-step rolling forecast for two consecutive days using the

2Of course, even continuous data is discrete when digitalized, which in practice gives an upper limitto the number of percentiles.

4.1. Binary-valued input data 33

Figure 4.1: The left and the middle columns show the forecasts madeusing the RNN-RBM model with 100 and 1000 hidden units in the RNNlayer, respectively. The right column shows the forecasts made usingthe Holt-Winter model. The rows corresponds to dividing the inputdata into different number of percentiles with 5 on top, 20 in the middleand 100 at the bottom. The blue line is the real data and the read lineis the forecasted data.

RNN-RBM model with the two different set-ups can be seen next to the Holt-Winter

forecast in Figure 4.1.

As can be seen in Figure 4.1, the forecasts using the RNN-RBM model are mostly random,

independent on how many percentiles the data is divided into or how many hidden units

in the RNN layer the model is using. The Holt Winter model is notably outperforming

the RNN-RBM model with an RSS value at about a magnitude lower.


4.2 Real-valued input data

In the following and main part of the project the input data is not transformed and

instead the RNN-RBM model is adapted to fit the input data. In this way no information

in the input data is lost. The initial testing is made on one single cell with one single

datatype as described in Section 4.2.2, giving only one time series as input data. The tests

then proceeds to first handle multiple cells and a single counter, as described in Section

4.2.3, then a single cell and multiple counters, as described in Section 4.2.4. Lastly the

combination of both multiple cells an multiple counters is tested, as described in Section

4.2.5. Before these sections a short description of the baseline forecasts is given in Section

4.2.1.

In all tests, the number of hidden units in both the RBM layer and the RNN layer has

been varied between simulations to evaluate what numbers produced the best results. For

the RBM layer the different numbers of hidden units has been 100, 300 and 1000 units.

For the RNN layer the values has been 1000, 2000 and 5000 units. The batch size during

training has also been varied, to evaluate if the results were affected, between 1 and 2

days, corresponding to 96 and 192 time steps. The initial data during testing has been set

to both a “Sliding window” of 14 days and to “Full history”, using all available historical

data.

For each of the tests the results has been evaluated using the RSS value and the AUC

value, as described in section 3.4, and presented in a set of plots. The limits of the y-axis

has been fixed to ease the comparisons. As a result, some of the forecast plots reach above

the limit. Each plot consists of a complete one-step rolling forecast at the bottom with

a zoom in over two days in the upper left corner and an ROC curve in the upper right

corner. The initial 14 days and the last 1 day has been cut away from the forecast, the

first 14 days to let the forecasts converge and the last day due to noisy data, giving a

total of 39 forecasted days.

4.2. Real-valued input data 35

Figure 4.2: One-step forecast made using the Holt-Winter model (red)on unchanged data and the corresponding RSS value. The real datais shown in blue. The top right corner shows the ROC plot with thecorresponding AUC value.

To be able to calculate the ROC curve and the corresponding AUC value for this problem,

it first has to be converted into a classification problem. In these tests, the classification

has been defined as the ability to predict when the amplitude of the input data is in the

top 10th percentile of all input data.

4.2.1 Baseline forecasts

Two baseline forecasts, using the Holt-Winter model, have been created for comparison.

Figure 4.2 shows a forecast made with the input data unchanged. However, this procedure

produces negative values, that has manually been set equal to zero since the real data is

never negative. Figure 4.3 shows a forecast made with input data that has been trans-

formed by taking the logarithm of each datapoint and then normalized over the entire

dataset. This produces a forecast that has no negative values.

When comparing the two forecasts, the RSS value is slightly better in the first (1.75

versus 1.76), which would indicate that the first forecast is closer to the real data than

the second. However, when looking at the AUC value in the ROC plot, the second is


Figure 4.3: One-step forecast made using the Holt-Winter model (red)on modified data and the corresponding RSS value. The real data isshown in blue. The top right corner shows the ROC plot with thecorresponding AUC value.

slightly better (0.774 versus 0.770), which would indicate that the second forecast is able

to classify the top 10 % amplitudes slightly better than the first. The differences, however,

are very small in both cases and the forecasts can more or less be regarded as identical.

4.2.2 Single cell and single counter

When making forecasts using only one single cell and one single counter the RNN-RBM

model operates on data with the same dimensionality as the Holt-Winter model. Since

one of the strengths of the RNN-RBM model lays in it being able to handle mutivariate

input data, the results from these forecasts will not in itself be enough to determine which

of the models that are most accurate.

All results from the simulations, with both limited amount of input data (7 days prior to

the forecasted datapoint) as well as unlimited amount of input data (all days prior to the

forecasted datapoint), can be seen in Table B.1 in Appendix B. The results from two of

the simulations producing the best results are shown in Figure 4.4 and Figure 4.5. Both

these simulations used 100 hidden units in the RBM layer, 2000 hidden units in the RNN


Figure 4.4: One-step forecast made using the RNN-RBM model (red)with a single cell, a single counter and 7 days of history as input. Thereal data is shown in blue and the corresponding RSS value is shownabove the plot. The top right corner shows the ROC plot with thecorresponding AUC value.

layer and a batch size of 2 days of data per batch. The real data is shown in blue and the

forecasted data is shown in red. Two of the days are zoomed in to give a better view of

the individual datapoints.

As can be seen from the figures, and further in Table B.1, the RNN-RBM model performs

better than the baseline model regarding the AUC value both for a limited amount of

input data as well as for an unlimited amount of input data but only when using an

unlimited amount of input data is the RNN-RBM model able to perform better than the

baseline model regarding the RSS value.

4.2.3 Multiple cells and single counter

When making forecasts using multiple cells but only one single counter, the spatial de-

pendence should occur and help giving better forecasts. The input data is now two-

dimensional, in contrast to the one-dimensional data that is used in the baseline forecast.

The spatial dependence is not equally strong between all pair of cells, but depend on


Figure 4.5: One-step forecast made using the RNN-RBM model (red)with a single cell, a single counter and all history as input. The real datais shown in blue and the corresponding RSS value is shown above theplot. The top right corner shows the ROC plot with the correspondingAUC value.

the geographical location of each cell. Cells that are close in geography or cells that are

located in similar types of areas are supposed to have higher dependencies than other

cells. This means that the greater the number of cells that are included in the set, the

more likely it is that strong dependencies occur that would impact the forecasts. 10 cells

is a little low number to really ensure that these types of dependencies occur, but it might

give an indication of the influence.




the simulations producing the best results are shown in Figure 4.6 and Figure 4.7. Both

these simulations used 100 hidden units in the RBM layer, 5000 hidden units in the RNN

layer and a batch size of 2 days of data per batch. The real data is shown in blue and the

forecasted data is shown in red. Two of the days are zoomed in to give a better view of

the individual datapoints.


Figure 4.6: One-step forecast made using the RNN-RBM model (red)with multiple cells, a single counter and 7 days of history as input. Thereal data is shown in blue and the corresponding RSS value is shownabove the plot. The top right corner shows the ROC plot with thecorresponding AUC value.

Figure 4.7: One-step forecast made using the RNN-RBM model (red)with multiple cells, a single counter and all history as input. The realdata is shown in blue and the corresponding RSS value is shown abovethe plot. The top right corner shows the ROC plot with the correspond-ing AUC value.


As can be seen from the figures, and further in Table B.2, the RNN-RBM model using a

limited amount of input data performes just slightly better than the baseline model both

regarding the RSS value and the AUC value. When using an unlimited amount of input

data, however, the RNN-RBM model performs better than the baseline model regarding

both the RSS value and the AUC value.

4.2.4 Single cell and multiple counters

When, instead of using multiple cells and a single counter, multiple counters in a single

cell are used as input, different types of dependencies should occur. The data is still

two-dimensional, but the second dimension is not spatial, but instead the set of counters.

It is reasonable to assume that not all dependencies between different pairs of counters

are equally strong. The number of calls might have a strong dependence on the number

of users connected to the cell, but might not have an equally strong dependence on the

number of SMS sent. The dependencies might not be equal in both directions, either, as it

might be for the spatial dependence. Even though the number of calls might be dependent

on the number of users, the opposite might not be true. The 5 different counters included

in Dataset 2 might not be enough to ensure that the dependencies occur, but it might

give an indication of the influence.




the simulations producing the best results are shown in Figure 4.8 and Figure 4.9. For the

forecast using a limited amount of input data, the number of hidden units in the RBM

layer was set to 100, the number of hidden units in the RNN layer was set to 1000 and

the number of days per batch during training was set to 2 days. For the forecast using

an unlimited amount of input data, the number of hidden units in the RBM layer was set

to 100, the number of hidden units in the RNN layer was set to 5000 and the batch size


Figure 4.8: One-step forecast made using the RNN-RBM model (red)with a single cell, multiple counters and 7 days of history as input. Thereal data is shown in blue and the corresponding RSS value is shownabove the plot. The top right corner shows the ROC plot with thecorresponding AUC value.

was set to 2 days. The real data is shown in blue and the forecasted data is shown in red.

Two of the days are zoomed in to give a better view of the individual datapoints.

As can be seen from the figures, and further in Table B.3 in Appendix B, the RNN-RBM

model using a limited amount of input data performes just slightly better than the baseline

model regarding the RSS value, but performs better regarding the AUC value. When

using an unlimited amount of data as input to the RNN-RBM model, the performance

gets better and reaches notably better values than the baseline model on both the RSS

value and the AUC value.

4.2.5 Multiple cells and multiple counters

In the last test, all available input data, from all cells and all counters, is used. This

makes the problem 3-dimensional, with time, cells and counters each corresponding to

one dimension. However, as described in Section 3.2, the cells and counters are folded

into one joint dimension, reducing the dataset to 2 dimensions. This test will not only


Figure 4.9: One-step forecast made using the RNN-RBM model (red)with a single cell, multiple counters and all history as input. The realdata is shown in blue and the corresponding RSS value is shown abovethe plot. The top right corner shows the ROC plot with the correspond-ing AUC value.

show the combined performance of using both spatial dependence on the same type of

counter and the dependence of other counters in the same cell, but also dependencies on

other types of counters in other cells, as well. As describe in earlier sections, however,

the small number of cells and couters might prove insufficient to actually improve the

performance of the model, but it might give an indication of the influence.

All results from the simulations, with both limited amount of input data (7 days prior

to the forecasted datapoint) as well as unlimited amount of input data (all days prior to

the forecasted datapoint), can be seen in Table B.4 in Appendix B. The results from two

of the simulations producing the best results are shown in Figure 4.10 and Figure 4.11.

Both these simulations used 100 hidden units in the RBM layer, 5000 hidden units in the

RNN layer and a batch size of 2 days of data per batch. The real data is shown in blue

and the forecasted data is shown in red. Two of the days are zoomed in to give a better

view of the individual datapoints.


Figure 4.10: One-step forecast made using the RNN-RBM model (red)with multiple cells, multiple counters and 7 days of history as input. Thereal data is shown in blue and the corresponding RSS value is shownabove the plot. The top right corner shows the ROC plot with thecorresponding AUC value.

Figure 4.11: One-step forecast made using the RNN-RBM model (red)with multiple cells, multiple counters and all history as input. Thereal data is shown in blue and the corresponding RSS value is shownabove the plot. The top right corner shows the ROC plot with thecorresponding AUC value.


As can be seen from the figures, and further in Table B.4 in Appendix B, the RNN-RBM

model performs notably better than the baseline model regarding both the RSS value

and the AUC value both for a limited and an unlimited amount of input data.

Chapter 5

Analysis

The combined results from the simulations show that the RNN-RBM model can be better

in producing one-step forecasts than the Holt-Winter model. The model can be moified

to perform better than the baseline model in all test cases, regardless of the choice of sin-

gle/multiple cells or single/multiple counters, and almost regardless of the chosen amount

of input data. The results are, however, highly dependent on several parameters. Given

a good choice of these parameters the RNN-RBM model can outperform the baseline

model, while a bad choice of parameters will have the opposite effect. Three different

parameters that have been identified as important to the performance and evaluated are;

the number of hidden units in the RBM and RNN layers respectively, and the batch size

during training. The results also vary between the cells and counters, and while this is

not presented in this report, a hint of the potential can be seen in Figure 5.1, where a

forecast of the number of packets transferred through a cell is shown. As can be seen, the

AUC value reaches as high as 0.918. The RSS value is not presented as it can not be

compared to any of the other RSS values, as explained in Section 3.4.

45

46 Chapter 5. Analysis

Figure 5.1: Figure of the forecast giving the overall best AUC value.The performance counter is the number of packets transferred and isshown in blue with the forecast in red.

5.1 Input data

For all test sets, the simulations have been made both with a limited amount of input

data, where the 7 days of data prior to the data that is to be forecasted have been used,

and an “unlimited” amount of input data, meaning that all available data in the dataset

prior to the data that is to be forecasted has been used as input to the model. The results

show that the amount of input data has a great impact on the performance of the model.

The more input data used, the better the performance of the model. The improvement

was most evident in the case where only a single cell and a single counter was used, and

when using a low number of hidden units in the RBM layer and a high number of hidden

units in the RBM layer, but the performance improved in most of the simulations. The

results also show that there is a dependece between the forecast performance and the

number of cells and number of counters used. This dependence is not as unambiguous,

however, which may be explained by the low number of cells and counters used in the

tests.

5.1. Input data 47

Figure 5.2: Two figures showing how the RSS value and the AUCvalue are dependent on the input data.


The results for the RSS value and the AUC value can be seen in Figure 5.2, where the

best results from each of the test sets are presented. It can be seen from both plots that an

unlimited amount of input data raises the performance, but the change gets less significant

when using multiple counters and even more so when using multiple cells. One reason for

this might be that the size of the input data grows when extending the dataset, and that

could make it harder for the model to learn the dependencies from the longer history of

data. It can also be seen that the RSS value has a more clear dependence on the number

of cells and counters than the AUC value. The AUC value actually reaches its top result

using only a single cell and a single counter as input and an unlimited amount of input

data. This might in part be explained by the fact that the training process of the model

only seek to optimize the fit of the complete data and not optimize the classification. A

training process that optimizes according to the classification might show different results

according to the AUC value and its dependence on the number of cells and counters, but

in the same time might worsen the performance regarding the RSS value.

In general, the results shows consistently that the more input data in the time dimension

that is used, the better the model performs. The results also indicate, while not as

consistently as for the time, that the larger the number of cells and counters that are

used, the better the model performs. It should also be pointed out that in an online

implementation of the model, the input data will be truly unlimited, covering the intire

history of data, and not “unlimited”, as in these tests, where the dataset itself impose a

limit to the data.

5.2 Hidden units

The simulations show that to get good results, the number of hidden units in the RBM

layer should be set to a low number, in the order of 102. In all simulations, the lowest

tested value, 100, was found to give the best results, as can be seen in Figure 5.3. A

5.2. Hidden units 49

Figure 5.3: Two figures showing how the RSS value and the AUCvalue depends on the number of hidden units in the RBM layer.

few simulations were made with a lower number of hidden units in the RBM layer, but

not enough to draw any conclusions and was therefore not included in the report. The

possibility that a lower number of hidden units could further improve the performance of

the model can, however, not be excluded. The difference in both RSS value and AUC

values was most significant when using only a single cell and a single counter, and became

less significant when adding multiple counters and even more when adding multiple cells.

The simulations also show that to get good results, the number of hidden units in the

RNN layer should be set to a higher number, in the order of 103 to 104. The simulations

has been made using 1000, 2000 and 5000 hidden units respectively, in an attempt to find

the best choice. A few simulations were even made using 10000 units, but not enough


Figure 5.4: Two figures showing how the RSS value and the AUCvalue depends on the number of hidden units in the RNN layer.

to draw any conclusions and was therefore not included in the report. Even though the

results indicate that increasing the number of hidden units improves the performance of

the model, the results are not unambiguous, as can be seen in Figure 5.4. In most cases

the best results were obtained using 5000 hidden units, but in some cases the best results

were obtained using 2000 or even 1000 hidden units, and the top results were often close.

Further tests could be made to better find the best choice.

The relation between the number of hidden units in the two layers was expected to be the

opposite, according to [BLBV12], where the number of hidden units in the RBM layer is

suggested to be “several hundred”, while the number of hidden units in the RNN layer is

suggested to be “typically smaller”. The reason for this is not clarified in this project.

5.3. Batch size 51

5.3 Batch size

The simulations have been made using two different batch sizes, one day and two days

respectively. The simulations show, quite consistently, that two days are better than one,

as can be seen in Figure 5.5. Some simulations were also made using larger numbers of

days, but not enough to draw any conclusions and were therefore not included in the

report. The general conclusion from the results is still that larger batch sizes improves

the performance. A too large batch size will, however, prevent the model from converging

during training, due to a too complex gradient descent in the training process.

One theory of why two days are better than one, and why more days might be even better,

is that the model might be unable to learn dependencies between times that extend the

batch size. A batch size of one day might therefore prevent the model from learning the

daily cycle. In a similar way, a batch size less than one week might prevent the model from

learning the weekly cycle. A few simulations were made, where the model was modified

in an attempt to transfer the information between batches, but not enough to draw any

conclusions and were therefore not included in the report.

Further testing is needed to be able to determine the optimal number of days to use for

the batch size. However, if the model was modified to operate online, the batch size might

be dependent on the update interval anyway, and the optimal bath size might then be

irrelevant.


Figure 5.5: Two figures showing how the RSS value and the AUCvalue depends on the batch size.

Chapter 6

Conclusion

This chapter will give a summary of the achievements in Section 6.1 and possible ways to

use these results as a basis for further work in Section 6.2.

6.1 Summary of thesis achievements

A series of simulations has been performed to evaluate the perfomance of an RNN-RBM

compared to a Holt-Winter model. Simulations have been made, both on binary trans-

formed data, where each real-valued datapoint in the dataset has been transformed into a

binary vector representation based on a percentile separation of the original dataset, and

on real-valued data, where each datapoint has been transformed by taking the logarithm

and then normalizing over the entire dataset. The simulations made using the binary

vector representation showed no promising result, and all forecasts only produced noise.

The simulations using the real-valued data showed very promising results and proves that

the RNN-RBM model can outperform the Holt-Winter model in making forecasts.

A number of parameters has been evaluated to find the best set up to produce the best

forecasts. These parameters are; the amunt of input data into the model, both regarding

53

54 Chapter 6. Conclusion

the number of cells and the number of counters, but also the amount of historical data;

the number of hidden units in the RBM layer and the RNN layer respectively and last

the size of the batches during training.

The results clearly show a dependence between the amount of input data and the per-

formance of the model. In general, an increased amount of historical data improved the

forecasts, whileincreasing the number of cells or counters in some cases did not improve

the results. One likely explanation to this is that not all cells and all counters have no-

table dependencies and the amount of cells and counters were too low to include such

dependencies. A larger test set might give more clear results in this sense.

The results also show a clear dependence between the number of hidden units in the

RBM layer and the performance of the model. The best forecasts were produced using

100 hidden units and the results also indicate that an even lower number could further

improe the performance. The number of hidden units in the RNN layer did not show an

equally clear dependence to the performance of the model. The best results were in most

cases produced using 5000 hidden units, but in some cases 2000, or even 1000 hidden

units produced better results. In general, however, increasinng the number of hidden

units improves the performance of the model.

Finally, the results also show a clear dependece between the batch size and the per-

formance of the model. Changing the batch size from one to two days improved the

performance of the model in most simulations, but the improvement was most evident

when the other parameters were chosen badly. A bad prediction could, in most cases, be

improved by changing increasing the batch size, while a good prediction did not improve

as much, or at all, when changing the batch size. In no cases was the performance re-

duced more than marginally and the conclusion is that two days are better than one. The

results also indicate that increasing the batch sizes further could additionally improve the

performance of the model.

6.2. Future Work 55

6.2 Future Work

Even though the results show that the RNN-RBM model can outperform the Holt-Winter

model in producing forecasts, a few theories on how to further improve the model has

been developed during the progress of the project. Due to the time limit of the project,

the theories has to be left as just theories, but they could act as a base for future work.

• Extend the dataset

Since not all cells and counters are thought to have equally strong dependencies, an

extended dataset could help showing these dependencies. A number of cells along a

railway, or inside a mall, could be a good set of cells. Including the number of users

connected to each cell could also be a good addition to the set of counters.

• Relation between dataset size and number of hidden units

If the model is to be scaled to operate on the large datasets that compose all the

cells in an entire city, a good knowledge about how the size of the dataset will affect

the parameters in the model will be important. It would be a dreadful task to run

simulations on thousands of cells and hundreds of counters just to evaluate the best

choice of parametsrs.

• Online training

At the moment the model is designed to operate on separate training and test sets,

but if it is to be used in a real application, a constant need for retraining will not

work. Instead, the model needs to be modified so that once it is trained, it will

only need to train on new data, either as soon as it appear, or by adding it up into

batches for training only on specific times.

• Evaluation methods

In this project, the RSS value, that is also closely related to the Root mean square,

RMS, value along with the AUC value, that is obtained from the confusion ma-

56 Chapter 6. Conclusion

trix, has been used. It might, however, exist even better evaluation methods and

performance values, that could be used.

• Change the classification limit The classification part of this projet has been

evaluated by setting the ”high load” limit to the top 10th percentile of the data

in the dataset. This limitis very naıve and can be changed to further evaluate the

performance of the model as a classification model. It could be interresting to se how

the performance depend on what percentile is chosen. It could also be interresting

to evaluate other classifications, such as two consecutive high load, variating limits

depending on time of day and week, sudden changes, or when the difference between

the forecasted value and the true value is too high.

• Improve the cost function

At the moment the model is designed to create a good fit for the real data. This

is a reasonable choice, but there might be even better choices. One example would

be to use some kind of classification as the basis for the cost function, or some kind

of smoothed representation of the real data. It might even be that all of these, or

other, can be used simultaneously to further improve the forecasts, or for different

types of forecasts.

• RBM design

The RBM implementation used in the model might not be the best choice. There

are a few different approaches suggested in different papers and articles, that could

be evaluated. It could also be investigated if stacked RBM’s would improve the

performance of the model.

• RNN design

The RNN implementation used in the model is fairly straightforward. It might

be interresting to see if some other design or neural network could be possible

to implement and possibly improve the performance. This might, however, be a

difficult task.

6.2. Future Work 57

• Multi-step forecasting

Some tests has been made in producing multi-step forecasts, by changing the number

of steps in the ’symbolic loop for sequence generation’ in the code. There were,

however, not enough time to evaluate the results from these test. It might also

be desirable to be able to dynamically set the time span to be forecasted. The

multi-step forecasting could also be implemented in the cost function.

• Alternative models

There are a number of alternative Deep Learning models that could be interresting

to evaluate furter. Among these are the LSTM model and the WaveNet, that has

both shown promising results in Machine Learning tasks.

Bibliography

[BD06] Peter J Brockwell and Richard A Davis. Introduction to time series and

forecasting. Springer Science & Business Media, 2006.

[BL] Nicolas Boulanger-Lewandowski. Rnn-rbm deep learning tutorial. http:

//deeplearning.net/tutorial/code/rnnrbm.py. Accessed: 2016-05-15.

[BLBV12] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent.

Modeling temporal dependencies in high-dimensional sequences: Applica-

tion to polyphonic music generation and transcription. arXiv preprint

arXiv:1206.6392, 2012.

[Bri] Denny Britz. Recurrent neural networks tutorial part 1 in-

troduction to RNNs. http://www.wildml.com/2015/09/

recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/.

[Bru00] Jake D Brutlag. Aberrant behavior detection in time series for network

monitoring. In LISA, volume 14, pages 139–146, 2000.

[Deea] Tutorial for RBM by deeplearning.net. http://deeplearning.net/

tutorial/rbm.html.

[Deeb] Tutorial for RNN-RBM by deeplearning.net. http://deeplearning.net/

tutorial/rnnrbm.html.

[dS] Cesar de Souza. Discriminatory power analysis by roc curves.

58

BIBLIOGRAPHY 59

[DZ07] Zhidong Deng and Yi Zhang. Collective behavior of a small-world recurrent

neural system with scale-free distribution. IEEE Transactions on Neural

Networks, 18(5):1364–1375, 2007.

[GPP+09] Jia Guo, Yu Peng, Xiyuan Peng, Qiang Chen, Jiang Yu, and Yufeng Dai.

Traffic forecasting for mobile networks with multiplicative seasonal arima

models. In Electronic Measurement & Instruments, 2009. ICEMI’09. 9th

International Conference on, pages 3–377. IEEE, 2009.

[GSM+15] Sebastian Grauwin, Stanislav Sobolevsky, Simon Moritz, Istvan Godor, and

Carlo Ratti. Towards a comparative science of cities: Using mobile traffic

records in new york, london, and hong kong. In Computational approaches

for urban environments, pages 363–387. Springer, 2015.

[HA] Rob J Hyndman and George Athanasopoulos. Holt-winter seasonal method.

[Hal] David A. Hall. What to expect with 5g.

[Hin02] Geoffrey E Hinton. Training products of experts by minimizing contrastive

divergence. Neural computation, 14(8):1771–1800, 2002.

[Hin10] Geoffrey Hinton. A practical guide to training restricted boltzmann machines.

Momentum, 9(1):926, 2010.

[HS03] Michael Husken and Peter Stagge. Recurrent neural networks for time series

classification. Neurocomputing, 50:223–235, 2003.

[JLP15] Jaeseong Jeong, Mathieu Leconte, and Alexandre Proutiere. Cluster-aided

mobility predictions. arXiv preprint arXiv:1507.03292, 2015.

[LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,

521(7553):436–444, 2015.

60 BIBLIOGRAPHY

[MYWW15] Xiaolei Ma, Haiyang Yu, Yunpeng Wang, and Yinhai Wang. Large-scale

transportation network congestion evolution prediction using deep learning

theory. PloS one, 10(3):e0119044, 2015.

[Pet] Tim Peters. The zen of python, 2004. http://www.python.org/dev/peps/

pep-0020. Accessed: 2016-05-10.

[Que] Andre Queiroz. Holt-winters algorthm to forecasting. https://gist.

github.com/andrequeiroz/5888967. Accessed: 2016-05-12.

[RCSR07] Jonathan Reades, Francesco Calabrese, Andres Sevtsuk, and Carlo Ratti.

Cellular census: Explorations in urban data collection. IEEE Pervasive Com-

puting, 6(3):30–38, 2007.

[SPSM12] Saulius Samulevicius, Torben Bach Pedersen, Troels Bundgaard Sorensen,

and Gilbert Micallef. Energy savings in mobile broadband network based

on load predictions: Opportunities and potentials. In Vehicular Technology

Conference (VTC Spring), 2012 IEEE 75th, pages 1–5. IEEE, 2012.

[TARA+16] The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Am-

jad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas,

Frederic Bastien, Justin Bayer, Anatoly Belikov, et al. Theano: A python

framework for fast computation of mathematical expressions. arXiv preprint

arXiv:1605.02688, 2016.

[TN07] Denis Tikunov and Toshikazu Nishimura. Traffic prediction for mobile net-

work using holt-winter’s exponential smoothing. In Software, Telecommu-

nications and Computer Networks, 2007. SoftCOM 2007. 15th International

Conference on, pages 1–5. IEEE, 2007.

[WGLP10] Shaojun Wang, Jia Guo, Qi Liu, and Xiyuan Peng. On-line traffic forecasting

of mobile communication system. In Pervasive Computing Signal Processing

BIBLIOGRAPHY 61

and Applications (PCSPA), 2010 First International Conference on, pages

97–100. IEEE, 2010.

[YMJ11] Peng Yu, Lei Miao, and Guo Jia. Clustered complex echo state networks

for traffic forecasting with prior knowledge. In Instrumentation and Mea-

surement Technology Conference (I2MTC), 2011 IEEE, pages 1–5. IEEE,

2011.

Appendix A

Software

A.1 Python

Python is a programming language that is widely used among programmers and in the

industry. It combines great versatility with simplicity and can be used for writing clear

programs on both small and large scale. It is a high-level, general-purpose, interpreted,

dynamic programming language.

The creator of the language is a fan of the humour group Monty Python, which not

only gave inspiration to the name, but also gave a kind of spirit to the language as well.

An important goal with the language is that it should be fun to use. A lot of example

code also make use of common Monty Python expressions, such as spam and eggs as

placeholders instead of the traditional foo and bar. The core philosophy is summarized

by the document The Zen of Python[Pet].

Even though Python can be used in and by itself, a lot of the functionality comes from

importing libraries. The standard library of Python is commonly cited as on of Python’s

62

A.2. Theano 63

greatest strengths and there are more than 86 000 other libraries to be found in the official

repository of third-party software for Python, the Python Package Index 1.

Python is frequently used in scientific computing due to it’s extensive mathematics library

that has proven useful for tasks involving image processing, text processing, numerical

data processing and natural language processing, among others. It is also commonly used

for tasks involving artificial intelligence and machine learning and a number of different

libraries has been developed specifically for these kinds of tasks.

A.2 Theano

Theano is a library for Python, developed by a machine learning group at the Universite

de Montreal[TARA+16]. It is open source and freely available for download through

the Python Package Index. The development of Theano relies on a wide community of

developers and users worldwide.

The Theano library allows the user to define, optimize, and evaluate mathematical ex-

pressions involving multi-dimensional arrays efficiently. It does so by symbolically define

the expressions and have them compiled in a highly optimized fashion. The expressions

are stored in graphs of variables and operations, that are optimized at compilation time.

It can handle symbolic differentiation of complex expressions automatically, ignore vari-

ables that are not required to compute the final output, reuse partial results, and apply

numerical stability optimization to overcome and minimize error due to hardware approx-

imations. Theano can be run both on CPUs and GPUs (the latter by using CUDA), so

it can take advantage of the GPU’s ability to parallelize large computations and hence

cut the computation time significantly, given that the computation is possible to paral-

lelize. This makes Theano very feasible for computing large matrices and deep networks

of algorithms and models, since a lot of the processes can be parallelized.

1https://pypi.python.org/pypi

64 Appendix A. Software

A.3 PyCharm

The IDE, Integrated Development Environment, that has been used for the programming

is PyCharm, developed by JetBrains. It can be downloaded and installed for free from

JetBrains’ website2. PyCharm is available for both Windows, Mac OS and Linux, making

the switching between platforms easier. JetBrains has also developed IDE’s for various

different programming languages with similar interface for all IDE’s, making the switching

between languages easier.

A.4 Git

Git is a Version Control System (VCS), created in 2005 by Linus Torvals, also known as

the creator of the operating system Linux, and is widely used by developers all around

the world.

As with every VCS the core functionality is to create a history of all changes that has been

made to the files. This ensures that mistakes or lost files can be reverted and restored to

earlier versions of the file. The VCS works best when changes are not made directly to

the master files. Instead copies of the files are created, or forked, into branches, leaving

the master branch untouched. All changes are then made and commited into the branch

and only when the code has reached a working stage it is merged with the master branch.

New branches can in turn be forked from each new branch. In this way it is a lot easier

to revert between working versions of the code. Multiple branches can be forked at the

same time, making parallell work a lot easier.

Since Git is also a Distributed Version Control System (DVCS) it is also great for collab-

orated work. Each developer can fork the code to local branches and work in parallell.

Branches can be merged and commits can be spread and evaluated across the team before

2https://www.jetbrains.com/pycharm/

A.4. Git 65

merging them into the master branch. This further ensure that updates are finished and

bug-free before implemented in the master branch.

Git is developed for use in terminal mode, but a number of user interfaces and desktop

clients has been developed to simplify the use of Git. There also exist a number of hosting

services that allows developers to save their project on a server instead of locally, to further

minimiz the risk of loosing data. Git is implemented in PyCharm so that commits, forks,

merges and comparisons can be made directly in PyCharm instead of through the terminal

or a desktop client.

Appendix B

RSS and AUC tables

Table B.1: Table showing all results from the simulations using a singlecell and a single counter as input data.

Setup Limited input Unlimited inputRBM units RNN units Days/batch RSS AUC RSS AUC

100 1000 1 6.59 0.750 5.37 0.8242 2.97 0.794 2.07 0.879

2000 1 6.76 0.750 6.08 0.8192 2.44 0.801 1.60 0.874

5000 1 6.99 0.750 5.37 0.8052 2.49 0.786 1.65 0.866

300 1000 1 20.56 0.720 32.10 0.7842 10.89 0.725 13.52 0.809

2000 1 24.51 0.727 17.24 0.7862 10.24 0.728 82.94 0.809

5000 1 14.37 0.730 12.23 0.7952 4.95 0.756 3.66 0.832

1000 1000 1 32.01 0.726 34.16 0.7512 80.65 0.706 96.84 0.765

2000 1 23.67 0.733 43.09 0.7732 45.51 0.701 70.15 0.753

5000 1 31.21 0.717 36.03 0.7522 26.15 0.729 67.21 0.779

66

67

Table B.2: Table showing all results from the simulations using mul-tiple cells and a single counter as input data.


100 1000 1 2.01 0.778 1.81 0.7912 1.78 0.778 1.73 0.781

2000 1 2.01 0.773 1.86 0.7962 1.77 0.775 1.66 0.800

5000 1 2.08 0.766 1.80 0.7972 1.74 0.777 1.60 0.803

300 1000 1 2.65 0.747 2.56 0.7592 2.24 0.748 2.11 0.772

2000 1 2.52 0.730 2.32 0.7642 2.31 0.747 2.16 0.759

5000 1 2.68 0.746 2.53 0.7522 2.18 0.732 1.95 0.772

1000 1000 1 4.55 0.729 4.01 0.7352 3.97 0.720 3.28 0.740

2000 1 4.81 0.715 4.53 0.7422 3.59 0.717 3.62 0.759

5000 1 4.55 0.732 4.16 0.7182 3.87 0.721 3.52 0.717

Table B.3: Table showing all results from the simulations using a singlecell and multiple counters as input data.


100 1000 1 2.90 0.771 2.89 0.8082 1.74 0.798 1.72 0.822

2000 1 2.87 0.775 2.12 0.8092 1.97 0.780 1.52 0.829

5000 1 2.41 0.778 2.11 0.8182 1.72 0.792 1.39 0.835

300 1000 1 5.62 0.739 4.56 0.7852 4.81 0.776 4.38 0.786

2000 1 5.35 0.742 5.80 0.7972 3.94 0.755 3.85 0.795

5000 1 4.97 0.757 4.11 0.7912 3.48 0.768 3.39 0.787

1000 1000 1 12.41 0.721 9.56 0.7672 13.06 0.735 9.70 0.747

2000 1 10.61 0.716 14.45 0.7562 10.89 0.724 8.03 0.756

5000 1 11.47 0.742 10.38 0.7572 8.36 0.742 8.57 0.758

68 Appendix B. RSS and AUC tables

Table B.4: Table showing all results from the simulations using mul-tiple cells and multiple counters as input data.


100 1000 1 1.43 0.803 1.42 0.8032 1.49 0.804 1.39 0.823

2000 1 1.44 0.800 1.38 0.8172 1.45 0.813 1.36 0.827

5000 1 1.43 0.798 1.40 0.8162 1.39 0.820 1.37 0.829

300 1000 1 1.74 0.776 1.58 0.7992 1.71 0.786 1.64 0.790

2000 1 1.73 0.777 1.68 0.7962 1.70 0.791 1.60 0.806

5000 1 1.80 0.783 1.80 0.7742 1.71 0.795 1.59 0.802

1000 1000 1 2.97 0.750 2.75 0.7592 2.38 0.778 2.39 0.759

2000 1 3.03 0.761 3.34 0.7522 2.38 0.752 2.37 0.774

5000 1 3.90 0.749 3.68 0.7462 2.36 0.767 2.21 0.777

multiple time-series forecasting on mobile network data using an …1075835/... · 2017. 2. 21. ·...

Documents