time series prediction algorithms : literature review

Time Series Prediction Algorithms

Literature Review

Kumara M.P.T.R., Fernando W.M.S., Perera J.M.C.U., Philips C.H.C

Department of Computer Science and Engineering,

University of Moratuwa,

Sri Lanka

1

Contents

List of Figures and Tables.....................................................................................3

1. Introduction.......................................................................................................4

2. Algorithms for time series analysis...................................................................6

2.1 Autoregressive Integrated Moving Average (ARIMA) Model …..................6

2.1.1 Autoregressive Model in Time Series Prediction........................................6

2.1.2 Moving Average Model in Time Series Prediction.....................................7

2.2 Artificial Neural Networks (ANN).................................................................7

2.2.1 Linear Time Series Forecasting...................................................................8

2.2.2 Nonlinear Time Series Forecasting..............................................................8

2.2.3 Using Unsupervised Learning Networks for Forecasting...........................12

2.3 Support Vector Machines (SVM) ….............................................................15

2.3.1 Decision boundary ….................................................................................15

2.3.2 Perceptron Algorithm …............................................................................16

2.3.3 Properties of SVM.....................................................................................17

2.3.4 Using Linear Kernel in Time Series Predicting ….....................................19

2.3.5 Using RBF Kernel in Time Series Predicting …........................................19

2.3.6 Using Least Square (LS) SVM method in Time Series Predicting …........19

2.4 Genetic Algorithms for Time-series prediction .............................................20

3. Evaluation of machine learning models ….......................................................24

4.Conclusion .........................................................................................................26

References .............................................................................................................27

2

List of Figures and Tables

Figure 1. Classification of Time Series Prediction methodsFigure 2. Linear Neuron for time series predictionFigure 3: Nonlinear Autoregressive with Exogenous Inputs (NARX) ModelFigure 4(a): Nonlinear autoregressive moving average with exogenous inputs NetworkFigure 4(b): Fully Recurrent NetworkFigure 5: Conceptual view and the operation of Elman NetworkFigure 6: Approximation using Global and Local ModelsFigure 7: SOM with 2 Neurons in the input layer and 3 Neurons in the output layerFigure 8: Schematic presentation of RSOM unitFigure 9: Decision BoundariesFigure 10: Decision boundary and marginFigure 11 : Transforming non-linear separable data into a rich feature spaceFigure 12: The genetic algorithm. illustrated for digit strings representing 8-queen statesFigure 13: The 8-queens states corresponding to the first two parents in Figure.12(c)

Table 1: Chromosomes of the first generation and their fitness valuesTable 2: New generation of chromosomes after crossover and mutation operations

3

1. IntroductionA time series is a sequence of observed values of some entity that is measured at different points in time. Usually measurements are taken in regular intervals to ensure uniformity. A daily report of rainfall, market values of a product and population of a country over time are examples for time series.Time series data are important because of two reasons. Time series data are used to identify the behavior of an entity in the past. On the other hand, it can be used to draw predictions about its behavior in the future. Although these two applications are very common in practice, the latter has gained a much popularity due to the importance of predicting future in many scenarios. Making correct guesses always returns valuable information regarding how an entity will behave in the future, under different environmental conditions. Predicting future trends is used it many industrial applications. Time series prediction has a long history that runs down to the work of Yule in 1927 [26]. In the present context they are heavily exploited in economics, bio-medical, meteorology, electricity consumption and a countless number of engineering applications. Time series data are used by people who wish to analyze past data and make useful predictions based on that data. They can be either engineers, statisticians, economists, marketing strategists, researchers and practitioners in various other disciplines. Time series analysis and forecasting have become a critical tool in the modern day industrial environment. Not only that but also it is a highly active field of study in the academia. Time series analysis is the key for taking decisions in controlling the processes which generates a particular time series. Time series analysis provides the means of identifying critical parameters and their effect on the time series which in turn allow us to change the future outcome by carefully controlling these parameters. There is enough evidence to show that time series analysis has become a universal and a cross-domain subject and hence of high importance. As a result, time series analysis and prediction have become a hot topic in disciplines like operational research and computer science.With the advancement of collecting data, huge amounts of data have been collected making it impossible to be processed manually. This is where the time series analysis has to be automated and take advantage of modern computing mechanisms. In general, data mining is the sub-discipline that is concerned with analyzing and retrieving useful information out of raw data. But in the context of time series data, time series analysis and forecasting has emerged as a separate field of study due its broad scope and the wide range of applications. Time series modeling exists in two classes, namely parametric and nonparametric [19]. A comprehensive classification of various time series prediction methods are available in [19] and a summary is given in Figure 1. Many early time series prediction methods were parametric approaches where the goal was to fit a statistical model to time series data. However with the identification of severe limitations in simple parametric approaches, people were more interested on nonparametric approaches in which no structural assumptions are made about the underlying structure of the process[26].

4

Figure 1. Classification of Time Series Prediction methods [19]

In time-series forecasting (TSF), the accuracy measure is obtained by the difference between the predicted value and the actual value where the subscript t denotes the values at time t (error of prediction at time t).

The following statistical terms are calculated over the entire timeline where l is the number of forecasts

and is the mean of the time series,

Sum Squared Error

Root Mean Squared Error

Normalized Mean Square Error

Artificial intelligence has become a perfect candidate for time series prediction in situations where manual data analysis is not applicable and relationships among different entities are not obvious. Machine learning is the branch in Artificial Intelligence where different algorithms are developed in order to allow computers to learn. Machine learning is said to be the domain of computational intelligence that develops algorithms which automatically improves themselves with experience. Various machine learning algorithms have been developed to perform time series prediction. In this review, the objective is to conduct a survey on recent literature on different machine learning algorithms that are used for time series prediction.

5

2. Algorithms for time series analysis

2.1 Autoregressive Integrated Moving Average (ARIMA) Model

Time series analysis and forecasting have their roots originated from classical statistics. In the current context, these methods are rarely used alone as computational intelligence has become the trend. However almost all the intelligent algorithms take advantage of these statistical methodologies. So establishing a general idea about these methods is substantial in understanding modern ways of time series prediction.ARIMA is a general statistical model which is widely used in the field of time series analysis. General ARIMA model is denoted as the ARIMA(p,d,q) where p,d, and q are non negative integers. In the above notation p parameter basically refers to the autoregressive part ,d parameter refers to integrated part and the last parameter q refers to the moving average part. Based on these parameter values there are several child models which we can derived from the ARIMA model. For example if d=0 ARIMA(p,0,q) model also referred as ARMA. If parameter d and q both equal to zero the ARIMA(p,0,0) is referred as the AR model.Given a time series of data Xt Where t is an integer index and the X t are real numbers then ARMA(p’,q) model is given by,

Where L is the lag operator and the are the parameters of the autoregressive part of the model and are the parameters of the moving average part of the model. are the error terms.The most general ARIMA(p,d,q) model can be defined as followed.

Basically Complete ARIMA model is a combination of more specific AR(Autoregressive), Integrated Model and Moving Average models.

2.1.1 Autoregressive Model in Time Series Prediction

Autoregressive Model is a Special case of ARIMA model. We can think of AR model as ARIMA(p,0,0). If AR Model has a degree of p then we can write AR model as follows.

6

To find the Coefficients of the AR model we can use the least square method or the Yule - Walker method. We have implemented the AR model using the above mentioned algorithms.

2.1.2 Moving Average Model in Time Series Prediction

Moving Average model is also a special case of general ARIMA model. We can think of Moving Average model of degree q as ARIMA(0,0,q) model. In Autoregressive model we basically build a linear model for forecasting using current values. But What happens if there is some noise present in the given time series. In AR Model we basically doesn’t consider to eliminate the noise present in our time series. But in Moving Average model we take noise into account when forecasting the data.

The above equation represent a moving average model of degree q. In above equation is expectation of

and are white noise error terms. We can consider Moving average model as a finite impulse response filter for the given data points.

2.2 Artificial Neural Networks (ANN)

Neural networks are one of most flexible and adaptive learning systems that are extensively used in variety of real-world problems. As defined in Haykin [7],

“A neural network is a massively parallel distributed processor made up of simple processing units, which has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects:

1. Knowledge is acquired by the network from its environment through a learning process2.Inter-neuron connections strengths, known as synaptic weights, are used to store the acquired knowledge”

The real power of neural networks comes from its ability to imitate some of the brain’s creative processes [11].Artificial Neural Networks are one of the dominant application in intelligent time series prediction [4]. The attempts to use neural networks for time series forecasting go back to 1970’s. But at that time with the limited knowledge on neural networks the results from the experiments were not very promising. But with formation of back-propagation algorithm is 1986, research interests in neural networks had grown rapidly and successfully adapted for the problem of time series forecasting. Some of those results had shown that the neural networks are capable of outperforming statistical forecasting methods such as regression analysis and Box-Jenkins forecasting [17].

7

2.2.1 Linear Time Series Forecasting

In the preceding sections we have discussed the linear statistical models that have been widely used to forecast linear time series including the general form of Auto Regressive Integrated Moving Average (ARIMA) model which is given by the following equation.

where and are the parameters of the of the model calculated using LSE.

gives the past observations and denotes the error.

In neural networks there is an equivalent linear model called linear neuron that can be used for linear time series forecasting. Linear neuron output can have continuous values making it capable of approximating linear functions. For the learning process of this model we need to provide target outputs for corresponding inputs which means this is a supervised learning model. Learning algorithm makes use of the delta rule which has similar meaning to the Least Square Error (LSE) in statistics. Following diagram illustrate the linear neuron model for p lags of the variable z [11].

Figure 2. Linear Neuron for time series prediction [11]

Here the weights are estimated using the delta rule. Because of the similarity between LSE and the delta rule, this model is likely to be exhibit the same results as ARIMA based models.The main disadvantage associated with ARIMA family of models is that they are linear models. Because of that they are not applicable to situations when there is a non linear relationship in the data. For these scenarios neural networks based models can be easily extended to achieve very good results. [11].

2.2.2 Nonlinear Time Series Forecasting

Ability of the neural networks to capture the nonlinear relationships in data make them the ideal candidate for nonlinear time series prediction. Neural networks that are built to capture the temporal information are known as dynamically driven recurrent networks. Most of these networks are based on supervised

8

learning and the multilayer perceptron (MLP) serves as the basis. General architecture of a recurrent network is given in the following diagram.

Figure 3: Nonlinear Autoregressive with Exogenous Inputs (NARX) Model [7]

The network given in above diagram is formally known as the Nonlinear Autoregressive with Exogenous Inputs (NARX) model and it’s dynamic behavior is given by the following equation [7].

In this diagram the global feedback is applied from the output layer to the input layer. But this global feedback can take different forms. For example feedback can be applied from the hidden layer to the input layer or when there is more than one hidden layer, feedback can be applied from one hidden layer to

9

the preceding layer [7]. Using this different types of global feedback we can identify another two generic types of recurrent networks (Nonlinear autoregressive moving average with exogenous inputs and the fully recurrent) which are illustrated in following figure.

Figure 4(a): Nonlinear autoregressive moving average with exogenous inputs Network [11]

Figure 4(b): Fully Recurrent Network [11]

By using these generic models various other network models are developed. Elman and Jordan network models are the most widely used recurrent network models for practical applications including the time series prediction. In following section we will look at the Elman network model in detail.

2.2.2.1 Elman Network

Elman network is a recurrent network with hidden layer output is fed back as input to the hidden layer at the next time step [11]. The conceptual idea behind Elman network and the structure of the network is given in the following diagram.

10

Figure 5: Conceptual view and the operation of Elman Network [11]

2.2.2.2 Learning Algorithm

Training of Elman network is very much similar to training the multilayer perceptron except the fact that the Elman training algorithm requires the update of recurrent weights[25].

Let the input vector be and the target vector be . Here denotes the combination of both the input and the context unit value. Take set of training data D

as , n = 1,2,...,N .

1. Algorithm starts with the randomization of the weight vector .

2. At the start of each epoch k store the current values of the weight vector in which is the weight vector at the beginning of each episode. denotes the repetition counter.

a. For n = 1,2,...N

i. For training example ( ) use the error back-propagation [7] to calculate the

partial derivatives ∂ En

∂w i

ii. Update the weights using the equation,

w (k +1)=w (k )−η∂ E n

∂ wi

, where denotes the learning rate.

iii. Copy hidden nodes values to the context units

iv. increment repetition counter ( ) b. Check for epoch counter k for termination. If not increment k and go to step 2.

11

2.2.3 Using Unsupervised Learning Networks for Forecasting

Most of the researches on time series prediction were based on the supervised learning algorithms such as Multilayer Perceptron (MLP) and Radial Basis functions because the time series problem can conveniently expressed as a supervised learning problem [8]. Self Organizing Maps (SOM) can be identified as the most widely used unsupervised learning algorithm in neural networks. Despite originally being developed to transform higher dimensional data into one or two dimensional map and widely used for clustering and visualizations problems, self organizing maps can be applied to time series prediction problem and had shown promising results. SOM based models fall into the category of local prediction models. The difference between global and local models is that in a global model approach only one model is used for whole process while in a local model approach time series data is divided into smaller subsets and they are modeled using local models. Local models are useful in non-stationary time series prediction and these models give the user a better understanding of the underlying process of generating time series [8, 10].

Figure 6: Approximation using Global and Local Models

In the following sections we will look at the theory behind the SOM briefly and how it can be extended so that it can be applicable to time series prediction.

2.2.3.1 Self Organizing Maps in a NutshellSelf organizing map consists of only two layers, input and the output layer. Neurons in the output layer usually have the formation of one or two dimensional lattice [11]. Set of synaptic weights connects neurons in the input layer with the neurons in the output layer. These neurons have to be fully mesh connected, i.e. each neuron in the input layer has connection weights to all the neurons in the output layer.

12

Figure 7: SOM with 2 Neurons in the input layer and 3 Neurons in the output layer

2.2.3.2 SOM Learning Algorithm

SOM learning algorithm consists of competitive, cooperative and adaptive processes [7].

Let the input vector selected randomly from the input space be and the weight

vector be , where j = 1, 2….. , l (number of output neurons in the network). After initialization of the weight vector to random values the competitive process is started. In the SOM for a given input vector there is only one winning neuron in the output layer. In the competitive process the euclidean distance between the input vector and each weight vector associated with output neurons is calculated and the output neuron with the minimum distance is selected as the winner. I.e,

Then the algorithm moves to the cooperative process. In this process the topological neighborhood

centered around the winning neuron (i) is considered and the neighboring strength of the topological neighbors are calculated usually using the Gaussian function [7].

where is the lateral distance between the winner i, and the neuron j in the neighborhood and is the width of the neighborhood which is also decays with each iteration. In the adaptive stage, weights of the winning neuron and its topological neighbors are adjusted using the function [7],

where is the learning rate parameter which is a decaying function with each iteration of the process.

13

2.2.3.3 Recurrent SOM and Time Series Prediction

The problem with directly applying self organizing map for time series prediction problem is that in SOM there is no method to keep track of time dimension. Therefore it is necessary to add a memory to SOM model so that it can keep track of the temporal contextual information [9]. Recurrent Self Organizing (RSOM) Map is a one way of storing the temporal contextual information.

2.2.3.4 Estimation of local models and learning algorithm At the first stage of the process the time series data is divided into two sets: training data set and testing data set. Windowing techniques are used to create the input vectors to RSOM. Usually the 4-fold cross validation is used for the selection of appropriate local model. Then the best model selected from cross validation is again trained using the entire training data set. This model can be used to predict the next values of the time series [9]. The main difference in RSOM is that it uses leaky integrator output values for selecting the best matching unit whereas in normal SOM the outputs are set to zero after selecting the winning neuron and adjusting

the weights. The leaky difference vector in unit is calculated as follows.

,

where is the leaky coefficient.

Figure 8: Schematic presentation of RSOM unit [9]

In RSOM the best matching unit or the winning neuron is selected by using the following formula.

Then the algorithm moves into the adaptive process as in the SOM algorithm. Weights of the winning neuron (Best matching unit) and its topological neighborhood is adjusted as follows.

14

After completion of updating difference vectors for the selected random point in the data set, the values of

’s are reset to zero and another random point is selected from the input space and same procedure is repeated until map is formed.

2.3 Support Vector Machines (SVM)

SVM is related to statistical learning theory. It is popular because of its success in digit recognition. SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning. Support vector machines have the basics of linear learning machines. The basic problem of linear learning machine is deciding the decision boundary.

2.3.1 Decision boundary

When we consider a two class linearly separable classification problem, the general line that is more justifiable to take decisions is the line that we get by regression in normal statistical analysis, is basically what we call a decision boundary. If we don’t use any statistical methods to find that boundary, all the boundaries that we think as decision boundary for a certain two class problem may not be justifiable. The gap between both sides of the classes is called as margin.

Figure 9: Decision Boundaries[30]

The decision boundary should be as far as possible from the both classes. The more we maximize margin m, more the correctness of the decision boundary.

15

Figure 10: Decision boundary and margin[30]

A decision boundary is defines as shown in the above example. If we take the set of data points as{X1 ,X2, X3,.... Xi} and let Yi {-1,1} the classes of X∈ i. Note that Yi*(WTX+b) >=1. Since m=2/||W||, maximizing ‘m’ means minimizing ||W||. So then it becomes a constrained quadratic optimization problem which an major field in artificial intelligence.

● Minimize 12∥W∥

2

● Subject to Yi*(WTX+b) >=1

Solving this is out of the scope this review. The perceptron algorithm is used in solving this linear separable classes.

2.3.2 Perceptron Algorithm

Update rule (Ignoring threshold)

if Yi*(WTX+b) < 0 then

Wk+1 = Wk + Yi Xi

K = K + 1

Recursive computation of above steps will be needed for the implementation of perceptron algorithm.

16

f (x) = <w, x> + b h (x) = sign (f(x))

Solving above it can be found that the solution is a linear combination of training points.

W= i Yi Xi

i 0This solution leads to the duality of SVM. How ever there are limitations with Linear Learning Machines. Linear classifiers,

● cannot deal with non-linearly separable data● noisy data

Moreover above formulation deals with vectorial data. But this can be used as a basic introduction to SVMs. And these results are true for SVMs also as these are a part of SVM.

2.3.3 Properties of SVMThe main properties of SVM are as follows.

● Duality● Kernels● Margin● Convexity● Sparseness

DualityDuality is the main feature of the SVM. SVM s can be represented in dual fashion, considering them as linear learning machines as above. [27]

f(x) =<W,X> + b = i Yi <Xi , X> + bHere data appear only within dot products.To observe about other properties we have to move to non linear classifiers.

KernelsNow we have to move to nonlinear classifiers. The main problem is about analyzing non-linear classifiers. Since it is easy to analyze linear classifiers we can use at an atomic level.

17

• One solution is to create a set of linear classifiers. i.e. creating neural network. (as discussed above in neural network section, problems of local minima,too much parameters, training the network with heuristics ,make this solution very complicated)

• Another solution is to map data into a more rich feature space including non-linear features and then use a linear classifier.

Here we will dig into the second option of mapping data into a linearly separable feature space.

Figure 11 : Transforming non-linear separable data into a rich feature space

X → What is shown in above figure is transforming the data set to a feature space. But there are problems with feature space.

● Problem of expressing complex functions - Solution is to work in high dimensional feature spaces ● Computational problem - working with very large vectors● Generalization theory problem - curse of dimensionality)

Basically the function ’ ’ that is used above, is called as the kernel. But to be called as a kernel it should give some kind of solutions to above mentioned problems. If not, literally it is not a kernel.Kernels will,

● Solve the computational problem of working with many dimensions● Can make it possible to use infinite dimensions–efficiently in time / space● Provide both practical and conceptual advantages

After the transition, the dual representation mentioned in duality will be a follows.

f(x) = i Yi < i ), > + bNote that dimensionality of space is not necessarily important. Converting X1, X2 to (X1) , (X2)

K(X1 , X2) = < 1), 2)> Examples of Kernels

● Linear Kernel ● Polynomial Kernel● Radial Basis Function (RBF) Kernel● Neural Network Kernel

18

Convexity - This is a quadratic optimization problem. Convex, no local minima. This means that this problem is solvable in polynomial time.Sparseness and Margin - Margin and sparseness is very important in deciding the decision boundary. This is explained above.

2.3.4 Using Linear Kernel in Time Series Predicting

The linear kernel is, K(X1 , X2) = X1 . X2

According to above description of linear separable classifications we can predict time series as follows.[27] [28] [29]

Xt = = ∑i=1

k

W i X ( t−i)+b

This is very straightforward and is the simplest kernel to be used.

2.3.5 Using RBF Kernel in Time Series Predicting

The Radial Basis Kernel takes the form K (X1 , X2) = exp(- )||X1- X2|| 2

Let us assume the time series is generated by a function such that Xt = Plot of the (k+1) dimensional vectors (Xi , Xi+1 , … , Xi+k) using the time series X1 ,.....,Xk

, ….XN , the resulting plot is a part of the graph of So the function can be estimated from the time series.Moreover assume that the function is linear and the data is generated by

Xt = + where is a Gaussian noise, it is shown that most of the data lies in an ellipsoid defined by the mean of the time series and the variance of . This interprets that information about a window of a time series can be found from other windowsof the time series that are similar in means of the Euclidean distance, which makes the RBFkernel promising for time series analysis. Instead of these kernels there are other methods of time series prediction using other methods.

2.3.6 Using Least Square (LS) SVM method in Time Series Predicting

An LS-SVM formulation uses the above mentioned equality constraints and a sum- squared error (SSE) cost function, instead of quadratic program as mentioned in linear separable classes case. Consider a

training data set of N points {xi ,yi}i=1N with input data xi n and the response yi . The optimization

problem is as follows.[27][28][29]

Minimize J(w,e) = 12

wt w+12

γ∑ ei2 Subject to y i=wt ф( xi)+b+e i ∀ i=1,2,3,... , N

is the non-linear mapping to a higher dimensional space is the regularization parameter.

19

Same RBF kernel, K(x,y) = exp(-||x-y||2/2 2) is the tuning parametersatisfies the Mercer's Condition. As mentioned in the linear separable classes’ incident, the primal space model of the optimization problem is given by,

y = wT (x) + bHere we have to move to dual space. The Lagrangian for the problem is given by

L(w ,b , e ,α)= J (w ,e)−∑i=1

N

αi(wt ф (x i)+b+e i− y i) =[ 1, 2 ,....., N]T, where i are the

Lagrange multipliers.[27][28][29].From here we can obtain the linear system of equations [27][28][29]

Here y = [y1, y2, ……, yN] 1=[1,1,....1] and with (i,j) = k(xi , xj) i,j N+ is the kernel matrix. The LS-SVM decision function is given by

y (x )=∑i=1

N

αi k ( x , x i)+b

Here ,b are the solutions of the above linear system. Obviously the main benefit of the LS-SVM technique is that it transforms the traditional Quadratic Programming Problem to a simultaneous linear system problem. This surely ensure the simplicity in computations, fast convergence and high precision[27][29].

2.4 Genetic Algorithms for Time-series prediction

Genetic or Evolutionary algorithms is a branch of artificial intelligence that is influenced by the Darwinian theory of evolution. Some authors introduce genetic algorithms as an offshoot of stochastic beam search[15]. Similar to the case in biological evolution, this algorithm exploits the concepts of the inheritance, selection, crossover and mutation. The nature of the offspring will be decided on the fitness of its parents. In other words, genetic algorithms mimic the core of natural evolution to produce better solutions to problems. Because of the generality of the accepted set of problems, time series analysis and prediction can also be done with genetic algorithms. The fact that natural evolution is proved to be a successful and robust mechanism that produces strong offsprings has introduced a promising computational model for problem solving.The idea of strong offsprings is ported to the problem solving domain which implies that the solutions will get stronger as they evolve, eventually producing an acceptably good answer. The heart of genetic algorithms lies in a heuristic function called the fitness function where the term fitness is adopted from evolutionary biology. Fitness measures how good a certain genotype at leaving offspring in the next generation relative to how good other genotypes are at it [13]. Mitchell, Melanie [23] states in her book that genetic algorithm is a procedure that tries to search an optimal solution in a search space, a collection of candidate solutions. The following pseudocode explains the general idea of genetic algorithms[12].

1. [Start] Generate random population of n chromosomes (suitable solutions for the problem)

20

2. [Fitness] Evaluate the fitness f(x) of each chromosome x in the population

Eg. from [23] , 3. [New population] Create a new population by repeating following steps until the new population

is complete. The following are called genetic operators.1. [Selection] Select two parent chromosomes from a population according to their fitness

(the better fitness, the higher chance to be selected)2. [Crossover] crossover of the parents to form a new offspring (children) with crossover

probability Pc. If no crossover was performed, offspring is an exact copy of parents. The locus of crossover is decided randomly. For example, the strings 10000100 and 11111111 could be crossed over after the third locus in each to produce the two offspring 10011111 and 11100100 [23].

3. [Mutation] With a mutation probability mutate new offspring at each locus (position in chromosome) with mutation probability Pm. For example, the string 00000100 might be mutated in its second position to yield 01000100. Mutation can occur at each bit position in a string with some probability, usually very small (e.g., 0.001)[23]

4. [Accepting] Place new offspring in a new population4. [Replace] Use new generated population for a further run of algorithm5. [Test] If the end condition is satisfied, stop, and return the best solution in current population6. [Loop] Go to step 2

The internals and the working principles of genetic algorithms are described in the following section. Genetic representation of the problem is the first step in genetic modeling of a problem. The standard way of encoding a problem genetically is representing it as a linear binary strings of fixed length. Other arrangements like variable length strings, n-ary character strings are also possible depending on the nature of the problem domain. An example problem encoding is shown below [15] .The famous 8-queens problem can be used to demonstrate the behavior of genetic algorithms. The following is an instance of 8-queens problem taken from [15]. Chess boards are numbered from lower left hand corner. We will refrain from an elaborated analysis of genetic algorithms because it is out of the scope of this literature and we will rather focus on the application of genetic algorithms in time series prediction.

Figure 12: The genetic algorithm. illustrated for digit strings representing 8-queen states.The initial population in (a) is ranked by the fitness function in (b), resulting in pairs for mating in (c). They produce offspring in (d), which are subject to mutation in (e) [15]

21

Figure 13: The 8-queens states corresponding to the first two parents in Figure.12(c) and the first offspring in Figure.12 (d). The shaded columns are lost in the crossover step and the unshaded columns are retained. [15]

[23] presents some typical parameter values and terminology that are used with genetic algorithms in practice. Each iteration in the process is called a generation. Typical number of iterations vary from 50 to 500 or even more. The entire process is called a run and at the end of a run there are fitter chromosomes in the populations that are more suitable for the survival(closer to the optimization goal). A more detailed example that elaborates genetic algorithms is as follows. Assume following parameter values.

l = 8 (chromosome length)f(x) = number of ones in chromosome xn = 4 (population size) Pc = 0.7Pm = 0.001

Chromosome label Chromosome string Fitness

A 00000110 2

B 11101110 6

C 00100000 1

D 00110100 3

Table 1 . Chromosomes of the first generation and their fitness values fitness-proportionate selection(viability selection) is used as a selection method where the frequency of an individual is expected to reproduce is equal to its fitness divided by the average of fitness in the

population. According to this, fitness-proportionate selection for A would be . In

general it can be expressed as , for the ith chromosome .

22

Roulette-wheel sampling is considered to be a simple method for implementing fitness-proportionate selection. The basic idea is giving each individual slice of a circular roulette wheel equal in area to the individual chromosome’s fitness. Once the roulette-wheel is spun, a marker comes to rest on one slice of the wheel. The chromosome corresponding to that slice is selected. The roulette-wheel will be spun n times (n= 4 in the above example). If the roulette-wheel is spun sufficiently large number of times, the average results would be closer to expected values. After a pair of parents chromosomes are selected, they crossover with probability p c to produce a new offspring(or simply replicate parents if they do not crossover). Assume that parents B and D from the above Table 1 crossover after the first bit location to produce the offspring E = 10110100 and F = 01101110. Then each child is subjected to mutation at each bit position with probability pm . A possible outcome of this evolution procedure is given in Table 2.

Chromosome label Chromosome string Fitness

E1 10110000 3

F 01101110 5

C 00100000 1

B1 01101110 5

Table 2. New generation of chromosomes after crossover and mutation operations

With a basic introduction to genetic algorithms and their behavior, it is now discussed how genetic approach is utilized in time series prediction. Most of the studies under this topic were based on parameter optimization models with the influence of conventional models like Holt-Winters and ARIMA[19]. But a tremendous effort was put by the researchers over the years and they have shown that genetic and evolutionary algorithms provide a way to tackle the time series forecasting problem. An impressive study on time series forecasting was conducted by Anufriev, M. et al [21]. The entire premise of the idea depends on the ability of formulating prediction problems as search problems. For example we can think of it as selecting the best set of parameters that optimizes a certain utility value. In other words, we are trying to find an optimal model that detects patterns in the data. Their genetic model was developed to offer an explanation for human behavior in Learning-to-Forecast experiment[22]. The model was developed in order to form price expectations. However the idea is valid for a general time series prediction scenario as well. Our discussion on the application of genetic algorithms for time series forecasting is based on the work done by Anufriev, M. et al[21]. This genetic model was developed to predict the human behavior on changing market prices so the model acts as an economic agent. The forecasting method tries to optimize a chosen heuristic function in order to learn the market patterns. The following section describes how genetic algorithms fit into the stated problem and its role as an optimization procedure. [21] describes the genetic approach as follows.

“The basic idea is that we should focus on a population of arguments which compete only in terms of their respective function value. This competition is modeled in an evolutionary fashion: mutation operators allow for a blind-search experimentation, but the probability that a particular candidate will survive over time is relative to its functional value. As a result, the target function may be as general as necessary, while the arguments can be of any kind, including real numbers, integers, probabilities or binary variables. The only constraint is that each argument must fall into a predefined bounded interval an , bn”

23

The term functional value in the above quote represents the value of the fitness function that we discussed in the introduction. In Norman Packard’s genetic model(1990) for stock market prediction, the problem is modeled in the following way [23]. Packard’s Xerox stock price prediction example will be discussed to elaborate the underlying ideas of the application of genetic algorithms for time series forecasting. A physical system or a formal dynamical system can be denoted as a set of pairs,{(X1, y1),(X2, y2),...(XN, yN)} where each Xi is a feature vector of the form Xi = (x1

i,x2i,...,xn

i) and C be such that,

where C is called a condition set.A condition set specifies a particular subset of the data points. Packard’s approach was to develop a genetic algorithm that searches for conditions sets that are good predictors. Eg. If the genetic algorithm finds a condition set C such that all the days satisfying the set were followed by days on which the Xerox price stock increased to ~$30, then we are confident to predict that, if those conditions were satisfied today, Xerox stock will go up.Fitness of a condition set Ci is calculated by running all the data points (Xk,yk) to check the outcome (True or False) per each training data point. Then collect all y from the data points whose X satisfies C i. If all these y values are close to a particular value Ä within a predetermined range of tolerance, then C is an acceptable condition set as a predictor for y. The above example explains how time series prediction can be carried out using a genetic approach. In summary, time series problem is converted to a typical search problem and then solved with a genetic algorithm.

3. Evaluation of machine learning models

Many time series prediction techniques have been discussed throughout this review. After a detailed analysis of the internals of each method, we are in a position to draw comparisons among these time series prediction models. Strengths and weaknesses of each methodARIMA is a simple historic statistical model for time series prediction. Its popularity is due to its simplicity and the accuracy shown with early applications. The main advantage of these parametric ARIMA model is the ease of implementation. Compared to other nonparametric models, the fundamental ARIMA models are fairly trivial. Regardless of the simplicity, ARIMA produces impressive results for many data sets. But the main drawback in ARIMA model is that it fails to detect complex time series patterns that cannot be determined by simple parametric models. Since many modern time series relationships are quite complicated with large feature spaces, the popularity of pure ARIMA models has decreased. As time series data grew in terms of complexity, parametric(stochastic) methods began to fail often. As a result, nonparametric models were introduced. Since nonparametric time series prediction methods are the current trend, the main focus of this survey is on nonparametric techniques. Predominant time series methods like self organizing maps, support vector machines and genetic algorithms are discussed in this review. Recurrent Self Organizing Map is a variant of the perceptron based learning mechanism that is specially capable of detecting nonlinear patterns in time series data. Being a local method, Recurrent SOM requires

24

a limited amount of computing complexity compared to global methods. Unsupervised learning approach makes it possible to build models from data with only a limited prior knowledge. This means that the designer has more freedom in selecting training data for the model. Regardless of this advantage of unsupervised learning, supervised MLP recurrent models are the most suitable to process nonlinear data. The only significant disadvantage of this approach is the computational complexity but this method is frequently used in practice because of its impressive accuracy with nonlinear time series data. It is also noteworthy that autoregressive models outperform recurrent SOMs in the presence of linear time series data. So in summary, recurrent SOMs are most suitable when the time series in nonlinear. Support Vector Machines, another well known algorithm has obtained a lot of popularity in recent times especially as a large margin classifier. The flexibility of SVMs is the main reason behind its wide usage and popularity. Compatibility with different linear and nonlinear kernels make SVM effective against any type of time series. If the time series is nonlinear, we can easily use a different SVM kernel to be used with complex, nonlinear data. This eliminates many undesired properties we has in both ARIMA and recurrent SOMs that they under-perform with certain data models. However there are known limitations of SVM. The main drawback in SVMs is that its tendency of overfitting when used against high dimensional feature spaces. The effect of overfitting should be carefully observed and eliminated by appropriately selecting data. This process can be cumbersome to be done manually. Finally, we discuss the genetic and evolutionary model of time series prediction. [31] lists out pros and cons of genetic algorithms in detail. Genetic algorithms are a very powerful computational model. If a problem can be encoded into chromosome it is then readily solvable by a genetic algorithm. Also this methods is used when there are multiple solutions. One huge plus point for genetic algorithms with respect to other methods is its independence of the error surface. This allows genetic algorithms to be used for multidimensional, non-differential, non-continuous, and even non parametric problems. So one can easily see that genetic algorithms solve problems in a larger range than other methods. As we have already seen, genetic algorithm approach does not require mathematical versatility. So the model is very comprehensive and natural. Apart from these promising advantages, genetic algorithms have some rough edges. Poorly known fitness functions result in bad chromosome blocks because crossover take place not only among the strong parents. This problem is imminent in certain optimization problems. Furthermore, genetic algorithms do not guarantee to find the global optimum and this scenario is common in problem domains where there are lots of objects in a population. Also, genetic algorithms do not guarantee constant optimization response times like in many other optimization algorithms. This is a severe drawback in genetic algorithms that limits their use in real word applications. Applications of genetic algorithms in real time environments are limited due to their random solutions and convergence in which we cannot pick an optimum even though the entire population is improving as a whole. So apart from the powerful computational model, there are some blocking factors for genetic algorithms that limit the use of this idea.

25

4. Conclusion

In this literature review, Time Series Prediction Algorithms: Literature Review, we have presented,discussed and compared a number of widely used techniques for time series modeling and forecasting(prediction). We dove into the internal details and implementation of each technique that helped us to observe how each method works. Among these methods, different methods were encountered; pure statistical models like ARIMA and machine learning based intelligent methods like Recurrent Self Organizing Maps, Support Vector Machines and Genetic Algorithms. Different methods adopt different approaches and they have their own strengths and weaknesses. Parametric statistical models are suited to model simple time series relationships whereas we need more advanced and versatile algorithms to detect complicated patterns in time series data. Therefore more focus was given to the latter. Different methods fit in for different situations so understanding and interpreting the problem is required before choosing an technique to predict data. Specially, understanding data prior to algorithm selection is a major rule of thumb in data mining and related fields. Performances of these algorithms can be improved in various ways. Pruning decision spaces, cleaning and preprocessing time series data, improved coding standards, distributing computation among different nodes can be considered as few general improvements. The conclusion to arrive at this point is that no single techniques is perfect, there are situations where one method would fit into a problem than another but this might be different for another problem. So selecting a proper technique to analyze and forecast time series data, one should have a sound and comprehensive understanding of the problem domain. Time series forecasting is a very old problem with many historic milestones. So we can reasonably predict that the advancement of these techniques will continue to improve as the complexity of data increases. Researchers are continuously working on developing better approaches for time series prediction due the significance of this problem in many fields and tasks such as economics, stock prediction, medicine, biology, meteorology, human sciences and many other cutting edge research areas. So the importance and impact of better time series prediction models is obvious as making decisions about future is becoming increasingly important and critical in many competitive organizational structures.

26

References

[1] Gianluca Bontempi, “Computational intelligence methods for time series prediction”, eBISS 2012 [online] Available: http://www.ulb.ac.be/di/map/gbonte/ftp/time_ser.pdf[2] Nesreen K. Ahmed , Amir F. Atiya , Neamat El Gayar , Hisham El-shishiny, “An Empirical Comparison of Machine Learning Models for Time Series Forecasting”[3] Wen-Chuan Wang,Kwok-Wing Chau,Chun-Tian Cheng,Lin Qiu, “A comparison of performance of several artificial intelligence methods for forecasting monthly discharge time series”, Journal of Hydrology, Volume 374, Issues 3–4, 15 August 2009, Pages 294–306 [4] Bjoern Krollner, Bruce Vanstone, and Gavin Finnie. (2010) “Financial time series forecasting with machine learning techniques: A survey” Paper presented at the European symposium on artificial neural networks: Computational and machine learning. Bruges, Belgium. Arp.2010. [5] Sapankevych N.I , Sankar Ravi, “Time Series Prediction Using Support Vector Machines: A Survey”, Computational Intelligence Magazine, IEEE (Volume:4, Issue:2), May,2009, p 24-38[6] Li Shu-rong and Ge Yu-lei, “Crude Oil Price Prediction Based on a Dynamic Correcting Support Vector Regression Machine,” Abstract and Applied Analysis, vol. 2013, Article ID 528678, 7 pages, 2013. doi:10.1155/2013/528678[7] Haykin S, “Neural Networks: A Comprehensive Foundation”, Prentice Hall, 1999[8] Barreto G, “Time Series Prediction with the Self-Organizing Map: A Review” Perspectives of Neural-Symbolic Integration, Studies in Computational Intelligence Volume 77, 2007, pp 135-158[9] Timo Koskela , Markus Varsta , Jukka Heikkonen , Kimmo Kaski, “Recurrent SOM with Local Linear Models in Time Series Prediction”, In 6th European Symposium on Artificial Neural Networks, 1998[10] Angelovič P, “Time Series Prediction Using RSOM and Local Models”, IIT. Student Research Conference, 2005[11] Sandhya Samarasinghe, “Neural Networks for Applied Sciences and Engineering: From Fundamentals to Complex Pattern Recognition”, Auerbach Publications; 1 edition, 2006[12] Introduction to Genetic Algorithms,[online] Available: http://www.obitko.com/tutorials/genetic-algorithms[13] What about fitness?, Evolution 101 [online] Available: http://evolution.berkeley.edu/evosite/evo101/IIIE2Fitness.shtml[14] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997, Chapter 9. [15] Russell, Stuart. Artificial intelligence: A modern approach, 2/E. Pearson Education India, 2003, pp 116-119

[16] Paulo Cortez, Miguel Rocha, Jose Neves, Genetic and Evolutionary Algorithms for Time Series Forecasting, Engineering of Intelligent Systems, 14th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, IEA/AIE 2001 Budapest, Hungary, June 4–7, 2001 Proceedings, pp 393-402[17] Palit A.K, Popovic D, Computational Intelligence in Time Series Forecasting Theory and Engineering Applications, Springer; 2005 edition (July 1, 2005) [18] Khadka, M., Popp, B., George, K. M., & Park, N. (2010). A New Approach for Time Series Forecasting based on Genetic Algorithm. In CAINE (pp. 226-231).[19] Chen, Gemai, Bovas Abraham, and Greg W. Bennett. "PARAMETRIC AND NONPARAMETRIC MODELLING OF TIME SERIES—AN EMPIRICAL STUDY", Environmetrics 8.1 (1997): 63-74.[20] Adriana Agapie, Alexandru Agapie, Forecasting the Economic cycles based on an Extension of the Holt-Winters Model. A Genetic Algorithms Approach. In Proceedings of the IEEE Computational Intelligence and Financial Forecasting Engineering, pages 96-99, 1997[21] Anufriev, Mikhail, Cars Hommes, and Tomasz Makarewicz. "Learning-To-Forecast with Genetic Algorithms.", Working Paper, May 2012[22] Heemeijer P, Hommes C, Sonnemans J, Tuinstra J (2009) Price stability and volatility in markets with positive and negative expectations feedback: An experimental investigation. Journal of Economic Dynamics and Control 33(5):1052 - 1072[23] Mitchell, Melanie, An introduction to genetic algorithms, A Bradford Book The MIT Press

27

Cambridge, Massachusetts, London, England, Fifth printing, 1999, ISBN 0−262−13316−4 (HB)[24] Mike Martin, Mike Mackey, Leon Glass, The Mackey-Glass Equation, [online] Available: http://math.jccc.edu:8180/webMathematica/JSP/mmartin/mackeyglass.jsp[25] John Salatas, Implementation of Elman Recurrent Neural Network in WEKA, online:http://jsalatas.ictpro.gr/implementation-of-elman-recurrent-neural-network-in-weka/[26] Ron Meir , Nonparametric Time Series Prediction Through Adaptive Model Selection, Machine Learning,Kluwer Academic Publishers, 39, 5-34, 2000. [online] Available: http://webee.technion.ac.il/Sites/People/rmeir/Publications/MeirTimeSeries00.pdf[27] J.A.K. Suykens and J. Vandewalle, “Least squares support vector machines classifiers”, Neural Processing Letters, vol. 9, no. 3, pp. 293-300, June 1999[28] J.A.K. Suykens and J. Vandewalle, “Recurrent least squares support vector machines”, IEEE Trans. Circuits Systems-I, vol. 47, no. 7, pp. 1109 - 1114, July 2000.[29] Yugang Fan, Ping Li and Zhihuan Song, “Dynamic least square support vectormachine”, Proceedings of the 6th World Congress on Intelligent Control and Automation (WCICA), June 21-23, 2006,Dalian, China, pages: 4886-4889.[30] Martin Law, A Simple Introduction to Support Vector Machines, [online] Available: http://www.cise.ufl.edu/class/cis4930sp11dtm/notes/intro_svm_new.pdf[31] Intelligent Control Techniques in Mechatronics - Genetic algorithm, [online] Available: http://www.ro.feri.uni-mb.si/predmeti/int_reg/Predavanja/Eng/3.Genetic algorithm/_18.html

28