[ieee 2012 ieee 12th international conference on data mining workshops - brussels, belgium...

10
Rare Events Forecasting Using a Residual-Feedback GMDH Neural Network Simon Fong, Zhou Nannan Department of Computer and Information Science University of Macau, Macau SAR [email protected] Raymond K. Wong National ICT Australia and University of New South Wales NSW 2052 Sydney, Australia [email protected] Xin-She Yang School of Science and Technology Middlesex University London NW4 4BT, UK [email protected] Abstract—The prediction of rare events is a pressing scientific problem. Events such as extreme meteorological conditions, may aggravate human morbidity and mortality. Yet, their prediction is inherently difficult as, by definition, these events are characterised by low occurrence, high sampling variation, and uncertainty. For example, earthquakes have a high magnitude variation and are irregular. In the past, many attempts have been made to predict rare events using linear time series forecasting algorithms, but these algorithms have failed to capture the surprise events. This study proposes a novel strategy that extends existing GMDH or polynomial neural network techniques. The new strategy, called residual- feedback, retains and reuses past prediction errors as part of the multivariate sample data that provides relevant multivariate inputs to the GMDH or polynomial neural networks. This is important because the strength of GMDH, like any neural network, is in predicting outcomes from multivariate data, and it is very noise-tolerant. GMDH is a well-known ensemble type of prediction method that is capable of modelling highly non-linear relations. It achieves optimal accuracy by testing all possible structures of polynomial forecasting models. The performance results of the GMDH alone, and the extended GMDH with residual-feedback are compared for two case studies, namely global earthquake prediction and precipitation forecast by ground ozone information. The results show that GMDH with residual- feedback always yields the lowest error. Keywords- Time Series Forecasting, GMDH, Earthquake Prediction, Ground Ozone, Neural Network, Data Pre-processing I. INTRODUCTION Accurate prediction of rare high-impact events is a difficult yet popular research problem. The difficulty is due to the large sampling uncertainty caused by the infrequent occurrence of the events, and the popularity is due to the potential mitigation of disasters if the problem is solved. In the past decades, the most popular type of forecasting technique has been based on the regression of a future prediction based on past samples. This forecast technique is inherited from the concept of time series decomposition, in which a time series is believed to be modelled individually according to four components: trend, seasonality, cycle, and white noise. Since the 1970s, many such techniques have become prevalent, including AR, MA, ARIMA, SARIMA, ARIMAX, etc. In this study, these techniques are generally referred to as linear forecasting techniques (LFT). Although all of these techniques produce satisfactory forecasting results for forecasting the mean and acceptable levels of variance, they have several limitations. First, because these algorithms are based on estimating the moving average values over a time series, the forecast is usually in the form of a mean value. In some special cases, such as predicting disasters, the rare events rather than the average values or normal happenings are of interest. In this study, we are concerned with the extraordinary cases that are rare even in large historical datasets. The second shortcoming is the problem of choosing the optimal parameters for calibrating LFTs. A casual choice of these parameters will lead to an extremely poor fit of the forecast model to the actual data. Third, LFTs may work very well for a time series that exhibits strong seasonality and/or long-term trends. However, for other time series, they may result in over- or under-forecasting, especially if the white noise component dominates the characteristics of the time series. The forecasting model easily becomes unstable when the time series exhibits a large extent of randomness and fluctuation, the “average-oriented” forecasting model fails badly to predict the peak values. An alternative approach to LFTs is a neural network approach that has completely different underlying mechanisms than the moving averages of the LFTs. The strength of a neural network model lies in its ability to capture non-linear relationships and the structural characteristics of a time series that are fuzzy, noisy and chaotic in nature. By training a neural network with a suitable architecture of neurons and layers from historical datasets, the irregular patterns of the time series can be “memorised” by adjusting the weights of the neurons, and the model is thus able to produce a prediction of a future value or values that exhibit a similar underlying pattern as the time series it is mirroring. Nonetheless, the problem of finding the optimal formation of neurons and layers for the best prediction result still exists. The selection range is very wide and imprecise configurations will lead to very different results. To solve this problem, an alternative breed of forecasting techniques, called Group Method Data Handling (GMDH) was proposed in 1971 [1]. GMDH has the advantages of generating an ideally fitted model by incrementally evolving the structure of the prediction model from simple to complex, whist fine-tuning the underlying neural network that captures the non-linear and varying behaviour of the time series. Quite often the best possible neural network model can be generated with just enough complexity to accurately model the time series. Over- or under-fitting of the model is less likely to occur with this optimal layout of the neural network. Furthermore, 2012 IEEE 12th International Conference on Data Mining Workshops 978-0-7695-4925-5/12 $26.00 © 2012 IEEE DOI 10.1109/ICDMW.2012.67 464 2012 IEEE 12th International Conference on Data Mining Workshops 978-0-7695-4925-5/12 $26.00 © 2012 IEEE DOI 10.1109/ICDMW.2012.67 464

Upload: xin-she

Post on 08-Dec-2016

221 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: [IEEE 2012 IEEE 12th International Conference on Data Mining Workshops - Brussels, Belgium (2012.12.10-2012.12.10)] 2012 IEEE 12th International Conference on Data Mining Workshops

Rare Events Forecasting Using a Residual-Feedback GMDH Neural Network

Simon Fong, Zhou Nannan

Department of Computer and

Information Science

University of Macau, Macau SAR

[email protected]

Raymond K. Wong

National ICT Australia and

University of New South Wales

NSW 2052 Sydney, Australia

[email protected]

Xin-She Yang

School of Science and Technology

Middlesex University

London NW4 4BT, UK

[email protected]

Abstract—The prediction of rare events is a pressing scientific

problem. Events such as extreme meteorological conditions,

may aggravate human morbidity and mortality. Yet, their

prediction is inherently difficult as, by definition, these events

are characterised by low occurrence, high sampling variation,

and uncertainty. For example, earthquakes have a high

magnitude variation and are irregular. In the past, many

attempts have been made to predict rare events using linear

time series forecasting algorithms, but these algorithms have

failed to capture the surprise events. This study proposes a

novel strategy that extends existing GMDH or polynomial

neural network techniques. The new strategy, called residual-

feedback, retains and reuses past prediction errors as part of

the multivariate sample data that provides relevant

multivariate inputs to the GMDH or polynomial neural

networks. This is important because the strength of GMDH,

like any neural network, is in predicting outcomes from

multivariate data, and it is very noise-tolerant. GMDH is a

well-known ensemble type of prediction method that is capable

of modelling highly non-linear relations. It achieves optimal

accuracy by testing all possible structures of polynomial

forecasting models. The performance results of the GMDH

alone, and the extended GMDH with residual-feedback are

compared for two case studies, namely global earthquake

prediction and precipitation forecast by ground ozone

information. The results show that GMDH with residual-

feedback always yields the lowest error.

Keywords- Time Series Forecasting, GMDH, Earthquake

Prediction, Ground Ozone, Neural Network, Data Pre-processing

I. INTRODUCTION

Accurate prediction of rare high-impact events is a difficult yet popular research problem. The difficulty is due to the large sampling uncertainty caused by the infrequent occurrence of the events, and the popularity is due to the potential mitigation of disasters if the problem is solved. In the past decades, the most popular type of forecasting technique has been based on the regression of a future prediction based on past samples. This forecast technique is inherited from the concept of time series decomposition, in which a time series is believed to be modelled individually according to four components: trend, seasonality, cycle, and white noise. Since the 1970s, many such techniques have become prevalent, including AR, MA, ARIMA, SARIMA, ARIMAX, etc. In this study, these techniques are generally referred to as linear forecasting techniques (LFT). Although all of these techniques produce satisfactory forecasting results for forecasting the mean and acceptable levels of

variance, they have several limitations. First, because these algorithms are based on estimating the moving average values over a time series, the forecast is usually in the form of a mean value. In some special cases, such as predicting disasters, the rare events rather than the average values or normal happenings are of interest. In this study, we are concerned with the extraordinary cases that are rare even in large historical datasets. The second shortcoming is the problem of choosing the optimal parameters for calibrating LFTs. A casual choice of these parameters will lead to an extremely poor fit of the forecast model to the actual data. Third, LFTs may work very well for a time series that exhibits strong seasonality and/or long-term trends. However, for other time series, they may result in over- or under-forecasting, especially if the white noise component dominates the characteristics of the time series. The forecasting model easily becomes unstable when the time series exhibits a large extent of randomness and fluctuation, the “average-oriented” forecasting model fails badly to predict the peak values.

An alternative approach to LFTs is a neural network approach that has completely different underlying mechanisms than the moving averages of the LFTs. The strength of a neural network model lies in its ability to capture non-linear relationships and the structural characteristics of a time series that are fuzzy, noisy and chaotic in nature. By training a neural network with a suitable architecture of neurons and layers from historical datasets, the irregular patterns of the time series can be “memorised” by adjusting the weights of the neurons, and the model is thus able to produce a prediction of a future value or values that exhibit a similar underlying pattern as the time series it is mirroring. Nonetheless, the problem of finding the optimal formation of neurons and layers for the best prediction result still exists. The selection range is very wide and imprecise configurations will lead to very different results. To solve this problem, an alternative breed of forecasting techniques, called Group Method Data Handling (GMDH) was proposed in 1971 [1].

GMDH has the advantages of generating an ideally fitted model by incrementally evolving the structure of the prediction model from simple to complex, whist fine-tuning the underlying neural network that captures the non-linear and varying behaviour of the time series. Quite often the best possible neural network model can be generated with just enough complexity to accurately model the time series. Over- or under-fitting of the model is less likely to occur with this optimal layout of the neural network. Furthermore,

2012 IEEE 12th International Conference on Data Mining Workshops

978-0-7695-4925-5/12 $26.00 © 2012 IEEE

DOI 10.1109/ICDMW.2012.67

464

2012 IEEE 12th International Conference on Data Mining Workshops

978-0-7695-4925-5/12 $26.00 © 2012 IEEE

DOI 10.1109/ICDMW.2012.67

464

Page 2: [IEEE 2012 IEEE 12th International Conference on Data Mining Workshops - Brussels, Belgium (2012.12.10-2012.12.10)] 2012 IEEE 12th International Conference on Data Mining Workshops

the prediction model is in a better position to forecast future events (not merely the projected average/mean) than the LFTs. Dejan Sarka [2] has established that the usefulness of a forecasting model depends not only on the overall fitting error between the mean values and the smoothed average line, but on how faithfully the fitted curve models the pattern of the time series.

One challenge in applying GMDH to time series forecasting is the transformation of the univariate format of the time series to a multivariate format necessary to reap the maximum potential of a neural network reasoning model. Appropriate parallel inputs to the neurons of the neural network will complement the learning process, if the multiple inputs are related and relevant to the target of prediction. This study proposes a general approach that first decomposes the statistics and the structural components of the time series, and then uses them as GMDH inputs. Two case studies of earthquake prediction and flood forecast are used as typical scenarios of time series forecasting where only extraordinary events are of interest (e.g., massive earthquakes and floods). The patterns of activities do not have an obvious seasonality, and they have a large level of fluctuation and randomness. In this study, the paired inputs of residuals feedback and lagged time series, along with the original time series, are found to be the most relevant and effective ingredients in GMDH forecasting leading to the lowest fitting error and the least complex model.

Our paper is structured as follows: Section 2 presents the background theory of GMDH briefly, its algorithms, and operation. Scenarios in which feedback from previous model training enhance the accuracy of on-going model training, together with our proposed method, are described in Section 3. Section 4 presents a case study of disaster prediction, and a comparison of our proposed method with existing GMDH methods. The results of the experiments are analysed and discussed in detail in Section 5. Section 6 concludes.

II. THEORETICAL FRAMEWORK

Since Group Method Data Handling (GMDH) [3] was developed by Ivakhnenko in 1968, it has been referred to as a type of predictive modelling technique that is designed to attain maximum accuracy with marginal model complexity. As its name suggests, GMDH determines the coefficients and parameters of a model directly from data sampling through an iterative and inductive sorting process. In a nutshell, GMDH is a controlled optimisation process between the inputs and the outputs that gradually generates more complicated models and retains those candidate models that have relatively low forecasting errors. Thus, the GMDH algorithm aims at solving the following optimisation argument by sequential testing of increasingly complex candidate models, according to a given external error criterion: �̅ = argmin⊆� �(�) , (1)

where ER(ζ) is an external error criterion of the model ζ, which is chosen from the set of candidate models Z. The candidate models that serve as building blocks for GMDH are defined as components of some nonlinear multi-parametric equation, such that y=h(x1, …, xn) is a non-linear

functional relation h() that approximates n number of observation variables, x1,…, xn in an input variable vector �̅, to map to the observed outcome of y. To derive the coefficients, a commonly used polynomial reference function to represent the candidate models in GMDH is given in the form of � = �� + ∑ ���� +∑ ∑ ��,����� +∑ ∑ ∑ ��,�,������� +⋯������ (2)

This functional series is known as the Kolmogorov-Gabor polynomial and by its theorem it can represent any function y=h(�̅) with its coefficient vector ��. The coefficient vector is exactly what GMDH aims at obtaining to formulate a linear or non-linear regression polynomial. Given the appropriate coefficients (calculated by GMDH), substituting the values of the observation variables �̅, a forecast result y can be estimated from the polynomial.

The two most popular learning algorithms that enable the evolution process in GDMH are called Combinatorial GMDH and Polynomial Neural Networks. The former algorithm is simply a combinatorial algorithm that scales up the power and the length of the polynomial function; it is generally called COMBI [4] in the literature. The other algorithm is also known as the Multilayer Iterative GMDH-type Neural Network or MIA for short [5]. It is usually implemented with a feed-forward neural network with multi-layered bi-input neurons that evolve the complexity of the coefficients according to some neuron selection criteria.

A. COMBI Algorithm

The COMBI algorithm applies a combinatorial search of model components via variable selection. Typically, the least squares method is used recursively, from the simplest to the most complex model structure, as a data fitting method to obtain the coefficients for the polynomial function, such as Equation (2). As shown in Figure 2, the prerequisite for the algorithm to work in GMDH is the division of the dataset into a training portion and a testing portion. Several options are possible for partitioning the data, such as division by ratio (e.g., 70% training, 30% testing), or n-fold cross-validation (sometimes called rotation estimation), etc. In the case of time series forecasting, a slightly different method is used in which the subgroups of the original data are not randomly chosen, to preserve the sequential characteristics of an ordered list; instead, a learning window of size n is defined. Out of n, the training set and the testing set are defined by a ratio arbitrarily controlled by the user.

Each row contains an instance of an observation from the sample, composed of m attributes, and therefore m columns in the dataset. The outcomes, y, are already known in the observation samples, so they can be used in the training set to fit a regression-type formula that will find the coefficients in the polynomial function. In the testing set, they are used to evaluate the goodness-of-fit by identifying the errors in each attempt at model estimation. An example of such a data structure is shown in Figure 2, where we have nt amount of records for training and n-nt amount for testing; both constants n and nt can be set by the user.

The GMDH framework under which the COMBI algorithm operates is multi-layered. A candidate model is generated at each layer, moving from simple to complex.

465465

Page 3: [IEEE 2012 IEEE 12th International Conference on Data Mining Workshops - Brussels, Belgium (2012.12.10-2012.12.10)] 2012 IEEE 12th International Conference on Data Mining Workshops

When all of the candidate models at all of the layers are built, the one that has the lowest error is selected as an output model. If a rolling learning window strategy is used, this process is repeated for all of the sample subgroups. Again, the ultimate output model is the winner of all of the generated candidates among all trials. At the first layer, the m number of candidate models, which are the simplest, each contain two coefficients a0 and a1, in the following form: ��� = �� + ���� , (3)

where i=1,2,..., m, m is the maximum number of x variables

in �̅, and ��� is the jth predicted outcome by this simple linear

regression. The least squares method is used for solving the coefficients a0 and a1 for each of the m models, by regarding the model as a system of Gauss’s normal equation, which takes the following format:

� ! ∑ ��,�"#�$�∑ ��,�"#�$� ∑ %��,�&'"#�$�

( × *����+ = , ∑ ��"#�$�∑ ��,���"#�$� - . (4)

Figure 1. Data structure of the GMDH process.

As the value of yj is already known from the training samples, Equation (4) estimates the two coefficient values by a combinatorial search. A full combinatorial search for the coefficients that produce the least error is very time consuming. The polynomial terms should be limited to less than two dozen in most cases. When all of the possible polynomial models (with their coefficient values) have been calculated, their predicted outcomes are checked against the

observed outcomes in the testing set. The model ζ that has

the lowest regularity criterion, ρ(ζ), is retained. The regularity criterion that is used as a fitness function is defined as

.(�) = ∑ %/01/02&34054678"1"# . (5)

ρ(ζ) is the average error in terms of the squared difference between the predicted and observed outcomes of

model ζ. From the initial layer, the model that contains the variables that yield the lowest error is permitted to scale up the polynomial series and generates the candidate models in the subsequent layer. The polynomial in the second layer takes the form of �� = �� + ���� + �'�� (6) �� = �� + ���� + �'��+. . +�;�< , (6a)

where i and k = 1, 2, … , m. Again, the candidate models at the second layers are checked for compliance with the regularity criterion. The winner is then allowed to generate further candidate models in the third layer. The action repeats though the upper layers until the regularity criterion no longer decreases in value. The generalised format of the candidate models has the format at the m

th layer in Equation

(6a), where i, k, l = 1, 2, .., m. The final output model is then chosen from the competition among all of the fittest models from each layer. The selection criterion is the forecasting

variance criterion δ, i.e.,

=(�) = �" × ∑ %/01/02&34058

∑ %/01/>2&34058 , (7)

where n is the number of samples in the learning window. If

the predicted outcome ��� , pertaining to the model ζ , is

applied on the jth

observation, the mean value of the model outcomes �> �, is obtained.

B. MIA Algorithm

Note that the Kolmogorov-Gabor polynomial, which is inherited from a Volterra functional series [5], is unbounded and its length can potentially be extended to infinity. According to the recommendation in [5], Equation (2) can be approximated by a set of partial quadratic polynomials that consist only of a pair of variables in the form �� = ?%�@, �A& = �� + ���@ + �'�A + �B�@�A + �C�@' + �D�A' .

(8)

The coefficients are estimated as in the COMBI algorithm, with the least squares method and regression techniques used to minimise the difference between the predicted outcomes and the actual outcomes. At the first layer the information from the two independent variables contained in the quadratic function F(xp, xq) are used to compute the candidate models in the form of a neuron that has two inputs. Out of the available nt training samples, all possible enumerations of xp and xq are tested to build the regression polynomial that best fits the data, as in Equation 8. New layers are expanded with the number of neurons doubling at each new layer. They grow from 1, 2, 4, 8, 16 etc. An exhaustive combinatorial search is applied to each neuron in the neural network, and the one that produces the most accurate output is chosen. Therefore, the fittest neurons in each layer are selected for input into the next layer. Different combinations of pairs are sent. As in the Genetic algorithm, the evolution continues as the number of layers grows until no further improvement is found between the final two layers. The fittest neuron from the final layer is deemed to be the final solution. The evolution is finite (assured to converge) and the polynomial represented by the winner neuron is proclaimed optimally complex. This is possible because the testing data reveal over-fitting when overcomplicated models start to show instability, and thus their predictive power diminishes.

III. EXTENDED GMDH WITH ADDITIONAL INPUTS

In addition to using subgroups of the original time series data, as shown in Figure 2, to supply input variables for the formation of polynomials, this study proposes a novel

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 . . .

Direction of moving window

y = ( x 1 x 2 . . . x n )

y 1 x 1,1 x 1,2 . . . x 1,m

y 2 x 2,1 x 2,2 . . . x 2,m

y 3 x 3,1 x 3,2 . . . x 3,m

. . . . . . . . . . . .

y nt -1 x nt -1,1 x nt -1,2 . . . x nt -1,m

y nt x nt ,1 x nt ,2 . . . x nt ,m

y 13 x 13,1 x 13,2 . . . x 13,m

y 14 x 14,1 x 14,2 . . . x 14,m

. . . . . . . . . . . .

y n -1 x n -1,1 x n -1,2 . . . x n -1,m

y n x n ,1 x n ,2 . . . x n ,m

Learning Window with size n

Training (size = nt ) Testing (size = n-nt )

Training

samples

Testing

samples

466466

Page 4: [IEEE 2012 IEEE 12th International Conference on Data Mining Workshops - Brussels, Belgium (2012.12.10-2012.12.10)] 2012 IEEE 12th International Conference on Data Mining Workshops

approach that includes extra relevant input variables. This has to be done with great care because including redundant variables reduces accuracy, and long input vectors incur substantially large amounts of extra computation cost/time. Salient characteristics or statistically significant information should be built into predictive models to enhance their accuracy. However, only those that are relevant to the prediction target should be included. They can be identified using correlation analysis [6].

One common extra input argument in GMDH modelling is the inclusion of a lagged time series. A lagged data series is taken from the original time series, but the whole series is shifted behind by one or more steps in a time scale, like a

projected shadow. L(ti)=x(ti+∆t), where L is the lagged (transformed) data, x(ti) is the variable data point at time ti

and ∆t is the increment in time. The lagged data series offers supplementary information about the longitudinal pattern of the ordered series. The ideal degree of lag should be calculated from the cross- and auto-correlation functions that help reduce the input dimension and improve the modelling procedure [7]. Other researchers second this principle and extend the selection of input arguments by using clustering [8] and rank correlation [9].

In our proposed method, relevant information in addition to lagged data is used as input arguments for the GMDH. This includes, but is not limited to, statistical information, training errors from past model training, and other transformations of the original sets. For neural networks, other researchers have used similar inter-related data metrics; inputs to neural networks are usually scalar features extracted from the data. Some researchers have advised [10] that the selection of these features should enable effective characterisation of the closed-loop operation of the network.

A. Residual Feedback GMDH

Out of many possibilities that incorporate additional input arguments into a GMDH model, residual feedback (RF) and one-step lag (Lag-1) are used in our case study. Other possibilities have been extensively evaluated in our experiments and we found that RF and Lag-1 produce the best results for the particular datasets being tested. In fact, RF and Lag-1 are often used as extra inputs in predictive regression-type neural networks for time series forecasting [11]. Lanitis [12] established a statistical appearance model for qualitative responses using residual information that improves the accuracy of image classification. Rossen [13] confirmed that the nonlinear transformation of lagged time series values and residuals systematically improves the average forecasting performance of simple autoregressive models. Other similar forecasting applications using residual information have shown higher accuracies in areas such as abnormality prediction in EEGs [14], and gold price prediction [15].

The rationale behind using RF is that the use of residuals directly compensates for the nonlinearity. Residuals are defined as the absolute deviations of the predicted values from the actual observation values. Usually the overall magnitude of the errors is greater in the testing dataset than in the training dataset. However, the distributions of the two

types of errors should behave similarly. This implies that we can detect any abnormality by observing any notable difference between the residual distributions in the testing and training errors. The abnormality may be an indication of too few samples in either or both the testing and training datasets, hence the high irregularity reflected in the residuals, or simply bias in the data distribution. Furthermore, the occurrence of over-training or over-fitting may become evident through a significant increase in the magnitude of residuals from the testing dataset; this could be due to the poor generalisation of the training model. Based on these phenomena the problems in non-linear prediction models can be compensated for by using residual feedback to adjust the fault severity. Mathematically, the compensation for non-linearity faults using residual feedback is already proven in [11] the Lyapunov function.

Figure 3 shows an example of a neural network trained with three types of input arguments: values from the original time series, values from the lagged data series, and residual information. A fixed window of length nt is assumed to hold the training dataset, and a predicted output of the value point at one step ahead is generated by the neural network. The residuals are retained from previous training cycles, which correspond to the full original time series. The architecture of the neural network can be arbitrarily chosen. In this case, it is a feed-forward network with one input layer, one output layer, and multiple hidden layers, whose neurons are fully interconnected. This sample layout can be considered a highly non-linear generalisation of a linear regression model. The expansion of the polynomial regression equations is possible, and the equations are represented by the neural network, using the GMDH-MIA methodology. In this way, the distributions of these three influential sources for fitting the trending pattern are used in formulating the resultant polynomial equation for regressing an output.

Figure 2. An example of an RF-type of neural network.

B. Our Proposed Methodology

The forecasting methodology that we propose includes three phases: conversion of the original time series from univariate format to multivariate format; feature selection; and GMDH modelling using either COMBI or MIA. The first two phases are interrelated as feature selection

467467

Page 5: [IEEE 2012 IEEE 12th International Conference on Data Mining Workshops - Brussels, Belgium (2012.12.10-2012.12.10)] 2012 IEEE 12th International Conference on Data Mining Workshops

nominates only the relevant variables to form a multivariate series. In our experiments, it is implemented using a correlation based selection test. For instance, a positive Pearson value of 0.91 is recorded from a correlation test between the variables Magnitude and Residual. It shows that they are strongly correlated; hence, inputting Residual information into the training model helps to improve the prediction of Magnitude. In Figure 4, a contour plot of the three dimensions, Year, Magnitude and Residual of the earthquake dataset (which will be described later) reveals a few noticeable observations about their relations. First, the residual level zero lies on about magnitude 7.927 for almost every year until the turn of the millennium. This means the model produces the most accurate predictions for occurrences of earthquakes of magnitude 7.927. The region below the zero level line suggests that the model is over-forecasting earthquakes of relatively small magnitudes. The upper region shows the number of large earthquakes that are being under-forecasted. These are data-induced high residuals (forecast errors). This phenomenon of under-forecasting earthquakes has emerged since 1986. This could be explained by the unprecedented number of strong earthquakes that have occurred in recent years – the model, which was trained with a long series of historical data, fails to forecast this increase.

Figure 3. Contour diagram of the variables, Year, Magnitude & Residual.

While the feature selection process filters off irrelevant variables, the data preparation process assembles the relevant variables into an original time series. The extra variables and their temporally ordered values aggregate together into several columns chronologically indexed by a time scale.

Except for the first round, the residuals are inherited from each previous session of GMDH training. They are in turn fed back into the data preparation process that produces a multivariate dataset for the subsequent training sessions. For the initial round of training, as there are not yet any residuals available from the GMDH model, the early residuals are obtained from one of the autoregressive algorithms, e.g., ARIMA. As ARIMA is extensively used and widely accepted in time series analysis, it is a useful boot-up technique for generating reliable residuals.

The final process, the GMDH Modelling, constructs the GMDH network. From the aggregated multivariate dataset, all combinations of the system input variables X = [(x1,1, x1,2, ..., x1,nt), (x2,1, x2,2, ..., x2,nt) … (xj,1, xj,2, ..., xj,nt)], where j is the number of variables and nt is the length of the learning window, are enumerated. The data from the lagged time

series and the residuals are then input into the first layer of the GMDH neural network. Using the least squares regression method, the coefficients at each neuron node are derived. The fittest nodes, with the least errors among the outputs of this layer, are chosen as inputs into the second layer; all combinations of distinct pairs are transferred up, forming a new layer. This action repeats for each training cycle until the current layer has no node that performs better than the previous one. Then, the best neuron at the previous layer is selected as the final solution – the optimally complex polynomial that offers the highest accuracy. The workflow of this methodology is shown in Figure 5.

Figure 4. The Operational Flow of the Residual-Feedback GMDH

Method.

IV. EXPERIMENT

The primary objective of the empirical experiment is to attempt a best fitting model on two popular natural disasters forecasts in the near future using the extended GMDH method: an upcoming large scale global earthquake and precipitation modelling from past ground zone data. The secondary intent is to evaluate the performance of the GMDH model with different settings and to verify that applying residual feedback improves the GMDH model.

The performance criteria adopted in this experiment are similar to other time series forecasting experiments where the following error measurements are used as accuracy indicators: MAE, RMSE, NMAE, NRMSE, MAPE and RMSPE. These errors are commonly used in forecasting, and therefore their definitions are not reproduced here. In general, smaller error values mean higher accuracy in the forecasting model in question. In addition to measuring the accuracy of the output of the model, it is important to assess the quality of the forecasting model. The criterion metric or criterion index (CI) for selecting the so-called best model is executed using the error bias criterion. The reasoning behind the use of bias criteria is that, when using the best model, the errors in estimating the model parameters should be minimally different between performances over the training dataset and the testing dataset. The basic form of error bias criterion is defined as

468468

Page 6: [IEEE 2012 IEEE 12th International Conference on Data Mining Workshops - Brussels, Belgium (2012.12.10-2012.12.10)] 2012 IEEE 12th International Conference on Data Mining Workshops

EF = GEH = IJ�H|LMN�" − J�H|LPQ#I = IJ�H|H(�.."#) − J�H|H("#R�..S)I (9)

where SRX|Train and SRX|Test are the summarised errors over the whole dataset N that is divided into training dataset and

testing dataset such that X = Train ∪ Test. The model under evaluation carries the same structures with coefficients and components estimated on the two datasets, training and testing accordingly: J�H|LMN�" = |�H − ℎLMN�"(US)|' = J�LPQ#|LMN�" + �J LMN�"

(10) J�H|LPQ# = |�H − ℎLPQ#(US)|' = J�LMN�"|LPQ# + �J LPQ# , (11)

where the residual sum of squares errors over the training dataset and testing dataset are RSETrain and RSETest, respectively.

In the first case study, the forecasting model aims at predicting the occurrence of very large earthquakes in coming years. More specifically, the question is, ‘When will the next mega earthquake hit us?’ The data source is taken from the website of United States Geological Survey’s (USGS) Earthquake Hazards Program. The earthquake records from the 1 January 1973 to March 2012 were downloaded in raw text format, with each row representing an earthquake event somewhere in the world. The total number of records for this period of time is 663,852, with a mean magnitude 3.8674, a minimum of 0.1, a maximum of 5.3, and a standard deviation 0.9898. From the annual earthquake records, only the maximum is chosen from the dataset, the rest are eliminated. The forecasting model that we aim to build is focused on predicting the maximum quakes. Therefore, the desired prediction is the number of earthquake(s) of maximum magnitude per year, in future years. This is done by first training the forecasting model with the history of earthquakes of maximum magnitude per year and then testing the model by speculating when a mega earthquake (e.g., 10) will strike. The occurrence of a mega earthquake is expected as the time series has shown a steadily increasing trend over the years. For each GMDH method, a forecast period of seven years is defined. From the projection across the seven years (on the x-axis) we cross-check the corresponding magnitudes (on the y-axis) and note any significantly and unprecedentedly strong magnitude earthquake and its year of occurrence. In other words, the model predicts the magnitude values for seven years ahead and we pick the soonest year that has a strong magnitude earthquake. We choose 10 as the level of a very devastating earthquake. Because of the rarity of the events, a very long training sample is desired as much longer verification periods are needed to obtain reliable verification statistics. For extremely rare events (e.g., a magnitude 9 earthquake only strikes once a century), even historical records are too short to allow a meaningful verification of the forecasts. The performance benchmark here, therefore, would be the accuracy of the fitting model to the full set of training data.

According to the workflow shown in Figure 5, the univariate time series first has to be converted to multivariate data. The original time series is modelled by ARIMA and its other variant analysis. The idea is to use the best possible

smoothing type (or moving average type) to forecast algorithms that will generate the first batch of residuals, prior to GMDH modelling. A copy of lagged data is also generated from the original time series by taking a simple lag transformation. Choosing different parameters gives very different output results in ARIMA algorithms. An optimised solver, powered by the commercial software Oracle Crystal Ball (Release 11, 32-bits), is used for this purpose. The software automatically tests all of the possibilities and retains only the parameter values that yield the optimal results. In this study, Crystal Ball is used to identify the best smoothing type algorithm to produce the initial residuals to be used in subsequent GMDH modelling. The univariate data are then expanded to multivariate data with the inclusion of Lag-1 data and Residual data.

In GMDH training only the residuals obtained from the ARIMA algorithm are included initially. After one round of GMDH training, another set of residuals produced by GMDH is available. We now have two available sets of residuals that are called R1 - those obtained from ARIMA and R2 - those obtained from GMDH, respectively. Although it is technically possible to continuously repeat the training process and generate further generations of residuals by repeating the GMDH, our experiment found by trial-and-error that the generation of residuals from GMDH subsequent to the first round attains no further improvement. Therefore, in this experiment, only R1 and R2 are used. Consequently, we have the following different forms of the datasets available, inherited from the original time series plus additional input arguments that are derived along the way. Table 1 shows the various combinations of datasets to be evaluated under different forms of GMDH algorithms. The acronyms are defined as follows. 10-fold cross-validation is used to ensure the model is sufficiently tested. U-C: Univariate time series data, to be computed by GMDH

COMBI algorithm. C is for COMBI.

U-N: Univariate time series data, to be computed by GMDH MIA

algorithm. N is for Neural Network.

M-C-R1: Multivariate time series data, with R1 added in, to be

computed by GMDH COMBI algorithm.

M-N-R1: Multivariate time series data, with R1 added in, to be

computed by GMDH MIA algorithm.

M-C-R2: Multivariate time series data, with R2 added in, to be

computed by GMDH COMBI algorithm.

M-N-R2: Multivariate time series data, with R2 added in, to be

computed by GMDH MIA algorithm.

M-C-R1R2: Multivariate time series data, with both R1 and R2 added in,

to be computed by GMDH COMBI algorithm.

M-N-R1R2: Multivariate time series data, with both R1 and R2 added in,

to be computed by GMDH MIA algorithm.

The comparisons of the forecast performance measures and model quality for the different combinations of dataset arguments are shown in Table 1. The forecast results of the different GDMH methods are shown in Table 2. The results are given in the dual forms of residual errors and predicted year of the next strong earthquake. The performances in terms of MAPE of the different GMDH methods are presented graphically in Figure 5. The output diagrams of model fitting and forecast results by different GMDH

469469

Page 7: [IEEE 2012 IEEE 12th International Conference on Data Mining Workshops - Brussels, Belgium (2012.12.10-2012.12.10)] 2012 IEEE 12th International Conference on Data Mining Workshops

methods are shown in Figures 8 to 15, in the forms of line charts. Results shown in Figures 8 to 11 are generated by the GMDH COMBI method, and those in Figures 12 to 15 are computed by the GMDH MIA method.

The second case study is the forecast of precipitation given a history of ground zone data. The dataset is from [16] which originally studies about a binary classification of zone day or normal day from the collected chemical samples from the past ground zone data. The prediction target is changed to precipitation in our study because heavy rainfalls observed in the observation period represent rare events. Two reasons for the choice of this dataset; first is for stress testing our extended model as the data have a large number of numeric attributes (73) and a numeric forecast target, but a relatively small volume of records (2536) available for training the model. This represents a difficult multivariate forecasting problem even for multiple regression. Plus an extra challenge, the data distribution is skewed biased stochastic. Secondly the data contain only a handful of rare events as outliers that fall beyond the 2

nd standard deviations from the

mean. Fitting a forecast model on such data would be extremely difficult let alone accurately predicting their future values.

V. RESULTS AND DISCUSSION

The purpose of the case studies is to evaluate the performance of the different GMDH methods, including those using residual feedback. The performances are thoroughly assessed using the error measures of the fitting curve, the complexity of the trained model, and the residual errors from the results.

TABLE I. PERFORMANCE COMPARISON OF DIFFERENT GMDH

METHODS IN ERROR MEASURES AND MODEL QUALITY: EARTHQUAKE DATA

As can be observed in Table 1, all of the error measures

show a consistent drop when the GMDH method is strengthened with R1, R2 or both. The order of performance improvement is R1+R2 > R2 > R1 > Nil. In general, the Neural-type (MIA) of GMDH is better than the Combinatorial-type (COMBI) of GMDH. This trend, shown in Figure 6, reinforces the belief that MIA is more suitable for training a forecasting model with residual feedback inputs. The criterion value shows a similar trend too, suggesting that not only is MIA better than COMBI, but the quality of the forecasting model improves as good residuals are included. The model complexity shows that in the MIA method the neural network needs six neurons in all situations, and eight if R1 and R2 are both used. In the COMBI method the resultant polynomial only needs three coefficients, except in the simplest case when no residual feedback is used.

TABLE II. OUTPUT COMPARISON OF DIFFERENT GMDH METHODS IN

RESIDUAL ERRORS AND PREDICTION RESULTS FOR EARTHQUAKE DATA

In Table 2, the mean residual errors are relatively low, –

0.2367 at the worst and 0.1733 at the best. The variation of the residual errors in terms of standard deviation is also low, ranging from 0.11 to 0.1998. This shows that the trained model is fairly stable and the output results, as reflected by the residuals, have a high accuracy. Like the models, their performance improves with R1 and R2 feedbacks. The results suggest that given the wide variety of possible ways to configure GMDH, it is best to choose the one that produces the lowest error. In this case, the best forecast method in Table 2 is M-N-R1R2 which predicts that an earthquake of magnitude 12.5266 will occur after four years. This is merely a forecast and the actual outcome will depend on many complex and unknown factors.

Figure 5. Performance comparison of different GMDH methods in

MAPE.

Figure 7 depicts the magnitude time series, and the residuals R1 and R2 from the case of M-N-R1R2. The curve of R1 follows very closely the magnitude time series, which means the Double Exponential Smoothing (DES) algorithm selected by Oracle Crystal Ball managed to come up with an almost perfect match for fitting the forecast curve prior to GMDH. DES performed at MAPE=2.93%, RMSE=0.3, and MAE=0.2. The curve of R2, which is the residuals produced by GMDH after one round of model training, displays a relatively flat (very low fluctuation) line. This line seems to be gradually declining in terms of absolute errors; in fact, near the end of the samples the value is nearly zero. This implies that the GMDH model is very well trained, and could hardly be further improved. In the subsequent, final, round of training, the final residual or error amounts to 0.0148.

Figure 6. Comparison of magnitude time series, R1 and R2.

470470

Page 8: [IEEE 2012 IEEE 12th International Conference on Data Mining Workshops - Brussels, Belgium (2012.12.10-2012.12.10)] 2012 IEEE 12th International Conference on Data Mining Workshops

Further examination of the relation between magnitudes (in the original time series), and residuals R1 and R2, reveals that the lines of magnitudes, R1 and R2 lay out together on the same scale (Figure 8). One can see that the initial fitting model by DES under-forecast the actual values in the early part of the time period. It was not until a 9.1 magnitude earthquake in 2005, which occurred in the Indian Ocean near Sumatra, that the trend shifted. Since that time, this fitting curve has over-estimated the subsequent time series values. The subsequent round of model training by GMDH rectifies the errors by compensating with the R1 residuals, which are a type of a priori knowledge. The compensating effect is evident by the R2 stage, which shows a gradual fall in the error scale.

Figure 7. Comparison of magnitude time series, R1 and R2 in log scale.

These forecast results, which use different GMDH methods to achieve better curve fitting, demonstrate several prominent phenomena. The fitting curve based on the univariate time series without any residual feedback approximates very poorly the actual values (Figures 9 and 13). When the first residual, R1, is deployed, the trend is properly modelled; the curve value approximation is better using the MIA method than the COMBI method as shown in Figure 9 and Figure 13, respectively. This justifies the use of residual feedback in the GMDH training, even though the initial residuals are derived from a separate type of forecasting method. After the first round of GMDH training, a new set of residuals, R2, are obtained and used as composite feedback to the subsequent round of model training. As the results show, the fitting of the curve generated by the model is significantly improved when the second residual R2 is used as feedback in the GMDH training. Again, the MIA method in Figure 14 achieves a more accurate curve approximation than the COMBI method in Figure 10. In the final round of training, both R1 and R2 are used together, with the aim of compensating for errors in different areas. In Figures 11 and 15, the differences between the fitted curve and the actual one are very minimal; they closely follow the actual values, and both are able to establish a future trend in the forecast period.

Residual feedbacks indeed improve a GMDH model. The results from all of the GMDH methods outperform DES, which is the best algorithm chosen by the solver. At MAPE 2.93%, DES achieved its best possible performance. The MAPEs produced by the GMDH methods in our experiment range from 2.878% to 2.115% (c.f. Table 1). The environment and parameter settings for the experiment of the second case of precipitation forecast remain the same. Due to the space limit, only the final results are shown as follow.

TABLE III. PERFORMANCE COMPARISON OF DIFFERENT GMDH

METHODS IN ERROR MEASURES AND MODEL QUALITY: PERCIPITATION DATA

Similar to the earthquake case, augmenting residuals

indeed improve the accuracy of the forecast model, especially the CI the indicator of model quality. However, in contrast to the earthquake dataset where only the highest magnitude of earthquake that happened per year was used – the rest was trimmed, this dataset is more complex as it retains a full set the precipitation values (of all magnitudes) and the dataset is inherently high dimensional because it has original 17 features, plus up to 3 extra features – Lag, R1 and R2. The results as shown in Figures 18 and 19 indicate that when residual information was fed into the GMDH rare events could be revealed (as those red-circled). Assume the events that exceed a level of 4 are of increase, below which are common, with the aid of one set of residuals R2 the model started to pick up rare events scarcely as shown in Figure 18. When both R1 and R2 are used, the model is capable to recognize more rare events as in Figure 18. The corresponding residuals as resulted from the model (M-N-R1R2) of Figure 18 are shown in Figure 19. One can see that most of the forecast values fall within the upper limit and lower limit of the second standard deviation. Occasionally it still under-forecasts the rare events as shown by the large negative residuals. Comparatively COMBI fall shorts of accurately modelling the rare events as shown in Figure despite R1 and R2 are used. This may be due to the inherent nature of polynomial equations that is less advantageous in modelling high dimensional data comparatively.

VI. CONCLUSION

Accurate forecasting of rare events is a challenging problem that has many important practical applications. Due to the nature of rare events, appropriate datasets for the construction of statistical and/or machine learning models are often very limited and incomplete. Regression-based linear forecasting models do not produce reliable results due to insufficient samples of rare event occurrences, and too much noise in the data. The neural network approach is a potential candidate, but its performance is sensitive to the choices of many configuration parameters. Other existing rare event modelling techniques such as extreme value theory only work on univariate data. These approaches are not easily extended to multivariate data.

In this study, the GMDH method was extended, allowing the complexity of a neural network to be incrementally scaled up until it reached an optimal structure. The end result was a forecasting model in the form of polynomial regression equations or a neural network that is optimally complex and accurate and reflects the choices of the user. The extension was a feedback mechanism that reuses the

471471

Page 9: [IEEE 2012 IEEE 12th International Conference on Data Mining Workshops - Brussels, Belgium (2012.12.10-2012.12.10)] 2012 IEEE 12th International Conference on Data Mining Workshops

residual errors taken from each previous round of model training; the feedback signals compensate for the errors made in the previous round of model training. Our experiment showed that GMDH with residual feedback indeed improves the accuracy of the forecasting model – by means of better curve fitting with relatively low errors. Our model outperformed the best forecasting algorithm candidate selected by Oracle Crystal Ball predictor, which chooses automatically the model with the lowest error rate. We showed that the neural network type of GMDH yields better results than the combinatorial type of GMDH when using residual feedback as self-correcting signals in model training.

REFERENCES

[1] Ivakhnenko, A.G., (1971) ‘Polynomial theory of complex systems’, IEEE Trans. on Systems, Man and Cybernetics, Vol. SMC-1, No. 4, pp. 364-378.

[2] Sarka, D., (2009) ‘The ART (and ARIMA) of forecasting’, Technical Presentation, Solid Quality Mentors, http://blogs.solidq.com/dsarka/home.aspx, Last accessed on April 23.

[3] Ivakhnenko, A.G., (1970) ‘Heuristic self-organization in problems of engineering cybernetics’, Automatica Vol. 6, pp. 207–219.

[4] Ivakhnenko, A.G. and Zholnarskiy, A.A., (1992) ‘Estimating the coefficients of polynomials in parametric GMDH algorithms by the improved instrumental variables method’, Journal of Automation and Information Sciences c/c of Avtomatika, Vol. 25, No. 3, pp. 25-32.

[5] Sarychev, A.P., (1984) ‘A multilayer nonlinear harmonic GMDH algorithm for self-organization of predicting models’, Soviet Automatic Control c/c of Avtomatika, Vol. 17, No. 4, pp. 90-95.

[6] Duffy, J.J. and Franklin, M.A. (1975) ‘A learning identification algorithm and its application to an environmental system’, IEEE

Transactions on Systems, Man and Cybernetics, Vol. SMC-5, No. 2, pp. 226-240.

[7] Kozubovskiy, S.F., (1986) ‘Determination of the optimal set of lagging arguments for a difference predicting model by correlation analysis’, Soviet Journal of Automation and Information Sciences c/c of Avtomatika, Vol. 19, No. 2, pp. 77-79.

[8] Karnazes, P.A. and Bonnell, R.D. (1982) ‘Systems identification techniques using the group method of data handling’, in Proceedings

of the 6th Symposium on Identification and System Parameter

Estimation, Vol.1, Washington DC, Oxford, International Federation of Automatic Control, Pergamon, pp. 713-718.

[9] Kendall, M. and Gibbons, J.D. (1990) Rank Correlation Methods, Edward Arnold, London.

[10] Loquasto, F. and Seborg, D.E. (2003) ‘Monitoring model predictive control systems using pattern classification and neural networks’, Journal of Industrial and Engineering Chemistry Research, Vol. 42, Oct., pp. 4689-4701.

[11] Narasimhana, S., Vachhanib, P. and Rengaswamya, R. (2008) ‘New nonlinear residual feedback observer for fault diagnosis in nonlinear systems’, Automatica, Vol. 44, Issue 9, September 2008, pp. 2222–2229.

[12] Lanitis, A. (2003) ‘Building statistical appearance models using residual information’, in Proceedings of 9th Panhellenic Conference in Informatics, Thessaloniki, Nov., pp. 547-559.

[13] Rossen, A. (2011) ‘On the predictive content of nonlinear transformations of lagged autoregression residuals and time series observations’, HWWI Research Papers, No. 113, Hamburg Institute of International Economics, pp. 1-24.

[14] Schetinin, V. (2003) ‘A learning algorithm for evolving cascade neural networks’, Journal of Neural Processing Letters, Vol. 17, Issue 1, pp. 21-31.

[15] Varahrami, V. (2011) ‘Recognition of good prediction of gold price between MLFF and GMDH neural network’, Journal of Economics and International Finance Vol. 3, No. 4, April 2011, pp. 204-210.

[16] K. Zhang, W. Fan, "Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions", ICDM 2006, pp.753-764.

Figure 8. COMBI on earthquake time series with no feedback.

Figure 9. MIA on earthquake time series with no feedback.

Figure 10. COMBI on earthquake time series with R1 feedback.

Figure 11. MIA on earthquake time series with R1 feedback.

472472

Page 10: [IEEE 2012 IEEE 12th International Conference on Data Mining Workshops - Brussels, Belgium (2012.12.10-2012.12.10)] 2012 IEEE 12th International Conference on Data Mining Workshops

Figure 12. COMBI on earthquake time series with R2 feedback.

Figure 13. MIA on earthquake time series with R2 feedback.

Figure 14. COMBI on earthquake time series with R1 and R2 feedback.

Figure 15. MIA on earthquake time series with R1 and R2 feedback.

Figure 16. COMBI on perciptation time series with R1 and R2 feedback.

Figure 17. MIA on perciptation time series with R2 feedback.

Figure 18. MIA on perciptation time series with R1 and R2 feedback.

Figure 19. Residuals of perciptation time series with R1 and R2 feedback.

473473