a recursive local polynomial approximation method using...

26
A Recursive Local Polynomial Approximation Method using Dirichlet Clouds and Radial Basis Functions Arta A. Jamshidi and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544. e-mail:{arta,powell}@princeton.edu Abstract We present a recursive function approximation technique that does not require the storage of the arrival data stream. Our work is motivated by algorithms in stochastic optimization which require approximating functions in a recursive setting such as a stochastic approximation algorithm. The unique collection of these features in this technique is essential for nonlinear modeling of large data sets where the storage of the data becomes prohibitively expensive and in circumstances where our knowledge about a given query point increases as new information arrives. The algorithm pre- sented here provides locally adaptive parametric models (such as linear models). The local models are updated using recursive least squares and only stores the statistical representative of the local approximations. The resulting scheme is very fast and mem- ory efficient without compromising accuracy in comparison to the standard and some advanced techniques used for functional data analysis in the literature. We motivate the algorithm using synthetic data and illustrate the algorithm on several real data sets. 1 Introduction There are three major classes of function approximation methods: look-up tables, parametric models (linear or non-linear), and nonparametric models. Parametric regression techniques (such as linear regression [34]) assume that the underlying structure of the data is known a priori and is in the span of the regressor function. Due to the simplicity of this approach it is commonly used for regression. Nonparametric models,[14, 36, 55], do not assume a specific structure underlying the data. Nonparametric methods use the raw data to build local approximations of the function, producing a flexible but data-intensive representation. Although nonparametric models are generally data hungry, the resulting approximation may be more accurate, [19, 26, 53] and is less sensitive to structural errors arising from a parametric model. Most nonparametric models require keeping track of all observed data points, which make function evaluations increasingly expensive as the algorithm progresses, a serious problem in stochastic search. Regression based on least squares, Lasso regression, ridge regression, and least absolute deviation all have the same underlying structure as ordinary least squares, except the objective/penalty terms are slightly modified. In this work we formulate a problem that is a balance between parametric and nonparametric 1

Upload: others

Post on 11-Mar-2020

15 views

Category:

Documents


1 download

TRANSCRIPT

A Recursive Local Polynomial Approximation Method

using Dirichlet Clouds and Radial Basis Functions

Arta A. Jamshidi and Warren B. PowellDepartment of Operations Research and Financial Engineering

Princeton University,Princeton, NJ 08544.

e-mail:arta,[email protected]

Abstract

We present a recursive function approximation technique that does not require thestorage of the arrival data stream. Our work is motivated by algorithms in stochasticoptimization which require approximating functions in a recursive setting such as astochastic approximation algorithm. The unique collection of these features in thistechnique is essential for nonlinear modeling of large data sets where the storage ofthe data becomes prohibitively expensive and in circumstances where our knowledgeabout a given query point increases as new information arrives. The algorithm pre-sented here provides locally adaptive parametric models (such as linear models). Thelocal models are updated using recursive least squares and only stores the statisticalrepresentative of the local approximations. The resulting scheme is very fast and mem-ory efficient without compromising accuracy in comparison to the standard and someadvanced techniques used for functional data analysis in the literature. We motivatethe algorithm using synthetic data and illustrate the algorithm on several real datasets.

1 Introduction

There are three major classes of function approximation methods: look-up tables, parametricmodels (linear or non-linear), and nonparametric models. Parametric regression techniques(such as linear regression [34]) assume that the underlying structure of the data is known apriori and is in the span of the regressor function. Due to the simplicity of this approachit is commonly used for regression. Nonparametric models,[14, 36, 55], do not assume aspecific structure underlying the data. Nonparametric methods use the raw data to buildlocal approximations of the function, producing a flexible but data-intensive representation.Although nonparametric models are generally data hungry, the resulting approximationmay be more accurate, [19, 26, 53] and is less sensitive to structural errors arising from aparametric model. Most nonparametric models require keeping track of all observed datapoints, which make function evaluations increasingly expensive as the algorithm progresses,a serious problem in stochastic search. Regression based on least squares, Lasso regression,ridge regression, and least absolute deviation all have the same underlying structure asordinary least squares, except the objective/penalty terms are slightly modified. In thiswork we formulate a problem that is a balance between parametric and nonparametric

1

models to benefit from the advantages of both modeling strategies. Semi-parametric modelsare discussed in [51].

Our work is motivated by the need to approximate functions within stochastic searchalgorithms where new observations arrive recursively. The most classical representation ofthe problem is given by

minx

EF (x, ν),

where E denotes the expectation operator, x is a deterministic parameter and ν is a randomvariable. Other applications include approximate dynamic programming, where we need toapproximate expectations of value functions. Since we are unable to compute the expecta-tion directly, we depend on the use of Monte Carlo samples. We are interested in the classof algorithms which replaces EF (x, ν) with an approximation F

n(x) which can be quickly

optimized. As we obtain new information from each iteration, we need a fast and flexiblemethod for updating the approximation F

n(x). We require an approximation method that

is more flexible than classical parametric models will offer, but we need a fast, compactrepresentation to minimize computational overhead.

Bayesian techniques for function approximation or regression are computationally intenseand require storage of all the data points [19]. Updating the model requires revisiting allthe data points in the history. These limitations make these methods impractical in thecircumstances where there is a need to update the model as a stream of new informationarrives and in situations where there is a need for fast algorithms with limited storage space.In addition these methods have multiple tunable parameters. There is a rich literature onnonparametric Baysian methods for regression [24]. In spirit, our work is closest to [52]with a proposed model that mixes over both the covariates and response, and the responseis drawn from a multinomial logistic model. This work handles various response shapes.Dirichlet process mixtures of generalized linear models (DP-GLM) proposed in [24] is ageneralization of this idea for various response types. Other statistical techniques that arewidely used for regression include regression trees [4], where data is divided into a fixed,tree-based partitioning and a regression model is fit to data in each leaf of the tree, andGaussian processes, which assume the observations arise from a Gaussian process modelwith known covariance function. This method does not handle variations in the variance ofthe response unless a Dirichlet process mixtures of GP is assumed [46].

Locally Weighted Scatterplot Smoothing (LOWESS), [11, 12, 13], does not require aglobal function to represent the underlying function. This procedure produces a modelbased on segments of data based on nearest neighborhood. At each point this methodassigns a low-degree polynomial using weighted least squares to the segment of data thatare closest to the point of interest. The weighted least squares assigns more weight to thepoints that are closer to the query point. This method is computationally intensive andkeeps track of all the data points. In addition, it does not incorporate an updating methodas new data points arrive.

Locally linear models [15, 16], is a weighted locally linear regression method that hasadvantages over regular kernel smoothing regression methods such as [56] and [38]. Thistechnique builds linear models around each observation and keeps track of all the datapoints. In addition this method does not provide a global mathematical formula for theregression function. This method performs better than spline methods [15].

The multilayer-perceptron and the associated back-propagation algorithm,[57, 50], andradial basis function [5, 42, 41] have received considerable attention for function approxima-tion. Back-propagation networks require a lot of training data and training and as a resultare quite slow. One of the main attractions of the radial basis functions (RBFs) is that

2

the resulting optimization problem can be broken efficiently into linear and nonlinear sub-problems. An automatic function approximation technique using RBF that does not needtuning ad-hoc parameters is proposed in [29]; the multivariate extension to this techniqueis proposed in [31]. For a comprehensive treatment on various growing RBF techniques,see [29] and references therein. The issue of selecting the number of basis functions withgrowing and pruning algorithms from a Bayesian prospective is described in [27]. In [1], ahierarchical full Bayesian model for radial basis functions (RBFs) is proposed. NormalizedRBFs are presented in [32] which perform well for approximation for less number of trainingdata.

We seek a general purpose approximation method that can approximate a wide range offunctions using a compact representation which can be updated recursively with fast functionevaluations. Toward this goal, we propose a novel recursive and fast approximation schemefor modeling nonlinear data streams that does not require storage of the data history. Thecovariates may be continuous or categorical, but we assume that the response functionis continuous, although not necessarily differentiable. The new method is robust in thepresence of homoscedastic and heteroscedastic noise. Our method automatically determinesthe model order for a given data set and produces an analytical formulation describingthe underlying function. Unlike similar algorithms in the literature, our method has onetunable parameter. Our proposed scheme introduces a cover over the input space to definelocal regions. This scheme only stores a statistical representation of data in local regions andlocally approximates the data with a low order polynomial such as a linear model. The localmodel parameters are quickly updated using recursive least squares. A nonlinear weightingsystem is associated with each local model which determines the contribution of this localmodel to the overall model output. The combined effect of the local approximations and theweights associated to them produces a nonlinear function approximation tool. On the datasets considered in this study, our new algorithm provides superior computational efficiency,both in terms of computational time and memory requirements compared to the existingnonparametric regression methods without loss of accuracy. The test results on varioussynthesized and real multivariate data sets are carried out and compared with nonparametricregression methods in the literature. In this work we have made an attempt to bridge variousfields in statistics, approximation and numerical analysis on function approximation.

The organization of this paper is as follows: Section 2 provides an overview of radialbasis functions and motivates the use of normalized Radial Basis Functions with polynomialmodulated response terms. Section 3 introduces various aspects of the proposed algorithmincluding the notion of a cloud cover for the input space, the model building procedure andthe updating roles. This section also provides the stopping criterion for the algorithm andthe procedure to measure the goodness of fit. Section 4 shows the convergence propertiesof the proposed algorithm in finite time and asymptotically. Section 5 demonstrates theperformance and robustness of the algorithm using various synthetic and real data setswith various input dimensions and noise levels. This section provides a comparison of thenew method and the available algorithms in the literature. Section 6 provides concludingremarks and discusses future work.

2 Normalized Radial Basis Functions

Radial Basis Functions are powerful tools for function approximation, [5, 7, 43]. Over theyears RBFs have been used successfully to solve a wide range of function approximationproblems (see e.g., [3]). RBFs have also been used for global optimization (see e.g., [48, 49,23, 28]). An RBF expansion is a linear summation of special nonlinear basis functions. In

3

general, an RBF is a mapping f : Rn −→ R that is represented by

f(x) =

Nc∑i=1

αiϕ(∥x− ci∥Wi), (1)

where x is an input pattern, ϕ is the radial basis function centered at location ci, and αi

denotes the weight for the ith RBF. Nc denotes the total number of RBFs. The term Wi

denotes the parameters in the weighted inner product

∥x∥W =√xTWx.

The dimensions of the input n and output m are specified by the dimensions of the input-output pairs. Universal approximation properties of RBFs are established in [39, 40].

Normalized RBFs of the following form were proposed by Moody and Darken, [35],

f(x) =

∑Nc

i=1 αiϕ(∥x− ci∥Wi)∑Nc

i=1 ϕ(∥x− ci∥Wi). (2)

Normalized RBFs appear to have advantages over regular RBF, especially in the domainof pattern classification, see e.g., [6]. In this work it is reported that using normalized RBFsreduced the order of the model and the robustness of generalization. It has also been reportedthat normalized RBFs require less data when training models of dynamical systems, [32].

The polynomial modulation of the normalized RBF expansion leads to the followingexpansion,

f(x) =

∑Nc

i=1 pi(x)ϕ(∥x− ci∥Wk)∑Nc

i=1 ϕ(∥x− ci∥Wi), (3)

where pi(x) is assumed to be a low order polynomial such as a linear function (pi(x) =αix+ βi).

The above notion leads us to the idea of normalized radial basis functions that have apolynomial response terms. In what follows we provide a fast algorithm that recursivelyupdates the response terms and the weights associated to them as new data points arrive.During the training phase the unknown parameters in the model need to be calculatedincluding the number of the RBFs in the function expansion.

3 On-line Learning Scheme

The goal of our proposed algorithm is to find a mapping f from x ∈ Rn, a vector in n-dimensional space, to R such that yk = f(xk) as the input-output pair (xk, yk) becomeavailable. We expect to have a total of K arrival data points where k ∈ 1, ...,K withX = xkKk=1 and Y = ykKk=1 representing the domain and range observable values. Weassume that data arrives as a stream. Throughout the process we would like to find theparameters associated with the model provided in Equation (3) to find an accurate modelfor the data. The observed data might be noisy and we would like to have a model that hasgood generalization ability. In this regression framework, the problem is to minimize thecost function

E(ψ) =1

2

K∑k=1

∥ f(xk, ψ)− yk ∥2,

4

given the available data points. ψ contains all the parameters, αi, βi, ci,Wi, Nc, that areused to define f(x) as shown in Equation (3). In this work, the Euclidean inner product isused for the metric ∥ · ∥ and we use the Gaussian kernel,

ϕ(r) = exp(−r2).

Note that other kernels could be used. For a comprehensive review of RBF kernels usedin the literature and novel skew and compactly supported expansions (see [30]). In whatfollows the description of an algorithm in this framework is presented that does not requirethe storage of the data stream and locally approximates the underlying functional behavior ofthe data. The response model parameters are updated quickly using recursive least squaresand the weights associated to each linear response is updated via statistical techniques. Notethat the model order is also determined in the algorithm.

3.1 A Data Driven Space Cover

The algorithm proposed here works by forming a cover for the domain of the underlyingfunction f upon arrival of new data points, xk. We define a cover

C = Ui : i ∈ ∆

where Ui denotes an indexed family of sets, indexed with the elements of set ∆. C is a coverof X if X ⊆

∪i∈∆ Ui. On the arrival of the first data point, x1, a cover element or cloud

U1 is formed with the center on this point and with a distance threshold of DT to form aball around xl. When the system receives a new data point x2 with associated y2 value, ifthe data point is within distance DT of x1, x2 is assigned to the same cloud; otherwise anew cloud is created. The same procedure is repeated upon arrival of each data point xk.A decision is made for the assignment of the new data to a new or existing cloud. If anexisting cloud is assigned, the mean and the variance of this cloud is updated. Otherwise,a new cloud is created with the new point as its centroid. In the case that the arrival datais assigned to an existing cloud, the response of the cloud is updated using recursive leastsquares.

To check the assignment of a new arrival data point xk, the distance between xk andthe centroids of the cloud, ci with i ∈ 1, ..., Nc (Nc is the number of clouds or elements inthe cover), is calculated using

Di = ∥xk − ci∥. (4)

LetI = argmin

iDi.

Compare DI , with distance threshold DT ; if DI < DT , update the ball indexed with UI .Otherwise spawn a new cloud UNc+1, according to the previous description. When a newelement is added to the cloud the number of the local balls, Nc is also updated toNc = Nc+1.

The algorithm retains a statistical representation of the data over a cover that is specifiedby the data stream and a distance threshold. The distance threshold is the only parameterin this algorithm that requires tuning. The value of this threshold depends on the variationsof the response values over the domain. The data that is stored in the model are the numberof points in each cloud, Ki, the centroid of each cloud, ci, and the variance in each dimensionof the data in each cloud, Wi. In addition the response coefficients for each cloud are alsostored. Note that the model can be evaluated at any point of its domain after adaptingthe model using the new information. As the geometry of the data becomes more complex,there will be a need for more clouds in the cover to capture this behavior.

5

3.2 Recursive Update of the Model

There are two parts in the model that are updated during the training: the response to eachlocal model associated to a local ball indexed I and an associated weight.

Recursive update for responses. We solve a local least squares problem of the formminθI

∥ XkI

I θI − YI ∥22 where XkI

I is a matrix of the form

XkI

I =

1 .. x1

: :1 .. xkI

,where the vectors x1, ..., xkI are domain values. The vector Y kI

I contains all the associated

range values Y kI

I = [y1, ..., ykI ]. The vector θI = [βI , αI1, ..., α

In] contains all the parameters

for the linear response. The direct solution to this system is θI = (XkI

I

TXkI

I )−1XkI

I

TY kI

I .After receiving n + 1 data points in a local neighborhood (we refer to this state of themodel as the no-knowledge state), the matrix XkI

I is formed with kI = n+1. The recursiveleast squares update, [10], is used to compute the linear response model parameters. For

kI = n + 1, let P kI

I = (XkI

I

TXkI

I )−1. When a new data point (xk, yk) is assigned to thislocal neighborhood, let aT = [xk, 1].

The recursion formula is then:

P kI+1I = P kI

I −P kI

I aaTP kI

I

1 + aTP kI

I a, (5)

θkI+1I = θkI

I + P kI+1I a

(yk − aT θkI

I

), (6)

where kI is the recursion index for cloud indexed I. These equations are easily modified toinclude a discount factor which puts more weight on recent observations.

Recursive update for the weights. When a new data point (xk, yk) is assigned to thelocal neighborhood indexed I, the weights WI associated with an element of cover C, cloudUI , is also updated. For this purpose the number of data points that have been assignedto this local ball is updated as kI = kI + 1. The center of the kernels are updated via therecursion

ckI+1I = ckI

I +xk − ckI

I

kI. (7)

The width or scale of the model in each dimension of the cloud I is updated using theWelford formula [33]. Initialize Q1 = x1 and S1 = 0. For each additional data point xk

assigned to this cloud, use the recurrence formulas

QkI = QkI−1 + (xk −QkI−1)/kI , (8)

SkI = SkI−1 + (xk −QkI−1)(xk −QkI ). (9)

The kIth estimate of the variance is W 2I = 1

kI−1SkI .

3.3 Model Evaluation

This section describes how to compute the approximation Fk(x) at a query point x. As

shown in equation (3), the output of the proposed model is the weighted average of thelinear responses over the local balls Ui associated to the cover C. In this study a Gaussian

6

kernel, ϕ(r) = exp(−r2), determines the weight of each response function. The widths, Wi,and the centroids, ci, of the kernels, i ∈ 1, ..., Nc, were computed recursively as mentionedin Section 3.2. The model output is formulated as

f(x) =

∑Nc

i=1 ϕ(∥x− ci∥Wi)(αix+ βi)∑Nc

i=1 ϕ(∥x− ci∥Wi)

. (10)

Nc is the total number of clouds. Note that αix + βi is the least squares solution of theresponse over the ith local ball.

The weighted inner product is calculated from the recursive update of the standarddeviation of data in each dimension. For the cases where the standard deviation of the dataalong a specific dimension is zero we introduce a penalty term to avoid singularity by writingWi =Wi + P , where P is a specified constant. Note that in this study the weight matricesare diagonal. The summary of the above procedure is presented in DC-RBF Algorithm.

Algorithm DC-RBF algorithm

Input DT

Initialize Nc = 0, k = 1, ∆ = ∅while xk doif k = 1 thenc1 = x1, Nc = 1, k1 = 1, ∆ = 1, form cloud U1

elsecompute Di = ∥xk − ci∥ according to Equation (4) for all i ∈ ∆compute I = argmin

iDi

if DI < DT thenupdate cloud UI , kI = kI + 1if kI ≥ n+ 1 thenupdate the response parameter, θI , using Equations (5) and (6)update the cloud centroid, cI , using Equation (7)update the scale of cloud, WI , using Equations (8) and (9)

elsestore xk assigned to cloud I

end ifelsespawn new cloud UNc+1, Nc = Nc + 1, kNc = 1, ∆ = 1, ..., Nc

end ifend ifk = k + 1

end whileevaluate the model as needed using Equation (10)

3.4 Goodness of Fit and Stopping Criteria

In data modeling one of the major goals is to model data in such a way that the modeldoes not over or under fit the data that is not used for training but is generated from thesame process as the training data. One general approach to finding such a smooth model toobserved data is known as regularization. Regularization describes the process of fitting asmooth function through the data set using a modified optimization problem that penalizes

7

variation [54]. A standard technique to achieve regularization is via cross-validation [21, 55].Such methods involve partitioning the data into subsets of training, validation, and testingdata; for details see, e.g., [26]. To determine the tunable parameter of the model proposedin this work, i.e., the distance threshold, DT , we use t-fold cross validation techniques[18]. In this scheme, the original data set is randomly partitioned into t subsample sets.One of the t subsamples is chosen as the validation data for testing the model, and theremaining subsamples are used as training data. This process is repeated t times with eachof the t subsamples used only once as the validation data. All the t results are averagedto provide an overall estimate of the error. The advantage of this method over repeatedrandom sub-sampling is that all observations are used for both training and validation, andeach observation is used for validation exactly once. We have chosen to use five-fold crossvalidation. In the first iteration, we mark the first fold as the test set, and use the otherfour folds (the training set) to build a model [25]. We record the accuracy of the model onthe testing data set. This procedure is repeated for all five folds, and the mean squarederror (MSE) on the test data sets is computed at each time using,

MSE =1

L

L∑l=1

(f(xl)− yl

)2, (11)

where L is the size of the testing data set. Over-fitting is avoided by tuning the parametersto minimize MSE calculated through cross validation, as opposed to minimizing the MSEof points within the training data set [26].

To increase the number of estimates, one could run the above t-fold cross validationmultiple times. For this purpose the data needs to be reshuffled each round.

To determine the predictive ability of the models we use several measures to show theaccuracy of the predictions. The R2 value is calculated through the following formula:

R2 = 1−∑L

l=1(yl − f(xl))2∑L

l=1(yl − y)2

,

where y = 1L

∑TL

l yl is the mean of yl. The variance of the data with regards to the truevalues is calculated in the numerator. The denominator presents the variance of the modeloutput with respect to the mean of the outcomes. This measure tells how good the modelis performing in comparison to the case where the mean value is used as the predictor. AnR2 value of 1 indicates an exactly fitted model, while a value of 0 indicates a model thatadds no predictive power. We have also used the notion of Mean Absolute Error to recordthe performance of the model. The MAE is defined as following:

MAE =1

L

L∑l=1

|f(xl)− yl|. (12)

In the worst case, if the distance threshold is set too high, the proposed method will fit asingle hyperplane to the entire data set.

4 The Convergence Properties of the DC-RBF Algo-rithm

This section describes facts about the DC-RBF algorithm. Theorems describe the finitetime and asymptotic behavior of this algorithm. Note that the proofs are carried out for the

8

algorithm in steady state where there are enough clouds to cover the desired domain of thefunction and the clouds have been stabilized. Theorems are provided for both homoscedasticand heteroscedastic types of noise.

Assume we haveyk = θxk + ϵk, (13)

for k ∈ 1, ...,K. Vector θ is the parameter that needs to be estimated. The set DK =(xk, yk), with k ∈ 1, ...,K denotes the collection of sequentially observed data points.The error variables, ϵk, are random.

Remark 1. Gauss-Markov theorem [17]: The Gauss-Markov theorem states that in a linearregression model in which the errors have expectation zero and are uncorrelated and haveequal variances, the lowest possible mean squared error (variance) of the linear unbiasedestimator (BLUE) of the coefficients is given by the ordinary least squares estimator. Actualerrors need not be normal, nor independent and identically distributed (only uncorrelatedand homoscedastic). That is, E(ϵk) = 0, var(ϵk) = σ2 <∞ and cov(ϵk1 , ϵk2) = 0 for k1 = k2and k1, k2 ∈ 1, ...,K. The proof of this follows from a contradiction on the mean andvariance of another linear estimator for coefficient θ. An example could be adding anotherterm to the least squares solution given by (XTX)−1XT . It can be shown that the meanof this new estimator is not unbiased and it produces a variance that is greater than whatis produced by the mean square error solution, i.e., σ2(XTX)−1.

Note that in our work we use recursive least squares solution on the local clouds as newdata points arrive to that local region.

Lemma 1. For a given DT , if the arrival data points sample an n dimensional ambientspace, the DC-RBF Algorithm will produce null output for a cloud cover that does notaccumulate Mo = n + 1 data points (n is the dimension of the ambient space). ThereforeDT > 0.5||NMo(x

k1) − xk2 || for xk1 , xk2 ∈ X . NMo(x) denotes the Moth nearest neighborof point x.

Proof. According to the construction of the model in the DC-RBF algorithm, there is aneed for n + 1 data points to construct a hyperplane from Rn to R. Therefore if DT ischosen to be a sufficiently small number then there is a chance that the model produces nooutput.

Proposition 1. Given DT = diam(X ), assuming the underlying data structure is a hyper-plane (line in dimension 1) with homoscedastic additive noise the DC-RBF Algorithm isunbiased asymptotically and in finite time larger than n+ 1. The diameter of domain X isdefined as

diam(X ) = max1≤k1,k2≤K

||xk1 − xk2 ||.

Proof. Under the hypothesis HkI that cloud I has a linear structure, the rank one update of

the recursive least squares solution at each iteration given by equations (5) and (6) providesthe best linear unbiased estimator for the coefficient θ. Note that in this argument the localiteration counter kI (the kth data point that arrives to cloud I) is the same as the globalcounter, kI = k. We offer a proof based on induction. At k = n − 1 the argument is truesince the algorithm saves n− 1 data points to initialize matrix P and the LSE is computeddirectly. Let the argument be true for k = m ∈ N, we show that at iteration k = m + 1,the rank one update solution via the recursive least squares solution is also unbiased. Thisfollows from the derivation of the recursive least squares via the Sherman-Morrison formula,

(A+ uvT )−1 = A−1 − (A−1u)(vTA−1)

1 + vTA−1u.

9

Let Gm = XmTXm and am+1T = [xm+1, 1]. Then θm+1 = Gm+1−1Xm+1TY m+1 =

Gm+1−1(XmTY m + am+1ym). Notice that XmTY m = GmGm−1XmTY m = Gmθm =

Gm+1θm − am+1am+1T θm. Therefore, θm+1 = θm − Gm+1−1am+1(am+1T θm − ym). Let

Pm+1 = Gm+1−1= (Gm + am+1am+1T )−1. Therefore,

θm+1 = θm + Pm+1am+1(ym − am+1T

).

Hence, at each iteration the estimate is unbiased.Asymptotically, when the number of data points becomes very large the same induction

is true. Therefore when the number of local iterations k → ∞ we deduce that θk+1 =θk + P k+1a

(yk − aT θk

)leads to an unbiased estimator for θ, according to Remark 1.

Lemma 2. Given DT = diam1(X), assuming the underlying data structure is a hyper-plane with heteroscedastic or correlated additive noise, the DC-RBF Algorithm is unbiasedasymptotically and in finite time larger than n+ 1.

Proof. The proof follows the proof of Proposition 1 and incorporating the notion of gener-alized least squares. Generalized least squares assumes the conditional variance of Y givenX is a known or estimated matrix Ω. This matrix is used as the weight in computing theminimum squared Mahalanobis length, hence, β = (X ′Ω−1X)−1X ′Ω−1Y. Ω rescales theinput data to make it uncorrelated, then the Gauss-Markov theorem is applied. For thecase where the variance of the noise changes, the weight matrix Ω is diagonal and this is aweighted least squares problem. In this case the weight associated to each point xk ∈ X is1σk .

We denote the bias generated by the DC-RBF algorithm at each point of the domainof function f(x) as B(x) = (f(x) − E[y|x])2. The variance at each point of the domain isV (x) = 1

K (f(x)− f(x))2, where f(x) is the average of f(x).

Proposition 2. If the underlying nonlinear functional structure g(x) is Lipschitz continuouswith Lipschitz constant Lc, and the observed data points are drawn from yk = g(xk) + ϵk,

then∑

x∈X B(x) < Q(Lc, ω1, ω2) = Lc(ω1−ω2)

2

2 + O. Here DT = diam(X ) is the chosendistance threshold, and ω1 and ω2 determine the intersecting points of the linear estimationand the underlying true structure g(x) and O denotes a residual term.

Proof. We assume that the underlying function g(x) is Lipschitz continuous, i.e., there arerestrictions on the variations in the underlying structure. This implies that for every pair ofpoints on the graph of this function, the absolute value of the slope of the line connectingthem is no greater than a definite real number (Lipschitz constant, Lc(g)). The slope of the

secant line passing through xk1 , xk2 ∈ X is equal to the difference quotient g(xk1 )−g(xk2 )||xk1−xk2 || < Lc

when xk1 = xk2 . Without loss of generality we only consider the portion of the data thathas an ascent followed by a descent slope in its structure. This could be achieved by theappropriate choice of DT . To show this result we consider ϕ(.) to be uniform weights. Firstwe show that the least squares solution to this structure intersects at two points, ω1 and ω2

(see Figure 1(a) for a visual presentation). The proof follows from a contradiction argument.We denote the residuals of the least squares fit line at iteration k, fk(x) = akx with the

truth function g(x), as ek(x). Note that the residuals at each point are discrete samplingsof the mapping

e : x ∈ [ωl, ωr] → ek(x) ∈ R.

10

(a) First order model. (b) Second order model.

Figure 1: The visual demonstration of the regression lines intersection the underlying non-linear function g(x). In this figure we provide the geometry for the case where the modelconsists of a single line, followed by the case where the model consists of two lines.

We assume that the data used for training is given by XI ∈ UI (index I denotes a specificcloud, in this case the whole data diameter with I = 1.) Note that the global iteration k isthe same as the local index kI . The following analysis is carried out for a given iteration k.This data is contained in the closed interval x ∈ BI = [ωl, ωr] ∈ UI . In addition, we assumethat both g and f intersect at the origin of the coordinate system, i.e., e(ω1) = 0 and thate(x) = 0 at x /∈ BI .

Let a+ denote the optimized parameters for the linear estimator f , i.e.,

a+ = argmina

∑xl∈BI

(axl − e(xl))2.

We would like to show that ∃x ∈ [ωl, ωr], such that

a+x− e(x) = 0.

This indicates that the line a+x intersects g(x) at another point. If a+x − e(x) = 0, ∀x ∈[ωr, ωl], then, without loss of generality, one could assume a+x− e(x) > 0, ∀x ∈ [ωl, ωr].

So, there must be a smallest distance between the optimal hyperplane a+x and theresidual curve e(x). This distance is defined as

α+ = minx∈[ωl,ωr]

a+x− e(x).

Say this minimum occurs at the point x+. Note that this point x+ may not actuallycorrespond to a sampled data point xk.

Now consider another hyperplane a∗x which is obtained by scaling the optimal hyper-plane so that it produces another point of intersection with e(x), i.e.,

a∗x− e(x) = 0

for x = 0 and x ∈ BI . By assumption we have a∗x = κa+x where 0 < κ < 1 so

0 < e(x) ≤ a∗x < a+x

11

for all x ∈ (ωl, ωr). Hence a∗x− e(x) < a+x− e(x) and∑xl∈[ωl,ωr]

(a∗x− e(xl))2 <∑

xl∈[ωl,ωr]

(a+x− e(xl))2,

which is a contradiction as we assumed a+ was the optimal solution. Thus, the parametersa+ could not have been obtained using least squares minimization.

Using the geometry that is produced by the intersection line and the nonlinear geom-etry, we could contain the residuals within three triangles that is generated by the linespassing through the intersecting points ω1 and ω2 with slop Lc (see Figure 1(a) for a visualpresentation). We conclude that

∑xl∈BI

|f(xl)− g(xl)| < Lc[(ω1 − ω2)

2

2+ (ωl − ω1)

2 + (ωr − ω2)2].

Note that BI is an interval that contains the region of interest, diam(BI) > |ω1−ω2|, whereωr and ωl are the right and left ends of the interval BI , respectively. f

k is the line structurethat is the least squares solution. We present this proof for a single dimension; however,the higher dimensional argument follows directly by considering a ball BI with a diameterthat covers the intersecting hyperplane with co-dimension 2 at Ω1, ....,Ωn with g(x). Theargument follows the volume that is contained between the hyperplane of co-dimension 1and the polyhedron (Tetrahedra in dimension 2) that contains the nonlinear structure andhas slope equal to Lc for all its faces.

Note that in one dimension we have the notion of a line and a tangent line and themodel output is the weighted sum of piecewise linear approximation to the data. In higherdimensions we have the notion of a hyperplane with the tangent plane to make a weightedpiecewise linear approximation to the underlying manifold.

Definition 1. Partition of Unity [37]: A partition of unity of a topological space X is a set ofcontinuous functions, ρii∈∆, from X to the unit interval [0, 1], such that for every point,x ∈ X , there is a neighborhood of x where all but a finite number of the functions are 0,and the sum of all the function values at x is 1, i.e.,

∑i∈∆ ρi(x) = 1.

The notion of partition of unity allows the extension of a local construction to the wholespace. We use the following way to identify our partition of unity. Given any open cover Ui,i ∈ ∆ (∆ is an index set) of a space, there exists a partition ρi, i ∈ ∆, such that supp ρi ∈ Ui

(supp ρi indicates the support of function ρj). Such a partition is said to be subordinate tothe open cover Ui. Thus we choose to have the supports indexed by the open cover.

If functions ρi are compactly supported, given any open cover Ui, i ∈ ∆, of a space,there exists a partition ρj , j ∈ Λ indexed over a possibly distinct index set Λ such that eachρj has compact support and for each j ∈ Λ, supp ρj ∈ Ui for some i ∈ ∆.

Theorem 1. Assuming the underlying data structure, g(x), is nonlinear, and the distancethreshold parameter, DT1 = diamX , the DC-RBF Algorithm produces an upper bound on

the bias denoted by QDT1(Q is defined in Proposition 2); however if DT2 = diamX

N (N ∈ N,is a natural number) with N > 1, and upper bound on the bias QDT2

then QDT2< QDT1

.

Proof. As new data points arrive, more of the true underlying geometry reveals itself. Webegin by posing as a null hypothesis that the underlying structure is described by a line. Asthe number of observations increases, we test this hypothesis against alternative hypothesesthat the underlying structure is described by a series of lines over regions. At this stage the

12

DC-RBF algorithm given DT = diamX attempts to model the secant line passing throughthe curvature (see Figure 1(a) for a visual presentation). What we mean by the secant lineis the regression line passing through the data points associated to cloud indexed I, where atthis stage I = 1. Note that the linear solution produces bias, with an upper bound providedin Proposition 2.

Similar to Proposition 2, we assume the regression line intersects the underlying nonlinearstructure at points ω1 and ω2 with upper bound on the bias Q. The domain of interest is

BI = [ωl, ωr]. As DT shrinks, (without loss of generality we assume, DT = diamX2 ), the

space is broken down into smaller sets (see Figure 1(b) for a visual presentation). Fora one dimensional space, this choice of DT breaks down the space into two parts, hence

∆ = 1, 2. The RBFs, ϕ(r) with∑

iϕ(∥x−ci∥W )∑i ϕ(∥x−ci∥W ) = 1, form a partition of unity as

provided in Definition 1. This would allow the expansion of each solution to a larger domainto produce a smooth transition from one local model to the neighboring models. With thisset up, we assume that the two regression lines l1 and l2 intersect at point ωM . Accordingto Proposition 2, and a uniform choice for kernel ϕ(.), we get the following bounds for thebias on l1 and l2 denoted by Ql1 and Ql2 , respectively,

Ql1 = Lc[(ωl11

− ωl12)2

2+ (ωl − ωl11

)2 + (ωM − ωl12)2]

and

Ql2 = Lc[(ωl21

− ωl22)2

2+ (ωM − ωl21

)2 + (ωr − ωl12)2].

One could simply verify that Ql1 +Ql2 < Q. This would result in reduction of the upperbound on the bias, hence QDT2

< QDT1with the cost of having more clouds in the cover

(more parameters in the model) and a corresponding increase in the variance of the model,VDT1

< VDT2.

The argument provided here is for one dimension, but with similar arguments made inProposition 2, could be generalized to higher dimensions.

Variable subset selection and shrinkage are methods that introduce bias and try toreduce the variance of the estimate. These methods trade a little bias for a larger reductionin variance. In practice we choose DT via cross validation among other available methodsthat are developed for this purpose.

Theorem 2. Asymptotically, the DC-RBF Algorithm is unbiased for heteroscedastic noise.

Proof. This theorem follows from Theorem 1 and taking into account that in the limitDT → 0 and K → ∞. Let

fi(x) =

∑i pi(x)ϕ(∥x− ci∥Wi)∑

i ϕ(∥x− ci∥Wi).

As DT → 0, i→ ∞, therefore

limi→∞

fi(x) =

∫pi(x)ϕ(∥x− ci∥Wi)dx∫ϕ(∥x− ci∥Wi)dx

.

Since DT → 0 leads to Wi → 0, limW→0

ϕ(∥x− ci∥W ) = δ(x− ci) where δ(.) is the Dirac delta

function. We then have limi→∞

∫ϕ(∥x − ci∥)dx = 1. Note that K → ∞ guarantees that in

every local ball around a given point in domain x with radius ξ > 0, B(x, ξ) contains at

13

least n+1 data points as suggested by Lemma 1, thus limi→∞

∫pi(x)ϕ(∥x− ci∥Wi

)dx = pi(ci)

where x = ci.Let the true underlying nonlinear function g ∈ C∞ be given by the Taylor expansion,

g(x, a) =∞∑

n=0

g(n)(x)

n!(x− a)n,

where g(x, a) denotes the approximation of function g(x) around point a and g(n)(x) denotesthe nth derivative of the function g at point x. If the underlying function g(x) ∈ C1, weconsider a first order linear approximation to function g(x) at each point, i.e., g(x) ≈ f(x) =g1 (x − a)1. In other words at each point x, the first order approximation is a hyperplane.In the DC-RBF algorithm if pi(x) is chosen to be a linear function, pi(x) is the tangent lineto each point x. This is true since the limit of the secant line passing through the points xk1

and xk2 when xk1 → xk2 is a tangent line at point xk1 . When xk1 → xk2 , ||xk1 − xk2 || → 0and in the limit this secant line forms the tangent at the meeting point xj . As a result the

difference quotient f(xk1 )−f(xk2 )||xk1−xk2 || approaches the slope of the tangent line of g(x) at that xk2 ,

i.e.,

lim||xk1−xk2 ||→0

f(xk1)− f(xk2)

||xk1 − xk2 ||= g1(xk2).

This is true as K → ∞. Up to the first order approximation within a ball B(x, ξ) withρ > 0, the function is assumed to be linear. According to Remark 1, and considering thefact that the variation of the noise within B(x, ξ) is negligible, the solution is BLUE. Sincethis is true for every point in domain X, the DC-RBF algorithm is therefore asymptoticallyunbiased.

5 The Empirical Results

Here we demonstrate the performance of the algorithm on a variety of synthesized and realdata sets. The data sets have distinct features in terms of input dimension, complexity of theresponse function as well as the noise content. Note that the algorithm provides a functionalrepresentation of the underlying data. The first three synthetic examples concentrate onthe performance and discussions on DC-RBF algorithm followed by two real examples withcomparisons to other statistical techniques.

5.1 Synthesized Data Sets

In this section, the performance of the DC-RBF algorithm on modeling the newsvenderproblem is shown. In the newsvendor problem, a newsvendor is trying to determine howmany units of an item x to stock. The stocking cost and selling price for the item is c andp, respectively. One could assume an arbitrary distribution on demand D. The expectedprofit is given by

F (x) = E [pmin(x,D)]− cx.

This problem poses a challenge to on-line data analysis due to the special behavior in thefunction around the optimal solution which is highly dependent on the characteristics ofa dataset. A data set composed of an oscillatory behavior with varying noise of the formy = sin(πx4 ) + 0.1xN(0, 1) is also used in this section. In this problem the challenge is todiscover the underlying nonlinear function while the observations are highly corrupted with

14

various levels of noise. The last synthesized data set provides an opportunity to approximatea two dimensional surface from a data set composed of noisy observation of a saddle shapesurface given by z = x2 − y2 +N(0,

√5).

One dimensional newsvender data set. Figure 2 describes the experimental setupand the model output for two different distributions on D. Figure 2 shows the training,testing and model output for each case. The specification of the algorithm for both prob-lems is identical. The number and position of training and testing data is the same inboth experiments. We employ our function approximation technique to find the maximumnumber of units to stock. The MSE value for the first experiment is 107.39 with a modelthat has three clouds. The second experiment has resulted in a model with three cloudswith MSE value of 426.63. We observe that the maximum stocking point is well identifiedin both experiments. Note that the algorithm learns the underlying functional behavior ofthe given data set within the scope that is determined by the radius of the local balls. Themodel generalizes well within local regions.

The underlying function forms a shape that is not in the span of a polynomial of degreep, in particular for the problems where D has low variance (making this a more challengingproblem). Therefore parametric methods do not perform well, especially given the fact thatthe basis functions of a parametric model must be known a priori. For the noisy data set theinterpolation methods such as cubic splines (despite the regression schemes) do not performwell and often result in over-fitting the data. Interpolation techniques also require more datapoints to produce an accurate response, see e.g., [55]. We observe that, in this experiment,our method outperforms kernel smoothing regression both in noisy and no noise data sets.The poor performance of kernel smoothing in this experiment is due to its local averagingproperty that tends to under-fit the function around the local extreme points. Therefore,the predicted value of the maximum is worse than the true value. In addition, this techniquerequires storage of the history of data points and does not provide an analytical form forthe underlying function.

We have tested our algorithm on various other single dimensional noisy data sets andwe have observed a similar conclusion. Therefore we only report on a newsvender data setwhich plays a key role in resource management and produces an interesting case for dataanalysis due to the shape around the optimum result. Our method has the lowest or a verycomparable mean squared error compared to the techniques described in this section andcaptures well the local structure of the signal.

An oscillatory data set with varying additive noise. To demonstrate the per-formance of the algorithm on signals that have varying additive noise, we have tested themethod on a synthesized data set that is produced by y = sin(πx4 )+0.1xN(0, 1). Figure 3(a)shows the noisy and noise free data sets. The challenge is to find the underlying functionsin(.) from the noisy input signal with large variation in the variance of noise. Figure 3(b)shows the testing data set and the model output. One could observe that the proposedalgorithm has performed well in recovering the underlying function with only five clouds inthe function expansion. The final MSE is 0.8 and MAE 0.62. The distance threshold is 3.

Two dimensional saddle data set. A data set generated from z = x2−y2+N(0,√5)

which produces a noisy saddle shape is used in this study. Figure 4 shows the data pointsthat are used for training and testing. Figure 5 demonstrates four major snapshots of themodel making process. Each panel in this figure shows the starting point of adding a newcloud when each cloud receives three data points. The first data point that activates thecloud is also plotted in each panel. At first there is not enough points to form a model. Then,as new data points arrive the first cloud is formed, as shown in Figure 5(a). Depending onthe spatial location of the arrival points the first cloud might get updated or the second

15

20 30 40 50 60 70 80

−800

−600

−400

−200

0

200

400

600

Units stocked

Ave

rage

pro

fit

training datatesting datamodel output

(a) D ∼ (50, 60).

20 30 40 50 60 70 80

−2000

−1500

−1000

−500

0

500

Units stocked

Ave

rage

pro

fit

training datatesting datamodel output

(b) D ∼ (30, 70).

Figure 2: Newsvendor data set is shown that is generated using [p min(x,D)] − cx. Thetraining and testing data sets consist of 24 and 36 data points, respectively. To generate thisdata set, c = 50, p = 60, and D is a random uniform integer between 50 and 60 for panel (a)and between 30 and 70 for panel (b). Inventory stock levels x were sampled from a randomuniform integer distribution from 20 to 80. The MSE and MAE of the model output ontest data set in the original scale of the data are 33847 and 107.39 for panel (a) and 453470and 426.63, for panel (b), respectively. The model distance threshold is 15. There are threeclouds in each model.

cloud might get formed. The same process repeats till the fourth cloud of the model isshaped as shown in panel (d) of Figure 5. Finally, Figure 6 shows the final model afterpresenting 125 data points in the presence of the 100 testing points.

We observe that the accuracy of the model is similar to LOESS. However, unlike LOESSour method does not require the storage of the data points and has superior speed. Inaddition, DC-RBF provides an analytical formulation for the underlying structure. Our testresults on other synthesized surfaces provide the same conclusion. For further elaborationon the LOESS method its comparison to DC-RBF, see Section 6.

5.2 Real Benchmark Data Sets and Comparison Results

In this section we show the performance of the proposed method on two real data sets andcompare results to the related methods in the literature in terms of accuracy, speed, batchv.s. on-line and the requirement of storing the previous data points. We select two data setswith continuous response values. The selected data sets represent regression challenges onnoise heteroscedasticity and moderate dimensionality. We compare our results to techniquesthat we find are the closest to the spirit of our work.

The list of the benchmark algorithms is as following:

• Dirichlet Process Mixtures of Generalized Linear Models (DP-GLM): ABayesian nonparametric method that finds a global model of the joint input-responsepair distribution through a mixture of local generalized linear models, [24].

• Ordinary Least Squares (OLS): A parametric method that is widely used fordata fitting problems and often provides a reasonable fit to data. In this study

16

0 2 4 6 8 10 12 14 16

−3

−2

−1

0

1

2

3

x

y

noise free datanoisy data

(a) Actual data.

0 2 4 6 8 10 12 14 16

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

x

y

training datatesting datamodel output

(b) Model output.

Figure 3: Panel (a) shows a data set of 200 data points in the interval [0, 16] is generatedwith y = sin(πx4 ) + 0.1xN(0, 1). Both noise free and noisy data are shown in this figure.Panel (b) shows the training, testing data set each with 100 points and the model output.In this graph the data is re-scaled using the STD of the whole data.

[1, X1, ..., Xn]T has been chosen as basis functions (Xi, i ∈ 1, ..., n denotes the i

th coordinate).

• CART: This is a nonparametric tree regression method [4]. This method is availablein the MATLAB function classregtree.

• Bayesian CART: A tree regression model with a prior over tree size [8], implementedin R with the tgp package.

• Bayesian Treed Linear Model: A tree regression model with a prior over tree sizeand a linear model in each of the leaves [9], implemented in R with the tgp package.

• Gaussian Processes (GP): A nonparametric method for continuous inputs andresponses [47]. This algorithm is available in MATLAB with program name gpml.

• Treed Gaussian Processes: A tree regression model with a prior over tree size anda GP on each leaf node [22]; This is implemented in R with the tgp package.

DP-GLM, [24], is closest to the spirit of our work, although algorithmicly they are quitedistinct. So we have most emphasized on this comparison in this work and used resultsreported in [24] on the performance of other competing methods in the literature. DP-GLMwas designed for batch applications and is much slower and data intensive than DC-RBFmethod proposed in this paper.

Cosmic Microwave Background (CMB) [2]. This data set maps positive integersx = 1, 2, ..., 899, called “multipole moments,” to power spectrum Cl. It consists of 899observations, shown in Figure 7. This data set is highly nonlinear and heteroscedastic.These features make this data set an interesting benchmark choice. The underlying functionrelates continuous domains values to their corresponding continuous responses.

Figure 8 shows the test data set and the model output. The distance threshold parametervalue, 95, is determined using 5-fold cross validation. The output is a continuous plot of a

17

−5

0

5

−5

0

5

−10

−5

0

5

10

xy

z

Figure 4: Saddle data set generated by z = x2 − y2 + N(0,√5). There are 125 training

and 100 testing data points over the domain x ∈ [−3, 3] × [−3, 3]. This nonlinear andnoisy example provides an interesting problems to show the performance of our proposedalgorithm.

−5

0

5

−5

0

5

−10

−5

0

5

10

xy

z

(a) First cloud.

−5

0

5

−5

0

5

−10

−5

0

5

10

xy

z

(b) Second cloud.

−5

0

5

−5

0

5

−10

−5

0

5

10

xy

z

(c) Third cloud.

−5

0

5

−5

0

5

−10

−5

0

5

10

xy

z

(d) Fourth cloud.

Figure 5: The model behavior at the starting point of adding a new cloud to the model.The final model consists of four clouds.

18

−5

0

5

−5

0

5

−10

−5

0

5

10

xy

z

Figure 6: The final model output after receiving 125 training points is shown in the presenceof 100 testing points. The final model with four clouds has MSE of 4.49 with MAE of 1.7.

0 100 200 300 400 500 600 700 800 900−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3x 10

4

x

y

Figure 7: This plot shows the CMB data set that consists of 899 data points.

parsimonious model with only seven clouds. The MSE and MAE for the testing data are0.7810 and 0.4546 respectively. The result shows that the method is capable of handling theheteroscedasticity in this data set quite well. The comparison between DC-RBF methodand the six methods described above are summarized in table 1. In terms of error rates ourproposed algorithm slightly under-performs on the smaller data sets compared to the DP-GLM, although with a dramatic reduction in CPU time. However it is quite competitive onlarger data sets. Note that even though our proposed algorithm is in the same algorithmicclass as DP-GLM, it only uses a given data point once and does not require storage of thedata points. In contrast, DP-GLM is a batch algorithm and relies on exhaustive computation

19

to compute a model and requires all the history to rebuild the model once a new data pointarrives. The new method has one adjustable hyper-parameter, while the DP-GLM hasmultiple hyper-parameters to tune. The DC-RBF method is orders of magnitude faster,since DP-GLM requires the use of Markov Chain Monte Carlo methods to reoptimize theclustering as each data point is added, where as DC-RBF updates instantly.

100 200 300 400 500 600 700 800

−4

−2

0

2

4

6

x

y

training datatesting datamodel output

Figure 8: The training and testing data set and the model output for the CMB data set isshown in this plot. The training data set consists of 500 points and the remaining 499 areused for testing. Note that the range of the signal is re-scaled by the standard deviation ofthe total input data.

Concrete Compressive Strength (CCS) [58] The CCS data set provides a mappingfrom an eight dimensional input space to a continuous output. This data set has lownoise and variance. The eight continuous input dimensions are: the components cement,blast furnace slag, fly ash, water, super-plasticizer, coarse aggregate and fine aggregate, allmeasured in kg per m3, and the age of the mixture in days. The response is the compressivestrength of the resulting concrete. There are 1, 030 observations in this data set. Thechallenge working with this data set is to find a continuous mapping that represents theinput-output relationship well in this multi-dimensional data.

Similar to the experiment using the CMB data set, we have summarized the comparisonresults for this data set in Table 2. We observe that as the number of observations increasethe performs of the new method enhances due to the nature of our design that requirescertain number of data points to activate a cloud. From the Table 2, we see that in terms oferror our method remains competitive with DP-GLM, which is computationally much moredemanding.

6 Concluding Remarks

We propose a fast, recursive, function approximation method which avoids the need to storethe complete history of data (typical of nonparametric methods) or the specification of

20

Training Set Size 30 50 100 250 500 30 50 100 250 500

Method Mean Absolute Error Mean Square ErrorDP-GLM 0.58 0.51 0.49 0.48 0.45 1.00 0.94 0.91 0.94 0.83

Bayesian CART 0.66 0.64 0.54 0.50 0.47 1.04 1.01 0.93 0.94 0.84Bayesian TLM 0.64 0.52 0.49 0.48 0.46 1.10 0.95 0.93 0.95 0.85

Gaussian Process 0.55 0.53 0.50 0.51 0.47 1.06 0.97 0.93 0.96 0.85Treed GP 0.52 0.49 0.48 0.48 0.46 1.03 0.95 0.95 0.96 0.89

Linear Regression 0.66 0.65 0.63 0.65 0.63 1.08 1.04 1.01 1.04 0.96CART 0.62 0.60 0.60 0.56 0.56 1.45 1.34 1.43 1.29 1.41

DC-RBF 0.61 0.53 0.49 0.47 0.46 1.40 1.13 0.98 0.91 0.85

Table 1: This table summarizes the performance of various regression techniques on CMBdata set. The MSEs and MAEs on different sizes of testing data sets are reported in thistable. The presented results are the mean of the performances for various methods fordifferent permutations of data points selected for training and testing data sets. For DC-RBF, the hyper-parameter DT is kept fixed for a given data set size. The hyper-parametercould be identified using for example a 5-fold cross validation or another suitable technique.The hyper parameters of the other techniques are calculated by sweeping over a set ofcandidate parameters that varies by 3 to 5 orders of magnitude, then the parameter thatdoes the best on the training set is chosen, these number are reported from [24].

Training Set Size 30 50 100 250 500 30 50 100 250 500

Method Mean Absolute Error Mean Square ErrorDP-GLM 0.54 0.50 0.45 0.42 0.40 0.47 0.41 0.33 0.28 0.27

Bayesian CART 0.78 0.72 0.63 0.55 0.54 0.95 0.80 0.61 0.49 0.46Bayesian TLM 1.08 0.95 0.60 0.35 1.10 7.85 9.56 4.28 0.26 1,232

Gaussian Process 0.53 0.52 0.38 0.31 0.26 0.49 0.45 0.26 0.18 0.14Treed GP 0.73 0.40 0.47 0.28 0.22 1.40 0.30 3.40 0.20 0.11

Linear Regression 0.61 0.56 0.51 0.50 0.50 0.66 0.50 0.43 0.41 0.40CART 0.72 0.62 0.52 0.43 0.34 0.87 0.65 0.46 0.33 0.23

DC-RBF 0.57 0.54 0.51 0.47 0.39 0.54 0.50 0.44 0.40 0.29

Table 2: The performance table for CCS data. In this table the results are reported in termsof Mean Square Error and Mean Absolute Error for various competing methods, includingDP-GLM. The experimental set up here is the same as we described for CMB data set.

hyperparameters for Bayesian priors. The method assigns locally parametric models (suchas linear polynomials) for regions of the covariate space that are created dynamically fromthe data without the need for prespecified classification schemes. A weighting scheme isassociated to each locally linear approximation using normalized RBF. Each local model isupdated recursively with the arrival of each new observation, which is then used to updatethe cloud representation before the data is discarded.

The new method is robust in the presence of homoscedastic and heteroscedastic noise.Our method automatically determines the model order for a given dataset (one of the mostchallenging tasks in nonlinear function approximation). Unlike similar algorithms in the

21

literature, our method has only one tunable parameter and is asymptotically unbiased.Table 3 provides a comparison of different statistical methods that we have considered in

this paper. This table brings together features such as speed, complexity of implementation,storage requirement, recursivity, the type of noise that can be handled and the number oftunable parameters.

More specifically, in contrast to DP-GLM, DC-RBF offers orders of magnitude fasterupdates, does not require as many tunable parameters, avoids the complexity of its imple-mentation, and can be updated recursively without need for storing the data history.

The power of DC-RBF lies in its adaptibility to approximate complex surfaces, com-pactness in model representation, parsimony in model specification, recursivity and speed.Parametric methods will always suffer from the need to tune the choice of basis functions.Nonparametric methods require the storage of the entire history which complicates functioncomputations in stochastic optimization settings (our motivating application). DP-GLM,which fits local polynomial approximations to clusters, is extremely slow and cannot beadapted to a recursive setting. Gaussian process models can work very well, but only inparticular problem classes with continuous covariates and they also require more tunableparameters. Gaussian process models, as well as techniques such as splines, also introducethe risk of overfitting.

Our experience with Bayesian methods (in particular, in the development of DP-GLM,in which the second author was a developer) is that they are quite sensitive to the tuningof the hyperparameters that make up the priors. This introduces an undesirable degree ofhuman intervention that is likely to limit their usefulness in black box stochastic optimizationalgorithms.

DC-RBF appears to satisfy most of our needs for approximating continuous functions inthe recursive setting of stochastic optimization. It starts with a linear, parametric model,and adds clouds only as the range of covariates expand. It can adapt to quite generalsurfaces, and requires only that we retain history in the form of a relatively compact set ofclouds. It does not require the specification of basis functions (we believe the local linearmodels should be kept very simple), and offers fast, recursive updating algorithms.

Performance in terms of solution quality is always going to be an evaluation limited toa specific set of experimental tests. We used the newsvendor setting to demonstrate theability of DC-RBF to adapt to both a sharp, nondifferentiable function (which is difficultto approximate using low-order polynomials) and a smooth function. The method workswell, theoretically and experimentally, in the presence of heteroscedastic noise. Finally, itappears to be competitive against much more complex algorithms such as DP-GLM.

DC-RBF does introduce the need to tune a distance threshold parameter DT . This isnot a minor requirement, as the choice of DT captures the overall behavior of the surface.We suspect the choice of DT will generally become apparent within any specific problemclass. We also doubt that this method would be effective for high dimensional applications(say, more than 10 or 20 covariates).

We believe that there are opportunities to build on this basic idea to overcome theneed to specify DT . A natural extension is to specify a dictionary of possible values of DT

(possibly on a log scale), which can be viewed as models at different levels of granularity.We might then estimate a family of approximations, one for each DT , which can then becombined using a hierarchical weighting scheme such as that proposed in [20].

We have shown numerical results on synthetic and real data which demonstrate the suc-cess of the algorithm for multiple dimensional function approximation. Despite a modelgenerated by DP-GLM which remains biased toward the early observations, (this form ofmodeling may not serve well in applications that provide update for the existing informa-

22

tion such as ADP), DC-RBF could adapt very quickly to new information. This type ofapproximation tool is specially useful for value function approximation in the context ofApproximate Dynamic Programming, where there is a stream of exogenous informationcontributing to the system dynamics. We intend to use the approximate technique forvalue function approximation in the context of Approximate dynamic programming, [44]and optimal learning [45] for energy resource allocation under uncertainty.

Algorithm Spd Comp Stor Rec Noise TPDP-GLM V C A N HE F

Bayesian CART V C A N HO FBayesian TLM F I A N HO F

Gaussian Process V C A N HO FTreed GP V C A N HO F

Linear Regression U S N Y HO NCART F I A N HO F

Kernel Regression F I A N HO FLocally Linear F I A Y HO F

LOWESS F I A N HO FDC-RBF U S SR Y HE O

Table 3: The benchmark table to compare various statistical function approximation tech-niques in terms of overall Speed in terms of model Updating and model evaluation (Ultrafast, Fast, Very slow), Complexity of implementation (Complex, Intermediate, Simple),Storage requirement (All the history, Statistical Representation, None), Recursivity (Yes,No), Noise type they could handle (HEteroscedasticity, HOmoscedasticity), the number ofTunable Parameters, (None, One, Few).

References

[1] C. Andrieu, N. Freitas, and A. Doucet. Robust Full Bayesian Learning for Radial BasisNetworks. Neural Computation, 13:2359–2407, 2001.

[2] L. Bennett, M. Halpern, G. Hinshaw, N. Jarosik, A. Kogut, M. Limon, S. S. Meyer,L. Page, D. N. Spergel, G. S. Tucker, E. Wollack, E. L. Wright, C. Barnes, M. R. Grea-son, R. S. Hill, E. Komatsu, M. R. Nolta, N. Odegard, H. V. Peiris, L. Verde, and J. L.Weiland. First-year Wilkinson Microwave Anisotropy Probe (WMAP) 1 Observations:Preliminary Maps and Basic Results. The Astrophysical Journal Supplement Series,148(1):1–27, 2003.

[3] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,Oxford, U.K., 1995.

[4] L. Brieman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regres-sion Trees. hapman & Hall/CRC, New York, NY, 1984.

[5] D. S. Broomhead and D. Lowe. Multivariable Functional Interpolation and AdaptiveNetworks. Complex Systems, 2:321–355, 1988.

23

[6] G. Bugmann. Normalized Gaussian Radial Basis Function Networks. Neurocomputing,20(1-3):97–110, 1998.

[7] M. D. Buhmann. Radial Basis Functions. Cambridge University Press, 2003.

[8] H. A. Chipman, E. I. George, and R. E. McCulloch. Bayesian CART Model Search.Journal of the American Statistical Association, 93(443):935–948, 1998.

[9] H. A. Chipman, E. I. George, and R. E. McCulloch. Bayesian Treed Models. MachineLearning, 48(1):299–320, 2002.

[10] E. K. P. Chong. An Introduction to Optimization. Wiley, 2008.

[11] W. S. Cleveland. Robust Locally Weighted Regression and Smoothing Scatterplots.Journal of the American Statistical Association, 74(368):829–836, 1979.

[12] W. S. Cleveland. LOWESS: A program for Smoothing Scatterplots by Robust LocallyWeighted Regression. The American Statistician, 35(1):54, 1981.

[13] W. S. Cleveland and S. J. Devlin. LOWESS: A program for Smoothing Scatterplots byRobust Locally Weighted Regression. Journal of the American Statistical Association,83(403):596–610, 1988.

[14] R. L. Eubank. Spline Smoothing and Nonparametric Regression. Marcel Dekker, NewYork, 1988.

[15] J. Fan. Design-adaptive Nonparametric Regression. Journal of the American StatisticalAssociation, 87(420):998–1004, Dec 1992.

[16] J. Fan and I. Gijbels. Local Polynomial Modelling and Its Applications. Chapman &Hall/CRC Monographs on Statistics & Applied Probability, London, 1996.

[17] C. F. Gauss. Theoria Combinationis Observationum Erroribus Minimis Obnoxiae. Got-tingae, 1825.

[18] S. Geisser. Predictive Inference: An Introduction. Chapman and Hall/CRC, 1 edition,June 1993.

[19] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis.Chapman and Hall/CRC, second edition, July 2003.

[20] A. George, W. B. Powell, and S. Kulkarni. Value Function Approximation using Mul-tiple Aggregation for Multiattribute Resource Management. J. Machine Learning Re-search, 9:20792111, 2008.

[21] F. Girosi, M. Jones, and T. Poggio. Regularization Theory and Neural Network Archi-tectures. Neural Computation, 7:219–269, 1995.

[22] R. B. Gramacy and H. K. H. Lee. Bayesian Treed Gaussian Process Models with anApplication to Computer Modeling. Journal of the American Statistical Association,103(483):1119–1130, 2008.

[23] H.-M. Gutmann. A Radial Basis Function Method for Global Optimization. Journalof Global Optimization, 19:201–227, 2001.

24

[24] L. A. Hannah, D. M. Blei, and W. B. Powell. Dirichlet Process Mixtures of GeneralizedLinear Models. Journal of Machine Learning Research, 12:1923–1953, 2011.

[25] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.Springer, 2009.

[26] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd edition,1999.

[27] C. C. Holmes and B. K. Mallick. Bayesian Radial Basis Functions of Variable Dimen-sion. Neural Computation, 10(5):1217–1233, 1998.

[28] T. Ishikawa and M. Matsunami. An Optimization Method Based on Radial BasisFunction. IEEE Transactions on Magnetics, 33(2):1868–1871, March 1997.

[29] A. A. Jamshidi and M. J. Kirby. Towards a Black Box Algorithm for Nonlinear FunctionApproximation over High-Dimensional Domains. SIAM Journal of Scientific Compu-tation, 29(3):941–963, May 2007.

[30] A. A. Jamshidi and M. J. Kirby. Skew-Radial Basis Function Expansions for EmpiricalModeling. SIAM Journal on Scientific Computation, 31(6):4715–4743, 2010.

[31] A. A. Jamshidi and M. J. Kirby. Modeling Multivariate Time Series on Manifolds withSkew Radial Basis Functions. Neural Computation, 23(1):97–123, 2011.

[32] R. D. Jones, Y.C. Lee, C. W. Barnes, G. W. Flake, K. Lee, P. S. Lewis, and S. Qian.Function Approximation and Time Series Prediction with Neural Networks. IJCNNInternational Joint Conference on Neural Networks, 1:649–665, 1990.

[33] D. E. Knuth. Art of Computer Programming, volume 2. Addison-Wesley Professional,3rd edition, Nov. November 14, 1997.

[34] D. C. Montgomery, E. A. Peck, and G. G. Vining. Introduction to Linear RegressionAnalysis. Wiley-Interscience, 3 edition edition, April 2001.

[35] J. Moody and C. Darken. Fast Learning in Networks of Locally-tuned Processing Units.Neural Computation, 1:281–294, 1989.

[36] H. G. Muller. Weighted Local Regression and Kernel Methods for Nonparametric CurveFitting. Journal of the American Statistical Association, 82:231–238, 1988.

[37] J. Munkres. Topology. Prentice Hall, 2 edition, January 2000.

[38] E. A. Nadaraya. On Estimating Regression. Theory of Probability and Its Applications,9:141–142, 1946.

[39] J. Park and I. W. Sandberg. Universal Approximation Using Radial-Basis-FunctionNetworks. Neural Computation, 3:246–257, 1991.

[40] J. Park and I. W. Sandberg. Approximation and Radial-Basis-Function Networks.Neural Computation, 5:305–316, 1993.

[41] T. Poggio and F. Girosi. Regularization Algorithm for Learning that Are Equivalentto Multilayer Networks. Science, 247:978–982, 1990.

25

[42] M. J. D. Powell. Radial Basis Functions for Multivariable Interpolation: A Review.In J. C. Mason and M. G. Cox, editors, Algorithms for approximation, pages 143–167.Clarendon Press, Oxford, 1987.

[43] M. J. D. Powell. The Theory of Radial Basis Functions in 1990. In W. Light, editor,Advances in Numerical Analysis, volume II of Wavelets, Subdivision Algorithms, andRadial Basis Functions, pages 105–210. Oxford University Press, 1992.

[44] W. B. Powell. Approximate Dynamic Programming. Wiley, 2011.

[45] W. B. Powell and Ilya O. Ryzhov. Optimal Learning. Wiley, 2012.

[46] C. E. Rasmussen and Z. Ghahramani. Infinite Mixtures of Gaussian Process Experts. InIn Advances in Neural Information Processing Systems 14, pages 881–888. MIT Press,2001.

[47] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.MIT Press, Cambridge, MA, 2006.

[48] R. G. Regis and C. A. Shoemaker. A Stochastic Radial Basis Function Method forthe Global Optimization of Expensive Functions. Informs Journal on Computing,19(4):497–509, 2007.

[49] R. G. Regis and C. A. Shoemaker. Improved Strategies for Radial Basis FunctionMethods for Global Optimization. Journal of Global Optimization, 37(1):113–135, 2007.

[50] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal Representationsby Error Propagation. In D. E. Rumelhart and J. L. McClelland, editors, ParallelDistributed Processing, pages 318–362. 1986.

[51] D. Ruppert, M. P. Wand, and R. J. Carroll. Semiparametric Regression. Cambridge Se-ries in Statistical and Probabilistic Mathematics. Cambridge University Press, 1 editionedition, July 2003.

[52] B. Shahbaba and R. M. Neal. Nonlinear Models Using Dirichlet Process Mixtures.Journal of Machine Learning Research, 10:1829–1850, 2009.

[53] J. S. Simonoff. Smoothing Methods in Statistics. Springer, June 1996.

[54] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-Posed Problems. John Wiley &Sons, New York, 1977.

[55] G. Wahba. Spline Bases, Regularization, and Generalized Cross Validation for Solv-ing Approximation Problems with Large Quantities of Data. In W. Cheney, editor,Approximation Theory III, pages 905–912. Academic Press, 1980.

[56] G. S. Watson. Smooth Regression Analysis. Sankhya, Ser. A, 26:359–372, 1946.

[57] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Be-havioral Sciences. Ph.D. dissertation, Division of Applied Mathematics, Harvard Uni-versity, Boston, MA, Aug. 1974.

[58] I. C. Yeh. Modeling of Strength of High-performance Concrete Using Artificial NeuralNetworks. Cement and Concrete Research, 28(12):1797–1808, 1998.

26