gated ensemble learning method for demand-side electricity ... · ment of the electric power system...

Gated Ensemble Learning Methodfor Demand-Side Electricity Load Forecasting

Eric M. Burger, Scott J. Moura

Energy, Control, and Applications Lab, University of California, Berkeley

Abstract

The forecasting of building electricity demand is certain to play a vital role in the future power grid.Given the deployment of intermittent renewable energy sources and the ever increasing consumption ofelectricity, the generation of accurate building-level electricity demand forecasts will be valuable to bothgrid operators and building energy management systems. The literature is rich with forecasting models forindividual buildings. However, an ongoing challenge is the development of a broadly applicable method fordemand forecasting across geographic locations, seasons, and use-types. This paper addresses the need fora generalizable approach to electricity demand forecasting through the formulation of an ensemble learningmethod that performs model validation and selection in real time using a gating function. By learning fromelectricity demand data streams, the method requires little knowledge of energy end-use, making it wellsuited for real deployments. While the ensemble method is capable of incorporating complex forecasters,such as Artificial Neural Networks or Seasonal Autoregressive Integrated Moving Average models, this workwill focus on employing simpler models, such as Ordinary Least Squares and k-Nearest Neighbors. Byapplying our method to 32 building electricity demand data sets (8 commercial and 24 residential), wegenerate electricity demand forecasts with a mean absolute percent error of 7.5% and 55.8% for commercialand residential buildings, respectively.

Keywords: Smart Grid, Buildings, Electricity Forecasting, Ensemble Learning

1. INTRODUCTION

Commercial and residential buildings account for74.1% of U.S. electricity consumption, more thaneither the transportation sector or the industrialsector (0.2% and 25.7%, respectively) [1]. Main-taining a continuous and instantaneous balance be-tween generation and load is a fundamental require-ment of the electric power system [2]. To reliablymatch supply with demand, the forecasting of grid-level electricity loads has long been a central part ofthe planning and management of electrical utilities[3]. The accuracy of these forecasts has a strongimpact on the reliability and cost of power system

∗Corresponding Author: Eric M. Burger,Email: [email protected], Phone: 317-385-1768,

∗∗Affiliation Address: Energy, Control, and ApplicationsLab (eCAL), 611 Davis Hall, Department of Civil and En-vironmental Engineering, University of California, Berkeley,Berkeley, CA 94720, USA.

operations. Trends, such as vehicle electrificationand distributed renewable generation, are expectedto pose new challenges for grid operators and mayundermine the accuracy of load forecasts.

To improve the accuracy of electricity demandforecasts and aid in the management of power sys-tems, recent attention has been placed on short-term building-level electricity demand forecastingusing a wide range of models [4][5]. The abilityto accurately and adaptively forecast demand-sideloads will play a critical role in maintaining grid sta-bility and enabling renewables integration. Addi-tionally, many novel optimal control schemes, underresearch umbrellas such as demand response andmicrogrid management, require short-term build-ing electricity demand forecasts to aid in decisionmaking [6].

The supply-side and load-side time series fore-casting of electricity demand has been a topic ofresearch for many decades. The literature is filled

Preprint submitted to Energy and Buildings October 9, 2015

with a variety of well-cited modelling approaches,each differing in algorithmic complexity, estimationprocedure, and computational cost. Of particu-lar note are the variants of Artificial Neural Net-works (ANN) [3][4][5][7][8][9][10], Support VectorRegression (SVR) [11][12][13][14] and Autoregres-sive Integrated Moving Average (ARIMA) mod-els [3][12][13][15][16][17][18]. Lesser but nonethelessnoteworthy attention has been given to approachessuch as Multiple Linear Regression [3][11][19],Fuzzy Logic [3][20], Decision Trees [4], and k-Nearest Neighbors (k-NN).

These studies provide a broad catalog of use-casesand demonstrate the performance of certain fore-casting algorithms when applied to specific buildingtypes. In particular, [3][4][10][17] provide a surveyof electricity forecasting methods and a high-levelcomparison of techniques. [8] provides a detaileddescription of ANNs and their application to loadforecasting, including data pre-processing and ANNarchitectures. [5] details the development of a sea-sonal ANN approach and the advantage over a Sea-sonal ARIMA (SARIMA) model when applied to 6building datasets. [18] focuses on the introductionof motion sensor data to improve the accuracy of anARIMA model. In [9][11][15][18][20], the authorsperform an in-depth analysis of the power demandpatterns of a particular building in order to cus-tomize a forecasting model.

In papers with experimental results, the au-thors have generally applied their electricity de-mand forecasting technique to only a small num-ber of datasets. Consequently, the literature is richwith forecasting algorithms customized for individ-ual buildings. This leads us to the following ques-tion: Is it possible to design a single minimally-customized forecasting algorithm that is widely ap-plicable across a diversity of building types, en-abling scalability? We pursue this question byproposing a novel ensemble learning method forelectricity demand forecasting.

Specifically, due to unique building characteris-tics, occupancy patterns, and individual energy usebehaviors, we argue that no single model structureis capable of accurately forecasting electricity de-mand across all commercial and residential build-ings.

For example, some forecasting models may pro-duce accurate predictions under certain observ-able or unobservable conditions, such as a seasonaltrend, a morning routine, or an extended absence.Other models may be ideal for buildings with en-

ergy use behaviors that are stable over long periodsof time. For buildings with frequent changes in oc-cupancy patterns, models that are trained over amoving horizon may yield the highest accuracy. Inshort, this work will develop an ensemble learningmethod that trains and validates multiple forecast-ing models before applying a gating method to se-lect a single model to perform electricity demandforecasting.

In this way, the ensemble method is able to learnfrom real-time data and to produce short-term elec-tricity demand forecasts that are automatically tai-lored to a particular building and instance in time.In addition to forecast accuracy, this paper willplace an emphasis on method adaptability and easeof use. While we have implemented certain fore-casting models, the method is intended to allow themodels to be interchangeable.

To demonstrate the use of our ensemble methodto produce short-term forecasts, this paper includes3 experimental studies: Single Model Studies, Mul-tiple Model Study, and Residential Study. For eachof these studies, we will make the following assump-tions with respect to the availability of buildingelectricity demand data:

A1. We have access to hourly historical buildingelectricity demand at the meter.

A2. We have access to hourly historical weatherdata near the building location.

A3. We do not have access to submetered elec-tricity demand data or building operationsdata, such a occupancy measurements or me-chanical system schedules.

The limited access to input data with which toproduce forecasts is representative of the challengefaced by grid operators. Accordingly, this pa-per will demonstrate the potential of our ensemblemethod to non-invasively forecast total electricitydemand using data-driven methods. Additionally,unlike in [9][11][15][18][20], where the authors per-form an in-depth analysis of the power demand pat-terns in order to customize a model to a particularbuilding, this paper will focus on developing a fore-casting approach that is generally applicable to allbuildings without customization.

This paper is organized into five sections: Re-gression Models, Single Model Studies, Ensem-ble Method, Multiple Model Study, and Residen-tial Study. Section II. Regression Models briefly

2

presents background theory for 5 regression mod-els that will be employed in this paper. In Sec-tion III. Single Model Studies, we apply the fore-casting models to 8 commercial/university buildingelectricity demand datasets using batch and movinghorizon training approaches. Section IV. EnsembleMethod presents our method for training and val-idating multiple models and for selecting the opti-mal model using a gating method. Section V. Mul-tiple Model Study applies our ensemble learningmethod to 8 commercial/university building elec-tricity demand datasets and quantifies and qualifiesthe advantage over a single model approach. Fi-nally, in Section VI. Residential Study, we apply ourensemble learning method to 24 residential build-ing electricity demand datasets and summarize theresults. Key conclusions and future research direc-tions are summarized in Section VII.

2. Regression Models

In this paper, we will consider one parametricregression model, Ordinary (Linear) Least Squareswith `2 Regularization (Ridge), and four nonpara-metric models, Support Vector Regression with Ra-dial Basis Function (SVR), Decision Tree Regres-sion (DTree), k-Nearest Neighbors with uniformweights and binary tree data structure (k-NN), andMultilayer Perceptron (MLP), a popular type offeedforward Artificial Neural Network (ANN). Inthis section, we will briefly describe the structureof each regression model.

2.1. Ordinary Least Squares with `2 Regularization

Ordinary Least Squares with `2 Regularization(Ridge) fits a linear model with coefficients w ∈ Rn

to minimize the residual sum of squared errors be-tween the observed and predicted responses whileimposing a penalty on the size of coefficients ac-cording to their `2-norm. The linear model of asystem with univariate output is given by

y = w0x0 + w1x1 + . . .+ wnxn

=∑k

wkxk = wTx (1)

with variables x ∈ Rn, the model input, y ∈ R,the predicted response, n, the number of inputs orfeatures in x, and k = 1, . . . , n.

The linear model is trained on a set of inputs andobserved responses by optimizing the function

minimizew

∑i

‖wTxi − yi‖22 + λ‖w‖22 (2)

with variables xi ∈ Rn, the model input forthe i-th data point, yi ∈ R, the i-th observed re-sponse, w ∈ Rn, the weighting coefficients, andi = 1, . . . , N , where N is the number of data sam-ples and n is the number of features in xi. Lastly, λis a weighting term for the regularization penalty.

For a system with a multivariate output y ∈ Rm,we will treat the outputs as uncorrelated and de-fine a set of coefficients wj ∈ Rn for each predictedresponse yj ∈ R for j = 1, . . . ,m. Thus, the multi-variate linear model is given by

yj = wTj x, ∀j = 1, . . . ,m (3)

The weights of the multivariate model are deter-mined by optimizing the function:

minimizew

∑i

∑j

‖wTj xi − yi,j‖22 +∑j

λ‖wj‖22 (4)

with variables xi ∈ Rn, the model input, yi ∈Rm, the observed multivariate response, wj ∈ Rn,the weighting coefficients of the j-th response, i =1, . . . , N , and j = 1, . . . ,m, where N is the numberof data samples, n is the number of features in xi,and m is the number of observations in yi.

2.2. Support Vector Regression

In non-linear Support Vector Regression (SVR),the model input x is transformed into a higher di-mensional feature space using a mapping functionφ. Then, a linear model is constructed in this fea-ture space, as given by

y = f(x) = wTφ(x) + b (5)

where variable b ∈ R is a bias term, w ∈ Rn

the linear coefficients, and φ a non-linear mappingfunction. Using an ε-intensive loss function, thesupport vector regression model can be trained byoptimizing

minimizew,b,ζ,ζ∗

1

2‖w‖22 + C

∑i

(ζi + ζ∗i )

subject to yi − wTφ(xi)− b ≤ ε+ ζi

wTφ(xi) + b− yi ≤ ε+ ζ∗i

ζi, ζ∗i ≥ 0

(6)

3

where constant ε ∈ R denotes the radius of theε-intensive region (i.e. region in which the value ofthe loss function is 0) and constant C > 0 denotesthe trade-off between the empirical risk (i.e. devi-ations beyond ε) and the regularization term (i.e.the flatness of the model). The positive slack vari-ables ζi ∈ R and ζ∗i ∈ R denote the magnitude ofthe deviation from the ε-intensive region.

By applying Lagrangian multipliers and Karush-Kuhn-Tucker conditions, the primal problem canbe reformulated into a dual form that is easier tosolve.

minimizeα,α∗

1

2

∑i

∑j

(αi − α∗i )TK(xi, xj)(αi − α∗j )

+ ε∑i

(αi + α∗i )−∑i

yi(αi − α∗i )

subject to∑i

(αi − α∗i ) = 0

0 ≤ αi, α∗i ≤ C(7)

with variables αi, α∗i ∈ [0, C], the Lagrangian

multipliers, and K, the kernel function givenby the inner product of the mapping functions,K(xi, xj) = φ(xi)φ(xj). In this paper, we will em-ploy the (Gaussian) Radial Basis Function kernel,given by

K(xi, x) = exp(−‖xi − x‖

22

2σ2

)(8)

where variable σ ∈ R is a free parameter.Using the Lagrangian multipliers, the support

vector regression model with univariate output canbe rewritten as

y = f(x) =∑i

(αi − α∗i )K(xi, x) + b (9)

For a system with a multivariate output y ∈ Rm,we will treat the outputs as uncorrelated and definem support vector regression models (i.e. one foreach output). Using m models to independentlypredict each of the m outputs is simple to imple-ment, but may be less accurate than a single modelcapable of simultaneously predicting all m outputs.We refer the reader to [21][22] for a more detaileddescription of the support vector regression algo-rithm.

2.3. Decision Tree Regression

In Decision Tree Regression (DTree), the featurespace is recursively sub-divided or partitioned into

smaller regions or leaves (i.e. terminal node). Oncethis splitting is complete, a simple regression modelis fit to the data samples that have been groupedinto each leaf. The objective is to form a tree datastructure with which a new observation can be as-signed to a leaf using nested boolean logic (i.e. mov-ing right or left down each branch according to athreshold value). Once the correct leaf has beenidentified, the observation can be mapped to a con-tinuous target value using a simple model.

Expressed mathematically, we can represent thedata at node a by B. For each potential divisionθ = (k, ta) composed of feature k and thresholdvalue ta ∈ R, we may partition the data into twosubsets or branches, BL(θ) and BR(θ), given by

BL(θ) = (x, y) if xk ≤ taBR(θ) = (x, y) if xk > ta

(10)

The accuracy of a decision tree regression modelcan be represented by the impurity of each branch.In this paper, we will define the impurity functionH as the mean squared error between each responseand the mean response in a branch. Thus, for nodea, H is given by

ya =1

Na

Na∑i=1

ya,i

H(Xa) =1

Na

Na∑i=1

(ya,i − ya)2

(11)

with variable Na ∈ [Nmin, N ], the number ofdata points in branch a, Nmin, the minimum num-ber of data points in a branch, Xa, the set of alldata points x ∈ Rn in branch a, and ya, the set ofall observed responses y ∈ R in branch a.

The decision tree is trained or grown by recur-sively selecting the parameters that minimize theimpurity of the tree and of each branch. In otherwords, we begin with a node a containing all datapoints. Then, we partition the data according tothe optimization function

θ∗ = argminθ

NLNa

H(BL(θ)) +NRNa

H(BR(θ)) (12)

with variable NL, NR ∈ [Nmin, Na], the numberof data points in the left and right branches, re-spectively. The optimization function is recursivelyapplied to each new branch until the maximumtree depth is reached, one of the resulting branches

4

would contain less than Nmin data points, or thereis no split that will decrease the impurity of thebranch by more than some threshold δ ∈ R.

For each leaf a, we will define the regressionmodel as the mean of the contained univariate ob-servations.

y =1

Na

Na∑i=1

ya,i (13)

For a system with multivariate output y ∈ Rm,each leaf in the tree will store observations of lengthm rather than 1. Therefore, the impurity functionH can be redefined as the mean of the mean squarederror between each j-th response and the mean j-thresponse in a branch for j = 1, . . . ,m.

ya,j =1

Na

Na∑i=1

ya,i,j ∀j = 1, . . . ,m

H(Xa) =1

mNa

m∑j=1

Na∑i=1

(ya,i,j − ya,j)2(14)

And the multivariate regression model can be de-fined as

yj =1

Na

Na∑i=1

ya,i,j ∀j = 1, . . . ,m (15)

We refer the reader to [23][4] for a more de-tailed description of the decision tree regression al-gorithm.

2.4. k-Nearest Neighbors Regression

In k-Nearest Neighbors Regression (k-NN), an in-put x ∈ Rn is mapped to a continuous output valueaccording to the weighted mean of the k nearestdata points or neighbors, as defined by the Eu-clidean distance. In this paper, we will use uni-form weights. In other words, each point in a neigh-borhood a contributes uniformly and thus the pre-dicted univariate response y ∈ R is the mean of thek-nearest neighbors.

y =1

k

k∑i=1

ya,i (16)

with variable ya, the set of k observed responsesy ∈ R in neighborhood a. For a system with mul-tivariate output y ∈ Rm, the model is defined as

the mean of each observation j over the k-nearestneighbors.

yj =1

k

k∑i=1

ya,i,j ∀j = 1, . . . ,m (17)

Given a new input x, it is possible to determinethe neighborhood by computing the Euclidean dis-tance (i.e. `2-norm of the difference) between thenew input x and every data point in the trainingdata set xi for i = 1, . . . , N and then ordering thedistances to identify the nearest neighbors. How-ever, this brute-force search is computationally in-efficient for large datasets.

To improve the efficiency of the neighborhoodidentification, the training data points are parti-tioned into a tree data structure. A commonlyused approach for organizing points in a multi-dimensional space is the ball tree data structure,a binary tree in which every node defines a D-dimensional hypersphere or ball. At each node,data points are assigned to the left or right ballsaccording to their distance from the ball’s center.At each terminal node or leaf, the data points areenumerated inside the ball.

We refer the reader to [24] for a description ofball tree construction algorithms.

2.5. Multilayer PerceptronArtificial Neural Network

Artificial Neural Networks (ANN) are a class ofstatistical learning algorithms inspired by the neu-rophysiology of the human brain. These numericalmodels are composed of interconnected “neurons”which use stimulation thresholds to predict how asystem will respond to inputs.

The most popular feedforward (i.e no feedback)neural network is the multilayer perceptron (MLP).Figure 1 illustrates an example of a 3 layer percep-tron network consisting of a 3 neuron input layer,4 neuron hidden layer, and 2 neuron output layer.The structure of a neuron in the hidden layer is pre-sented in Figure 2 where variables x1, x2, x3 ∈ Rare inputs to the neuron and w1, w2, w3 ∈ R aresynaptic weights. Variable y ∈ {0, 1} is the outputand z1, z2 ∈ R are the synaptic weights of the nextneurons in the network. The weighted sum of theinputs is the excitation level of the neuron

v =

n∑i=1

wixi − h (18)

5

Input Layer

Hidden Layer

Output Layer

Figure 1: Artificial Neural Network

w1

w2

w3

z1

z2

x1

x2

x3

y

Figure 2: Structure of a Neuron

where variable v ∈ R is the excitation, h ∈ Ris a threshold, and n is the number of inputs tothe neuron. Next, we want to define the outputor activation function f of the neuron such that ifv ≥ 0 then y = 1 otherwise y = 0. The simplestactivation function is the hard limiter,

y = f(v) =

{1 if v ≥ 0

0 if v < 0(19)

However, the hard limiter function cannot bepractically implemented because it is not differ-entiable. Thus, artificial neural network algo-rithms employ differentiable activation functionsthat have horizontal asymptotes at both 0 and 1(i.e. limv→∞ f(v) = 1 and limv→−∞ f(v) = 0). Anexample is the sigmoid function,

y = f(v) =1

1 + e−v(20)

For the neurons in the output layer, it is commonto use a linear activation function,

y = f(v) = v (21)

The structure of the multilayer perceptron arti-ficial neural network makes it capable of both uni-variate and multivariate predictions. Networks aretrained using a backpropogation algorithm whichadjusts the weights and thresholds of each neuronto minimize the error between the observations andthe network outputs. In this paper, we will em-ploy a 3 layer feedforward ANN with a 30 neuronsigmoid hidden layer and 6 neuron linear outputlayer. The size of the linear input layer will vary.The ANN will be trained using gradient descentbackpropagation for 200 epochs.

We refer the reader to [25] for a further discussionof artificial neural networks and backpropogationtraining algorithms.

2.6. Software Packages

This work employs the PyBrain Artificial NeuralNetwork library [26] and the Sci-Kit Learn Ordi-nary Least Squares, Support Vector Regression, De-cision Tree, and k-Nearest Neighbors libraries [27].All plots are generated with Matplotlib [28].

3. SINGLE MODEL STUDIES

To motivate the advantage of our ensemble ap-proach, we will begin by considering single modelapproaches to electricity demand forecasting. Inthis section, we will apply the regression modelsabove to 8 building datasets containing 2 yearsof metered hourly electricity demand (kW). Thistime-series data has been provided by the facili-ties team at the University of California, Berkeleyand will be used as the observation data for eachforecasting model. Submetered electricity demanddata and building operations data, such a occu-pancy measurements and mechanical system sched-ules, are not available.

The 8 buildings are located on the Universityof California, Berkeley campus and have been se-lected to represent a heterogeneous population.The buildings A, B, and D are occupied by thephysics, civil engineering, and environmental engi-neering departments, respectively, and are primar-ily comprised of faculty offices and research lab-oratories. Building C is a university library andbuilding E houses faculty and departmental offices

6

for multiple humanities departments. Buildings Fand G are university administrative buildings andbuilding H is comprised mainly of lecture halls andclassrooms. Table 1 presents the square footage aswell as basic statistics regarding the electricity de-mand of each building.

Building

A B C D E F G H

Size

(103 sq.ft.)97 140 67 142 306 111 153 140

Mean (kW) 333 33 96 109 113 114 23 73

SD (kW) 44 5 26 8 36 33 2 30

Min (kW) 190 20 48 61 60 69 5 29

Max (kW) 602 69 221 150 271 236 73 195

Table 1: Building Electricity Demand Statistics

The models will be used to generate short-termmultivariate electricity demand forecasts, specifi-cally 6 consecutive hourly electricity demand pre-dictions (i.e. y ∈ R6). The accuracy of each fore-cast y will be measured by the root mean squarederror (RMSE). To allow for comparison betweenbuildings, the performance of forecasting modelswill be measured by the mean absolute percent er-ror (MAPE).

Forecast i RMSE =

√√√√ 1

m

m∑j=1

(yi,j − yi,j)2 (22)

Model MAPE =100%

mN

N∑i=1

m∑j=1

∣∣∣∣yi,j − yi,jyi,j

∣∣∣∣ (23)

with variables yi ∈ Rm, the i-th observation, andyi ∈ Rm, the i-th prediction, where m representsthe number of outputs in the prediction (m = 6)and N , the number of predictions.

The regression models will employ 4 different in-put types: electricity demand (D), time (T), elec-tricity demand and time (DT), and electricity de-mand, time, and exogenous weather data (DTE).The electricity demand input type (D) consists ofthe 24 hourly records that precede the desired fore-cast (x ∈ R24). The time input type (T) is the

current weekday and hour represented as a sparsebinary vector (x ∈ {0, 1}31). The demand and timeinput type (DT) combines the demand and time in-puts (x ∈ R55). The demand, time, and exogenousweather data input type (DTE) is the demand andtime input with current outdoor air temperature(◦C) and relative humidity (%RH) data retrievedfrom a local weather station (x ∈ R57)[29]. Theoutput of each forecasting model is a prediction ofthe hourly electricity demand over the following 6hours (y ∈ R6).

Throughout this paper, we will use the term “re-gression model” to refer to the model structure andalgorithm used to perform regression, “input type”to refer to the subset of features used by each model,and “forecasting model” to refer to the pairing ofregression model and input type.

3.1. Batch Study

For each of the 8 buildings, we train the fore-casting models with demand data from that build-ing. Electricity demand data from one building isnot used to fit the models of another building. Weconsider 5 regression models (Ridge, SVR, DTree,k-NN, and ANN) and 4 input types (D, T, DT,and DTE) for a total of 20 forecasting models perbuilding. The forecasting models for each buildingare trained in a batch manner (i.e. trained onceon a large dataset) using 18 months of hourly in-put data from January 1st, 2012, to July 1st, 2013(i.e. 13,128 training data points). For each model,the training dataset depends on the input type (D,T, DT, and DTE) but may include hourly electric-ity demand records for the respective building (D),time records represented as a sparse binary vector(T), and hourly outdoor air temperature and rela-tive humidity records (E).

For validation, the trained models are used togenerate a 6 hour electricity demand forecast (y ∈R6) for each building for every hour from July1st, 2013, to January 1st, 2014 (i.e. 4,416 testingdata points). The results of the batch regressionstudy are presented in Figure 3. In the figure, eachdata point represents the MAPE of one forecastingmodel (i.e. 8 buildings with 20 models each for atotal of 160 models).

Examples of the electricity demand forecasts pro-duced by the Building E Ridge regression modelwith demand (D) input are presented in Figure 4.Note that each blue line represents a multivariateforecast y and that the figure plots y starting at butexcluding the most recent power demand record.

7

Bldg A Bldg B Bldg C Bldg D Bldg E Bldg F Bldg G Bldg H0

5

10

15

20

25

30

MA

PE (

%)

Ridge SVR DTree k-NN ANN D T DT DTE

Figure 3: Batch Study Results

8/6/13

0h

8/6/13

12h

8/7/13

0h

8/7/13

12h

8/8/13

0h

8/8/13

12h

60

80

100

120

140

Pow

er Dem

and (kW

)

Actual Demand Ridge with D Forecast

Figure 4: Building E Ridge Forecast Sample

By comparing the results in Figure 3 for eachbuilding, we can immediately distinguish forecast-ers that consistently perform poorly (e.g. T in-put) from forecasters that perform well (e.g. ANN,Ridge, and k-NN with DT input). We also ob-serve that certain forecasting models perform in-consistently across the different buildings. For ex-ample, SVR with D input performs well in Build-ing A but relatively poorly in Buildings E and F.Furthermore, there is dispersion among the results,particularly in Buildings E, F, and H. This disper-sion represents a challenge for building level appli-cations. To produce the best results using a batchapproach, an engineer must perform model selec-tion for every deployment. Just because a certainregression model and input type has performed wellfor one building does not guarantee it will do thesame for another building.

As shown, an ANN model outperforms the Ridge,SVR, DTree, and k-NN models in 7 out of 8 build-ing. However, there is no input type that performsbest in each building. In Building B, we observethat the addition of the exogenous weather input

decreases the performance of the ANN. A possibleexplanation is that the ANN found a link betweenthe weather input and the electricity demand in thetraining data. However, this link may not have con-tinued to appear in the test data, resulting in a lossin performance.

Additionally, it could be argued that the ANNmodel does not outperform the much simpler Ridgeand k-NN models to an extent that warrants theadditional complexity, particularly in Buildings Dand G. If further tuned to a particular building andinput type and trained over a higher number ofepochs, it is possible that the ANNs’ performancescould be further improved. However, this increasein forecaster accuracy would come at a high compu-tational cost. Instead, this work will focus on devel-oping a computationally efficient ensemble methodfor producing electricity demand forecasts by learn-ing from data streams and adapting to individualbuilding energy use patterns.

3.2. Moving Horizon Study

In the batch regression study presented above,each forecasting model was trained once on 18months of data and then used to generate predic-tions up to 6 months after the last training datapoint. For buildings with very consistent electricitydemand patterns, using such a large training set willhelp to prevent overfitting and to produce the bestpossible results. For buildings with inconsistent orchanging electricity demand patterns, the absenceof the most recent data from the training set maylimit the performance of the forecasting model.

8


5

10

15

20

25

MA

PE (

%)

Ridge SVR DTree kNN D T DT DTE Min Batch MAPE

Figure 5: Moving Horizon Study Results

In this section, we will consider training the mod-els over a moving horizon. Specifically, at each hourbetween July 1st, 2013, and January 1st, 2014, wewill train the forecasting models for each buildingon the 3 months of data that precede that time step(i.e. 2,016 training data points). Then, we will pro-duce a 6 hour forecast (y ∈ R6) and record the er-rors. In this way, the start and end times of thetraining set will move relative to the current timestep and we can hope to better capture recent elec-tricity demand patterns. It should be noted thatbecause we have significantly reduced the trainingset size, we have introduced the potential for over-fitting the models, a point that will be addressedby our ensemble method. We will consider 4 re-gression models (Ridge, SVR, DTree, and k-NN)and 4 input types (D, T, DT, and DTE) for a totalof 16 forecasting models per building. The compu-tational cost of training an ANN makes the modelunsuitable for a moving horizon approach.

The results of the moving horizon regressionstudy are presented in Figure 5. In the figure, eachdata point represents the MAPE of one forecastingmodel (i.e. 8 buildings with 16 models each for atotal of 128 models). The horizontal dotted linemarks, for each building, the highest performingforecasting model (lowest MAPE) from the batchregression study. In other words, for Buildings B,C, D, E, and H, the dotted line indicates the MAPEof the ANN model with DT input. For Buildings Aand F, the line marks the MAPE of the ANN modelwith DTE input. For Building G, Ridge with DTproduced the lowest MAPE in the batch study.

While none of the models trained on a moving

horizon show a significant improvement in accuracyover the best batch model, the results for each in-dividual building are generally less dispersed in themoving horizon cases than in the batch cases.

Because we have generated 16 forecasts for ev-ery hour between July 1st, 2013, and January 1st,2014, we are able to compare the performance ofthe forecasting models at each time step and acrossall 8 buildings. This analysis will aid in determiningwhich models to include in the ensemble method.Figure 6 shows the fraction of time steps that aspecific regression model produced the most accu-rate (lowest RMSE) electricity demand prediction(regardless of input type). Here, we see that k-Nearest Neighbors models produce the best forecastwith the highest frequency of any of the regressionmodels considered. Because we plan to incorpo-rate several forecasters in the ensemble method, weare also interested in identifying models that consis-tently rank among the top. Figure 7 shows the frac-tion of time steps that a specific regression modelproduced a forecast that was among the best fourpredictions. Here, we see that Ridge models mostconsistently produced such a prediction. Repeatingthis analysis for input types (not displayed), we findthat every input scores between 20% and 30% forboth the top and top four predictions, suggestingno clear relative advantage.

Recognizing that a specific forecasting modelmay have the highest accuracy in one time stepand the lowest accuracy in the next, we are alsointerested in which regression models and inputsshow the poorest performance (highest RMSE). Forthe model that generates the worst forecast at each

9

kNN: 35.7%

DTree: 26.7%

Ridge: 19.8%

SVR: 17.7%

Figure 6: Best Prediction

Ridge: 33.1%

kNN: 25.3%

DTree: 22.2%

SVR: 19.4%

Figure 7: Best Four Predic-tions

D: 40.0%

T: 29.0%

DTE: 15.6%

DT: 15.4%

Figure 8: Worst Prediction

D: 32.5%

T: 28.4%

DTE: 26.1%

DT: 13.0%

Figure 9: Worst Four Predic-tions

time step (not displayed), Decision Tree Regressionranks poorest (44%) followed by Ridge (31%). Forthe worst four predictions, Decision Tree Regressionranks poorest again (42%) followed again by Ridge(26%). Figures 8 and 9 show the fraction of timesteps that a specific input type produced the leastaccurate electricity demand prediction. Not sur-prisingly, the demand only input (D) ranks worstin both cases. For the buildings studied, exogenousweather inputs do not appear to significantly im-prove the forecast accuracy. Buildings located inless temperate climates could be expected to showgreater correlation between temperature and elec-tricity demand.

3.3. Single Model Study Conclusions

Based on the results from the batch and mov-ing horizon studies, we assert that no single fore-casting model (regression model and input type)will produce the best results across every building.To produce good results using a single model ap-proach, an engineer must perform model and fea-ture selection for each deployment, increasing thecost of energy management applications that re-quire building-level electricity demand forecasts.

However, the results also show that there is a sub-set of forecasting models that perform well for eachbuilding. Furthermore, at each time step, one fore-casting model produces a prediction that is moreaccurate than any other prediction. If we train a setof regression models and determine which model inthe set is most likely to produce the best predictionat a given time step, we can formulate an ensemblemethod that is able to perform model and featureselection by learning from past electricity demand.This is the core motivation of the gated ensemblelearning method presented in the next section.

4. ENSEMBLE METHOD

4.1. Background

Given the many unpredictable behaviors of oc-cupants and the unique physical and mechanicalcharacteristics of every building, a single model ap-proach to electricity demand forecasting may per-form very well in one case and very poorly in an-other. Without being able to observe the causesof electricity demand changes (through extensivesub-metering and/or occupancy sensing), it is dif-ficult to justify why a model does or does not per-form well. Furthermore, the incorporation of exoge-nous signals like regional weather conditions mayimprove a model’s accuracy but such benefits can-not be guaranteed. Only through observation andexperimentation can the best regression model andinput type be identified for a particular building.

The assertion that the best forecasting model canbe identified through data driven experimentationunderlies the ensemble method presented in this pa-per. To build upon existing literature and to im-prove the portability of electricity demand forecast-ers, we have developed a method that tests multiplemodels before selecting one that is best suited fora particular building and instance in time.

Our multiple model regression method falls undera category of ensemble learning methods commonlyreferred to as a “bucket of models” [30][31]. It isimportant to recognize that unlike other ensemblemethods (which average, stack, or otherwise com-bine the outputs from multiple models), the bucketof models approach selects a single model from theensemble set. Consequently, a bucket of models ap-proach can perform no better than the best modelin the set. Therefore, to produce accurate fore-casts, it is important that the individual modelsperform well, though perhaps only in certain con-ditions. Additionally, our ensemble method must

10

be able to identify and avoid models that are likelyto perform poorly for a particular building. This ca-pability will alleviate the need for prior knowledgeof a building’s energy use and, in practice, allow formodel and feature selection to be performed in realtime.

4.2. Method

Our gated ensemble learning method can be di-vided into 4 steps: Training, Validation, Gating,and Testing. In the training step, each model istrained on a subset of historic data, with the mostrecent electricity demand data points reserved forvalidation, as shown in Figure 10. The size of thetraining subset may vary by application and bytraining approach. For models trained in a batchmanner, a large data set (e.g. >12 months) is re-quired. Additionally, the training step is eitherperformed only once or periodically (e.g. every 6months) rather than at every time step.

For models trained in a moving horizon manner,a smaller dataset is required (e.g. 2 to 12 months)and the training step is performed periodically (e.g.daily) or at every time step. Again, the length ofthe dataset can be customized to the application.For example, because the use patterns of universitybuildings change according to an academic calen-dar, a training set size of 2 or 3 months may beideal. For an office building with very consistentenergy use patterns, a training set size of 6 monthsor more may produce the best results. Stated moreexplicitly, small training sets are more capable ofcapturing recent energy use patterns but carry therisk of allowing the regression model to overfit thedata. By contrast, a large dataset will capture con-sistent energy use patterns but may miss recent orshort-term changes, thereby appearing to underfitthe data. In this paper, we will favor smaller train-ing sets (3 months) for moving horizon models andlarger trainings sets (18 months) for batch models.

In the validation step, the most recent historicdata is used to generate predictions with each fore-casting model, as shown in Figure 10. These fore-casts are compared with the electricity demanddata (that was not used for model training) to de-termine each model’s performance. Again, the sizeof the data set used for validation may vary by ap-plication, but in our implementation, the length ofthe validation set is equal to twice the desired fore-cast length (i.e. for a 6 hour forecast, the previ-ous 12 hourly electricity demand data points arereserved for validation).

Train Validate Test

0 1 2 3-10 -9 -8 -7 -6 -5 -4 -2 -1-3-99 -98 -97

Time Steps From Present

Po

we

r D

em

an

d (

kW

)

Figure 10: Graphical Representation of Gated Ensemble Re-gression Method with 3 Models and 3 Hour Forecast Horizon

The measure of a model’s performance or the cri-teria for identifying the “best” model will dependon the application. In some cases, we may wantto define “best” as the forecast that will producethe highest pay-off or incur the least risk. In othercases, we may want the forecast with the small-est positive or negative error. In this paper, wewill focus on producing the most accurate forecastas defined by the lowest root mean squared error(RMSE).

In the gating step, a method is applied to selecta single model from the bucket of models accord-ing to its relative performance during the valida-tion step. In other words, the gating method isresponsible for choosing which model in the bucketof models will be used in the test step to generatethe electricity demand prediction. The objective ofthe gating method is to select the best model basedon present and/or past information from the vali-dation step. In this paper, we will implement andcompare 2 alternative gating methods. The firstmethod is cross-validation selection (CV). Put sim-ply, the CV gate will select the forecasting modelthat performed best (produced the lowest RMSE)during the validation step. The CV gate can be con-sidered a greedy approach because it has no mem-ory of how its past decisions impacted the accuracyof the test prediction. Instead, the model selectionis based entirely of the current performances in thevalidation step.

The second gating method uses a single recur-sively trained linear regression model (SR) to pre-dict the performance of each forecaster in thebucket of models. Given n forecasters, the SRmodel input x ∈ Rn+1 is the forecasting model’sperformance during the validation step (i.e. xn+1 =Past RMSE of forecaster j) and a sparse binary

11


3

6

9

12

15

MA

PE (

%)

Oracle Cross Validation Single Regressor Min Batch MAPE Min Moving Horizon MAPE

0

10

20

30

40

50

Regre

ssor

Uti

lizati

on (

%)

BA MHRidge

BA MHk-NN

BA MHRidge

BA MHk-NN

BA MHRidge

BA MHk-NN

BA MHRidge

BA MHk-NN

BA MHRidge

BA MHk-NN

BA MHRidge

BA MHk-NN

BA MHRidge

BA MHk-NN

BA MHRidge

BA MHk-NN

10

20

30

40

Input

Uti

lizati

on (

%)

D T DT DTE D T DT DTE D T DT DTE D T DT DTE D T DT DTE D T DT DTE D T DT DTE D T DT DTE

Figure 11: Gated Ensemble Study Results

vector representing each forecaster (i.e. for fore-caster j, xj = 1 and xi = 0 for i = 1, . . . , j −1, j + 1, . . . , n). The SR model output y ∈ R is theforecaster’s predicted performance during the teststep (i.e. y = Predicted RMSE of forecaster j). Inother words, the linear model includes a parameterfor each of the regression models and a parametercorresponding to the validation RMSE. Thus, givena forecasting model and its performance during thevalidation step, we will train the linear model topredict the performance during the test step. Inthis way, the SR gate learns how well a model’s val-idation performance does or does not indicate theperformance during the test step. This capabilityshould help to avoid models that perform poorlyand favor models that perform well. The modelwith the best predicted performance (i.e. lowest y)is selected by the SR gate for use in the test step.

Finally, in the testing step, the model selectedin the gating step is used to generate an electricitydemand forecast. Our multiple model method canbe summarized by the following steps:

1. Train each model on a subset of the historicaldata.

2. Validate each model using historic data that

immediately precedes the current time step.

3. Apply a gating method to select a model ac-cording to its performance during validation.

4. Use the selected model to generate a predic-tion.

5. MULTIPLE MODEL STUDY

To test our approach, we have implemented thegated ensemble learning method using two regres-sion models [Ordinary Least Squares with `2 Regu-larization (Ridge) and k-Nearest Neighbors (k-NN)]and four input types [electricity demand (D), time(T), electricity demand and time (DT), and elec-tricity demand, time, and exogenous weather data(DTE)]. Despite the poor performance of the de-mand only and time only inputs in the batch andmoving horizon studies, we have included them toobserve if the gating methods choose to avoid theinputs. Therefore, the bucket of models will includethe best (k-NN with DT) and the worst (Ridge withT) forecasting models from the single model stud-ies. We have elected to exclude the ANN modelsfrom the ensemble due to their computational com-plexity.

12

The regression models are trained in both batch(BA) and moving horizon (MH) manners, for a to-tal of 16 forecasting models per building. Next,we generated a 6 hour electricity demand forecast(y ∈ R6) for every hour between July 1st, 2013,and January 1st, 2014 using the Training, Valida-tion, Gating, and Testing steps described above.The batch models are trained once on 18 months ofdata and the moving horizon models are trained atevery time step on 3 months of data.

The results, employing both of the gating meth-ods described, are presented in Figure 11. For test-ing purposes, we have also implemented an Oraclegate, which simply selects the best prediction foreach time step regardless of the performances ofthe models in the validation step. The results fromthe Oracle gate represent the theoretical optimal ofthe ensemble approach.

In Figure 11, the top subplot presents the over-all performance of each gate as measured by theMAPE of the selected predictions. The dotted linesrepresent the best performance (lowest MAPE) ofany single model from the batch study and thedashed lines, from the moving horizon study. Theresults show that our ensemble method is able toperform comparably to the best forecaster from thebatch and moving horizon studies without any priorknowledge of the energy end-use or the relative per-formance of the contained forecasting models. TheOracle gate is able to outperform the ANN modelsfrom the batch study, however, the cross-validationgate (CV) and single regression (SR) gate are not.Among the gating methods studied, there is noclear winner. The SR gate shows a small advan-tage in buildings B and G, but the worst perfor-mance in building C. In each of the buildings, theCV gate performs comparably to the best modelfrom the moving horizon study. The performanceof the Oracle gate, particularly in buildings C, E,and H, suggests there is still potential for a gatingmethod (not presented here) to further improve ourensemble approach.

The second subplot shows the percent utilizationof each regression model by the Oracle, CV, andSR gating methods, regardless of input type. Inother words, the plot shows the percentage of timesteps that a particular regression model and train-ing manner was selected by the respective gatingmethod. Because the Oracle gate represents the op-timal decision given the forecasting models in theset, we would like to see comparable utilization per-centages between the Oracle and the other gates.

300

350

400

Power Demand (kW

) Ridge

k-NN

Batch

MH

D

T

DT

DTE

Actual Demand

Oracle Forecast

8/6/13

0h

8/6/13

12h

8/7/13

0h

8/7/13

12h

8/8/13

0h

8/8/13

12h

300

350

400

Power Demand (kW

)

Figure 12: Building A Oracle Forecast Sample

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Number of Time Steps

100

101

102

103

104

105Count

Figure 13: Optimal Prediction Generation by a Forecasterover Consecutive Time Steps When Applied to CommercialDataset

Based on the subplot, this is generally the case withrespect to regression model selection.

The bottom subplot shows the percent utiliza-tion of each input type by the Oracle, CV, and SRgating methods. Here, we observe larger differencesbetween the behaviors of the gates. For BuildingsA, E, and F, the SR gate favors the electricity de-mand and time (DT) inputs much more than theOracle gate. For Buildings E and D, the SR gateunderutilizes the electricity demand only (D) inputcompared to the CV and Oracle gates. For Build-ing C, the CV and Oracle gates avoid the time only(T) input but the SR gate does not. It should benoted that neither of these subplots quantifies theimpact of the percent utilizations on the MAPE ofthe gates. Nonetheless, the percent utilization met-ric provides useful insight into the decision makingof each gate.

13

In general, we see very similar utilization ratesbetween the Oracle and CV gates. This does notmean that both gates select the same forecaster atthe each time step, but does suggest that the CVgate is capable of identifying forecasters that workwell for a certain building. In Buildings E and F, wesee that all 3 gates favor the k-NN models over theRidge models. Additionally, in each of the build-ings, the inclusion of exogenous weather data doesnot appear to significantly improve the accuracy ofthe predictions, as suggested by the Oracle’s DTEutilization rate. This could mean that electricitydemand is not strongly correlated with weather orthat correlation only appears under certain condi-tions (e.g. an unusually cold or warm day).

Figure 12 presents a sample of the predictions forBuilding A. The top subplot shows the actual elec-tricity demand data from 8/6/13 to 8/9/13 and aseries of multivariate forecasts selected by the Ora-cle gate (i.e. the most accurate forecast at each timestep). The bottom subplot details which forecast-ing model generated the selected prediction. Themarker color denotes the regression model and themarker shape, the input type. A filled marker indi-cates that the model was trained in a batch mannerand a half-filled marker, a moving horizon manner.

Since the Oracle gate chooses the most accurateforecast at each time step, Figure 12 supports thenotion that a certain forecaster may generate thebest prediction several time steps in a row. Figure13 shows, for all 8 buildings, the number of times aforecaster produced the best prediction over multi-ple consecutive time steps and the lengths of suchsequences. For the validation step to properly in-form the gating method, we would like to observehigh frequencies of large repeated model sequences.Using fewer models would, of course, increase theprobability of such sequences at the loss of potentialperformance.

6. RESIDENTIAL STUDY

To further test our ensemble approach, we haverepeated the single model batch study, single modelmoving horizon study, and multiple model ensem-ble study using 24 residential electricity demanddatasets. This time-series demand data has beendownloaded by the individual customers from theelectric utility’s website and provided to our re-search group. Each dataset is from a home in north-ern California but has been anonymized such thatwe do not know its exact location. Accordingly, for

D T DT D T DT D T DT D T DT30

40

50

60

70

80

90

100

MAPE (%)

Batch Moving Horizon Ensemble

Ridge k-NN Oracle CV SR

Figure 14: Residential Study Results

this study, we have excluded the forecasting modelsthat use weather data as inputs.

We utilize two regression models [Ordinary LeastSquares with `2 Regularization (Ridge) and k-Nearest Neighbors (k-NN)] and three input types[electricity demand (D), time (T), and demand andtime (DT)]. The regression models are trained inboth batch (BA) and moving horizon (MH) man-ners, for a total of 12 forecasting models. In each ofthe studies, we generate a 6 hour electricity demandforecast (y ∈ R6) for every hour between October1st, 2014, and January 1st, 2015. The batch mod-els are trained once on 9 months of data and themoving horizon models are trained at every timestep on 3 months of data.

The results from the single model batch, singlemodel moving horizon, and multiple model ensem-ble studies are presented in Figure 14. In the fig-ure, each bar represents the mean absolute percenterror (MAPE) of the respective predictions acrossall 24 residential building datasets. Moving fromleft to right, the first 6 bars represent the resultsfrom the single model batch study and the next 6bars, the single model moving horizon study. Here,we observe that the MAPEs are much larger forthe residential predictions than for the commer-cial/university predictions. This can be explainedby the fact that the residential electricity demandsare much smaller and are composed of fewer loadsthan the commercial/university demands. There-fore, any change to the residential demand producesa larger percent change and any prediction errorwill correspond to a larger percent error. In thesingle model studies, the highest performing fore-caster is the k-Nearest Neighbors regression modelwith demand input (D) trained in a moving horizonmanner (MAPE: 63.5%).

In Figure 14, the Oracle, CV, and SR bars

14

show the results produced by our ensemble learningmethod. As previously stated, the results from theOracle gate (MAPE: 37.1%) represent the optimalpotential given the forecasters in the bucket of mod-els. The CV gate represents the results (MAPE:61.5%) using cross-validation to select which fore-caster to utilize. Here, we observe that the CV gateperforms comparably to the best forecaster in thesingle model studies. The SR gate represents theresults (MAPE: 55.8%) using a single recursivelytrained linear regression model to select a forecasterby predicting the performance of each model. Theseresults suggest that the SR gate outperforms theCV gate as well as any of the single model ap-proaches. In other words, the SR gate is able tolearn from the past performance of each model andto make better decisions about which model to se-lect at each time step.

The Oracle gate’s percent utilization of each fore-caster in the bucket of models is shown in Figure15. These results show that, across the 24 resi-dential electricity demand datasets, the Oracle gateshows a slight preference for the k-Nearest Neigh-bors models. This does not indicate that the k-NNmodels perform significantly better (lower RMSE)than the Ridge models, only that the k-NN modelsproduce the best prediction with a higher frequency.

Lastly, Figure 16 shows, for all 24 residentialbuildings, the number of times a forecaster pro-duced the best prediction over multiple consecu-tive time steps and the lengths of such sequences.Compared to the commercial/university buildings(Figure 13), we observe larger sequences of optimalprediction generation when applying the ensem-ble method to the residential buildings. It shouldbe noted that for the residential studies we havenot incorporated weather data and are thus using4 fewer forecasting models than in the commer-cial/university studies. Nonetheless, the Oracle’sforecaster utilization rates (ranging from 5.0% to14.6%) and the presence of large optimal predictiongeneration sequences support the underlying asser-tion of our ensemble learning method: by selectingbetween multiple models at each time step, we canobtain better results than a single model approach.

7. CONCLUSIONS AND FUTURE WORK

In this paper, we have presented a gated ensemblelearning method for short-term electricity demandforecasting. The contribution of this method is toallow for the incorporation of multiple forecasting

D T DT D T DT D T DT D T DT0

2

4

6

8

10

12

14

16

18

Foreca

ster Utiliz

ation (%) Batch Moving Horizon

Ridge k-NN

Figure 15: Oracle Forecaster Utilization

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Number of Time Steps

100

101

102

103

104

105

Count

Figure 16: Optimal Prediction Generation by a Forecasterover Consecutive Time Steps When Applied to ResidentialDataset

models trained in both batch and moving horizonmanners. Stated more explicitly, rather than choos-ing a single approach, an engineer can utilize mul-tiple models with the intention of improving the re-liability of the forecaster to produce useful results.This makes the method well suited for real worldapplications. At deployment, the moving horizonmodels can be utilized until sufficient data is avail-able to train the batch models. This adaptabilityalso makes the method suitable for control appli-cations. Rather than assuming that demand be-haviors are time invariant, the method will work torecognize both long and short-term electricity de-mand patterns.

The relative performance of the Oracle gate sug-gests that there is potential for continued devel-opment of the gating functions. For instance,none of the gating functions attempt to identify re-peated model selection sequences. Given the opti-mal model in the previous few time steps, a gat-ing method could determine the probability that amodel will be optimal in the next time step. Also,in our implementation, the validation step uses the

15

RMSE of 6 multivariate forecasts to measure theperformance of each model. It may be more effec-tive to use fewer forecasts or even univariate predic-tions in order to identify the optimal model at thegiven time step. Finally, the addition of a featureselection procedure may help to reduce the dimen-sionality of the regression models or even eliminatecertain input types from consideration.

By applying our gated ensemble learning methodto 32 unique building electricity demand data sets(8 commercial/university and 24 residential), wedemonstrate that the incorporation of multiplemodels can yield better results than a single modelapproach. While the development of the gatingmethods is ongoing, the ability of each gate to per-form model validation and selection in real timegreatly improves the method’s adaptability andease of use. Utilizing this data-driven approach,we empirically show that the ensemble method iscapable of aiding in the production of accurate mul-tivariate electricity demand forecasts for building-level applications.

[1] U.S. Energy Information Administration, Monthly En-ergy Review (March 2015).URL http://www.eia.gov/mer

[2] E. Hirst, B. Kirby, Separating and measuring the reg-ulation and load-following ancillary services, UtilitiesPolicy 8 (1999) 75–81.

[3] H. K. Alfares, M. Nazeeruddin, Electric load forecast-ing: literature survey and classification of methods, In-ternational Journal of Systems Science 33 (1) (2002)23–34. doi:10.1080/00207720110067421.

[4] G. K. F. Tso, K. K. W. Yau, Predicting electricity en-ergy consumption: A comparison of regression analysis,decision tree and neural networks, Energy 32 (9) (2007)1761–1768. doi:10.1016/j.energy.2006.11.010.

[5] J. G. Jetcheva, M. Majidpour, W.-P. Chen, Neuralnetwork model ensembles for building-level electricityload forecasts, Energy and Buildings 84 (2014) 214223.doi:doi:10.1016/j.enbuild.2014.08.004.

[6] F. Javeda, N. Arshada, F. Wallinb, I. Vassilevab,E. Dahlquistb, Forecasting for demand response insmart grids: An analysis on use of anthropologic andstructural data and short term multiple loads forecast-ing, Applied Energy 96 (2012) 150–160.

[7] A. Azadeh, S. F. Ghaderi, S. Tarverdian, M. Saberi, In-tegration of artificial neural networks and genetic algo-rithm to predict electrical energy consumption, AppliedMathematics and Computation 186 (2) (2007) 1731–1741. doi:10.1016/j.amc.2006.08.093.

[8] H. S. Hippert, C. E. Pedreira, R. C. Souza, Neural net-works for short-term load forecasting: A review andevaluation, IEEE Transactions on Power Systems 16 (1)(2001) 44–55. doi:10.1109/59.910780.

[9] S. Karatasou, M. Santamouris, V. Geros, Modeling andpredicting building’s energy use with artificial neuralnetworks: Methods and results, Energy and Buildings38 (8) (2006) 949–958.

[10] K. Metaxiotis, A. Kagiannas, D. Askounis, J. Psarras,

Artificial intelligence in short term electric load fore-casting: a state-of-the-art survey for the researcher, En-ergy Conversion and Management 44 (9) (2003) 1525–1534. doi:10.1016/S0196-8904(02)00148-6.

[11] J. Massana, C. Pous, L. Burgas, J. Melen-dez, J. Colomer, Short-term load forecasting in anon-residential building contrasting models and at-tributes, Energy and Buildings 92 (2015) 322–330.doi:doi:10.1016/j.enbuild.2015.02.007.

[12] P. F. Pai, W. C. Hong, Support vector ma-chines with simulated annealing algorithms inelectricity load forecasting, Energy Conversionand Management 46 (17) (2005) 2669–2688.doi:10.1016/j.enconman.2005.02.004.

[13] A. Kavousi-Fard, H. Samet, F. Marzbani, A new hybridmodified firefly algorithm and support vector regressionmodel for accurate short term load forecasting, ExpertSystems with Applications 41 (13) (2014) 6047–6056.doi:10.1016/j.eswa.2014.03.053.

[14] W.-C. Hong, Electric load forecasting by support vectormodel, Applied Mathematical Modelling 33 (5) (2009)2444–2454. doi:10.1016/j.apm.2008.07.010.

[15] H. T. Yang, C. M. Huang, C. L. Huang, Identification ofARMAX model for short term load forecasting: An evo-lutionary programming approach, IEEE Transactionson Power Systems 11 (1) (1996) 403–408.

[16] G. A. Darbellay, M. Slama, Forecasting the short-termdemand for electricity - do neural networks stand a bet-ter chance?, International Journal of Forecasting 16 (1)(2000) 71–83. doi:10.1016/S0169-2070(99)00045-X.

[17] J. W. Taylor, L. M. de Menezes, P. E. McSharry,A comparison of univariate methods for forecast-ing electricity demand up to a day ahead, Inter-national Journal of Forecasting 22 (1) (2006) 1–16.doi:10.1016/j.ijforecast.2005.06.006.

[18] G. R. Newsham, B. J. Birt, Building-level occupancydata to improve arima-based electricity use forecasts,in: Proceedings of the 2nd ACM workshop on embed-ded sensing systems for energy-efficiency in building,ACM, 2010, pp. 13–18.

[19] S. Rahman, O. Hazim, A generalized knowledge-basedshort-term load-forecasting technique, IEEE Trans-actions on Power Systems 8 (2) (1993) 508–514.doi:10.1109/59.260833.

[20] K. Li, H. Su, J. Chu, Forecasting building energy con-sumption using neural networks and hybrid neuro-fuzzysystem: A comparative study, Energy and Buildings 43(2011) 2893–2899.

[21] A. J. Smola, B. Scholkopf, A tutorial on support vectorregression, Statistics and computing 14 (3) (2004) 199–222.

[22] C.-J. Lu, T.-S. Lee, C.-C. Chiu, Financial time seriesforecasting using independent component analysis andsupport vector regression, Decision Support Systems47 (2) (2009) 115–125.

[23] G. De’ath, K. E. Fabricius, Classification and regressiontrees: a powerful yet simple technique for ecologicaldata analysis, Ecology 81 (11) (2000) 3178–3192.

[24] S. M. Omohundro, Five balltree construction al-gorithms, International Computer Science InstituteBerkeley, 1989.

[25] J. E. Dayhoff, J. M. DeLeo, Artificial neural networks,Cancer 91 (S8) (2001) 1615–1635.

[26] T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder,F. Sehnke, T. Ruckstieß, J. Schmidhuber, Pybrain: an

16

open source machine-learning library written in python,Journal of Machine Learning Research 11 (2010) 743–746.

[27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay,Scikit-learn: Machine learning in python, Journal ofMachine Learning Research 12 (2011) 2825–2830.

[28] J. D. Hunter, Matplotlib: A 2d graphics environment,Computing In Science & Engineering 9 (3) (2007) 90–95.

[29] R. L. Snyder, California irrigation management infor-mation system, American Journal of Potato Research61 (4) (1984) 229–234.URL http://www.cimis.water.ca.gov/

[30] C. C. Aggarwal, Outlier ensembles, ACM SIGKDD Ex-plorations Newsletter 14 (2) (2012) 49–58.

[31] S. Deroski, B. enko, Is combining classifiers with stack-ing better than selecting the best one?, Machine Learn-ing 54 (3) (2004) 255–273.

17

gated ensemble learning method for demand-side electricity ... · ment of the electric power system...

Documents