towards plug&play smart thermostats inspired by

Towards plug&play Smart Thermostats inspired byReinforcement Learning

Charalampos MarantosSchool of ECE, National Technical

University of Athens, [email protected]

Christos P. LamprakosSchool of ECE, National Technical


Vasileios TsoutsourasSchool of ECE, National Technical


Kostas SioziosDepartment of Physics, AristotleUniversity of Thessaloniki, Greece

[email protected]

Dimitrios SoudrisSchool of ECE, National Technical


ABSTRACTBuildings are immensely energy-demanding and this fact is en-hanced by the expectation of even more increment of energy con-sumption in the future. In order to mitigate this problem, a low-cost, flexible and high-quality Decision-Making Mechanism forsupporting the tasks of a Smart Thermostat is proposed. Energyefficiency and thermal comfort are the two primary quantities re-garding control performance of a building’s HVAC system. Apartfrom demonstrating a conflicting relationship, they depend not onlyon the building’s dynamics, but also on the surrounding climateand weather, thus rendering the problem of finding a long-termcontrol scheme hard, and of stochastic nature. The introducedmech-anism is inspired by Reinforcement Learning techniques and aimsat satisfying both occupants’ thermal comfort and limiting energyconsumption. In contrast to to existing methods, this approachfocuses on a plug&play solution, that does not require detailedbuilding models and is applicable to a wide variety of buildings asit learns the dynamics using gathered information from the envi-ronment. The proposed control mechanisms were evaluated via awell-known building simulation framework and implemented onARM-based, low-cost embedded devices.

KEYWORDSHVAC control, Intelligent agents, Energy efficiency, Learning sys-tems, Decision making, Embedded software

ACM Reference Format:CharalamposMarantos, Christos P. Lamprakos, Vasileios Tsoutsouras, KostasSiozios, and Dimitrios Soudris. 2018. Towards plug&play Smart Thermostatsinspired by Reinforcement Learning. In INTelligent Embedded Systems Ar-chitectures and Applications (INTESA), October 4, 2018, Turin, Italy. ACM,New York, NY, USA, 6 pages. https://doi.org/10.1145/3285017.3285024

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected], October 4, 2018, Turin, Italy© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-6598-7/18/10. . . $15.00https://doi.org/10.1145/3285017.3285024

1 INTRODUCTIONIntelligent computing systems are gradually reshaping the world aswe know it, in an effort to optimize every aspect of contemporaryactivities. Unprecedented monitoring and calculation abilities areat the disposal of system designers, which in turn need to desig-nate novel applications with societal impact. A relative importantfield is rural development, since buildings are immensely energy-demanding, consuming around 40% of the total European Union’senergy [16]. Taking into account that the total consumption in-creases by 1% per year [15], a balancing mechanism is requiredabiding to the concept of energy consumption minimization.

There is a plethora of techniques that aim to optimize the finan-cial and ecological cost of buildings. Innovative design methods,new materials and appliances are used during the construction ofnew buildings including insulation improvements and more en-ergy efficient HVAC systems. While new green building can beoptimized for energy savings, maximizing also the usage of renew-able sources, an efficient solution for old buildings is crucial, as theexisting infrastructure cannot be replaced in cost effective manner.

An energy optimization technique that applies to all buildings,regardless of their age is fine-tuning and control of its heating, ven-tilation, and air conditioning (HVAC) systems. An online controlsystem of HVAC is frequently referred to as a Smart Thermostat:Computerized embedded platforms that apply advanced controlmethods on HVAC systems. Smart Thermostats promise to achieveenergy reduction and better thermal conditions by proper config-uration of the HVAC system at real-time based on environmentalparameters, building’s state and occupants preferences. Recent re-ports estimate that the global smart thermostat market is expectedto generate a revenue of $1.3 billion by 2019 1

Embedding intelligence on a dynamic HVAC configuration hasattracted the interest of many researchers over the years resultingin numerous design approaches. This work focuses on a plug&playsolution that is applicable in a wide variety of buildings, aimingat a rapid prototyping solution (low design time). The core of thedecision making logic of the proposed Smart Thermostat is inspiredby Reinforcement Learning augmented with supervised learningtechniques in order to effectively adapt to the parameters and dy-namics of the controlled environment. A further contribution of this

1According to a recent report by Sandler Research

https://doi.org/10.1145/3285017.3285024

https://doi.org/10.1145/3285017.3285024

INTESA, October 4, 2018, Turin, Italy C. Marantos, C. Lamprakos, V. Tsoutsouras, K. Siozios, and D. Soudris

work is an optimal cost formulation, creating an efficient normal-ization of the competitive efficiency metrics (energy and thermalcomfort), in order to be equally taken into account without anyprior knowledge. Experimental results, using popular simulationsoftware (EnergyPlus), highlight the effectiveness of this work.

The rest of the manuscript is summarized as follows: Section 2provides an overview of relevant approaches found in literature,whereas Section 3 provides the technical background that is con-sidered necessary in order for the reader to have a clear view of theaspects that will be discussed afterwards. The proposed framework,as well as its components, is discussed in detail in Section 4, whileSection 5 presents the experimental setup. The efficiency of ourproposed solution is discussed and quantified under various metricsin Section 6. Finally, Section 7 concludes the manuscript.

2 RELATEDWORKIn the context of dynamic HVAC control two major techniques, i.e.on-line decision-making and Model Predictive Control (MPC), dom-inate the literature with each one being characterized by number ofpros and cons. On-line algorithms usually require lower design time,while MPC methods are usually more robust efficient, especiallyin cases where the control was designed along with the system.Nevertheless, on-line methods are more reactive to real-time condi-tions, whereas the accuracy of MPC techniques is affected by theprecision of the weather forecasting and building dynamics models.

MPC control techniques have been successfully applied in awide range of similar non-linear applications [22, 24], includingHVAC system control [1]. In most cases the design of the controllerrequires extensive analysis of the system, which leads to high-dimensional mathematical problems [14] requiring high computa-tional power. As a result a great amount of design and customizationtime for every different type of building (detailed experimental andmathematical analysis) is needed. In general, MPC methods cannotsupport real time applications but are better for controlling compo-nents of HVAC systems that have been modeled at design time andare not affected by the building’s behavior.

Fuzzy rules [2, 9, 30] alleviate the necessity of a detailed mathe-matical model, through a fuzzy approximation scheme. The con-troller follows a (usually predefined) action plan according to theinformation received by the environment. Sometimes genetic algo-rithms are employed to support the fuzzy controller [3, 19].

Supervised machine learning techniques, such as Artificial Neu-ral Networks (ANNs) [7, 21], are recently gaining a lot of attention,due to the fact that they do not require a detailed study of theunderlying dynamics of the building. Contrariwise, they can betrained, basing on historical data and learn the behavior of thebuilding’s physics. Although these techniques are in alignmentwith the model-free controller idea, they have a number of limita-tions. Machine learning models usually need long time to be trainedand calibrated and are difficult to implement in practice, especiallyas a lightweight plug&play solution, while fuzzy rules create fuzzyclasses of some parameters and as a result they are not able tolearn building’s behavior in detail, in order to react on real-timedynamics. Therefore a stage of ”pre-training” is performed, basedon historical data of a target building that they target to or buildingmodeling tools (e.g. EnergyPlus, Modelica).

Reinforcement Learning (RL) promises to give a solution bycontinuously learning through the results of different inputs in thesystem. This is achieved by matching each action to a reward thataccrues by the evaluation of the produced output. RL is gainingattention nowadays and a growing use in the field of embeddedsystems is observed [28]. Several state-of-the-art approaches use RLfor HVAC control [5, 11, 33]. Usual criticism to RL is the instability atthe initial system period, as well as prolonged learning periods [1].

As far as the use case of Smart Thermostats is concerned, severalworks focus only on energy consumption minimization regardlessof thermal comfort. Some take into account the energy market,trying to satisfy a desired threshold set by users, either by control-ling the HVAC system [35] or escaping from the limits of availablethermostat choices by choosing a purchase-bidding strategy for thebuilding [25]. The first approach [35] uses simulations to develop alinear regression model that is related only to temperature differ-ence, while the second one [25] assumes a full model for estimatingenergy consumption, based on modeled building parameters and acomputationally demanding Monte Carlo approach. On the otherhand, a big number of proposed solutions are attempting to serveoccupant’s preferences according to their manual modificationson temperature set-points. These approaches attempt to build aschedule and provide energy savings by avoiding unnecessary ad-justments, normalizing fluctuations and turning off the HVACwhenthe zone is not occupied. In order to achieve this, some works askthe user to identify the comfort zone manually [10], [6].

Our proposed method envisions a controller that takes into ac-count both energy and thermal comfort, solving a multi-objectiveoptimization problem. Similarly, [4] proposes a control method thatcomprises energy with a comfortable lifestyle and provides a solu-tion to the whole Smart Home tasks scheduling, using a detailedmodel of the building and a predefined thermal comfort model. Alow cost and flexible solution to the Smart Thermostat problem isdevised in [12], coupling Neural Networks (NN) with Fuzzy control.However, the solution employs a NN that is pre-trained off-lineusing a thorough design space exploration. The results highlightedthat a machine learning technique can be very efficient, leading tonear optimal results with low computational complexity.

Regarding RL, an examination its application on Smart Ther-mostats has been introduced in [5]. In this work, energy cost corre-sponds to a reward of −1when the HVAC is on but no actual energycosts are integrated. Additionally, the controller tries to achieve apredefined temperature by occupants. Another approach formu-lates a reward function that focuses only on minimizing energy costtaking into account a desired range for the temperature [33]. How-ever, this range is occasionally violated and does not consider morerealistic thermal comfort values. Finally, [11] is the only work on RLthat comprises both energy and thermal comfort in construction ofthe reward function. However, the comfort exceeds the acceptablelimits for numerous periods, while the technique relies on someprior information such as the maximum energy consumption.

3 THEORETICAL BACKGROUND3.1 Reinforcement Learning - NFQ AlgorithmIn alignment to the unknown parameters of the system, a determin-istic approach for decision making is limited by approximations

Towards plug&play Smart Thermostats inspired by Reinforcement Learning INTESA, October 4, 2018, Turin, Italy

regarding the available system states, able to be reached at run-time.Similarly, when the parameter space is vast, the definition of deter-ministic transitions from one state to another can be prove to beinfeasible. Such design requirements, gave birth to Reinforcementlearning (RL) approach, which constantly gaining attention.

A Reinforcement Learning problem, consists of a set S of states,a set A of actions, and a function r : S ×A → R, called the rewardfunction. At each instance of the problem, an action ai ∈ A (weassume thatA is finite) has to be chosen, which will lead from si ∈ Sto a new state si+1. The tuple (si ,ai , si+1) is called a transition. Areal value ri is assigned to each of the transitions. The agent’sgoal is a series of transitions t1, t2, . . . , tn that maximizes the Rvalue (called the return). Since maximizing a reward is equal tominimizing a cost, with the said cost defined as the negative of thereward, we will refer to c as the cost function for the rest of thismanuscript. A real value ci is assigned to each of the transitions.The mathematical formulation of this problem is given by Eq. 1,while the minimization of total cost instead of maximization ofthe total reward, is the differentiator compared to conventionalReinforcement Learning problem. The γ ∈ [0, 1) is a discountingfactor that controls the importance of future rewards and ensuresconvergence of the sum in Eq. 1 when n → ∞.

Minimize R =n∑i=0

γ ici (1)

Although state-values suffice to converge to an optimal solution,it is useful to define action-values. Given a state s and an action a,the action-value of the pair (s,a) is is defined asQ and is calculatedaccording to Eq. 2, whereR stands for the return associatedwith firsttaking action a in state s . Consequently, estimating Q plays a keyrole on the overall system effectiveness as it quantifies the efficiencyof the possible alternative selections that can be performed by thedecision-making mechanism.

Q(s,a) = E[R |(s,a)] (2)

In this paper, RL is employed via Neural Fitted Q-iteration (NFQ),which has proven successful in real world applications [26][27].NFQ uses a multilayer perceptron (MLP) in order to approximatethe Q function. The agent acts ϵ-greedily on each state encounteredbased on its current approximation of the Q function. All of theresulting transitions are stored on a growing batch of data on whichthe MLP is trained, and the estimation of Q(s,a) is renewed.

3.2 Thermal ComfortThermal comfort counts the satisfaction of people in a thermalenvironment. Thermal comfort can not be measured directly andtherefore can only be estimated using a number of parameters. Apopular index for estimating occupants’ thermal comfort is thePredicted Mean Vote (PMV). It was developed by Fanger [17] andproduces values in a seven-point scale ([−3, 3]). The sign of thePMV denotes feeling colder or warmer than the ideal.

4 PROPOSED CONTROL ALGORITHMAn overview of the proposed controller framework is schematicallypresented in Figure 1. Each set-point generation cycle starts withretrieval of the current system state. For a feasible approximation of

IndoorsSensors

WeatherSensors

EnergySensors

Thermal Comfort

Unsupervised Dynamic Scaling

Calculate Cost

Reinforcement Learning Controller (NFQ)

FailSafeController

ε-greedy exploration

MLP(Q estimation)

TransitionsBook

Temperature set-point

Action Space

State

Figure 1: Proposed controller’s framework

the Q-function, the system state has to fulfill the Markovian prop-erty, i.e. it has to contain all information relevant to the Q function.Additionally, in the proposed design the state vector must be able toeffectively capture both energy consumption and thermal comfort.Consequently, in this work the system state is summarized by avector of Outdoor temperature, Solar radiation, Indoor humidityand Indoor temperature.

The term action refers to a set-point designation for the thermo-stat. The state-action space has to be kept minimum, in order forthe agent to learn as fast as possible. The actions in this case are:

• Maintain indoor temperature• Increase indoor temperature by 1oC• Decrease indoor temperature by 1oC

The agent’s knowledge is the history of all encountered states,taken actions and received costs. We will refer to this history asTransitions Book (TB). As NFQ implies, TB is a batch of data inthe form of concatenated tuples (si ,ai , ci ), one for each transition(si ,ai , si+1). The cost function (Eq. 3) of the proposed decision mak-ingmechanism is computedwith respect to the energy consumptionE of each transition as well as the thermal comfort value in theform of PMV . Moreover, a trade-off tr (0 ≤ tr ≤ 1) is introducedin the cost calculation to allow the user to designate its preferencewith respect to the importance of Energy and Comfort.

c(s,a, s ′) ={tr · Estd + (1 − tr ) · |PMV |std (non-terminal)terminal_cost else

(3)

More precisely, Estd , |PMV|std are normalized values. In the ex-treme cases of tr = 0 or tr = 1 the resulting control is purelycomfort-driven or energy-efficiency-driven, respectively. Regard-ing the PMV value, absolute values are used since our goal is tominimize the distance from the ideal conditions (PMV = 0).

A practical obstacle in the calculation of the cost function isthe fact that the value range of the two involved quantities maydiffer significantly. This difference would inevitably affect the costcalculation and thus a mitigation strategy is mandatory.

Adhering to our plug&play design concept, we refrain from us-ing existing knowledge in order to achieve scaling of Energy andthermal comfort values. On the contrary, we adopt a unsuperviseddynamic scaling [8] technique, where the running average and


standard deviation are dynamically calculated for both Energy con-sumption and PMV. During run-time, as new data are accumulated,the scaling parameters are re-computed, ensuring that the scalingis up-to-date. The principle of the employed scaling the techniqueis shown Eq. 4 for an arbitrary function F . The variable µk rep-resents the current (k-th time-step) mean value of F , while δk itsstandard deviation. Similarly, Eq. 5 and 6 indicate how these valuesare iteratively updated in preceding steps.

Fnorm =F − µk

δk(4)

µk = µk−1 +F − µk−1

k(5)

δk =

√sk

k − 1 , sk = sk−1 + (F − µk−1)(F − µk ) (6)

A critical aspect of effective RL is to determine the range, withinwhich the system operates in a acceptable way. This is more straight-forward in other RL applications, such as autonomous drivingwhere the automobile should stay within the limits of the roadat all times [27]. We implement a similar strict zone of accepted sys-tem function by considering limiting the Smart Thermostat withinthe specified thermal comfort (PMV) limit. In other words, exceed-ing this limit is an unacceptable action by the agent. This zone offunction is set to |PMV| ≤ 0.752. Consequently, an action is deemedterminal (corresponding to a terminal cost) if:

• the action led the PMV out of the zone of function• the PMV was already out of the zone of function, and theaction increased its absolute value

When terminal costs surface, the MLP is retrained and the next set-point is generated by the following failsafe controller :

if PMV > 0 thendecrease indoor temperature by 1oC

elseincrease indoor temperature by 1oC

end ifThis choice compensates for the agent’s failure and is consistentwith the general form of the agent’s actions.

Given a state and the available actions, the agent has to producea set-point which will minimize the expected return (the cumulativefuture costs in the time horizon determined by γ in Eq. 1). Due tothe incremental nature of the thermostat’s actions, we desire set-points optimized with a long-term horizon in mind. Consequently,for this case study γ was set equal to 0.98 to maximize performance.

As stated in Section 3, the Q function summarizes the expectedbenefit from a future action and this function is approximated byan MLP. The MLP weights are updated at the start of each day orwhenever the agent has received a terminal cost. The data set usedfor training the MLP is extracted from TB in the form of (s,a) tuples.Denoting Qk as the output of the current NN, the training targets,as defined by the NFQ framework, are given in Eq. 7.

tarдet = c(s,a, s ′) + γ ·mina

Qk (s ′,a) (7)

An important consideration stemming from the utilized approx-imations is the possibility of the controller to be trapped in in a2This threshold is the acceptable limit due to the EN15251 European standard.

sub-optimal solution because the controller’s selected actions arebased on transitions that were examined in the past. Consequently,a dilemma for each time-step is whether the controller will exploitits current knowledge, or it will seek a possibly better solution. Thisis the exploration/exploitation dilemma, since the controller cannotknow the optimality of a certain action in a certain state if thisaction is never picked. In this work, we approach this dilemma viaϵ-greedy action selection, with ϵ being self-regulated as describedin [31]. The "positive outcomes" are counted and then used to regu-late ϵ . In our case, such positive outcomes are determined by thevalidity of the MLP’s prediction. This validity is represented by theTemporal Difference (TD) error, defined in Eq.8.

TD = c(s,a, s ′) + γ ·mina

Q(s ′,a) −Q(s,a) (8)

A decision of the agent is deemed positive if |TD | < 0.15. Onthe one hand, this bound is tight enough to represent an accurateapproximation. On the other hand, we observed that it is elasticenough to allow for iterative learning at early stages where thetraining error is initially very high. The association of the TD errorwith the agent’s exploration mechanism was inspired by [18].

The set-point generation process is summarized as follows: thelast taken action is evaluated. The values used for scaling the energyconsumption and thermal comfort are updated, and so is the agent’sknowledge (in the form of TB). The TD error is calculated, thusregulating the mechanism’s exploration. If the last action was ter-minal, the MLP is retrained and the failsafe controller takes action.Otherwise, the next set-point is chosen via ϵ-greedily exploitingthe current approximation of the Q function.

The control ensemble, illustrated in Figure 1, is repeated in periodT , until the end of the schedule. The definition of this period isimportant as it affects the granularity of control as well as thecomputational requirements of the Smart Thermostat, thus leadingto a trade-off. In the context of this work, T was set equal to 10minutes so that the agent collects a greater amount of experienceevery day, which in turn results in faster learning. In addition, thistime-step guarantees that if the agent makes a sub-optimal set-pointdesignation (which is expected to happen, especially in the earlystages of learning), it will not affect the occupants for too long.

5 EXPERIMENTAL SETUPThe effectiveness of the proposed control logic was evaluated usinga well-known simulation and testing testbed provided by [20, 29].Figure 2 illustrates an overview of the testbed, which has beenused in a variety of works [12, 23]. The building dynamics andinput sensor data for the controller are produced by the EnergyPlussuite [13]. The controller gathers this data and calculates the set-point through MATLAB. Data exchange is facilitated through theBCVTB (Building Controls Virtual TestBed) [34]. The employedbuilding model corresponds to an actual building located in Crete,Greece3: The utilized weather data correspond to publicly availableinformation collected in 2010.

The smart thermostat demonstrated in this paper targets a single,randomly occupied thermal zone of the building, active from 6:00to 21:00. It is also assumed that, during the daily schedule, the zoneis occupied at all times from at least one person.3Building models were part of the PEBBLE FP7 EU project (grant agreement 248537).

Towards plug&play Smart Thermostats inspired by Reinforcement Learning INTESA, October 4, 2018, Turin, Italy

EnergyPlus

Building Controls Virtual TestBed (BCVTB)

Matlab

Building Model

Weather File

ProposedController

Set-points

Set-pointsBuilding dynamics

Building dynamics

Figure 2: Simulation testbed of proposed controller

To evaluate the thermostat’s performance, the resulting energyconsumption and thermal comfort are compared with a wide, rea-sonable array of rule-based control set-points (RBC’s). This is atypical function found in all the cooling/heating devices for set-ting a ”static” temperature set-point. For the sake of completeness,we select to provide the performance results achieved with theusage of alternative RBCs as a reference, in order to highlight theenhancement achieved with the proposed solution. Most manualthermostats tend to operate in a single heating set-point in winter,and a respective cooling set-point in summer [32]. Similarly, othersmart thermostats also produce set-points in a range deemed rea-sonable in regard of thermal comfort (in this case, from 20oC to27oC). Fluctuations in these set-points do exist, but still the result ofthese fluctuations would be a trajectory varying between the RBCset-points. By including as many of them as possible, a meaningfulassessment against typical user or smart control is ensured.

6 EXPERIMENTAL EVALUATIONThe first experiment, summarized in Figure 3, evaluates the pro-posed controller’s efficiency against typical RBC values, for a typicalsummer day, concerning the two basic metrics: energy consump-tion and thermal comfort. The controller actually verges on theideal comfort level for tr = 0 and leads to less consumption fortr = 1. Additional results concerning two three-month periods, onein winter (January to March) and one in summer (June to August),are summarized in Table 1. The proposed controller achieves up to59.2%mean energy savings (for tr = 1) and up to 41.8% comfort sav-ings (for tr = 0) on average. Regarding the learning performance,it is shown that the worst-case scenario requires on average only303/90 = 3.37 training sessions per day.

varying between the RBC setpoints. By including as many ofthem as possible, a meaningful assessment against typical useror smart control is ensured.

V. RESULTS AND DISCUSSION

Figure 3 evaluates the proposed controller’s efficiencyagainst typical RBC values, for a typical summer day, concern-ing the two basic metrics: consumption and thermal comfort.The controller actually verges on the ideal comfort level fortr = 0 and leads to less consumption for tr = 1. Additionalresults concerning two three-month periods, one in winter(January to March) and one in summer (June to August), aresummarized in Table I. Only summer results are depicted infull in Figure 4 for brevity. The acronym MSATD is usedfor Mean Square Absolute TD-error. The proposed controllerachieves up to 59.2% (996KWh) mean energy savings andup to 41.8% comfort savings on average. As regards learningperformance, it is shown here that the worst-case scenario isthat of an average only 303/90 = 3.37 training sessions perday. These results are backed up by the Mean Square AbsoluteTD-error, since its value correlates with the number of trainingsessions.

Figure 5 show the agent’s ability to adjust in the tradeoff’svalue. It is also apparent that the thermostat’s performanceimproves over time. There results evaluate the contributionof dynamic scaling in this work, in order to achieve a ”real”trade-off between the two metrics: Energy consumption andthermal comfort.

Last but not least, the proposed controller was tested on twowell-known embedded devices, a BeagleBoard xm (ARM37xCortexA8 1-core up to 1GHz) and a Raspberry Pi Zero(ARMv6 BCM2835 1-core up to 1GHz). The average timerequired for training the MLP on a batch of 4000 transitions(around 1.5 months of function) was 48.94 and 64.32 secondsrespectively. This is the heaviest task of the proposed controlscheme and is repeated, as mentioned before, on average 3.37times per day. Prediction time was measured at 0.0013 and0.0025 seconds. These results emphasize the feasibility of theproposed controller’s implementation on a low-cost embeddedplatform.

VI. CONCLUSION

This paper has introduced a model-free, plug-and-play ap-proach on the problem of an HVAC thermostat’s setpoint

Learning PerformanceTradeoff Retrain sessions

(winter/summer)MSATD(winter/summer)

0 (Optimize Comfort) 180 / 112 0.0135 / 0.00680.5 (Optimize both) 207 / 157 0.0198 / 0.01591 (Optimize Energy) 303 / 223 0.0284 / 0.0322

Control PerformanceTradeoff Mean energy savings

(winter/summer)Mean comfort sav-ings (winter/summer)

0 (Optimize Comfort) -7.8% / 10.8% 11.9% / 41.8%0.5 (Optimize both) 28.4% / 32.4% -3.9% / 27.4%1 (Optimize Energy) 59.2% / 48.3% -23.3% / 5.8%

TABLE I: Evaluation of the thermostat’s learning and controlperformance.

scheduling. Through reinforcement learning, the controlleradapts and improves its performance over time, with up to59.2% (996 kWh) energy savings in a three-month period.The user can set the preferred balance in energy savings andthermal comfort. The proposed controller is lightweight andcan be implemented in low-cost embedded devices. The solu-tion is demonstrated using a well-known simulation and testingframework (EnergyPlus-BCVTB-Matlab) and is implementedin ARM-based microprocessors.

6am 10am 2pm 6pm 9pm

Simulation hours

-1.5

-1

-0.5

0

0.5

1

PM

V

Comfort Comparison

(a) Optimize PMV (tr = 0)


Simulation Hours

0

0.5

1

1.5

2

Pow

er

Consum

ption (

kW

)

Consumption ComparisonProposed

RBC 20

RBC 21

RBC 22

RBC 23

RBC 24

RBC 25

RBC 26

RBC 27

(b) Optimize Energy (tr = 1)

Fig. 3: Daily performance of proposed controller against RBCs

20 40 60 80

Day

0

20

40

60

80

100

120

Mean v

alu

e (

|PM

V| sum

)

Comfort daily statistics

(a) Daily thermal comfort tr = 0

20 40 60 80

Day

0

5

10

15

20

25

Mean v

alu

e (

kw

h)

Energy daily statisticsProposed

RBC 20

RBC 21

RBC 22

RBC 23

RBC 24

RBC 25

RBC 26

RBC 27

(b) Daily energy tr = 1

Fig. 4: Evaluation of 3-month daily mean scores against RBCs

20 40 60 80

Day

10

20

30

40

50

60

Mean v

alu

e (

|PM

V| sum

)


(a) Proposed PMV performance

20 40 60 80

Day

2

4

6

8

10

12

Mean v

alu

e (

KW

h)

Energy daily statistics

tr = 0

tr = 0.5

tr = 1

(b) Proposed Energy performance

Fig. 5: Variation of mean scores for different values of tr.

Figure 3: Daily performance vs RBCs (left tr = 0, right tr = 1)

Towards plug&play Smart Thermostats inspired by Reinforcement Learning INTESA 2018 Workshop, October 2018, Torino, Italy

EnergyPlus

Building Controls Virtual TestBed (BCVTB)

Matlab

Building Model

Weather File

ProposedController

Set-points

Set-pointsBuilding dynamics

Building dynamics

Figure 2: Simulation testbed of proposed controller

suite [13]. The controller gathers this data and calculates the set-point through MATLAB. Data exchange is facilitated through theBCVTB (Building Controls Virtual TestBed) [34]. The employedbuilding model corresponds to an actual building located in Crete,Greece2: The utilized weather data correspond to publicly availableinformation collected in 2010

The smart thermostat demonstrated in this paper targets a single,randomly occupied thermal zone of the building, active from 6:00to 21:00. It is also assumed that, during the daily schedule, the zoneis occupied at all times from at least one person.

To evaluate the thermostat’s performance, the resulting energyconsumption and thermal comfort are compared with a wide, rea-sonable array of rule-based control set-points (RBC’s). RBC standsfor ”Rule-Base Control”. This is a typical function found in all thecooling/heating devices for setting a ”static” temperature set-pointfor the entire experiment. For the sake of completeness, we selectto provide the performance results achieved with the usage of alter-native RBCs as a reference, in order to highlight the enhancementachieved with the proposed solution.

Most manual thermostats tend to operate in a single heatingset-point in winter, and a respective cooling set-point in summer[32]. Similarly, other smart thermostats also produce set-pointsin a range deemed reasonable in regard of thermal comfort (inthis case, from 20oC to 27oC). Fluctuations in these set-points doexist, but still the result of these fluctuations would be a trajectoryvarying between the RBC set-points. By including as many of themas possible, a meaningful assessment against typical user or smartcontrol is ensured.

6 EXPERIMENTAL EVALUATIONThe first experiment, summarized in Figure 3, evaluates the pro-posed controller’s efficiency against typical RBC values, for a typicalsummer day, concerning the two basic metrics: energy consump-tion and thermal comfort. The controller actually verges on theideal comfort level for tr = 0 and leads to less consumption fortr = 1. Additional results concerning two three-month periods, onein winter (January to March) and one in summer (June to August),are summarized in Table 1. The proposed controller achieves up to59.2%mean energy savings (for tr = 1) and up to 41.8% comfort sav-ings (for tr = 0) on average. Regarding the learning performance,

2The modeling of the buildings was part of the PEBBLE FP7 project funded by theEuropean Commission under the grant agreement 248537.






VI. CONCLUSION











Simulation hours

-1.5

-1

-0.5

0

0.5

1

PM

V

Comfort Comparison



Simulation Hours

0

0.5

1

1.5

2

Pow

er

Co

nsum

ptio

n (

kW

)


RBC 20

RBC 21

RBC 22

RBC 23

RBC 24

RBC 25

RBC 26

RBC 27



20 40 60 80

Day

0

20

40

60

80

100

120

Mea

n v

alu

e (

|PM

V| su

m)



20 40 60 80

Day

0

5

10

15

20

25

Mea

n v

alu

e (

kw

h)


RBC 20

RBC 21

RBC 22

RBC 23

RBC 24

RBC 25

RBC 26

RBC 27



20 40 60 80

Day

10

20

30

40

50

60

Me

an v

alu

e (

|PM

V| sum

)



20 40 60 80

Day

2

4

6

8

10

12

Mean

va

lue (

KW

h)


tr = 0

tr = 0.5

tr = 1


Fig. 5: Variation of mean scores for different values of tr.

Figure 3: Daily performance of proposed controller againstRBCs (left a = 0, right a = 1)

Days

0 10 20 30 40 50 60 70 80 90

Mean T

D e

rror

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 4: Daily mean MLP TD-error.

it is shown that the worst-case scenario requires on average only303/90 = 3.37 training sessions per day.

Themean TD error (Eq. 7) for each of the first 90 days is plotted inFigure 4. These results confirm previous evidence about improvingthe controller’s efficiency over time, as the machine learning partof the controller leads to lower error values. The majority of thesevalues is less than 0.15, which has been defined in Section 4 as thethreshold for considering the model as successful.

Learning PerformanceTrade-off Retrain sessions (winter/summer)0 (Optimize Comfort) 180 / 1120.5 (Optimize both) 207 / 1571 (Optimize Energy) 303 / 223

Control PerformanceTrade-off Mean energy sav-

ings (vs RBCs)Mean comfort sav-ings (vs RBCs)

(winter/summer) (winter/summer)0 (Optimize Comfort) -7.8% / 10.8% 11.9% / 41.8%0.5 (Optimize both) 28.4% / 32.4% -3.9% / 27.4%1 (Optimize Energy) 59.2% / 48.3% -23.3% / 5.8%

Table 1: Evaluation of the thermostat’s learning and controlperformance.

Figure 4: Daily mean MLP TD-error.

Themean TD error (Eq. 8) for each of the first 90 days is plotted inFigure 4. These results confirm previous evidence about improvingthe controller’s efficiency over time, as the machine learning partof the controller leads to lower error values. The majority of thesevalues is less than 0.15, which has been defined in Section 4 as thethreshold for considering the model as successful.

Expressing the optimization objective as a weighted sum, enablesthe designer to designate preference with respect to the importanceof each objective. To abide by this functionality, the critical compo-nent of our design is its dynamic scaling part, which normalizes thevalues of the objectives so that the weighted factors dominate thecalculated cost. In the experiment illustrated in Figure 5 we studythis ability according to different values of the trade-off factor tr .We observe that according to its value the system emphasizes inone of the two metrics. For example in the case of tr = 0 the con-troller leads to best comfort values, while in the case of tr = 1 thecontroller minimizes the energy consumption. The study of Figure5 highlights that setting tr = 0.5, actually leads to results, whereboth the energy cost and the occupants thermal comfort metricsare of equal importance. It is also apparent that the thermostat’sperformance improves over time.

Last, it is important to quantify the ability of the proposed frame-work to support online execution on small-factor, resource con-strained embedded device. Towards this direction, the proposed con-trol logic ensemble was evaluated on two well-known, single-coreembedded devices, a BeagleBoard xm (ARM37x Cortex-A8@1GHz)and a Raspberry Pi Zero (ARMv6 BCM2835@1GHz). We focus ouranalysis on the average execution latency for the training and pre-diction of the utilized MLP Neural Network, since these are the

Learning Performance (winter/summer)Trade-off Retrain sessions0 (Optimize Comfort) 180 / 1120.5 (Optimize both) 207 / 1571 (Optimize Energy) 303 / 223

Control Performance (winter/summer)Trade-off Mean energy sav-

ings (vs RBCs)Mean comfort sav-ings (vs RBCs)

0 (Optimize Comfort) -7.8% / 10.8% 11.9% / 41.8%0.5 (Optimize both) 28.4% / 32.4% -3.9% / 27.4%1 (Optimize Energy) 59.2% / 48.3% -23.3% / 5.8%Table 1: Evaluation of learning and control performance.







VI. CONCLUSION











Simulation hours

-1.5

-1

-0.5

0

0.5

1

PM

V

Comfort Comparison



Simulation Hours

0

0.5

1

1.5

2

Pow

er

Consum

ption (

kW

)


RBC 20

RBC 21

RBC 22

RBC 23

RBC 24

RBC 25

RBC 26

RBC 27



20 40 60 80

Day

0

20

40

60

80

100

120

Mean v

alu

e (

|PM

V| sum

)



20 40 60 80

Day

0

5

10

15

20

25

Mean v

alu

e (

kw

h)


RBC 20

RBC 21

RBC 22

RBC 23

RBC 24

RBC 25

RBC 26

RBC 27



20 40 60 80

Day

10

20

30

40

50

60

Mean v

alu

e (

|PM

V| sum

)



20 40 60 80

Day

2

4

6

8

10

12

Mean v

alu

e (

KW

h)


tr = 0

tr = 0.5

tr = 1


Fig. 5: Variation of mean scores for different values of tr.Figure 5: Efficiency of dynamic scaling to satisfy trade-off

most computationally demanding tasks of the proposed controller.The average training latency for a batch of 4000 transitions (around1.5 months of function) on the two devices was 48.94 and 64.32seconds, respectively. The corresponding prediction latency wasmeasured at 0.0013 and 0.0025 seconds. These results emphasizethe feasibility of the proposed controller’s implementation on anembedded platform and are attributed to the simple nature of theutilized Neural Network.

7 CONCLUSIONSThis paper has introduced a model-free, plug&play approach on theproblem of an HVAC thermostat’s set-point scheduling. Throughreinforcement learning, the controller adapts and improves its per-formance over time. The user can set the preferred balance in energysavings and thermal comfort. The proposed controller is lightweightand can be implemented in low-cost embedded devices. The solu-tion is demonstrated using a well-known simulation and testingframework and is implemented in ARM-based microprocessors.

AcknowledgmentsWork in this paper has been partially fundedby the European Union’s Horizon 2020 research and innovationprogramme, under project SDK4ED, grant agreement No 780572.

REFERENCES[1] Abdul Afram and Farrokh Janabi-Sharifi. 2014. Theory and applications of HVAC

control systems–A review of model predictive control (MPC). Building andEnvironment 72 (2014), 343–355.

[2] Rafael AlcalÃą, Jorge Casillas, Oscar CordÃşn, Antonio GonzÃąlez, and FranciscoHerrera. 2005. A genetic rule weighting and selection process for fuzzy controlof heating, ventilating and air conditioning systems. Engineering Applications ofArtificial Intelligence 18, 3 (2005), 279 – 296.

[3] Plamen P Angelov and Richard A Buswell. 2003. Automatic generation of fuzzyrule-based models from data by genetic algorithms. Information Sciences 150, 1-2(2003), 17–31.

[4] Amjad Anvari-Moghaddam, Hassan Monsef, and Ashkan Rahimi-Kian. 2015.Optimal smart home energy management considering energy saving and a com-fortable lifestyle. IEEE Transactions on Smart Grid 6, 1 (2015), 324–332.

[5] Enda Barrett and Stephen Linder. 2015. Autonomous hvac control, a reinforce-ment learning approach. In Joint European Conference on Machine Learning andKnowledge Discovery in Databases. Springer, 3–19.

[6] Sahand Behboodi, David P Chassin, Ned Djilali, and Curran Crawford. 2018.Transactive control of fast-acting demand response based on thermostatic loadsin real-time retail electricity markets. Applied Energy 210 (2018), 1310–1320.

[7] Abdullatif E Ben-Nakhi and Mohamed A Mahmoud. 2002. Energy conservationin buildings through efficient A/C control using neural networks. Applied Energy73, 1 (2002), 5–23.

[8] Danushka Bollegala. 2017. Dynamic feature scaling for online learning of binaryclassifiers. Knowledge-Based Systems 129 (2017), 97–105.

[9] Francesco Calvino, Maria La Gennusa, Gianfranco Rizzo, and Gianluca Scac-cianoce. 2004. The control of indoor thermal comfort conditions: introducing afuzzy adaptive controller. Energy and buildings 36, 2 (2004), 97–102.

[10] David P Chassin, Jakob Stoustrup, Panajotis Agathoklis, and Nedjib Djilali. 2015.A new thermostat for real-time price demand response: Cost, comfort and energyimpacts of discrete-time control without deadband. Applied Energy 155 (2015),816–825.

[11] Konstantinos Dalamagkidis, Denia Kolokotsa, Konstantinos Kalaitzakis, andGeorge S Stavrakakis. 2007. Reinforcement learning for energy conservation andcomfort in buildings. Building and environment 42, 7 (2007), 2686–2698.

[12] Panayiotis Danassis, Kostas Siozios, Christos Korkas, Dimitrios Soudris, and EliasKosmatopoulos. 2017. A low-complexity control mechanism targeting smartthermostats. Energy and Buildings 139 (2017), 340–350.

[13] U. S Department of Energy. 2015. EnergyPlus Energy Simulation Software.http://apps1.eere.energy.gov/buildings/energyplus/.

[14] Anastasios I Dounis and Christos Caraiscos. 2009. Advanced control systemsengineering for energy and comfort management in a building environmentâĂŤAreview. Renewable and Sustainable Energy Reviews 13, 6-7 (2009), 1246–1261.

[15] E.U. Commission. 2008. European energy and transport - Trends to 2030 (Update2007). http://aei.pitt.edu/46140/

[16] Eurostat. 2005. Energy balance sheets. Data 2002-âĂŞ2003, Luxemburg.[17] Poul O Fanger et al. 1970. Thermal comfort. Analysis and applications in environ-

mental engineering. Thermal comfort. Analysis and applications in environmentalengineering. (1970).

[18] Clement Gehring and Doina Precup. 2013. Smart exploration in reinforcementlearning using absolute temporal difference errors. In Proceedings of the 2013 inter-national conference on Autonomous agents and multi-agent systems. InternationalFoundation for Autonomous Agents and Multiagent Systems, 1037–1044.

[19] D Kolokotsa, GS Stavrakakis, K Kalaitzakis, and D Agoris. 2002. Genetic algo-rithms optimized fuzzy controller for the indoor environmental management inbuildings implemented using PLC and local operating networks. EngineeringApplications of Artificial Intelligence 15, 5 (2002), 417–428.

[20] GD Kontes, GI Giannakis, Elias B Kosmatopoulos, and DV Rovas. 2012. Adaptive-fine tuning of building energy management systems using co-simulation. InControl Applications (CCA), 2012 IEEE International Conference on. IEEE, 1664–1669.

[21] Rajesh Kumar, R.K. Aggarwal, and J.D. Sharma. 2013. Energy analysis of abuilding using artificial neural network: A review. Energy and Buildings 65 (2013),352 – 358. https://doi.org/10.1016/j.enbuild.2013.06.007

[22] L Magni, Giuseppe De Nicolao, Lorenza Magnani, and Riccardo Scattolini. 2001.A stabilizing model-based predictive control algorithm for nonlinear systems.Automatica 37, 9 (2001), 1351–1362.

[23] Charalampos Marantos, Kostas Siozios, and Dimitrios Soudris. 2017. A FlexibleDecision-Making Mechanism Targeting Smart Thermostats. IEEE EmbeddedSystems Letters 9, 4 (2017), 105–108.

[24] David Q Mayne, James B Rawlings, Christopher V Rao, and Pierre OM Scokaert.2000. Constrained model predictive control: Stability and optimality. Automatica36, 6 (2000), 789–814.

[25] Daniele Menniti, Ferdinando Costanzo, Nadia Scordino, and Nicola Sorrentino.2009. Purchase-bidding strategies of an energy coalition with demand-responsecapabilities. IEEE Transactions on Power Systems 24, 3 (2009), 1241–1255.

[26] Martin Riedmiller. 2005. Neural fitted Q iteration–first experiences with a data ef-ficient neural reinforcement learning method. In European Conference on MachineLearning. Springer, 317–328.

[27] Martin Riedmiller, Mike Montemerlo, and Hendrik Dahlkamp. 2007. Learning todrive a real car in 20 minutes. In Frontiers in the Convergence of Bioscience andInformation Technologies, 2007. FBIT 2007. IEEE, 645–650.

[28] Luigi Rucco, Andrea Bonarini, Carlo Brandolese, and William Fornaciari. 2013.A bird’s eye view on reinforcement learning approaches for power managementin WSNs. In Wireless and Mobile Networking Conference (WMNC), 2013 6th JointIFIP. IEEE, 1–8.

[29] Carina Sagerschnig, Dimitrios Gyalistras, Axel Seerig, Samuel Prívara, Jiří Cigler,and Zdenek Vana. 2011. Co-simulation for building controller development: Thecase study of a modern office building. In Proc. CISBAT. 14–16.

[30] Jagdev Singh, Nirmal Singh, and JK Sharma. 2006. Fuzzy modeling and controlof HVAC systems–A review. (2006).

[31] Teck-Hou Teng, Ah-Hwee Tan, and Yuan-Sin Tan. 2012. Self-regulating actionexploration in reinforcement learning. Procedia Computer Science 13 (2012),18–30.

[32] Bryan Urban and Kurt Roth. 2014. A Data-Driven Framework For ComparingResidential Thermostat Energy Performance.

[33] Tianshu Wei, Yanzhi Wang, and Qi Zhu. 2017. Deep reinforcement learningfor building HVAC control. In Design Automation Conference (DAC), 2017 54thACM/EDAC/IEEE. IEEE, 1–6.

[34] Michael Wetter. 2011. Co-simulation of building energy and control systemswith the Building Controls Virtual Test Bed. Journal of Building PerformanceSimulation 4, 3 (2011), 185–203.

[35] Ji Hoon Yoon, Ross Baldick, and Atila Novoselac. 2014. Dynamic demand re-sponse controller based on real-time retail price for residential buildings. IEEETransactions on Smart Grid 5, 1 (2014), 121–129.

http://apps1.eere.energy.gov/buildings/energyplus/

http://aei.pitt.edu/46140/

https://doi.org/10.1016/j.enbuild.2013.06.007

towards plug&play smart thermostats inspired by

Documents