energy and uncertainty - dynamic...

Articles

8 AI MAGAZINE

The challenge of creating an efficient, sustainable ener-gy system requires solving design and control problemsin the presence of different sources of uncertainty.

There is the familiar array of decisions: discrete actions, con-tinuous controls, and vector-valued (and possibly integer)decisions. The tools for these problems are drawn from com-puter science, engineering, applied math, and operationsresearch.

The problems of energy planning, however, introduce afresh dimension in terms of the range of different types ofuncertainty: familiar Gaussian noise, heavy-tailed distribu-tions, bursts, rare events, temporal uncertainty, lagged infor-mation processes, and model uncertainty. Randomness aris-es in supplies and demands, in costs and prices, and in theunderlying dynamics (the efficiency of gas turbines, andevaporation rates from reservoirs). A by-product of this vari-ability is the need to use different types of objective func-tions. We can minimize the expected costs, but there is grow-ing interest in drawing on different types of risk measures aswe try to capture our attitudes toward uncertainty.

The design and control of energy systems is proving to bea rich playground for specialists in the science of makingdecisions under uncertainty (also known as stochastic opti-mization). The purpose of this article is to identify some ofthe modeling and algorithmic challenges that arise in ener-

Copyright © 2014, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602

Energy and Uncertainty: Models and Algorithms

for Complex Energy Systems

Warren B. Powell

n The problem of controlling energysystems (generation, transmission, stor-age, investment) introduces a number ofoptimization problems that need to besolved in the presence of different typesof uncertainty. I highlight several ofthese applications, using a simple ener-gy storage problem as a case applica-tion. Using this setting, I describe amodeling framework that is based onfive fundamental dimensions and thatis more natural than the standardcanonical form widely used in the rein-forcement learning community. Theframework focuses on finding the bestpolicy, where I identify four fundamen-tal classes of policies consisting of poli-cy function approximations (PFAs), costfunction approximations (CFAs), poli-cies based on value function approxi-mations (VFAs), and look-ahead poli-cies. This organization unifies anumber of competing strategies under acommon umbrella.

Articles

FALL 2014 9

gy systems, and to demonstrate a way of modelingthem that represents a slight departure from the clas-sical modeling frameworks used in computer scienceor operations research.

The proper handling of information is arguablyone of the most subtle challenges in the correct mod-eling of complex systems. We have well-developednotation for modeling physical processes: the opera-tion of generators, the flow of electricity, the storageof power, and the logistics of people, equipment andfuels. Deterministic optimization models follow asmall number of standard modeling conventions, asevidenced by a substantial body of papers with mod-els that are clear, correct, and implementable.

By contrast, the communities that attempt to com-bine decisions and uncertainty are fragmented into ajungle of subcommunities with names such as rein-forcement learning (Sutton and Barto 1998), stochas-tic search (Spall 2003), approximate/adaptive dynam-ic programming (Werbos 1974, Powell 2011,Bertsekas 2012), stochastic programming (Birge andLouveaux 1997), Markov decision processes (Puter-man 2005), decision trees, optimal control (Stengel1994), and simulation-optimization (Swisher, Jacob-son and Yucesan 2003, Fu 2008). Notation for themodeling of information can be nonstandard evenwithin a community.

Ignoring for the moment our inability to agree onthe right way to write down these problems, we thenface the challenge of designing algorithms to dealwith the different types of uncertainty. Reliable algo-rithms exist only for an astonishingly narrow set ofproblems. The sorry state of affairs is difficult to dis-cern from the ocean of research “proving” that algo-rithms work and experimental work where we are allheroes in our own papers. We focus on the challengeof finding good policies, and identify four funda-mental classes of policies that help to integrate whathave often been presented as competing algorithmicstrategies.

Sample Problems in Energy SystemsThe process of delivering electricity requires a systemcomprised of a series of components, spanning gen-eration, transmission, storage, and systems for man-aging loads (demands). Since electricity is a criticalbut difficult-to-store commodity, utilities use com-plex contracts to protect both consumers and them-selves from the risk of high electricity prices. All ofthese processes interact in a complex way that bene-fit from the tools of stochastic optimization.

Energy StorageThe problem of storing energy now to be used laterarises in such a wide range of settings that it hasbecome a foundational problem, as ubiquitous inenergy systems as inventory theory is in operationsmanagement. Storage is particularly important in the

context of the electric power grid, because electricityis extremely difficult to store, resulting in a variety ofstrategies to compensate for different types of vari-ability (generator failures, transmission failures, andunexpected increases in temperatures on a hot day).There is tremendous interest in direct methods ofstoring energy from the grid, spanning pumpedhydro (pumping water uphill), grid level battery stor-age, compressed air (pumping air under high pressureinto underground cavities), flywheels, and vehicle togrid.

Some settings include battery arbitrage, bidding,pairing storage with wind or solar, and hydroelectricstorage.

Battery arbitrage: Real-time spot prices of electricityon the grid will typically average around $50 permegawatt-hour, but it may range down to the 30s(and can even go negative!), but will routinely spikeabove $200 per megawatt-hour, and occasionallymore than $500 per megawatt-hour. These pricesvary by node (they are known as locational marginalprices, or LMPs), and they may be updated every 5 to15 minutes. Battery arbitrage involves drawing ener-gy from the grid when prices are low, and sellingenergy back when prices are high. The problem is todetermine these upper and lower limits.

Bidding energy storage: If your battery is largeenough, you are required to bid your energy into theelectricity market. Electricity markets are managedprimarily by independent system operators (ISOs)such as PJM Interconnections, which manages thegrid that covers a triangle from the Chicago area toVirginia up through New Jersey, and all the states inbetween (13 in total). The rules differ between ISOs,but a battery operator might have to quote lower(charge) and upper (discharge) prices either an hourahead or even a day ahead. The optimization prob-lem is to choose these prices to maximize net rev-enues while managing the energy level in the battery.

Pairing storage with wind or solar: Wind and solar areboth very popular as a completely green source ofenergy, but they suffer from our inability to controlthem (a property known as dispatchability in theelectricity community). The sun shines only duringthe day, and both wind and solar suffer from inter-mittency, creating a need for backup sources. Storageoffers a way of smoothing both the intermittency aswell as daily cycles.

Hydroelectric storage: Some regions are served bynetworks of water reservoirs, which introduces theproblem of optimizing the flows between reservoirs,a problem that has long attracted the attention of thestochastic optimization communities.

Storage problems have to be solved in the presenceof different types of uncertainty: uncertain genera-tion from wind turbines and solar panels, stochastic(and heavy-tailed) real-time prices, infrequent fail-ures of generators or transmission lines, as well asforecasts of temperatures and loads.

Articles

10 AI MAGAZINE

Commodity ContractsUtilities need to sign contracts, typically 3 to 5 yearsinto the future (but sometimes longer), to supply acustomer with energy at a guaranteed price. The util-ity needs then to purchase hedges to protect againstpotential spikes in prices. While spikes are generallyof short duration, a lengthy heat wave in Texas in2012 cost a utility tens of millions of dollars. Con-tracts can be purchased over different horizons. Forthis problem, we might introduce the notation att΄,giving us the amount we wish to purchase at time tfor delivery during time period t΄. Forecasts ftt΄ (forexample, of prices or demands) and decisions att΄ (forexample, purchasing forward contracts) are known aslagged information and decision processes.

Negotiating these advance contracts requires thatwe deal with different sources of uncertainty: thelong term growth in demand, improvements in ener-gy efficiency, structural changes in commodity pricessuch as gas and oil, as well as the usual spikes inprices shown in figure 1.

Demand ResponseDemand response, in the terminology of the energycommunity (sometimes called demand side manage-ment or demand side response), refers to the abilityto reduce demand using various incentives. Thismight be through a real-time price or, more often, apreviously arranged contract where customers arepaid in return for agreeing to reduce their load ondemand. These contracts may offer consumers a low-er price in return for their agreement to reduce theirload on demand (possibly under direct control of thegrid operator or a load-serving entity).

Demand response providers have to deal with anumber of issues. First, customers often want someadvance notice (two hours is typical) so they can takeactions to minimize the impact of the reduction.Commercial buildings might want to air condition alobby so the cool air can be used later. A manufac-turing enterprise might want to make extra product,which is then held in inventory. Both of these exam-ples represent a form of advance storage. In other cas-

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

Figure 1. Electricity Spot Prices on the PJM Grid.

Articles

FALL 2014 11

es, demand is pushed off to later in the day (again, adifferent form of storage).

Second, customers may not always respond in apredictable way. Increasing the cost of charging anelectric vehicle may have a bigger response on awarm and sunny day, when customers are more will-ing to walk, than a cold and rainy day. It may be nec-essary to learn a function that changes depending onconditions.

The challenge we face with demand response isthat we are often dealing with people, not equip-ment. Not only do we not know how people (includ-ing organizations) may respond to load curtailmentincentives, the response may depend on local condi-tions that change over time. Well known is the prob-lem of fatigue, where responses diminish as repeateddemands for curtailment are made. These behaviorsintroduce the dimension of model uncertainty (fig-ure 2).

Unit CommitmentThe unit commitment problem involves planningthe power generating units (nuclear, steam plantsrun by coal or gas, and fast-ramping gas turbines)that are to be turned on or off, and the level at whichthey should be operated. Most grid operators follow

PJM (a transmission organization), which runs a day-ahead market to determine the steam generators thatshould be used tomorrow, and an hour-ahead marketthat plans which gas turbines should be online. Thereis also a real-time, economic dispatch problem thatcan ramp a generator up and down within limits,without turning it off or on.

Unit commitment problems are hard largelybecause they might involve up to a hundred thou-sand integer variables. Fortunately, such problemscan be solved using commercial solvers such as Cplexor Gurobi. However, these solvers can only be used ifwe view the future deterministically. ISOs havelearned how to insert reserve capacity to handle theuncertainty in demand forecasts and the infrequentgenerator failure, but face a much more difficult chal-lenge in designing a schedule that can handle themuch higher level of uncertainty as generation fromwind and solar increases.

As an example, it is not hard to recognize that ifwe are going to count on a significant contributionfrom wind or solar, we are going to have to arrangeenough backup reserve to provide power when thewind drops or clouds come in. But this is not enough,especially when we are drawing energy from wind,which can exceed the forecast. If this happens, we

Price

Elec

tric

ity d

eman

d

Possible demandresponse curves.

Figure 2. Uncertainty in Demand Response Models.

also need to be able to ramp the generators down. Ifwe cannot do this (which might happen if gas tur-bines are running at their minimum), then we can-not use the wind energy, forcing us to replace freeenergy, which generates no carbon dixoide (CO2),with more expensive energy from natural gas, whichhas a substantial CO2 footprint.

A significant part of unit commitment involvesprotecting against generation and network failures.Figure 3 shows the primary transformers in New YorkCity, which fail from time to time. It is important toplan generation capacity that will meet demandseven in the event of these infrequent, but not rare,network failures.

Replacement of TransformersTransformers represent a critical technology in a grid.A large number of transformers were installed in the1950s and 1960s, and are still operating today. As aresult, utilities do not actually know the true failure-rate curve. Overestimating the failure-rate curve(which is common if the utilities believed the curvesprovided by manufacturers were accurate) increasescosts by replacing transformers unnecessarily. How-

ever, underestimating the failure-rate curve exposesthe utility to the risk of a burst of failures. It can takeup to 12 months to order and install a replacement,and utilities would be exposed to considerable finan-cial risk if they had to order, say, 20 transformers ina single year.

Information, Forecasting, and Uncertainty

The aforementioned examples help to illustrate arange of different types of uncertainty. Before high-lighting the different types of uncertainty, it is usefulto make a few remarks about the nature of informa-tion and uncertainty.

First, all of the applications above involve somesort of manageable resources (generators, energy instorage, commodities, spare transformers, and evencustomers). Exogenous information can enter in theform of energy from the sun, uncertain tempera-tures, uncertainty in the response of customers tocurtailment decisions, infrequent generator failures,and heavy-tailed spot prices.

Second, we often have access to some sort of fore-

Articles

12 AI MAGAZINE

Substation

Transformers

Primary feeders(27kV)

Figure 3. Primary Feeders in New York, with Potential Failures.

cast. We might let ftt΄ be the forecast of solar energyduring time period t΄, using information available upto time t. If Et΄ is the actual energy generated at timet΄, it is reasonable to expect that the forecast of thedemand is unbiased, which is to say

(1)

where 𝔼t [...] represents the expected value of a quan-tity made at time t. It is common to think of the ener-gy Et΄ as random if we are at time t < t΄, but the realuncertainty is not in Et΄, but rather the differencebetween Et΄ and its forecast, which we might write as

(2)

Figure 4 illustrates a forecast of wind (bold solidline) produced by a meteorological model calledWRF (weather research and forecasting). Also shownis the actual observations (dashed line), along witha number of sample realizations of the random vari-able εt΄ drawn from a stochastic model estimatedspecifically to capture the difference between actualand forecasted wind. Note that in this plot, the vari-

ft !t = Et[E !t ]

E !t = ft !t + ! !t

ation in the forecasted wind is much more variablethan the error in the forecast. It is important to dis-tinguish predictable variability from unpredictablevariability.

This application illustrates a different type ofuncertainty. In the graph, the meteorological modelseems to do a good job of forecasting sudden increas-es in wind speed, but in addition to amplitude errors(errors in the amount of wind), there also appear tobe temporal errors, representing errors in predictingthe timing of changes in the wind speed.

This discussion highlights the different types ofuncertainty that arise in different applications inenergy systems:

Gaussian noise, which is used to model the familiarbell-shaped curve of the normal distribution.

Heavy-tailed distributions, as arise in models of real-time electricity prices. In one study, electricity priceswere well predicted by a Cauchy distribution with infi-nite variance.

Binomial random variables with small probabilities,which would be used to model the event that a gener-ator fails.

Articles

FALL 2014 13

ForecastedActual

Buildout 2-21-28 Jul 2010

30000

25000

20000

15000

10000

5000

01 31 61 91 121 151 181

CbsSim-1Sim-2Sim-3Sim-4Sim-5Sim-6Sim-7Sim-8Sim-9Sim-10WRP

Figure 4. Actual and Forecasted Wind, Along with Sample Paths Representing Other Potential Outcomes.

Poisson distributions, which would be used to modelthe number of failures over a large network.

Temporal errors, which capture errors in the timing ofan event.

Model uncertainty, where we may not know the struc-ture of a model or, more often, we do not know theparameters of a model.

Unobservable processes. We will never know the elas-ticity of demand in response to electricity prices, orthe true load on the network which is influenced byactivities that are outside of the range of our grid (butstill connected). Instead, we will have to represent ourunderstanding of these parameters through a proba-bility distribution.

We also have to distinguish between problemswhere we have a reasonable estimate of the probabil-ity distribution of the uncertainty, and problemswhere we do not. If we do not know the probabilitydistribution, we may assume that we are working ina setting where we can observe the random events.The second case is one way that model-free applica-tions arise, although this is a term that is used differ-ently in different communities.

Case Application: Analysis of a Solar Storage System

There is growing interest in pairing solar arrays withbattery storage on the grid, where the battery can beused to smooth the variations in the solar energy(which can be quite spikey due to clouds). The bat-tery can be used for multiple purposes. A major use isfrequency regulation (smoothing out the high-fre-quency deviations from the standard 60-Hertz signal

for electricity), for which there is an established mar-ket that pays about $30 per hour, per megawatt ofpower (a megawatt is the rate at which power movesover the network). Another use is known as batteryarbitrage to exploit price volatility. Battery arbitragearises because the price of electricity (known as alocational marginal price, or LMP), changes every 5minutes (on the PJM grid — this frequency of updatevaries from one ISO to another). LMPs typically rangebetween $30 per megawatt hour (but can go lower),up to $1000 per megawatt hour, averaging about $50per megawatt hour. A battery can be used to storeelectricity when prices are low, and sell it back whenprices are high.

Managing energy storage requires a policy that canreact to price spikes, while thinking about the timingof when solar energy will be generated, and when itis needed. A sample configuration is illustrated in fig-ure 5. Forecasts are often available for both genera-tion and loads (which tend to depend heavily ontemperature). We have to decide whether our solararray should meet current demands of a building, sellto the grid, or store in a battery.

We are going to use this setting to illustrate someprinciples of modeling and the design of control poli-cies.

The Challenge of ModelingIt is very common in the reinforcement learningcommunity to represent stochastic dynamic prob-lems as consisting of a state space S, action space A,reward function r(s, a) , a discount factor γ, and a one-step transition matrix P with element p(s΄ | s, a).

Articles

14 AI MAGAZINE

Wind speed

Electricity prices

Wind speed

Electricity prices

Demand

Figure 5. Illustration of Storage Device Balancing Energy from Wind and Grid to Meet a Time Varying Load.

Implicit in the representation is that the problem isstationary. Such a representation would be calledmodel based in computer science because we assumewe know the transition matrix P (known in this com-munity as the model). It is widely recognized that thetransition matrix is not known for many applica-tions. Such models are known as model free and it isassumed that if we are in a state sn and take action an,that it is possible to observe from an external sourcethe next state sn+1 and an associated reward r(sn, an,sn+1) . This line of thinking overlooks the many appli-cations where the dynamics are known, but the tran-sition matrix is not computable.

Such problems are often modeled as dynamic pro-grams by writing Bellman’s optimality equation, giv-en by

(3)

Of course, Bellman’s optimality equation is not amodel at all, but rather a characterization of an opti-mal policy. The problem is that it is also well knownto be computationally intractable for all but thesmallest problems. The problem is most commonlydescribed as the curse of dimensionality, which refersto the explosion of the state space as additionaldimensions are added. In reality, for many problemsthere may be up to three “curses” of dimensionality,which we discuss below.

As problems become more complex, we have toput more attention into how we model these prob-lems. There is a tendency to quickly put problemsinto a canonical framework and then focus on algo-rithms. However, the standard form can prevent usfrom addressing important dimensions of the prob-lem.

Next, I offer a simple framework that divides mod-eling into five components: states, actions, exoge-nous information, the transition function and theobjective function.

The State of the SystemPerhaps one of the most important quantities in adynamic model is a state variable, but try to find aformal definition. Classic texts such as Bellman(1957) and Puterman (2005) avoid a definition withstatements such as “there is a state of the system.”Wikipedia offers, “State commonly refers to eitherthe present condition of a system or entity,” which istrue but hopelessly vague, or, “A state variable is oneof the set of variables that are used to describe themathematical ‘state’ of a dynamical system” (usingthe word you are trying to define in the definition isa red flag that you do not have a definition). My owndefinition (from Powell [2011]) is that a state variableis the “minimally dimensioned function of historythat is necessary and sufficient to calculate the deci-sion function, cost/reward function, and (if available)the transition function” — in other words, all the

V(s) = mina!A r s,a( )+ ! p !s | s,a( )V !s( )!s "S!"

#$%&'

information you need to model the informationfrom time t onward, and only the information thatis needed.

Even this definition is so general that a studentmodeling a complex system might overlook impor-tant elements. It helps to describe three increasingsets of variables: (1) The physical (or resource) stateRt. This might be the energy in a battery, the state ofa generator, or the status of a component such as atransformer that might fail. (2) The information stateIt, which (in addition to including Rt) might includethe price of electricity, the current temperature, thewind speed, or the price of natural gas. (3) Theknowledge (or belief) state Kt. In addition to knowninformation captured in It, the knowledge state cap-tures distributional information about any unknownparameters.

A common mistake is to assume that the state ofthe system is given by the physical state, which isoften (but not always) the only controllable dimen-sion(s). In some applications, we need to includeinformation that was first revealed in the past, lead-ing some authors to describe these as history-dependent (that is, non-Markovian) systems. Forexample, we might need to know the wind speedover the past three hours to predict the wind speedin the next hour. Just because we first learned thewind speed three hours ago does not make it any dif-ferent (as a piece of information) from the windspeed that we just learned now. But if we do not needthe wind speed four hours ago, then it is not part ofthe (information) state.

The most subtle part of a state variable is theknowledge (belief) state, which is how we describeunobservable parameters. We might learn about howthe market responds to a price signal, but only byvarying the price and learning (imperfectly) the priceresponse coefficient. In this setting, we may wish totry different prices to learn the price more efficient-ly, but this would involve a classic exploration /exploitation trade-off familiar to the bandit commu-nity.

To illustrate using our solar-storage case applica-tion, let Rt be the energy stored in the battery, and letEt be the energy generated from the solar array overt – 1 to t. Let pt be the price of electricity on the grid(known as a locational marginal price or LMP) attime t, which depends on the temperature Qt at timet (LMPs tend to be both higher, and a lot morevolatile, at very low and very high temperatures).The flow of energy into and out of the battery is gov-erned by bids bt = (gt–1,t, ht–1,t), made at time t – 1,tobe implemented between t and t + 1. The batterycharges (buys from the grid) if pt < gt–1,t, and dis-charges (sells back to the grid) if pt > ht–1,t. Accordingto market rules, the bids bt = (gt–1,t, ht–1,t)were set at anearlier time (for simplicity, assume they were set at t– 1), which means that at time t, bt–1 is part of our(information) state. Finally, we recognize that tem-

Articles

FALL 2014 15

perature can be forecasted with some degree of pre-cision. Let ftt΄ be the forecast of the temperature Qt΄ attime t΄ given what we know at time t (note that ftt =Qt). We can roll this into a single variable ft = (ftt΄)t΄≥t.Pulling all this together gives us our state variable St= (Rt, Et, pt, ft, bt).

The recognition that a forecast is part of the statevariable seems to be quite recent. However, we canalso argue that a forecast is really a part of our prob-ability model of future events. Technically, if thisprobability model is changing over time (as forecastsare updated), then this probability model should alsobe part of the state variable. But the probability mod-el can also be viewed as a latent variable, one that isused in calculating the expectation, but that is notexplicitly represented in the state variable.

DecisionsDecisions come in different shapes and sizes: discreteactions, continuous scalars, and continuous or dis-crete vectors of arbitrary size. The reinforcementlearning community works primarily with discreteactions, denoted a. The engineering controls com-munity tends to work with low-dimensional contin-uous vectors denoted u. Finally, the operationsresearch community often works with very highdimensional vectors denoted by x, typically in thecontext of convex resource allocation problems.

An important complication that arises in manyapplications involves lagged decisions. For example,we might have to purchase electricity contracts inmonth t to deliver electricity in month t’. Alterna-tively, we might have to request at hour t that a com-mercial building operator reduce its air conditioningin hour t΄ (perhaps four hours in the future). We canrepresent these decisions as att΄, or as a vector at =(att΄)t ΄≥ t. Using this notation, we see that these laggeddecisions are very similar to our forecasts (and workin a similar way, except that we control them).

When writing the model, it is important to definedecisions but defer until later how a decision will bemade. Standard modeling practice is to define anarbitrary function π(s), referred to as a policy, thatreturns an action a (or control u or decision x) givena state s. In practice, researchers tend to quicklyadopt a particular class of functions to represent thepolicy, which can artificially narrow their search fora solution. For reasons that will become clear, we aregoing to let π represent the structure of the functionused. Then, we let Aπ(s) be the function (policy) thatreturns an action a if we are in state s.

We can use Xπ(s) or Uπ(s) for our policies if we areusing decision x or control u.

For our solar storage case application, the decisionvariable would be at = (xt, bt,t+1), where

is the energy flows from grid to battery (or back),solar panel to battery, and solar panel to grid, respec-

xt = xtGB ,xt

SB ,xtSG( )

tively, while bt,t+1 = (gt,t+1, ht,t+1) represents the bids wemake at time t that will be used at time t + 1. Next,we let Aπ(St) be the function (policy) that returns theaction at when we are in state St.

Exogenous InformationThe reinforcement learning community often avoidsexplicit models of exogenous information by eitherassuming that the one-step transition matrix is given(the exogenous information is buried in this matrix),or by assuming the application is “model free,”which means that exogenous information is notobservable.

The exogenous information process provides theinformation we need to describe the changes in thestate variables. Since there is very little in the way ofstandard notation for modeling exogenous informa-tion, we have adopted the style of putting hats onexogenous information variables. We illustrate ourstyle using our solar-storage application. For this sys-tem, the exogenous information comes in the formof changes in LMPs, which we denote by δpt = pt – pt–

1, changes in the energy from the solar panel, δEt, andchanges in the forecasts of future temperature, δft. Welet Wt be our generic exogenous information vari-able, which would be Wt = (δpt, δEt, δft) for this appli-cation. If we were modeling storage over the grid, δpt= (δpti)i∈I where δpti is the change in the LMP at node(bus) i. Thus, Wt can have thousands of dimensions.

Transition FunctionThe concept of a transition function is widely used inthe engineering controls community, but it is a termthat is rarely heard in either the computer science oroperations research communities. The transitionfunction consists of a series of equations that describehow the state variables change over time.

For our solar storage application, let η be the roundtrip efficiency (typically around 0.80). If we let A = (η,η, 0)T, then ATxt is the net flow into or out of the bat-tery. Also, between t and t + 1 there will be exoge-nous changes to the storage due to the interactionbetween the available energy from the solar panelEt+1, the LMPs pt, and the decision bt. To keep themodel simple, we represent this change using therandom variable δRt+1, which depends on the bid bt aswell as the observations of the change in the LMPδpt+1 and change in the solar energy δEt+1. Our transi-tion function is then

(5)

(6)

(7)

(8)

These equations capture, respectively, the evolu-tion of storage (the only controllable state), followed

Rt+1 = Rt + ATxt +!Rt+1

pt+1 = pt +! pt+1

Et+1 = Et +! Et+1

ft+1, !t = ft , !t +! ft+1, !t

Articles

16 AI MAGAZINE

by the stochastic processes describing energy fromwind, the loads (demands) on the grid, and theupdating of forecasts. The variables with hats repre-sent exogenous information that first became knownat time t + 1.

The transition function goes by many names inthe literature: transfer function, system model, plant-model, or simply model. While the function thenotation f (...) is often used for the transition func-tion, we use St+1 = SM(St, at, wt+1) to represent our tran-sition function (or “system model”). Note that attime t, the state St is known, the action at is com-putable (using our policy Aπ(St)), but the new infor-mation Wt+1 is random at time t (it will be known attime t + 1).

This notation allows us to describe the secondcurse of dimensionality, which we introduced earlier.Let’s consider what is involved in computing theone-step transition matrix. While it is convenient toassume that the transition matrix might simply begiven, computing it requires solving the equation

(9)

where 1{E} = 1 if the event E is true, and the expecta-tion is over the random variable W (or Wt+1). If we aremodeling a network with 100 storage devices, Wt+1would have 300 dimensions. Calculating equation(9), then, means computing a 300-dimensional inte-gral (or summation). It is easy to overlook the need tocompute an expectation when calculating the one-step transition function. In fact, a more natural wayof writing Bellman’s equation is to use its expectationform

(10)

Now we see the expectation explicitly, and we canunderstand that if our exogenous informationprocess is multidimensional, then we are going tohave a problem calculating this expectation. This isthe second curse of dimensionality. Finally, if at is avector, then enumerating a set of discrete actions tosolve the maximization problem introduces the thirdcurse of dimensionality.

The Objective FunctionOur last step involves writing the objective function.Let C(s, a) be the cost (contribution if we are maxi-mizing) if we are in state s and take action a. If we fixour policy with the function Aπ(St) when we are instate St, we can write our objective function over afinite horizon 0, …, T as

(11)

where St+1 = SM(St,Aπ(St),Wt+1) governs how we evolve

from state St to St+1. In virtually all practical applica-tions, equation (11) is calculated using simulation.

V(St )

=mina!A r(St ,a)+! E V(SM (St ,a,Wt+1)) | St{ }( )

min!!" EF! (S0 ,W ) = E! " tC(St ,A! (St ))

t=0

T

!

p( !s | s,a) = E1s '=SM (s,a,W ){ }

We let (W1, W2,...,Wt, ...) represent a Monte Carlosample realization of all the exogenous information.There are, of course, many possible sample realiza-tions. We can either use one very long simulation, ortake an average over a number of simulations.

For some applications, the policy Aπ(St) is station-ary. While stationary policies (and infinite horizonproblems) are quite popular in the reinforcementlearning community, applications in energy systems(in our experience) seem usually to require some sortof time-dependent policy (the policy itself is a func-tion of time, and not just a function of a time-depen-dent state). Time of day, day of week, and seasonalpatterns tend to be present across applications, andthese tend to require the use of time-dependent poli-cies.

The use of expectations is so widespread that wetend to use them blindly as a kind of default objec-tive function. Expectations represent the correctoperator when we want to minimize (or maximize)our average performance. However, there are manyapplications where we have to pay serious attentionto risk. For example, in the unit commitment prob-lem, we have to worry about the risk that we do nothave enough generating capacity to cover the net-work load. Also, a supplier signing a contract todeliver energy at a fixed price in the future has toworry about the risk that electricity might cost morethan the contract. If we want to depend on high pen-etrations of wind and solar, we have to worry aboutthe risk that there will not be enough energy avail-able from these intermittent sources. And if we areinterested in investing in solar while counting on thevalue of solar renewable energy certificates (a mar-ket-driven form of subsidy), investors will worryabout the possibility that the value of these certifi-cates might drop from year to year.

A relatively simple approach to handling risk is toreplace the expectation with a quantile. If W is a ran-dom variable, let Qa(W ) be the α quantile of W.Equation (11) now becomes

(12)

This simple change is actually not so simple. First,the quantile operator is no longer additive, whichprevents us from using the Bellman optimality equa-tion. Second, it is much harder to estimate the quan-tile of a function using standard Monte Carlo meth-ods than it is to estimate the mean.

The finance community has developed a rich the-ory around risk measures, especially a class known ascoherent risk measures, which enjoy certain proper-ties that are important in finance. Coherent riskmeasures enjoy properties such as convexity (withrespect to the costs), monotonicity (higher costsmean higher risk), translationary invariance (addinga constant does not change the risk), and positivehomogeneity (multiplying the costs times a scaling

min!!"Q"F! (S0 ,W ) =Q" # tC(St ,At

! (St ))t=0

T

#

Articles

FALL 2014 17

factor is the same as multiplying the risk times thesame scaling factor). For more on this topic, seeShapiro, Dentcheva, and Ruszczynski (2009) andRuszczynski (2010). In our work, we have found thatwe might worry about the risk of running out of ener-gy or losing money on a contract; such risk measuresare not coherent, but rather fall in a category we arecalling threshold risk measures (see Collado and Pow-ell [2013]). Considerable attention is also being giv-en to worst-case performance, sometimes referred toas robust optimization (for an excellent introductionto the field of robust optimization, see Ben-Tal,Ghaoul, and Nemirovski [2009]). The general field ofstochastic optimization using risk measures or“robust” objectives is quite young, but with tremen-dous applications in energy.

The biggest challenge with finding the best policy,whether we use equation (11), (12), or any other riskmeasure, is determining what is meant by a searchover policies. There is a fairly wide range of commu-nities working in the general area of making effectivedecisions over time under uncertainty. Part of thediversity is explained by differences in problemsaddressed by different communities. But a contribut-ing factor is the lack of a standard vocabulary fordescribing problems, and there seems to be a surpris-ing misunderstanding in terms of what is meant bythe word policy. We address this issue next.

PoliciesIf we were solving a deterministic problem, our chal-lenge would be to find an optimal set of decisions.When we work on stochastic problems, the challengeis to find a set of functions that are known in the lit-erature as policies. Since Bellman’s seminal work inthe 1950s on dynamic programming, there has beena widely shared presumption that this means solvingBellman’s optimality equation (equations (3) or (10)).While this works well on a small class of problems, ithas proven to be astonishingly difficult for most realapplications, and certainly those that we haveencountered in energy applications. As a result, dif-ferent communities have identified a variety ofstrategies to solve different problem classes, a situa-tion we have come to refer to as the jungle of sto-chastic optimization, illustrated in figure 6. Our con-clusion, however, is that the range of strategies is notas diverse as it seems, since similar policies are oftendisguised by differences in presentation style andnotation.

There seems to be some confusion about themeaning of the term dynamic program. Dynamicprogramming is often equated with Bellman’s opti-mality equation. Actually, a dynamic program is asequential decision problem, as depicted in equation(11). Bellman’s optimality equation is (1) a charac-terization of an optimal policy, and (2) a foundationfor finding a policy. A policy, on the other hand, is a

mapping from a state to an action — any mapping. Apolicy, for example, may require solving a decisiontree or a linear (or integer) program.

The search for an optimal (or good) policy canseem like an impossible task. How in the world arewe supposed to search over some arbitrary class offunctions? This seems even more hopeless when ourdecisions involve hundreds or thousands of dimen-sions (as occurs in the stochastic unit commitmentproblem) and have to obey a complex set of con-straints. It turns out that a good guide is to look atthe range of tools that people are already using.

The Four Classes of PoliciesMy own experience stumbling around the jungle ofstochastic optimization for several decades has led tothe conclusion that all the different algorithmicstrategies can be boiled down into four fundamentalclasses: policy function approximations, cost func-tion approximations, policies based on value func-tion approximations, and look-ahead policies.

Policy function approximations (PFAs) are func-tions that return a state given an action, withoutsolving an imbedded minimization or maximizationproblem.

Cost function approximation (CFA) covers policieswhere we minimize a cost function, usually modifiedin some way to achieve better long-term perform-ance. In rare cases, minimizing a single period costmay be optimal.

Policies based on value function approximations(VFAs) is the class of policies most often associatedwith dynamic programming, where we optimize aone-period cost/contribution plus an approximationof the value of future costs/contributions given thedownstream state.

Look-ahead policies cover the entire class of poli-cies where we optimize over a finite horizon H, butthen just implement the first period decision. Look-ahead policies include rolling/receding horizon pro-cedures, decision trees, rollout heuristics, Monte Car-lo tree search, model predictive control, andstochastic programming.

The four classes of policies can be briefly summa-rized as PFAs, CFAs, VFAs, and look-ahead policies. Amore detailed summary of these policies, illustratedusing energy applications, follows.

Policy function approximations represent the classof policies that are described by an analytic functionof some form. PFAs come in a variety of forms: (1)lookup tables (wear a coat if it is raining); (2) para-metric models, such as Aπ (St | θ0) = θ0 + θ1St + θ2S

t2; (3)

neural networks; and (4) nonparametric functionalapproximations.

In our energy application, the decision to chargeor discharge a battery is governed by the rule: “chargeif pt < gt or discharge if pt > ht,” which is a form of PFA.PFAs are the only class of policy that do not requiresolving a minimization problem.

Articles

18 AI MAGAZINE

A policy based on a cost function approximationmight be written

(13)

where Cπ(St΄,a | θ) is a modification of the original costfunction C(St΄,a) designed to achieve better long-termperformance. The vector θ represents parameters thatcan be tuned to improve the performance of the pol-icy, and may affect the cost function itself or the con-straint set At(θ). For our energy application, the setAt(θ) would capture the flow conservation con-straints on energy flows as well as logical limits onbids. Cost functions are widely used in engineeringpractice, but they have not received the respect theydeserve. In the energy setting, a cost function is oftenused for the economic dispatch problem, which is alinear program that is allowed to adjust generators upor down (but cannot turn them on or off) to meet thecurrent set of demands in real time.

Policies based on value function approximationswould be written

(14)

At! (St ) = argmina!At (! )

C" (St ,a |! )

At! (St |" ) =

argminxtC(St ,at )+# E Vt+1(S

M (St ,at ,Wt+1)) | St{ }( )

where we have replaced the value function withsome sort of approximation. For many problems, itis useful to define the post-decision state variable Sa

t,which is the state at time t immediately after a deci-sion has been made (see Powell [2011], chapter 4).For example, in our energy storage problem we intro-duced earlier, the post–decision state would be

where

We have to augment the state variable to reflect thatwe not only have the bids bt–1 made at time t–1, butalso the bids bt that we just chose (to be used at timet + 1). Note that the post–decision state is a deter-ministic function of St and at, allowing us to writeour VFA-based policy as

(15)

The policy π determines the structure of the valuefunction approximation. For example, it is very pop-ular to use a linear model with a predefined set ofbasis functions, which we would write as

Sta = Rt

a , pt ,Et ,ft ,bt!1,bt( )

Rta = Rt + A

Txt

At! (St |" ) = argminat

C(St ,at )+Vta(St

a )( )

Articles

FALL 2014 19

Stochastic programming

Markov decision processes

Simulation optimization

Stochastic search

Reinforcement learning

Optimal control

Policy search

Model predictive control

Decision trees

Q – learning

Figure 6. The Jungle of Stochastic Optimization.

(16)

where the basis functions ϕf(s) represent the type ofpolicy, while the regression parameters θf are the tun-able parameters.

Look-ahead policies are widely used in practice. Intheir most popular form, a deterministic approxima-tion of the future would be used, producing a policyof the form

(17)

where the look-ahead horizon is represented as onetunable parameter θH, but where the cost function orconstraints can also depend on tunable parameters(θC and θA, respectively). There is a large communitythat attempts to build uncertainty into the look-ahead model. In computer science, this falls underthe umbrella of Monte Carlo tree search, designed fordiscrete action spaces. The operations research com-munity, with its focus on vector-valued decisions,uses the framework of stochastic programming,which approximates future uncertainties by buildinga scenario tree of sampled outcomes (not unlikeMonte Carlo tree search); see Sen and Higle (1999)for an easy-to-read tutorial. The difference is that thescenario tree is constructed in advance. In this set-ting, θ would be the set of parameters that governhow the scenario tree is generated. Look-ahead poli-cies are the only policy class that does not requiresome sort of functional approximation. While thiscan be quite attractive, the cost is that this is a rela-tively brute-force approach, and is computationallythe most difficult.

In addition to this list of policies, it is possible tocreate a wide range of hybrids. For example, wemight combine a deterministic look-ahead policywith a value function approximation, or a look-ahead policy with a cost function approximation(with modified costs and constraints). Look-aheadpolicies or policies based on value function approxi-mations can also be combined with low-dimension-al policy function approximations.

Note that three of our policies involve some sort offunction approximation, whether it be the cost func-tion approximation, the policy function approxima-tion, or the value function approximation. Functionscan be approximated using three broad strategies:lookup tables, parametric functions (these may belinear or nonlinear in the parameters), and nonpara-metric functions. In practice, look-ahead policies areoften solved using a form of (parametric) cost func-tion approximation, which typically enter the prob-lem in the form of bonuses, penalties and modifiedconstraints (such as reserve capacity).

Look-ahead policies combined with cost function

At! (St |" )

= argminat ,at+1,...,at+H( )!A (! A )

C(St ' ,at ' |!C )

t '=t

t+!H

"

At! (St |" ) = argmina C(St ,at )+ ! f"f (St

a )f !F!

"

#$%

&'approximations are popular in planning electricitymarkets. Grid operators use large optimization mod-els to determine which generators to turn off or on(the unit commitment problem) by planning overperhaps a 48-hour horizon (the look-ahead part).Built into this model are constraints to ensure thatwe have sufficient reserve, where we might requirethat total generation is (1 + θ) times the forecastedload (this is the cost function approximation part).

Note that all of the policies above depend on sometype of tunable parameters, which was representedusing θ. Once we fix the class of policy, we can tunethe parameters using

(18)

In general, we cannot compute the expectationexactly. As a result, we have to turn to the large fieldof stochastic search algorithms.

Choosing a PolicyThe list of policies above transforms the vague searchover policies in equation (11) into a much more con-crete process of identifying specific classes of policies,and then optimizing the tunable parameters. Wehave used all of these policies in our energy systemsresearch.

So how do we choose between these? Not surpris-ingly, it all depends on the structure of your policy. Ipresent some guidelines from our own empiricalwork:

Policy function approximations are best when thestructure of the policy is obvious. For example, wemight want to change a battery when prices arebelow one number, and discharge when it is above ahigher number. This structure is simply obvious (ormay even be specified by regulations). When this isthe case, we only have to search for the best values ofthe tunable parameters.

There are problems where minimizing single-peri-od costs are optimal, and others where it is a good ini-tial heuristic. For example, our utility may want toassign repair resources to jobs at least cost, but thismight ignore a job that requires moving a long dis-tance. We can introduce a bonus for covering jobsbased on how long they have been waiting.

Value function approximations work well whenyou feel that approximating the value of being in afuture state is relatively simple. This does not meanthe problem is small. We have successfully used VFAsto optimize large transportation systems. VFAs seemto work especially well for energy storage problems,and problems where the value of being in a state canbe approximated in a straightforward way using stan-dard statistical tools. High dimensionality is not anissue if you are willing to live with separability.

If you have some type of forecast (prices, loads,weather), then a look-ahead policy is going to be anatural choice. The question is: can you use a deter-

min! E! C(St ,At! (St |! ))

t=0

T

!

Articles

20 AI MAGAZINE

ministic approximation of the future, or do you needto model uncertainty explicitly in the look-aheadmodel? This is not an obvious choice, and solving astochastic look-ahead model can be quite hard (basi-cally, it is a stochastic, dynamic program, but typi-cally on a somewhat simplified problem). Determin-istic look-ahead models can sometimes provide goodsolutions for stochastic models.

Often, a deterministic look-ahead policy may workwell, but we are concerned about robustness. Howev-er, achieving a more robust solution may be fairlyobvious. For example, standard industry practice withthe unit commitment problem is to require reservecapacity to account for possible network failures. Iwould call this a hybrid look-ahead/CFA policy.

It is generally best to use a policy with a min ormax operator (that is, anything but PFAs) if yourdecision is a vector that has to satisfy a set of con-straints. These problems can generally be solvedusing some sort of math programming solver, such asCplex, Gurobi, CVX, or LOQO, that is designed tohandle constraints. PFAs work well when we have agood understanding of the structure of the policy,but this generally only happens for very low-dimen-sional decisions. However, it is possible to combine aPFA with an optimization-based policy by includingthe PFA in the objective function in the form of apenalty term for deviating from what the PFA feelswould be best for a particular variable. For example,our economic dispatch model (which minimizes costto meet demand, a form of CFA) may wish to dis-charge energy to the grid because of a high price, butour battery may be close to the point where furtherdischarge will degrade its lifetime, requiring that wedeviate from our guidelines for discharging the bat-tery (a form of PFA).

Closing NotesThe applications in energy systems open up a widearray of challenges that require making decisions inthe presence of different sources of uncertainty. Thethoughts in this article are designed to bring togeth-er the different communities working in stochasticoptimization. The contributions of the reinforce-ment learning community in the solution of prob-lems with discrete action spaces with nonconvexobjective functions would benefit from the skills ofthe operations research communities (approximatedynamic programming, stochastic programming,simulation-optimization) that are familiar with vec-tor-valued decisions with convex objective functions.

ReferencesBellman, R. E. 1957. Dynamic Programming. Princeton, NJ:Princeton University Press.

Ben-Tal, A.; Ghaoui, L. El; and Nemirovski, A. 2009. RobustOptimization. Princeton NJ: Princeton University Press.

Bertsekas, D. P. 2012. Dynamic Programming and Optimal

Control. In Approximate Dynamic Programming vol. 2, 4thed. Belmont, MA: Athena Scientific.

Birge, J. R., and Louveaux, F. 1997. Introduction to StochasticProgramming. Berlin: Springer Verlag.

Collado, R., and Powell, W. B. 2013. Threshold Risk Meas-ures for Dynamic Optimization, Working paper, Depart-ment of Operations Research and Financial Engineering,Princeton University, Princeton NJ.

Fu, M. C. 2008. What You Should Know About Simulationand Derivatives. Naval Research Logistics 55(8): 723–736.dx.doi.org/10.1002/nav.20313

Lagoudakis, M. G., and Parr, R. 2003. Least-Squares PolicyIteration. Journal of Machine Learning Research, 4, 1107–1149.

Powell, W. B. 2011. Approximate Dynamic Programming:Solving the Curses of Dimensionality, 2nd. ed. Hoboken, NJ:John Wiley & Sons. dx.doi.org/10.1002/9781118029176

Puterman, M. L. 2005. Markov Decision Processes 2nd ed.,vol. 7. Hoboken, NJ: John Wiley and Sons.

Ruszczynski, A. 2010. Risk-Averse Dynamic Programmingfor Markov Decision Processes. Mathematical Programming125(2): 235–261. dx.doi.org/10.1007/s10107-010-0393-3

Sen, S., and Higle, J. L. 1999. An Introductory Tutorial onStochastic Linear Programming Models. Interfaces 29(2):33–6152. dx.doi.org/10.1287/inte.29.2.33

Shapiro, A.; Dentcheva, D.; and Ruszczynski, A. 2009. Lec-tures on Stochastic Programming: Modeling and Theory. Philadel-phia, PA: SIAM. dx.doi.org/10.1137/1.9780898718751

Spall, J. C. 2003. Introduction to Stochastic Search andOptimization: Estimation, Simulation and Control. Esti-mation, Simulation, and Control. Hoboken, NJ: John Wileyand Sons. dx.doi.org/10.1002/0471722138

Stengel, R. F. 1994. Optimal Control and Estimation. NewYork: Dover Publications.

Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning.An Introduction. Cambridge, MA: The MIT Press.

Swisher, J. R.; Jacobson, S. H.; and Yücesan, E. 2003. Dis-crete-Event Simulation Optimization Using Ranking, Selec-tion, and Multiple Comparison Procedures: A Survey. ACMTransactions on Modeling and Computer Simulation(TOMACS) 13(2): 134–154. dx.doi.org/10.1145/858481.858484

Werbos, P. J. 1974. Beyond Regression: New Tools for Pre-diction and Analysis in the Behavioral Sciences. Cambridge,MA: Harvard University.

Warren Powell is a professor in the Department of Opera-tions Research and Financial Engineering at Princeton Uni-versity, where he has taught since 1981. He is director ofCASTLE Labs (www.castlelab.princeton.edu), which spe-cializes in computational stochastic optimization andlearning, with applications in energy, transportation,health, and finance. He has over 190 refereed publicationsand is the author of Approximate Dynamic Programming:Solving the Curses of Dimensionality and (with Ilya Ryzhov)Optimal Learning. He is a Fellow of Informs, where he hasserved in numerous leadership and service roles.

Articles

FALL 2014 21

energy and uncertainty - dynamic...

Documents