preprint submitted 1 joint control of...

PREPRINT SUBMITTED 1

Joint Control of Manufacturing and OnsiteMicrogrid System via Novel Neural-Network

Integrated Reinforcement Learning AlgorithmsWenqing Hu, Zeyi Sun, Jiaojiao Yang, Louis Steimeister, and Kaibo Xu

Abstract—Microgrid is a promising technology of distributedenergy supply system, which typically consists of storage de-vices, generation capacities including renewable sources, andcontrollable loads. It has been widely investigated and appliedfor residential & commercial end use customers as well as criticalfacilities. In this paper, we propose a joint dynamic controlmodel of microgrids and manufacturing systems using MarkovDecision Process (MDP) to identify an optimal control strategyfor both microgrid components and manufacturing system sothat the energy cost for production can be minimized withoutsacrificing production throughput. The proposed MDP modelhas a high dimensional state/action space and is complicatedin that the state and action spaces have both discrete andcontinuous parts and are intertwined through constraints. Toresolve these challenges, a novel reinforcement learning algorithmthat leverages both on-policy temporal difference control (TD-control) and deterministic policy gradient (DPG) algorithmsis proposed. In this algorithm, the values of discrete decisionactions are learned through neural network integrated temporaldifference iteration, while the parameterized values of continuousactions are learned from deterministic policy gradients. Theconstraints are then addressed via proximal projection operatorsat the policy gradient updates. Experiments for a manufacturingsystem with an onsite microgrid with renewable sources havebeen implemented to identify optimal control actions for bothmanufacturing system and microgrid components towards costoptimality. The experimental results show the effectiveness ofcombining TD control and policy gradient methodologies inaddressing the “curse of dimensionality” in dynamic decision-making with high dimensional and complicated state and actionspaces.

Index Terms—Microgrid, Manufacturing System, Reinforce-ment Learning, Markov Decision Process, On-Policy TemporalDifference Learning, Deterministic Policy Gradient

I. INTRODUCTION

A microgrid is a localized autonomous energy system thatconsists of distributed energy sources and loads, which canoperate either separated from, or connected to, external utility

Manuscript received xx, 2020; revised xxx. (Corresponding author: ZeyiSun.)

W.Hu is with the Department of Mathematics and Statistics, MissouriUniversity of Science and Technology (formerly University of Missouri,Rolla), Rolla, MO, USA. Email: [email protected].

Z.Sun is with Mininglamp Academy of Sciences, Minininglamp Technol-ogy, Beijing, 100084 China Email: [email protected].

J.Yang is with the School of Mathematics and Statistics, Anhui NormalUniversity, Wuhu, Anhui, China. Email: [email protected].

L.Steinmeister is with the Department of Mathematics and Statistics, Mis-souri University of Science and Technology (formerly University of Missouri,Rolla), Rolla, MO, USA.. Email: [email protected].

K.Xu is with Mininglamp Academy of Sciences, Minininglamp Technology,Beijing, 100084 China Email: [email protected].

power grids [1], [2], [3]. It is considered a reliable solutionto satisfy the growing demand of electric power throughstrengthening the resilience and mitigating the disturbancesof the grid [4], [5], [6].

Various studies on microgrids have been conducted forresidential houses [7], [8], [9] and critical facilities, such asmedical centers, financial corporations, military bases, andjails [10], [11]. While, since manufacturing is traditionally notconsidered a critical facility, the research on the application ofmicrogrid in manufacturing has been less reported.

However, manufacturing activities dominate energy con-sumption and green-house-gas (GHG) emissions in the indus-trial sector [12] that accounts for one third of the total energyconsumption in the U.S. [13]. Furthermore, it is hardly pos-sible to maintain manufacturing operations without electricitysupply nowadays, even a very short power outage can lead todetrimental impacts on manufacturing companies [14], [15],[16], [17].

Thus, the research focusing on the optimal design andcomponent sizing of the microgrid for manufacturing planthas been recently launched [18], [19], [20]. For example, aMixed Integer Non-Linear Programming optimization modelwas proposed for sizing the capacity of onsite generationsystem with renewable sources and battery energy storagesystem for the manufacturers considering the energy loadsfrom both manufacturing system and HVAC system in atypical manufacturing plant [18].

In addition, optimal energy control from the manufacturingside has also been widely investigated [21], [22], [23], [24].For example, a simulation based model was proposed toinvestigate the energy control of manufacturing system indemand response program [21]. An analytical model waslater proposed to identify an optimal production schedulefor typical manufacturing systems in Time-of-Use demandresponse program [22].

However, the study of joint energy control and managementfor both on-site microgrid generation system and manufac-turing plant simultaneously has not yet been fully launched.The major challenges impeding the research in this areacan be summarized from two aspects, i.e., modeling andsolution. On one hand, the combined system including bothmanufacturing and microgrid has a complex interaction whencontrols are implemented from both sides. For example, thecontrols on manufacturing systems will influence the energydemand that needs to be met through controlling the opera-tions of microgrids to achieve an energy flow balance. The


manufacturing system itself is a complex system where theinterrelationships among different machines in the system needto be quantified. The manufacturing throughput should not besacrificed when energy control for the manufacturing system isimplemented. All these factors need to be carefully consideredwhen modeling the decision-making using Markov DecisionProcess (MDP).

On the other hand, it can be expected that the space of thestates and the actions in the MDP model will be very large,which makes most existing reinforcement learning algorithmsand strategies for solving MDP less effective. An initial studyby authors has shown that a traditional algorithm, e.g., vanillaQ-learning, integrated with a neurual network can only workfor a small sized model, while cannot sufficiently address themodel with a large space size [25].

Therefore, there is an urgent need to extend the researchon microgrid technology from traditional residential sector,commercial sector, critical facilities, etc., to the manufacturingend-use customer, specifically, considering the joint energycontrol for both energy supply from the microgrid and energyload from the manufacturing system. In this paper, a jointcontrol model conceding both microgrids and manufacturingsystems is established using MDP. A novel reinforcementlearning algorithm that leverages both on-policy temporal dif-ference control (TD-control) for discrete decision actions anddeterministic policy gradient algorithms (DPG) for continuousactions, together with function approximation of the action-value function via a neural network is proposed for solvingthe MDP. Experiments based on a manufacturing system withan onsite microgrid with renewable sources are implementedunder real parameters to identify optimal control actions forboth manufacturing system and microgrid components towardscost optimality. It is empirically validated that the optimal poli-cies found by our reinforcement learning algorithm are moreefficient in production and incur less cost when compared torandomly sampled policies and a routing operation policy.

The remaining part of the paper is organized as follows.Section II introduces the dynamic decision-making modelusing MDP. Section III introduces our novel neural-networkintegrated reinforcement learning algorithm that leveragesboth on-policy temporal difference control (TD-control) anddeterministic policy gradient (DPG) algorithms. Section IVimplements experiments on numerical case studies. Section Vconcludes the paper and discusses future works.

II. MARKOV DECISION PROCESS (MDP) FOR JOINTCONTROL OF MANUFACTURING AND ONSITE

MICROGRID SYSTEMS

A. Formulate Joint Energy Control Problem Using MarkovDecision Process (MDP)

A Markov Decision Process (MDP) model is proposed tomodel the decision-making of the joint control of both on-sitemicrogrid generation system and manufacturing system. Themicrogrid system used is a typical setup consisting of a gasturbine generator, a battery bank as well as solar PV modulesand wind turbines as shown in Figure 1. The manufacturingsystem modeled is a typical serial production line with N

Figure 1: A microgrid with various components.

Figure 2: A typical manufacturing system with N ma-chines and N − 1 buffers (here N = 5).

machines and N−1 buffers as shown in Figure 2 (where N =5, and work-in-progress parts are stored in buffers). Let i =1, 2, ..., N be the index of the machines and i = 1, 2, ..., N−1be the index of the buffers.

The time horizon is discretized and divided into a setof discrete intervals, with the actual time duration for eachinterval to be ∆t. The time variable t denotes the indexesof the decision epochs of such discrete intervals at whichthe control actions identified based on the optimal policy andthe given states can be implemented. The state, policy, statetransition, objective function, and constraints of the proposedMDP are introduced as follows.

System State. Let the system states form a state spaceS. The system state at decision epoch t is denotedby St. It includes the states of manufacturing system(Smfgt ), microgrid system (Smict ), and exogenous environ-mental features (Senvt ), which can be formulated by St =(Smfgt ,Smict ,Senvt ). Smfgt can be denoted by Smfgt =(SM1t , S

M2t , ..., S

MNt, S

B1t, S

B2t, ..., S

B(N−1)t), where SMit (i =

1, 2, ..., N) denotes the state of machine i in the manufacturingsystem at decision epoch t; SBit (i = 1, 2, ..., N − 1) denotesthe state of the buffer i in the manufacturing system atdecision epoch t. Machine states include operational, block-age, starvation, off, and breakdown. Blockage means that themachine itself is not failed while the completed part cannotbe delivered to the downstream buffer due to the breakdownof specific downstream machines. Starvation means that themachine itself is not failed while there is no incoming partfrom the upstream buffer due to the breakdown of spe-cific upstream machines. The set of machine states is thus{Opr,Blo, Sta,Off,Brk}, where Opr,Blo, Sta,Off andBrk denote the operational state, blockage state, starvationstate, off state, and breakdown state, respectively. At eachstate, there is a corresponding power consumption level, whichcan be illustrated by Figure 3. Buffer state is quantified by thenumber of work-in-process parts stored in each buffer at thedecision epoch t.

Smict can be denoted by Smict = (gst , gwt , g

gt , SOCt) , where

gst , gwt , and ggt denote the working status of solar PV, wind

turbine, and generator, respectively, of the onsite microgridgeneration system at decision epoch t (working= 1, notworking= 0). A non-negative real number SOCt, denotes the


Figure 3: Operation and energy states.

state of charge of the battery system at decision epoch t.The exogenous environmental feature state, Senvt , can be

denoted by Senvt = (It, vt) , where It denotes the solarirradiance at decision epoch t, and vt denotes the wind speedat decision epoch t. The exogenous feature has an impacton the system dynamics and the cost function, but cannotbe influenced by the control actions. The feature is timeand weather dependent. The model formulation considers theavailability of a deterministic forecast of the exogenous stateinformation. The states (It, vt) are taken from one year’s data(assumed to be 360 days and 24 hours/day, so in total 8640hours).

Control Actions and Policy. All admissible actions con-stitute an action space A. Let π be the policy that mapsfrom different states S to the actions A. The control actionsadopted at decision epoch t can be denoted by At. It includesthe control actions for the manufacturing system (Amfgt )and the microgrid system (Amict ), which can be denoted byAt = (Amfgt ,Amict ). Amfgt can be denoted by Amfgt =(a1t , a

2t , ..., a

Nt ), where ait (i = 1, 2, . . . , N) is the control

action for machine i at the decision epoch t. The actionsinclude K-action, W-action, and H-action. K-action intendsto keep original machine states, which can be applied to themachines in Opr,Blo, Sta,Off , and Brk states (note thatmachine repair is not considered a control action in this paperand repair is assumed to be a supposed-to-be reaction, so K-action used for breakdown machine can imply that repair willbe implemented). H-action intends to turn off the machine,which can only be applied to the machine in Opr,Blo, andSta states. W-action intends to turn on the machine thatwas previously turned off, which can only be applied to themachine in Off states. Note that for model simplicity, weassume the energy consumption and time required for thetransitions between different machine states can be ignored.

Amict specifies the actions with respect to the adjustmentof the working status of the components of the microgridas well as the corresponding energy flow and allocationin the joint system, which can be denoted by Amict =(ast , a

wt , a

gt , s

mt , s

bt , s

sbt , w

mt , w

bt , w

sbt , g

mt , g

bt , g

sbt , p

mt , p

bt , b

mt ).

Here ast , awt , and agt are the actions of adjusting the working

status (i.e., 1 is connected, 0 is not connected to the load) ofthe solar, wind, and generators in microgrids; smt , s

bt , and ssbt

denote the solar energy used for supporting manufacturing,charging battery, and sold back to grids, respectively.Similarly, the notations w and g with correspondingsuperscripts denote the allocation of the energy generated bywind turbine and generator, respectively. pmt and pbt denote theuse of the energy purchased from the grid, i.e., for supportingmanufacturing, and charging battery, respectively. Note thatthe energy purchased from the grid is not considered for

sold back. Finally, bmt denotes the energy discharged by thebattery for supporting manufacturing, and is given by a binaryvariable δbmt , so that bmt = b · δbmt ·∆t for some dischargingrate b > 0. Note that the energy discharged by the battery isnot considered for sold back.

State Transition. Let the function P : S ×A× S → [0, 1]

be the transition probability function, so that P (S′,S,A) ≡

Pr(S′|S,A) is the probability of transition to state given that

the previous state was S and action A was taken at state S. Thestate transition given state S and adopted action A at decisionepoch t is partially deterministic (that is, for some states S andS′ and actions A we have P (S,A,S′) = 0 or 1) and partiallystochastic (that is, for some states S and S′ and actions A wehave 0 < P (S,A,S′) < 1), although they can all be put in thetransition probability function P . It is assumed that the statetransition happens at the beginning of each interval when thedecision is made.

For the manufacturing system, the buffer state at decisionepoch t+ 1 can be obtained by (II.1) based on the states andthe control actions adopted at decision epoch t of upstreamand downstream machines:

SBi(t+1) = SBit + I(SMit , ati)− I(SM(i+1)t, a

ti+1) , 0 ≤ SBit ≤ Ni ,

(II.1)Here Ni is the capacity of buffer i, and I(SMit , a

ti) is an

indicator function that is defined by (II.2):

I(SMit , ati) =

{1 ,when SMit = Opr and ati = K ;0 ,when SMit 6= Opr or ati = H . (II.2)

Referring to the literature focusing on the statistical methodsfor machine reliability [26], we assume Li, which is the ran-dom lifetime of machine i, follows Weibull distribution withspecific shape parameter and scale parameter. The probabilitythat machine i goes into breakdown or non-breakdown stateat the next decision epoch t+ 1, given it is not in breakdownstate at the current decision epoch t can be described by (II.3)and (II.4) respectively:

Pr(SMi(t+1) = Brk|SMit 6= Brk, SMit 6= Off)

= Pr(Li < t+ ∆t) ,(II.3)

Pr(SMi(t+1) 6= Brk|SMit 6= Brk, SMit 6= Off)

= Pr(Li ≥ t+ ∆t) .(II.4)

The probability whether the machine is at Off state or notat next decision epoch can be calculated by (II.5):

SMi(t+1) = Off if (SMit = Off and ati = K)

or (SMit 6= Off and ati = H) .(II.5)

In addition, we also assume Di, which is the random repairtime of machine i, follows Exponential distribution [27]. Theprobability that machine i completes or does not complete therepair at the next decision epoch t + 1, given it is in repairat the current decision epoch t can be described by (II.6) and(II.7) respectively.

Pr(SMi(t+1) 6= Brk|SMit = Brk) = Pr(Di < t+∆t) , (II.6)

Pr(SMi(t+1) = Brk|SMit = Brk) = Pr(Di ≥ t+∆t) . (II.7)


Thus, the probability that machine i is in an Sta and Blostate can be described by (II.8) and (II.9), respectively:

Pr(SMi(t+1) = Sta) =

Pr(SMi(t+1) 6= Brk or Off) · Pr(SB(i−1)(t+1) = 0)

·Pr(SM(i−1)(t+1) = Brk)

+Pr(SMi(t+1) 6= Brk or Off) · Pr(SB(i−1)(t+1) = 0)

·Pr(SM(i−1)(t+1) = Sta)

+Pr(SMi(t+1) 6= Brk or Off) · Pr(SB(i−1)(t+1) = 0)

·Pr(SM(i−1)(t+1) = Off) ,(II.8)

Pr(SMi(t+1) = Blo) =

Pr(SMi(t+1) 6= Brk or Off) · Pr(SBi(t+1) = Ni)

·Pr(SM(i+1)(t+1) = Brk)

+Pr(SMi(t+1) 6= Brk or Off) · Pr(SBi(t+1) = Ni)

·Pr(SM(i+1)(t+1) = Blo)

+Pr(SMi(t+1) 6= Brk or Off) · Pr(SBi(t+1) = Ni)

·Pr(SM(i+1)(t+1) = Off) .(II.9)

The probability that machine i is in operation state can thusbe calculated by (II.10):

Pr(SMi(t+1) = Opr) =

Pr(SMi(t+1) 6= Brk)− Pr(SMi(t+1) = Sta)

−Pr(SMi(t+1) = Blo) .

(II.10)

Therefore, the probability of system operation state transi-tion between the current decision epoch and the next decisionepoch can be calculated by using (II.1)-(II.10) when Amfgt isadopted based on a given Smfgt .

For the microgrid system, the state transition of solar PV isdetermined by the action adopted. While, the state transitionof wind turbine is determined by the action adopted and thevariation of the wind speed. They can be formulated by (II.11)and (II.12), respectively:

gst+1 =

{1, if ast+1 = 1 ,0, if ast+1 = 0 , (II.11)

gwt+1 =

{1, if awt+1 = 1 and vci ≤ vt+1 ≤ vco ,0, if awt+1 = 0 or vt+1 > vco or vt+1 < vci ,

(II.12)where vci and vco are the cut-in and cut-off wind speeds (m/s),respectively,

The state transition of generator is determined by the controlactions adopted, which can be formulated by (II.13):

ggt+1 =

{1, if agt+1 = 1 ,0, if agt+1 = 0 . (II.13)

The state transition of battery (i.e., SOC) is determined bythe charging and discharging happened between t and t+1 aswell as the original SOC, which can be formulated by (II.14).

SOCt+1 = SOCt + (sbt + wbt + gbt + pbt)η − bmt /η , (II.14)

where η is charging/discharging efficiency.Objective Function. The objective is to identify an optimal

policy based on the given state that can minimize the incurredcost from time t to the end of decision horizon. The overallcost from state S to state S′ under action A is definedby E(S′,A,S) ≡ E(S′|S,A). We will specify below that

it is equal to energy consumption cost plus the microgridoperational cost, minus production throughput reward and thesold back reward. At decision epoch t, a transition from stateSt to state St+1 under action At results in an incurred costE(St+1,At,St). The total incurred cost from time 0 to the endof planning horizon, starting from state S and under policy π,is given by

C(S, π) = E

[ ∞∑t=0

γtE(St+1, π(St),St)

∣∣∣∣∣S0 = S

]. (II.15)

Here γ ∈ [0, 1) is the discount factor. The objective is toidentify an optimal policy π∗ = arg min

π∈ΠC(S, π) that can

guide the decision maker to find appropriate actions basedon the given system state to minimize the total incurred costC(S, π) in (II.15).

For our model, we have in particular E(S′,A,S) = E(S,A)is the average total cost when action A is taken at state S,which can be calculated by

E(S,A) = TF (S,A) +MC(S,A)− TP (S,A)− SB(S,A) ,(II.16)

where TF (S,A) is the cost for the energy purchased fromthe grid, MC(S,A) is the operational cost for the onsitegeneration system, TP (S,A) is the reward of productionthroughput of the manufacturing system, and SB(S,A) is thesold back benefit. TF (S,A) can be calculated by

TF (S,A) = pt · rct , (II.17)

where pt is the energy consumption purchased from the grid atdecision epoch t, rct is the rate of energy consumption charge.pt can be calculated by

pt = Emfgt − (smt + wmt + gmt + bmt ) , (II.18)

where Emfgt is the total energy consumed by the manufac-turing system at decision epoch t which can be determinedby

Emfgt =∑N

i=1PCit ·∆t , (II.19)

where PCit is the amount of power drawn by the machine ifrom t to t+ 1. PCit can be calculated by

PCit =

0, if SMit = Brk or SMit = Off ,PCOpri , if SMit = Opr ,PCIdli , if SMit = Sta or SMit = Blo ,

(II.20)where PCOpri and PCIdli are the power level of machine i atthe states of Opr and Sta/Blo, respectively.MC(S,A) can be calculated by

MC(S,A) = est · rsomc + ewt · rwomc + egt · rgomc+

(bmt + sbt + wbt + gbt ) ·∆t2e(SOCmax − SOCmin)

· rbomc ,(II.21)

where est , ewt , and egt are the energy generated at decision

epoch t from the onsite solar PV, wind turbine, and generator,respectively, which are calculated from the states and actions,that will be specified below. rsomc, r

womc, and rgomc are the

unit operational and maintenance cost for generating powerfrom solar PV, wind turbine, and generator respectively. rbomcis the operational and maintenance cost for battery storage


system per unit charging/discharging cycle. e is the capacityof battery storage system. SOCmax and SOCmin are maxi-mum and minmum state of charge of battery storage system,respectively.est can be calculated according to [25]

est =

{0, if gst = 0 ,It · a · δ/1000, if gst = 1 ,

(II.22)

where It is the solar irradiance of a certain location (W/m2)at decision epoch t, a is the area of the solar PV system, andδ is the efficiency of the system.ewt can be calculated according to [18].

ewt =

0, if gwt = 0 or vt < vcior vt > vco ,

Nw ·RPw ·∆t, if gwt = 1and vr ≤ vt < vco ,

Nw ·RPw ·vt − vcivr − vci

, if gwt = 1

and vci ≤ vt < vr ,(II.23)

where vt is the wind speed (m/s) at decision epoch t. vr is therated wind speeds (m/s). Nw is the number of wind turbine inthe onsite generation system and RPw is the rated power ofthe wind turbine (kW). RPw is determined by

RPw =1

2· ρ · π · r2 · v3

avg · θ · ηt · ηg/1000 , (II.24)

where ρ is the density of air. vavg is average wind speed. θis the power coefficient. r is the radius of the wind turbineblade. ηt is its gearbox transmission efficiency. ηg is electricalgenerator efficiency.egt can be calculated by

egt =

{0, if ggt = 0 ,ng ·Gp ·∆t, if ggt = 1 ,

(II.25)

where ng is the number of generators and Gp is the ratedoutput power of the generator (KW).TP (S,A) can be determined by

TP (S,A) = pt · rp , (II.26)

where pt is the production count at decision epoch t andrp is the unit reward for each unit of production. pt can becalculated by

pt =

{1, if SNt = Opr and aNt = K ,0, if SNt 6= Opr or aNt = H . (II.27)

Note that in this paper, the concern of reaching the targetproduction throughput is represented as a monetary rewardand integrated into the objective function. This strategy cir-cumvents the challenges of modeling throughput as a majorconstraint in MDP. The throughput modeling and quantifica-tion for typical manufacturing systems with N machines andN − 1 buffers as used in this paper are still major researchchallenges in the field of production system engineering whenmachine states of blockage & starvation are considered. Verylimited research progresses have been made (see [28], [29]),however, these works cannot address the scenarios whendynamic control agents are involved.

Similarly, the sold back reward SB(S,A) can be calculatedby

SB(S,A) = st · rsb , (II.28)

where st = ssbt +wsbt +gsbt is the sold back energy to the gridat decision epoch t and rsb is the unit reward from sold backenergy.

B. Parameterization of the Action Space and Constraints.

Our model contains the following constraints for the actionspace A, that are described below.

Since we set the battery state of charge level needs to bemaintained within a given range, which can be formulated by

SOCmin ≤ SOCt+ (sbt +wbt +gbt +pbt)η−bmtη≤ SOCmax .

(II.29)Notice that (II.29) indicates that the actions to the microgrid

on battery charging/discharging are controlled by the currentSOC state.

Actions that can be applied to machines are restricted by thecurrect machine states: K-action can be applied to the machineat Opr,Blo, Sta,Off and Brk states; H-action can only beapplied to the machine at Opr,Blo and Sta states; W-actioncan only be applied to the machine at Off states, i.e.,

ait can be

K , if SMit = Opr,Blo, Sta,Off,Brk ;H , if SMit = Opr,Blo, Sta ;W , if SMit = Off .

(II.30)The energy flow balance for the energy generated by solar

PV, wind turbine, and generator can be formulated by (II.31)(II.32), and (II.33),respectively.

smt + sbt + ssbt = est . (II.31)

wmt + wbt + wsbt = ewt . (II.32)

gmt + gbt + gsbt = egt . (II.33)

Notice that according to (II.22), (II.23) and (II.25), the ener-gies est , e

wt , e

gt depend on the working states of the microgrid

(gst , gwt , g

gt ) and the solar irradiance It and the wind speed

vt, the constraints (II.31)-(II.33) are restricting the actionsapplied to microgrid based on the current microgid state andthe environmental features.

The battery cannot be charged and discharged simulta-neously. The charge/discharge constraint is represented asfollows:

(sbt + wbt + gbt + pbt) · bmt = 0 . (II.34)

Notice that since bmt = δbmt · b ·∆t, we only seek for binarychoices of δbmt = 0/1 when sbt = wbt = gbt = pbt = 0.

The energy sold back to the grid and the energy purchasedfrom the grid cannot happen simultaneously which can berepresented by

(ssbt + wsbt + gsbt )(pmt + pbt) = 0 . (II.35)

If the constraint (II.35) is satisfied at ssbt = wsbt = gsbt = 0,so that we allow pmt + pbt 6= 0, then due to supply-demand


balance principle, the energy purchased from the grid shouldbe equal to the energy consumed from the grid, so we have

pmt + pbt = pt1{pt>0} , (II.36)

where pt is given by (II.18).To simplify the model, we further assume that the energy

purchased from the grid can be used either only for supportingmanufacturing or charging battery, but not simultaneously, i.e.,

pmt · pbt = 0 . (II.37)

Due to (II.37) and (II.35), if ssbt = wsbt = gsbt = 0,we can introduce a binary variable δpbt = 0/1 (0 meanspurchased energy is not used for battery charging, 1 meanspurchased energy is used for battery charging) so that pmt =(1− δpbt )pt1pt>0 and pbt = δpbt pt1pt>0; if else, that is any ofssbt , w

sbt or gsbt is not equal to 0, we have pmt = pbt = 0.

To facilitate the design of policy-gradient relatedalgorithms for training, we will further parameterize(smt , s

bt , s

sbt , w

mt , w

bt , w

sbt , g

mt , g

bt , g

sbt ) by introducing

proportionality parameters

θ = (λms , λbs, λ

mw , λ

bw, λ

mg , λ

bg) (II.38)

and the representationsmt = est · λms , sbt = est · λbs , ssbt = est · (1− λms − λbs) ,wmt = ewt · λmw , wbt = ewt · λbw , wsbt = ewt · (1− λmw − λbw) ,gmt = egt · λmg , gbt = egt · λbg , gsbt = egt · (1− λmg − λbg) .

(II.39)These representations further simplify the constraints

(II.31)-(II.33) into the following constraints

λms ≥ 0, λbs ≥ 0, 0 ≤ λms + λbs ≤ 1 , (II.40)

λmw ≥ 0, λbw ≥ 0, 0 ≤ λmw + λbw ≤ 1 , (II.41)

λmg ≥ 0, λbg ≥ 0, 0 ≤ λmg + λbg ≤ 1 . (II.42)

To further deal with the constraints (II.34), (II.35) and(II.37), we further introduce the binary (0/1) variablesδbt = 1{sbt+wb

t+gbt>0} , δsbt = 1{ssbt +wsb

t +gsbt >0}, and δpt =1{pmt +pbt>0}. Then constraints (II.34), (II.35) and (II.37) be-come a discrete constraint(δbt , δ

pt , δ

pbt , δ

sbt , δ

bmt ) ∈ {(1, 1, 1, 0, 0), (1, 1, 0, 0, 0),

(1, 0, 0, 1, 0), (0, 1, 1, 0, 0),(0, 1, 0, 0, 1), (0, 0, 0, 1, 1)} .

(II.43)Notice that (II.43) summarizes all discrete constraints for the

control parameters on the microgrid. The remaining continu-ous constraints for the microgrid are only (II.29) and (II.40)-(II.42).

Based on (II.43), we can further write the constraints (II.29)and (II.40)-(II.42) into different constraints on θ (the variablein (II.38)):(1) (δbt , δ

pt , δ

pbt , δ

sbt , δ

bmt ) = (0, 0, 0, 1, 1). The constraints on

θ are given byλbs = λbw = λbg = 0 ;0 ≤ λms , λmw , λmg < 1 ;SOCmin ≤ SOCt − b∆t/η ≤ SOCmax ;

(II.44)

(2) (δbt , δpt , δ

pbt , δ

sbt , δ


θ are given byλbs = λbw = λbg = 0 ;λms = λmw = λmg = 1 ;SOCmin ≤ SOCt − b∆t/η ≤ SOCmax ;

(II.45)

(3) (δbt , δpt , δ

pbt , δ

sbt , δ


θ are given byλbs = λbw = λbg = 0 ;λms = λmw = λmg = 1 ;SOCmin ≤ SOCt

+η(Emfg − (smt + wmt + gmt + b∆t))·1(Emfg−(smt +wm

t +gmt +b∆t))>0 ≤ SOCmax ;(II.46)

(4) (δbt , δpt , δ

pbt , δ

sbt , δ


θ are given byλms ≥ 0, λbs > 0, λms + λbs = 1 ;λmw ≥ 0, λbw > 0, λmw + λbw = 1 ;λmg ≥ 0, λbg > 0, λmg + λbg = 1 ;SOCmin ≤ SOCt + η(sbt + wbt + gbt ) ≤ SOCmax ;

(II.47)(5) (δbt , δ

pt , δ

pbt , δ

sbt , δ


θ are given byλms ≥ 0, λbs > 0, 0 ≤ λms + λbs < 1 ;λmw ≥ 0, λbw > 0, 0 ≤ λmw + λbw < 1 ;λmg ≥ 0, λbg > 0, 0 ≤ λmg + λbg < 1 ;SOCmin ≤ SOCt + η(sbt + wbt + gbt ) ≤ SOCmax ;

(II.48)(6) (δbt , δ

pt , δ

pbt , δ

sbt , δ


θ are given by

λms ≥ 0, λbs > 0, λms + λbs = 1 ;λmw ≥ 0, λbw > 0, λmw + λbw = 1 ;λmg ≥ 0, λbg > 0, λmg + λbg = 1 ;SOCmin ≤

SOCt + η[(sbt + wbt + gbt )

+(Emfg − (smt + wmt + gmt + b∆t))·1(Emfg−(smt +wm

t +gmt +b∆t))>0

]≤ SOCmax ;

(II.49)All effective constraints for the admissible actions in this

problem are (II.30), (II.43) and (II.44)-(II.49).

III. NOVEL NEURAL-NETWORK INTEGRATEDREINFORCEMENT LEARNING ALGORITHMS FOR THE

MDP MODEL

A. Review of Previous Works and Our Contributions

As far as the authors are aware of, there are very few pub-lished works that address the joint control of manufacturingand onsite microgrid system using MDP and reinforcementlearning algorithms. A few existing works that are in this direc-tion are [30], [31], where Deep Q-learning (DQN) algorithmshave been applied to the learning of the microgird system only.

Our Contributions.(1) We have designed a new model that combines joint

control of manufacturing and onsite microgrid system.


(2) Our novel reinforcement learning algorithm integratesdeterministic policy-gradient (DPG) with on policy tem-poral difference (TD) control to treat the co-existenceof discrete and continuous states and actions. We alsoaddress the constraints via proximal projection operatorsand policy gradient updates.

B. Abstract Formulation of the Model, Deep ReinforcementLearning Algorithm for solving the Model in its AbstractFormulation

A state St ∈ S in the space of the model consists of twoparts St = (Sdt ,S

ct): the discrete part

Sdt = (SM1t , SM2t , ..., S

MNt, S

B1t, S

B2t, ..., S

B(N−1)t, g

st , g

wt , g

gt , It, vt) ,(III.1)

which consists of the machine, buffer and microgrid states, aswell as the coarse-grained solar irradiance and the wind speed(It, vt). Here to reduce complexity, an approximate course-grain scheme will be applied to each pair of the values (It, vt),so that they will be approximated by integers closest to themon a grid with 20×20 states, thus taken values among 20×20different states; the continuous part

Sct = (SOCt) , (III.2)

which consists of the SOC state.An action At in the action space of the model also consists

of three parts At = (Adt ,A

ct ,A

rt ): the discrete part

Adt = (a1

t , a2t , ..., a

Nt , a

st , a

wt , a

gt , δ

bt , δ

pt , δ

pbt , δ

sbt , δ

bmt ) ,

(III.3)which consists of the actions on each of the machines and theconnected/disconnected action of the solar PV, wind turbineand the generator, as well as the indicator variables in (II.43);the continuous part

Act = (smt , s

bt , s

sbt , w

mt , w

bt , w

sbt , g

mt , g

bt , g

sbt ) , (III.4)

which consists of the solar, wind, generator energy used forsupporting manufacturing, charging battery and sold back tothe grid. The continuous part will be paramaterized by the vari-able θ introduced in (II.38), so that we have Ac

t = Ac(θt,St);the remainder part Ar

t = (pmt , pbt , b

mt ), which consists of the

use of the energy purchased from the grid for supportingmanufacturing and charging battery, and the energy dischargedby the battery for supporting manufacturing. These can becalculated directly from δpbt and δbmt as well as (II.36), (II.18),which then can be calculated from Ac

t .At a specific state St ∈ S, the actions that can be taken are

restricted by this particular state via the restrictionAdt ∈ Dd(Sdt ) , (III.5)

andθ ∈ Dc(Sdt ,S

ct ,A

dt ) , (III.6)

where Dd is the set of admissible discrete actions Adt that can

be taken at the current state, and Dc is the set of admissibleparameters θ for continuous actions Ac that can be taken atthe current state. According to the discussions in section II-B,we see that Dd is given by (II.30) and (II.43), and dependsonly on the discrete part of the current state (actually, onlyon the current states of the machines) and Dc is given by

one of (II.44)-(II.49), that depend on both the continuous andthe discrete parts of the current state, as well as the discreteactions.

From the above abstract formulation, we see the mathemat-ical complicacy of the problem in that the state and actionspaces contain both discrete and continuous parts, and theaction constraints are determined by both the discrete andcontinuous parts of the states, as well as the discrete partof the actions. To design an effective learning algorithm, wepropose to integrate both the on-policy TD control (SARSA)for finding the discrete part of the optimal control actions andthe proximal projection of the deterministic policy gradientmethod associated with on-policy actor-critic (see [32]) forfinding the continuous part of the optimal control actions.We employ on-policy methods rather than off-policy to dealwith constraints that are variable with respect to changeof states. This enables more exploration over the variableconstraints. In the use of SARSA and actor-critic, we borrowthe ideas in Deep-Q Learning (see [33], [34], [35]) and weuse a neural network to serve as the function approximator ofthe action-value function Q(S,A). Proximal algorithm (see[36]) is a popular optimization technique in machine learningfor handling constrained optimization problems, and here wecombine it with the deterministic policy gradient iterationsto approximate the continuous part of the optimal controlpolicies. We write the proposed algorithm into pseudo-codeat Algorithm 1.C. Solution Algorithm for the Original Model

One major challenge of solving the MDP problem usingAlgorithm 1 lies in that the set of admissible parametersθ ∈ Dc(S,A) for continuous actions Ac, that are determinedby (II.44)-(II.49), have an intersection structure. Indeed from(II.44)-(II.49) one can view Dc(S,A) as an intersectionDc(S,A) = Dc ∩ DSOC(θ, SOCt, (δ

bt , δ

pt , δ

pbt , δ

sbt , δ

bmt )).

Here we set the fixed simplexDc =

{(λms , λ

bs, λ

mw , λ

bw, λ

mg , λ

bg) :

λms , λbs, λ

mw , λ

bw, λ

mg , λ

bg ≥ 0,

0 ≤ λms + λbs ≤ 1 ,0 ≤ λmw + λbw ≤ 1 ,0 ≤ λmg + λbg ≤ 1

}.

(III.7)

The complicacy in Dc(S,A) lies in the other part of theintersection DSOC(θ, SOCt, (δ

bt , δ

pt , δ

pbt , δ

sbt , δ

bmt )), that may

vary according to the choice of (δbt , δpt , δ

pbt , δ

sbt , δ

bmt ) and de-

pend on the SOC state, which makes the proximal projectionto Dc(S,A) in Step 7 of Algorithm 1 nearly impossible tocompute in practice.

To fix this issue, we suggest to relax the constraintDc(S,A) by only considering its fixed simplex part Dc. Ofcourse, if we only project to Dc at every proximal projectionstep in our Step 7 of Algorithm 1, we may miss the SOCconstraints in (II.44)-(II.49). But we can then fix this issueby using the θ found on the relaxed constraint set Dc anddetermine the binary variables (δbt , δ

pt , δ

pbt , δ

sbt , δ

bmt ), as well as

update the SOC. If the SOC values we obtain violate the ad-ditional constraints in DSOC(θ, SOCt, (δ

bt , δ

pt , δ

pbt , δ

sbt , δ

bmt )),

we will just set them to be the boundary values SOCmax orSOCmin, so that they will not violate the SOC constraints. In


Algorithm 1 Training the Abstract Model via Integrating on-policy TD control (SARSA) and Proximal Projection of theDeterministic Policy Gradient

1: Input: Input state space S, action space A, con-straints Dd, Dc; Given the neural network architectureQ(S,Ad,Ac(θ),Ar;ω); Discount factor 0 < γ < 1;Learning rates ηθ, ηω > 0;

2: Initialization: Initialize the weight vector w0 ∝a given prior distribution; initial action Ad

0,Ar0 and ac-

tion parameter θ0, Ac0 = Ac(θ0); Initial state S0 =

(Sd0,Sc0);

3: for t = 0, 1, 2, ... do4: Run one step of the MDP from state St under action

At = (Adt ,Act ,A

rt ), obtain a new state St+1;

5: Calculate the total cost E(St,At);6: Identify

Adt+1 = arg min

Ad∈Dd(St+1)Q(St+1,A

d,Ac(θt);ωt) ;

7: Based on Adt+1, update the policy parameter θ accordingto deterministic policy gradient

θt+1 = proxDc(St+1,Adt+1) [θt

−ηθ∇θAc(θt)∇AcQ(St,Adt ,A

ct ,A

rt ;ωt)

];

8: Based on Adt+1 and θt+1, obtain Ar

t+1, so that

At+1 = (Adt+1,A

ct+1 = Ac(θt+1),Ar

t+1) ;

9: Calculate on policy TD:

δt = E(St,At)+γQ(St+1,At+1;ωt)−Q(St,At;ωt) ;

10: Update the weight vector ω using actor-critic:

ωt+1 = ωt − ηωδt∇ωQ(St,At;ωt) ;

11: end for12: Output: With the given optimal ω∗ and θ∗ found, for

each state S given, output the approximate optimal policy(A∗,d,Ac(θ∗),A∗,r) where

(A∗,d,A∗,r) = arg minAd,Aradmissible

Q(S,Ad,Ac(θ∗),Ar;ω∗) .

this way, we can find an approximate solution to the optimalone, with computationally achievable simulations.

Under this simplification, the solution algorithm for theoriginal model can be much simpler. We can then changeAlgorithm 1 accordingly, so that

(1) In Step 6 of Algorithm 1, we select one actionAdt+1 that only takes into account the constraint

(II.30) and maximizes the Q-value (on policy). Theaction Ad

t+1 does not include the indicator variables(δbt+1, δ

pt+1, δ

pbt+1, δ

sbt+1, δ

bmt+1) in (II.43);

(2) In Step 7 of Algorithm 1, the constraint Dc(St+1,Adt+1)

is replaced by a fixed constraint Dc in (III.7). Notice thatin this case, the projection onto Dc can be calculateddirectly, see Appendix A;

(3) Once θ is chosen, we determine whether or not we should

Table I: Machine Parameters.Mean Mean Rated Powertime time power at

between to of idlefailure repair operation State

Scale/Shape Parameter (min) (kW) (kW)M1 111.39 min/1.5766 4.95 115.5 105M2 51.1 min/1.6532 11.7 115.5 105M3 110.9 min/1.7174 15.97 115.5 105M4 239.1 min/1.421 27.28 170.5 155M5 122.1 min/1.591 18.37 132 120

choose pmt+1 +pbt+1 6= 0, and whether pbt+1 6= 0 accordingto (II.35), (II.36) and (II.37); we also determine whetheror not we choose bmt+1 6= 0 according to (II.34). Theprecise scheme of choosing (pmt+1, p

bt+1, b

mt+1) can be

found in Appendix B.(4) With all these ready, Step 8 in Algorithm 1 will be

modified accordingly.

IV. CASE STUDY

In this section, numerical case studies are implemented toillustrate the benefits of the proposed modeling and solutionstrategies. Our source code is open and can be found at theGitHub repository [37] for this project.

We have carried out experiments of the manufacturing-microgrid system using a real-case parameter set. The man-ufacturing system includes five machines and four buffersas shown in Figure 2. The parameters of the manufacturingsystem are taken according to [38]. Specifically, the parametersrelated to machines and buffers of the manufacturing systemare shown in Table I and Table II, respectively. Note that themean time between failures of each machine is modeled asrandom variables follow Weibull distribution with respectivescale and shape parameters. The mean time to repair ofeach machine is modeled as exponentially distributed randomvariables. Unit production reward rp is set to be $104 per unit.

The parameters of the microgrid used in the experimentare sized based on the manufacturing load according to themethods in [18]. The parameters related to wind turbine,battery storage system, and solar panel and generator areillustrated in Table III, Table IV, and Table V, repsectively.The data of solar irradiance and wind speed are collected fromSolar Energy Local [39] and State Climatologist of Illinois[40], respectively.

In order to prevent computational overflow, in our actualnumerical experiment we scaled all the parameters by choos-ing different units of measurement, with distance measured bykm (103 m), time measured by hour (60 min= 3600s), speedmeasured by km/h (3.6 m/s), energy measured by MegaWatt(106 Watt), money cost measured by 104 US$, area measuredby km2 (106 m2), mass measured by 106 kg. Time periodis measured in hours. The neural network Q(S,A;ω) that weuse to simulate the action-value function Q(S,A) contains twohidden layers with 100 neurons for each layer, with Sigmoidand ReLU activations for layers 1 and 2. The output is thenscaled back to usual units of measurment, like $ for cost andkW for energy demand.


Table II: Buffer Parameters.B1 B2 B3 B4

Capacity 1000 1000 1000 1000Initial 100 100 100 100

Table III: Wind Turbine Parameters.Parameters Value Parameters Value

vci (m/s)-Cut in speed 3 ηt-Gearbox efficiency 0.9vco (m/s)-Cut off speed 11 ηg-Generator efficiency 0.9vr (m/s)-Rate speed 7 θ-Power coefficient 0.593ρ (kg/m3)-Air density 1.225 rwomc ($/kWh) 0.08r (m)-Blade radius 25 Nw (unit) 1

We take the discount factor γ = 0.999. The experimentincludes reinforcement learning training for 5×103 iterations,with each iteration counts for time period with equal durationof one hour per period, and the learning rates are tuned tobe ηθ = 0.003 and ηω = 0.0003, with ηω discounted by afactor 0.999 at each iteration. Figure 4-(a) plots the L2-normof the difference at two consecutive iterations during trainingin the neural network weights (‖ωt+1 − ωt‖22), which clearlyindicates the convergence of the training process. Figure 4-(b) plots the cumulative reward function (total incurred costC(S, π) in (II.15) truncated at the current iteration step).

In order to validate the effective convergence of our re-inforcement learning algorithm, comparison has been carriedout for a pure Q-learning algorithm for the same microgrid-manufacturing system with a smaller size (2 machines and1 buffer) [25]. After discretization of the continuous statesand actions, the state space has a size of 3.8 × 103 and theaction space has a size of 2.6×104. The action-value function(Q-function) is approximated by a smaller fully-connectedneural network with two hidden layers and Sigmoid activation,where each hidden layer has 32 neurons. Again, we calculatedthe square norms of the differences of the neural-networkweight vectors for each two consecutive algorithm iterations(i.e. ‖ωt+1 − ωt‖22). The discount factor γ = 0.1. The neuralnetwork is trained using Adam [41] with different learningrates: Figure 4-(c) is for learning rate 0.001 and Figure 4-(d)is for learning rate 0.0001. It is seen that in these two cases,even after 104 iterations, the pure Q-learning algorithm cannotconverge due to the immense size of the discretized state-action spaces (a manifestation of the “curse of dimensional-ity”), indicating the effectiveness of our method that combinesdeterministic policy gradient with discrete Monte-Carlo typesearches.

Based on the optimal parameters (ω∗, θ∗) found for theneural-network Q(S,A;ω∗) and the continuous action Ac(θ∗)

Table IV: Battery Storage Parameters.Parameters Value Parameters Value

e (kWh)-Capacity 350 b-Charging rate (kW) 2SOCmax (%) 95 SOCmin (%) 5rbomc ($/kWh) 0.9 η-Efficiency 0.99

Table V: Solar Panel and Generator Parameters.Parameters Value Parameters Value

a (m2)-Panel area 1400 ng (unit)-Number of generator 1δ-Efficiency 0.2 Gp (kW)-Generator capacity 650rsomc ($/kWh) 0.17 rgomc ($/kWh)-Operating cost 0.45

Figure 4: Left to Right: (a) Evolution of weight differences‖ωt+1 − ωt‖22; (b) Evolution of cumulative rewards; (c), (d)Compare with vanilla Q-learning.

(see Algorithm 1), we tested the corresponding microgrid-manufacturing model at a time horizon of 100 time pe-riods, with each time period equals one hour. At eachdecision point with state S, we store all admissible ac-tions Ad and remainder actions Ar in a tree and searchover the tree to identify optimal actions (A∗,d,A∗,r) =arg max

Ad,AradmissibleQ(S,Ad,Ac(θ∗),Ar;ω∗). These optimal

actions are then implemented for the MDP system to jumpto the next state and the incurred cost, throughput and energydemand are calculated. The results are compared with twobaseline scenarios: The first one runs under a random policy,while the second one is a routing policy. For the random policy,the accumulation of total incurred cost E(S,A) at (II.16), totalenergy cost TF (S,A) at (II.17) and total production unitspt in (II.26) for the optimal policy and random policy arecalculated for the system running at a total horizon of 100 timeperiods (each period = 1 hour). The system under randomlychosen policy starts with the same initial conditions as thesystem under optimal policy. The results of 3 experiments areshown in Figure 5, where red solid lines are for the optimalpolicy selected by reinforcement learning and black star linesare for the random policy. It is clearly seen from these resultsthat under the optimal policy the manufacturing system tendsto produce more throughput with less total cost and similar orless total energy cost (energy demand). More precisely, in oneof the experiments, the optimal policy found by reinforcementlearning over a time horizon of 100 time periods has an output(the quantity pt in (II.26)) of 73, while the randomly selectedpolicy only produces 24. At the same time, the total cost forthe optimal policy is −$728293.1002207897 and the total costfor the random policy is −$236214.01230431726. In terms ofenergy cost, optimal policy is also about one time less than therandom policy, with $876.4532621691491 for optimal policyand $1656.0785448269392 for random policy. The two otherexperiments behave very similarly.

For the routing policy, we consider a routing practicestrategy that can be adopted by many industrial practitioners,i.e., the production system and microgrid are controlled orscheduled separately. The production scheduling is generatedto minimize total energy consumption without sacrificingtarget production. This model is briefly introduced as follows.Let xit be the binary decision variable denoting the productionschedule of the manufacturing system, i.e., it takes the valueof one when machine i is scheduled for production in period t,and zero otherwise. The objective function can be formulatedas:

minxit

∑t∈T

xit · pi ·∆t, (IV.1)

where T is the set including all time periods t, pi is the


rated power of machine i, ∆t is the time duration of eachdiscretization period. Note that for simplicity, the values ofPCOpri are used to for the rated power without consideringthe difference between PCOpri and PCIdli .

Two constraints are formulated as follows:∑t∈T

xNt · PRN ≥ TA (IV.2)

where xNt is the decision variable for machine N (i.e., thelast machine) of the production system. PRN is the productionrate of machine N . TA is the target production count. Thisconstraint shows that the target production should be satisfied.Note that this constraint is based on a simplified assumptionthat machine breakdown and the resultant blockage/starvationare not considered.

0 ≤ Bi(t+1) = Bit + xit · PRi − x(i+1)t · PRi+1 ≤ Ci(IV.3)

where Ci is the capacity of buffer i. Bi(t+1) is the countof work-in-progress parts stored in buffer location i at thebeginning of period t+1. This constraint shows material flowbalance and the work-in-progress part in each buffer locationcannot exceed its respective bounds.

After solving the aforementioned Integer Programming, theproduction schedule that can minimize energy consumptionwithout sacrificing production can be obtained. Then, theutilization of microgrids follows the following empirical ruleswill be implemented. First, the battery storage system andgenerator are typically considered backups for emergencysituations in practice, and thus are not used in this routingpolicy.

Second, if the renewable sources are available at time periodt, they will be first used to satisfy the energy demand ofproduction. If the renewable sources have a higher supplycapability than production demand at period t, the remainingpart will be sold back to the grid. If the renewable sources havea less supply capability than production demand at period t,the demand gap will be filled by purchasing electricity fromthe grid. The wind energy has a higher priority to be usedthan solar source since wind energy cost is typically lowerthan solar energy cost.

To match the target production unit made by optimal policy,for this routing strategy found by mixed-integer programming,we set the target output at the time horizon 100, i.e, the targetproduction unit TA in (IV.2) to be equal to 73, which is thetotal production throughput in units for the optimal policy (thequantity pt in (II.26)) at time horizon 100. We found that thetotal energy cost under the optimal policy found by this integerprogramming is $3642.8834007673. This has been about fourtimes of our previously announced $876.4532621691491 forthe optimal policy found by reinforcement learning. Two otherexperiments are made and the results are similar. The resultsof the evolution of the total energy cost and total productionthroughput in units as a function of time are shown in Figure5, where red solid lines are for the optimal policy selectedby reinforcement learning and blue dashed lines are for therouting strategy found by mixed-integer programming. It isclearly seen that the reinforcement learning selects a policythat incurs less energy cost.

The overall comparison between proposed reinforcementlearning model and random policy as well as routing policyis summarized in Table VI.

Table VI: Comparison Among Three Models.

Energy Cost ($) Production Throughput (unit)Routing Strategy 3642.8834007673 73Proposed Model 876.4532621691491 73Random Policy 1656.0785448269392 24

Figure 5: Left to Right: Average over 3 experiments thecomparison of (a) total throughput in production unit; (b) totalenergy cost incurred by optimal, routing and random policies:red solid line = optimal policy, blue dashed line = routingstrategy via mixed-integer programming, black star line =random policy.

V. CONCLUSION

This paper proposes a joint dynamic control model ofmicrogrids and manufacturing systems using Markov DecisionProcess (MDP) to identify an optimal control strategy for bothmicrogrid components and manufacturing system so that theenergy cost for production can be minimized without sacri-ficing production throughput. A novel reinforcement learningalgorithm that leverages both on-policy temporal differencecontrol (TD-control) and deterministic policy gradient (DPG)algorithms is proposed to resolve the joint control of microgridand manufacturing system. Experiments for a manufacturingsystem with an onsite microgrid with renewable sources havebeen implemented and the results show the effectiveness ofcombining TD control and policy gradient methodologies inaddressing the “curse of dimensionality” in dynamic decision-making with high dimensional and complicated state andaction spaces.

For future work, real time decision making can be con-sidered for emergency situations such as natural disasters thatlead to non-availability of external energy supplies from grids.

APPENDIX ATHE PROJECTION ONTO THE SIMPLEX Dc

In this appendix we calculate directly the projection operatorproxDc(θ) for a given θ ∈ R6 onto the simplex

Dc = Dcs ×Dc

w ×Dcg ,

where Dcs, D

cw, Dc

g are three simplices all isomorphic to

D = {(λ1, λ2) : λ1 ≥ 0, λ2 ≥ 0, 0 ≤ λ1 + λ2 ≤ 1} .


Let θ = (θs, θw, θg) where each θs, θw, θg ∈ R2. Then wehave

proxDc(θ) = (proxDcs(θs), proxDc

w(θw), proxDc

g(θg)) .

If θ0 ∈ D, then proxD(θ0) = θ0. For each θ0 = (θ10, θ

20) ∈

R2\D, we can easily calculate proxD(θ0) according to thefollowing rules(I) θ1

0 < 0, θ20 < 0, then proxD((θ1

0, θ20)) = (0, 0);

(II) 0 ≤ θ10 ≤ 1, θ2

0 < 0; then proxD((θ10, θ

20)) = (θ1

0, 0),(III) θ1

0 > 1 and θ10 − θ2

0 > 1, then proxD((θ10, θ

20)) = (1, 0);

(IV) θ10, θ

20 > 0, θ1

0 + θ20 ≥ 1 and −1 ≤ θ1

0 − θ20 ≤ 1, then

proxD((θ10, θ

20)) =

(1 + θ1

0 − θ20

2,

1− θ10 + θ2

0

2

);

(V) θ20 > 1 and θ1

0−θ20 < −1, then proxD((θ1

0, θ20)) = (0, 1);

(VI) 0 ≤ θ20 ≤ 1, θ1

0 < 0, then proxD((θ10, θ

20)) = (0, θ2

0).

APPENDIX BTHE SCHEME FOR CHOOSING BATTERY

CHARGING/DISCHARGING ACTIONS

Once θ is chosen, we determine whether or not we shouldchoose pmt+1 + pbt+1 6= 0, and whether pbt+1 6= 0 according to(II.35), (II.36) and (II.37); we also determine whether or notwe choose bmt+1 6= 0 according to (II.34). To be precise, wefirst calculate the vector

T = (1{ssbt+1+wsbt+1+gsbt+1>0} , 1{sbt+1+wb

t+1+gbt+1>0} ,

1{SOCt+1−b∆t/η−SOCmin≥0}) ,

where ssbt+1, wsbt+1 and gsbt+1, sbt+1, wbt+1 and gbt+1 can becalculated according to θt+1 updated in Step 7 and (2), andSOCt+1 is from St+1. Then we determine pmt+1, pbt+1 andbmt+1 according to the following two tables:

T (1, 1, 1) (1, 1, 0) (1, 0, 1) (1, 0, 0) (0, 1, 1)pmt+1 0 0 0 0 pt+1/0pbt+1 0 0 0 0 0/pt+1

bmt+1 0 0 0/b∆t 0 0

T (0, 1, 0) (0, 0, 1) (0, 0, 1) (0, 0, 0)pmt+1 pt+1/0 pt+1/0 pt+1 pt+1/0pbt+1 0/pt+1 0/pt+1 0 0/pt+1

bmt+1 0 0 b∆t 0

Here 0/b∆t means that we randomly pick 0 or b∆t and

pt+1 =[Emfgt+1 − (smt+1 + wmt+1 + gmt+1)

]·1{Emfg

t+1 −(smt+1+wmt+1+gmt+1)>0} ,

pt+1 =[Emfgt+1 − (smt+1 + wmt+1 + gmt+1 + b∆t)

]·1{Emfg

t+1 −(smt+1+wmt+1+gmt+1+b∆t)>0} .

The (0/pt+1, pt+1/0) means that we randomly sample(pbt+1, p

mt+1) to make their sum pt+1.

REFERENCES

[1] C. Mahieux and A. Oudalov. Microgrids En-ter the Mainstream, RenewableEnergyFocus.com,http://www.renewableenergyfocus.com/view/43345/microgrids-enter-the-mainstream/, 2015.

[2] U.S. Department of Energy. Technical Report: How Microgrids Work.http://www.energy.gov/articles/how-microgrids-work, 2014.

[3] Lawrence Berkeley National Laboratory. Technical Report: MicrogridDefinitions. https://building-microgrid.lbl.gov/microgrid-definitions,2016.

[4] B. Lasseter. Microgrids [distributed power generation]. In PowerEngineering Society Winter Meeting, 2001. IEEE, volume 1, pages 146–149. IEEE, 2001.

[5] B. Lasseter. Microgrid. In Power Engineering Society Winter Meeting,2002. IEEE, volume 1, pages 305–308. IEEE, 2002.

[6] D.E. Olivares, A. Mehrizi-Sani, A.H. Etemadi, C.A. Canizares, R. Ira-vani, M. Kazerani, A.H. Hajimiragha, O. Gomis-Bellmunt, M. Saeedi-fard, R. Palma-Behnke, G.A. Jimenez-Estevez, and N.D. Hatziargyriou.Trends in microgrid control. IEEE Transactions on Smart Grid,5(4):1905–1919, 2014.

[7] F. Ahourai and M.A. Al Faruque. Technical Report: Grid ImpactAnalysis of a Residential Microgrid under Various EV Penetration Ratesin GridLab-D. Center for Embedded Computer Systems, Irvine, CA,2013.

[8] L. Roggia, C. Rech, L. Schuch, J.E. Baggio, H.L. Hey, and J.R. Pinheiro.Design of a sustainable residential microgrid system including PHEVand energy storage device. In Proceedings of the 2011 14th EuropeanConference on Power Electronics and Applications, pages 1–9. IEEE,2011.

[9] A.D. Hawkes and M.A. Leach. Cost-effective operating strategy forresidential micro-combined heat and power. Energy, 32(5):711–723,2007.

[10] New York State Division of Homeland Security NYDHSES and Emer-gency Services. Microgrids for Critical Facility Resiliency in NewYork State. https://www.nyserda.ny.gov/-/media/Microgrids-Report-Summary.pdf, 2014.

[11] M. Stadler. Microgrid Planning and Operations for CriticalFacilities Considering Outages due to Natural Disasters.https://building-microgrid.lbl.gov/sites/all/files/DER-CAM%20-%20microgrid%20resilience V7.pdf, 2014.

[12] J.R. Duflou, J.W. Sutherland, D. Dornfeld, C. Herrmann, J. Jeswiet,S. Kara, M. Hauschild, and K. Kellens. Towards energy and resourceefficient manufacturing: A processes and systems approach. CIRPAnnals-Manufacturing Technology, 61(2):587–609, 2012.

[13] U.S. Department of Energy. Annual Energy Review 2009.ftp://ftp.eia.doe.gov/multifuel/038409.pdf, 2010.

[14] F. Katiraei, C. Abbey, S. Tang, and M. Gauthier. Planned islandingon rural feeders—utility perspective. In Power and Energy SocietyGeneral Meeting-Conversion and Delivery of Electrical Energy in the21st Century, 2008 IEEE, pages 1–6. IEEE, 2008.

[15] J. Turkewitz. Unemployment Deepens Storm’s Loss as Businesses StayClosed. http://www.nytimes.com/2012/12/28/nyregion/unemployment-deepens-the-loss-from-hurricane-sandy.html? r=0, 2012.

[16] M. Garber and J.and Wohlford L. Unger, L.and White. HurricaneKatrina’s Effects on Industry Employment and Wages, Monthly LaborReview August 2006. http://www.bls.gov/opub/mlr/2006/08/art3full.pdf,2006.

[17] T. Loix and K.U. Leuven. The First Microgrid in the Netherlands:Bronsbergen. http://www.leonardo-energy.org/ webfm send/493, 2009.

[18] Md Monirul Islam and Zeyi Sun. Onsite generation system sizing formanufacturing plant considering renewable sources towards sustainabil-ity. Sustainable Energy Technologies and Assessments, 32:1–18, 2019.

[19] M.M. Islam, Z. Sun, and X. Yao. Simulation-based investigation forthe application of microgrid with renewable sources in manufacturingsystems towards sustainability. In ASEM 2016 International AnnualConference, Charlotte, NC, USA. American Society for EngineeringManagement, 2016.

[20] X. Zhong, M.M. Islam, H. Xiong, and Z. Sun. Design the capacityof onsite generation system with renewable sources for manufacturingplant. In Procedia Computer Science, Complex Adaptive SystemsConference, Chicago, IL, USA, volume 114, pages 433–440, 2017.

[21] L. Cuyler, Z. Sun, and L. Li. Simulation-based optimization of electricitydemand response for sustainable manufacturing systems. In ASME2014 International Manufacturing Science and Engineering Conferencecollocated with the JSME 2014 International Conference on Materialsand Processing and the 42nd North American Manufacturing ResearchConference, pages V001T05A002–V001T05A002. American Society ofMechanical Engineers, 2014.

[22] Y. Wang and L. Li. Time-of-Use based electricity demand response forsustainable manufacturing systems. Energy, 63:233–244, 2013.


[23] M. Fernandez, L. Li, and Z. Sun. “Just-for-Peak” buffer inventory forpeak electricity demand reduction of manufacturing systems. Interna-tional Journal of Production Economics, 146(1):178–184, 2013.

[24] Z. Sun, L. Li, M. Fernandez, and J. Wang. Inventory control for peakelectricity demand reduction of manufacturing systems considering thetradeoff between production loss and energy savings. Journal of CleanerProduction, 82:84–93, 2014.

[25] Wenqing Hu, Zeyi Sun, Yunchao Zhang, and Yu Li. Joint manufacturingand onsite microgrid system control using markov decision process andneural network integrated reinforcement learning. Procedia Manufac-turing, 39:1242–1249, 2019.

[26] E. Kuznetsova, Y. F. Li, C. Ruiz, E. Zio, G. Ault, and K. Bell.Reinforcement learning for microgrid energy management. Proc. of the35th International Conference on Uncertainty in Artificial Intelligence(UAI), Tel Aviv, Israel, 59:133–146, 2013.

[27] Yves Dallery. On modeling failure and repair times in stochastic modelsof manufacturing systems using generalized exponential distributions.Queueing Systems, 15(1-4):199–209, 1994.

[28] Yong Wang and Lin Li. A novel modeling method for both steady-state and transient analyses of serial bernoulli production systems. IEEETransactions on Systems, Man, and Cybernetics: Systems, 45(1):97–108,2014.

[29] Jingshan Li and Semyon M Meerkov. Production systems engineering.Springer Science & Business Media, 2008.

[30] Peng Zeng, Heping Li, Haibo He, and Shuhui Li. Dynamic energymanagement of a microgrid using approximate dynamical programmingand deep recurrent neural network learning. IEEE Transactions on SmartGrid, 10(4):4435–4445, 2019.

[31] Y. Ji, J. Wang, J. Xu, X. Fang, and H. Zhang. Real-time energymanagement of a microgrid using deep reinforcement learning. Energies,12:2291, 2019.

[32] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-miller. Deterministic policy gradient algorithms. ICML (InternationalConference on Machine Learning), 2014.

[33] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller. Playing atari with deep reinforcement learning.Neural Information and Processing Systems (NIPS), 2013.

[34] D.P. Bertsekas and J.N. Tsitsiklis. Neuro–Dynamic Programming.Athena Scientific, 1996.

[35] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction.Second Edition, in progress, complete draft online. MIT Press, Novem-ber 5, 2017.

[36] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trendsin Optimization, 1(3):123–231, 2013.

[37] Source code for our paper: https://github.com/huwenqing0606/rl-manufacturing.

[38] L. Li and Z. Sun. Dynamic energy control for energy efficiency im-provement of sustainable manufacturing systems using markov decisionprocess. IEEE Transactions on Systems, Man, and Cybernetics: Systems,43, 2013.

[39] Solar energy local: Solar energy data and resources in the us.https://solarenergylocal.com/.

[40] State climatologist office for Illinois.http://www.isws.illinois.edu/atmos/statecli/wind/wind.htm.

[41] D.P. Kingma and J. Ba. Adam: A method for Stochastic Optimization.ICLR, arXiv:1412.6980[cs.LG], 2015.

Wenqing Hu received the B.Sc. degree in Ap-plied Mathematics from Peking University, Beijing,China, in 2008, the Ph.D. degree in Mathematicsfrom the University of Maryland, College Park, MD,USA, in 2013. From 2016 to now he is servingas an Assistant Professor in Mathematics at theDepartment of Mathematics and Statistics, MissouriUniversity of Science and Technology, Rolla, MO,USA.

His research interests lie in probability theory andstatistical methodology. He has been working on

problems in data sciences, statistical machine learning and optimization.

Zeyi Sun received the B.Eng. degree in materialscience and engineering from Tongji University,Shanghai, China, in 2002, the M.Eng. degree inmanufacturing from the University of Michigan AnnArbor, Ann Arbor, MI, USA, in 2010, and thePh.D. degree in industrial engineering and operationsresearch from the University of Illinois at Chicago,Chicago, IL, USA, in 2015. He served as an Assis-tant Professor with the Department of EngineeringManagement and Systems Engineering, MissouriUniversity of Science and Technology, Rolla, MO,

USA. from 2015 to 2020. Currently, he is a senior research scientist withMininglamp Academy of Sciences, Mininglamp Technology, Beijing, China.

His research interest is mainly focused on using reinforcement learningalgorithms to solve dynamic decision-making problem formulated by MarkovDecision Process.

Jiaojiao Yang received the B.Sc. degree in ChaohuCollege, Hefei, Anhui, China in 2011, and the Ph.D.degree in Applied Mathematics from the SouthChina University of Technology in 2016. From2016 to 2018 she is serving as a Researcher atHuazhong University of Science and Technology,Wuhan, China and since 2018 to now she hasbeen an Assistant and Associate Professor at AnhuiNormal University, Wuhu, Anhui, China.

Her research interests are in the analysis of frac-tals, dynamical systems and machine learning.

Louis Steimeister received his M.Sc. in AppliedMathematics from the Missouri University of Sci-ence and Technology through a joint program withUlm University in Germany. The emphasis was onMathematical Statistics and Machine Learning aswell as their applications to Finance. During histime in Ulm he started his own consulting businessafter his experience in Risk Management at KPMG.In 2016, he was awarded his B.Sc. in BusinessMathematics from the University of Hamburg, hav-ing emphasized in Finance, Stochastic Processes,

and Statistics. Louis’ research interests revolve around algorithmic trading,reinforcement learning, machine learning, financial mathematics, and mathe-matical statistics.

Kaibo Xu Dr. Kaibo Xu received his Bachelordegree (1998) in Computer Science from BeijingUniversity of Chemical Technology and his Master(2005) and PhD (2010) in Computer Science fromthe University of the West of Scotland. He worked asa Teaching Assistant (1998-2004), Lecturer (2004-2009), Associate Professor (2009-2017) at BeijingUnion University. He has supervised more than 20master and doctoral students who are successful intheir academic and industrial careers. As the prin-cipal investigator, he has received 7 governmental

funds and 5 industrial funds with the total amount of 5M in the Chinesedollar. Dr. Kaibo Xu has also consulted extensively and been involved inmany industrial projects. He worked as the Chief-Information-Officer (CIO)of Yunbai Clothing Retail Group, China (2016-2019). Currently, he is servingas the vice president and principal scientist of MiningLamp Tech. His researchinterests include graph mining, knowledge graph and knowledge reasoning.

preprint submitted 1 joint control of...

Documents