deep reinforcement learning for advanced energy management

Deep Reinforcement Learning for Advanced Energy Management ofHybrid Electric Vehicles

Roman Liessner, Christian Schroer, Ansgar Dietermann and Bernard BakerDresden Institute of Automobile Engineering, TU Dresden, George-Bahr-Straße 1c, 01069 Dresden, Germany

Keywords: Energy Management, Deep Learning, Reinforcement Learning, Hybrid Electric Vehicle.

Abstract: Machine Learning seizes a substantial role in the development of future low-emission automobiles, as man-ufacturers are increasingly reaching limits with traditional engineering methods. Apart from autonomousdriving, recent advances in reinforcement learning also offer great benefit for solving complex parameteriza-tion tasks. In this paper, deep reinforcement learning is used for the derivation of efficient operating strategiesfor hybrid electric vehicles. There, for achieving fuel efficient solutions, a wide range of potential drivingand traffic scenarios have to be anticipated where intelligent and adaptive processes could bring significantimprovements. The underlying research proves the ability of a reinforcement learning agent to learn nearly-optimal operating strategies without any prior route-information and offers great potential for the inclusion offurther variables into the optimization process.

1 INTRODUCTION

Today, automobile manufacturers are confronted withmajor demands regarding the development of newdrive trains. On the one hand, strict legislation andregulation requiring overall and continuous reductionof fuel consumption and emission levels of new ve-hicles. On the other hand, rising customer demandsconcerning vehicle dynamics, overall comfort and af-fordability. While purely electric vehicles often lacklarger operation ranges and are still rather expensivedue to high costs for production and development, hy-brid electric vehicles (HEV) offer a good compromiseto combine the benefits of conventional combustionengines and novel electric motors.

The use of an electric machine (EM) as a supple-mentary motor increases the degree of freedom of thedriving unit and a so-called operating strategy has tobe applied for the efficient coordination of both en-ergy converters. Ideally, a variety of factors with ef-fect on fuel consumption are taken into account bysuch a strategy. These factors range from the driver’sinfluence through different driving styles and habits,environmental conditions like traffic, route and roadinformation up to internal state information of the ve-hicle like fuel and battery levels. In general, thesefactors can be seen as highly stochastic, intercorre-lated and dependent on the individual situation. Forexample, the driving style of a sporty driver operat-

ing a sports car in a large city is significantly differentfrom the driving style of the same driver in an all-terrain vehicle in the mountains.

These scenarios are captured in velocity profiles,so-called driving cycles, which are not only used forcalibration but also for certification of new vehicles.Formerly, emission levels were derived purely fromdeterministic cycles (e.g. “New European DrivingCycle”) for which an efficient strategy could specif-ically be optimized. Nevertheless, the ability of thesesynthetic cycles to represent reality can and should bequestioned as the variety of potential driving scenar-ios in real-world traffic is huge. For that reason, fu-ture certification tests (Real Driving Emissions) (Eu-ropean Commission, 2017) will be performed directlyon the road rather than in fully controlled testingenvironments which poses major challenges for themanufacturers. Therefore, traditional approaches forthe derivation of operating strategies with fixed rule-based strategies seem outdated, especially with inclu-sion of diverse driver, route and vehicle information.In order to bring significant fuel and energy savingsbeyond certification procedures to the actual customerwith every-day use of the vehicle, new and innovativeapproaches are required for the energy managementof HEVs.In recent years, machine learning has increasinglygained momentum in the engineering context wherehighly non-linear problems have to be solved and

Liessner, R., Schroer, C., Dietermann, A. and Bäker, B.Deep Reinforcement Learning for Advanced Energy Management of Hybrid Electric Vehicles.DOI: 10.5220/0006573000610072In Proceedings of the 10th International Conference on Agents and Artificial Intelligence (ICAART 2018) - Volume 2, pages 61-72ISBN: 978-989-758-275-2Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

61

greater abstraction levels reduce valuable informa-tion. Reinforcement learning (RL) in particular of-fers significant benefit for planning and optimiza-tion tasks, mostly because an agent learns model-free, with great efficiency and through direct interac-tion with its environment. In contrast to traditionalRL methods, which were bound to rather small andlowdimensional problems, the combination of RL andneural networks, called deep reinforcement learning,allows the application in very complex domains withhighdimensional sensory input. In the context of au-tonomous driving for example, RL can play a vitalrole in fine-grained control of the vehicle in challeng-ing environments (Mirzaei and Givargis, 2017)(Chaeet al., 2017). Next to that, the energy managementof modern vehicles appears to be a problem wherestate-of-the-art machine learning could dramaticallyimprove current technology, bringing significant ben-efits to customers and the environment.The main contributions of this paper are: (1) Deriva-tion of a deep reinforcement learning framework ca-pable of learning nearly-optimal operating strategies(2) The use of stochastic driver models for improvedstate generalization and preventing the strategy fromoverfitting. (3) Inclusion of the battery temperaturewith additional power limitation in to the optimiza-tion process.

The paper is structured into six sections. Section2 establishes the fundamentals of hybrid vehicles, en-ergy management and reinforcement learning. Sec-tion 3 introduces related work and a concrete formu-lation of the problem. In Section 4, the experimen-tal setup for solving the energy management problemwith RL is described, for which the results will beshown in section 5. Section 6 concludes the paperand gives a prospect of potential future work.

2 BACKGROUND

2.1 Hybrid Electric Vehicles

The term ‘hybrid vehicle’ is classified by the UNas vehicles which have at least two energy stor-ages and two energy converters for locomotion (UN-ECE Transport Division, 2005). Today, the mostwidely spread implementation is the combination ofa conventional internal combustion engine (ICE) withan electric machine. Biggest advantage of this com-bination is the ability to exploit vehicle decelerationfor recuperation, meaning the conversion of brakingenergy into electric energy for recharging the elec-tric energy storage. With the EM assisting the ICEin times of high loads up to 30% of fuel could be

saved, especially in urban environments, where theefficiency of vehicles operated solely by ICE is typ-ically rather low (Guzzella and Sciarretta, 2013).If the combustion engine and the electric machine areboth directly connected to the driving axle of the hy-brid vehicle they form a parallel power train structurewhere the EM can be switched on and off as desired.The basic architecture of parallel hybrid drive trainstructures is shown in Figure 1.

EM gearbox

rear

axl

e

nCRS, trqCRS

PEMmFuel

CE

rear

whee

ls

Clutch

Figure 1: Structural scheme of a parallel mild hybrid.

Hybrid vehicles with a parallel power train are of-ten designed as mild hybrids, meaning that the EMhas a comparatively low power output against theICE. In this case, the EM is mainly used for recuper-ation of braking energy and short-term support of thecombustion engine in periods of high load. In con-trast to plug-in hybrids, mild hybrids do not offer theopportunity to charge the electric storage externallyand the entirety of the energy used for support of theICE has to be gained from recuperation. Deliberately,purely electric driving with only the EM operated forlocomotion of the vehicle is not intended (Guzzellaand Sciarretta, 2013). Mild hybrids can be seen asa compromise between vehicles operated by a singlecombustion engine and full hybrids with the ability todrive fully electric for long routes, saving costs in thedevelopment and production of the vehicle as well asoperational costs for fuel and energy.

2.2 Operating Strategy

Mainly, the control of HEVs can be split into twolevels. The component level (low-level), whichcontrols the components of the driving unit withtraditional feedback-control procedures, and thesupervisory level (high-level) for monitoring andcontrol of any power flow within the vehicle. Latterone will also process any kind of vehicle and drivinginformation (speed, torque, acceleration, slope), andthus can be described as the energy management

ICAART 2018 - 10th International Conference on Agents and Artificial Intelligence

62

system (EMS). Primary goal of the EMS is derivingoptimal control signals for the low-level in order toachieve best possible energy efficiency (Onori et al.,2016). The energy management is embedded in anoverall operating strategy of the vehicle which, nextto fuel saving, could also include any other aspectslike vehicle dynamics or driving comfort as wellas certain technical goals. For example, limitationsof the maximum battery power could be desired inorder to avoid damage or to extend the lifetime of thecomponent.Most derivation processes for energy managementstart with model-based simulations. Deriving anoptimal strategy can then be described as a multi-goaloptimization process. Various methods exist forsolving this problem. On the one hand, globaloptimization methods like dynamic programming(DP) (Kirschbaum et al., 2002) which offer thechance to determine optimal solutions for any givendriving cycle. However, since these methods requirefull knowledge of the route in advance and arecomputationally very expensive, the application inthe vehicle for online control seams infeasible. Inpractice DP mostly serves for academic validationof other procedures. On the other hand there arelocal optimization methods like the EquivalentConsumption Minimization Strategy (ECMS) wherefuel and electric energy used for driving is balancedthrough a weighting factor (Chasse and Sciarretta,2011). Nevertheless, selection of an appropriatefactor requires further effort for optimization withECMS since it heavily depends on the state of theroute being driven and the habits of the driver.Further details to traditional optimization methodsfor operating strategies can be found in (Guzzella andSciarretta, 2013).

For adaptability towards different driver types,more innovative approaches are required. Next tostochastic dynamic programming shown in (Leroyet al., 2012) and (Tate et al., 2008), machine learn-ing and especially reinforcement learning offer goodconcepts for the efficient inclusion of specific drivingand driver information into the optimization processof operating strategies for hybrid vehicles.

2.3 Reinforcement Learning

The basic idea of reinforcement learning algorithms isstrongly geared towards how humans or animals learnand can be described as a learning process by trial anderror. The interactive nature makes RL particularlyinteresting for learning problems with no existing setof structured and labeled training examples.

Figure 2: Scheme of the typical agent-environment-interaction for reinforcement learning (Sutton and Barto,2012).

The fundamental structure of any RL algorithm isshown in Figure 2 and can be described as a structuredinteraction of an agent with its environment. At everydiscrete timestep t the agent is presented with a statest from the environment for which he must choose anaction at . The agent determines his actions based onan internal policy π which maps an action to everyobservable state:

π : s← a (1)

For every chosen action the agent receives a re-ward rt+1 and a new state st+1. Goal of any RL al-gorithm is for the agent to adjust his policy in such away that his return gt , which is given as the weightedsum of the temporal rewards

gt =∞

∑k=0

γkrt+k+1 (2)

is maximized at any given timestep. The discountfactor γ ∈ [0,1) indicates the relevance of future re-wards. In general, the agents policy can be eitherstochastic or deterministic, latter will further be de-noted with µ. In order for the agent to learn an op-timal policy π∗ which maximizes his return in thegiven environment, the problem statement has to ful-fill the Markov-Property, meaning that at any pointof time all of the environments historic state infor-mation is captured in the current state st and so anyfollowing state and future rewards solely depend onst and the agents chosen action at . A process sat-isfying the Markov-Property can be described as aMarkov-Decision-Process (MDP) (Puterman, 2010).Since complete satisfaction of this condition for real-world problems is often unfeasible, a good approxi-mation of an MDP is often sufficient for the applica-bility of reinforcement learning.

Different approaches exist for solving an MDPwith RL, one of which is the concept of Temporal-Difference (TD) Learning. Here, the agent maintainsan estimate of the achievable return which can itera-tively be updated based on the truth found in the ac-tually experienced state sequences and the receivedrewards during the interaction with the environment.One proven way of solving an MDP with TD-methods

Deep Reinforcement Learning for Advanced Energy Management of Hybrid Electric Vehicles

63

Figure 3: Basic structure of the DDPG actor-critic agent. Actor with parameters θµ, state s as input and action based ondeterministic policy µ as output. Critic taking both state and chosen action of the actor as input and giving out an accordingQ-Value, based on parameters θQ. Update signal for higher rewards for actor as policy gradient, derived with the chain-ruleof both networks (Timothy P. Lillicrap et al., 2015).

is Q-Learning where the agent estimates a Q-Valuefor every possible action, indicating the achievable fu-ture return when this action is chosen. After everyinteraction with the environment, an iterative updateof the current estimation for the chosen action can bemade by

Q(st ,at)← Q(st ,at)+α [yt −Q(st ,at)] (3)

with

yt =

[rt+1 + γmax

at+1Q(st+1,at+1)

](4)

With a sufficiently small learning rate α, the iterativeupdates of the agents estimations are proven to con-verge towards the real Q-Values of the MDP. Knowl-edge of the Q-Values automatically results in follow-ing the optimal policy if the agent in every state picksthe action with the highest Q-Value:

π∗(s) = argmaxa

Q(s,a) (5)

which ultimately maximizes his return. Eventhough Q-Learning has proven to be very effectivefor rather small and low-dimensional control tasks,the need for discrete state and actions spaces limitsthe number of application options for real-world opti-mization problems.

2.4 Deep Reinforcement Learning

Deep reinforcement learning, as a combination ofdeep learning with artificial neural networks and theinteractive learning structure of RL, has gained muchattention in recent years. As the foundation of al-most any major progress in artificial intelligence inthe past decade, neural networks offer the essentialsto make RL valuable for a whole new range of highdi-mensional real-world optimization problems. Hence,

deep neural networks can be deployed as non-linearfunction approximators with trainable parameters θQ

in the context of Q-learning, resulting in the Deep Q-Network (DQN) algorithm (Volodymyr Mnih et al.,2013). Instead of iteratively updating running esti-mates of the actions Q-values, the DQN is trained tooutput the correct values for any given state by mini-mizing the loss:

L(θQ) =[yt −Q(st ,at | θQ)

]2(6)

with

yt =

[rt+1 + γmax

at+1Q(st+1,at+1 | θQ′)

](7)

For what was always believed to be inherently unsta-ble, (Mnih et al., 2015) show effective learning of theDQN by introducing two main features: training ofthe network with uncorrelated minibatches of the past- called experience replay - and deriving target val-ues yt with a separate network with parameters θQ′ ,called the target network. In this manner, a DQN-agent could be trained to achieve human level per-formance in many games of an Atari 2600 emulatoronly receiving unprocessed pixels as state-input andthe raw score as a reward.

The processing of highdimensional state-vectorscan be seen as a significant advantage for many en-gineering tasks where often data gathered from alarge number of sensors has to be filtered elaboratelyfor the use with conventional optimization methods.However, for many control tasks even bigger advan-tages are believed to arise from the use of continu-ous control parameters in order to avoid quality lim-iting discretization errors. Plain Deep-Q-Networksdo not offer the chance of continuous output param-eters. Thus, numerous evolutions of DQN have been


64

proposed in the last years, mostly categorized as pol-icy gradient algorithms like the Deep DeterministicPolicy Gradient (DDPG) (Timothy P. Lillicrap et al.,2015) or Trust Region Policy Optimization (TRPO)(John Schulman et al., 2015). A continuous variant ofDQN is proposed in (Shixiang Gu et al., 2016) withthe Normalized Advantage Function. Continuous pol-icy gradient methods use an actor-critic architecturewhere an actor represents a currently followed policywith output of continuous action variables which arethen evaluated by a second part called the critic. Incase of DDPG both actor and critic are presented asdeep neural networks, where the actor performanceis evaluated through a deterministic policy gradient(David Silver et al., 2014) derived from the critic as aDQN. The DDPG architecture is shown in Figure 3.

3 REINFORCEMENT LEARNINGFOR HEV CALIBRATION

3.1 Related Work

Compared to traditional and most commonly usedmethods for deriving efficient operating strategies inthe automotive industry like DP, ECMS or fuzzy ap-proaches, machine learning and in particular rein-forcement learning present new opportunities withsignificant advantages. Most of all the agent wouldnot require any prior information about the course ofthe route in order to decrease fuel consumption sincehe is fully trained to do so with only current state in-formation.

In (X. Lin et al., 2014) a RL agent is given controlover particular parameters of the control unit withinthe model of a HEV. Given a discrete state consistingof power demand, vehicle speed, and battery charg-ing state the agent was able to determine a strategyto regulate battery output voltage and gear transla-tion resulting in fuel savings up to 42%. In (C. Liuand Y. L. Murphey, 2014) the discrete state space ofthe constrained optimization is extended with specifictrip information like the remaining traveling distance,which made the operating strategy even more effi-cient. In both cases traditional Q-learning methodswere used for optimization which require discretiza-tion of the generally continuous state and control pa-rameters within the vehicle.

Preliminary investigations (Patrick Wappler,2016) have proven the high quality of the gainedsolutions of classic RL algorithms. The discreterepresentation of the results in a Q-table would alsoallow a straight forward approach for the integration

in common energy management systems of HEV.However, the well-known curse of dimensionalityexponentially increases the size of the table with in-clusion of further variables which limits the problemto rather small state and action spaces in order to dealwith restrictions concerning memory and computingtime. Additionally, without the use of approximationor interpolation methods the agent is unable to valueany state he has not processed during training. Withseverely different driving scenarios occuring in thereal world considering traffic or driving styles, thederivation of a fully complete discrete representationis unfeasible. A rough discretization excludes valu-able information within the optimization problemand limits the achievable fuel savings.

3.2 Problem Formulation

The central goal of the energy management is findinga control signal action(t) for the minimization of thefuel consumption mFuel during a drive of time t0≤ t ≤t f , expressed as the minimization of an integral valueJ:

J =∫ t f

t0mFuel(action(t), t)dt (8)

wherein mFuel describes the fuel mass flow rate.The minimization of J is subject to physical limita-tions concerning performance measures, limitationsof the energy storage and variance of the chargingstate of the battery (SOC). Thus, the energy man-agement optimization problem is a temporally limitedoptimization problem with secondary and boundaryconditions (Onori et al., 2016).

3.2.1 Boundary Conditions

Unlike plug-in hybrids, mild hybrids can not becharged externally so they must be operated in acharge-conserving way. Hence a boundary conditioncan be stated:

SOC(t f ) = SOCtarget (9)

which also allows for comparison of different so-lutions with an optimal solution found by dynamicprogramming. A desired target value could be:

SOCtarget = SOC(t0) (10)

maintaining the initial charging state of the bat-tery at the beginning of the driving cycle. How-ever, for practical application a hard constraint doesnot seem necessary and ranging the terminal chargingstate SOC(t f ) between threshold values is acceptable.


65

With a sufficiently small range, a deviation to the tar-get value does not lead to interference of the vehiclesfunctionality (Onori et al., 2016).

3.2.2 Secondary Conditions

Secondary conditions of the optimization problemcan be stated for the state variables of the HEV as wellas the control parameters. Limiting state variableslike the SOC and the battery temperature ϑbat ensuresafe use of the components, operation in regions ofhigh efficiency and extended lifetime. Limitations ofthe control variables arise from physical limits of ac-tuators and maximum performance (or power outputP) of engines and battery.

Thus, it can be stated:

SOCmin ≤ SOC(t)≤ SOCmax,

Pbat,min ≤ Pbat(t)≤ Pbat,max,

ϑbat,min ≤ ϑbat(t)≤ ϑbat,max,

Trqx,min ≤ Trqx(t)≤ Trqx,max,

nx,min ≤ nx(t)≤ nx,max

(11)

forx = ICE,EM

where n and Trq denote rotational speeds andtorque of the EM and ICE. The notations (·)min and(·)max represent the minimum and maximum of therespective entries at every point of time.

4 REINFORCEMENT LEARNINGSETUP

The use of deep reinforcement learning requires thestatement of the problem as the typical interaction ofan agent with his environment. Concerning the energymanagement problem of HEV, the agent can be cho-sen to represent the control unit of the vehicle. Thenthe environment can include any outside part of theworld containing information for the derivation of anefficient operating strategy. Here the environment isassembled by two models, one of the vehicle and oneof the driver.

4.1 Vehicle Model

The mild hybrid power train, which is used for thevehicle model, is briefly described in section 2.2 andshown in Figure 1. Both energy converters are locatedon the crank shaft of the driving unit where the torqueis split according to the operating strategy and physi-cal limitations. The ICE gets fuel from a fuel tank and

the EM is supplied with electric energy from a battery.The same battery stores the recuperated energy fromvehicle deceleration, downhill driving or load pointshifting of the combustion engine.

Since targets concerning speed and accelerationare specified by driving cycles, the vehicle model isorientated backwards with power demands calculatedfrom the wheels to the driving unit with each inter-mediate component modelled individually. Figure 4shows the signal flow of the vehicle model as a blockdiagram.

4.1.1 Vehicle Dynamics

With a predetermined speed vveh and slope δveh fromthe driving cycle and the vehicle parameters parveh,the required power for the vehicle Pveh can be deter-mined by approximating the driving resistance causedby rolling friction and aerodynamic drag (Guzzellaand Sciarretta, 2013):

Pveh = f (vveh,δveh, parveh) (12)

With the addition of the dynamic rolling radius ofthe wheel r, the required torque Trqwhl and the rota-tional speed at the wheels nwhl can be calculated to:

Trqwhl =Pveh · r

vveh(13)

nwhl =vveh

2π · r (14)

4.1.2 Consumption Model

An analytical description of chemical processeswithin the combustion engine is barely possible,hence the fuel consumption of the ICE was modelledempirically based on measured data from a powertrain test bench. With input of the speed and torqueat the wheels, the selected gear G as well as theelectric power of the electric motor PEM,el (derivedby the chosen action of the agent), the model deter-mines a fuel volume mFuel consumed in a time spanof ∆t = 1s:

mFuel = f (nwhl ,Trqwhl ,G,PEM,el) (15)

4.1.3 Electric Motor Model

The model of the EM determines the losses due to theconversion between electric and mechanical power.Based on the efficiency for a given crankshaft speedncrs and the direction of the conversion, the empiricalmodel outputs the power after losses:

Pmech = f (ncrs,Pel) (16)

Pel = f (ncrs,Pmech) (17)


66

Drivingcycle

Motor

PEM,el

Paux

velocity

slope

Trqcrs

ncrs

Trqice

Gear TrqEM,target

mFuel

SOCϑbat

Pbat

ICEDifferential & gearbox

Driving resistance & wheel

Battery

CoolingPcool,el

Trqwhl

nwhl niceControl

Figure 4: Signal flow diagram of the HEV model.

4.1.4 Battery Model

For the calculation of change in charging state, thebattery was modelled as an equivalent two terminalnetwork with simplification of the underlying physicsof batteries. With neglection of any effect of tempera-ture or loading cycles for the batteries capacitance, thechange of charging state was purely determined fromthe previous state and any required power output fromthe battery, composing of the power for operation ofthe EM as well as any auxiliary users (e.g. the coolingsystem):

SOCt+1 = f (SOCt ,Pbat) (18)

For results shown in section 5.4, a temperaturemodel of the battery was implemented approximatingthe change of temperature due to the power output andthe cooling control signal CoolControl :

ϑbat,t+1 = f (ϑbat,t ,Pbat ,CoolControl) (19)

The utilization of the cooling system causes an ad-ditional power consumption Pcool,el .

4.2 Driver Model

The driver is implemented as driving cycles imply-ing specific velocity procedures the vehicle drivesthrough simulatively. Next to standard proceduresfor the determination of emission levels, cycles gen-erated by a stochastic driver model from (Liessneret al., 2017) are used for a more realistic reproduc-tion of real-world traffic scenarios. Hereby, diversecharacteristics can be considered within those cycleslike different market features, specific driving stylesand habits or traffic situations at different points dur-ing the day. The tuning of operating strategies to-wards special features in real-world driving scenarioscan be seen as a major benefit for customers concern-ing fuel efficiency of their vehicles since it can be as-sumed that e.g. an operating strategy derived with eu-ropean driving data will not work as efficient in Indiaor China.

4.3 States, Actions and Reward

4.3.1 States

At each timestep during simulative driving the envi-ronment provides a state vector for the agent consti-tuting the current internal state of the vehicle. Here,the state vector is defined by:

st = (nwhl ,Mwhl ,SOC,ϑbat ,G) (20)

In the case of the underlying vehicle model nottaking into account the battery temperature and theaccording power limit, ϑbat can be rejected from thestate vector. In this form the environmental state com-plies with the Markov-Property in terms of any fol-lowing state of the environment (vehicle model) beingonly dependent on the current state and the chosen ac-tion by the agent. Hence reinforcement learning canbe utilized to solve the underlying control problem.

4.3.2 Actions

For any given state the agent has to choose an actionwhich he considers optimal for the maximization ofhis return. For the energy management problem ac-tions can include any parameterizable metric with aneffect on fuel and energy efficiency. Commonly pa-rameterized metrics in the optimization of operatingstrategies are the split of torque between the electricmotor and the combustion engine, the choice of thegear or the management of the battery temperature.Here we only consider the choice of power outputfor the electric motor as an action controlled by theRL agent. Choice of gear and temperature regula-tion are implemented as heuristics. As mentioned,the electric motor is subject to multiple restrictionsconcerning the maximum possible power output. Notonly is this limited by the current crankshaft speedbut also - if considered by the model - by the maxi-mum power output of the battery regarding its temper-ature. In contrast to discrete Q-tables, where the useof certain actions in specific states can be restrictedby assigning extremely low Q-values, a continuous


67

output of neural networks can hardly be limited with-out any additional feedback signal which the networkcan learn to cope with. Hence the selectable action ofthe agent was implemented as a percent value of thecurrent maximally applicable torque with the outputlayer of the neural network as a tanh-layer restrictingthe action to at ∈ [−1,1]. The applicable power of theelectric motor is then determined by:

PEM =

{at ·PEM,max f or at ≥ 0at ·PEM,min f or at < 0

(21)

where the agent can choose to operate the EM ei-ther as a motor or a generator respectively by choos-ing positive or negative action values.

4.3.3 Reward

Since the main goal of the operating strategy is min-imizing fuel consumption but the RL agent aims atmaximizing his return, a common practice is to statethe reward with a negative sign. Here the agents re-ward based on his selected action is defined as thenegative total energy usage per timestep by:

rt =−(Eche +κEel) (22)

with

Eche = mFuel ·ρFuel ·HFuel (23)

andEel = Pbat ·∆t (24)

where the agent not only gets incentivized to min-imize fuel consumption but to do so with the leastamount of electric energy as possible. Thereby Eelis weighted by a factor κ = f (SOC) which rates thecost of electric energy proportionally to the deviationof the SOC to a target value SOCtarget . In doing so, theagent should learn to balance his SOC over the courseof a driving cycle as described in section 3.2.1.

4.4 Training

With the stated RL specific formulation, a DDPGalgorithm was implemented for solving the energymanagement problem. With a lower computationalcost than TRPO and a straight forward implementa-tion, DDPG has proven its usability and sample ef-ficiency in other implementations (Duan Yan et al.,2016). Additionally, the actor critic architecture of-fers a good approach for possible application in real-world vehicle-hardware with rather low computa-tional capabilities, since the resulting operating strat-egy is captured in a single neural network.

Training was conducted for one specific drivingcycle which could include any type of informationabout driving styles or traffic situations. The currentoperating strategy was evaluated every 5 episodes onthat specific cycle to track the learning progress. Intraining episodes however, as mentioned, the agentwas confronted with a unique stochastic cycle basedon the original one in order to prevent the strategyfrom overfitting to a specific velocity procedure. Atthe same time, the length of the stochastic cycleswas varied in every episode. In order to increase theamount of possible states which the network is con-fronted with during training, the initial values of thenetwork were varied each training episode on a ran-dom basis and within defined bounds. For evaluationthe SOC and ϑbat were always initialized to 50% and20◦C in order to the track learning progress.

For general exploration of the state space, asin (Timothy P. Lillicrap et al., 2015), an Ornstein-Uhlenbeck-Process (Borodin and Salminen, 1996)was used for adding noise to the chosen actions ofthe agent. The exploration noise was slowly decayedover the first 1000 training episodes.

The training procedure of DDPG is shown aspseudocode in algorithm 1.

Algorithm 1: Pseudocode for DDPG.

1: initialize networks2: initialize replay buffer3: for episode = 1,M do4: initialize exploration process5: observe initial state s06: for t=1,T do7: choose action at with actor and add noise8: execute at9: observe rt+1 and st+1

10: store [st ,at ,rt+1,st+1] in buffer11: choose minibatch from buffer12: update critic with loss L from minibatch13: update actor with policy gradient14: update target networks

4.5 Hyperparameters

Similar to the original architecture, both actor andcritic were implemented as deep neural networks with2 hidden layers each with 400 and 300 neurons andRectified-Linear-Unit activation. The actor receivesa standardized state vector and outputs a continuousaction as described. The critic additionally takes inthe actions into the second layer of the network andoutputs a continuous Q-Value through linear activa-tion. An Adam-optimizer was chosen for the update-


68

Figure 5: Cumulative reward (return) of the agent duringtraining process for the New-York-City-Cycle.

Figure 6: Fuel saving in % compared to driving solely withthe combustion engine (blue) and terminal battery chargingstate (green) during the training process for the New-York-City-Cycle.

process of the neural networks with a learning rate of10−5 for the critic and 10−4 for the actor. The replaybuffer for sampling minibatches was initialized to asize of 105 where in every timestep a minibatch sizeof 32 was sampled to update the critic network.

For the training process a discount factor of γ = 0was chosen, thus the agent optimized his strategy onlytowards local rewards. However, since the rewardfunction contains a dynamic weighting factor depen-dent on the current charging state, the agent has turnedout to adapt his strategy rather long-sighted. That waysignificantly better results were achieved compared todiscounts γ > 0.

5 RESULTS

5.1 Learning

Figure 5 and 6 show exemplary training progressfor simulative driving of the New-York-City-Cycle(NYCC), a characteristic cycle for low speed citydriving with many start-stop maneuvers. As shown,the agent continuously increases his cumulative re-

Figure 7: Excerpt of the split of overall torque (blue) be-tween the combustion engine (red) and the electric mo-tor (green) for the New-York-City-Cycle, controlled by thefully trained agent.

Figure 8: Comparison of battery loading trajectories for theNew-York-City-Cycle with the strategy learned by the RLagent (blue) and the optimal strategy derived by DynamicProgramming (red).

ward - the return - with progressive training. In do-ing so, according to the reward function, the resultingenergy consumption is decreased consistently with aconstant fuel saving of over 20% compared to driv-ing with the combustion engine only, established veryearly in the training process. With the resulting bat-tery charge at the terminal state of an episode veryclose to the initial value of 50%, the SOC can bedescribed as balanced. In a low speed environmentwhere the rather low-powered electric motor can beused very effectively, a positive deviation of the bat-tery charge throughout the cycle can well be expectedfor any efficient operating strategy. Since the agentdid not have any hard constraint towards the chargingstate but was only implicitly incentivized through thereward function, he is free to make any compromisebetween the use of chemical or electric energy whichhe considers most efficient.

Figure 7 shows an excerpt from the resulting strat-egy for the NYCC in terms of split of torque betweenthe combustion engine and the electric motor whichthe agent establishes with the choice of his actions. Asshown, the agent fully learns to use moments of vehi-


69

cle deceleration for recharging the battery with recu-perational energy by choosing actions a < 0 whichruns the electric motor in generator-mode. Mention-able is that after 20 episodes of training the level ofthis behavior is already nearly as sophisticated as af-ter 1000 episodes of simulative driving.

It is recalled that the results for purely determin-istic cycles only serve evaluation purposes. Since theagent is trained with random cycles similar to the onehe is being evaluated on, he is able to deal with highamounts of stochasticity concerning the velocity pro-file of the drive and will control the electric motor justas efficient. Due to the lack of space this can not befully presented here and will be briefly covered in sec-tion 5.3.

In general, training the agent for a variety of dif-ferent driving cycles is exceptionally stable. For driv-ing profiles including rather high velocity and accel-eration rates the learning rate of the actor network hadto be adjusted occasionally, since the unavoidable in-crease of the reward magnitude through higher mo-mentary fuel consumptions leads to bigger gradientsin the beginning of the training, where the neural net-works are initialized with outputs close to zero andthe loss increases accordingly. Clipping the rewardsor gradients to certain maximum values would be aconsiderable option to avoid this, though useful infor-mation about the fuel consumption might be excludedat a later point in the training.

5.2 Quality Analysis

For analyzing the quality of the operating strategylearned by the agent, a comparison was made to astrategy derived by dynamic programming with dis-crete state and action spaces. With full knowledgeof the cycle in advance, the result of DP can be seenas the globally optimal solution even though a smallerror will be unavoidable through the use of discreteparameters.

Table 1: Comparison of fuel savings for the strategy derivedby the RL agent and a global optimum computed with DPin contrast to driving with ICE only.

Driving Cycle RL DP ∆NYCC 28.6% 29.5% 0.9%WLTP 16.0% 16.3% 0.3%US06 10.8% 11.2% 0.4%FTP75 20.1% 20.7% 0.6%

Table 1 shows the resulting fuel savings for 4 dif-ferent velocity profiles typically used in calibrationand certification processes. The DP results are basedon the terminal SOC of the RL strategy in order to as-

sure comparability. As seen, only small deviations oc-cur, where the optimal strategy achieves less than 1%more savings in fuel which makes the strategy learnedby the agent nearly-optimal. A comparison of bothbattery charging trajectories resulting from the operat-ing strategy for the NYCC is shown in Figure 8, con-firming that the agent’s strategy converges towards aglobal optimum and does not get stuck in a poor localoptimum throughout the training process. In contrastto DP, the RL agent does not need any prior informa-tion about the driving route and he is trained to controlthe EM optimally with only momentary state infor-mation provided by the vehicle environment. A localstrategy optimization with nearly globally optimal so-lutions can be seen as a major advantage of deep rein-forcement learning over any other traditional methodfor the derivation of operating strategies for HEVs.

5.3 Stochastic Cycles

As mentioned, in every episode of the training processthe agent is confronted with a different stochastic ve-locity profile with the same characteristics concerningspeed and acceleration as the one he will be evalu-ated on. Next to preventing overfitting, this is mainlyaimed at increasing the agent’s ability to general-ize his strategy towards potentially unknown statesand driving profiles. Where traditional RL methodswith discrete parameters require tedious approxima-tion methods, generalizing knowledge is a major ad-vantage of deep RL obtained through the use of neuralnetworks.

Table 2: Comparison of a strategy trained with stochasticvelocity profiles and the other trained solely deterministi-cally. Results are shown as the deviation of fuel saving in% compared to a global optimal solution computed by dy-namic programming.

Driving Cycle Stochastic DeterministicNYCC 1.6% 1.5%

RandNYCC,1 0.95% 3.1%RandNYCC,2 1.5% 6.6%RandNYCC,3 1.7% 2.5%RandNYCC,4 1.4% 3.1%RandNYCC,5 1.6% 3.2%

Table 2 shows a comparison of two strategies, onetrained with stochastic cycles based on the NYCC,the other trained purely with the deterministic NYCCwhich was repeatedly driven through simulatively.Where the deterministically derived strategy performsgood on the original NYCC, it is shown that for anyother stochastic velocity profile similar to the origi-nal NYCC the strategy performs significantly worsewhich clearly indicates overfitting. The stochastically


70

Figure 9: Battery temperature relative to the threshold forpower limitation (green) together with the control of cool-ing power (blue) for simulative driving of a 5.000s longstochastic cycle based on NYCC with the operating strat-egy learned by the agent.

trained strategy in contrast shows constantly goodresults for any driven route similar to the originalNYCC with a deviation to the global optimal solutionof only 0.9− 1.7%. Additionally, stochastic cyclescould include any type of further information aboutdriving habits or traffic scenarios which neural net-works could easily take into account for the operatingstrategy. This results in a much more accurate illustra-tion of realistic traffic potentially leading to increasedfuel savings and emission reduction in the actual oper-ation of the vehicle rather than deriving approximatevalues on a test bench which could hardly be achievedin the real-world.

5.4 Inclusion of TemperatureInformation

As mentioned, for this paper a temperature approx-imation of the battery was included into the vehiclemodel combined with a power restriction to preventdamage from overheating. Additionally, a coolingmechanism of the battery was implemented with aheuristic used for control. Figure 9 shows an exem-plary development of the battery temperature for sim-ulative driving of a 5000 s stochastic cycle, which theagent has optimized his strategy for. As it shows, theagent has learned to control the EM in such a man-ner that the temperature level of the battery remainsbelow a threshold value over which the maximumpower output of the battery would be restricted andthe agent could not use the full potential of the elec-tric motor. Additionally, a clear tendency can be seento keep the temperature level close to the thresholdlevel of the cooling heuristic, around 91%, in order todecrease the additional use of battery power for thecooling system. With the additional boundary con-dition implemented implicitly into the vehicle model

and the agent not receiving any explicit informationexcept the resulting reward based solely on the en-ergy consumption, the agent still found nearly optimalsolutions to the energy management problem. Herethe major potential of deep reinforcement learningcan be seen compared to traditional methods for thederivation of operating strategies with discrete vari-ables as additional optimization goals can be imple-mented into the training process without any signifi-cant extra cost or loss in quality.

6 CONCLUSION

The energy management of hybrid electric vehiclesposes major challenges for automobile manufacturerswhere traditional approaches show many deficienciesin processing additional information concerning real-world driving scenarios. In this paper, a deep rein-forcement learning framework has been derived thatoffers great potential for solving many of those prob-lems. It has been shown that a deep RL agent is ca-pable of achieving nearly-optimal fuel consumptionresults with a locally trained strategy which can be ap-plied online in the vehicle. In contrast to dynamic pro-gramming, no prior knowledge of the driving route isnecessary and the training with stochastic driving cy-cles allows for greater generalization to variously dif-ferent velocity profiles. Additionally, deep reinforce-ment learning allows the efficient inclusion of furtheroptimization criteria.

Another advantage can be seen in the potentialadaptability of the agent which would be able to learnfrom real driving data even during direct usage of thevehicle. A conceivable result would be an intelli-gent energy management system constantly adaptingto specific driving habits, typically driven routes orother desirable characteristics of the vehicle owner ordriver.

Future work can build up on the presented re-sults and the arising potential for further improve-ment. This includes a broader examination of theagents capabilities of online generalization and adapt-ability, simulating alternating traffic scenarios or dif-ferent driving habits while constantly updating theagents strategy. Furthermore, a big field of researcharises from the integration of additional vehicle or fu-ture route information, e.g. from navigation data. Apredictive strategy offers great potential for further re-duction of fuel consumption and emission-levels.

In a next step, the algorithm could be applied ona power train test bench for assessing and refining theresults learned with the simulation model. Similarly,the transfer into the real vehicle would be feasible.


71

REFERENCES

Borodin, A. N. and Salminen, P. (1996). Ornstein-uhlenbeck process. In Borodin, A. N. and Salminen,P., editors, Handbook of Brownian Motion — Factsand Formulae, pages 412–448. Birkhauser Basel,Basel.

C. Liu and Y. L. Murphey (2014). Power managementfor plug-in hybrid electric vehicles using reinforce-ment learning with trip information. In 2014 IEEETransportation Electrification Conference and Expo(ITEC), pages 1–6.

Chae, H., Kang, C. M., Kim, B., Kim, J., Chung, C. C., andChoi, J. W. (2017). Autonomous braking system viadeep reinforcement learning. CoRR, abs/1702.02302.

Chasse, A. and Sciarretta, A. (2011). Supervisory controlof hybrid powertrains: An experimental benchmark ofoffline optimization and online energy management.Control Engineering Practice, 19(11):1253–1265.

David Silver, Guy Lever, Nicolas Heess, Thomas Degris,Daan Wierstra, and Martin Riedmiller (2014). Deter-ministic policy gradient algorithms. In Tony Jebaraand Eric P. Xing, editors, Proceedings of the 31st In-ternational Conference on Machine Learning (ICML-14), pages 387–395. JMLR Workshop and ConferenceProceedings.

Duan Yan, Chen Xi, Houthooft Rein, Schulman John, andAbbeel Pieter (2016). Benchmarking deep reinforce-ment learning for continuous control. InternationalConference on Machine Learning, 2016:1329–1338.

European Commission (2017). Draft regulation: real-driving emissions in the euro 6 regulation on emis-sions from light passenger and commercial vehicles.

Guzzella, L. and Sciarretta, A. (2013). Vehicle propulsionsystems: Introduction to modeling and optimization.Springer, Heidelberg, 3rd ed. edition.

John Schulman, Sergey Levine, Philipp Moritz,Michael I. Jordan, and Pieter Abbeel (2015).Trust region policy optimization. CoRR.

Kirschbaum, F., Back, M., and Hart, M. (2002). Determina-tion of the fuel-optimal trajectory for a vehicle along aknown route. IFAC Proceedings Volumes, 35(1):235–239.

Leroy, T., Malaize, J., and Corde, G. (2012). Towardsreal-time optimal energy management of hev pow-ertrains using stochastic dynamic programming. In2012 IEEE Vehicle Power and Propulsion Conference,pages 383–388. IEEE.

Liessner, R., Dietermann, A., Baker, B., and Lupkes, K.(2017). Generation of replacement vehicle speed cy-cles based on extensive customer data by means ofmarkov models and threshold accepting. SAE Inter-national Journal of Alternative Powertrains, 6(1).

Mirzaei, H. and Givargis, T. (2017). Fine-grained ac-celeration control for autonomous intersection man-agement using deep reinforcement learning. CoRR,abs/1705.10432.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-ness, J., Bellemare, M. G., Graves, A., Riedmiller,M., Fidjeland, A. K., Ostrovski, G., Petersen, S.,

Beattie, C., Sadik, A., Antonoglou, I., King, H., Ku-maran, D., Wierstra, D., Legg, S., and Hassabis, D.(2015). Human-level control through deep reinforce-ment learning. Nature, 518(7540):529–533.

Onori, S., Serrao, L., and Rizzoni, G. (op. 2016). Hy-brid electric vehicles: Energy management strategies.Springer, London.

Patrick Wappler (2016). Applikation von hybridfahrzeu-gen auf basis kunstlicher intelligenz. Master’s the-sis, Technische Universitat Dresden, Lehrstuhl furFahrzeugmechatronik.

Puterman, M. L. (dr. 2010). Markov decision processes:Discrete stochastic dynamic programming. Wiley Se-ries in Probability and Statistics. John Wiley & Sons,Hoboken, [dodr.] edition.

Shixiang Gu, Timothy P. Lillicrap, Ilya Sutskever,and Sergey Levine (2016). Continuous deep q-learning with model-based acceleration. CoRR,abs/1603.00748.

Sutton, R. S. and Barto, A. G. (2012). Introduction to rein-forcement learning. MIT Press, 2 edition.

Tate, E. D., Grizzle, J. W., and Peng, H. (2008). Shortestpath stochastic control for hybrid electric vehicles. In-ternational Journal of Robust and Nonlinear Control,18(14):1409–1429.

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel,Nicolas Heess, Tom Erez, Yuval Tassa, David Silver,and Daan Wierstra (2015). Continuous control withdeep reinforcement learning. CoRR, abs/1509.02971.

UNECE Transport Division (2005). Vehicle regulations:Regulation no. 101, revision 2,.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Alex Graves, Ioannis Antonoglou, Daan Wierstra, andMartin Riedmiller (2013). Playing atari with deep re-inforcement learning. In NIPS Deep Learning Work-shop.

X. Lin, Y. Wang, P. Bogdan, N. Chang, and M. Pedram(2014). Reinforcement learning based power manage-ment for hybrid electric vehicles. In 2014 IEEE/ACMInternational Conference on Computer-Aided Design(ICCAD), pages 33–38.


72

deep reinforcement learning for advanced energy management

Documents