learning deep neural policies with stability guarantees

Learning Deep Neural Policies with Stability Guarantees

Shahbaz A. Khader1,2, Hang Yin1, Pietro Falco2 and Danica Kragic1

Abstract— Deep reinforcement learning (DRL) has been suc-cessfully used to solve various robotic manipulation tasks.However, most of the existing works do not address theissue of control stability. This is in sharp contrast to thecontrol theory community where the well-established normis to prove stability whenever a control law is synthesized.What makes traditional stability analysis difficult for DRL arethe uninterpretable nature of the neural network policies andunknown system dynamics. In this work, unconditional stabilityis obtained by deriving an interpretable deep policy structurebased on the energy shaping control of Lagrangian systems.Then, stability during physical interaction with an unknownenvironment is established based on passivity. The result is astability guaranteeing DRL in a model-free framework that isgeneral enough for contact-rich manipulation tasks. With anexperiment on a peg-in-hole task, we demonstrate, to the bestof our knowledge, the first DRL with stability guarantee on areal robotic manipulator.

I. INTRODUCTION

Deep reinforcement learning (DRL) has emerged as apromising approach for autonomously acquiring manipula-tion skills, even in the form of complex end-to-end visuomo-tor policies [1]. In the context of manipulation, where contactwith the environment is inevitable, data-driven methods suchas DRL that can automatically cope with complex contactdynamics are very appealing. DRL derives its power fromexpressive deep policies that can represent complex controllaws. However, one aspect that is often forgotten is controlstability. Although recent advances in stability related DRL[2], [3], [4], [5] is encouraging, all of them are model-basedapproaches that either learn dynamics or have strong assump-tions about it. Naturally, these methods would struggle in thecontext of robotic manipulation where highly nonlinear andpossibly discontinuous contact dynamics [6] are common.

We focus on motion generation in manipulation where arigid grasp is assumed to have been established. This is awell-studied problem [7], where stability is analyzed accord-ing to the Lyapunov method [8]. Stable RL is challengingbecause traditional Lyapunov analysis assumes known andinterpretable forms of dynamics and control law, both ofwhich are unavailable in the DRL setting. We approachthe problem by asking the question: can Lyapunov stabilitybe guaranteed by virtue of the policy structure alone, irre-spective of the robot-environment interaction and the policyparameter values? If so, stable RL can be achieved in a

1Robotics, Perception, and Learning lab, Royal Institute of Technology,Sweden. {shahak, hyin, dani}@kth.se.

2ASEA Brown Boveri (ABB) Corporate Research, [email protected].

This work was partially supported by the Wallenberg AI, AutonomousSystems and Software Program (WASP) funded by the Knut and AliceWallenberg Foundation.

Policy

Env

Reward

Fig. 1. Learning neural policies with parameterized Lagrangian energyfunctions; ensuring stability guarantee via exploring in the parameter space.

model-free framework that avoids learning complex contactdynamics and requires only unconstrained policy searchalgorithms. In our previous works [9], [10], we addressedthis question by leveraging the stability property of a passivemanipulator interacting with a passive environment, evenwhen the environment is unknown. However, none of thesemethods featured policies that were fully parameterized as adeep neural network, thus potentially limiting the expressiv-ity of the learned policies.

In this paper, we follow [9] and [10] and formulate amodel-free stable RL problem for learning manipulationskills that may involve physical contact with the environ-ment. The main contribution is the parameterization of adeep neural policy, the structure of which alone guaranteesstability without the need for learning dynamics or con-strained policy search. The proposed policy, called the energyshaping (ES) policy, is derived from the first principles ofLagrangian mechanics. It consists of a position dependentconvex potential function and a velocity dependent dampingfunction; while the former is represented as an Input ConvexNeural Network (ICNN) [11], the latter is a specially tailoredfully connected network. We show through Lyapunov anal-ysis that stability is guaranteed for any value of networkparameters and also while interacting with any unknown butpassive environment. Parameter space exploration is adopted,in the form of (CEM) [12], since the deterministic Lyapunovanalysis adopted in this work allows only a deterministicpolicy. Our experiments, which include a real-world peginsertion task by a 7-DOF robot, validate the proposed policystructure and demonstrate the benefits of stability in RL.

II. RELATED WORK

Stability has been recently considered in several RLrelated works. The analysis in [2] relies on smoothnessproperties of the learned dynamics model and, therefore,would not cope with contact-rich tasks. Another method [3]assumes a model prior and learns only the unactuated partof the dynamics. The work in [5] assumes an existing priorstabilizing controller while [4] uses Control Barrier Functionin a robust control setting. Methods such as [3], [5], [4]require good nominal dynamics models and can only toleratedisturbances to some extent. Contact dynamics can neitherbe easily modelled, even nominally, nor can it be treated as adisturbance due to its magnitude and persistence. Our methoddepends only on a prior gravity model and neither does itmake any assumptions on the contact dynamics–except thatthe environment is passive–nor does it learn any.

Recently, [13] proposed a model-free stability guaranteedRL but it is limited to either a reward function that is thenorm of the state, or to partial stability. Our method does notimpose any limitations on the reward function and is basedon the standard notion of Lyapunov stability. Like [9] and[10], our method also considers the stability of manipulatorsinteracting with passive environments. Given the gravitycompensation model, these methods can be considered asmodel-free RL. The method in [9] features an analytic policywith arguably limited expressivity. The policy in [10] has partanalytic structure that is not learned and part deep networkthat is learned. In this work, we parameterize the policyalmost entirely as a deep neural network and all parts ofit are learned. The policy structure in [10] is comparable tothe proposed method since it also aims for stability by virtueof the policy structure alone.

Many recent methods have used the Lagrangian formu-lation [14], [15] and also the closely related Hamiltonianformulation [16], [17], as physics-based priors for learn-ing dynamics models. In contrast, our method focuses onmodel-free reinforcement learning and is influenced by theLagrangian physics-based prior via the principle of energyshaping control. Interestingly, [15] uses a manually designedenergy shaping control while leaving policy learning to futurework. Spragers et al. [18] employ energy shaping controlwithin an RL framework but, unlike our case, is basedon port-Hamiltonian formulation. Furthermore, the work islimited to linear-in-parameters basis function models insteadof expressive deep models. The general form of the energyshaping control was also used in [19] for learning fromdemonstration, albeit without any neural network policy orreference to energy shaping of Lagrangian systems.

Parameter space exploration in the form of EvolutionStrategies has recently emerged as a viable alternative toaction space exploration in RL [20], [21]. A relevant benefitfor robotics is smooth trajectories during rollout [22]. Thesimplest method in this class is the CEM method whichhas been previously used for policy search in [23] and evenstability-guaranteed policy search in [9]. In this work, weuse CEM for parameter space exploration.

III. PRELIMINARIES

A. Reinforcement Learning

Reinforcement learning is a data-driven approach forsolving Markov decision processes (MDP) that are definedby an environment (or dynamics) model p(st+1|st,at), anaction space a ∈ A, a state space s ∈ S and a rewardfunction r(st,at). The environment is usually assumed tobe unknown. In the episodic setting, the goal is to findthe optimal policy πθ(at|st) of the agent, that acts on theenvironment through the action space, that maximizes theexpected sum of reward for a desired episode length T

θ∗ = argmaxθ

J(θ); J(θ) = Es0,a0,...,sT [

T∑t=0

r(st,at)], (1)

Here {s0,a0, ..., sT } is a sample trajectory (episode) fromthe distribution induced in the stochastic system.

Our formulation differs from the above since we choosethe policy and the dynamics to be deterministic: at = πθ(st)and st+1 = f(st,at), respectively. A trajectory distributionis still induced as a result of maintaining a distribution overpolicy parameters p(θ). Adapting p(θ) until it reaches thesolution is called parameter space exploration [21] [22]. Thisreadily allows exploration for deterministic policies and alsoresults in smooth trajectories [22].

B. Cross Entropy Method

Cross Entropy Method (CEM) [12] is a black-box opti-mization method. In the context of the RL problem, it usesa sampling distribution q(θ|Φ) to generate samples {θn}Ns

n=1

and evaluates the fitness of each of them according to thereward quantity J(θn). The fitness information is then usedto compute updates for the sampling distribution parametersΦ. The iterative process can formalized as:

Φi+1 = argmaxΦ

∑n

I(J(θin)) log q(θin|Φ) θin ∼ q(θ|Φi)

(2)where the indicator function I(·) selects only the best Nesamples or elites over all Ns samples based on individualfitness J(θn). The update of Φi+1 from the samples {θin}Ne

n=1

is performed by maximum likelihood estimation (MLE),which is simple and straightforward for a Gaussian q(θ|Φ).The index i directly corresponds to iterations in RL and even-tually q(θ|Φi) converges to a narrow solution distributionwith high fitness and thus solving Eq. (1) approximately.

C. Stability in RL

Lyapunov stability is the main tool used to study stabilityof controlled nonlinear systems such as robotic manipulators[8]. The system is said to be stable at an equilibrium point ifthe state trajectories that start close to it remain bounded instate space and asymptotically stable if they also converge toit. Such properties are necessary for any controlled system tobe certified safe and predictable. To prove Lyapunov stability,an appropriate Lyapunov function V (s)–a positive definite,radially unbounded and scalar function of the state s–has tobe shown to exist with properties V (s) is in C1, V (s∗) = 0

and V (s) < 0 ∀s 6= s∗ where s∗ is the equilibrium point.Although stability analysis requires knowledge of the systemdynamics, it is still possible to reason about the stability ofunknown interacting systems if they can be shown to bepassive. Passivity means that a system can only consumeenergy but not generate energy, or V (s) <= Pi where Pi isthe power that flows into the system.

A stable RL policy can guarantee all rollouts to bebounded in state space and eventually converge to the goalposition demanded by the task. If the policy ensures that thegoal position is the only equilibrium point, we can refer tothe stability of the system instead of an equilibrium point.

D. Lagrangian Mechanics

Lagrangian mechanics can be used to describe motions ofphysical systems in any generalized coordinates. A roboticmanipulator arm is no exception and its motion can berepresented by the Euler-Lagrange equation,d

dt

(∂

∂xL

)− ∂

∂xL = u; L(x, x) = T (x)− V (x) (3)

where T and V are the kinetic and potential energies, respec-tively, the quantity L is the Lagrangian, x is the generalizedcoordinates and u is the generalized non-conservative forceacting on the system. V represents the gravitational potentialfunction with its natural equilibrium point x∗, one with theleast energy. For a manipulator under the influence of anexternal force fext, Eq. (3) can be expanded to the standardmanipulator equation,

M(x)x + C(x, x)x + g(x) = u + fext, (4)

where M,C and g are the inertia matrix, the Coriolis matrixand the gravitational force vector, respectively.

E. Energy Shaping Control of Lagrangian Systems

The natural equilibrium point x∗ may not be our controlobjective, but instead it may be another desired goal x∗

d.Furthermore, we may also be interested in shaping the trajec-tory while moving towards x∗

d. A general control frameworkthat incorporates both these aspects is energy shaping [24]control. The energy shaping control law is of the form u =β(x)+v(x), where β(x) is the potential energy shaping termand v(x) is the damping injection term. Potential energyshaping proposes to cancel out the natural potential functionV and replace it with a desired one Vd such that not onlyis the new equilibrium point placed exactly at x∗

d but it alsohelps render the desired trajectory. The damping injectionterm’s purpose is to bring the system to rest at x∗

d by avoidingperpetual oscillations. Energy shaping is as follows:

d

dt

(∂

∂xL

)− ∂

∂xL = β(x) + v(x) (5)

β(x) =∂

∂xV (x)− ∂

∂xVd(x); v(x) = −D(x)x (6)

where D(x) � 0 is a damping matrix function. Notethat v(x) = −D(x)x is a particular choice for achievingxTv(x) < 0 or dissipation. Let Vd(x) be convex with the

unique minimum at x∗d. Then the controller

u =∂

∂xV (x)− ∂

∂xVd(x)−D(x)x (7)

can potentially shape the trajectory to any arbitrary x∗d if ap-

propriate Vd(x) and D(x) can be synthesized. The first termplays no role because it is only the gravity compensation.Unfortunately, no general solution exists for such synthesisexcept for some trivial cases. The benefit of this approachis that, using a Lyapunov function defined as Vd(x) +T (x),closed-loop stability and passivity can be easily achieved forany Vd(x) and D(x). See Sec. IV-B.

F. Input Convex Neural Network (ICNN)

Fully connected ICNN or FICNN, a version of ICNN [11],has the architecture for a layer i = 0, ..., k − 1

zi+1 = gi(W(z)i zi +W

(x)i x + bi) (8)

where zi represents layer activations and z0,W(z)0 ≡ 0.

FICNN is designed to be a scalar-valued function that isconvex in its input, hence the name ICNN. Convexity isguaranteed if W

(z)i are non-negative and the non-linear

activation functions gi are convex and non-decreasing. Theremaining parameters W (x)

i and bi have no constraints. Thenetwork input can be any general quantity although it willonly be the position coordinate in our formulation.

IV. METHODOLOGY

Our goal is stability-guaranteed DRL for manipulationtasks that may involve contact with the environment. Fur-thermore, we would like to obtain stability by virtue of thepolicy structure alone so that an unconstrained policy searchwithin a model-free framework is possible.

A. Structuring Policies from Physics Prior

It was shown in Sec. III-E that the control form in Eq.(7), comprising of a convex potential function Vd(x) with aunique minimum at x∗

d and also a damping term D(x) �0, can generate a trajectory to any arbitrary x∗

d. We shallconsider x∗

d = 0 without loss of generality. As a first step,we drop the term ∂

∂xV (x) for all subsequent analysis sinceit is the gravity compensation control that is assumed to bereadily available. We define the deterministic policy as:

u = π(x, x; η, ψ, φ)

= − ∂

∂x

(1

2xTS(η)x + Ψ(x;ψ)

)−D(x;φ)x (9)

= −S(η)x− ∂

∂xΨ(x;ψ)−D(x;φ)x

where S(η) � 0 contributes to a strongly convex potentialfunction xTS(η)x, Ψ(x;ψ) is a more flexible convex poten-tial function and D(x;φ) is the damping function. Followingthe usual practice in energy shaping control, D(x;φ) doesnot depend on x while Ψ(x;ψ) depends only on x. Weshall drop the parameter notations η, ψ and φ whenevernecessary in order to simplify notations. The deterministicpolicy structure shall be referred to as the energy shaping(ES) policy and is also illustrated in Fig. 2.

Damping network

ICNN network

AutoDiff

Quadratic potential

sReLu

FC

NN

ta

nh

hid

den n

on-l

in

sReLu

sReLu

ReLuReLu

FC

NN

ta

nh

hid

den n

on-l

in

ReLu

Fig. 2. Network architecture for the energy shaping policy

The convex function Ψ(x) need not have a unique min-imum which then corresponds to non-asymptotic stability.If the task requires such a policy then the agent can learna practically negligible S and an appropriate Ψ(x) withouta unique minimum. Alternatively, by learning a significantS in addition to Ψ(x), the agent can achieve a stronglyconvex potential function with a unique minimum, one thatcorresponds to asymptotic stability. It is enough to show thatΨ(x) is convex with minimum at 0 and S(η) � 0 for anyvalues of their respective parameters.

S(η) is defined as a diagonal matrix with the parameter ηas the diagonal elements. η is clamped at lower and higherends with positive numbers to ensure positive definitenessand to limit the contribution of the quadratic function.

We propose to implement Ψ(x) using the FICNN versionof ICNN [11]. ICNN has been used previously to model aLyapunov function in [25]. In our case, the ICNN potentialfunction will only be a part of a Lyapunov function. The non-negativity of W (z)

i in Eq. (8) is implemented by applying theReLu function on the parameter itself. The convexity of gi inEq. (8) is satisfied by adopting the smooth variant of ReLu(sReLu), introduced in [25], which is convex, non negativeand differentiable. Differentiability of activation function isrequired to ensure that Ψ(x) is continuously differentiableor class C1. Both aspects are shown in Fig. 2. Unlike theintended use of FICNN, our requirement is such that Ψ(x)should have a minimum at 0. For any random initializationof FICNN this is not necessarily true. We could find theminimum through any gradient based optimization and add

offsets to both x and Ψ to give Ψ(0) = 0. However, weadopt a less expensive method, one that avoids any numericaloptimization, by simply setting bi = 0 in Eq. (8). Withz0,W

(z)0 ≡ 0 and x = 0:

zi+1 = gi(W(z)i zi +W

(x)i x) = 0 for i = 0...k − 1 (10)

Together with the fact that Ψ(x) ≥ 0, due to the sReLuactivation function, it means that x = 0 is a minimum withΨ(0) = 0. By considering the combined potential function inEq. (9), we also have (xTSx + Ψ(x)) > 0 ∀x 6= 0. Finally,as shown in Fig. 2, the gradient − ∂

∂xΨ(x) is obtained readilyby automatic differentiation. This is an instance of inferencetime automatic differentiation as opposed to the usual caseof training time automatic differentiation.

The damping function in Eq. (9) is structured as shown inFig. 2. Two independent fully connected networks (FCNN)with tanh hidden layer non-linearity output the elements ofthe damping matrix. The first and second networks outputthe diagonal and the off-diagonal elements, respectively. Thediagonal elements are first made non negative through aReLu activation and then positive by adding a small positivenumber. The output layer of the second FCNN has nonon linearity. Together, the two outputs combine to forma lower triangular matrix with positive diagonal elements.The product of the lower triangular matrix with itself willproduce the required positive definite damping matrix D(x).A similar strategy was used previously for a symmetricpositive definite inertia matrix in [14]. The two networksmay also be combined into one if that is preferable.

B. Stability of the Energy Shaping Policy

In the following theorem, we show that the proposedpolicy parameterization is automatically stable and passiveindependent of the network parameter values.

Theorem 1: Let S be symmetric positive definite, Ψ(x)be convex in x, a generalized coordinate, with minimum at0 and Ψ(0) = 0, and the function D(x) be also positivedefinite. Then a gravity compensated manipulator under thecontrol of the deterministic policy u = −Sx − ∂

∂xΨ(x) −D(x)x is: (i) globally asymptotically stable at x = 0 iffext = 0 and (ii) passive with respect to fext if fext 6= 0.

Proof: Consider the Lyapunov function

V (x, x) =1

2xTSx + Ψ(x) +

1

2xTM(x)x

where M(x) is the inertia matrix. We already have V (x, x)in class C1, V (0,0) = 0, V (x, x) > 0 ∀x 6= 0, x 6= 0 andas ||x||, ||x|| → ∞ =⇒ V (x, x) → ∞. The first threeproperties are achieved by design as explained earlier. Thelast property follows from the fact that for large values ofx and x, V (x, x) behaves like a quadratic function becauseΨ(x) only grows linearly thanks to the sReLu activation.

Taking the time derivative,

V = xTSx + xT∂

∂xΨ(x) + xTM(x)x +

1

2xTM(x)x

Substituting M(x)x from Eq. (4), substituting u from Eq.(9), utilizing the skew-symmetric property of M − 2C and

utilizing positive definiteness of D(x) leads to the lastrequired property for global asymptotic stability

V = −xTD(x)x < 0 (fext = 0)

When considering fext 6= 0,

V = xT fext − xTD(x)x < xT fext

which fulfills the condition for passivity that requires V to beupper bounded by the flow of energy from external sources.

Remark: A Cartesian space control version of the ESpolicy can be implemented based on the Jacobian transposemethod presented in Sections 8.6.2 and 9.2.2 of [7]. Inthe case of redundant manipulators, where the Cartesianspace is not a generalized coordinate system, stability canbe preserved by adopting a slightly reformulated Lyapunovfunction that considers joint space kinetic energy term andjoint space viscous friction. More details can be found in theabove references.

C. Stability-Guaranteed RL with Energy Shaping PolicyIn Sec. IV-B, we showed that our policy parameterization

leads to unconditional global asymptotic stability for freemotion and passivity when acted upon by an external force.When the external force is a result of physical interactionwith passive objects in the environment, the overall systemis also stable [26], [27]. Passive objects are those that cannotgenerate energy and are the most common cases (see Sec.VI). Note that asymptotic stability may be lost due to motionconstrained by obstacles but stability is still guaranteed. Wehave achieved the goal regarding policy structuring since ourpolicy guarantees stability for potentially contact-rich taskswithout any conditions or constraints for policy initializationor search. The policy search is also model-free since it doesnot require any learned dynamics model.

With the deep policy in place, we only require a suitablepolicy search algorithm. The ES policy is a deterministicpolicy and injecting noise into its action space would in-validate the stability proof in Theorem 1. Fortunately, itis unconditionally stable for any network parameter valueand is, therefore, suitable for unconstrained parameter spaceexploration strategies. Since CEM (see Sec. III-B) is onesuch strategy, we formulate a CEM based RL with the stateand action defined as s = [xT xT ]T and a = u, respectively,the parameter set θ = {η, ψ, φ}, and a Gaussian withdiagonal covariance for the sampling distribution q(θ|Φ).

V. EXPERIMENTAL RESULTS

For validating our method, we focus on contact-rich ma-nipulation tasks. Contact-rich manipulation is relevant forrobotics because it may be thought of as representing thesuperset of robot motion problems. This is because it coversboth the path generation and the force interaction aspectsof manipulation. Contact-rich manipulation is usually repre-sented by various insertion tasks. A simplified representativeproblem is the so called peg-in-hole problem [28], [29].

Adopting the experimental setups from [9], we start witha simulated 2D block-insertion task without a manipulator

Fig. 3. Experimental setup (a) 2D Block-insertion: a cubical block ofside 50 mm is to be inserted into a slot with a clearance of 2 mm. Theinitial position and an illustrative path is also shown. (b & d) Simulatedpeg-in-hole: a cylindrical peg of diameter of 23 mm and a square hole ofwidth and depth of 25 mm and 40 mm, respectively. (c & e) Physical peg-in-hole: a cylindrical peg of diameter 27 mm and a cylindrical hole of widthand depth of 28 mm and 40 mm, respectively. An insertion depth of 25mm and 35 mm are considered successful for the 2D block-insertion andthe robotic insertion tasks, respectively.

(Fig. 3a), scale up to a 7-DOF manipulator peg-in-hole (Fig.3b,d) and then move to a physical version of peg-in-hole (Fig.3c,e). We use the MuJoCo physics simulator and a physicalYuMi robot for the experiments. The proposed method willbe referred to as energy shaping with CEM (ES-CEM).For baseline comparison, we use a standard deep policywith Proximal Policy Optimization [30] (ANN-PPO) and thenormalizing flow control policy [10] with CEM (NF-CEM).We use the PyTorch version of the RL software frameworkgarage [31]. All simulation experiments are done using fiverandom seeds and the resulting variances are incorporatedinto the error bars in the plots.

NF-CEM is the main baseline since it also has stabilityproperties. The main idea here is to learn a complex coordi-nate transformation function, through RL, such that a fixedspring-damper regulator in the transformed coordinate spacecan solve the task. Despite the presence of the deep network,control stability is preserved. The spring-damper is idealizedto be ’normal’, defined as identity matrices for the stiffness(K) and damping (D) gains.

Values of relevant hyperparameters are: number of rolloutsor CEM samples per iteration Ns = 15; number of elitesamples in CEM Ne = 3; episode length T = 200; initialvalue of η is 0.1; clipping values for η are ηmin = 1e−3 andηmax = 5; initial standard deviation for q(θ|Φ), or Φ0

σ , are0.2 and 0.15 for ES-CEM and NF-CEM, respectively; initialaction standard deviation for PPO is 2.0; number of flowelements in NF policy nflow = 2; and dimension of flowelement layers in NF policy nh flow = 16. Initial mean forq(θ|Φ), or Φ0

µ, is obtained by randomly initializing all net-work parameters with the Pytorch functions xavier uniformand zeros for weights and biases, respectively. We usedthe default settings in garage for PPO. The reward functionpresented in [32] is used for all our experiments.

0 20 40 60 80Iteration

−4

−2

Rew

ard

×103ICNN +QUAD QUAD ICNN


0

50

100

Su

cces

s%

(a) (b)

−0.5 0.0 0.5x1

−0.5

0.0

0.5

x2

−0.5 0.0 0.5x1

−0.5

0.0

0.5

x2

−0.5 0.0 0.5x1

−0.5

0.0

0.5

x2

(c) (d) (e)

Itr 0 Itr 99 Goal

Fig. 4. Potential Function Parameterizations for 2D Block-insertion:RL experiments with potential functions of ICNN, quadratic function andICNN plus quadratic function. (a) Learning curve (b) Success rate ofinsertions. Contour plots of potential functions with its correspondingtrajectory overlaid: (c) ICNN, (d) quadratic function and (e) ICNN plusquadratic function. Blue corresponds to a parameter sample from iteration0 (randomly initialized parameter distribution) and green corresponds to aparameter sample from the last iteration.

A. 2D Block-Insertion

A block of mass that is permitted to only slide on asurface but not rotate is to be inserted into a slot with asmall clearance. The block is controlled by applying a pairof orthogonal forces at its center of mass. The initial positionis shown in Fig. 3a along with an illustrative path. We usedhidden layer sizes [16,16] for ICNN, [8,8] for each of theFCNN damping modules and [16,16] for ANN-PPO.

Since we are primarily interested in the modeling capacityof ICNN, we conduct an ablation study with only ICNN, onlyquadratic function and the combination of both. The dampingnetwork part remains the same. In Fig. 4a-b we see thatICNN by itself is capable of solving the task while the purequadratic case is not. The combination model is necessaryfor our theoretical stability proof to hold. From Fig. 4c-ewe can see the potential function contour plots for samplecases from first and last iterations. The quadratic potentialis clearly incapable of moving the block in curved paths.Even for the other two cases, the initial iteration samples arealways unsuccessful.

The NF policy turned out to be dependent on the latentspace spring-damper system (see 5a-b). It failed to solve thetask for the default value of the stiffness parameter K = I .However, NF-CEM succeeds for K = 2I . We concludethat NF policy is competitive to ES policy but incurs anadditional burden of extra task sensitive hyperparameters. Wealso compare with the ANN-PPO as a standard baseline. InFig. 5a-b we see that ANN-PPO has the worst performance.This is consistent with results in [10] for a simpler versionof the task. Following [10] and [9], we attribute the betterperformance of ES-CEM and NF-CEM to the ability ofstable policies for forming attractors to the goal. Finally,we showcase the strength of our approach with respect tostandard action space exploration. Fig. 5c-d show that evenwhen randomly initialized, the trajectories induced by the


−4

−2

Rew

ard

×103

ES − CEM ANN − PPO NF − CEM −K2 NF − CEM −K1


0

50

100

Su

cces

s%

(a) (b)

−0.25 0.00 0.25 0.50x1

−0.5

0.0

x2

Goal

−0.25 0.00 0.25 0.50x1

−0.5

0.0

x2

(c) (d)

Fig. 5. Reinforcement Learning, 2D Block-insertion: RL results of ES-CEM along with that of the baselines. (a) Learning curve (b) Success rateof insertions. (c) Smooth trajectory rollouts of ES-CEM that tend towardsthe goal even in the first iteration. (d) Noisy and divergent trajectory rolloutsof the standard ANN policy (action space exploration) in the first iteration.


−5.0

−2.5R

ewar

d

×103

ES − CEM NF − CEM −K1 NF − CEM −K4 ANN − PPO


0

50

100

Su

cces

s%

(a) (b)

Fig. 6. Reinforcement Learning, peg-in-hole: RL results of ES-CEMalong with that of the baselines. (a) Learning curve (b) Success rate ofinsertions.

ES policy tend towards the goal. The trajectories are alsosmooth owing to parameter space exploration.

B. Simulated Peg-In-Hole

A simulated 7-DOF manipulator arm is expected to learnto insert a rigidly held peg into a hole with low clearance. Tosimplify, only the translation part of the Cartesian space isconsidered for RL while the rotation part is controlled witha stable impedance controller (PD regulator) that regulatestowards the goal orientation. Our method can be extendedto include rotation by adopting the same formulation as[9]. The initial position is a fixed but randomly generatedconfiguration as shown in Fig. 3d. We use hidden layer sizesof [24,24] for ICNN, [12,12] for each of the FCNN modulesin the damper network and [32,32] for the standard ANNpolicy in the ANN-PPO baseline.

Fig. 6 also leads to the same observation that the NF policyis competitive to the ES policy only when its latent spacespring-damper system is configured appropriately (K = 4I).Such tuning was not necessary with the NF-PPO setup in[10] perhaps due to larger state space coverage. ANN-PPOagain performed worse than the rest due to the same reasonmentioned earlier. We notice that its performance is consid-erably better for the peg-in-hole task than the simpler block-insertion task. We attribute this to the compliant behaviour inthe rotation space compared to the perfectly rigid interactionin the block-insertion task.

Fig. 7. Samples of initial position: drawn with mean of the training timeposition and standard deviation σq0 = 0.05. The leftmost is the trainingposition.


−1.0

−0.5

Rew

ard

×104


0

50

100

Su

cces

s%

(a) (b)

Fig. 8. Reinforcement Learning, physical peg-in-hole: RL results ofES-CEM. (a) Learning curve (b) Success rate of insertions.

C. Physical Peg-In-Hole

In this experiment, we realize a physical version of thepeg-in-hole task and evaluate only the ES-CEM case. Theinitial position is as shown in Fig. 3e. We use identicalsettings and hyperparameters as of the simulated case exceptΦ0σ = 0.3, T = 500, and an insertion clearance of 1

mm. The insertion clearance was lowered from 2 mm to1 mm to increase the difficulty of the problem. Similar tothe simulation case, the robot exhibited stable behaviourwith trajectories tending towards the goal right from thebeginning.

In Table I, we report the success rate for different initialstate distributions after training. We conducted a set of 15trials for each of four initial position (q0) distributions,whose means were the fixed position used for training andelement-wise standard deviations (σq0 ) as shown in thetable. The distributions are defined for the joint positioncoordinates. Figure 7 shows that the distributions did induceenough spread in the Cartesian space translation componentto challenge low clearance insertion. The observed gener-alization to unseen initial position can be attributed to thepolicy’s global stability.

VI. DISCUSSIONS AND CONCLUSIONS

In this paper, we investigated the possibility of achievingstability-guaranteed DRL for robotic manipulation tasks. Weformulated a strategy of 1) structuring a deep policy withunconditional stability guarantee, and 2) adopting a model-free policy search that preserves stability. The first wasachieved by deriving a deep policy structure from energyshaping control of Lagrangian mechanics. The second wasrealized by adopting the well-known parameter space searchalgorithm CEM. Our method is model-free as long as themanipulator is gravity compensated. Neither the stabilityproof nor the policy search requires the prior knowledgeor learning of any model. Moreover, stability is preservedeven when interacting with unknown passive environments.The method was validated on simulated and physical contact-rich tasks. Our result is notable because no prior work has

TABLE IGENERALIZATION TO INITIAL POSITION

σq0 (rad) 0.02 0.04 0.05 0.1Success % 100 80 60 33.3

demonstrated, to the best of our knowledge, a stability-guaranteed DRL on a robotic manipulator.

The main benefit of our work is that it proposes one wayof moving away from analytic shallow policies to expressivedeep policies while guaranteeing stability. In addition toimproving predictability and safety, stability also has thepotential to reduce sample complexity in RL [10]. This isdue to the fact that the goal position information is directlyincorporated into the policy. In fact, due to inherent stabilityof the policy, it is possible to deploy state-of-the-art policygradient algorithms such PPO instead of CEM. Althoughsuch an approach will invalidate the formal stability proof,the user will still enjoy the practical benefit of stability aswas demonstrated on the NF policy in [10]. Else, moresophisticated parameter space exploration strategies such asCMA-ES [33] and NES [34], [20] are also an option toscale up without invalidating the stability proof. Anotherpossibility to exploit the stability property would be in asparse reward setting since convergence behavior towards thegoal can potentially minimize reward shaping.

The stability property of the proposed ES policy is sharedby the previously proposed NF policy structure [10], al-though the latter is not completely a deep policy. Our resultsindicate that the practical learning outcome of the NF policyis dependent on task specific initialization of its unlearnedspring-damper system. The proposed ES policy has no suchhyperparameters. The NF policy will come in handy whenthe spring-damper values can positively bias the policy andthus speed up the learning process, whereas the ES policyhas no such mechanism to incorporate a bias.

An important consideration is the model capacity of theES policy. While an ANN policy has the potential to approx-imate any nonlinear function, the same cannot be said aboutthe ES policy. Nevertheless, it may be noted that the ESpolicy can potentially represent any policy within the energyshaping control form for a given fully actuated Lagrangiansystem. This is because the ICNN can represent any convexfunction [11] and the damping network, parameterized byfully-connected network modules, could represent all non-linear damping functions. As an alternative to scaling upthe ICNN network, combining multiple convex functionsthrough any of the convexity preserving operations can alsobe used to construct complex convex functions. However,activating a number of convex functions in sequence isnot supported since that would require more sophisticatedstability analysis techniques. The energy shaping controlstructure may very well be more limiting than an arbitrarynonlinear control law. A theoretical analysis of that nature isoutside the scope of this work. Our empirical results showthat the ES policy is expressive enough to learn complexcontact-rich tasks. For example, the block insertion resultshows that the ES policy can represent curved paths and also

the necessary interaction behavior that regulates exchange offorces and exploit contacts.

Other than the possible limitations in model capacity, fewother limitations are also worth mentioning: i) If the settingof the bias terms in ICNN to zero is to be avoided, asa measure to enhance representational capacity, one wouldneed to introduce a convex optimization step to discoverthe minimum of the ICNN (see Sec. IV-A). This is just anadditional step and will not jeopardize the RL process. ii)The ES policy is only valid for fully actuated Lagrangiansystems. This is not a serious limitation because all seriallink manipulators are fully actuated. iii) The ES policy isvalid only for episodic tasks with a unique goal position,thus excluding cyclic or rhythmic tasks. iv) The stabilityanalysis can be extended to the rotational space only for theEuler angle representation. A straightforward extension canbe done according to [9]. Lastly, v) stability is preserved onlyfor passive environments. This is also not a serious limitationbecause most objects in the environment are unactuated andhence passive. Notable exceptions are other robots, actuatedmechanisms and human users.

Stability guarantee in DRL cannot be overstated. It is oneof the main means to achieve safety in DRL. Although manyrecent works have made significant contributions in this field,very rarely do they consider a robotic manipulator interactingwith the environment. As a result, most of these methods areeither not suitable or do not scale well to the manipulationcase. Our work can be seen as an attempt to bridge this gap.

REFERENCES

[1] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end trainingof deep visuomotor policies,” The Journal of Machine LearningResearch, vol. 17, no. 1, pp. 1334–1373, 2016.

[2] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safemodel-based reinforcement learning with stability guarantees,” inAdvances in neural information processing systems, 2017, pp. 908–918.

[3] R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-endsafe reinforcement learning through barrier functions for safety-criticalcontinuous control tasks,” in The Thirty-Third AAAI Conference onArtificial Intelligence, AAAI, 2019, pp. 3387–3395.

[4] J. Choi, F. Castaneda, C. Tomlin, and K. Sreenath, “ReinforcementLearning for Safety-Critical Control under Model Uncertainty, usingControl Lyapunov Functions and Control Barrier Functions,” in Pro-ceedings of Robotics: Science and Systems, 2020.

[5] R. Cheng, A. Verma, G. Orosz, S. Chaudhuri, Y. Yue, and J. Burdick,“Control regularization for reduced variance reinforcement learning,”in Proceedings of the 36th International Conference on MachineLearning, ICML, 2019, pp. 1141–1150.

[6] S. A. Khader, H. Yin, P. Falco, and D. Kragic, “Data-efficient modellearning and prediction for contact-rich manipulation tasks,” IEEERobotics and Automation Letters, vol. 5, no. 3, pp. 4321–4328, 2020.

[7] B. Siciliano, L. Sciavicco, L. Villani, and G. Oriolo, Robotics: mod-elling, planning and control. Springer Science & Business Media,2010.

[8] J.-J. E. Slotine, W. Li et al., Applied nonlinear control. Prentice hallEnglewood Cliffs, NJ, 1991.

[9] S. A. Khader, H. Yin, P. Falco, and D. Kragic, “Stability-guaranteedreinforcement learning for contact-rich manipulation,” IEEE Roboticsand Automation Letters, vol. 6, no. 1, pp. 1–8, 2020.

[10] S. A. Khader, H. Yin, P. Falco, and D. Kragic, “Learning stablenormalizing-flow control for robotic manipulation,” arXiv preprintarXiv:2011.00072, 2020.

[11] B. Amos, L. Xu, and J. Z. Kolter, “Input convex neural networks,” inInternational Conference on Machine Learning, 2017, pp. 146–155.

[12] R. Y. Rubinstein and D. P. Kroese, The Cross Entropy Method:A Unified Approach To Combinatorial Optimization, Monte-CarloSimulation (Information Science and Statistics). Berlin, Heidelberg:Springer-Verlag, 2004.

[13] M. Han, L. Zhang, J. Wang, and W. Pan, “Actor-critic reinforcementlearning for control with stability guarantee,” IEEE Robotics andAutomation Letters, vol. 5, no. 4, pp. 6217–6224, 2020.

[14] M. Lutter, C. Ritter, and J. Peters, “Deep lagrangian networks: Usingphysics as model prior for deep learning,” in International Conferenceon Learning Representations (ICLR), 2019.

[15] Y. D. Zhong and N. Leonard, “Unsupervised learning of lagrangiandynamics from images for prediction and control,” Advances in NeuralInformation Processing Systems, vol. 33, 2020.

[16] S. Greydanus, M. Dzamba, and J. Yosinski, “Hamiltonian neural net-works,” Advances in Neural Information Processing Systems, vol. 32,pp. 15 379–15 389, 2019.

[17] P. Toth, D. J. Rezende, A. Jaegle, S. Racaniere, A. Botev, and I. Hig-gins, “Hamiltonian generative networks,” in International Conferenceon Learning Representations (ICLR), 2020.

[18] O. Sprangers, R. Babuska, S. P. Nageshrao, and G. A. Lopes, “Rein-forcement learning for port-hamiltonian systems,” IEEE transactionson cybernetics, vol. 45, no. 5, pp. 1017–1027, 2014.

[19] S. M. Khansari-Zadeh and O. Khatib, “Learning potential functionsfrom human demonstrations with encapsulated dynamic and compliantbehaviors,” Autonomous Robots, vol. 41, no. 1, pp. 45–69, 2017.

[20] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolutionstrategies as a scalable alternative to reinforcement learning,” arXivpreprint arXiv:1703.03864, 2017.

[21] P. Matthias, H. Rein, D. Prafulla, S. Szymon, C. Richard Y., C. Xi,A. Tamim, A. Pieter, and A. Marcin, “Parameter space noise forexploration,” in International Conference on Learning Representations(ICLR), 2018.

[22] T. Ruckstieß, M. Felder, and J. Schmidhuber, “State-dependent explo-ration for policy gradient methods,” in Joint European Conference onMachine Learning and Knowledge Discovery in Databases. Springer,2008, pp. 234–249.

[23] S. Mannor, R. Y. Rubinstein, and Y. Gat, “The cross entropy methodfor fast policy search,” in Proceedings of the 20th InternationalConference on Machine Learning (ICML-03), 2003, pp. 512–519.

[24] R. Ortega, A. J. Van Der Schaft, I. Mareels, and B. Maschke, “Puttingenergy back in control,” IEEE Control Systems Magazine, vol. 21,no. 2, pp. 18–33, 2001.

[25] J. Z. Kolter and G. Manek, “Learning stable deep dynamics models,”in Advances in Neural Information Processing Systems, 2019, pp.11 128–11 136.

[26] N. Hogan and S. P. Buerger, “Impedance and interaction control,” inRobotics and automation handbook. CRC press, 2018, pp. 375–398.

[27] J. E. Colgate and N. Hogan, “Robust control of dynamically interactingsystems,” International journal of Control, vol. 48, no. 1, pp. 65–88,1988.

[28] S.-k. Yun, “Compliant manipulation for peg-in-hole: Is passive com-pliance a key to learn contact motion?” in 2008 IEEE InternationalConference on Robotics and Automation, 2008, pp. 1647–1652.

[29] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei,A. Garg, and J. Bohg, “Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-richtasks,” in 2019 International Conference on Robotics and Automation(ICRA). IEEE, 2019, pp. 8943–8950.

[30] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, andO. Klimov, “Proximal policy optimization algorithms,” CoRR,vol. abs/1707.06347, 2017.

[31] T. garage contributors, “Garage: A toolkit for reproducible reinforce-ment learning research,” https://github.com/rlworkgroup/garage, 2019.

[32] S. Levine, N. Wagener, and P. Abbeel, “Learning contact-rich ma-nipulation skills with guided policy search,” in IEEE InternationalConference on Robotics and Automation (ICRA), 2015, pp. 156–163.

[33] N. Hansen and A. Ostermeier, “Completely derandomized self-adaptation in evolution strategies,” Evolutionary computation, vol. 9,no. 2, pp. 159–195, 2001.

[34] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, andJ. Schmidhuber, “Natural evolution strategies,” The Journal of Ma-chine Learning Research, vol. 15, no. 1, pp. 949–980, 2014.

learning deep neural policies with stability guarantees

Documents