optimal adaptive control of nonlinear discrete-time...

68
Optimal Adaptive Control of Nonlinear Discrete time S stems Discrete-time Systems S. Jagannathan Department of Electrical and Computer Engineering Missouri University of Science and Technology Rolla, MO, USA (sarangap@mst edu) (sarangap@mst.edu)

Upload: vuongnhan

Post on 25-May-2018

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Adaptive Control of Nonlinear Discrete time S stemsDiscrete-time Systems

S. JagannathanDepartment of Electrical and Computer EngineeringMissouri University of Science and TechnologyRolla, MO, USA

(sarangap@mst edu)([email protected])

Page 2: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Outline Introduction Introduction

Background on Optimal Adaptive Control

Optimal Adaptive Control of Unknown Dynamic Systems

Challenges of Value and/or Policy Iteration-based Schemesg y

Online Optimal Adaptive Control w/o Policy and/or Value Iterations

Online Optimal Tracking

Hardware Implementation

Extensions to Continuous-time

Conclusions Conclusions

Page 3: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Introduction

Reinforcement learning (RL) /Adaptive dynamic programming (ADP) techniques with online approximators such as neural networks (NN’s) have been widely employed in the literature in the context of optimal adaptive

t l (Si B t P ll & W h 2004)control (Si, Barto, Powell & Wunsch, 2004) Policy and/or value iterations are utilized NN reconstruction errors are ignored

C f th NN i l t ti t id d Convergence of the NN implementations are not provided Full or partial knowledge of the system dynamics are needed and in

some cases a model of the system is needed The Bellman equation forms the basis for a family of RL/ADP schemes for The Bellman equation forms the basis for a family of RL/ADP schemes for

finding optimal adaptive policies forward in time RL/ADP based schemes do not require the system dynamics

Convergence and/or closed-loop stability can be demonstrated by using Convergence and/or closed loop stability can be demonstrated by using temporal difference error (TD) method

RL/ADP-based online methods are preferred over offline-based schemes

Page 4: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Backgroundg

F.L. Lewis, D. Vrabie, and V. Syrmos, “Optimal Control” Third edition, To appear, John Wiley Press.

Page 5: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Control of Discrete-time Systems

Consider discrete-time (DT) nonlinear affine system given by(1)( ) ( )k k k kx f x g x u (1)

where is the system state and is the control inputs Value function

1 ( ) ( ) ,k k k kx f x g x u n

kx R mku R

Value function(2)

with is a discount factor is a( )k ku h x

( ) ( , ) ( ) ,h i k i k Tk i i i i i

i k i kV x r x u Q x u Ru

0 1 ( ) 0 0Q x R with is a discount factor, , is a

feedback control input, and stage cost is

Note: Discounting makes initial time steps more important and traditional

( )k ku h x0 1 ( ) 0, 0kQ x R

( , ) ( ) .Tk k k k kr x u Q x u Ru

Note: Discounting makes initial time steps more important and traditional learning technique usually learn during first few time steps, and hence discounting is not preferred for learning schemes. Value function is the positive definite solution to the Bellman equation that satisfies V(0)=0positive definite solution to the Bellman equation that satisfies V(0)=0.

Page 6: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Control for DT System

Discrete-time Hamilton-Jacobi-Bellman (HJB) equation (3)

Optimal control policy:

* *1(.)

( ) min ( , ( )) ( ) .k k k khV x r x h x V x

(4) * *1

(.)( ) arg min ( , ( )) ( )k k k k

hh x r x h x V x

)()(2 1

*1

kkT xVxgR

Note: For solving optimal control , DT HJB equation (3) needs

)()(2 1kkg

)(*kxh

to be solved. However, due to nonlinear nature of DT HJB equation, finding its solution is difficult or impossible.

Page 7: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Hamiltonian

Discrete-time Hamiltonian(5)

where is the forward difference 1( , ( ), ) ( , ( )) ( ) ( ) ,h h

k K k k k k kH x h x V r x h x V x V x

1( ) ( )k h k h kV V x V x

operator. It is important to note Hamiltonian is the temporal difference (TD) error for prescribed control policy . ( )k ku h x

Iterative based methods will be shown to obtain the value function and optimal control from the DT HJB.

Page 8: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Traditional Optimal Control of Linear Systems

Linear Quadratic Regulator DT linear time-invariant system dynamics:

(6)A B (6) Infinite horizon value function:

(7)

1 ,k k kx Ax Bu

TT RuuQxxxV )()( (7)

Bellman equation:(8)

ki

iiiik RuuQxxxV )()(

)()()( TT xVRuuQxxxV (8))()()( 1 kkkkkk xVRuuQxxxV

Page 9: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Bellman Equation for DT LQR: ARE

Value function in quadratic form of state: with some P.kTkk PxxxV )(

DT Hamiltonian:(9)( , ) ( ) ( ) .T T T T

k k k k k k k k k k k kH x u x Qx u Ru Ax Bu P Ax Bu x Px

Use stationarity condition to obtain optimal control(10)1( ) .T T

k k ku Kx B PB R B PAx

( , ) 0k k kH x u u

Substitute (10) into (9) to obtain DT Algebraic Riccati Equation (DARE)

( ) .k k ku Kx B PB R B PAx

(DARE)(11)

DT ARE is exactly the Bellman optimality equation for the DT LQR.

1( ) 0 .T T T TA PA P Q A PB B PB R B PA

DT ARE is exactly the Bellman optimality equation for the DT LQR.

Page 10: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Value Function Approximation (VFA)

VFA is the key for implementing PI and VI online for dynamical systems with infinite state and action spaces.

Linear system:Since value function is quadratic of state in LQR case, it can be

(12)1 1( ) ( ( )) ( ) ( )T T T TV x x Px vec P x x p x p x (12) Nonlinear system:According to Weierstrass Higher Order Approximation Theorem there

1 12 2( ) ( ( )) ( ) ( ) ,k k k k k k kV x x Px vec P x x p x p x

According to Weierstrass Higher-Order Approximation Theorem, there exists a dense basis set such that

(13){ ( )}i x

( ) ( ) ( ) ( ) ( ) ( ) ,L

Ti i i i i i LV x w x w x w x W x x

( )

with basis vector and converges uniformly to zero when .

1 1 1( ) ( ) ( ) ( ) ( ) ( ) ,i i i i i i L

i i i L

1 2( ) ( ) ( ) ( ) : n LLx x x x R R ( )L x

L

Page 11: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Adaptive Control of Linear System

Example: Solving DT LQR online by using PI and VI System dynamics:

(14)V l f i

1 ,k k kx Ax Bu

Value function:(15)

Optimal control:k

Tk

kii

Tii

Tik PxxRuuQxxxV

)()(

Optimal control:(16)

Note: Solving optimal control needs which is the solution of

1( ) .T Tk k ku Kx B PB R B PAx

PNote: Solving optimal control needs which is the solution of Riccati equation. Next, PI and VI will be used to solve online.

PP

Page 12: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Adaptive Control via PI

PI for DT LQR: Initialization: select any admissible control . Start iteration until convergence

P li l ti t

0 ( )kh x

0j

Policy evaluation step:(17)

where 1 1 ( , ( )) ( ) ,T T T

j k k k j k j j kp x x r x h k x Q K RK x

1( )jp vec P where Policy improvement step:

(18)

1 ( )jjp vec P

jTjTj AxPBRBPBxhu 1111 )()( (18)kkjk AxPBRBPBxhu 1 )()(

Page 13: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Adaptive Control via VI

VI for DT LQR: Initialization: select any control . Start iteration until convergence

V l d t t

0 ( )kh x

0j

Value update step:(19)

where 1( )jp vec P

1 1 ( , ( )) ( ) ,T T T T Tj k k j j k k j j k j kp x r x h k p x x Q K RK x p x

where Control policy update:

(20)

1 ( )jjp vec P

jTjTj AxPBRBPBxhu 1111 )()( (20)

Note: The control policy update (18),(20) requires system dynamics which cannot be known beforehand

kkjk AxPBRBPBxhu 1 )()(

BA,dynamics which cannot be known beforehand.BA,

Page 14: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Policy (PI) and Value Iteration(VI) Flowchart

Note: Difference between PI and VI: 1) VI does not require initial admissible ) qcontrol; 2) Value update is different with policy evaluation.

Page 15: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Adaptive Control via Policy Approximation

Second ‘Actor’ Neural Network:For relaxing the requirement of full system dynamics knowledge, a

second ne ral net ork is introd ced for control policsecond neural network is introduced for control policy. Actor NN

(21)( ) ( ) ,Tk k ku h x U x (21)

with is activation or basis function and is weights or unknown parameter.

( ) : n Mx R R M mU R

Update law is required for the NN weights

Note: After convergence of critic NN parameters to in PI or VI, actor NN is used to update the control policy based on . Actor NN avoids h d f d if d i ( i li )

1jW

1jW )(fthe need of state drift dynamics (or in linear system).

j)(f A

Page 16: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Actor-Critic Implementation of TD Learning

Fig 1. Temporal Difference Learning using Policy Iteration.

The value function is estimated by observing the current and the next state, and the cost incurred. Based on the new value, the action i d t dis updated.

Page 17: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Nonlinear Optimal Control via Offline

Discrete-time unknown nonlinear affine system

1k k k k k k kx f ( x ) g( x )u( x ) f g u (1) Infinite horizon cost function

1k k k k k k kf ( ) g( ) ( ) f g

Tk n n n

k

V ( x ) Q( x ) u Ru

( )

where1

n kT

k k k kQ( x ) u Ru V( x )

(2)

0kQ( x ) , .0 mxmR The control input must be admissible (Chen &

Jagannathan, 2008): is continuous on a compact set;

kQ( )

ku

ku

is continuous on a compact set; stabilizes (1) and ; is finite for all .

ku0

0k

k xu

0V( x ) 0x ku

T. Dierks, B. Thumati, and S. Jagannathan, “Optimal control of unknown affine nonlinear discrete-time systemsusing offline-trained neural networks with proof of convergence”, Neural Networks, vol.22.pp. 851-860, 2009.

Page 18: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Nonlinear Optimal Control (cont.)

The optimal control input for (1) is found by solvingand revealed to be0k kV( x ) / u k k( )

1 1

1

12

T kk k

k

V( x )u R g( x )x

(3)

Optimal control (3) is generally not implementable for unknown nonlinear systems due to andkg( x ) 1kx

This work relaxes the need for explicit knowledge of (1) while ensuring convergence and stability Online NN system identification scheme Online NN system identification scheme Offline optimal control using only the learned NN

model of the system

Page 19: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Identifier Design

The nonlinear system (1) is rewritten as

( )f h1 ( , )k k k k k kx f g u h x u T T Ts s k k s k s kw ( v z ) ( x ) w ( z( x ) ( x ) (4)

The proposed nonlinear identifier is written as

1T

k s k k kˆ ˆx w ( z ) v (5)

where

1k s ,k k k( ) (5)

k kk T

ˆ xvx x C

(6)

and is the identification error, is an additional tunable parameter, and is a constant

k k sx x C ( )

k k kˆx x x k 0sCadd t o a tu ab e pa a ete , a d s a co sta t0sC

Page 20: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

NN Identifier (cont.)

Assumption: The term is assumed to be upper bounded by a function of estimation error such that

s k( x )bounded by a function of estimation error such that

T Ts k s k M k k k( x ) ( x ) ( x ) x x (7)

where

The proposed tuning laws for and are given byw

M

The proposed tuning laws for and are given by

and

s ,kwk

1 1T

s ,k s ,k s k kˆ ˆw w ( z )x T Tˆ ˆ / ( C )

(8)

(9)and T Tk k s k k k k sx x / ( x x C ) (9)

2T TL x x tr[ w w ] (10)

The asymptotical error convergence can be demonstrated using

k k k s ,k s ,k s k sL x x tr[ w w ] (10)

Page 21: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Estimation of Control Coefficient Matrix

Using the online system identification scheme (5), an estimate for is given bykg( x )

T Tˆ( ) ( ) ( / ) (11)

where is the derivative of the activation function with respect to and is a constant

T Tk s ,k k s k kˆg( x ) w ( z )v ( z / u ) (11)

xk( z )

z function with respect to and is a constant matrix

After a sufficiently long online learning session, the

kzk kz u

y g g ,robust term (6) approaches to zero, and the dynamic system can be rewritten as

Then observe

1 1T

k k k k k s ,k kˆ ˆx f ( x ) g( x )u x w ( z ) (12)

ˆx / u g( x ) x / u Then observeto yield the relationship shown in (11)

1 1k k k k kx / u g( x ) x / u

Page 22: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Offline Heuristic Dynamic Programming

Only the NN model learned online is used for the offline training

1) Initialize the cost function and select the initial stabilizing control input:

0V ( ) Tˆ ˆ( ) i Q( ) R V ( )

2) Compute the value function0 0kV ( x ) , 0 0 1

Tk k k k ku

u ( x ) argmin Q( x ) u Ru V ( x )

1 0 0 0T T

k k k kˆ ˆV ( x ) Q( x ) u Ru V w ( z )

3) Optimization is achieved by iterating between a

1 0 0 0

0 0 0 1

k k s ,k k

Tk k

V ( x ) Q( x ) u Ru V w ( z )

ˆQ( x ) u Ru V ( x )

p y g

sequence of control policies and value functions (Proof of converge shown in Al-Tamimi, Lewis, & Abu-Khalaf, 2008)

T ˆ( ) i Q( ) R V( ) 1T

i k k k k i kuu ( x ) argmin Q( x ) u Ru V( x )

1T T

i k k i i i s ,k kˆV ( x ) Q( x ) u Ru V w ( z )

Page 23: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

NN Implementation of Offline HDP

Value function (2) and optimal control (3) are assumed to have NN representations as

The NN approximations of (13) and (14) are written as

Ti k Vi k ViV ( x ) W ( x ) (13)

Ti k Ai k Aiu ( x ) W ( x ) (14)

The NN approximations of (13) and (14) are written asT T

i k Vi k Vi kˆ ˆ ˆV ( x ) W ( x ) W

T Ti k Ai k Ai k

ˆ ˆu ( x ) W ( x ) W

(15)(16)

Critic weights are tuned at each iteration using method of weighted residuals

i k Ai k Ai k( ) ( )

1

1 1ˆ ˆ( ) ( ) ( ) ( ), , , )T TVi k k k i k k Vi AiW x x dx x d x x W W dx

(17)

Page 24: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Tuning the Action Network Define the action error to be the difference between the

estimated control input (14) and the estimated optimal controlcontrol

11

1

12

i j ,kT Tai , j i ,k i ,k Ai , j k k

j ,k

ˆ ˆV ( x )ˆˆ ˆe u u W ( x ) R g ( x )x

11

1

12

T

j ,k j ,kT TAi , j k s j ,k s ,k Vi

ij ,k j ,k

ˆz ( x )ˆ ˆˆW ( x ) R v ( z ) w Wˆ ˆu x

(18)

Action network weight update

1T TAi , j Ai , j j k ai , j k k

ˆ ˆW W e / ( ) (19)

Proof of action network convergence can be shown

Page 25: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Remarks

The results indicate that only a nearly optimal control law is possible when the NN reconstruction errors are explicitly considered.

When the NN reconstruction errors are ignored, the NN estimate of the optimal control law converges to the optimal control uniformly.

Although the NN reconstruction errors are often ignored in the literature, proof of convergence for the NN implementations of HDP are rarely studiedHDP are rarely studied.

Page 26: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Offline HDP Flowchart

Page 27: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Simulation Results

E l 1 Li S t Example 1: Linear System1 1

12 1

0 3 0 8 00 8 1 8 1

,kk k k

k

x . .x x u

x . .

Online identifier parameters:2 1 0 8 1 8 1,kx . .

,09.0s 0027.0s

Offline training set and learning parameter:s

],5.0,5.0[1 x ],5.0,5.0[2 x 01.0j

Value network activation functions},,,,,,{ 6

2231

41

2221

21 xxxxxxxx

Action network activation functions

DARE solution},,,,{ 5

23121 xxxx

0 691 1 3405[ ] DARE solution 0 691 1 3405k ku [ . . ] x

Page 28: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Linear System Results

1State Identification Error

1.2

x1

- x2

Trajectories

0.6

0.8

1

rror

x1

x2

0.8

1DARE Solution

NN Solution

0.2

0.4

tific

atio

n E

r

0.4

0.6

x 2-0.2

0Iden

t

0

0.2

5 10 15 20 25 30 35 40 45 50Iteration Index k

-1 -0.5 0 0.5 1-0.2

x1

Fig. 2 Fig. 3g g

Page 29: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Nonlinear System Example

E l 2 N li t Example 2: Nonlinear system

2 k1 1 -sin(0.5x ) 0kx 2,k1 11

2,k 1,k2 1

( )-cos(1.4x )sin(0.9x ) 1

,kk k

,k

x ux

NN identifier, value and action network parameters same as Ex. 1

HDP algorithm described in Al Tamimi Le is and Ab HDP algorithm described in Al-Tamimi, Lewis, and Abu-Khalaf (2008) was also implemented using full knowledge of the nonlinear system dynamics for comparison

Page 30: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Fig. 4 Fig. 5

Fig. 7

Fig. 6

Page 31: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Challenges of PI and VI –based Schemes

Iterative-based schemes may not be suitable for hardware implementation

PI and VI-based schemes may not converge y gif suitable number of iterations are not chosen

In certain instances, model of the nonlinear ,system needs to be created before designing the PI and VI-schemes

Page 32: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Effect of Insufficient Iterations using HDP

Consider discrete-time nonlinear affine systemuxgxfx )()(

wherekkkk uxgxfx )()(1

kx2 085.1

k

kkkk

kk x

xgxxx

xf,2,2,2,1

,2 )(,5.1)*5.0sin(

)(

Simulation parameter Initial states:

Tx ]5.4- 5.0[0 122 RIQ

Sampling interval:1,22 RIQ

sec1.0sT

Page 33: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Effect of Insufficient Iterations 3000

15

1000

2000

err

ors

e1e2

5

10

15

erro

rs

e1e2

-2000

-1000

0

Reg

ulat

ion

-5

0

5

Reg

ulat

ion

e

0 0.3 0.6 0.9 1.2 1.5-3000

2000

Time (Sec)

0 2 4 6 8 10-10

-5

Time (Sec)

F th fi it i b i h b f it ti i

Fig. Number of iteration=1000Fig. Number of iteration=10

Fig. 9

From these figures it is obvious when number of iteration is decreased, policy iteration cannot force the regulation errors converge to zeros

Page 34: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Online Near Optimal Regulation

Optimal control

No offline scheme or internal dynamics f ( x )

1 1

1

12

T kk k

k

J( x )u R g( x )x

(20)

No offline scheme or internal dynamics To begin, the cost function (2) is assumed to have an NN

representation written as

kf ( x )

The NN approximation of (4) is written as

Tk c k cJ ( x ) W ( x )

(21)T Tˆ ˆ ˆJ ( ) W ( ) W

The cost function error (residual or TD) dynamics are formed as

T Tk c ,k k c ,k kJ ( x ) W ( x ) W , (22)

formed as1 1

T Tc,k k c,k k c,k k

ˆ ˆe r W W

1 1 1T

c,k k c,k k kˆe r W

(23)

Page 35: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Auxiliary Critic Error System

Define the auxiliary error vector

h1 1

1 1T x( j )

c ,k k c ,k kˆE Y W X

(24)where

1 2 1k k k k jY [ r r ... r ]

1 1k k k k jX [ ... ]

Auxiliary error dynamics

1k k k

y y

1 1T T Tc ,k k k c ,k

ˆE Y X W (25)

The system (25) can be viewed as an affine nonlinear system with control input 1c ,kW

Page 36: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Critic Weight Update

Therefore, the critic weight update is written as

1

1T T T

c ,k k k k c c ,k kW X X X E Y

(26)and (24) becomes

( )

1T Tc ,k c ,k c ,kE E (27)

We can show that exists 1Tk kX X

The weight update (26) is comparable to the least squares updates used in offline training Instead of summing over a mesh of training points, the update

(26) represents a sum over the system’s time history stored in Thus, the update (9) uses data collected in real time instead of

)(kEc

p ( )data formed offline

Page 37: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Action Network

The optimal control (3) has a NN representation written as

(28)Tk A k Au( x ) W ( x )

The NN approximation of (28) takes the form ofT

k k A k kˆˆ ˆu u( x ) W ( x ) (29)

The optimal control signal error is defined as

k k A,k k( ) ( ) (29)

1 ( x )ˆ ˆ

Action network weight update

1 1

1

12

T T ka ,k A,k k k c ,k

k

( x )ˆ ˆe W ( x ) R g ( x ) Wx

(30)

g p

1

Tk a ,k

A,k A,k a Tk k

eˆ ˆW W

(31)

In the absence of NN reconstruction errors, the action and critic errors converge to zero asymptotically and the action and critic NN weights remain bounded. That is .ˆ uu

Page 38: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Offline and Online Schemes: A Comparison

Page 39: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Simulation Results: Example 1

Example 1: Linear System

1 1 0 0 8 0,kx .x x u

Results of solving the discrete time algebraic Riccati

12 1 0 8 1 8 1k k k

,k

x x ux . .

g gequation (DARE) that requires A to be known

0 6239 1 2561k ku [ . . ] x

Initial stabilizing control:

21,k 1,k 2,k1.5063x 2.0098x xkJ 2

2,k3.7879x

g

Implementation of the proposed optimal regulator does not required the A matrix to be known

0 0 5 1 4,k ku [ . . ] x

not required the A matrix to be known

Page 40: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Regulation ResultsC t i t ti ti f ti Cost approximator activation functions :

Action network activation functions:},,,,,,{ 6

2231

41

2221

21 xxxxxxxx

Action network activation functions:

Tuning constants:},,,,{ 5

23121 xxxx

,10 6ca 1.0aau g co sta ts Final cost estimator weights:

,c a

0030.00025.00015.00082.07886.30097.25071.1[ˆ cW]0002.00009.00003.00008.00000.00000.00020.00014.0

21 1 21 5063 2 0098k ,k ,k ,kJ . x . x x 2

2 3 7879 ,k. x

Final weights of the action network:

]005400075000500007400049000920...0095.00338.00589.02586.16208.0[ˆ AW

]0054.0.0075.00050.00074.00049.00092.0

)(]2561.16239.0[)( kxku

Page 41: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Regulation Results (cont.)

Fig. 10Fig. 10

Fig. 11 Fig. 12

Page 42: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Simulation Results: Example 2

Regulation Example 2: Nonlinear System

1 1 1 2 0,k ,k ,kx x x 1 2

2 1 1 21 8 1, , ,

k k,k ,k ,k

x ux x . x

Tuning parameters and activation functions same as example 1

Initial stabilizing control:0 0 4 1 24,k ku [ . . ] x

Online results are compared to result obtained via offline training (Chen & Jagannathan, 2008)

Page 43: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Online Results for the Nonlinear System

Fig. 13 Fig. 14

Fig. 15 Fig. 16

Page 44: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Tracking Results (cont.)

Fig. 17

Page 45: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Tracking

For regulation, the optimal control inputs are constructed so that when0ku 0kx

In contrast, tracking control inputs typically consist of feedback and feedforward terms so that is not zero when the tracking error becomes zero

kukewhen the tracking error becomes zero

In the literature, finite horizon cost functions in the form ofk

N

T TJ e Qe u Ru are often considered for tracking to mitigate the fact that does not vanish (Lewis and Syrmos 1995)

ku

1k j j j j

j

J e Qe u Ru

does not vanish (Lewis and Syrmos, 1995)

Objective: Design the control input to ensure that the nonlinear t (1) t k d i d t j t i ti lsystem (1) tracks a desired trajectory in an optimal manner.

T. Dierks and S. Jagannathan, “Optimal tracking control of affine nonlinear discrete-time systems with unknown internal dynamics”, Proc. of the IEEE Conference on Decision and Control, pp. 6750-6755, December 2009.

Page 46: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Tracker Development

To begin, write the desired trajectory dynamics1d ,k d ,k k d ,kx f ( x ) g( x )u (32)

Tracking error with dynamics given by(33)

k k d ,ke x x

k e,k k e,ke f g( x )u

Infinite horizon cost function for tracking can be given by

e ,k k d ,kf f ( x ) f ( x ) e,k k d ,ku u u andwith

g g y

, , , , 1 , , 10 0

e k e k i e k e k i e k e ki i

J r r r r J

(34)

Th i l i d i ibl h i fi i

with Te,k e k e,k e e,kr Q ( e ) u R u , 0e kQ ( e ) , mxm

eR

u J The signal is admissible, thus is finite.e ,ku e,kJ

Page 47: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Tracking Control Input

Optimal feedback control found by solving 0e ,k e ,kJ / u

The optimal control is then given by

(35)

u

1 1

1

12

T e ke,k e k

k

J (e )u R g (x )e

The optimal control is then given byk, u ,

(36)1 1

1

12

T e kk d ,k e k

k

J ( e )u u R g ( x )e

where the feedforward term is found from (32) as12 ke

1( k ) ( ) ( f )

d ,k, u ( k ),1

1d ,k k d ,k d ,ku ( k ) g( x ) ( x f ) (37)

Optimal control requires nonlinear cost function, feedback andf df d l hi h ill b i t d b OLA’feedforward values which will be approximated by OLA’s

Page 48: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Feedforward Control Input Approximation

The feedforward control input (37) is approximated using an OLA and written as

The feedforward OLA update is derived using Lyapunov

11

Td ,k k d ,k d ,k d

ˆu g( x ) ( x ) (38)

The feedforward OLA update is derived using Lyapunov methods to be

Tˆ ˆ (39)

The estimate of the system control input (36) is then written

1 1T

d ,k d ,k d d k k e,kˆ( e g( x )u ) (39)

as

k d ,k e,kˆ ˆ ˆu u u (40)

Page 49: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Remarks

Remark 1: For offline ADP, the optimal cost function is learned using a mesh

of training points. In contrast, the optimal cost function is learned online in this work using the system’s time history.

Remark 2: Remark 2: By construction, the feedback control input error becomes

zero when the tracking error is zero. As a result, the feedback control estimator (41) cannot be updated once the

ea ,keke

feedback control estimator (41) cannot be updated once the tracking error has converged to zero.

To prevent the control input error from becoming zero before the p p gcost function and nearly optimal control policy have been learned, external probing noise may be introduced into the nonlinear system dynamics to ensure the tracking error is persistently exciting.

Page 50: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Online Regulation and Tracking: Comparison

Page 51: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Nonlinear System Consider the Nonlinear system

21 1 12 1 1 00 5,k ,k,k ,kx usin( . x )x

.

Desired trajectory:

2 1 22 1 0 11 4 0 9,k ,k,k ,kx ucos( . x ) sin( . x )

1 1 0 25d ,kx sin( . k )

Initial stabilizing control:2 1 0 25,

d ,k

.x cos( . k )

0 1 0 e

No offline training is required to implement the proposed

10

2

0 1 00 0 1

,ke

,k

e.u

e.

optimal tracking control scheme 11 past values are used to form the auxiliary cost function

errorerror

Page 52: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

OLA and Parameter Design The OLA’s considered for implementation were neural networks:

10 hidden layer neurons were used in both the cost function and feedback networks along with the sigmoid activation function

25 hidden layer neurons were used in the feedforward estimator along with the radial basis activation function

Tuning gains were selected according the theoretical results: Tuning gains were selected according the theoretical results: Cost function parameter tuning gain: Feedback control signal parameter tuning gain: 1.0ec

10 Feedforward control signal parameter tuning gain:

All tunable parameters were initialized to zero

1.0ea05.0d

Probing noise was generated from a random distribution with zero mean and variance of 0.04 to ensure the tracking error was persistently exciting

Page 53: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Optimal Tracking Results

Fig. 18

Page 54: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Results for Optimal Tracking Control

Fig. 19 Fig. 20 All parameters remain bounded and converge to constant values as

stability theorem suggested

Page 55: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Hardware Implementation

Page 56: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Multi-modal Engine Control

Objective is to control an SI engine to minimize emissions and maximize fuel efficiency Engine dynamics are unknown States are not available Nonstrict nonlinear discrete-time system Nonstrict nonlinear discrete time system

High levels of exhaust gas recirculation (EGR) dilute cylinder contentsAlthough stoichiometric equivalence ratio is Although stoichiometric equivalence ratio is maintained, partial fuel burns and misfires become frequent

Lean combustion changes the equivalence Lean combustion changes the equivalence ratio causing partial fuel burns and misfires

Control of fuel input attempts to reduce cyclic dispersion of heat releasecyclic dispersion of heat release

Page 57: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Lean Engine Control

Daw model equations: states are total air and total fuel; output is heat release or accelaration

1 1 2 11x k AF k F k x k R F k CE k x k d k

2 2 21 1x k CE k F k x k MF k u k d k

kCEkxky 2

kCE

kCE )(max

lumk

)(

1001

kkx

Rk 2

This model represents spark engine dynamics as a nonlinear system in nonstrict feedback form.

kx1

S. Jagannathan and P. He, “Neural network-based state feedback control of nonlinear discrete-time system in non-strict feedback form”, IEEE Transactions on Neural Networks, vol. 19, no. 12, pp. 2073-2087, December 2008.

Page 58: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Lean Engine Control and Diagnostics—Experimental Results

1x

2x

1y

TV TW

(.)

(.)

(.)

1

2

Neural Network (NN) Neural Network (NN) ControllerController EngineEngine Adaptive critic neural network

controller for lean engine operation reduces NOx by over 90%, unburned hydrocarbons 2

nx

2y

my

(.)

(.)

(.)

(.)

3

LInputsHidden Layer

outputs

Measurements

ControlInputs

NN ObserverNN Observer

90%, unburned hydrocarbons by 30%, and CO by 10% from stoichiometric levels.

Fuel efficiency is enhanced by 10% NN ObserverNN Observer

0.045Uncontrolled =0.76

0.045Controlled =0.76 0.05

HR

(k)

Measured HR at =0.76MisfireMisfireMisfireMisfire

10%. Misfires have been minimized

0.02

0.025

0.03

0.035

0.04

HR

(k+1

)

0.02

0.025

0.03

0.035

0.04

HR

(k+1

)

850 855 860 865 870 875 880 885 890 895 9000

k

H850 855 860 865 870 875 880 885 890 895 9000

0.05

k

HR

(k)

Estimated HR at =0.76

0 0.01 0.02 0.03 0.040

0.005

0.01

0.015

HR(k)0 0.01 0.02 0.03 0.04

0

0.005

0.01

0.015

HR(k)

k

850 855 860 865 870 875 880 885 890 895 9000

1

2x 10-3

k

u(k)

gra

ms

Control input u(k) at =0.76

Cyclic Dispersion in Heat Release Without

58

Engine Heat Release During MisfireCyclic Dispersion in Heat Release Without

(left) and With (right) Control

Page 59: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Lean Engine Control Time series of Heat Release for Equivalence Ratio Set Point = 0.77

Time series of heat 800

900

1000

1100

1200

e(k)

, J

Control ActionInitiated

release shows more stable heat release output with controller

ti400

500

600

700

800

Hea

t Rel

ease

active Return maps of

current cycle heat l l tt d 1200

Uncontrolled HR: =0.77 - COV: 0.3851

1200

Controlled HR: =0.77+3.9% - COV: 0.1364

-500 -400 -300 -200 -100 0 100 200 300 400 500

300

kth cycle

release plotted against next cycle show 64% reduced cyclic dispersion

800

900

1000

1100

1200

ase(

k+1)

, J

800

900

1000

1100

1200

ase(

k+1)

, J

Without Control With Control

cyclic dispersion

300

400

500

600

700

Hea

t Rel

ea

300

400

500

600

700

Hea

t Rel

ea

Fig. 21400 600 800 1000 1200

Heat Release(k), J400 600 800 1000 1200

Heat Release(k), J

g

Page 60: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Reinforcement Learning Based EGR1 ( 1)dx k 2ˆ ( )x k ˆ ( )e k ( 2)x k

Nonlinear System

1 ( )dx k 2ˆ ( )dx kl5 +

1( )e k

1 ( 1)dx k

+

2 ( )x k

2ˆ ( )e kAction NN2

( 1)y k ( )u k

1( )e k 1 ( 2)dx k

5 1( )l e k

+

Action NN1

3 3ˆ ( ) ( )Tw k kCritic NN

ˆ ( )Q k

1( 1)x k

ˆ( )x k

ˆ( )k

1( )x kObserver

( )y kˆ( 1)y k ˆ ( )Q k

ˆ( )k[1 0] 2ˆ ( 1)x k

( )x k 1z( )x k

1000

Controller Output Tracking Error at =0.72

200

0

200

400

600

800

acki

ng E

rror(%

)1350 1400 1450 1500 1550

-1000

-800

-600

-400

-200

Iteration(k)Tr

aIteration(k)

P. Shih, B. Kaul, S. Jagannathan, and J. Drallmeier, “Reinforcement learning based dual-control methodology forcomplex nonlinear discrete-time systems with application to spark engine EGR operation”, IEEE Transactions on Neural Networks, vol. 19, no. 8,pp. 1369-1388, August 2008.

Page 61: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Experimental Datax 10

-3 Controlled Estimation Error at EGR=18%

EGR = 18%

800

900

1000

1100Uncontrolled Return Map at EGR=18%

800

900

1000

1100Controlled Return Map at EGR=18%

0 5

1

1.5

2

x 10 Controlled Estimation Error at EGR 18%

êx1 m=-4.76e-006

êx2 m=3.09e-006

ey m=-0.00229

300

400

500

600

700

800

Hea

t Rel

ease

(k+1

), J

300

400

500

600

700

800

Hea

t Rel

ease

(k+1

), J

-1

-0.5

0

0.5

Est

imat

ion

Erro

r

0 200 400 600 800 10000

100

200

Heat Release(k), J0 200 400 600 800 1000

0

100

200

Heat Release(k), JHeat Release and Control at EGR=18%. Controller on at k=5242 600 800 1000 1200 1400 1600

-2.5

-2

-1.5

Iteration(k)

600

700

800

900

Hea

t Rel

ease

(J)

Fig. 22

2

-1

0

1

Con

trol(m

g)

4000 4500 5000 5500 6000 6500 7000-2

Iteration(k)

Page 62: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Mobile Robot Formation Control

A wedge formation consisting of a leader L

and two followers was considered F1 F2.8ijdL 3/2ijd

Initial stabilizing controls for leader i and each follower1.3 ( ) ( )ieo i icu g k e k 1.4 ( ) ( )jeo j jcu g k e k

To evaluate the performance of the optimal controller, a second non-optimal NN control scheme is selected for each robot according to (Jagannathan 2006)(Jagannathan, 2006)

1 ˆ( ) ( )( ( ) ( ) ( ( ))T TNN cu k B k Ke k W k V z k

ˆ ˆ( 1) ( ) ( 1) ( 1) ( )Tc cW k k e k e k W k

62B. Brenner, T. Dierks, and S. Jagannathan, “Neuro optimal control of robot formations”, Proc. of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 234-241, April 2011.

Page 63: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Experimental Results

Primary ComponentsMST M t1. MST Mote

2. MicrostrainInertial Sensor

3. CMU2 Camera4. US Digital Wheel

Encoders5. Sharp IR Sensors

Missouri S&T Robot Testbed

63

Page 64: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Experimental Results

Optimal Path

leader5

x 10-3 Feedback Weights

tude

1.2

1.4

1.6

1.8

(m)

leaderfollower

0 20 40 60 80 100 120 140 160 180 200-5

0

Time (s)

Mag

nit

0 02Cost Weights

de

0.6

0.8

1

1.2

Y Po

sitio

n (

0 20 40 60 80 100 120 140 160 180 200-0.02

0

0.02

Time (s)

Mag

nitu

d

Feedforward Weights

-0.5 0 0.5 1 1.50

0.2

0.4

X Position (m)

0 20 40 60 80 100 120 140 160 180 200-0.1

0

0.1Feedforward Weights

Time (s)M

agni

tude

X Position (m)

Fig. 24

Time (s)

Fig. 25

64

Page 65: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Experimental Results: With Obstacle Avoidance

ˆeuNNu

Fig. 26

Fig. 27

65

Page 66: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Experimental Results: With Obstacle Avoidance

ˆeuNNu

Fig. 29Fig. 28

finalk

kJJ ))((

Control Policy/ ˆeu NNu

k

ceeTotal keJJ0

))((

Control Policy/Robot

L 126.04 192.71F 829 18 2635

e NN

66

F 829.18 2635

Page 67: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Continuous-time RL/ADP Schemes• Dierks & Jagannathan, ACC 2010 – Single OLA to learn the HJB equation

for the online optimal control of nonlinear continuous-time systems

67Optimal Regulator Block Diagram Optimal Tracker Block Diagram

Page 68: Optimal Adaptive Control of Nonlinear Discrete-time …ece.mst.edu/media/academic/ece/documents/c/1/CDC_workshop_2011...Optimal Adaptive Control of Nonlinear Discrete-time S stemstime

Conclusions RL using Value and Policy iteration-based schemes help

solve optimal control forward-in-time without needing system dynamicsdynamics

Due to implementation challenges, online RL/ADP schemes without iterative approach for optimal control are preferred.

Discrete-time design and implementation is preferred for hardware implementation though the Lyapunov first difference i li ith t t th t t L this nonlinear with respect to the states. Lyapunov theory guarantees that all signal remain bounded while under ideal conditions, the approximated control input approaches the

ti l i t t ti lloptimal input asymptotically.

Simulation and hardware implementation results verify the theoretical resultstheoretical results