-
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Basics of DHP type Adaptive Critics / Basics of DHP type Adaptive Critics / Approximate Dynamic ProgrammingApproximate Dynamic Programming
and some application issuesand some application issues
TUTORIALTUTORIAL
George G. LendarisNW Computational Intelligence Laboratory
Portland State University, Portland, ORApril 3, 2006
-
2
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Basic Control Design Problem
Controller Plant
Controller designer needs following: Problem domain specifications Design objectives / Criteria for success All available a priori information about
Plant and EnvironmentHigh Level Design Objective: Intelligent
Control
-
3
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
In this tutorial, we focus onAdaptive Critic Method (ACM)
A methodology for designing an (approximately) optimal controller for a given plant according to a stated criterion, via a learning process.
ACM may be implemented using two neural networks (also Fuzzy systems):---> one in role of controller, and---> one in role of critic.
-
4
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Simple, 1-Hidden Layer, FeedforwardNeural Network [ to implement y=f(x) ]
1
2
3 5
64net
4o
31w
42w
32w
41w
53w
64w
54
63
ww
3o
4o
5
6
o
o
1o
2o 2y
1y1x
2x
Can obtain partial derivatives via Dual of trained NN
-
5
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Overview of Adaptive Critic methodUser provides the
Design objectives / Criteria for successthrough a Utility Function, U(t) (local cost).Then, a new utility function is defined (Bellman Eqn.),
[cost to go]
which is to be minimized [~ Dynamic Programming].
[We note: Bellman Recursion]
0
( ) ( )kk
J t U t k
=
= +
( ) ( ) ( 1)J t U t J t= + +
-
6
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Family of Adaptive Critic Methods:The critic approximates either J(t) or the gradient of J(t)
wrt state vector R(t) [ J(R)]
Two members of this family: Heuristic Dynamic Programming (HDP)
Critic approximates J(t)(cf. Q Learning)
Dual Heuristic Programming (DHP)Critic approximates J(R(t))
(t)
[Today, focus on DHP]
-
7
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
May describe DHP training process via two primary feedback loops:
-
8
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
9
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Desire a training Delta Rule
for wij to minimize cost-to-go J(t).
Obtain this via and the chain rule of differentiation.
To develop feel for the weight update rule, consider a partial block diagram and a little math (discrete time):
( )( )ij
J tw tPLANT Critic
u(t) R(t+1)R(t) J(t+1)Controller( )ijw
( )( )ij
J tw t
-
10
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
The weights in controller NN are updated with objective of minimizing J(t):
where
and
and
Call this term (to be output of critic)
( )( )( )i j ij
J tw t lcoefw t
=
1
( )( ) ( )( ) ( )
ak
ij k ijk
u tJ t J tw t u t w=
=
( ) ( ) ( 1)( ) ( ) ( )k k k
J t U t J tu t u t u t +
= +
1
( 1)( 1) ( 1)( ) ( 1) ( )
ns
k s ks
R tJ t J tu t R t u t=
+ + +=
+ ( 1)s t +
-
11
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
It follows that Controller
training is based on:
Similarly, Critic
training is based on:
[Bellman Recursion & Chain Rule used in above.]Plant model is needed to calculate partial derivatives for DHP
1
( 1)( 1) ( ) ( 1)( ) ( ) ( 1) ( )
ns
k k s ks
R tJ t U t J tu t u t R t u t=
+ + += +
+
Via CRITIC Via Plant Model
1
( 1) ( 1) ( )( ) ( ) ( 1)( ) ( ) ( 1) ( ) ( ) ( )
nk k m
s s k s m sk m
R t R t u tJ t dU t J tR t dR t R t R t u t R t=
+ + += + + +
Via Plant Model
-
12
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
13
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Key ADP Process Parameters
Specification of:
state variables size of NN structure, connection pattern, and
type of activation functions learn rate of weight update rules discount factor (gamma of Bellman Equation) scaling of variable values (unit hypersphere) lesson plan
for training
-
14
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Pole-Cart Benchmark Problem
-
15
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
16
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Experimental Procedure (lesson plan):A. Train 3 passes through sequence
(5, -10, 20, -5, -20, 10) [degrees from vertical].Train 30 sec. on each angle.
B. Accumulate absolute values of U: C(1), C(2), C(3).C. Perform TEST pass through train sequence
(30 sec. each angle). Accumulate U values: C(4).D. Perform GENERALIZE pass through sequence
(-23, -18, -8, 3, 13, 23) [degrees from vertical]
andaccumulate U values: C(5).
E. Perform GENERALIZE pass through sequence(-38, -33, 23, 38) [degrees from vertical]
and accumulate U values: C(6).
-
17
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
STRATEGIES TO SOLVE EARLIER EQUATIONS: Strategy 1.
Straight application of the equation.
Strategy 2.
Basic 2-stage process [flip/flop].[e.g., Santiago/Werbos, Prokhorov/Wunsch]
During stage 1, train criticNN, not actionNN;During stage 2, train actionNN, not criticNN.
Strategy 3.
Modified 1st stage of 2-stage process.While train criticNN during stage 1, keepparameters constant in module thatcalculates critics desired output (R).Then adjust weights all at once at end ofstage 1.
Strategy 4.
Single-stage process, usingmodifications introduced in Strategy 3.
-
18
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Progress of training process under each of the strategies.[ Pole angles during the first 80 sec. of training. Note how fast Strategies 4a & 4b learn.]
-
19
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Pole angles during part of test sequence. [Markings below and parallel to axis are plotting artifacts.]
Controller Responses to Disturbances
-
20
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
The following table lists the value of the performance measure for each DHP train strategy, averaged over 4 separate training runs.
1 2a 2b /3b 3a 4a 4bC(1) 368 740 883 209 107 160C(4) 3.2 3.1 2.9 2.8 2.4 2.2C(5) 5.1 7.9 7.4 7.5 6.2 5.5C(6) 30.5 24.1 25.4 20.7 14.5 14.0
We observe in this table that strategies 4a & 4b converge the fastest
[lowest value of C(1)], and also
appear to yield the best controllers
[lowest values in C(4), C(5) & C(6) rows]. We note separately that strategy 1 (when it converges) yields controllers on par with the better ones.
-
21
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
S1: Simultaneous/Classical StrategyS2: Flip-Flop StrategyS4: Shadow Critic Strategy
Introduce a copy of the criticNN in the lower loop.Run both loops simultaneously. Lower loop: train the copy
for an epoch; upload the weight values from copy (called shadow critic) into the activecriticNN; repeat.
S5: Shadow Controller StrategyIntroduce a copy of the actionNN in the upper loop.Run both loops simultaneously. Upper loop, train the copy
for an epoch;upload the weight values from copy (called shadow controller) into theactive actionNN, repeat.
S6: Double Shadow StrategyMake use of the NN copies in both training loops.Run both loops simultaneously. Both loops: train the copies
for an epoch;upload the weight values from the copies into their respective active NNs.
Lendaris, G.G., T.T. Shannon &
A. Rustan (1999), "A Comparison of Training Algorithms for DHP Adaptive Critic Neuro-control," Proceedings of International Conference on Neural Networks99 (IJCNN'99), Washington,DC IEEE Press
-
22
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Autonomous Vehicle Example(Two-axle terrestrial vehicle)
Objective: Demonstrate design via DHP(from scratch) of a steering controllerwhich effects a Change of Lanes (on astraight highway).
Controller Task: Specify sequence ofsteering angles to effect accelerationsneeded to change orientations of vehiclevelocity vector.
-
23
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Design Scenarios --
general:
Maintain approximately constant forwardvelocity during lane change.Control angular velocity of wheel.Thus, Controller needs 2 outputs:
1.) Steering angle2.) Wheel rotation velocity
-
24
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Major unknown for controller:Coefficient of Friction (cof)between tire/road at any point in time.
It can change abruptly(based on road conditions).
Requires Robustness and/or faston-line adaptation.
-
25
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Tire side-force/side-slip-angle as a function of COF
-
26
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
27
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
28
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Assume on-board sensorsfor constructing state information
Only 3 accelerometers needed:--
x direction accel. of chassis
--
lateral accel. at rear axle--
lateral accel. at front axle
Plus load cells at front & rear axles(to give estimate of c.g. location)
[Simulation used analytic proxies]
-
29
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Criteria for variousDesign Scenarios:
1. Velocity Error reduce to zero.
2. Velocity Rate limit rate of velocity corrections.
3. Y-direction Error reduce to zero.
4. Lateral front axle acceleration re. comfort
design specification.
5. Friction Sense estimate closeness to knee of cof curves.
-
30
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Utility Functions forthree Design Scenarios:
[different combinations of above criteria]1.
U(1,2,3)
2. U(1,2,3,5)3. U(1,2,3,4,5)
All applied to task of designing controllerfor autonomous 2-axle terrestrial vehicle.
-
31
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
32
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
33
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
34
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Design Scenario 2.
Add Criterion 5 (friction sense) in U2.This is intended to allow aggressive lanechanges on dry pavement,
PLUSmake lane changes on icy road conditions as aggressively as the icy road will allow.
-
35
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
36
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
37
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
38
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
39
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Generalize Test: Ice patch at point where wheel angle is greatest indry pavement case, retrained with coefficient modified.
-
40
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
41
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
42
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Conclusions from Utility Function Expts.Controller Designs resulting via DHP satisfy intuitive sense of being good.Control Engineer knows that controller design requires careful specification of objective, and that as change design criteria, controllers change.For DHP, control objectives are contained in the Utility Function. The DHP process embodied the different requirements for the three design scenarios in qualitatively distinct controllers --
all yielding intuitively good
results,
according to the design constraints.
-
43
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Conclusions from Utility Function Expts., cont.
Of particular interest to the present researchers is the ability of the DHP method toaccept design criteria crafted into the Utilityfunction by the human designers, and to thenevolve a controller design whose responselooks and feels
like one a human designer
might have designed.
-
44
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Expanded task for Autonomous Vehicle Controller:
Additional road condition of encountering an ice patch while doing lane change: also, accommodate side-
wind load disturbances.
-
45
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
46
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Aircraft Control Augmentation System
-
47
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
System configuration during DHP Design of Controller Augmentation System
-
48
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Pilot stick-x doublet signal
(arbitrary scale in the Figure), and roll-rate responses of 3 aircraft:LoFLYTE
w/Unaugmented control,LoFLYTE
w/Augmented Control, and LoFLYTE*.(Note: Responses of latter two essentially coincide.)
-
49
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Stick-x doublet: pilots stick signal vs. augmented signal(the latter is sent to aircraft actuators)
-
50
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Roll-rate error (for above stick-x signal)between LoFLYTE* and LoFLYTE
w/Unaugmented Control, and between LoFLYTE* and LoFLYTE
w/Augmented Control signals
-
51
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Blue: LoFLYTE
w/ Unaugmented
control Red: LoFLYTE
w/Augmented Control
Black: LoFLYTE*
Roll1
-
52
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Blue: LoFLYTE
w/ Unaugmented
control Red: LoFLYTE
w/Augmented Control
Black: LoFLYTE*
Roll2
-
53
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Pitch-rate error (for above stick-x signal) between LoFLYTE* and LoFLYTE
w/Unaugmented Control, and between LoFLYTE* and LoFLYTE
w/Augmented Control signals.
-
54
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Yaw-rate error (for above stick-x signal)between LoFLYTE* and LoFLYTE
w/Unaugmented Control, andbetween LoFLYTE* and LoFLYTE
w/Augmented Control signals.
-
55
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Augmentation commands for stick-y and pedal that the controller learned toprovide to make the induced a) pitch (stick-y) and b) yaw (pedal) responsesof LoFLYTE
match those
of LoFLYTE*
-
56
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Blue: LoFLYTE
w/ Unaugmented
control Red: LoFLYTE
w/Augmented Control
Black: LoFLYTE*
Pitch w/ cg Shift
-
57
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Conclusions re. Control Augmentation Example:
Basic motivation behind this work is the desire to ultimately generate a non-linear controller that has control capabilities equivalent to that of an experienced pilot.
Successful base demonstration was presented here.
-
58
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
-
59
NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory
Perform search over Criterion Function (J function) Surface in Weight (State) Space
Basics of DHP type Adaptive Critics / Approximate Dynamic Programming and some application issuesTUTORIALSlide Number 2Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Slide Number 9Slide Number 10Slide Number 11Slide Number 12Slide Number 13Slide Number 14Slide Number 15Slide Number 16Slide Number 17Slide Number 18Slide Number 19Slide Number 20Slide Number 21Slide Number 22Slide Number 23Slide Number 24Slide Number 25Slide Number 26Slide Number 27Slide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Slide Number 33Slide Number 34Slide Number 35Slide Number 36Slide Number 37Slide Number 38Slide Number 39Slide Number 40Slide Number 41Slide Number 42Slide Number 43Slide Number 44Slide Number 45Slide Number 46Slide Number 47Slide Number 48Slide Number 49Slide Number 50Slide Number 51Slide Number 52Slide Number 53Slide Number 54Slide Number 55Slide Number 56Slide Number 57Slide Number 58Slide Number 59