resilient control in cooperative and adversarial multi...
Post on 10-Mar-2018
219 Views
Preview:
TRANSCRIPT
Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA
F.L. Lewis National Academy of Inventors
Talk available online at http://www.UTA.edu/UTARI/acs
Supported by :ONR – Tony Seman, Lynn Petersen, Marc SteinbergUS TARDEC – Dariusz MikulskiAFOSR EuropeNSF
Reinforcement Learning for Resilient Control inCooperative and Adversarial Multi-agent Networks:
CPS Applications in Microgrid and Human-Robot Interactions
J.J. Finnigan, Complex science for a complex world
The Internet
ecosystem ProfessionalCollaboration network
Barcelona rail network
Structure of Natural andManmade Systems
Peer-to-Peer Relationships in networked systems
Distributed Decision & ControlLocal nature of Physical Laws
Clusters of galaxies
FlockingReynolds, Computer Graphics 1987
Reynolds’ Rules:Alignment : align headings
Cohesion : steer towards average position of neighbors- towards c.g.Separation : steer to maintain separation from neighbors
( )i
i ij j ij N
a
Nature and Biological Systems are Resilient !
Resilient Control in Cooperative Multi-agent NetworksModeling and Control of Adversarial Networked TeamsCPS Applications-
Distributed Control of Renewable Energy MicrogridsShared Learning in Human-Robot Interactions
Multi-agent Networks on Communication GraphsRobustness of Optimal Design Reinforcement LearningCooperative Agents - Games on Communication GraphsCPS #1 – Resilient Distributed Control of Renewable Energy Microgrids
F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2014
Key Point
Lyapunov Functions and Performance IndicesMust depend on graph topology
Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.
H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.
Synchronized Motion of Biological Groups
Fishschool
Birdsflock
Locustsswarm
Firefliessynchronize
Nature and Biological Systems are Resilient
1
2
3
4
56
Diameter= length of longest path between two nodes
Volume = sum of in-degrees1
N
ii
Vol d
Spanning treeRoot node
Strongly connected if for all nodes i and j there is a path from i to j.
Tree- every node has in-degree=1Leader or root node
Followers
Communication Graph
Communication Graph 1
2
3
4
56
N nodes
[ ]ijA a
0 ( , )ij j i
i
a if v v E
if j N
oN1
Noi ji
jd a
Out-neighbors of node iCol sum= out-degree
42a
Adjacency matrix
0 0 1 0 0 01 0 0 0 0 11 1 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 1 0
A
iN1
N
i ijj
d a
In-neighbors of node iRow sum= in-degreei
(V,E)
iJohn Baras - social standing
Algebraic Graph Theory
L=D-A = graph Laplacian matrix
1
N
dD
d
[ ]ijA a
Graph Laplacian MatrixAdjacency matrix
In-Degree Matrix
1
2 3
4 5 6
12
3
4 5
6
Graph Eigenvalues for Different Communication Topologies
Directed Tree-Chain of command
Directed Ring-Gossip networkOSCILLATIONS
Graph Eigenvalues for Different Communication Topologies
Directed graph-
Undirected graph-
65
34
2
1
4
5
6
2
3
1
Dynamic Graph- the Distributed Structure of Control
Each node has an associated state i ix u
Standard local voting protocol ( )i
i ij j ij N
u a x x
1
1i i
i i ij ij j i i i iNj N j N
N
xu x a a x d x a a
x
( )u Dx Ax D A x Lx L=D-A = graph Laplacian matrix
x Lx
If x is an n-vector then ( )nx L I x
x
1
N
uu
u
1
N
dD
d
Closed-loop dynamics
i
j
[ ]ijA a
Communication Graph
N nodes or agents
State at node i is ( )ix t
Synchronization problem ( ) ( ) 0i jx t x t
Formation ControlSynchronization of Rigid-body Spacecraft DynamicsSynchronization of Trust Opinions in Multi-agent NetworksDiurnal Rhythm Synchronization in Nature
Controlled Consensus: Cooperative Tracker
Node state i ix uDistributed Local voting protocol with control node v
( ) ( )i
i ij j i i ij N
u a x x b v x
( ) 1x L B x B v
0ib If control v is in the neighborhood of node i
{ }iB diag b
Theorem. Let graph have a spanning tree and for at least one root node. Then L+B is nonsingular with all e-vals positiveand -(L+B) is asymptotically stable
0ib
control node v
Local Neighborhood Tracking Error
1
N
xx
x
Global state
2 1 1,
2 1 0A B
0.5 0.5K
Agent Dynamics and Local Feedback Design
i i ix Ax Bu
i iu Kx
1
2
3
4
56
Couple 6 agents with communication graph
Nodes synchronize to consensus heading
-350 -300 -250 -200 -150 -100 -50 0 50-300
-250
-200
-150
-100
-50
0
50
x
y
0( ) ( )i
i ij j i i ij N
e x x g x x
Local neighborhood tracking error
i iu K
0xc.g. leader
Example- Non Resilience in Communication Networks
2 1 1,
2 1 0A B
0.5 0.5K
Agent Dynamics and local Feedback design
i i ix Ax Bu
i iu Kx
1
2
3
4
56
ADD another comm. Link- more information flow
0( ) ( )i
i ij j i i ij N
e x x g x x
Local neighborhood tracking error
i iu K
Causes Unstable Formation!WHY?
-30 -25 -20 -15 -10 -5 0 5 10 15 20-25
-20
-15
-10
-5
0
5
10
15
20
25
x
y
Example- Non Resilience in Communication Networks
Define as the trust that node i has for node jij
[ 1,1]ij -1 ………..………. 0 ………..…………. 1Distrust no opinion complete trust
Foundation work by John Baras
Define trust vector of node i as 1
2
3
4
5
6
i
i
ii
i
i
i
Trust node i has for node 3
N vector
( )i
i i ij j ij N
u a
Standard local voting protocol
Difference of opinion with neighbors
Inspired by social behavior in flocks, herds, teams
Trust Propagation and Consensus – Network Security
( )NL I
2
1
2 N
N
R
Closed-loop trust dynamics
Trust Propagation & Consensus1
2
3
4
56
Nodes 1, 2, 4 initially distrust node 5initial trusts are negative
Convergence of trust
0 5 10 15 20 25 30 35 40-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Other nodes agree that node 5 has negative trust
0
Nodes converge to consensus heading
Trust-Based Control: Swarms/Formations
Convergence of trust Convergence of headings
1
2
3
4
56
0 5 10 15 20 25 30 35 400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35 400
1
2
3
4
5
6
-350 -300 -250 -200 -150 -100 -50 0 50-300
-250
-200
-150
-100
-50
0
50
( )i
i ij j ij N
a
( )ci
i ij ij j ij N
a
heading angle
cossin
i i
i i
x Vy V
Trust dynamics
Motion dynamics
time time x
y
Causes Unstable Formation
( )ci
i ij ij j ij N
a
Trust-Based Control: Swarms/FormationsMalicious Node
Divergence of trust Divergence of headings
1
2
3
4
56
Node 5 injects negative trust values
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5-20
-15
-10
-5
0
5
10
15
0 0.5 1 1.5 2 2.5 30
5
10
15
20
25
30
35
40
-30 -25 -20 -15 -10 -5 0 5 10 15 20-25
-20
-15
-10
-5
0
5
10
15
20
25
Internal attackMalicious node puts out bad trust values
i.e. false informationc.f. virus propagation
Trust-Based Control: Swarms/FormationsCUT OUT Malicious Node
heading angle
1
2
3
4
56
Node 5 injects negative trust values
If node 3 distrusts node 5,Cut out node 5
Convergence of trust0 5 10 15 20 25 30 35 40
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Other nodes agree that node 5 has negative trust
Restabilizes FormationConvergence of headings
0 5 10 15 20 25 30 35 400
1
2
3
4
5
6
Node 5
-400 -350 -300 -250 -200 -150 -100 -50 0 50 100-400
-350
-300
-250
-200
-150
-100
-50
0
50
Node 5
( )ci
i ij ij j ij N
a
Work by Sajal Das
Multi-agent Networks on Communication GraphsRobustness of Optimal Design Reinforcement LearningCooperative Agents - Games on Communication GraphsCPS #1 – Resilient Distributed Control of Renewable Energy Microgrids
Cell HomeostasisThe individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.
Cellular Metabolism
Permeability control of the cell membrane
http://www.accessexcellence.org/RC/VL/GG/index.html
Optimality in Biological SystemsWhy are Biological Systems Resilient?
Reinforcement Learning
Every living organism improves its control actions based on rewards received from the environment
The resources available to living organisms are usually meager.Nature uses optimal control.
1. Apply a control. Evaluate the benefit of that control.2. Improve the control policy.
RL finds optimal policies by evaluating the effects of suboptimal policies
Optimality in Biological Systems
Optimality Provides an Organizational Principle for Behavior
Charles Darwin showed that Optimal Control over long timescalesIs responsible for Natural Selection of Species
Why are Biological Systems Resilient?
Different methods of learning
SystemAdaptiveLearning system
ControlInputs
outputs
environmentTuneactor
Reinforcementsignal
Actor
Critic
Desiredperformance
Reinforcement learningIvan Pavlov 1890s
Actor-Critic Learning
We want OPTIMAL performanceADP- Approximate Dynamic Programming
Sutton & Barto book
D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, IET Press, 2012.
BooksF.L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, third edition, John Wiley and Sons, New York, 2012.
New Chapters on:Reinforcement LearningDifferential Games
F.L. Lewis and D. Vrabie,“Reinforcement learning andadaptive dynamic programmingfor feedback control,”IEEE Circuits & SystemsMagazine, Invited Feature Article,pp. 32-50, Third Quarter 2009.
IEEE Control Systems Magazine,F. Lewis, D. Vrabie, and K.Vamvoudakis,“Reinforcement learning andfeedback Control,” Dec. 2012
xux A x B u
SystemControlK
PBPBRQPAPA TT 10
1 TK R B P
On-line real-timeControl Loop
FORMAL DESIGN PROCEDURE
Off-line Design LoopUsing ARE
Optimal Control- The Linear Quadratic Regulator (LQR)
An Offline Design Procedurethat requires Knowledge of system dynamics model (A,B)
System modeling is expensive, time consuming, and inaccurate
( , )Q R
User prescribed optimization criterion ( ( )) ( )T T
t
V x t x Qx u Ru d
OPTIMAL CONTROL IS RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS
Applications at Boeing Defense Space & SecurityKevin Wise and Eugene Lavretsky
Highly reliable adaptive uncertainty approximation compensators for flight control applications:
unmanned aircraft – Phantom Ray
Tailkit adaptive control systems for Joint Direct Attack Munition (JDAM) munitions:
Mk-82, Mk-84, and Laser-guided variantsFielded for defense in Iraq and Afghanistan.
A. Optimal control B. Zero-sum games C. Non zero-sum games
1. System dynamics
2. Value/cost function
3. Bellman equation
4. HJ solution equation (Riccati eq.)
5. Policy iteration – gives the structure we need
Adaptive Control Structures for Resilience:
We want to find optimal control solutions Online in real-time Using adaptive control techniquesWithout knowing the full dynamics
For nonlinear systems and general performance indices
Adaptive Control is online and works for unknown systems.Generally not Optimal
Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.
We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosing
Reinforcement Learning turns out to be the key to this!
RL brings together Optimal Control and Adaptive Control
t
T
t
dtRuuxQdtuxrtxV ))((),())((
( , , ) ( , ) ( , ) ( ) ( ) ( , ) 0T TV V VH x u V r x u x r x u f x g x u r x u
x x x
112( ) ( )T Vu h x R g x
x
dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0
Nonlinear System dynamics
Cost/value
Bellman Equation, in terms of the Hamiltonian function
Stationary Control Policy
Hamilton‐Jacobi‐Bellmen Equation
Derivation of Nonlinear Optimal Regulator
Off‐line solutionHJB hard to solve. May not have smooth solution.Dynamics must be known
Stationarity condition 0Hu
( , ) ( ) ( )x f x u f x g x u
Leibniz gives Differential equivalent
t
T
t
dtRuuxQdtuxrtxV ))((),())((
112( ) ( )T Vu h x R g x
x
Nonlinear System dynamics
Cost/value
Bellman Equation, in terms of the Hamiltonian function
Stationary Control Policy
Integral Reinforcement Learning
Stationarity condition 0Hu
( , ) ( ) ( )x f x u f x g x u To find online methods for optimal control Iterate on these two equations
( ( )) ( , ) ( , ) ( , )t T
t t t T
V x t r x u d r x u d r x u d
( ( )) ( , ) ( ( )), (0) 0t T
t
V x t r x u d V x t T V
Good Bellman Equation
( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
IRL Policy iteration
Initial stabilizing control is needed
Cost update
Control gain update
f(x) and g(x) do not appear
g(x) needed for control update
Policy evaluation‐ IRL Bellman Equation
Policy improvement11
21 1( ) ( )T kk k
Vu h x R g xx
),,(),(),(0 uxVxHuxruxf
xV T
Equivalent to
Solves Bellman eq. (nonlinear Lyapunov eq.) without knowing system dynamics
CT Bellman eq.
Integral Reinforcement Learning (IRL)- Draguna Vrabie
D. Vrabie proved convergence to the optimal value and controlAutomatica 2009, Neural Networks 2009
Converges to solution to HJB eq.dx
dVggRdx
dVxQfdx
dV TTT *
1*
41
*
)(0
( ( )) ( ) ( ( ))t T
Tk k k k
t
V x t Q x u Ru dt V x t T
Approximate value by Weierstrass Approximator Network ( )TV W x
( ( )) ( ) ( ( ))t T
T T Tk k k k
t
W x t Q x u Ru dt W x t T
( ( )) ( ( )) ( )t T
T Tk k k
t
W x t x t T Q x u Ru dt
regression vector Reinforcement on time interval [t, t+T]
kWNow use RLS along the trajectory to get new weights
Then find updated FB
1 11 12 21 1
( ( ))( ) ( ) ( )( )
TT Tk
k k kV x tu h x R g x R g x Wx x t
Real-time ImplementationApproximate Dynamic Programming
Direct Optimal Adaptive Control for Partially Unknown CT Systems
Value Function Approximation (VFA) to Solve Bellman Equation
Scalar equation with vector unknowns
Same form as standard System ID problems
t t+T
observe x(t)
apply uk=Lkx
observe cost integral
update P
Do RLS until convergence to Pk
update control gain
Integral Reinforcement Learning (IRL)
Data set at time [t,t+T)
( ), ( , ), ( )x t t t T x t T
t+2T
observe x(t+T)
apply uk=Lkx
observe cost integral
update P
t+3T
observe x(t+2T)
apply uk=Lkx
observe cost integral
update P
11
Tk kK R B P
( , )t t T ( , 2 )t T t T ( 2 , 3 )t T t T
Or use batch least-squares
(x( )) ( ( )) ( ) ( ) ( ) ( , )t T
T T Tk k k
tW t x t T x Q K RK x d t t T
Solve Bellman Equation - Solves Lyapunov eq. without knowing dynamics
This is a data-based approach that uses measurements of x(t), u(t)Instead of the plant dynamical model.
A is not needed anywhere
Direct Optimal Adaptive Controller
A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval
Draguna Vrabie
DynamicControlSystemw/ MEMORY
xu
V
ZOH T
x A x B u System
T Tx Q x u R u
Critic
ActorK
T T
Run RLS or use batch L.S.To identify value of current control
Update FB gain afterCritic has converged
Reinforcement interval T can be selected on line on the fly – can change
Solves Riccati Equation Online without knowing A matrixOPTIMAL CONTROL IS
RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS
Actor / Critic structure for CT Systems
Theta waves 4-8 Hz
Reinforcement learning
Motor control 200 Hz
Optimal Adaptive IRL for CT systems
A new structure of adaptive controllers
( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
1121 1( ) ( )T k
k kVu h x R g xx
D. Vrabie, 2009
Motor control 200 Hz
Oscillation is a fundamental property of neural tissue
Brain has multiple adaptive clocks with different timescales
theta rhythm, Hippocampus, Thalamus, 4-10 Hzsensory processing, memory and voluntary control of movement.
gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.
• consolidation of memory• spatial mapping of the environment – place cells
The high frequency processing is due to the large amounts of sensorial data to be processed
Spinal cord
D. Vrabie and F.L. Lewis, “Neural network approach to continuous-time direct adaptive optimal control forpartially-unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237-246, Apr. 2009.
D. Vrabie, F.L. Lewis, D. Levine, “Neural Network-Based Adaptive Optimal Controller- A Continuous-Time Formulation -,” Proc. Int. Conf. Intelligent Control, Shanghai, Sept. 2008.
Doya, Kimura, Kawato 2001
Limbic system
Motor control 200 Hz
theta rhythms 4-10 HzDeliberativeevaluation
control
OPTIMALITY AND RESILIENCEOF BIOLOGICAL SYSTEMS
Simulation 1‐ F‐16 aircraft pitch rate controller
1.01887 0.90506 0.00215 00.82225 1.07741 0.17555 0
0 0 1 1x x u
Stevens and Lewis 2003
*1 11 12 13 22 23 33[p 2p 2p p 2p p ]
[1.4245 1.1682 -0.1352 1.4349 -0.1501 0.4329]
T
T
W
PBPBRQPAPA TT 10 ARE
Exact solution
,Q I R I
][ eqx
Select quadratic NN basis set for VFA
0 0.5 1 1.5 2-0.3
-0.2
-0.1
0Control signal
Time (s)
0 0.5 1 1.5 2-0.4
-0.2
0Controller parameters
Time (s)
0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
3.5
4System states
Time (s)
0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
Critic parameters
Time (s)
P(1,1)P(1,2)P(2,2)P(1,1) - optimalP(1,2) - optimalP(2,2) - optimal
Simulations on: F‐16 autopilot
A matrix not needed
Converge to SS Riccati equation soln
Solves ARE online without knowing A
PBPBRQPAPA TT 10
1/ / 0 0 00 1/ 1/ 0 0
,1/ 0 1/ 1/ 1/
0 0 0 0
p p p
T T
G G G G
E
T K TT T
A BRT T T T
K
x Ax Bu
( ) [ ( ) ( ) ( ) ( )]Tg gx t f t P t X t E t
FrequencyGenerator outputGovernor positionIntegral control
Simulation 2: Load Frequency Control of Electric Power system
PBPBRQPAPA TT 10
0.4750 0.4766 0.0601 0.4751
0.4766 0.7831 0.1237 0.3829
0.0601 0.1237 0.0513 0.0298
0.4751 0.3829 0.0298 2.3370
.AREP
ARE
ARE solution using full dynamics model (A,B)
A. Al-Tamimi, D. Vrabie, Youyi Wang
0.4750 0.4766 0.0601 0.4751
0.4766 0.7831 0.1237 0.3829
0.0601 0.1237 0.0513 0.0298
0.4751 0.3829 0.0298 2.3370
.AREP
PBPBRQPAPA TT 10
0 10 20 30 40 50 60-0.5
0
0.5
1
1.5
2
2.5P matrix parameters P(1,1),P(1,3),P(2,4),P(4,4)
Time (s)
P(1,1)P(1,3)P(2,4)P(4,4)
( ), ( ), ( : )x t x t T t t T Fifteen data points
Hence, the value estimate was updated every 1.5s.
IRL period of T= 0.1s.
Solves ARE online without knowing A
0.4802 0.4768 0.0603 0.4754
0.4768 0.7887 0.1239 0.3834
0.0603 0.1239 0.0567 0.0300
0.4754 0.3843 0.0300 2.3433
.critic NNP
0 1 2 3 4 5 6-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1System states
Time (s)
Adaptive Control
Plantcontrol output
Identify the Controller-Direct Adaptive
Identify the system model-Indirect Adaptive
Identify the performance value-Optimal Adaptive
)()( xWxV TOPTIMALITY YIELDSRESILIENCEROBUSTNESSPERFORMANCE
GUARANTEES
Multi-agent Networks on Communication GraphsRobustness of Optimal Design Reinforcement LearningCooperative Agents - Games on Communication GraphsCPS #1 – Resilient Distributed Control of Renewable Energy Microgrids
Fast Decision UsingNew Concept of Graphical Games to limit each agent’s decision horizon
Faster decisions based on local informationUnder certain conditions, the local decisions are globally optimalNash Equilibrium on Graphs is defined
Speed up Multi-agent Decision-making by Imposing Sparse EfficientCommunication Structures
Bounded Rationality -
Manufacturing as the Interactions of Multiple AgentsEach machine has it own dynamics and cost functionNeighboring machines influence each other most stronglyThere are local optimization requirements as well as global necessities
( )i
i i i i i i ij j jj N
A d g B u e B u
12
0
( (0), , ) ( )i
T T Ti i i i i ii i i ii i j ij j
j N
J u u Q u R u u R u dt
Each process has its own dynamics
And cost function
Each process helps other processes achieve optimality and efficiency
F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2014
Key Point
Lyapunov Functions and Performance IndicesMust depend on graph topology
Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.
H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.
,i i i ix Ax B u
0 0x Ax
0( ) ( ),ix t x t i
0( ) ( ),i
i ij i j i ij N
e x x g x x
( ) ,nix t ( ) im
iu t
Graphical GamesSynchronization‐ Cooperative Tracker Problem
Node dynamics
Target generator dynamics
Synchronization problem
Local neighborhood tracking error (Lihua Xie)
x0(t)
( )i
i i i i i i ij j jj N
A d g B u e B u
12
0
( (0), , ) ( )i
T T Ti i i i i ii i i ii i j ij j
j N
J u u Q u R u u R u dt
12
0
( ( ), ( ), ( ))i i i iL t u t u t dt
Local nbhd. tracking error dynamics
Define Local nbhd. performance index
Local agent dynamics driven by neighbors’ controls
Values driven by neighbors’ controls
K.G. Vamvoudakis, F.L. Lewis, and G.R. Hudas, “Multi-Agent Differential Graphical Games: online adaptive learning solution for synchronization with optimality,” Automatica, vol. 48, no. 8, pp. 1598-1611, Aug. 2012.
Impose Optimality to get Resilience and Robustness
1u
2u
iu Control action of player i
Value function of player i
New Differential Graphical Game
( )i
i i i i i i ij j jj N
A d g B u e B u
State dynamics of agent i
Local DynamicsLocal Value Function
Only depends on graph neighbors
12
0
( (0), , ) ( )i
T T Ti i i i i ii i i ii i j ij j
j N
J u u Q u R u u R u dt
OPTIMAL CONTROL IS RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS
1
N
i ii
z Az B u
12
10
( (0), , ) ( )N
T Ti i i j ij j
j
J z u u z Qz u R u dt
1u
2u
iu Control action of player i
Central Dynamics
Value function of player i
Standard Multi-Agent Differential Game
Central DynamicsLocal Value Functiondepends on ALL
other control actions
1 1 11 1 2 3 1 2 1 3 13 3 3( ) ( ) ( ) coi
teamJ J J J J J J J J J
1 1 12 1 2 3 2 1 2 3 23 3 3( ) ( ) ( ) coi
teamJ J J J J J J J J J
1 1 13 1 2 3 3 1 3 2 33 3 3( ) ( ) ( ) coi
teamJ J J J J J J J J J
The objective functions of each player can be written as a team average term plus a conflict of interest term:
1 1
1 1
( ) , 1,N N
coii j i j team iN N
j j
J J J J J J i N
For N-player zero-sum games, the first term is zero, i.e. the players have no goals in common.
For N-players
Team Interest vs. Self InterestCooperation vs. Collaboration
* * *( , ) ( , ), ,i j G j i j G jJ u u J u u i j N
New Definition of Nash Equilibrium for Graphical Games
* * *1 2, ,...,u u u
* * * *( , ) ( , ),i i i G i i i G iJ J u u J u u i N
Def: Interactive Nash equilibrium
are in Interactive Nash equilibrium if
2. There exists a policy such that ju
1.
That is, every player can find a policy that changes the value of every other player.
i.e. they are in Nash equilibrium
Theorem 3. Let (A,Bi) be reachable for all i. Let agent i be in local best response
Then are in global Interactive Nash iff the graph is strongly connected.
*( , ) ( , ),i i i i i iJ u u J u u i
* * *1 2, ,...,u u u
To restore the Symmetry of Nash equilibrium
1 1 12 2 2( , , , ) ( ) 0
i i
TT T Ti i
i i i i i i i i i ij j j i ii i i ii i j ij ji i j N j N
V VH u u A d g B u e B u Q u R u u R u
10 ( ) Ti ii i i ii i
i i
H Vu d g R B
u
2 1 2 1 11 1 12 2 2( ) ( ) 0,
i
TT Tj jc T T Ti i i
i i ii i i i i ii i j j j jj ij jj ji i i j jj N
V VV V VA Q d g B R B d g B R R R B i N
2 1 1( ) ( ) ,i
jc T Tii i i i i ii i ij j j j jj j
i jj N
VVA A d g B R B e d g B R B i N
* *( , , , ) 0ii i i i
i
VH u u
* 2 11 1 12 2 20 ( , , , ) ( )
i
T Tc T T Ti i i i
i i i i i i ii i i i i ii i j ij ji i i i j N
V V V VH u u A Q d g B R B u R u
2 1( )
i
c T ii i i i i ii i ij j j
i j N
VA A d g B R B e B u
12( ( )) ( )
i
T T Ti i i ii i i ii i j ij j
j Nt
V t Q u R u u R u dt
Value function
Differential equivalent (Leibniz formula) is Bellman’s Equation
Stationarity Condition
1. Coupled HJ equations
2. Best Response HJ Equations – other players have fixed policies
where
where
Graphical Game Solution Using Integral Reinforcement Learning
ju
To find online methods for GG
Iterate on these two equations
Online Solution of Graphical Games
Use Reinforcement LearningConvergence Results
POLICY ITERATION
Kyriakos Vamvoudakis
MULTI-AGENT LEARNING
Multi-agent Networks on Communication GraphsRobustness of Optimal Design Reinforcement LearningCooperative Agents - Games on Communication GraphsCPS #1 – Resilient Distributed Control of Renewable Energy Microgrids
What is a micro‐grid?• Micro-grid is a small-scale power system that provides the power for
a group of consumers.• Micro-grid enables
local power support for local and critical loads.
• Micro-grid has the ability to work in both grid-connected and islanded modes.
• Micro-grid facilitates the integration of Distributed EnergyResources (DER).
Photo from: http://www.horizonenergygroup.com
• The main building block of smart-grids
• Rural plants
• Business buildings, hospitals,and factories
Smart‐grid photo from: http://www.sustainable‐sphere.com
An introduction to micro‐grids: Micro‐grid applications
Distributed Generators (DG)Distributed Energy Resources (DER)
• Non-renewables Internal combustion engine Micro-turbines Fuel cells
• Renewables Photovoltaic Wind Hydroelectric Biomass
97
Micro‐grid Advantages
• Micro-grid provides high quality and reliable power to the criticalconsumers
• During main grid disturbances, micro-grid can quickly disconnectform the main grid and provide reliable power for its local loads
• DGs can be simply installed close to the loads which significantlyreduces the power transmission line losses
• By using renewable energy resources, a micro-grid reduces CO2emissions
98
Micro‐grid Hierarchical Control Structure
Tertiary ControlmodesOptimal operation in both operating
modes
Secondary Control
Primary Control
MicrogridTie
Power flow control in grid-tied mode
Voltage deviation mitigationFrequency deviation alleviation
Voltage stability provision
preservingFrequency stability
preservingPlug and play capability for DGs
Main grid
Do coop. ctrl. here toSynchronize frequencyand voltage
Bidram, A., & Davoudi, A. (2012). Hierarchical structure of microgrids control system. IEEE Transactions on Smart Grid, vol. 3, pp. 1963‐1976, Dec 2012.
Maintains Stability
Cooperative Game-theoretic Control of Active Loads in DC Microgrids
Ling-ling Fan, Vahidreza Nasirian, Hamidreza Modares, Frank L. Lewis, Yong-duan Song, and Ali Davoudi,
3t
2t
1t
3t
2t
1t
e
outp
inp
e
outp
3t
2t
1t
inp
Power buffer operation during a step change in power demand.
18r
48r
58r
59r
47r
27r
67r
69r
39r
iv i
p
s1vs1
r
iu
ie
Power buffers in Microgrid Network
2
,i
i ii
i i
ve p
rr u
ìïïï = -ïíïï =ïïî
Active Load Power Buffer
Stored energyInput impedanceBus voltage Control input Output power = a disturbance
ieir
iviu
ip
2
0
d , 1, , ,i
i j ij j i ij N
J u t i M M Nr¥
Î
æ ö÷ç ÷ç= + = + +÷ç ÷ç ÷çè øåò x Q xT
Define coupled performance indices
( )
2q q
1( )q
0 00 2 1
1 00 0 0
0 10 0 0
2 0 ,
0
i i i ii
i ii ii i
i i i i
i i
M N
ij jj M i
i
e ei i
r r u w
p p
r
i
g
g+
= + ¹
é ùé ù é ù é ù é ù- -ê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê ú= + + +ê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úë û ë û ë û ë ûë û
é ùê úê úê úê ú+ ê úê úê úê úë û
å
x x B DA
1, , ,i M M N= + +
Solve for bus voltage to get coupled agent dynamics
Define Communication GraphSparse efficient topologyOptimal design provides Resilience
and disturbance rejection
Vahid NasirianReza Modares
Dr. Ali Davoudi
104
Microgrid
DG 1DG 2 DG 3
DG 4
DG 5
DG 6DG 7
DG 8
DG 1DG 2 DG 3
DG 4
DG 5
DG 6
DG 8
DG 7
Communication link
Cybercommunication
framework
Micro‐grid secondary control:Distributed CPS structure
Physical LayerThe interconnect structure of the power grid
Cyber layerA sparse, efficient communication network to allow
cooperative control for synchronization ofvoltage and frequency
Work of Ali BidramWith Dr. A. Davoudi
Cyber Physical System (CPS)
Resilience Because of-Sparse, Efficient communication networkOptimal Design- Graph GamesDistributed computation – no SPOF
Case Study
Physical Microgrid Layer
Cyber Communication Layer
A 48 V DC distribution network with three active loads and three sources
1 2 3 4 5 6 7 8 9
47
48
49
Time (s)
MG
Vol
tage
(V)
1 2 3 4 5 6 7 8 960
80
100
120
Time (s)
Buf
fer V
olta
ge (V
)
1 2 3 4 5 6 7 8 9
47
48
49
Time (s)
Load
Vol
tage
(V)
1 2 3 4 5 6 7 8 91.5
22.5
33.5
Time (s)
Sou
rce
Cur
rent
(A)
1 2 3 4 5 6 7 8 910
15
20
25
Time (s)
Sto
red
Ene
rgy
(J)
1 2 3 4 5 6 7 8 90
10
20
30
Time (s)
Res
ista
nce
( )
1 2 3 4 5 6 7 8 950
100
150
Time (s)O
utpu
t Pow
er (W
)
10 20 3010
15
20
25
Sto
red
Ene
rgy
(J)
10 20 30Input Resistance ( )
10 20 30
Ω
Buffer 4 Buffer 5 Buffer 6
Ω
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
s1i
s2i
s3i
b4v
b5v
b6v
4v
5v
6v
o4v
o5v
o6v
4e
5e
6e
4r
5r
6r
4p5p
6p
A B DC
Load change at terminal 5
On‐policy RL
109
Target and behaviorpolicy
SystemRef.
Target policy: The policy that we are learning about.
Behavior policy: The policy that generates actions and behavior
Target policy and behavior policy are the same
Off‐policy RL
110
Behavior Policy System
Target policy
Ref.
Target policy and behavior policy are different
Humans can learn optimal policies while actually applying suboptimal policies
Off-policy IRLHumans can learn optimal policies while actually applying suboptimal policies
( ) ( )x f x g x u
( ) ( ), ( )t
J x r x u d
[ [ [] [ ]] ]( ( )) ( ( )) ( )i i
t T t
t i i
T
t TJ x t J x t T Q x d Ru du
On-policy IRL
[ 1] 1 [ ]12
i T ixu R g J
[ ] [ ]( )i ix f gu g u u Off-policy IRL
[ ] [ ] [ ] [ ] [ 1] [ ]( ( )) ( ( )) ( ) 2 ( )t t tii i
t T t T t T
T i i T iJ x t J x t T Q x d R d du u u R u u
This is a linear equation for andThey can be found simultaneously online using measured data using Kronecker product and VFA
[ ]iJ [ 1]iu
1. Completely unknown system dynamics2. Can use applied u(t) for –
disturbance rejection – Z.P. Jiang - R. Song and Lewis, 2015robust control – Y. Jiang & Z.P. Jiang, IEEE TCS 2012exploring probing noise – without bias ! - J.Y. Lee, J.B. Park, Y.H. Choi 2012
DDO
Yu Jiang & Zhong-Ping Jiang, Automatica 2012
system
value
Must know g(x)
Augmented system state: ( ) ( ) ( )T
T T
dX t x t y té ù= ê úë û
Augmented system: 1
A BX X u T X B u
F
é ù é ùê ú ê ú= + º +ê ú ê úê ú ê úë û ë û
00 0
Value function:
( ) ( )1 1
( ) ( ) ( )2 2
t T T T
Tt
V X t e X Q X u R u d X t P X tg t t¥
- - é ù= + =ê úë ûò
1 1 1, [ ]TTQ C QC C C I= = -
LQT Bellman equation:
1 10 ( ) ( )T T T T T
TTX B u P X X P TX B u X P X X Q X u Rug= + + + - + +
114
Assume the reference trajectory is generated by d dy Fy
x A x B u
y C x
= +=
System Dynamics
LQ Tracker Problem for Continuous-time Systems
Online solution to the CT LQT ARE: Off‐policy IRL
1 1( )i
iX TX B u T X B K X u= + = + + ,
1
i
iT T B K= -
( )
( ) ( )1
1
( ) ( ) ( ) ( ) ( )
( ) 2 ( )
t TT T i T i t T i
t
t T t Tt T iT i t i T T i
Tt t
iRK
de X t T P X t T X t P X t e X P X d
d
e X Q K RK X d e u K X B P X d
g g t
g t g t
tt
t t
+- - -
+ +- - - -
+
+ + - = -
= - + + +
ò
ò ò
Off-policy IRL Bellman equation
Algorithm. Online Off-policy IRL algorithm for LQT
Online step: Apply a fixed control input and collect some data
Offline step: Policy evaluation and improvement using LS on collected data
( )( ) ( ) ( ) ( )t T
T T i T i t Tit
e X t T P X t T X t P X t e X Q X dg g t t+
- - -+ + - =- +ò
( ) 12 ( )t T
t i T i
te u K X RK X dg t t
+- - ++ò No knowledge of the dynamics is required
118
Humans can learn optimal policies while actually applying suboptimal policies
H. Modares, I. Ranatunga, F.L. Lewis, and D.O. Popa, “Optimized Assistive Human-robot Interaction using Reinforcement Learning,” IEEE Transactions on Cybernetics, to appear 2015.
RL for Human-Robot Interactions
RobotManipulator
TorqueDesign
PrescribedImpedance Model
xhf
mx
e
th
KHuman
FeedforwardControl
‐d
x+
-
Performance
Robot-specific inner control loop
Task-specific outer-loop control designUsing IRL Optimal Tracking Implemented on PR2 robot
2-Loop HRI Learning Controller Based on Human Factors Performance StudiesHumans learn 2 components in HRI-
1. must learn to compensate for nonlinear robot dynamics2. must learn to perform the task
Our result - The Robot adapts in order to minimize human effort
Robot control inner loopusing Neural Adaptive Learning
Task control outer loopusing Online Reinforcement Learning
2-Loop HRI Learning Controller Based on Human Factors Performance StudiesThe Robot adapts in order to minimize human effort
Human-Robot Interaction Using Reinforcement Learning
Modeling and Control of Adversarial Networked TeamsZero-Sum GamesBipartite Synchronization on Antagonistic Communication GraphsMulti-agent Pursuit-Evasion Games
2
0
2
0
2
0
2
0
2
)(
)(
)(
)(
dttd
dtuhh
dttd
dttz T
System( ) ( ) ( )( )
TT T
x f x g x u k x dy h x
z y u
( )u l x
d
u
z
x control
Performance output disturbance
Find control u(t) so that
For all L2 disturbancesAnd a prescribed gain
L2 Gain Problem
H‐Infinity Control Using Neural Networks
Zero‐Sum differential gameNature as the opposing player
Disturbance Rejection
TwoAntagonistic Inputs
2* 2
0
( (0)) min max ( (0), , ) min max ( ) ( )T T
u ud dV x V x u d h x h x u Ru d dt
Define 2‐player zero‐sum game as
min max ( (0), , ) max min ( (0), , )u ud d
V x u d V x u d
The game has a unique value (saddle‐point solution) iff the Nash condition holds
Optimal Game Design YieldsResilienceRobustnessGuaranteed PerformanceDisturbance Rejection
TwoAntagonistic Inputs
control node v
Cooperative and Adversarial Multi-agent NetworkAdversarial disturbance at each agent
Agents want to cooperate to synchronize to the control leader
Zero-sum Graphical Games
i i i i
i i
x Ax Bu Ddy Cx
( ) ( ) ( )dy (I )
I A I B u I DC
Cooperative Synchronization0 0
0 0
x Axy Cx
System
leader
0x Ix
2( ) T Tz t Q u Ru
H
Performance output
2
0 0
0 0
2
2
( ) ( )
( ) ( ) ( )
T T
TT
z t dt Q u Ru dt
d t dt d t Td t dt
Bounded L2 synchronization error
2
0
( , , ) ( )T T TJ u d Q u Ru d Td dt
H‐inf performance index
Resilience to Network Disturbances
Condition on graph topology
Condition for existenceof local OPFB
1 1( ) PR L G
2 2 22TR K C B P M
Kristian Hengster-MovricCooperative SynchronizationH
J. Gadewadikar, Frank L. Lewis, L. Xie, V. Kucera and M. Abu-Khalaf, “Parameterization of all stabilizing H∞static state-feedback gains: Application to output-feedback design,” Automatica, vol. 43, no. 9, pp. 1597-1604, September 2007
Condition on graph topology
Condition for existenceof local OPFB
1 1( ) PR L G
2 2 22TR K C B P M
ye L G C
Fine Local OPFB structure Coarse global OPFB structure
Symmetry between OPFB and Graph Topology Information Restrictions
Same Same
Communication Network Limits Available Information Exchange
Communication Network Tries to Destroy OptimalityProper Controls Design Restores Optimality
0(y ) (y )i
i i yi ij j i i ij N
u cKC cKe cK e y g y
i iy Cx
Network info flow restrictions
Local system measurement restrictions
0 0
0 0
,,
x Axy Cx
2-player Zero-sum Nash Equilibrium
,,
i i i i
i i
x Ax Bu Ddy Cx
Command generator
Agent Dynamics
Nash Equilibrium
2
0
( , , ) ( )T T TJ u d Q u Ru d Td dt
H‐inf performance index
Resilience to Network Disturbances
control node v
Cooperative and Adversarial Multi-agent NetworkAdversarial disturbance at each agent
Agents want to cooperate to synchronize to the control leader
Resilient Optimal DesignFor Adversarial Networks
Modeling and Control of Adversarial Networked TeamsZero-Sum GamesBipartite Synchronization on Antagonistic Communication GraphsMulti-agent Pursuit-Evasion Games
Structural Balance for Networked Multi-agent Systems
In almost all human endeavors There are TWO Opposing Teams - WHY?
SportsWarInternational conglomerates in macroeconomics
Social Psychology Dynamics of Friendship GroupsCan be understood in terms of TRIANGLES of relationships
+ +
+
Stable Groups
+ - + + - -
- - -
Unstable Groups
Stable groups have an odd number of friendship linksStructurally Balanced = Stability
Bipartite Synchronization on Antagonistic Communication Graphs
Node BipartitionRed edge weights are negative
Two opposing antagonistic groups
Structural Balance
Black edge weightsare positive
All triangles have an odd number of positive edges
[ ]ijA a
0 0 1 0 0 01 0 0 0 0 11 1 0 0 0 0
0 1 0 0 0 00 0 1 0 0 00 0 0 1 1 0
A
Adjacency matrix
iN1
Nsi ij
jd a
In-neighbors of node iAbsolute Row sumi
Signed graph Laplacian matrix
1s
s
sN
dD
d
Absolute In-Degree Matrix
s sL D A
i i iAx Bu i Nx
0 0Axx
0( ) ( )i
i ij j ij i i i ij N
u cK a x a x g x g x
Bipartite ConsensusAgent Dynamics
Leader Dynamics
New Bipartite Distributed Control Law
One group goes toThe leader’s state
0x
Opposing group goes toNegative of the Leader’s state
0x
Time-Varying Edge Weights
TA AAdjacency matrix
2dA Adt
Adaptive edge weights
Undirected graph [ ]ijA a
ijik kj
k
daa a
dt
For almost any initial weights (0)AThe network becomes structurally balancedAnd splits into two antagonistic teams of almost the same size
Theorem
or
Emergent BehaviorsEmergence of Two Antagonistic Teams
S. A. Marvel, J. Kleinberg, R. D. Kleinberg, and S. H. Strogatz, “Continuous-time model of structural balance,” Proceedings of the National Academy of Sciences, vol. 108, no. 5, pp. 1771–1776, 2011.
Emergent BehaviorsEmergence of Two Antagonistic Teams
2dA Adt
http://vodpod.com/watch/5255321-mathematical-model-shows-how-groups-split-into-factions-physorg-comJanuary 4, 2011 By Bill Steele
The cloud-capped towers, the gorgeous palaces,The solemn temples, the great globe itself,Yea, all which it inherit, shall dissolve,And, like this insubstantial pageant faded,Leave not a rack behind.
We are such stuff as dreams are made on, and our little life is rounded with a sleep.
Our revels now are ended. These our actors, As I foretold you, were all spirits, and Are melted into air, into thin air.
Prospero, in The Tempest, act 4, sc. 1, l. 152-6, Shakespeare
top related