resilient control in cooperative and adversarial multi...

Resilient Control inCooperative and Adversarial Multi-agent Networks

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA

F.L. Lewis National Academy of Inventors

Talk available online at http://www.UTA.edu/UTARI/acs

Supported by :ONR – Tony Seman, Lynn Petersen, Marc SteinbergUS TARDEC – Dariusz MikulskiAFOSR EuropeNSF

Reinforcement Learning for Resilient Control inCooperative and Adversarial Multi-agent Networks:

CPS Applications in Microgrid and Human-Robot Interactions

Invited by Frank Ferrese

J.J. Finnigan, Complex science for a complex world

The Internet

ecosystem ProfessionalCollaboration network

Barcelona rail network

Structure of Natural andManmade Systems

Peer-to-Peer Relationships in networked systems

Distributed Decision & ControlLocal nature of Physical Laws

Clusters of galaxies

Distributed Adaptive Control for Multi‐Agent Systems

FlockingReynolds, Computer Graphics 1987

Reynolds’ Rules:Alignment : align headings

Cohesion : steer towards average position of neighbors- towards c.g.Separation : steer to maintain separation from neighbors

i ij j ij N

Nature and Biological Systems are Resilient !

Resilient Control in Cooperative Multi-agent NetworksModeling and Control of Adversarial Networked TeamsCPS Applications-

Distributed Control of Renewable Energy MicrogridsShared Learning in Human-Robot Interactions

Multi-agent Networks on Communication GraphsRobustness of Optimal Design Reinforcement LearningCooperative Agents - Games on Communication GraphsCPS #1 – Resilient Distributed Control of Renewable Energy Microgrids

F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2014

Key Point

Lyapunov Functions and Performance IndicesMust depend on graph topology

Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.

H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.

Synchronized Motion of Biological Groups

Fishschool

Birdsflock

Locustsswarm

Firefliessynchronize

Nature and Biological Systems are Resilient

The Power of Synchronization Coupled OscillatorsDiurnal Rhythm

Diameter= length of longest path between two nodes

Volume = sum of in-degrees1

Spanning treeRoot node

Strongly connected if for all nodes i and j there is a path from i to j.

Tree- every node has in-degree=1Leader or root node

Followers

Communication Graph

Communication Graph 1

N nodes

[ ]ijA a

0 ( , )ij j i

a if v v E

if j N

Noi ji

Out-neighbors of node iCol sum= out-degree

Adjacency matrix

0 0 1 0 0 01 0 0 0 0 11 1 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 1 0

In-neighbors of node iRow sum= in-degreei

iJohn Baras - social standing

Algebraic Graph Theory

L=D-A = graph Laplacian matrix

[ ]ijA a

Graph Laplacian MatrixAdjacency matrix

In-Degree Matrix

Graph Eigenvalues for Different Communication Topologies

Directed Tree-Chain of command

Directed Ring-Gossip networkOSCILLATIONS

Graph Eigenvalues for Different Communication Topologies

Directed graph-

Undirected graph-

Synchronization on Good Graphs

Chris Elliott fast video

Mesh graph4 neighbors

Synchronization on Gossip Rings

Chris Elliott weird video

Ring graphor cycle

Dynamic Graph- the Distributed Structure of Control

Each node has an associated state i ix u

Standard local voting protocol ( )i

i ij j ij N

u a x x

i i ij ij j i i i iNj N j N

xu x a a x d x a a

( )u Dx Ax D A x Lx L=D-A = graph Laplacian matrix

If x is an n-vector then ( )nx L I x

Closed-loop dynamics

[ ]ijA a

Communication Graph

N nodes or agents

State at node i is ( )ix t

Synchronization problem ( ) ( ) 0i jx t x t

Formation ControlSynchronization of Rigid-body Spacecraft DynamicsSynchronization of Trust Opinions in Multi-agent NetworksDiurnal Rhythm Synchronization in Nature

Controlled Consensus: Cooperative Tracker

Node state i ix uDistributed Local voting protocol with control node v

( ) ( )i

i ij j i i ij N

u a x x b v x

( ) 1x L B x B v

0ib If control v is in the neighborhood of node i

{ }iB diag b

Theorem. Let graph have a spanning tree and for at least one root node. Then L+B is nonsingular with all e-vals positiveand -(L+B) is asymptotically stable

control node v

Local Neighborhood Tracking Error

Global state

2 1 1,

2 1 0A B

0.5 0.5K

Agent Dynamics and Local Feedback Design

i i ix Ax Bu

i iu Kx

Couple 6 agents with communication graph

Nodes synchronize to consensus heading

-350 -300 -250 -200 -150 -100 -50 0 50-300

0( ) ( )i

i ij j i i ij N

e x x g x x

Local neighborhood tracking error

i iu K

0xc.g. leader

Example- Non Resilience in Communication Networks

2 1 1,

2 1 0A B

0.5 0.5K

Agent Dynamics and local Feedback design

i i ix Ax Bu

i iu Kx

ADD another comm. Link- more information flow

0( ) ( )i

i ij j i i ij N

e x x g x x

Local neighborhood tracking error

i iu K

Causes Unstable Formation!WHY?

-30 -25 -20 -15 -10 -5 0 5 10 15 20-25

Example- Non Resilience in Communication Networks

Trust‐based Control

Define as the trust that node i has for node jij

[ 1,1]ij -1 ………..………. 0 ………..…………. 1Distrust no opinion complete trust

Foundation work by John Baras

Define trust vector of node i as 1

Trust node i has for node 3

N vector

i i ij j ij N

Standard local voting protocol

Difference of opinion with neighbors

Inspired by social behavior in flocks, herds, teams

Trust Propagation and Consensus – Network Security

( )NL I

Closed-loop trust dynamics

Trust Propagation & Consensus1

Nodes 1, 2, 4 initially distrust node 5initial trusts are negative

Convergence of trust

0 5 10 15 20 25 30 35 40-1

Other nodes agree that node 5 has negative trust

Nodes converge to consensus heading

Trust-Based Control: Swarms/Formations

Convergence of trust Convergence of headings

0 5 10 15 20 25 30 35 400

-350 -300 -250 -200 -150 -100 -50 0 50-300

i ij j ij N

i ij ij j ij N

heading angle

cossin

x Vy V

Trust dynamics

Motion dynamics

time time x

Causes Unstable Formation

i ij ij j ij N

Trust-Based Control: Swarms/FormationsMalicious Node

Divergence of trust Divergence of headings

Node 5 injects negative trust values

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5-20

0 0.5 1 1.5 2 2.5 30

-30 -25 -20 -15 -10 -5 0 5 10 15 20-25

Internal attackMalicious node puts out bad trust values

i.e. false informationc.f. virus propagation

Trust-Based Control: Swarms/FormationsCUT OUT Malicious Node

heading angle

Node 5 injects negative trust values

If node 3 distrusts node 5,Cut out node 5

Convergence of trust0 5 10 15 20 25 30 35 40

Other nodes agree that node 5 has negative trust

Restabilizes FormationConvergence of headings

0 5 10 15 20 25 30 35 400

Node 5

-400 -350 -300 -250 -200 -150 -100 -50 0 50 100-400

Node 5

i ij ij j ij N

Work by Sajal Das

Cell HomeostasisThe individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.

Cellular Metabolism

Permeability control of the cell membrane

http://www.accessexcellence.org/RC/VL/GG/index.html

Optimality in Biological SystemsWhy are Biological Systems Resilient?

Reinforcement Learning

Every living organism improves its control actions based on rewards received from the environment

The resources available to living organisms are usually meager.Nature uses optimal control.

1. Apply a control. Evaluate the benefit of that control.2. Improve the control policy.

RL finds optimal policies by evaluating the effects of suboptimal policies

Optimality in Biological Systems

Optimality Provides an Organizational Principle for Behavior

Charles Darwin showed that Optimal Control over long timescalesIs responsible for Natural Selection of Species

Why are Biological Systems Resilient?

Different methods of learning

SystemAdaptiveLearning system

ControlInputs

outputs

environmentTuneactor

Reinforcementsignal

Critic

Desiredperformance

Reinforcement learningIvan Pavlov 1890s

Actor-Critic Learning

We want OPTIMAL performanceADP- Approximate Dynamic Programming

Sutton & Barto book

D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, IET Press, 2012.

BooksF.L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, third edition, John Wiley and Sons, New York, 2012.

New Chapters on:Reinforcement LearningDifferential Games

F.L. Lewis and D. Vrabie,“Reinforcement learning andadaptive dynamic programmingfor feedback control,”IEEE Circuits & SystemsMagazine, Invited Feature Article,pp. 32-50, Third Quarter 2009.

IEEE Control Systems Magazine,F. Lewis, D. Vrabie, and K.Vamvoudakis,“Reinforcement learning andfeedback Control,” Dec. 2012

xux A x B u

SystemControlK

PBPBRQPAPA TT 10

1 TK R B P

On-line real-timeControl Loop

FORMAL DESIGN PROCEDURE

Off-line Design LoopUsing ARE

Optimal Control- The Linear Quadratic Regulator (LQR)

An Offline Design Procedurethat requires Knowledge of system dynamics model (A,B)

System modeling is expensive, time consuming, and inaccurate

( , )Q R

User prescribed optimization criterion ( ( )) ( )T T

V x t x Qx u Ru d

OPTIMAL CONTROL IS RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS

Aircraft Autopilots -Linear Quadratic RegulatorLinear Quadratic Gaussian LTR

Applications at Boeing Defense Space & SecurityKevin Wise and Eugene Lavretsky

Highly reliable adaptive uncertainty approximation compensators for flight control applications:

unmanned aircraft – Phantom Ray

Tailkit adaptive control systems for Joint Direct Attack Munition (JDAM) munitions:

Mk-82, Mk-84, and Laser-guided variantsFielded for defense in Iraq and Afghanistan.

A. Optimal control B. Zero-sum games C. Non zero-sum games

1. System dynamics

2. Value/cost function

3. Bellman equation

4. HJ solution equation (Riccati eq.)

5. Policy iteration – gives the structure we need

Adaptive Control Structures for Resilience:

We want to find optimal control solutions Online in real-time Using adaptive control techniquesWithout knowing the full dynamics

For nonlinear systems and general performance indices

Adaptive Control is online and works for unknown systems.Generally not Optimal

Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.

We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosing

Reinforcement Learning turns out to be the key to this!

RL brings together Optimal Control and Adaptive Control

dtRuuxQdtuxrtxV ))((),())((

( , , ) ( , ) ( , ) ( ) ( ) ( , ) 0T TV V VH x u V r x u x r x u f x g x u r x u

112( ) ( )T Vu h x R g x

dxdVggR

dxdVxQf

dxdV T

Nonlinear System dynamics

Cost/value

Bellman Equation, in terms of the Hamiltonian function

Stationary Control Policy

Hamilton‐Jacobi‐Bellmen Equation

Derivation of Nonlinear Optimal Regulator

Off‐line solutionHJB hard to solve. May not have smooth solution.Dynamics must be known

Stationarity condition 0Hu

( , ) ( ) ( )x f x u f x g x u

Leibniz gives Differential equivalent

dtRuuxQdtuxrtxV ))((),())((

112( ) ( )T Vu h x R g x

Nonlinear System dynamics

Cost/value

Bellman Equation, in terms of the Hamiltonian function

Stationary Control Policy

Integral Reinforcement Learning

Stationarity condition 0Hu

( , ) ( ) ( )x f x u f x g x u To find online methods for optimal control Iterate on these two equations

( ( )) ( , ) ( , ) ( , )t T

t t t T

V x t r x u d r x u d r x u d

( ( )) ( , ) ( ( )), (0) 0t T

V x t r x u d V x t T V

Good Bellman Equation

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

IRL Policy iteration

Initial stabilizing control is needed

Cost update

Control gain update

f(x) and g(x) do not appear

g(x) needed for control update

Policy evaluation‐ IRL Bellman Equation

Policy improvement11

21 1( ) ( )T kk k

Vu h x R g xx

),,(),(),(0 uxVxHuxruxf

Equivalent to

Solves Bellman eq. (nonlinear Lyapunov eq.) without knowing system dynamics

CT Bellman eq.

Integral Reinforcement Learning (IRL)- Draguna Vrabie

D. Vrabie proved convergence to the optimal value and controlAutomatica 2009, Neural Networks 2009

Converges to solution to HJB eq.dx

dVggRdx

dVxQfdx

dV TTT *

( ( )) ( ) ( ( ))t T

Tk k k k

V x t Q x u Ru dt V x t T

Approximate value by Weierstrass Approximator Network ( )TV W x

( ( )) ( ) ( ( ))t T

T T Tk k k k

W x t Q x u Ru dt W x t T

( ( )) ( ( )) ( )t T

T Tk k k

W x t x t T Q x u Ru dt

regression vector Reinforcement on time interval [t, t+T]

kWNow use RLS along the trajectory to get new weights

Then find updated FB

1 11 12 21 1

( ( ))( ) ( ) ( )( )

k k kV x tu h x R g x R g x Wx x t

Real-time ImplementationApproximate Dynamic Programming

Direct Optimal Adaptive Control for Partially Unknown CT Systems

Value Function Approximation (VFA) to Solve Bellman Equation

Scalar equation with vector unknowns

Same form as standard System ID problems

observe x(t)

apply uk=Lkx

observe cost integral

update P

Do RLS until convergence to Pk

update control gain

Integral Reinforcement Learning (IRL)

Data set at time [t,t+T)

( ), ( , ), ( )x t t t T x t T

observe x(t+T)

apply uk=Lkx

update P

observe x(t+2T)

apply uk=Lkx

update P

Tk kK R B P

( , )t t T ( , 2 )t T t T ( 2 , 3 )t T t T

Or use batch least-squares

(x( )) ( ( )) ( ) ( ) ( ) ( , )t T

T T Tk k k

tW t x t T x Q K RK x d t t T

Solve Bellman Equation - Solves Lyapunov eq. without knowing dynamics

This is a data-based approach that uses measurements of x(t), u(t)Instead of the plant dynamical model.

A is not needed anywhere

Direct Optimal Adaptive Controller

A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval

Draguna Vrabie

DynamicControlSystemw/ MEMORY

x A x B u System

T Tx Q x u R u

Critic

ActorK

Run RLS or use batch L.S.To identify value of current control

Update FB gain afterCritic has converged

Reinforcement interval T can be selected on line on the fly – can change

Solves Riccati Equation Online without knowing A matrixOPTIMAL CONTROL IS

RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS

Actor / Critic structure for CT Systems

Theta waves 4-8 Hz

Reinforcement learning

Motor control 200 Hz

Optimal Adaptive IRL for CT systems

A new structure of adaptive controllers

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

1121 1( ) ( )T k

k kVu h x R g xx

D. Vrabie, 2009

Oscillation is a fundamental property of neural tissue

Brain has multiple adaptive clocks with different timescales

theta rhythm, Hippocampus, Thalamus, 4-10 Hzsensory processing, memory and voluntary control of movement.

gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.

• consolidation of memory• spatial mapping of the environment – place cells

The high frequency processing is due to the large amounts of sensorial data to be processed

Spinal cord

D. Vrabie and F.L. Lewis, “Neural network approach to continuous-time direct adaptive optimal control forpartially-unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237-246, Apr. 2009.

D. Vrabie, F.L. Lewis, D. Levine, “Neural Network-Based Adaptive Optimal Controller- A Continuous-Time Formulation -,” Proc. Int. Conf. Intelligent Control, Shanghai, Sept. 2008.

Doya, Kimura, Kawato 2001

Limbic system

theta rhythms 4-10 HzDeliberativeevaluation

control

OPTIMALITY AND RESILIENCEOF BIOLOGICAL SYSTEMS

Simulation 1‐ F‐16 aircraft pitch rate controller

1.01887 0.90506 0.00215 00.82225 1.07741 0.17555 0

0 0 1 1x x u

Stevens and Lewis 2003

*1 11 12 13 22 23 33[p 2p 2p p 2p p ]

[1.4245 1.1682 -0.1352 1.4349 -0.1501 0.4329]

PBPBRQPAPA TT 10 ARE

Exact solution

,Q I R I

][ eqx

Select quadratic NN basis set for VFA

0 0.5 1 1.5 2-0.3

0Control signal

Time (s)

0 0.5 1 1.5 2-0.4

0Controller parameters

Time (s)

0 0.5 1 1.5 20

4System states

Time (s)

0 1 2 3 4 5 60

Critic parameters

Time (s)

P(1,1)P(1,2)P(2,2)P(1,1) - optimalP(1,2) - optimalP(2,2) - optimal

Simulations on: F‐16 autopilot

A matrix not needed

Converge to SS Riccati equation soln

Solves ARE online without knowing A

PBPBRQPAPA TT 10

1/ / 0 0 00 1/ 1/ 0 0

,1/ 0 1/ 1/ 1/

0 0 0 0

G G G G

T K TT T

A BRT T T T

x Ax Bu

( ) [ ( ) ( ) ( ) ( )]Tg gx t f t P t X t E t

FrequencyGenerator outputGovernor positionIntegral control

Simulation 2: Load Frequency Control of Electric Power system

PBPBRQPAPA TT 10

0.4750 0.4766 0.0601 0.4751

0.4766 0.7831 0.1237 0.3829

0.0601 0.1237 0.0513 0.0298

0.4751 0.3829 0.0298 2.3370

ARE solution using full dynamics model (A,B)

A. Al-Tamimi, D. Vrabie, Youyi Wang

0.4750 0.4766 0.0601 0.4751

0.4766 0.7831 0.1237 0.3829

0.0601 0.1237 0.0513 0.0298

0.4751 0.3829 0.0298 2.3370

PBPBRQPAPA TT 10

0 10 20 30 40 50 60-0.5

2.5P matrix parameters P(1,1),P(1,3),P(2,4),P(4,4)

Time (s)

P(1,1)P(1,3)P(2,4)P(4,4)

( ), ( ), ( : )x t x t T t t T Fifteen data points

Hence, the value estimate was updated every 1.5s.

IRL period of T= 0.1s.

Solves ARE online without knowing A

0.4802 0.4768 0.0603 0.4754

0.4768 0.7887 0.1239 0.3834

0.0603 0.1239 0.0567 0.0300

0.4754 0.3843 0.0300 2.3433

.critic NNP

0 1 2 3 4 5 6-0.1

0.1System states

Time (s)

Adaptive Control

Plantcontrol output

Identify the Controller-Direct Adaptive

Identify the system model-Indirect Adaptive

Identify the performance value-Optimal Adaptive

)()( xWxV TOPTIMALITY YIELDSRESILIENCEROBUSTNESSPERFORMANCE

GUARANTEES

Sun Tz bin fa孙子兵法

Games on Communication Graphs

500 BC

Fast Decision UsingNew Concept of Graphical Games to limit each agent’s decision horizon

Faster decisions based on local informationUnder certain conditions, the local decisions are globally optimalNash Equilibrium on Graphs is defined

Speed up Multi-agent Decision-making by Imposing Sparse EfficientCommunication Structures

Bounded Rationality -

Manufacturing as the Interactions of Multiple AgentsEach machine has it own dynamics and cost functionNeighboring machines influence each other most stronglyThere are local optimization requirements as well as global necessities

i i i i i i ij j jj N

A d g B u e B u

( (0), , ) ( )i

T T Ti i i i i ii i i ii i j ij j

J u u Q u R u u R u dt

Each process has its own dynamics

And cost function

Each process helps other processes achieve optimality and efficiency

F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2014

Key Point

Lyapunov Functions and Performance IndicesMust depend on graph topology

Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.

H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.

,i i i ix Ax B u

0 0x Ax

0( ) ( ),ix t x t i

0( ) ( ),i

i ij i j i ij N

e x x g x x

( ) ,nix t ( ) im

Graphical GamesSynchronization‐ Cooperative Tracker Problem

Node dynamics

Target generator dynamics

Synchronization problem

Local neighborhood tracking error (Lihua Xie)

A d g B u e B u

( (0), , ) ( )i

( ( ), ( ), ( ))i i i iL t u t u t dt

Local nbhd. tracking error dynamics

Define Local nbhd. performance index

Local agent dynamics driven by neighbors’ controls

Values driven by neighbors’ controls

K.G. Vamvoudakis, F.L. Lewis, and G.R. Hudas, “Multi-Agent Differential Graphical Games: online adaptive learning solution for synchronization with optimality,” Automatica, vol. 48, no. 8, pp. 1598-1611, Aug. 2012.

Impose Optimality to get Resilience and Robustness

iu Control action of player i

Value function of player i

New Differential Graphical Game

A d g B u e B u

State dynamics of agent i

Local DynamicsLocal Value Function

Only depends on graph neighbors

( (0), , ) ( )i

OPTIMAL CONTROL IS RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS

z Az B u

( (0), , ) ( )N

T Ti i i j ij j

J z u u z Qz u R u dt

iu Control action of player i

Central Dynamics

Value function of player i

Standard Multi-Agent Differential Game

Central DynamicsLocal Value Functiondepends on ALL

other control actions

1 1 11 1 2 3 1 2 1 3 13 3 3( ) ( ) ( ) coi

teamJ J J J J J J J J J

1 1 12 1 2 3 2 1 2 3 23 3 3( ) ( ) ( ) coi

1 1 13 1 2 3 3 1 3 2 33 3 3( ) ( ) ( ) coi

The objective functions of each player can be written as a team average term plus a conflict of interest term:

( ) , 1,N N

coii j i j team iN N

J J J J J J i N

For N-player zero-sum games, the first term is zero, i.e. the players have no goals in common.

For N-players

Team Interest vs. Self InterestCooperation vs. Collaboration

* * *( , ) ( , ), ,i j G j i j G jJ u u J u u i j N

New Definition of Nash Equilibrium for Graphical Games

* * *1 2, ,...,u u u

* * * *( , ) ( , ),i i i G i i i G iJ J u u J u u i N

Def: Interactive Nash equilibrium

are in Interactive Nash equilibrium if

2. There exists a policy such that ju

That is, every player can find a policy that changes the value of every other player.

i.e. they are in Nash equilibrium

Theorem 3. Let (A,Bi) be reachable for all i. Let agent i be in local best response

Then are in global Interactive Nash iff the graph is strongly connected.

*( , ) ( , ),i i i i i iJ u u J u u i

* * *1 2, ,...,u u u

To restore the Symmetry of Nash equilibrium

1 1 12 2 2( , , , ) ( ) 0

TT T Ti i

i i i i i i i i i ij j j i ii i i ii i j ij ji i j N j N

V VH u u A d g B u e B u Q u R u u R u

10 ( ) Ti ii i i ii i

H Vu d g R B

2 1 2 1 11 1 12 2 2( ) ( ) 0,

TT Tj jc T T Ti i i

i i ii i i i i ii i j j j jj ij jj ji i i j jj N

V VV V VA Q d g B R B d g B R R R B i N

2 1 1( ) ( ) ,i

jc T Tii i i i i ii i ij j j j jj j

i jj N

VVA A d g B R B e d g B R B i N

* *( , , , ) 0ii i i i

VH u u

* 2 11 1 12 2 20 ( , , , ) ( )

T Tc T T Ti i i i

i i i i i i ii i i i i ii i j ij ji i i i j N

V V V VH u u A Q d g B R B u R u

2 1( )

c T ii i i i i ii i ij j j

VA A d g B R B e B u

12( ( )) ( )

T T Ti i i ii i i ii i j ij j

V t Q u R u u R u dt

Value function

Differential equivalent (Leibniz formula) is Bellman’s Equation

Stationarity Condition

1. Coupled HJ equations

2. Best Response HJ Equations – other players have fixed policies

Graphical Game Solution Using Integral Reinforcement Learning

To find online methods for GG

Iterate on these two equations

Online Solution of Graphical Games

Use Reinforcement LearningConvergence Results

POLICY ITERATION

Kyriakos Vamvoudakis

MULTI-AGENT LEARNING

What is a micro‐grid?• Micro-grid is a small-scale power system that provides the power for

a group of consumers.• Micro-grid enables

local power support for local and critical loads.

• Micro-grid has the ability to work in both grid-connected and islanded modes.

• Micro-grid facilitates the integration of Distributed EnergyResources (DER).

Photo from: http://www.horizonenergygroup.com

• The main building block of smart-grids

• Rural plants

• Business buildings, hospitals,and factories

Smart‐grid photo from: http://www.sustainable‐sphere.com

An introduction to micro‐grids: Micro‐grid applications

Distributed Generators (DG)Distributed Energy Resources (DER)

• Non-renewables Internal combustion engine Micro-turbines Fuel cells

• Renewables Photovoltaic Wind Hydroelectric Biomass

Micro‐grid Advantages

• Micro-grid provides high quality and reliable power to the criticalconsumers

• During main grid disturbances, micro-grid can quickly disconnectform the main grid and provide reliable power for its local loads

• DGs can be simply installed close to the loads which significantlyreduces the power transmission line losses

• By using renewable energy resources, a micro-grid reduces CO2emissions

Micro‐grid Hierarchical Control Structure

Tertiary ControlmodesOptimal operation in both operating

Secondary Control

Primary Control

MicrogridTie

Power flow control in grid-tied mode

Voltage deviation mitigationFrequency deviation alleviation

Voltage stability provision

preservingFrequency stability

preservingPlug and play capability for DGs

Main grid

Do coop. ctrl. here toSynchronize frequencyand voltage

Bidram, A., & Davoudi, A. (2012). Hierarchical structure of microgrids control system. IEEE Transactions on Smart Grid, vol. 3, pp. 1963‐1976, Dec 2012.

Maintains Stability

Cooperative Game-theoretic Control of Active Loads in DC Microgrids

Ling-ling Fan, Vahidreza Nasirian, Hamidreza Modares, Frank L. Lewis, Yong-duan Song, and Ali Davoudi,

Power buffer operation during a step change in power demand.

Power buffers in Microgrid Network

ìïïï = -ïíïï =ïïî

Active Load Power Buffer

Stored energyInput impedanceBus voltage Control input Output power = a disturbance

d , 1, , ,i

i j ij j i ij N

J u t i M M Nr¥

æ ö÷ç ÷ç= + = + +÷ç ÷ç ÷çè øåò x Q xT

Define coupled performance indices

0 00 2 1

1 00 0 0

0 10 0 0

i i i ii

i ii ii i

i i i i

ij jj M i

e ei i

r r u w

= + ¹

é ùé ù é ù é ù é ù- -ê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê ú= + + +ê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úë û ë û ë û ë ûë û

é ùê úê úê úê ú+ ê úê úê úê úë û

x x B DA

1, , ,i M M N= + +

Solve for bus voltage to get coupled agent dynamics

Define Communication GraphSparse efficient topologyOptimal design provides Resilience

and disturbance rejection

Vahid NasirianReza Modares

Dr. Ali Davoudi

Microgrid

DG 1DG 2 DG 3

DG 6DG 7

DG 1DG 2 DG 3

Communication link

Cybercommunication

framework

Micro‐grid secondary control:Distributed CPS structure

Physical LayerThe interconnect structure of the power grid

Cyber layerA sparse, efficient communication network to allow

cooperative control for synchronization ofvoltage and frequency

Work of Ali BidramWith Dr. A. Davoudi

Cyber Physical System (CPS)

Resilience Because of-Sparse, Efficient communication networkOptimal Design- Graph GamesDistributed computation – no SPOF

Case Study

Physical Microgrid Layer

Cyber Communication Layer

A 48 V DC distribution network with three active loads and three sources

1 2 3 4 5 6 7 8 9

Time (s)

1 2 3 4 5 6 7 8 960

Time (s)

1 2 3 4 5 6 7 8 9

Time (s)

1 2 3 4 5 6 7 8 91.5

Time (s)

1 2 3 4 5 6 7 8 910

Time (s)

1 2 3 4 5 6 7 8 90

Time (s)

1 2 3 4 5 6 7 8 950

Time (s)O

10 20 3010

10 20 30Input Resistance ( )

10 20 30

Buffer 4 Buffer 5 Buffer 6

A B DC

Load change at terminal 5

Optimal Leader-Follower TrackersCPS #2 – Resilient Human-robot Interaction Systems

On‐policy RL

Target and behaviorpolicy

SystemRef.

Target policy: The policy that we are learning about.

Behavior policy: The policy that generates actions and behavior

Target policy and behavior policy are the same

Off‐policy RL

Behavior Policy System

Target policy

Target policy and behavior policy are different

Humans can learn optimal policies while actually applying suboptimal policies

Off-policy IRLHumans can learn optimal policies while actually applying suboptimal policies

( ) ( )x f x g x u

( ) ( ), ( )t

J x r x u d

[ [ [] [ ]] ]( ( )) ( ( )) ( )i i

t TJ x t J x t T Q x d Ru du

On-policy IRL

[ 1] 1 [ ]12

i T ixu R g J

[ ] [ ]( )i ix f gu g u u Off-policy IRL

[ ] [ ] [ ] [ ] [ 1] [ ]( ( )) ( ( )) ( ) 2 ( )t t tii i

t T t T t T

T i i T iJ x t J x t T Q x d R d du u u R u u

This is a linear equation for andThey can be found simultaneously online using measured data using Kronecker product and VFA

[ ]iJ [ 1]iu

1. Completely unknown system dynamics2. Can use applied u(t) for –

disturbance rejection – Z.P. Jiang - R. Song and Lewis, 2015robust control – Y. Jiang & Z.P. Jiang, IEEE TCS 2012exploring probing noise – without bias ! - J.Y. Lee, J.B. Park, Y.H. Choi 2012

Yu Jiang & Zhong-Ping Jiang, Automatica 2012

system

Must know g(x)

Augmented system state: ( ) ( ) ( )T

dX t x t y té ù= ê úë û

Augmented system: 1

A BX X u T X B u

é ù é ùê ú ê ú= + º +ê ú ê úê ú ê úë û ë û

Value function:

( ) ( )1 1

( ) ( ) ( )2 2

t T T T

V X t e X Q X u R u d X t P X tg t t¥

- - é ù= + =ê úë ûò

1 1 1, [ ]TTQ C QC C C I= = -

LQT Bellman equation:

1 10 ( ) ( )T T T T T

TTX B u P X X P TX B u X P X X Q X u Rug= + + + - + +

Assume the reference trajectory is generated by d dy Fy

x A x B u

System Dynamics

LQ Tracker Problem for Continuous-time Systems

Online solution to the CT LQT ARE: Off‐policy IRL

1 1( )i

iX TX B u T X B K X u= + = + + ,

iT T B K= -

( ) ( )1

( ) ( ) ( ) ( ) ( )

( ) 2 ( )

t TT T i T i t T i

t T t Tt T iT i t i T T i

de X t T P X t T X t P X t e X P X d

e X Q K RK X d e u K X B P X d

g t g t

+- - -

+ +- - - -

+ + - = -

= - + + +

Off-policy IRL Bellman equation

Algorithm. Online Off-policy IRL algorithm for LQT

Online step: Apply a fixed control input and collect some data

Offline step: Policy evaluation and improvement using LS on collected data

( )( ) ( ) ( ) ( )t T

T T i T i t Tit

e X t T P X t T X t P X t e X Q X dg g t t+

- - -+ + - =- +ò

( ) 12 ( )t T

t i T i

te u K X RK X dg t t

+- - ++ò No knowledge of the dynamics is required

Humans can learn optimal policies while actually applying suboptimal policies

Optimal Leader-Follower TrackersCPS #2 – Resilient Human-robot Interaction Systems

PR2 meets Isura

H. Modares, I. Ranatunga, F.L. Lewis, and D.O. Popa, “Optimized Assistive Human-robot Interaction using Reinforcement Learning,” IEEE Transactions on Cybernetics, to appear 2015.

RL for Human-Robot Interactions

RobotManipulator

TorqueDesign

PrescribedImpedance Model

KHuman

FeedforwardControl

Performance

Robot-specific inner control loop

Task-specific outer-loop control designUsing IRL Optimal Tracking Implemented on PR2 robot

2-Loop HRI Learning Controller Based on Human Factors Performance StudiesHumans learn 2 components in HRI-

1. must learn to compensate for nonlinear robot dynamics2. must learn to perform the task

Our result - The Robot adapts in order to minimize human effort

Robot control inner loopusing Neural Adaptive Learning

Task control outer loopusing Online Reinforcement Learning

2-Loop HRI Learning Controller Based on Human Factors Performance StudiesThe Robot adapts in order to minimize human effort

Human-Robot Interaction Using Reinforcement Learning

Modeling and Control of Adversarial Networked TeamsZero-Sum GamesBipartite Synchronization on Antagonistic Communication GraphsMulti-agent Pursuit-Evasion Games

dttz T

System( ) ( ) ( )( )

x f x g x u k x dy h x

( )u l x

x control

Performance output disturbance

Find control u(t) so that

For all L2 disturbancesAnd a prescribed gain

L2 Gain Problem

H‐Infinity Control Using Neural Networks

Zero‐Sum differential gameNature as the opposing player

Disturbance Rejection

TwoAntagonistic Inputs

( (0)) min max ( (0), , ) min max ( ) ( )T T

u ud dV x V x u d h x h x u Ru d dt

Define 2‐player zero‐sum game as

min max ( (0), , ) max min ( (0), , )u ud d

V x u d V x u d

The game has a unique value (saddle‐point solution) iff the Nash condition holds

Optimal Game Design YieldsResilienceRobustnessGuaranteed PerformanceDisturbance Rejection

TwoAntagonistic Inputs

control node v

Cooperative and Adversarial Multi-agent NetworkAdversarial disturbance at each agent

Agents want to cooperate to synchronize to the control leader

Zero-sum Graphical Games

i i i i

x Ax Bu Ddy Cx

( ) ( ) ( )dy (I )

I A I B u I DC

Cooperative Synchronization0 0

x Axy Cx

System

leader

2( ) T Tz t Q u Ru

Performance output

( ) ( )

( ) ( ) ( )

z t dt Q u Ru dt

d t dt d t Td t dt

Bounded L2 synchronization error

( , , ) ( )T T TJ u d Q u Ru d Td dt

H‐inf performance index

Resilience to Network Disturbances

Condition on graph topology

Condition for existenceof local OPFB

1 1( ) PR L G

2 2 22TR K C B P M

Kristian Hengster-MovricCooperative SynchronizationH

J. Gadewadikar, Frank L. Lewis, L. Xie, V. Kucera and M. Abu-Khalaf, “Parameterization of all stabilizing H∞static state-feedback gains: Application to output-feedback design,” Automatica, vol. 43, no. 9, pp. 1597-1604, September 2007

Condition on graph topology

Condition for existenceof local OPFB

1 1( ) PR L G

2 2 22TR K C B P M

ye L G C

Fine Local OPFB structure Coarse global OPFB structure

Symmetry between OPFB and Graph Topology Information Restrictions

Same Same

Communication Network Limits Available Information Exchange

Communication Network Tries to Destroy OptimalityProper Controls Design Restores Optimality

0(y ) (y )i

i i yi ij j i i ij N

u cKC cKe cK e y g y

i iy Cx

Network info flow restrictions

Local system measurement restrictions

x Axy Cx

2-player Zero-sum Nash Equilibrium

i i i i

x Ax Bu Ddy Cx

Command generator

Agent Dynamics

Nash Equilibrium

( , , ) ( )T T TJ u d Q u Ru d Td dt

H‐inf performance index

Resilience to Network Disturbances

control node v

Cooperative and Adversarial Multi-agent NetworkAdversarial disturbance at each agent

Agents want to cooperate to synchronize to the control leader

Resilient Optimal DesignFor Adversarial Networks

Modeling and Control of Adversarial Networked TeamsZero-Sum GamesBipartite Synchronization on Antagonistic Communication GraphsMulti-agent Pursuit-Evasion Games

Structural Balance for Networked Multi-agent Systems

In almost all human endeavors There are TWO Opposing Teams - WHY?

SportsWarInternational conglomerates in macroeconomics

Social Psychology Dynamics of Friendship GroupsCan be understood in terms of TRIANGLES of relationships

Stable Groups

+ - + + - -

Unstable Groups

Stable groups have an odd number of friendship linksStructurally Balanced = Stability

Bipartite Synchronization on Antagonistic Communication Graphs

Node BipartitionRed edge weights are negative

Two opposing antagonistic groups

Structural Balance

Black edge weightsare positive

All triangles have an odd number of positive edges

[ ]ijA a

0 0 1 0 0 01 0 0 0 0 11 1 0 0 0 0

0 1 0 0 0 00 0 1 0 0 00 0 0 1 1 0

Adjacency matrix

Nsi ij

In-neighbors of node iAbsolute Row sumi

Signed graph Laplacian matrix

Absolute In-Degree Matrix

s sL D A

i i iAx Bu i Nx

0 0Axx

0( ) ( )i

i ij j ij i i i ij N

u cK a x a x g x g x

Bipartite ConsensusAgent Dynamics

Leader Dynamics

New Bipartite Distributed Control Law

One group goes toThe leader’s state

Opposing group goes toNegative of the Leader’s state

Time-Varying Edge Weights

TA AAdjacency matrix

2dA Adt

Adaptive edge weights

Undirected graph [ ]ijA a

ijik kj

For almost any initial weights (0)AThe network becomes structurally balancedAnd splits into two antagonistic teams of almost the same size

Theorem

Emergent BehaviorsEmergence of Two Antagonistic Teams

S. A. Marvel, J. Kleinberg, R. D. Kleinberg, and S. H. Strogatz, “Continuous-time model of structural balance,” Proceedings of the National Academy of Sciences, vol. 108, no. 5, pp. 1771–1776, 2011.

Emergent BehaviorsEmergence of Two Antagonistic Teams

2dA Adt

http://vodpod.com/watch/5255321-mathematical-model-shows-how-groups-split-into-factions-physorg-comJanuary 4, 2011 By Bill Steele

The cloud-capped towers, the gorgeous palaces,The solemn temples, the great globe itself,Yea, all which it inherit, shall dissolve,And, like this insubstantial pageant faded,Leave not a rack behind.

We are such stuff as dreams are made on, and our little life is rounded with a sleep.

Our revels now are ended. These our actors, As I foretold you, were all spirits, and Are melted into air, into thin air.

Prospero, in The Tempest, act 4, sc. 1, l. 152-6, Shakespeare

resilient control in cooperative and adversarial multi...

Documents

generative adversarial networks (part...

adversarial examples and adversarial...

deep adversarial learning for nlp -...

generative adversarial networks - haw...

deep adversarial learningfor...

non adversarial elections

adversarial examples and adversarial...

lecture 22: adversarial image, adversarial training,...

generative adversarial nets - semantic scholargenerative...

generative adversarial networks, and applications...

fad: fairness through adversarial...

adversarial pattern classification

adversarial examples google

adversarial distillation of bayesian neural network...

adversarial information retrieval

adversarial machine learning —an...

generative adversarial...

chen et al.: adversarial posenet 1 adversarial learning of...

cyote methodology 20210923 final - inl.gov

authenticated adversarial routing