STOCHASTIC OPTIMIZATION METHODS FOR THESIMULTANEOUS CONTROL OFPARAMETER-DEPENDENT SYSTEMS
Umberto BiccariFundación Deusto and Universidad de Deusto, Bilbao, [email protected] cmc.deusto.es/umberto-biccari
joint work with:Ana Navarro - Universitat de ValènciaEnrique Zuazua - FAU, Fundación Deusto and Universidad Autónoma de Madrid
June 12, 2020
INTRODUCTION
Keywords
Key concepts of the presentation:
parameter-depending models
simultaneous controllability
stochastic optimization
2/23
Parameter-depending models
Parameter-dependent models appear in many real-life applications, todescribe physical phenomena which may have different realizations{
x′ν(t) = Aνxν(t) + Bu(t), 0 < t < T,
xν(0) = x0,, ν ∈ K
Example 1: linearized cart-inverted pendulum systemxνvνθνων
=
0 0 1 00 − ν
M 0 00 0 0 10 ν+M
M` 0 0
xνvνθνων
+
010−1
u.
3/23
Parameter-depending models
Parameter-dependent models appear in many real-life applications, todescribe physical phenomena which may have different realizations{
x′ν(t) = Aνxν(t) + Bu(t), 0 < t < T,
xν(0) = x0,, ν ∈ K
Example 2: system of thermoelasticity
wtt − µ ∆w −
Lamécoeffi-cients
(λ+ µ) ∇div(w) + α∇θ = u 1ω
θt −∆θ + βdiv(wt) = 0
Lebeau and Zuazua, Null controllability of a system of linear thermoelasticity, 2002
3/23
Simultaneous controllability
We look for a unique parameter-independent control u such that, at timeT > 0, the corresponding solution xν satisfies
xν(T) = xT , for all ν ∈ K
In the ODE setting, simultaneous controllability is equivalent to the classi-cal controllability of the augmented system
x = Ax + Bu
with x = (xν1 , . . . , xν|K|)T ∈ RN|K|, u = (u, . . . ,u)T ∈ L2(0,T;RN|K|), and
where the matrices A and B are given by
A =
Aν1 0. . .
0 Aν|K|
∈ RN|K|×N|K| and B =
B...B
∈ RN|K|×1
Lohéac and Zuazua, From averaged to simultaneous controllability, 2016
4/23
Simultaneous controllability
We look for a unique parameter-independent control u such that, at timeT > 0, the corresponding solution xν satisfies
xν(T) = xT , for all ν ∈ K
In the ODE setting, simultaneous controllability is equivalent to the classi-cal controllability of the augmented system
x = Ax + Bu
with x = (xν1 , . . . , xν|K|)T ∈ RN|K|, u = (u, . . . ,u)T ∈ L2(0,T;RN|K|), and
where the matrices A and B are given by
A =
Aν1 0. . .
0 Aν|K|
∈ RN|K|×N|K| and B =
B...B
∈ RN|K|×1
Lohéac and Zuazua, From averaged to simultaneous controllability, 2016
4/23
Computation of simultaneous controls
u = minu∈L2(0,T;RM)
Fν(u)
Fν(u) :=12E[∥∥∥xν(T)− xT
∥∥∥2RN
]+β
2‖u‖2L2(0,T;RM)
Fν(u) :=1|K|
∑νk∈K
fνk +β
2‖u‖2L2(0,T;RM)
Typical approaches:
• Gradient Descent (GD): uk+1 = uk − ηk∇Fν(uk)
• Conjugate Gradient (CG)
Nocedal and Wright, Numerical optimization, 1999
Ciarlet, Introduction à l’analyse numérique matricielle et à l’optimisation, 1988
Both approaches have a high computational cost when dealing with largeparameter sets.
5/23
Computation of simultaneous controls
u = minu∈L2(0,T;RM)
Fν(u)
Fν(u) :=12E[∥∥∥xν(T)− xT
∥∥∥2RN
]+β
2‖u‖2L2(0,T;RM)
Fν(u) :=1|K|
∑νk∈K
fνk +β
2‖u‖2L2(0,T;RM)
Typical approaches:
• Gradient Descent (GD): uk+1 = uk − ηk∇Fν(uk)
• Conjugate Gradient (CG)
Nocedal and Wright, Numerical optimization, 1999
Ciarlet, Introduction à l’analyse numérique matricielle et à l’optimisation, 1988
Both approaches have a high computational cost when dealing with largeparameter sets.
5/23
Computation of simultaneous controls
u = minu∈L2(0,T;RM)
Fν(u)
Fν(u) :=12E[∥∥∥xν(T)− xT
∥∥∥2RN
]+β
2‖u‖2L2(0,T;RM)
Fν(u) :=1|K|
∑νk∈K
fνk +β
2‖u‖2L2(0,T;RM)
Typical approaches:
• Gradient Descent (GD): uk+1 = uk − ηk∇Fν(uk)
• Conjugate Gradient (CG)
Nocedal and Wright, Numerical optimization, 1999
Ciarlet, Introduction à l’analyse numérique matricielle et à l’optimisation, 1988
Both approaches have a high computational cost when dealing with largeparameter sets.
5/23
Stochastic optimization
STOCHASTIC GRADIENT DESCENT (SGD)This is a simplification of the classical GD in which, instead of computing∇Fν for all parameters ν ∈ K, in each iteration this gradient is estimatedon the basis of a single randomly picked configuration
uk+1 = uk − ηk∇fνk(uk)
Robbins and Monro, A stochastic approximation method, 1951
CONTINUOUS STOCHASTIC GRADIENT (CSG)This is a variant of SGD, based on the idea of reusing previously obtainedinformation to improve the efficiency of the algorithm
uk+1 = uk − ηkG k, G k =k∑`=1
α`∇fν`(u`)
Pflug, Bernhardt, Grieshammer and Stingl, A new stochastic gradient method for the efficientsolution of structural optimization problems with infinitely many state problems, 2020
6/23
OPTIMIZATION ALGORITHMS
Gradient Descent
uk+1 = uk − ηk∇Fν(uk)
Convergence
Since Fν is convex, if we take ηk constant small enough, we have∥∥∥uk − u∥∥∥2RN≤∥∥∥u0 − u
∥∥∥2RN
e−2CGDk, CGD = ln
(ρ+ 1ρ− 1
)
∥∥∥uk − u∥∥∥2RN< ε → k = O
(ln(ε−1)
CGD
)→ costGD = O
(|K| ln(ε−1)
CGD
)8/23
Gradient Descent
uk+1 = uk − ηk
(βuk − 1
|K|∑ν∈K
B>pkν
)
x′ν(t) = Aνxν(t) + Bu, 0 < t < T
p′ν(t) = −A>ν pν(t), 0 < t < T
xν(0) = x0, pν(T) = −(xν(T)− xT)
Convergence
Since Fν is convex, if we take ηk constant small enough, we have∥∥∥uk − u∥∥∥2RN≤∥∥∥u0 − u
∥∥∥2RN
e−2CGDk, CGD = ln
(ρ+ 1ρ− 1
)
∥∥∥uk − u∥∥∥2RN< ε → k = O
(ln(ε−1)
CGD
)→ costGD = O
(|K| ln(ε−1)
CGD
)8/23
GD - practical considerations
The expected exponential convergence of GD may be violated in practice.
The convergence rate is given in terms of the constant CGD(ρ) which ispositive decreasing converge to zero as ρ→ +∞.
A bad conditioning in a minimization problem affects the actual conver-gence of GD.
Example
minx∈R
(12x>Qτx − b>x
)
Qτ =
1 0 00 τ 00 0 τ2
b = −
111
ρ =λmax
λmin= τ2
τ iterations ρ
2 27 45 161 2510 633 10020 2511 40050 15619 2500
Meza, Steepest descent, 2010
9/23
GD - practical considerations
The expected exponential convergence of GD may be violated in practice.
The convergence rate is given in terms of the constant CGD(ρ) which ispositive decreasing converge to zero as ρ→ +∞.
A bad conditioning in a minimization problem affects the actual conver-gence of GD.
Example
minx∈R
(12x>Qτx − b>x
)
Qτ =
1 0 00 τ 00 0 τ2
b = −
111
ρ =λmax
λmin= τ2
τ iterations ρ
2 27 45 161 2510 633 10020 2511 40050 15619 2500
Meza, Steepest descent, 2010
9/23
Conjugate Gradient
∇Fν(u) = βu− 1|K|
∑ν∈K
B>pν
Convergence∥∥∥uk − u∥∥∥2RN≤ 4
∥∥∥u0 − u∥∥∥2RN
e−2CCGk, CCG = ln
(√ρ+ 1√ρ− 1
)
∥∥∥uk − u∥∥∥2RN< ε → k = O
(ln(ε−1)
CCG
)→ costCG = O
(|K| ln(ε−1)
CCG
)10/23
Conjugate Gradient
∇Fν(u) =
A
(βI + E[L∗T,νLT,ν ]) u+
-b
E[L∗T,ν(yν(T)− xT)] → Au = b
LT,ν : L2(0,T;RM) −→ RN
u 7−→ zν(T)L∗T,ν : RN −→ L2(0,T;RM)
pT,ν 7−→ B>pν{y′ν(t) = Aνyν(t), 0 < t < T
yν(0) = x0
{z′ν(t) = Aνzν(t) + Bu(t), 0 < t < T
zν(0) = 0
Convergence∥∥∥uk − u∥∥∥2RN≤ 4
∥∥∥u0 − u∥∥∥2RN
e−2CCGk, CCG = ln
(√ρ+ 1√ρ− 1
)
∥∥∥uk − u∥∥∥2RN< ε → k = O
(ln(ε−1)
CCG
)→ costCG = O
(|K| ln(ε−1)
CCG
)10/23
CG - practical considerations
The expected exponential convergence of CG may be violated in practicalexperiments, although the situation is less critical than in GD.
• The constant CCG(ρ) depends on the square root of ρ, hence CG is lesssensible to the conditioning of the problem.
• CG enjoys the finite termination property. This means that, if weapply CG to solve a N-dimensional problem, the algorithm willconverge in at most N-iterations.
11 / 23
Stochastic Gradient Descent
uk+1 = uk − ηk∇fνk(uk), νk i.i.d. from K
Applying SGD for minimizing Fν(u) requires, at each iteration k, only oneresolution of the dynamics.
12 /23
Stochastic Gradient Descent
uk+1 = uk − ηk(βuk − B>pkνk
)x′νk(t) = Aνkxνk(t) + Bu, 0 < t < T
p′νk(t) = −A>νkpνk(t), 0 < t < T
xνk(0) = x0, pνk(T) = −(xνk(T)− xT)
Applying SGD for minimizing Fν(u) requires, at each iteration k, only oneresolution of the dynamics.
12 /23
Stochastic Gradient Descent - convergence
In SGD the iterate sequence (uk)k≥1 is a stochastic process determined bythe random sequence (νk)k≥1 ⊂ K. Hence, the convergence properties
are defined in expectation E[ ∥∥uk+1 − u
∥∥2RN
]or in the context of almost
sure convergence.
Bach and Moulines, Non-asymptotic analysis of stochastic approximation algorithms for ma-chine learning, 2011
Bottou, Online learning and stochastic approximations, 1998
In SGD, convergence is guaranteed if the step-sizes are chosen such that
E[∥∥∇Fν(uk)
∥∥2] is bounded above by a deterministic quantity. In particular,
a fixed step-size ηk = η, even if small, does not allow to converge. Astandard approach is to use as a decreasing sequence such that
∞∑k=1
ηk = +∞ and∞∑k=1
η2k < +∞
Robbins and Monro, A stochastic approximation method, 1951
Bottou, Curtis and Nocedal, Optimization methods for large-scale machine learning, 2018
13/23
Stochastic Gradient Descent - convergence
If ηk is properly chosen, by means of standard martingale techniques wecan show that the SGD converges almost surely
uk a.s−→ u, as k→ +∞
Convergence rate
Because of the noise introduced by the random selection of the descentdirection the convergence of SGD is linear
E[∥∥∥uk − u
∥∥∥2RN
]= O
(k−1)
E[∥∥∥uk − u
∥∥∥2RN
]< ε → k = O
(ε−1)→ costSGD = O
(ε−1)
14 /23
Stochastic Gradient Descent - convergence
If ηk is properly chosen, by means of standard martingale techniques wecan show that the SGD converges almost surely
uk a.s−→ u, as k→ +∞
Convergence rate
Because of the noise introduced by the random selection of the descentdirection the convergence of SGD is linear
E[∥∥∥uk − u
∥∥∥2RN
]= O
(k−1)
E[∥∥∥uk − u
∥∥∥2RN
]< ε → k = O
(ε−1)→ costSGD = O
(ε−1)
14 /23
Continuous Stochastic Gradient
uk+1 = uk − ηkG k, G k =k∑`=1
α`∇fν`(u`)
CONVERGENCE PROPERTIESAs the optimization process evolves, the approximated gradient Gk con-verges almost surely to the full gradient of the objective functional
G k a.s−→ ∇Fν , as k→ +∞
In particular, CSG is a less noisy algorithm and has a better convergencebehavior. In particular, convergence may be guaranteed also choosing afixed learning rate sequence ηk = η.
Pflug, Bernhardt, Grieshammer and Stingl, A new stochastic gradient method for the efficientsolution of structural optimization problems with infinitely many state problems, 2020
15/23
Continuous Stochastic Gradient
uk+1 = uk − ηkk∑`=1
α`(βu` − B>p`ν`
)
CONVERGENCE PROPERTIESAs the optimization process evolves, the approximated gradient Gk con-verges almost surely to the full gradient of the objective functional
G k a.s−→ ∇Fν , as k→ +∞
In particular, CSG is a less noisy algorithm and has a better convergencebehavior. In particular, convergence may be guaranteed also choosing afixed learning rate sequence ηk = η.
Pflug, Bernhardt, Grieshammer and Stingl, A new stochastic gradient method for the efficientsolution of structural optimization problems with infinitely many state problems, 2020
15/23
NUMERICAL SIMULATIONS
Numerical simulations
Linearized cart-inverted pendulum systemxνvνθνων
=
0 0 1 00 − ν
M 0 00 0 0 10 ν+M
M` 0 0
xνvνθνων
+
010−1
u
• The system includes a cart of massM and a rigid pendulum of length `.
• The pendulum is anchored to the cart and at the free extremity it isplaced a variable mass described by the parameter ν .
• The cart moves on a horizontal plane. The states xν(t) and vν(t)describe its position and velocity, respectively.
• During the motion of the cart the pendulum deviates from the initialvertical position by an angle θν(t), with an angular velocity ων(t).
• Starting from an initial state (x i, v i,0,0), we want to compute aparameter-independent control function u steering all the realizationsof the system in time T to the final state (x f ,0,0,0).
17 /23
Numerical simulations
• The system includes a cart of massM and a rigid pendulum of length `.
• The pendulum is anchored to the cart and at the free extremity it isplaced a variable mass described by the parameter ν .
• The cart moves on a horizontal plane. The states xν(t) and vν(t)describe its position and velocity, respectively.
• During the motion of the cart the pendulum deviates from the initialvertical position by an angle θν(t), with an angular velocity ων(t).
• Starting from an initial state (x i, v i,0,0), we want to compute aparameter-independent control function u steering all the realizationsof the system in time T to the final state (x f ,0,0,0).
17 /23
Numerical simulations
Input data
• x0 = (−1, 1,0,0)>
• xT = (0,0,0,0)>
• T = 1s
• ε = 10−4
• M = 10
• ` = 1
• ν ∈ K = {ν1, . . . , ν|K|} withν1 = 0.1 and ν|K| = 1
18 /23
Numerical simulations
Input data
• x0 = (−1, 1,0,0)>
• xT = (0,0,0,0)>
• T = 1s
• ε = 10−4
• M = 10
• ` = 1
• ν ∈ K = {ν1, . . . , ν|K|} withν1 = 0.1 and ν|K| = 1
18 /23
Numerical simulations
GD CG SGD CSG
|K| Iter. Time Iter. Time Iter. Time Iter. Time
2 1868 45.1s 12 1.1s 2195 33.1s 930 18.6s
10 1869 150.1s 13 2.6s 2106 31.4s 923 17.4s
100 1870 1799.5s 12 17.7s 2102 28.9s 929 17.4s
250 13 50.3s 2080 28.2s 928 17.9s
500 13 101.3s 2099 32.9s 927 21.5s
19/23
Numerical simulations
CSG outperforms SGD in terms of the number of iterations it requiresto converge and, consequently, of the total computational time. Thisbecause the optimization process is less noisy than SGD, yielding to abetter convergence behavior.
20/23
Conclusions
We compared the GD, CG, SGD and CSG algorithms for the minimization ofa quadratic functional associated to the simultaneous controllability oflinear parameter-dependent models.
We observed the following:
1. The GD approach is the worst one in terms of the computationalcomplexity, as a consequence of the bad conditioning of thesimultaneous controllability problem.
2. The choice of SGD and CSG instead of CG is preferable only whendealing with parameter sets of large cardinality |K|.
21 /23
Open problems
SIMULTANEOUS CONTROLLABILITY OF PDE MODELS
• In the PDE setting, simultaneous controllability is a quite delicate issuebecause of the appearance of peculiar phenomena which are notdetected at the ODE level.
• For some PDE systems, simultaneous controllability may beunderstood by looking at the spectral properties of the model.Roughly speaking, one needs all the eigenvalues to have multiplicityone in order to be able to observe every eigenmode independently.This fact generally yields restrictions to the validity of simultaneouscontrollability, which may be difficult to tackle at the numerical level.
Dáger and Zuazua, Controllability of star-shaped networks of strings, 2001
22/23
Open problems
COMPARISON WITH THE GREEDY METHODOLOGY• The greedy approach aims to approximate the dynamics and controlsof linear parameter-depending by identifying the most meaningfulrealizations of the parameters.Lazar and Zuazua, Greedy controllability of finite dimensional linear systems, 2016
Hernández-Santamaría, Lazar and Zuazua, Greedy controllability of finite dimensional
linear systems, 2019
• A comparison of the greedy and stochastic would be an interestingissue.
22/23
THANK YOU FOR YOUR ATTENTION!
This project has received funding from the European Research Coun-cil (ERC) under the European Union’s Horizon 2020 research andinnovation program (grant agreement No 694126-DYCON).
23/23