fast evolutionary optimization: comparison-based optimizers need comparison-based surrogates. ppsn...
DESCRIPTION
Taking inspiration from approximate ranking, this paper investigates the use of rank-based Support Vector Machine as surrogate model within CMA-ES, enforcing the invariance of the approach with respect to monotonous transformations of the fitness function. Whereas the choice of the SVM kernel is known to be a critical issue, the proposed approach uses the Covariance Matrix adapted by CMA-ES within a Gaussian kernel, ensuring the adaptation of the kernel to the currently explored region of the fitness landscape at almost no computational overhead. The empirical validation of the approach on standard benchmarks, comparatively to CMA-ES and recent surrogate-based CMA-ES, demonstrates the efficiency and scalability of the proposed approach.TRANSCRIPT
MotivationBackground
ACM-ES
Comparison-Based Optimizers NeedComparison-Based Surrogates
Ilya Loshchilov1,2, Marc Schoenauer1,2, Michèle Sebag2,1
1TAO Project-team, INRIA Saclay - Île-de-France
2and Laboratoire de Recherche en Informatique (UMR CNRS 8623)Université Paris-Sud, 91128 Orsay Cedex, France
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 1/ 20
MotivationBackground
ACM-ES
Content
1 MotivationWhy Comparison-Based Surrogates ?Previous Work
2 BackgroundCovariance Matrix Adaptation Evolution Strategy (CMA-ES)Support Vector Machine (SVM)
3 ACM-ESAlgorithmExperiments
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 2/ 20
MotivationBackground
ACM-ES
Why Comparison-Based Surrogates ?Previous Work
Why Comparison-Based Surrogates ?
Surrogate Model (Meta-Model) assisted optimization
Construct the approximation model M(x) of f(x).
Optimize the model M(x) in lieu of f(x) to reducethe number of costly evaluations of the function f(x).
Example
f(x) = x21 + (x1 + x2)
2.
An efficient Evolutionary Algorithm (EA) withsurrogate models may be 4.3 faster on f(x).
But the same EA is only 2.4 faster on f ′(x) = f(x)1/4! a
aCMA-ES with quadratic meta-model (lmm-CMA-ES) on fSchwefel 2-D
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 3/ 20
MotivationBackground
ACM-ES
Why Comparison-Based Surrogates ?Previous Work
Ordinal Regression in Evolutionary Computation
Goal: Find the function F (x) which preserves the ordering of thetraining points xi (xi has rank i):
xi � xj ⇔ F (xi) > F (xj)
F (x) is invariant to any rank-preserving transformation.
CMA-ES with Rank Support Vector Machine on Rosenbrock: 1
200 400 60010
6
104
102
100
200 600 100010
1
100
101
102
1000 2000 300010
0
101
102
103
# function evaluations
me
an
fitn
ess
(μI ,λ )CMA-ES
Poly, d =2Poly, d =4RBF, γ =1
n =2 n =5 n =10
# function evaluations# function evaluations
1T. Runarsson (2006). "Ordinal Regression in Evolutionary Computation"
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 4/ 20
MotivationBackground
ACM-ES
Why Comparison-Based Surrogates ?Previous Work
Exploit the local topography of the function
CMA-ES adapts the covariance matrix C which describes thelocal structure of the function.
Mahalanobis (fully weighted Euclidean) distance:
d(xi, xj) =√
(xi − xj)TC−1(xi − xj)
Results of CMA-ES with quadratic meta-models: 2
2 4 8 16
102
104
106
108
n
O(F
LO
P)/
saved function e
valu
ation
lmm-CMA-ES
Speed-up: a factor of 2-4 for n ≥ 4
Complexity: from O(n4) to O(n6)
Rank-preserving invariance: NO
becomes intractable for n>16
2S. Kern et al. (2006). "Local Meta-Models for Optimization Using Evolution Strategies"
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 5/ 20
MotivationBackground
ACM-ES
Why Comparison-Based Surrogates ?Previous Work
Tractable or Efficient ?
Answer: Tractable and Efficient and Invariant.
Ingredients: CMA-ES (Adaptive Encoding) and Rank SVM.
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 6/ 20
MotivationBackground
ACM-ES
Covariance Matrix Adaptation Evolution Strategy (CMA-ES)Support Vector Machine (SVM)
Covariance Matrix Adaptation Evolution StrategyDecompose to understand
While CMA-ES by definition is CMA and ES, only recently thealgorithmic decomposition has been presented. 3
Algorithm 1 CMA-ES = Adaptive Encoding + ES1: xi ← m+ σNi(0, I), for i = 1 . . . λ2: fi ← f(Bxi), for i = 1 . . . λ3: if Evolution Strategy (ES) then4: σ ← σexp∝( success rate
expected success rate−1)
5: if Cumulative Step-Size Adaptation ES (CSA-ES) then
6: σ ← σexp∝(‖evolution path‖
‖expected evolution path‖−1)
7: B ←AECMA-Update(Bx1, . . . , Bxµ)
3N. Hansen (2008). "Adaptive Encoding: How to Render Search Coordinate System Invariant"
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 7/ 20
MotivationBackground
ACM-ES
Covariance Matrix Adaptation Evolution Strategy (CMA-ES)Support Vector Machine (SVM)
Adaptive EncodingInspired by Principal Component Analysis (PCA)
Principal Component Analysis
Adaptive Encoding Update
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 8/ 20
MotivationBackground
ACM-ES
Covariance Matrix Adaptation Evolution Strategy (CMA-ES)Support Vector Machine (SVM)
Support Vector Machine for ClassificationLinear Classifier
L1
2
3
L
L
b
w
w
w2/ || ||
-1
b
b+1
0b
xi
supportvector
Main Idea
Training Data:D = {(xi, yi)|xi ∈ IRp, yi ∈ {−1,+1}}ni=1
〈w, xi〉 ≥ b+ ε⇒ yi = +1;〈w, xi〉 ≤ b− ε⇒ yi = −1;Dividing by ε > 0:〈w, xi〉 − b ≥ +1⇒ yi = +1;〈w, xi〉 − b ≤ −1⇒ yi = −1;
Optimization Problem: Primal Form
Minimize{w, ξ}12 ||w||2 + C
∑ni=1 ξi
subject to: yi(〈w, xi〉 − b) ≥ 1− ξi, ξi ≥ 0
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 9/ 20
MotivationBackground
ACM-ES
Covariance Matrix Adaptation Evolution Strategy (CMA-ES)Support Vector Machine (SVM)
Support Vector Machine for ClassificationLinear Classifier
L1
2
3
L
L
b
w
w
w2/ || ||
-1
b
b+1
0b
xi
supportvector
Optimization Problem: Dual Form
From Lagrange Theorem, instead of minimize F :Minimize{α,G}F −
∑
i αiGi
subject to: αi ≥ 0, Gi ≥ 0Leaving the details, Dual form:Maximize{α}
∑ni αi − 1
2
∑ni,j=1 αiαjyiyj 〈xi, xj〉
subject to: 0 ≤ αi ≤ C,∑n
i αiyi = 0
Properties
Decision Function:F (x) = sign(
∑ni αiyi 〈xi, x〉 − b)
The Dual form may be solved using standardquadratic programming solver.
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 10/ 20
MotivationBackground
ACM-ES
Covariance Matrix Adaptation Evolution Strategy (CMA-ES)Support Vector Machine (SVM)
Support Vector Machine for ClassificationNon-Linear Classifier
ww, (x)F -b = +1
w, (x)F -b = -1
w2/ || ||
a) b) c)
support vector
xi
F
Non-linear classification with the "Kernel trick"
Maximize{α}
∑ni αi − 1
2
∑ni,j=1 αiαjyiyjK(xi, xj)
subject to: ai ≥ 0,∑n
i αiyi = 0,where K(x, x′) =def < Φ(x),Φ(x′) > is the Kernel functionDecision Function: F (x) = sign(
∑ni αiyiK(xi, x)− b)
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 11/ 20
MotivationBackground
ACM-ES
Covariance Matrix Adaptation Evolution Strategy (CMA-ES)Support Vector Machine (SVM)
Support Vector Machine for ClassificationNon-Linear Classifier: Kernels
Polynomial: k(xi, xj) = (〈xi, xj〉+ 1)d
Gaussian or Radial Basis Function: k(xi, xj) = exp(‖xi−xj‖
2
2σ2 )
Hyperbolic tangent: k(xi, xj) = tanh(k 〈xi, xj〉+ c)
Examples for Polynomial (left) and Gaussian (right) Kernels:
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 12/ 20
MotivationBackground
ACM-ES
Covariance Matrix Adaptation Evolution Strategy (CMA-ES)Support Vector Machine (SVM)
Ranking Support Vector Machine
Find F (x) which preserves the ordering of the training points.
w
L( 1r )
L( 2r)
xx
x
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 13/ 20
MotivationBackground
ACM-ES
Covariance Matrix Adaptation Evolution Strategy (CMA-ES)Support Vector Machine (SVM)
Ranking Support Vector Machine
Primal problem
Minimize{w, ξ}12 ||w||2 +
∑Ni=1 Ciξi
subject to{
〈 w,Φ(xi)− Φ(xi+1) 〉 ≥ 1− ξi (i = 1 . . . N − 1)ξi ≥ 0 (i = 1 . . . N − 1)
Dual problem
Maximize{α}
∑N−1i αi −
∑N−1i,j αijK(xi − xi+1, xj − xj+1))
subject to 0 ≤ αi ≤ Ci (i = 1 . . . N − 1)
Rank Surrogate Function in the case 1 rank = 1 point
F(x) = ∑N−1i=1 αi(K(xi, x)−K(xi+1, x))
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 14/ 20
MotivationBackground
ACM-ES
AlgorithmExperiments
Model LearningNon-separable Ellipsoid problem
K(xi, xj) = e−(xi−xj)
T (xi−xj)
2σ2 ; KC(xi, xj) = e−(xi−xj)
T C−1(xi−xj)
2σ2
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 15/ 20
MotivationBackground
ACM-ES
AlgorithmExperiments
Optimization or filtering?Don’t be too greedy
Optimization: Significant Potential Speed-Up if the surrogatemodel is global and accurate enoughFiltering: "Guaranteed" Speed-Up with the local surrogate model
0 100 200 300 400 500
0.2
0.3
0.4
0.5
0.6
Rank
Pro
babili
ty D
ensity
Results
~Retain with rank ,Prescreen
(λ )
Evaluate
(λ′ )Retain with rank ,λ ~λ
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 16/ 20
MotivationBackground
ACM-ES
AlgorithmExperiments
ACM-ES Optimization Loop
4. Generate pre-children and rank themaccording to surrogate fitness function.
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
X1X
2
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
X1
X2
A Select training points. B Build a surrogate model.
C Generate pre-children.D Select most promising children.
1. Select best training points.
3. Build a surrogate model using Rank SVM.
7. Add new training points andupdate parameters of CMA-ES.
λ′
k
2
[4]
. The change of coordinates, definedfrom the current covariance matrixand the current mean value , reads :
SurrogateModel
Rank-based
0 100 200 300 400 500
0.2
0.3
0.4
0.5
0.6
Rank
Pro
ba
bili
ty D
en
sity
Results
~Retain with rank ,Prescreen
(λ )
Evaluate
(λ′ )Retain with rank ,λ ~
5.
6.λ
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 17/ 20
MotivationBackground
ACM-ES
AlgorithmExperiments
ResultsSpeed-up
0 5 10 15 20 25 30 35 400
1
2
3
4
5
Problem Dimension
Speedup
Schwefel
Schwefel1/4
Rosenbrock
Noisy Sphere
Ackley
0
1
2
3
4
5
Ellipsoid
Function n λ λ′ e ACM-ES spu CMA-ES
Schwefel 10 10 3 801 36 3.3 2667 8720 12 4 3531 179 2.0 7042 17240 15 5 13440 281 1.7 22400 289
Schwefel 10 10 3 1774 37 4.1 7220 206
20 12 4 6138 82 2.5 15600 29440 15 5 22658 390 1.8 41534 466
Rosenbrock 10 10 3 2059 143 (0.95) 3.7 7669 691 (0.90)
20 12 4 11793 574 (0.75) 1.8 21794 152940 15 5 49750 2412 (0.9) 1.6 82043 3991
NoisySphere 10 10 3 0.15 766 90 (0.95) 2.7 2058 14820 12 4 0.11 1361 212 2.8 3777 12740 15 5 0.08 2409 120 2.9 7023 173
Ackley 10 10 3 892 28 4.1 3641 15420 12 4 1884 50 3.5 6641 10840 15 5 3690 80 3.3 12084 247
Ellipsoid 10 10 3 1628 95 3.8 6211 26420 12 4 8250 393 2.3 19060 50140 15 5 33602 548 2.1 69642 644
Rastrigin 5 140 70 23293 1374 (0.3) 0.5 12310 1098 (0.75)
1/4
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 18/ 20
MotivationBackground
ACM-ES
AlgorithmExperiments
ResultsLearning Time
100
101
102
103
101
102
103
10
Problem Dimension
Learn
ing tim
e (
ms) Slope=1.13
Cost of model learning/testing increasesquasi-linearly with on Sphere function:d
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 19/ 20
MotivationBackground
ACM-ES
AlgorithmExperiments
Summary
ACM-ES
ACM-ES is from 2 to 4 times faster on Uni-Modal Problems.
Invariant to rank-preserving transformation: Yes
The computation complexity (the cost of speed-up) is O(n)comparing to O(n6)
The source code is available online:http://www.lri.fr/~ilya/publications/ACMESppsn2010.zip
Open Questions
Extention to multi-modal optimization
Adaptation of selection pressure and surrogate model complexity
Ilya Loshchilov, Marc Schoenauer, Michèle Sebag ACM-ES = CMA-ES + RankSVM 20/ 20
MotivationBackground
ACM-ES
AlgorithmExperiments
Summary
Thank you for your attention!
Questions?
MotivationBackground
ACM-ES
AlgorithmExperiments
Parameters
SVM Learning:
Number of training points: Ntraining = 30√d for all problems,
except Rosenbrock and Rastrigin, where Ntraining = 70√
(d)
Number of iterations: Niter = 50000√d
Kernel function: RBF function with σ equal to the averagedistance of the training points
The cost of constraint violation: Ci = 106(Ntraining − i)2.0
Offspring Selection Procedure:
Number of test points: Ntest = 500
Number of evaluated offsprings: λ′ = λ3
Offspring selection pressure parameters: σ2sel0 = 2σ2
sel1 = 0.8