Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Evaluating the Population Size AdaptationMechanism for CMA-ES on the BBOB Noiseless
Testbed
Kouhei Nishida1 Youhei Akimoto1
1Shinshu University, Japan
1 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Introduction: CMA-ESI The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is a
stochastic search algorithm using the multivariate normal distribution.1. Generate candidate solutions (x (t )
i )i=1,2, ...,λ from N (m(t ),C (t )).
2. Evaluate f (x (t )i ) and sort them, f (x1:λ ) < · · · < f (xλ:λ ).
3. Update the distribution parameters θ (t ) = (m(t ),C (t )) using theranking of candidate solutions.
I The CMA-ES has the default value for all strategy parameters (such as thepopulation size λ, the learning rate ηc ).
I A larger population size than the default value improves its performanceon following scenarios.1. Well-structured multimodal function2. Noisy function
I It can be easily very expensive to tune the population size in advance.2 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Introduction: Population Size AdaptationI As a measure for the adaptation, we consider the randomness of the
parameter update.
I To quantify the randomness of the parameter update, we introduce theevolution path in the parameter space.
I To keep the randomness of the parameter update in a certain level, thepopulation size is adapted online.
Advantage of adapting the population size online:I It doesn’t require tuning of the population size in advance.
I On rugged function, it may accelerate the search by reducing thepopulation size after converging in a basin of a local minimum.
3 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Rank-µ update CMA-ESI The rank-µ update CMA-ES, which is a component of the CMA-ES,
repeats the following procedure.1. Generate candidate solutions (x (t )
i )i=1,2, ...,λ from N (m(t ),C (t )).
2. Evaluate f (x (t )i ) and sort them, f (x1:λ ) < · · · < f (xλ:λ ).
3. Update the distribution parameters θ (t ) = (m(t ),C (t )) using theranking of candidate solutions.
θ (t+1) = θ (t ) + ∆θ (t )
∆m(t ) = ηm
λ∑i
wi (x (t )i:λ − m(t )),
∆C (t ) = ηc
λ∑i
wi ((x (t )i:λ − m(t ))(x (t )
i:λ − m(t ))T − C (t ))
4 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Population Size Adaptation: MeasurementTo quantify the randomness of the parameter update, we introduce the evolutionpath in the space Θ of the distribution parameter θ = (m,C).
p(t+1) = (1 − β)p(t ) +√β(2 − β)∆θ (t )
The evolution path accumulates the successive steps in the parameter space Θ.
(a) less tendency (b) strong tendency
Figure: An image of the evolution path
5 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Population Size Adaptation: MeasurementI We measure the length of the evolution path based on the KL-divergence.
‖p‖2θ = pTI(θ)p ≈ K L(θ‖θ + p)
The KL-divergence measures the defference between two probability distributions.
I We measure the randomness of the parameter update by the ratio between‖p(t+1) ‖2θ and its expected value γ
(t+1) ≈ E[‖p(t+1) ‖2θ ] under a randomfunction.
γ (t+1) = (1 − β)2γ (t ) + β(2 − β)λ∑i
w2i (dη2m +
d(d + 1)2
η2c )
I Two important cases
I a random function: ‖p ‖2θ
γ ≈ 1
I too large λ: ‖p ‖2θ
γ → ∞
6 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Population Size Adaptation: Adaptation
I If‖p (t+1) ‖2
θ (t )
γ (t+1) < α, regarding the update as inaccurate, the population sizeis increased with
λ (t+1) =
⌊λ (t ) exp *
,β *,α −‖p(t+1) ‖2
θ (t )
γ (t+1)+-+-
⌋∨ λ (t ) + 1.
I If‖p (t+1) ‖2
θ (t )
γ (t+1) > α, regarding the update as sufficiently accurate, thepopulation size is decreased with
λ (t+1) =
⌊λ (t ) exp *
,β *,α −‖p(t+1) ‖2
θ (t )
γ (t+1)+-+-
⌋∨ λmin.
7 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Algorithm VariantWe use the default setting for most of parameters. The modifiedparameters are the learning rate for the mean vector, cm , and thethreshold α to decide whether the parameter update is consideredaccurate or not.PSAaLmC α =
√2, cm = 0.1
PSAaLmD α =√2, cm = 1/D
PSAaSmC α = 1.1, cm = 0.1PSAaSmD α = 1.1, cm = 1/D
I The greater α is, the greater the population size tends to be keptI From our preliminaly study, we set cc =
√2/(D + 1)cm .
8 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Restart StrategyFor each (re-)start of the algorithm, we initialize the mean vectorm ∼ U[−4, 4]D and the covariance matrix C = 22I. The maximum#f-call is set to 105D.
Termination conditionstolf: median(fiqr_hist) < 10 - 12abs(median(fmin_hist))
I the objective function value differences are too small tosort them without being affected by numerical errors.
tolx: median(xiqr_hist) < 10 - 12min(abs(xmed_hist))I the coordinate value differences are too small to update
parameters without being affected by numerical errors.
maxcond: cond(C) > 1014I the matrix operations using C are not reliable due to
numerical errors.
maxeval: #f-call ≥ 5 × 104D (for noiseless) or 105D (for noisy)
9 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
BIPOP-CMA-ESBIPOP restart strategy: A restart strategy with two budgets of functionevaluations.I one is for incremental population size.
I to tackle well-structured multimodal functions or noisy functions
I the other is for relatively small population size and a relatively smallstep-size.I to tackle weakly-structured multimodal functions
10 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Noiseless: Unimodal Function
2 3 5 10 20 40
0
1
2
3
4
target Df: 1e-8
1 Sphere
PSAaLmC
PSAaLmD
PSAaSmC
PSAaSmD
BIPOP-CMA-ES
2 3 5 10 20 40
0
1
2
3
4
target Df: 1e-8
5 Linear slope
2 3 5 10 20 40
0
1
2
3
4
target Df: 1e-8
7 Step-ellipsoid
2 3 5 10 20 400
1
2
3
4
5
6
target Df: 1e-8
8 Rosenbrock original
The aRT is higher for most of the unimodal functions than the best 2009 portfolio due tolack of the step-size adaptation.On Step-ellipsoid function, where the step-size adaptaiton is less important, ouralgorithm performs well.
11 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Noiseless: Well-structured Multimodal Function
2 3 5 10 20 400
1
2
3
4
5
6
target Df: 1e-8
15 Rastrigin
2 3 5 10 20 400
1
2
3
4
5
target Df: 1e-8
17 Schaffer F7, condition 10
2 3 5 10 20 400
1
2
3
4
5
target Df: 1e-8
18 Schaffer F7, condition 1000
2 3 5 10 20 400
1
2
3
4
5
6
7
target Df: 1e-8
19 Griewank-Rosenbrock F8F2
The performance of the tested algorithms is similar to the performance of theBIPOP-CMA-ES without the step-size adaptation.Especially on Griewank-Rosenbrock, the tested algorithm is partly better than the best2009 portfolio.
12 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Noiseless: Weakly-structured Multimodal Function
0 1 2 3 4 5 6 7 8log10 of (# f-evals / dimension)
0.0
0.2
0.4
0.6
0.8
1.0
Pro
port
ion o
f fu
nct
ion+
targ
et
pair
s
PSAaSmD
PSAaSmC
PSAaLmC
PSAaLmD
BIPOP-CMA
best 2009f20-f24,5-D
0 1 2 3 4 5 6 7 8log10 of (# f-evals / dimension)
0.0
0.2
0.4
0.6
0.8
1.0
Pro
port
ion o
f fu
nct
ion+
targ
et
pair
s
PSAaLmD
PSAaLmC
PSAaSmC
PSAaSmD
BIPOP-CMA
best 2009f20-f24,20-D
The BIPOP-CMA-ES performs better than the tested algorithm because thetested algorithms doesn’t have the mechanism to tackle weakly-structure.
13 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Noiseless: Comparing the variants
0 1 2 3 4 5 6 7 8log10 of (# f-evals / dimension)
0.0
0.2
0.4
0.6
0.8
1.0Pro
port
ion o
f fu
nct
ion+
targ
et
pair
s
PSAaLmC
PSAaLmD
PSAaSmD
BIPOP-CMA
best 2009
PSAaSmCf15-f19,10-D
Variants with α = 1.1 are better than ones with α =√2.
14 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Noiseless Summary
I On Well-structured multimodal function, the tested algorithm performswell without the step-size adaptaiton.
I For lack of the step-size adaptation, the aRT is higher for most of theunimodal functions and the than the best 2009 portfolio.
I When the step-size is less important, the tested algorithm performs well.
I Variants with α = 1.1 tends to be better than ones with α =√2
15 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Noisy: Unimodal Function
2 3 5 10 20 40
0
1
2
3
4
target Df: 1e-8
101 Sphere moderate Gauss
PSAaLmC
PSAaLmD
PSAaSmC
PSAaSmD
BIPOP-CMA-ES
2 3 5 10 20 40
0
1
2
3
4
target Df: 1e-8
102 Sphere moderate unif
2 3 5 10 20 40
0
1
2
3
4
target Df: 1e-8
103 Sphere moderate Cauchy
2 3 5 10 20 400
1
2
3
4
5
6
7
target Df: 1e-8
104 Rosenbrock moderate Gauss
2 3 5 10 20 400
1
2
3
4
5
6
target Df: 1e-8
105 Rosenbrock moderate unif
2 3 5 10 20 400
1
2
3
4
5
6
7
target Df: 1e-8
106 Rosenbrock moderate Cauchy
On sphere function, the algorithm is slower than the BIPOP-CMA-ES for lackof the step-size adaptation.The failure on the Rosenbrock functions is mainly due to the same reason.
16 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Noisy: Unimodal Function
2 3 5 10 20 400
1
2
3
4
5
6
target Df: 1e-8
113 Step-ellipsoid Gauss
2 3 5 10 20 400
1
2
3
4
5
6
target Df: 1e-8
114 Step-ellipsoid unif
2 3 5 10 20 400
1
2
3
4
5
target Df: 1e-8
115 Step-ellipsoid Cauchy
On step-ellipsoid function, where the step-size adaptation is less important, thealgorithm performs well.
17 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Noisy: Well-structured Multimodal Function
2 3 5 10 20 400
1
2
3
4
5
6
7
target Df: 1e-8
122 Schaffer F7 Gauss
2 3 5 10 20 400
1
2
3
4
5
6
7
target Df: 1e-8
123 Schaffer F7 unif
2 3 5 10 20 400
1
2
3
4
5
6
target Df: 1e-8
124 Schaffer F7 Cauchy
On schaffer function, the performance of the tested algorithm is similarly to thebest 2009 portfolio, and partly better than it.
18 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Noisy: Compairing the variants
2 3 5 10 20 400
1
2
3
4
5
6
target Df: 1e-8
113 Step-ellipsoid Gauss
2 3 5 10 20 400
1
2
3
4
5
6
target Df: 1e-8
116 Ellipsoid Gauss
2 3 5 10 20 400
1
2
3
4
5
6
7
target Df: 1e-8
119 Sum of diff powers Gauss
The algorithms using cm = 1/D sometimes get worse in low dimension becausethe learning rate is too large.
19 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Noisy: Compairing the variants
0 1 2 3 4 5 6 7 8log10 of (# f-evals / dimension)
0.0
0.2
0.4
0.6
0.8
1.0Pro
port
ion o
f fu
nct
ion+
targ
et
pair
s
PSAaLmD
PSAaLmC
PSAaSmD
PSAaSmC
BIPOP-CMA
best 2009f101-f130,10-D
Variants with α = 1.1 are better than ones with α =√2.
20 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Noisy Summary
I On Well-structured multimodal function, the tested algorithmperformance is similarly to the best 2009.
I For lack of the step-size adaptation, the convargence speed scales worseon Sphere function and the aRT is higher for most of the unimodalfunctions than the best 2009 portfolio.
I variants with α = 1.1 tends to be better than ones with α =√2
I cm = 1/D is too large at low dimension.
21 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
ConclusionSummaryI On well-structured multimodal function, the tested algorithm performance
is similarly to the best 2009.
I For lack of the step-size adaptation, the aRT is higher for most of theunimodal function and the weakly-structured function than the best 2009portfolio.
I On noisy function, cm = 1/D is too large at low dimension.
Future WorkI We incorporate the rank-one adaptation and the step-size adaptation.
22 / 23
Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion
Using the small learning rate works as averaging the mean vector insuccessive iteration.
(a) with a larger learning rate(b) with a smaller learning rate
23 / 23