gradient-based adaptive stochastic searchenluzhou.gatech.edu/papers/gass_slides.pdf · stochastic...

83
Gradient-based Adaptive Stochastic Search Enlu Zhou H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology November 5, 2014 1 / 41

Upload: others

Post on 20-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Gradient-based Adaptive Stochastic Search

Enlu Zhou

H. Milton Stewart School of Industrial and Systems EngineeringGeorgia Institute of Technology

November 5, 2014

1 / 41

Page 2: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Outline

1 Introduction

2 GASS for non-differentiable optimization

3 GASS for simulation optimization

4 Conclusions

2 / 41

Page 3: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Outline

1 Introduction

2 GASS for non-differentiable optimization

3 GASS for simulation optimization

4 Conclusions

3 / 41

Page 4: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Introduction

We considerx∗ ∈ arg max

x∈XH(x)

Given any x ∈ X , H(x) can be evaluated exactly.

We are interested in objective functions:lack structural properties (such as convexity and differentiability)have multiple local optimaonly be assessed by “black-box” evaluation

4 / 41

Page 5: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Introduction

We considerx∗ ∈ arg max

x∈XH(x)

Given any x ∈ X , H(x) can be evaluated exactly.

We are interested in objective functions:lack structural properties (such as convexity and differentiability)have multiple local optimaonly be assessed by “black-box” evaluation

4 / 41

Page 6: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Examples of Objective Functions

5 / 41

Page 7: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Background

Stochastic Search: use randomized mechanism to generate asequence of iterates

e.g., simulated annealing (Krikpatrick et al. 1983), genetic algorithms (Goldberg1989), tabu search (Glover 1990), nested partitions method (Shi and Ólafsson2000), pure adaptive search (Zabinsky 2003), sequential Monte Carlo simulatedannealing (Zhou and Chen 2011), model-based algorithms (survey by Zlochin etal. 2004).

Model-based Algorithms: generate candidate solutions from asampling distribution (i.e., probabilistic model)e.g., ant colony optimization (Dorigo and Gambardella 1997), annealing adaptivesearch (Romeijn and Smith 1994), estimation of distribution algorithms(Muhlenbein and Paaß 1996), the cross-entropy method (Rubinstein 1997),model reference adaptive search (Hu et al. 2007).

6 / 41

Page 8: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Background

Stochastic Search: use randomized mechanism to generate asequence of iteratese.g., simulated annealing (Krikpatrick et al. 1983), genetic algorithms (Goldberg1989), tabu search (Glover 1990), nested partitions method (Shi and Ólafsson2000), pure adaptive search (Zabinsky 2003), sequential Monte Carlo simulatedannealing (Zhou and Chen 2011), model-based algorithms (survey by Zlochin etal. 2004).

Model-based Algorithms: generate candidate solutions from asampling distribution (i.e., probabilistic model)e.g., ant colony optimization (Dorigo and Gambardella 1997), annealing adaptivesearch (Romeijn and Smith 1994), estimation of distribution algorithms(Muhlenbein and Paaß 1996), the cross-entropy method (Rubinstein 1997),model reference adaptive search (Hu et al. 2007).

6 / 41

Page 9: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Background

Stochastic Search: use randomized mechanism to generate asequence of iteratese.g., simulated annealing (Krikpatrick et al. 1983), genetic algorithms (Goldberg1989), tabu search (Glover 1990), nested partitions method (Shi and Ólafsson2000), pure adaptive search (Zabinsky 2003), sequential Monte Carlo simulatedannealing (Zhou and Chen 2011), model-based algorithms (survey by Zlochin etal. 2004).

Model-based Algorithms: generate candidate solutions from asampling distribution (i.e., probabilistic model)

e.g., ant colony optimization (Dorigo and Gambardella 1997), annealing adaptivesearch (Romeijn and Smith 1994), estimation of distribution algorithms(Muhlenbein and Paaß 1996), the cross-entropy method (Rubinstein 1997),model reference adaptive search (Hu et al. 2007).

6 / 41

Page 10: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Background

Stochastic Search: use randomized mechanism to generate asequence of iteratese.g., simulated annealing (Krikpatrick et al. 1983), genetic algorithms (Goldberg1989), tabu search (Glover 1990), nested partitions method (Shi and Ólafsson2000), pure adaptive search (Zabinsky 2003), sequential Monte Carlo simulatedannealing (Zhou and Chen 2011), model-based algorithms (survey by Zlochin etal. 2004).

Model-based Algorithms: generate candidate solutions from asampling distribution (i.e., probabilistic model)e.g., ant colony optimization (Dorigo and Gambardella 1997), annealing adaptivesearch (Romeijn and Smith 1994), estimation of distribution algorithms(Muhlenbein and Paaß 1996), the cross-entropy method (Rubinstein 1997),model reference adaptive search (Hu et al. 2007).

6 / 41

Page 11: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Model-based optimization

7 / 41

Page 12: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Model-based optimization

−6 −5 −4 −3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

X

dist

ribut

ion

−6 −5 −4 −3 −2 −1 0 1 2 30

10

20

30

40

50

60

X

H(x

)

Figure 1: Optimization via model-based methods.

Examples for the sequence of distributions{gk} include the following:

(a) proportional selection scheme — introduced in estimation of distribution algorithms (EDAs)[45], and the instantiation of the model reference adaptive search (MRAS) method in [21];

(b) Boltzmann distribution with decreasing (nonincreasing) temperature schedule — used inthe annealing adaptive search (AAS) [41, 33];

(c) optimal importance sampling measure — used in the cross-entropy (CE) method [8].

Case (c) is easiest to implement, but may not converge to a global optimum. The other two caseshave nice theoretical properties: Case (a) guarantees improvement in expectation [21], whereascase (b) guarantees improvement in stochastic order [41]. All three will be used in the proposedresearch.

However, in all three cases, the sequence{gk} is unknown explicitly a priori, or else the problemwould essentially be solved. So at iterationk, sampling is done from a surrogate distribution(s)that approximatesgk. There are two main approaches that have been adopted:

• Markov chain Monte Carlo approximation to the target distribution ateach iteration[41],which involves generating asequenceof sample points from a sequence of distributions{πki}following a Markov chain that asymptotically converge to the target distribution (at thatiteration), i.e.,

πk1,πk2, ...�→gk;

e.g., a common implementation is the “hit-and-run” algorithm [35, 43, 44];

• projectiononto distributions that are easy to work with, e.g., use a family ofparameterizeddistributions{fθ}, and projectgk onto the family to obtain a sequence that converges tothe (�nal) target distribution, i.e.,

fθ0 , fθ1 , fθ2 , ...�→g∞;

a common implementation minimizes the Kullback-Leibler (KL) divergence betweenfθk andgk at each iteration, because it leads to analytically tractable solutions if the parameterizeddistributions are from the exponential family.

The �rst approach is adopted by AAS, and generates asequence of candidate solutions at eachiteration. MRAS and the CE method follow the second approach, which leads to apopulationofcandidate solutions, from which an elite set is selected and used to update the distribution. Wewill follow the second approach in the proposed research.

2

7 / 41

Page 13: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Model-based optimization

−6 −5 −4 −3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

X

dist

ribut

ion

−6 −5 −4 −3 −2 −1 0 1 2 30

10

20

30

40

50

60

X

H(x

)

Figure 1: Optimization via model-based methods.

Examples for the sequence of distributions{gk} include the following:

(a) proportional selection scheme — introduced in estimation of distribution algorithms (EDAs)[45], and the instantiation of the model reference adaptive search (MRAS) method in [21];

(b) Boltzmann distribution with decreasing (nonincreasing) temperature schedule — used inthe annealing adaptive search (AAS) [41, 33];

(c) optimal importance sampling measure — used in the cross-entropy (CE) method [8].

Case (c) is easiest to implement, but may not converge to a global optimum. The other two caseshave nice theoretical properties: Case (a) guarantees improvement in expectation [21], whereascase (b) guarantees improvement in stochastic order [41]. All three will be used in the proposedresearch.

However, in all three cases, the sequence{gk} is unknown explicitly a priori, or else the problemwould essentially be solved. So at iterationk, sampling is done from a surrogate distribution(s)that approximatesgk. There are two main approaches that have been adopted:

• Markov chain Monte Carlo approximation to the target distribution ateach iteration[41],which involves generating asequenceof sample points from a sequence of distributions{πki}following a Markov chain that asymptotically converge to the target distribution (at thatiteration), i.e.,

πk1,πk2, ...�→gk;

e.g., a common implementation is the “hit-and-run” algorithm [35, 43, 44];

• projectiononto distributions that are easy to work with, e.g., use a family ofparameterizeddistributions{fθ}, and projectgk onto the family to obtain a sequence that converges tothe (�nal) target distribution, i.e.,

fθ0 , fθ1 , fθ2 , ...�→g∞;

a common implementation minimizes the Kullback-Leibler (KL) divergence betweenfθk andgk at each iteration, because it leads to analytically tractable solutions if the parameterizeddistributions are from the exponential family.

The �rst approach is adopted by AAS, and generates asequence of candidate solutions at eachiteration. MRAS and the CE method follow the second approach, which leads to apopulationofcandidate solutions, from which an elite set is selected and used to update the distribution. Wewill follow the second approach in the proposed research.

2

7 / 41

Page 14: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Model-based optimization

−6 −5 −4 −3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

X

dist

ribut

ion

−6 −5 −4 −3 −2 −1 0 1 2 30

10

20

30

40

50

60

X

H(x

)

Figure 1: Optimization via model-based methods.

Examples for the sequence of distributions{gk} include the following:

(a) proportional selection scheme — introduced in estimation of distribution algorithms (EDAs)[45], and the instantiation of the model reference adaptive search (MRAS) method in [21];

(b) Boltzmann distribution with decreasing (nonincreasing) temperature schedule — used inthe annealing adaptive search (AAS) [41, 33];

(c) optimal importance sampling measure — used in the cross-entropy (CE) method [8].

Case (c) is easiest to implement, but may not converge to a global optimum. The other two caseshave nice theoretical properties: Case (a) guarantees improvement in expectation [21], whereascase (b) guarantees improvement in stochastic order [41]. All three will be used in the proposedresearch.

However, in all three cases, the sequence{gk} is unknown explicitly a priori, or else the problemwould essentially be solved. So at iterationk, sampling is done from a surrogate distribution(s)that approximatesgk. There are two main approaches that have been adopted:

• Markov chain Monte Carlo approximation to the target distribution ateach iteration[41],which involves generating asequenceof sample points from a sequence of distributions{πki}following a Markov chain that asymptotically converge to the target distribution (at thatiteration), i.e.,

πk1,πk2, ...�→gk;

e.g., a common implementation is the “hit-and-run” algorithm [35, 43, 44];

• projectiononto distributions that are easy to work with, e.g., use a family ofparameterizeddistributions{fθ}, and projectgk onto the family to obtain a sequence that converges tothe (�nal) target distribution, i.e.,

fθ0 , fθ1 , fθ2 , ...�→g∞;

a common implementation minimizes the Kullback-Leibler (KL) divergence betweenfθk andgk at each iteration, because it leads to analytically tractable solutions if the parameterizeddistributions are from the exponential family.

The �rst approach is adopted by AAS, and generates asequence of candidate solutions at eachiteration. MRAS and the CE method follow the second approach, which leads to apopulationofcandidate solutions, from which an elite set is selected and used to update the distribution. Wewill follow the second approach in the proposed research.

2

7 / 41

Page 15: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Model-based optimization

−6 −5 −4 −3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

X

dist

ribut

ion

−6 −5 −4 −3 −2 −1 0 1 2 30

10

20

30

40

50

60

X

H(x

)

Figure 1: Optimization via model-based methods.

Examples for the sequence of distributions{gk} include the following:

(a) proportional selection scheme — introduced in estimation of distribution algorithms (EDAs)[45], and the instantiation of the model reference adaptive search (MRAS) method in [21];

(b) Boltzmann distribution with decreasing (nonincreasing) temperature schedule — used inthe annealing adaptive search (AAS) [41, 33];

(c) optimal importance sampling measure — used in the cross-entropy (CE) method [8].

Case (c) is easiest to implement, but may not converge to a global optimum. The other two caseshave nice theoretical properties: Case (a) guarantees improvement in expectation [21], whereascase (b) guarantees improvement in stochastic order [41]. All three will be used in the proposedresearch.

However, in all three cases, the sequence{gk} is unknown explicitly a priori, or else the problemwould essentially be solved. So at iterationk, sampling is done from a surrogate distribution(s)that approximatesgk. There are two main approaches that have been adopted:

• Markov chain Monte Carlo approximation to the target distribution ateach iteration[41],which involves generating asequenceof sample points from a sequence of distributions{πki}following a Markov chain that asymptotically converge to the target distribution (at thatiteration), i.e.,

πk1,πk2, ...�→gk;

e.g., a common implementation is the “hit-and-run” algorithm [35, 43, 44];

• projectiononto distributions that are easy to work with, e.g., use a family ofparameterizeddistributions{fθ}, and projectgk onto the family to obtain a sequence that converges tothe (�nal) target distribution, i.e.,

fθ0 , fθ1 , fθ2 , ...�→g∞;

a common implementation minimizes the Kullback-Leibler (KL) divergence betweenfθk andgk at each iteration, because it leads to analytically tractable solutions if the parameterizeddistributions are from the exponential family.

The �rst approach is adopted by AAS, and generates asequence of candidate solutions at eachiteration. MRAS and the CE method follow the second approach, which leads to apopulationofcandidate solutions, from which an elite set is selected and used to update the distribution. Wewill follow the second approach in the proposed research.

2

7 / 41

Page 16: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Outline

1 Introduction

2 GASS for non-differentiable optimization

3 GASS for simulation optimization

4 Conclusions

8 / 41

Page 17: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Reformulation

Original problem:

x∗ ∈ arg maxx∈X

H(x), X ⊆ Rn.

Let {f (x ; θ)} be a parameterized family of probability densityfunctions on X .∫

H(x)f (x ; θ)dx 6 H(x∗) , H∗, ∀θ ∈ Rd .

“=” is achieved if and only if ∃ θ∗ s.t. the probability mass of

f (x ; θ∗) is concentrated on a subset of the optimal solutions.

New problem:

θ∗ ∈ arg maxθ∈Rd

∫H(x)f (x ; θ)dx .

9 / 41

Page 18: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Reformulation

Original problem:

x∗ ∈ arg maxx∈X

H(x), X ⊆ Rn.

Let {f (x ; θ)} be a parameterized family of probability densityfunctions on X .∫

H(x)f (x ; θ)dx 6 H(x∗) , H∗, ∀θ ∈ Rd .

“=” is achieved if and only if ∃ θ∗ s.t. the probability mass of

f (x ; θ∗) is concentrated on a subset of the optimal solutions.

New problem:

θ∗ ∈ arg maxθ∈Rd

∫H(x)f (x ; θ)dx .

9 / 41

Page 19: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Reformulation

Original problem:

x∗ ∈ arg maxx∈X

H(x), X ⊆ Rn.

Let {f (x ; θ)} be a parameterized family of probability densityfunctions on X .∫

H(x)f (x ; θ)dx 6 H(x∗) , H∗, ∀θ ∈ Rd .

“=” is achieved if and only if ∃ θ∗ s.t. the probability mass of

f (x ; θ∗) is concentrated on a subset of the optimal solutions.

New problem:

θ∗ ∈ arg maxθ∈Rd

∫H(x)f (x ; θ)dx .

9 / 41

Page 20: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Reformulation

Original problem:

x∗ ∈ arg maxx∈X

H(x), X ⊆ Rn.

Let {f (x ; θ)} be a parameterized family of probability densityfunctions on X .∫

H(x)f (x ; θ)dx 6 H(x∗) , H∗, ∀θ ∈ Rd .

“=” is achieved if and only if ∃ θ∗ s.t. the probability mass of

f (x ; θ∗) is concentrated on a subset of the optimal solutions.

New problem:

θ∗ ∈ arg maxθ∈Rd

∫H(x)f (x ; θ)dx .

9 / 41

Page 21: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Why reformulation?

Possible Scenarios:

Original Problem New Problemarg maxx∈X H(x) arg maxθ

∫H(x)f (x ; θ)dx

Discrete in x Continuous in θNon-differentiable in x Differentiable in θ

Incorporate model-based optimization into gradient-basedoptimization:1). Generate candidate solutions from f (·; θ) on the solution space X .2). Use a gradient-based method to update the parameter θ.

Combine the robustness of model-based optimization with therelative fast convergence of gradient-based optimization.

10 / 41

Page 22: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Why reformulation?

Possible Scenarios:

Original Problem New Problemarg maxx∈X H(x) arg maxθ

∫H(x)f (x ; θ)dx

Discrete in x Continuous in θNon-differentiable in x Differentiable in θ

Incorporate model-based optimization into gradient-basedoptimization:1). Generate candidate solutions from f (·; θ) on the solution space X .2). Use a gradient-based method to update the parameter θ.

Combine the robustness of model-based optimization with therelative fast convergence of gradient-based optimization.

10 / 41

Page 23: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Why reformulation?

Possible Scenarios:

Original Problem New Problemarg maxx∈X H(x) arg maxθ

∫H(x)f (x ; θ)dx

Discrete in x Continuous in θNon-differentiable in x Differentiable in θ

Incorporate model-based optimization into gradient-basedoptimization:1). Generate candidate solutions from f (·; θ) on the solution space X .2). Use a gradient-based method to update the parameter θ.

Combine the robustness of model-based optimization with therelative fast convergence of gradient-based optimization.

10 / 41

Page 24: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

More reformulation

For an arbitrary but fixed θ′ ∈ Rd , define the function

l(θ; θ′) , ln(∫

Sθ′(H(x))f (x ; θ)dx).

The shape function Sθ(·) : R→ R+ is chosen to ensure

0 < l(θ; θ′) ≤ ln (Sθ′(H∗)) ∀ θ,

and “=” is achieved if ∃ a θ∗ s.t. the probability mass of f (x ; θ∗) isconcentrated on a subset of global optima.

So considermaxθ

l(θ; θ′).

11 / 41

Page 25: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

More reformulation

For an arbitrary but fixed θ′ ∈ Rd , define the function

l(θ; θ′) , ln(∫

Sθ′(H(x))f (x ; θ)dx).

The shape function Sθ(·) : R→ R+ is chosen to ensure

0 < l(θ; θ′) ≤ ln (Sθ′(H∗)) ∀ θ,

and “=” is achieved if ∃ a θ∗ s.t. the probability mass of f (x ; θ∗) isconcentrated on a subset of global optima.

So considermaxθ

l(θ; θ′).

11 / 41

Page 26: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

More reformulation

For an arbitrary but fixed θ′ ∈ Rd , define the function

l(θ; θ′) , ln(∫

Sθ′(H(x))f (x ; θ)dx).

The shape function Sθ(·) : R→ R+ is chosen to ensure

0 < l(θ; θ′) ≤ ln (Sθ′(H∗)) ∀ θ,

and “=” is achieved if ∃ a θ∗ s.t. the probability mass of f (x ; θ∗) isconcentrated on a subset of global optima.

So considermaxθ

l(θ; θ′).

11 / 41

Page 27: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Parameter updating

Suppose {f (·; θ)} is an exponential family of densities, i.e.,

f (x ; θ) = exp{θT T (x)− φ(θ)}, φ(θ) = ln{∫

exp(θT T (x))dx}.

Then

∇θ l(θ; θ′)|θ=θ′ = Ep(·;θ′)[T (X )]− Eθ′ [T (X )],

∇2θ l(θ; θ′)|θ=θ′ = Varp(·;θ′)[T (X )]− Varθ′ [T (X )],

where p(x ; θ′) , Sθ′ (H(x))f (x ;θ′)∫Sθ′ (H(x))f (x ;θ′)dx .

A Newton-like scheme for updating θ

θk+1 = θk +αk (Varθk [T (X )] + εI)−1(

Ep(·;θk )[T (X )]− Eθk [T (X )]

),

αk > 0, ε > 0.

12 / 41

Page 28: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Parameter updating

Suppose {f (·; θ)} is an exponential family of densities, i.e.,

f (x ; θ) = exp{θT T (x)− φ(θ)}, φ(θ) = ln{∫

exp(θT T (x))dx}.

Then

∇θ l(θ; θ′)|θ=θ′ = Ep(·;θ′)[T (X )]− Eθ′ [T (X )],

∇2θ l(θ; θ′)|θ=θ′ = Varp(·;θ′)[T (X )]− Varθ′ [T (X )],

where p(x ; θ′) , Sθ′ (H(x))f (x ;θ′)∫Sθ′ (H(x))f (x ;θ′)dx .

A Newton-like scheme for updating θ

θk+1 = θk +αk (Varθk [T (X )] + εI)−1(

Ep(·;θk )[T (X )]− Eθk [T (X )]

),

αk > 0, ε > 0.

12 / 41

Page 29: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Parameter updating

Suppose {f (·; θ)} is an exponential family of densities, i.e.,

f (x ; θ) = exp{θT T (x)− φ(θ)}, φ(θ) = ln{∫

exp(θT T (x))dx}.

Then

∇θ l(θ; θ′)|θ=θ′ = Ep(·;θ′)[T (X )]− Eθ′ [T (X )],

∇2θ l(θ; θ′)|θ=θ′ = Varp(·;θ′)[T (X )]− Varθ′ [T (X )],

where p(x ; θ′) , Sθ′ (H(x))f (x ;θ′)∫Sθ′ (H(x))f (x ;θ′)dx .

A Newton-like scheme for updating θ

θk+1 = θk +αk (Varθk [T (X )] + εI)−1(

Ep(·;θk )[T (X )]− Eθk [T (X )]

),

αk > 0, ε > 0.

12 / 41

Page 30: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Parameter updating

A Newton-like scheme for updating θ

θk+1 = θk +αk (Varθk [T (X )] + εI)−1(

Ep(·;θk )[T (X )]− Eθk [T (X )]

),

αk > 0, ε > 0.

Varθk [T (X )] = E [(∇θ ln f (X ; θk ))2] is the Fisher information matrix,leading to the following facts:

Varθk [T (X )]−1 is the minimum-variance step size in stochasticapproximation.

Varθk [T (X )]−1 adapts the gradient step to our belief aboutpromising regions. (Think about T (X ) = X ...)

Varθk [T (X )]−1∇θ l(θ; θk )|θ=θk is the gradient of l(θ; θk ) on thestatistical manifold equipped with Fisher metric.

13 / 41

Page 31: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Parameter updating

A Newton-like scheme for updating θ

θk+1 = θk +αk (Varθk [T (X )] + εI)−1(

Ep(·;θk )[T (X )]− Eθk [T (X )]

),

αk > 0, ε > 0.

Varθk [T (X )] = E [(∇θ ln f (X ; θk ))2] is the Fisher information matrix,leading to the following facts:

Varθk [T (X )]−1 is the minimum-variance step size in stochasticapproximation.

Varθk [T (X )]−1 adapts the gradient step to our belief aboutpromising regions. (Think about T (X ) = X ...)

Varθk [T (X )]−1∇θ l(θ; θk )|θ=θk is the gradient of l(θ; θk ) on thestatistical manifold equipped with Fisher metric.

13 / 41

Page 32: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Parameter updating

A Newton-like scheme for updating θ

θk+1 = θk +αk (Varθk [T (X )] + εI)−1(

Ep(·;θk )[T (X )]− Eθk [T (X )]

),

αk > 0, ε > 0.

Varθk [T (X )] = E [(∇θ ln f (X ; θk ))2] is the Fisher information matrix,leading to the following facts:

Varθk [T (X )]−1 is the minimum-variance step size in stochasticapproximation.

Varθk [T (X )]−1 adapts the gradient step to our belief aboutpromising regions. (Think about T (X ) = X ...)

Varθk [T (X )]−1∇θ l(θ; θk )|θ=θk is the gradient of l(θ; θk ) on thestatistical manifold equipped with Fisher metric.

13 / 41

Page 33: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Parameter updating

A Newton-like scheme for updating θ

θk+1 = θk +αk (Varθk [T (X )] + εI)−1(

Ep(·;θk )[T (X )]− Eθk [T (X )]

),

αk > 0, ε > 0.

Varθk [T (X )] = E [(∇θ ln f (X ; θk ))2] is the Fisher information matrix,leading to the following facts:

Varθk [T (X )]−1 is the minimum-variance step size in stochasticapproximation.

Varθk [T (X )]−1 adapts the gradient step to our belief aboutpromising regions. (Think about T (X ) = X ...)

Varθk [T (X )]−1∇θ l(θ; θk )|θ=θk is the gradient of l(θ; θk ) on thestatistical manifold equipped with Fisher metric.

13 / 41

Page 34: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Parameter updating

A Newton-like scheme for updating θ

θk+1 = θk +αk (Varθk [T (X )] + εI)−1(

Ep(·;θk )[T (X )]− Eθk [T (X )]

),

αk > 0, ε > 0.

Varθk [T (X )] = E [(∇θ ln f (X ; θk ))2] is the Fisher information matrix,leading to the following facts:

Varθk [T (X )]−1 is the minimum-variance step size in stochasticapproximation.

Varθk [T (X )]−1 adapts the gradient step to our belief aboutpromising regions. (Think about T (X ) = X ...)

Varθk [T (X )]−1∇θ l(θ; θk )|θ=θk is the gradient of l(θ; θk ) on thestatistical manifold equipped with Fisher metric.

13 / 41

Page 35: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Main algorithm: GASS

Gradient-based Adaptive Stochastic Search (GASS)Initialization: set k = 0.Sampling: draw samples x i

kiid∼ f (x ; θk ), i = 1,2, . . . ,Nk .

Updating: update the parameter θ according to

θk+1 = θk + αk (Varθk [T (X )] + εI)−1(Epk [T (X )]− Eθk [T (X )]),

where Varθk [T (X )] and Epk [T (X )] are estimates using the samples{x i

k , i = 1, . . . ,Nk}.Stopping: If some stopping criterion is satisfied, stop and returnthe current best sampled solution; else, set k := k + 1 and goback to step 2).

14 / 41

Page 36: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Accelerated algorithm: GASS_avg

GASS can be viewed as a stochastic approximation algorithm infinding θ∗.

Accelerated GASS: use Polyak averaging with online feedback

θk+1 = θk + αk

(Varθk [T (X )] + εI

)−1(Epk [T (X )]− Eθk [T (X )])

+ αk c(θk − θk ),

θk =1k

k∑i=1

θi .

15 / 41

Page 37: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Accelerated algorithm: GASS_avg

GASS can be viewed as a stochastic approximation algorithm infinding θ∗.

Accelerated GASS: use Polyak averaging with online feedback

θk+1 = θk + αk

(Varθk [T (X )] + εI

)−1(Epk [T (X )]− Eθk [T (X )])

+ αk c(θk − θk ),

θk =1k

k∑i=1

θi .

15 / 41

Page 38: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Convergence analysis

The updating of θ can be rewritten in the form of a generalizedRobbins-Monro iterates:

θk+1 = θk + αk [D(θk ) + bk + ξk ],

where D(θk ) is the gradient field, bk is the bias term, and ξk is thenoise term.

D(θk ) = (Varθk [T (X )] + εI)−1∇θ l(θk ; θk ).

It can be viewed as a noisy discretization of the ordinarydifferential equation (ODE)

θt = D(θt ), t ≥ 0.

16 / 41

Page 39: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Convergence analysis

The updating of θ can be rewritten in the form of a generalizedRobbins-Monro iterates:

θk+1 = θk + αk [D(θk ) + bk + ξk ],

where D(θk ) is the gradient field, bk is the bias term, and ξk is thenoise term.

D(θk ) = (Varθk [T (X )] + εI)−1∇θ l(θk ; θk ).

It can be viewed as a noisy discretization of the ordinarydifferential equation (ODE)

θt = D(θt ), t ≥ 0.

16 / 41

Page 40: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Convergence analysis

θk+1 = θk + αk [D(θk ) + bk + ξk ],

θt = D(θt ), t ≥ 0.

Assumptionαk ↘ 0 as k →∞,

∑∞k=0 αk =∞.

Lemma 1Under certain assumptions, bk → 0 w .p.1 as k →∞.

Lemma 2Under certain assumptions, for any T > 0,

limk→∞

{sup

{n:0≤∑n−1

i=k αi≤T}

∥∥∥∥∥n∑

i=k

αiξi

∥∥∥∥∥}

= 0, w .p.1.

17 / 41

Page 41: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Convergence analysis

θk+1 = θk + αk [D(θk ) + bk + ξk ],

θt = D(θt ), t ≥ 0.

Assumptionαk ↘ 0 as k →∞,

∑∞k=0 αk =∞.

Lemma 1Under certain assumptions, bk → 0 w .p.1 as k →∞.

Lemma 2Under certain assumptions, for any T > 0,

limk→∞

{sup

{n:0≤∑n−1

i=k αi≤T}

∥∥∥∥∥n∑

i=k

αiξi

∥∥∥∥∥}

= 0, w .p.1.

17 / 41

Page 42: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Convergence analysis

θk+1 = θk + αk [D(θk ) + bk + ξk ],

θt = D(θt ), t ≥ 0.

Assumptionαk ↘ 0 as k →∞,

∑∞k=0 αk =∞.

Lemma 1Under certain assumptions, bk → 0 w .p.1 as k →∞.

Lemma 2Under certain assumptions, for any T > 0,

limk→∞

{sup

{n:0≤∑n−1

i=k αi≤T}

∥∥∥∥∥n∑

i=k

αiξi

∥∥∥∥∥}

= 0, w .p.1.

17 / 41

Page 43: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Convergence analysis

θk+1 = θk + αk [D(θk ) + bk + ξk ],

θt = D(θt ), t ≥ 0.

Assumptionαk ↘ 0 as k →∞,

∑∞k=0 αk =∞.

Lemma 1Under certain assumptions, bk → 0 w .p.1 as k →∞.

Lemma 2Under certain assumptions, for any T > 0,

limk→∞

{sup

{n:0≤∑n−1

i=k αi≤T}

∥∥∥∥∥n∑

i=k

αiξi

∥∥∥∥∥}

= 0, w .p.1.

17 / 41

Page 44: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Convergence results

Theorem (Asymptotic Convergence)Assume that D(θt ) is continuous with a unique integral curveand some regularity conditions hold. Then the sequence {θk}converges to a limit set of the ODE w.p.1. Furthermore, if thelimit sets of the ODE are isolated equilibrium points, then w.p.1{θk} converges to a unique equilibrium point.

Implication: GASS converges to a stationary point of l(θ; θ′).

18 / 41

Page 45: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Convergence results

Theorem (Asymptotic Convergence Rate)Let αk = α0/kα for 0 < α < 1. For a given constant τ > 2α, letNk = Θ(kτ−α). Assume the convergence of the sequence {θk}occurs to a unique equilibrium point θ∗ w.p.1. If Assumptions 1,2, and 3 hold, then

kτ2 (θk − θ∗)

dist−−−→ N(0,QMQT ),

where Q is an orthogonal matrix such that QT (−JL(θ∗))Q = Λwith Λ being a diagonal matrix, and the (i , j)th entry of thematrixM is given byM(i,j) = (QT ΦΣΦT Q)(i,j)(Λ(i,i) + Λ(j,j))

−1.

Implication: The asymptotic convergence rate of GASS isO(1/

√kτ ).

19 / 41

Page 46: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

0 2 4 6 8 10

x 104

−4

−3.5

−3

−2.5

−2

−1.5

−1

2−D Dejong5 function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avgmodified CEMRAS

0 5 10 15

x 104

0

1

2

3

4

5

6

7

8

9

10

4−D Shekel function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avgmodified CEMRAS

0 1 2 3 4 5

x 105

−1010

−108

−106

−104

−102

−100

50−D Powel function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avgmodified CEMRAS

0 0.5 1 1.5 2 2.5

x 106

−104

−103

−102

−101

−100

10−D Rosenbrock function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avgmodified CEMRAS

Figure : Comparison of average performance of GASS, GASS_avg, MRAS,and the modified CE.

20 / 41

Page 47: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

0 0.5 1 1.5 2 2.5 3

x 105

−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

0

50−D Griewank function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avgmodified CEMRAS

0 1 2 3 4 5 6

x 105

−105

−104

−103

−102

−101

−100

50−D Trigonometric function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avgmodified CEMRAS

0 5 10 15

x 105

−105

−104

−103

−102

−101

−100

50−D Pinter function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avgmodified CEMRAS

0 1 2 3 4 5

x 105

−106

−105

−104

−103

−102

−101

−100

50−D Sphere function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avgmodified CEMRAS

Figure : Comparison of average performance of GASS, GASS_avg, MRAS,and the modified CE.

21 / 41

Page 48: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

0 0.5 1 1.5 2 2.5

x 105

10−2

10−1

100

200−D Griewank function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avg

0 0.5 1 1.5 2 2.5 3 3.5

x 105

10−6

10−5

10−4

10−3

10−2

10−1

100

200−D Trigonometric function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avg

0 0.5 1 1.5 2 2.5 3 3.5

x 105

10−6

10−5

10−4

10−3

10−2

10−1

100

200−D Levy function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avg

0 0.5 1 1.5 2 2.5 3 3.5

x 105

10−8

10−6

10−4

10−2

100

200−D Sphere function

Total sample size

Fun

ctio

n va

lue

true optimumGASSGASS_avg

Figure : Average performance of GASS and GASS_avg on 200-dimensionalbenchmark problems.

22 / 41

Page 49: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

GASS_avg and GASS find the ε-optimal solutions in all the runsfor 7 out of the 8 benchmark problems (except the Shekelfunction).

Accuracy: GASS_avg and GASS find better solutions than themodified CE method on badly-scaled functions and arecomparable to the modified Cross Entropy method (Rubinstein1998) on multi-modal functions; outperform Model ReferenceAdaptive Search (Hu et al. 2007) on all the problems.

Convergence speed: GASS_avg always converges faster thanGASS; both are faster than MRAS on all the problems and fasterthan the modified CE on most problems.

23 / 41

Page 50: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

GASS_avg and GASS find the ε-optimal solutions in all the runsfor 7 out of the 8 benchmark problems (except the Shekelfunction).

Accuracy: GASS_avg and GASS find better solutions than themodified CE method on badly-scaled functions and arecomparable to the modified Cross Entropy method (Rubinstein1998) on multi-modal functions; outperform Model ReferenceAdaptive Search (Hu et al. 2007) on all the problems.

Convergence speed: GASS_avg always converges faster thanGASS; both are faster than MRAS on all the problems and fasterthan the modified CE on most problems.

23 / 41

Page 51: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

GASS_avg and GASS find the ε-optimal solutions in all the runsfor 7 out of the 8 benchmark problems (except the Shekelfunction).

Accuracy: GASS_avg and GASS find better solutions than themodified CE method on badly-scaled functions and arecomparable to the modified Cross Entropy method (Rubinstein1998) on multi-modal functions; outperform Model ReferenceAdaptive Search (Hu et al. 2007) on all the problems.

Convergence speed: GASS_avg always converges faster thanGASS; both are faster than MRAS on all the problems and fasterthan the modified CE on most problems.

23 / 41

Page 52: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Resource allocation in communication networks

Q users may transmit or receive signals using N carriers, under apower budget Bq for the qth user. The objective is to maximize thetotal transmission rate (sum-rate) by optimally allocating eachuser’s power resource to the carriers.

24 / 41

Page 53: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Resource allocation in communication networks

maxpq(k),∀q,∀k

Q∑q=1

N∑k=1

log

(1 +

|Hqq|2pq(k)

N0 +∑Q

r=1,r 6=q |Hrq(k)|2pr (k)

)subject to:

N∑k=1

pq(k) ≤ Bq, q = 1, · · · ,Q,

pq(k) ≥ 0, q = 1, · · · ,Q, k = 1, · · · ,N.

The sampling distribution f (·; θ) is chosen to be the Dirichletdistribution, whose support is a multi-dimensional simplex.

25 / 41

Page 54: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Resource allocation in communication networks

maxpq(k),∀q,∀k

Q∑q=1

N∑k=1

log

(1 +

|Hqq|2pq(k)

N0 +∑Q

r=1,r 6=q |Hrq(k)|2pr (k)

)subject to:

N∑k=1

pq(k) ≤ Bq, q = 1, · · · ,Q,

pq(k) ≥ 0, q = 1, · · · ,Q, k = 1, · · · ,N.

The sampling distribution f (·; θ) is chosen to be the Dirichletdistribution, whose support is a multi-dimensional simplex.

25 / 41

Page 55: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Resource allocation in communication networks

Figure : Numerical results on resource allocation in communication networks.IWFA, DDPA, MADP, GPA are distributed algorithms. Other algorithms arerunning multi-start versions of NEOS Solvers: http://neos-server.org/neos/.

26 / 41

Page 56: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Discrete optimization

X is a discrete set.

Discrete-GASS: use discrete distributionSampling is easy, but the parameter is of high dimension.

Annealing-GASS: use Boltzmann distributionParameter is always of dimension 1, but sampling (by MCMC) ismore expensive and inexact.Annealing-GASS converges to the set of optimal solutions inprobability.

27 / 41

Page 57: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Discrete optimization

X is a discrete set.

Discrete-GASS: use discrete distributionSampling is easy, but the parameter is of high dimension.

Annealing-GASS: use Boltzmann distributionParameter is always of dimension 1, but sampling (by MCMC) ismore expensive and inexact.Annealing-GASS converges to the set of optimal solutions inprobability.

27 / 41

Page 58: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Discrete optimization

X is a discrete set.

Discrete-GASS: use discrete distributionSampling is easy, but the parameter is of high dimension.

Annealing-GASS: use Boltzmann distributionParameter is always of dimension 1, but sampling (by MCMC) ismore expensive and inexact.

Annealing-GASS converges to the set of optimal solutions inprobability.

27 / 41

Page 59: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Discrete optimization

X is a discrete set.

Discrete-GASS: use discrete distributionSampling is easy, but the parameter is of high dimension.

Annealing-GASS: use Boltzmann distributionParameter is always of dimension 1, but sampling (by MCMC) ismore expensive and inexact.Annealing-GASS converges to the set of optimal solutions inprobability.

27 / 41

Page 60: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

|X | ≈ 106 (Shekel), 1016 (Rosenbrock), 1080 (others).

0 1 2 3 4 5 6 7 8

x 104

0

1

2

3

4

5

6

7

8

9

10

4−D Shekel function

Total sample size

Fun

ctio

n va

lue

True OptimumDist−GASSMRASAnnealing−GASSSA (Geo Temp)SA (Log Temp)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 105

−107

−106

−105

−104

−103

−102

−101

−100

50−D Powel function

Total sample size

Fun

ctio

n va

lue

True OptimumDist−GASSMRASAnnealing−GASSSA (Geo Temp)SA (Log Temp)

0 0.5 1 1.5 2

x 105

−105

−104

−103

−102

−101

−100

10−D Rosenbrock function

Total sample size

Fun

ctio

n va

lue

True OptimumDist−GASSMRASAnnealing−GASSSA (Geo Temp)SA (Log Temp)

0 0.5 1 1.5 2 2.5 3

x 105

−1.5

−1

−0.5

0

50−D Griewank function

Total sample size

Fun

ctio

n va

lue

True OptimumDist−GASSMRASAnnealing−GASSSA (Geo Temp)SA (Log Temp)

Figure : Average performance of discrete-GASS, Annealing-GASS, MRAS,SA (geometric temperature), and SA (logarithmic temperature)

28 / 41

Page 61: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

0 0.5 1 1.5 2 2.5 3

x 105

−104

−103

−102

−101

−100

50−D Rastrigin function

Total sample size

Fun

ctio

n va

lue

True OptimumDist−GASSMRASAnnealing−GASSSA (Geo Temp)SA (Log Temp)

0 0.5 1 1.5 2 2.5 3

x 105

−103

−102

−101

−100

50−D Levy function

Total sample size

Fun

ctio

n va

lue

True OptimumDist−GASSMRASAnnealing−GASSSA (Geo Temp)SA (Log Temp)

0 1 2 3 4 5 6

x 105

−105

−104

−103

−102

−101

−100

50−D Pinter function

Total sample size

Fun

ctio

n va

lue

True OptimumDist−GASSMRASAnnealing−GASSSA (Geo Temp)SA (Log Temp)

0 0.5 1 1.5 2 2.5 3

x 105

−105

−104

−103

−102

−101

−100

50−D Sphere function

Total sample size

Fun

ctio

n va

lue

True OptimumDist−GASSMRASAnnealing−GASSSA (Geo Temp)SA (Log Temp)

Figure : Average performance of discrete-GASS, Annealing-GASS, MRAS,SA (geometric temperature), and SA (logarithmic temperature)

29 / 41

Page 62: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

Discrete-GASS outperforms MRAS in both accuracy andconvergence rate.

Annealing-GASS algorithm is an improvement of multi-startsimulated annealing algorithms with geometric and logarithmictemperature schedules.

Discrete-GASS provides accurate solutions in most of theproblems; Annealing-GASS yields accurate solutions only in thelow-dimensional problem and badly-scaled problems.

Discrete-GASS usually needs more computation time for eachiteration than Annealing-GASS, but needs less iterations toconverge.

30 / 41

Page 63: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

Discrete-GASS outperforms MRAS in both accuracy andconvergence rate.

Annealing-GASS algorithm is an improvement of multi-startsimulated annealing algorithms with geometric and logarithmictemperature schedules.

Discrete-GASS provides accurate solutions in most of theproblems; Annealing-GASS yields accurate solutions only in thelow-dimensional problem and badly-scaled problems.

Discrete-GASS usually needs more computation time for eachiteration than Annealing-GASS, but needs less iterations toconverge.

30 / 41

Page 64: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

Discrete-GASS outperforms MRAS in both accuracy andconvergence rate.

Annealing-GASS algorithm is an improvement of multi-startsimulated annealing algorithms with geometric and logarithmictemperature schedules.

Discrete-GASS provides accurate solutions in most of theproblems; Annealing-GASS yields accurate solutions only in thelow-dimensional problem and badly-scaled problems.

Discrete-GASS usually needs more computation time for eachiteration than Annealing-GASS, but needs less iterations toconverge.

30 / 41

Page 65: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

Discrete-GASS outperforms MRAS in both accuracy andconvergence rate.

Annealing-GASS algorithm is an improvement of multi-startsimulated annealing algorithms with geometric and logarithmictemperature schedules.

Discrete-GASS provides accurate solutions in most of theproblems; Annealing-GASS yields accurate solutions only in thelow-dimensional problem and badly-scaled problems.

Discrete-GASS usually needs more computation time for eachiteration than Annealing-GASS, but needs less iterations toconverge.

30 / 41

Page 66: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Implementation & Software

Most tuning parameters can be set to default; need carefullychoose stepsize {αk}.

Choice of sampling distributionX is a continuous set: (truncated) GaussianX is a simplex (with or without interior): DirichletX is a discrete set: discrete, Boltzmann

Software available at http://enluzhou.gatech.edu/software.html

31 / 41

Page 67: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Implementation & Software

Most tuning parameters can be set to default; need carefullychoose stepsize {αk}.

Choice of sampling distributionX is a continuous set: (truncated) GaussianX is a simplex (with or without interior): DirichletX is a discrete set: discrete, Boltzmann

Software available at http://enluzhou.gatech.edu/software.html

31 / 41

Page 68: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Outline

1 Introduction

2 GASS for non-differentiable optimization

3 GASS for simulation optimization

4 Conclusions

32 / 41

Page 69: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Simulation optimization: introduction

Simulation optimization:

maxx∈X

H(x) , Eξx [h(x , ξx )].

Computer simulation of a complex system

x y h(x,ξx) ~

Example: a queueing system (x : service rate; H: waiting time +staffing cost; ξx : arrival/service times)

X is a continuous set.

33 / 41

Page 70: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Simulation optimization: introduction

Simulation optimization:

maxx∈X

H(x) , Eξx [h(x , ξx )].

Computer simulation of a complex system

x y h(x,ξx) ~

Example: a queueing system (x : service rate; H: waiting time +staffing cost; ξx : arrival/service times)

X is a continuous set.

33 / 41

Page 71: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Simulation optimization: introduction

Simulation optimization:

maxx∈X

H(x) , Eξx [h(x , ξx )].

Computer simulation of a complex system

x y h(x,ξx) ~

Example: a queueing system (x : service rate; H: waiting time +staffing cost; ξx : arrival/service times)

X is a continuous set.

33 / 41

Page 72: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Simulation optimization: introduction

Simulation optimization:

maxx∈X

H(x) , Eξx [h(x , ξx )].

Computer simulation of a complex system

x y h(x,ξx) ~

Example: a queueing system (x : service rate; H: waiting time +staffing cost; ξx : arrival/service times)

X is a continuous set.

33 / 41

Page 73: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Simulation optimization: introduction

Main solution methodsRanking & Selection (for problems with finite solution space)Stochastic approximationResponse surface methodsSample average approximationStochastic search methods

34 / 41

Page 74: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

GASS for simulation optimization

Gradient-based Adaptive Stochastic Search (GASS)InitializationSampling: draw samples x i

kiid∼ f (x ; θk ), i = 1,2, . . . ,Nk .

Estimation: simulate each x ik for Mk times; estimate

H(x ik ) = 1

Mk

∑Mkj=1 h(x i

k , ξi,jk ).

Updating: update the parameter θ according to

θk+1 = θk + αk (Varθk [T (X )] + εI)−1(Epk [T (X )]− Eθk [T (X )]),

where Varθk [T (X )] and Epk [T (X )] are estimates using {x ik} and

{H(x ik )}.

Stopping

35 / 41

Page 75: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Two-timescale GASS

Motivated by two-timescale stochastic approximation (Borkar 1997):

Two-timescale GASS (GASS_2T)Assume αk → 0, βk → 0, βk/αk → 0.

Draw samples x ik

iid∼ f (x ; θk ), i = 1, . . . ,N, and carry outcomputer simulation for each xi once.Update the gradient and Hessian estimates in GASS onthe fast timescale with step size αk .Update θk on the slow timescale with step size βk .

Intuition: sampling distribution can be viewed as fixed while thegradient and Hessian estimates are updated over many iterations.So only a small sample size N is needed.

36 / 41

Page 76: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Two-timescale GASS

Motivated by two-timescale stochastic approximation (Borkar 1997):

Two-timescale GASS (GASS_2T)Assume αk → 0, βk → 0, βk/αk → 0.

Draw samples x ik

iid∼ f (x ; θk ), i = 1, . . . ,N, and carry outcomputer simulation for each xi once.Update the gradient and Hessian estimates in GASS onthe fast timescale with step size αk .Update θk on the slow timescale with step size βk .

Intuition: sampling distribution can be viewed as fixed while thegradient and Hessian estimates are updated over many iterations.So only a small sample size N is needed.

36 / 41

Page 77: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Numerical results

102

103

104

105

106

−108

−106

−104

−102

−100

10−D Powell function

Number of Function Evaluations

Fun

ctio

n va

lue

true optimumGASSOGASSO_2TCEOCBA

102

103

104

105

−104

−103

−102

−101

−100

10−D Trignometric function

Number of Function Evaluations

Fun

ctio

n va

lue

true optimumGASSOGASSO_2TCEOCBA

102

103

104

105

106

−105

−104

−103

−102

−101

−100

10−D Pinter function

Number of Function Evaluations

Fun

ctio

n va

lue

true optimumGASSOGASSO_2TCEOCBA

102

103

104

105

106

−104

−103

−102

−101

−100

10−D Rastrigin function

Number of Function Evaluations

Fun

ctio

n va

lue

true optimumGASSOGASSO_2TCEOCBA

Figure : Average performance of GASS, GASS_2T, CEOCBA (He et al.2010) on problems with independent noise

37 / 41

Page 78: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Outline

1 Introduction

2 GASS for non-differentiable optimization

3 GASS for simulation optimization

4 Conclusions

38 / 41

Page 79: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Conclusions

By reformulating a hard optimization problem into a differentiableone, we can incorporate direct gradient search with stochasticsearch.

A class of gradient-based adaptive stochastic search (GASS)algorithms for non-differentiable optimization, black-boxoptimization, and simulation optimization problems.

Convergence results and numerical results show that GASS is apromising and competitive method.

39 / 41

Page 80: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Conclusions

By reformulating a hard optimization problem into a differentiableone, we can incorporate direct gradient search with stochasticsearch.

A class of gradient-based adaptive stochastic search (GASS)algorithms for non-differentiable optimization, black-boxoptimization, and simulation optimization problems.

Convergence results and numerical results show that GASS is apromising and competitive method.

39 / 41

Page 81: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Conclusions

By reformulating a hard optimization problem into a differentiableone, we can incorporate direct gradient search with stochasticsearch.

A class of gradient-based adaptive stochastic search (GASS)algorithms for non-differentiable optimization, black-boxoptimization, and simulation optimization problems.

Convergence results and numerical results show that GASS is apromising and competitive method.

39 / 41

Page 82: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

References

E. Zhou and Jiaqiao Hu, “Gradient-based Adaptive StochasticSearch for Non-differentiable Optimization”, IEEE Transactions onAutomatic Control, 59(7), pp.1818-1832, 2014.

E. Zhou and Shalabh Bhatnagar, “Gradient-based AdaptiveStochastic Search for Simulation Optimization over ContinuousSpace”, submitted.

40 / 41

Page 83: Gradient-based Adaptive Stochastic Searchenluzhou.gatech.edu/papers/GASS_slides.pdf · Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated

Thank you !

41 / 41