stochastic comparison algorithm for continuous optimization with estimation

31
JOURNALOF OPTIMIZATION THEORYAND APPLICATIONS: Vol.91, No. 3, pp. 585-615,DECEMBER 1996 Stochastic Comparison Algorithm for Continuous Optimization with Estimation G. BAO 2 AND C. G. CASSANDRAS 3 Communicated by Y. C. Ho Abstract. The problem of stochastic optimization for arbitrary objec- tive functions presents a dual challenge. First, one needs to repeatedly estimate the objective function; when no closed-form expression is avail- able, this is only possible through simulation. Second, one has to face the possibility of determining local, rather than global, optima. In this paper, we show how the stochastic comparison approach recently pro- posed in Ref. 1 for discrete optimization can be used in continuous optimization. We prove that the continuous stochastic comparison algo- rithm converges to an E-neighborhood of the global optimum for any e > O. Several applications of this approach to problems with different features are provided and compared to simulated annealing and gradient descent algorithms. Key Words. Stochastic optimization, simulation, estimation, stochastic comparison, simulated annealing. 1. Introduction Stochastic optimization is an area of obvious importance and one that presents major challenges from both the theoretical and practical points of view. The proliferation of discrete-event systems (DES), for instance, has given rise to problems involving the design and analysis of highly complex ~This work was supported in part by the National Science Foundation under Grants EID- 92-12122 and ECS-88-01912, and by a Grant from United Technologies/Otis Elevator Company. 2Graduate Student, Department of Ek~trical and Computer Engineering, University of Massachusetts, Amherst, Massachusetts. Currently, Senior Engineer, Qualcomm Incorpor- ated, San Diego, California. 3Professor, Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, Massachusetts. 585 0022-3239/96/1200-0585509.50/0 1996 Plenum Publishing Corporation

Upload: g-bao

Post on 10-Jul-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Stochastic comparison algorithm for continuous optimization with estimation

JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS: Vol. 91, No. 3, pp. 585-615, DECEMBER 1996

Stochastic Comparison Algorithm for Continuous Optimization with Estimation

G. BAO 2 AND C. G. CASSANDRAS 3

Communicated by Y. C. Ho

Abstract. The problem of stochastic optimization for arbitrary objec- tive functions presents a dual challenge. First, one needs to repeatedly estimate the objective function; when no closed-form expression is avail- able, this is only possible through simulation. Second, one has to face the possibility of determining local, rather than global, optima. In this paper, we show how the stochastic comparison approach recently pro- posed in Ref. 1 for discrete optimization can be used in continuous optimization. We prove that the continuous stochastic comparison algo- rithm converges to an E-neighborhood of the global optimum for any e > O. Several applications of this approach to problems with different features are provided and compared to simulated annealing and gradient descent algorithms.

Key Words. Stochastic optimization, simulation, estimation, stochastic comparison, simulated annealing.

1. Introduction

Stochastic optimization is an area of obvious importance and one that presents major challenges from both the theoretical and practical points of view. The proliferation of discrete-event systems (DES), for instance, has given rise to problems involving the design and analysis of highly complex

~This work was supported in part by the National Science Foundation under Grants EID- 92-12122 and ECS-88-01912, and by a Grant from United Technologies/Otis Elevator Company.

2Graduate Student, Department of Ek~trical and Computer Engineering, University of Massachusetts, Amherst, Massachusetts. Currently, Senior Engineer, Qualcomm Incorpor- ated, San Diego, California.

3Professor, Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, Massachusetts.

585 0022-3239/96/1200-0585509.50/0 �9 1996 Plenum Publishing Corporation

Page 2: Stochastic comparison algorithm for continuous optimization with estimation

586 JOTA: VOL. 91, NO. 3, DECEMBER 1996

stochastic systems such as computer/communication networks, flexible manufacturing systems, and transportation systems, where objective functions to be optimized often cannot be expressed in closed form. There- fore, each value of the objective function can only be estimated through simulation or through direct observation of actual data.

Stochastic optimization problems can generally be classified into two categories: in discrete optimization, the objective function depends on the elements of a discrete set (e.g., a subset of the integers), whereas in continu- ous optimization the objective function depends on a continuous parameter vector. In this paper, we consider only the continuous optimization problem. In solving this problem, one common approach is based on estimated gradi- ent information which drives the optimization process to a minimal point. However, this approach can easily lead to a local minimum if the objective function has multiple local minima. To overcome this problem, it is necessary to allow the optimization process to occasionally move to a bad neighboring point so as to provide the opportunity to jump out of a local minimum. To accomplish this, several algorithms have been proposed based on a variety of random search schemes (e.g., see Refs. 2-3). Simulated annealing (SA) is one such algorithm which has been successfully applied to solve some practical problems (Refs. 4-5). However, SA usually requires the accurate evaluation of objective function values. Furthermore, to our knowledge, there is no theoretical analysis for the SA algorithm applied to stochastic optimization problems, where the objective function needs to be repeatedly estimated. Experimental evidence does show SA to work in some problems with Monte Carlo estimates of the objective function at each iteration. How- ever, (a) very long simulation times are required to get accurate estimates, and (b) the optimization process usually converges extremely slowly. One of the key reasons for this is that each optimization search is mainly concen- trated on the neighbors of the current point. If the neighboring area is small, it is hard to jump out of a local minimum. If the neighboring area is large, an occasional very bad move is always possible, which makes the optimization process inherently inefficient as far as fast convergence is concerned. More- over, the SA algorithm is not intended to solve problems in a setting where the estimates used have a large variance, which is often the case in practice.

To overcome these problems in discrete optimization, Gong et al. recently developed a new algorithm referred to as the stochastic comparison (SC) algorithm (Ref. 1), which was in turn inspired by the stochastic ruler (SR) algorithm introduced by Yan and Mukai (Ref. 6). The SC algorithm developed by Gong et al. is intended to solve discrete optimization problems with an objective function estimated through Monte Carlo simulations. This algorithm has been shown to be very efficient in solving discrete optimization problems with estimation. First, it is a global search algorithm. In addition,

Page 3: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 587

it replaces the cooling strategy used in SA by one that is efficiently coupled to the estimation noise involved: Instead of an artificial cooling function of the form exp( -A/T) in SA (further discussed in the next section), the SC algorithm increases the number of estimated objective function comparisons gradually before each move has been made.

Our main contribution in this paper is to propose an SC algorithm for continuous stochastic optimization problems and to provide a proof of convergence to an e-neighborhood of the global optimum for arbitrary e > 0. We subsequently provide some applications of the proposed continuous SC algorithm to three different types of problems in order to highlight the advantages and the limitations of this approach and to compare it to simu- lated annealing and gradient-descent algorithms.

In Section 2, we will set up the continuous stochastic optimization problem and briefly describe the gradient-descent and SA approaches. In Section 3, we review the discrete stochastic comparison (DSC) algorithm and some of its properties which will be used in our main result. In Section 4, we introduce the continuous stochastic comparison (CSC) algorithm and provide our main result. In Section 5, we present three problems where the CSC approach is used and compare our results to simulated annealing and gradient-based algorithms. This allows us to make a more systematic com- parison between these different approaches, describe their relative advan- tages and limitations, and identify classes of problems for which each may be suitable.

2. Stochastic Optimization Problem Setup

For a stochastic system, let L#(0, co) denote a sample performance function, where 0 �9 O is a continuous parameter vector and co �9 f~ is a sample point. For each given 0, let E[~(O, co)] denote the mean of Aa(0, co). For complex stochastic systems, a closed-form expression for E[~(O, co)] in terms of the parameter 0 is usually unavailable. Therefore, for each given O, E[L#(O, co)] has to be estimated through simulation or direct observation of the system. For simplicity, set

g(O) = E[~(O, co)],

and let ~(0, co) represent an estimate of g(O) based on a sample path co. Let

W(O) = ~ ( 0 ) - g ( 0 )

be the estimation error (we have dropped co for simplicity), which is assumed to form an iid sequence over all estimation points and to have a symmetric

Page 4: Stochastic comparison algorithm for continuous optimization with estimation

588 JOTA: VOL. 91, NO. 3, DECEMBER 1996

pdf. As we will see in Section 5, this assumption may be violated in some practical problems of interest, yet the key properties of the CSC algorithm are preserved in the problems considered in this paper.

By repeated estimations ~(0) of g(O) based on simulation or direct observation under different values of 0, we are interested in finding an optimal point 0EO* so that g(O) is minimized, where | is defined as the set

| = {0 | g(O) <_g(O'), v 0 ' e | (1)

For this optimization problem, there are two major approaches. The first approach is based on using estimated derivative information of g(0) to drive an iterative process to a minimal point. The simplest and most common derivative estimation algorithm is the SO called brute force (BF) method. For example, letting g'(O) denote the derivative of g(O), a first derivative estimator under this method is of the form

[g' (0)]B -ost. = ( l / a 0 ) + a 0 ) - (2)

The BF estimator is generally biased, since A0 is a finite quantity and the ratio above is an approximation of the true derivative. Further, an appropri- ate choice of A0 is very difficult to make. On the one hand, if A0 is too large, the estimate gives a bad approximation of the true derivative due to the generally nonlinear nature of the performance function. On the other hand, if A0 is too small, the estimate faces a numerical stability problem, since the random fluctuation of the values of Aa(0) may dominate the final estimation result and since the expression above becomes a ratio of two very small numbers. However, the BF estimator is the most general estimator, since it does not depend on any specific structure of the stochastic system under study.

In the context of discrete event systems (DES), where this stochastic optimization problem is often encountered, new techniques such as perturba- tion analysis (PA, Refs. 7-8) and the likelihood ratio (LR, Refs. 9-11) method have been developed. In many cases, estimators based on these techniques are unbiased. For example, the infinitesimal perturbation analysis (IPA) estimator of the first derivative of g(O), which is of the form

b' = a e(0)/o0, (3)

can be shown to be unbiased over a class of DES that satisfy the commuting condition; see Refs. 7-8.

Once a first-derivative estimate is available, we can use various gradient- descent algorithms to adjust 0i, the value of 0 at the ith iteration, by an

Page 5: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 589

amount AOi as follows:

AOi = -k[Og( O,) / OOdost,

where k is an adjustable parameter. Gradient-descent algorithms have been widely used, including some recent applications to neural network comput- ing (Ref. 12). When an estimate of the gradient is used, we usually add a factor to reduce oscillatory behavior as follows:

AOi = -(k/ia)[Og( O~)/OO~]est,

where i is the number of iterations and 0 < a < 1. However, this gradient- based approach has several drawbacks:

(a) We have to assume that g(O) has only one minimum point to prevent the optimization process from falling into a local minimum.

(b) We have to assume that g'(0) =0 can only occur at the minimum point; otherwise, our optimization process may oscillate around a saddle point because of the estimation noise.

(c) Current derivative estimation techniques have many limitations. For example, as already mentioned, the brute force estimator is sensitive to the choice of A0 and is generally biased. In the context of DES, PA and LR techniques also have their limitations. For example, it is well known that the LR estimator generally has high variance. On the other hand, beyond IPA, PA estimators depend on the structure of the system and have to be developed on a case' by-case basis.

(d) Regardless of the method used, to get a relatively accurate deriva- tive estimate, either a long simulation run (or directly observed sample path) is needed or several repeated simulations have to be performed.

Obviously, for a general function g(O) which has multiple local minima, the derivative-driven optimization approach may ultimately settle down at such a local minimum. Therefore, even in the absence of estimation noise, several algorithms of the random search variety have been developed to overcome this problem. We limit ourselves here to a brief review of simulated annealing (SA), since it is the most thoroughly analyzed algorithm to date and it has been applied to some practical problems (Ref. 4). The idea of SA is very simple. In fact, it is very close to the derivative approach with one exception: it allows the optimization process to move to bad neighboring points (i.e., higher cost or uphill points) with a positive probability. In order to make sure that the optimization process will finally settle down at some

Page 6: Stochastic comparison algorithm for continuous optimization with estimation

590 JOTA: VOL. 91, NO. 3, DECEMBER 1996

point close to the optimum, a control parameter called the "temperature" is used so that the probability of a bad move is gradually reduced to zero. This procedure is also referred to as cooling down. A typical SA algorithm that one might use for continuous parameter optimization with estimation is outlined as follows (see Ref. 5).

Step 1. Get an initial point 0.

Step 2. Get an initial temperature T> 0.

Step 3. Perform the following loop L times (L is a parameter of the algorithm).

Step 3a. Pick a random neighbor 0' of 0 in ( 0 - A0, 0 + A0).

Step 3b. Let A = g ( 0 ' ) - g ( 0 ) .

Step 3c. If A < 0 (downhill move), set 0 = 0'.

Step 3d. IfA > 0 (uphill move), set 0 = 0' with probability exp(-A/T).

Step 4. Set T = r T (reduced temperature), where r, 0 < r < 1, is called the cooling ratio.

Step 5. Go back to Step 3 if a given stopping condition has not been met.

This algorithm converges very slowly. One of the major reasons is that the optimization search for lower-cost points is mainly concentrated on the neighbors of the current point. If the neighboring area is small, it is hard to jump out of a local minimum. If the neighboring area is large, an occa- sional very bad move is always possible, which makes it difficult to converge to the optimal point quickly. Furthermore, to our knowledge, there is little work reported on the analysis of convergence for the SA algorithm when the objective function has to be estimated through Monte Carlo simulation. For discrete optimization, Ref. 13 provides a proof of convergence under certain constraining conditions on the estimation noise.

Recently, motivated by the stochastic ruler (SR) algorithm (Ref. 6), Gong et al. proposed a scheme referred to as the stochastic comparison (SC) algorithm (Ref. 1). This algorithm is aimed at discrete optimization problems with an objective function estimated through Monte Carlo simulation. It is shown to be more efficient in solving discrete optimization problems with estimation, since it performs a global search for the optimal point. A key feature of this algorithm is that it couples the estimation noise with a "cool- ing" strategy as follows: Instead of cooling down the randomiT.ation of the search sequence by the artificial function exp(-A/T), the SC algorithm increases the number of estimated objective function comparisons gradually

Page 7: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 591

before each move has to be made. In the next section, we give a brief review of this discrete optimization algorithm (see Ref. 1 for further details), as some results will be used in our study of the continuous SC algorithm presented in Section 4.

3. Review of the SC Algorithm for Discrete Optimization with Estimation

For a finite set S, assume that the objective function g(i), i~S, can be estimated by ~(i). Let

W(i) =~,(i) -g(i)

be the estimation error, assumed to have a symmetric distribution. Our objective is to find some i~S* which minimizes g(i), where

S* = {i~SI g(i) <g(j), Vj~S}.

To solve this discrete optimization problem, we first define a preselected probability matrix R(i, j ) , with i, j ~ S. Here, R(i, j ) stands for the probability that state j will be selected as the next state visited by the optimization process, given the current state is i. For simplicity, we usually set

R(i,j)=r, Vi, jeS;

i.e., all next state probabilities are equal, since we do not have any a priori information on the cost function for each possible state. We also let

R(i,j)=O, for i=j.

Next, we define {Mk}, k = 1, 2 . . . . , to be a nondecreasing integer sequence and assume that {Mk} increases logarithmically (a logarithmic increase is needed to guarantee convergence of the optimization process; see Ref. 1 for details). Here, Mk is the number of comparisons made between estimates of the current state and a candidate next state before the kth move is made. The SC algorithm proposed by Gong et al. then works as follows.

Discrete Stochastic Comparison (DSC) Algorithm.

Step 1. Initialize: X0 = io, k = 0.

Step 2. For given Xk = i, choose a next candidate state Zk from S - {i} with probabilities

R(i,j), j s- {i}.

Page 8: Stochastic comparison algorithm for continuous optimization with estimation

592 JOTA: VOL. 91, NO. 3, DECEMBER 1996

Step 3. For a chosen Zk =3, set

Zk, Xk+l= Xk,

with probability Pk,

with probability 1 --Pk,

where Pk = {P[~(j) <~(i)]}Mk;

the actual implementation of this step is described below.

Step 4. Replace k by k + 1, and go to Step 2.

Implementation of Step 3. The probability pk is actually not calculable, since we do not know the underlying probability functions. However, it is realizable in the following way: both ~(j) and ~(i) are estimated Mk times. If ~(j) <~(i) every time, then we set Xk+l =Zk. Otherwise, we set Xk+l = Xk. This corresponds to Mk independent trials, where Pk is the probability of Mk successes.

As we can see, the process {Xk} under the SC algorithm is a time- inhomogeneous Markov chain. Its one-step state transition probabilities for a given testing number Mk are

�9 . A . ^ , g k

Pij(Mk) = I R(''J) {P[g(J) <g(t)]} ' i f j~ i , ' (1 -~s, , i R(i, s){P~(s) <~(i)]}Mk, if j = i.

(4)

The following theorem from Ref. 1 characterizes the Markov chain {Xk } and establishes convergence to a point in S*.

Theorem 3.1. The Markov chain {Xk } generated by the discrete stoch- astic comparison algorithm is strongly ergodic. Furthermore,

(i) limk-~ SUpx0 I[X(S, k) - e* II = 0, (ii) limk_.~P[XkeS*] = 1,

where

k - 1

x(s, k) =xoP(s, k) =Xo 1-I P(Mt), i = s

Xo is the initial probability vector, and e* is the optimal probability vector.

For the purpose of completeness, we state below several lemmas which lead to the proof of Theorem 3.1. These results can also be found in Ref. 1.

Page 9: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 593

We first let

z(M) = [zl(M), z z ( M ) , . . . , z , (M)]

be a stationary probability vector for a Markov chain with transition prob- ability matrix

P(M) = [P;,g (M)].

Here, M is fixed and z(M) is called the quasi-stationary probability distribu- tion of the time-inhomogeneous Markov chain {Xk}. We then have

z(M)P(M)=z(M) and ~ zi(M)=l. ieS

If we assume that there is only one optimal point which is state 1 (without loss of generality), that is, S* = { 1 }, the following lemma can be proved.

I_emma 3.1. For ~r(M) defined above, if S* = l, then

lira [zm(M)zE(M)... zs(M)]=[1 0 - . . 0]. M--*~

Further, for the nondecreasing sequence {Mk}, there exists a M* < ~ , such that, for Mk > M*, we have

z i (Mk+l ) > z t (Mk) , for ieS*,

zi(Mk+ 1 ) < z,(Mk), for iCg*,

where

S* = { iES lsp( i) >__sp(j), VIES},

sp(i)=j~s~_{i}P([g(i) < g(J)] }"

For the case where the optimal set S* contains more than one point, a similar result can be established.

From Lemma 3.1, the following two lemmas are proved (see Ref. 1 for details).

Lemma 3.2. The Markov chain generated by the SC algorithm is weakly ergodic.

Page 10: Stochastic comparison algorithm for continuous optimization with estimation

594 JOTA: VOL. 91, NO. 3, DECEMBER 1996

Lemma 3.3. The probability vector ~r(M), defined in ~ ( M ) P ( M ) =

7r(M), satisfies

Ilrci(Mk + l ) - rci(Mk)ll < ~ . k=0

By Lemma 3.2, Lemma 3.3, and Theorem V.4.3 in Ref. 14, it follows that the time-inhomogeneous Markov chain {Xk} is strongly ergodic and conclusions (i) and (ii) inTheorem 3.1 hold.

In the next section, we will extend the SC algorithm in order to solve continuous parameter optimization problems.

4. SC Algorithm for Continuous Optimization with Estimation

Let us return to the continuous optimization problem. Without loss of generality, we consider 0 ~ 0 to be a scalar. Assume that | is a bounded interval. Then, the SC scheme that we propose and analyze for the continu- ous optimization problem is as follows.

Continuous Stochastic Comparison (CSC) Algorithm.

Step 1.

Step 2.

Step 3.

Step 4.

Initialize: X| = So, k = 0.

For a given X k =Sk , choose the next candidate point Z k from | with Zk uniformly distributed over 0.

For a chosen Z k = r k , set

Xk + 1 = I Zk, with probability Pk,

(Xk, with probability 1 -pk ,

where pk = {P [~(rk) < ~(Sk)] } Mk.

Replace k by k + 1, and go to Step 2.

Remark 4.1. As in the DSC algorithm, the probability Pk is actually not calculable, since we do not know the underlying probability functions. However, it is realizable in the following way, similar to the DSC case: both ~,(rk) and g(sk) are estimated M~ times. If ~(rk) <g(sk) every time, we set X k + l = Z k . Otherwise, we set X k + l = X k .

We can see that the process {Xk} is a time-inhomogeneous Markov process with a continuous state space. Note that, when the SC algorithm is

Page 11: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, D E C E M B E R 1996 595

applied to a continuous objective function, we cannot generally conclude that

lim P[XkE19*] = 1, k....+ oo

for | defined in (1), since | could be a single point (the probability that the optimization process finally stays at a single point is trivially zero). Our approach, in this case, is to define an interval

| {0e| g(O) <g(O') + E, V0' ~19},

called an E-optimal interval. For E > 0, 119, I > 0. Before presenting our main result, we make two assumptions.

Assumption 4.1. g(O) is a continuous function of 0 with 0619 and 1191 < ~ .

Assumption 4.2. For each estimate ~(0) of g(0), the error W(O)= ~,(0)-g(O) has a symmetric pdf.

Under these assumptions, we have the following theorem.

Theorem 4.1. For an objective function g(0), under Assumptions 4.1 and 4.2, let {Xk } be the Markov process generated by the continuous stoch- astic comparison algorithm. Then,

lim P[Xk~19,]=l, for any e>0 . k--* oo

Proof. We first prove the case where g(O) is a bounded function in 19, that is, - oo < cl <g(0) < c2 < ~ for some constants cl, c2. Let

a=inf{g(O)}, b=sup {g(O)}.

We then divide (a, b) into N=2[(b-a)/E] equal intervals. We index these intervals in the direction from a to b with i = 1 . . . . . N. Therefore, the interval length is

= (b - a ) / N < E/2.

Page 12: Stochastic comparison algorithm for continuous optimization with estimation

596 JOTA: VOL. 91, NO. 3, DECEMBER 1996

f(e) b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

]ilZ ,0 .....................................

. . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . § . . . .

. . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s . . . - , f l * 1 i

i r ~ i i i . . . . i . . . . . . . . . . . . . . . . . . . . . . L J . . .

~ . . . . J . . . . Y ' j . . . . . . . . . . . "l . . . . . t i I i I i

�9 i ~ - - - i i i i i

, , i 1 , t

0 , ,, ~ ~ ,I, i r , , , ~ i i , i , ~ p i !,, OC

-I i

Fig. 1. Functions g(O) and h(O) for the proof of Theorem 4.1.

a + g

a+2 a+~

Now, we define N sets

|174 i = 1 , 2 . . . . . N.

Since 6 < e/2, we have | w | C 0 ~ ; see Fig. 1 for an illustration,. Therefore,

P[XkEOI U 02] < P[Xke|

As we will see next, | i= 1 , . . . , N, are needed for the purpose of construct- ing a new function which will serve as an objective function for which the results from the DSC scheme can be applied.

We now construct a new function h(O) as follows (see also Fig. 1):

( a+ t~, h(O) = ~a+ 26,

( a + ( i - 1)c~,

0e |

Oe|

0e| i>3.

Let the optimization sequence generated by applying our CSC algorithm with h(O) as an objective function be { Yk }. Since, by construction,

h(O) >_g(O), OeO~ u 02,

h(O)<_g(O), O~)i, i>_3,

Page 13: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 597

we have

where

P[ Yke@l] < P[ YkeOl U |

<P[Xke| w 0 2 ] <P[XgeOe] < 1, (5)

P[ Yk~@1 u 02] < P[XkE| U |

follows from the previous inequality and the updating scheme in the CSC algorithm.

Next, we define a mapping from the continuous variable Zk (the candi- date next state in the CSC algorithm) to an integer i, i e { 1 , . . . , N}, using a mapping function p in the following way: p(Zk) = i, i = 1 . . . . . N, if Zk ~ | This specifies a one-to-one mapping from the sequence {Yk} to {P(Yk)}; the latter is a sequence produced by the DSC algorithm for an underlying discrete state space {1 . . . . . N} using probabilities R(i , j )= [|174 in the discrete domain [with a minor modification to allow R(i , j )> 0, when i=j, which, clearly, will not affect convergence, since this case represents a step where no change occurs, as long as every state is reachable from any other state]. Thus, {P(Yk)} is the result of an optimization process where DSC is applied. By Theorem 3.1, we know that

lira P[p(Yk) = 1]= lim P[Yk~01] = 1. k--* o| k -*

From this equation and observing in (5) that

lim P[Yk~| I < lira P [ X k e | I, k....~ oo k ~ o|

we have

lim P[Xke| = 1. k ~

This establishes the result when g(i) is bounded. For the case where g(O) does not have an upper bound (but does have a lower bound), we can choose an upper bound b such that b > a and let h(O) = b, whenever g(O) > b, for any 0e| The proof then follows by a similar argument as above. Finally, the case where g(O) does not have a lower bound is meaningless, since we are trying to minimize this function. []

One of the key features of the CSC (as well as the DSC) algorithm is the fact that the iteration steps are based on the estimated order of two objective function values, g(O) and g(O'), not their cardinal values. Thus, we exploit the fact that order statistics are generally very robust with respect

Page 14: Stochastic comparison algorithm for continuous optimization with estimation

598 JOTA: VOL. 91, NO. 3, DECEMBER 1996

to estimation noise, an idea also exploited in Ref. 15. In other words, estimat- ing the order of two variables is a much simpler problem than estimating their actual values. It follows that good moves (i.e., moves that reduce cost) can be made based on relatively poor estimates. This, in turn, implies that short simulation runs are adequate, which means that one can expect the CSC algorithm to converge quickly in general. In fact, in practice, one can see that the algorithm rapidly identifies points that are close to optimal (see next section).

In the next section, we provide three applications of the CSC algorithm to different types of stochastic optimization problems. Based on our observa- tions, the CSC algorithm usually works very well in the initial stage of the optimization process. However, as g(Xk) approaches the optimal value and as Mk grows large, the chance that g(Zk) <~(Xk) Mk times at Step 3 of the CSC algorithm becomes very small, even if g(Zk)<g(Xk). We will see in the examples of the next section that the optimization trajectory tends to stay at a state close to optimal for a very long time. This seems to be a key feature of the CSC scheme: points close to the global optimum are quickly identified, but getting to the actual optimum may take a very long time. Thus, in order to further speed up the optimization process, one idea that we suggest is to reduce the search space when X~ does not change for a sufficiently long time. In this way, we can increase the chance that Zk is selected over Xk if indeed g(Zk) < g(Xk). By using this heuristic, we may not be able to reach the global minimum if it turns out that we are currently close to a local minimum. However, since the trajectory does not change for a long time, the difference between local and global minimum will be very small. Further, the probability that we are really close to the global optimum is also very high.

5. Applications and Comparison with Other Schemes

In this section, we apply the CSC algorithm to three types of stochastic optimization problems with different features. In the first problem, the objec- tive function is selected to have multiple local minima. In the second, there is a single minimum in a convex objective function, so that gradient-based algorithms can be used and compared to the CSC scheme. Finally, we con- sider the application of CSC to a class of routing problems and compare it again to gradient-based schemes. In this case, we have a constrained optimi- zation problem over a vector of controllable parameters.

5.1. Stochastic Optimization with Multiple Minima. In this example, we consider a purely artificial function and formulate a problem with satisfies

Page 15: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 599

the conditions of Theorem 4.1. In particular, we consider

g(O) = 100 [2+ sin(0 - 1.57) - exp(-0 .10) ,

and let W(O) be a uniformly distributed random variable over the interval [ -v , v]. We will consider two cases: v = 6 0 (moderate noise) and v= 100 (large noise). Figure 2 shows the function g(O). Note that there are several minima with a global minimum at 0* =0. Clearly, a gradient-based approach in this case can easily be trapped at a local minimum with a value greater than 40. Moreover, the noise is significant, so as to make estimates based on a single observation quite unreliable in a cardinal sense. We empha- size again, however, that the CSC scheme is driven by the relative order of estimates at two different values of 0.

26(

22(

18( |

I0(

40

Fig. 2.

Parameter value

Function g(O) with multiple minima.

Page 16: Stochastic comparison algorithm for continuous optimization with estimation

600 JOTA: VOL. 91, NO. 3, DECEMBER 1996

The number of estimate comparisons Mk used in every iteration of the CSC algorithm is chosen to be of the form

Mk = 1 + [log, (k/c) ]

for the kth iteration, where s and c are adjustable parameters. For implemen- tation purposes, we adopt an algorithm similar to an exponential backoff algorithm, which is widely used in CSMA/CD on Ethernet. Specifically, we define Mk as follows:

Mk= { 1, for l<_k<b, i, f o r b (q i -~ - l ) / ( q -1 )<k <b(q~- l ) / ( q -1 )and i> l .

Although this definition may seem complex, it is actually easily imple- mented. That is,

n k = 1,

Mk=2,

and

Mk =j,

for b iterations,

for bq iterations,

for bq j- 1 iterations.

It can be shown that Mk increases logarithmically. Figure 3 shows a typical optimization trajectory obtained, where one

can see that a value very close to the optimum (i.e., zero) is attained after about 800 estimate comparisons. Note that we plot the estimated cost as a function of comparisons, rather than algorithm iterations; this is done because the number of comparisons Mk varies over iterations. Thus, the number of total comparisons made is a more accurate measure of the time needed to converge. It also allows a fair comparison with the SA algorithm discussed next. An additional optimization trajectory obtained for a large noise case may be found in Ref. 16 indicating that the CSC algorithm is not very sensitive to large estimate variances, a feature generally associated with ordinal optimization; see, e.g., Ref. 15.

For comparison purposes, we also used a simulated annealing (SA) algorithm for this problem. Figure 4 shows the case where v = 60 (moderate noise). The optimization trajectory periodically reaches the global minimum, but never converges to it. For a larger noise case (v = 100) and parameters L, T, r empirically selected to provide the best possible convergence observed, numerical results included in Ref. 16 show that the SA algorithm may not converge to the global minimum; instead, it is common to see that the algorithm is trapped at some local minimum. We should point out that one should not attempt to draw general conclusions on the relative performance of the SA and CSC algorithms. The main goal of this example

Page 17: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 601

~8o

:~6o

40

20

Fig. 3.

luuu 2000 3000 4 00

Number o f c o m p a r i s o n s

O p f i m ~ i o n t r ~ t o ~ ~ i n g t h e C S C ~god~m:v=60, b=lOO, q=2.

is to illustrate the fact that the SA algorithm can be trapped in a local minimum, whereas the CSC algorithm avoids this problem. A second obser- vation is the relative robustness of a stochastic comparison approach with respect to estimation noise, a point already made in Section 4.

5.2. Optimal Admission Control Problem. Here, we investigate a scalar optimization problem which is commonly encountered in the control of queueing systems and which involves estimation of a cost function through simulation or direct observation of a sample path. We consider a single- server queueing system with infinite queueing capacity operating under a first-in-first-out discipline. The arrival rate is A and the service rate is/~. Admission control is applied in the following way: when a customer arrives, he is admitted with probability p and rejected with probability 1 -p . The

Page 18: Stochastic comparison algorithm for continuous optimization with estimation

602 JOTA: VOL. 91, NO. 3, DECEMBER 1996

24(

20(

~16(

~12(

80

40

Fig. 4.

' 1000 ' 2000 ' 3000 " 4000

Number o f c o m p a r i s o n s

Optim~ationtr~eeto~ ~ingtheSA algorithm:v=60, L=l,T=1~O,r=0.997.

cost function of interest is defined as follows:

C(p) = T(p) + R(1 - p ) ,

where T(p) is the mean delay of admitted customers at steady state and R is a fixed penalty applied to each rejected customer. We can easily see that, by controlling p, we can trade off the mean delay (which decreases as p decreases) against the expected rejection cost R(1 - p ) . Thus, for some given R, the problem is to find an optimal admission probabil i typ which minimizes the cost function above.

For a system with arbitrary interarrival and service time distributions, T(p) and hence C(p) can only be estimated through simulation or direct observation of a sample path under p. Both the gradient-based and CSC approaches described next can be applied in such a general setting. For

Page 19: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 603

simplicity, however, we consider Poisson arrivals and exponentially distri- buted service times, in which case an analytical expression for C(p) is easily derived, since we obtain an M / M / I queueing system with arrival rate )~p. This allows us to use the analytically derived optimal solution p* and corre- sponding cost C(p*) as a frame of reference. The objective function C(p) in this case is convex. One therefore expects that a gradient-descent algorithm is well suited for this type of optimization problem. In what follows, we apply such an algorithm to be used as a baseline for comparison with the CSC algorithm. Obviously, we cannot expect the latter to perform as well as the former in this case. Therefore, our goal here is to explore how much slower the CSC approach would be compared to a gradient-based scheme.

Gradient-Based Approach Using Perturbation Analysis. This approach may be used for a system with arbitrary interarrival and service time distribu- tions, where no closed-form expression for C(p) is available. We have assumed Poisson arrivals and exponentially distributed service times only for the purpose of comparing our results with an analytically obtained value of the optimal parameter p* and the corresponding optimal cost C(p*). In general, however, for any givenp, the cost C(p) is estimated through simula- tion of the system; the first derivative T'(p) of T(p) is estimated from the same simulation using perturbation analysis (PA). In particular, we used an algorithm based on smoothed perturbation analysis (SPA), which is described in Ref. 17. Let T'(p) denote this SPA estimate obtained from a simulation performed under p. We then use a gradient-descent algorithm where, after the nth step^[i.e., a simulation under a value p(n) used to obtain the derivative estimate T'(p(n))], the current value p(n) is changed by

Ap (n) = - ( r /n~)[ T' (p (~)) - R ],

where r and a are parameters used to control the stepsize sequence in the algorithm and n is the iteration index. We have chosen r = 0.5 and a = 0.7, which satisfy standard conditions guaranteeing convergence. This is a typical optimization algorithm that has been used in similar problems involving queueing systems; e.g., see Ref. 18 or Ref. 19 for further details.

In Fig. 5, we show an optimization trajectory based on this gradient- descent algorithm using SPA derivative estimates. Note that the accuracy (i.e., variance) of the derivative estimate depends on the length of the simula- tion run used. In Fig. 5, the length of every simulation performed in order to obtain the derivative estimates T'(p(~)), n= 1, 2 . . . . . is selected to be such that 100 admitted customers are initially observed and this number is increased by a factor of 1.2 at every iteration. The case where this number is kept constant leads to much slower convergence (see Ref. 16). It is interest- ing that this choice is crucial in view of the fact that the CSC scheme, as we

Page 20: Stochastic comparison algorithm for continuous optimization with estimation

604 JOTA: VOL. 91, NO. 3, DECEMBER 1996

Fig. 5.

9.4

9.C

8.6

8.~

07.8 U

7.4-

7.~

6.6-

I I I I I

1 26000 1 46000 J 66000 I

I I

86000 i i~ iO00

Cumulative number of arrivals

- - Cost trajectory ..... Minimum cost

Optimization trajectory using the gradient-based algorithm with PA derivative estima- tion: number of admitted customers per iteration continuously increased by a factor of 1.2.

will see, does not require long simulation runs. As already mentioned, this is because only the order of the estimates is important in that case, so that accurate estimation of C(p) is not critical. Finally, note that C(p) is plotted as a function of the total number of arrivals observed, rather than algorithm iterations, which gives an accurate measure of the time required for convergence.

CSC Approach. We begin by pointing out that this problem does not actually satisfy the required conditions for the CSC algorithm; see Theorem 4.1. First, the estimation noise W(p) is not identically distributed. In fact, it depends on p. In general, at higher utilization values (values o fp close to 1), estimates of the mean delay have higher variance. Secondly, the pdf of W(p) is also not quite symmetric. Nonetheless, we proceed with the CSC

Page 21: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 605

algorithm, anticipating once again that an approach based on ordinal esti- mate values is relatively immune to noise characteristics.

To apply the CSC approach, we first choose the sequence {Mk} in the same way as discussed in Section 5.1. Each interation then consists of Mk simulation runs to estimate the cost at the current value of p and at some candidate value as specified by the CSC algorithm in Section 4. In what follows, we have chosen an initial value of p0 = 0.2 and parameter values A= I, I~ = I, R = IO.O.

In Fig. 6, we show a typical optimization trajectory (additional sample trajectories may also be found in Ref. 16). In this case, the CSC algorithm identifies points very close to the optimal cost within 30,000 customer arriv- als, which is not much higher than the corresponding number under the gradient-descent algorithm above. However, it may also take much longer when an occasional jump away from a neighborhood of the optimal point

45

35

.o Ol o o

25 �84

15

i[

Cumulative number of arrivals

--Cost trajectory ..... Minimum cost

Fig. 6. O p f i m ~ i o n t r a j ~ t o ~ using CSC.

)00

Page 22: Stochastic comparison algorithm for continuous optimization with estimation

606 JOTA: VOL. 91, NO, 3, DECEMBER 1996

occurs in the algorithm (see examples included in Ref. 16). Note that the simulations here are kept short (100 admitted customers for each cost esti- mate used in the comparisons).

A problem sometimes arising in the CSC approach is that an occasional jump to a particularly bad point (for some p close to 1) slows down the ~onvergence of the CSC algorithm. Taking advantage of the known fact that C ( p ) ~ oo when p ~ 1, we have repeated the previous process by restrict- ing the search space a priori to p <0.9 (this is referred to as the modified CSC process). The resulting trajectory that corresoonds to Fig. 6 is shown in Fig. 7. Now, the algorithm identifies points very close to the optimum within about 50,000 observed customer arrivals.

As already pointed out, we believe that a key feature of the CSC algo- rithm is its ability to determine points that are close to optimal relatively quickly, which is of great value in practice. It tends, however, to remain close

9 . 8

9 . 4

9 . 0

8 . 6

~u 8.2- ~n

-

7 . 8

7 . 4

7.0-

6 . 6

6.2 i 1( ~000

Fig. 7.

Cumulative number of arrivals

~ cost trajectory .... Minimum cost

Opdmiz~iontrajectory u~ng modified CSC.

Page 23: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 607

to optimal for a very long time before it converges to the actual optimum. In comparing the gradient-descent and CSC algorithms for this prob-

lem, it is interesting to observe the following. Although the former is the obvious choice for a simple convex optimization problem, its convergence speed is not dramatically faster, considering the need for relatively long simulation runs at each step to obtain accurate derivative estimates. In gen- eral, there are two parameters that a gradient-descent algorithm requires: a sequence of stepsizes (determined by r and a in the specific algorithm used here) and the length of the observation period which controls the estimate accuracy. Gradient-based algorithms may be quite sensitive to the choice of these parameters. In the case of the CSC algorithm, the analogs are: the sequence {Mk} that controls the number of comparisons at each step and again the length of the observation period (simulation run). The latter, however, is not critical, precisely because it is only the order of estimates of the cost function that matters, as opposed to their cardinal values. The CSC approach also foregos the need for derivative estimation. Its main drawback so far appears to be the fa~t that it may remain at a near-optimal point for a very long time.

5.3. Optimal Routing Problem. We now consider a more complicated constrained vector optimization problem. Assume that we have M servers with infinite queueing capacity operating under a first-in-first-out discipline. The arrival rate is 3. and the service rates of the M servers are #~, i= 1 . . . . , M. The routing control is applied in the following way: when a customer arrives, he is routed with probabilitypi to queue i. The cost function of interest is the customer mean delay T(p), where p = (p l , . �9 �9 Pu) . Thus, the problem is to find the optimal routing probability vector p* so that the mean delay is minimized. This is a classic problem in the queueing literature that has attracted considerable attention over the years (e.g., Ref. 20).

�9 Similar to the admission control problem, for a system with arbitrary interarrival and service time distributions, T(p) can only be estimated through simulation or direct observation of a sample path under p. Both the gradient-based and CSC approaches described next can be applied in such a general setting. For simplicity, however, we consider Poisson arrivals and exponentially distributed service times, in which case an analytical expression for T(p) is easily derived. This allows us to use the analytically derived optimal solution p* and corresponding cost T(p*) as a frame of reference.

Denoting the mean delay at node i by Tj(p~), we have M

T(p) = ~. p~ T,(p,). (6) i=l

Page 24: Stochastic comparison algorithm for continuous optimization with estimation

608 JOTA: VOL. 91, NO. 3, DECEMBER 1996

In what follows, rather than concentrating on how fast convergence is attained, we concentrate on how close to optimal we can get with a given simulation time budget. In other words, we will fix a time horizon and explore the behavior of each optimization scheme within this horizon. From a practical standpoint, this type of question is often the one of primary concern.

Gradient-Based Approach Using Perturbation Analysis. As already mentioned this approach may be used for a system with arbitrary interarrival and service time distributions, where no closed-form expression for T(p) is available. Setting

f(p~) =piT~(p,) + Tt(pi), i= 1 . . . . . M,

and using (6), we know that the optimal routing probabilities must satisfy the following condition, which can be derived easily using Lagrange multipliers:

f (P~) =f(P2) . . . . . f (P~ ). (7)

In general, for any given vector p = (pl . . . . , PM), the cost 7",. (p~) is estimated through simulation, and the first derivative T'(p~) of T(p~) is estimated from the same simulationusing smoothed perturbation analysis (SPA) as described in Ref. 17. Let Ti(p~) and 7"~(p,-) denote the estimates of T~(pi) and T~ (p~). Define

f ( p , ) =pg I"; (p,) + T,(p,), i= 1 . . . . . M.

We then search for pi, i= 1 . . . . . M, to satisfy condition (7) with estimates I'~ (p,-) and 7"i (pi). The algorithm that we implement is briefly described as follows.

Step I. For given ~ , 7"i(P.t), 7"~(P;), i= 1 , . . . , M, choose pj and Pk such that f (Pj )<f(Pk) . They are usually chosen so that j corresponds to the minimum value and k corresponds to the maximum value over f (Pi) , i= 1 . . . . , M.

Step 2. Find some A such tha t f (p j+ A) > f (p j ) a n d f ( p k - A) <f(Pk).

Step 3. Repeat Steps 1 and 2 until a given simulation budget is exhausted.

As an example, we consider a system with three servers. The service rates are g l = 1.0, g2=2.0, p3=3.0. The arrival rate is ;t=4.0. The initial probabilities are p = (0.2, 0.4, 0.4). The optimization trajectory is shown in

Page 25: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 609

Fig. 8.

2 .~

0 " 40000 " 80000 " 120000 160000 2C I000

Cumulative number of observed arrivals

Optimization trajectory ............ Optimal cost

Optimization ~ t o ~ usmg~adient-basedalgofithmwi~ PAderwative ~fimation: 100eustome~ ob~edper i~raf ion .

Fig. 8, where we stop after 200,000 arrivals are observed with 100 customers observed in each iteration. When the total number of observed arrivals in each iteration is small, it is possible that no departure event is observed at some node so that we cannot evaluate the derivatives using SPA estimates. In this case, we will assume that the arrival rate to this node is very small so that we can use an approximate formula; for instance, set the derivative estimate to 1 based on the M/M/1 analytical expression for T(p).

CSC Approach. Again, we point out that this problem does not actu- ally satisfy the required conditions for the CSC algorithm; see Theorem 4.1. First, the estimation noise W(p) is not identically distributed. In fact, it depends on p. In general, at higher utilization values (values of pi close to 1), estimates of the mean delay have higher variance. Nonetheless, we pro- ceed with the CSC algorithm, anticipating once again that an approach based

Page 26: Stochastic comparison algorithm for continuous optimization with estimation

610 JOTA: VOL. 91, NO. 3, DECEMBER 1996

3 . ~

3 .

3 .

L~

2 .

Fig. 9.

" 40000 80000 120000 " 160000 " 2C ~000

Cumulative number of observed arrivals

Optimization trajectory ........ Optimal cost

Opfim~atJon ~to~ using CSC:lOOcustomersobse~edperiteration.

on ordering estimate values is relatively immune to noise characteristics. In what follows, we consider the same three-server example studied

above. We have chosen the same initial probabilities p = (0.2, 0.4, 0.4). An optimization trajectory is shown in Fig. 9; for additional trajectories with longer simulation runs for each iteration, see Rd. 16. Two observations are worth making. First, reducing the length of simulation runs hardly affects the final value attained. Second, it is interesting to compare Figs. 8 and 9 and note that the final estimated cost values at the end of the process are comparable.

In addition, to improve the convergence rate of the CSC algorithm, we introduced the following simple heuristic: we reduce the search space gradu- ally when we find that the parameters stay constant for 15 iterations. We call this the modified CSC approach and show an optimization trajectory in Fig. 10.

Page 27: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 611

2.3

2.2

2.1

2.0

1,9 ~J

0 Ul.8

1.7

1,6

1,5

1.4

Fig. 10.

1

t l I I I I I I

46000 I 8~000 ' 1~00001 ld0000 ! 2

Cumulative number of. arrivals

--Cost trajectory ..... Minimum cost

~000

Optimization trajectory using modified CSC: 400 customers observed per iteration.

5.4. Comparison of Algorithms and Issues in Stochastic Optimi- zation. The applications described in the previous sections contribute to the identification of some key issues related to stochastic optimization. In what follows, we outline these issues and provide an informal description of how the various schemes we have considered address the corresponding issue. This also provides us with the opportunity to summarize the findings of the previous three sections.

0) Global Convergence. This is the first issue one is confronted with. As already discussed, gradient-based techniques can only guarantee conver- gence to a local optimum. The SA and CSC schemes are motivated in part by the need to overcome this difficulty. As we have seen, SA may not con- verge when multiple minima are present in the cost function of interest, depending on the estimate noise characteristics; see also Ref. 16. On the

Page 28: Stochastic comparison algorithm for continuous optimization with estimation

612 JOTA: VOL. 91, NO. 3, DECEMBER 1996

other hand, the CSC algorithm converges under the conditions imposed in Theorem 4.1.

(ii) Convergence Speed. Assuming that an algorithm converges to a global optimum, the issue of speed is always one that is hard to characterize analytically and typically depends on algorithm parameters that one tends to choose rather empirically (e.g., the scaling factor in gradient-based algo- rithms or the cooling ratio in SA). When a single optimal point is guaranteed, gradient-based algorithms may converge quite fast, provided one carefully selects the two critical parameters: the scaling factors applied to the deriva- tive estimates and the observation interval length over which the derivatives are estimated. The SA algorithm is known to be very slow. Finally, the CSC approach tends to reach a point very close to optimal quite fast, but generally takes a long time to converge to the actual global minimum.

Given the practical importance of optimization algorithms, an alterna- tive approach for evaluating the performance of such algorithms is the one suggested in the last section: define a time budget and evaluate how close to the optimal point an algorithm can get within this time budget.

(iii) Robustness with Respect to Algorithm Parameters. In gradient- based algorithms, one often finds that the selection of the scaling factors can be very crucial. The same is known to be true of the cooling ratio in SA, as well as the choice of search neighborhood required in this case. The corresponding parameter in the CSC scheme is the number of comparisons at each iteration. Our experience to date, by no means exhaustive, indicates that the sequence {M~} does not seem to be as critical in the overall behavior of the algorithm. We point this out simply to suggest further research in this direction rather than to draw any general conclusions.

(iv) Robustness with Respect to Estimated Noise. The quality of the estimates driving an algorithm is closely related to convergence speed. If one requires great accuracy at each iteration, then one obviously requires significant time in collecting data for estimation purposes. The sensitivity to noise in SA is actually quite well known. We have also found the same to be true of the derivative estimates used in the last two sections. In contrast, the general behavior of the CSC algorithm does not seem to be affected by this issue. Once again, this must be attributed to the inherent robustness of order statistics, since the CSC scheme is driven by mere comparisons of estimates and not cardinal values.

(v) Generality. It is always desirable to have at one's disposal an optimization scheme of general applicability. In this respect, the SA and CSC algorithms fulfill this requirement. In contrast, gradient-based algo- rithms require derivative information, which is not always easy to obtain. On the other hand, it should also be pointed out that structural information regarding the system or process being optimized is often invaluable. As an

Page 29: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 613

example, we saw in Section 5.2 that knowledge of the fact that the cost function C(p) approaches infinity as p approaches 1 allowed us to modify the search space of the CSC algorithm and improve its convergence speed.

(vi) Use of Heuristics to Improve Algorithm Performance. This is a potentially important issue, since one is often forced to resort to heuristics when an algorithm does not work well in practice. For example, the selection of a scaling factor sequence in gradient-based algorithms is frequently based on simple heuristics within certain theoretical guidelines; most practitioners choose it so that it decreases roughly like (1/n) 1/2. Unfortunately, heuristics are more often than not problem-dependent, so that a rigorous general study is infeasible. In the case of the CSC algorithm, we find it interesting that a simple heuristic like the one employed in Section 5.3 can significantly improve the algorithm behavior: if no change in the controlled parameters occurs within a certain number of iterations, then we gradually reduce the search space.

6. Conclusions

The problem of stochastic optimization for arbitrary objective functions presents a dual challenge. First, one needs to repeatedly estimate the objec- tive function; when no closed-form expression is available, this is only pos- sible through simulation or direct observation of a sample path. Second, one has to face the possibility of being trapped in the optimization process at a local, rather than global, optimum. Gradient-based algorithms can be very efficient when (a) one knows a priori that the global optimum is uniquely specified by the point where the gradient is zero and (b) efficient gradient estimation techniques are available, which usually requires advance knowledge of a specific structure for the problem. In a general setting, one must resort to various algorithms of the random search variety.

In this paper, we have used the stochastic comparison (SC) approach proposed in Ref. 1 for discrete optimization and adapted it to the case of continuous optimization. We have shown that the continuous stochastic comparison (CSC) algorithm that we have described in Section 4 converges to an e-neighborhood of the global optimum for any e > 0. This approach is applied to three problems with different features and compared to simulated annealing and gradient-descent algorithms. An attractive feature of the CSC algorithm is its ability to identify points in the neighborhood of the global optimum relatively fast, which is of significant value in practice. The key to this speed is the fact that the algorithm is driven by the relative order of estimates of the objective function at two points, rather than their cardinal values. Exploiting the robustness of order statistics, the algorithm is quick

Page 30: Stochastic comparison algorithm for continuous optimization with estimation

614 JOTA: VOL. 91, NO. 3, DECEMBER 1996

to proceed to better points using potentially very noisy estimates. On the other hand, final convergence to the actual optimal point may take a consid- erable amount of time. This suggests the possibility of simple heuristics at the late phase of an optimization process, such as those used in the applica- tions of Section 5.

The generality of the CSC approach compared to gradient-descent schemes is an obvious attractive feature. In comparing the CSC approach to simulated annealing (SA), we find the latter limited by the need for highly accurate objective function estimates at each step.

Finally, we have used the experience accumulated from the applications in Section 5 to summarize what we view as key features in stochastic optimi- zation and compare gradient-based, SA, and CSC algorithms on this basis. The goal of this summary (see Section 5.4) is primarily to provide some informal guidelines and to suggest directions for further research in this area. Given some of the attractive features of the CSC scheme, a potentially promising area is the combination of the stochastic comparison principle with structural information pertaining to a given system or process being optimized. As an example, one may exploit the CSC algorithm ability to quickly identify a near-optimal point and then employ a gradient-based scheme to determine the actual optimal point.

References

1. GONO, W. B., Ho, Y. C., and ZHAI, W., Stochastic Comparison Algorithm for Discrete Optimization with Estimation, Proceedings of the 31st IEEE Conference on Decision and Control, pp. 795-802, 1992.

2. TORN, A., and ZILINSKAS, A., Global Optimization, Lecture Notes in Computer Science, Springer Verlag, Berlin, Germany, 1987.

3. GOLDBERG, D. E., Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley Publishing Company, Reading, Massachusetts, 1989.

4. AARTS, E., and KoRsr, J., Simulated Annealing and Boltzmann Machines, John Wiley and Sons, New York, New York, 1989.

5. JOHNSON, D. S., ARAGON, C. R., McGEOCH, L. A., and SCHEVON, C., Optimi- zation by Simulated Annealing: An Experimental Evaluation, Part 1: Graph Parti- tioning, Operations Research, Vol. 37, pp. 865-892, 1989.

6. YAN, n. , and MUKAI, H., Stochastic Discrete Optimization, SIAM Journal on Control and Optimization, Vol. 30, pp. 549-612, 1992.

7. Ho, Y. C., and CAO, X. R., Perturbation Analysis of Discrete-Event Dynamic Systems, Kluwer Academic Publishers, Dordrecht, Netherlands, 1991.

8. GLASSERMAN, P., Gradient Estimation via Perturbation Analysis, Kluwer Acad- emic Publishers, Dordrecht, Netherlands, 1991.

9. GLYNN, P. W., Likelihood Ratio Gradient Estimation: An Overview, Proceedings of the 1987 Winter Simulation Conference, pp. 336-375, 1987.

Page 31: Stochastic comparison algorithm for continuous optimization with estimation

JOTA: VOL. 91, NO. 3, DECEMBER 1996 615

10. REIMAN, M., and WEISS, A., Sensitivity Analysis for Simulations via Likelihood Ratios, Operations Research, Vol. 37, pp. 830-844, 1989.

11. RtrBINSTEn'~, R.Y., Sensitivity Analysis and Performance Extrapolation for Com- puter Simulation Models, Operations Research, Vol. 37, pp. 72-81, 1989.

12. HERTZ, J., KROna, A., and PALMER, R. G., Introduction to the Theory of Neural Computing, Addison-Wesley Publishing Company, Reading, Massachusetts, 1991.

13. GELFAND, S. B., and MITTER, S. K., Simulated Annealing with Noisy or Impre- cise Energy Measurements, Journal of Optimization Theory and Applications, Vol. 62, pp. 49-62, 1989.

14. ISAACSON, D. L., and MADSEN, R. W., Markow Chains: Theory and Applica- tions, John Wiley and Sons, New York, New York, 1976.

15. HO, Y. C., SREENIVAS, R., and VAKILI, P., Ordinal Optimization of Discrete- Event Dynamic Systems, Journal of Discrete Event Dynamic Systems, Vol. 2, pp. 61-88, 1992.

16. BAO, G., and CASSANDRAS, C. G., Stochastic Comparison Algorithm: Theory and Applications, Technical Report, Department of Electrical and Computer Engineering, University of Massachusetts, 1995.

17. GONG, W. B., Smoothed Perturbation Analysis Algorithm for a G/G/1 Routing Problem, Proceedings of the 1988 Winter Simulation Conference, pp. 525-531, 1988.

18. CHONG, E. K. P., and RAMADGE, P. J., Convergence of Recursive Optimization Algorithms Using Infinitesimal Perturbation Analysis Estimates, Journal of Discrete Event Dynamic Systems, Vol. 1, pp. 339-372, 1992.

19. MOHAWrY, B., and CASSANDRAS, C. G., The Effect of Model Uncertainty on Some Optimal Routing Problems, Journal of Optimization Theory and Applica- tions, Vol. 77, pp. 257-290, 1993.

20. BERTSEKAS, D. P., and GALLAGER, R. G., Data Networks, Prentice Hall, Englewood Cliffs, New Jersey, 1987.