properties of multivariate q-gaussian distributions and its … · 2016. 6. 2. · distributions,...

Properties of Multivariate q-Gaussian Distributions

and its application to Smoothed Functional

Algorithms for Stochastic Optimization

A Project Report

Submitted in partial fulfilment of the

requirements for the Degree of

Master of Engineering

in

System Science and Automation

by

Debarghya Ghoshdastidar

ELECTRICAL ENGINEERING

and

COMPUTER SCIENCE & AUTOMATION

INDIAN INSTITUTE OF SCIENCE

BANGALORE – 560 012, INDIA

JUNE 2012

TO

People who work,

People who dream,

and

People who work

even in their dreams

Acknowledgements

I thank Dr. Ambedkar Dukkipati and Prof. Shalabh Bhatnagar for giving me the oppor-

tunity to work on this problem. Their invaluable guidance made this work possible. I also

thank other faculty members of CSA, EE, ECE and Mathematics departments for blessing

me with immense knowledge during coursework. I sincerely thank Prof. Christophe Vig-

nat, Ecole Polytechnique Federale de Lausanne, for his advices, which helped me improve

my work.

I would like to thank all the members of Algorithmic Algebra Lab and Stochastic Sys-

tems Lab for their support. The names of Abhranil, Saswata, Prabu Chandran, Gaurav

and Maria needs special mention.

Last, but not the least, I thank my family and close friends for providing constant

encouragement even in the toughest times.

i

Publications based on this Thesis

1. D. Ghoshdastidar, A. Dukkipati, and S. Bhatnagar. q-Gaussian based smoothed

functional algorithms for stochastic optimization. In International Symposium on

Information Theory. IEEE, 2012.

ii

Abstract

The importance of the q-Gaussian distribution is due to their power-law nature, and

close association with the popular Gaussian distribution. This distribution arises from

nonextensive information theory, and it has an interesting property that uncorrelated

q-Gaussian random variates show a special kind of inter-dependence.

In this work, we study some key properties related to higher order moments and co-

moments of the multivariate q-Gaussian distribution. We use the important features of

this distribution to improve upon the smoothing properties of a Gaussian kernel. Based

on this, we propose a Smoothed Functional scheme for gradient and Hessian estimation

using q-Gaussian distribution.

Using the above estimation technique, we propose four two-timescale algorithms for

optimization of a stochastic objective function using gradient descent and Newton based

search methods. We prove that the proposed algorithms converge to a local optimum.

Performance of the algorithms is shown by simulation results on a queuing model. The

results show that the q-Gaussian based schemes provide a better tuning of the algorithms,

which improve their performance compared to their Gaussian counterparts.

iii

Contents

Acknowledgements i

Publications based on this Thesis ii

Abstract iii

Notation and Abbreviations vi

1 Introduction 11.1 Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Nonextensive Information Theory . . . . . . . . . . . . . . . . . . . . . . . 31.3 Motivation and Summary of current work . . . . . . . . . . . . . . . . . . 41.4 Organization of this report . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background and Preliminaries 62.1 Problem Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Smoothed Functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 q-Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Some properties of multivariate q-Gaussian 123.1 Generalized co-moments of joint q-Gaussian distribution . . . . . . . . . . 123.2 q-Gaussian as a Smoothing Kernel . . . . . . . . . . . . . . . . . . . . . . . 153.3 Optimization using q-Gaussian SF . . . . . . . . . . . . . . . . . . . . . . . 16

4 q-Gaussian based Gradient Descent Algorithms 184.1 Gradient Estimation with q-Gaussian SF . . . . . . . . . . . . . . . . . . . 18

4.1.1 One-simulation q-Gaussian SF Gradient Estimate . . . . . . . . . . 184.1.2 Two-simulation q-Gaussian SF Gradient Estimate . . . . . . . . . . 19

4.2 Proposed Gradient Descent Algorithms . . . . . . . . . . . . . . . . . . . . 204.3 Convergence of Gradient SF Algorithms . . . . . . . . . . . . . . . . . . . 22

4.3.1 Convergence of Gq-SF1 Algorithm . . . . . . . . . . . . . . . . . . . 224.3.2 Convergence of Gq-SF2 Algorithm . . . . . . . . . . . . . . . . . . . 31

iv

CONTENTS v

5 q-Gaussian based Newton Search Algorithms 335.1 Hessian Estimation using q-Gaussian SF . . . . . . . . . . . . . . . . . . . 33

5.1.1 One-simulation q-Gaussian SF Hessian Estimate . . . . . . . . . . . 335.1.2 Two-simulation q-Gaussian SF Hessian Estimate . . . . . . . . . . . 35

5.2 Proposed Newton-based Algorithms . . . . . . . . . . . . . . . . . . . . . . 355.3 Convergence of Newton SF Algorithms . . . . . . . . . . . . . . . . . . . . 38

5.3.1 Convergence of Nq-SF1 Algorithm . . . . . . . . . . . . . . . . . . . 385.3.2 Convergence of Nq-SF2 Algorithm . . . . . . . . . . . . . . . . . . . 45

6 Simulations using Proposed Algorithms 476.1 Numerical Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7 Conclusions 57

Bibliography 58

Notation and Abbreviations

N Set of all natural numbers, 0, 1, 2, . . .

R Set of all real numbers

R+ Set of all positive real numbers

P Probability of an event

Ep Expectation of a random variable with probability distribution p

PS Projection on to a set S

‖.‖ Euclidian norm if argument is a vector, and induced 2-norm for matrix

∇nx nth derivative with respect to vector x

x(i) ith element of vector x

xn or x(n) nth element of sequence

Ai,j Element in ith row and jth column of matrix A

IN×N N ×N identity matrix

Yn The state of a stochastic process at the nth time instant

vi

Chapter 1

Introduction

1.1 Stochastic Optimization

Optimization lies at the heart of engineering sciences. Modern systems, which cannot

be modeled or analyzed using a deterministic approach, require optimization techniques

based on a stochastic framework. “Methods for stochastic optimization provide a means of

coping with inherent system noise and coping with models or systems that are highly non-

linear, high dimensional, or otherwise inappropriate for classical deterministic methods

of optimization” Gentle et. al. [17]. Stochastic techniques play a key role in optimiza-

tion problems, where the objective function does not have an analytic expression. Such

problems are often encountered in discrete event systems, which are quite common in

engineering and financial world. Most often, the data, obtained via statistical survey or

simulation, contain only noisy estimates of the objective function to be optimized.

One of the most commonly used solution methodology involves stochastic approxima-

tion algorithms, particularly the Robbins-Monro Algorithm [34], which is used to find the

zeros of a given function. Based on this approach, gradient descent algorithms have been

developed, in which the parameters controlling the system proceed towards the zeros of

the gradient of average cost. However, these algorithms require an estimate of the cost

gradient. Kiefer and Wolfowitz [25] provide such a gradient estimate using several parallel

simulations of the system. More efficient techniques for gradient estimation, using one or

1

CHAPTER 1. INTRODUCTION 2

two simulations, have been developed based on smoothed functional approach [24, 35],

perturbation analysis [22, 39] and likelihood ratio [29]. A stochastic variation of Newton-

based optimization methods, also known as adaptive Newton-based schemes, has also

been studied in literature. These algorithms require estimation of the Hessian of aver-

age cost, along with the gradient estimate. Such estimates may be obtained by finite

differences [16, 36], simultaneous perturbation [40] or smoothed functional scheme [5].

We provide a special mention about the smoothed functional (SF) approach proposed

by Katkovnik and Kulchitsky [24]. In this method, the gradient of expected cost is ap-

proximated by its convolution with a multivariate normal distribution. Such a technique

requires just one simulation and hence, proves to be efficient. A two-simulation SF al-

gorithm based on finite difference gradient estimate has been developed in [41], which

performs better than one-simulation algorithms. Rubinstein [35] showed that the gradi-

ent of the cost function can be approximated by its convolution with any function, which

satisfies certain conditions. Such functions are often referred to as smoothing kernels.

Bhatnagar [5] extended the method for gradient estimation to the Hessian case.

When the above estimation schemes are employed in gradient or Newton based op-

timization methods, the time complexity of the algorithms increase as each update iter-

ation requires the estimation procedure. A more efficient approach is to simultaneously

perform gradient estimation and parameter updation using different step-size schedules.

These class of algorithms constitute the multi-timescale stochastic approximation algo-

rithms [6]. Two-timescale stochastic optimization algorithms have been developed using

simultaneous perturbations [8, 9] and smoothed functional [7, 7].

The main issue with such algorithms is that, although convergence of the algorithm

to a local optimum is guaranteed, the global optimum cannot be achieved in practice.

Bhatnagar[5] provides a detailed comparison of the performance of various multi-timescale

algorithms for stochastic optimization in a queuing system. The results presented there

indicate that the two-simulation SF schemes, using multivariate Gaussian distribution,

outperform other algorithms. The results also show that the performance of the SF

algorithms depend considerably on several tuning parameters, such as the variance of the


normal distribution, and also the step-sizes.

We look into smoothing kernels arising out of a generalization of the classical infor-

mation theory, known as nonextensive information theory, which has gained popularity

in several domains in recent years.

1.2 Nonextensive Information Theory

Shannon [38] provided the concept of entropy as a measure of uncertainty given by prob-

ability distributions. A continuous form of the Shannon entropy functional has been

extensively studied in statistical mechanics, probability and statistics. It is defined as

H(p) =

∫X

p(x)lnp(x) dx, (1.1)

where p(.) is a pdf defined on the sample space X . Kullback’s minimum discrimination

theorem [27] establishes important connections between statistics and information theory.

A special case of this theorem is Jaynes maximum entropy principle [23], by which the

exponential family can be obtained by maximizing Shannon entropy functional, subject

to certain moment constraints.

Several generalizations of Shannon’s entropy have been studied in literature [26, 33, 31,

13]. These generalized entropy functions find extensive use in physics, communications,

computer science, statistics etc. One of the most recently studied generalized information

measure is due to [44]. Although this generalization had been introduced earlier by Havrda

and Charvat[20], Tsallis provided interpretations in the context of statistical mechanics.

Dukkipati et. al.[15, 14] provides a measure theoretic formulation of the continuous form

Tsallis entropy functional, defined as

Hq(p) =

1−∫X

[p(x)]q dx

q − 1, q ∈ R. (1.2)

This function results when the natural logarithm in (1.1) is replaced by q-logarithm defined


as lnq(x) = x1−q−11−q , q ∈ R, q 6= 1. The Shannon entropy can be retrieved from (1.2) as

q → 1. It is called nonextensive because of its pseudo-additive nature [44]. Borges[10]

presents a detailed discussion on the mathematical structure induced by this nonextensive

entropy functional. Suyari[42] generalized the Shannon-Khinchin axioms to this case.

The popularity of Tsallis entropy in computer science and statistics is due to the

power-law nature of the distributions obtained by maximizing Tsallis entropy. It has

been observed that most of real-world data exhibit a power-law behavior [30, 3]. Re-

cently Tsallis entropy has been used to study this behavior in different cases like finance,

earthquakes and network traffic [1, 2]. Compared to the exponential family, the Tsallis

distributions, i.e. the family of distributions resulting from maximization of Tsallis en-

tropy, have an additional shape parameter q, similar to that in (1.2), which controls the

nature of the power-law tails.

1.3 Motivation and Summary of current work

Though stochastic optimization algorithms are guaranteed to converge to a local optimum

of the objective function, the challenge is to achieve the global optimum. The smoothed

functional schemes have become popular due to their smoothing effects on local fluctua-

tions. However, the Gaussian smoothing kernel, which is used in practice, does not provide

“ideal performance”. Although with proper tuning of parameters, we can reduce the cost,

we cannot achieve globally optimal performance. Hence, new methods are sought.

We propose a new SF method where the smoothing kernel is a q-Gaussian distribution,

which is a power-law generalization of the Gaussian distribution, resulting from nonex-

tensive information theory. First, we prove an important result related to the co-moment

of multivariate q-Gaussian distribution. We show that the multivariate q-Gaussian dis-

tribution satisfies all the conditions for smoothing kernels discussed in [35]. The “shape

parameter” q, which controls the power-law behavior of q-Gaussian, also controls the

smoothness of the convolution, thereby providing additional tuning.

We illustrate methods for gradient and Hessian estimation using q-Gaussian smooth-

ing kernel. We also present two-timescale algorithms for stochastic optimization using


q-Gaussian based SF, and prove the convergence of the proposed algorithms to the neigh-

bourhood of a local optimum. Further, we perform simulations on a queuing network to

illustrate the benefits of the q-Gaussian based SF algorithms compared to their Gaussian

counterparts.

1.4 Organization of this report

The rest of the report is organized as follows. The framework for the optimization problem

and some of the preliminaries are presented in Chapter 2. Some general results regarding

multivariate q-Gaussian distribution has been proved in Chapter 3. We also provide

some insights regarding optimization using q-Gaussian SF in this chapter. Chapter 4

deals with gradient descent algorithms using q-Gaussian SF. It presents the approach of

gradient estimation, corresponding algorithms and their convergence. Similar results and

proposed algorithms for adaptive Newton-based search has been presented in Chapter 5.

Chapter 6 deals with implementation of the proposed algorithms, and simulations based

on a numerical setting. Finally, Chapter 7 provides the concluding remarks.

Chapter 2

Background and Preliminaries

In this chapter, we first describe the framework of the optimization problem. We also

discuss about the Smoothed Functional scheme, that is commonly to estimate derivatives

of a stochastic function. Finally, we provide a brief idea about the generalization of

Gaussian distribution studied in nonextensive information theory.

2.1 Problem Framework

We assume there exists a stochastic process, and a cost function associated with it. A few

assumptions are made regarding the process and its associated cost, which are reasonable

even in a real-world scenario.

Let Yn : n ∈ N ⊂ Rd be a parameterized Markov process, depending on a tunable

parameter θ ∈ C, where C is a compact and convex subset of RN . Let Pθ(x, dy) denote the

transition kernel of Yn when the operative parameter is θ ∈ C. Let h : Rd 7→ R+⋃0

be a Lipschitz continuous cost function associated with the process.

Assumption I. The process Yn is ergodic for any given θ as the operative parameter,

i.e., as L→∞,

1

L

L−1∑m=0

h(Ym)→ Eνθ [h(Y )],

where νθ is the stationary distribution of Yn.

6

CHAPTER 2. BACKGROUND AND PRELIMINARIES 7

Our objective is to minimize the long-run average cost

J(θ) = limL→∞

1

L

L−1∑m=0

h(Ym) =

∫Rd

h(x)νθ( dx), (2.1)

by choosing an appropriate θ ∈ C. The existence of the above limit is assured by As-

sumption I and the fact that h is continuous, hence measurable. In addition, we assume

that the average cost J(θ) satisfies the following condition:

Assumption II. The function J(.) is twice continuously differentiable with respect to any

θ ∈ C, with a bounded third derivative.

Definition 2.1 (Non-anticipative sequence). A random sequence of parameter vectors,

(θ(n))n>0 ⊂ C, controlling a process Yn ⊂ Rd, is said to be non-anticipative if the

conditional probability P (Yn+1 ∈ dy|Fn) = Pθ(Yn, dy) almost surely for all n > 0 and all

Borel sets dy ⊂ Rd, where Fn = σ(θ(m), Yn,m 6 n) is the associated σ-field.

It can be verified that under a non-anticipative parameter sequence (θ(n)), the given

process along with the updates, (Yn, θ(n))n>0 is Markov. We assume the existence of a

stochastic Lyapunov function.

Assumption III. Let (θ(n)) be a sequence of random parameters, obtained using an

iterative scheme, controlling the process Yn, and Fn = σ(θ(m), Ym,m 6 n), n > 0

be a sequence of associated σ-fields. There exists ε0 > 0, a compact set K ⊂ Rd, and

a continuous function V : Rd 7→ R+⋃0, with lim

‖x‖→∞V (x) = ∞, such that under any

non-anticipative sequence (θ(n))n>0,

(i) supn E[V (Yn)2] <∞, and

(ii) E[V (Yn+1)|Fn] 6 V (Yn)− ε0, whenever Yn /∈ K, n > 0.

While Assumption II is a technical requirement, Assumption III ensures that the

process under a tunable parameter remains stable. Assumption III will not be required,

for instance, if the single-stage cost function h is bounded in addition.


2.2 Smoothed Functionals

Here, we present an idea about the smoothed functional approach proposed by Katkovnik

and Kulchitsky[24]. We consider a real-valued function f : C 7→ R, defined over a compact

set C. Its smoothed functional is defined as

Sβ[f(θ)] =

∞∫−∞

Gβ(η)f(θ − η) dη =

∞∫−∞

Gβ(θ − η)f(η) dη, (2.2)

where Gβ : RN 7→ R is a kernel function, with a parameter β taking values from R. The

idea behind using smoothed functionals is that if f(θ) is not well-behaved, i.e., it has a

fluctuating character, then Sβ[f(θ)] is “better-behaved”. This ensures that any optimiza-

tion algorithm with objective function f(θ) does not get stuck at any local minimum, but

converges to the global minimum. The parameter β controls the degree of smoothness.

Rubinstein[35] established that the SF algorithm achieves these properties if the kernel

function satisfies the following sufficient conditions:

(P1) Gβ(η) = 1βNG(ηβ

), where G(x) means Gβ(x) corresponds to the case β = 1, i.e.,

G

(η

β

)= G1

(η(1)

β,η(2)

β, . . . ,

η(N)

β

),

(P2) Gβ(η) is piecewise differentiable in η,

(P3) Gβ(η) is a probability distribution function, i.e., Sβ[f(θ)] = EGβ(η)[f(θ − η)],

(P4) limβ→0Gβ(η) = δ(η), where δ(η) is the Dirac delta function, and

(P5) limβ→0 Sβ[f(θ)] = f(θ).

A two-sided form of SF is defined as

S ′β[f(θ)] =1

2

∞∫−∞

Gβ(η)(f(θ − η) + f(θ + η)

)dη


This can be rewritten as

S ′β[f(θ)] =1

2

∞∫−∞

Gβ(θ − η)f(η) dη +1

2

∞∫−∞

Gβ(η − θ)f(η) dη . (2.3)

The normal distribution satisfies the above conditions, and has been used as a kernel

in [24, 41]. The SF approach provides a method [7] for estimating the gradient or Hessian

of any function which satisfies Assumptions I–III as shown in [5], where the Gaussian

smoothing kernel is used. The gradient estimator obtained using (2.2) is given by

∇θJ(θ) ≈ 1

βML

M−1∑n=0

L−1∑m=0

η(n)h(Ym) (2.4)

for large M , L and small β. The stochastic process Ym is governed by parameter

(θ(n)+βη(n)), where θ(n) ∈ C ⊂ RN is obtained through an iterative scheme, and η(n) =(η(1)(n), . . . , η(N)(n)

)Tis a N -dimensional vector composed of i.i.d. standard normal

distributed random variables η(1)(n), . . . , η(N)(n). Similarly, a two-simulation gradient

estimator has been suggested using (2.3), which is of the following form

∇θJ(θ) ≈ 1

2βML

M−1∑n=0

L−1∑m=0

η(n)(h(Ym)− h(Y ′m)

)(2.5)

for large M , L and small β, where Ym and Y ′m are two processes governed by pa-

rameters (θ(n) + βη(n)) and (θ(n)− βη(n)), respectively, θ(n) and η(n) being defined as

earlier. The respective one and two simulation estimates for Hessian case are given by

∇2θJ(θ) ≈ 1

β2ML

M−1∑n=0

L−1∑m=0

H(η(n))h(Ym) (2.6)

and

∇2θJ(θ) ≈ 1

2β2ML

M−1∑n=0

L−1∑m=0

H(η(n))[h(Ym) + h(Y ′m)], (2.7)

where H(η) is a matrix given by H(η)i,j =(η(i))2 − 1 for i = j, and η(i)η(j) for i 6= j.


2.3 q-Gaussian distribution

The q-Gaussian distribution was initially developed to describe the process of Levy super-

diffusion [32], but has been later studied in other fields, such as finance [37]. Its importance

lies in its power-law nature, due to which the tails of the q-Gaussian decay at a slower

rate than Gaussian distribution, depending on the choice of q.

It results from maximizing Tsallis entropy under certain ‘deformed’ moment con-

straints, known as normalized q-expectation, defined as

〈f〉q =

∫Rf(x)p(x)q dx∫Rp(x)q dx

. (2.8)

This form of an expectation considers an escort distribution

pq(x) =p(x)q∫

Rp(x)q dx

,

and has been shown to be compatible with the foundations of nonextensive statistics [46].

Prato and Tsallis [32] maximized Tsallis entropy under the constraints, 〈x〉q = µq and

〈(x− µ)2〉q = β2q , which are known as q-mean and q-variance, respectively. These are

generalizations of standard first and second moments, and tend to the usual mean and

variance, respectively, as q → 1. This results in q-Gaussian distribution of the form

Gq,β(x) =1

βqKq

(1− (1− q)

(3− q)β2q

(x− µq)2

) 11−q

+

for all x ∈ R, (2.9)

where, y+ = max(y, 0) is called the Tsallis cut-off condition [45], which ensures that the

above expression is defined, and Kq is the normalizing constant, which is given by

Kq =

√π√

3−q√1−q

Γ( 2−q1−q )

Γ( 5−3q2(1−q))

for −∞ < q < 1,

√π√

3−q√1−q

Γ( 3−q2(q−1))

Γ( 1q−1)

for 1 < q < 3,


with Γ being the Gamma function1, which exists because its arguments are positive over

the specified intervals.

The function defined in (2.9) is not integrable for q > 3, and hence, q-Gaussian is a

probability density function only when q < 3. Further, it has been shown by Prato and

Tsallis [32] that the variance of the above distribution is finite only for q < 53, and is

given by β =√

3−q5−3q

βq. In this report, we refer to the expression which involves variance,

instead of q-variance as it is more compatible with the analysis in the stochastic scenario.

A multivariate form of the q-Gaussian distribution has been proposed in [47]. Vignat

and Plastino [48] provided an explicit form of this distribution, which will be used in this

report. Considering the usual covariance matrix of the N -variate distribution to be of the

form E[XXT ] = β2IN×N , it is defined as

Gq,β(X) =1

βNKq,N

(1− (1− q)(

(N + 4)− (N + 2)q) ‖X‖2

β2

) 11−q

+

for all X ∈ RN , (2.10)

where Kq,N is the normalizing constant given by

Kq,N =

((N+4)−(N+2)q

1−q

)N2 πN/2Γ( 2−q

1−q )Γ( 2−q

1−q+N2 )

for q < 1,

((N+4)−(N+2)q

q−1

)N2 πN/2Γ( 1

q−1−N

2 )Γ( 1

q−1)for 1 < q <

(1 + 2

N+2

).

(2.11)

The multivariate normal distribution can be obtained as a special case when q → 1. A

similar distribution can also be obtained by maximizing Renyi entropy [12]. In this pa-

per, we study the multivariate q-Gaussian distribution, and develop smoothed functional

algorithms based on it.

1Gamma function is defined as Γ(z) =∞∫0

tz−1e−t dt for z ∈ C, Real(z) > 0.

Chapter 3

Some properties of multivariate

q-Gaussian

Before going into further analysis of q-Gaussians as smoothing kernels, we look at the

support set of the multivariate q-Gaussian distribution with covariance β2IN×N . We

denote the support set as

Ωq =

x ∈ RN : ‖x‖2 < ((N+4)−(N+2)q)β2

(1−q)

for q < 1,

RN for 1 < q <(1 + 2

N+2

).

(3.1)

A standard q-Gaussian distribution has mean zero and unit variance. So, the support set

can be expressed as above by substituting β = 1.

3.1 Generalized co-moments of joint q-Gaussian dis-

tribution

We first state the following lemma, which provides an expression for the moments of

N -variate q-Gaussian distributed random vector. This is a consequence of the results

presented in [19]. The result is considered only for q <(1 + 2

N+2

)as above this interval,

12

CHAPTER 3. SOME PROPERTIES OF MULTIVARIATE q-GAUSSIAN 13

variance of q-Gaussian is not finite [32].

Proposition 3.1. Suppose X =(X(1), X(2), . . . , X(N)

)∈ RN is a random vector, where

the components are uncorrelated and identically distributed, each being distributed accord-

ing to a q-Gaussian distribution with zero mean and unit variance, where the parameter

q ∈(−∞, 1

)⋃ (1, 1 + 2

N+2

). Also, let ρ(X) =

(1− (1−q)

((N+4)−(N+2)q)‖X‖2

). Then, for any

b, b1, b2, . . . , bN ∈ Z+⋃0 we have

EGq

[(X(1)

)b1 (X(2)

)b2. . .(X(N)

)bN(ρ(X))b

]=

K

((N + 4)− (N + 2)q

1− q

) N∑i=1

bi2

(N∏i=1

bi!

2bi(bi2

)!

),

if bi is even for all i = 1, 2, . . . , N,

0 otherwise,

(3.2)

where

K =

Γ( 11−q−b+1)Γ( 1

1−q+1+N2 )

Γ( 11−q+1)Γ

(1

1−q−b+1+N2

+N∑i=1

bi2

) if q ∈ (−∞, 1),

Γ( 1q−1)Γ

(1q−1

+b−N2−N∑i=1

bi2

)Γ( 1

q−1+b)Γ( 1

q−1−N

2 )if q ∈

(1, 1 + 2

N+2

),

(3.3)

exists only if the above Gamma functions exist. Further, the existence on the Gamma func-

tions occur under the condition b <(

1 + 11−q

)if q < 1, and

(1q−1

+ b− N2−∑N

i=1bi2

)> 0

for 1 < q <(1 + 2

N+2

).

Proof. Since ρ(X) is non-negative over Ωq, we have

EGq(X)

[(X(1)

)b1 (X(2)

)b2. . .(X(N)

)bN(ρ(X))b

]

=1

Kq,N

∫Ωq

(x(1))b1 (

x(2))b2

. . .(x(N)

)bN (1− (1− q)(

(N + 4)− (N + 2)q)‖x‖2

) 11−q−b

dx.

The second equality in (3.2) can be easily proved. If for some i = 1, . . . , N , bi is odd,

then the above function is odd, and its integration is zero over Ωq, which is symmetric


with respect to any axis by definition. For the other cases, since the function is even,

the integral is same over every orthant. Hence, we may consider the integration over the

first orthant, i.e., where each component is positive. For q < 1, we can reduce the above

integral, using [19, Eq. 4.635], to obtain

EGq(X)

[(X(1)

)b1 (X(2)

)b2. . .(X(N)

)bN(ρ(X))b

]

=

N∏i=1

Γ(bi+1

2

)Kq,NΓ(b)

((N + 4)− (N + 2)q

1− q

)b 1∫0

(1− y)(1

1−q−b)y(b−1)dx

where we set b =

(N2

+N∑i=1

bi2

). We can observe that the integral is in the form of a Beta

function1. Since bi’s are even, we can expand Γ(bi+1

2

)using the expansion of Gamma

function of half-integers to get Γ(bi+1

2

)= bi!

2bi( bi2 )!

√π. The claim can be obtained by

substituting Kq,N from (2.11) and using the relation B(m,n) = Γ(m)Γ(n)Γ(m+n)

. It is easy to

verify that all the Gamma functions in the equality exist provided b <(

1 + 11−q

). The

result for the interval 1 < q <(1 + 2

N+2

)can be proved in a similar way (see eq. 4.635

and eq. 4.636 of [19]).

Corollary 3.2. As q → 1, in the limiting case

limq→1

EGq(X)

[(X(1)

)b1 (X(2)

)b2. . .(X(N)

)bN(ρ(X))b

]=

N∏i=1

EG(X)

[(X(i)

)bi].

Proof. The term ρ(X) → 0 as q → 1. Also, in the limit case, we retrieve the normal

distribution, for which uncorrelated implies independence. So, the above expression turns

out to be a product of individual moments.

1Beta function is defined as B(m,n) =1∫0

tm−1(1− t)n−1 dt for m,n ∈ C, Real(m),Real(n) > 0.


3.2 q-Gaussian as a Smoothing Kernel

The first step in applying q-Gaussians for SF algorithms is to ensure that the q-Gaussian

satisfies Rubinstein conditions [35].

Proposition 3.3. The N-dimensional q-Gaussian distribution satisfies the kernel prop-

erties (P1)–(P5) for all q <(1 + 2

N+2

)and q 6= 1.

Proof. (P1) From (2.10), it is evident that Gq,β(x) =1

βNGq

(x

β

).

(P2) For 1 < q <(1 + 2

N+2

), Gq,β(x) > 0 for all x ∈ RN . Thus,

∇xGq,β(x) = − 2x((N + 4)− (N + 2)q

)β2

Gq,β(x)(1− (1−q)

((N+4)−(N+2)q)β2‖x‖2) . (3.4)

For q < 1, (3.4) holds when x ∈ Ωq. On the other hand, when x /∈ Ωq, we have

Gq,β(x) = 0 and hence, ∇xGq,β(x) = 0. Thus, Gq,β(x) is differentiable for q > 1,

and piecewise differentiable for q < 1.

(P3) Gq,β(x) is a distribution for q <(1 + 2

N+2

)and hence, the corresponding SF Sq,β(.),

parameterized by both q and β, can be written as Sq,β[f(θ)] = EGq,β(x)[f(θ − x)].

(P4) Gq,β is a probability distribution satisfying limβ→0

Gq,β(0) =∞. So, limβ→0

Gq,β(x) = δ(x).

(P5) This property trivially holds due to convergence in mean

limβ→0

Sq,β[f(θ)] =

∞∫−∞

limβ→0

Gq,β(x)f(θ − x)dx =

∞∫−∞

δ(x)f(θ − x) dx = f(θ).

Hence the claim.

From the above result, it follows that q-Gaussian can be used as a kernel function,

and hence, given a particular value q ∈(− ∞, 1

)⋃ (1, 1 + 2

N+2

)and some β > 0, the


one-sided and two-sided SFs of any function f : RN 7→ R are respectively given by

Sq,β[f(θ)] =

∫Ωq

Gq,β(θ − x)f(x) dx, (3.5)

S ′q,β[f(θ)] =1

2

∫Ωq

Gq,β(θ − x)f(x) dx+1

2

∫Ωq

Gq,β(x− θ)f(x) dx, (3.6)

where the nature of the SFs are controlled by both q and β.

3.3 Optimization using q-Gaussian SF

We present an example to illustrate the smoothing properties of Gaussian and q-Gaussian

kernels for different values of q to motivate the proposed SF method using q-Gaussian.

The results are for one-dimensional case, where we use the above mentioned kernels to

find the SF of the function, f(x) = x2 − 14e−x

2cos(8πx). The corresponding one-sided

SFs (3.5) are shown in Figure 3.1.

Figure 3.1: (Top figure) Unsmoothed function,(Middle row) SF for Gaussian andq-Gaussian kernels using β = 0.05 for values of q = 0.5, 1(Gaussian case) and 1.5, respec-tively, and (Bottom row) SFs using β = 0.09 for same q’s.

The figures illustrate that for a particular value of β, different degrees of smoothness

can be obtained by varying the values of q. It is observed that for a low value of β, the

SFs obtained using q-Gaussians with q > 1 are more smooth than Gaussian-SF, but as β


increases, q-Gaussians with q < 1 become more smooth.

We now illustrate the effect of function smoothing in optimization. It can be observed

that the global minimum of the function is at x = 0, with several local extrema. We try

to search for the global minimum of f(x) in the interval [−1, 1] using standard Gradient-

descent and Newton-based methods. In each iteration, we project the update to [−1, 1] to

satisfy space constraint. We denote this by a projection function P[−1,1]. We incorporate a

‘suitable’ decreasing step-size so that the Newton-based algorithms converge. The update

rules are given in Table 3.1, along with the distance of the update from the the global

optimum after 100 iterations.

Algorithm Update rule Distance from optimum

Gradient-descent xn+1 = xn − 1n∇xf(xn) 0.00007538

Newton’s method xn+1 = xn − 1n

(∇2xf(xn))

−1∇2xf(xn) 0.96199169

Table 3.1: Optimization of unsmoothed function.

The next two tables show the result of optimization when the one-sided smoothed

version of f is used as the objective function. The cases in which the final distance from

the optimum are relatively small (< 0.01) are highlighted. It can be observed that the

gradient based approach gives better performance than Newton’s method. Moreover,

for higher values of β, the error decreases, which is expected as the function becomes

more smooth. However, the scenario changes for the stochastic case as here, there is an

additional error, which increases for higher values of β.

q \ β 0.050 0.100 0.150 q \ β 0.050 0.100 0.150

0.00 0.00000003 0.10686676 0.00000019 0.00 0.24342130 0.11581702 0.01461908

0.25 0.00000012 0.10177132 0.00000014 0.25 0.24127418 0.09847975 0.00193110

0.50 0.24282084 0.08734088 0.00000019 0.50 0.48939828 0.08670668 0.00016341

0.75 0.24294724 0.00000433 0.00000017 0.75 0.24615714 0.00164180 0.00063249

Gaussian 0.00000014 0.00000011 0.00332414 Gaussian 0.73622977 0.00206699 0.00018992

1.25 0.00067325 0.22314089 0.00000254 1.25 0.49121175 0.23889724 0.00013602

1.50 0.72345646 0.23916780 0.00888054 1.50 0.48663598 0.00239447 0.22118680

Table 3.2: Performance of Gradient-descent (left) and Newton’s method (right).

Chapter 4

q-Gaussian based Gradient Descent

Algorithms

4.1 Gradient Estimation with q-Gaussian SF

The objective is to estimate the gradient of the average cost∇θJ(θ) using the SF approach,

where existence of ∇θJ(θ) follows from Assumption II.

4.1.1 One-simulation q-Gaussian SF Gradient Estimate

Rubinstein [35] defined the gradient of smoothed functional (smoothed gradient) as

∇θSq,β[J(θ)] =

∫Ωq

∇θGq,β(θ − η)J(η) dη ,

recall that Ωq is the support set defined as in (3.1). As there is no functional relationship

between θ and η over Ωq, i.e.,dη(j)

dθ(i)= 0 for all i, j,

∇(i)θ Gq,β(θ − η) =

1

βNKq,N

2(η(i) − θ(i)

)β2((N + 4)− (N + 2)q

) (1−(1− q)

∑Nk=1

(θ(k) − η(k)

)2((N + 4)− (N + 2)q

)β2

) q1−q

=2

β2((N + 4)− (N + 2)q)

(η(i) − θ(i)

)ρ( θ−η

β)

Gq,β(θ − η) , (4.1)

18

CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 19

where ρ(η) =

(1− (1−q)(

(N+4)−(N+2)q)‖η‖2

). Hence, substituting η′ = η−θ

β, and using the

symmetry of Gq,β(.) and ρ(.), we can write

∇θSq,β[J(θ)] =

(2

β((N + 4)− (N + 2)q

))∫Ωq

η′

ρ(η′)Gq(η

′)J(θ + βη′) dη′

=

(2

β((N + 4)− (N + 2)q

))EGq(η′)

[η′

ρ(η′)J(θ + βη′)

∣∣∣∣ θ] . (4.2)

In sequel (Proposition 4.10), we show that ‖∇θSq,β[J(θ)]−∇θJ(θ)‖ → 0 as β → 0.

Hence, for large M and small β, the form of gradient estimate suggested by (4.2) is

∇θJ(θ) ≈

(2

β((N + 4)− (N + 2)q

)M

)M−1∑n=0

(η(n)J(θ + βη(n))

ρ(η(n))

), (4.3)

where η(1), η(2), . . . , η(n) are uncorrelated identically distributed standard q-Gaussian

distributed random vectors. Considering that in two-timescale algorithms (discussed in

next section), the value of θ is updated concurrently with the gradient estimation pro-

cedure, we estimate ∇θJ(θ(n)) at each stage. Using an approximation of (2.1), we can

write (4.3) as

∇θJ(θ(n)) ≈

(2

βML((N + 4)− (N + 2)q

))M−1∑n=0

L−1∑m=0

η(n)h(Ym)(1− (1−q)

((N+4)−(N+2)q)‖η(n)‖2

)(4.4)

for large L, where the process Ym has the same transition kernel as defined in Assump-

tion I, except that it is governed by parameter (θ(n) + βη(n)).

4.1.2 Two-simulation q-Gaussian SF Gradient Estimate

In a similar manner, based on (2.3), the gradient of the two-sided SF can be written as

∇θS′q,β[J(θ)] =

1

2

∫Ωq

∇θGβ(θ − η)J(η) dη +1

2

∫Ωq

∇θGβ(η − θ)J(η) dη. (4.5)


The first integral can be obtained as in (4.2). The second integral is evaluated as

∫Ωq

∇θGβ(η − θ)J(η) dη =2

β2((N + 4)− (N + 2)q

) ∫Ωq

(η − θ)ρ(η−θ

β)Gq,β(η − θ)J(η) dη

=2

β((N + 4)− (N + 2)q

) ∫Ωq

η′

ρ(η′)Gq(η

′)J(θ − βη′) dη′ ,

where η′ = θ−ηβ

. Thus, we obtain the gradient as a conditional expectation


(1

β((N + 4)− (N + 2)q

))EGq(η)

[η

ρ(η)

(J(θ + βη)− J(θ − βη)

)∣∣∣∣ θ] .(4.6)

In sequel (Proposition 4.12) we show that∥∥∇θS

′q,β[J(θ)]−∇θJ(θ)

∥∥ → 0 as β → 0,

which can be used to approximate (4.6) as

∇θJ(θ(n)) ≈ 1

βML((N + 4)− (N + 2)q

) M−1∑n=0

L−1∑m=0

η(n)(h(Ym)− h(Y ′m)

)(1− (1−q)

((N+4)−(N+2)q)‖η(n)‖2

) (4.7)

for large M , L and small β, where Ym and Y ′m are governed by (θ(n) + βη(n)) and

(θ(n)− βη(n)) respectively.

4.2 Proposed Gradient Descent Algorithms

In this section, we propose two-timescale algorithms corresponding to the estimates ob-

tained in (4.4) and (4.7). Let (a(n)) and (b(n)) be two step-size sequences satisfying the

following assumption.

Assumption IV. (a(n))n>0 and (b(n))n>0 are positive step-size sequences, which satisfy∞∑n=0

a(n)2 < ∞,∞∑n=0

b(n)2 < ∞,∞∑n=0

a(n) =∞∑n=0

b(n) = ∞ and a(n) = o(b(n)), i.e.,

a(n)b(n)→ 0 as n→∞.

For θ = (θ(1), . . . , θ(N))T ∈ RN , let PC(θ) = (PC(θ(1)), . . . , PC(θ(N)))T represent the

projection of θ onto the set C. (Z(i)(n), i = 1, . . . , N)n>0 are quantities used to estimate


∇θJ(θ) via the recursions below.

The algorithms require a generation of N -dimensional random vectors, consisting of

uncorrelated q-Gaussian distributed random variates. However, there is no standard al-

gorithm in literature for this multivariate case. This issue is addressed later in the Sec-

tion 6.2.

Algorithm 1 : The Gq-SF1 Algorithm

1: Fix M , L, q and β.2: Set Z(i)(0) = 0, i = 1, . . . , N .3: Fix the parameter vector θ(0) = (θ(1)(0), θ(2)(0), . . . , θ(N)(0))T .4: for n = 0 to M − 1 do5: Generate a random vector η(n) = (η(1)(n), η(2)(n), . . . , η(N)(n))T from a standard

N -dimensional q-Gaussian distribution.6: for m = 0 to L− 1 do7: Generate the simulation YnL+m governed with parameter (θ(n) + βη(n)).8: for i = 1 to N do9: Z(i)(nL+m+ 1) = (1− b(n))Z(i)(nL+m)

+b(n)

[2η(i)(n)h(YnL+m)

β((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)

‖η(n)‖2)

].

10: end for11: end for12: for i = 1 to N do13: θ(i)(n+ 1) = PC

(θ(i)(n)− a(n)Z(i)(nL)

).

14: end for15: Set θ(n+ 1) = (θ(1)(n+ 1), θ(2)(n+ 1), . . . , θ(N)(n+ 1))T .16: end for17: Output θ(M) = (θ(1)(M), . . . , θ(N)(M))T as the final parameter vector.

The Gq-SF2 algorithm is similar to the Gq-SF1 algorithm, except that we use two

parallel simulations YnL+m and Y ′nL+m governed with parameters (θ(n) + βη(n)) and

(θ(n)−βη(n)) respectively, and update the gradient estimate, in Step 9, using the single-

stage cost function of both simulations as in (4.7).


Algorithm 2 : The Gq-SF2 Algorithm

1: Fix M , L, q and β.2: Set Z(i)(0) = 0, i = 1, . . . , N .3: Fix the parameter vector θ(0) = (θ(1)(0), θ(2)(0), . . . , θ(N)(0))T .4: for n = 0 to M − 1 do5: Generate a random vector η(n) = (η(1)(n), η(2)(n), . . . , η(N)(n))T from a standard

N -dimensional q-Gaussian distribution.6: for m = 0 to L− 1 do7: Generate two simulations YnL+m and Y ′nL+m governed with control parameters

(θ(n) + βη(n)) and (θ(n)− βη(n)) respectively.8: for i = 1 to N do9: Z(i)(nL+m+ 1) = (1− b(n))Z(i)(nL+m)

+b(n)

[η(i)(n)(h(YnL+m)−h(Y ′nL+m))

β((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)

‖η(n)‖2)

].

10: end for11: end for12: for i = 1 to N do13: θ(i)(n+ 1) = PC

(θ(i)(n)− a(n)Z(i)(nL)

).

14: end for15: Set θ(n+ 1) = (θ(1)(n+ 1), θ(2)(n+ 1), . . . , θ(N)(n+ 1))T .16: end for17: Output θ(M) = (θ(1)(M), . . . , θ(N)(M))T as the final parameter vector.

4.3 Convergence of Gradient SF Algorithms

We now look into the convergence of the algorithms proposed in Section 4.2. The analysis

presented in the shorter version of this report [18] was in the lines of [5]. In this report,

we deviate from that approach to provide a more straightforward technique to prove the

convergence of the algorithms to a local optimum.

4.3.1 Convergence of Gq-SF1 Algorithm

First, let us consider the update along the faster timescale, i.e., Step 9 of the Gq-SF1

algorithm. Defining b(p) = b(n) for nL 6 p < (n + 1)L, it follows from Assumption IV

that b(p) = o(b(p)

),∑

p b(p) = ∞ and∑

p b(p)2 < ∞. We can rewrite Step 9 as the

following iteration, which runs for L steps:

Z(p+ 1) = Z(p) + b(p)[g(Yp)− Z(p)

], (4.8)


where for all nL 6 p < (n + 1)L, g(Yp) = 2η(n)h(Yp)

β((N+4)−(N+2)q)ρ(η(n)). Here, ρ(.) is defined as

in Proposition 3.1 and Yp : p ∈ N is a Markov process with the stationary distribution

ν(θ(n)+βη(n)), as the parameter updates θ(n) and η(n) are held fixed. We state the following

two results from [11], adapted to our scenario. These results lead to the stability and

convergence of iteration (4.8).

Lemma 4.1. Consider the iteration, xp+1 = xp + γ(p)f(xp, Yp). Let the following condi-

tions hold:

1. Yp : p ∈ N is a Markov process satisfying Assumptions I and III,

2. for each x ∈ RN and xp ≡ x for all p ∈ N, Yp has a unique invariant probability

measure νxp,

3. f(., .) is Lipschitz continuous in its first argument uniformly w.r.t the second, and

4. (γ(p))p>0 are step-sizes satisfying∞∑p=0

γ(p) =∞ and∞∑p=0

γ2(p) <∞.

Then the update xp converges to the stable fixed points of the ordinary differential equation

(ODE)

x(t) = f(x(t), νx(t)

),

where f(x, νx

)= Eνx

[f(x, Y )

].

Lemma 4.2. Suppose the limit f(x(t)

)= lim

a↑∞

f(ax(t), νax(t)

)a

exists uniformly on com-

pacts, and furthermore, the ODE x(t) = f(x(t)

)is well-posed and has the origin as the

unique globally asymptotically stable equilibrium. Then supp ‖xp‖ < ∞, almost surely,

where (xp)p∈N are obtained as per the recursion in Lemma 4.1.

It can be observed that iteration (4.8) is a special case of Lemma 4.1, where the

invariant measure ν(θ(n)+βη(n)) is independent of the updates (Z(p))nL<p<(n+1)L. So we

consider the following ODEs:

θ(t) = 0, (4.9)

Z(t) =2η(t)J

(θ(t) + βη(t)

)β((N + 4)− (N + 2)q

)ρ(η(t))

− Z(t). (4.10)


In lieu of (4.9) and the fact that the iterations are in a faster timescale, we let θ(t) ≡ θ

and η(t) ≡ η (constants) in (4.10). Hence, we consider

Z(t) =2ηJ

(θ + βη

)β((N + 4)− (N + 2)q

)ρ(η)

− Z(t). (4.11)

Lemma 4.3. The sequence of updates (Z(p)) is uniformly bounded with probability 1.

Proof. It can be easily verified that iteration (4.8) satisfies the first four conditions of

Lemma 4.1, the fifth not being applicable. Thus, by Lemma 4.1, (Z(p)) converges to

ODE (4.11) as

Eν(θ+βη)

[2η h(Yp)

β((N + 4)− (N + 2)q

)ρ(η)

− Z(t)

]=

2ηJ(θ + βη)

β((N + 4)− (N + 2)q

)ρ(η)

− Z(t).

We can also see that

lima↑∞

1

a

(2ηJ(θ + βη)

β((N + 4)− (N + 2)q

)ρ(η)

− aZ(t)

)= −Z(t).

Hence, Lemma 4.2 can be used to arrive at the claim.

We now define time steps tn such that t0 = 0 and tn =∑n−1

i=0 b(i) for n > 1. We also

express an interpolated trajectory Z(t) such that Z(tn) = Z(nL) for all n, and over the

interval [tn, tn+1], it is a linear interpolation between Z(nL) and Z(nL+L). Based on the

definition given below, we can consider the interpolated trajectory Z(t) as a perturbation

of the trajectory Z(t) in (4.11).

Definition 4.4 ((τ -µ) perturbation). For the specified constants τ > 0, µ > 0 and a

given ODE

x(t) = f(x(t)

), (4.12)

a (τ -µ) perturbed trajectory of (4.12) is a map y : [0,∞) 7→ RN that satisfies the following

condition. If there exists an increasing sequence (τk)k>0 ⊂ [0,∞) such that τk+1 − τk > τ

for all k, then on each interval [τk, τk+1], there exists a solution xk(t) of (4.12) such that

|xk(t)− y(t)| < µ for all t ∈ [τk, τk+1].


We state the following lemma due to [21], which is based on the above definition.

Lemma 4.5. If K = x : f(x) = 0 be the asymptotically stable attracting set of (4.12),

then given τ, ε > 0, there exists µ0 > 0 such for all µ ∈ [0, µ0], any (τ -µ) perturbation

of (4.12) converges to ε-neighbourhood of K, defined as Kε = x : ‖x− x0‖ < ε, x0 ∈ K.

Lemma 4.6. Given τ, µ > 0,(θ(tn + ·), Z(tn + ·)

)is eventually a bounded (τ -µ) pertur-

bation of (4.9)-(4.11).

Proof. Since, a(n) = o(b(n)) and the parameter is fixed at θ(n) over [tn, tn+1], we can

write the parameter update (Step 13 of Gq-SF1) as

θ(i)(n+ 1) = PC(θ(i)(n)− b(n)ζ(n)

),

where ζ(n) = o(1). Thus, the parameter update recursion seems to be quasistatic when

viewed from the timescale of (b(n)). Now, we define (τn) such that τ0 = 0 and for n > 1,

τn = min tk : tk > τn−1 + τ. We also define the functions θn(t), Zn(t), t ∈ [τn, τn+1]

which are the solutions of the ODEs

θn(t) = 0

Zn(t) =2η(n)J

(θ(n) + βη(n)

)β((N + 4)− (N + 2)q

)ρ(η(n))

− Zn(t)

over the interval (τn, τn+1), with θn(τn) = θ(n) and Zn(τn) = Z(n), respectively, i.e., the

values are given by the algorithm. Using arguments based on Gronwall’s lemma1[8], it

can now be shown that

limn→∞

supt∈[τn,τn+1]

‖Zn(t)− Z(t)‖ = 0 with probability 1. (4.13)

The claims follows as a consequence of (4.13).

1Gronwall’s lemma states that for continuous functions u(.), v(.) > 0 and scalars C,K, T > 0,

u(t) 6 C + K

∫ t

0

u(s)v(s) ds for all t ∈ [0, T ]

implies u(t) 6 C exp

(K

∫ t

0

v(s) ds

)for all t ∈ [0, T ].


Corollary 4.7.

∥∥∥∥∥Z(nL)−

(2η(n)J

(θ(n) + βη(n)

)β((N + 4)− (N + 2)q

)ρ(η(n))

)∥∥∥∥∥→ 0 almost surely as n→

∞.

Proof. The claim follows by applying Lemma 4.5 on (4.11) for every µ > 0.

Thus we can claim that the updates given by Z(nL) eventually tracks the function(2η(n)J(θ(n)+βη(n))

β((N+4)−(N+2)q)ρ(η(n))

). So, Steps 13 and 15 of Gq-SF1 algorithm can be written as

θ(n+ 1) = PC

(θ(n)− a(n)

[2η(n)J

(θ(n) + βη(n)

)β((N + 4)− (N + 2)q

)ρ(η(n))

])

= PC(θ(n) + a(n)

[−∇θ(n)J

(θ(n)

)+ ∆

(θ(n)

)+ ξn

]), (4.14)

where the error in the gradient estimate is given by

∆(θ(n)

)= ∇θ(n)J

(θ(n)

)−∇θ(n)Sq,β

[J(θ(n)

)](4.15)

and the noise term is

ξn = ∇θ(n)Sq,β[J(θ(n)

)]−

2η(n)J(θ(n) + βη(n)

)β((N + 4)− (N + 2)q

)ρ(η(n))

=2

β((N + 4)− (N + 2)q

)(EGq(η)

[η(n)

ρ(η(n))J(θ(n) + βη(n)

)∣∣∣∣ θ(n)

]− η(n)


)), (4.16)

which is a martingale difference term. Let Fn = σ(θ(0), . . . , θ(n), η(0), . . . , η(n − 1)

)denote the σ-field generated by the mentioned quantities. We can observe that (Fn)n>0

is a filtration, where ξ0, . . . , ξn−1 are Fn-measurable for each n > 0.

We state the following result from [28], adapted to our scenario, which leads to the

convergence of the updates in (4.14).

Lemma 4.8. Given the iteration, xn+1 = PC(xn + γ(n)(f(xn) + ξn)

), where

1. PC is the projection onto a constraint set C, which is closed and bounded,


2. f(.) is a continuous function,

3. (γ(n))n>0 is positive sequence satisfying γ(n) ↓ 0,∑∞

n=0 γ(n) =∞, and

4.∑m

i=0 γ(n)ξn converges a.s.

Under the above conditions, the update (xn) converges to the ODE

x(t) = PC(f(x(t))

), (4.17)

where PC(f(x)

)= lim

ε↓0

(PC(x+εf(x)

)−x

ε

). Further, if f(.) = −∇xg(.), where g is a contin-

uous differentiable function, then the invariant set of (4.17) is given by

K =x ∈ C

∣∣∣PC(f(x))

= 0

=x ∈ C

∣∣∣∇xg(x)T PC(−∇xg(x)

)= 0.

Following result shows that the noise term ξn satisfies the last condition in Lemma 4.8.

Lemma 4.9. Let Mn =∑n−1

i=0 a(k)ξk. Then, for all values of q ∈(0, 1)⋃ (

1, 1 + 2N+2

),

(Mn,Fn)n∈N is an almost surely convergent martingale sequence.

Proof. We can easily observe that for all k > 0,

E[ξk|Fk] =2

β((N + 4)− (N + 2)q

)(E

[η(k)

βρ(η(k))J(θ(k) + βη(k)

)∣∣∣∣ θ(k)

]− E

[η(k)

βρ(η(k))J(θ(k) + βη(k)

)∣∣∣∣Fk]).So E[ξk|Fk] = 0, since θ(k) is Fk-measurable, whereas η(k) is independent of Fk. It

follows that (ξn,Fn)n∈N is a martingale difference sequence, and hence (Mn,Fn)n∈N is a

martingale sequences.

By Lipschitz continuity of h, there exists α1 > 0 such that for all p, |h(Yp)| 6 α1(1 +

‖Yp‖), and hence, by Assumption III, we can claim

E[h(Yp)

2]6 2α2

1

(1 + E

[‖Yp‖2

])<∞ a.s.


Thus, applying Jensen’s inequality, we have J(θ(k) + βη

)26 E [h(Yp)

2] < ∞ for all η,

which implies supη

[J(θ + βη

)2]<∞ for all θ ∈ C. Now,

E[‖ξk‖2

∣∣Fk] =N∑j=1

E

[(ξ

(j)k

)2∣∣∣∣Fk]

68

β2((N + 4)− (N + 2)q

)2

N∑j=1

E

[(E

[η(k)(j)

ρ(η(k))J(θ(k) + βη(k)

)∣∣∣∣ θ(k)

])2

+

(η(k)(j)

ρ(η(k))J(θ(k) + βη(k)

))2 ∣∣∣∣Fk]

By Jensen’s inequality, we have

E[‖ξk‖2

∣∣Fk] 6 16

β2((N + 4)− (N + 2)q

)2

N∑j=1

E

[ (η(k)(j)

)2

ρ(η(k))2 J(θ(k) + βη(k)

)2

∣∣∣∣∣ θ(k)

]

=16

β2((N + 4)− (N + 2)q

)2 supη

(J(θ(k) + βη

)2) N∑j=1

E

[(η(j))2

ρ(η)2

].

We apply Proposition 3.1 to study the nature of E

[(η(j))

2

ρ(η)2

]. We can observe that in

this case, b = 2 and bi = 2 if i = j, otherwise bi = 0. SoN∑i=1

bi2

= 1 andN∏i=1

(bi!

2bi( bi2 )!

)= 1

2.

For q < 1, we have the condition b <(

1 + 11−q

), which can be satisfied only if q > 0.

Using the relation Γ(n+ 1) = nΓ(n), we can write

E

[(η(j))2

ρ(η)2

]=

((N + 4)− (N + 2)q

1− q

)1

2

Γ(

11−q − 1

)Γ(

11−q + 1 + N

2

)Γ(

11−q + 1

)Γ(

11−q + N

2

)=

((N + 4)− (N + 2)q

)((N + 2)−Nq

)4q

. (4.18)

We can verify that for 1 < q <(1 + 2

N+2

), the condition mentioned in Proposition 3.1 is

always satisfied. Hence,

E

[(η(j))2

ρ(η)2

]=

((N + 4)− (N + 2)q

)((N + 2)−Nq

)4q

. (4.19)


Thus, it can be seen that E

[(η(j))

2

ρ(η)2

]<∞ for all q ∈

(0, 1)⋃ (

1, 1+ 2N+2

), j = 1, 2, . . . , N .

and so, E [‖ξk‖2| Fk−1] <∞ for all k, and hence, if∑

n a(n)2 <∞,

∞∑n=0

E[‖Mn+1 −Mn‖2

]=∞∑n=0

a(n)2E[‖ξn‖2

]<∞.

The claim follows from the martingale convergence theorem [49].

Although the previous result shows that the noise is bounded, (4.18) shows that the

noise term can become quite large for very low values of q (close to 0). Now, we deal with

the error term ∆(θ(n)

)in (4.14).

Proposition 4.10. For a given q <(1 + 2

N+2

), q 6= 1, and for all θ ∈ C, the error term

∥∥∥∇θSq,β[J(θ)]−∇θJ(θ)∥∥∥ = o(β).

Proof. For small β > 0, using Taylor series expansion of J(θ + βη) around θ ∈ C,

J(θ + βη) = J(θ) + βηT∇θJ(θ) +β2

2ηT∇2

θJ(θ)η + o(β2).

So we can write (4.2) as

∇θSq,β[J(θ)] =2(

(N + 4)− (N + 2)q)(J(θ)

βEGq(η)

[η

ρ(η)

]+ EGq(η)

[ηηT

ρ(η)

]∇θJ(θ)

+β

2EGq(η)

[ηηT∇2

θJ(θ)η

ρ(η)

∣∣∣∣ θ]+ o(β)

). (4.20)

We consider each term in (4.20). The ith component in the first term is EGq(η)

[η(i)

ρ(η)

]= 0

by Proposition 3.1 for all i = 1, . . . , N . Similarly, the ith component in the third term can

be written as

β

2EGq(η)

[ηηT∇2

θJ(θ)η

ρ(η)

](i)

=β

2

N∑j=1

N∑k=1

[∇2θJ(θ)

]j,k

EGq(η)

[η(i)η(j)η(k)

ρ(η)

].

It can be observed that in all cases, each term in the summation is an odd function, and


so from Proposition 3.1, we can show that the third term in (4.20) is zero. Using a similar

argument, we claim that the off-diagonal terms in EGq(η)

[ηηT

ρ(η)

]are zero, while the diagonal

terms are of the form EGq(η)

[(η(i))

2

ρ(η)

], which exists for all q ∈

(− ∞, 1

)⋃ (1, 1 + 2

N+2

)as the conditions in Proposition 3.1 are always satisfied on this interval. Further, we can

compute that for all q ∈(−∞, 1 + 2

N+2

), q 6= 1,

EGq(η)

[(η(i))2

ρ(η)

]=

(N + 4)− (N + 2)q

2. (4.21)

The claim follows by substituting the above expression in (4.20).

Now, we consider the following ODE for the slowest timescale recursion:

θ(t) = PC(−∇θJ(θ(t))

), (4.22)

where PC(f(x)

)= limε↓0

(PC(x+εf(x).)−x

ε

). In accordance with Lemma 4.8, it can be

observed that the stable points of (4.22) lie in the set

K =θ ∈ C

∣∣∣PC(−∇θJ(θ))

= 0

=θ ∈ C

∣∣∣∇θJ(θ)T PC(−∇θJ(θ)

)= 0

. (4.23)

We have the following key result which shows that iteration (4.14) tracks ODE (4.22),

and hence, the convergence of our algorithm is proved.

Theorem 4.11. Under Assumptions I – IV, given ε > 0 and q ∈(0, 1)⋃ (

1, 1 + 2N+2

),

there exists β0 > 0 such that for all β ∈ (0, β0], the sequence (θ(n)) obtained using Gq-

SF1 converges to a point in the ε-neighbourhood of the stable attractor of (4.22), with

probability 1 as n→∞.

Proof. It immediately follows from Lemmas 4.8 and 4.9 that the update in (4.14) converges

to the ODE

θ(t) = PC(−∇θJ(θ(t)) + ∆

(θ(t)

)), (4.24)

But, from Proposition 4.10, we have for all n,∥∥∆(θ(n)

)∥∥ = o(β). Then for any given

ε, τ > 0, we can invoke Lemma 4.5 by considering the sequence (τk)k>0 as τ0 = 0 and


τk =∑nk

p=1 a(p), with the indices nk = minn :∑n

p=1 a(p) > τk−1 + τ

. This gives us

a β0 such that for all β ∈ (0, β0], any (τ -µ) perturbation of (4.22) would converge to

ε-neighbourhood of (4.23). Hence, the claim.

4.3.2 Convergence of Gq-SF2 Algorithm

Since, the proof of convergence here is along the lines of Gq-SF1, we do not describe it

explicitly. We just briefly describe the modifications that are required in this case. In the

faster timescale, as n→∞, the updates given by Z(nL) track the function

(2η(n)

β((N + 4)− (N + 2)q)ρ(η(n))

)(J(θ(n) + βη(n)

)− J

(θ(n)− βη(n)

)).

So we can rewrite the slower timescale update for Gq-SF2 algorithm, in a similar manner

as (4.14), where the noise term is

ξn =1

β((N + 4)− (N + 2)q

) (EGq(η(n))

[η(n)


)∣∣∣∣ θ(n)

]− η(n)


))− 1

β((N + 4)− (N + 2)q

) (EGq(η(n))

[η(n)

ρ(η(n))J(θ(n)− βη(n)

)∣∣∣∣ θ(n)

]− η(n)

ρ(η(n))J(θ(n)− βη(n)

)),

which can be divided into two parts, each being bounded (as in Lemma 4.9). We discuss

about the error term in the following proposition.

Proposition 4.12. For a given q <(1 + 2

N+2

), q 6= 1,

∥∥∥∇θS′q,β[J(θ)]−∇θJ(θ)

∥∥∥ = o(β)

for all θ ∈ C.

Proof. Using Taylor’s expansion, we have for small β

J(θ + βη)− J(θ − βη) = 2βηT∇θJ(θ) + o(β2).


One can use similar arguments as in Proposition 4.10 to rewrite (4.6) as


1((N + 4)− (N + 2)q

)EGq(η)

[2

ρ(η)ηηT

]∇θJ(θ) + o(β),

which leads to the claim.

This leads to a result similar to Theorem 4.11, which proves convergence of the Gq-SF2

algorithm.

Theorem 4.13. Under Assumptions I – IV, given ε > 0 and q ∈(0, 1)⋃ (

1, 1 + 2N+2

),

there exists β0 > 0 such that for all β ∈ (0, β0], the sequence (θ(n)) obtained using Gq-

SF2 converges to a point in the ε-neighbourhood of the stable attractor of (4.22), with

probability 1 as n→∞.

Theorems 4.11 and 4.13 give the existence of some β0 > 0 for a given ε > 0 such

that the proposed gradient-descent algorithms converge to the ε-neighbourhood of a local

minimum. However, these resulte do not give the precise value of β0. Further, they do

not guarantee that this neighbourhood lies within a close proximity of a global minimum.

Chapter 5

q-Gaussian based Newton Search

Algorithms

5.1 Hessian Estimation using q-Gaussian SF

In this chapter, we extend the proposed Gradient based algorithms by incorporating an

additional Hessian estimate. This leads to algorithms similar to Newton based search. The

existence of ∇2θJ(θ) is assumed as per Assumption II, We estimate it using SF approach.

5.1.1 One-simulation q-Gaussian SF Hessian Estimate

By following Rubinstein [35], we define the smoothed Hessian, or Hessian of the SF, as

∇2θSq,β[J(θ)] =

∫Ωq

∇2θGq,β(θ − η)J(η) dη , (5.1)

where Ωq is the support set of the q-Gaussian distribution as defined earlier (3.1). Now,

the (i, j)th element of Gq,β(θ − η) is

[∇2θGq,β(θ − η)

]i,j

=4q

β4((N + 4)− (N + 2)q

)2

(θ(i) − η(i)

) (θ(j) − η(j)

)ρ( θ−η

β)2 Gq,β(θ − η)

33

CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 34

for i 6= j, where ρ(X) =(

1− (1−q)((N+4)−(N+2)q)

‖X‖2)

. For i = j, we have

[∇2θGq,β(θ − η)

]i,i

=− 2

β2((N + 4)− (N + 2)q

) 1

ρ( θ−ηβ

)Gq,β(θ − η)

+4q

β4((N + 4)− (N + 2)q

)2

(θ(i) − η(i)

)2

ρ( θ−ηβ

)2 Gq,β(θ − η).

Thus, we can write

∇2θGq,β(θ − η) =

2

β2((N + 4)− (N + 2)q

)H (θ − ηβ

)Gq,β(θ − η), (5.2)

where H(η) =

(2q(

(N + 4)− (N + 2)q) η(i)η(j)

ρ(η)2

)for i 6= j, and

(2q(

(N + 4)− (N + 2)q) (η(i)

)2

ρ(η)2 −1

ρ(η)

)for i = j.

(5.3)

The function H(.) is a generalization of a similar function given in [5], which can be

obtained as q → 1. Hence, from (5.1), we have


(2

β2((N + 4)− (N + 2)q

))∫Ωq

Gq,β(θ − η)H

(θ − ηβ

)J(η) dη .

Substituting η′ = η−θβ

, we can write


(2

β2((N + 4)− (N + 2)q

))EGq(η′)

[H(η′)J(θ + βη′)

∣∣∣θ] . (5.4)

As in the case of gradient, in sequel (Proposition 5.4) we show that as β → 0,

‖∇2θSq,β[J(θ)]−∇2

θJ(θ)‖ → 0. Hence, we obtain an estimate of ∇2θJ(θ(n)) of the form

∇2θJ(θ(n)) ≈ 2

β2ML((N + 4)− (N + 2)q

) M−1∑n=0

L−1∑m=0

H(η(n)

)h(Ym), (5.5)

for large M , L and small β, where process Ym is governed by parameter (θ(n) +βη(n)).


5.1.2 Two-simulation q-Gaussian SF Hessian Estimate

Similarly, the Hessian of two-sided SF can be defined as


1

2

∫Ωq

∇2θGβ(θ − η)J(η) dη +

1

2

∫Ωq

∇2θGβ(η − θ)J(η) dη

= EGq(η)

[1

β2((N + 4)− (N + 2)q

)H(η)(J(θ + βη) + J(θ − βη)

)]. (5.6)

By using Proposition 5.8 (discussed later in Proposition 5.8), we obtain the estimate as

∇2θJ(θ(n)) ≈ 1

β2ML((N + 4)− (N + 2)q

) M−1∑n=0

L−1∑m=0

H(η(n)

)[h(Ym) + h(Y ′m)

], (5.7)

for large M , L and small β, where Ym and Y ′m are governed by (θ(n) + βη(n)) and

(θ(n)− βη(n)), respectively.

5.2 Proposed Newton-based Algorithms

In this section, we propose two-timescale algorithms that perform a Newton based search,

and require one and two simulations, respectively. In particular, the one-simulation

(resp. two-simulation) algorithm uses gradient and Hessian estimates obtained from (4.2)

and (5.4) (resp. (4.6) and (5.6)). One of the problems with Newton-based algorithms is

that the Hessian has to be positive definite for the algorithm to progress in the descent

direction. This is satisfied in the neighbourhood of a local minimum, but it may not

hold always. Hence, the estimate obtained during recursion has to be projected onto the

space of positive definite symmetric matrices. Let Ppd : RN×N 7→ symmetric matrices

with eigenvalues> ε be the function that projects any N × N matrix onto the set of

symmetric positive definite matrices, whose minimum eigenvalue > ε for some ε > 0. We

assume that the projection Ppd satisfies the following:

Assumption V. If (An)n∈N, (Bn)n∈N ⊂ RN×N are sequences of matrices such that

limn→∞ ‖An −Bn‖ = 0, then limn→∞ ‖Ppd(An)− Ppd(Bn)‖ = 0 as well.


Such a projection always exists since, the set of positive definite matrices is dense in

RN×N . Methods for performing projection and inverse computation are discussed later.

The basic approach of the algorithms is similar to the proposed gradient descent algo-

rithms, and we use two step-size sequences, (a(n))n>0 and (b(n))n>0 satisfying Assump-

tion IV. In the recursions below, the estimate of ∇θJ(θ) is obtained from the sequence

(Z(i)(n), i = 1, . . . , N)n>0, while (Wi,j(n), i, j = 1, . . . , N)n>0 estimates ∇2θJ(θ).

Algorithm 3 : The Nq-SF1 Algorithm

1: Fix M , L, q, β and ε.2: Set Z(i)(0) = 0,W (i)(0) = 0, i = 1, . . . , N .3: Fix parameter vector θ(0) = (θ(1)(0), θ(2)(0), . . . , θ(N)(0))T .4: for n = 0 to M − 1 do5: Generate a standard N -dimensional q-Gaussian distributed random vector η(n).6: for m = 0 to L− 1 do7: Generate the simulation YnL+m governed with parameter (θ(n) + βη(n)).8: for i = 1 to N do9: Z(i)(nL+m+ 1) = (1− b(n))Z(i)(nL+m)

+b(n)

[2η(i)(n)h(YnL+m)

β((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)

‖η(n)‖2)

].

10: Wi,i(nL+m+ 1) = (1− b(n))Wi,i(nL+m)

+b(n)

[2h(YnL+m)

β2((N+4)−(N+2)q)

(2q(η(i)(n))

2

((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)

‖η(n)‖2)2

− 1

(1− (1−q)((N+4)−(N+2)q)

‖η(n)‖2)

)].

11: for j = i+ 1 to N do12: Wi,j(nL+m+ 1) = (1− b(n))Wi,j(nL+m)

+b(n)

[4qη(i)(n)η(j)(n)h(YnL+m)

β2((N+4)−(N+2)q)2(1− (1−q)((N+4)−(N+2)q)

‖η(n)‖2)2

].

13: Wj,i(nL+m+ 1) = Wi,j(nL+m+ 1).14: end for15: end for16: end for17: Project W (nL) using Ppd, and compute its inverse. Let M(nL) = Ppd(W (nL))−1.18: for i = 1 to N do

19: θ(i)(n+ 1) = PC

(θ(i)(n)− a(n)

N∑j=1

Mi,j(nL)Z(j)(nL)

).

20: end for21: Set θ(n+ 1) = (θ(1)(n+ 1), θ(2)(n+ 1), . . . , θ(N)(n+ 1))T .22: end for23: Output θ(M) = (θ(1)(M), θ(2)(M), . . . , θ(N)(M))T as the final parameter vector.


In the Nq-SF2 algorithm, we generate two simulations YnL+m and Y ′nL+m governed with

parameters (θ(n)+βη(n)) and (θ(n)−βη(n)), respectively, instead of a single simulation.

The single-stage cost function of both simulations are used to update the gradient and

Hessian estimates, which are used for the optimization update rule (Step 19).

Algorithm 4 : The Nq-SF2 Algorithm1: Fix M , L, q, β and ε.2: Set Z(i)(0) = 0,W (i)(0) = 0, i = 1, . . . , N .3: Fix parameter vector θ(0) = (θ(1)(0), θ(2)(0), . . . , θ(N)(0))T .4: for n = 0 to M − 1 do5: Generate a standard N -dimensional q-Gaussian distributed random vector η(n).6: for m = 0 to L− 1 do7: Generate two independent simulations YnL+m and Y ′nL+m governed with param-

eters (θ(n) + βη(n)) and (θ(n)− βη(n)) respectively.8: for i = 1 to N do9: Z(i)(nL+m+ 1) = (1− b(n))Z(i)(nL+m)

+b(n)

[η(i)(n)(h(YnL+m)−h(Y ′nL+m))

β((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)

‖η(n)‖2)

].

10: Wi,i(nL+m+ 1) = (1− b(n))Wi,i(nL+m)

+b(n)

[h(YnL+m)+h(Y ′nL+m)

β2((N+4)−(N+2)q)

(2q(η(i)(n))

2

((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)

‖η(n)‖2)2

− 1

(1− (1−q)((N+4)−(N+2)q)

‖η(n)‖2)

)].

11: for j = i+ 1 to N do12: Wi,j(nL+m+ 1) = (1− b(n))Wi,j(nL+m)

+b(n)

[2qη(i)(n)η(j)(n)(h(YnL+m)+h(Y ′nL+m))

β2((N+4)−(N+2)q)2(1− (1−q)((N+4)−(N+2)q)

‖η(n)‖2)2

].

13: Wj,i(nL+m+ 1) = Wi,j(nL+m+ 1).14: end for15: end for16: end for17: Project W (nL) using Ppd, and compute its inverse. Let M(nL) = Ppd(W (nL))−1.18: for i = 1 to N do

19: θ(i)(n+ 1) = PC

(θ(i)(n)− a(n)

N∑j=1

Mi,j(nL)Z(j)(nL)

).

20: end for21: Set θ(n+ 1) = (θ(1)(n+ 1), θ(2)(n+ 1), . . . , θ(N)(n+ 1))T .22: end for23: Output θ(M) = (θ(1)(M), θ(2)(M), . . . , θ(N)(M))T as the final parameter vector.


5.3 Convergence of Newton SF Algorithms

5.3.1 Convergence of Nq-SF1 Algorithm

The convergence analysis is quite similar to that of Gq-SF1, along with an additional

Hessian update. Let us consider the updates in the faster timescale, i.e., Steps 9 and 10

of the Nq-SF1 algorithm. We use b(p) defined in Section 4.3, i.e., b(p) = b(n) for all

nL 6 p < (n+ 1)L. Observing that θ(n) and η(n) are fixed during these updates, we can

rewrite Step 9 as the following iteration, which runs for L steps.

Z(p+ 1) = Z(p) + b(p)(g1(Yp)− Z(p)

), (5.8)

where g1(Yp) =

(2η(n)h(Yp)

β((N + 4)− (N + 2)q)ρ(η(n))

)for nL 6 p < (n+1)L. Here, ρ(.) is de-

fined in (5.4) and Ypp∈N is a Markov process with the stationary distribution ν(θ(n)+βη(n)).

Similarly, the update of the Hessian matrix can be expressed as

W (p+ 1) = W (p) + b(p)(g2(Yp)−W (p)

), (5.9)

where g2(Yp) =

(2H(η(n))h(Yp)

β2((N + 4)− (N + 2)q)

)for nL 6 p < (n+ 1)L,

The iterations (5.8) and (5.9) are independent of each other for a fixed n, and hence,

they can be dealt with separately. It follows from Lemma 4.1 that the updates lead to

the following ODEs:

θ(t) = 0, (5.10)

Z(t) =2η(t)J

(θ(t) + βη(t)

)β((N + 4)− (N + 2)q

)ρ(η(t))

− Z(t), (5.11)

and W (t) =2H(η(t))J

(θ(t) + βη(t)

)β2((N + 4)− (N + 2)q

) −W (t), (5.12)

where, as before, θ(t) ≡ θ and η(t) ≡ η in (5.11)–(5.12) in lieu of (5.10). The results

related to the gradient updates stated in Section 4.3 (Lemmas 4.3, 4.6 and Corollary 4.7)

still hold. We prove similar results for the Hessian update.


Lemma 5.1. The sequence (W (p)) is uniformly bounded with probability 1.

Proof. Observing the fact that

Eν(θ+βη)

[2H(η)h(Yp)

β2((N + 4)− (N + 2)q

) −W (t)

]=

2H(η)J(θ + βη

)β2((N + 4)− (N + 2)q

) −W (t),

the proof turns out to be exactly similar to that of Lemma 4.3.

We again refer to the time steps tn with t0 = 0 and tn =∑n−1

i=0 b(i) for n > 1. Defining

W (t) as the interpolation between W (nL) and W (nL+L) over the interval [tn, tn+1], we

have the following.

Lemma 5.2. Given τ, µ > 0,(θ(tn + ·), W (tn + ·)

)is eventually a bounded (τ -µ) pertur-

bation of (4.9)-(4.11).

Corollary 5.3.

∥∥∥∥W (nL)−(

2H(η(n))J(θ(n) + βη(n))

β2((N + 4)− (N + 2)q)

)∥∥∥∥→ 0 almost surely as n→∞.

Thus, both Z(.) and W (.) updates eventually track the gradient and Hessian of

Sq,β[J(θ)]. So, after incorporating the projection considered in Step 17, we can write

Steps 17 and 19 of Nq-SF1 algorithm as follows:

θ(n+ 1) = PC

(θ(n)− a(n)

[Ppd

(2H(η(n))J

(θ(n) + βη(n)

)β2((N + 4)− (N + 2)q

) )−1

2η(n)J(θ(n) + βη(n)

)βρ(η(n))

((N + 4)− (N + 2)q

)])

= PC(θ(n) + a(n)

[−Ppd

(∇2θ(n)J

(θ(n)

))−1∇θ(n)J(θ(n)

)+ ∆

(θ(n)

)+ ξn

]),

(5.13)

where, using (4.2) and (5.4), we have

∆(θ(n)

)= Ppd

(∇2θ(n)J

(θ(n)

))−1∇θ(n)J(θ(n)

)− Ppd

(∇2θ(n)Sq,β

[J(θ(n)

)])−1

∇θ(n)Sq,β

[J(θ(n)

)](5.14)


and

ξn = E

Ppd(2H(η(n))J(θ(n) + βη(n)

)β2((N + 4)− (N + 2)q

) )−12η(n)J

(θ(n) + βη(n)

)βρ(η(n))

((N + 4)− (N + 2)q

)∣∣∣∣∣∣ θ(n)

− Ppd

(2H(η(n))J

(θ(n) + βη(n)

)β2((N + 4)− (N + 2)q

) )−12η(n)J

(θ(n) + βη(n)

)βρ(η(n))

((N + 4)− (N + 2)q

)(5.15)

The objective is to show that the error term ∆(θ(n)

)and the noise term ξn satisfy

conditions similar to those mentioned in Section 4.3. Then, we can use similar arguments

to prove the convergence of Nq-SF1. Considering the error term, we have the following

proposition regarding convergence of the Hessian of SF to the Hessian of the objective

function J as β → 0.

Proposition 5.4. For a given q ∈(0, 1)⋃ (

1, 1 + 2N+2

), we have

∥∥∥∇2θSq,β[J(θ)]−∇2

θJ(θ)∥∥∥ = o(β).

for all θ ∈ C and β > 0.

Proof. We use Taylor’s expansion of J(θ + βη) around θ ∈ C to rewrite (5.4) as


2

β2((N + 4)− (N + 2)q

)(E [H(η)J(θ) |θ] + βE[H(η)ηT∇θJ(θ) |θ

]+β2

2E[H(η)ηT∇2

θJ(θ)η |θ]

+β3

6E[H(η)ηT (∇3

θJ(θ)η)η |θ]

+ o(β3)

).

(5.16)

Let us consider each of the terms in (5.16). It is evident for all i, j = 1, . . . , N , i 6= j,

E [H(η)i,j] = 0. Even for the diagonal elements, we have for all i = 1, 2, . . . , N ,

E [H(η)i,i] =2q(

(N + 4)− (N + 2)q)E

[(η(i))2

ρ(η)2

]− E

[1

ρ(η)

]. (5.17)


For q < 1, a simple application of Proposition 3.1 show that

E

[1

ρ(η)

]=

Γ(

11−q

)Γ(

11−q + 1 + N

2

)Γ(

11−q + 1

)Γ(

11−q + N

2

) =

11−q + N

2

11−q

=N + 2−Nq

2. (5.18)

Similarly, E[

1ρ(η)

]= N+2−Nq

2for q ∈

(1, 1 + 2

N+2

). Substituting this expression in (5.17),

we get E [H(η)i,i] = 0. Thus, the first term in (5.16) is zero. Expanding the inner product,

(i, j)th element of the second term can be written as

E[H(η)ηT∇θJ(θ) |θ

]=

N∑k=1

[∇θJ(θ)](k) E[η(k)H(η)i,j

]

=

2q

((N+4)−(N+2)q)

N∑k=1

[∇θJ(θ)](k) E

[η(i)η(j)η(k)

ρ(η)2

]if i 6= j

2q((N+4)−(N+2)q)

N∑k=1

[∇θJ(θ)](k) E

[(η(i))2η(k)

ρ(η)2

]−

N∑k=1

[∇θJ(θ)](k) E

[η(k)

ρ(η)

]if i = j.

In all the expectations above, since the total number of exponents in the numerator is

odd, hence, for any combination of i, j, k, the functions are odd and so, by Proposition 3.1,

the expectations are zero in all cases. Thus, the second term in (5.16) is zero. Due to

similar reasons, the fourth term in (5.16) is also zero. Now, we consider the second term.

For i 6= j, we have

E[H(η)i,j

(ηT∇2

θJ(θ)η)|θ]

=2q(

(N + 4)− (N + 2)q) N∑k,l=1

[∇2θJ(θ)

]k,l

E

[η(i)η(j)η(k)η(l)

ρ(η)2

],

which is zero unless i = k, j = l or i = l, j = k. So using the fact that ∇2θJ(θ) is

symmetric, i.e., [∇2θJ(θ)]k,l = [∇2

θJ(θ)]l,k, we can write

E[H(η)i,j

(ηT∇2

θJ(θ)η)|θ]

=4q [∇2

θJ(θ)]i,j((N + 4)− (N + 2)q

)E

[(η(i))2 (

η(j))2

ρ(η)2

]. (5.19)

Referring to Proposition 3.1, we have in this case, b = bi = bj = 2 and bk = 0 for all other

k. So∑N

k=1bk2

= 2 and bi!

2bi( bi2 )!=

bj !

2bj(bj2

)!

= 12. For any q ∈

(0, 1)⋃ (

1, 1 + 2N+2

), the


arguments of the Gamma functions are positive, and hence, the Gamma functions exist.

For q < 1, we have

E

[(η(i))2 (

η(j))2

ρ(η)2

]=

((N + 4)− (N + 2)q

1− q

)2(1

4

) Γ(

11−q − 1

)Γ(

11−q + 1 + N

2

)Γ(

11−q + 1

)Γ(

11−q + 1 + N

2

)=

((N + 4)− (N + 2)q

1− q

)21

4

1

11−q

(1

1−q − 1)

=

((N + 4)− (N + 2)q

)2

4q,

while for q > 1, we again have E

[(η(i))2 (

η(j))2

ρ(η)2

]=

(((N + 4)− (N + 2)q)2

4q

).

Hence, we obtain

E[H(η)i,j

(ηT∇2

θJ(θ)η)|θ]

=((N + 4)− (N + 2)q

) [∇2θJ(θ)

]i,j

for i 6= j by substituting in (5.19). Now for i = j, using (5.3)

E[H(η)i,i

(ηT∇2

θJ(θ)η)|θ]

=2q(

(N + 4)− (N + 2)q) N∑k,l=1

[∇2θJ(θ)

]k,l

E

[(η(i))2η(k)η(l)

ρ(η)2

]

−N∑

k,l=1

[∇2θJ(θ)

]k,l

E

[η(k)η(l)

ρ(η)2

].

Observing that the above expectations are zero for k 6= l, we have

E[H(η)i,i

(ηT∇2

θJ(θ)η)|θ]

=2q [∇2

θJ(θ)]i,i((N + 4)− (N + 2)q

)E

[(η(i))4

ρ(η)2

]

+2q(

(N + 4)− (N + 2)q)∑k 6=i

[∇2θJ(θ)

]k,k

E

[(η(i))2 (

η(k))2

ρ(η)2

]

−N∑k=1

[∇2θJ(θ)

]k,k

E

[(η(k))2

ρ(η)

]. (5.20)

We again refer to Proposition 3.1 to compute each term in (5.20). For the first term,


b = 2 and bi = 4. So we can verify the conditions in Proposition 3.1 hold for all values of

q ∈(0, 1)⋃ (

1, 1 + 2N+2

). We have

E

[(η(i))4

ρ(η)2

]=

((N + 4)− (N + 2)q

1− q

)2(4!

24 2!

) Γ(

11−q − 1

)Γ(

11−q + 1

) =3((N + 4)− (N + 2)q

)2

4q

for q ∈ (0, 1). The same result also holds for 1 < q <(1+ 2

N+2

). The second term in (5.20)

is similar to the one in (5.19), and can be computed in the same way. From (4.21), we

have EGq(η)

[(η(k))

2

ρ(η)

]=(

(N+4)−(N+2)q2

). Substituting all these terms in (5.20) results in

the following.

E[H(η)i,i

(ηT∇2

θJ(θ)η)|θ]

=((N + 4)− (N + 2)q

)(3

2

[∇2θJ(θ)

]i,i

+1

2

∑k 6=i

[∇2θJ(θ)

]k,k− 1

2

N∑k=1

[∇2θJ(θ)

]k,k

)

=((N + 4)− (N + 2)q

) [∇2θJ(θ)

]i,i.

The claim follows by substituting all the above expressions in (5.16).

The following result is a direct consequence of Propositions 4.10 and 5.4.

Corollary 5.5. Under Assumption V, ‖∆(θ)‖ = o(β) for all q ∈(0, 1 + 2

N+2

), q 6= 1.

Proof. We write ∆(.) as

∆(θ) =(Ppd

(∇2θJ(θ)

)−1 − Ppd(∇2θSq,β[J(θ)]

)−1)∇θJ(θ)

+ Ppd(∇2θSq,β[J(θ)]

)−1(∇θJ(θ)−∇θSq,β[J(θ)]) ,

which implies that

∥∥∥∆(θ)∥∥∥ 6

∥∥∥Ppd (∇2θJ(θ)

)−1 − Ppd(∇2θSq,β[J(θ)]

)−1∥∥∥∥∥∥∇θJ(θ)

∥∥∥+∥∥∥Ppd (∇2

θSq,β[J(θ)])−1∥∥∥∥∥∥∇θJ(θ)−∇θSq,β[J(θ)]

∥∥∥ .(5.21)


Since, ∇θJ(θ) is continuously differentiable on compact set C, hence supθ∈C‖∇θJ(θ)‖ <∞.

Also, since Ppd(.) is a positive definite matrix, its inverse always exists, i.e., for any given

matrix A, ‖(Ppd(A))−1‖ < ∞ considering any matrix norm. Thus, in order to justify

the claim, we need to show that other terms are o(β). From Proposition 4.10, it follows∥∥∥∇θJ(θ)−∇θSq,β[J(θ)]∥∥∥ = o(β), and we can write

∥∥∥Ppd (∇2θJ(θ)

)−1 − Ppd(∇2θSq,β[J(θ)]

)−1∥∥∥

=∥∥∥Ppd (∇2

θJ(θ))−1Ppd

(∇2θSq,β[J(θ)]

)−1(Ppd

(∇2θSq,β[J(θ)]

)− Ppd

(∇2θJ(θ)

) )∥∥∥6∥∥∥Ppd (∇2

θJ(θ))−1∥∥∥∥∥∥Ppd (∇2

θSq,β[J(θ)])−1∥∥∥∥∥∥Ppd (∇2

θSq,β[J(θ)])− Ppd

(∇2θJ(θ)

) ∥∥∥The first two terms are finite due to the positive definiteness of Ppd(.), while from Propo-

sition 5.4, the third term is o(β). Hence, the claim.

The following result deals with the noise term ξn in a way similar to the Gq-SF1

algorithm. For this we consider the filtration (Fn)n>0 as defined earlier in Chapter 4, i.e.,

Fn = σ(θ(0), . . . , θ(n), η(0), . . . , η(n− 1)

).

Lemma 5.6. Defining Mn =∑n−1

i=0 a(k)ξk, (Mn,Fn)n>0 is an almost surely convergent

martingale sequence for all q ∈(0, 1)⋃ (

1, 1 + 2N+2

).

Proof. As θ(k) is Fk-measurable, while η(k) is independent of Fk for all k > 0, we can

conclude that E[ξk|Fk] = 0. Thus (ξk,Fk)k>0 is a martingale difference sequence and

(Mk,Fk)k>0 is a martingale sequence. As shown in Lemma 4.9,

E[‖ξk‖2

∣∣Fk]6 4E

∥∥∥∥∥∥Ppd(

2H(η)J(θ + βη)

β2((N + 4)− (N + 2)q

))−12ηJ(θ + βη)

βρ(η)((N + 4)− (N + 2)q

)∥∥∥∥∥∥

26

16

β2((N + 4)− (N + 2)q

)2 E

∥∥∥∥∥∥Ppd(

2H(η)J(θ + βη)

β2((N + 4)− (N + 2)q

))−1∥∥∥∥∥∥

2 ∥∥∥∥ η

ρ(η)

∥∥∥∥2

J(θ + βη)2

6

16

β2((N + 4)− (N + 2)q

)2 supηJ(θ + βη)2 sup

η

(1

Λ2max

) N∑j=1

E

[(η(j))2

ρ(η)2

],


where Λmax is the maximum eigenvalue of the projected Hessian matrix, which is always

greater than ε by definition of Ppd. The claim follows using similar arguments as in

Proposition 4.9.

Thus, we have the main theorem which affirms the convergence of the Nq-SF1 algo-

rithm. The proof of this theorem is exactly the same as that of Theorem 4.11.

Theorem 5.7. Under Assumptions I – V, given ε > 0 and q ∈(0, 1)⋃ (

1, 1 + 2N+2

),

there exists β0 > 0 such that for all β ∈ (0, β0], the sequence (θ(n)) obtained using Nq-

SF1 converges almost surely as n → ∞ to a point in the ε-neighbourhood of the set of

stable attractor of the ODE

θ(t) = PC(Ppd

(∇2θ(t)J

(θ(t)

))−1∇θ(t)J(θ(t)

)), (5.22)

where the domain of attraction is given by

K =θ ∈ C

∣∣∣∇θJ(θ)T PC(−Ppd

(∇2θJ(θ)

)−1∇θJ(θ))

= 0. (5.23)

5.3.2 Convergence of Nq-SF2 Algorithm

The convergence of Nq-SF2 algorithm to a local minimum can be showed by extending

the results in the previous section in the same way as done for the Gq-SF2 algorithm.

We just show the result regarding convergence of smoothed Hessian to the Hessian of the

objective function as β → 0, which has been used in (5.7).

Proposition 5.8. For a given q ∈(0, 1)⋃ (

1, 1 + 2N+2

), for all θ ∈ C and β > 0,

∥∥∇2θS′q,β[J(θ)]−∇2

θJ(θ)∥∥ = o(β).

Proof. For small β > 0, using Taylor’s expansion of J(θ + βη) and J(θ − βη) around

θ ∈ C,

J(θ + βη) + J(θ − βη) = 2J(θ) + β2∇θJ(θ) + o(β3).


Thus the Hessian of the two-sided SF (5.6) is

∇2θS′q,β[J(θ)] =

1

β2((N + 4)− (N + 2)q

)(E [2H(η)J(θ) |θ]

+ β2E[H(η)ηT∇2

θJ(θ)η |θ]

+ +o(β3)

),

which can be computed as in Proposition 5.4 to arrive at the claim.

The key result related to convergence of Nq-SF2 follows.

Theorem 5.9. Under Assumptions I – V, given ε > 0 and q ∈(0, 1)⋃ (

1, 1 + 2N+2

),

there exists β0 > 0 such that for all β ∈ (0, β0], the sequence (θ(n)) obtained using Nq-

SF2 converges almost surely as n → ∞ to a point in the ε-neighbourhood of the set of

stable attractor of the ODE (5.22).

Thus, the convergence analysis of all the proposed algorithms show that by choosing a

small β > 0 and any q ∈(0, 1 + 2

N+2

), all the SF algorithms converge to a local optimum.

In the special case of q → 1, we retrieve the algorithms presented in [5], and so the

corresponding convergence analysis holds.

Chapter 6

Simulations using Proposed

Algorithms

6.1 Numerical Setting

We consider a multi-node network of M/G/1 queues with feedback as shown in the figure

below. The setting here is a generalized version of that considered in [5].

Figure 6.1: Queuing Network.

There are K nodes, which are fed with independent Poisson external arrival processes

with rates λ1, λ2, . . . , λK , respectively. After departing from the ith node, a customer either

leaves the system with probability pi or enters the (i+ 1)th node with probability (1−pi).

Once the service at the Kth node is completed, the customer may rejoin the 1st node with

probability (1−pK). The service time processes of each node, Sin(θi)n>1, i = 1, 2, . . . , K

47

CHAPTER 6. SIMULATIONS USING PROPOSED ALGORITHMS 48

are defined as

Sin(θi) = Ui(n)

(1

Ri

+ ‖θi(n)− θi‖2

), (6.1)

where for all i = 1, 2, . . . , K, Ri are constants and Ui(n) are independent samples drawn

from the uniform distribution on (0, 1). The service time of each node depends on the

Ni-dimensional tunable parameter vector θi, whose individual components lie in a certain

interval[(θ

(j)i

)min

,(θ

(j)i

)max

], j = 1, 2, . . . , Ni, i = 1, 2, . . . , K. θi(n) represents the nth

update of the parameter vector at the ith node, and θi represents the target parameter

vector corresponding to ith node.

The cost function is chosen to be the sum of the total waiting times of all the customers

in the system. For the cost to be minimum, Sin(θi) should be minimum, and hence, we

should have θi(n) = θi, i = 1, . . . , K. Let us denote

θ =

θ1

θ2

...

θK

and θ =

θ1

θ2

...

θK

.

It is evident that θ, θ ∈ RN , where N =∑K

i=1Ni. In order to compare the performance

of various algorithms, we consider the performance measure to be the euclidian distance

between θ(n) and θ, given by

‖θ(n)− θ‖ =

[K∑i=1

Ni∑j=1

(θ

(j)i (n)− θ(j)

i

)2]1/2

.

6.2 Implementation Issues

Before discussing about the results of simulation, we address some of the issues regarding

implementation of the proposed algorithms.

All the algorithms require generation of a multivariate q-Gaussian distributed random

vector, whose individual components are uncorrelated and identically distributed. This

implies that the random variables are q-independent [47]. For limiting case of q → 1,


q-independence is equivalent to independence of the random variables. Hence, we can use

standard algorithms to generate i.i.d. samples. This is ideally not possible q-Gaussians

with q 6= 1. Thistleton et.al. [43] proposed an algorithm for generating one-dimensional q-

Gaussian distributed random variables using generalized Box-Muller transformation. This

method can be easily extended to the two variable case. However, there exists no standard

algorithm for generating N -variate q-Gaussian random vectors. Hence, we perform the

simulations with random vectors consisting of N i.i.d. samples of univariate q-Gaussian

random variables.

The projection of Hessian in the Newton based algorithms requires some discussion.

We require the projected matrix to have eigenvalues bounded below by ε > 0. This

can be done by performing the eigen decomposition of the Hessian update W (nL), and

computing the projected matrix by making all the eigenvalues less than ε to be equal to ε.

However, this method requires large amount of time and memory resources for the eigen

decomposition. To reduce the computational efforts, various methods have been studied

in literature [4, 40, 50] for obtaining projected Hessian estimates. We consider a variation

of Newton’s method, where the off-diagonal elements of the matrix are set to zero. This

is similar to the Jacobi variant algorithms discussed in [5], and it simplifies the update as

Steps 11–15 in the algorithms are no longer required. The projection of Hessian can be

easily obtained by projecting each of the diagonal element to [ε,∞), and inverse can be

directly computed. The simulations are shown using this method.

The analysis in faster timescale for the Newton based algorithms shows that the gradi-

ent and Hessian updates are independent, and hence, their convergence to the smoothed

gradient and Hessian, respectively, can be independently analyzed. This also provides

a scope to update the gradient and Hessian in different timescales without affecting the

convergence of the algorithms. Bhatnagar [5] used a faster timescale to update the Hes-

sian, due to which inverse of the Hessian converges much faster compared to the gradient

update. This improves the performance of the N-SF algorithms. We also study the effect

of the timescale on the proposed algorithms.


6.3 Experimental Results

For the simulations, we first consider a two queue network with the arrival rates at the

nodes being λ1 = 0.2 and λ2 − 0.1 respectively. We consider that all customers leaving

node-1 enter node-2, i.e., p1 = 0, while customers serviced at node-2 may leave the

system with probability p2 = 0.4. We also fix the constants in the service time at R1 = 10

and R2 = 20. The service parameters for either nodes are two-dimensional vectors,

N1 = N2 = 2, with components lying in the interval[(θ

(j)i

)min

,(θ

(j)i

)max

]= [0.1, 0.6]

for i = 1, 2, j = 1, 2. Thus the constrained space C is given by C = [0.1, 0.6]4 ⊂ R4. We

fix the target parameter vector at θ = (0.3, 0.3, 0.3, 0.3)T .

The simulations were performed on an Intel Core i5 machine with 3.7GiB memory

space. We run the algorithms varying the values of q and β, while all the other parameters

are held fixed at M = 10000, L = 100 and ε = 0.1. For all the cases, the initial parameter

is assumed to be θ(0) = (0.1, 0.1, 0.6, 0.6)T . The step-sizes (a(n))n>0, (b(n))n>0 were

taken to be a(n) = 1n, b(n) = 1

n3/4 , respectively. For each q, β pair, 20 independent runs

were performed, which took about 10 seconds. We compare the performance of all the

proposed algorithms with the four SF algorithms proposed in [5], which use Gaussian

smoothing. Figure 6.2 shows the convergence behavior of the proposed q-Gaussian based

algorithms with q = 0.9, and that of the corresponding Gaussian based algorithms, where

the smoothness parameter is β = 0.1 for all cases.

Figure 6.2: Convergence behavior of various algorithms for β = 0.1.


The following tables present the performance of the gradient based algorithms and

Jacobi variant of the Newton based algorithms. The values of β are chosen over the range

where the Gaussian SF algorithms perform well [5], whereas the values of q are chosen

such that they are spread uniformly over the range(0, 1 + 2

N+4

)=(0, 5

3

). It may noted

that q = 1 corresponds to the Gaussian case. We show the distance of the final updates

θ(M) from the optimum θ, averaged over 20 trials. The variance of the updates is also

shown to indicate the stability of the algorithms.

We observe that the two-simulation algorithms always perform better than their one-

simulation counterparts. The results show that, in general, Gq-SF2 algorithm gives a

better performance than Nq-SF2. This is due to the approximation considered in the

Hessian update, as it has been observed in literature that the Newton based approach

usually performs better than the gradient case. We now focus on the effect of the value

of q used in the algorithms. For each value of β, the instances where the distance for the

q-Gaussian is less than Gaussian-SF are highlighted. Also, the least distance obtained for

each β is marked.

In case of the gradient descent algorithms, it can be seen that better results are usually

obtained for q > 1. Also, over the interval (0, 1), distance is less when the value of q is

close to 1 (Gaussian case). In fact, q = 0.9 seems to give better performance than other

values of q when all values of β are considered. This observation is similar to the results

in [18]. The poor performance for low values of q is due to the noise term addressed in

Lemma 4.9, for which the variance may become quite high as q gets closer to zero. The

Newton based algorithms, however, perform well over a wider range of q. This may be

due to the fact that, in these algorithms, the parameter updates are such that they lead to

some amount of cancellation of the noise effects in gradient and Hessian updates. But, in

all the algorithms, it is observed that for higher values of β (for example, β = 0.5), when

the Gaussian algorithms tend to become unstable, q-Gaussian algorithms with q < 1 are

relatively more stable, whereas q > 1 show unstable results (high variance). Thus, it can

be claimed that for higher values of q, the region of stability, (0, β0), mentioned in the

convergence theorems becomes smaller.

CHAPTER

6.SIM

ULATIO

NSUSIN

GPROPOSED

ALGORIT

HMS

52

HHHHqβ

0.01 0.025 0.05 0.075 0.1 0.25 0.5

0.1 0.2353±0.0946 0.1763±0.0751 0.1024±0.0620 0.0975±0.0993 0.0668±0.0553 0.0772±0.0772 0.0954±0.0986

0.2 0.2378±0.1056 0.1809±0.1040 0.0974±0.0594 0.0741±0.0424 0.1052±0.0992 0.0670±0.0392 0.1380±0.1088

0.3 0.2452±0.1142 0.1624±0.0975 0.0802±0.0389 0.0943±0.0737 0.0545±0.0485 0.0607±0.0365 0.1212±0.0908

0.4 0.2603±0.1031 0.1659±0.0820 0.1074±0.0733 0.0862±0.0587 0.0715±0.0361 0.0891±0.0796 0.1566±0.1217

0.5 0.2524±0.1157 0.1675±0.1085 0.1267±0.0802 0.0810±0.0508 0.0841±0.0841 0.0713±0.0786 0.1346±0.1039

0.6 0.2174±0.0724 0.1444±0.0558 0.0751±0.0509 0.0888±0.0981 0.0610±0.0359 0.0892±0.0730 0.1467±0.1082

0.7 0.2011±0.0786 0.1151±0.0515 0.0616±0.0244 0.0794±0.0853 0.0647±0.0668 0.0603±0.0396 0.0950±0.03500.8 0.1754±0.1072 0.1012±0.0691 0.0481±0.0379 0.0333±0.0145 0.0298±0.0130 0.0563±0.0966 0.1037±0.03930.9 0.1798±0.0614 0.0768±0.0286 0.0371±0.0160 0.0262±0.0078 0.0244±0.0083 0.0243±0.0075 0.0936±0.0302

Gaussian 0.1664±0.0929 0.0653±0.0286 0.0387±0.0152 0.0274±0.0096 0.0243±0.0075 0.0258±0.0115 0.1173±0.0530

1.1 0.1523±0.0652 0.0767±0.0203 0.0398±0.0133 0.0296±0.0114 0.0204±0.0081 0.0246±0.0104 0.1326±0.0550

1.2 0.2018±0.0765 0.0746±0.0245 0.0353±0.0148 0.0274±0.0125 0.0264±0.0072 0.0312±0.0114 0.2208±0.0727

1.3 0.2012±0.0727 0.0961±0.0215 0.0413±0.0205 0.0306±0.0079 0.0275±0.0085 0.0336±0.0107 0.3238±0.0776

Table 6.1: Performance of Gq-SF1 algorithm for different values of q and β.

HHHHqβ

0.01 0.025 0.05 0.075 0.1 0.25 0.5

0.1 0.1587±0.1472 0.0623±0.0301 0.0734±0.1287 0.0184±0.0045 0.0168±0.0059 0.0243±0.0182 0.0290±0.01710.2 0.0702±0.0569 0.0455±0.0344 0.0204±0.0098 0.0759±0.1204 0.0185±0.0088 0.0393±0.0437 0.0670±0.06070.3 0.1423±0.0775 0.0659±0.0748 0.1402±0.1787 0.0210±0.0110 0.0228±0.0125 0.0409±0.0648 0.0549±0.04740.4 0.1101±0.0599 0.1124±0.1545 0.0274±0.0170 0.0190±0.0095 0.0169±0.0095 0.0146±0.0036 0.0538±0.03010.5 0.1243±0.0740 0.0460±0.0344 0.0268±0.0255 0.0256±0.0170 0.0163±0.0121 0.0480±0.0595 0.0801±0.0557

0.6 0.1090±0.0708 0.0387±0.0198 0.0285±0.0164 0.0536±0.0554 0.0363±0.0380 0.0277±0.0186 0.0754±0.0384

0.7 0.0588±0.0400 0.0307±0.0076 0.0262±0.0203 0.0173±0.0097 0.0754±0.1131 0.0335±0.0314 0.0804±0.1089

0.8 0.0544±0.0176 0.0238±0.0034 0.0196±0.0093 0.0093±0.0043 0.0075±0.0050 0.0150±0.0056 0.0473±0.02320.9 0.0490±0.0203 0.0213±0.0070 0.0114±0.0037 0.0090±0.0017 0.0061±0.0021 0.0088±0.0034 0.0466±0.0166

Gaussian 0.0501±0.0118 0.0215±0.0067 0.0112±0.0037 0.0062±0.0017 0.0066±0.0020 0.0097±0.0032 0.0742±0.0351

1.1 0.0517±0.0125 0.0236±0.0042 0.0113±0.0032 0.0077±0.0026 0.0062±0.0015 0.0104±0.0034 0.0773±0.0325

1.2 0.0531±0.0195 0.0204±0.0075 0.0107±0.0050 0.0071±0.0025 0.0056±0.0023 0.0113±0.0036 0.1471±0.0685

1.3 0.0529±0.0112 0.0208±0.0064 0.0101±0.0029 0.0079±0.0021 0.0064±0.0007 0.0159±0.0053 0.2196±0.0543

Table 6.2: Performance of Gq-SF2 algorithm for different values of q and β.

CHAPTER

6.SIM

ULATIO

NSUSIN

GPROPOSED

ALGORIT

HMS

53

HHHHqβ

0.01 0.025 0.05 0.075 0.1 0.25 0.5

0.1 0.2913±0.0714 0.1971±0.0675 0.0855±0.0229 0.0575±0.0165 0.0518±0.0205 0.0480±0.0187 0.1145±0.04580.2 0.2640±0.0774 0.1322±0.0484 0.0882±0.0376 0.0587±0.0210 0.0515±0.0168 0.0448±0.0148 0.0813±0.03290.3 0.3055±0.0966 0.1952±0.0596 0.0894±0.0352 0.0641±0.0206 0.0469±0.0187 0.0460±0.0153 0.1221±0.04200.4 0.2592±0.0823 0.1891±0.0865 0.0940±0.0375 0.0588±0.0173 0.0505±0.0187 0.0480±0.0253 0.1057±0.03190.5 0.2892±0.0837 0.1549±0.0612 0.0805±0.0326 0.0589±0.0193 0.0517±0.0172 0.0493±0.0194 0.1010±0.03150.6 0.2344±0.0858 0.1356±0.0614 0.0737±0.0276 0.0580±0.0149 0.0470±0.0203 0.0450±0.0191 0.0973±0.03950.7 0.2660±0.0923 0.1621±0.0682 0.0826±0.0305 0.0674±0.0207 0.0457±0.0144 0.0440±0.0162 0.1308±0.04120.8 0.2737±0.0805 0.1776±0.0851 0.0842±0.0237 0.0556±0.0209 0.0462±0.0178 0.0445±0.0117 0.1055±0.03640.9 0.2597±0.0826 0.1554±0.0647 0.0739±0.0300 0.0524±0.0175 0.0391±0.0102 0.0431±0.0177 0.1340±0.0686

Gaussian 0.2356±0.0844 0.1226±0.0593 0.0761±0.0321 0.0539±0.0199 0.0440±0.0131 0.0427±0.0208 0.1804±0.0676

1.1 0.2299±0.1069 0.1420±0.0423 0.0681±0.0194 0.0548±0.0177 0.0455±0.0179 0.0452±0.0142 0.1919±0.0773

1.2 0.2126±0.0650 0.1267±0.0609 0.0754±0.0333 0.0720±0.0189 0.0450±0.0231 0.0396±0.0142 0.2898±0.0786

1.3 0.2621±0.0887 0.1920±0.0835 0.0970±0.0366 0.0664±0.0265 0.0608±0.0234 0.0702±0.0352 0.3631±0.0776

Table 6.3: Performance of Jacobi variant of Nq-SF1 algorithm for different values of q and β.

HHHHqβ

0.01 0.025 0.05 0.075 0.1 0.25 0.5

0.1 0.0788±0.0372 0.0281±0.0092 0.0165±0.0061 0.0120±0.0026 0.0088±0.0035 0.0121±0.0052 0.0421±0.01970.2 0.0883±0.0322 0.0305±0.0082 0.0182±0.0073 0.0129±0.0030 0.0087±0.0026 0.0097±0.0045 0.0287±0.01230.3 0.0951±0.0345 0.0372±0.0107 0.0167±0.0040 0.0116±0.0042 0.0075±0.0034 0.0098±0.0034 0.0348±0.01120.4 0.0726±0.0241 0.0251±0.0089 0.0162±0.0055 0.0109±0.0023 0.0113±0.0043 0.0103±0.0036 0.0365±0.00920.5 0.0954±0.0329 0.0277±0.0097 0.0145±0.0041 0.0112±0.0037 0.0108±0.0036 0.0101±0.0029 0.0279±0.01400.6 0.0808±0.0297 0.0318±0.0132 0.0147±0.0042 0.0115±0.0039 0.0102±0.0026 0.0098±0.0029 0.0346±0.01280.7 0.0693±0.0328 0.0247±0.0108 0.0130±0.0052 0.0099±0.0043 0.0091±0.0028 0.0105±0.0035 0.0458±0.01570.8 0.0796±0.0308 0.0320±0.0100 0.0164±0.0051 0.0102±0.0043 0.0113±0.0034 0.0100±0.0035 0.0339±0.00900.9 0.0612±0.0220 0.0292±0.0055 0.0146±0.0027 0.0108±0.0035 0.0104±0.0037 0.0105±0.0044 0.0484±0.0221

Gaussian 0.0718±0.0222 0.0248±0.0116 0.0117±0.0046 0.0115±0.0031 0.0087±0.0023 0.0100±0.0028 0.0690±0.0287

1.1 0.0680±0.0263 0.0240±0.0111 0.0147±0.0055 0.0127±0.0037 0.0074±0.0025 0.0090±0.0053 0.0969±0.0432

1.2 0.0629±0.0219 0.0257±0.0110 0.0141±0.0040 0.0103±0.0041 0.0095±0.0036 0.0114±0.0053 0.1024±0.0498

1.3 0.0827±0.0372 0.0523±0.0212 0.0209±0.0097 0.0134±0.0044 0.0104±0.0033 0.0257±0.0146 0.2108±0.0849

Table 6.4: Performance of Jacobi variant of Nq-SF2 algorithm for different values of q and β.


We study the effect of the timescales on the proposed algorithms. Since, both the

one-simulation and two-simulation algorithms involve the step-sizes in a similar way, we

consider only Gq-SF2 and Nq-SF2 algorithms. As mention in the previous section, we can

update the Hessian on a different timescale, independent of the gradient estimation. We

use a step-size (c(n)) for Hessian estimation. We fix the step-size sequence corresponding

to the slower timescale at a(n) = 1n, n > 1. and let the other sequences to be of the form

b(n) = 1nγ

, c(n) = 1nδ

.

HHHHqδ

0.55 0.65 0.75 Gq-SF2

0.1 0.0133±0.0059 0.0095±0.0031 0.0087±0.0030 0.0229±0.0064

0.2 0.0089±0.0030 0.0095±0.0024 0.0097±0.0038 0.0243±0.0253

0.3 0.0094±0.0038 0.0110±0.0041 0.0100±0.0036 0.0473±0.0778

0.4 0.0112±0.0039 0.0104±0.0040 0.0097±0.0031 0.0250±0.0216

0.5 0.0111±0.0053 0.0108±0.0030 0.0101±0.0045 0.0403±0.0330

0.6 0.0114±0.0047 0.0100±0.0051 0.0080±0.0014 0.0505±0.0944

0.7 0.0121±0.0033 0.0101±0.0040 0.0087±0.0031 0.0098±0.0040

0.8 0.0109±0.0028 0.0098±0.0029 0.0096±0.0032 0.0101±0.0053

0.9 0.0084±0.0031 0.0089±0.0040 0.0095±0.0044 0.0085±0.0018

Gaussian 0.0093±0.0028 0.0076±0.0033 0.0088±0.0032 0.0067±0.0021

1.1 0.0075±0.0042 0.0070±0.0023 0.0080±0.0028 0.0068±0.0028

1.2 0.0086±0.0030 0.0070±0.0019 0.0078±0.0036 0.0079±0.0012

1.3 0.0072±0.0031 0.0091±0.0024 0.0102±0.0048 0.0086±0.0028

Table 6.5: Performance of Gq-SF2 and Nq-SF2 (Jacobi variant) algorithms for differentvalues of q and δ, where β = 0.1 and the step-sizes are a(n) = 1

n, b(n) = 1

n0.75 , c(n) = 1nδ

.

HHHHqδ

0.55 0.65 0.75 0.85 Gq-SF2

0.1 0.0123±0.0076 0.0343±0.0554 0.0115±0.0025 0.0115±0.0031 0.0302±0.0197

0.2 0.0416±0.0618 0.0123±0.0038 0.0091±0.0031 0.0105±0.0029 0.0232±0.0162

0.3 0.0173±0.0098 0.0147±0.0124 0.0092±0.0044 0.0119±0.0032 0.0452±0.0373

0.4 0.0255±0.0222 0.0187±0.0088 0.0121±0.0037 0.0096±0.0036 0.0187±0.0114

0.5 0.0203±0.0096 0.0108±0.0041 0.0103±0.0028 0.0105±0.0027 0.0424±0.0702

0.6 0.0160±0.0140 0.0126±0.0038 0.0096±0.0036 0.0114±0.0027 0.0223±0.0202

0.7 0.0180±0.0224 0.0115±0.0036 0.0078±0.0033 0.0098±0.0035 0.0118±0.0095

0.8 0.0148±0.0060 0.0083±0.0031 0.0095±0.0031 0.0104±0.0027 0.0055±0.0028

0.9 0.0096±0.0042 0.0101±0.0032 0.0075±0.0025 0.0098±0.0034 0.0076±0.0018

Gaussian 0.0096±0.0024 0.0067±0.0042 0.0094±0.0027 0.0077±0.0025 0.0068±0.0015

1.1 0.0096±0.0027 0.0095±0.0039 0.0069±0.0032 0.0074±0.0032 0.0060±0.0013

1.2 0.0086±0.0029 0.0071±0.0023 0.0082±0.0020 0.0087±0.0019 0.0075±0.0010

1.3 0.0087±0.0020 0.0080±0.0037 0.0064±0.0023 0.0139±0.0057 0.0073±0.0023

Table 6.6: Performance of Gq-SF2 and Nq-SF2 (Jacobi variant) algorithms for differentvalues of q and δ, where β = 0.1 and the step-sizes are a(n) = 1

n, b(n) = 1

n0.85 , c(n) = 1nδ

.


In order to satisfy Assumption IV, we need to consider γ, δ ∈ (0.5, 1). We study

the relative performance of Gq-SF2 and Nq-SF2 (Jacobi variant) algorithms, when the

Hessian is updated on various timescales. We vary δ such that b(n) = o(c(n)

)due to

reasons discussed in [5]. Tables 6.5 and 6.6 show the effect of δ on the Nq-SF2 algorithm

for various values of q. The value of β is held fixed at β = 0.1, and γ is fixed at 0.75

and 0.85, respectively. The value of δ is varied from 0.55 to γ. We mark the cases where

Nq-SF2 performs better than Gq-SF2.

We perform similar experiments in a higher dimensional case. For this, we consider

a five node network with λi = 0.2, i = 1, . . . , 5 external arrival rate at each node. The

probability of leaving the system after service at each node is pi = 0.2 for all nodes.

The service process of each node is controlled by a 10-dimensional parameter vector, and

a constant set at Ri = 10, i = 1, . . . , 5. Thus, we have a 50-dimensional constrained

optimization problem, where each component can vary over the interval [0.1, 0.6] and the

target is 0.3. The parameters of the algorithms are held fixed at M = 10000, L = 100 and

ε = 0.1. Each component in the initial parameter vector is assumed to be θ(i)(0) = 0.6 for

all i = 1, 2, . . . , 50. The step-sizes were taken to be a(n) = 1n, b(n) = 1

n0.75 and c(n) = 1n0.65 .

For each q, β pair, 20 independent runs were performed, which took about 1 minute on

an average.

Figure 6.3: Convergence behavior of various algorithms for β = 0.1.


It can be observed that the trend changes in higher dimensional case. The smaller

values of q become more significant as the noise term do not scale up to be very high,

as compared to the 4-dimensional case. In fact, for Nq-SF2 algorithm, low values of q

(close to 0) make the algorithm more stable, and also improves performance significantly.

However, we cannot conclude a trend for the best case in Gq-SF2 algorithm, although

small q’s still give better performance.

HHHHqβ

0.025 0.05 0.075 0.1 0.25

0.1 0.7237±0.0514 0.6323±0.0477 0.5721±0.0481 0.5325±0.0329 0.3128±0.03050.2 0.6858±0.0834 0.5921±0.0398 0.5278±0.0431 0.4899±0.0279 0.2769±0.03870.3 0.6695±0.0591 0.5385±0.0386 0.4913±0.0347 0.4422±0.0328 0.3113±0.12330.4 0.6827±0.1260 0.5426±0.1293 0.4687±0.1018 0.4683±0.1974 0.3005±0.13870.5 0.7196±0.1721 0.5039±0.0935 0.4798±0.1551 0.4118±0.1226 0.5363±0.35350.6 0.8145±0.1709 0.7290±0.3343 0.5164±0.1147 0.5883±0.2394 0.7116±0.29430.7 1.1038±0.1390 0.9438±0.2537 0.7993±0.2745 0.7315±0.2545 0.9571±0.32650.8 1.1176±0.2065 0.9393±0.2182 0.8644±0.2965 0.7553±0.2653 0.9573±0.21900.9 0.9266±0.2015 0.6436±0.3200 0.5634±0.2493 0.5634±0.2893 0.8520±0.2093

Gaussian 0.9548±0.0546 0.8618±0.0208 0.6182±0.0089 0.7225±0.0115 0.9768±0.0720

Table 6.7: Performance Gq-SF2 algorithm for different values of q and β.

HHHHqβ

0.025 0.05 0.075 0.1 0.25

0.1 1.0572±0.1257 0.3125±0.0316 0.2306±0.0279 0.2013±0.0233 0.4786±0.07450.2 1.0994±0.1117 0.3604±0.0638 0.2318±0.0202 0.2205±0.0279 0.6431±0.11350.3 1.2004±0.0975 0.5092±0.1137 0.2917±0.0242 0.2664±0.0307 0.8322±0.12380.4 1.2188±0.0887 0.7111±0.1847 0.3529±0.0470 0.3280±0.0463 1.0707±0.18470.5 1.3280±0.1547 0.9714±0.2154 0.5552±0.1214 0.4624±0.1020 1.2536±0.14820.6 1.4026±0.1281 1.1468±0.1503 0.8221±0.1575 0.7008±0.1915 1.3794±0.11960.7 1.4614±0.1136 1.2622±0.1383 1.0808±0.1629 0.9673±0.1896 1.4844±0.11960.8 1.4725±0.0939 1.3228±0.1452 1.1361±0.1636 1.0394±0.1783 1.5204±0.12830.9 1.5142±0.0971 1.3295±0.1019 1.1470±0.1248 1.1067±0.1631 1.5617±0.1039

Gaussian 1.4739±0.0820 1.2346±0.0934 1.0459±0.1641 1.0301±0.1617 1.6134±0.0823

Table 6.8: Performance Nq-SF2 (Jacobi variant) algorithm for different values of q and β.

Chapter 7

Conclusions

The power-law behavior of q-Gaussians provide a better control over smoothing of func-

tions as compared to Gaussian distribution. This property provides better tuning of

algorithms, which involve such distributions as smoothing kernels. This gives a better

trade-off between local fluctuations and overall error incurred while smoothing.

We have extended the Gaussian smoothed functional approach for gradient and Hes-

sian estimation approach to the q-Gaussian case, and developed optimization algorithms

based on this. We propose four two timescale algorithms using gradient and Newton

based search. These algorithms generalize the one proposed in [5]. We use a queuing

network example to show that for some values of q, the results provided by the proposed

algorithms are significantly better than Gaussian SF algorithms.

We also present proof of convergence of the proposed algorithms, providing necessary

conditions under which the algorithms converge to a local minimum of the objective

function. In course of the convergence analysis, we come across some interesting properties

of multivariate q-Gaussian distribution. We show the q-Gaussian satisfies the Rubinstein

conditions [35], and also provide an expression for the higher order generalized co-moments

of the distribution.

57

Bibliography

[1] S. Abe and N. Suzuki. Itineration of the internet over nonequilibrium stationary

states in Tsallis statistics. Physical Review E, 67(016106), 2003.

[2] S. Abe and N. Suzuki. Scale-free statistics of time interval between successive earth-

quakes. Physica A: Statistical Mechanics and its Applications, 350:588–596, 2005.

[3] A. L. Barabasi and R. Albert. Emergence of scaling in random networks. Science,

286:509–512, 1999.

[4] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, 1999.

[5] S. Bhatnagar. Adaptive Newton-based multivariate smoothed functional algorithms

for simulation optimization. ACM Transactions on Modeling and Computer Simula-

tion, 18(1):27–62, 2007.

[6] S. Bhatnagar and V. S. Borkar. Two timescale stochastic approximation scheme

for simulation-based parametric optimization. Probability in the Engineering and

Informational Sciences, 12:519–531, 1998.

[7] S. Bhatnagar and V. S. Borkar. Multiscale chaotic SPSA and smoothed functional

algorithms for simulation optimization. Simulation, 79(9):568–580, 2003.

[8] S. Bhatnagar, M. C. Fu, S. I. Marcus, and S. Bhatnagar. Two timescale algorithms for

simulation optimization of hidden makov models. IIE Transactions, 33(3):245–258,

2001.

58

BIBLIOGRAPHY 59

[9] S. Bhatnagar, M. C. Fu, S. I. Marcus, and I. J. Wang. Two-timescale simultaneous

perturbation stochastic approximation using deterministic perturbation sequences.

ACM Transactions on Modeling and Computer Simulation, 13(2):180–209, 2003.

[10] E. P. Borges. A possible deformed algebra and calculus inspired in nonextensive

thermostatistics. Physica A: Statistical Mechanics and its Applications, 340:95–101,

2004.

[11] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cam-

bridge University Press, 2008.

[12] J. Costa, A. Hero, and C. Vignat. On solutions to multivariate maximum α-entropy

problems. Energy Minimization Methods in Computer Vision and Pattern Recogni-

tion, Lecture Notes in Computer Science, 2683:211–226, 2003.

[13] Z. Daroczy. Generalized information functions. Information and Control, 16(1):36–

51, 1970.

[14] A. Dukkipati, S. Bhatnagar, and M. N. Murty. Gelfand-Yaglom-Perez theorem for

generalized relative entropy functionals. Information Sciences, 177(24):5707–5714,

2007.

[15] A. Dukkipati, S. Bhatnagar, and M. N. Murty. On measure-theoretic aspects of

nonextensive entropy functionals and corresponding maximum entropy prescriptions.

Physica A: Statistical Mechanics and its Applications, 384(2):758–774, 2007.

[16] V. Fabian. Stochastic approximation. In J. J. Rustagi, editor, Optimizing methods

in Statistics, pages 439–470, New York, 1971. Academic Press.

[17] J. E. Gentle, W. Hardle, and Y. Mori. Handbook of Computational Statistics: Con-

cepts and Methods. Springer, 2004.

[18] D. Ghoshdastidar, A. Dukkipati, and S. Bhatnagar. q-Gaussian based smoothed

functional algorithms for stochastic optimization. In International Symposium on

Information Theory. IEEE, 2012.

BIBLIOGRAPHY 60

[19] I. S. Gradshteyn and I. M. Ryzhik. Table of Integrals, Series and Products (5th ed.).

Elsevier, 1994.

[20] J. Havrda and F. Charvat. Quantification method of classification processes: Concept

of structural a-entropy. Kybernetika, 3(1):30–35, 1967.

[21] M. W. Hirsch. Convergent activation dynamics is in continuous time networks. Neural

Networks, 2:331–349, 1989.

[22] Y. C. Ho and X. R. Cao. Perturbation Analysis of Discrete Event Dynamical Systems.

Kluwer Academic Publishers, 1991.

[23] E. T. Jaynes. Information theory and statistical mechanics. The Physical Review,

106(4):620–630, 1957.

[24] V. Y. A. Katkovnik and Y. U. Kulchitsky. Convergence of a class of random search

algorithms. Automation Remote Control, 8:1321–1326, 1972.

[25] E. Kiefer and J. Wolfowitz. Stochastic estimation of a maximum regression function.

Annals of Mathematical Statistics, 23:462–466, 1952.

[26] A. N. Kolmogorov. New metric invariant of transitive dynamical systems and endo-

morphisms of Lebesgue spaces. Doklady of Russian Academy of Sciences, 119(5):861–

864, 1958.

[27] S. Kullback. Information theory and statistics. John Wiley and Sons, N.Y., 1959.

[28] H. J. Kushner and D. S. Clark. Stochastic Approximation Methods for Constrained

and Unconstrained Systems. Springer-Verlag, New York, 1978.

[29] P. L’Ecuyer and P. W. Glynn. Stochastic optimization by simulation: Convergence

proofs for the GI/G/1 queue in steady state. Management Science, 40(11):1562–1578,

1994.

[30] V. Pareto. Manuale di economica politica. Societa Editrice Libraria, 1906.

BIBLIOGRAPHY 61

[31] A. Perez. Risk estimates in terms of generalized f -entropies. In Proceedings of the

Colloquium on Information Theory, Debrecen 1967, pages 299–315, Budapest, 1968.

Journal Bolyai Mathematical Society.

[32] D. Prato and C. Tsallis. Nonextensive foundation of Levy distributions. Physical

Review E., 60(2):2398–2401, 1999.

[33] A. Renyi. On measures of entropy and information. In Fourth Berkeley Symposium

on Mathematical Statistics and Probability, 1960, volume 1, pages 547–561, Berkeley,

California, 1961. University of California Press.

[34] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathe-

matical Statistics, 22(3):400–407, 1951.

[35] R. Y. Rubinstein. Simulation and Monte-Carlo Method. John Wiley, New York,

1981.

[36] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro proce-

dure. Annals of Statistics, 13:236–245, 1985.

[37] A. H. Sato. q-Gaussian distributions and multiplicative stochastic processes for

analysis of multiple financial time series. Journal of Physics: Conference Series,

201(012008), 2010.

[38] C. E. Shannon. A mathematical theory of communications. The Bell System Tech-

nical Journal, 27, 1948.

[39] J. C. Spall. Multivariate stochastic approximation using a simultaneous perturbation

gradient approximation. IEEE Transactions on Automatic Control, 37(3):332–334,

1992.

[40] J. C. Spall. Adaptive stochastic approximation by the simultaneous perturbation

method. IEEE Transactions on Automatic Control, 45:1839–1853, 2000.

BIBLIOGRAPHY 62

[41] M. A. Styblinski and T. S. Tang. Experiments in nonconvex optimization: Stochastic

approximation with function smoothing and simulated annealing. Neural Networks,

3(4):467–483, 1990.

[42] H. Suyari. Generalization of Shannon-Khinchin axioms to nonextensive systems and

the uniqueness theorem for the nonextensive entropy. IEEE Transactions on Infor-

mation Theory, 50:1783–1787, 2004.

[43] W. J. Thistleton, J. A. Marsh, K. Nelson, and C. Tsallis. Generalized Box-Muller

method for generating q-Gaussian random deviates. IEEE Transactions on Informa-

tion Theory, 53(12):4805–4810, 2007.

[44] C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of Statiscal

Physics, 52(1-2):479–487, 1988.

[45] C. Tsallis. Some comments on Boltzmann-Gibbs statistical mechanics. Chaos, Soli-

tons & Fractals, 6:539–559, 1995.

[46] C. Tsallis, R. S. Mendes, and A. R. Plastino. The role of constraints within gener-

alized nonextensive statistics. Physica A: Statistical Mechanics and its Applications,

261(3–4):534–554, 1998.

[47] S. Umarov and C. Tsallis. Multivariate generalizations of the q-central limit theorem.

arXiv:cond-mat/0703533, 2007.

[48] C. Vignat and A. Plastino. Central limit theorem and deformed exponentials. Journal

of Physics A: Mathematical and Theoretical, 20(45), 2007.

[49] D. Williams. Probability with Martingales. Cambridge University Press, 1991.

[50] X. Zhu and J. C. Spall. A modified second-order SPSA optimization algorithm for

finite samples. International Journal of Adaptive Control and Signal Process, 16:397–

409, 2002.

properties of multivariate q-gaussian distributions and its … · 2016. 6. 2. · distributions,...

Documents