optimization - statisticsdept.stat.lsa.umich.edu/.../courses/stat606/notes/optim.pdfoptimization...

35
Optimization Background: Problem: given a function f (x) defined on X , find x * such that f (x * ) f (x) for all x ∈X . The value x * is called a maximizer of f and is written argmax X f . In general, argmax X f may not be unique. It will be unique if -f is strictly convex. A function g is strictly convex if g(λx + (1 - λ)y) < λg(x) + (1 - λ)g(y) 0 <λ< 1; x, y ∈X . A twice continuously-differentiable function is strictly convex if and only if the Hessian matrix H ij (x)= 2 f (x)/∂x i ∂x j is positive definite x. If X is discrete the problem is known as combinatorial optimization. Methods for smooth functions in one dimension: Suppose X = R. A bracket is a triple of values x 1 <x 2 <x 3 such that f (x 2 ) max(f (x 1 ),f (x 3 )). If f is continuous, a bracket must contain a local maximizer. There are no good generic methods for finding a bracket. If -f is convex, we can always find a bracket by letting x 1 → -∞, x 3 →∞, and placing x 2 very close to x 1 (if f (x 1 ) >f (x 3 )) or x 3 (if f (x 3 ) >f (x 1 )). Bisection: Suppose we have a bracket x 1 <x 2 <x 3 . Define a new point x * equal to either (x 1 + x 2 )/2 or (x 2 + x 3 )/2, depending on whether x 2 - x 1 >x 3 - x 2 , or x 2 - x 1 <x 3 - x 2 . In the former case, there are two possibilities for the location of f (x * ) relative to f (x 1 ) and f (x 2 ), each of which allows us to shrink the bracket: 1. f (x * ) >f (x 2 ) (x 1 ,x 2 ,x 3 ) (x 1 ,x * ,x 2 ). 2. f (x * ) f (x 2 ) (x 1 ,x 2 ,x 3 ) (x * ,x 2 ,x 3 ). Similarly, in the latter case, one of the following must hold: 3. f (x * ) >f (x 2 ) (x 1 ,x 2 ,x 3 ) (x 2 ,x * ,x 3 ). 4. f (x * ) f (x 2 ) (x 1 ,x 2 ,x 3 ) (x 1 ,x 2 ,x * ). The difference x 3 - x 1 decreases by a factor no greater than 3/4, thus the convergence is linear. 1

Upload: others

Post on 22-Jan-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

Optimization

Background:

• Problem: given a function f(x) defined on X , find x∗ such that f(x∗) ≥ f(x) for allx ∈ X .

The value x∗ is called a maximizer of f and is written argmaxXf .

• In general, argmaxXf may not be unique. It will be unique if −f is strictly convex. Afunction g is strictly convex if

g(λx+ (1− λ)y) < λg(x) + (1− λ)g(y) 0 < λ < 1; x, y ∈ X .

A twice continuously-differentiable function is strictly convex if and only if the Hessianmatrix Hij(x) = ∂2f(x)/∂xi∂xj is positive definite ∀x.

• If X is discrete the problem is known as combinatorial optimization.

Methods for smooth functions in one dimension:

• Suppose X = R. A bracket is a triple of values x1 < x2 < x3 such that f(x2) ≥max(f(x1), f(x3)). If f is continuous, a bracket must contain a local maximizer.

• There are no good generic methods for finding a bracket. If −f is convex, we canalways find a bracket by letting x1 → −∞, x3 → ∞, and placing x2 very close to x1

(if f(x1) > f(x3)) or x3 (if f(x3) > f(x1)).

• Bisection: Suppose we have a bracket x1 < x2 < x3. Define a new point x∗ equalto either (x1 + x2)/2 or (x2 + x3)/2, depending on whether x2 − x1 > x3 − x2, orx2 − x1 < x3 − x2. In the former case, there are two possibilities for the location off(x∗) relative to f(x1) and f(x2), each of which allows us to shrink the bracket:

1. f(x∗) > f(x2) (x1, x2, x3)→ (x1, x∗, x2).

2. f(x∗) ≤ f(x2) (x1, x2, x3)→ (x∗, x2, x3).

Similarly, in the latter case, one of the following must hold:

3. f(x∗) > f(x2) (x1, x2, x3)→ (x2, x∗, x3).

4. f(x∗) ≤ f(x2) (x1, x2, x3)→ (x1, x2, x∗).

The difference x3− x1 decreases by a factor no greater than 3/4, thus the convergenceis linear.

1

Page 2: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• Example: Let ρ(x) = x2 if |x| < c and ρ(x) = c(2|x| − c) if |x| ≥ c, where c is aconstant (c = 1.345 is a good choice). Given data X1, X2, . . . , Xn, minimizing

∑i

ρ(Xi − µ)

over µ gives the “Winsorized mean”, a robust estimate of the center of the distribution.The following Octave code calculates the Winsorized mean using bisection:

2

Page 3: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

## Calculate the value of rho for the data in X, with scale value c.

function Q = WM(X, x, c)

A = abs(X-x);

I = (A < c);

Z = (A.^2).*I + c*(2*A-c).*(1-I);

Q = sum(Z);

endfunction

## A standard value for c.

c = 1.345;

## Use the arithmetic mean as a starting value for the center bracket point.

mu(2) = mean(X);

F(2) = WM(X, mu(2), c);

## Get the left bracket point.

mu(1) = mu(2) - 1;

while (1)

f = WM(X, mu(1), c);

if (f > F(2))

F(1) = f;

break;

endif

mu(1) = mu(1) - 1;

endwhile

## Get the right bracket point.

mu(3) = mu(2) + 1;

while (1)

f = WM(X, mu(3), c);

if (f > F(2))

F(3) = f;

break;

endif

mu(3) = mu(3) + 1;

endwhile

3

Page 4: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

## Bisection.

while (mu(3)-mu(1) > 1e-6)

## Work on the left interval.

if (mu(2)-mu(1) > mu(3)-mu(2))

## The location and function value of the new point.

mm = mean(mu(1:2));

ff = WM(X, mm, c);

if (ff < F(2))

mu = [mu(1), mm, mu(2)];

F = [F(1), ff, F(2)];

else

mu = [mm, mu(2), mu(3)];

F = [ff, F(2), F(3)];

endif

## Work on the right interval.

else

## The location and function value of the new point.

mm = mean(mu(2:3));

ff = WM(X, mm, c);

if (ff < F(2))

mu = [mu(2), mm, mu(3)];

F = [F(2), ff, F(3)];

else

mu = [mu(1), mu(2), mm];

F = [F(1), F(2), ff];

endif

endif

endwhile

Newton’s Method

• Suppose f has two continuous derivatives. Then all interior local maxima x of f satisfyf ′(x) = 0.

4

Page 5: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• More generally, write the vector of partial derivatives of f with respect to x (thegradient of f) as follows:

∇f(x) = (∂f/∂x1, . . . , ∂f/∂xm)′.

An interior local maximum f satisfies ∇f(x) = 0.

• The Hessian of f , denoted Hf , is an m × m matrix for each value of x, defined by(Hf )ij = ∂2f/∂xi∂xj. For twice smooth functions Hf is symmetric at each x. If −f isconvex in a neighborhood of a local maximizer x, then −Hf (x) is SPD.

• Suppose f has a continuous Hessian map at x0. Then we can approximate f quadrat-ically in a neighborhood of x0 using

f(x) ≈ f(x0) +∇f(x0)′(x− x0) +

1

2(x− x0)

′Hf (x0)(x− x0).

• This leads to the following approximation to the gradient:

∇f(x) ≈ ∇f(x0) +Hf (x0)(x− x0).

We can solve this expression for the stationary point ∇f(x) = 0, giving:

x = x0 −Hf (x0)−1∇f(x0).

• This is the Newton-Raphson method for multidimensional optimization. We definea sequence of iterates starting at an arbitrary value x0, and update using the rulexi+1 = xi −Hf (xi)

−1∇f(xi).

• The Newton-Raphson algorithm is globally convergent at quadratic rate whenever fis convex and has two continuous derivatives.

Maximum Likelihood

• Suppose we observe independent realizations Y1, . . . , Yn, where the density of Yi isfi(·; θ), where θ ∈ Rd is an unknown parameter. Then the joint density is given by

p(Y1, . . . , Yn) =n∏

i=1

fi(Yi; θ),

and the log-likelihood is given by

L(θ|Y1, . . . , Yn) =n∑

i=1

log fi(Yi; θ).

5

Page 6: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• The maximum likelihood estimator (MLE) of θ is defined to be: argmaxθL(θ|Y1, . . . , Yn).

• In this setting, the gradient ∇L(θ) is known as the score function, and is often writtens(θ). It has the form

s(θ) =∑

i

∇fi(Yi; θ)/fi(Yi; θ) =∑

i

si(θ).

• The score function is a random vector with expected value 0:

Eθs(θ) =∑

i

∫(∇fi(Yi; θ)/fi(Yi; θ))fi(Yi; θ)dYi

=∑

i

∫∇fi(Yi; θ)dYi

=∑

i

∇∫fi(Yi; θ)dYi

= 0.

• The Hessian of the log-likelihood function has the form:

HL(θ) =n∑

i=1

fi(Yi; θ)Hfi(θ)−∇fi(θ)∇fi(θ)

f 2i (Yi; θ)

.

• We can write

Eθfi(Yi; θ)Hfi(θ)/f2

i (Yi; θ) =∫Hfi

(θ)dYi = ∇2∫fi(Yi; θ)dYi = 0.

Thus,

EθHL(θ) = −n∑

i=1

∫si(θ)si(θ)

′fi(Yi; θ)dYi = −∑

i

covθ(si(θ), si(θ)) = −cov(s(θ), s(θ)).

This quantity, the Fisher Information, is denoted I(θ).

• The Fisher information matrix is negative semidefinite, and can serve as a stand-in forthe Hessian in the Newton-Raphson algorithm, giving the update:

xi+1 = xi − I(xi)−1s(xi).

This is the Fisher scoring algorithm. It has essentially the same convergence propertiesas Newton-Raphson, but it is often easier to compute I than HL.

The Fisher information matrix also gives the inverse of the asymptotic variance of theMLE θ̂.

6

Page 7: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• Example: Logistic regression. Suppose we observe independent pairs (Yi, Xi),i = 1, . . . , n with Yi ∈ {0, 1}, and the probability law is given by:

P (Yi = 1|Xi) =exp(β′Xi)

1 + exp(β′Xi)

P (Yi = 0|Xi) =1

1 + exp(β′Xi).

our goal is to compute the MLE of β. The joint likelihood is

P (Y1, . . . , Yn|X1, . . . , Xn) =∏

i:Yi=1

exp(β′Xi)

1 + exp(β′Xi)

∏i:Yi=0

1

1 + exp(β′Xi).

The log-likelihood function is:

L(β) = β′∑

i:Yi=1

Xi −∑

i

log(1 + exp(β′Xi)).

The score function is:

s(β) =∑

i:Yi=1

Xi −∑

i

exp(β′Xi)

1 + exp(β′Xi)Xi.

The Hessian of L is:

HL(β) = −∑

i

exp(β′Xi)

(1 + exp(β′Xi))2XiX

′i.

The score function is a linear combination of the Xi, and the information matrix is alinear combination of XiX

′i. This is typical in exponential family regression models.

The coefficients of XiX′i in the expression for HL(β) are negative. Thus HL is negative

definite as long as the Xi span Rm, and it follows that −L is globally convex, so thereis a unique local maximizer, which is also the global maximizer.

The Yi do not appear in HL(β), so HL(β) = I(β).

• Example: Heteroscedastic regression

Suppose we are fitting a linear regression model in which E(Y |X) = α + βX andvar(Y |X) = cX2. We wish to estimate α, β, and c using maximum likelihood. Thelog-likelihood is

L(α, β, c) = −1

2

∑ (Yi − α− βXi)2

cX2i

− n log(c)/2−∑

i

logX2i /2.

The gradient is

7

Page 8: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

∇L(α, β, c) =1

2

∑ 2ri/(cX2i )

2ri/(cXi)r2i /(c

2X2i )− 1/c

,where ri = Yi − α− βXi. The Hessian matrix is

HL = −∑

i

1/(cX2i ) 1/(cXi) ri/(c

2X2i )

1/(cXi) 1/c ri/(c2X2

i )ri/(c

2X2i ) ri/(c

2X2i ) r2

i /(c3X2

i )− 1/(2c2)

.The expected Hessian is:

HL = −∑

i

1/(cX2i ) 1/(cXi) 0

1/(cXi) 1/c 00 0 1/(2c2)

.Here is the Octave code to fit this model using Fisher Scoring:

## Calculate the score function.

function G = score(alpha, beta, c, Y, X)

## The residuals.

R = Y - alpha - beta*X;

## The gradient with respect to alpha, beta, and c.

G(1) = sum(R./X.^2)/c;

G(2) = sum(R./X)/c;

G(3) = sum(R.^2./X.^2)/(2*c^2) - length(Y)/(2*c);

endfunction

## Calculate the expected Hessian matrix.

function H = EHess(c, X)

H(1,1) = -sum(1./X.^2)/c;

H(1,2) = -sum(1./X)/c;

H(2,1) = -sum(1./X)/c;

H(2,2) = -length(X)/c;

H(3,3) = -length(X)/(2*c^2);

endfunction

8

Page 9: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

## Calculate the maximum likelihood estimates using Fisher scoring.

## Q contains (alpha, beta, c) on return.

function Q = Fit(Y, X)

## Starting values from OLS.

Q(2) = cov(Y,X)/var(X);

Q(1) = mean(Y) - Q(2)*mean(X);

Q(3) = 1;

while (1)

G = score(Q(1), Q(2), Q(3), Y, X);

## Test for convergence.

if (norm(G) < 1e-6)

break;

endif

## Take a step.

H = EHess(Q(3), X);

Q = Q - H \ G;

endwhile

endfunction

Methods using only one derivative

• In many cases, it is practical to compute the first derivative of L(θ), but not theHessian or Fisher information matrix. In this case, we know that ∇L(θ) gives thesteepest uphill direction from a given position θ.

One strategy is to always move in the direction of ∇L(θ). That is, if we are currentlyat a point θi, the next iteration will take us to

θi+1 = θi + λ∇L(θi).

We do not know how far to move in the direction ∇L(θ) (i.e. the value of λ), so wecan use a one dimensional line search (e.g. bisection) to find the greatest value. Theresulting algorithm is known as the steepest ascent algorithm:

9

Page 10: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

1. Start at θ0, set i = 0.

2. Set Gi = ∇L(θi).

3. Use a one dimensional line search to obtain argmaxλL(θi + λGi).

4. Set θi+1 = θi + λGi, return to step 2.

There are many variants of this algorithm. The main issue is whether a complete linesearch is done in step 3, or whether we take a single uphill step, or a fixed number ofuphill steps. The best strategy in a given situation depends on the relative expense ofevaluating L(θ) and ∇L(θ).

The steepest ascent algorithm never takes a downhill step.

The steepest ascent algorithm almost always converges to a local or global maximumin practice. Theoretically, the method can converge to a saddle point but this is anunstable solution hence difficult to achieve in practice.

• The trajectory followed by the steepest ascent algorithm can be characterized geomet-rically in terms of the level sets of the objective function. The gradient ∇f(x0) of f ,evaluated at x0, is perpendicular to the level sets of f at x0. The extreme values ofthe restricted function g(λ) = f(x0 + λ∇f(x0)) are located at the points λ∗ such that∇f(x0) is is tangent to the level curves of f at x0 + λ∗∇f(x0). Thus each iterationmoves from x0 in the direction ∇f(x0) such that the angle of ∇f(x0) relative to thelevel sets moves from π/2 to 0.

• The steepest ascent method is globally convergent for concave functions, but tends tobe very slow. It is not finitely convergent for any interesting class of functions. Itis possible to construct optimization algorithms using only first derivatives that arefinitely convergent for quadratic functions. These methods are generally preferred tosteepest ascent.

• Example: Covariance structure modeling. Suppose we observe Y1, . . . , Yn ∈ Rp,independent and Gaussian with mean µ ∈ Rp and covariance Σ ∈ Rp×p. The densityof Yi is

(2π)−p/2|Σ|−1/2 exp(−1

2(Yi − µ)′Σ−1(Yi − µ)

).

The log-likelihood of Yi (up to an additive constant) is:

−1

2log |Σ| − 1

2trace

(Σ−1(Yi − µ)(Yi − µ)′

),

where we use the identity that

v′Av = tr(v′Av) = tr(Avv′)

for a vector v and compatible square matrix A.

10

Page 11: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

The log-likelihood of the entire sample is (up to a multiplicative and an additive con-stant):

−1

2log |Σ| − 1

2trace

(Σ−1S

),

where S =∑

i(Yi − µ)(Yi − µ)′/n.

If Σ is unconstrained (with p(p + 1)/2 free parameters) and µ is known, the MLE isΣ̂ = S (the sample covariance, but with n rather than n− 1 in the denominator).

More generally, we may have Σ = Σ(θ), where dim(θ) = r < p(p + 1)/2 are the freeparameters. Some example with closed-form solutions for the MLE are:

1. Independence with equal variances: Σrs = θI(r = s). The MLE is θ̂ =∑

ij(Yij −µj)

2/np.

2. Independence with unequal variances: Σrs = θrI(r = s). The MLE is θ̂r =∑i(Yir − µr)

2/n.

Many other examples do not have closed-form solutions. For example the exchangeablemodel: Σrs = θrI(r = s) + θp+1I(r 6= s), or the autoregressive model: Σrs = θ|r−s|.

By the chain rule,

∂L/∂θj =∑p,q

∂L/∂Σpq · ∂Σpq/∂θj.

Therefore it is sufficient to be able to differentiate L with respect to an unconstrainedcovariance Σ, and then differentiate Σ with respect to θ. The first term is given inthe next section. The second term depends on the parameterization, but is usuallystraightforward to calculate.

Derivatives of functions of matrices.

• The linear function f(v) = θ′v, where θ ∈ Rd can be expressed

f(v) =∑j

θjvj.

Therefore

∂f/∂vk = θk

and the gradient of f is

∇f = θ.

11

Page 12: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• The quadratic form f(v) = v′Av for a vector v and compatible square matrix A canbe expressed

f(v) =∑ij

vivjAij.

which can be rewritten

f(v) =∑

i

v2iAii +

∑i6=j

vivjAij.

Therefore

∂f/∂vk = 2vkAkk +∑j 6=k

vjAkj +∑i6=k

viAik.

If A is symmetric, this reduces to

∂f/∂vk = 2vkAkk + 2∑j 6=k

vjAkj = 2(Av)k,

so the gradient of f is

∇f = 2Av.

• The i, j minor Mij of a square matrix A is the determinant of the matrix formed bydeleting row i and column j from A. The i, j cofactor Cij is −1i+jMij. The adjointmatrix adj(A) is the matrix whose i, j entry is Cij.

The Laplace expansion for the determinant is

|A| =∑j

AijCij

which holds for any chosen value of i. Therefore

∂|A|/∂Aij = Cij = [adj(A)]ij.

We use the notation ∂|A|/∂A = adj(A) to express this set of derivatives.

• By Cramer’s rule, A−1 = adj(A)/|A|. Therefore ∂ log(|A|)/∂A = A−1.

• Suppose we wish to compute ∂tr(AB)/∂Aij. Since tr(AB) =∑

i(AB)ii =∑

ik AikBki,it follows that ∂tr(AB)/∂Aij = Bji. Packing all the derivative together yields

∂tr(AB)/∂A = B′.

12

Page 13: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• Since AA−1 = I, by the product rule A · ∂A−1/∂Apq + ∂A/∂Apq · A−1 = 0, hence∂A−1/∂Aij = −A−1·∂A/∂Aij·A−1 = −A−1

:i A−1j: . In particular, ∂A−1

qr /∂Aij = −A−1qi A

−1jr .

• We can use the chain rule to determine ∂tr(Σ−1C)/∂Σ:

∂tr(Σ−1C)/∂Σij =∑qr

∂tr(Σ−1C)/∂Σ−1qr · ∂Σ−1

qr /∂Σij

=∑qr

Crq · ∂Σ−1qr /∂Σij

= −∑qr

Crq · Σ−1qi · Σ−1

jr

= −(Σ−TCT Σ−T )ij.

• Suppose f is a differentiable function fromRm×m toR. If we can determine ∂f(M)/∂Mfor general M , then the derivative for symmetric M is given by:

∂f(Ms)/∂Ms = ∂f(M)/∂M + ∂f(M)/∂M ′ − diag(∂f(M)/∂M).

Therefore the gradient ∂L/∂Σ for the Gaussian covariance structure model is

Σ−1CΣ−1 − Σ−1 + diag(Σ−1 − Σ−1CΣ−1)/2.

From this we can verify the closed-form solutions for the multivariate Gaussian MLEΣ̂ = C. We can also use the chain rule to construct the gradient for any covarianceparameterization Σ = Σ(θ). Note that not all parameterizations lead to convex log-likelihoods, so the MLE is not always unique.

Conjugate-Gradient Methods

• Steepest ascent is the most obvious maximization method that does not require deriva-tives. In fact, it is not always best to move in the direction of the gradient. Supposef(x, y) = −100x2 − y2. The maximum is at the origin, and the level curves are elon-gated parallel to the y axis. The gradient is ∇f(x, y) = (−200x,−2y)′. If we start onthe x or y axis, we reach the optimum in one line search. More generally, from thepoint (x, y), we move to the optimum of the following line search:

gx,y(λ) = −100(x− 200λx)2 − (y − 2λy)2,

which is given by

λ =2002x2 + 4y2

2003x2 + 8y2.

13

Page 14: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

Consider two starting values: (.9, 9) and (−.1,−9). In both cases, the trajectory movesalmost parallel to the x axis at each step since the slope of the gradient is y/100x. They-axis forms a ridge that could be followed directly to the optimum, but the onedimensional strategy always overshoots the ridge to get a slightly higher value.

• Conjugate-Gradient methods are designed to avoid the zigzag trajectories that steepestascent often follows. There are a number of formulations. The general principle is tosearch along a set of conjugate directions. Specifically, if we have a positive semidefinitematrix A ∈ Rm×m, then a set of A-conjugate directions vj ∈ Rm are defined to satisfyv′iAvj = 0 if i 6= j.

The principal axes (eigenvectors of A) are conjugate, but there are many sets of con-jugate directions besides the principal axes.

• Fact: If we have m conjugate directions {vj}, and we perform exact line searches alongthe {vj} in any order, then an exact maximum of f(x) = −x′Ax− b′x will be found inat most m steps.

For example, suppose we wish to maximize −5x2− 5y2 + 2xy. It is easy to verify thatthe vectors v1 = (1, 0)′ and v2 = (1/5, 1) are conjugate with respect to the matrix

A =

(−5 1

1 −5

),

that defines the quadratic form. Now suppose we start at an arbitrary point (x0, y0),and search in the direction v1, giving the one-dimensional problem

−5(x0 + λ)2 − 5y20 + 2(x0 + λ)y0.

The optimum is reached at λ = (2y0 − 10x0)/10, so the next point is (y0/5, y0). Thispoint differs from the maximizer at 0 by a multiple of v2, so the global optimum willbe reached after the next line search.

Note that the conjugate directions v1 and v2 above are distinct from the principal axes,which are approximately (.71,−.71)′ and (.71, .71)′. The principal axes (eigenvectors)of a quadratic form are more expensive to compute than its maximum. However we willsee that some other set of conjugate directions can be computed much more cheaply.

• If the objective function f is not quadratic, then we can use a set of directions thatare conjugate with respect to its Hessian Hf (x). The resulting algorithm will not befinitely convergent, but will generally converge at quadratic rate.

• One direct way to construct a set of A-conjugate directions is to use the Gramm-Schmidt procedure, or to use the eigenvectors of A. But both of these methods requireexplicit knowledge of A. Moreover, to compute the eigenvectors of A is actually moredifficult than inverting A, which would give the optimum directly. To make conjugategradient methods practical, we need a better approach.

14

Page 15: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• The following recursions give a set of A-conjugate directions, starting with an arbitraryinitial vector g1 = h1:

gi+1 = gi − λiAhi

hi+1 = gi+1 + γihi

λi = h′igi/h′iAhi

γi = g′i+1gi+1/g′igi

It follows directly that (i) g′i+1hi = 0, (ii) g′ihi = g′igi, and (iii) λi = g′igi/h′iAhi. Using

these identities, we can show by induction that (a) the hi are A-conjugate and (b) thegi are orthogonal.

To prove orthogonality of the gi, we have

g′i+1gi = g′igi − λig′iAhi

= g′igi − g′igi · g′iAhi/h′iAhi

= g′igi(1− g′iAhi/h′iAhi)

Since h′iAhi = h′iAgi + γi−1h′iAhi−1, and by induction h′iAhi−1 = 0, therefore h′iAhi =

h′iAgi, so the above is zero.

If j < i, g′i+1gj = g′igj − λig′jAhi, where g′igj = 0 by induction. Also, g′jAhi = (hj −

γj−1hj−1)′Ahi = 0 by induction.

Turning to the conjugacy of the hi, we have

h′i+1Ahi = h′i+1(gi − gi+1)/λi

= (h′i+1gi − h′i+1gi+1)/λi

= ((gi+1 + γihi)′gi − h′i+1gi+1)/λi

= (γih′igi − h′i+1gi+1)/λi

= (g′i+1gi+1 − g′i+1gi+1)/λi

= 0.

If j < i, then

h′i+1Ahj = (gi+1 + γihi)′(gj − gj+1)/λj

= γih′i(gj − gj+1)/λj

15

Page 16: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

As long as i > j+1 we can substitute gi+γi−1hi−1 for hi and the gi term will contributenothing to the inner product. Thus the final line above is proportional to

h′j+1(gj − gj+1),

which can be rewritten

(gj+1 + γjhj)′(gj − gj+1) = g′j+1gj+1 + γjg

′jgj = 0.

• If A is m × m and g1 = h1 is the gradient at an arbitrary starting point x1, then itis guaranteed that a sequence of line searches in the directions hi will reach the exactmaximizer of a quadratic form in A after at most m steps.

• If g1 is the gradient at x1, and if we perform a line search along h1 leading us to x2,then g2 is the gradient at x2. We can extend this argument using induction: if gi is thegradient at xi and we carry out a line search along hi giving xi+1, then gi+1 will be thegradient at xi+1. Hence the gi can be determined through line searches and gradientcomputations – we never need to know A.

To prove the final fact, suppose we are maximizing f(x) = −x′Ax/2 + b′x, so thegradient is ∇f(x) = −Ax + b. Specifically, the gradient at xi is gi = −Axi + b, andxi+1 is of the form xi + λhi, so ∇f(xi+1) = −Axi+1 + b = gi− λAhi. When λ gives theoptimum from the line search, h′i∇f(xi+1) = 0. Therefore h′igi = λh′iAhi, which givesλ = λi as given above.

• When gi+1 and hi are nearly anti-parallel then the steepest ascent procedure wouldfall into a zigzagging pattern. The search direction hi+1 = gi+1 + γihi, being a positivelinear combination of gi+1 and hi, will undergo a cancellation in the direction of thezigzagging.

• The quantity g′igi measures how close xi is to the optimum, since at the optimum, gi =0. Therefore γi captures the relative progress made in the previous line search (smallervalues of γi correspond to greater progress). Consider the update hi+1 = gi+1+γihi; themethod resets h to the gradient after a lot of progress is made, but when less progress ismade, h is set to a compromise between the gradient and the previous search direction.

• The method described above is called the Fletcher-Reeves approach. A variant is calledthe Polak-Ribiere update, which is identical to the Fletcher-Reeves procedure exceptthat the update for γi is changed to: γi = (gi+1 − gi)

′gi+1/g′igi. This update is equal

to the Fletcher-Powell update for exact quadratic forms, but in general it is different.There is some disagreement about the relative merits of the two updates, but usuallythe difference is small.

Random effects models

16

Page 17: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• Linear random coefficient model In a standard least squares regression model, themean relationship between the response Y and covariates X is parameterized as alinear relationshsip

E(Y |X) = α+ β′X,

which may derive from a model such as

Y = α+ β′X + ε.

According to this model, the same linear relationship applies equally to every unit inthe population. For example, if X is a measure of cumulative lifetime smoking and Yis a measure of lung function, the model states that each incremental unit of smokingexposure changes lung functioning by β units. An individual’s actual measured lungfunction Y differs from the idealized value E(Y |X) due to the error variable ε which isthought of as comprising both measurement error and other forms of dispersion withinthe population.

It may be of interest to get a better handle on these “other sources of dispersion withinthe population.” One way to do this is to consider people as differing in terms of theirsensitivity to smoke. Thus in place of a universal β, we have a smoke sensitivitycoefficient βi for each subject, with the βi considered to be independent realizationsfrom a “random effects distribution.” For convenience, if we take βi ∼ N(γ, τ 2), wewould have

Yi = α+ βiXi + εi

as a description of the ith subject’s response. The parameters of this model are α, γ, τ 2,and σ2 = var ε. Note that βi is not a model parameter. The set of βi are unobservedrandom variables. If a given individual has βi < γ, then (supposing that lower valuesof the response variable indicate poorer functioning) that individual is more sensitiveto smoke than an average member of the population. Similarly if βi > γ then theindividual is less sensitive to smoke than an average member of the population.

In order to estimate and carry out inference for α, γ, τ 2, and σ2, we must average outover the βi to get a model in which all random quantities are observed. In this example,we know that the distribution of each Yi is Gaussian, so it is sufficient to determine itsmean and variance, which are

E(Y |X) = α+ γX var(Y |X) = X2τ 2 + σ2.

The log-likelihood for the data set is

17

Page 18: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

−1

2

∑i

log(X2i τ

2 + σ2) + (Yi − α− γXi)2/√X2

i τ2 + σ2.

The gradient of this log-likelihood can be directly calculated so that conjugate gradientmethods can be used to numerically calculate the MLE.

• Mixed logistic regression Continuing with the previous example, suppose that the re-sponse variable Y is the occurence of lung cancer, rather than a continous measure oflung function. Using logistic regression framework we have

P (Y = 1|X) = 1/(1 + exp(−α− βiX)),

where βi ∼ N(γ, τ 2) represents a random risk coefficient for subject i. The probabilityof observing response Yi for subject i can be written

(1 + exp(−α− βiX))−Yi(1 + exp(α+ βiX))Yi−1.

In order to average out over βi we would need to calculate

Qi ≡∫

(1 + exp(−α− βiX))−Yi(1 + exp(α+ βiX))Yi−1 exp(−(βi − γ)2/(2τ 2))τ−1dβi.

However unlike in the linear case, this integral cannot be solved in closed form. Thelog-likelihood for the entire sample is

∑i

logQi

so the score function is

∑i

∇Qi

Qi

=∑

i

∇∫(1 + exp(−α− βiX))−Yi(1 + exp(α+ βiX))Yi−1 exp(−(βi − γ)2/(2τ 2))τ−1dβi∫

(1 + exp(−α− βiX))−Yi(1 + exp(α+ βiX))Yi−1 exp(−(βi − γ)2/(2τ 2))τ−1dβi

=∑

i

∫∇(1 + exp(−α− βiX))−Yi(1 + exp(α+ βiX))Yi−1 exp(−(βi − γ)2/(2τ 2))τ−1dβi∫(1 + exp(−α− βiX))−Yi(1 + exp(α+ βiX))Yi−1 exp(−(βi − γ)2/(2τ 2))τ−1dβi

The numerator and denominator of the above ratio must be numerically approximatedin order to calculate the score function.

Stochastic Search

18

Page 19: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• A probabilistic generalization of the usual optimization problem is stated as follows:

Let H(Y |X) denote a family of distribution functions for X ∈ X ⊂ R. Let M(X) =E(Y |X) denote the regression function. Suppose further that there exists θ ∈ R suchthat forX < θ, M(X) is strictly increasing, and forX > θ, M(X) is strictly decreasing.

Suppose we are able to specify an X and obtain a realization from H(Y |X) (but wecan not observe M(X) directly). The goal is to define an algorithm that produces asequence X1, X2, . . . that converges to θ in some sense. The algorithm can select eachXi based on the values Xj and on the realizations Yj ∼ H(Y |Xj) for j < i.

• Note that if H(Y |X) is degenerate (there exists Y (X) such that P (Y = Y (X)|X) = 1,then this problem reduces to deterministic optimization.

• An algorithm giving convergence in probability was discovered by J. Kiefer and H.Wolfowitz (Annals Stat., Sept. 1952, pp. 462-466) based on an earlier procedure forstochastic root finding discovered by H. Robbins and S. Monro (Annals Stat., Sept.1951, pp. 400-407).

Begin by determining any sequences {an} and {cn} such that:

1. cn → 0

2.∑

n an =∞3.∑

n ancn <∞4.∑

n a2n/c

2n <∞.

For example, the sequences an = 1/n and cn = n−1/3 satisfy all four conditions. DefineZn by the rule

Zn+1 = Zn + anY2n − Y2n−1

cn,

where Y2n−1 is drawn according to H(Y |Zn − cn) and Y2n is drawn according toH(Y |Zn + cn) (independently).

Then if H(Y |X) has finite variance for all X, and if M(X) satisfies the followingconditions, Kiefer and Wolfowitz proved that Xn → θ in probability (∀ε > 0, P (|Zn−θ| > ε)→ 0).

1. There exist B, β > 0 such that

|x− θ|+ |y − θ| < β ⇒ |M(x)−M(y)| < B|x− y|

2. There exist ρ > 0 and R such that

|x− y| < ρ⇒ |M(x)−M(y)| < R

19

Page 20: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

3. For every δ > 0 there exists π(δ) > 0 such that

|z − θ| > δ ⇒ inf0<ε<δ/2|M(z + ε)| − |M(z − ε)|

ε> π(δ)

The result also applies if M(X) is only defined on an interval [C1, C2]. In this case, ifthe sequence Zn jumps out of [C1, C2], then set the value equal to the closer of C1 andC2.

• The procedure is related to an earlier “stochastic root finding” procedure discoveredby H. Robbins and S. Monro. The problem (in the previous notation) is to find X suchthat M(X) = α. The proposal is to use the sequence Zn+1 = Zn + an(α − Yn), wherean is a positive sequence that is square-summable but for which

∑n(an/

∑n−1j=1 aj) =∞

(e.g. an = 1/n).

EM Algorithm

• Suppose we observe Y according to a density fθ(Y ), where L(θ|Y ) = log fθ(Y ) is acomplicated function of θ, and therefore is hard to optimize. Suppose further that thenature of the problem suggests a random variable Z such that if Z were observed, thenthe augmented log-likelihood L(θ|Y, Z) = log fθ(Y, Z) is simpler, in that the followingtwo conditions hold:

1. We can compute the following conditional expectation as a function of θ in closedform

Q(θ|θ′) = Eθ′(Lθ(Y, Z)|Y ).

2. We can optimize Q(θ|θ′) as a function of θ for fixed θ′. That is, we can compute

M(θ′) = argmaxθQ(θ|θ′).

Under these conditions, the EM-algorithm constructs a sequence of iterates accordingto the rule θi+1 = M(θi). Under certain additional regularity conditions (J. Wu, AnnalsStat. 1983, pp. 95-103), the θi converge to a stationary point of L(θ|Y ). If L(θ|Y )is unimodal and has only one stationary point, then the convergence is to the MLE.When convergence occurs, the rate is typically linear. Step 1 is called the E-step, andstep 2 is called the M-step.

• The EM algorithm, like the steepest ascent and conjugate gradient algorithms, butunlike Newton’s method, is monotonic, in that the sequence θn of iterates satisfiesL(θn+1|Y ) ≥ L(θn|Y ). Begin by observing that

20

Page 21: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

L(θ|Y ) = log fθ(Y, Z)− log fθ(Z|Y )

=∫fθ′(Z|Y ) log fθ(Y, Z)dZ −

∫fθ′(Z|Y ) log fθ(Z|Y )dZ

= Q(θ|θ′)−∫fθ′(Z|Y ) log fθ(Z|Y )dZ

= Q(θ|θ′)−K(θ|θ′).

Continuing, we have

L(θ|Y )− L(θ′|Y ) = Q(θ|θ′)−Q(θ′|θ′) +K(θ′|θ′)−K(θ|θ′)

= Q(θ|θ′)−Q(θ′|θ′)−∫

log (fθ(Z|Y )/fθ′(Z|Y )) fθ′(Z|Y )dZ.

If we set θ∗ = argmaxτQ(τ |θ′), then Q(θ∗|θ′)−Q(θ′|θ′) ≥ 0. By Jensen’s inequality,

−∫

log (fθ(Z|Y )/fθ′(Z|Y )) fθ′(Z|Y )dZ ≥ − log∫

(fθ(Z|Y )/fθ′(Z|Y ))fθ′(Z|Y )dZ

= 0.

Thus L(θ∗|Y )− L(θ′|Y ) ≥ 0.

• For any value of θ,

L(θ|Y ) ≥ Q(θ|θ′) + L(θ′|Y )−Q(θ′|θ′)

with equality for θ = θ′. This leads to the characterization of the EM algorithmas a majorization or optimization transfer algorithm. Optimization of the surrogatefunction Q(θ|θ′) always drives L(θ) uphill.

• Example: Finite mixture models. Suppose π1, . . . , πK are positive constants with∑j πj = 1. Let f1, . . . , fK denote density functions. Then

∑j πjfj is also a density

function. It is the density of an observation generated according to the following two-stage procedure: (i) generate κ ∈ {1, . . . , K} with P (κ = k) = πk, (ii) generate Y ∼ fκ.We observe Y but not κ.

Suppose we observe Y1, . . . , Yn according to the mixture density, and we wish to esti-mate the parameters π = (π1, . . . , πK)′ using maximum likelihood. The log-likelihoodfor the observed data is:

L(π|{Yj}) =n∑

j=1

log(K∑

k=1

πkfk(Yj)).

21

Page 22: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

Suppose we were to observe as complete data the pair (Yj, κj), where κj is the randomvariable defined in (i) above. Then the log-likelihood for the complete data is:

L(π|{Yj, κj}) =n∑

j=1

log(πκj) + log(fκj

(Yj)).

The E-step of the EM algorithm requires that we compute the expectation of L(π|{Yj, κj})given the Yj for a fixed probability vector π′. This can be more easily accomplished ifwe rewrite the complete-data log-likelihood as follows:

L(π|{Yj, κj}) =n∑

j=1

K∑k=1

log(πk)I(κj = k) + log(fk(Yj))I(κj = k).

This is linear in the random variables I(κj = k), so the expectation is obtained byplugging in the values for the conditional expectations

Eπ′(I(κj = k)|Yj) = Pπ′(κj = k|Yj).

Using Bayes’ theorem, we have

Pπ′(κj = k|Yj) ∝ P (Yj|κj = k)Pπ′(κj = k) = fk(Yj)π′k.

Therefore the expectation is given by:

Eπ′(I(κj = k)|Yj) = fk(Yj)π′k/∑

`

f`(Yj)π′` ≡ pjk,

and the Q function takes the form

Q(π|π′) =n∑

j=1

K∑k=1

log(πk)pjk + log(fk(Yj))pjk.

Note that only the first term involves π, so the M-step depends only on the first term.We update the parameters using the rule

πk ←∑j

pjk/n.

• Example: Gaussian Mixture Models Continuing with the previous example, sup-pose that fk(Y ) ∝ exp(−(Y −µk)

′Σ−1k (Y −µk)/2)/|Σk|1/2 is Gaussian, and we wish to

estimate π1, . . . , πK , µ1, . . . , µK and Σ1, . . . ,ΣK .

The complete data log likelihood becomes

22

Page 23: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

L(π, {Σk}, {µk}|{Yj, κj}) =n∑

j=1

K∑k=1

log(πk)I(κj = k)− 1

2

(log |Σk|+ (Yj − µk)

′Σ−1k (Yj − µk)

)I(κj = k).

To carry out the E-step, the posterior probabilities pjk can be computed as in theprevious example. To carry out the M-step for the mean and variance parameters, usethe weighted moment updates

µk ←∑j

pjkYj/∑j

pjk,

Σk ←∑j

pjk(Yj − µk)(Yj − µk)′/∑j

pjk.

The probability parameters πk are updated as in the previous example.

Here is the Octave code for calculating the maximum likelihood estimates for a Gaus-sian finite mixture model using the EM algorithm:

## Fit a K-component Gaussian mixture to the data in Z, where each row

## is an observation and each column is a variable.

## The dimensions of Z.

p = size(Z,2);

n = size(Z,1);

## Number of classes.

K = 2;

## Starting values for class means.

M = randn(p,K);

## Starting values for the marginal class probabilities.

P = ones(K,1)/K;

## Starting values for within-class covariances, stacked vertically.

S = [];

for k=1:K

S = [S; eye(p,p)];

endfor

## Keep track of the marginal log-likelihood.

LL = -Inf;

23

Page 24: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

while (1)

## Calculate L1 = log(P(Z_i | Q_i=k) and L = log P(Z_i, Q_i=k)

## for each observation and each class.

for k=1:K

r = Z - ones(n,1)*M(:,k)’;

z = S((k-1)*p+1:k*p,:) \ r’;

L1(:,k) = -0.5*sum(r’ .* z)’ - 0.5*log(det(S((k-1)*p+1:k*p,:)));

L(:,k) = L1(:,k) + log(P(k));

endfor

## Calculate the marginal log-likelihood.

Lnew = sum(log(P’ * exp(L1)’)’);

if (Lnew-LL < 1e-6)

break;

endif

LL = Lnew;

## Calculate P(Q_i=k | Z_i), taking care in the normalization.

Q = L - max(L’)’ * ones(1,K);

Q = exp(Q);

Q = Q ./ (sum(Q’)’ * ones(1,K));

## Update the class means.

M = (Z’ * Q) ./ (ones(p,1) * sum(Q));

## Update the covariances.

for k=1:K

r = (sqrt(Q(:,k)) * ones(1,p)) .* (Z - ones(n,1)*M(:,k)’);

S((k-1)*p+1:k*p,:) = r’*r / sum(Q(:,k));

endfor

## Update the marginal group probabilities.

P = sum(Q)’/n;

endwhile

• Example: Contingency tables with collapsed cells. Suppose we observe a m×ncontingency table with independent rows and columns. If Xij denotes the number ofobservations in cell (i, j), then EXij = npiqj where n is the total number of observa-tions, pi is the probability of an observation falling in a cell in row i, and qj is theprobability of an observation falling in a cell in column j. Now suppose that we areunable to distinguish between certain cells, so we can only observe the sum. To be

24

Page 25: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

concrete, consider a 2× 3 table where we can only observe the following 4 values:

Z1 = X11 +X12

Z2 = X21

Z3 = X22 +X13

Z4 = X23.

The likelihood for the observable data is:

L({pi}, {qj}|{Zj}) ∝ (p1q1 + p1q2)Z1(p2q1)

Z2(p2q2 + p1q3)Z3(p2q3)

Z4 .

We could optimize this function using gradient methods, but it would require a fairamount of programming. On the other hand, consider the likelihood for the {Xij}(some of which are unobservable):

L({pi}, {qj}|{Xj}) ∝∑ij

Xij log(piqj).

This is linear in the Xij, so to compute the Q(·|·) function, we need only to computethe conditional means of these values given the Zj, and then plug them in. Specifically,set

X∗11 = E(X11|{Zj}) = p1q1Z1/(p1q1 + p1q2)

X∗12 = E(X12|{Zj}) = p1q2Z1/(p1q1 + p1q2)

X∗22 = E(X22|{Zj}) = p2q2Z3/(p2q2 + p1q3)

X∗13 = E(X13|{Zj}) = p1q3Z3/(p2q2 + p1q3),

with the other X∗ij set to their directly-observable values. Finally, we can update the

parameter estimates using

pi ←∑

j X∗ij/n

qi ←∑

iX∗ij/n.

This defines one iteration of the EM algorithm.

• Example: Missing data in a multivariate normal model. Suppose that weobserve Yj ∈ Rdj with Yj ∼ N(µj,Σj). Suppose further that there exist injectivefunctions σj : {1, . . . , dj} → {1, . . . , d}, such that µj(k) = µ(σj(k)) and Σj(k1, k2) =Σ(σj(k1), σj(k2)). In words, the components of each Yj comprise a subset of the com-ponents of a common multivariate normal distribution.

Suppose that we want to estimate θ = (µ,Σ) based on the following log-likelihood:

25

Page 26: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

Lθ({Yj}) =∑j

−1

2log |Σj| −

1

2(Yj − µj)

′Σ−1j (Yj − µj).

Note that this might not be advisable if the σj are viewed as random variables thatcould be dependent on the Yj. In general, Lθ({Yj}) can not be optimized in closed-form.A conjugate gradient algorithm would be a good choice here, but the EM algorithm isconsiderably simpler to derive and program (at least for a statistician).

The complete data are random vectors Zj ∈ Rd such that Zj ∼ N(µ,Σ) and Yj(k) =Zj(σj(k)). The complete-data log-likelihood is:

Lθ({Zj}) =∑j

−1

2log |Σ| − 1

2(Zj − µ)′Σ−1(Zj − µ)

∝ −1

2log |Σ| − 1

2tr(Σ−1 1

n

∑j

(Zj − µ)(Zj − µ)′).

To compute the Q(θ|θ′) function, we need the values ηj = Eµ′,Σ′(Zj|Yj) and Cj =Eµ′,Σ′(ZjZ

′j|Yj). Once we have these values, the Q(θ|θ′) function can be expressed:

Q(θ|θ′) = −1

2log |Σ| − 1

2tr

Σ−1 1

n

∑j

Cj − ηjµ′ − µη′j + µµ′

= −1

2log |Σ| − 1

2tr(Σ−1(µ− η̄)(µ− η̄)′ + Σ−1(C̄ − η̄η̄′)

),

where η̄ =∑

j ηj/n and C̄ =∑

j Cj/n. The maximum of Q(θ|θ′) occurs where µ = η̄,and Σ = C̄ − η̄η̄′ = ∑

j Var(Zj|Yj)/n.

The values ηj and Cj can be easily computed using the usual formulas for the condi-tional mean and variance of a multivariate normal random vector. Specifically, let Z∗

denote the missing components of Z. Then:

E(Z∗|Y ) = EZ∗ + cov(Z∗, Y )cov(Y )−1(Z∗ − EY )

cov(Z∗|Y ) = cov(Z∗)− cov(Z∗, Y )cov(Y )−1cov(Y, Z∗),

where all of the values on the right hand side are computed with respect to θ′ = (µ′,Σ′)using the usual formulas.

• General optimization transfer. The basic idea behind the EM algorithm can be ex-tended to optimization problems where there is no data augmentation, or even toproblems that have no likelihood. In fact the reason that the procedure works has only

26

Page 27: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

to do with convexity, and nothing whatsoever to do with probability. We retain theidea of the minorant Q(θ|θ′) and shift it by −Q(θ′|θ′), so that Q(θ′|θ′) = 0 and

L(θ) ≥ Q(θ|θ′) + L(θ′),

so that optimizing Q(θ|θ′) (with respect to θ) drives L(θ) uphill. There are many waysof constructing such a Q(·|·). For example, if −L(θ) is convex and differentiable, then

L(θ) ≥ L′(θ)(θ − θ′) + L(θ′),

suggesting Q(θ|θ′) = L′(θ)(θ − θ′).

• Example: L1 regression. The L1 regression problem is to minimize

L(β) =∑

i

|Yi −X ′iβ| =

∑i

√(Yi −X ′

iβ)2,

When u ≥ 0 and v ≥ 0, (√u−√v)2 ≥ 0⇒ 1

2√

v(u− v) ≥

√u−√v. Letting

u = (Yi −X ′iβ)2

v = (Yi −X ′iβ̃)2

for two vectors of regression coefficients β and β̃, it follows that for each i

(Yi −X ′iβ)2 − (Yi −X ′

iβ̃)2

2|Yi −X ′iβ̃|

≥ |Yi −X ′iβ| − |Yi −X ′

iβ̃|.

This gives the following majorant (since this is a minimization rather than a maxi-mization problem):

Q(β|β̃) =∑

i

(Yi −X ′iβ)2 − (Yi −X ′

iβ̃)2

2|Yi −X ′iβ̃|

.

That is, L(β) ≤ Q(β|β̃) + L(β̃), with equality at β = β̃.

Thus the optimization transfer algorithm is identical to the iteratively re-weighted leastsquares algorithm:

1. Set j = 0, begin at β0.

2. Define the weight wi = |Yi −X ′iβj|.

3. Regress the Yi on the Xi using weighted least squares with weights wi. Let βj+1

denote the least squares solution.

4. Set j ← j + 1 and return to (2).

27

Page 28: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

Constrained Optimization

• Suppose we want to optimize f(Z) over a set S ⊂ Rd, where S is defined by a setof parametric constraints: Z ∈ S ⇐⇒ gi(Z) ≥ 0, i = 1, . . . ,m1; hi(Z) = 0, i =1, . . . ,m2. The first type of constraint function is called an inequality constraint, orbarrier function. The second type of constraint is called an equality constraint. Pointsin S are called feasible points, S is called the feasible region.

• If S is convex (x, y ∈ S ⇒ λx + (1 − λ)y ∈ S ∀0 ≤ λ ≤ 1), and −f is convex, then frestricted to S has a unique maximizer.

• If the maximizer lies in the interior of S, then unconstrained optimization procedureswill find it. The usual situation, in which the maximizer lies on the boundary, is moredifficult. In this case, there is an unknown subset of the inequality constraints gi thatbecome equalities at the maximizer. These are called the active constraints. If we canidentify this subset of constraints, then the other inequality constraints can be ignored.However there are a very large number of subsets, so it can be hard to find the activeconstraints by a direct search.

• Iterative algorithms that approach the extreme value from the interior of S are calledinterior point algorithms. Algorithms that approach along the boundary of S are calledactive set algorithms.

• If there is a single equality constraint h(Z) = 0, then the classical method of Lagrangemultipliers can be used. Specifically, the optimum must occur where∇h(Z) and∇f(Z)are parallel.

• Example: Multinomial probabilities. Suppose counts n1, . . . , nk summing to Nare observed according to the multinomial mass function, which is proportional to∏

j pnj

j , where the pj are unknown probabilities, so pj ≥ 0 and∑

j pj = 1. To obtain theMLE, we can maximize L(p1, . . . , pk) =

∑j nj log(pj). The constraint is h(p1, . . . , pk) =∑

pj − 1. The gradients are ∇h = (1, . . . , 1)′ and ∇f = (n1/p1, . . . , nk/pk)′. Clearly

the gradients are parallel if we have pj = cnj for any c. Since the constraint∑

j pj = 1must be satisfied as well, we must have c = 1/N .

• Identity constraints are of the form h(Z) = Zj1 − Zj2 , i.e. a constraint that equatestwo variables. Suppose we are using any gradient-based optimization procedure (e.g.steepest ascent or conjugate gradient). In these settings, identity constraints can behandled by the chain rule. Specifically, if hi is the search direction at iteration i, wedefine a “packed gradient” h̃i by summing all components of hi that are constrainedto a common value.

• Example: Nonlinear Gompertz regression. Suppose we want to minimize f(a, b, c) =∑i(Yi − (a+ bXi)/(c+Xi))

2. The gradient is given by:

∇f(a, b, c) = −2∑

i

(Yi −a+ bXi

c+Xi

)

(1

c+Xi

,Xi

c+Xi

,− a+ bXi

(c+Xi)2

)′.

28

Page 29: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

Now suppose we want to impose the constraint a ≡ c. Then by summing the gradientcomponents for a and c we get:

∇̃f(a, b) = −2∑

i

(Yi −a+ bXi

c+Xi

)

(1

c+Xi

− a+ bXi

(c+Xi)2,

Xi

c+Xi

)′.

• If there are only linear equality constraints (and no inequality constraints), then thebasic idea behind Lagrange multipliers can be extended to any gradient-based opti-mization procedure. Let hi denote the search direction at iteration i that would beobtained for the unconstrained problem. Let g1, . . . , gr denote the coefficient vectorsfor each of the linear equality constraints. Then we can construct h̃i by projectinghi onto the perpendicular complement of g1, . . . , gr. As long as we start at a pointthat satisfies the constraint, line searches in the directions h̃i will always preserve theconstraint.

• Another approach to handling linear equality constraints is to do a reparameterization.Suppose we have the constraint A′Z = b, where A is tall and thin with independentcolumns (so the system is underdetermined). Writing A = QR gives the constraintQ′Z = R−T b. Let Z0 be any vector satisfying Q′Z0 = R−T b. Substitute for Z thequantity Z0 + (I − QQ′)W . The optimization is now unconstrained as a function ofW , and the solution, expressed in the original coordinates is Z∗ = Z0 + (I −QQ′)W ∗,where W ∗ is a solution to the reparameterized problem.

• If the objective function is linear and any combination of linear equality and inequalityconstraints are present, then the problem is called a linear program. The problem canbe expressed as follows:

maximize b′x subject to : x ≥ 0, Ax ≤ b.

A classical method called the simplex algorithm is an active set approach to solvinglinear programs.

• To extend the Lagrange multiplier approach to multiple inequality constraints, con-struct the Lagrangian:

L(x;λ1, . . . , λr) = f(x) +∑j

λjgj(x).

• The dual function of f is the unconstrained optimizer of the Lagrangian over x:

g(λ1, . . . , λr) = supxL(x;λ1, . . . , λr).

• For example, in the linear program, the Lagrangian is L(λ) = c′x − λ′(Ax − b). Thedual function is ∞ if A′λ− c 6= 0 and −λ′b otherwise.

29

Page 30: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• If x is feasible and λi ≥ 0 for i = 1, . . . , r, then f(x) ≤ g(λ1, . . . , λr):

f(x) ≤ f(x) +∑

i

λigi(x)

≤ g(λ1, . . . , λr).

Thus supxf(x) ≤ infλg(λ), where supx is subject to the constraints gi(x) ≥ 0 and infλ

is subject to the constraints λi ≥ 0. The dual function is an upper bound for f (f iscalled the primal function). The difference infλg(λ) − supxf(x) is called the optimalduality gap.

• When the optimal duality gap is zero, the condition of strong duality is said to hold.Strong duality is guaranteed if Slater’s condition holds: −f is convex, and there existsa point in the interior of S.

• For example, returning to the linear program, the dual problem is to minimize −b′λsubject to A′λ+ c = 0 and λ ≥ 0. Note the the number of variables and the number ofequations are reversed. Slater’s condition holds for linear programs except when theprimal problem and the dual problem are both unbounded.

• Suppose we have an algorithm that constructs a sequence xi that converge to a solutionof the primal problem, and a sequence λi that converge to a sequence of the dualproblem. Then the constrained supremum supxf(x) is trapped: f(xi) ≤ supxf(x) ≤g(λi). This gives us the ability to construct stopping criteria based on absolute error.

• Suppose x∗ is a solution to the primal problem, and λ∗ is a solution to the dual problemλ∗ = infλ≥0 g(λ). If the duality gap is zero, it follows that

f(x∗) = g(λ∗)

≥ f(x∗) +∑j

λ∗jgj(x∗)

which implies that∑

j λ∗jgj(x

∗) ≤ 0. It follows that at the solution the Lagrangemultipliers are zero for the inactive constraints. This property is called complementaryslackness.

• Primal-dual algorithms are a general class that follows the strategy of starting with afeasible pair (x, λ), then taking in pairs an uphill primal step (in x) and a downhilldual step (in λ). When the objective functions meet, convergence has occurred.

Example:

Suppose our aim is to optimize the linear function

30

Page 31: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

f(x1, x2) = ax1 + bx2

where a and b are given constants, subject to the constraint x21 + x2

2 ≤ 1. The singlebarrier function is

g1(x1, x2) = 1− x21 − x2

2,

and the Lagrangian is

L(x1, x2;λ) = ax1 + bx2 + λ− λx21 − λx2

2.

To construct the dual function g(λ) (not to be confused with the barrier functiong1(x)), we calculate the unconstrained supremum of L(x1, x2;λ) over all values of x1

and x2. Using calculus, we get that the supremum occurs at

x1 = a/(2λ)

x2 = b/(2λ).

Thus the dual function is

g(λ) =a2 + b2 + 4λ2

4λ.

To minimize the dual function over λ ≥ 0 we first note that as λ→ 0,∞, g(λ)→∞.Thus the minimum must occur at a point where g′(λ) = 0. Solving for this point yields

λ∗ =

√a2 + b2

2.

Therefore f attains its constrained supremum

g(λ∗) =√a2 + b2.

The constrained supremum of f is the same as the unconstrained supremum of L(λ∗;x1, x2),which is attained at

x1 = a/√a2 + b2

x2 = b/√a2 + b2.

In this case, we can confirm this solution by direct constrained optimization of f . It’seasier to work in polar coordinates

31

Page 32: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

x1 = r cos θ

x2 = r sin θ.

The objective function is

f(r, θ) = ar cos θ + br sin θ.

Since this is linear in r, the constrained optimum must occur when r = 0 or r = 1. Toidentify the optimal θ, differentiating with respect to θ yields

θ = arctan(b/a).

Converting back to the original coordinates yields

x1 = r cos(arctanθ) = ra/√a2 + b2

x2 = r sin(arctanθ) = rb/√a2 + b2.

Substituting this into f with r = 0 and r = 1 yields 0 and

a2/√a2 + b2 + b2/

√a2 + b2,

respectively. Thus r = 1 gives the maximum, in agreement with the minimizer of thedual function.

Extensions to smooth barrier functions

• Suppose the constraints gj are differentiable, and let (x∗, λ∗) be solutions to the primaland dual problems in a setting with zero optimal duality gap. Due to complemen-tary slackness, g(λ∗) = L(x∗, λ∗), hence x∗ is an unconstrained extreme value of theLagrangian at λ = λ∗. It follows that

∇f(x∗) +∑j

λ∗j∇gj(x∗) = 0.

Combine this with the following three conditions:

gj(x∗) ≥ 0

λ∗j ≥ 0

gj(x∗)λ∗j = 0,

and we have the KKT conditions. A pair (x∗, λ∗) are optimal if and only if they satisfythe KKT conditions.

32

Page 33: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• Penalty function methods are a general class of algorithms that turn constrained prob-lems into unconstrained problems with approximately the same solution. The basicidea is to choose barrier functions φj(x) such that as x approaches the boundary ofS, at least one φj approaches −∞. The maximizer of F (x) = f(x) + ε

∑j λjφj(x)

is slightly interior to the maximizer of the constrained problem, and under generalconditions converges to that value as ε → 0. Beginning with a strictly feasible point,a gradient method can be used to optimize F (x). For example, the penalized linearprogram becomes:

F (x) = c′x+ ε∑j

log(xj) + ε∑j

log(A1,:x− bj)

• There are a number of heuristics that improve the performance of penalty algorithms.Define x0 to be the the maximizer of Φ(x) =

∑j φj(x), and for c < Φ(x0), define

ψ(c) = argmaxx:Φ(x)=cf(x). The parametric curve ψ(c) connects x0 to the maximizerx∗. It is called the central path. Most methods start at x0 and attempt to follow thecentral path to x∗.

Dynamic Programming

• Suppose we have a Markov chain X1, . . . , Xn, so that

P (Xt = i|X1 = j1, . . . Xt−1 = jt−1) = P (Xt = i|Xt−1 = jt−1).

Let P (Xt = i|Xt−1 = j) ≡ Pt(i|j) and P (X1 = i) ≡ P1(i).

Suppose the distribution is known but no data are observed. It may be of interest toidentify the sequence with highest probability under the Markov chain distribution,that is

argmaxx1,...,xnP (X1 = x1, . . . , Xn = xn).

For an independent sequence this is easy – each Xk can be set to the maximizer of themarginal distribution P (Xk). But for a dependent sequence the maximizer of the jointdistribution will generally not maximize P (Xk) at each time point.

Suppose the sample space at each time point is {1, . . . , K}. Then the sample spaceof X1, . . . , Xm contains Km points. Finding the most probable sequence seems in-tractable, but surprisingly the maximizer can be calculated in only order Km opera-tions.

To see this let

δk(x) = maxx1,...,xk−1P (X1 = x1, . . . , Xk−1 = xk−1, Xk = x).

33

Page 34: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

Then

δk(x) = maxxk−1

(maxx1,...,xk−2

P (X1 = x1, . . . , Xk−1 = xk−1, Xk = x))

= maxxk−1

(maxx1,...,xk−2

P (X1 = x1, . . . , Xk−1 = xk−1)P (Xk = x|Xk−1 = xk−1))

= maxxk−1δk−1(xk−1)P (Xk = x|Xk−1 = xk−1)

Thus δk(x) can be calculated in K steps if δk−1(1), . . ., δk−1(K) are known, and allδk(x) values can be calculated in order K2m operations. Once the δk(x) values are allknown, we can use the fact that

maxxδm(x) = maxx1,...,xmP (X1 = x1, . . . , Xm = xm)

to determine the greatest probability. The last step is to determine the actual se-quence that gives this probability. To do this, we trace backward from the value of xmaximizing δm(x). When calculating the δk(x) values, we can record the value of xmaximizing

maxxk−1δk−1(xk−1)P (Xk = x|Xk−1 = xk−1).

Let this value be denoted gk(x). Starting from argmaxxδm(x) we can trace back usingthe g(·) functions to construct a sequence maximizing the joint probability (it may notbe unique).

Simulated Annealing

• Suppose we want to maximize a real-valued function f with domain S. Simulatedannealing is an attractive optimization procedure if any of the following hold: (i) f hasmany local extreme values, (ii) S is discrete, (iii) f is not smooth, or the derivative off is much more expensive to compute than the function value.

• Suppose we have a family of “proposal distributions” on S indexed by X ′ ∈ S:R(X|X ′). From a starting value X0, define a sequence of points in S using the rules

Zn ∼ R(·|Xn−1)P (Xn = Zn) = min(exp((f(Zn)− f(Xn−1))/Tn), 1)P (Xn = Xn−1) = 1− P (Xn = Zn).

• Note that if f(Zn) > f(Xn−1) then Xn = Zn – uphill proposals are always accepted,but some downhill proposals are accepted as well.

• As Tn → 0, downhill proposals are accepted less often.

34

Page 35: Optimization - Statisticsdept.stat.lsa.umich.edu/.../Courses/Stat606/Notes/optim.pdfOptimization Background: •Problem: given a function f(x) defined on X, find x∗ such that f(x∗)

• If Tn → 0 sufficiently slowly, then a.s. convergence to the global extremum occurs fora reasonable class of problems. Theoretically, the rate of Tn should have the form

Tn ∼ 1/ log(c+ n).

However this rate is too slow to be useful for most practical problems, since for nof manageable size the estimate Xn will have substantial variability. There is someexperimental evidence that the optimal value can be identified for many problemsusing a heuristic rule, such as decreasing Tn by 10% each time a fixed number of stepsare accepted.

35