sdp: semideﬁnite prog sdp: background properties interior...

Interior Point Methods

for Linear, Quadratic

and Nonlinear Programming

Turin 2008

Jacek Gondzio

Lecture 11:

SemiDefinite Programming

1

IPMs for LP, QP, NLP, J. Gondzio, Turin 2008

SDP: SemiDefinite Prog

• Generalization of LP.

• Deals with symmetric positive semidefinite

matrices (Linear Matrix Inequalities, LMI).

• Solved with IPMs.

• Numerous applications:

eigenvalue optimization problems, LP, quasi-

convex programs, convex quadratically con-

strained optimization, robust mathematical

programming, matrix norm minimization,

combinatorial optimization, control theory,

statistics.

This lecture is based on two survey papers:

• L. Vandenberghe and S. Boyd, Semide-

finite Programming, SIAM Review 38 (1996)

pp. 49-95.

• M.J. Todd, Semidefinite Optimization,

Acta Numerica 10 (2001) pp. 515-560.

The Internet resources:

• http://www-unix.mcs.anl.gov/otc/InteriorPoint/

• http://www.zib.de/helmberg/semidef.html

2


SDP: Background

Def. A matrix H ∈ Rn×n is positive semidefinite

if xTHx ≥ 0 for any x 6= 0. We write H � 0.

Def. A matrix H ∈ Rn×n is positive definite if

xTHx > 0 for any x 6= 0. We write H ≻ 0.

We denote with SRn×n (SRn×n+

) the set of sym-

metric positive semidefinite (positive definite) ma-

trices.

Let U, V ∈ SRn×n. We define the inner product

between U and V as U •V = trace(UTV), where

trace(H) =∑n

i=1hii. The associated norm is the

Frobenius norm, written ‖U‖F = (U • U)1/2 (or

just ‖U‖).

Def. Linear Matrix Inequalities

Let U, V ∈ SRn×n.

We write U � V iff U − V � 0.

We write U ≻ V iff U − V ≻ 0.

We write U � V iff U − V � 0.

We write U ≺ V iff U − V ≺ 0.

3


Properties

1. If P ∈ Rm×n and Q ∈ Rn×m, then

trace(PQ) = trace(QP).

2. If U, V ∈ SRn×n, and Q ∈ Rn×n is orthogonal

(i.e. QTQ = I), then U •V = (QTUQ)•(QTV Q).

More generally, if P is nonsingular, then

U • V = (PUPT ) • (P−TV P−1).

3. Every U ∈SRn×n can be written as U=QΛQT,

where Q is orthogonal and Λ is diagonal. Then

UQ = QΛ. In other words the columns of Q are

the eigenvectors, and the diagonal entries of Λ

the corresponding eigenvalues of U .

4. If U ∈ SRn×n and U = QΛQT , then

trace(U) = trace(Λ) =∑

i λi.

5. For U∈SRn×n, the following are equivalent:

(i) U � 0 (U ≻ 0)

(ii) xTUx ≥ 0, ∀x∈Rn (xTUx>0,∀0 6= x∈Rn).

(iii) If U = QΛQT , then Λ � 0 (Λ ≻ 0).

(iv) U = PT P for some matrix P (U = PTP for

some square nonsingular matrix P). 4


Properties (continued)

6. Every U ∈ SRn×n has a square root

U1/2 ∈ SRn×n.

Proof: From Property 5 (ii) we get U = QΛQT .

Take U1/2 = QΛ1/2QT , where Λ1/2 is the diago-

nal matrix whose diagonal contains the (nonneg-

ative) square roots of the eigenvalues of U , and

verify that U1/2U1/2 = U .

7. Suppose

U =

[

A BT

B C

]

,

where A and C are symmetric and A ≻ 0. Then

U � 0 (U ≻ 0) iff C − BA−1BT � 0 (≻ 0).

The matrix C−BA−1BT is called the Schur com-

plement of A in U .

Proof: follows easily from the factorization:[

A BT

B C

]

=

[

I 0

BA−1 I

][

A 0

0 C−BA−1BT

][

I A−1BT

0 I

]

.

8. If U ∈SRn×n and x∈Rn, then xTUx=U •xxT .

5


Primal and Dual SDPs

Consider a primal SDP

min C • Xs.t. Ai • X = bi, i = 1..m,

X � 0,

where all Ai ∈ SRn×n, b ∈ Rm, C ∈ SRn×n are

given, and X ∈ SRn×n is the variable.

The associated dual SDP

max bTys.t.

∑mi=1 yiAi + S = C

S � 0,

where y ∈ Rm and S ∈ SRn×n are the variables.

An equivalent dual SDP has the form

max bTys.t.

∑mi=1 yiAi � C.

6


Weak Duality in SDP

Theorem (Weak Duality)

If X is feasible in the primal and (y, S) in the

dual, then

C • X − bTy = X • S ≥ 0.

Proof:

C • X − bTy = (m∑

i=1

yiAi + S) • X − bTy

=m∑

i=1

(Ai • X) yi + S • X − bTy

= S • X = X • S.

Further, since X is positive semidefinite, it has a

square root X1/2 (Property 6), and so

X • S = trace(XS) = trace(X1/2X1/2S)

= trace(X1/2SX1/2) ≥ 0.

We use Property 1 and the fact that S and

X1/2 are positive semidefinite, hence X1/2SX1/2

is positive semidefinite and its trace is nonnega-

tive.

7


SDP Example 1

Minimize the Maximum Eigenvalue

We wish to choose x ∈ Rk to minimize the max-

imum eigenvalue of A(x)=A0+x1A1+. . .+xkAk,

where Ai ∈ Rn×n and Ai = ATi .

Observe that

λmax(A(x)) ≤ t

if and only if

λmax(A(x) − tI) ≤ 0

or equivalently if and only if

λmin(tI − A(x)) ≥ 0.

This holds iff

tI − A(x) � 0.

So we get the SDP in the dual form:

max −t

s.t. tI − A(x) � 0,

where the variable is y := (t, x).

Application: This problem arises for example in

stabilizing a differential equation.

8


SDP Example 2

Logarithmic Chebyshev Approximation

Suppose we wish to solve Ax ≈ b approximately,

where A = [a1 . . . an]T ∈ Rn×k and b ∈ Rn. In

Chebyshev approximation we minimize the ℓ∞-

norm of the residual, i.e., we solve

min maxi

|aTi x − bi|.

This can be cast as a linear program, with x and

an auxiliary variable t:

min t

s.t. −t ≤ aTi x − bi ≤ t, i = 1..n.

In some applications bi has a dimension of a

power of intensity, and it is typically expressed

on a logarithmic scale. In such cases the more

natural optimization problem is

min maxi

| log(aTi x) − log(bi)|

(assuming aTi x > 0 and bi > 0).

9


Log Chebyshev Approximation

The logarithmic Chebyshev approximation prob-

lem can be cast as a semidefinite program. To

see this, note that

| log(aTi x)−log(bi)| = logmax(aT

i x/bi, bi/aTi x).

Hence the problem can be rewritten as the fol-

lowing nonlinear program

min t

s.t. 1/t ≤ aTi x/bi ≤ t, i = 1..n.

or,

min t

s.t.

t−aTi x/bi 0 0

0 aTi x/bi 1

0 1 t

� 0, i = 1..n

which is a semidefinite program.

10


SDP Example 3

Stats: Minimum Trace Factor Analysis

Assume x ∈ Rn is a random vector, with mean

x and covariance matrix Σ. We take a large

number of samples y = x + w, where the mea-

surement noise w has zero mean, is uncorrelated

with x, and has an unknown but diagonal covari-

ance matrix D. It follows that Σ = Σ+D, where

Σ denotes covariance matrix of y.

We assume that we can estimate Σ with high

confidence, i.e., we consider it a known, con-

stant matrix.

We do not know Σ, the covariance of x, or D,

the covariance matrix of the measurement noise.

But they are both positive semidefinite, so we

know that Σ lies in the convex set

C = {Σ−D | Σ−D � 0, D ≻ 0, D diagonal }.

We can derive bounds on linear functions of Σ

by solving SDPs over the set C.

11


Min Trace Factor Analysis

Consider the sum of components of x, i.e., eT x.The variance of eTx is given by

eTΣ e = eT Σ e − trace(D).

Since we do not know Σ, we cannot say exactly

what eTΣ e is. But we can compute its bounds.

The upper bound is eT Σ e.A lower bound can be obtained by solving SDP

maxn∑

i=1di

s.t. Σ − diag(d) � 0,d ≥ 0.

Fletcher interpreted this as the educational test-

ing problem. The vector y gives the scores of a

random student on a series of n tests, and eTygives the total score. One considers the test to

be reliable if the variance of the measured total

scores eTy is close to the variance of eTx over

the entire population. The quantity

ρ = eTΣ e/eT Σ e

is called the reliability of the test. SDP provides

a lower bound for ρ.12


Logarithmic Barrier Function

Define the logarithmic barrier function for the

cone SRn×n+

of positive definite matrices.

f : SRn×n+

7→ R

f(X) =

{

− ln detX if X ≻ 0+∞ otherwise.

Let us evaluate its derivatives.

Let X ≻ 0, H ∈ SRn×n. Then

f(X+αH)=− ln det[X(I + αX−1H)]

=− ln detX−ln(1+αtrace(X−1H)+O(α2))

= f(X) − αX−1 • H + O(α2),

so that f ′(X)=−X−1 and Df(X)[H]=−X−1 •H.

Similarly

f ′(X+αH) =−[X(I + αX−1H)]−1

=−[I−αX−1H+O(α2)]X−1

=f ′(X) + αX−1HX−1 + O(α2),

so that f ′′(X)[H]=X−1HX−1

and D2f(X)[H, G]=X−1HX−1 • G.

Finally,

f ′′′(X)[H,G]=−X−1HX−1GX−1 −X−1GX−1HX−1.

13


Logarithmic Barrier Function

Theorem:

f(X) = − ln detX is a convex barrier for SRn×n+

.

Proof:

Define φ(α) = f(X +αH). We know that f is

convex if, for every X ∈ SRn×n+

and every H ∈

SRn×n, φ(α) is convex in α.

Consider a set of α such that X+αH ≻ 0. On

this set

φ′′(α) = D2f(X)[H, H] = X−1HX−1 • H,

where X = X+αH.

Since X≻0, so is V =X−1/2 (Property 6), and

φ′′(α) = V 2HV 2 • H = trace(V2HV2H)

= trace((VHV)(VHV)) = ‖VHV‖2F ≥ 0.

So φ is convex.

When X ≻ 0 approaches a singular matrix, its

determinant approaches zero and f(X) → ∞.

14


Simplified Notation

Define A : SRn×n 7→ Rm

AX = (Ai • X)mi=1 ∈ Rm.

Note that, for any X ∈ SRn×n and y ∈ Rm,

(AX)T y =m∑

i=1

(Ai • X) yi = (m∑

i=1

yiAi) • X,

so the adjoint of A is given by

A∗y =m∑

i=1

yiAi.

A∗ is a mapping from Rm to SRn×n.

With this notation the primal SDP becomes

min C • Xs.t. AX = b,

X � 0,

where X ∈ SRn×n is the variable.

The associated dual SDP writes

max bTys.t. A∗y + S = C

S � 0,

where y ∈ Rm and S ∈ SRn×n are the variables.

15


Solving SDPs with IPMs

Replace the primal SDP

min C • Xs.t. AX = b,

X � 0,

with the primal barrier SDP

min C • X + µf(X)s.t. AX = b,

(with a barrier parameter µ ≥ 0).

Formulate the Lagrangian

L(X, y, S) = C • X + µf(X) − yT(AX − b),

with y ∈ Rm, and write the first order conditions

(FOC) for a stationary point of L:

C + µf ′(X) −A∗y = 0.

Use f(X) = − ln det(X) and f ′(X) = −X−1.

Denote S = µX−1, i.e., XS = µI.

For a positive definite matrix X its inverse is also

positive definite. The FOC now become:

AX = b,A∗y + S = C,

XS = µI,

with X ≻ 0 and S ≻ 0.16


Newton direction

We derive the Newton direction for the system:

AX = b,A∗y + S = C,

−µX−1 + S = 0.

Recall that the variables in FOC are (X, y, S),

where X, S ∈ SRn×n+

and y ∈ Rm.

Hence we look for a direction (∆X,∆y,∆S),

where ∆X,∆S ∈ SRn×n+

and ∆y ∈ Rm.

The differentiation in the above system is a non-

trivial operation.

The direction is the solution of the system:

A 0 00 A∗ I

µ(X−1 ⊙ X−1) 0 I

·

∆X∆y∆S

=

ξbξCξµ

.

We introduce a useful notation P ⊙ Q for n × nmatrices P and Q. This is an operator from

SRn×n to SRn×n defined by

(P ⊙ Q)U =1

2(PUQT + QUPT ).

17




Turin 2008

Jacek Gondzio

Lecture 12:

Linear Algebra Issues

1


Linear Algebra of IPM for LP

First order optimality conditions

Ax = b,

ATy + s = c,

XSe = µe.

Newton direction

A 0 0

0 AT IS 0 X

∆x∆y∆s

=

ξp

ξdξµ

,

where

ξp

ξdξµ

=

b − Ax

c − ATy − sµe − XSe

.

Use the third equation to eliminate

∆s = X−1(ξµ − S∆x)

= −X−1S∆x + X−1ξµ,

from the second equation and get[

−Θ−1 AT

A 0

] [

∆x∆y

]

=

[

ξd − X−1ξµ

ξp

]

.

where Θ = XS−1 is a diagonal scaling matrix.

2


Weighted Least Squares

Augmented system (symmetric but indefinite)[

−Θ−1 AT

A 0

] [

∆x∆y

]

=

[

rh

]

,

where[

rh

]

=

[

ξd − X−1ξµ

ξp

]

.

Eliminate

∆x = ΘAT∆y − Θr,

to get normal equations (symmetric, positive

definite system)

(AΘAT)∆y = g = AΘr + h.

Matrix AΘAT has always the same sparsity struc-

ture (only Θ changes in subsequent iterations).

Two step solution method:

• factorization to LDLT form,

• backsolve to compute direction ∆y.3


Linear Algebra of IPMs

Cholesky factorization

LDLT = AΘAT .

Cholesky factorization is nothing else but the

Gaussian Elimination process that exploits two

properties of the matrix:

• symmetry;

• positive definiteness.

Def. A matrix H∈Rm×m is symmetric if H =HT .

Example: H1 is symmetric, H2 is not.

H1 =

[

5 33 8

]

and H2 =

[

5 32 8

]

.

Def. A matrix H ∈ Rm×m is positive definite if

xTHx > 0 for any x 6= 0.

Example: H3 is positive definite, H4 is not.

H3 =

[

5 33 2

]

and H4 =

[

4 33 2

]

.

4


Pivots from the diagonal

Lemma 1: If H =

[

p aT

a W

]

∈ Rm×m is positive

definite then W−1paaT is also positive definite.

Proof: Observe that p > 0. Otherwise, for x =

e1 ∈ Rm we would have xTHx = p < 0.

We shall prove the main result by contradiction.

Suppose W − 1paaT is not positive definite that is

there exists x ∈ Rm−1, x 6= 0 such that

xT (W − p−1aaT ) x < 0.

Define x = (α, xT )T ∈ Rm, where α = −xTa/p

and compute:

xTHx = [α, xT ]

[

p aT

a W

][

αx

]

=α2p+2αxTa+xTWx

=xT(W−1

paaT )x+

1

p(αp+xTa)2=xT(W−

1

paaT )x<0,

which contradicts positive definiteness of H.

Corollary.

If GE is applied to a positive definite matrix, then

all pivots can be chosen from the diagonal.

Diagonal pivoting preserves symmetry. 5


Cholesky Factorization

Use Cholesky factorization

LDLT = AΘAT ,

where:

L is a lower triangular matrix; and

D is a diagonal matrix.

Hence replace the difficult equation

(AΘAT) · ∆y = g,

with a sequence of easy equations:

L · u = g,

D · v = u,

LT · ∆y = v.

Note that

g = Lu

= L(Dv)

= LD(LT∆y)

= (LDLT )∆y

= (AΘAT)∆y.

6


Existence of LDLT factors

Lemma 2: The decomposition H = LDLT with

dii > 0,∀i exists iff H is positive definite (PD).

Proof:

Part 1 ( ⇒ ) Let H = LDLT with dii > 0.

Take any x 6= 0 and let u = LTx. Since L is a

unit lower triangular matrix it is nonsingular so

u 6= 0 and

xTHx = xTLDLTx = uTDu =m∑

i=1

diiu2i > 0.

Part 2 ( ⇐ ) Proof by induction on dimension

of H. For m=1. H =h11=d11>0 iff H is PD.

Assume the result is true for m = k − 1 ≥ 1.

Let H =

[

W a

aT q

]

∈ Rk×k be given k × k positive

definite matrix with W ∈ R(k−1)×(k−1), a ∈ Rk−1

and q ∈ R. Note first that since H is PD, W is

also PD. Indeed for any (x,0) ∈ Rk we have

[x,0]

[

W a

aT q

][

x0

]

=xTWx>0 ∀x∈Rk−1, x 6= 0.

7


Existence of LDLT factors

From inductive hypothesis we know that

W =LDLT with dii>0. Let[

W a

aT q

]

=

[

L 0

lT 1

] [

D 00 d

] [

LT l0 1

]

,

where l is the solution of equation (LD)l = a (itis well defined since L and D are nonsingular)

and d is given by d = q − lTDl.

Hence matrix H =

[

W a

aT q

]

has an LDLT de-

composition. It remains to prove that d > 0.

Consider the vector

x =

[

−L−T l1

]

.

Since H is positive definite, we get

0 < xTHx

= [−lTL−1,1]

[

L 0

lT 1

][

D 00 d

][

LT l0 1

][

−L−T l1

]

= [0,1]

[

D 00 d

] [

01

]

= d,

which completes the proof.

8


Linear Equation Systems

Let H∈Rn×n be a nonsingular matrix

H =

h11 h12 · · · h1nh21 h22 · · · h2n... ... . . . ...hn1 hn2 · · · hnn

.

The general linear equation system

Hx = b

is difficult to solve.

Suppose we represent H in the following form:

H =

1 0 · · · 0l21 1 · · · 0... ... . . . ...ln1 ln2 · · · 1

u11 u12 · · · u1n0 u22 · · · u2n... ... . . . ...0 0 · · · unn

.

The system Hx = b can be solved as a sequence

of two (easy) triangular systems

Lz = b,

Ux = z.

9


Triangular Equation Systems

The system with a unit lower triangular matrix

1 0 · · · 0l21 1 · · · 0... ... . . . ...ln1 ln2 · · · 1

z1z2...zn

=

b1b2...bn

can be solved as followsfor i = 1,2,...,n

zi := bi

for j = i+1,i+2,...,n

bj := bj − ljizi

The system with an upper triangular matrix

u11 u12 · · · u1n0 u22 · · · u2n... ... . . . ...0 0 · · · unn

x1x2...xn

=

z1z2...zn

can be solved as followsfor i = n,n-1,...,1

xi := zi/uii

for j = 1,2,...,i-1

zj := zj − ujixi10


Gaussian Elimination

Matrix A and the right-hand-side at the begin-ning of step k of Gaussian Elimination:

A(k)=

a11 a12 · · · a1,k−1 a1k · · · a1n

0 a(2)22 · · · a(2)

2,k−1 a(2)2k

· · · a(2)2n

...... . . . ...

......

0 0 · · · a(k−1)k−1,k−1 a

(k−1)k−1,k · · · a

(k−1)k−1,n

0 0 · · · 0 a(k)kk

· · · a(k)kn... ... ... ... ...

0 0 · · · 0 a(k)nk

· · · a(k)nn

, b(k)=

b1b(2)2...

b(k−1)k−1

b(k)k...

b(k)n

.

Elimination Operationsfor i = k+1,k+2,...,n

mik := a(k)ik /a

(k)kk

for j = k+1,k+2,...,n

a(k+1)ij := a(k)

ij − mika(k)kj

b(k+1)i := b(k)i − mikb

(k)k

After step k of Gaussian Elimination:

A(k+1)=

a11 a12 · · · a1k a1,k+1 · · · a1n

0 a(2)22 · · · a(2)

2ka(2)2,k+1 · · · a(2)

2n... ... . . . ... ... ...

0 0 · · · a(k)kk

a(k)k,k+1 · · · a(k)

kn

0 0 · · · 0 a(k+1)k+1,k+1 · · · a(k+1)

k+1,n... ... ... ... ...

0 0 · · · 0 a(k+1)n,k+1 · · · a(k+1)

n,n

, b(k+1)=

b1

b(2)2...

b(k)k

b(k+1)k+1...

b(k+1)n

.

11


Elementary Operations in GE

Step k of Gaussian Elimination can be viewed as

multiplying A(k) and b(k) with the following n×nmatrix

Mk =

1.. .

11

−mk+1,k 1... . . .

−mn,k 1

.

After n − 1 steps

A(n)=Mn−1Mn−2 · · ·M1A=

a11 a12 · · · a1n

0 a(2)22 · · · a

(2)2n... ... . . . ...

0 0 · · · a(n)nn

is an upper triangular matrix.

Gaussian Elimination implicitly transforms ma-

trix A into the product A = LU . Normally, ma-

trix L is not stored. However, it is possible to

save elementary row operators m(k)ij and obtain

an explicit form of matrix L.

12


Accuracy of Computations

When you compute

S = πr2,

you probably set π = 3.14.

Beware of the finite precision of the computer.

IEEE standard relative precision:

practically all computers in the world use

ǫ = 2.23 × 10−16.

This means that

1 + ǫ = 1 + ǫ

but

1 +1

2ǫ = 1.

13


Accuracy and Gaussian Elim.

General Idea: Avoid divisions by small numbersbecause they cause growth of numerical errors.

Partial Pivoting

Choose the element with the largest absolutevalue in a column as a pivot.

Complete Pivoting

Choose the element with the largest absolutevalue in an active submatrix as a pivot.

Complete pivoting ensures excellent accuracy.Partial pivoting provides acceptable accuracy.No pivoting at all is extremely dangerous.

Example

GE without pivoting applied to a matrix

1 1 11 1 23 4 5

gives in the second step the following 2×2 Schurcomplement [

0 11 2

]

so there is a zero at the place of pivot.

14


Computational Complexity

A flop is a floating point operation:

x := x + a · b.

LU decomposition of an unsymmetric n × n ma-

trix needs 13n3 flops.

Sketch of the proof.

At iteration k of LU decomposition one flop is

executed for every element in an active subma-

trix of size (n − k) × (n − k). Thus the whole

decomposition requires

n−1∑

k=1

(n − k)2 ≈1

3n3

flops.

Solving equation with an LU decomposition of

an unsymmetric n × n matrix needs n2 flops.

15


Additional Cost of Pivoting

Partial Pivoting

At step k we choose maxi≥k

|a(k)ik | so overall we add

n−1∑

k=1

(n − k) ≈1

2n2

comparisons of real numbers.

Complete Pivoting

At step k we choose maxi,j≥k

|a(k)ij | so overall we add

n−1∑

k=1

(n − k)2 ≈1

3n3

comparisons of real numbers.

A comparison of two real numbers has a costcomparable to a single flop. Thus partial pivot-ing adds very little to the cost of LU decompo-sition. However, complete pivoting doubles thecost of decomposition.The guarantee of accuracy is costly!

GE with complete pivoting is two times moreexpensive than GE with partial pivoting.

16


Symmetric Gaussian Elim.

Let H ∈Rm×m be a symmetric positive definite

matrix

H =

h11 h12 · · · h1mh21 h22 · · · h2m... ... . . . ...hm1 hm2 · · · hmm

.

By applying Gaussian Elimination to it, we can

represent it in the following form:

1 0 · · · 0l21 1 · · · 0... ... . . . ...lm1 lm2 · · · 1

d11 0 · · · 00 d22 · · · 0... ... . . . ...0 0 · · · dmm

1 l21 · · · lm1

0 1 · · · lm2... ... . . . ...0 0 · · · 1

Example 1:

1 −1 2−1 3 02 0 9

=

1 0 0−1 1 02 1 1

1 0 00 2 00 0 3

1 −1 20 1 10 0 1

.

Example 2:

1 1 −11 5 7−1 7 22

=

1 0 01 1 0−1 2 1

1 0 00 4 00 0 5

1 1 −10 1 20 0 1

.

17




Turin 2008

Jacek Gondzio

Lecture 13:

Linear Algebra Issues (cont’d)

Definite & Indefinite Systems

1


Definite & Indefinite Systems

Cholesky factorization fails for indefinite matrix.

Example 1: Negative pivot d22 < 0.[

3 22 1

]

=

[

1 02/3 1

] [

3 00 −1/3

] [

1 2/30 1

]

.

Example 2: d11 = 0. Can’t start elimination.[

0 22 5

]

=???

IPMs:

For indefinite augmented system[

−Θ−1 AT

A 0

] [

∆x∆y

]

=

[

rh

]

.

one needs to use some special tricks.

For positive definite normal equations

(AΘAT)∆y = g.

one can compute the Cholesky factorization.

2


Symmetric Factorization

Two step solution method:

• factorization to LDLT form,

• backsolve to compute direction ∆y.

A symmetric nonsingular matrix H is factoriz-

able if there exists a diagonal matrix D and unit

lower triangular matrix L such that H = LDLT .

A symmetric matrix H is strongly factorizable

if for any permutation matrix P a factorization

PHPT = LDLT exists.

The general symmetric indefinite matrix is not

factorizable.

3


Factorizing Indefinite Matrix

Two options are possible:

1. Replace diagonal matrix D with a block-

diagonal one and allow 2 × 2 (indefinite) pivots[

0 aa 0

]

and

[

0 aa d

]

.

Hence obtain a decomposition H = LDLT with

block-diagonal D.

2. Regularize indefinite matrix to produce a

quasidefinite matrix

K =

[

−E AT

A F

]

,

where

E ∈ Rn×n is positive definite,

F ∈ Rm×m is positive definite, and

A ∈ Rm×n has full row rank.

4


Using 2 × 2 Pivots

Replace diagonal matrix D with a block-diagonal

one and allow 2 × 2 (indefinite) pivots[

0 aa 0

]

and

[

0 aa d

]

.

Hence obtain a decomposition H = LDLT with

block-diagonal D.

Drawback:

One cannot predict when the zero pivot will ap-

pear. Consequently, the pivots have to be cho-

sen dynamically (one cannot reorder the matrix

beforehand and that way separate symbolic and

numerical factorization phases).

This considerably slows down the computations.

5


Quasidefinite Matrices

Symmetric matrix is called quasidefinite if

K =

[

−E AT

A F

]

,

where E ∈ Rn×n and

F ∈ Rm×m are positive definite, and

A ∈ Rm×n has full row rank.

Symmetric nonsingular matrix K is factorizable

if there exists a diagonal matrix D and unit lower

triangular matrix L such that K = LDLT .

The symmetric matrix K is strongly factorizable

if for any permutation matrix P a factorization

PKPT = LDLT exists.

Vanderbei (1995) proved that

Symmetric QDFM’s are strongly factorizable.

For any quasidefinite matrix

there exists a Cholesky-like factorization

H = LDLT ,

where

D is diagonal but not positive definite:

n negative pivots; and m positive pivots.6


Quasidefinite Matrix

Def:

An indefinite matrix

H =

[

−E AT

A F

]

,

where

E ∈ Rn×n is positive definite, and

F ∈ Rm×m is positive definite

is called quasi-definite.

Lemma 1: If G ∈ Rn×n is positive definite, and

p > 0 then for any a ∈ Rn the matrix G+ 1paaT is

positive definite.

Proof: Take any x ∈ Rn, x 6= 0 and observe that

xT (G +1

paaT )x = xTGx +

1

p(xTa)2 > 0.

Lemma 2: If G ∈ Rn×n is negative definite, and

p < 0 then for any a ∈ Rn the matrix G+ 1paaT is

negative definite.

7


Factorizing Quasidefinite Mtx

Lemma 3:

A quasidefinite matrix is strongly factorizable.

Sketch of the proof.

We show that the elimination of any diagonal

pivot in a quasidefinite matrix preserves quasidef-

initeness in the Schur complement.

We consider two cases depending on where the

first pivot comes from (part −E or part F).

Suppose the first pivot is from −E part:

p eT uT

e −E1 AT1

u A1 F

,

where p < 0 is the pivot element and E1 is a pos-

itive definite matrix. After the elimination of the

pivot, we obtain the following Schur complement

−E1 − 1peeT AT

1 − 1peuT

A1 − 1pueT F − 1

puuT

. (1)

Matrix −E1 − 1peeT is negative definite (Lect 12:

Lemma 1) and matrix F−1puuT is positive definite

(Lemma 1), so (1) is quasidefinite.

8


From Indefinite to Quasidef.

Indefinite matrix

H =

[

−Q − Θ−1 AT

A 0

]

.

Vanderbei (1995): replace Ax=b with Ax≤b

HV =

−Θ−1s 0 I

0 −Q − Θ−1 AT

I A 0

and eliminate Θ−1s

K =

[

−Q − Θ−1 AT

A Θs

]

.

Saunders (1996):

HS =

[

−Q − Θ−1 AT

A 0

]

+

[

−γ2In 0

0 δ2Im

]

,

whereγδ ≥

√ε = 10−8.

A & G (1999): use dynamic regularization

H =

[

−Q − Θ−1 AT

A 0

]

+

[

−Rp 00 Rd

]

,

Rp ∈ Rn×n and Rd ∈ Rm×m are primal and dual

regularizations.9


From Indefinite to Quasidef.

Indefinite matrix

H =

[

−Q − Θ−1 AT

A 0

]

.

in IPMs can be converted to a quasidefinite one.

Regularize indefinite matrix to produce a quasi-

definite matrix. Use dynamic regularization

H =

[

−Q − Θ−1 AT

A 0

]

+

[

−Rp 00 Rd

]

,

where

Rp ∈ Rn×n is a primal regularization

Rd ∈ Rm×m is a dual regularization.

For any quasidefinite matrix

there exists a Cholesky-like factorization

H = LDLT ,

where

D is diagonal but not positive definite:

n negative pivots;

m positive pivots.

10


Primal Regularization

Primal barrier problem

min zP = cTx+12xTQx−µ

∑nj=1(ln xj+ln sj)

s. to Ax = b,x + s = u,x, s > 0

[

−Q − Θ−1 AT

A 0

] [

∆x∆y

]

=

[

fh

]

.

Primal regularized barrier problem

min zP +1

2(x − x0)

TRp(x − x0)

s. to Ax = b,

x + s = u,

x, s > 0

[

−Q − Θ−1 − Rp AT

A 0

] [

∆x∆y

]

=

[

f ′

h

]

,

where

f ′ = f − Rp(x − x0).

11


Dual Regularization

Dual barrier problem

max zD = bTy−uTw−12xTQx+µ

n∑

j=1(ln zj+lnwj)

s. to ATy + z − w − Qx = c,x ≥ 0, z, w > 0

[

−Q − Θ−1 AT

A 0

] [

∆x∆y

]

=

[

fh

]

.

Dual regularized barrier problem

max zD − 1

2(y − y0)

TRd(y − y0)

s. to ATy + z − w − Qx = c,

x ≥ 0, z, w > 0

[

−Q − Θ−1 AT

A Rd

] [

∆x∆y

]

=

[

fh′

]

,

where

h′ = h − Rd(y − y0).

12


Large Problems are Sparse

Large problems are always sparse.

Linear programs are perfect examples of that.

Exploiting sparsity in computations leads to huge

savings. Exploiting sparsity means mainly avoid-

ing doing useless computations: the computa-

tions for which the result is known, as for exam-

ple multiplications with zero.

Example: Consider a multiplication

Ax =

2 1 0 4 0 0

0 2 0 −1 5 −1

3 0 3 8 0 5

2

0

5

0

0

−2

.

It requires computing

2 · A.1 + 5 · A.3 − 2 · A.6

and involves only five multiplications and five ad-

ditions. We say that this matrix-vector multipli-

cation needs 5 flops.

13


Sparsity Issues in QP

Example

1 1

1 2 1

1 2 1

1 2 1

1 2

−1

=

1

1 1

1 1

1 1

1 1

·

1 1

1 1

1 1

1 1

1

−1

=

1 −1 1 −1 1

1 −1 1 −1

1 −1 1

1 −1

1

·

1

−1 1

1 −1 1

−1 1 −1 1

1 −1 1 −1 1

=

5 −4 3 −2 1

−4 4 −3 2 −1

3 −3 3 −2 1

−2 2 −2 2 −1

1 −1 1 −1 1

.

Conclusion:

the inverse of the sparse matrix may be dense.

IPMs for QP:

Do not explicitly invert the matrix Q + Θ−1

in the matrix A(Q + Θ−1)−1AT .

Use augmented system instead.14


IPMs: LP vs QP

Augmented system in LP[

−Θ−1 AT

A 0

] [

∆x∆y

]

=

[

ξd − X−1ξµ

ξp

]

.

Eliminate ∆x from the first equation and get

normal equations

(AΘAT)∆y = g.

Augmented system in QP[

−Q − Θ−1 AT

A 0

] [

∆x∆y

]

=

[

ξd − X−1ξµ

ξp

]

.

Eliminate ∆x from the first equation and get

normal equations

(A(Q + Θ−1)−1AT)∆y = g.

One can use normal equations in LP, but not

in QP. Normal equations in QP may become al-

most completely dense even for sparse matrices

A and Q. Thus, in QP, usually the indefinite

augmented system form is used.

15


Quasidefinite System in NLP

Lemma 4:

If f : Rn 7→ R is strictly convex, and g : Rn 7→ Rm

is convex, both f and g are twice differentiable,

and A(x) has full row rank for any x, then the

augmented system matrix

H =

[

Q(x, y) A(x)T

A(x) −ZY −1

]

is quasidefinite for any x and any z, y > 0.

Proof:

From convexity of f and g, the matrix Q(x, y) is

positive definite. Since z, y > 0, the matrix ZY −1

is also positive definite. Hence H is quasidefinite.

Remark 1:

Note that if f is convex but not strictly con-

vex, then the regularization is needed to make

Q(x, y) = Q(x, y) + Rp positive definite.

Remark 2:

The assumption that A(x) has full row rank for

any x is an example of the regularity condition.

16




Turin 2008

Jacek Gondzio

Lecture 14:

Linear Algebra Issues (cont’d)

Exploiting Sparsity

1


Large Problems are Sparse

Large problems are always sparse.

Linear programs are perfect examples of that.

Exploiting sparsity in computations leads to huge

savings. Exploiting sparsity means mainly avoid-

ing doing useless computations: the computa-

tions for which the result is known, as for exam-

ple multiplications with zero.

Example: Consider a multiplication

Ax =

2 1 0 4 0 0

0 2 0 −1 5 −1

3 0 3 8 0 5

2

0

5

0

0

−2

.

It requires computing

2 · A.1 + 5 · A.3 − 2 · A.6

and involves only five multiplications and five ad-

ditions. We say that this matrix-vector multipli-

cation needs 5 flops.

2


Sparse Systems of Equations

Dense system of linear equations

x1 + x2 + x3 + x5 = 112x1 − x2 + x3 − x4 + 2x5 = 93x1 − x2 − x3 − x4 + 2x5 = 4x1 − x2 + 3x3 + x4 − x5 = 7

2x1 + x3 − x4 + 7x5 = 36,

needs GE or to compute the LU decomposition

of the matrix involved.

What do you do for sparse system of linear equa-

tions ?

x1 + x2 + x3 + x5 = 11x2 − x3 = −1

x1 + x3 + x4 + x5 = 13x2 + 2x3 = 8

4x4 − 2x5 = 6.

Use common sense:

• try to simplify,

• solve for “easy” variables.

You might be unaware of that

... but you exploit sparsity.

3


LU factors of Sparse Matrix

The best pivot candidates (in terms of sparsity)

are singleton rows/columns.

Example:

1 2 3 4 5 6

1 x x x2 x x x x

3 x x x4 x x

5 p6 x x

pivot : a52

rows : 5 − 1

columns : 2 − 1

2 1 3 4 5 6

5 x

2 x x x x3 x x x

4 x x1 x x x

6 x p

pivot : a64

rows : 6 − 2

columns : 4 − 1

2 4 3 1 5 6

5 x6 x x

3 x x x

4 x x1 x x p2 x x x x

pivot : a11

rows : 1 − 3

columns : 1 − 3

4


LU factors of Sparse Matrix

2 4 1 3 5 6

5 x

6 x x1 x x x

4 x x3 x x p2 x x x x

pivot : a35

rows : 3 − 4

columns : 5 − 3

2 4 1 5 3 6

5 x6 x x

1 x x x

3 x x x4 x x

2 x x x p

pivot : a23

rows : 2 − 4

columns : OK

2 4 1 5 3 6

5 x6 x x

1 x x x3 x x x

2 x x x x

4 x x

triangular matrix

5


Sparse Equation Systems

The original linear system was

x1 + 2x2 − x4 = 1x2 + x3 − x4 + x5 = 6

3x1 + 3x2 − x5 = 4x3 + 2x6 = 15

2x2 = 4x2 + x4 = 6.

One should start from equation 5, then 6, etc.

The reordered linear system is

2x2 = 4x2 + x4 = 6

2x2 − x4 + x1 = 13x2 + 3x1 − x5 = 4x2 − x4 + x5 + x3 = 6

x3 + 2x6 = 15.

with the corresponding sparse matrix2 4 1 5 3 6

5 x6 x x

1 x x x3 x x x

2 x x x x4 x x

6


General Sparse Systems

Single step in Gaussian Elimination

A =

p vT

u A1

produces the following Schur complement

A1 − p−1uvT .

The effect on the sparsity pattern

1 2 3 4 5 6 7 8

1 p x x x x2 x x x x

3 x x x x4 x x x

5 x x x

6 x x7 x x x

8 x x x x

pivot : pnonzero : x

1 2 3 4 5 6 7 8

1 p x x x x2 x x x x

3 x x x f f f f4 x x x5 x f f x f f6 x x7 x x x

8 x f f f f

pivot : pnonzero : xfill − in : f

7


Markowitz Pivot Choice

Let ri and ci, i = 1,2, ..., n be numbers of nonzero

entries in row and column i, respectively. The

elimination of the pivot aij needs

fij = (ri − 1)(cj − 1)

flops to be made. This step creates at most fij

new nonzero entries in the Schur complement.

Markowitz: Choose the pivot with mini,j fij.

1 2 3 4 5 6 7 8

1 x x x x x

2 x x x x

3 x x x x4 x x x

5 x x x6 p x

7 x x x8 x x x x

1 2 3 4 5 6 7 8

1 x x x f x

2 x x x x

3 x x x x4 x x x f5 x x x6 p x7 x x x8 x x x x

8


Sparsity in Cholesky Factors

Example: Consider a symmetric 4 × 4 matrix

H =

x x x x

x xx x

x x

,

where x denotes a nonzero and empty spaces

denote zeros. Direct application of Cholesky

factorization would produce a completely dense

lower triangular factor

L =

xx x

x x xx x x x

.

However, it suffices to reorder (symmetrically)

the rows and columns of H

H = PHP T =

x x

x xx x

x x x x

,

to obtain sparse Cholesky factor

L =

x

x

xx x x x

.

9


Minimum Degree Ordering

How to permute rows and columns of H to get

the sparsest possible Cholesky factorization?

Difficult problem but heuristics can help.

Example: Consider a symmetric matrix

H =

x x x xx x

x x xx x x

x x xx x x

.

Suppose h11 is the first pivot

H =

p x x xx x

x x f f xx f x f x

x x f f xx x x

.

Suppose h22 is the first pivot. Replace rows 1

and 2 and columns 1 and 2.

H =

p xx x x x

x x x

x x xx x x

x x x

.

10


Minimum Degree Ordering

Recall Markowitz Pivot Choice

Let ri and ci, i = 1,2, ..., n be numbers of nonzero

entries in row and column i, respectively. The

elimination of the pivot aij needs

fij = (ri − 1)(cj − 1)

flops to be made.

Choose the pivot with mini,j fij.

1 2 3 4 5 6 7 8

1 x x x f x2 x x x x

3 x x x x4 x x x f5 x x x

6 p x7 x x x

8 x x x x

In symmetric positive definite case:

pivots are chosen from the diagonal and ri = ci

hence choose the pivot with mini ri

Minimum degree ordering: choose an element

with the minimum number of nonzeros in a row.

11


Nested Dissection

Remove few nodes to disconnect the graph.

Consider a matrix

H =

1 2 3 4 5 6 7 8 9 10 11

1 x x x x

2 x x x x x3 x x x x

4 x x x x x x5 x x x x x

6 x x x x x

7 x x x x8 x x x x x

9 x x x x10 x x x x x x

11 x x x x x

and its graph

3

5

104

11

86

9

2 71

12


Nested Dissection (cont’d)

Remove (permute) nodes 4 and 7.

PHP T =

1 2 3 5 6 8 9 10 11 4 7

1 x x x x2 x x x x x3 x x x x5 x x x x x6 x x x x x8 x x x x x9 x x x x

10 x x x x x x11 x x x x x4 x x x x x x7 x x x x

and its graph

3

5

104

11

86

9

2 71

13


Linear Algebra of IPMs

Cholesky factorization

LDLT = AΘAT .

Involved preparation step:

• minimum degree ordering

(reduces # of nonzeros of L);

• symbolic factorization

(predicts the sparsity structure of L).

Computational complexity of different steps:

• minimum degree ordering O(∑

i n2i )

• numerical factorization O(∑

i n2i )

• symbolic factorization O(∑

i ni)

• backsolve O(∑

i ni)

where ni is # of nonzero entries in L.i

14


Linear Algebra: SM vs IPM

Suppose an LP of dimension m × n is solved.

Iterations to reach optimum:

Simplex Method IPM

Theory Practice Theory PracticeNonpolynomial O(m + n) O(

√n) O(log10 n)

But ...

one iteration of the simplex method is usually

significantly less expensive.

Simplex method solves equation with the basis

matrix:[

B N

0 In−m

] [

xB

xN

]

=

[

b

0

]

,

that reduces to

BxB = b.

IPM solves equation with the matrix AΘAT :

(AΘAT)∆y = g.

15


References:

Gondzio

Implementing Cholesky Factorization for Interior

Point Methods of Linear Programming, Opti-

mization, 27 (1993) pp. 121–140.

Andersen, Gondzio, Meszaros and Xu

Implementation of Interior Point Methods for

Large Scale Linear Programming, in: Interior

Point Methods in Mathematical Programming,

T Terlaky (ed.), Kluwer Academic Publishers,

1996, pp. 189–252.

Altman and Gondzio

Regularized Symmetric Indefinite Systems in In-

terior Point Methods for Linear and Quadratic

Optimization, Optimization Methods and Soft-

ware, 11-12 (1999), pp 275–302.

16




Turin 2008

Jacek Gondzio

Lecture 15:

Tree Sparse Problems

1


Block-Angular LPs

Primal Block-Angular Structure

A =

A1

A2. . .

An

B1 B2 · · · Bn Bn+1

.

Dual Block-Angular Structure

A =

A1 C1

A2 C2. . . ...

An Cn

.

2


Primal block-angular structure

A =

A1

A2. . .

An

B1 B2 · · · Bn Bn+1

.

Normal-equations matrix

AAT =

A1AT1 A1AT

1

A2AT2 A2BT

2. . . ...

AnATn AnBT

n

B1AT1 B2AT

2 · · · BnATn C

,

where

C =n+1∑

i=1

BiBTi .

Implicit inverse

AiATi = LiL

Ti

Bi = BiATi L−T

iS = C −

∑ni=1 BiB

Ti = LSLT

S

3


Dual block-angular structure

A =

A1 C1

A2 C2. . . ...

An Cn

.

Normal-equations matrix

AAT =

A1AT1

A2AT2

. . .

AnATn

+ CCT ,

where C ∈ Rm×k defines a rank-k corrector.

Implicit inverse (Sherman-Morrison-Woodbury)

AiATi = LiL

Ti

diag(AiATi ) = diag(LiL

Ti ) = LLT

Ci = L−1i Ci

S = Ik +∑n

i=1 CTi Ci = LSLT

S

(AAT)−1 = (LLT + CCT)−1

= (LLT )−1−(LLT )−1CS−1CT (LLT )−1

4


Stochastic LP with recourse

The two-stage stochastic program

minx∈X

cTx + Eξ{qTy(ξ)}

s.t. T(ξ) x + Wy(ξ) = h(ξ),x ≥ 0, y(ξ) ≥ 0, ∀ξ ∈ Ξ.

Assume random data has a joint finite discrete

distribution {(ξk, pk), k =1..N} with∑

k pk = 1.

We have stochastic program with fixed recourse

minx∈X

cTx +N∑

k=1

pkqTk yk

s.t. T(ξk) x + Wyk = h(ξk), k = 1..N,

x ≥ 0, yk ≥ 0, k = 1..N.

5


Stochastic LPs (cont’d)

The deterministic equivalent formulation

minx∈X

cTx +p1qT1 y1 +p2qT

2 y2 . . . +pNqTNyN

s.t. T1x +Wy1 =h1

T2x +Wy2 =h2... . . . ...TNx +WyN =hN

x ≥ 0, y1 ≥ 0, y2 ≥ 0, . . . yN ≥ 0.

Dual block-angular structure

6


Multi-stage Stoch. Programs

1

2

3

Period 1 Period 2 Period 3

Scenario 1

Scenario 2

Scenario 3

Scenario 4

4

5

7

6

The structured constraint matrix

Symmetrical event tree with p realizations at

each node and T + 1 periods corresponds to

pT

scenarios.7


Real-life Stochastic Programs

Multistage stochastic linear program.

Reordered matrix of the multistage SLP.

8


Exploiting Structure in IPMs

Interior Point Methods:

• are well-suited to large-scale optimization

• can take advantage of the parallelism

Large problems are “structured”:

• partial separability

• spatial distribution

• dynamics

• uncertainty

• etc.

Object-Oriented Parallel Solver (OOPS)

http://www.maths.ed.ac.uk/~gondzio/

parallel/solver.html

• Exploits structure

• Runs in parallel

• Solves problems with millions of variables

9


Data Structure

Linear Algebra Module:

• Given A, x and Θ, compute Ax, ATx, AΘAT .

• Given AΘAT , compute the Cholesky factor-

ization AΘAT = LLT

• Given L and r, solve Lx = r and LT = r

Common choice: A single data structure to com-

pute the general sparse operations

How can we deal with block-structures?

Blocks may be

• general sparse matrices

• dense matrices

• very particular matrices

(identity, projection, GUB)

• block-structured

It is worth considering many data structures.

10


MDO for Sparse Matrix

H =

x x x x

x x

x x x

x x x

x x x

x x x

Pivot h11

p x x x

x x

x x f f x

x f x f x

x x f f x

x x x

Pivot h22

x x x x

p x

x x x

x x x

x x x

x x x

11


MDO for Block-Sparse Matrix

H =

Pivot Block H11

P

Pivot Block H22

P

12


Primal block-angular

Q =

[ ]

A =

and AT =

[ ]

Augmented System:

H =

,

Reorder blocks: {1,3; 2,4; 5}.

PHPT =

13


Dual block-angular

Q =

A =

[ ]

and AT =

Augmented System:

H =

Reorder blocks: {1,4; 2,5; 3}.

PHPT =

14


Block Tree Structure

Structured Matrix:

11

Q12

Q10

Q21

Q22

Q23

Q20

Q30

Q

Q Q

Q

Q

T

T

31 32

32

31

D10

D30

D20

23D

22D

21D

12D

11D

C

BB

B B B

D1

D2

21 23

1211

C32

31

22

Associated Tree:

Dua

l Blo

ck A

ngul

ar S

truc

ture

Prim

al B

lock

Ang

ular

Str

uctu

re

Prim

al B

lock

Ang

ular

Str

uctu

re

D30

C31A

D1 D2

D D1211 D10 B B11 12 D D D D B B B21 22 23 20 21 22 23

C32

15


Reordered Augmented Matrix

11

Q12

Q10

Q21

Q22

Q23

Q20

Q30

Q

Q Q

Q

Q

T

T

31 32

32

31

D10

D30

D20

23D

22D

21D

12D

11D

C

BB

B B B

D1

D2

21 23

1211

C32

31

22

T

T

D11

D12

D10

Q30

D30

D11

D12

D10

D30

~C31

~C31

~C31

C~32

C~32

C~32

C~32

~C

T31

T~C31

~CT

31 C~32T C

~T32

TC~32

TC~32

B

B T

B

B

B

B

~Q31

T

T

T

T

T

T

T

T

T

B

B

12

11

1211

T

T

T

D

D

D

D

D

D

21

21

22

22

23

23

D

DT

20

20

~Q31

~Q31

~Q32

~Q32

~Q32

~Q32

~Q

~Q

~Q

T32

32

32

~Q

~Q

~Q

~Q

31

31

31

32Q

Q

Q

Q

21

22

23

20

Q

Q

11

12

Q10

21 22 B23

B

T

T

T23

22

21

16


OOPS

• Every node in the block elimination tree

has its own linear algebra implementation

(depending on its type)

• Each implementation is a realisation of an

abstract linear algebra interface.

• Different implementations are available for

different structures

iA

iB

iC

iA

R

Rank corrector

implementation

RankCorrector

D

iA

iC

iB

y=Mtx

y=Mx

SolveLt

SolveL Implicit

PrimalBlockAngMatrixFactorize

factorization

Implicit

DualBlockAng

factorization

Implicit

factorization

BorderedBlockDiag

linear algebra

General sparse

linear algebra

SparseMatrix

DenseMatrix

General dense

Mat

rix

Inte

rfac

e

⇒ Rebuild block elimination tree

with matrix interface structures

17


References:

Gondzio and Sarkissian

Parallel Interior Point Solver for Structured Lin-

ear Programs, Mathematical Programming 96

(2003) 561-584.

Gondzio and Grothey

Reoptimization with the Primal-Dual Interior

Point Method, SIAM J. on Optimization 13

(2003) 842-864.

Gondzio and Grothey

Direct Solution of Linear Systems of Size 109

Arising in Optimization with Interior Point Meth-

ods, in: Parallel Processing and Applied Math-

ematics, R. Wyrzykowski et al. (eds), Lecture

Notes in Computer Science 3911 (2006) pp. 513–

525

Gondzio and Grothey

A New Unblocking Technique to Warmstart Inte-

rior Point Methods based on Sensitivity Analysis,

SIAM J. on Optimization (accepted).

18




Turin 2008

Jacek Gondzio

Lecture 16:

Markowitz Portfolio

Optimization

1


The Markowitz Model

The material given in this lecture comes from the

book of R. J. Vanderbei: Linear Program-

ming: Foundations and Extensions,

Kluwer, Boston, 1997.

Suppose a collection of potential investments

I = {1,2, ..., n} is given. Let Ri denote the return

in the next time period on investment i ∈ I.

Ri is a random variable.

A portfolio is a vector x∈Rn. xi, i∈ I specifies

the fraction of asset put into investment i. The

portfolio satisfies∑

i∈I xi=1 and has the return

R =∑

i∈I

xiRi

that is uncertain. The reward associated with

such a portfolio is the expected return:

IE R =∑

i∈I

xiIE Ri.

2


The Markowitz Model

There is a risk associated with this investment.

It is measured with the variance of the return:

Var(R) = IE (R−IE R)2

= IE (∑

i∈I

xiRi−IE (∑

i∈I

xiRi))2

= IE (∑

i∈I

xiRi−∑

i∈I

xiIE Ri)2

= IE (∑

i∈I

xi(Ri−IE Ri))2

= IE (∑

i∈I

xiRi)2,

where

Ri = Ri − IE Ri.

3


The Markowitz Model

The investor has two major concerns:

• maximizing the reward IE R; and

• minimizing the associated risk Var(R).

These objectives usually are contradictory.

There is no chance to increase the reward with-

out exposing yourself to a higher risk.

Conversely, if you don’t want to take more risk,

you have to accept lower reward.

In the Markowitz model (1959), these two objec-

tives are combined and a quadratic optimization

problem is solved:

min −∑

i∈I xiIE Ri + λIE (∑

i∈I xiRi)2

s.t.∑

i∈I xi = 1,

xi ≥ 0,

where λ is a positive parameter that represents

investor’s attitude to risk. The higher the value

of λ, the less risk the investor is ready to incur.

Harry Markowitz received the 1990 Nobel

prize for this work. 4


The Markowitz Model (cont’d)

The quadratic program has an equivalent formu-

lation

min cTx + λxTQ x

s.t. eTx = 1,

x ≥ 0,

where c ∈ Rn : ci = −IE Ri is the expected return

of asset i, and Q ∈ Rn×n : Qij = IE (RiRj) is the

covariance of Ri and Rj.

Although this QP has only one linear constraint,

its solution for larger n is not trivial because ma-

trix Q is often completely dense.

5


Estimates of IER and Var(R)

The Markowitz Model requires the knowledge of

the joint distribution of variables Ri.

These distributions are not known but may be

estimated using historical data. Suppose, for

simplicity, that only four classes of assets are

considered for possible investments: Long-term

government bonds (LTB), S&P 500, NASDAQ

Composite and Gold.

The historical data for the last 10 years for these

assets may look like:

Year LTB S&P NASDAQ Gold

1990 1.084 0.961 0.824 0.9251991 1.054 1.302 1.605 0.9571992 1.072 1.056 1.065 1.0641993 1.107 1.106 0.978 1.1231994 1.047 1.024 0.952 0.9921995 1.074 1.109 1.205 0.9711996 1.036 1.063 1.453 1.0631997 1.046 0.934 1.056 1.0871998 1.051 0.928 1.105 1.0731999 1.071 1.314 1.513 0.945

6


Estimate of IER

A way to estimate IE Ri is to take the average of

the historical returns:

IE Ri =1

T·

T∑

t=1

Ri(t).

However, it is a very poor estimate. If in two

subsequent years the returns were 2.0 and 0.5

respectively, then this estimate would have given

IE Ri = (2.0 + 0.5)/2 = 1.25 though the invest-

ment in this particular asset made two years ago

would have a return 2.0 · 0.5 = 1.

A better way is to use a geometric mean

IE Ri =

T∏

t=1

Ri(t)

1

T

.

By taking logarithms of both sides, we obtain:

log IE Ri =1

T·

T∑

t=1

logRi(t).

7


Estimate of IER (cont’d)

Additionally, one should pay more attention to

the most recent data. Indeed, the 1999 return is

more probable to repeat in 2000 than the 1990

one. Use a discounted sum of logarithms:

log IE Ri =1

∑Tt=1 sT−t

·T

∑

t=1

sT−t logRi(t),

where

s(t) =sT−t

∑Tt=1 sT−t

, t = 1,2, ..., T,

is a discount factor.

Indeed, £1 today is valid more than £1 next year.

8


Estimate of Var(R)

Recall that

Var(R) = IE (∑

i∈I

xiRi)2,

where

Ri = Ri − IE Ri.

Thus Var(R) is a quadratic function of x

Var(R) = xTQ x,

where

Qij = IE (RiRj).

9


Estimate of Var(R) (cont’d)

Suppose a discounted sum is used to estimate

IE Ri, i.e.

IE Ri =T

∑

t=1

s(t)Ri(t),

where

s(t) =sT−t

∑Tt=1 sT−t

, t = 1,2, ..., T.

Analogously, we use a discounted sum to esti-

mate Var(R), i.e.

Var(R) = IE (∑

i∈I

xiRi)2

=T

∑

t=1

s(t)

∑

i∈I

xiRi(t)

2

,

where

Ri(t) = Ri(t) − IE Ri

= Ri(t) −T

∑

t=1

s(t)Ri(t).

10


How to represent Var(R) ?

There are two ways of handling the double sum-

mation in the definition of Var(R):

Var(R) =T

∑

t=1

s(t)

∑

i∈I

xiRi(t)

2

.

Possibility 1.

Write:

Var(R) = xTQ x,

where matrix Q ∈ Rn×n has coefficients:

Qij =T

∑

t=1

s(t)Ri(t)Rj(t), i, j = 1,2, ..., n.

This gives dense matrix Q and a very difficult

nonseparable QP problem to solve.

11


Another representation Var(R)

Possibility 2.

Write:

Var(R) = uTD u,

where diagonal matrix D ∈ RT×T has coeffi-

cients:

Dtt = s(t), t = 1,2, ..., T,

and u ∈ RT is defined as follows:

ut =∑

i∈I

xiRi(t), t = 1,2, ..., T.

We rewrite the last equation in a form

Fx = u,

where matrix F ∈ RT×n is built of coefficients

fti = Ri(t), t = 1, ..., T, i = 1, ..., n.

This formulation leads to a separable QP that

is easier to solve with IPM.

12


IPMs and the Markowitz Model

The original Markowitz quadratic program with

the matrix Q ∈ Rn×n

min cTx + 12xTQ x

s.t. eTx = 1, (1)

x ≥ 0,

can thus be replaced by the following equivalent

separable one

min cTx + 12uTD u

s.t. eTx = 1, (2)

Fx − u = 0,

x ≥ 0.

Suppose the number of assets n is much larger

than the number of time periods T

T ≪ n.

13


IPMs/M. Model (details)

The primal barrier QP

min cTx + 12uTD u − µ

n∑

j=1ln xj

s.t. eT x = 1,Fx − u = 0.

Lagrangian

L(x, u, y, z, µ)=

cTx+1

2uTDu−y(eTx−1)−zT(Fx−u)−µ

n∑

j=1

lnxj,

where y ∈ R and z ∈ RT are dual variables as-

sociated with linear constraints eTx = 1 and

Fx − u = 0, respectively.

The conditions for a stationary point

∇xL = c − ye − FT z − µX−1e = 0∇uL = Du + z = 0

−∇yL = eTx − 1 = 0−∇zL = Fx − u = 0.

14


IPMs/M. Model (details)

We denote

s = µX−1e, i.e. XSe = µe.

and get the first order optimality conditions:

eTx = 1,Fx − u = 0,

FT z + ey + s = c,Du + z = 0,

XSe = µe.

We rewrite FOC[

eT 0F −I

][

xu

]

=

[

10

]

,

[

e FT

0 −I

][

yz

]

+

[

s0

]

−

[

0 00 D

][

xu

]

=

[

c0

]

,

XSe = µe.

Finally, we derive the Newton equation system

and reduce it to the following form:

−Θ−1 e FT

−D −I

eT

F −I

∆x∆u∆y∆z

=

rx

ru

ry

rz

,

where Θ−1 = X−1S.15


IPMs: separable form of MM

Although the problem (2) has n + T variables

(x, u) and T +1 linear constraints (while (1) had

n variables and only one constraint), for small T ,

(2) is usually much easier to solve than (1).

Indeed, in this case, the Newton equation system

can be reduced to the following normal equations

system

(A(D + Θ−1)−1AT)∆y = g,

where

A=

[

eT 0F −I

]

∈ R(T+1)×(n+T)

D+Θ−1=

[

0 00 D

]

+

[

Θ−1 00 0

]

∈ R(n+T)×(n+T).

Observe that D + Θ−1 is a diagonal matrix.

16


IPMs: separable form of MM

Summing up, when (2) is solved, we deal with

the normal equations system of dimension T +1.

The augmented system corresponding to the non-

separable formulation (1) would be a completely

dense indefinite system of dimension n+1. Since

T ≪ n, the separable QP is much easier to solve.

17


Solution of the M. Model

Optimal portfolios for different λ:

λ LTB S&P NASDAQ Gold

0.0 0.020 0.056 0.924 0.0000.5 0.061 0.084 0.840 0.0151.0 0.248 0.214 0.516 0.022

10.0 0.431 0.239 0.213 0.117100.0 0.749 0.103 0.103 0.045

1000.0 0.890 0.040 0.015 0.055

Efficient frontier:

1.05 1.06 1.07 1.08 1.09 1.10 1.12 1.13 1.14

return

4

6

8

10

12

14

16

risk

18

20

1.11

λ = 1000

λ = 100

λ = 10

λ = 5

λ = 2.5

λ = 1

λ = 0.25

λ = 0

λ = 0.5

2

18


Solution of the M. Model (cnt’d)

The smaller the λ, the more risky investments

are present in the optimal portfolio.

The border of the green-shaded area is called the

efficient frontier. For any portfolio inside the

shaded area one can find a better portfolio that

either:

• offers the same reward with less risk; or

• gives a larger reward with the same risk.

In real-life problems one considers many possible

assets (more than 4 classes in our example). The

resulting quadratic programming problem then

becomes much larger.

19




Turin 2008

Jacek Gondzio

Lecture 17:

Financial Planning Problems

1


Modeling Multistage SLPs

Stochastic Programming Models can be ap-

plied to multistage decision problems under un-

certainty satisfying the following two conditions:

• The underlying stochastic process is discrete

and independent of the decisions;

• The dynamics is represented by an equation

zt+1 = Gt(zt, xt, ξt),

where zt ∈ Zt and xt ∈ Xt are the state and

the decision (control) of the dynamic system

at stage t, respectively, and ξt is the realiza-

tion of the stochastic process.

2


Modeling Multistage SLPs

With the Stochastic Program we associate an

event tree. Decision variables are indexed with

the nodes of the tree. The transition equation

relates variables associated with a given node

and its predecessor in the event tree. With every

arc from a node to its successor we associate a

unique realization of the random variable.

Symmetric Event TreeUnsymmetric Event Tree

3


Event Tree

Every node of the event tree is indexed by the

pair (t, n), where t∈{1, .., T} and n∈{1, .., N(t)}

where T is the planning horizon (number of stages)

and N(t) is the number of nodes in the stage t.

The immediate ancestor (predecessor) of node

(t, n) is (t − 1, a(t, n)), where a(t, n) is an appli-

cation from the set of nodes onto itself. The

probabilistic event that led from (t−1, a(t, n)) to

its son (t,n) is an application denoted by s(t,n).

The probability π(t, n) that the stochastic pro-

cess visits node (t, n) is given by the recurrence

formula

π(t, n) = P(s(t, n) | (t−1, a(t, n))) · π(t−1, a(t, n)),

where P(s(t, n) | (t−1, a(t, n))) is the conditional

probability of the transition from (t−1, a(t, n)) to

(t, n).

The variables (state and control) are indexed on

the node. The dynamics is defined at each node

(t, n) by

z(t, n)=G(t,n)(z(t−1, a(t, n)), x(t, n), ξ(t−1, s(t, n))).

4


Description of Event Tree

The applications a(t, n) and s(t, n) enable the

mathematical programming formulation of many

multistage decision problems under uncertainty.

Stage 2 Stage 3Stage 1

(1,1)

(3,1)

(3,2)

(2,3)

(2,2)

(2,1)

(3,3)

(3,4)

(3,6)

(3,5)

N(1)=1

N(2)=3

N(3)=6

Z(1)=3

Z(2)=2

Assume that:

The number of possible events at node (t, n) is

the same for all events at stage t and equal to

Z(t). The conditional probabilities P(s(t, n)|(t −

1, a(t, n))) depend on time (stage) t but not on

the particular node in this stage. Consequently,

for any t, we have N(t) = Z(t − 1) · N(t − 1).

Now a(t, n) and s(t, n) are computed recursively:

a(t, n) = ⌈ nZ(t−1)

⌉

s(t, n) = n − (a(t, n)− 1) · Z(t − 1) 5


Event Tree: Example

Consider the following event tree

Stage 2 Stage 3Stage 1

(1,1)

(3,1)

(3,2)

(2,3)

(2,1)

(3,3)

(3,4)

(3,6)

(3,5)

(2,2)

2

1ξ

ξ

Consider the nodes of stage 3. Their ancestors

belong to stage 2 at which Z(3− 1) = 2, hence

a(3,3) = ⌈ 3Z(3−1)

⌉ = 2;

a(3,4) = ⌈ 4Z(3−1)

⌉ = 2.

Nodes (3,3) and (3,4) are sons of node (2,2).

The indices of the sons are the following:

s(3,3) = 3 − (a(3,3) − 1) · Z(3 − 1) = 1;

s(3,4) = 4 − (a(3,4) − 1) · Z(3 − 1) = 2.6


Node Labels in the Event Tree

There are two “natural” node labeling rules:

- breadth-first search; and

- depth-first search.

Breadth-first search

Stage 1

1 3(1,1) (2,2) 7

8

9

10

6

5

(3,2)

(3,1)

(3,3)

2

4(2,3)

(2,1)

Stage 2 Stage 3

(3,4)

(3,5)

(3,6)

Recursively defined mapping:

a node (t, n) has the following label

fb(t, n) = fb(t − 1, N(t − 1)) + n,

with fb(1,1) = 1.7


Labels in the Event Tree (cnt’d)

Depth-first search

Stage 1

1(1,1) (2,2)

9

10

(3,2)

(3,1)

(3,3)

2

(2,3)

(2,1)

Stage 2 Stage 3

3

4

5

8

7

6

(3,4)

(3,5)

(3,6)

h(3)=1h(2)=3h(1)=10

Define a function h(t) that gives the number of

nodes of any subtree with root node in time t.

Use the backward recursive formula

h(t − 1) = h(t) · Z(t − 1) + 1,

with h(T) = 1.

We get h(3) = 1, h(2) = 3 and h(1) = 10.

Recursively defined mapping:

a node (t, n) has the following label

fd(t, n) = fd(t−1, a(t, n)) + h(t) · (s(t, n)−1) + 1,

with fd(1,1) = 1.8


Financial Planning Problem

Consider a multi-period financial planning prob-

lem. At every stage t = 0, ..., T −1 we can buy

or sell different assets from the set J = {1, ..., J}

(e.g. bonds, stock, real estate), we can lend

the money to other parties or borrow it from the

bank.

The return of asset j at stage t is uncertain.

We have an initial sum S0 to invest and we want

to maximize the expected final wealth ST

(or to maximize its expected utility U(ST)).

9


Asset Liability Management

Asset Liability Modeling

Suppose that at every stage t, we contribute a

certain amount of cash Ct to the portfolio and

we pay a certain liability Lt.

Such a financial planning problem is called the

asset liability management problem.

This problem is of crucial importance to

pension funds and insurance companies.

Note a dynamic aspect of decisions to be taken:

the portfolio is to be re-balanced at every stage.

Note a stochastic aspect of the problem:

the returns of assets are uncertain.

We model this problem using an event tree

2

1ξ

ξ (t,n)

(t-1,a(n))

(t,n-1)

and decision variables associated with its nodes

(t, n). We assume that applications a(t, n) and

s(t, n) which define ancestor and son of node

(t, n), respectively are known.10


Asset Liability Management

With asset j ∈ J at node (t, n) we associate:

xj,t,n the position in asset j in node (t, n);

xbj,t,n the amount of asset j bought in (t, n);

xsj,t,n the amount of asset j sold in (t, n).

For any t : 1 ≤ t ≤ T , we write the inventory

equation for asset j at node (t, n)

xj,t,n=(1+rj,t,n) · xj,t−1,a(t,n) + xbj,t,n − xs

j,t,n,

where rj,t,n is a return of asset j corresponding

to moving from node (t−1, a(t, n)) to node (t, n)

in the event tree.

Initial inventory equation for asset j:

xj,0,1= xinitialj + xb

j,0,1 − xsj,0,1.

11


Cash BalanceLet Pj be the initial price of asset j. We assume

that the transaction costs are proportional to

the value of asset bought or sold. To buy xbj,t,n

of asset j at stage t we have to pay

(1 + γb) · Pj · xbj,t,n.

Analogously, for selling xsj,t,n of asset j at stage

t we get(1 − γs) · Pj · x

sj,t,n.

Let mt,n, mbt,n and ml

t,n be the cash hold, bor-

rowed and lent at time t at node n, respec-

tively. The borrowing and lending have return

rates rbt,n and rl

t,n, respectively. For example, for

the money mlt−1,a(t,n)

lent at stage t − 1, we re-

ceive back (1+rlt,n) · m

lt−1,a(t,n)

at stage t.

The cash balance equation states that the cash

inflow is equal to the cash outflow

Ct,n+mbt,n+(1+rl

t,n)mlt−1,a(t,n)+

J∑

j=1

(1−γs)Pjxsj,t,n

=Lt,n+mlt,n+(1+rb

t,n)mbt−1,a(t,n)+

J∑

j=1

(1+γb)Pjxbj,t,n

at any node (t, n), t = 1, ..., T , n = 1, ..., N(t).12


Special Cash Balance Equations

Initial cash balance

S0+C0,1+mb0,1+

J∑

j=1

(1−γs)Pjxsj,0,1

=L0,1+ml0,1+

J∑

j=1

(1+γb)Pjxbj,0,1

i.e., we assume that there was an initial portfolio

of assets, so we can also sell at stage t = 0.

Final cash balance

CT,n+mbT,n+(1+rl

T,n)mlT−1,a(T,n)+

J∑

j=1

(1−γs)Pjxsj,T,n =

ST +LT,n+mlT,n+(1+rb

T,n)mbT−1,a(T,n)+

J∑

j=1

(1+γb)Pjxbj,T,n

i.e., we assume that the final portfolio can in-

clude assets, so we can also buy at stage T .

These two constraints may be used for example

in asset liability management problem for a pen-

sion fund or an insurance company.

(Indeed, life does not end at time T .)

13


Additional ConstraintsPolicy restrictions for asset mix:

wloj ·

J∑

j=1

xj,t,n ≤ xj,t,n ≤ wupj ·

J∑

j=1

xj,t,n, ∀j, t, n.

Also the contributions are bounded:

Clot,n ≤ Ct,n ≤ C

upt,n, ∀t, n.

Total asset value at the end of period t,

At,n=J

∑

j=1

(1+rj,t,n)Pjxj,t−1,a(t,n)

+(1+rlt,n)m

lt−1,a(t,n)−(1+rb

t,n)mbt−1,a(t,n)

should not decrease below the minimum level of

funding ratio Fmin for liabilities at this period.

To get more flexibility in modeling (to ensure

complete recourse), we allow a deficit Zt,n:

At,n ≥ Fmin · Lt,n − Zt,n, ∀t, n,

for which we shall penalize in the objective.

The final value of assets at the end of the plan-

ning horizon should cover the final liabilities and

ensure the final wealth of at least ST , hence

AT,n ≥ F end · LT,n + ST,n, ∀n.

14


Objective

In a simple financial planning problem we usually

maximize the expected value of the final portfolio

converted into cash.

In asset liability management problem we may be

more flexible:

• we accept (small) deficits, Zt,n;

• we can increase the contributions, Ct,n;

• we can borrow cash, mbt,n, etc.

Suppose we:

• penalize for deficits; and

• minimize contributions.

Hence we get the following objective

minT−1∑

t=0

N(t)∑

n=1

πt,nCt,n + λT

∑

t=1

N(t)∑

n=1

πt,nZt,n

Lt,n.

15


Fin. Planning Prob: Example

Consider a two-year planning problem with the

initial portfolio in cash S0=1000. The portfolio

is re-balanced at the beginning of each year. Any

transaction incurs a proportional cost of 2%.

There are two assets to choose from. Their val-

ues at t = 0 are 50 and 80, respectively.

In year 1, assets A and B have two possible re-

turns: (-6%,-4%) and (+12%,+8%) with prob-

abilities 0.2 and 0.8, respectively. In year 2, these

returns are (-8%,-6%) and (+12%,+10%) with

probabilities 0.4 and 0.6, respectively.

At the end of year 2 (or at the beginning of year

3) all assets are sold and converted into cash.

The objective of the manager is to maximize the

expected value of the terminal wealth in cash.

16


Example cont’d: Event Tree

We use an event tree with two possible eventsin year 1, and 4 events in year 2. (t, n) denotesnode n in year t and πt,n is the probability ofreaching node (t, n).

(0,1)

(1,1)

(1,2)

(2,1)

(2,2)

(2,3)

(2,4)

(-8,-6)

(12,10)

(12,10)

(-8,-6)

(-6,-4)

(12,8)

π1,1

π

= 0.2

= 0.8

π

π

π

π2,1= 0.08

2,2

2,3

2,4

= 0.12

= 0.32

= 0.48

1,2

With every node (t, n) we associate 6 variables.Let xA,t,n, xb

A,t,n and xsA,t,n be the position, the

amount purchased and the amount sold of assetA, respectively. The variables xB,t,n, xb

B,t,n andxs

B,t,n correspond to asset B.

Overall, we have 7 × 6 = 42 variables. However,some of them must be zero:

xsA,0,1 = xs

B,0,1 = 0.

17


Example cont’d: EquationsThe initial inventory equations:

xA,0,1 = xbA,0,1 and xB,0,1 = xb

B,0,1,

and

50(1 + γ) · xbA,0,1 + 80(1 + γ) · xb

B,0,1 = S0.

The inventory equations in year 1:

xA,1,1 = 0.94 · xA,0,1 + xbA,1,1 − xs

A,1,1,

xB,1,1 = 0.96 · xB,0,1 + xbB,1,1 − xs

B,1,1,

xA,1,2 = 1.12 · xA,0,1 + xbA,1,2 − xs

A,1,2,

xB,1,2 = 1.08 · xB,0,1 + xbB,1,2 − xs

B,1,2.

The inventory equations in year 2:

xA,2,1 = 0.92 · xA,1,1 + xbA,2,1 − xs

A,2,1,

xB,2,1 = 0.94 · xB,1,1 + xbB,2,1 − xs

B,2,1,

etc.

Cash balance at every node:

50(1−γ)xsA,t,n + 80(1−γ)xs

B,t,n

= 50(1+γ)xbA,t,n + 80(1+γ)xb

B,t,n

for t = 1,2, and n = 1, ..., N(t).18


Example cont’d: Objective

At the beginning of year 3 (or at the end of year

2), we sell all assets to convert our portfolio into

cash.

The objective is to maximize:

50(1−γ)·4

∑

n=1

π2,n·xA,2,n + 80(1−γ)·4

∑

n=1

π2,n·xB,2,n.

All variables must satisfy

xj,t,n ≥ 0 xbj,t,n ≥ 0 xs

j,t,n ≥ 0,

for j ∈ {A, B}, t = 0,1,2, and n = 1, ..., N(t).

Overall, we have:

42-2 = 40 non-negative decision variables; and

3+4+8+6 = 21 constraints.

19


ALM: Extensions

Introduce two more (nonnegative) variables per

final node (scenario) i ∈ Lt to model the positive

and negative variation from the mean

(1 − ct)J

∑

j=1

vjxhi,j + s+i − s−i = y.

Since (s+i )2, (s−i )2 cannot both be positive the

variance is expressed as

Var(X) =∑

i∈Lt

pi(s+i −s−i )2 =

∑

i∈Lt

pi((s+i )2+(s−i )2).

We model downside risk using a semi-variance

IE[(X − IEX)2−] =∑

i∈Lt

pi(s+i )2.

Downside risk can be taken into account

• in the objective, or

• as a constraint.

20


ALM: Extensions

Concave utility function (risk-aversion).

Suppose an initial sum of b = 10000 is to be invested

and a liability of L = 11000 is to be paid at the end of

the planning period. If the final wealth, W exceeds the

liability L, a return of 4% per year from the excess W − L

is ensured. Otherwise, if W is lower than the liability L,

the deficit L − W will have to be covered by a loan that

costs 10% per year.

Utility

Wealthexcessdeficit

10%

4%11,000

The investor’s utility is proportional to the inter-

est payment resulting from the excess or deficit.

u(w) =

{

0.04 · (w − 11000) if w ≥ 11000,

0.10 · (w − 11000) if w < 11000.

log(w − l) is a commonly used concave utility

function. 21


ALM: Extensions

xi = (xi,1, ..., xi,J) denotes the portfolio in node i

Standard Markowitz formulation:

max y − σ∑

i∈Ltpi(d

⊤i xi−y)2

s.t.

(C1)∑

i∈Ltpid

⊤i xi−y = 0

(C2) Bxa(i) − Axi = 0, i 6=0

(C3) Ax0 = b

Downside risk constrained:

max y

s.t.∑

i∈Ltpi(s

+i )2 ≤ ρ

d⊤i xi + s+i − s−i − y = 0, i ∈ Lt

(C1) − (C3)

22


ALM: Extensions

Nonlinear utility function:

max∑

i∈Ltpi log (d⊤i xi)

s.t.∑

i∈Ltpi(s

+i )2 ≤ ρ

d⊤i xi + s+i − s−i − y = 0, i ∈ Lt

(C1) − (C3)

Skewness formulation:

max y+γ∑

i∈Lt

pi(s+i −s−i )3

s.t.∑

i∈Ltpi(s

+i )2 ≤ ρ

d⊤i xi + s+i − s−i − y = 0, i ∈ Lt

(C1) − (C3)

23


References:

Gondzio and Kouwenberg

High performance computing for asset liability

management, Operations Research, 49 (2001)

pp. 879–891.

Gondzio and Grothey

Parallel interior point solver for structured quad-

ratic programs: Application to financial planning

problems, Annals of Operations Research 152

(2007) pp. 319–339.

Gondzio and Grothey

Solving nonlinear portfolio optimization problems

with the primal-dual interior point method,

European Journal of Operational Research 181

(2007) pp. 1019–1029.

24

IPMs for LP, QP, NLP J. Gondzio, Turin 2008

Classification

We consider a set of points X = {x1, x2, . . . , xn}, xi ∈ Rℓ to be classified into two subsest of

“good” and “bad” ones. X = G ∪ B and G ∩ B = ∅.We look for a function f : X 7→ R such that f(x) ≥ 0 if x ∈ G and f(x) < 0 if x ∈ B.

Linear Classification

We consider a case when f is a linear function:

f(x) = wTx + b,

where w ∈ Rℓ and b ∈ R.

In other words we look for a hyperplane which separates “good” points from “bad” ones.

In such case the decision rule is given by d = sgn(f(x)).

If f(xi) ≥ 0, then di = +1 and xi ∈ G. If f(xi) < 0, then di = −1 and xi ∈ B.

We say that there is a linearly separable training sample

S = ((x1, d1), (x2, d2), . . . , (xn, dn)).

3


How does it work?

Given a linearly separable database (training sample)

S = ((x1, d1), (x2, d2), . . . , (xn, dn))

find a separating hyperplane

wTx + b = 0,

which satisfies

di(wTxi + b) ≥ 1, ∀i = 1, 2, . . . , n.

Given a new (unclassified) point z, compute

dz = sgn(wTz + b)

to decide whether z is “good” or “bad”.

4


Lecture 18:

Machine Learning and Data Mining

with Support Vector Machines

• V.N. Vapnik: Statistical Learning Theory, John Wiley & Sons, New York, 1998.

• N. Cristianini and J. Shawe-Taylor: An Introduction to Support Vector Machines and Other

Kernel Based Learning Methods, Cambridge University Press, 2000.

• J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor and J. Vandevalle, Least

Squares Support Vector Machines, World Scientific, 2002.

• M.C. Ferris, T.S. Munson: Interior-Point Methods for Massive Support Vector Machines,

SIAM Journal on Optimization 13 (2003), pp 783–804.

• J. Ma, J. Theiler, S. Perkins: Accurate Online Support Vector Regression, Neural Compu-

tation, 15 (2003), pp 2683-2703. (Tech Report, Los Alamos National Lab, Los Alamos, NM

87545, USA.)

1


Support Vector Machines (SVMs) are a standard

method used to separate points belonging to two (or

more) sets in n-dimensional space by a linear or

nonlinear surface (Vapnik, Cristianini and Shawe-

Taylor, Suykens et al.).

B

GG

GG

G

GG

GBB

BB

B

BBB BBB

B

B

G

GG

SVMs have numerous applications:

• Finance: detecting fraud transactions with credit cards,

• Finance: credit scoring,

• CS: pattern recognition, image recognition, hand-written digit recognition,

• Bioinformatics: finding the functions of particular genes,

• Medicine: diagnosing patients,

• Telecommunications: detecting customers who will switch to another operator,

• and many others ...

2


QP Formulation

Finding a separating hyperplane can be formulated as a quadratic programming problem:

min 12w

Tw

s.t. di(wTxi + b) ≥ 1, ∀i = 1, 2, . . . , n.

In this formulation the Euclidean norm of w is minimized.

This is clearly a convex optimization problem.

(We can minimize ‖w‖1 or ‖w‖∞ and then the problem can be reformulated as an LP.)

Two major difficulties:

• Clusters may not be separable at all

−→ minimize the error of misclassifications;

• Clusters may be separable by a nonlinear manifold

−→ find the right feature map.

7


Difficult Cases

Nonseparable clusters:

B

G

GG

G

GG

GBB

BB

BB BBB

B

B

G

GG

BG

B

1

2

ξ

ξ Errors when defining clusters of good and bad points.

Minimize the global error of misclassifications: ξ1 +ξ2.

Use nonlinear feature map:

B

G

G

G

G BBB

B

BB

B BBBB

B

G

GG

G

G

G

B

GG

G

B

GG B

G

BBB

B

B BBBB

B

G

GG

GG

G G

G G

GG

GG

GG

G

G

BB

B

B

G GG

Φ

8


Example

Consider a database of mortgages in a bank.

For each mortgage (customer) i = 1, 2, . . . , n we know xi = (ai, li, pi, si, ci,mi) ∈ R6, where:

ai is the age of the customer,

li is the number of loans this customer has,

pi is the post code of the customer’s address,

si is the salary of the customer,

ci is the type of customer’s contract (part-time, temporary, open-ended, etc),

mi is the outstanding balance on the mortgage.

For each of these mortgages we also know di which indicates if the mortgage is “good” or “bad”.

We find a hyperplane

wTx + b = 0,

where w ∈ R6 and b ∈ R which separates “good” and “bad” mortgages.

Suppose a new application for a mortgage has been made.

We collect information z = (az, lz, pz, sz, cz,mz) ∈ R6 for the applicant and compute

dz = sgn(wTz + b)

to decide whether the mortgage should be given or not.

5


Separating Hyperplane

To guarantee a nonzero margin of separation we look for a hyperplane

wTx + b = 0,

such that

wTxi + b ≥ 1 for “good” points;

wTxi + b ≤ −1 for “bad” points.

This is equivalent to:wT xi

‖w‖+ b

‖w‖≥ 1

‖w‖for “good” points;

wT xi

‖w‖+ b

‖w‖≤ − 1

‖w‖for “bad” points.

In this formualtion the normal vector of the separating hyperplane w‖w‖ has unit length.

In this case the margin between “good” and “bad” points is measured by 2‖w‖.

We would like this margin to be maximised.

This can be achieved by minimising the norm ‖w‖.

6


Dual Quadratic Problem (continued)

Observe that the dual problem has a neat formulation in which only dual variables y are present.

(The primal variables (w, b, ξ) do not appear in the dual.)

Define a matrix Q ∈ Rn×n such that qij = didj(xTi xj) .

Rewrite the (dual) quadratic program:

max eTy − 12y

TQy,

s.t. dTy = 0,

0 ≤ y ≤ λe,

where e is the vector of ones in Rn.

The matrix Q corresponds to a specific linear kernel function.

11


Dual Quadratic Problem (continued)

The primal problem is convex hence the dual problem must be well stated too.

The dual problem is to maximise the concave function. We can prove it directly.

Lemma

The matrix Q is positive semidefinite.

Proof:

Define

Z = [d1x1|d2x2| . . . |dnxn]T ∈ Rn×ℓ

and observe that

Q = ZZT (i.e., qij = didj(xTi xj)).

For any u ∈ Rn we have

uTQu = (uTZ)(ZTu) = ‖ZTu‖2 ≥ 0

hence Q is positive semidefinite.

12


Linearly nonseparable case

If perfect linear separation is impossible then for each misclassified data we introduce a slack vari-

able ξi which measures the distance between the hyperplane and misclassified data.

Finding the best hyperplane can be formulated as a quadratic programming problem:

min 12w

Tw + λn∑

i=1

ξi

s.t. di(wTxi + b) + ξi ≥ 1, ∀i = 1, 2, . . . , n,

ξi ≥ 0 ∀i = 1, 2, . . . , n,

where λ (λ > 0) controls the penalisation for misclassifications.

We will derive the dual quadratic problem.

We associate Lagrange multipliers y ∈ Rn (y≥0) and s ∈ Rn (s≥0)

with the constraints di(wTxi + b) + ξi ≥ 1 and ξ ≥ 0, and write the Lagrangian

L(w, b, ξ, y, s) =1

2wT w + λ

n∑

i=1

ξi −n∑

i=1

yi(di(wTxi + b) + ξi − 1) − sTξ.

9


Dual Quadratic Problem

Stationarity conditions (with respect to all primal variables):

∇w L(w, b, ξ, y, s) = w −n∑

i=1

yidixi = 0

∇ξiL(w, b, ξ, y, s) = λ − yi − si = 0

∇b L(w, b, ξ, y, s) =n∑

i=1

yidi = 0.

Substituting these equations into the Lagrangian function we get

L(w, b, ξ, y, s) =n∑

i=1

yi −1

2

n∑

i,j=1

didjyiyj(xTi xj).

Hence the dual problem has the form:

maxn∑

i=1

yi −12

n∑i,j=1

didjyiyj(xTi xj)

s.t.n∑

i=1

diyi = 0,

0 ≤ yi ≤ λ, ∀i = 1, 2, . . . , n,

10


Support Vector Machines

Two step procedure:

Use nonlinear mapping to transform data into a feature space F ;

Use linear machine to classify objects in the feature space.

Requirement: Only the inner products of data with the new point can be used.

Kernel Function

A kernel is a function K, such that for all x, z ∈ X

K(x, z) = 〈φ(x), φ(z)〉,

where φ is a mapping from X to an (inner product) feature space F .

We use 〈., .〉 to denote a scalar product.

Linear Kernel K(x, z) = xTz.

Polynomial Kernel K(x, z) = (xTz + 1)d.

Gaussian Kernel K(x, z) = e−

‖x−z‖2

σ2 .

15


Question

Consider a particular case:

Database of mortgages for a big lender:

n = 1000000 mortgages;

ℓ = 6 (or, more realistically, ℓ = 20).

Which problem is easier to solve:

the Primal (page 9) or the Dual (page 11) ?

16


Definitions

The original quantities are called attributes.

The quantities introduced to describe the data are called features.

The task of choosing a suitable representation is called feature selection.

X ⊂ Rℓ is the input space.

F ⊂ Rk is the feature space.

The function φ : X 7→ F is called feature map.

We would like to transform input space into feature space

and see a clear separation (preferably linear) in the feature space.

13


Example: Nonlinear Feature Map

G BB

BG

G

GG

G

G

G

GG

B

G

GG

BB

B

B

BB

BBB

BB

BB

B

B

B

BB

GG

G

G

G

BB

B

BBBB G G G

GGGG

GGGG

GG

G

G

G

G

B BB

BBB B B BB B B B

BBBB

B B B BBB

B B BB

B

B

G

Φ

2p

2q

Use polar coordinates:

Define φ : R2 7→ R2 such that (xi1, xi2) 7→ (ri, θi),

where ri = ((xi1−a)2

p2 + (xi2−b)2

q2 )1/2 and θi = tan−1( (xi2−b)p(xi1−a)q

).

Attributes: (xi1, xi2)

Features: (ri, θi)

Feature map: φ

14




Turin 2008

Jacek Gondzio

Lecture 19:

Nonlinear Programming

Linesearch Methods

1


Unconstrained Nonlin. Opt.

Consider an unconstrained nonlinear optimiza-

tion problem

minx

f(x)

where x ∈ Rn, and f : Rn 7→ R is a twice differ-

entiable but not necessarily convex function.

Lemma 1

First Order Necessary Conditions.

If x is a local minimizer and f is continuously dif-

ferentiable in an open neighbourhood of x, then

∇f(x) = 0.

Proof (by contradiction)

Suppose ∇f(x) 6= 0. Define p = −∇f(x) and

note that ∇f(x)Tp = −‖∇f(x)‖2 < 0. Since ∇fis continuous near x, there exists a scalar a > 0

such that ∇f(x + αp)Tp < 0 for all α ∈ [0, a].Take any α ∈ (0, a]. From Taylor’s theorem we

get

f(x + αp) = f(x) + α∇f(x + βp)T p,

for some β ∈ (0, α]. Thus f(x + αp) < f(x) for

all α ∈ (0, a]. Hence x is not a local minimizer.

2


Optimality Conditions

Lemma 2

Second Order Necessary Conditions.

If x is a local minimizer and ∇2f is continuous

in an open neighbourhood of x, then ∇f(x) = 0

and ∇2f(x) is positive semidefinite.

Proof

From Lemma 1, ∇f(x) = 0. Suppose ∇2f(x) is

not positive semidefinite. Then we can choose

a vector p such that pT∇2f(x)p < 0, and from

continuity of ∇2f around x, there exists a scalar

a such that pT∇2f(x+αp)p < 0 for all α ∈ (0, a].

From the second-order Taylor expansion around

x, we then have for any α ∈ (0, a]

f(x + αp) = f(x)+α∇f(x)Tp+1

2α2pT∇2f(x+βp)p

for some β ∈ (0, α). Since ∇f(x) = 0 and

pT∇2f(x + βp)p < 0, then f(x + αp) < f(x).

Hence x is not a local minimizer.

3


Optimality Conditions (cont’d)

Lemma 3

Second Order Sufficient Conditions.

Suppose ∇2f is continuous in an open neigh-bourhood of x and ∇2f(x) is positive definite and∇f(x) = 0. Then x is a strict local minimizer of

f .

Proof

Since ∇2f is continuous and it is positive definiteat x, it remains positive definite in some neigh-

bourhood of x. Let B = {z : ‖z− x‖ < r} be sucha neighbourhood. For any nonzero vector p such

that ‖p‖ < r, we have x + p ∈ B and so

f(x + p) = f(x)+∇f(x)Tp+1

2pT∇2f(z)p

= f(x)+1

2pT∇2f(z)p

where z = x + αp for some α ∈ (0,1). Sincez ∈ B, we have pT∇2f(z)p > 0, and therefore

f(x + p) > f(x).

Hence x is a strict local minimizer.

4


Quadratic Model of the Function

Let f : Rn 7→ R be twice continuously differen-

tiable at x0.

From Taylor theorem we can replace (locally) f

with its second order approximation

f(x0+p)=f(x0)+∇f(x0)Tp+

1

2pT∇2f(x0)p+r3(p),

where the remainder satisfies:

limp→0

r3(p)

‖p‖2= 0.

The quadratic model

m(x0+p)=f(x0)+∇f(x0)Tp+

1

2pT∇2f(x0)p

is a good approximation of f in the neighbour-

hood of x0, i.e. iff p is small.

5


Rosenbrock Function

Example: f(x) = 10(x2 − x21)

2 + (1 − x1)2.

x 1

level set

minimizer

1

1

x 2

6


Descent Directions

Steepest Descent Direction

p = −∇f(x0).

Indeed, for small α that satisfies

α∇f(x0)

T∇2f(x0)∇f(x0)

2‖∇f(x0)‖2< 1

we have

m(x0 + αp) = f(x0)−α‖∇f(x0)‖2

+1

2α2∇f(x0)

T∇2f(x0)∇f(x0)

< f(x0).

7


Descent Directions (cont’d)

Newton Direction

This direction points to the minimum of the

quadratic model.

The quadratic model has a minimum at point

x0 + p such that

m′

(x0 + αp) = ∇f(x0)+∇2f(x0)p=0,

i.e.

p = −(∇2f(x0))−1∇f(x0).

If ∇2f(x0) is positive definite, then (∇2f(x0))−1

is also positive definite, hence

∇f(x0)Tp = −∇f(x0)

T∇2f(x0)−1

∇f(x0) < 0.

Thus, for α ∈ (0,1] we get

m(x0+αp)=f(x0)−α(1−α

2)∇f(x0)

T∇2f(x0)−1

∇f(x0)

< f(x0)−α

2∇f(x0)

T∇2f(x0)−1

∇f(x0)

< f(x0).

8


Other Descent Directions

Let B be a symmetric positive definite matrix.

Define

p = −B−1∇f(x0).

Since B is positive definite, its inverse is also

positive definite and

∇f(x0)Tp = −∇f(x0)

TB−1∇f(x0) < 0.

Thus, for small α that satisfies

α∇f(x0)

TB−1∇2f(x0)B−1∇f(x0)

2∇f(x0)TB−1∇f(x0)

< 1

we get

m(x0+αp)=f(x0)−α∇f(x0)TB−1∇f(x0)

+1

2α2∇f(x0)

TB−1∇2f(x0)B−1∇f(x0)

< f(x0).

9


Descent Directions: Summary

The steepest descent direction is easy to com-

pute (only the gradient of f). There is a guar-

antee that for sufficiently small α some progress

in the objective can be achieved. Although the

algorithms that use the steepest descent direc-

tions are globally convergent, they may be very

slow in practice.

If ∇2f(x0) is positive definite, the Newton direc-

tion can be used. The computation of this direc-

tion needs the evaluation of the second derivative

of f as well as the solution of the system of linear

equations:

(∇2f(x0)) p = −∇f(x0).

This may be quite a significant effort.

10


Approximate Hessian

If

• ∇2f(x0) is not positive definite, or

• ∇2f(x0) is very expensive to compute, or

• the linear system with ∇2f(x0) is too hard,

then instead of using exact ∇2f(x0), we can use

its positive definite approximation B.

Example:

Quasi-Newton Methods

• DFP: Davidon, Fletcher, Powell

• BFGS: Broyden, Fletcher, Goldfarb, Shanno

Low-rank approximations of the Hessian matrix.

11


Line Search

The material given in this lecture comes from

the book of J. Nocedal & S. Wright:

Numerical Optimization,

Springer Series in Operations Research,

Springer-Verlag Telos, 1999.

To analyse the behaviour of function f along di-

rection p, define φ : R 7→ R as

φ(α) = f(x0 + αp).

If p is a descent direction for f , then for suffi-

ciently small α

φ(α) < φ(0).

However, tiny step size α would mean that tiny

progress to optimality can only be made.

We want α to be reasonably large.

12


Two Conds for the Step Size

Sufficient Decrease Condition

f(x0 + αp) ≤ f(x0) + c1α∇f(x0)T p,

for some constant c1 ∈ (0,1).

Curvature Condition

∇f(x0 + αp)T p ≥ c2∇f(x0)T p,

for some constant c2 ∈ (c1,1).

The sufficient decrease condition requires the re-

duction in the objective to be proportional to

the step size α and the directional derivative

∇f(x0)T p.

The curvature condition can be expressed as

φ′

(α) ≥ c2φ′

(0).

Thus it requires the slope of φ at α to be c2 times

larger than that at zero. In other words, if the

slope φ′

(α) is strongly negative, then we should

increase α. However, if it is only slightly negative

or even positive, then little or none progress can

be expected from moving further along direction

p.13


Two Conds for the Step Size

φ(α) αk=f(x + p )k

α

l( )α

acceptable acceptable

φ(α) αk

=f(x + p )k

α

l( )α

acceptable acceptable

14


Good step size α exists

Lemma 4

Let f : Rn 7→ R be continuously differentiable at

x0. Let p be a descent direction at x0, such that

f is bounded below along a ray {x0+αp|α > 0}.

If 0 < c1 < c2 < 1, then there exist intervals of

step lengths that satisfy Wolfe’s conditions:

f(x0 + αp) ≤ f(x0) + c1α∇f(x0)T p,

∇f(x0 + αp)T p ≥ c2∇f(x0)T p.

15


Good step size α exists

Proof. Since φ(α)=f(x0+αp) is bounded below

∀α > 0 and 0 < c1 < 1, the line l(α) = f(x0)+

c1α∇f(x0)T p must intersect the graph of φ at

least once. Let α0 be the smallest intersecting

value of α, that is

f(x0 + α0p) = f(x0) + c1α0∇f(x0)T p.

The sufficient decrease condition holds for

α ∈ [0, α0].

By the mean value theorem, ∃αx ∈ (0, α0):

f(x0 + α0p) − f(x0) = α0∇f(x0 + αxp)T p.

By combining this and the earlier equality:

∇f(x0 + αxp)T p = c1∇f(x0)T p > c2∇f(x0)

T p,

since c1 < c2 and ∇f(x0)T p < 0.

Therefore αx satisfies Wolfe’s conditions. Since

both these conditions are satisfied as strict in-

equalities, there exists a neighbourhood of αx in

which the Wolfe’s inequalities hold.

16


Goldstein Conditions

An alternative to Wolfe’s conditions are the Gold-

stein Conditions:

f(x0 + αp) ≤ f(x0) + c α∇f(x0)T p,

f(x0 + αp) ≥ f(x0) + (1 − c)α∇f(x0)T p,

with 0 < c < 1/2.

α T(1-c) f pk k

α Tc f pk k

αacceptable

φ(α) αk=f(x + p )k

accept.

17


Convergence of Line Search M.

Def. A function f is Lipschitz continuous on Dif there exists a constant L such that

‖f(x) − f(y)‖ ≤ L‖x − y‖ for all x, y ∈ D.

Theorem (Zoutendijk)

Consider an iteration of the form

xk+1 = xk + αkpk,

where pk is a descent direction and αk satis-

fies the Wolfe’s conditions. Let θk be the angle

between pk and the steepest descent direction

−∇fk, i.e.

cos θk =−∇fT

k pk

‖∇fk‖‖pk‖.

Assume f is bounded below, continuously dif-

ferentiable on an open set D, and the initial

point x0 ∈ D is given. Assume also that ∇fis Lipschitz continuous on D and the level set

L = {x : f(x) < f(x0)} ⊂ D.

Then∑

k≥0

cos2θk‖∇fk‖2 < ∞.

18


Proof of Zoutendijk Theorem

Proof. From the second Wolfe’s condition:

(∇fk+1 −∇fk)T pk ≥ (c2 − 1)∇fT

k pk.

Lipschitz continuity of ∇f implies that

(∇fk+1 −∇fk)T pk ≤ αkL‖pk‖

2.

By combining these two inequalities, we get

αk ≥c2 − 1

L·∇fT

k pk

‖pk‖2

.

Hence from the first Wolfe’s condition we get

fk+1 ≤ fk − c1 ·1 − c2

L·(∇fT

k pk)2

‖pk‖2

.

Let c = c11−c2

L> 0. From the definition of cos θk,

we getfk+1 ≤ fk − c · cos2θk‖∇fk‖

2.

Summing this inequalities for j = 0,1, · · · , k, we

obtainfk+1 ≤ f0 − c ·

k∑

j=0

cos2θj‖∇fj‖2.

Since f is bounded below, for any k, f0− fk+1 is

less than some positive constant. Hence∑

k≥0

cos2θk‖∇fk‖2 < ∞.

19


Global Convergence

Corollary 1

limk→∞

cos θk∇fk = 0.

Corollary 2

If for all k, cos2 θk ≥ δ > 0, then

limk→∞

∇fk = 0.

Thus it suffices that the angle between pk and

−∇fk does not get too close to π/2. Then

the descent algorithm with line search satisfying

Wolfe’s conditions is globally convergent.

The convergence is global but it may be slow.

xx

x0

12

20




Turin 2008

Jacek Gondzio

Lecture 20:

Nonlinear Programming

Trust Region Methods

1


Trust Region Nonlinear Opt.The material given in this lecture comes from

the book of J. Nocedal & S. Wright:

Numerical Optimization,

Springer Series in Operations Research,

Springer-Verlag Telos, 1999.

Consider an unconstrained nonlinear optimiza-

tion problemmin

xf(x)

where x ∈ Rn, and f : Rn 7→ R is a twice differ-

entiable but not necessarily convex function.

The quadratic model

mk(xk+p)=f(xk) + ∇f(xk)Tp +

1

2pTBk p

is a good approximation of f in the neighbour-

hood of xk, i.e. iff p is small.

It is natural to make step from xk to the new

xk+1 in a direction that minimizes the quadratic

model. The quadratic model is valid for small p,say, ‖p‖ ≤ ∆. Thus the direction pk ∈ Rn is the

solution of the following problem

min mk(p) = fk + ∇fTk p + 1

2pTBk ps.t. ‖p‖ ≤ ∆k.

2


Trust Region: Illustration

xk

∆

∆2

1

contours of the model

trust region

level set

minimizer

Trust region around the point xk.

Two possible directions depending on the size of

the trust region ∆1 and ∆2.

3


Measuring the Progress

The radius of the trust region is chosen at each

iteration. This choice is based on the agreement

of the model function mk and the true objective

function f .

Given a step pk we define the ratio

ρk =f(xk) − f(xk + pk)

mk(0) − mk(pk).

The numerator is called the actual reduction,

and the denominator is the predicted reduction.

Since the step pk is obtained by minimizing the

model mk over the region that includes p = 0,

the predicted reduction will always be nonnega-

tive.

4


Measuring the Progress

If ρk < 0, the new objective value f(xk + pk) is

greater than the current value f(xk), so the step

must be rejected and the trust region should be

reduced.

On the other hand, if ρk is close to 1, there is

a good agreement between the model mk and

the function f , so it is safe to expand the trust

region for the next iteration.

If ρk > 0 but it is not close to 1, we do not alter

the trust region. If ρk is close to zero, we shrink

the trust region.

5


Trust Region Algorithm

Given ∆ > 0,∆0 ∈ (0, ∆) and η ∈ [0, 14):

for k = 0,1,2, · · ·

Find pk (by solving the trust region problem);

Evaluate ρk;

if ρk < 14

∆k+1 = 14‖pk‖

else

if ρk > 34 and ‖pk‖ = ∆k

∆k+1 = min{2∆k, ∆}

else

∆k+1 = ∆k;

if ρk > η

xk+1 = xk + pk

else

xk+1 = xk;

end (for).

6


The Cauchy Point

Find the direction pSk that solves the linear ver-

sion of trust region subproblem:

min fk + ∇fTk p

s.t. ‖p‖ ≤ ∆k.

Find the step size τk > 0 that minimizes

mk(τpSk ) within the trust region, i.e. solve:

minτ>0

mk(τpSk )

s.t. ‖τpSk‖ ≤ ∆k.

Set pCk = τkpS

k

7


The Cauchy Point (cont’d)

kpC

pSk

xk

contours of the model

trust region

It is possible to write a closed-form definition of

the Cauchy point. Indeed, the first minimization

givespSk = −

∆k

‖∇fk‖∇fk.

It is the steepest descent direction and the max-

imum step size within the trust region is made

in this direction.

8


Cauchy Point: Explicit Formula

The problem of the line search along pSk can also

be solved explicitly. We consider two cases:

Case 1: ∇fTk Bk∇fk ≤ 0.

The function mk(τpSk ) decreases monotonically

with τ so τk is the largest value that satisfies the

trust region bound, namely τk = 1.

Case 2: ∇fTk Bk∇fk > 0.

The function mk(τpSk ) is a convex quadratic in

τ , so τk is either the unconstrained minimizer

of this quadratic, ‖∇fk‖3/(∆k∇fT

k Bk∇fk), or the

boundary value 1, whichever comes first.

Summing up, pCk = −τk

∆k

‖∇fk‖∇fk, where

τk=

1 if ∇fTk Bk∇fk ≤ 0;

min{‖∇fk‖

3

(∆k∇fT

kBk∇fk)

, 1} otherwise.

9


Improvement on the C. Point

If we choose the Cauchy point then we actually

make step in the steepest descent direction. (We

just use the specific step size.) We thus make

negligible use of the second-order information.

Being a variant of the steepest descent method,

an algorithm that always makes the step towards

the Cauchy point is globally convergent but very

slow in practice.

We should make better use of the second order

information.

To simplify the notation from now on we drop

the subscripts k from pk, fk,∇fk and Bk.

Additionally, we denote g = ∇fk.

10


Direction p depends on ∆

Examine the influence of ∆ on the solution of the

trust region subproblem. For very small ∆, the

step in the steepest descent direction −g should

be made:

p(∆) ≈ −∆g

‖g‖.

For very large ∆ (and positive definite B), the

full step in the Newton-like direction should be

made:

p(∆) = pB = −B−1g.

-g

xk

trajectory p( )∆

pU

pB

11


Dogleg Method

For the intermediate values of ∆, the solution

p(∆) typically follows the curved trajectory in

the Figure.

We can approximate the curved trajectory with

two line segments. The first runs from the origin

to the unconstrained minimizer along the steep-

est descent direction

pU = −gTg

gTB gg.

The second line segment runs from pU to pB.

The following is a parametric description of the

two-segment trajectory:

p(τ)=

{

τpU , 0 ≤ τ ≤ 1,

pU + (τ − 1)(pB − pU), 1 < τ ≤ 2.

12


Dogleg Method (cont’d)

Lemma 1 Let B be positive definite. Then

(i) ‖p(τ)‖ is an increasing function of τ ,

(ii) m(p(τ)) is a decreasing function of τ .

Proof: Both (i) and (ii) hold for any τ ∈ [0,1].

Suppose τ ∈ (1,2]. Let τ = 1 + α.

For (i), define h(α) and show that h′

(α) ≥ 0 for

α ∈ (0,1).

h(α) =1

2‖pU + α(pB − pU)‖2

=1

2‖pU‖2+α(pU)T (pB−pU)+

1

2α2‖(pB−pU)‖2.

13


Proof of Lemma 1 (cont’d)

h′

(α) = −(pU)T (pU−pB)+α‖(pB−pU)‖2

≥ −(pU)T (pU−pB)

=gTg

gTB ggT(−

gTg

gTB gg + B−1g)

= gTggTB−1g

gTB g

(

1 −(gTg)2

(gTB g)(gTB−1g)

)

≥0.

Substitute u = B1/2g and v = B−1/2g and use

|uTv| ≤ ‖u‖‖v‖.

For (ii), define

h(α)=m(p(1+α))=f(xk)+(pU +α(pB−pU))T g

+1

2(pU +α(pB−pU))TB(pU +α(pB−pU)),

and show that h′

(α) ≤ 0 for α ∈ (0,1).

h′

(α) = (pB−pU)T (g+BpU)+α(pB−pU)TB(pB−pU)

≤ (pB−pU)T (g+BpU + B(pB−pU))

= (pB−pU)T (g+BpB) = 0

14


Dogleg Method (continued)

From Lemma we conclude that the path p(τ) has

to intersect the trust region boundary ‖p‖ = ∆

at exactly one point if ‖pB‖ ≥ ∆, and nowhere

otherwise.

Since m is decreasing along the path, the optimal

solution of the trust region subproblem (along

the path) is either pB (if ‖pB‖ ≤ ∆) or the point

of intersection of the path and the trust region

boundary. In the latter case, we compute τ by

solving the following scalar quadratic equation

‖pU + (τ − 1)(pB − pU)‖2 = ∆2.

15


Two-D. Subspace Minimization

The dogleg method can be made more sophis-

ticated by allowing the search in the entire two-

dimensional subspace spanned by vectors g and

B−1g instead of the search along the path p.

The trust region subproblem is then replaced by

min m(p) = f + gT p + 12pT B p

s.t. ‖p‖ ≤ ∆

p ∈ span[g, B−1g].

Clearly, the Cauchy point pC is feasible for this

subproblem so the optimal solution yields at least

as much reduction in m as the Cauchy point,

resulting in the global convergence of the algo-

rithm. Since the space span[g, B−1g] contains

the whole dogleg trajectory, the two-dimensional

subspace search is an obvious extension of the

dogleg method.

16


Two-D. Subspace Minimization

An important advantage of the two-dimensional

subspace minimization approach is that it can

easily handle the case of indefinite B.

When B is indefinite, the two-dimensional sub-

space is replaced by span[g, (B +λI)−1g], where

λ > 0 is chosen such that the matrix B + λI is

positive definite.

If B is indefinite, then at least one of its eigen-

values is negative. Let λ1 be the most negative

eigenvalue of B. It suffices to take λ > −λ1 to

make B + λI positive definite.

There exist efficient methods that find the most

negative eigenvalue of the matrix.

17


Example

Consider the Rosenbrock function:

f(x) = 100(x2 − x21)

2 + (1 − x1)2.

x 1

level set

minimizer

1

1

x 2

The function has a deep “valley” around the

points that satisfy equation x2 = x21.

Build the quadratic model at point (0,1) and

solve the trust region subproblem for different

values of ∆: ∆1 = 1/4 and ∆2 = 1/2.

18

sdp: semideﬁnite prog sdp: background properties interior...

Documents