algorithms for minimizing differences of convex functions ......minimizing differences of convex...

Minimizing Differences of Convex Functions-The DCA

Algorithms for Minimizing Differences ofConvex Functions and Applications to

Facility Location and Clustering

Mau Nam Nguyen

(joint work with D. Giles and R. B. Rector)

Fariborz Maseeh Department of Mathematics and StatisticsPortland State University

Mathematics and Statistics SeminarWSU Vancouver, April 12, 2017


Convex Functions

DefinitionLet I ⊂ R be an interval and let f : I → R be a function. Thefunction f is said to be CONVEX on Ω if

f(λx + (1− λ)u

)≤ λf (x) + (1− λ)f (u)

for all x ,u ∈ I and λ ∈ (0,1).The function f is called CONCAVE if −f is convex.

Figure: Convex Function and Nonconvex Function.


Convex Functions

ExampleLet I = R = (−∞,∞) and let f (x) = |x | for x ∈ R. Then f is aconvex function. Indeed, for any x ,u ∈ R and λ ∈ (0,1),

f (λx + (1− λ)u) = |λx + (1− λ)u|≤ |λx |+ |(1− λ)u|= λ|x |+ (1− λ)|u|= λf (x) + (1− λ)f (u).



Characterizations for Convexity

TheoremLet f : I → R be differentiable function, where I is an openinterval in R. Then f is convex on I if and only if f ′ is monotoneincreasing on I.

TheoremLet f : I → R be twice differentiable function, where I is an openinterval in R. Then f is convex on I if and only if f ′′(x) ≥ 0 for allx ∈ I.

ExampleThe following functions are convex:

(i) f (x) = ax2 + bx + c, x ∈ R, a > 0.(ii) f (x) = ex , x ∈ R.(iii) f (x) = − ln(x), x ∈ (0,∞).


Convex Sets

DefinitionA subset Ω of Rn is CONVEX if [a,b] ⊂ Ω whenever a,b ∈ Ω.Equivalently, Ω is convex if λa + (1− λ)b ∈ Ω for all a,b ∈ Ωand λ ∈ [0,1].

Figure: Convex set and nonconvex set.


Convex Functions

DefinitionLet f : Ω→ R be defined on a convex set Ω ⊂ Rn. The functionf is said to be CONVEX on Ω if

f(λx + (1− λ)y

)≤ λf (x) + (1− λ)f (y)

for all x , y ∈ Ω and λ ∈ (0,1).The function f is called CONCAVE if −f is convex.



Convex Functions

Theorem

Let fi : Rn → R be convex functions for all i = 1, . . . ,m.Then the following functions are convex as well:

(i) The multiplication by scalars λf for any λ > 0.(ii) The sum function

∑mi=1 fi .

(iii) The maximum function max1≤i≤m fi .Let f : Rn → R be convex and let ϕ : R→ R benondecreasing. Then the composition ϕ f is convex.Let B : Rn → Rp be an affine mapping and let f : Rp → Rbe a convex function. Then the composition f B is convex.


Relationship between Convex Functions and Convex Sets

TheoremA function f : Rn → R is convex if and only if its epigraph epi f isa convex subset of the product space Rn × R.

Figure: Epigraphs of convex function and nonconvex function.


Pierre de Fermat

Pierre de Fermat (1601-1665) was a French lawyer at theParlement of Toulouse, France, and an amateur mathematicianwho is given credit for early developments that led to calculus.


Fermat-Torricelli Problem

In the early of the 17th century, at the end of his book, Treatiseon Minima and Maxima, the French mathematician Fermat(1601-1665) proposed the following problem: given three pointsin the plane, find a point such that the sum of its distances tothe three given points is minimal.Given a1,a2, . . . ,am ∈ Rn, find x ∈ Rn which minimizes thefunction

f (x) =m∑

i=1

‖x − ai‖.


Weiszfeld’s Algorithm

The gradient of the function φ is

∇φ(x) =m∑

i=1

x − ai

‖x − ai‖, x /∈ a1,a2, . . . ,am.

Solving the equation ∇φ(x) = 0 gives

x =

∑mi=1

ai

‖x − ai‖∑mi=1

1‖x − ai‖

:= F (x).

For continuity, define F (x) := x for x ∈ a1,a2, . . . ,am.Weiszfeld algorithm: choose a starting point x0 ∈ Rn and define

xk+1 = F (xk ) for k ∈ N.


Kuhn’s Convergence Theorem

Weiszfeld also claimed that if x0 /∈ a1,a2, . . . ,am, where ai fori = 1, . . . ,m are not collinear, then (xk ) converges to the uniqueoptimal solution of the problem. A counterexample and acorrect statement and the proof of the convergence were givenby Kuhn in 1972.

TheoremLet (xk ) be the sequence formed by the Weiszfeld’s algorithm.Suppose that xk /∈ a1,a2, . . . ,am for k ≥ 0. Then (xk )converges to the optimal solution of the problem.


C1− Functions

A function f : Rn → R is called a C1 function if it has all partial

derivatives∂f∂xi

for i = 1, . . . ,n, and all these partial derivatives

are continuous functions. The gradient of f at a point x isdenoted by

∇f (x) =( ∂f∂x1

(x), . . . ,∂f∂xn

(x)).

Example

Let f (x) = ‖x‖2 =∑n

i=1 x2i . Then f is a C1 function with

∂f∂xi

(x) = 2xi for x = (x1, . . . , xn).

and ∇f (x) = (2x1, . . . ,2xn) = 2x .


C1− Convex Functions

Theorem

Let f : Rn → R be a C1 function. Then f is convex if and only if

f (x) + 〈∇f (x), x − x〉 ≤ f (x) for all x , x ∈ Rn.

Figure: C1 Convex Functions.


Lipschitz Continuous Functions and C1,1 Functions

DefinitionA function g : Rn → Rm is said to be Lipschitz continuous ifthere exists a constant ` ≥ 0 such that

‖g(x)− g(u)‖ ≤ `‖x − u‖ for all x ,u ∈ Rn.

A C1 function f : Rn → R is called a C1,1 function if its gradientis Lipschitz continuous, i.e., there exists a constant ` > 0 suchthat

‖∇f (x)−∇f (u)‖ ≤ `‖x − u‖ for all x ,u ∈ Rn.


C1,1 Functions

Theorem

Let f : Rn → R be a C1,1 function whose gradient is Lipschitzcontinuous with Lipschitz constant `. Then

f (x) ≤ f (x) + 〈∇f (x), x − x〉+`

2‖x − x‖2

for all x , x ∈ Rn.

Figure: C1 Convex Functions.


The Gradient Method

Let f : Rn → R be a C1 function and let αk be a sequence ofpositive real numbers called the sequence of step sizes. Asequence xk as follows:

Choose x0 ∈ Rn.Define xk+1 = xk − αk∇f (xk ) for k ≥ 0.

Given a C1 function f : Rn → R and α > 0, define g : Rn → Rn

by g(x) = x − α∇f (x). Then the gradient method with constantstep size α is given by

Choose x0 ∈ Rn.Define xk+1 = xk − α∇f (xk ) = g(xk ) for k ≥ 0.


The Gradient Method

Theorem

Given a C1 convex function f : Rn → R whose gradient isLipschitz continuous with constant ` > 0. Consider the gradient

method with constant step size α =1`. Suppose that f has an

absolute minimum at x∗. Then f (xk ) is monotone decreasingand

0 ≤ f (xk )− f (x∗) ≤ ‖x0 − x∗‖2

2kαfor all k ≥ 1.


Nesterov’s Accelerated Gradient Method

Let f : Rn → R be a C1−convex function whose gradient is

Lipschitz continuous with constant ` > 0 and let α =1`

.

Choose y1 = x0 ∈ Rn and t0 = 0.

Define xk = yk − αk∇f (yk ) = g(yk ) for k ≥ 1.

Update tk =1 +

√1 + 4t2

k−1

2for k ≥ 1.

Define yk+1 = xk +tk − 1tk + 1

(xk − xk−1) for k ≥ 1.


The Gradient Method

Theorem

Given a C1 convex function f : Rn → R whose gradient isLipschitz continuous with constant ` > 0. Consider theNesterove accelerated gradient method with constant step size

α =1`. Suppose that f has an absolute minimum at x∗. Then

0 ≤ f (xk )− f (x∗) ≤ 2‖x0 − x∗‖2

α(k + 1)2 for all k ≥ 1.


Distance Functions

Given a set Ω ⊂ Rn, the distance function associated with Ω isdefined by

d(x ; Ω) := inf‖x − ω‖

∣∣ ω ∈ Ω.

For each x ∈ Rn, the Euclidean projection from x to Ω isdefined by

P(x ; Ω) :=ω ∈ Ω

∣∣ ‖x − ω‖ = d(x ; Ω).

Figure: Distance function.


Squared Distance Functions

TheoremIf Ω is a nonempty, closed, convex subset of Rn. Then:

For each x ∈ Rn, the Euclidean projection Π(x ; Ω) is asingleton.The projection mapping is nonexpansive:∥∥Π(x1; Ω)− Π(x2; Ω)

∥∥ ≤ ‖x1 − x2‖ for all x1, x2 ∈ Rn.

In addition, the squared distance function ϕ(x) = [d(x ; Ω)]2 forx ∈ Rn is a C1,1 function with

∇ϕ(x) = 2[x − P(x ; Ω)].


Nesterov’s Smoothing Technique

TheoremGiven any a ∈ Rn and µ > 0, a Nesterov smoothingapproximation a of ϕ(x) := ‖x − a‖ has the representation

fµ(x) =‖x − b‖2

2µ− µ

2[d(

x − bµ

; IB)]2,

where IB is the closed unit ball of Rn. In addition,

∇fµ(x) = uµ(x) = Π(x − bµ

; IB).

The gradient ∇fµ is Lipschitz continuous with constant `µ =1µ.

aY. Nesterov: Smooth minimization of non-smooth functions.Math.Program., Ser. A 103 , 127-152 (2005).


Generalized Fermat-Torricelli Problems

Consider the following version of the Fermat-Torricelli problem:

minimize F(x) :=m∑

i=1

‖x − ai‖ subject to x ∈ Rn.

A smooth approximation of F is given by

Fµ(x) =m∑

i=1

(‖x − ai‖2

2µ− µ

2[d(

x − ai

µ; IB)]2

).



The function Hµ continuously differentiable on Rn with itsgradient given by

∇Fµ(x) =m∑

i=1

Π(x − ai

µ; IB).

Its gradient is Lipschitz with constant

Lµ =mµ.

Moreover, one has the following estimate

Hµ(x) ≤ H(x) < Hµ(x) + mµ

2for all x ∈ Rn.



To demonstrate the method, let us consider a numericalexample below.

ExampleThe longitude/latitude coordinates in decimal format of US cities arerecorded, for example, at http://www.realestate3d.com/gps/uslatlongdegmin.htm.Our goal is to use to find a point that minimizes the `1 distances to thegiven points representing the cities. If we consider the Euclideannorm, an approximate optimal value is V ∗ = 23409.33, and asuboptimal solution is x∗ = (38.63,97.35). If we consider the`1−norm, the algorithm presented in this section allows us to find anapproximate optimal value V ∗ = 28724.68 and a suboptimalsolution x∗ = (39.48,97.22).

http://www.realestate3d.com/gps/uslatlongdegmin.htm

http://www.realestate3d.com/gps/uslatlongdegmin.htm



ExampleThe graphs showing the convergence of the algorithms fordifferent norms:

0 500 1000 1500 20000

0.5

1

1.5

2

2.5

3

3.5

4x 10

4

k

Fun

ctio

n va

lue

sum normmax normEuclidean norm

Figure: A generalized Fermat-Torricelli problem with different norms


Differences of Convex Functions

Consider the problem

minimize f (x) := g(x)− h(x) , x ∈ Rn

where g : Rn → R and h : Rn → R are convex.

We call g − h a DC decomposition of f .

g(x) = x4

h(x) = 3x2 + x

f = g − h


Examples of DC Programming

Fermat-Torricelli

f (x) =∑m

i=1 ci‖x−ai‖

Clustering

f (x1, . . . , x`) =∑mi=1 mink

`=1‖x` − ai‖2

MultifacilityLocation

f (x1, . . . , x`) =∑mi=1 mink

`=1‖x` − ai‖


The DCA for Clustering

Problem formulation1: Let ai for i = 1, ...,m be m datapoints in Rn.

Minimize f (x1, . . . , xk ) :=m∑

i=1

min‖xl − ai‖2 : l = 1, ..., k

over xl ∈ Rn, l = 1, ..., k .

1L.T.H. An, M.T. Belghiti, P.D. Tao, A new efficient algorithm based on DCprogramming and DCA for clustering, J. Glob. Optim., 27 (2007), 503–608.


K-Mean Clustering

Let x1, x2, . . . , xm be the data points and let c1, . . . , ck denotethe centers.

Randomly select k cluster centers.Assign each data point to the nearest center.Find the average of the data points assigned to eachcenter.Repeat the second step with the obtained new centers inthe third step until the centroids no longer move.

Although k−mean clustering is effective in many situations, italso has some disadvantages.

The k-means algorithm does not necessarily find theoptimal solutionThe algorithm is sensitive to the initial selected clustercenters


DCA for Clustering and K-Mean2

Both DCA1 and DCA2 are better than K-means: theobjective values given by DCA1 and DCA2 are muchsmaller than that computed by K-means.DCA2 is the best among the three algorithms: it providesthe best solution with the shortest time. DCA2 is very fastand can then handle large-scale problems.

f (x1, . . . , xk ) =m∑

i=1

min`=1,...,k

‖x` − ai‖1

This is a nonsmooth nonconvex program for which there arerarely efficient solution algorithms, especially in the large scalesetting.

2L.T.H. An, M.T. Belghiti, P.D. Tao, A new efficient algorithm based on DCprogramming and DCA for clustering, J. Glob. Optim., 27 (2007), 503–608.


Subgradients and Fenchel Conjugates of Convex Functions

DefinitionLet f : Rn → R be a convex function. A subgradient of f at x isany v ∈ Rn such that

〈v , x − x〉 ≤ f (x)− f (x) for all x ∈ Rn.

The subdifferential ∂f (x) of f at x is the set of all subgradientsof f at x .

DefinitionLet f : Rn → R be a function. The Fenchel conjugate of f isdefined by

f ∗(x) = supu∈Rn〈x ,u〉 − g(u), x ∈ Rn.


DC Algorithm-The DCA

Consider the problem

minimize f (x) := g(x)− h(x) , x ∈ Rn

where g : Rn → R and h : Rn → R are convex.

The DCA3.

INPUT: x1 ∈ Rn, N ∈ Nfor k = 1, . . . ,N do

Find yk ∈ ∂h(xk )Find xk+1 ∈ ∂g∗(yk )

end forOUTPUT: xN+1

3P.D. Tao, L.T.H. An, A d.c. optimization algorithm for solving thetrust-region subproblem, SIAM J. Optim. 8 (1998), 476–505.


An Example of the DCA

minimize f (x) = x4 − 3x2 − x , x ∈ RThen f (x) = g(x)− h(x), where g(x) = x4 and h(x) = 3x2 + x .We have

∂h(x) = 6x + 1, ∂g∗(y) =

argmin

x∈Rn

(x4 − yx

)=

3

√y4

xk+1 =3

√6xk + 1

4



DefinitionA function h : Rn → R is called γ-convex (γ ≥ 0) if there existsγ ≥ 0 such that the function defined by k(x) := h(x)− γ

2‖x‖2,

x ∈ Rn, is convex. If there exists γ > 0 such that h is γ−convex,then h is called strongly convex with parameter γ.

TheoremConsider the sequence xk generated by the DCA. Supposethat g is γ1-convex and h is γ2-convex. Then

f (xk )− f (xk+1) ≥ γ1 + γ2

2‖xk+1 − xk‖2 for all k ∈ N.



DefinitionWe say that an element x ∈ Rn is a stationary point of thefunction f = g − h if ∂g(x)∩ ∂h(x) 6= ∅. In the case where g andh are differentiable, x is a stationary point of f if and only if∇f (x) = ∇g(x)−∇h(x) = 0.

TheoremConsider sequence xk generated by the DCA. Then f (xk )is a decreasing sequence. Suppose further that f is boundedfrom below and that g is γ1-convex and h is γ2-convex withγ1 + γ2 > 0. If xk is bounded, then every subsequential limitof the sequence xk is a stationary point of f .


DC Decomposition

f (x1, . . . , xk ) =m∑

i=1

min`=1,...,k

‖x` − ai‖

We will utilize the fact that

f (x1, . . . , xk ) =m∑

i=1

k∑`=1

‖x` − ai‖ − maxr=1,...,k

k∑`=16=r

‖x` − ai‖

=m∑

i=1

(k∑`=1

‖x` − ai‖

)−

m∑i=1

maxr=1,...,k

k∑`=16=r

‖x` − ai‖

.


DC Decomposition

We obtain the µ-smoothing approximation fµ = gµ − hµ, where

gµ(x1, . . . , xk ) =1

2µ

m∑i=1

k∑`=1

‖x` − ai‖2

hµ(x1, . . . , xk ) =µ

2

m∑i=1

k∑`=1

[d(

x` − ai

µ; IB)]2

+m∑

i=1

maxr=1,...,k

k∑`=16=r

(1

2µ‖x` − ai‖2 −

µ

2

[d(

x` − ai

µ; IB)]2

)

To implement DCA, we need ∂g∗µ and ∂hµ. . .


∂gµ

Using the Frobenius norm in a space of matrices, we expressgµ as

Gµ(X ) =m2µ‖X‖2 − 1

µ〈X ,B〉+

k2µ‖A‖2,

with the inner product 〈A,B〉 =k∑`

n∑j

a`jb`j ,

X is the k × n matrix with rows x1, . . . , xk ,A is m × n with rows a1, . . . ,am, andB is k × n whose every row is the sum a1 + · · ·+ am.

∇Gµ(X ) = mµX − 1

µB X ∈ ∂G∗(Y ) iff Y ∈ ∂G(X )

∇G∗µ(Y ) = 1m (B + µY )


∂hµ

hµ(x1, . . . , xk ) =µ

2

m∑i=1

k∑`=1

[d(

x` − ai

µ; IB)]2

+m∑

i=1

maxr=1,...,k

k∑`=1` 6=r

(1

2µ‖x` − ai‖2 −

µ

2

[d(

x` − ai

µ; IB)]2)


∂hµ

∇H1(X ) =

∑

ix1−aiµ − PIB

(x1−aiµ

)...∑

ixk−aiµ − PIB

(xk−aiµ

)

For each i = 1, . . . ,m, there is some Ri such that theR-excluded sum is maximal. If we call this sum FRi , then ∇FRi

is a k × n matrix whose `th 6= R row is PIB

(x`−aiµ

), and Rth row

is 0.

V ∈ ∂H2(X ) iff V =m∑

i=1

FRi


Multifacility Location Algorithm

INPUT: X1 ∈ dom g, N ∈ N

for k = 1, . . . ,N do

Compute Yk = ∇H1(Xk )+Vk

Compute Xk+1 = 1m (B + µYk )

end for

OUTPUT: xN+1


Clustering


Results

iteration2 4 6 8 10 12 14 16 18 20

f(x

k)

5500

6000

6500

7000

7500Fermat Torricelli, R10

Algorithm 4

iteration0 50 100 150 200 250 300 350

f(X

k)

190

195

200

205

210

215

220

225

230

235

240Eight Rings, R2

Algorithm 5

iteration0 100 200 300 400 500 600 700 800

f(X

k)

1800

1820

1840

1860

1880

1900

1920

1940

1960

1980

Multifacility, R2

Algorithm 5

iteration0 100 200 300 400 500 600 700 800 900 1000

f(X

k)

4800

4900

5000

5100

5200

5300

5400

5500

Multifacility, R10

Algorithm 5

iteration2 4 6 8 10 12 14 16 18 20

f(X

k)

1500

2000

2500

3000

3500

4000

4500

5000

5500

Multifacility Sets, R2

Algorithm 7


References

[1] An, Belghiti, Tao: A new efficient algorithm based on DCprogramming and DCA for clustering. J. Glob. Optim., 27 (2007),503–608.

[2] Nam, Rector, Giles: Minimizing Differences of Convex Functionsand Applications to Facility Location and clustering.arXiv :1511.07595 (2015).

[3] Nesterov: Smooth minimization of non-smooth functions.Math.Program., Ser. A 103 , 127-152 (2005).

[4] Rockafellar: Convex Analysis, Princeton University Press,Princeton, NJ, 1970.

algorithms for minimizing differences of convex functions ......minimizing differences of convex...

Documents