algorithms for minimizing differences of convex functions ......minimizing differences of convex...
TRANSCRIPT
Minimizing Differences of Convex Functions-The DCA
Algorithms for Minimizing Differences ofConvex Functions and Applications to
Facility Location and Clustering
Mau Nam Nguyen
(joint work with D. Giles and R. B. Rector)
Fariborz Maseeh Department of Mathematics and StatisticsPortland State University
Mathematics and Statistics SeminarWSU Vancouver, April 12, 2017
Minimizing Differences of Convex Functions-The DCA
Convex Functions
DefinitionLet I ⊂ R be an interval and let f : I → R be a function. Thefunction f is said to be CONVEX on Ω if
f(λx + (1− λ)u
)≤ λf (x) + (1− λ)f (u)
for all x ,u ∈ I and λ ∈ (0,1).The function f is called CONCAVE if −f is convex.
Figure: Convex Function and Nonconvex Function.
Minimizing Differences of Convex Functions-The DCA
Convex Functions
ExampleLet I = R = (−∞,∞) and let f (x) = |x | for x ∈ R. Then f is aconvex function. Indeed, for any x ,u ∈ R and λ ∈ (0,1),
f (λx + (1− λ)u) = |λx + (1− λ)u|≤ |λx |+ |(1− λ)u|= λ|x |+ (1− λ)|u|= λf (x) + (1− λ)f (u).
Figure: Convex Function and Nonconvex Function.
Minimizing Differences of Convex Functions-The DCA
Characterizations for Convexity
TheoremLet f : I → R be differentiable function, where I is an openinterval in R. Then f is convex on I if and only if f ′ is monotoneincreasing on I.
TheoremLet f : I → R be twice differentiable function, where I is an openinterval in R. Then f is convex on I if and only if f ′′(x) ≥ 0 for allx ∈ I.
ExampleThe following functions are convex:
(i) f (x) = ax2 + bx + c, x ∈ R, a > 0.(ii) f (x) = ex , x ∈ R.(iii) f (x) = − ln(x), x ∈ (0,∞).
Minimizing Differences of Convex Functions-The DCA
Convex Sets
DefinitionA subset Ω of Rn is CONVEX if [a,b] ⊂ Ω whenever a,b ∈ Ω.Equivalently, Ω is convex if λa + (1− λ)b ∈ Ω for all a,b ∈ Ωand λ ∈ [0,1].
Figure: Convex set and nonconvex set.
Minimizing Differences of Convex Functions-The DCA
Convex Functions
DefinitionLet f : Ω→ R be defined on a convex set Ω ⊂ Rn. The functionf is said to be CONVEX on Ω if
f(λx + (1− λ)y
)≤ λf (x) + (1− λ)f (y)
for all x , y ∈ Ω and λ ∈ (0,1).The function f is called CONCAVE if −f is convex.
Figure: Convex Function and Nonconvex Function.
Minimizing Differences of Convex Functions-The DCA
Convex Functions
Theorem
Let fi : Rn → R be convex functions for all i = 1, . . . ,m.Then the following functions are convex as well:
(i) The multiplication by scalars λf for any λ > 0.(ii) The sum function
∑mi=1 fi .
(iii) The maximum function max1≤i≤m fi .Let f : Rn → R be convex and let ϕ : R→ R benondecreasing. Then the composition ϕ f is convex.Let B : Rn → Rp be an affine mapping and let f : Rp → Rbe a convex function. Then the composition f B is convex.
Minimizing Differences of Convex Functions-The DCA
Relationship between Convex Functions and Convex Sets
TheoremA function f : Rn → R is convex if and only if its epigraph epi f isa convex subset of the product space Rn × R.
Figure: Epigraphs of convex function and nonconvex function.
Minimizing Differences of Convex Functions-The DCA
Pierre de Fermat
Pierre de Fermat (1601-1665) was a French lawyer at theParlement of Toulouse, France, and an amateur mathematicianwho is given credit for early developments that led to calculus.
Minimizing Differences of Convex Functions-The DCA
Fermat-Torricelli Problem
In the early of the 17th century, at the end of his book, Treatiseon Minima and Maxima, the French mathematician Fermat(1601-1665) proposed the following problem: given three pointsin the plane, find a point such that the sum of its distances tothe three given points is minimal.Given a1,a2, . . . ,am ∈ Rn, find x ∈ Rn which minimizes thefunction
f (x) =m∑
i=1
‖x − ai‖.
Minimizing Differences of Convex Functions-The DCA
Weiszfeld’s Algorithm
The gradient of the function φ is
∇φ(x) =m∑
i=1
x − ai
‖x − ai‖, x /∈ a1,a2, . . . ,am.
Solving the equation ∇φ(x) = 0 gives
x =
∑mi=1
ai
‖x − ai‖∑mi=1
1‖x − ai‖
:= F (x).
For continuity, define F (x) := x for x ∈ a1,a2, . . . ,am.Weiszfeld algorithm: choose a starting point x0 ∈ Rn and define
xk+1 = F (xk ) for k ∈ N.
Minimizing Differences of Convex Functions-The DCA
Kuhn’s Convergence Theorem
Weiszfeld also claimed that if x0 /∈ a1,a2, . . . ,am, where ai fori = 1, . . . ,m are not collinear, then (xk ) converges to the uniqueoptimal solution of the problem. A counterexample and acorrect statement and the proof of the convergence were givenby Kuhn in 1972.
TheoremLet (xk ) be the sequence formed by the Weiszfeld’s algorithm.Suppose that xk /∈ a1,a2, . . . ,am for k ≥ 0. Then (xk )converges to the optimal solution of the problem.
Minimizing Differences of Convex Functions-The DCA
C1− Functions
A function f : Rn → R is called a C1 function if it has all partial
derivatives∂f∂xi
for i = 1, . . . ,n, and all these partial derivatives
are continuous functions. The gradient of f at a point x isdenoted by
∇f (x) =( ∂f∂x1
(x), . . . ,∂f∂xn
(x)).
Example
Let f (x) = ‖x‖2 =∑n
i=1 x2i . Then f is a C1 function with
∂f∂xi
(x) = 2xi for x = (x1, . . . , xn).
and ∇f (x) = (2x1, . . . ,2xn) = 2x .
Minimizing Differences of Convex Functions-The DCA
C1− Convex Functions
Theorem
Let f : Rn → R be a C1 function. Then f is convex if and only if
f (x) + 〈∇f (x), x − x〉 ≤ f (x) for all x , x ∈ Rn.
Figure: C1 Convex Functions.
Minimizing Differences of Convex Functions-The DCA
Lipschitz Continuous Functions and C1,1 Functions
DefinitionA function g : Rn → Rm is said to be Lipschitz continuous ifthere exists a constant ` ≥ 0 such that
‖g(x)− g(u)‖ ≤ `‖x − u‖ for all x ,u ∈ Rn.
A C1 function f : Rn → R is called a C1,1 function if its gradientis Lipschitz continuous, i.e., there exists a constant ` > 0 suchthat
‖∇f (x)−∇f (u)‖ ≤ `‖x − u‖ for all x ,u ∈ Rn.
Minimizing Differences of Convex Functions-The DCA
C1,1 Functions
Theorem
Let f : Rn → R be a C1,1 function whose gradient is Lipschitzcontinuous with Lipschitz constant `. Then
f (x) ≤ f (x) + 〈∇f (x), x − x〉+`
2‖x − x‖2
for all x , x ∈ Rn.
Figure: C1 Convex Functions.
Minimizing Differences of Convex Functions-The DCA
The Gradient Method
Let f : Rn → R be a C1 function and let αk be a sequence ofpositive real numbers called the sequence of step sizes. Asequence xk as follows:
Choose x0 ∈ Rn.Define xk+1 = xk − αk∇f (xk ) for k ≥ 0.
Given a C1 function f : Rn → R and α > 0, define g : Rn → Rn
by g(x) = x − α∇f (x). Then the gradient method with constantstep size α is given by
Choose x0 ∈ Rn.Define xk+1 = xk − α∇f (xk ) = g(xk ) for k ≥ 0.
Minimizing Differences of Convex Functions-The DCA
The Gradient Method
Theorem
Given a C1 convex function f : Rn → R whose gradient isLipschitz continuous with constant ` > 0. Consider the gradient
method with constant step size α =1`. Suppose that f has an
absolute minimum at x∗. Then f (xk ) is monotone decreasingand
0 ≤ f (xk )− f (x∗) ≤ ‖x0 − x∗‖2
2kαfor all k ≥ 1.
Minimizing Differences of Convex Functions-The DCA
Nesterov’s Accelerated Gradient Method
Let f : Rn → R be a C1−convex function whose gradient is
Lipschitz continuous with constant ` > 0 and let α =1`
.
Choose y1 = x0 ∈ Rn and t0 = 0.
Define xk = yk − αk∇f (yk ) = g(yk ) for k ≥ 1.
Update tk =1 +
√1 + 4t2
k−1
2for k ≥ 1.
Define yk+1 = xk +tk − 1tk + 1
(xk − xk−1) for k ≥ 1.
Minimizing Differences of Convex Functions-The DCA
The Gradient Method
Theorem
Given a C1 convex function f : Rn → R whose gradient isLipschitz continuous with constant ` > 0. Consider theNesterove accelerated gradient method with constant step size
α =1`. Suppose that f has an absolute minimum at x∗. Then
0 ≤ f (xk )− f (x∗) ≤ 2‖x0 − x∗‖2
α(k + 1)2 for all k ≥ 1.
Minimizing Differences of Convex Functions-The DCA
Distance Functions
Given a set Ω ⊂ Rn, the distance function associated with Ω isdefined by
d(x ; Ω) := inf‖x − ω‖
∣∣ ω ∈ Ω.
For each x ∈ Rn, the Euclidean projection from x to Ω isdefined by
P(x ; Ω) :=ω ∈ Ω
∣∣ ‖x − ω‖ = d(x ; Ω).
Figure: Distance function.
Minimizing Differences of Convex Functions-The DCA
Squared Distance Functions
TheoremIf Ω is a nonempty, closed, convex subset of Rn. Then:
For each x ∈ Rn, the Euclidean projection Π(x ; Ω) is asingleton.The projection mapping is nonexpansive:∥∥Π(x1; Ω)− Π(x2; Ω)
∥∥ ≤ ‖x1 − x2‖ for all x1, x2 ∈ Rn.
In addition, the squared distance function ϕ(x) = [d(x ; Ω)]2 forx ∈ Rn is a C1,1 function with
∇ϕ(x) = 2[x − P(x ; Ω)].
Minimizing Differences of Convex Functions-The DCA
Nesterov’s Smoothing Technique
TheoremGiven any a ∈ Rn and µ > 0, a Nesterov smoothingapproximation a of ϕ(x) := ‖x − a‖ has the representation
fµ(x) =‖x − b‖2
2µ− µ
2[d(
x − bµ
; IB)]2,
where IB is the closed unit ball of Rn. In addition,
∇fµ(x) = uµ(x) = Π(x − bµ
; IB).
The gradient ∇fµ is Lipschitz continuous with constant `µ =1µ.
aY. Nesterov: Smooth minimization of non-smooth functions.Math.Program., Ser. A 103 , 127-152 (2005).
Minimizing Differences of Convex Functions-The DCA
Generalized Fermat-Torricelli Problems
Consider the following version of the Fermat-Torricelli problem:
minimize F(x) :=m∑
i=1
‖x − ai‖ subject to x ∈ Rn.
A smooth approximation of F is given by
Fµ(x) =m∑
i=1
(‖x − ai‖2
2µ− µ
2[d(
x − ai
µ; IB)]2
).
Minimizing Differences of Convex Functions-The DCA
Generalized Fermat-Torricelli Problems
The function Hµ continuously differentiable on Rn with itsgradient given by
∇Fµ(x) =m∑
i=1
Π(x − ai
µ; IB).
Its gradient is Lipschitz with constant
Lµ =mµ.
Moreover, one has the following estimate
Hµ(x) ≤ H(x) < Hµ(x) + mµ
2for all x ∈ Rn.
Minimizing Differences of Convex Functions-The DCA
Generalized Fermat-Torricelli Problems
To demonstrate the method, let us consider a numericalexample below.
ExampleThe longitude/latitude coordinates in decimal format of US cities arerecorded, for example, at http://www.realestate3d.com/gps/uslatlongdegmin.htm.Our goal is to use to find a point that minimizes the `1 distances to thegiven points representing the cities. If we consider the Euclideannorm, an approximate optimal value is V ∗ = 23409.33, and asuboptimal solution is x∗ = (38.63,97.35). If we consider the`1−norm, the algorithm presented in this section allows us to find anapproximate optimal value V ∗ = 28724.68 and a suboptimalsolution x∗ = (39.48,97.22).
Minimizing Differences of Convex Functions-The DCA
Generalized Fermat-Torricelli Problems
ExampleThe graphs showing the convergence of the algorithms fordifferent norms:
0 500 1000 1500 20000
0.5
1
1.5
2
2.5
3
3.5
4x 10
4
k
Fun
ctio
n va
lue
sum normmax normEuclidean norm
Figure: A generalized Fermat-Torricelli problem with different norms
Minimizing Differences of Convex Functions-The DCA
Differences of Convex Functions
Consider the problem
minimize f (x) := g(x)− h(x) , x ∈ Rn
where g : Rn → R and h : Rn → R are convex.
We call g − h a DC decomposition of f .
g(x) = x4
h(x) = 3x2 + x
f = g − h
Minimizing Differences of Convex Functions-The DCA
Examples of DC Programming
Fermat-Torricelli
f (x) =∑m
i=1 ci‖x−ai‖
Clustering
f (x1, . . . , x`) =∑mi=1 mink
`=1‖x` − ai‖2
MultifacilityLocation
f (x1, . . . , x`) =∑mi=1 mink
`=1‖x` − ai‖
Minimizing Differences of Convex Functions-The DCA
The DCA for Clustering
Problem formulation1: Let ai for i = 1, ...,m be m datapoints in Rn.
Minimize f (x1, . . . , xk ) :=m∑
i=1
min‖xl − ai‖2 : l = 1, ..., k
over xl ∈ Rn, l = 1, ..., k .
1L.T.H. An, M.T. Belghiti, P.D. Tao, A new efficient algorithm based on DCprogramming and DCA for clustering, J. Glob. Optim., 27 (2007), 503–608.
Minimizing Differences of Convex Functions-The DCA
K-Mean Clustering
Let x1, x2, . . . , xm be the data points and let c1, . . . , ck denotethe centers.
Randomly select k cluster centers.Assign each data point to the nearest center.Find the average of the data points assigned to eachcenter.Repeat the second step with the obtained new centers inthe third step until the centroids no longer move.
Although k−mean clustering is effective in many situations, italso has some disadvantages.
The k-means algorithm does not necessarily find theoptimal solutionThe algorithm is sensitive to the initial selected clustercenters
Minimizing Differences of Convex Functions-The DCA
DCA for Clustering and K-Mean2
Both DCA1 and DCA2 are better than K-means: theobjective values given by DCA1 and DCA2 are muchsmaller than that computed by K-means.DCA2 is the best among the three algorithms: it providesthe best solution with the shortest time. DCA2 is very fastand can then handle large-scale problems.
f (x1, . . . , xk ) =m∑
i=1
min`=1,...,k
‖x` − ai‖1
This is a nonsmooth nonconvex program for which there arerarely efficient solution algorithms, especially in the large scalesetting.
2L.T.H. An, M.T. Belghiti, P.D. Tao, A new efficient algorithm based on DCprogramming and DCA for clustering, J. Glob. Optim., 27 (2007), 503–608.
Minimizing Differences of Convex Functions-The DCA
Subgradients and Fenchel Conjugates of Convex Functions
DefinitionLet f : Rn → R be a convex function. A subgradient of f at x isany v ∈ Rn such that
〈v , x − x〉 ≤ f (x)− f (x) for all x ∈ Rn.
The subdifferential ∂f (x) of f at x is the set of all subgradientsof f at x .
DefinitionLet f : Rn → R be a function. The Fenchel conjugate of f isdefined by
f ∗(x) = supu∈Rn〈x ,u〉 − g(u), x ∈ Rn.
Minimizing Differences of Convex Functions-The DCA
DC Algorithm-The DCA
Consider the problem
minimize f (x) := g(x)− h(x) , x ∈ Rn
where g : Rn → R and h : Rn → R are convex.
The DCA3.
INPUT: x1 ∈ Rn, N ∈ Nfor k = 1, . . . ,N do
Find yk ∈ ∂h(xk )Find xk+1 ∈ ∂g∗(yk )
end forOUTPUT: xN+1
3P.D. Tao, L.T.H. An, A d.c. optimization algorithm for solving thetrust-region subproblem, SIAM J. Optim. 8 (1998), 476–505.
Minimizing Differences of Convex Functions-The DCA
An Example of the DCA
minimize f (x) = x4 − 3x2 − x , x ∈ RThen f (x) = g(x)− h(x), where g(x) = x4 and h(x) = 3x2 + x .We have
∂h(x) = 6x + 1, ∂g∗(y) =
argmin
x∈Rn
(x4 − yx
)=
3
√y4
xk+1 =3
√6xk + 1
4
Minimizing Differences of Convex Functions-The DCA
DC Algorithm-The DCA
DefinitionA function h : Rn → R is called γ-convex (γ ≥ 0) if there existsγ ≥ 0 such that the function defined by k(x) := h(x)− γ
2‖x‖2,
x ∈ Rn, is convex. If there exists γ > 0 such that h is γ−convex,then h is called strongly convex with parameter γ.
TheoremConsider the sequence xk generated by the DCA. Supposethat g is γ1-convex and h is γ2-convex. Then
f (xk )− f (xk+1) ≥ γ1 + γ2
2‖xk+1 − xk‖2 for all k ∈ N.
Minimizing Differences of Convex Functions-The DCA
DC Algorithm-The DCA
DefinitionWe say that an element x ∈ Rn is a stationary point of thefunction f = g − h if ∂g(x)∩ ∂h(x) 6= ∅. In the case where g andh are differentiable, x is a stationary point of f if and only if∇f (x) = ∇g(x)−∇h(x) = 0.
TheoremConsider sequence xk generated by the DCA. Then f (xk )is a decreasing sequence. Suppose further that f is boundedfrom below and that g is γ1-convex and h is γ2-convex withγ1 + γ2 > 0. If xk is bounded, then every subsequential limitof the sequence xk is a stationary point of f .
Minimizing Differences of Convex Functions-The DCA
DC Decomposition
f (x1, . . . , xk ) =m∑
i=1
min`=1,...,k
‖x` − ai‖
We will utilize the fact that
f (x1, . . . , xk ) =m∑
i=1
k∑`=1
‖x` − ai‖ − maxr=1,...,k
k∑`=16=r
‖x` − ai‖
=m∑
i=1
(k∑`=1
‖x` − ai‖
)−
m∑i=1
maxr=1,...,k
k∑`=16=r
‖x` − ai‖
.
Minimizing Differences of Convex Functions-The DCA
DC Decomposition
We obtain the µ-smoothing approximation fµ = gµ − hµ, where
gµ(x1, . . . , xk ) =1
2µ
m∑i=1
k∑`=1
‖x` − ai‖2
hµ(x1, . . . , xk ) =µ
2
m∑i=1
k∑`=1
[d(
x` − ai
µ; IB)]2
+m∑
i=1
maxr=1,...,k
k∑`=16=r
(1
2µ‖x` − ai‖2 −
µ
2
[d(
x` − ai
µ; IB)]2
)
To implement DCA, we need ∂g∗µ and ∂hµ. . .
Minimizing Differences of Convex Functions-The DCA
∂gµ
Using the Frobenius norm in a space of matrices, we expressgµ as
Gµ(X ) =m2µ‖X‖2 − 1
µ〈X ,B〉+
k2µ‖A‖2,
with the inner product 〈A,B〉 =k∑`
n∑j
a`jb`j ,
X is the k × n matrix with rows x1, . . . , xk ,A is m × n with rows a1, . . . ,am, andB is k × n whose every row is the sum a1 + · · ·+ am.
∇Gµ(X ) = mµX − 1
µB X ∈ ∂G∗(Y ) iff Y ∈ ∂G(X )
∇G∗µ(Y ) = 1m (B + µY )
Minimizing Differences of Convex Functions-The DCA
∂hµ
hµ(x1, . . . , xk ) =µ
2
m∑i=1
k∑`=1
[d(
x` − ai
µ; IB)]2
+m∑
i=1
maxr=1,...,k
k∑`=1` 6=r
(1
2µ‖x` − ai‖2 −
µ
2
[d(
x` − ai
µ; IB)]2)
Minimizing Differences of Convex Functions-The DCA
∂hµ
∇H1(X ) =
∑
ix1−aiµ − PIB
(x1−aiµ
)...∑
ixk−aiµ − PIB
(xk−aiµ
)
For each i = 1, . . . ,m, there is some Ri such that theR-excluded sum is maximal. If we call this sum FRi , then ∇FRi
is a k × n matrix whose `th 6= R row is PIB
(x`−aiµ
), and Rth row
is 0.
V ∈ ∂H2(X ) iff V =m∑
i=1
FRi
Minimizing Differences of Convex Functions-The DCA
Multifacility Location Algorithm
INPUT: X1 ∈ dom g, N ∈ N
for k = 1, . . . ,N do
Compute Yk = ∇H1(Xk )+Vk
Compute Xk+1 = 1m (B + µYk )
end for
OUTPUT: xN+1
Minimizing Differences of Convex Functions-The DCA
Clustering
Minimizing Differences of Convex Functions-The DCA
Clustering
Minimizing Differences of Convex Functions-The DCA
Clustering
Minimizing Differences of Convex Functions-The DCA
Results
iteration2 4 6 8 10 12 14 16 18 20
f(x
k)
5500
6000
6500
7000
7500Fermat Torricelli, R10
Algorithm 4
iteration0 50 100 150 200 250 300 350
f(X
k)
190
195
200
205
210
215
220
225
230
235
240Eight Rings, R2
Algorithm 5
iteration0 100 200 300 400 500 600 700 800
f(X
k)
1800
1820
1840
1860
1880
1900
1920
1940
1960
1980
Multifacility, R2
Algorithm 5
iteration0 100 200 300 400 500 600 700 800 900 1000
f(X
k)
4800
4900
5000
5100
5200
5300
5400
5500
Multifacility, R10
Algorithm 5
iteration2 4 6 8 10 12 14 16 18 20
f(X
k)
1500
2000
2500
3000
3500
4000
4500
5000
5500
Multifacility Sets, R2
Algorithm 7
Minimizing Differences of Convex Functions-The DCA
References
[1] An, Belghiti, Tao: A new efficient algorithm based on DCprogramming and DCA for clustering. J. Glob. Optim., 27 (2007),503–608.
[2] Nam, Rector, Giles: Minimizing Differences of Convex Functionsand Applications to Facility Location and clustering.arXiv :1511.07595 (2015).
[3] Nesterov: Smooth minimization of non-smooth functions.Math.Program., Ser. A 103 , 127-152 (2005).
[4] Rockafellar: Convex Analysis, Princeton University Press,Princeton, NJ, 1970.