the global optimization geometry of low-rank matrix ... · 3 the authors proved that tensor...

24
1 The Global Optimization Geometry of Low-Rank Matrix Optimization Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B. Wakin Abstract—This paper considers general rank-constrained op- timization problems that minimize a general objective function f (X) over the set of rectangular n × m matrices that have rank at most r. To tackle the rank constraint and also to reduce the computational burden, we factorize X into UV T where U and V are n × r and m × r matrices, respectively, and then optimize over the small matrices U and V . We characterize the global optimization geometry of the nonconvex factored problem and show that the corresponding objective function satisfies the robust strict saddle property as long as the original objective function f satisfies restricted strong convexity and smoothness properties, ensuring global convergence of many local search algorithms (such as noisy gradient descent) in polynomial time for solving the factored problem. We also provide a comprehensive analysis for the optimization geometry of a matrix factorization problem where we aim to find n × r and m × r matrices U and V such that UV T approximates a given matrix X ? . Aside from the robust strict saddle property, we show that the objective function of the matrix factorization problem has no spurious local minima and obeys the strict saddle property not only for the exact-parameterization case where rank(X ? )= r, but also for the over-parameterization case where rank(X ? ) <r and the under-parameterization case where rank(X ? ) > r. These geometric properties imply that a number of iterative optimization algorithms (such as gradient descent) converge to a global solution with random initialization. Index Terms—Low-rank optimization, matrix sensing, matrix factorization, nonconvex optimization, optimization geometry I. I NTRODUCTION Low-rank matrices arise in a wide variety of applications throughout science and engineering, ranging from quantum tomography [1], signal processing [2], machine learning [3], [4], and so on; see [5] for a comprehensive review. In all of these settings, we often encounter the following rank- constrained optimization problem: minimize XR n×m f (X), subject to rank(X) r, (1) where the objective function f : R n×m R is smooth. Whether the objective function f is convex or nonconvex, the rank constraint renders low-rank matrix optimizations of the form (1) highly nonconvex and computationally NP- hard in general [6]. Significant efforts have been devoted to transforming (1) into a convex problem by replacing the rank constraint with one involving the nuclear norm. This strategy has been widely utilized in matrix inverse problems [7] This work was supported by NSF grant CCF-1409261, NSF grant CCF- 1464205, and NSF CAREER grant CCF-1149225. The authors are with the Department of Electrical Engineering, Colorado School of Mines. Email: {zzhu, qiuli, gtang, mwakin}@mines.edu. arising in signal processing [5], machine learning [8], and control [6]. With convex analysis techniques, nuclear norm minimization has been proved to provide optimal performance in recovering low-rank matrices [9]. However, in spite of the optimal performance, solving nuclear norm minimization is very computationally expensive even with specialized first- order algorithms. For example, the singular value thresholding algorithm [10] requires performing an expensive singular value decomposition (SVD) in each iteration, making it compu- tationally prohibitive in large-scale settings. This prevents nuclear norm minimization from scaling to practical problems. To relieve the computational bottleneck, recent studies pro- pose to factorize the variable into X = UV T , and optimize over the n × r and m × r matrices U and V rather than the n × m matrix X. The rank constraint in (1) then is auto- matically satisfied through the factorization. This strategy is usually referred to as the Burer-Monteiro type decomposition after the authors in [11], [12]. Plugging this parameterization of X in (1), we can recast the program into the following one: minimize UR n×r ,V R m×r h(U , V ) := f (UV T ). (2) The bilinear nature of the parameterization renders the ob- jective function of (2) nonconvex. Hence, it can potentially have spurious local minima (i.e., local minimizers that are not global minimizers) or even saddle points. With technical innovations in analyzing the landscape of nonconvex functions, however, several recent works have shown that the factored objective function h(U , V ) in certain matrix inverse problems has no spurious local minima [13]–[15]. A. Summary of results and outline In this paper, we provide a comprehensive geometric anal- ysis for solving general low-rank optimizations of the form (1) using the factorization approach (2). Our work actually rests on the recent works [16]–[20] ensuring a number of iterative opti- mization methods (such as gradient descent) converge to a lo- cal minimum with random initialization provided the problem satisfies the so-called strict saddle property (see Definition 3 in Section II). If the objective function further obeys the robust strict saddle property [16] (see Definition 4 in Section II) or belongs to the class of so-called X functions [17], the recent works [16], [17] show that many local search algorithms can converge to a local minimum in polynomial time. The implications of this line of work have had a tremendous impact on a number of nonconvex problems in applied mathematics, signal processing, and machine learning. arXiv:1703.01256v2 [cs.IT] 4 Jan 2018

Upload: vanxuyen

Post on 14-Aug-2019

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

1

The Global Optimization Geometryof Low-Rank Matrix Optimization

Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B. Wakin

Abstract—This paper considers general rank-constrained op-timization problems that minimize a general objective functionf(X) over the set of rectangular n×m matrices that have rankat most r. To tackle the rank constraint and also to reduce thecomputational burden, we factorize X into UV T where U andV are n× r and m× r matrices, respectively, and then optimizeover the small matrices U and V . We characterize the globaloptimization geometry of the nonconvex factored problem andshow that the corresponding objective function satisfies the robuststrict saddle property as long as the original objective function fsatisfies restricted strong convexity and smoothness properties,ensuring global convergence of many local search algorithms(such as noisy gradient descent) in polynomial time for solvingthe factored problem. We also provide a comprehensive analysisfor the optimization geometry of a matrix factorization problemwhere we aim to find n × r and m × r matrices U and Vsuch that UV T approximates a given matrix X?. Aside fromthe robust strict saddle property, we show that the objectivefunction of the matrix factorization problem has no spuriouslocal minima and obeys the strict saddle property not onlyfor the exact-parameterization case where rank(X?) = r, butalso for the over-parameterization case where rank(X?) < rand the under-parameterization case where rank(X?) > r.These geometric properties imply that a number of iterativeoptimization algorithms (such as gradient descent) converge to aglobal solution with random initialization.

Index Terms—Low-rank optimization, matrix sensing, matrixfactorization, nonconvex optimization, optimization geometry

I. INTRODUCTION

Low-rank matrices arise in a wide variety of applicationsthroughout science and engineering, ranging from quantumtomography [1], signal processing [2], machine learning [3],[4], and so on; see [5] for a comprehensive review. In allof these settings, we often encounter the following rank-constrained optimization problem:

minimizeX∈Rn×m

f(X),

subject to rank(X) ≤ r,(1)

where the objective function f : Rn×m → R is smooth.Whether the objective function f is convex or nonconvex,

the rank constraint renders low-rank matrix optimizationsof the form (1) highly nonconvex and computationally NP-hard in general [6]. Significant efforts have been devoted totransforming (1) into a convex problem by replacing the rankconstraint with one involving the nuclear norm. This strategyhas been widely utilized in matrix inverse problems [7]

This work was supported by NSF grant CCF-1409261, NSF grant CCF-1464205, and NSF CAREER grant CCF-1149225. The authors are with theDepartment of Electrical Engineering, Colorado School of Mines. Email:zzhu, qiuli, gtang, [email protected].

arising in signal processing [5], machine learning [8], andcontrol [6]. With convex analysis techniques, nuclear normminimization has been proved to provide optimal performancein recovering low-rank matrices [9]. However, in spite of theoptimal performance, solving nuclear norm minimization isvery computationally expensive even with specialized first-order algorithms. For example, the singular value thresholdingalgorithm [10] requires performing an expensive singular valuedecomposition (SVD) in each iteration, making it compu-tationally prohibitive in large-scale settings. This preventsnuclear norm minimization from scaling to practical problems.

To relieve the computational bottleneck, recent studies pro-pose to factorize the variable into X = UV T, and optimizeover the n× r and m× r matrices U and V rather than then × m matrix X . The rank constraint in (1) then is auto-matically satisfied through the factorization. This strategy isusually referred to as the Burer-Monteiro type decompositionafter the authors in [11], [12]. Plugging this parameterizationof X in (1), we can recast the program into the following one:

minimizeU∈Rn×r,V ∈Rm×r

h(U ,V ) := f(UV T). (2)

The bilinear nature of the parameterization renders the ob-jective function of (2) nonconvex. Hence, it can potentiallyhave spurious local minima (i.e., local minimizers that arenot global minimizers) or even saddle points. With technicalinnovations in analyzing the landscape of nonconvex functions,however, several recent works have shown that the factoredobjective function h(U ,V ) in certain matrix inverse problemshas no spurious local minima [13]–[15].

A. Summary of results and outline

In this paper, we provide a comprehensive geometric anal-ysis for solving general low-rank optimizations of the form (1)using the factorization approach (2). Our work actually rests onthe recent works [16]–[20] ensuring a number of iterative opti-mization methods (such as gradient descent) converge to a lo-cal minimum with random initialization provided the problemsatisfies the so-called strict saddle property (see Definition 3in Section II). If the objective function further obeys the robuststrict saddle property [16] (see Definition 4 in Section II)or belongs to the class of so-called X functions [17], therecent works [16], [17] show that many local search algorithmscan converge to a local minimum in polynomial time. Theimplications of this line of work have had a tremendous impacton a number of nonconvex problems in applied mathematics,signal processing, and machine learning.

arX

iv:1

703.

0125

6v2

[cs

.IT

] 4

Jan

201

8

Page 2: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

2

We begin this paper in Section II with the notions ofstrict saddle, strict saddle property, and robust strict saddleproperty. Considering that many invariant functions are notstrongly convex (or even convex) in any neighborhood arounda local minimum point, we then provide a revised robuststrict saddle property1 requiring a regularity condition (seeDefinition 8 in Section II) rather than strong convexity nearthe local minimum points (which is one of the requirementsfor the strict saddle property). The stochastic gradient descentalgorithm is guaranteed to converge to a local minimum pointin polynomial time for problems satisfying the revised robuststrict saddle property [16], [20].

In Section III, we consider the geometric analysis forsolving general low-rank optimizations of the form (1) usingthe factorization approach (2). Provided the objective functionf satisfies certain restricted strong convexity and smoothnessconditions, we show that the low-rank optimization problemwith the factorization (2) (with an additional regularizer—seeSection III for the details) obeys the revised robust strict saddleproperty. In Section III-C, we consider a stylized applicationin matrix sensing where the measurement operator satisfiesthe restricted isometry property (RIP) [7]. In the case ofGaussian measurements, as guaranteed by this robust strictsaddle property, a number of iterative optimizations can findthe unknown matrix X? of rank r in polynomial time withhigh probability when the number of measurements exceeds aconstant times (n+m)r2.

Our main approach for analyzing the optimization geometryof (2) is based on the geometric analysis for the following non-square low-rank matrix factorization problem: given X? ∈Rn×m,

minimizeU∈Rn×r,V m×r

∥∥∥UV T −X?∥∥∥2

F. (3)

In particular, we show the optimization geometry for thelow-rank matrix factorization problem (3) is preserved forthe general low-rank optimization (2) under certain restrictedstrong convexity and smoothness conditions on f . Thus, inAppendix A, we provide a comprehensive geometric analysisfor (3), which can be viewed as an important foundation ofmany popular matrix factorization problems such as the matrixsensing problem and matrix completion. We show that thelow-rank matrix factorization problem (3) (with an additionalregularizer) has no spurious local minima and obeys the strictsaddle property—that is the objective function in (3) has adirectional negative curvature at all critical points but localminima—not only for the exact-parameterization case whererank(X?) = r, but also for the over-parameterization casewhere rank(X?) < r and the under-parameterization casewhere rank(X?) > r. The strict saddle property and lack ofspurious local minima ensure that a number of local search

1A similar notion of a revised robust strict saddle property has also beenutilized in [20], which shows that noisy gradient descent converges to alocal minimum in a number iterations that depends only poly-logarithmicallyon the dimension. In a nutshell, [20] has a different focus than this work:the focus in [20] is on providing convergence analysis of a noisy gradientdescent algorithm with a robust strict saddle property, while in the presentpaper, we establish a robust strict saddle property for the nonsymmetricmatrix factorization and more general low-rank optimization (including matrixsensing) problems with the factorization approach.

algorithms applied to the matrix factorization problem (3)converge to global optima which correspond to the bestrank-r approximation to X?. Further, we completely analyzethe low-rank matrix factorization problem (3) for the exact-parameterization case and show that it obeys the revised robuststrict saddle property.

B. Relation to existing work

Unlike the objective functions of convex optimizations thathave simple landscapes, such as where all local minimizersare global ones, the objective functions of general nonconvexprograms have much more complicated landscapes. In recentyears, by exploiting the underlying optimization geometry,a surge of progress has been made in providing theoreticaljustifications for matrix factorization problems such as (2)using a number of previously heuristic algorithms (such asalternating minimization, gradient descent, and the trust regionmethod). Typical examples include phase retrieval [21]–[23],blind deconvolution [24], [25], dictionary learning [26], [27],phase synchronization [28] and matrix sensing and completion[14], [29]–[34].

These iterative algorithms can be sorted into two categoriesbased on whether a good initialization is required. One set ofalgorithms consist of two steps: initialization and local refine-ment. Provided the function satisfies a regularity condition orsimilar properties, a good guess lying in the attraction basinof the global optimum can lead to global convergence of thefollowing iterative step. We can obtain such initializations byspectral methods for phase retrieval [22], phase synchroniza-tion [28] and low-rank matrix recovery problems [29], [30],[35], [36]. As we have mentioned, a regularity condition isalso adopted in the revised robust strict saddle property.

Another category of works attempt to analyze the landscapeof the objective functions in a larger space rather than theregions near the global optima. We can further separate theseapproaches into two types based on whether they involvethe strict saddle property or the robust strict saddle property.The strict saddle property and lack of spurious local minimaare proved for low-rank, positive semidefinite (PSD) matrixrecovery [13] and completion [14], PSD matrix optimizationproblems with generic objective functions [37], low-rank non-square matrix estimation from linear observations [15], low-rank nonsquare optimization problems with generic objectivefunctions [38] and generic nuclear norm reggularized problems[39]. The strict saddle property along with the lack of spuriouslocal minima ensures a number of iterative algorithms such asgradient descent [16] and the trust region method [40] con-verge to the global minimum with random initialization [16],[18], [27].

A few other works which are closely related to our workattempt to study the global geometry by characterizing thelandscapes of the objective functions in the whole space ratherthan the regions near the global optima or all the criticalpoints. As we discussed before, a number of local searchalgorithms are guaranteed to find a local optimum (whichis also the global optimum if there are no spurious localminima) because of this robust strict saddle property. In [16],

Page 3: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

3

the authors proved that tensor decompostion problems satisfythis robust strict saddle property. Sun et al. [21] studied theglobal geometry of the phase retrieval problem. The veryrecent work in [41] analyzed the global geometry for PSDlow-rank matrix factorization of the form (3) and the relatedmatrix sensing problem when the rank is exactly parameterized(i.e., r = rank(X?)). The factorization approach for matrixinverse problems with quadratic loss functions is consideredin [34]. We extend this line by considering general rank-constrained optimization problems including a set of matrixinverse problems.

Finally, we remark that our work is also closely relatedto the recent works in low-rank matrix factorization of theform (3) and its variants [13]–[15], [29]–[31], [34], [38], [41].As we discussed before, most of these works except [34],[41] (but including [15] which also focuses on nonsymmetricmatrix sensing) only characterize the geometry either near theglobal optima or all the critical points. Instead, we characterizethe global geometry for general (rather than PSD) low-rankmatrix factorization and sensing. Because the analysis is differ-ent, the proof strategy in the present paper is also very differentthan that of [15], [38]. The results for PSD matrix sensing in[41] build heavily on the concentration properties of Gaussianmeasurements, while our results for matrix sensing depend onthe RIP of the measurement operator and thus can be applied toother matrix sensing problems whose measurement operator isnot necessarily from a Gaussian measurement ensemble. Also,[34] considers matrix inverse problems with quadratic lossfunctions and its proof strategy is very different than that in thepresent paper: the proof in [34] is specified to quadratic lossfunctions, while we consider the rank-constrained optimizationproblem with general objective functions in (1) and our proofutilizes the fact that the gradient and Hessian of the low-rank matrix sensing are respectively very close to those inlow-rank matrix factorization. Furthermore, in terms of thematrix factorization, we show that the objective function in (3)obeys the strict saddle property and has no spurious localminima not only for exact-parameterization (r = rank(X?)),but also for over-parameterization (r > rank(X?)) andunder-parameterization (r < rank(X?)). Local (rather thanglobal) geometry results for exact-parameterization and under-parameterization are also covered in [38]. As noted above,the work in [34], [41] for low-rank matrix factorization onlyfocuses on exact-parameterization (r = rank(X?)). Theunder-parameterization implies that we can find the best rank-rapproximation to X? by many efficient iterative optimizationalgorithms such as gradient descent.

C. Notation

Before proceeding, we first briefly introduce some notationused throughout the paper. The symbols I and 0 respectivelyrepresent the identity and zero matrices with appropriate sizes.Also In is used to denote the n×n identity matrix. The set ofr× r orthonormal matrices is denoted by Or := R ∈ Rr×r :RTR = I. For any natural number n, we let [n] or 1 : ndenote the set 1, 2, ..., n. We use |Ω| denote the cardinality(i.e., the number of elements) of a set Ω. MATLAB notations

are adopted for matrix indexing; that is, for the n×m matrixA, its (i, j)-th element is denoted by A[i, j], its i-th row (orcolumn) is denoted by A[i, :] (or A[:, i]), and A[Ω1,Ω2] refersto a |Ω1| × |Ω2| submatrix obtained by taking the elements inrows Ω1 of columns Ω2 of matrix A. Here Ω1 ⊂ [n] andΩ2 ⊂ [n]. We use a & b (or a . b) to represent that there isa constant so that a ≥ Const · b (or a ≤ Const · b).

If a function h(U ,V ) has two arguments, U ∈ Rn×rand V ∈ Rm×r, we occasionally use the notation h(W )when we put these two arguments into a new one as

W =

[UV

]. For a scalar function f(Z) with a matrix

variable Z ∈ Rn×m, its gradient is an n × m matrixwhose (i, j)-th entry is [∇f(Z)][i, j] = ∂f(Z)

∂Z[i,j] for all i ∈1, 2, . . . , n, j ∈ 1, 2, . . . ,m. The Hessian of f(Z) can beviewed as an nm × nm matrix [∇2f(Z)][i, j] = ∂2f(Z)

∂z[i]∂z[j]

for all i, j ∈ 1, . . . , nm, where z[i] is the i-th entry ofthe vectorization of Z. An alternative way to represent theHessian is by a bilinear form defined via [∇2f(Z)](A,B) =∑i,j,k,l

∂2f(Z)∂Z[i,j]∂Z[k,`]A[i, j]B[k, `] for any A,B ∈ Rn×m.

These two notations will be used interchangeably wheneverthe specific form can be inferred from context.

II. PRELIMINARIES

In this section, we provide a number of important definitionsin optimization and group theory. To begin, suppose h(x) :Rn → R is twice differentiable.

Definition 1 (Critical points). A point x is a critical point ofh(x) if ∇h(x) = 0.

Definition 2 (Strict saddles; or ridable saddles in [27]). Acritical point x is a strict saddle if the Hessian matrixevaluated at this point has a strictly negative eigenvalue, i.e.,λmin(∇2h(x)) < 0.

Definition 3 (Strict saddle property [16]). A twice differ-entiable function satisfies the strict saddle property if eachcritical point either corresponds to a local minimum or is astrict saddle.

Intuitively, the strict saddle property requires a function tohave a directional negative curvature at all of the critical pointsbut local minima. This property allows a number of iterativealgorithms such as noisy gradient descent [16] and the trustregion method [40] to further decrease the function value atall the strict saddles and thus converge to a local minimum.

In [16], the authors proposed a noisy gradient descentalgorithm for the optimization of functions satisfying therobust strict saddle property.

Definition 4 (Robust strict saddle property [16]). Givenα, γ, ε, δ, a twice differentiable h(x) satisfies the (α, γ, ε, δ)-robust strict saddle property if for every point x at least oneof the following applies:

1) There exists a local minimum point x? such that ‖x? −x‖ ≤ δ, and the function h(x′) restricted to a 2δneighborhood of x? (i.e., ‖x? − x′‖ ≤ 2δ) is α-stronglyconvex;

2) λmin

(∇2h(x)

)≤ −γ;

Page 4: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

4

3) ‖∇h(x)‖ ≥ ε.

In words, the above robust strict saddle property says thatfor any point whose gradient is small, then either the Hessianmatrix evaluated at this point has a strictly negative eigenvalue,or it is close to a local minimum point. Thus the robust strictsaddle property not only requires that the function obeys thestrict saddle property, but also that it is well-behaved (i.e.,strongly convex) near the local minima and has large gradientat the points far way to the critical points.

Intuitively, when the gradient is large, the function valuewill decrease in one step by gradient descent; when the pointis close to a saddle point, the noise introduced in the noisygradient descent could help the algorithm escape the saddlepoint and the function value will also decrease; when the pointis close to a local minimum point, the algorithm then convergesto a local minimum. Ge et al. [16] rigorously showed thatthe noisy gradient descent algorithm (see [16, Algorithm 1])outputs a local minimum in a polynomial number of steps ifthe function h(x) satisfies the robust strict saddle property.

It is proved in [16] that tensor decompostion problemssatisfy this robust strict saddle property. However, requiringthe local strong convexity prohibits the potential extension ofthe analysis in [16] for the noisy gradient descent algorithm tomany other problems, for which it is not possible to be stronglyconvex in any neighborhood around the local minimum points.Typical examples include the matrix factorization problemsdue to the rotational degrees of freedom for any criticalpoint. This motivates us to weaken the local strong convexityassumption relying on the approach used by [22], [29] and toprovide the following revised robust strict saddle property forsuch problems. To that end, we list some necessary definitionsrelated to groups and invariance of a function under the groupaction.

Definition 5 (Definition 7.1 [42])). A (closed) binary oper-ation, , is a law of composition that produces an elementof a set from two elements of the same set. More precisely,let G be a set and a1, a2 ∈ G be arbitrary elements. Then(a1, a2)→ a1 a2 ∈ G.

Definition 6 (Definition 7.2 [42])). A group is a set G togetherwith a (closed) binary operation such that for any elementsa, a1, a2, a3 ∈ G the following properties hold:• Associative property: a1 (a2 a3) = (a1 a2) a3.• There exists an identity element e ∈ G such that e a =a e = a.

• There is an element a−1 ∈ G such that a−1a = aa−1 =e.

With this definition, it is common to denote a group just byG without saying the binary operation when it is clear fromthe context.

Definition 7. Given a function h(x) : Rn → R and a groupG of operators on Rn, we say h is invariant under the groupaction (or under an element a of the group) if

h(a(x)) = h(x)

for all x ∈ Rn and a ∈ G.

Suppose the group action also preserves the energy of x,i.e., ‖a(x)‖ = ‖x‖ for all a ∈ G. Since for any x ∈ Rn,h(a(x)) = h(x) for all a ∈ G, it is straightforward to stratifythe domain of h(x) into equivalent classes. The vectors ineach of these equivalent classes differ by a group action. Oneimplication is that when considering the distance of two pointsx1 and x2, it would be helpful to use the distance betweentheir corresponding classes:

dist(x1,x2) : = mina1∈G,a2∈G

‖a1(x1)− a2(x2)‖

= mina∈G‖x1 − a(x2)‖,

(4)

where the second equality follows because ‖a1(x1) −a2(x2)‖ = ‖a1(x1 − a−1

1 a2(x2))‖ = ‖x1 − a−11 a2(x2)‖

and a−11 a2 ∈ G. Another implication is that the function

h(x) cannot possibly be strongly convex (or even convex) inany neighborhood around its local minimum points because ofthe existence of the equivalent classes. Before presenting therevised robust strict saddle property for invariant functions, welist two examples to illuminate these concepts.

Example 1: As one example, consider the phase retrievalproblem of recovering an n-dimensional complex vectorx? from

yi =

∣∣∣bHi x

?∣∣∣ , i = 1, . . . , p

, the magnitude of

its projection onto a collection of known complex vectorsb1, b2, . . . , bp [21], [22]. The unknown x? can be estimated bysolving the following natural least-squares formulation [21],[22]

minimizex∈Cn

h(x) =1

2p

p∑i=1

(y2i −

∣∣∣bHi x∣∣∣2)2

,

where we note that here the domain of x is Cn. For this case,we denote the corresponding

G = ejθ : θ ∈ [0, 1)

and the group action as a(x) = ejθx, where a = ejθ is anelement in G. It is clear that h(a(x)) = h(x) for all a ∈ G.Due to this invariance of h(x), it is impossible to recover theglobal phase factor of the unknown x? and the function h(x)is not strongly convex in any neighborhood of x?.

Example 2: As another example, we revisit the general fac-tored low-rank optimization problem (2):

minimizeU∈Rn×r,V ∈Rm×r

h(U ,V ) = f(UV T).

We recast the two variables U ,V into W as W =

[UV

].

For this example, we denote the corresponding G = Orand the group action on W as a(W ) =

[URV R

]where

a = R ∈ G. We have that h(a(W )) = h(W ) for all a ∈ Gsince UR(V R)T = UV T for any R ∈ Or. Because of thisinvariance, in general h(W ) is not strongly convex in anyneighborhood around its local minimum points even thoughf(X) is a strongly convex function; see [41] for the symmetriclow-rank factorization problem and Theorem 2 in Appendix Afor the nonsymmetric low-rank factorization problem.

Page 5: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

5

In the examples illustrated above, due to the invariance,the function is not strongly convex (or even convex) in anyneighborhood around its local minimum point and thus it isprohibitive to apply the standard approach in optimizationto show the convergence in a small neighborhood aroundthe local minimum point. To overcome this issue, Candes etal. [22] utilized the so-called regularity condition as a sufficientcondition for local convergence of gradient descent appliedfor the phase retrieval problem. This approach has also beenapplied for the matrix sensing problem [29] and semi-definiteoptimization [35].

Definition 8 (Regularity condition [22], [29]). Suppose h(x) :Rn → R is invariant under the group action of the given groupG. Let x? ∈ Rn be a local minimum point of h(x). Define theset B(δ,x?) as

B(δ,x?) := x ∈ Rn : dist(x,x?) ≤ δ ,

where the distance dist(x,x?) is defined in (4). Then we saythe function h(x) satisfies the (α, β, δ)-regularity condition iffor all x ∈ B(δ,x?), we have

〈∇h(x),x− a(x?)〉 ≥ α dist(x,x?)2 + β‖∇h(x)‖2, (5)

where a = arg mina′∈G ‖x− a′(x?)‖.

We remark that (α, β) in the regularity condition (8) mustsatisfy αβ ≤ 1

4 since by applying Cauchy-Schwarz

〈∇h(x),x− a(x?)〉 ≤ ‖∇h(x)‖ dist(x,x?)

and the inequality of arithmetic and geometric means

α dist2(x,x?)+β‖∇h(x)‖2 ≥ 2√αβ dist(x,x?)‖∇h(x)‖2.

Lemma 1. [22], [29] If the function h(x) restricted to a δneighborhood of x? satisfies the (α, β, δ)-regularity condition,then as long as gradient descent starts from a point x0 ∈B(δ,x?), the gradient descent update

xt+1 = xt − ν∇h(xt)

with step size 0 < ν ≤ 2β obeys xt ∈ B(δ,x?) and

dist2(xt,x?) ≤ (1− 2να)

tdist2(x0,x

?)

for all t ≥ 0.

The proof is given in [22]. To keep the paper self-contained,we also provide the proof of Lemma 1 in Appendix B. Weremark that the decreasing rate 1 − 2να ∈ [0, 1) since wechoose ν ≤ 2β and αβ ≤ 1

4 .Now we establish the following revised robust strict saddle

property for invariant functions by replacing the strong con-vexity condition in Definition 4 with the regularity condition.

Definition 9 (Revised robust strict saddle property for invariantfunctions). Given a twice differentiable h(x) : Rn → R anda group G, suppose h(x) is invariant under the group actionand the energy of x is also preserved under the group action,i.e., h(a(x)) = h(x) and ‖a(x)‖2 = ‖x‖2 for all a ∈ G.Given α, β, γ, ε, δ, h(x) satisfies the (α, β, γ, ε, δ)-robust strictsaddle property if for any point x at least one of the followingapplies:

1) There exists a local minimum point x? such thatdist(x,x?) ≤ δ, and the function h(x′) restricted to 2δ aneighborhood of x? (i.e., dist(x′,x?) ≤ 2δ) satisfies the(α, β, 2δ)-regularity condition defined in Definition 8;

2) λmin

(∇2h(x)

)≤ −γ;

3) ‖∇h(x)‖ ≥ ε.

Compared with Definition 4, the revised robust strict saddleproperty requires the local descent condition instead of strictconvexity in a small neighborhood around any local mini-mum point. With the convergence guarantee in Lemma 1,the convergence analysis of the stochastic gradient descentalgorithm in [16] for the robust strict saddle functions canalso be applied for the revised robust strict saddle functionsdefined in Definition 9 with the same convergence rate.2 Weomit the details here and refer the reader to [20] for moredetails on this. In the rest of the paper, the robust strict saddleproperty refers to the one in Definition 9.

III. LOW-RANK MATRIX OPTIMIZATION WITH THEFACTORIZATION APPROACH

In this section, we consider the minimization of generalrank-constrained optimization problems of the form (1) usingthe factorization approach (2) (which we repeat as follows):

minimizeU∈Rn×r,V ∈Rm×r

h(U ,V ) = f(UV T),

where the rank constraint in (1) is automatically satisfied bythe factorization approach. With necessary assumptions onf in Section III-A, we provide geometric analysis of thefactored problem in Section III-B. We then present a stylizedapplication in matrix sensing in Section III-C.

A. Assumptions and regularizer

Before presenting our main results, we lay out the necessaryassumptions on the objective function f(X). As is known,without any assumptions on the problem, even minimizingtraditional quadratic objective functions is challenging. Forthis reason, we focus on problems satisfying the followingtwo assumptions.

Assumption 1. f(X) has a critical point X? ∈ Rn×m whichhas rank r.

Assumption 2. f(X) is (2r, 4r)-restricted strongly con-vex and smooth, i.e., for any n × m matrices X,D withrank(X) ≤ 2r and rank(D) ≤ 4r, the Hessian of f(X)satisfies

a ‖D‖2F ≤ [∇2f(X)](D,D) ≤ b ‖D‖2F (6)

for some positive a and b.

Assumption 1 is equivalent to the existence of a rank r X?

such that ∇f(X?) = 0, which is very mild and holds in manymatrix inverse problems including matrix sensing [7], matrixcompletion [9] and 1-bit matrix completion [43], where theunknown matrix to be recovered is a critical point of f .

2As mentioned previously, a similar notion of a revised robust strict saddleproperty has also recently been utilized in [20].

Page 6: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

6

Assumption 2 is also utilized in [30, Conditions 5.3 and5.4] and [38], where weighted low-rank matrix factorizationand a set of matrix inverse problems are proved to satisfy the(2r, 4r)-restricted strong convexity and smoothness condition(6). We discuss matrix sensing as a typical example satisfyingthis assumption in Section III-C.

Combining Assumption 1 and Assumption 2, we have thatX? is the unique global minimum of (1).

Proposition 1. Suppose f(X) satisfies the (2r, 4r)-restrictedstrong convexity and smoothness condition (6) with positivea and b. Assume X? is a critical point of f(X) withrank(X?) = r. Then X? is the global minimum of (1), i.e.,

f(X?) ≤ f(X), ∀ X ∈ Rn×m, rank(X) ≤ r

and the equality holds only at X = X?.

The proof of Proposition 1 is given in Appendix C. We notethat Proposition 1 guarantees that X? is the unique globalminimum of (1) and it is expected that solving the factorizedproblem (9) also gives X?. Proposition 1 differs from [38] inthat it only requires X? as a critical point, while [38] needsX? as a global minimum of f .

Before presenting the main result, we note that if f satisfies(6) with positive a and b and we rescale f as f ′ = 2

a+bf , thenf ′ satisfies

2a

a+ b‖D‖2F ≤ [∇2f ′(X)](D,D) ≤ 2b

a+ b‖D‖2F .

It is clear that f and f ′ have the same optimization geometry(despite the scaling difference). Let a′ = 2a

a+b = 1 − c andb′ = 2a

a+b = 1 + c with c = b−aa+b . We have 0 < a′ ≤ 1 ≤ b′

and a′ + b′ = 2. Thus, throughout the paper and without thegenerality, we assume

a = 1− c, b = 1 + c, c ∈ [0, 1). (7)

Now letX? = ΦΣΨT =∑ri=1 σiφiψ

Ti be a reduced SVD

ofX?, where Σ is a diagonal matrix with σ1 ≥ · · · ≥ σr alongits diagonal. Denote

U? = ΦΣ1/2R,V ? = ΨΣ1/2R (8)

for anyR ∈ Or. We first introduce the following ways to stackU and V together that are widely used through the paper:

W =

[UV

], W =

[U−V

],W ? =

[U?

V ?

], W

?=

[U?

−V ?

].

Before moving on, we note that for any solution (U ,V )to (2), (UR1,V R2) is also a solution to (2) for anyR1,R2 ∈ Rr×r such that UR1R

T2 V

T = UV T. As anextreme example, R1 = cI and R2 = 1

c I where c can bearbitrarily large. In order to address this ambiguity (i.e., toreduce the search space of W for (3)), we utilize the trick in[15], [29], [30], [38] by introducing a regularizer ρ and turnto solve the following problem

minimizeU∈Rn×r,V ∈Rm×r

G(W ) := h(W ) + ρ(W ), (9)

whereρ(W ) :=

µ

4

∥∥∥UTU − V TV∥∥∥2

F.

We remark that W ? is still a global minimizer of the factoredproblem (29) since both the first term and ρ(W ) achieve theirglobal minimum at W ?. The regularizer ρ(W ) is applied toforce the difference between the Gram matrices of U and V assmall as possible. The global minimum of ρ(W ) is 0, whichis achieved when U and V have the same Gram matrices, i.e.,when W belongs to

E :=

W =

[UV

]: UTU − V TV = 0

. (10)

Informally, we can view (9) as finding a point from E that alsominimizes the first term in (9). This is rigorously establishedin the following result which reveals that any critical point Wof g(W ) belongs to E (that is U and V are balanced factorsof their product UV T) for any µ > 0.

Lemma 2. [38, Theorem 3] Suppose G(W ) is defined as in(9) with µ > 0. Then any critical point W of G(W ) belongsto E , i.e.,

∇G(W ) = 0 ⇒ UTU = V TV . (11)

For completeness, we include the proof of Lemma 2 inAppendix D.

B. Global geometry for general low-rank optimization

We now characterize the global optimization geometry ofthe factored problem (9). As explained in Section II thatG(W ) is invariant under the matrices R ∈ Or, we first recallthe discussions in Section II about the revised robust strictsaddle property for the invariant functions. To that end, wefollow the notion of the distance between equivalent classesfor invariant functions defined in (4) and define the distancebetween W 1 and W 2 as follows

dist(W 1,W 2) : = minR1∈Or,R2∈Or

‖W 1R1 −W 2R2‖F

= minR∈Or

‖W 1 −W 2R‖F .(12)

For convenience, we also denote the best rotation matrix R sothat ‖W 1 −W 2R‖F achieves its minimum by R(W 1,W 2),i.e.,

R(W 1,W 2) := arg minR′∈Or

∥∥W 1 −W 2R′∥∥F, (13)

which is also known as the orthogonal Procrustes prob-lem [44]. The solution to the above minimization problem ischaracterized by the following lemma.

Lemma 3. [44] Let WT2W 1 = LSPT be an SVD of

WT2W 1. An optimal solution for the orthogonal Procrustes

problem (13) is given by

R(W 1,W 2) = LPT.

Moreover, we have

WT1W 2R(W 1,W 2) = (W 2R(W 1,W 2))

TW 1

= PSPT 0.

Page 7: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

7

To ease the notation, we drop W 1 and W 2 in R(W 1,W 2)and rewrite R instead of R(W 1,W 2) when they (W 1 andW 2) are clear from the context. Now we are well equippedto present the robust strict saddle property for G(W ) in thefollowing result.

Theorem 1. Define the following regions

R1 : =W : dist(W ,W ?) ≤ σ1/2

r (X?),

R2 : =

W : σr(W ) ≤

√1

2σ1/2r (X?),

‖WWT‖F ≤20

19‖W ?W ?T‖F

,

R′3 : =

W : dist(W ,W ?) > σ1/2

r (X?), ‖W ‖ ≤ 20

19‖W ?‖,

σr(W ) >

√1

2σ1/2r (X?), ‖WWT‖F ≤

20

19‖W ?W ?T‖F

,

R′′3 : =

W : ‖W ‖ > 20

19‖W ?‖ =

20

19

√2‖X?‖1/2,

‖WWT‖F ≤10

9‖W ?W ?T‖F

,

R′′′3 : =

W : ‖WWT‖F >

10

9‖W ?W ?T‖F =

20

9‖X?‖F

.

Let G(W ) be defined as in (9) with µ = 12 . Suppose f(X)

has a critical point X? ∈ Rn×m of rank r and satisfies the(2r, 4r)-restricted strong convexity and smoothness condition(6) with positive constants a = 1− c, b = 1 + c and

c .σ

3/2r (X?)

‖X?‖F ‖X?‖1/2. (14)

Then G(W ) has the following robust strict saddle property:1) For any W ∈ R1, G(W ) satisfies the local regularity

condition:

〈∇G(W ),W −W ?〉 &σr(X?) dist2(W ,W ?)

+1

‖X?‖‖∇G(W )‖2F ,

(15)

where dist(W ,W ?) and R are defined in (12) and (13),respectively.

2) For any W ∈ R2, G(W ) has a directional negativecurvature, i.e.,

λmin

(∇2G(W )

). −σr(X?). (16)

3) For any W ∈ R3 = R′3 ∪ R′′3 ∪ R′′′3 , G(W ) has largegradient descent:

‖∇G(W )‖F & σ3/2r (X?), ∀ W ∈ R′3; (17)

‖∇G(W )‖F & ‖W ‖3, ∀ W ∈ R′′3 ; (18)

‖∇G(W )‖F &∥∥∥WWT

∥∥∥3/2

F, ∀ W ∈ R′′′3 . (19)

The proof of this result is given in Appendix L. The mainproof strategy is to utilize Assumption 1 and Assumption 2

about the function f to control the deviation between the gra-dient (and the Hessian) of the general low-rank optimization(9) and the counterpart of the matrix factorization problem sothat the landscape of the general low-rank optimization (9) hasa similar geometry property. To that end, in Appendix A, weprovide a comprehensive geometric analysis for the matrixfactorization problem (3). The reason for choosing µ = 1

2is also discussed in Appendix A-F. We note that the resultsin Appendix A are also of independent interest, as we showthat the objective function in (3) obeys the strict saddleproperty and has no spurious local minima not only forexact-parameterization (r = rank(X?)), but also for over-parameterization (r > rank(X?)) and under-parameterization(r < rank(X?)). Several remarks follow.

Remark 1. Note that

R1 ∪R2 ∪R′3 ⊇W : ‖W ‖ ≤ 20

19‖W ?‖F ,

‖WWT‖F ≤10

9‖W ?W ?T‖F

,

which further implies

R1 ∪R2 ∪R′3 ∪R′′3 ⊇ W : ‖WWT‖F ≤10

9‖W ?W ?T‖F .

Thus, we conclude thatR1∪R2∪R′3∪R′′3∪R′′′3 = R(n+m)×r.Now the convergence analysis of the stochastic gradientdescent algorithm in [16], [20] for the robust strict saddlefunctions also holds for G(W ).

Remark 2. The constants involved in Theorem 1 can be foundin Appendix L through the proof. Theorem 1 states that theobjective function for the general low-rank optimization (9)also satisfies the robust strict saddle property when (14) holds.The requirement for c in (14) can be weakened to ensure theproperties of g(W ) are preserved for G(W ) in some regions.For example, the local regularity condition (15) holds when

c ≤ 1

50

which is independent of X?. With the analysis of the globalgeometric structure in G(W ), Theorem 1 ensures that manylocal search algorithms can converge to X? (which is the theglobal minimum of (1) as guaranteed by Proposition 1) withrandom initialization. In particular, stochastic gradient descentwhen applied to the matrix sensing problem (22) is guaranteedto find the global minimum X? in polynomial time.

Remark 3. Local (rather than global) geometry results forthe general low-rank optimization (9) are also covered in[38], which only characterizes the geometry at all the criticalpoints. Instead, Theorem 1 characterizes the global geometryfor general low-rank optimization (9). Because the analysisis different, the proof strategy for Theorem 1 is also verydifferent than that of [38]. Since [38] only considers localgeometry, the result in [38] requires c ≤ 0.2, which is slightlyless restrictive than the one in (14).

Remark 4. To explain the necessity of the requirement on theconstants a and b in (14), we utilize the symmetric weightedPCA problem (so that we can visualize the landscape of

Page 8: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

8

the factored problem in Figure 1) as an example where theobjective function is

f(X) =1

2‖Ω (X −X?)‖2F , (20)

where Ω ∈ Rn×n contains positive entries. The Hessianquadratic form for f(X) is given by [∇2f(X)](D,D) =‖ΩD‖2F for any D ∈ Rn×n. Thus, we have

minij|Ω[i, j]|2 ≤ [∇2f(X)](D,D)

‖D‖2F≤ max

ij|Ω[i, j]|2.

Comparing with (6), we see that f satisfies the restricted strongconvexity and smoothness conditions with the constants a =minij |Ω[i, j]|2 and b = maxij |Ω[i, j]|2. In this case, we alsonote that if each entry Wij is nonzero (i.e., minij |Ω[i, j]|2 >0), the function f(X) is strongly convex, rather than onlyrestrictively strongly convex, implying that (20) has a uniqueoptimal solution X?. By applying the factorization approach,we get the factored objective function

h(U) =1

2‖Ω (UUT −X?)‖2F . (21)

To illustrate the necessity of the requirement on the constantsa and b as in (14) so that the factored problem (21) hasno spurious local minima and obeys the robust strict saddle

property, we set X? =

[1 11 1

]which is a rank-1 matrix

and can be factorized as X? = U?U?T with U? =

[11

].

We then plot the landscapes of the factored objective function

h(U) with Ω =

[1 11 1

]and

[8 11 8

]in Figure 1. We observe

from Figure 1 that as long as the elements in Ω have a smalldynamic range (which corresponds to a small b/a), h(U) hasno spurious local minima, but if the elements in Ω have a largedynamic range (which corresponds to a large b/a), spuriouslocal minima can appear in h(U).

(a) (b)

Figure 1. Landscapes of h(U) in (21) with X? =

[1 11 1

]and (a) Ω =[

1 11 1

]; (b) Ω =

[8 11 8

].

Remark 5. The global geometry of low-rank matrix recoverybut with analysis customized to linear measurements andquadratic loss functions is also covered in [34], [41]. Since

Theorem 1 only requires the (2r, 4r)-restricted strong convex-ity and smoothness property (6), aside from low-rank matrixrecovery [45], it can also be applied to many other low-rankmatrix optimization problems [46] which do not necessarily in-volve quadratic loss functions. Typical examples include 1-bitmatrix completion [43], [47] and Poisson principal componentanalysis (PCA) [48]. We refer to [38] for more discussion onthis issue. In next section, we consider a stylized applicationof Theorem 1 in matrix sensing and compare it with the resultin [41].

C. Stylized application: Matrix sensing

In this section, we extend the previous geometric analysisto the matrix sensing problem

minimizeU∈Rn×r,V ∈Rm×r

G(W ) :=1

2

∥∥∥A(UV T −X?)∥∥∥2

2+ ρ(W ),

(22)

where A : Rn×m → Rp is a known linear measurementoperator andX? is the unknown rank r matrix to be recovered.In this case, we have

f(X) =1

2‖A (X −X?)‖22 .

The derivative of f(X) at X? is

∇f(X?) = A∗A(X? −X?) = 0,

which implies that f(X) satisfies Assumption 1. The Hessianquadrature form ∇2f(X)[D,D] for any n ×m matrices Xand D is given by

∇2f(X)[D,D] = ‖A(D)‖2 .

The following matrix Restricted Isometry Property (RIP)serves as a way to link the low-rank matrix factorization prob-lem (29) with the matrix sensing problem (22) and certifiesf(X) satisfying Assumption 2.

Definition 10 (Restricted Isometry Property (RIP) [7], [49]).The map A : Rn×m → Rp satisfies the r-RIP with constantδr if3

(1− δr) ‖D‖2F ≤ ‖A(D)‖2 ≤ (1 + δr) ‖D‖2F (23)

holds for any n×m matrix D with rank(D) ≤ r.

If A satisfies the 4r-restricted isometry property with con-stant δ4r, then f(X) satisfies the (2r, 4r)-restricted strongconvexity and smoothness condition (6) with constants a =1− δ4r and b = 1− δ4r since

(1− δ4r) ‖D‖2F ≤ ∇2f(X)[D,D] = ‖A(D)‖2

≤ (1 + δ4r) ‖D‖2F(24)

for any rank-4r matrix D. Comparing (24) with (6), we notethat the RIP is stronger than the restricted strong convexityand smoothness property (6) as the RIP gives that (24) holds

3By abuse of notation, we adopt the conventional notation δr for the RIPconstant. The subscript r can be used to distinguish the RIP constant δr fromδ which is used as a small constant in Section II.

Page 9: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

9

for all n×m matrices X , while Assumption 2 only requiresthat (6) holds for all rank-2r matrices.

Now, applying Theorem 1, we obtain a similar geometricguarantee to Theorem 1 for the matrix sensing problem (22)when A satisfies the RIP.

Corollary 1. Let R1,R2,R′3,R′′3 ,R′′′3 be the regions asdefined in Theorem 5. Let G(W ) be defined as in (22) withµ = 1

2 and A satisfying the 4r-RIP with

δ4r .σ

3/2r (X?)

‖X?‖F ‖X?‖1/2. (25)

Then G(W ) has the following robust strict saddle property:

1) For any W ∈ R1, G(W ) satisfies the local regularitycondition:

〈∇G(W ),W −W ?〉 &σr(X?) dist2(W ,W ?)

+1

‖X?‖‖∇G(W )‖2F ,

(26)

where dist(W ,W ?) and R are defined in (12) and (13),respectively.

2) For any W ∈ R2, G(W ) has a directional negativecurvature, i.e.,

λmin

(∇2G(W )

). −σr(X?).

3) For any W ∈ R3 = R′3 ∪ R′′3 ∪ R′′′3 , G(W ) has largegradient descent:

‖∇G(W )‖F & σ3/2r (X?), ∀ W ∈ R′3;

‖∇G(W )‖F & ‖W ‖3, ∀ W ∈ R′′3 ;

‖∇G(W )‖F &∥∥∥WWT

∥∥∥3/2

F, ∀ W ∈ R′′′3 .

Remark 6. The constants involved in Corollary 1 are the sameas those in Theorem 1 and can be found in Appendix L throughthe proof. Similarly, the requirement for δ4r in (25) can beweakened to ensure the properties of g(W ) are preservedfor G(W ) in some regions. For example, the local regularitycondition (26) holds when

δ4r ≤1

50

which is independent of X?. Note that Tu et al. [29, Section5.4, (5.15)] provided a similar regularity condition. However,the result there requires δ6r ≤ 1

25 and dist(W ,W ?) ≤1

2√

2σr(X

?) which defines a smaller region than R1. Basedon this local regularity condition, Tu et al. [29] showedthat gradient descent with a good initialization (which isclose enough to W ?) converges to the unknown matrix W ?

(and hence X?). With the analysis of the global geometricstructure in G(W ), Corollary 1 ensures that many local searchalgorithms can find the unknown matrix X? in polynomialtime.

Remark 7. A Gaussian A will have the RIP with highprobability when the number of measurements p is comparableto the number of degrees of freedom in an n × m matrix

with rank r. By Gaussian A we mean the `-th element iny = A(X), y`, is given by

y` = 〈X,A`〉 =

n∑i=1

m∑j=1

X[i, j]A`[i, j],

where the entries of each n ×m matrix A` are independentand identically distributed normal random variables with zeromean and variance 1

p . Specifically, a Gaussian A satisfies (23)with high probability when [5], [7], [45]

p & r(n+m)1

δ2r

.

Now utilizing the inequality ‖X?‖F ≤√r‖X?‖ for (14),

we conclude that in the case of Gaussian measurements,the robust strict saddle property is preserved for the matrixsensing problem with high probability when the number ofmeasurements exceeds a constant times (n + m)r2κ(X?)3

where κ(X?) = σ1(X?)σr(X?) . This further implies that, when

applying the stochastic gradient descent algorithm to thematrix sensing problem (22) with Gaussian measurements, weare guaranteed to find the unknown matrix X? in polynomialtime with high probability when

p & (n+m)r2κ(X?)3. (27)

WhenX? is an n×n PSD matrix, Li et al. [41] showed that thecorresponding matrix sensing problem with Gaussian measure-ments has similar global geometry to the low-rank PSD matrixfactorization problem when the number of measurements

p & nr2σ41(X?)

σ2r(X?)

. (28)

Comparing (27) with (28), we find both results for the numberof measurements needed depend similarly on the rank r, butslightly differently on the spectrum of X?. We finally remarkthat the sampling complexity in (27) is O((n+m)r2), which isslightly larger than the information theoretically optimal boundO((n+m)r) for matrix sensing. This is because Corollary 1is a direct consequence of Theorem 1 in which we directlycharacterize the landscapes of the objective functions in thewhole space by combining the results for matrix factorizationin Appendix A and the restricted strong convexity and smooth-ness condition. We believe this mismatch is an artifact of ourproof strategy and could be mitigated by a different approach,like utilizing the properties of quadratic loss functions [34].If one desires only to characterize the geometry for criticalpoints, then O((n+m)r) measurements are enough to ensurethe strict saddle property and lack of spurious local minimafor matrix sensing [15], [38].

APPENDIX ATHE OPTIMIZATION GEOMETRY OF LOW-RANK MATRIX

FACTORIZATION

In this appendix, we consider the low-rank matrix factor-ization problem

minimizeU∈Rn×r,V ∈Rm×r

g(W ) :=1

2

∥∥∥UV T −X?∥∥∥2

F+ ρ(W ) (29)

Page 10: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

10

where ρ(W ) is the regularizer used in (9) and repeated here:

ρ(W ) =µ

4

∥∥∥UTU − V TV∥∥∥2

F.

We provide a comprehensive geometric analysis for the matrixfactorization problem (29). In particular, we show that the ob-jective function in (29) obeys the strict saddle property and hasno spurious local minima not only for exact-parameterization(r = rank(X?)), but also for over-parameterization (r >rank(X?)) and under-parameterization (r < rank(X?)). Forthe exact-parameterization case, we further show that theobjective function satisfies the robust strict saddle property,ensuring global convergence of many local search algorithmsin polynomial time. As we believe these results are also ofindependent interest and to make it easy to follow, we onlypresent the main results in this appendix and defer the proofsto other appendices.

A. Relationship to PSD low-rank matrix factorization

Similar to (8), let X? = ΦΣΨT =∑ri=1 σiφiψ

Ti be a

reduced SVD of X?, where Σ is a diagonal matrix withσ1 ≥ · · · ≥ σr along its diagonal, and denote U? =ΦΣ1/2R,V ? = ΨΣ1/2R for any R ∈ Or. The followingresult to some degree characterizes the relationship betweenthe nonsymmetric low-rank matrix factorization problem (29)and the following PSD low-rank matrix factorization prob-lem [41]:

minimizeU∈Rn×r

∥∥∥UUT −M∥∥∥2

F(30)

where M ∈ Rn×n is a rank-r PSD matrix.

Lemma 4. Suppose g(W ) is defined as in (29) with µ > 0.Then we have

g(W ) ≥ minµ4,

1

8∥∥∥WWT −W ?W ?T

∥∥∥2

F.

In particular, if we choose µ = 12 , then we have

g(W ) =1

8

∥∥∥WWT −W ?W ?T∥∥∥2

F+

1

4

∥∥∥UTU? − V TV ?∥∥∥2

F.

The proof of Lemma 4 is given in Appendix E. Informally,Lemma 4 indicates that minimizing g(W ) also results in

minimizing∥∥∥WWT −W ?W ?T

∥∥∥2

F(which is the same form

as the objective function in (30)) and hence the distancebetween W and W ? (though W ? is unavailable at priori).The global geometry for the PSD low-rank matrix factorizationproblem (30) is recently analyzed by Li et al. in [41].

B. Characterization of critical points

We first provide the gradient and Hessian expression forg(W ). The gradient of g(W ) is given by

∇Ug(U ,V ) = (UV T −X?)V + µU(UTU − V TV ),

∇V g(U ,V ) = (UV T −X?)TU − µV (UTU − V TV ),

which can be rewritten as

∇g(W ) =

[(UV T −X?)V

(UV T −X?)TU

]+ µWW

TW .

Standard computations give the Hessian quadrature form

[∇2g(W )](∆,∆) for any ∆ =

[∆U

∆V

]∈ R(n+m)×r (where

∆U ∈ Rn×r and ∆V ∈ Rm×r) as

[∇2g(W )](∆,∆)

=∥∥∥∆UV

T +U∆TV

∥∥∥2

F+ 2

⟨UV T −X?,∆U∆T

V

⟩+ [∇2ρ(W )](∆,∆),

(31)

where

[∇2ρ(W )](∆,∆)

= µ⟨W

TW , ∆

T∆⟩

+ µ⟨W ∆

T,∆WT

⟩+ µ

⟨WW

T,∆∆T

⟩.

(32)

By Lemma 2, we can simplify the equations for criticalpoints as follows

∇Uρ(U ,V ) = UUTU −X?V = 0, (33)

∇V ρ(U ,V ) = V V TV −X?TU = 0. (34)

Now suppose W is a critical point of g(W ). We can applythe Gram-Schmidt process to orthonormalize the columnsof U such that U = UR, where U is orthogonal andR ∈ Or =

R ∈ Rr×r,RTR = I

.4 Also let V = V R.

Since UTU = V TV , we have UTU = V

TV . Thus V is

also orthogonal. Noting that UV T = U VT

, we concludethat g(W ) = g(W ) and W is also a critical point ofg(W ) since ∇Ug(W ) = ∇Ug(W )R = 0 and ∇V g(W ) =∇V g(W )R = 0. Also for any ∆ ∈ R(n+m)×r, we have[∇2g(W )](∆,∆) = [∇2g(W )](∆R,∆R), indicating thatW and W have the same Hessian information. Thus, withoutloss of generality, we assume U and V are orthogonal, butpossibly include zero columns. With this, we use ui and vi todenote the i-th columns of U and V , respectively. It followsfrom ∇g(W ) = 0 that

‖ui‖2ui = X?vi,

‖vi‖2vi = X?Tui,

which indicates that

(ui,vi) ∈

(√λ1p1,

√λ1q1), . . . , (

√λrpr,

√λrqr), (0,0)

.

Now we identify all the critical points of g(W ) in thefollowing lemma, which is formally proved with an algebraicapproach in Appendix F.

Lemma 5. Let X? = ΦΣΨT =∑ri=1 σiφiψ

Ti be a reduced

SVD of X? and g(W ) be defined as in (29) with µ > 0. Any

4Another way to find R is via the SVD. Let U = LΣRT be a reducedSVD of U , where L is an n× r orthonormal matrix, Σ is an r× r diagonalmatrix with non-negative diagonals, and R ∈ Or . Then U = UR = LΣis orthogonal, but has possible zero columns.

Page 11: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

11

W =

[UV

]is a critical point of g(W ) if and only if W ∈ C

with

C :=

W =

[UV

]:U = ΦΛ1/2R,V = ΨΛ1/2R,R ∈ Or,

Λ is diagonal,Λ ≥ 0, (Σ−Λ)Σ = 0

.

(35)

Intuitively, (35) means that a critical point W of g(W )is one such that UV T is a rank-` approximation to X?

with ` ≤ r and U and V are equal factors of this rank-`approximation. Let λ1, λ2, . . . , λr denote the diagonals of Λ.Unlike Σ, we note that these diagonals λ1, λ2, . . . , λr are notnecessarily placed in decreasing or increasing order. Actually,this equation (Σ−Λ)Σ = 0 is equivalent to

λi ∈ σi, 0

for all i ∈ 1, 2, . . . , r. Further, we introduce the set ofoptimal solutions:

X :=

W =

[UV

]: U = ΦΣ1/2R,V = ΨΣ1/2R,R ∈ Or

.

(36)

It is clear that the set X containing all the optimal solutions,the set C containing all the critical points and the set Econtaining all the points with balanced factors have the nestingrelationship: X ⊂ C ⊂ E . Before moving to the next section,we provide one more result regarding W ∈ E . The proof ofthe following result is given in Appendix G.

Lemma 6. For any ∆ =

[∆U

∆V

]∈ R(n+m)×r and W ∈ E

where E is defined in (10), we have

‖∆UUT‖2F + ‖∆V V

T‖2F = ‖∆UVT‖2F + ‖∆V U

T‖2F ,(37)

and

∇2ρ(W ) 0. (38)

C. Strict saddle property

Lemma 6 implies that the Hessian of ρ(W ) evaluated atany critical point W is PSD, i.e., ∇2ρ(W ) 0 for all W ∈C. Despite this fact, the following result establishes the strictsaddle property for g(W ).

Theorem 2. Let g(W ) be defined as in (29) with µ > 0 and

rank(X?) = r. Let W =

[UV

]be any critical point satisfying

∇g(W ) = 0, i.e., W ∈ C. Any W ∈ C \ X is a strict saddleof g(W ) satisfying

λmin(∇2g(W )) ≤ −1

2

∥∥∥WWT −W ?W ?T∥∥∥ ≤ −σr(X?).

(39)

Furthermore, g(W ) is not strongly convex at any globalminimum point W ∈ X .

The proof of Theorem 2 is given in Appendix H. We notethat this strict saddle property is also covered in [38, Theorem3], but with much looser bounds (in particular, directly apply-ing [38, Theorem 3] gives λmin(∇2g(W )) ≤ −0.1σr(X

?)rather than λmin(∇2g(W )) ≤ −σr(X?) in (39)). Theorem 2actually implies that g(W ) has no spurious local minima(since all local minima belong to X ) and obeys the strict saddleproperty. With the strict saddle property and lack of spuriouslocal minima for g(W ), the recent results [18], [19] ensurethat gradient descent converges to a global minimizer almostsurely with random initialization. We also note that Theorem 2states that g(W ) is not strongly convex at any global minimumpoint W ∈ X because of the invariance property of g(W ).This is the reason we introduce the distance in (12) and alsothe robust strict saddle property in Definition 9.

D. Extension to over-parameterized case: rank(X?) < r

In this section, we briefly discuss the over-parameterizedscenario where the low-rank matrix X? has rank smaller thanr. Similar to Theorem 2, the following result shows that thestrict saddle property also holds in this case.

Theorem 3. Let X? = ΦΣΨT =∑r′

i=1 σiφiψTi be a

reduced SVD of X? with r′ ≤ r, and let g(W ) be defined

as in (29) with µ > 0. Any W =

[UV

]is a critical point of

g(W ) if and only if W ∈ C with

C :=

W =

[UV

]:U = ΦΛ1/2R,V = ΨΛ1/2R,RRT = Ir′ ,

Λ is diagonal,Λ ≥ 0, (Σ−Λ)Σ = 0

Further, all the local minima (which are also global) belongto the following set

X =

W =

[UV

]:U = ΦΣ1/2R,V = ΨΣ1/2R,

RRT = Ir′

Finally, any W ∈ C \X is a strict saddle of g(W ) satisfying

λmin(∇2g(W )) ≤ −1

2

∥∥∥WWT −W ?W ?T∥∥∥ ≤ −σr′(X?).

The proof of Theorem 3 is given in Appendix I. We note thatthis strict saddle property is also covered in [38, Theorem 3],but with much looser bounds (in particular, directly applying[38, Theorem 3] gives λmin(∇2g(W )) ≤ −0.1σr′(X

?) ratherthan λmin(∇2g(W )) ≤ −σr′(X?) in Theorem 3).

E. Extension to under-parameterized case: rank(X?) > r

We further discuss the under-parameterized case whererank(X?) > r. In this case, (3) is also known as the low-rankapproximation problem as the product UV T forms a rank-rapproximation to X?. Similar to Theorem 2, the followingresult shows that the strict saddle property also holds for g(W )in this scenario.

Page 12: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

12

Theorem 4. Let X? = ΦΣΨT =∑r′

i=1 σiφiψTi be a

reduced SVD of X? with r′ > r and σr(X?) > σr+1(X?).5

Also let g(W ) be defined as in (29) with µ > 0. Any

W =

[UV

]is a critical point of g(W ) if and only if W ∈ C

with

C :=

W =

[UV

]: U = Φ[:,Ω]Λ1/2R,V = Ψ[:,Ω]Λ1/2R,

Λ = Σ[Ω,Ω],RRT = I`,Ω ⊂ 1, 2, . . . , r′, |Ω| = ` ≤ r

where we recall that Φ[:,Ω] is a submatrix of Φ obtainedby keeping the columns indexed by Ω and Σ[Ω,Ω] is an `×` matrix obtained by taking the elements of Σ in rows andcolumns indexed by Ω.

Further, all local minima belong to the following set

X =

W =

[UV

]: Λ = Σ[1 : r, 1 : r],R ∈ Or,

U = Φ[:, 1 : r]Λ1/2R,V = Ψ[:, 1 : r]Λ1/2R

.

Finally, any W ∈ C \X is a strict saddle of g(W ) satisfying

λmin(∇2g(W )) ≤ −(σr(X?)− σr+1(X?)).

The proof of Theorem 4 is given in Appendix J. It followsfrom Eckart-Young-Mirsky theorem [50] that for any W ∈ X ,UV T is the best rank-r approximation to X?. Thus, this strictsaddle property ensures that the local search algorithms ap-plied to the factored problem (29) converge to global optimumwhich corresponds to the best rank-r approximation to X?.

F. Robust strict saddle property

We now consider the revised robust strict saddle propertydefined in Definition 9 for the low-rank matrix factorizationproblem (29). As guaranteed by Theorem 2, g(W ) satisfiesthe strict saddle property for any µ > 0. However, too smalla µ would make analyzing the robust strict saddle propertydifficult. To see this, we denote

f(W ) =1

2

∥∥∥UV T −X?∥∥∥2

F

for convenience. Thus we can rewrite g(W ) as the sum of

f(W ) and ρ(W ). Note that for any W =

[UV

]∈ C where C

is the set of critical points defined in (35), W =

[UM

VM−1

]is

a critical point of f(W ) for any invertible M ∈ Rr×r. Thisfurther implies that the gradient at W reduces to

∇g(W ) = ∇ρ(W ),

which could be very small if µ is very small since ρ(W ) =µ4

∥∥∥UTU − V TV∥∥∥2

F. On the other hand, W could be far

5If σr1 = · · · = σr = · · · = σr2 with r1 ≤ r ≤ r2, then the optimalrank-r approximation to X? is not unique. For this case, the optimal solutionset X for the factorized problem needs to be changed correspondingly, butthe main arguments still hold.

away from any point in X for some M that is not well-conditioned. Therefore, we choose a proper µ controlling theimportance of the regularization term such that for any W thatis not close to the critical points X , g(W ) has large gradient.Motivated by Lemma 4, we choose µ = 1

2 .The following result establishes the robust strict saddle

property for g(W ).

Theorem 5. Let R1,R2,R′3,R′′3 ,R′′′3 be the regions as de-fined in Theorem 1. Let g(W ) be defined as in (29) withµ = 1

2 . Then g(W ) has the following robust strict saddleproperty:

1) For any W ∈ R1, g(W ) satisfies local regularitycondition:

〈∇g(W ),W −W ?R〉 ≥ 1

32σr(X

?) dist2(W ,W ?)

+1

48‖X?‖‖∇g(W )‖2F ,

(40)

where dist(W ,W ?) and R are defined in (12) and (13),respectively.

2) For any W ∈ R2, g(W ) has a directional negativecurvature:

λmin

(∇2g(W )

)≤ −1

4σr(X

?). (41)

3) For any W ∈ R3 = R′3 ∪ R′′3 ∪ R′′′3 , g(W ) has largegradient descent:

‖∇g(W )‖F ≥1

10σ3/2r (X?), ∀ W ∈ R′3; (42)

‖∇g(W )‖F >39

800‖W ‖3, ∀ W ∈ R′′3 ; (43)

‖∇g(W )‖F >1

20

∥∥∥WWT∥∥∥3/2

F, ∀ W ∈ R′′′3 . (44)

The proof is given in Appendix K. We present several re-marks better illustrating the above robust strict saddle propertyto conclude this section.

Remark 8. Both the right hand sides of (43) and (44) dependon W . We can further obtain lower bounds for them byutilizing the fact that W is large enough in both regions R′′3and R′′′3 . Specifically, noting that ‖W ‖ > 20

19‖W?‖ for all

W ∈ R′′3 , we have

‖∇g(W )‖F >39

780

√2‖X?‖3/2

for any W ∈ R′′3 . Similarly, it follows from the fact‖WWT‖F > 20

9 ‖X?‖F for all W ∈ R′′′3 that

‖∇g(W )‖F >√

20

27‖X?‖3/2F

for any W ∈ R′′′3 .

Remark 9. Recall that all the strict saddles of g(W ) areactually rank deficient (see Theorem 2). Thus the regionR2 attempts to characterize all the neighbors of the sad-dle saddles by including all rank deficient points. Actually,(41) holds not only for W ∈ R2, but for all W suchthat σr(W ) ≤

√12σ

1/2r (X?). The reason we add another

Page 13: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

13

constraint controlling the term ‖W ?W ?T‖F is to ensure thisnegative curvature property in the region R2 also holds for thematrix sensing problem discussed in next section. This is thesame reason we add two more constraints ‖W ‖ ≤ 20

19‖W?‖F

and ‖WWT‖F ≤ 109 ‖W

?W ?T‖F for the region R′3.

APPENDIX BPROOF OF LEMMA 1

Denote ax,x? = arg mina′∈G ‖x − a′(x?)‖. Utilizing thedefinition of distance in (4), the regularity condition (5) andthe assumption that µ ≤ 2β, we have

dist2(xt+1,x?)

=∥∥xt+1 − axt+1,x?(x?)

∥∥2

≤ ‖xt − ν∇h(xt)− axt,x?(x?)‖2

= ‖xt − axt,x?(x?)‖2 + ν2 ‖∇h(xt)‖2

− 2ν 〈xt − axt,x?(x?),∇h(xt)〉≤ (1− 2να) dist2(xt,x

?)− ν(2β − ν) ‖∇h(xt)‖2

≤ (1− 2να) dist2(xt,x?)

where the fourth line uses the regularity condition (5) and thelast line holds because ν ≤ 2β. Thus we conclude xt ∈ B(δ)for all t ∈ N if x0 ∈ B(δ) by noting that 0 ≤ 1 − 2να < 1since αβ ≤ 1

4 and ν ≤ 2β.

APPENDIX CPROOF OF PROPOSITION 1

First note that if X? is a critical point of f , then

∇f(X?) = 0.

Now for any X ∈ Rn×m with rank(X) ≤ r, the second orderTaylor expansion gives

f(X) =f(X?) + 〈∇f(X?),X −X?〉

+1

2[∇2f(X)](X −X?,X −X?)

= f(X?) +1

2[∇2f(X)](X −X?,X −X?)

where X = tX? + (1− t)X for some t ∈ [0, 1]. This Taylorexpansion together with (6) (both X and X ′−X? have rankat most 2r) gives

f(X)− f(X?) ≥ a‖X −X?‖2F .

APPENDIX DPROOF OF LEMMA 2

Any critical point (see Definition 1) W =

[UV

]satisfies

∇G(W ) = 0, i.e.,

∇f(UV T)V + µU(UTU − V TV

)= 0, (45)

(∇f(UV T))TU − µV(UTU − V TV

)= 0. (46)

By (46), we obtain

(∇f(UV T))TU = µ(UTU − V TV

)V T.

Multiplying (45) by UT and plugging it in the expression forUT∇f(UV T) from the above equation gives(UTU − V TV

)V TV +UTU

(UTU − V TV

)= 0,

which further implies

UTUUTU = V TV V TV .

In order to show (11), note that UTU and V TV are theprincipal square roots (i.e., PSD square roots) of UTUUTUand V TV V TV , respectively. Utilizing the result that a PSDmatrix A has a unique PSD matrix B such that Bk = A forany k ≥ 1 [50, Theorem 7.2.6], we obtain

UTU = V TV

for any critical point W .

APPENDIX EPROOF OF LEMMA 4

We first rewrite the objective function g(W ):

g(W ) =1

2

∥∥∥UV T −U?V ?T∥∥∥2

F+µ

4

∥∥∥UTU − V TV∥∥∥2

F

≥ minµ, 1

2(∥∥∥UV T −U?V ?T

∥∥∥2

F+

1

4

∥∥∥UTU − V TV∥∥∥2

F

)= minµ, 1

2(

1

4

∥∥∥WWT −W ?W ?T∥∥∥2

F+ g′(W )

),

where the second line attains the equality when µ = 12 , and

g′(W ) in the last line is defined as

g′(W ) :=1

2

∥∥∥UV T −U?V ?T∥∥∥2

F− 1

4

∥∥∥UUT −U?U?T∥∥∥2

F

− 1

4

∥∥∥V V T − V ?V ?T∥∥∥2

F+

1

4

∥∥∥UTU − V TV∥∥∥2

F.

We further show g′(W ) is always nonnegative:

g′(W ) =1

2

∥∥∥UV T −U?V ?T∥∥∥2

F− 1

4

∥∥∥UUT −U?U?T∥∥∥2

F

− 1

4

∥∥∥V V T − V ?V ?T∥∥∥2

F+

1

4

∥∥∥UTU − V TV∥∥∥2

F.

=1

2

∥∥∥UV T −U?V ?T∥∥∥2

F+

1

2

∥∥∥UTU?∥∥∥2

F+

1

2

∥∥∥V TV ?∥∥∥2

F

− 1

2trace

(UTUV TV

)− 1

4

∥∥∥U?U?T∥∥∥2

F− 1

4

∥∥∥V ?V ?T∥∥∥2

F

=1

2

∥∥∥UTU? − V TV ?∥∥∥2

F+

1

2

∥∥∥U?V ?T∥∥∥2

F

− 1

4

∥∥∥U?U?T∥∥∥2

F− 1

4

∥∥∥V ?V ?T∥∥∥2

F

=1

2

∥∥∥UTU? − V TV ?∥∥∥2

F≥ 0,

where the last line follows because U?TU? = V ?TV ?. Thus,we have

g(W ) ≥ minµ4,

1

8∥∥∥WWT −W ?W ?T

∥∥∥2

F,

and

g(W ) =1

8

∥∥∥WWT −W ?W ?T∥∥∥2

F+

1

4

∥∥∥UTU? − V TV ?∥∥∥2

F

if µ = 12 .

Page 14: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

14

APPENDIX FPROOF OF LEMMA 5

We first repeat that X? = ΦΣΨT = is a reduced SVDof X?. We separate U into two parts—the projections ontothe column space of Φ and its orthogonal complement—bydenoting U = ΦΛ

1/21 R1 +E1 with R1 ∈ Or, ET

1 Φ = 0 andΛ1 being a r× r diagonal matrix with non-negative elementsalong its diagonal. Similarly, denote V = ΨΛ

1/22 R2 + E2,

where R2 ∈ Or, ET2 Ψ = 0, Λ2 is a r × r diagonal matrix

with non-negative elements along its diagonal. Recall that any

critical point W =

[UV

]satisfies

∇Uρ(U ,V ) = UUTU −X?V = 0,

∇V ρ(U ,V ) = V V TV −X?TU = 0.

Plugging U = ΦΛ1/21 R1 + E1 and V = ΨΛ

1/22 R2 + E2

into the above equations gives

ΦΛ3/21 R1 + ΦΛ

1/21 R1E

T1E1 +E1R

T1 Λ1R1

+E1ET1E1 −ΦΣΛ

1/22 R2 = 0,

(47)

ΨΛ3/22 R2 + ΨΛ

1/22 R2E

TE2 +E2RT2 Λ2R2

+E2ET2E2 −ΨΣΛ

1/21 R1 = 0.

(48)

Since E1 is orthogonal to Φ, (47) further implies that

ΦΛ3/21 R1 + ΦΛ

1/21 R1E

T1E1 −ΦΣΛ

1/22 R2 = 0, (49)

E1RT1 Λ1R1 +E1E

T1E1 = 0. (50)

From (50), we have⟨E1R

T1 Λ1R1 +E1E

T1E1,E1

⟩=⟨RT

1 Λ1R1,ET1E1

⟩+ ‖E1‖2F = 0,

which further implies ‖E1‖2F = 0 by noting that⟨RT

1 Λ1R1,ET1E1

⟩≥ 0 since it is the inner product between

two PSD matrices. Thus E1 = 0. With a similar argument wealso have E2 = 0.

With E1 = E2 = 0, (49) reduces to

ΦΛ3/21 R1 −ΦΣΛ

1/22 R2 = 0.

Since Φ is orthogonal and R1 ∈ Or, the above equationimplies that

Λ3/21 = ΣΛ

1/22 R2R

T1 .

Let Ω denote the set of locations of the non-zero diagonals inΛ2, i.e., Λ2[i, i] > 0 for all i ∈ Ω. Then [RT

1 ]Ω = [RT2 ]Ω since

otherwise ΣΛ1/22 R2R

T1 is not a diagonal matrix anymore.

Then we have

Λ3/21 = ΣΛ

1/22 (51)

implying that the set of the locations of non-zero diagonals inΛ1 is identical to Ω. A similar argument applied to (48) gives

Λ3/22 = ΣΛ

1/21 . (52)

Noting that (51) implies Λ3/21 [i, i] = Σ[i, i]Λ

1/22 [i, i] and (52)

implies Λ3/22 [i, i] = Σ[i, i]Λ

1/21 [i, i], for all i ∈ Ω we have

Λ1[i, i] = Λ2[i, i] = Σ[i, i]. For i /∈ Ω, we have Λ1[i, i] =Λ2[i, i] = 0. Thus Λ1 = Λ2. For convenience, denote Λ =Λ1 = Λ2 with Λ[i, i] = λi.

Finally, we note that U = ΦΛ1/2R1 =∑i∈Ω λiφiR1[i, :]

and V = ΨΛ1/2R2 =∑i∈Ω λiψiR2[i, :] implying that only

[RT1 ]Ω and [RT

2 ]Ω play a role in U and V , respectively. Thusone can set R1 = R2 since we already proved [RT

1 ]Ω =[RT

2 ]Ω.

APPENDIX GPROOF OF LEMMA 6

Utilizing the result that any point W ∈ E satisfiesW

TW = UTU − V TV = 0, we directly obtain

‖∆UUT‖2F + ‖∆V V

T‖2F = ‖∆UVT‖2F + ‖∆V U

T‖2F

since ‖∆UUT‖2F = trace

(∆UU

TU∆U

)=

trace(∆UV

TV∆U

)= ‖∆UV

T‖2F (and similarlyfor the other two terms).

We then rewrite the last two terms in (32) as⟨W ∆

T,∆WT

⟩+⟨WW

T,∆∆T

⟩=⟨W

T∆,∆TW

⟩+⟨W

T∆, W

T∆⟩

=⟨W

T∆, W

T∆ + ∆TW

⟩=

1

2

⟨W

T∆ + ∆TW , W

T∆ + ∆TW

⟩+

1

2

⟨W

T∆−∆TW , W

T∆ + ∆TW

⟩=

1

2

∥∥∥WT∆ + ∆TW

∥∥∥2

F

where the last line holds because⟨A−AT,A+AT

⟩= 0.

Plugging these with the factor WTW = 0 into the Hessian

quadrature form [∇2ρ(W )](∆,∆) defined in (32) gives

[∇2ρ(W )](∆,∆) ≥ µ

2

∥∥∥WT∆ + ∆TW

∥∥∥2

F≥ 0.

This implies that the Hessian of ρ evaluated at any W ∈ E isPSD, i.e., ∇2ρ(W ) 0.6

APPENDIX HPROOF OF THEOREM 2 (STRICT SADDLE PROPERTY FOR

(29))

We begin the proof of Theorem 2 by characterizing any

W ∈ C \ X . For this purpose, let W =

[UV

], where

U = ΦΛ1/2R,V = ΨΛ1/2R,R ∈ Or,Λ is diagonal,Λ ≥0, (Σ − Λ)Σ = 0, and rank(Λ) < r. Denote the cor-

responding optimal solution W ? =

[U?

V ?

], where U? =

ΦΣ1/2R,V ? = ΨΣ1/2R. Let

k = arg maxi

σi − λi

6This can also be observed since any critical point W is a global minimumof ρ(W ), which directly indicates that ∇2ρ(W ) 0.

Page 15: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

15

denote the location of the first zero diagonal element in Λ.Noting that λi ∈ σi, 0, we conclude that

λk = 0, φTkU = 0, ψT

kV = 0. (53)

In words, φk and ψk are orthogonal to U and V , respectively.Let α ∈ Rr be the eigenvector associated with the smallesteigenvalue of WTW . Such α simultaneously lives in the nullspaces of U and V since W is rank deficient indicating

0 = αTWTWα = αTUTUα+αTV TV α,

which further impliesαTUTUα = 0,

αTV TV α = 0.(54)

With this property, we construct ∆ by setting ∆U = φkαT

and ∆V = ψkαT. Now we show that W is a strict saddle

by arguing that g(W ) has a strictly negative curvature alongthe constructed direction ∆, i.e., [∇2g(W )](∆,∆) < 0. Tothat end, we compute the five terms in (31) as follows∥∥∥∆UV

T +U∆TV

∥∥∥2

F= 0 (since (54)),⟨

UV T −X?,∆U∆TV

⟩= λk − σk = −σk (since (53)),⟨

WTW , ∆

T∆⟩

= 0 (since WTW = 0),⟨

W ∆T,∆WT

⟩= trace

(∆

TW∆TW

)= 0,⟨

WWT,∆∆T

⟩= trace

(W

T∆∆TW

)= 0,

where WTW = 0 since UTU − V TV = 0, the last

two lines utilize ∆TW = 0 (or W

T∆ = 0) because

∆TW = αφT

kU − αψTkV = 0 (see (53)). Plugging these

terms into (31) gives

[∇2g(W )](∆,∆)

=∥∥∥∆UV

T +U∆TV

∥∥∥2

F+ 2

⟨UV T −X?,∆U∆T

V

⟩+ µ

⟨W

TW , ∆

T∆⟩

+ µ⟨W ∆

T,∆WT

⟩+ µ

⟨WW

T,∆∆T

⟩= −2σk.

The proof of the strict saddle property is completed by notingthat

‖∆‖2F = ‖∆U‖2F + ‖∆V ‖2F =∥∥φkαT

∥∥2

F+∥∥ψkαT

∥∥2

F= 2,

which further implies

λmin

(∇2g(W )

)≤ [∇2g(W )](∆,∆)

‖∆‖2F≤ −2σk

2

= −‖Λ−Σ‖ = −1

2

∥∥∥WWT −W ?W ?T∥∥∥ ,

where the first equality holds because

‖Λ−Σ‖ = maxiσi − λi = σk,

and the second equality follows since

WWT −W ?W ?T =1

2Q (Λ−Σ)QT,

Q =

[Φ/√

2

Ψ/√

2

], QTQ = I.

We finish the proof of (39) by noting that

σk = σk(X?) ≥ σr(X?).

Now suppose W ? ∈ X . Applying (38), which states thatthe Hessian of ρ evaluated at any critical point W is PSD, wehave

[∇2g(W ?)][∆,∆]

=∥∥∥∆UV

?T +U?∆TV

∥∥∥2

F+ 2

⟨U?V ?T −X?,∆U∆T

V

⟩+ [∇2ρ(W ?)][∆,∆]

≥∥∥∥∆UV

?T +U?∆TV

∥∥∥2

F+ 2

⟨U?V ?T −X?,∆U∆T

V

⟩≥ 0

since U?V ?T−X? = 0. We show g is not strongly convex atW ? by arguing that λmin(∇2g(W ?)) = 0. For this purpose,we first recall that U? = ΦΣ1/2,V ? = ΨΣ1/2, where weassume R = I without loss of generality. Let e1, e2, . . . , erbe the standard orthobasis for Rr, i.e., e` is the `-th column of

the r× r identity matrix. Construct ∆(i,j) =

[∆

(i,j)U

∆(i,j)V

], where

∆(i,j)U = U?eje

Ti −U

?eieTj , ∆

(i,j)V = V ?eje

Ti −U

?eieTj ,

for any 1 ≤ i < j ≤ r. That is, the `-th columns of thematrices ∆

(i,j)U and ∆

(i,j)V are respectively given by

∆(i,j)U [:, `] =

σ

1/2j φj , ` = i,

−σ1/2i φi, ` = j,

0, otherwise,

,

∆(i,j)V [:, `] =

σ

1/2j ψj , ` = i,

−σ1/2i ψi, ` = j,

0, otherwise,

for any 1 ≤ i < j ≤ r. We then compute the five terms in(31) as follows∥∥∥∆(i,j)

U V ?T +U?(∆(i,j)V )T

∥∥∥2

F

=∥∥∥σ1/2

i σ1/2j

(φjψ

Ti − φiψ

Tj + φiψ

Tj − φjψ

Ti

)∥∥∥2

F= 0,

〈U?V ?T −X?,∆(i,j)U (∆

(i,j)V )T〉 = 0 (as U?V ?T −X? = 0),

〈W?TW ?, ∆

T

(i,j)∆(i,j)〉 = 0 (as W?TW ? = 0),

〈W?∆

T

(i,j),∆(i,j)W?T〉 = trace(W

?T∆(i,j)W

?T∆(i,j)) = 0,

〈W?W

?T,∆(i,j)∆

T(i,j)〉 = trace(W

?T∆(i,j)∆

T(i,j)W

?) = 0,

where the last two lines hold because

W?T

∆(i,j) = U?TU?(ejeTi − eieT

j )− V ?TV ?(ejeTi − eieT

j )

= 0

Page 16: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

16

since U?TU? = V ?TV ?.Thus, we obtain the Hessian evaluated at the optimal

solution point W ? along the direction ∆(i,j):[∇2g(W ?)

] (∆(i,j),∆(i,j)

)= 0

for all 1 ≤ i < j ≤ r. This proves that g(W ) is not stronglyconvex at a global minimum point W ? ∈ X .

APPENDIX IPROOF OF THEOREM 3 (STRICT SADDLE PROPERTY OF

g(W ) WHEN OVER-PARAMETERIZED)

Let X? = ΦΣΨT =∑r′

i=1 σiφiψTi be a reduced SVD

of X? with r′ ≤ r. Using an approach similar to that inAppendix F for proving Lemma 5, we can show that anyW =[UV

]is a critical point of g(W ) if and only if W ∈ C with

C =

W =

[UV

]: U = ΦΛ1/2R,V = ΨΛ1/2R,RRT = Ir′ ,

Λ is diagonal,Λ ≥ 0, (Σ−Λ)Σ = 0

.

Recall that

X =

W =

[UV

]:U = ΦΣ1/2R,V = ΨΣ1/2R,

RRT = Ir′

.

It is clear that X is the set of optimal solutions since for anyW ∈ X , g(W ) achieves its global minimum, i.e., g(W ) = 0.

Using an approach similar to that in Appendix H for provingTheorem 2, we can show that any W ∈ C\X is a strict saddlesatisfying

λmin

(∇2g(W )

)≤ −σr′(X?).

APPENDIX JPROOF OF THEOREM 4 (STRICT SADDLE PROPERTY OF

g(W ) WHEN UNDER-PARAMETERIZED)

Let X? = ΦΣΨT =∑r′

i=1 σiφiψTi be a reduced SVD of

X? with r′ > r and σr(X?) > σr+1(X?). Using an approachsimilar to that in Appendix F for proving Lemma 5, we can

show that any W =

[UV

]is a critical point of g(W ) if and

only if W ∈ C with

C =

W =

[UV

]: U = Φ[:,Ω]Λ1/2R,V = Ψ[:,Ω]Λ1/2R,

Λ = Σ[Ω,Ω],RRT = I`,Ω ⊂ 1, . . . , r′, |Ω| = ` ≤ r.

Intuitively, a critical point is one such that UV T is a rank-`approximation to X? with ` ≤ r and U and V are equalfactors of their product UV T.

It follows from the Eckart-Young-Mirsky theorem [50] thatthe set of optimal solutions is given by

X =

W =

[UV

]: U = Φ[:, 1 : r]Λ1/2R,

V = Ψ[:, 1 : r]Λ1/2R,Λ = Σ[1 : r, 1 : r],R ∈ Or.

Now we characterize any W ∈ C \ X by letting W =

[UV

],

where

U = Φ[:,Ω]Λ1/2R,V = Ψ[:,Ω]Λ1/2R,

Λ = Σ[Ω,Ω],R ∈ R`×r,RRT = I`,

Ω ⊂ 1, 2, . . . , r′, |Ω| = ` ≤ r,Ω 6= 1, 2, . . . , r.

Let α ∈ Rr be the eigenvector associated with the smallesteigenvalue of UTU (or V TV ). By the typical structures inU and V (see the above equation), we have

‖V α‖2F = ‖Uα‖2F = σ2r(U)

=

σj(X

?), |Ω| = r and j = max Ω0, |Ω| < r,

(55)

where j > r because Ω 6= 1, 2, . . . , r. Note that there alwaysexists an index

i ∈ 1, 2, . . . , r, i 6= Ω

since Ω 6= 1, 2, . . . , r and |Ω| ≤ r. We construct ∆ bysetting

∆U = φiαT, ∆V = ψiα

T.

Since i /∈ Ω, we have

UT∆U = UTφiαT = 0,

V T∆V = V TψiαT = 0.

(56)

We compute the five terms in (31) as follows∥∥∥∆UVT +U∆T

V

∥∥∥2

F

=∥∥∥∆UV

T∥∥∥2

F+∥∥∥U∆T

V

∥∥∥2

F+ 2 trace

(UT∆UV

T∆V

)= 2σ2

r(U),⟨UV T −X?,∆U∆T

V

⟩=⟨UV T −X?,φiψ

Ti

⟩= −

⟨X?,φiψ

Ti

⟩= −σi(X?),⟨

WTW , ∆

T∆⟩

= 0 (since WTW = 0),⟨

W ∆T,∆WT

⟩= trace

(W

T∆WT∆

)= 0,⟨

WWT,∆∆T

⟩= trace

(W

T∆∆TW

)= 0,

where the last equality in the first line holds because

UT∆U = 0 (see (56)) and∥∥∥∆UV

T∥∥∥2

F=∥∥∥U∆T

V

∥∥∥2

F=

σ2r(U) (see (55)), W

TW = 0 in the third line holds since

UTU − V TV = 0, and WT∆ = 0 in the fourth and last

lines holds because

WT∆ = UT∆U − V T∆V = 0.

Now plugging these terms into (31) yields

[∇2g(W )](∆,∆)

= ‖∆UVT +U∆T

V ‖2F + 2〈UV T −X?,∆U∆TV 〉

+ µ(〈WTW , ∆

T∆〉+ 〈W ∆

T,∆WT〉+ 〈WW

T,∆∆T〉)

= −2(σi(X?)− σ2

r(U)).

Page 17: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

17

The proof of the strict saddle property is completed by notingthat

‖∆‖2F = ‖∆U‖2F + ‖∆V ‖2F = 2,

which further implies

λmin

(∇2g(W )

)≤ −2

σi(X?)− σ2

r(U)

‖∆‖2F≤ − (σr(X

?)− σr+1(X?)) ,

where the last inequality holds because of (55) and becausei ≤ r.

APPENDIX KPROOF OF THEOREM 5 (ROBUST STRICT SADDLE FOR

g(W ))

We first establish the following useful results.

Lemma 7. For any two PSD matrices A,B ∈ Rn×n, we have

σn(A) trace(B) ≤ trace (AB) ≤ ‖A‖ trace(B).

Proof of Lemma 7. Let A = Φ1Λ1ΦT1 and B = Φ2Λ2Φ

T2

be the eigendecompositions of A and B, respectively. HereΛ1 (Λ2) is a diagonal matrix with the eigenvalues of A (B)along its diagonal. We first rewrite trace (AB) as

trace (AB) = trace(Λ1Φ

T1 Φ2Λ2Φ

T2 Φ1

).

Noting that Λ1 is a diagonal matrix, we have

trace(Λ1Φ

T1 Φ2Λ2Φ

T2 Φ1

)≥ min

iΛ1[i, i] · trace

(ΦT

1 Φ2Λ2ΦT2 Φ1

)= σn(A) trace(B).

The other direction follows similarly.

Corollary 2. For any two matricesA ∈ Rr×r andB ∈ Rn×r,we have

σr(A)‖B‖F ≤ ‖AB‖F ≤ ‖A‖‖B‖F .

We provide one more result before proceeding to prove themain theorem.

Lemma 8. Suppose A,B ∈ Rn×r such that ATB =

BTA 0 is PSD. If ‖A−B‖ ≤√

22 σr(B), we have⟨(

AAT −BBT)A,A−B

⟩︸ ︷︷ ︸

(ℵ1)

≥ 1

16(trace((A−B)T(A−B)BTB)︸ ︷︷ ︸

(ℵ2)

+ ‖AAT −BBT‖2F︸ ︷︷ ︸(ℵ3)

).

(57)

Proof. Denote E = A −B. We first rewrite the terms (ℵ1),(ℵ2) and (ℵ3) as follows

(ℵ1) = trace

((ETE

)2

+ 3ETEETB +(ETB

)2

+ETEBTB

),

(ℵ2) = trace(ETEBTB

),

(ℵ3) = trace

((ETE

)2

+ 4ETEETB + 2(ETB

)2

+ 2ETEBTB

),

where ETB = ATB −BTB = BTE. Now we have

(ℵ1)− 1

16(ℵ2)− 1

16(ℵ3)

= trace

(15

16

(ETE

)2

+11

4ETEETB +

7

8

(ETB

)2

+13

16ETEBTB

)=

∥∥∥∥∥√

121

56ETE +

√7

8ETB

∥∥∥∥∥2

F

+ trace

(13

16ETEBTB − 137

112ETEETE

)≥ trace

(13

16ETEσ2

r(B)− 137

112ETE‖E‖2

)≥ trace

((13

16− 137

112

1

2

)σ2r(B)ETE

)≥ 0,

where the third line follows from Lemma 7 and the fourth lineholds because by assumption ‖E‖ ≤

√2

2 σr(B).

Now we turn to prove the main results. Recall that µ = 12

throughout the proof.

A. Regularity condition for the region R1

It follows from Lemma 3 that WTW ?R = RTW ?TWis PSD, where R = arg minR′∈Or

‖W −W ?R′‖2F . We firstperform the change of variable W ?R → W ? to avoid Rin the following equations. With this change of variable wehave instead WTW ? = W ?TW is PSD. We now rewrite thegradient ∇g(W ) as follows:

∇g(W ) =

[0 UV T −U?V ?T

V UT − V ?U?T 0

]W

+ µW (WTW )

=1

2

(WWT −W ?W ?T

)W +

1

2W

?W

?TW

+ (µ− 1

2)WW

TW

=1

2

(WWT −W ?W ?T

)W +

1

2W

?W

?TW .

(58)

Page 18: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

18

Plugging this into the left hand side of (40) gives

〈∇g(W ),W −W ?〉

=1

2

⟨(WWT −W ?W ?T

)W ,W −W ?

⟩+

1

2

⟨W

?W

?TW ,W −W ?

⟩=

1

2

⟨(WWT −W ?W ?T

)W ,W −W ?

⟩+

1

2

⟨W

?W

?T,WWT

⟩(59)

where the last line follows from the fact that W ?TW?

= 0.We first show the first term in the right hand side of the aboveequation is sufficiently large⟨(

WWT −W ?W ?T)W ,W −W ?

⟩≥ 1

16trace

((W −W ?)T(W −W ?)W ?TW ?

)+

1

16

∥∥∥WWT −W ?W ?T∥∥∥2

F

≥ 1

16σr(W

?TW ?) ‖W −W ?‖2F

+1

16

∥∥∥WWT −W ?W ?T∥∥∥2

F

=1

8σr(X

?) ‖W −W ?‖2F +1

16

∥∥∥WWT −W ?W ?T∥∥∥2

F,

(60)

where the first inequality follows from Lemma 8 sinceWTW ? = W ?TW is PSD and ‖W −W ?‖ ≤σ

1/2r (X?) =

√2

2 σr(W?), the second inequality follows from

Lemma 7, and the last line holds because σr(W

?TW

?)

=

σr

(U?TU?

+ V?TV?)

= 2σr (Σ) = 2σr (X?). We thenshow the second term in the right hand side of (59) is lowerbounded by⟨

W?W

?T,WWT

⟩=

1

2 ‖X?‖

∥∥∥W ?TW

?∥∥∥ trace

(W

?TWWTW

?)

≥ 1

2 ‖X?‖trace

(W

?TW

?W

?TWWTW

?)

=1

2 ‖X?‖

∥∥∥W ?W

?TW∥∥∥2

F

(61)

where the first line holds because∥∥∥W ?T

W?∥∥∥ =∥∥∥U?T

U?

+ V?TV?∥∥∥ = 2 ‖Σ‖ = 2 ‖X?‖, and the inequality

follows from Lemma 7.On the other hand, we attempt to control the gradient of

g(W ). To that end, it follows from (58) that

‖∇g(W )‖2F

=1

4

∥∥∥(WWT −W ?W ?T)W + W

?W

?TW∥∥∥2

F

≤ 12

47

∥∥∥(WWT −W ?W ?T)W∥∥∥2

F+ 12

∥∥∥W ?W

?TW∥∥∥2

F

≤ 12

47‖W ‖2

∥∥∥WWT −W ?W ?T∥∥∥2

F+ 12

∥∥∥W ?W

?TW∥∥∥2

F,

(62)

where the first inequality holds since (a+ b)2 ≤ 1+εε a2 +(1+

ε)b2 for any ε > 0.Combining (59)-(62), we can conclude the proof of (40) as

long as we can show the following inequality:

1

8

∥∥∥WWT −W ?W ?T∥∥∥2

F

≥ 1

47

‖W ‖2

‖X?‖

∥∥∥WWT −W ?W ?T∥∥∥2

F.

To that end, we upper bound ‖W ‖ as follows:

‖W ‖ ≤ ‖W ?‖+ ‖W −W ?‖≤√

2σ1/21 (X?) + ‖W −W ?‖F

≤ (√

2 + 1)σ1/21 (X?)

since ‖W ?‖ =√

2σ(1/2)1 (X?) and dist(W ,W ?) ≤

σ(1/2)r (X?). This completes the proof of (40).

B. Negative curvature for the region R2

To show (41), we utilize a strategy similar to that used inAppendix H for proving the strict saddle property of g(W )by constructing a direction ∆ such that the Hessian evaluatedat W along this direction is negative. For this purpose, denote

Q =

[Φ/√

2

Ψ/√

2

], (63)

where we recall that Φ and Ψ consist of the left and rightsingular vectors of X?, respectively. The optimal solutionW ?

has a compact SVD W ? = Q(√

2Σ1/2)R. For notationalconvenience, we denote Σ = 2Σ, where Σ is a diagonalmatrix whose diagonal entries in the upper left corner areσ1, . . . , σr.

For any W , we can always divide it into two parts, theprojections onto the column spaces of Q and its orthogonalcomplement, respectively. Equivalently, we can write

W = QΛ1/2R+E, (64)

where QΛ1/2R is a compact SVD form representing the

projection of W onto the column space of Q, and ETQ = 0(i.e., E is orthogonal to Q). Here R ∈ Or and Λ is a diagonalmatrix whose diagonal entries in the upper left corner areλ1, . . . , λr, but the diagonal entries are not necessarily placedeither in decreasing or increasing order. In order to characterizethe neighborhood near all strict saddles C \ X , we considerW such that σr(W ) ≤

√38σ

1/2r (X?). Let k := arg mini λi

denote the location of the smallest diagonal entry in Λ. It isclear that

λk ≤ σ2r(W ) ≤ 3

8σr(X

?). (65)

Let α ∈ Rr be the eigenvector associated with the smallesteigenvalue of WTW .

Recall that µ = 12 . We show that the function g(W ) at W

has directional negative curvature along the direction

∆ = qkαT. (66)

Page 19: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

19

We repeat the Hessian evaluated at W for ∆ as follows

[∇2g(W )](∆,∆)

=∥∥∥∆UV

T +U∆TV

∥∥∥2

F︸ ︷︷ ︸Π1

+2⟨UV T −X?,∆U∆T

V

⟩︸ ︷︷ ︸

Π2

+1

2

⟨∆W

T,∆WT

⟩︸ ︷︷ ︸

Π3

+1

2

⟨W ∆

T,∆WT

⟩︸ ︷︷ ︸

Π4

+1

2

⟨WW

T,∆∆T

⟩︸ ︷︷ ︸

Π5

.

The remaining part is to bound the five terms.Bounding terms Π1, Π3 and Π4: We first rewrite these threeterms:

Π1 = ‖∆UVT‖2F + ‖U∆T

V ‖2F + 2⟨U∆T

V ,∆UVT⟩,

Π3 =⟨∆W

T,∆WT

⟩= ‖∆UU

T‖2F + ‖∆V VT‖2F

− ‖∆UVT‖2F − ‖∆V U

T‖2F ,

Π4 =⟨U∆T

U ,∆UUT⟩

+⟨V∆T

V ,∆V VT⟩

− 2⟨U∆T

V ,∆UVT⟩

≤ ‖∆UUT‖2F + ‖∆V V

T‖2F − 2⟨U∆T

V ,∆UVT⟩,

which implies

Π1 +1

2Π3 +

1

2Π4

≤ ‖∆UVT‖2F + ‖U∆T

V ‖2F + ‖∆UUT‖2F + ‖∆V V

T‖2F

− 1

2‖∆UV

T‖2F −1

2‖∆V U

T‖2F +⟨U∆T

V ,∆UVT⟩

= ‖W∆T‖2F −1

2

∥∥∥∆UVT −U∆T

V

∥∥∥2

F

≤ ‖W∆T‖2F .(67)

Noting that ∆T∆ = αqTk qkα

T = ααT, we now compute‖W∆T‖2F as

‖W∆T‖2F = trace(WTW∆T∆

)= trace

(WTWααT

)= σ2

r(W ).

Plugging this into (67) gives

Π1 +1

2Π3 +

1

2Π4 ≤ σ2

r(W ). (68)

Bounding terms Π2 and Π5: To obtain an upper bound forthe term Π2, we first rewrite it as follows

Π2 =⟨UV T −X?,∆U∆T

V

⟩=

1

2

⟨[0 UV T −U?V ?T

V UT − V ?U?T 0

],∆∆T

⟩=

1

4

⟨WWT −W ?W ?T,∆∆T

⟩− 1

4

⟨WW

T∆∆T

⟩+

1

4

⟨W

?W

?T,∆∆T

⟩.

We then have

2Π2 +1

2Π5 =

1

2

⟨WWT −W ?W ?T,∆∆T

⟩+

1

2

⟨W

?W

?T,∆∆T

⟩.

(69)

To bound these two terms in the above equation, we note that

∆∆T =

r∑i=1

α2i qkq

Tk = qkq

Tk =

1

2

[φkφ

Tk φkψ

Tk

ψkφTk ψkψ

Tk

].

Then we have⟨W

?W

?T,∆∆T

⟩=

1

2

⟨[ΦΣΦT −ΦΣΨT

−ΨΣΦT ΨΣΨT

],

[φkφ

Tk φkψ

Tk

ψkφTk ψkψ

Tk

]⟩= 0,

and⟨WWT −W ?W ?T,∆∆T

⟩=⟨QΛQT − 2QΛ

1/2RET +EET −QΣQT, qkq

Tk

⟩= λk − σk

where the last utilizes the fact that ETqk = 0 since E isorthogonal to Q.

Plugging these into (69) gives

2Π2 +1

2Π5 =

1

2(λk − σk). (70)

Merging together: Putting (68) and (70) together yields

[∇2g(W )](∆,∆) = Π1 +1

2Π3 +

1

2Π4 + 2Π2 +

1

2Π5

≤ σ2r(W ) +

1

2(λk − σk)

≤ 1

2σr(X

?) +1

2(1

2σr(X

?)− 2σr(X?))

≤ −1

4σr(X

?),

where the third line follows because by assumption σr(W ) ≤√12σ

1/2r (X?), by construction λk ≤ 1

2σr(X?) (see (65)), and

σk ≥ σr = 2σr(X?). This completes the proof of (41).

C. Large gradient for the region R′3 ∪R′′3 ∪R′′′3 :In order to show that g(W ) has a large gradient in the three

regions R′3 ∪ R′′3 ∪ R′′′3 , we first provide a lower bound forthe gradient. By (58), we have

‖∇g(W )‖2F

=1

4

∥∥∥(WWT −W ?W ?T)W + W

?W

?TW∥∥∥2

F

=1

4

(∥∥∥(WWT −W ?W ?T)W∥∥∥2

F+∥∥∥W ?

W?TW∥∥∥2

F

)+

1

2

⟨(WWT −W ?W ?T

)W , W

?W

?TW⟩

=1

4

(∥∥∥(WWT −W ?W ?T)W∥∥∥2

F+∥∥∥W ?

W?TW∥∥∥2

F

)+

1

2

⟨WWTWWT, W

?W

?T⟩

≥ 1

4

∥∥∥(WWT −W ?W ?T)W∥∥∥2

F,

(71)

Page 20: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

20

where the third equality follows because W ?TW?

=U?TU? − V ?TV ? = 0 and the last line utilizes the fact thatthe inner product between two PSD matrices is nonnegative.

1) Large gradient for the regionR′3:: To show ‖∇g(W )‖2Fis large for any W ∈ R′3, again, for any W ∈ R(n+m)×r, weutilize (64) to write W = QΛ

1/2R+E, where Q is defined

in (63), QΛ1/2R is a compact SVD form representing the

projection of W onto the column space of Q, and ETQ = 0(i.e., E is orthogonal to Q). Plugging this form of W intothe last term of (71) gives∥∥∥(WWT −W ?W ?T

)W∥∥∥2

F=∥∥∥QΛ

1/2(Λ−Σ)R+QΛ

1/2REET +ERTΛR+EETE

∥∥∥2

F

=∥∥∥QΛ

1/2(Λ−Σ)R+QΛ

1/2REET

∥∥∥2

F

+∥∥∥ERTΛR+EETE

∥∥∥2

F(72)

since Q is orthogonal to E. The remaining part is to showat least one of the two terms is large for any W ∈ R′3 byconsidering the following two cases.

Case I: ‖E‖2F ≥425σr(X

?). As E is large, we bound thesecond term in (72):∥∥∥ERTΛR+EETE

∥∥∥2

F≥ σ2

r

(RTΛR+ETE

)‖E‖2F

= σ4r (W ) ‖E‖2F

≥ (1

2)2 4

25σ3r(X?) =

1

25σ3r(X?),

(73)

where the first inequality follows from Corollary 2, the firstequality follows from the fact WTW = RTΛR+ETE, andthe last inequality holds because by assumption that σ2

r(W ) ≥12σr(X

?) and ‖E‖2F ≥425σr(X

?).Case II: ‖E‖2F ≤

425σr(X

?). In this case, we start bybounding the diagonal entries in Λ. First, utilizing Weyl’sinequality for perturbation of singular values [50, Theorem3.3.16] gives ∣∣∣σr(W )−min

1/2

i

∣∣∣ ≤ ‖E‖2,which implies

miniλ

1/2

i ≥ σr(W )− ‖E‖2 ≥√

1

2σ1/2r (X?)− 2

5σ1/2r (X?),

(74)

where we utilize ‖E‖2 ≤ ‖E‖F ≤ 25σ

1/2r (X?). On the other

hand,

dist(W ,W ?) ≤∥∥∥Q(Λ

1/2 −Σ1/2

)R+E∥∥∥F

≤∥∥∥Q(Λ

1/2 −Σ1/2

)R∥∥∥F

+ ‖E‖F ,

which together with the assumption that dist(W ,W ?) ≥σ

1/2r (X?) gives∥∥∥Λ1/2 −Σ

1/2∥∥∥F≥ σ1/2

r (X?)− 2

5σ1/2r (X?) =

3

5σ1/2r (X?).

We now bound the first term in (72):∥∥∥QΛ1/2

(Λ−Σ)R+QΛ1/2REET

∥∥∥F

≥ miniλ

1/2

i

∥∥∥(Λ−Σ)R+REET∥∥∥F

≥ miniλ

1/2

i

(∥∥(Λ−Σ)R∥∥F−∥∥∥REET

∥∥∥F

)≥ (

√1

2− 2

5)

((√

2 +

√1

2− 2

5

)3

5− 4

25

)σ3/2r (X?)

(75)

where the third line holds because ‖EET‖F ≤ ‖E‖2F ≤425σr(X

?), mini λ1/2

i ≥(√

12 −

25

1/2r (X?) by (74), and

∥∥Λ−Σ∥∥F

=

√√√√ r∑i=1

(σi − λi

)2=

√√√√ r∑i=1

1/2i − λ1/2

i

)2 (σ

1/2i + λ

1/2

i

)2

≥(σ1/2r + min

1/2

i

)√√√√ r∑i=1

1/2i − λ1/2

i

)2

=(σ1/2r + min

1/2

i

)∥∥∥Λ1/2 −Σ1/2∥∥∥F

(√

2 +

√1

2− 2

5

)3

5σr(X

?).

Combining (71) with (72), (73) and (75) gives

‖∇g(W )‖F ≥1

10σ3/2r (X?).

This completes the proof of (42).2) Large gradient for the region R′′3 :: By (71), we have

‖∇g(W )‖F ≥1

2

∥∥∥(WWT −W ?W ?T)W∥∥∥2

F.

Now (43) follows directly from the fact ‖W ‖ > 2019‖W

?‖ andthe following result.

Lemma 9. For any A,B ∈ Rn×r with ‖A‖ ≥ α‖B‖ andα > 1, we have∥∥∥(AAT −BBT

)A∥∥∥F≥ (1− 1

α2)‖A‖3.

Proof. Let A = Φ1Λ1RT1 and B = Φ2Λ2R

T2 be the SVDs

of A and B, respectively. Then∥∥∥(AAT −BBT)A∥∥∥F

=∥∥∥Φ1Λ

31 −Φ2Λ

22Φ

T2 Φ1Λ1

∥∥∥F

≥∥∥∥Λ3

1 −ΦT1 Φ2Λ

22Φ

T2 Φ1Λ1

∥∥∥F

≥∥∥Λ3

1 −Λ22Λ1

∥∥F

≥ (1− 1

α2)‖A‖3.

Page 21: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

21

3) Large gradient for the region R′′′3 :: By (58), we have

〈∇g(W ),W 〉

=

⟨1

2

(WWT −W ?W ?T

)W +

1

2W

?W

?TW ,W

⟩≥ 1

2

⟨(WWT −W ?W ?T

)W ,W

⟩≥ 1

2

(∥∥∥WWT∥∥∥2

F−∥∥∥WWT

∥∥∥F

∥∥∥W ?W ?T∥∥∥F

)>

1

20

∥∥∥WWT∥∥∥2

F(76)

where the last line holds because ‖W ?W ?T‖F <910‖WWT‖F . On the other hand, we have

〈∇g(W ),W 〉 ≤ ‖∇g(W )‖F ‖W ‖

≤ ‖∇g(W )‖F(‖WWT‖F

)1/2

,

which further indicates that

‖∇g(W )‖F >1

20

∥∥∥WWT∥∥∥3/2

F.

This completes the proof of (44).

APPENDIX LPROOF OF THEOREM 1 (ROBUST STRICT SADDLE FOR

G(W ))

Throughout the proofs, we always utilizeX = UV T unlessstated otherwise. To give a sense that the geometric result inTheorem 5 for g(W ) is also possibly preserved for G(W ),we first compute the derivative of G(W ) as

∇G(W ) =

[∇f(UV T)V

(∇f(UV T))TU

]+ µWW

TW . (77)

For any ∆ =

[∆U

∆V

]∈ R(n+m)×r, algebraic calculation gives

the Hessian quadrature form [∇2G(W )](∆,∆) as

[∇2G(W )](∆,∆)

= [∇2f(UV T)](∆UVT +U∆T

V ,∆UVT +U∆T

V )

+ 2〈∇f(UV T),∆U∆TV 〉+ [∇2ρ(W )](∆,∆)

(78)

where [∇2ρ(W )](∆,∆) is defined in (32). Thus, it is ex-pected that G(W ), ∇G(W ), and ∇2G(W ) are close totheir counterparts (i.e., g(W ), ∇g(W ) and ∇2g(W )) for thematrix factorization problem when f(X) satisfies the (2r, 4r)-restricted strong convexity and smoothness condition (6).

Before moving to the main proofs, we provide several usefulresults regarding the deviations of the gradient and Hessian.We start with a useful characterization of the restricted strongconvexity and smoothness condition.

Lemma 10. Suppose f satisfies the (2r, 4r)-restricted strongconvexity and smoothness condition (6) with positive constantsa = 1− c and b = 1 + c, c ∈ [0, 1). Then any n×m matricesC,D,H with rank(C), rank(D) ≤ r and rank(H) ≤ 2r,we have

|〈∇f (C)−∇f (D)− (C −D),H〉| ≤ c ‖C −D‖F ‖H‖F .

Proof of Lemma 10. We first invoke [38, Proposition 2] whichstates that under Assumption 2 for any n × m matricesZ,D,H of rank at most 2r, we have∣∣[∇2f(Z)](D,H)− 〈D,H〉

∣∣ ≤ c ‖D‖F ‖H‖F . (79)

Now using integral form of the mean value theorem for ∇f ,we have

|〈∇f (C)−∇f (D)− (C −D),H〉|

=

∣∣∣∣∫ 1

0

[∇2f(tC + (1− t)D)

](C −D,H)− 〈C −D,H〉dt

∣∣∣∣≤∫ 1

0

∣∣[∇2f(tC + (1− t)D)]

(C −D,H)− 〈C −D,H〉∣∣ dt

≤∫ 1

0

c ‖C −D‖F ‖H‖F dt = c ‖C −D‖F ‖H‖F .

where the second inequality follows from (79) since tC+(1−t)D, C −D, and H all are rank at most 2r.

The following result controls the deviation of the gradientbetween the general low-rank optimization (9) and the matrixfactorization problem by utilizing the (2r, 4r)-restricted strongconvexity and smoothness condition (6).

Lemma 11. Suppose f(X) has a critical point X? ∈ Rn×mof rank r and satisfies the (2r, 4r)-restricted strong convexityand smoothness condition (6) with positive constants a = 1−cand b = 1 + c, c ∈ [0, 1). Then, we have

‖∇G(W )−∇g(W )‖F ≤ c∥∥∥WWT −W ?W ?T

∥∥∥F‖W ‖ .

Proof of Lemma 11. We bound the deviation directly:

‖∇G(W )−∇g(W )‖F = max‖∆‖F =1

〈∇G(W )−∇g(W ),∆〉

= max‖∆‖F =1

⟨∇f(X),∆UV

T⟩−⟨X −X?,∆UV

T⟩

+⟨∇f(X),U∆T

V

⟩−⟨X −X?,U∆T

V

⟩= max‖∆‖F =1

⟨∇f(X)−∇f(X?)− (X −X?),∆UV

T⟩

+⟨∇f(X)−∇f(X?)− (X −X?),U∆T

V

⟩≤ max‖∆‖F =1

c ‖X −X?‖F(∥∥∥∆UV

T∥∥∥F

+∥∥∥U∆T

V

∥∥∥F

)≤ c‖UV T −X?‖F (‖V ‖+ ‖U‖)≤ c‖WWT −W ?W ?T‖F ‖W ‖ ,

where the last equality follows from Assumption 1 that∇f(X?) = 0 and and the first inequality utilizesLemma 10.

Similarly, the next result controls the deviation of theHessian between the matrix sensing problem and the matrixfactorization problem.

Lemma 12. Suppose f(X) has a critical point X? ∈ Rn×mof rank r and satisfies the (2r, 4r)-restricted strong convexityand smoothness condition (6) with positive constants a = 1−c

Page 22: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

22

and b = 1 + c, c ∈ [0, 1). Then, for any ∆ =

[∆U

∆V

]∈

R(n+m)×r the following holds:∣∣∇2G(W )[∆,∆]−∇2g(W )[∆,∆]∣∣

≤ 2c∥∥∥UV T −X?

∥∥∥F

∥∥∥∆U∆TV

∥∥∥F

+ c∥∥∥∆UV

T +U∆TV

∥∥∥2

F.

Proof of Lemma 12. First note that

∇2G(W )[∆,∆]−∇2g(W )[∆,∆]

= 2⟨∇f(X),∆U∆T

V

⟩− 2

⟨X −X?,∆U∆T

V

⟩+ [∇2f(X)](∆UV

T +U∆TV )−

∥∥∥∆UVT +U∆T

V

∥∥∥2

F.

Now utilizing Lemma 10 and (6), we have∣∣∇2G(W )[∆,∆]−∇2g(W )[∆,∆]∣∣

≤ 2∣∣∣⟨∇f(X)−∇f(X?),∆U∆T

V

⟩− 〈X −X?,∆U∆T

V 〉∣∣∣

+

∣∣∣∣[∇2f(X)](∆UVT +U∆T

V )−∥∥∥∆UV

T +U∆TV

∥∥∥2

F

∣∣∣∣≤ 2c

∥∥∥UV T −X?∥∥∥F

∥∥∥∆U∆TV

∥∥∥F

+ c∥∥∥∆UV

T +U∆TV

∥∥∥2

F.

We provide one more result before proceeding to prove themain theorem.

Lemma 13. [13, Lemma E.1] Let A and B be two n × rmatrices such that ATB = BTA is PSD. Then∥∥∥(A−B)AT

∥∥∥2

F≤ 1

2(√

2− 1)

∥∥∥AAT −BBT∥∥∥2

F.

A. Local descent condition for the region R1

Similar to what used in Appendix K-A, we perform thechange of variable W ?R→W ? to avoid R in the followingequations. With this change of variable we have insteadWTW ? = W ?TW is PSD.

We first control |〈∇G(W )−∇g(W ),W −W ?〉| as fol-lows:

|〈∇G(W )−∇g(W ),W −W ?〉|

≤∣∣∣〈∇f(X), (U −U?)V T〉 − 〈X −X?, (U −U?)V T〉

∣∣∣+∣∣〈∇f(X),U(V − V ?)T〉 − 〈X −X?,U(V − V ?)T〉

∣∣≤ c ‖X −X?‖F

(‖(U −U?)V T‖F + ‖U(V − V ?)T‖F

)≤ c‖WWT −W ?W ?T‖F

∥∥W (W −W ?)T∥∥F

≤ c

2(√

2− 1)‖WWT −W ?W ?T‖2F

where the second inequality utilizes ∇f(X?) = 0 andLemma 10, and the last inequality follows from Lemma 13.The above result along with (59)-(60) gives

〈∇G(W ),W −W ?〉≥ 〈∇g(W ),W −W ?〉 − |〈∇G(W )−∇g(W ),W −W ?〉|

≥ 〈∇g(W ),W −W ?〉 − c

2(√

2− 1)

∥∥∥WWT −W ?W ?T∥∥∥2

F

≥ 1

16σr(X

?) dist2(W ,W ?) +1

32

∥∥∥WWT −W ?W ?T∥∥∥2

F

+1

4‖X?‖

∥∥∥W ?W

?TW∥∥∥2

F

− c

2(√

2− 1)

∥∥∥WWT −W ?W ?T∥∥∥2

F

≥ 1

16σr(X

?) dist2(W ,W ?) +1

160

∥∥∥WWT −W ?W ?T∥∥∥2

F

+1

4‖X?‖

∥∥∥W ?W

?TW∥∥∥2

F

(80)

where we utilize c ≤ 150 .

On the other hand, we control ‖∇G(W )‖F with Lemma 11controlling the deviation between ∇G(W ) and ∇g(W ) asfollows:

‖∇G(W )‖2F = ‖∇g(W ) +∇G(W )−∇g(W )‖2F≤ 20

19‖∇g(W )‖2F + 20 ‖∇g(W )−∇G(W )‖2F

≤ 20

19‖∇g(W )‖2F + 20c2‖W ‖2

∥∥∥WWT −W ?W ?T∥∥∥2

F

=5

19

∥∥∥(WWT −W ?W ?T)W + W

?W

?TW∥∥∥2

F

+ 20c2‖W ‖2∥∥∥WWT −W ?W ?T

∥∥∥2

F

≤(

5

19

100

99+ 20c2

)∥∥∥(WWT −W ?W ?T)W∥∥∥2

F

+ 25∥∥∥W ?

W?TW∥∥∥2

F

≤ (5

19

100

99+ 50c2)(

√2 + 1)2‖X?‖‖WWT −W ?W ?T‖2F

+ 25‖W?W

?TW ‖2F ,

(81)

where the first inequality holds since (a+ b)2 ≤ 1+εε a2 +(1+

ε)b2 for any ε > 0, and the fourth line follows from (58).Now combining (80)-(81) and assuming c ≤ 1

50 gives

〈∇G(W ),W −W ?〉

≥ 1

16σr(X

?) dist2(W ,W ?) +1

260‖X?‖‖∇G(W )‖2F .

This completes the proof of (15).

B. Negative curvature for the region R2

Let ∆ = qkαT be defined as in (66). First note that∥∥∥∆UV

T +U∆TV

∥∥∥2

F≤ 2

∥∥∥∆UVT∥∥∥2

F+ 2

∥∥∥U∆TV

∥∥∥2

F

≤ 2∥∥∥W∆T

∥∥∥2

F= 2σ2

r(W ) ≤ σr(X?),

Page 23: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

23

where the last equality holds because σr(W ) ≤√12σ

1/2r (X?). Also utilizing the particular structure in

∆ yields

∥∥∥∆U∆TV

∥∥∥F

=1

2

∥∥∥φkψTk

∥∥∥F

=1

2.

Due to the assumption 2019‖W

?W ?T‖F ≥ ‖WWT‖F , wehave

‖UV T −X?‖F ≤√

2

2‖WWT −W ?W ?T‖F

≤√

2

2(20

19‖W ?W ?T‖F + ‖W ?W ?T‖F ) =

39√

2

19‖X?‖F .

Now combining the above results with Lemma 12, we have

∇2G(W )[∆,∆]

≤ ∇2g(W )[∆,∆] +∣∣∇2G(W )[∆,∆]−∇2g(W )[∆,∆]

∣∣≤ −1

4σr(X

?) + 2c∥∥∥UV T −X?

∥∥∥F

∥∥∥∆U∆TV

∥∥∥F

+ c∥∥∥∆UV

T +U∆TV

∥∥∥2

F

≤ −1

4σr(X

?) +39

19

√2c‖X?‖F + cσr(X

?)

≤ −1

6σr(X

?),

where the last line holds when c ≤ σr(X?)50‖X?‖F . This completes

the proof of (16).

C. Large gradient for the region R′3 ∪R′′3 ∪R′′′3 :

To show that G(W ) has large gradient in these three re-gions, we mainly utilize Lemma 11 to guarantee that ∇G(W )is close to ∇g(W ).

1) Large gradient for the region R′3: Utilizing Lemma 11,we have

‖∇G(W )‖F≥ ‖∇g(W )‖F − ‖∇G(W )−∇g(W )‖F≥ ‖∇g(W )‖F − c

∥∥∥WWT −W ?W ?T∥∥∥F‖W ‖

≥ ‖∇g(W )‖F − c(10

9‖W ?W ?T‖F + ‖W ?W ?T‖F ) ‖W ‖

≥ 1

10σ3/2r (X?)− c19

92‖X?‖F

20

19

√2‖X?‖1/2

≥ 1

27σ3/2r (X?),

where the fourth line follows because∥∥∥W ?W ?T

∥∥∥F

=

2‖X?‖F and ‖W ‖ ≤ 2019

√2‖X?‖1/2, and the last line holds

if c ≤ 1100

σ3/2r (X?)

‖X?‖F ‖X?‖1/2 . This completes the proof of (17).

2) Large gradient for the region R′′3 : Utilizing Lemma 11again, we have

‖∇G(W )‖F≥ ‖∇g(W )‖F − c

(∥∥∥WWT∥∥∥F

+∥∥∥W ?W ?T

∥∥∥F

)‖W ‖

≥ 39

800‖W ‖3 − c

(10

9

∥∥∥W ?W ?T∥∥∥F

+∥∥∥W ?W ?T

∥∥∥F

)‖W ‖

≥ 39

800‖W ‖3 − c19

92 ‖X?‖F ‖W ‖

≥ 39

800‖W ‖3 − 19

450‖X?‖ ‖W ‖

≥ 1

50‖W ‖3,

where the fourth line holds if c ≤ 1100

σ3/2r (X?)

‖X?‖F ‖X?‖1/2 and thelast follows from the fact that

‖W ‖ > 20

19‖W ?‖ ≥ 20

19

√2‖X?‖1/2.

This completes the proof of (18).3) Large gradient for the region R′′′3 : To show (19), we

first control |〈∇G(W )−∇g(W ),W 〉| as follows:

|〈∇G(W )−∇g(W ),W 〉|

= 2∣∣∣⟨∇f(UV T),UV T

⟩−⟨UV T −X?,UV T

⟩∣∣∣≤ 2c

∥∥∥UV T −X?∥∥∥F

∥∥∥UV T∥∥∥F

≤ 2c19

20

√2‖WWT‖F

1

2‖WWT‖F =

19

20

√2c‖WWT‖2F ,

where the first inequality utilizes the fact ∇f(X?) = 0 andLemma 10, and the last inequality holds because∥∥∥UV T −X?

∥∥∥F≤√

2

2

∥∥∥WWT −W ?W ?T∥∥∥F

≤√

2

2

(9

10

∥∥∥WWT∥∥∥F

+∥∥∥WWT

∥∥∥F

)=

19√

2

20

∥∥∥WWT∥∥∥F

and

‖WWT‖2F = ‖UUT‖2F + ‖V V T‖2F + 2‖UV T‖2F≥ 4‖UV T‖2F

by noting that

‖UUT‖2F+‖V V T‖2F−2‖UV T‖2F = ‖UTU−V TV ‖2F ≥ 0.

Now utilizing (76) to provide a lower bound for〈∇g(W ),W 〉, we have

|〈∇G(W ),W 〉|≥ 〈∇g(W ),W 〉 − |〈∇G(W )−∇g(W ),W 〉|

>1

20‖WWT‖2F −

19

20

√2c‖WWT‖2F

≥ 1

45‖WWT‖2F ,

where the last line holds when c ≤ 150 . Thus,

‖∇G(W )‖F ≥1

‖W ‖|〈∇G(W ),W 〉| > 1

45‖WWT‖3/2F ,

Page 24: The Global Optimization Geometry of Low-Rank Matrix ... · 3 the authors proved that tensor decompostion problems satisfy this robust strict saddle property. Sun et al. [21] studied

24

where we utilize ‖W ‖ ≤(‖WWT‖F

)1/2

. This completesthe proof of (19).

REFERENCES

[1] S. Aaronson, “The learnability of quantum states,” Proceedings of theRoyal Society of London A: Mathematical, Physical and EngineeringSciences, vol. 463, no. 2088, pp. 3089–3114, 2007.

[2] Z. Liu and L. Vandenberghe, “Interior-point method for nuclear normapproximation with application to system identification,” SIAM Journalon Matrix Analysis and Applications, vol. 31, no. 3, pp. 1235–1256,2009.

[3] N. Srebro, J. Rennie, and T. S. Jaakkola, “Maximum-margin matrixfactorization,” in Advances in Neural Information Processing Systems,pp. 1329–1336, 2004.

[4] L. Xu and M. Davenport, “Dynamic matrix recovery from incompleteobservations under an exact low-rank constraint,” in Advances in NeuralInformation Processing Systems, pp. 3585–3593, 2016.

[5] M. A. Davenport and J. Romberg, “An overview of low-rank matrixrecovery from incomplete observations,” IEEE Journal of SelectedTopics in Signal Processing, vol. 10, no. 4, pp. 608–622, 2016.

[6] M. Fazel, H. Hindi, and S. Boyd, “Rank minimization and applicationsin system theory,” in American Control Conference, vol. 4, pp. 3273–3278, IEEE, 2004.

[7] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-ranksolutions of linear matrix equations via nuclear norm minimization,”SIAM Review, vol. 52, no. 3, pp. 471–501, 2010.

[8] Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, and J. Malick, “Large-scale image classification with trace-norm regularization,” in IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), pp. 3386–3393, IEEE, 2012.

[9] E. J. Candes and B. Recht, “Exact matrix completion via convexoptimization,” Foundations of Computational Mathematics, vol. 9, no. 6,pp. 717–772, 2009.

[10] J.-F. Cai, E. J. Candes, and Z. Shen, “A singular value thresholding al-gorithm for matrix completion,” SIAM Journal on Optimization, vol. 20,no. 4, pp. 1956–1982, 2010.

[11] S. Burer and R. D. Monteiro, “A nonlinear programming algorithm forsolving semidefinite programs via low-rank factorization,” MathematicalProgramming, vol. 95, no. 2, pp. 329–357, 2003.

[12] S. Burer and R. D. Monteiro, “Local minima and convergence in low-rank semidefinite programming,” Mathematical Programming, vol. 103,no. 3, pp. 427–444, 2005.

[13] S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Global optimal-ity of local search for low rank matrix recovery,” arXiv preprintarXiv:1605.07221, 2016.

[14] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious localminimum,” arXiv preprint arXiv:1605.07272, 2016.

[15] D. Park, A. Kyrillidis, C. Caramanis, and S. Sanghavi, “Non-squarematrix sensing without spurious local minima via the Burer-Monteiroapproach,” arXiv preprint arXiv:1609.03240, 2016.

[16] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—online stochastic gradient for tensor decomposition,” in Proceedings ofThe 28th Conference on Learning Theory, pp. 797–842, 2015.

[17] J. Sun, Q. Qu, and J. Wright, “When are nonconvex problems notscary?,” arXiv preprint arXiv:1510.06096, 2015.

[18] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gradient descentconverges to minimizers,” University of California, Berkeley, vol. 1050,p. 16, 2016.

[19] I. Panageas and G. Piliouras, “Gradient descent only converges tominimizers: Non-isolated critical points and invariant regions,” arXivpreprint arXiv:1605.00405, 2016.

[20] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “Howto escape saddle points efficiently,” arXiv preprint arXiv:1703.00887,2017.

[21] J. Sun, Q. Qu, and J. Wright, “A geometric analysis of phase retrieval,”arXiv preprint arXiv:1602.06664, 2016.

[22] E. J. Candes, X. Li, and M. Soltanolkotabi, “Phase retrieval via Wirtingerflow: Theory and algorithms,” IEEE Transactions on Information The-ory, vol. 61, no. 4, pp. 1985–2007, 2015.

[23] Y. Chen and E. Candes, “Solving random quadratic systems of equationsis nearly as easy as solving linear systems,” in Advances in NeuralInformation Processing Systems, pp. 739–747, 2015.

[24] K. Lee, N. Tian, and J. Romberg, “Fast and guaranteed blind multi-channel deconvolution under a bilinear system model,” arXiv preprintarXiv:1610.06469, 2016.

[25] X. Li, S. Ling, T. Strohmer, and K. Wei, “Rapid, robust, and reli-able blind deconvolution via nonconvex optimization,” arXiv preprintarXiv:1606.04933, 2016.

[26] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon,“Learning sparsely used overcomplete dictionaries.,” in Conference onLearning Theory (COLT), pp. 123–137, 2014.

[27] J. Sun, Q. Qu, and J. Wright, “Complete dictionary recovery over thesphere II: Recovery by Riemannian trust-region method,” arXiv preprintarXiv:1511.04777, 2015.

[28] H. Liu, M.-C. Yue, and A. M.-C. So, “On the estimation performanceand convergence rate of the generalized power method for phasesynchronization,” arXiv preprint arXiv:1603.00211, 2016.

[29] S. Tu, R. Boczar, M. Soltanolkotabi, and B. Recht, “Low-rank solu-tions of linear matrix equations via Procrustes flow,” arXiv preprintarXiv:1507.03566, 2015.

[30] L. Wang, X. Zhang, and Q. Gu, “A unified computational and statisticalframework for nonconvex low-rank matrix estimation,” arXiv preprintarXiv:1610.05275, 2016.

[31] R. Sun and Z.-Q. Luo, “Guaranteed matrix completion via non-convexfactorization,” IEEE Transactions on Information Theory, vol. 62, no. 11,pp. 6535–6579, 2016.

[32] C. Jin, S. M. Kakade, and P. Netrapalli, “Provable efficient online matrixcompletion via non-convex stochastic gradient descent,” arXiv preprintarXiv:1605.08370, 2016.

[33] P. Jain, P. Netrapalli, and S. Sanghavi, “Low-rank matrix completion us-ing alternating minimization,” in Proceedings of the Forty-Fifth AnnualACM Symposium on Theory of Computing, pp. 665–674, ACM, 2013.

[34] R. Ge, C. Jin, and Y. Zheng, “No spurious local minima in noncon-vex low rank problems: A unified geometric analysis,” arXiv preprintarXiv:1704.00708, 2017.

[35] S. Bhojanapalli, A. Kyrillidis, and S. Sanghavi, “Dropping convexity forfaster semi-definite optimization,” arXiv preprint, 2015.

[36] T. Zhao, Z. Wang, and H. Liu, “A nonconvex optimization frameworkfor low rank matrix estimation,” in Advances in Neural InformationProcessing Systems, pp. 559–567, 2015.

[37] Q. Li and G. Tang, “The nonconvex geometry of low-rank matrix opti-mizations with general objective functions,” arXiv:1611.03060, 2016.

[38] Z. Zhu, Q. Li, G. Tang, and M. B. Wakin, “Global optimality in low-rankmatrix optimization,” arXiv preprint arXiv:1702.07945, 2017.

[39] Q. Li, Z. Zhu, and G. Tang, “Geometry of factored nuclear normregularization,” arXiv:1704.01265, 2017.

[40] A. R. Conn, N. I. Gould, and P. L. Toint, Trust region methods. SIAM,2000.

[41] X. Li, Z. Wang, J. Lu, R. Arora, J. Haupt, H. Liu, and T. Zhao,“Symmetry, saddle points, and global geometry of nonconvex matrixfactorization,” arXiv preprint arXiv:1612.09296, 2016.

[42] G. S. Chirikjian and A. B. Kyatkin, Harmonic Analysis for Engineersand Applied Scientists: Updated and Expanded Edition. Courier DoverPublications, 2016.

[43] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters, “1-bitmatrix completion,” Information and Inference, vol. 3, no. 3, pp. 189–223, 2014.

[44] N. Higham and P. Papadimitriou, “Matrix procrustes problems,” Rapporttechnique, University of Manchester, 1995.

[45] E. J. Candes and Y. Plan, “Tight oracle inequalities for low-rank matrixrecovery from a minimal number of noisy random measurements,” IEEETransactions on Information Theory, vol. 57, no. 4, pp. 2342–2359,2011.

[46] M. Udell, C. Horn, R. Zadeh, and S. Boyd, “Generalized low rankmodels,” arXiv preprint arXiv:1410.0342, 2014.

[47] T. Cai and W.-X. Zhou, “A max-norm constrained minimization ap-proach to 1-bit matrix completion.,” Journal of Machine LearningResearch, vol. 14, no. 1, pp. 3619–3647, 2013.

[48] J. Salmon, Z. Harmany, C.-A. Deledalle, and R. Willett, “Poisson noisereduction with non-local PCA,” Journal of Mathematical Imaging andVision, vol. 48, no. 2, pp. 279–294, 2014.

[49] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEETransactions on Information Theory, vol. 51, no. 12, pp. 4203–4215,2005.

[50] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge UniversityPress, 2012.