arxiv:1412.8132v1 [math.st] 28 dec 2014arxiv:1412.8132v1 [math.st] 28 dec 2014...

arX

iv:1

412.

8132

v1 [

mat

h.ST

] 2

8 D

ec 2

014

Robust Matrix Completion

Olga Klopp

CREST and MODAL’X

University Paris Ouest, 92001 Nanterre, France

Karim Lounici

School of Mathematics, Georgia Institute of Technology

Atlanta, GA 30332-0160, USA

and

Alexandre B. Tsybakov

CREST

3, Av. Pierre Larousse, 92240 Malakoff, France

Abstract

This paper considers the problem of recovery of a low-rank matrix in

the situation when most of its entries are not observed and a fraction of

observed entries are corrupted. The observations are noisy realizations of

the sum of a low rank matrix, which we wish to recover, with a second

matrix having a complementary sparse structure such as element-wise

or column-wise sparsity. We analyze a class of estimators obtained by

solving a constrained convex optimization problem that combines the nu-

clear norm and a convex relaxation for a sparse constraint. Our results

are obtained for the simultaneous presence of random and deterministic

patterns in the sampling scheme. We provide guarantees for recovery of

low-rank and sparse components from partial and corrupted observations

in the presence of noise and show that the obtained rates of convergence

are minimax optimal.

1 Introduction

In recent years, there have been a considerable interest in statistical inferencefor high-dimensional matrices. One particular problem is matrix completionwhere one observes only a small number N ≪ m1m2 of the entries of a high-dimensional m1 × m2 matrix L0 of rank r and aims at inferring the missingentries. In the noiseless setting, [6, 13, 22] established the following remarkable

1

http://arxiv.org/abs/1412.8132v1

result: assuming that it satisfies some low coherence condition, L0 can be recov-ered exactly by constrained nuclear norm minimization with high probabilityfrom only N & rmaxm1,m2 log2(m1+m2) entries observed uniformly at ran-dom. A more common situation in applications corresponds to the noisy settingin which the few available entries are corrupted by noise. Noisy matrix comple-tion has been extensively studied recently (see, e.g., [15, 23, 18, 21, 11, 16, 4]).

The matrix completion problem is motivated by a variety of applications.An important question in applications is whether or not matrix completionprocedures are robust to corruptions. Suppose that we observe noisy entries ofA0 = L0 + S0 where L0 is an unknown low-rank matrix and S0 correspondsto some gross/malicious corruptions. We wish to recover L0, but only observea few entries of A0 and, among those, a fraction happens to be corrupted byS0. Of course, we do not know which entries are corrupted. It has been shownempirically that uncontrolled and potentially adversarial gross errors that mightaffect only a few observations are particularly harmful. For example, Xu etal [26] showed that a very popular matrix completion procedure using nuclearnorm minimization can fail dramatically even if S0 contains only a single nonzerocolumn. It is particularly relevant in applications to recommendation systemswhere malicious users try to manipulate the prediction of matrix completionalgorithms by introducing spurious perturbations S0. Hence, the need for newtechniques robust to the presence of corruptions S0.

With these motivations in mind, we consider robust matrix completion. LetA0 ∈ R

m1×m2 be an unknown matrix. We suppose that A0 = L0 + S0 whereL0 is low rank and S0 is assumed to have some low complexity structure suchas entry-wise sparsity or column-wise sparsity. We consider the observations(Xi, Yi) satisfying the trace regression model

Yi = tr(XTi A0) + ξi, i = 1, . . . , N, (1)

where tr(A) denotes the trace of the matrix A; the noise variables ξi are inde-pendent with zero mean; Xi are m1 ×m2 matrices with values in

X =ej(m1)eTk (m2), 1 ≤ j ≤ m1, 1 ≤ k ≤ m2

, (2)

where el(m) are the canonical basis vectors in Rm. Thus, we observe entries

of matrix A0 with random noise. Our goal is to obtain accurate estimatesof the decomposition (L0, S0) based on the observations (Xi, Yi) in the high-dimensional setting N ≪ m1m2.

We assume that the set of observations indexed by i = 1, . . . , N is the unionof two components Ω and Ω with Ω∩Ω = ∅. One of them, Ω, corresponds to the“non-corrupted” observations of noisy entries of L0, i.e. observations for whichthe corresponding entry of S0 is zero. The other component, Ω, corresponds tothe observations for which the corresponding entry of S0 is non vanishing. Givenan observation, we do not know if it belongs to the corrupted or non-corruptedpart of observations and we have that |Ω| + |Ω| = N . The sizes of the sets Ωand Ω are assumed non-random.

2

A particular case of this setting is the matrix decomposition problem whenN = m1m2, i.e., we observe all the entries of A0. Several recent works con-sider the matrix decomposition problem, mostly in the noiseless setting, ξi ≡ 0.Chandrasekaran et al [7] analyzed the case when the matrix S0 is sparse, withsmall number of non-zero entries. They proved that it is possible to recover per-fectly (L0, S0) under additional sufficient identifiability conditions. This modelwas further studied by Hsu et al in [14] where they give milder conditions forthe recovery of (L0, S0). Also in the noiseless setting, Candes et al [5] studiedthe same model but with positions of corruptions chosen uniformly at random.Xu et al [26] studied a model, in which the matrix S0 is column-wise sparse withrelatively small number of non-zero columns. Their method guarantees approx-imate recovery for the uncorrupted columns of the low-rank component, L0.Agarwal et al [1] consider a general model in which the observations are noisyrealizations of a linear transformation of A0. Their setup includes the matrixdecomposition problem and some other statistical models of interest but doesnot cover the matrix completion problem. Agarwal et al provide a general resulton approximate recovery of the pair (L0, S0) imposing a “spikiness condition”on the low-rank component L0. Their analysis includes as particular cases boththe entry-wise and column-wise corruptions.

The robust matrix completion setting, when N < m1m2, was first consideredby Candes et al [5] in the noiseless case and for element-wise sparse S0. In thispaper, they suppose that the support of S0 is selected uniformly at randomand assume that N = 0.1m1m2 or some other fixed fraction. Chen et al [8]consider also the noiseless case but column-wise sparse S0. They proved thatthe same procedure as in [7] can recover the non-corrupted columns of L0 andidentify the set of indices of the corrupted columns. This was done under thefollowing assumptions: the locations of the non-corrupted columns are chosenuniformly at random; L0 satisfies some sparse/low-rank incoherence condition;the total number of corrupted columns is small and a sufficient number of noncorrupted entries is observed. In further works, Chen et al [9] and Li [19] considernoiseless robust matrix completion in the case of element-wise sparse S0. Theyprove exact reconstruction of the low-rank component under an incoherencecondition on L0 and some additional assumptions on the number of corruptedobservations.

To the best of our knowledge, the present paper is the first study of robustmatrix completion with noise. Our method allows us to treat simultaneouslythe case of column-wise and element-wise corruptions. It is important to notethat we do not require strong assumptions on the unknown matrices (such asincoherence condition) or additional conditions on the number of corrupted ob-servations as in the noiseless case. We emphasize that we do not need to knowthe rank of L0 nor the sparsity level of S0. We do not need to observe all entriesof A0 either. We only need to know an upper bound on the maximum of theabsolute values of the entries of L0 and S0. Such information is often availablein applications; for example, in recommendation systems, this bound is just themaximum rating. Another important point is that our method allows us to con-sider quite general and unknown sampling distribution. All the previous works

3

on noiseless robust matrix completion assume the uniform sampling distribu-tion. However, in practice, the observed entries are not guaranteed to followuniform scheme and its distribution is not known exactly.

We establish oracle inequalities for the cases of entry-wise and columns-wisesparse S0. For example, in the case of column-wise corruptions, we show thefollowing bound on the normalized Frobenius error of the estimator (L, S) of(L0, S0): with high probability

‖L− L0‖22m1m2

+‖S0 − S‖22m1m2

.r max(m1,m2) + |Ω|

|Ω| +s

m2

where the symbol . means that the inequality holds up to a multiplicativeabsolute constant and a logarithmic factor. Here, r denotes the rank of L0, sis the number of corrupted columns, |Ω| and |Ω| are respectively the numberof non-corrupted and corrupted observations. Note that, when the number ofcorrupted columns s and the proportion of corrupted observations |Ω|/|Ω| aresmall, this bound implies that O(r max(m1,m2)) observations are enough forsuccessful and robust to corruptions matrix completion. We also show that, inboth case of column-wise and element-wise corruptions, the obtained rates ofconvergence are minimax optimal (up to a log factor).

The rest of the paper is organized as follows. Section 2.1 contains the no-tation and definitions. We introduce our estimator in Section 2.2 and our as-sumptions on the sampling scheme in Section 2.3. In Section 3, we present ageneral result for the estimation error. In Sections 4 and 5, we specialize it tothe cases of column-wise and entry-wise corruptions. In Section 6, we provethat our estimator is rate minimax optimal up to a logarithmic factor. TheAppendix contains the proofs.

2 Preliminaries

2.1 Notation and definitions

We provide here a brief summary of the notation used in the paper. For anyset I, |I| denotes its cardinal and I its complement. Let a ∨ b = max(a, b) anda ∧ b = min(a, b).

For a matrix A, Ai is its ith column and Aij is its (i, j)−th entry. LetI ⊂ 1, . . .m1×1, . . .m2 be a subset of indices. Given a matrix A, we defineits restriction on I, AI , in the following way: (AI)ij = Aij if (ij) ∈ I and(AI)ij = 0 if not. Let Id denote the matrix of ones, i.e. Idij = 1 for any (i, j)and 0 the zero matrix, i.e. for any (i, j), 0ij = 0.

For any p ≥ 1, we denote by ‖ · ‖p the usual lp−norm. Additionally, weuse the following matrix norms: ‖A‖∗ is the nuclear norm (the sum of singularvalues), ‖A‖ is the operator norm (the largest singular value), ‖A‖∞ is thelargest absolute value of the entries:

‖A‖∞ = max1≤j≤m1,1≤k≤m2

|Ajk|,

4

‖A‖2,1 is the sum of l2 norms of the columns of A and ‖A‖2,∞ is the largest l2norm of the columns of A:

‖A‖2,1 =

m2∑

k=1

‖Ak‖2 and ‖A‖2,∞ = max1≤k≤m2

‖Ak‖2.

The scalar product of two matrices A,B is defined by 〈A,B〉 = tr(AB⊤).

Notation related to corruptions:

In the case of column-wise sparse matrix S0, let I ⊂ 1, . . . ,m1×1, . . . ,m2be a subset of indices such that

I = 1, . . . ,m1 × C (3)

where C ⊂ 1, . . . ,m2 is the subset of indices of the non-zero columns of S0.For element-wise sparse matrix S0, we denote by I the set of indices of thenon-zero elements of S0. Let I denote the complement of I.

We use a norm-based regularizer R : Rm1×m2 → R+ to constrain the struc-

ture of S0. We define the associated dual norm

R∗(A) = supR(B)≤1

〈A,B〉. (4)

Let |A| denote the matrix with entries the absolute values of the entries of A.A norm R(·) is called absolute if it depends only on the absolute values of theentries of A

R(A) = R(|A|).For instance, lp and ‖ · ‖2,1 norms are absolute. We call R(·) monotonic if|A| ≤ |B| implies R(A) ≤ R(B) (inequalities between matrices are understoodto hold component-wise). An absolute norm is monotonic and vice versa (see,e.g. [2]).

For any index set I ⊂ 1, . . . ,m1× 1, . . . ,m2 we define the subspace MI

MI =V ∈ R

m1×m2 : Vij = 0 for all (ij) 6∈ I. (5)

A summary of the notation:

• We set d = m1 +m2, m = m1 ∧m2 and M = m1 ∨m2.

• Let ǫini=1 be an i.i.d. Rademacher sequence and we define the followingstochastic terms

ΣR =1

n

∑

i∈Ω

ǫiXi, Σ =1

N

∑

i∈Ω

ξiXi and W =1

N

∑

i∈Ω

Xi

• We denote by r the rank of the matrix L0.

• N is the number of observations, n = |Ω| is the number of non-corruptedobservations, æ = N/n and |Ω| is the number of corrupted observations.

• In what follows, we use the symbol C for a generic positive constant,which is independent of n,m1,m2 and r, s, and may take different valuesat different places.

5

2.2 Convex relaxation for robust matrix completion

In the case of usual matrix completion (S0 = 0) one of the most popular methodis based on constrained nuclear norm minimization. For example, in [16] Kloppintroduces the following constrained matrix lasso estimator

A ∈ arg min‖A‖∞≤a

1

n

n∑

i=1

(Yi − 〈Xi, A〉)2 + λ‖A‖∗

,

where λ > 0 is a regularization parameter and a is an upper bound on ‖L0‖∞.We introduce an additional norm-based penalty that accounts for corrup-

tions induced by the matrix S0. This penalty depends on the structure of S0.Let (L, S) be the following estimator of (L0, S0):

(L, S) ∈ arg min‖L‖∞≤a

‖S‖∞≤a

1

N

N∑

i=1

(Yi − 〈Xi, L+ S〉)2 + λ1‖L‖∗ + λ2R(S)

. (6)

Here λ1 > 0 and λ2 > 0 are regularization parameters and a is an upperbound on ‖L0‖∞ and ‖S0‖∞. Note that our method (and all the proofs) can beimmediately adapted to the case of two different upper bounds for ‖L0‖∞ and‖S0‖∞, as it can be the case in some applications.

Let us provide two examples for the possible sparsity structure of S0 andassociated regularizer R :

• Example 1. Suppose that S0 is column-sparse, that is, it has a smallnumber s < m2 of non-zero columns. We use the (2, 1)−norm constraintfor such column-wise sparsity structure: R(A) = ‖A‖2,1. The associateddual norm is R∗(A) = ‖A‖2,∞.

• Example 2. Suppose now that S0 is sparse, that is, it has s << m1m2

non-zero entries. An usual choice of regularizer for such element-wisesparsity structure is the l1−norm: R(A) = ‖A‖1. The associated dualnorm is R∗(A) = ‖A‖∞.

In these two examples, the regularizer R is decomposable with respect to aproperly chosen set of indices I. That is, for any matrix A ∈ R

m1×m2 we have

R(A) = R(AI) + R(AI). (7)

For instance, the (2, 1)−norm is decomposable with respect to any set I suchthat

I = 1, . . . ,m1 × C (8)

where C ⊂ 1, . . . ,m2. The element-wise l1−norm is decomposable with re-spect to any subspace of indices I.

6

2.3 Assumptions on the sampling scheme and the noise

Works on matrix completion (S0 = 0) usually assume i.i.d. observations Xi.For robust matrix completion it is more realistic to assume the existence of twocomponents in the observed entries. The first component, Xi, i ∈ Ω is a setof i.i.d. random matrices with some unknown distribution on

X ′ =ej(m1)eTk (m2), (j, k) ∈ I

. (9)

They are called non-corrupted observations (recall that the entries of S0 corre-sponding to indices in I are zero). These observations are of the same type asin the usual matrix completion. On this non-corrupted part of observations werequire some assumptions on sampling distribution (see Assumption 1, 2 andAssumptions 5 or 9 below).

The second component Xi, i ∈ Ω is a set of matrices with values in

X ′′ =

ej(m1)eTk (m2), (j, k) ∈ I

.

They are called corrupted observations. Importantly, we make no assumptionson how the entries corrupted by S0 are sampled. Thus, for any i ∈ Ω, wehave that the index of the corresponding entry is in I and make no furtherassumption. If we take the example of recommendation systems, this partitioninto Xi, i ∈ Ω and Xi, i ∈ Ω accounts for the difference in behavior ofnormal and malicious users.

As there is no hope for recovering the unobserved entries of S0, one shouldconsider only the estimation of restriction of S0 to Ω. This is equivalent toassume that we estimate the whole S0 when all unobserved entries of S0 arezero (c.f. [8]). This assumption will be done throughout the paper.

For i ∈ Ω, we suppose that Xi are i.i.d copies of a random matrix X havingdistribution Π on the set X ′. Let πjk = P

(X = ej(m1)eTk (m2)

)be the probabil-

ity to observe the (j, k)-th entry. One of the particular settings of this problemis the case of the uniform on X ′ distribution Π which was previously consideredin works on noiseless robust matrix completion, see, e.g., [8]. Our main result,Theorem 1, which applies to a general convex program (6), is obtained under amore general sampling model. In particular, we suppose that any non-corruptedelement is sampled with positive probability:

Assumption 1. There exists a positive constant µ ≥ 1 such that for any (j, k) ∈I,

πjk ≥ (µ|I|)−1.

In the case when Π is the of uniform distribution on X ′, µ = 1. Let us set‖A‖2L2(Π) = E

(〈A,X〉2

). Assumption 1 implies that

‖A‖2L2(Π) ≥ (µ |I|)−1‖AI‖22. (10)

Let us denote by π·k =m1

Σj=1

πjk the probability to observe an element from the

k-th column and by πj· =m2

Σk=1

πjk the probability to observe an element from

7

the j-th row. Additionally, we suppose that, no column or row is sampled witha very high probability:

Assumption 2. There exists a positive constant L ≥ 1 such that

maxi,j

(π·k, πj·) ≤ L/m.

In Sections 4 and 5 we apply Theorem 1 to particular cases of column-wisesparse and element-wise sparse corruptions. It is done under more restrictiveassumptions on sampling distribution (see Assumptions 5 and 9).

In what follows, we suppose that the noise variables ξi are sub-gaussian:

Assumption 3. There exists constants σ, c1, c2 > 0, such that

maxi=1,...,n

E exp(ξ2i /σ

2)< c1 and Eξ2i ≥ c2σ

2 for any 1 ≤ i ≤ n.

3 Upper bounds for general regularizers

In this section we state our main result which applies to a general convex pro-gram (6) whenever R is based on an absolute norm and is a decomposableregularizer. In the next sections, we consider in details two particular choicesR = ‖ · ‖1 and R = ‖ · ‖2,1.

Let

Ψ1 = µ2m1m2 r(

æ2λ21 + a2 (E (‖ΣR‖))2)

+ a2 µ

√

log(d)

n,

Ψ2 = µ aR(IdΩ)

(λ2 a

λ1E (‖ΣR‖) + æλ2 + aE (R∗(ΣR))

)

,

Ψ3 =µ |Ω|

(a2 + σ2 log(d)

)

N

(aE (‖ΣR‖)

λ1+

aE (R∗(ΣR))

λ2+ æ

)

+a2|I|m1m2

,

Ψ4 = µ a2√

log(d)

n+ µ aR(IdΩ) [æλ2 + aE (R∗(ΣR))]

+

[aE (R∗(ΣR))

λ2+ æ

]µ |Ω|

(a2 + σ2 log(d)

)

N(11)

where d = m1 +m2.

Theorem 1. Assume that ‖L0‖∞ ≤ a, ‖S0‖∞ ≤ a for some constant a andAssumptions 1 - 3 are satisfied. Let λ1 > 4 ‖Σ‖, λ2 ≥ 4 (R∗(Σ) + 2aR∗(W )).Then, with probability at least 1 − 4.5 d−1,

‖L0 − L‖22m1m2

+‖S0 − S‖22m1m2

≤ C Ψ1 + Ψ2 + Ψ3 (12)

where C is a numerical constant. Moreover, with the same probability

‖SI‖22|I| ≤ CΨ4.

8

The first term in (12), Ψ1, corresponds to the estimation error associatedwith matrix completion of a rank r matrix. The second and the third termsaccount for the error induced by corruptions. In the next two sections we applyTheorem 1 to the case of the element-wise and column-wise sparse structure ofthe matrix S0.

4 Column-wise sparsity

Suppose that S0 has at most s non-zero columns. We will suppose that s ≤m2/2. Then, we use ‖ · ‖2,1 regularization and the convex program (6) takesform


‖S‖∞≤a

1

N

N∑

i=1

(Yi − 〈Xi, L+ S〉)2 + λ1‖L‖1 + λ2‖S‖2,1

. (13)

In the case of column-wise sparse S0 we have |I| = m1s and, using the Cauchy-

Schwarz inequality, ‖IdΩ‖2,1 ≤√

s|Ω|. Then, Ψ2, Ψ3 and Ψ4 take form

Ψ′2 = µ a

√

s|Ω|(aλ2λ1

E (‖ΣR‖) + æλ2 + aE‖ΣR‖2,∞)

,

Ψ′3 =

µ |Ω|(a2 + σ2 log(d)

)

N

(aE (‖ΣR‖)

λ1+

aE‖ΣR‖2,∞λ2

+ æ

)

+a2s

m2,

Ψ′4 = µ a2

√

log(d)

n+ µ a

√

s|Ω| [æλ2 + aE‖ΣR‖2,∞]

+

[aE‖ΣR‖2,∞

λ2+ æ

]µ |Ω|

(a2 + σ2 log(d)

)

N.

Specializing Theorem 1 to this case yields the following corollary:

Corollary 4. Let (L, S) be a solution of the convex program (13) with theregularization parameters (λ1, λ2) such that

λ1 > 4 ‖Σ‖ and λ2 ≥ 4 (‖Σ‖2,∞ + 2a‖W‖2,∞) .

Assume that ‖L0‖∞ ≤ a and ‖S0‖∞ ≤ a. Then, with probability at least 1 −4.5 d−1

‖L0 − L‖22m1m2

+‖S0 − S‖22m1m2

≤ C Ψ1 + Ψ′2 + Ψ′

3 .


‖SI‖22|I| ≤ CΨ′

4.

9

In order to get a bound in a closed form we need to obtain suitable upperbounds on the stochastic terms Σ, ΣR and W . We obtain such bounds underan additional assumption on the column marginal sampling distribution. Set

π(2)·,k =

∑m1

j=1 π2jk. We assume that

Assumption 5. There exists a positive constant γ ≥ 1 such that

maxk

π(2)·,k ≤ γ2

|I|m2.

This condition prevents columns to be sampled with high probability andguarantees that the non-corrupted observations are well spread out among thecolumns. In particular, it is satisfied when the distribution Π is uniform orapproximately uniform, i.e. when πjk ≍ 1

m1(m2−s). Note that Assumption 5 is

less restrictive than assuming that Π is uniform as it was done in previous workson noiseless robust matrix completion. It implies the following milder conditionon the marginal sampling distribution

maxk

π·k ≤√

2 γ

m2. (14)

Condition (14) is sufficient to control ‖Σ‖2,∞ and ‖ΣR‖2,∞ while the control on‖W‖2,∞ requires a stronger Assumption 5.

Lemma 6. Let Xi be i.i.d. with distribution Π on X ′ which satisfies Assump-tions 1, 2 and 5. Assume that (ξi)

ni=1 are independent and satisfy Assumption

3. Suppose that N ≤ m1m2, n ≤ |I| and logm2 ≥ 1. Then, there exists anumerical constant C > 0 such that with probability at least 1 − e−t

(i) ‖Σ‖ ≤ Cσmax

(√

L(t+ log d)

æNm,

(logm)(t+ log d)

N

)

and

E ‖ΣR‖ ≤ C

(√

L log(d)

nm+

log2 d

N

)

;

(ii) ‖Σ‖2,∞ ≤ Cσ

√

γ(t+ log(d))

æNm2+t+ log d

N

and

E ‖Σ‖2,∞ ≤ Cσ

√

γ log(d)

æNm2+

log d

N

;

(iii) ‖ΣR‖2,∞ ≤ C

√

γ(t+ log(d))

nm2+t+ log d

n

and

E ‖ΣR‖2,∞ ≤ C

√

γ log(d)

nm2+

log d

n

;

10

(iv) ‖W‖2,∞ ≤ C

γ(t+ logm2)1/4√

æNm2

(

1 +

√

m2(t+ logm2)

n

)1/2

+t+ logm2

N

E‖W‖2,∞ ≤ C

γ log1/4(d)√

æNm2

(

1 +

√

m2 log d

n

)1/2

+log d

N

.

Let

n∗ = 2 log(d)

(m2

γ∨ m log2m

L

)

. (15)

Recall that æ = Nn ≥ 1. If n ≥ n∗, using bounds given by Lemma 6, we can

chose the regularization parameters λ1 and λ2 in the following way:

λ1 = C (σ ∨ a)

√

L log(d)

Nmand λ2 = C γ (σ ∨ a)

√

log(d)

Nm2, (16)

where C > 0 is a large enough numerical constant.With this choice of the regularization parameters Corollary 4 implies

Corollary 7. Let Xi be i.i.d. with distribution Π on X ′ which satisfies Assump-tions 1, 2 and 5. Assume that (ξi)

ni=1 are independent, satisfy Assumption 3 and

‖L0‖∞ ≤ a, ‖S0‖∞ ≤ a. Suppose that N ≤ m1m2 and n∗ ≤ n. Let (L, S) be asolution of the convex program (13) with the regularization parameters (λ1, λ2)given by (16). Then, with probability at least 1 − 6/d

‖L0 − L‖22m1m2

+‖S0 − S‖22m1m2

≤ Cµ,γ,L(σ ∨ a)2 log(d) ærM + |Ω|

n+

a2s

m2

(17)

where Cµ,γ,L > 0 can depend only on µ, γ, L. Moreover, with the same proba-bility

‖SI‖22|I| ≤ Cµ,γ,L

æ(σ ∨ a)2 |Ω| log(d)

n+

a2s

m2.

Remarks. 1. In the upper bound (17) we have two terms. The first termproportional to rM is the same as in the case of the usual matrix completion (c.f.[18, 16]). The second one is induced by corruptions and is proportional to thenumber of corrupted columns s and to the number of corrupted observations,|Ω|. This term is equal to zero if there is no corruptions.

2. When all entries of A0 are observed (i.e. we are in the case of matrixdecomposition problem) the bound (17) is analogous to the corresponding boundin [1]. Indeed, then |Ω| = sm1, N = m1m2,æ ≤ 2 and we get

‖L0 − L‖22m1m2

+‖S0 − S‖22m1m2

. (σ ∨ a)2(

rM

m1m2+

s

m2

)

.

The estimator studied in [1] for matrix decomposition problem is similar toour program (13). The difference between these estimators is that in (13) the

11

minimum is taken over a ball in ‖ · ‖∞ norm when the program of [1] uses theminimization over a ball in ‖ · ‖2,∞ norm and requires a bound on this norm ofthe unknown matrix L0.

3. Suppose that the number of corrupted columns is small (s ≪ m2). Then,Corollary 7 guarantees, that the prediction error of our estimator is small when-ever the number of non-corrupted observations n satisfies the following condition

n & (m1 ∨m2)rank(L0) + |Ω| (18)

where |Ω| is the number of corrupted observations. This quantifies the samplesize sufficient for successful, robust to corruptions, matrix completion. Whenrank(L0) is small and s≪ m2, the right hand side of (18) is considerably smallerthan the total number of entries m1m2.

5 Element-wise sparsity

We assume now that S0 has s non-zero entries but they do not necessarily layin a small subset of columns. We will assume that s ≤ m1m2

2 . In this case, weuse ‖ · ‖1 regularization and the convex program (6) takes the form


‖S‖∞≤a

1

N

N∑

i=1

(Yi − 〈Xi, L+ S〉)2 + λ1‖L‖∗ + λ2‖S‖1

. (19)

Recall that, here, I = (j, k) : (S0)jk 6= 0 is the support of the non-zero

entries of S0. Then, we have |I| = s, ‖IdΩ‖1 = |Ω| and Ψ2, Ψ3, Ψ4 take form

Ψ′′2 = µ a |Ω|

(aλ2λ1

E (‖ΣR‖) + æλ2 + aE‖ΣR‖2,∞)

,

Ψ′′3 =

µ |Ω|(a2 + σ2 log(d)

)

N

(aE (‖ΣR‖)

λ1+

aE‖ΣR‖2,∞λ2

+ æ

)

+a2s

m1m2,

Ψ′′4 = µ a2

√

log(d)

n+ µ a |Ω| [æλ2 + aE‖ΣR‖2,∞]

+

[aE‖ΣR‖2,∞

λ2+ æ

]µ |Ω|

(a2 + σ2 log(d)

)

N.

Specializing Theorem 1 to this case yields the following corollary:

Corollary 8. Let (L, S) be a solution of the convex program (19) with theregularization parameters (λ1, λ2) such that

λ1 > 4 ‖Σ‖ and λ2 ≥ 4 (‖Σ‖∞ + 2a‖W‖∞) .

Assume that ‖L0‖∞ ≤ a and ‖S0‖∞ ≤ a. Then, with probability at least 1 −4.5 d−1

‖L0 − L‖22m1m2

+‖S0 − S‖22m1m2

≤ C Ψ1 + Ψ′′2 + Ψ′′

3 .

12


‖SI‖22|I| ≤ CΨ′′

4 .

In order to get a bound in a closed form we need to obtain suitable upperbounds on the stochastic terms Σ,ΣR and W . We obtain such bounds underan additional assumption on the sampling distribution. This condition preventsany entry from being sampled too often and guarantees that the observationsare well spread out over the non-corrupted entries.

Assumption 9. There exists a positive constant γ ≥ 1 such that

maxi,j

πij ≤µ1

|I| .

This assumption together with Assumption 1 implies that the samplingdistribution Π is approximately uniform, i.e. πjk ≍ 1

|I| . In particular, since

|I| ≤ m1m2

2 , Assumption 9 implies Assumption 2 for L = 2µ1.

Lemma 10. Let Xi be i.i.d. with distribution Π on X ′ which satisfies Assump-tion 1 and 9. Assume that (ξi)

ni=1 are independent and satisfy Assumption 3.

Then, there exists a numerical constant C > 0 such that with probability at least1 − e−t

(i) ‖W‖∞ ≤ C

µ1

æm1m2+

√

µ1(t+ log d))

æNm1m2+t+ log d

N

and

E‖W‖∞ ≤ C

(

µ1

æm1m2+

√

µ1 log d

æNm1m2+

log d

N

)

;

(ii) ‖Σ‖∞ ≤ Cσ

√

µ1(t+ log d)

æNm1m2+t+ log d

N

and

E ‖Σ‖∞ ≤ Cσ

(√

µ1 log d

æNm1m2+

log d

N

)

;

(iii) ‖ΣR‖∞ ≤ C

√

µ1(t+ log d)

nm1m2+t+ log d

n

and

E ‖ΣR‖∞ ≤ C

(√

µ1 log d

nm1m2+

log d

n

)

.

Using Lemma 6, (i) and Lemma 10, under the conditions

m1m2 log d

µ1≥ n ≥ 2m log(d) log2(m)

L(20)

13

we can choose the regularization parameters λ1 and λ2 in the following way:

λ1 = C(σ ∨ a)

√

µ1 log(d)

Nmand λ2 = C(σ ∨ a)

log(d)

N. (21)

With this choice of the regularization parameters Corollary 4 and Lemma 14imply

Corollary 11. Let Xi be i.i.d. with distribution Π on X ′ which satisfies As-sumptions 1 and 9. Assume that (ξi)

ni=1 are independent, satisfy Assumption

3 and ‖L0‖∞ ≤ a, ‖S0‖∞ ≤ a. Suppose that N ≤ m1m2 and condition (20)

holds. Let (L, S) be a solution of the convex program (13) with the regularizationparameters (λ1, λ2) given by (21). Then, with probability at least 1 − 6/d

‖L0 − L‖22m1m2

+‖S0 − S‖22m1m2

≤ Cµ,µ1æ(σ ∨ a)2 log(d)

rM + |Ω|n

+a2s

m1m2

(22)

where Cµ,µ1> 0 can depend only on µ and µ1. Moreover, with the same proba-

bility

‖SI‖22|I| ≤ Cµ,µ1

æ(σ ∨ a)2|Ω| log(d)

n+

a2s

m1m2.

Remarks. 1. As in the column sparsity case, we observe two terms in theupper bound (22). The first term, proportional to rM/n, is the same as in thecase of the usual matrix completion (c.f. [18, 16]). The second and the thirdones are induced by corruptions and are proportional to the number of nonzeroentries in S0, s, and to the number of corrupted observations, |Ω|. We willprove in Section 6 below that these error terms are of the correct order (up toa logarithmic factor). When s ≪ n < m1m2, this bound implies that, one canrecover a low-rank matrix from a nearly minimal number of observations, evenwhen a part of these samples has been corrupted.

2. When all entries of A0 are observed (i.e. we are in the case of matrixdecomposition problem) the bound (17) is analogous to the corresponding boundin [1]. Indeed, then |Ω| ≤ s,N = m1m2,æ ≤ 2 and we get

‖L0 − L‖22m1m2

+‖S0 − S‖22m1m2

. (σ ∨ a)2(

rM

m1m2+

s

m1m2

)

.

6 Minimax lower bounds

In this section, we prove the minimax lower bounds showing that the ratesattained by our estimator are optimal up to a logarithmic factor. We will denoteby inf(L,S) the infimum over all pairs of estimators (L, S) for the decomposition

A0 = L0+S0 where L, S both take values in Rm1×m2 . For any A0 ∈ R

m1×m2 , letPA0

denote the probability distribution of the observations (X1, Y1, . . . , Xn, Yn)satisfying (1).

14

We begin with the case of column-wise sparsity. For any matrix S ∈ Rm1×m2 ,

we denote by ‖S‖2,0 the number of nonzero columns of S. For any integers0 ≤ r ≤ min(m1,m2), 0 ≤ s ≤ m2 and any a > 0, we consider the class ofmatrices

AGS(r, s, a) =A0 = L0 + S0 ∈ R

m1×m2 : rank(L0) ≤ r, ‖L0‖∞ ≤ a,

and ‖S0‖2,0 ≤ s , ‖S0‖∞ ≤ a (23)

and define

ψGS(N, r, s) = (σ ∧ a)2

(

Mr + |Ω|n

+s

m2

)

.

The following theorem gives a lower bound on the estimation risk in the case ofcolumn-wise sparsity.

Theorem 2. Suppose that m1,m2 ≥ 2. Fix a > 0 and integers 1 ≤ r ≤min(m1,m2) and 1 ≤ s ≤ m2/2. Let Assumption 9 be satisfied. Assume thatMr ≤ n, |Ω| ≤ sm1 and æ ≤ 1 + s/m2. Suppose that the variables ξi arei.i.d. Gaussian N (0, σ2), σ2 > 0, for i = 1, . . . , n. Then, there exist absoluteconstants β ∈ (0, 1) and c > 0, such that

inf(L,S)

sup(L0,S0)∈AGS(r,s,a)

PA0

(

‖L− L0‖22m1m2

+‖S − S0‖22m1m2

> cψGS(N, r, s)

)

≥ β.

We consider now the case of element-wise sparsity. For any matrix S ∈Rm1×m2 , we denote by ‖S‖0 the number of nonzero entries of S. For any

integers 0 ≤ r ≤ min(m1,m2), 0 ≤ s ≤ m1m2/2 and any a > 0, we consider theclass of matrices

AS(r, s, a) =A0 = L0 + S0 ∈ R

m1×m2 : rank(L0) ≤ r,

‖S0‖0 ≤ s, ‖L0‖∞ ≤ a, ‖S0‖∞ ≤ a

and define

ψS(N, r, s) = (σ ∧ a)2

Mr + |Ω|n

+s

m1m2

.

We have the following theorem for the lower bound in the case of standardsparsity.

Theorem 3. Assume that m1,m2 ≥ 2. Fix a > 0 and integers 1 ≤ r ≤min(m1,m2) and 1 ≤ s ≤ m1m2/2. Let Assumption 9 be satisfied. Assume thatMr ≤ n and there exists a constant ρ > 0 such that |Ω| ≤ ρ rM . Suppose thatthe variables ξi are i.i.d. Gaussian N (0, σ2), σ2 > 0, for i = 1, . . . , n. Then,there exist absolute constants β ∈ (0, 1) and c > 0, such that

inf(L,S)

sup(L0,S0)∈AS(r,s,a)

PA0

(‖L− L0‖22m1m2

+‖S − S0‖22m1m2

> cψS(N, r, s)

)

≥ β. (24)

15

Appendix

A Proof of Theorem 1

1) Let F(L, S) = 1N

∑Ni=1 (Yi − 〈Xi, L+ S〉)2 +λ1‖L‖∗+λ2R(S), ∆L = L0− L

and ∆S = S0 − S. Using F(L, S) ≤ F(L0, S0) and (1) we get

1

N

N∑

i=1

(〈Xi,∆L+ ∆S〉 + ξi)2+λ1‖L‖∗+λ2R(S) ≤ 1

N

N∑

i=1

ξ2i +λ1‖L0‖∗+λ2R(S0).

Hence,

1

N

∑

i∈Ω

〈Xi,∆L+ ∆S〉2 ≤ 1

N

∑

i∈Ω

|〈ξiXi,∆L+ ∆S〉| − 1

N

∑

i∈Ω

〈Xi,∆L+ ∆S〉2

︸︷︷︸

I

+ 2 |〈Σ,∆L〉| + λ1

(

‖L0‖∗ − ‖L‖∗)

︸︷︷︸

II

+ 2 |〈Σ,∆SI〉| + λ2

(

R(S0) −R(S))

︸︷︷︸

III

(25)

where Σ = 1N

∑

i∈Ω ξiXi and we use 〈Σ,∆S〉 = 〈Σ,∆SI〉.We consider the following random event

A =

max1≤i≤N

|ξi| ≤ Cσ√

log d

. (26)

A standard bound on the maximum of sub-gaussian variables and N ≤ m1m2

imply that there exists a numerical constant C such that P(A) ≥ 1 − 12d .

We start by estimating I. On the event A we have that

I ≤ 1

4N

∑

i∈Ω

ξ2i ≤ C σ2|Ω| log(d)

N. (27)

Now we estimate II. Let PS be the projector on the linear vector subspace Sand let S⊥ be the orthogonal complement of S. For any A ∈ R

m1×m2 , let uj(A)and vj(A) denote respectively the left and right orthonormal singular vectorsof A. S1(A) is the linear span of uj(A), S2(A) is the linear span of vj(A).We set

P⊥A(B) = PS⊥

1(A)BPS⊥

2(A) and PA(B) = B −P⊥

A(B).

16

By definition of P⊥L0

, for any matrix B the singular vectors of P⊥L0

(B) areorthogonal to the space spanned by the singular vectors of L0. This impliesthat

∥∥L0 + P⊥

L0(∆L)

∥∥∗

= ‖L0‖∗ +∥∥P⊥

L0(∆L)

∥∥∗. Then

∥∥∥L∥∥∥∗

= ‖L0 + ∆L‖∗=∥∥L0 + P⊥

L0(∆L) + PL0

(∆L)∥∥∗

≥∥∥L0 + P⊥

L0(∆L)

∥∥∗− ‖PL0

(∆L)‖∗= ‖L0‖∗ +

∥∥P⊥

L0(∆L)

∥∥∗− ‖PL0

(∆L)‖∗ .

(28)

Note that from (28) we get

‖L0‖∗ −∥∥∥L∥∥∥∗≤ ‖PL0

(∆L)‖∗ −∥∥P⊥

L0(∆L)

∥∥∗. (29)

Using (29), by the duality between the nuclear and the operator norms, weobtain

II ≤ 2‖Σ‖‖∆L‖∗ + λ1(‖PL0

(∆L)‖∗ −∥∥P⊥

L0(∆L)

∥∥∗

).

Using λ1 ≥ 4‖Σ‖ and the triangle inequality we get

II ≤ 3

2λ1 ‖PL0

(∆L)‖∗ ≤ 3

2λ1

√2r ‖∆L‖2 (30)

where r = rank(L0) and we use rank(PL0(∆L)) ≤ 2 rank(L0).

For the third term in (25) we use the duality between the R and R∗ and∆SI = SI :

III ≤ 2R∗(Σ)R(SI) + λ2

(

R(S0) −R(S))

.

Now, λ2 ≥ 4R∗(Σ) impliesIII ≤ λ2R(S0). (31)

Putting (30), (31) and (27) into (25) we get that on the event A

1

n

∑

i∈Ω

〈Xi,∆L+ ∆S〉2 ≤ 3 æλ1√2

√r ‖∆L‖2 + æλ2R(S0) +

Cσ2|Ω| log(d)

n(32)

where æ = N/n.

2) We will show now that, with hight probability,

1

n

∑

i∈Ω

〈Xi,∆L+ ∆S〉2 ≥ ‖∆L+ ∆S‖2L2(Π) − E (33)

with an appropriate choice of E . In order to prove it, we use Lemma 15 (ii)which guarantees that (33) holds if the pair of matrices (∆L,∆S) belongs to aproperly chosen constrain set.

17

We start by defining such constrain set. For some positive constants δ1 andδ2 we define the following set of matrices

B(δ1, δ2) =

B ∈ Rm1×m2 : ‖B‖2L2(Π) ≤ δ21 and R(B) ≤ δ2

and let D(τ, κ) be the set of the pair of matrices (A,B) such that

D(τ, κ) =

B,A ∈ Rm1×m2 : ‖A+B‖2L2(Π) ≥

√

64 log(d)

log (6/5) n,

‖A+B‖∞ ≤ 1, ‖A‖∗ ≤ √τ ‖AI‖2 + κ

where κ and τ < m1 ∧m2 are some positive constants.By definition of L, S we have that ‖∆L+ ∆S‖∞ ≤ 4a. We now consider two

cases, depending on whether the couple 14a (∆L,∆S) belongs to a set D(τ, κ) ∩

Rm1×m2 × B(δ1, δ2) or not.

Case 1: Suppose first that ‖∆L+ ∆S‖2L2(Π) < 16a2

√

64 log(d)

log (6/5) n. An easy

calculation gives

‖∆L+ ∆S‖2L2(Π) ≥1

2‖∆L‖2L2(Π) − ‖∆S‖2L2(Π) (34)

which together with Lemma 14 implies that with probability at least 1 − 2/d

‖∆L‖2L2(Π) ≤ ∆1 (35)

where

∆1 = C

a2√

log(d)

n+ aR(IdΩ) [æλ2 + aE (R∗(ΣR))]

+

(aE (R∗(ΣR))

λ2+ æ

) |Ω|(a2 + σ2 log(d)

)

N

.

Case 2: We consider now the case∥∥∥∆L+ ∆S

∥∥∥

2

L2(Π)≥ 16a2

√

64 log(d)

log (6/5) n.

Lemma 13 implies that on the event A

‖∆L‖∗ ≤ 4√

2r‖∆L‖2 +λ2 a

λ1R(IdΩ) +

Cσ2|Ω| log(d)

Nλ1

≤ 4√

2r‖∆LI‖2 + 8a

√

2r|I| +λ2 a

λ1R(IdΩ) +

Cσ2|Ω| log(d)

Nλ1.

(36)

On the other hand, (50) implies that on the event A

R(∆S) ≤ 8aR(IdΩ) +|Ω|(8a2 + Cσ2 log(d)

)

Nλ2. (37)

18

Using Lemma 14 and (37) we obtain that, with probability at least 1 −2.5 d−1,

∆S

4a∈ B

(√∆1

4a, 2R(IdΩ) +

C|Ω|(a2 + σ2 log(d)

)

4aNλ2

)

= B.

This, together with (36), implies that, with probability at least 1 − 2.5 d−1, wehave that 1

4a (∆L,∆S) ∈ D(τ, κ) ∩Rm1×m2 × B

with

τ = 32r and κ = 2

√

2r|I| +λ24λ1

R(IdΩ) +Cσ2|Ω| log(d)

4aNλ1.

Then, we can apply Lemma 15 (ii).From Lemma 15 (ii) and (32) we obtain that with probability at least 1 −

4.5 d−1 one has

1

2‖∆L+ ∆S‖2L2(Π) ≤

3 æλ1√2

√r ‖∆L‖2 + CE (38)

where

E = µ a2 r |I| (E (‖ΣR‖))2

+ 8a2√

2r|I|E (‖ΣR‖) + λ2R(IdΩ)

(a2E (‖ΣR‖)

λ1+ aæ

)

+|Ω|(a2 + σ2 log(d)

)

N

(aE (‖ΣR‖)

λ1+

aE (R∗(ΣR))

λ2+ æ

)

+ ∆1.

(39)

Putting

3 æ√2λ1

√r ‖∆L‖2 ≤ 9 æ2 µm1m2 r λ

21

2+

‖∆L‖224µm1m2

≤ 9 æ2 µm1m2 r λ21

2+

‖∆LI‖224µm1m2

+a2|I|µm1m2

.

in (38) we get

‖∆L+ ∆S‖2L2(Π) ≤9 æ2 µm1m2 r λ

21

4+

‖∆LI‖224µm1m2

+a2|I|µm1m2

+ CE .

Using again (34), Lemma 14, (10) and |I| ≤ m1m2 we obtain

‖∆LI‖22µ|I| ≤ C

æ2 µm1m2 r λ21 +

a2|I|µm1m2

+ E

.

Finally, using |I| ≤ m1m2 and√

2r|I|E (‖ΣR‖) ≤ |I|µm1m2

+µm1m2 r (E (‖ΣR‖))2

we get

19

‖∆LI‖22|I| ≤ C Ψ1 + Ψ2 + Ψ3 (40)

where

Ψ1 = µ2m1m2 r(

æ2λ21 + (aE (‖ΣR‖))2)

+ a2 µ

√

log(d)

n,

Ψ2 = µ aR(IdΩ)

(aλ2λ1

E (‖ΣR‖) + æλ2 + aE (R∗(ΣR))

)

,

Ψ3 =µ |Ω|

(a2 + σ2 log(d)

)

N

(aE (‖ΣR‖)

λ1+

aE (R∗(ΣR))

λ2+ æ

)

+a2|I|m1m2

.

Now, using

‖∆L‖22 ≤ ‖∆LI‖22 + ‖∆LI‖22 ≤ ‖∆LI‖22 + 4a2|I|

together with Lemma 14 and the inequalities (40), (35) we get the statement ofTheorem 1.

Lemma 12. Assume that λ2 ≥ 4 (R∗(Σ) + 2aR∗(W )). Then, we have

R(∆SI) ≤ 3R(∆SΩ) +1

Nλ2

4a2|Ω| +∑

i∈Ω

ξ2i

Proof. Let ∂‖·‖∗, and ∂R denote the subdifferentials of ‖·‖∗ and R respectively.By the standard condition for optimality over a convex set we are guaranteedthat

− 2

N

N∑

i=1

(

Yi −⟨

Xi, L+ S⟩)⟨

Xi, L+ S − L− S⟩

+ λ1

⟨

∂‖L‖∗, L− L⟩

+ λ2

⟨

∂R(S), S − S⟩

≥ 0

(41)

for all feasible pairs (L, S). In particular, for (L, S0) we obtain

− 2

N

N∑

i=1

(

Yi −⟨

Xi, L+ S⟩)

〈Xi,∆S〉 + λ2

⟨

∂R(S),∆S⟩

≥ 0

which implies

− 2

N

N∑

i=1

〈Xi,∆S〉2 −2

N

∑

i∈Ω

〈Xi,∆L〉〈Xi,∆S〉 −2

N

∑

i∈Ω

ξi 〈Xi,∆S〉

− 2

N

∑

i∈Ω

〈Xi,∆L〉〈Xi,∆S〉 − 2 〈Σ,∆S〉 + λ2

⟨

∂R(S),∆S⟩

≥ 0.

20

Recall that ‖∆L‖∞ ≤ 2a, then using

− 2

N

N∑

i=1

〈Xi,∆S〉2 −2

N

∑

i∈Ω

〈Xi,∆L〉〈Xi,∆S〉 −2

N

∑

i∈Ω

ξi 〈Xi,∆S〉

≤ 1

N

∑

i∈Ω

〈Xi,∆L〉2 +1

N

∑

i∈Ω

ξ2

≤ 4a2|Ω|N

+1

N

∑

i∈Ω

ξ2i

we get

λ2

⟨

∂R(S), S − S0

⟩

≤ 2

∣∣∣∣∣

⟨

1

N

∑

i∈Ω

〈Xi,∆L〉Xi,∆S

⟩∣∣∣∣∣

+ 2 |〈Σ,∆S〉| +4a2|Ω|N

+1

N

∑

i∈Ω

ξ2i

≤ 2R∗

(

1

N

∑

i∈Ω

〈Xi,∆L〉Xi

)

R(∆S) + 2R∗(Σ)R(∆S)

+4a2|Ω|N

+1

N

∑

i∈Ω

ξ2i .

(42)

By the definition of R∗ we have that

R∗

(

1

N

∑

i∈Ω

〈Xi,∆L〉Xi

)

= supR(B)≤1

⟨

1

N

∑

i∈Ω

〈Xi,∆L〉Xi, B

⟩

= supR(B)≤1

1

N

∑

i∈Ω

〈Xi,∆L〉〈Xi, B〉 .(43)

Using that R is an absolute norm, (43) and ‖∆L‖∞ ≤ 2a imply

R∗

(

1

N

∑

i∈Ω

〈Xi,∆L〉Xi

)

≤ 2a supR(B)≤1

1

N

∑

i∈Ω

〈Xi, B〉

= 2aR∗(W )

(44)

where W = 1N

∑

i∈ΩXi.Now, by convexity of R(·) and definition of subdifferential, we have

R(S0) ≥ R(S) +⟨

∂R(S),∆S⟩

. (45)

Putting (44) and (45) into (42) we obtain

λ2

(

R(S) −R(S0))

≤ 4aR∗(W )R(∆S) + 2R∗(Σ)R(∆S) +4a2|Ω|N

+1

N

∑

i∈Ω

ξ2i .

21

On the other hand, by decomposability of R(·), using (S0)I = 0 and thetriangle inequality we have

R(S0 − ∆S) −R(S0) = R ((S0 − ∆S)I) + R ((S0 − ∆S)I) −R ((S0)I)

≥ R ((∆S)I) −R ((∆S)I) .(46)

Using λ2 ≥ 4 (2aR∗(W ) + R∗(Σ)) and (46) we get that

λ2 (R ((∆S)I) −R ((∆S)I))

≤ λ22

(R (∆SI) + R ((∆S)I)) +4a2|Ω|N

+1

N

∑

i∈Ω

ξ2i

which implies

R (∆SI) ≤ 3R (∆SI) +1

Nλ2

4a2|Ω| +∑

i∈Ω

ξ2i

. (47)

Assuming that all unobserved entries of S0 are zero implies (S0)I = (S0)Ω. On

the other hand, as R(·) is a monotonic norm, we get that SI = SΩ. Indeed,

having a non-zero element on the non-observed part means larger R(S) but

do not decrease 1N

∑Ni=1 (Yi − 〈Xi, A〉)2. So we have that ∆SI = ∆SΩ, which

together with (47) imply the statement of the Lemma 12.

Lemma 13. Suppose that λ1 ≥ 4‖Σ‖ and λ2 ≥ 4R∗(Σ). Then,

∥∥P⊥

L0(∆L)

∥∥∗≤ 3 ‖PL0

(∆L)‖∗ +λ2 a

λ1R(IdΩ) +

1

Nλ1

∑

i∈Ω

ξ2i .

Proof. Using (41) for (L, S) = (L0, S0) we obtain

− 2

N

N∑

i=1

〈Xi,∆S + ∆L〉2 − 2

N

∑

i∈Ω

〈ξiXi,∆L + ∆S〉

− 2 〈Σ, (∆S)I〉 − 2 〈Σ,∆L〉 + λ1

⟨

∂‖L‖∗,∆L⟩

+ λ2

⟨

∂R(S),∆S⟩

≥ 0.

(48)

By convexity of ‖ · ‖∗ and R(·) and using the definition of the subdifferential,we have

‖L0‖∗ ≥ ‖L‖∗ +⟨

∂‖L‖∗,∆L⟩

R(S0) ≥ R(S) +⟨

∂R(S),∆S⟩

which together with (48) yields

λ1

(

‖L‖∗ − ‖L0‖∗)

+ λ2

(

R(S) −R(S0))

≤ 2‖Σ‖‖∆L‖∗ + 2R∗(Σ)R (∆SI)

+1

N

∑

i∈Ω

ξ2i .

22

Using λ1 ≥ 4‖Σ‖, λ2 ≥ 4R∗(Σ), the triangle inequality and (29) we get

λ1(∥∥P⊥

L0(∆L)

∥∥∗− ‖PL0

(∆L)‖∗)

+ λ2

(

R(S) −R(S0))

≤ λ12

(∥∥P⊥

L0(∆L)

∥∥∗

+ ‖PL0(∆L)‖∗

)+λ22R(

SI

)

+1

N

∑

i∈Ω

ξ2i .

Assuming that all unobserved entries of S0 are zero implies R(S0) ≤ aR(IdΩ).This together with the last inequality get

∥∥P⊥

L0(∆L)

∥∥∗≤ 3 ‖PL0

(∆L)‖∗ +λ2 a

λ1R(IdΩ) +

1

Nλ1

∑

i∈Ω

ξ2i

as stated.

Lemma 14. Let n > m1 and λ2 ≥ 4 (R∗(Σ) + 2aR∗(W )). Assume that Xi arei.i.d. with distribution Π on X ′ which satisfies Assumption 1 and 2. Assumethat ‖S0‖∞ ≤ a for some constant a and Assumption 3 is satisfied. Then, withprobability at least 1 − 2.5 d−1

‖∆S‖2L2(Π) ≤ C

a2√

log(d)

n+ aR(IdΩ) [æλ2 + aE (R∗(ΣR))]

+

(aE (R∗(ΣR))

λ2+ æ

) |Ω|(a2 + σ2 log(d)

)

N

.

Proof. Using F(L, S) ≤ F(L, S0) and (1) we obtain

1

N

N∑

i=1

(〈Xi,∆L+ ∆S〉 + ξi)2

+ λ2R(S)

≤ 1

N

N∑

i=1

(〈Xi,∆L〉 + ξi)2

+ λ2R(S0)

which implies

1

N

∑

i∈Ω

〈Xi,∆S〉2 +1

N

∑

i∈Ω

〈Xi,∆S〉2 +2

N

∑

i∈Ω

〈Xi,∆L〉〈Xi,∆S〉 +2

N

∑

i∈Ω

〈ξiXi,∆S〉

+2

N

∑

i∈Ω

〈Xi,∆L〉〈Xi,∆SI〉 + 2 〈Σ,∆SI〉 + λ2R(S) ≤ λ2R(S0).

Using (44) and the duality between R and R∗, we obtain

1

N

∑

i∈Ω

〈Xi,∆S〉2 ≤ 2 (2aR∗(W ) + R∗(Σ))R(∆SI) + λ2

(

R(S0) −R(S))

+2

N

∑

i∈Ω

〈Xi,∆L〉2 +2

N

∑

i∈Ω

ξ2.

23

Then, ∆SI = SI and λ2 ≥ 4 (R∗(Σ) + 2aR∗(W )) imply

1

N

∑

i∈Ω

〈Xi,∆S〉2 ≤ λ2R (S0) +2

N

∑

i∈Ω

〈Xi,∆L〉2 +2

N

∑

i∈Ω

ξ2. (49)

Now, Lemma 12 and ‖∆S‖∞ ≤ 2a imply that on the event A defined in (26)

R(∆S) ≤ 4R(∆SΩ) +|Ω|(4a2 + Cσ2 log(d)

)

Nλ2

≤ 8aR(IdΩ) +|Ω|(4a2 + Cσ2 log(d)

)

Nλ2.

(50)

For a δ > 0 we define the following constrain set

C(δ) =

A ∈ Rm1×m2 : ‖A‖∞ ≤ 1, ‖A‖2L2(Π) ≥

√

64 log(d)

log (6/5) n,R(A) ≤ δ

.

(51)By the definition of S we have that ‖∆S‖∞ ≤ 2a. We consider now two

cases depending on whenever the matrix ∆S/2a belongs to the set

C(

4R(IdΩ) +|Ω|(8a2 + Cσ2 log(d)

)

2aNλ2

)

or not.

Case I: If ‖∆S‖2L2(Π) ≤ 4a2√

64 log(d)log(6/5)n , then the statement of the Lemma

14 holds.

Case II: If ‖∆S‖2L2(Π) > 4a2√

64 log(d)log(6/5)n , then the inequality (50) implies

that on the event A

∆S

2a∈ C

(

4R(IdΩ) +|Ω|(8a2 + Cσ2 log(d)

)

2aNλ2

)

and we can apply Lemma 15 (i). Le inequality (49), Lemma 15 (i), ‖∆L‖∞ ≤ 2aand R (S0) ≤ aR(IdI) imply that with probability at least 1 − 2.5 d−1

‖∆S‖2L2(Π) ≤ C

aR(IdΩ) [æλ2 + aE (R∗(ΣR))]

+

[aE (R∗(ΣR))

λ2+ æ

] |Ω|(a2 + σ2 log(d)

)

N

as stated.

Lemma 15. Let Xi be i.i.d. with distribution Π on X ′ which satisfies Assump-tion 1 and 2. Then,

24

(i) for any S ∈ C(δ)

1

n

∑

i∈Ω

〈Xi, S〉2 ≥ 1

2‖S‖2L2(Π) − 8δE (R∗(ΣR))

with probability at least 1 − 2

d.

(ii) For any couple (L, S) ∈ D(τ, κ) ∩ Rm1×m2 × B(δ1, δ2)

1

n

∑

i∈Ω

〈Xi, L+ S〉2 ≥ 1

2‖L+ S‖2L2(Π) −

360µ |I| τ (E (‖ΣR‖))2

+4δ21 + 8δ2 E (R∗(ΣR)) + 8κE (‖ΣR‖)

with probability at least 1 − 2

d.

Proof. We prove (i) and (ii) together. Denote A = S for (i) and A = L+ S for(ii). Put

E =

8δE (R∗(ΣR)) for (i)

360µ |I| τ (E (‖ΣR‖))2

+ 4δ21 + 8δ2 E (R∗(ΣR)) + 8κE (‖ΣR‖) for (ii)

and

C =

C(δ) for (i)D(τ, κ) ∩ (Rm1×m2 × B(δ1, δ2)) for (ii).

We will show that the probability of the following “bad” event is small

B =

∃A ∈ C such that

∣∣∣∣∣

1

n

∑

i∈Ω

〈Xi, A〉2 − ‖A‖2L2(Π)

∣∣∣∣∣>

1

2‖A‖2L2(Π) + E

.

Note that B contains the complement of the event that we are interested in.In order to estimate the probability of B we use a standard peeling argument.

Let ν =

√

64 log(d)

log (6/5) nand α =

6

5. For l ∈ N set

Sl =

A ∈ C : αl−1ν ≤ ‖A‖2L2(Π) ≤ αlν

.

If the event B holds for some matrix A ∈ C(r), then A belongs to some Sl and

∣∣∣∣∣

1

n

∑

i∈Ωr

〈Xi, A〉2 − ‖A‖2L2(Π)

∣∣∣∣∣>

1

2‖A‖2L2(Π) + E

>1

2αl−1ν + E

=5

12αlν + E .

(52)

25

For each T > ν consider the following set of matrices

C(T ) =

A ∈ C : ‖A‖2L2(Π) ≤ T

and the following event

Bl =

∃A ∈ C(αlν) :

∣∣∣∣∣

1

n

∑

i∈Ω

〈Xi, A〉2 − ‖A‖2L2(Π)

∣∣∣∣∣>

5

12αlν + E

.

Note that A ∈ Sl implies that A ∈ C(αlν). Then (52) implies that Bl holds andwe get B ⊂ ∪Bl. Thus, it is enough to estimate the probability of the simplerevent Bl and then apply the union bound. Such an estimation is given by theLemma 16 which implies that P (Bl) ≤ exp(−c5 nα2lν2). Using the union boundwe obtain

P (B) ≤∞

Σl=1

P (Bl)

≤∞

Σl=1

exp(−c5 nα2l ν2)

≤∞

Σl=1

exp(−(2 c5 n log(α) ν2

)l)

where we used ex ≥ x. We finally obtain, for ν =

√

64 log(d)

log (6/5) n,

P (B) ≤ exp(−2 c5 n log(α) ν2

)

1 − exp (−2 c5 n log(α) ν2)=

exp (− log(d))

1 − exp (− log(d)).

Let

ZT = supA∈C(T )

∣∣∣∣∣

1

n

∑

i∈Ω

〈Xi, A〉2 − ‖A‖2L2(Π)

∣∣∣∣∣.

Lemma 16. Let Xi be i.i.d. with distribution Π on X which satisfies Assump-tion 1 and 2. Then,

P

(

ZT >5

12T + E

)

≤ exp(−c5 nT 2)

where c5 =1

128.

Proof. Our approach is standard: first we show that ZT concentrates aroundits expectation and then we upper bound the expectation.

Massart’s concentration inequality (see e.g. [3, Theorem 14.2]) implies that

P

(

ZT ≥ E (ZT ) +1

9

(5

12T

))

≤ exp(−c5 nT 2

)(53)

26

where c5 =1

128. Next we bound the expectation E (ZT ). Using a standard

symmetrization argument (see e.g. [17, Theorem 2.1]) we obtain

E (ZT ) = E

(

supA∈C(T )

∣∣∣∣∣

1

n

∑

i∈Ω

〈Xi, A〉2 − E

(

〈X,A〉2)∣∣∣∣∣

)

≤ 2E

(

supA∈C(T )

∣∣∣∣∣

1

n

∑

i∈Ω

ǫi 〈Xi, A〉2∣∣∣∣∣

)

where ǫini=1 is an i.i.d. Rademacher sequence. The assumption ‖A‖∞ = 1implies |〈Xi, A〉| ≤ 1. Then, the contraction inequality (see e.g. [17]) yields

E (ZT ) ≤ 8E

(

supA∈C(T )

∣∣∣∣∣

1

n

∑

i∈Ω

ǫi 〈Xi, A〉∣∣∣∣∣

)

= 8E

(

supA∈C(T )

|〈ΣR, A〉|)

where ΣR =1

n

n∑

i=1

ǫiXi.

In order to control E

(

supA∈C(T )

|〈ΣR, A〉|)

we will consider separately the case

of C = C(δ) and C = D(τ, κ) ∩ Rm1×m2 × B(δ1, δ2).

Let us consider first the case A ∈ C(δ) and ‖A‖2L2(Π) ≤ T :

By the definition of C(δ), (51), we have that

R(A) ≤ δ.

Then, by the duality between R and R∗, we obtain

E (ZT ) ≤ 8E

(

supR(A)≤δ

|〈ΣR, A〉|)

≤ 8δE (R∗(ΣR)) .

Finally, using the concentration bound (53) we obtain that

P

(

ZT >5

12T + E

)

≤ exp(−c5 nT 2)

with c5 =1

128and E = 8δ E (R∗(ΣR)) as stated.

Now we consider the case of A = L + S where the couple (L, S) ∈ D(τ, κ),S ∈ B(δ1, δ2) and ‖L + S‖2L2(Π) ≤ T . By the definition of B(δ1, δ2), we havethat

R(S) ≤ δ2.

On the other hand, by the definition of D(τ, κ), we have

‖L‖∗ ≤ √τ‖LI‖2 + κ

27

and‖L‖L2(Π) ≤ ‖L+ S‖L2(Π) + ‖S‖L2(Π) ≤

√T + δ1.

The last two inequalities imply

‖L‖∗ ≤√

µ |I| τ (√T + δ1) + κ.

LetΓ1 =

√

µ |I| τ (√T + δ1) + κ.

Then we have the following estimation on E

(

supA∈C(T )

|〈ΣR, A〉|)

:

E

(

supA∈C(T )

|〈ΣR, A〉|)

≤ 8E

(

sup‖L‖∗≤Γ1

|〈ΣR, L〉| + supR(A)≤δ2

|〈ΣR, A〉|)

≤ 8 Γ1 E (‖ΣR‖) + δ2 E (R∗(ΣR)) .

Finally, using

1

9

(5

12T

)

+ 8√

µ |I| τ T E (‖ΣR‖) ≤(

1

9+

8

9

)5

12T + 44µ |I| τ (E (‖ΣR‖))2 ,

δ1√

µ |I| τ ,E (‖ΣR‖) ≤ µ |I| τ (E (‖ΣR‖))2

+δ212

and the concentration bound (53) we obtain that

P

(

ZT >5

12T + E

)

≤ exp(−c5 nT 2)

with c5 =1

128and

E = 360µ |I| τ (E (‖ΣR‖))2

+ 4δ21 + 8δ2 E (R∗(ΣR)) + 8κE (‖ΣR‖) (54)

as stated.

With λ1 and λ2 given by (16) we obtain

Ψ1 = µ2æ2(σ ∨ a)2M r log d

N,

Ψ′2 ≤ µ2æ2(σ ∨ a)2 log(d)

|Ω|N

+a2s

m2,

Ψ′3 =

µæ|Ω|(a2 + σ2 log(d)

)

N+

a2s

m2

Ψ′4 ≤ µæ2|Ω|

(a2 + σ2 log(d)

)

N+ a2

√

log(d)

n+

a2s

m2

and we get the statement of the Corollary 7.

28

B Proof of Theorems 2 and 3

Note that assumption æ ≤ 1 + s/m2 implies that

|Ω|n

≤ s

m2. (55)

Assume w.l.o.g. that m1 ≥ m2. For a γ ≤ 1, define

L =

L = (lij) ∈ Rm1×r : lij ∈

0, γ(σ∧a)(rM

n

)1/2

, ∀1 ≤ i ≤ m1, 1 ≤ j ≤ r

,

and consider the associated set of block matrices

L =

L = ( L · · · L O ) ∈ Rm1×m2 : L ∈ L

,

where O denotes the m1×(m2−r⌊m2/(2r)⌋) zero matrix, and ⌊x⌋ is the integerpart of x.

We build similarly the set of matrices

S =

S = (sij) ∈ Rm1×s : sij ∈

0, γ(σ ∧ a)

, ∀1 ≤ i ≤ m1, 1 ≤ j ≤ s

,

andS =

S = ( O S ) ∈ Rm1×m2 : S ∈ S

,

where O is the m1 × (m2 − s) zero matrix.We now set

A = A = L+ S : L ∈ L, S ∈ S .

Remark 1. In the case m1 < m2, we only need to change the construction ofthe low rank component of the test set. We first build a matrix L =

(L O

)∈

Rr×m2 where L ∈ Rr×(m2/2) with entries in

0, γ(σ ∧ a)(rMn

)1/2

and then we

replicate this matrix to obtain a block matrix L of size m1 ×m2

L =

L

...

L

O

.

By construction, any element of A as well as the difference of any two ele-ments of A can be decomposed into a low rank component L of rank at most rand a group sparse component S with at most s nonzero columns. In addition,the entries of any matrix in A take values in [0, a]. Thus, A ⊂ AGS(r, s, a).

29

Let us first establish the lower bound involving the matrix completion term,rM/n. Let A ⊂ A such that for any A = L+S ∈ A, S = 0. Varshamov-Gilbertbound (cf. Lemma 2.9 in [25]) guarantees the existence of a subset A0 ⊂ A withcardinality Card(A0) ≥ 2(rM)/8 + 1 containing the zero m1 ×m2 matrix 0 andsuch that, for any two distinct elements A1 and A2 of A0,

‖A1 −A2‖22 ≥Mr

8

(

γ2(σ ∧ a)2Mr

n

)⌊m2

r

⌋

≥ γ2

16(σ ∧ a)2m1m2

Mr

n. (56)

Using that, conditionally on Xi, the distributions of ξi are Gaussian, we getthat, for any A ∈ A0, the Kullback-Leibler divergence K

(P0,PA

)between P0

and PA satisfies

K(P0,PA

)=

|Ω|2σ2

‖A‖2L2(Π) ≤µγ2Mr

2. (57)

From (57) we deduce that the condition

1

Card(A0) − 1

∑

A∈A0

K(P0,PA) ≤ α log(Card(A0) − 1

)(58)

is satisfied for any α > 0 if γ > 0 is chosen as a sufficiently small numericalconstant depending on α. In view of (56) and (58) and using the application ofTheorem 2.5 in [25] implies

inf(L,S)


PA0

(

‖L− L0‖22m1m2

+‖S − S0‖22m1m2

>C(σ ∧ a)2Mr

n

)

≥ β

(59)for some absolute constants β ∈ (0, 1).

We now describe the construction of a testing set for lower bounding the errorinduced by corruptions. Let A ⊂ A such that for any A = L + S ∈ A, L = 0.Varshamov-Gilbert bound (cf. Lemma 2.9 in [25]) guarantees the existence ofa subset A0 ⊂ A with cardinality Card(A0) ≥ 2(sm1)/8 + 1 containing the zerom1 ×m2 matrix 0 and such that, for any two distinct elements A1 and A2 ofA0,

‖S1 − S2‖22 ≥ sm1

8

(γ2(σ ∧ a)2

)=γ2 (σ ∧ a)2 s

8m2m1m2 (60)

and for any A ∈ A0, the Kullback-Leibler divergence K(P0,PA

)between P0

and PA satisfies

K(P0,PA

)=

|Ω|2σ2

γ2(σ ∧ a)2 ≤ γ2m1s

2

which implies that the condition (58) is satisfied for any α > 0 if γ > 0 is chosenas a sufficiently small numerical constant depending on α. Now, the applicationof Theorem 2.5 in [25] implies

inf(L,S)


PA0

(

‖L− L0‖22m1m2

+‖S − S0‖22m1m2

>C(σ ∧ a)2 s

m2

)

≥ β

(61)

30

for some absolute constants β ∈ (0, 1). Inequalities (55), (59) and (61) implythe statement of Theorem 2.

The proof of Theorem 3 follows the same arguments as that of Theorem 2and is therefore omitted here. The sparse component is chosen in the followingset

S =

S = (sij) ∈ Rm1×m2 : sij ∈

0, γ(σ∧a)

, ∀1 ≤ i ≤ m1, ⌊m2/2⌋+1 ≤ j ≤ m2

.

C Control on the stochastic terms

Part (i) is proven in Lemma 5 and Lemma 6 in [16].We start by proving (ii). For the sake of brevity, we setXi(j, k) = 〈Xi, ej(m1)ek(m2)⊤〉.

By definition of Σ and ‖ · ‖2,∞, we have

‖Σ‖22,∞ = max1≤k≤m2

m1∑

j=1

(

1

N

∑

i∈Ω

ξiXi(j, k)

)2

.

For any fixed k, we have

m1∑

j=1

(

1

N

∑

i∈Ω

ξiXi(j, k)

)2

=1

N2

∑

i1,i2∈Ω

ξi1ξi2

m1∑

j=1

Xi1(j, k)Xi2(j, k)

= Ξ⊤AkΞ, (62)

where Ξ = (ξ1, · · · , ξn)⊤ and Ak ∈ R|Ω|×|Ω| with entries

ai1i2(k) =1

N2

m1∑

j=1

Xi1(j, k)Xi2 (j, k).

We freeze the Xi’s and we apply the version of Hanson and Wright inequality in[24] to get that there exists a numerical constant C such that with probabilityat least 1 − e−t

∣∣Ξ⊤AkΞ − E[Ξ⊤AkΞ|Xi]

∣∣ ≤ Cσ2

(

‖Ak‖2√t+ ‖Ak‖t

)

. (63)

Next, we note that

‖Ak‖22 =∑

i1,i2

a2i1i2(k) ≤ 1

N4

∑

i1i2

m1∑

j1=1

X2i1(j1, k)

m1∑

j1=1

X2i2(j1, k)

≤ 1

N4

∑

i1

m1∑

j1=1

X2i1(j1, k)

2

=

1

N2

∑

i1

m1∑

j1=1

Xi1(j1, k)

2

,

where we have used Cauchy - Schwarz in the first line and X2i (j, k) = Xi(j, k).

31

Note that Zi(k) :=∑m1

j=1Xi(j, k) follows a Bernoulli distribution with pa-

rameter π·k and consequently Z(k) =∑|Ω|

i=1 Zi(k) follows a Binomial distribu-tion B(|Ω|, π·k). We apply Bernstein inequality (See for instance Proposition2.9 in [20]) to get for any t > 0 that

P

(

|Z(k) − E[Z(k)]| ≥ 2√

|Ω|π·kt+ t)

≤ 2e−t.

Consequently, we get with probability at least 1 − 2e−t that

‖Ak‖22 ≤(

|Ω|π·k + 2√

|Ω|π·kt+ t

N2

)2

and, using ‖Ak‖ ≤ ‖Ak‖2, that

‖Ak‖ ≤ |Ω|π·k + 2√

|Ω|π·kt+ t

N2.

Note also that

E[Ξ⊤AkΞ|Xi] =σ2

N2Z(k).

Combining the last three displays with (63) we get, up to a rescaling of theconstants, with probability at least 1 − e−t that

m1∑

j=1

(

1

N

∑

i∈Ωr

ξiXi(j, k)

)2

≤ Cσ2

N2

(

|Ω|π·k + 2√

|Ω|π·kt+ t)

(1 +√t+ t).

Replacing t by t+ logm2 in the above display and a union bound gives withprobability at least 1 − e−t that

‖Σ‖2,∞ ≤ Cσ

N

(

|Ω|π·k + 2√

|Ω|π·k(t+ logm2) + (t+ logm2))1/2

× (1 +√

t+ logm2 + t+ logm2)1/2

= Cσ

N

(√

|Ω|π·k +√

t+ logm2

)(

1 +√

t+ logm2

)

.

Assuming that logm2 ≥ 1 we get with probability at least 1 − e−t that

‖Σ‖2,∞ ≤ Cσ

N

(√

|Ω|π·k(t+ logm2) + (t+ logm2))

.

Using (14), we get that there exist a numerical constant C > 0 such withprobability at least 1 − e−t

‖Σ‖2,∞ ≤ Cσ

N

√

γ1/2n(t+ logm2)

m2+ (t+ logm2)

.

We use Lemma 17 to obtain the bound on E‖Σ‖2,∞.

32

The proof of (iii) for ΣR is exactly the same as that for Σ. We just need toreplace ξi by ǫi, σ by 1 and N by n.

We now turn to the proof of (iv) and establish the bound on ‖W‖2,∞. Wehave

‖W‖22,∞ = max1≤k≤m2

m1∑

j=1

(

1

N

∑

i∈Ω

Xi(j, k)

)2

.

For any fixed k, we have

m1∑

j=1

(

1

N

∑

i∈Ω

Xi(j, k)

)2

=1

N2

∑

i∈Ω

m1∑

j=1

X2i (j, k) +

1

N2

∑

i1 6=i2

m1∑

j=1

Xi1(j, k)Xi2 (j, k).

We study the first term in the right hand side of the above display. Notethat

1

N2

∑

i∈Ω

m1∑

j=1

X2i (j, k) =

Z(k)

N2.

Using the concentration obtained above on Z(k), we get with probability atleast 1 − e−t that

1

N2

∑

i∈Ω

m1∑

j=1

X2i (j, k) ≤ |Ω|

N2π·k + 2

√

|Ω|π·ktN2

+t

N2. (64)

We study now the U-statistic of order 2

U2 =1

N2

∑

i1 6=i2

m1∑

j=1

[Xi1(j, k)Xi2 (j, k) − π2

j,k

].

We will use a Bernstein-type concentration inequality for U-statistics. To thisend, we set Xi(·, k) = (Xi(1, k), · · · , Xi(m1, k))⊤ and

h(Xi1(·, k), Xi2(·, k)) =

m1∑

j=1

[Xi1(j, k)Xi2(j, k) − π2

j,k

].

Set e0(m1) = 0m1the zero vector in R

m1 . Note that Xi(·, k) takes values inej(m1), 0 ≤ j ≤ m1. For any function g : ej(m1), 0 ≤ j ≤ m12 → R, wedenote by ‖g‖L∞ = max0≤j1,j2≤m1

|g(ej1(m1), ej2(m1))|.

33

We will need the following quantities to control the tail behavior of U2

A = ‖h‖L∞, B2 = max

∥∥∥∥∥

∑

i1

Eh2(Xi1(·, k), ·)∥∥∥∥∥L∞

,

∥∥∥∥∥

∑

i2

Eh2(·, Xi2(·, k))

∥∥∥∥∥L∞

,

C =∑

i1 6=i2

E[h2(Xi1(·, k), X ′

i2(·, k))]

and

D = sup

E

∑

i1 6=i2

h[Xi1(·, k), X ′

i2(·, k)]fi1 [Xi1(·, k)]gi2 [X ′

i2(·, k)],

E

∑

i1

f2i1(Xi1(·, k)) ≤ 1,E

∑

i2

g2i2(X ′i2(·, k)) ≤ 1

,

where X ′i(·, k) are independent replications of Xi(·, k) and f , g : R

m1 → R.We now find the above quantities in our particular setting. It is not hard to

see that A = maxπ(2)·k , 1 − π

(2)·k ≤ 1. We also have that

C =∑

i1 6=i2

E

[⟨Xi1(·, k), X ′

i2(·, k)⟩2]

−

m1∑

j=1

π2jk

2

= |Ω|(|Ω| − 1)

E[⟨Xi1(·, k), X ′

i2(·, k)⟩]

−

m1∑

j=1

π2jk

2

= |Ω|(|Ω| − 1)

m1∑

j=1

π2jk −

m1∑

j=1

π2jk

2

≤ |Ω|(|Ω| − 1)π

(2)·k ,

where we have used in the second line that 〈Xi1(·, k), X ′i2(·, k)〉2 = 〈Xi1(·, k), X ′

i2 (·, k)〉since 〈Xi1(·, k), X ′

i2(·, k)〉 takes values in 0, 1.

We now derive a bound on D. By concavity of√·, we get

∑

i

√

E [f2i (Xi(·, k))] ≤ |Ω|1/2

√√√√E

[∑

i

f2i (Xi(·, k))

]

≤ |Ω|1/2

where we used E[∑

i f2i (Xi(·, k))

]≤ 1. Thus, the Cauchy-Schwarz inequality

34

implies

D ≤∑

i1 6=i2

E[h2(Xi1 , X′i2)]E1/2[f2

i1(Xi1(·, k))]E1/2[g2i2(X ′i2(·, k))]

≤ maxi1 6=i2

E1/2[h2(Xi1 , X

′i2)]∑

i1,i2

E1/2[f2

i1(Xi1(·, k))]E1/2[g2i2(X ′i2(·, k))]

≤ maxi1 6=i2

E1/2[h2(Xi1 , X

′i2)]

|Ω|

≤ |Ω|

m1∑

j=1

π2jk

1/2

= |Ω|[

π(2)·k

]1/2

,

where we have used E[h2(Xi1 , X′i2

)] ≤∑m1

j=1 π2jk (which follows from a reasoning

similar to that used to bound C).Finally, we get a bound on B. Set π0,k = 1 − π·,k. Note first that

∥∥∥∥∥

∑

i1

Eh2(Xi1(·, k), ·)∥∥∥∥∥L∞

= |Ω| max0≤j′≤m1

m1∑

j=0

h2(ej(m1), ej′(m1))πjk

≤ |Ω|(π(2)·k )2 + |Ω| max

1≤j′≤m1

πj′,k.

By symmetry, we obtain the same bound on∥∥∑

i2Eh2(·, Xi2(·, k))

∥∥L∞

. Thuswe get that

B ≤ |Ω|1/2(

π(2)·k + max

1≤j′≤m1

π1/2j′,k

)

.

Set U2 =∑

i1 6=i2h(Xi1(·, k), Xi2 (·, k)). We apply a decoupling argument (See

for instance Theorem 3.4.1 page 125 in [10]) to get that there exists a constantC > 0, such that for any u > 0

P

∑

i1 6=i2

h (Xi1(·, k), Xi2(·, k)) ≥ u

≤ CP

∑

i1 6=i2

h(Xi1(·, k), X′

i2(·, k)) ≥ u/C

,

where X ′i(·, k) is an independent copy of Xi(·, k) (X ′

i(·, k)D= Xi(·, k)).

Next, Theorem 3.3 in [12] gives for any u > 0

P

∑

i1 6=i2

h(Xi1(·, k), X′

i2 (·, k)) ≥ u

≤ C exp

[

− 1

Cmin

(u2

C2,u

D,u2/3

B2/3,u1/2

A1/2

)]

,

for some absolute constant C > 0.

35

Combining the previous display with our controls on A,B,C,D, we get forany t > 0 with probability at least 1 − 2e−t that∣∣∣∣∣∣

1

N2

∑

i1 6=i2

m1∑

j=1

Xi1(j, k)Xi2 (j, k)

∣∣∣∣∣∣

≤ |Ω|(|Ω| − 1)

N2

m1∑

j=1

π2jk +

C

N2

(

Ct1/2 + Dt+ Bt3/2 + At2)

≤ |Ω|(|Ω| − 1)

N2

m1∑

j=1

π2jk + C

[ |Ω|(|Ω| − 1)

N2π(2)·,k t

1/2

+|Ω|N2

m1∑

j=1

π2jk

1/2

t+|Ω|1/2N2

(

π(2)+max

1≤j′≤m1π1/2

j′,k

·k

)

t3/2 +t2

N2

,

where C > 0 is a numerical constant.Combining the previous display with (64), we get for any t > 0 with proba-

bility at least 1 − 3e−t that

m1∑

j=1

(

1

N

∑

i∈Ω

Xi(j, k)

)2

≤ |Ω|(|Ω| − 1)

N2π(2)·k + C

[(

|Ω|(|Ω| − 1)

N2π(2)·k +

2√

|Ω|π·kN2

)

t1/2

+|Ω|N2

π·k +

( |Ω|N2

(

π(2)·k

)1/2

+1

N2

)

t+|Ω|1/2N2

(

π(2)+max

1≤j′≤m1π1/2

j′,k

·k

)

t3/2 +t2

N2

]

.

Set πmax = max1≤k≤m2π·k and π

(2)max = max1≤k≤m2

π(2)·k . Using an union

bound argument and up to a rescaling of the constants, we get with probabilityat least 1 − et that

‖W‖22,∞ ≤ |Ω|(|Ω| − 1)

N2π(2)max + C

[(

|Ω|(|Ω| − 1)

N2π(2)max +

2√

|Ω|πmax

N2

)

(t+ logm2)1/2

+|Ω|N2

πmax +|Ω|N2

(

π(2)max

)1/2

(t+ logm2)

+|Ω|1/2N2

(

π(2)max + max

j,kπ1/2

jk )

(t+ logm2)3/2 +(t+ logm2)2

N2

]

.

Recall that |Ω| = n and æ = N/n. Assumption 5 and n ≤ |I| imply thatthere exist a numerical constant C > 0 such with probability at least 1 − e−t

‖W‖22,∞ ≤ C

(γ2

æNm2

(√

t+ logm2 + (t+ logm2)

√m2

n

)

+(t+ logm2)2

N2

)

where we have used that πj,k ≤ π·k ≤√

2γ/m2.Using again Lemma 17, we get

E‖W‖2,∞ ≤ C

γ log1/4(m2)√

æNm2

(

1 +

√

m2 logm2

n

)1/2

+logm2

N

,

where C > 0 is an absolute constant possibly different from the one in theprevious display.

36

D Proof of Lemma 10

Recall, we set Xi(j, k) = 〈Xi, ej(m1)ek(m2)⊤〉. We have

‖Σ‖∞ = max1≤j≤m1,1≤k≤m2

∣∣∣∣∣

1

N

∑

i∈Ω

ξiXi(j, k)

∣∣∣∣∣.

For any fixed (j, k), we have in view of Assumptions 3 and 9 that

∑

i∈Ω

E

[1

N2ξ2iX

2i (j, k)

]

≤ |Ω| σ2µ1

N2m1m2=: v,

and∑

i∈Ω

E

[1

Nk|ξi|kXk

i (j, k)

]

≤ k!

2v( σ

cN

)k−2

,

where we have used the fact that under Assumption 3, the ψ2 Orlicz norm ‖ξ‖ψ2

satisfies ‖ξ‖ψ2≤ cσ and the well-known relation

E[|ξ|k] ≤ k

2Γ

(k

2

)

‖ξ‖kψ2, ∀k ≥ 1.

We can apply Proposition 2.9 page 24 in [20] to get with probability at least1 − e−t

P

∣∣∣∣∣

1

N

∑

i∈Ω

ξiXi(j, k)

∣∣∣∣∣>

√

2µ1σ2t

æNm1m2+σt

cN

≤ 2e−t.

Replacing t by t+ log(m1m2) and the union bound give

P

∥∥∥∥∥

1

N

∑

i∈Ω

ξiXi

∥∥∥∥∥∞

>

√

2µ1σ2(t+ log(m1m2))

æNm1m2+σ(t+ log(m1m2))

cN

≤ 2e−t.

The bound in expectation follows from Lemma 17. The bounds on ‖ΣR‖∞ andE‖ΣR‖∞ follow from the same argument. By a similar and actually simplerargument, we also get that

P

‖W − E[W ]‖∞ >

√

2µ1(t+ log(m1m2))

æNm1m2+t+ log(m1m2)

N

≤ 2e−t

and Assumption 9 implies that ‖E[W ]‖∞ ≤ µ1

æm1m2

.

E Technical Lemma

We need the following technical Lemma.

37

Lemma 17. Let Y be a non-negative random variable. Let there exist A ≥ 0and aj > 0 and αj > 0 for any 1 ≤ j ≤ m such that

P(Y > A+m∑

j=1

ajtαj ) ≥ e−t, ∀t > 0.

Then we have

E[Y ] ≤ A+

m∑

j=1

ajαjΓ(αj),

where Γ(·) is the Gamma function.

Proof. We have

E[Y ] =

∫ ∞

0

P(|Y | > t)dt

=

∫ A

0

P(|Y | > t)dt+

∫ ∞

A

P(|Y | > t)dt

≤ A+

∫ ∞

0

P(|Y | > A+ u)du.

Next we use the change of variable

u =

m∑

j=1

ajvαj ,

which is a one to one transformation of R+ onto itself under our assumptionson aj , αj . Thus we get

∫ ∞

0

P(|Y | > A+ u)du =

∫ ∞

0

P(|Y | > A+m∑

j=1

ajvαj )

m∑

j=1

ajαjvαj−1

dv

≤∫ ∞

0

m∑

j=1

ajαjvαj−1

e−vdv ≤m∑

j=1

ajαjΓ(αj).

References

[1] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Noisy ma-trix decomposition via convex relaxation: optimal rates in high dimensions.Ann. Statist., 40(2):1171–1197, 2012.

[2] F.L. Bauer, J. Stoer, and C. Witzgall. Absolute and monotonic norms.Numerische Mathematik, 3:257–264, 1961.

38

[3] P. Buehlmann and S. van de Geer. Statistics for High-Dimensional Data:Methods, Theory and Applications. Springer, 2011.

[4] T. T. Cai and W. Zhou. Matrix completion via max-norm constrainedoptimization. URL = http://dx.doi.org/10.1007/978-1-4612-0537-1, 201E.

[5] E.J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal componentanalysis? Journal of ACM, 58(1):1–37, 2009.

[6] Emmanuel J. Candes and Terence Tao. The power of convex relaxation:near-optimal matrix completion. IEEE Trans. Inform. Theory, 56(5):2053–2080, 2010.

[7] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo, and Alan S.Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM J.Optim., 21(2):572–596, 2011.

[8] Yuand Chen, Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Ro-bust matrix completion with corrupted columns. ICML, pages 873–880,2011.

[9] Yudong Chen, Ali Jalali, Sujay Sanghavi, and Constantine Caramanis.Low-rank matrix recovery from errors and erasures. IEEE Transactionson Information Theory, 59(7):4324 –4337, 2013.

[10] Vıctor H. de la Pena and Evarist Gine. Decoupling. Probability and its Ap-plications (New York). Springer-Verlag, New York, 1999. From dependenceto independence, Randomly stopped processes. U -statistics and processes.Martingales and beyond.

[11] R. Foygel and N. Srebro. Concentration-based guarantees for low-rank ma-trix reconstruction. Journal 24nd Annual Conference on Learning Theory(COLT), 2011.

[12] Evarist Gine, Rafa l Lata la, and Joel Zinn. Exponential and moment in-equalities for U -statistics. In High dimensional probability, II (Seattle,WA, 1999), volume 47 of Progr. Probab., pages 13–38. Birkhauser Boston,Boston, MA, 2000.

[13] David Gross. Recovering low-rank matrices from few coefficients in anybasis. IEEE Trans. Inform. Theory, 57(3):1548–1566, 2011.

[14] Daniel Hsu, Sham M. Kakade, and Tong Zhang. Robust matrix decompo-sition with sparse corruptions. IEEE Transactions on Information Theory,57(11):7221–7234, 2011.

[15] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrixcompletion from noisy entries. J. Mach. Learn. Res., 11:2057–2078, 2010.

[16] Olga Klopp. Noisy low-rank matrix completion with general sampling dis-tribution. Bernoulli, 20(1):282–303, 2014.

39

[17] Vladimir Koltchinskii. Oracle inequalities in empirical risk minimizationand sparse recovery problems, volume 2033 of Lecture Notes in Mathe-matics. Springer, Heidelberg, 2011. Lectures from the 38th ProbabilitySummer School held in Saint-Flour, 2008, Ecole d’Ete de Probabilites deSaint-Flour. [Saint-Flour Probability Summer School].

[18] Vladimir Koltchinskii, Karim Lounici, and Alexandre B. Tsybakov.Nuclear-norm penalization and optimal rates for noisy low-rank matrixcompletion. Ann. Statist., 39(5):2302–2329, 2011.

[19] Xiaodong Li. Compressed sensing and matrix completion with constantproportion of corruptions. Constructive Approximation, 37(1), 2013.

[20] Pascal Massart. Concentration inequalities and model selection, volume1896 of Lecture Notes in Mathematics. Springer, Berlin, 2007. Lecturesfrom the 33rd Summer School on Probability Theory held in Saint-Flour,July 6–23, 2003, With a foreword by Jean Picard.

[21] S. Negahban and M. J. Wainwright. Restricted strong convexity andweighted matrix completion: Optimal bounds with noise. Journal of Ma-chine Learning Research, 13:1665–1697, 2012.

[22] Ben Recht. A simpler approach to matrix completion. Journal of MachineLearning Research, 12:3413–3430, 2011.

[23] A. Rohde and A. Tsybakov. Estimation of high-dimensional low-rank ma-trices. Annals of Statistics, 39 (2):887–930, 2011.

[24] Mark Rudelson and Roman Vershynin. Hanson-Wright inequality and sub-Gaussian concentration. Electron. Commun. Probab., 18:no. 82, 9, 2013.

[25] Alexandre B. Tsybakov. Introduction to nonparametric estimation.Springer Series in Statistics. Springer, New York, 2009. Revised and ex-tended from the 2004 French original, Translated by Vladimir Zaiats.

[26] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA viaoutlier pursuit. IEEE Trans. Inform. Theory, 58(5):3047–3064, 2012.

40

arxiv:1412.8132v1 [math.st] 28 dec 2014arxiv:1412.8132v1 [math.st] 28 dec 2014...

Documents