two-sampletestingofhigh-dimensionallinear ...ucaktwa/publication/compsket.pdfhigh-dimensional...

31
Two-sample testing of high-dimensional linear regression coe๏ฌƒcients via complementary sketching Fengnan Gao * and Tengyao Wang โ€  (November 27, 2020) Abstract We introduce a new method for two-sample testing of high-dimensional linear regression coe๏ฌƒcients without assuming that those coe๏ฌƒcients are individually es- timable. The procedure works by ๏ฌrst projecting the matrices of covariates and response vectors along directions that are complementary in sign in a subset of the coordinates, a process which we call โ€˜complementary sketchingโ€™. The resulting pro- jected covariates and responses are aggregated to form two test statistics, which are shown to have essentially optimal asymptotic power under a Gaussian design when the di๏ฌ€erence between the two regression coe๏ฌƒcients is sparse and dense re- spectively. Simulations con๏ฌrm that our methods perform well in a broad class of settings. 1 Introduction Two-sample testing problems are commonplace in statistical applications across di๏ฌ€erent scienti๏ฌc ๏ฌelds, wherever researchers want to compare observations from di๏ฌ€erent samples. In its most basic form, a two-sample Gaussian mean testing problem is formulated as follows: upon observing two samples 1 ,... 1 iid โˆผ ( 1 , 2 ) and 1 ,..., 2 iid โˆผ ( 2 , 2 ), we wish to test 0 : 1 = 2 versus 1 : 1 ฬธ= 2 . (1) This leads to the introduction of the famous two-sample Studentโ€™s -test. In a slightly more involved form in the parametric setting, we observe 1 ,..., 1 iid โˆผ 1 , 1 and 1 ,..., 2 iid โˆผ 2 , 2 and would like to test 0 : 1 = 2 versus 1 : 1 ฬธ= 2 , where 1 and 2 are nuisance parameters. Linear regression models have been one of the staples of statistics. A two-sample testing problem in linear regression arises in the following classical setting: ๏ฌx โ‰ช min{ 1 , 2 }, we observe an 1 -dimensional response vector 1 with an associated design * Fudan University. Email: [email protected] โ€  University College London. Email: [email protected] 1

Upload: others

Post on 29-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Two-sample testing of high-dimensional linearregression coefficients via complementary sketching

    Fengnan Gao* and Tengyao Wangโ€ 

    (November 27, 2020)

    Abstract

    We introduce a new method for two-sample testing of high-dimensional linearregression coefficients without assuming that those coefficients are individually es-timable. The procedure works by first projecting the matrices of covariates andresponse vectors along directions that are complementary in sign in a subset of thecoordinates, a process which we call โ€˜complementary sketchingโ€™. The resulting pro-jected covariates and responses are aggregated to form two test statistics, whichare shown to have essentially optimal asymptotic power under a Gaussian designwhen the difference between the two regression coefficients is sparse and dense re-spectively. Simulations confirm that our methods perform well in a broad class ofsettings.

    1 IntroductionTwo-sample testing problems are commonplace in statistical applications across differentscientific fields, wherever researchers want to compare observations from different samples.In its most basic form, a two-sample Gaussian mean testing problem is formulated asfollows: upon observing two samples ๐‘‹1, . . . ๐‘‹๐‘›1

    iidโˆผ ๐‘(๐œ‡1, ๐œŽ2) and ๐‘Œ1, . . . , ๐‘Œ๐‘›2iidโˆผ ๐‘(๐œ‡2, ๐œŽ2),

    we wish to test๐ป0 : ๐œ‡1 = ๐œ‡2 versus ๐ป1 : ๐œ‡1 ฬธ= ๐œ‡2. (1)

    This leads to the introduction of the famous two-sample Studentโ€™s ๐‘ก-test. In a slightly moreinvolved form in the parametric setting, we observe ๐‘‹1, . . . , ๐‘‹๐‘›1

    iidโˆผ ๐น๐œƒ1,๐›พ1 and ๐‘Œ1, . . . , ๐‘Œ๐‘›2iidโˆผ

    ๐น๐œƒ2,๐›พ2 and would like to test ๐ป0 : ๐œƒ1 = ๐œƒ2 versus ๐ป1 : ๐œƒ1 ฬธ= ๐œƒ2, where ๐›พ1 and ๐›พ2 are nuisanceparameters.

    Linear regression models have been one of the staples of statistics. A two-sampletesting problem in linear regression arises in the following classical setting: fix ๐‘ โ‰ชmin{๐‘›1, ๐‘›2}, we observe an ๐‘›1-dimensional response vector ๐‘Œ1 with an associated design

    *Fudan University. Email: [email protected]โ€ University College London. Email: [email protected]

    1

  • matrix ๐‘‹1 โˆˆ R๐‘›1ร—๐‘ in the first sample, and an ๐‘›2-dimensional response ๐‘Œ2 with designmatrix ๐‘‹2 โˆˆ R๐‘›2ร—๐‘ in the second sample. We assume in both samples the responses aregenerated from standard linear modelsโŽงโŽจโŽฉ๐‘Œ1 = ๐‘‹1๐›ฝ1 + ๐œ–1,๐‘Œ2 = ๐‘‹2๐›ฝ2 + ๐œ–2, (2)for some unknown regression coefficients ๐›ฝ1, ๐›ฝ2 โˆˆ R๐‘ and independent homoscedastic noisevectors ๐œ–1 | (๐‘‹1, ๐‘‹2) โˆผ ๐‘๐‘›1(0, ๐œŽ2๐ผ๐‘›1) and ๐œ–2 | (๐‘‹1, ๐‘‹2) โˆผ ๐‘๐‘›2(0, ๐œŽ2๐ผ๐‘›2). The purpose isto test ๐ป0 : ๐›ฝ1 = ๐›ฝ2 versus ๐ป1 : ๐›ฝ1 ฬธ= ๐›ฝ2. Suppose that ๐›ฝ is the least square estimate of๐›ฝ = ๐›ฝ1 = ๐›ฝ2 under the null hypothesis and ๐›ฝ1, ๐›ฝ2 are the least square estimates of ๐›ฝ1 and๐›ฝ2 respectively under the alternative hypothesis. Define the residual sum of squares as

    RSS1 = โ€–๐‘Œ1 โˆ’๐‘‹1๐›ฝ1โ€–22 + โ€–๐‘Œ2 โˆ’๐‘‹2๐›ฝ2โ€–22,RSS0 = โ€–๐‘Œ1 โˆ’๐‘‹1๐›ฝโ€–22 + โ€–๐‘Œ2 โˆ’๐‘‹2๐›ฝโ€–22.

    (3)

    The classical generalised likelihood ratio test (Chow, 1960) compares the ๐น -statistic

    ๐น = (RSS0โˆ’RSS1)/๐‘RSS1 /(๐‘›1 + ๐‘›2 โˆ’ 2๐‘)โˆผ ๐น๐‘, ๐‘›1+๐‘›2โˆ’2๐‘ (4)

    against upper quantiles of the ๐น๐‘, ๐‘›1+๐‘›2โˆ’2๐‘ distribution. It is well-known that in the clas-sical asymptotic regime where ๐‘ is fixed and ๐‘›1, ๐‘›2 โ†’โˆž, the above generalised likelihoodratio test is asymptotically optimal.

    High-dimensional datasets are ubiquitous in the contemporary era of Big Data. Asdimensions of modern data ๐‘ in genetics, signal processing, econometrics and other fieldsare often comparable to sample sizes ๐‘›, the most significant challenge in high-dimensionaldata is that the fixed-๐‘-large-๐‘› setup prevalent in classical statistical inference is no longervalid. Yet the philosophy remains true that statistical inference is only possible when thesample size relative to the true parameter size is sufficiently large. Most advances in high-dimensional statistical inference so far have been made under some โ€˜sparsityโ€™ conditions,i.e., all but a small (often vanishing) fraction of the ๐‘-dimensional model parameters arezero. The assumption in effect reduces the parameter size to an estimable level and itmakes sense in many applications because often only few covariates are really responsiblefor the response, though identification of these few covariates is still a nontrivial task. Inthe high-dimensional regression setting ๐‘Œ = ๐‘‹๐›ฝ + ๐œ– where ๐‘Œ โˆˆ R๐‘›, ๐‘‹ โˆˆ R๐‘›ร—๐‘, ๐›ฝ โˆˆ R๐‘with ๐‘, ๐‘› โ†’ โˆž simultaneously, a common assumption to make is ๐‘˜ log ๐‘/๐‘› โ†’ 0 with๐‘˜ = โ€–๐›ฝโ€–0 :=

    โˆ‘๏ธ€๐‘๐‘—=1 1{๐›ฝ๐‘— ฬธ=0}. Therefore, ๐‘˜ is the true parameter size, which vanishes relative

    to the sample size ๐‘›, and log ๐‘ is understood as the penalty to pay for not knowing wherethe ๐‘˜ true parameters are.

    Aiming to take a step in studying the fundamental aspect of two-sample hypothesistesting in high dimensions, this paper is primarily concerned with the following testingproblem: we need to decide whether or not the responses in the two samples have differentlinear dependencies on the covariates. More specifically, under the same regression settingas in (2) with min{๐‘, ๐‘›} โ†’ โˆž, we wish to test the global null hypothesis

    ๐ป0 : ๐›ฝ1 = ๐›ฝ2 (5)

    2

  • against the composite alternative

    ๐ป1 : โ€–๐›ฝ1 โˆ’ ๐›ฝ2โ€–2 โ‰ฅ 2๐œŒ, โ€–๐›ฝ1 โˆ’ ๐›ฝ2โ€–0 โ‰ค ๐‘˜. (6)

    In other words, we assume that under the alternative hypothesis, the difference betweenthe two regression coefficients is a ๐‘˜-sparse vector with โ„“2 norm at least 2๐œŒ (the additionalfactor of 2 here exists to simplify relevant statements under the reparametrisation we willintroduce later in Section 2). Throughout this paper, we do not assume the sparsity of๐›ฝ1 or ๐›ฝ2 under the alternative.

    Classical ๐น -tests no longer work well on the above testing problem, for the simplereason that it is not possible to get good estimates of ๐›ฝโ€™s through naive least squareestimators, which are necessary in establishing RSS in (3) to measure the modelโ€™s goodnessof fit. A standard way out is to impose certain kinds of sparsity on both ๐›ฝ1 and ๐›ฝ2 toensure that both quantities are estimable. To our best knowledge, this is the out-of-shelfapproach taken by most literature, see, for instance, Stรคdler and Mukherjee (2012); Xia,Cai and Cai (2015). Nevertheless, it is both more interesting and relevant in applicationsto study the testing problem where neither ๐›ฝ1 nor ๐›ฝ2 is estimable but only ๐›ฝ1 โˆ’ ๐›ฝ2 issparse.

    As an example of applications where such assumptions come about natrually, considerthe area of differential networks. Here, researchers are interested in whether two biologicalnetworks formulated as Gaussian graphical models, such as โ€˜brain connectivity networkโ€™and gene-gene interaction network (Xia, Cai and Cai, 2015; Charbonnier, Verzelen andVillers, 2015), are different in two subgroups of population. Such complex networks aremostly of high-dimensional nature, in the sense that the number of nodes or features inthe networks are large, relative to the number of observations. Such networks are oftendense as interactions within different brain parts or genes are omnipresent, but becausethey are subject to the about same physiology, the differences between networks from twosubpopulations are small, i.e., there are only a few different edges from one network toanother. In the above case of dense coefficients, sparsity assumption makes no sense and itis impossible to obtain reasonable estimates of either regression coefficient ๐›ฝ1 or ๐›ฝ2 when๐‘ is of the same magnitude as ๐‘›. For this reason, any approach to detect the differencebetween ๐›ฝ1 and ๐›ฝ2, which is built upon comparing estimates of ๐›ฝ1 and ๐›ฝ2 in some ways,fails. In fact, any inference on ๐›ฝ1 or ๐›ฝ2 is not possible unless we make some other strin-gent structural assumptions on the model. However, certain inference on the coefficientdifference ๐›ฝ1 โˆ’ ๐›ฝ2, such as testing the zero null with the sparse alternative, is feasible byexploiting sparse difference between different networks without much assumptions.

    1.1 Related WorksThe two-sample testing problem in its most general form is not well-understood in highdimensions. Most of the existing literature has focused on testing the equality of means,namely the high-dimensional equivalence of (1), see, e.g. Cai, Liu and Xia (2014); Chen,Li and Zhong (2019). Similar to our setup, in the mean testing problems, we may allowfor non-sparse means in each sample and test only for sparse differences between the twopopulation means (Cai, Liu and Xia, 2014). The intuitive approach for testing equality

    3

  • of means is to eliminate the dense nuisance parameter by taking the difference in meansof the two samples and thus reducing it to a one-sample problem of testing a sparse meanagainst a global zero null, which is also known as the โ€˜needle in the haystackโ€™ problemwell studied previously by e.g. Ingster (1997); Donoho and Jin (2004). Such reduction,however, is more intricate in the regression problem, as a result of different design matricesfor the two samples.

    Literature is scarce for two-sample testing under high-dimensional regression setting.Stรคdler and Mukherjee (2012), Xia, Cai and Cai (2018), Xia, Cai and Sun (2020) haveproposed methods that work under the additional assumption so that both ๐›ฝ1 and ๐›ฝ2can be consistently estimated. Charbonnier, Verzelen and Villers (2015) and Zhu andBradic (2016) are the only existing works in the literature we are aware of that allowfor non-sparse regression coefficients ๐›ฝ1 and ๐›ฝ2. Specifically, Charbonnier, Verzelen andVillers (2015) look at a sequence of possible supports of ๐›ฝ1 and ๐›ฝ2 on a Lasso-type solutionpath and then apply a variant of the classical ๐น -tests to the lower-dimensional problemsrestricted on these supports, with the test ๐‘-values adjusted by a Bonferonni correction.Zhu and Bradic (2016) (after some elementary transformation) uses a Dantzig-type se-lector to obtain an estimate for (๐›ฝ1 + ๐›ฝ2)/2 and then use it to construct a test statisticbased on a specific moment condition satisfied under the null hypothesis. As both testsdepend on the estimation of nuisance parameters, their power can be compromised if suchnuisance parameters are dense.

    1.2 Our contributionsOur contributions are four-fold.

    1. We propose a novel method to solve the testing problems formulated in (5) and (6)for model (2). Through complementary sketching, a delicate linear transformationon both the designs and responses, our method turns the testing problem withtwo different designs into one with the same design of dimension ๐‘š ร— ๐‘ where๐‘š = ๐‘›1+๐‘›2โˆ’๐‘. After taking the difference in two regression coefficients, the problemis reduced to testing whether the coefficient in the reduced one-sample regression iszero against sparse alternatives. The transformation is carefully chosen such thatthe error distribution in the reduced one-sample regression is homoscedastic. Thispaves the way for constructing test statistics using the transformed covariates andresponses. Our method is easy to implement and does not involve any complicationsarising from solving computationally expensive optimisation problems. Moreover,when complementary sketching is combined with any methods designed for one-sample problems, our proposal substantially supplies a novel class of testing andestimation procedures for the corresponding two-sample problems.

    2. In the sparse regime, where the sparsity parameter ๐‘˜ โˆผ ๐‘๐›ผ in the alternative (6) forany fixed ๐›ผ โˆˆ (0, 1/2), we show that the detection limit of our procedure, definedas the minimal โ€–๐›ฝ1 โˆ’ ๐›ฝ2โ€–2 necessary for asymptotic almost sure separation of thealternative from the null, is minimax optimal up to a multiplicative constant undera Gaussian design. More precisely, we show that in the asymptotic regime where

    4

  • ๐‘›1, ๐‘›2, ๐‘ diverge at a fixed ratio, if ๐œŒ โ‰ฅโˆš๏ธ

    8๐‘˜ log ๐‘๐‘›๐œ…1

    , where ๐œ…1 is a constant dependingon ๐‘›1/๐‘›2 and ๐‘/๐‘š only, then our test has asymptotic power 1 almost surely. On theother hand, in the same asymptotic regime, if ๐œŒ โ‰ค

    โˆš๏ธ๐‘๐›ผ๐‘˜ log ๐‘๐‘›๐œ…1

    for some ๐‘๐›ผ dependingon ๐›ผ only, then almost surely no test has asymptotic size 0 and power 1 .

    3. Our results reveal the effective sample size of the two-sample testing problem. Here,by effective sample size, we mean the sample size for a corresponding one-sampletesting problem (i.e. testing ๐›ฝ = 0 in a linear model ๐‘Œ = ๐‘‹๐›ฝ + ๐œ– with rows of ๐‘‹following the same distribution as rows of ๐‘‹1 and ๐‘‹2) that has an asymptoticallyequal detection limit; see the discussion after Theorem 5 for a detailed definition. Atfirst glance, one might think that the effective sample size is ๐‘š, which is the numberof rows in the reduced design. This hints that the reduction to the one-sampleproblem has made the original two-sample problem obsolete. However, on deeperthoughts, as an imbalance in the numbers of observations in ๐‘‹1 and ๐‘‹2 clearly makestesting more difficult, the effective sample size has to also incorporate this effect. Wesee from the previous point that uniformly for any ๐›ผ less than and bounded awayfrom 1/2, the detection boundary is of order ๐œŒ2 โ‰ ๐‘˜ log ๐‘/(๐‘›๐œ…1), with the precisedefinition of ๐œ…1 given in Lemma 12. Writing ๐‘›1/๐‘›2 = ๐‘Ÿ and ๐‘/๐‘š = ๐‘ , our results onthe sparse case implies that the two-sample testing problem has the same order ofdetection limit as in a one-sample problem with sample size ๐‘›๐œ…1 = ๐‘š(๐‘Ÿโˆ’1 +๐‘Ÿ+2)โˆ’1.We note that this effective sample size is proportional to ๐‘š, and for each fixed ๐‘š,maximised when ๐‘Ÿ = 1 (i.e. ๐‘›1 = ๐‘›2) and approaches ๐‘š/๐‘› in the most imbalanceddesign. This is in agreement with the intuition that testing is easiest when ๐‘›1 = ๐‘›2and impossible when ๐‘›1 and ๐‘›2 are too imbalanced. Our study, thus, sheds light onthe intrinsic difference between two-sample and one-sample testing problems andcharacterises the precise dependence of the difficulty of the two-sample problem onthe sample size and dimensionality parameters.

    4. We observe a phase transition phenomena of how the minimax detection limit de-pends on the sparsity parameter ๐‘˜. In addition to establishing minimax rate optimaldetection limit of our procedure in the sparse case when ๐‘˜ โ‰ ๐‘๐›ผ for ๐›ผ โˆˆ [0, 1/2), wealso prove that a modified version of our procedure, suited for denser signals, is ableto achieve minimax optimal detection limit up to logarithmic factors in the denseregime ๐‘˜ โ‰ ๐‘๐›ผ for ๐›ผ โˆˆ (1/2, 1). However, the detection limit is of order ๐œŒ โ‰

    โˆš๏ธ๐‘˜ log ๐‘๐‘›๐œ…1

    in the sparse regime, but of order ๐œŒ โ‰ ๐‘โˆ’1/4 up to logarithmic factors in the denseregime. Such a phase transition phenomenon is qualitatively similar to results pre-viously reported in the one-sample testing problem (see, e.g. Ingster, Tsybakov andVerzelen, 2010; Arias-Castro, Candรจs and Plan, 2011).

    1.3 Organization of the paperWe describe our methodology in detail in Section 2 and establish its theoretical propertiesin Section 3. Numerical results illustrate the finite sample performance of our proposed

    5

  • algorithm in Section 4. Proofs of our main results are deferred until Section 5 withancillary results in Section 6.

    1.4 NotationFor any positive integer ๐‘›, we write [๐‘›] := {1, . . . , ๐‘›}. For a vector ๐‘ฃ = (๐‘ฃ1, . . . , ๐‘ฃ๐‘›)โŠค โˆˆ R๐‘›,we define โ€–๐‘ฃโ€–0 :=

    โˆ‘๏ธ€๐‘›๐‘–=1 1{๐‘ฃ๐‘– ฬธ=0}, โ€–๐‘ฃโ€–โˆž := max๐‘–โˆˆ[๐‘›] |๐‘ฃ๐‘–| and โ€–๐‘ฃโ€–๐‘ž :=

    {๏ธโˆ‘๏ธ€๐‘›๐‘–=1(๐‘ฃ๐‘–)๐‘ž

    }๏ธ1/๐‘žfor any

    positive integer ๐‘ž, and let ๐’ฎ๐‘›โˆ’1 := {๐‘ฃ โˆˆ R๐‘› : โ€–๐‘ฃโ€–2 = 1}. The support of vector ๐‘ฃ is definedby supp(๐‘ฃ) := {๐‘– โˆˆ [๐‘›] : ๐‘ฃ๐‘– ฬธ= 0}.

    For ๐‘› โ‰ฅ ๐‘š, O๐‘›ร—๐‘š denotes the space of ๐‘› ร— ๐‘š matrices with orthonormal columns.For ๐‘Ž โˆˆ R๐‘, we define diag(๐‘Ž) to the ๐‘ ร— ๐‘ diagonal matrix with diagonal entries filledwith elements of ๐‘Ž, i.e., (diag(๐‘Ž))๐‘–,๐‘— = 1{๐‘–=๐‘—}๐‘Ž๐‘–. For ๐ด โˆˆ R๐‘ร—๐‘, define diag(๐ด) to be the๐‘ร— ๐‘ diagonal matrix with diagonal entries coming from ๐ด, i.e., (diag(๐ด))๐‘–,๐‘— = 1{๐‘–=๐‘—}๐ด๐‘–,๐‘—.We also write tr(๐ด) := โˆ‘๏ธ€๐‘–โˆˆ[๐‘] ๐ด๐‘–,๐‘–. For a symmetric matrix ๐ด โˆˆ R๐‘ร—๐‘ and ๐‘˜ โˆˆ [๐‘], the๐‘˜-operator norm of ๐ด is defined by

    โ€–๐ดโ€–๐‘˜,op := sup๐‘ฃโˆˆ๐’ฎ๐‘โˆ’1:โ€–๐‘ฃโ€–0โ‰ค๐‘˜

    |๐‘ฃโŠค๐ด๐‘ฃ|.

    For any ๐‘† โŠ† [๐‘›], we write ๐‘ฃ๐‘† for the |๐‘†|-dimensional vector obtained by extracting co-ordinates of ๐‘ฃ in ๐‘† and ๐ด๐‘†,๐‘† the matrix obtained by extracting rows and columns of ๐ดindexed by ๐‘†.

    Given two sequences (๐‘Ž๐‘›)๐‘›โˆˆN and (๐‘๐‘›)๐‘›โˆˆN such that ๐‘๐‘› > 0 for all ๐‘›, we write ๐‘Ž๐‘› =๐’ช(๐‘๐‘›) if |๐‘Ž๐‘›| โ‰ค ๐ถ๐‘๐‘› for some constant ๐ถ. If the constant ๐ถ depends on some parameter๐‘ฅ, we write ๐‘Ž๐‘› = ๐’ช๐‘ฅ(๐‘๐‘›) instead. Also, ๐‘Ž๐‘› = ๐’ช(๐‘๐‘›) denotes ๐‘Ž๐‘›/๐‘๐‘› โ†’ 0.

    2 Testing via complementary sketchingIn this section, we describe our testing strategy. Since we are only interested in thedifference in regression coefficients in the two linear models, we reparametrise (2) with๐›พ := (๐›ฝ1 + ๐›ฝ2)/2 and ๐œƒ := (๐›ฝ1 โˆ’ ๐›ฝ2)/2 to separate the nuisance parameter from theparameter of interest. Define

    ฮ˜๐‘,๐‘˜(๐œŒ) :={๏ธ๐œƒ โˆˆ R๐‘ : โ€–๐œƒโ€–2 โ‰ฅ ๐œŒ and โ€–๐œƒโ€–0 โ‰ค ๐‘˜

    }๏ธ.

    Under this new parametrisation, the null and the alternative hypotheses can be equiva-lently formulated as

    ๐ป0 : ๐œƒ = 0 and ๐ป1 : ๐œƒ โˆˆ ฮ˜๐‘,๐‘˜(๐œŒ).

    The parameter of interest ๐œƒ is now ๐‘˜-sparse under the alternative hypotheses. However,its inference is confounded by the possibly dense nuisance parameter ๐›พ โˆˆ R๐‘. A naturalidea, then, is to eliminate the nuisance parameter from the model. In the special designsetting where ๐‘‹1 = ๐‘‹2 (in particular, ๐‘›1 = ๐‘›2), this can be achieved by considering thesparse regression model ๐‘Œ1 โˆ’ ๐‘Œ2 = ๐‘‹1๐œƒ + (๐œ–1 โˆ’ ๐œ–2). While the above example only worksin a special, idealised setting, it nevertheless motivates our general testing procedure.

    6

  • To introduce our test, we first concatenate the design matrices and response vectorsto form

    ๐‘‹ =(๏ธƒ๐‘‹1๐‘‹2

    )๏ธƒand ๐‘Œ =

    (๏ธƒ๐‘Œ1๐‘Œ2

    )๏ธƒ.

    A key idea of our method is to project ๐‘‹ and ๐‘Œ respectively along ๐‘› โˆ’ ๐‘ pairs of direc-tions that are complementary in sign in a subset of their coordinates, a process we callcomplementary sketching. Specifically, assume ๐‘›1 + ๐‘›2 > ๐‘ and define ๐‘› := ๐‘›1 + ๐‘›2 and๐‘š := ๐‘›โˆ’ ๐‘ and let ๐ด1 โˆˆ R๐‘›1ร—๐‘š and ๐ด2 โˆˆ R๐‘›2ร—๐‘š be chosen such that

    ๐ดโŠค1 ๐ด1 + ๐ดโŠค2 ๐ด2 = ๐ผ๐‘š and ๐ดโŠค1 ๐‘‹1 + ๐ดโŠค2 ๐‘‹2 = 0. (7)

    In other words, ๐ด := (๐ดโŠค1 , ๐ดโŠค2 )โŠค is a matrix with orthonormal columns orthogonal to thecolumn space of ๐‘‹. Such ๐ด1 and ๐ด2 exist since the null space of ๐‘‹ has dimension at least๐‘š. Define ๐‘ := ๐ดโŠค1 ๐‘Œ1 + ๐ดโŠค2 ๐‘Œ2 and ๐‘Š := ๐ดโŠค1 ๐‘‹1 โˆ’ ๐ดโŠค2 ๐‘‹2. From the above construction,we have

    ๐‘ = ๐ดโŠค1 ๐‘‹1๐›ฝ1 + ๐ดโŠค2 ๐‘‹2๐›ฝ2 + (๐ดโŠค1 ๐œ–1 + ๐ดโŠค2 ๐œ–2) = ๐‘Š๐œƒ + ๐œ‰, (8)

    where ๐œ‰ | ๐‘Š โˆผ ๐‘๐‘š(0, ๐ดโŠค๐ด) = ๐‘๐‘š(0, ๐œŽ2๐ผ๐‘š). We note that similar to conventional sketch-ing (see, e.g. Mahoney, 2011), the complementary sketching operation above synthesises๐‘š data points from the original ๐‘› observations. However, unlike conventional sketching,where one projects the design ๐‘‹ and response ๐‘Œ by the same sketching matrix ๐‘† โˆˆ R๐‘šร—๐‘›to obtain sketched data (๐‘†๐‘‹, ๐‘†๐‘Œ ), here we project ๐‘‹ and ๐‘Œ along different directions toobtain (๐ด๐‘‹,๐ด๐‘Œ ), where ๐ด := (๐ดโŠค1 ,โˆ’๐ดโŠค2 )โŠค is complementary in sign to ๐ด in its secondblock. Moreover, the main purpose of the conventional sketching is to trade off statisticalefficiency for computational speed by summarising raw data with a smaller number ofsynthesised data points, whereas the main aim of our complementary sketching operationis to eliminate the nuisance parameter, and surprisingly, as we will see in Section 3, thereis essentially no loss of statistical efficiency introduced by our complementary sketchingin this two-sample testing setting.

    To summarise, after projecting ๐‘‹ and ๐‘Œ via complementary sketching to obtain ๐‘Šand ๐‘, we reduce the original two-sample testing problem to a one-sample problem with ๐‘šobservations, where we test the global null of ๐œƒ = 0 against sparse alternatives using data(๐‘Š,๐‘). From here, we can construct test statistics as functions of ๐‘Š and ๐‘, for whichwe describe two different tests. The first testing procedure, detailed in Algorithm 1, com-putes the sum of squares of hard-thresholded inner products between the response ๐‘ andstandardised columns of the design matrix ๐‘Š in (8). We denote the output of Algorithm 1with input ๐‘‹1, ๐‘‹2, ๐‘Œ1 and ๐‘Œ2 and tuning parameters ๐œ† and ๐œ as ๐œ“sparse๐œ†,๐œ (๐‘‹1, ๐‘‹2, ๐‘Œ1, ๐‘Œ2). Aswe will see in Section 3, the choice of ๐œ† = 2๐œŽ

    โˆšlog ๐‘ and ๐œ = ๐‘˜๐œŽ2 log ๐‘ would be suitable

    for testing against sparse alternatives in the case of ๐‘˜ โ‰ค ๐‘1/2. On the other hand, in thedense case when ๐‘˜ > ๐‘1/2, one option would be to choose ๐œ† = 0. However, it turns out tobe difficult to set the test threshold level ๐œ in this dense case using the known problemparameters. Therefore, we decided to study instead the following as our second test. Weapply steps 1 to 4 of Algorithm 1 to obtain the vector ๐‘, and then define our test as

    ๐œ“dense๐œ‚ (๐‘‹1, ๐‘‹2, ๐‘Œ1, ๐‘Œ2) := 1{โ€–๐‘โ€–22 โ‰ฅ ๐œ‚},

    7

  • Algorithm 1: Pseudo-code for complementary sketching-based test ๐œ“sparse๐œ†,๐œ .Input: ๐‘‹1 โˆˆ R๐‘›1ร—๐‘, ๐‘‹2 โˆˆ R๐‘›2ร—๐‘, ๐‘Œ1 โˆˆ R๐‘›1 , ๐‘Œ2 โˆˆ R๐‘›2 satisfying ๐‘›1 + ๐‘›2 โˆ’ ๐‘ > 0, a

    hard threshold level ๐œ† โ‰ฅ 0, and a test threshold level ๐œ > 0.1 Set ๐‘šโ† ๐‘›1 + ๐‘›2 โˆ’ ๐‘.2 Form ๐ด โˆˆ O๐‘›ร—๐‘š with columns orthogonal to the column space of (๐‘‹โŠค1 , ๐‘‹โŠค2 )โŠค.3 Let ๐ด1 and ๐ด2 be submatrices formed by the first ๐‘›1 and last ๐‘›2 rows of ๐ด.4 Set ๐‘ โ† ๐ดโŠค1 ๐‘Œ1 + ๐ดโŠค2 ๐‘Œ2 and ๐‘Š โ† ๐ดโŠค1 ๐‘‹1 โˆ’ ๐ดโŠค2 ๐‘‹2.5 Compute ๐‘„โ† diag(๐‘ŠโŠค๐‘Š )โˆ’1/2๐‘ŠโŠค๐‘.6 Compute the test statistic

    ๐‘‡ :=๐‘โˆ‘๏ธ๐‘—=1

    ๐‘„2๐‘—1{|๐‘„๐‘— |โ‰ฅ๐œ†}.

    7 Reject the null hypothesis if ๐‘‡ โ‰ฅ ๐œ .

    for a suitable choice of threshold level ๐œ‚.The computational complexity of both ๐œ“sparse๐œ†,๐œ and ๐œ“dense๐œ‚ depends on Step 2 of Algo-

    rithm 1. In practice, we can form the projection matrix ๐ด as follows. We first generate an๐‘›ร—๐‘š matrix ๐‘€ with independent ๐‘(0, 1) entries, and then project columns of ๐‘€ to theorthogonal complement of the column space of ๐‘‹ to obtain ๏ฟฝฬƒ๏ฟฝ := (๐ผ๐‘› โˆ’๐‘‹๐‘‹โ€ )๐‘€ , where๐‘‹โ€  is the Mooreโ€“Penrose pseudoinverse of ๐‘‹. Finally, we extract an orthonormal basisfrom the columns of ๏ฟฝฬƒ๏ฟฝ via a QR decomposition ๏ฟฝฬƒ๏ฟฝ = ๐ด๐‘…, where ๐‘… is upper triangularand ๐ด is a (random) ๐‘›ร—๐‘š matrix with orthonormal columns that can be used in Step 2of Algorithm 1. The overall computational complexity for our tests are therefore of order๐’ช(๐‘›2๐‘+๐‘›๐‘š2). Finally, it is worth emphasising that while the matrix ๐ด generated this wayis random, our test statistics ๐‘‡ = โˆ‘๏ธ€๐‘๐‘—=1 ๐‘„2๐‘—1{|๐‘„๐‘— |โ‰ฅ๐œ†} and โ€–๐‘โ€–22, are in fact deterministic.To see this, we observe that both

    ๐‘ŠโŠค๐‘ = (๐ดโŠค1 ๐‘‹1 โˆ’ ๐ดโŠค2 ๐‘‹2)โŠค(๐ดโŠค1 ๐‘Œ1 + ๐ดโŠค2 ๐‘Œ2) =(๏ธ๐‘‹โŠค1 โˆ’๐‘‹โŠค2

    )๏ธ(๏ธƒ๐ด1๐ด2

    )๏ธƒ(๏ธ๐ดโŠค1 ๐ด

    โŠค2

    )๏ธ(๏ธƒ๐‘Œ1๐‘Œ2

    )๏ธƒ

    and โ€–๐‘โ€–22 = ๐‘Œ โŠค๐ด๐ดโŠค๐‘Œ depend on ๐ด only through ๐ด๐ดโŠค, which is determined by the columnspace of ๐ด. Moreover, by Lemma 9, (โ€–๐‘Š๐‘—โ€–22)๐‘—โˆˆ[๐‘], being diagonal entries of ๐‘ŠโŠค๐‘Š =4๐‘‹โŠค1 ๐ด1๐ดโŠค1 ๐‘‹1, are also functions of ๐‘‹ alone. This attests that both test statistics, andconsequently our two tests, are deterministic in nature.

    3 Theoretical analysisWe now turn to the analysis of the theoretical performance of ๐œ“sparse๐œ†,๐œ and ๐œ“dense๐œ‚ . Weconsider both the size and power of each test, as well as the minimax lower bounds forsmallest detectable signal strengths.

    In addition to working under the regression model (2), we further assume the followingconditions in our theoretical analysis.

    8

  • (C1) All entries of ๐‘‹1 and ๐‘‹2 are independent standard normals.

    (C2) Parameters ๐‘›1, ๐‘›2, ๐‘ satisfy ๐‘š = ๐‘›1 + ๐‘›2 โˆ’ ๐‘ > 0 and lie in the asymptotic regimewhere ๐‘›1/๐‘›2 โ†’ ๐‘Ÿ and ๐‘/๐‘šโ†’ ๐‘  as ๐‘›1, ๐‘›2, ๐‘โ†’โˆž

    The main purpose for assuming (C1) is to ensure that after standardising columns to haveunit length, the matrix ๐‘Š , as computed in Step 4 of Algorithm 1, satisfies a restrictedisometry condition with high probability as in Proposition 2. In fact, as revealed in theproof of our theory, even if matrices ๐‘‹1 and ๐‘‹2 do not follow the Gaussian design, aslong as ๐‘Š satisfies (9) and (10), all results in this section will still be true. The condition๐‘›1 + ๐‘›2 โˆ’ ๐‘ > 0 in (C2) is necessary in this two-sample problem, since otherwise, for anyprescribed value of ฮ” := ๐›ฝ1 โˆ’ ๐›ฝ2, the equation system with ๐›ฝ1 as unknowns(๏ธƒ

    ๐‘‹1๐‘‹2

    )๏ธƒ๐›ฝ1 =

    (๏ธƒ๐‘Œ1

    ๐‘Œ2 โˆ’๐‘‹2ฮ”

    )๏ธƒ

    has at least one solution when (๐‘‹โŠค1 , ๐‘‹โŠค2 )โŠค has rank ๐‘›. As a result, except in some patho-logical cases, we can always find ๐›ฝ1, ๐›ฝ2 โˆˆ R๐‘ that fit the data perfectly with ๐‘Œ1 = ๐‘‹1๐›ฝ1and ๐‘Œ2 = ๐‘‹2๐›ฝ2, which makes the testing problem impossible. Finally, we have carried outproofs of our theoretical results with finite sample arguments wherever possible. Never-theless, due to a lack of finite-sample bounds on the empirical spectral density of matrix-variate Beta distributions, all results in this section are presented under the asymptoticregime set out in Condition (C2). Under this condition, we were able to exploit existingresults in the random matrix theory to obtain a sharp dependence of the detection limiton ๐‘  and ๐‘Ÿ.

    In what follows, we also make the simplifying assumption that the noise variance ๐œŽ2is known. By replacing ๐‘‹1, ๐‘‹2, ๐‘Œ1, ๐‘Œ2 with ๐‘‹1/๐œŽ, ๐‘‹2/๐œŽ, ๐‘Œ1/๐œŽ and ๐‘Œ2/๐œŽ, we may furtherassume that ๐œŽ2 = 1 without loss of generality, which we assume for the rest of this section.In practice, if the noise variance is unknown, we can replace it with one of its consistentestimators ๏ฟฝฬ‚๏ฟฝ2 (see e.g. Reid, Tibshirani and Friedman, 2016, and references therein).

    3.1 Sparse caseWe consider in this subsection the test ๐œ“sparse๐œ†,๐œ , which is suitable for distinguishing ๐›ฝ1and ๐›ฝ2 that differ in a small number of coordinates, the setting that has more subtlephenomena and hence is most of interest to us. Our first result below states that with achoice of hard-thresholding level ๐œ† of order

    โˆšlog ๐‘, the test has asymptotic size 0.

    Theorem 1. If Condition (C2) holds and ๐›ฝ1 = ๐›ฝ2, then, with the choice of parameters๐œ > 0 and ๐œ† =

    โˆš๏ธ(4 + ๐œ€) log ๐‘ for any ๐œ€ > 0, we have

    ๐œ“sparse๐œ†,๐œ (๐‘‹1, ๐‘‹2, ๐‘Œ1, ๐‘Œ2)a.s.โˆ’โˆ’โ†’ 0.

    The almost sure statement in Theorem 1 and subsequent results in this section arewith respect to both the randomness in ๐‘‹ = (๐‘‹โŠค1 , ๐‘‹โŠค2 )โŠค and in ๐œ– = (๐œ–โŠค1 , ๐œ–โŠค2 )โŠค. However,a closer inspection of the proof of Theorem 1 tells us that the statement is still true if we

    9

  • allow an arbitrary sequence of matrices ๐‘‹ (indexed by ๐‘) and only consider almost sureconvergence with respect to the distribution of ๐œ–.

    The control of the asymptotic power of ๐œ“sparse๐œ†,๐œ is more involved. A key step in theargument is to show that ๐‘ŠโŠค๐‘Š is suitably close to a multiple of the identity matrix.More precisely, in Proposition 2 below, we derive entrywise and ๐‘˜-operator norm controlsof the Gram matrix of the design matrix sketch ๐‘Š .

    Proposition 2. Under Conditions (C1) and (C2), we further assume ๐‘˜ โˆˆ [๐‘] and let ๐‘Šbe defined as in Algorithm 1. Then with probability 1,

    max๐‘—โˆˆ[๐‘]

    โƒ’โƒ’โƒ’โƒ’โƒ’(๐‘ŠโŠค๐‘Š )๐‘—,๐‘—4๐‘›๐œ…1 โˆ’ 1

    โƒ’โƒ’โƒ’โƒ’โƒ’โ†’ 0, (9)

    where ๐œ…1 is as defined in Lemma 12. Moreover, define ๏ฟฝฬƒ๏ฟฝ = ๐‘Š diag(๐‘ŠโŠค๐‘Š )โˆ’1/2. If

    ๐‘˜ log(๐‘’๐‘/๐‘˜)๐‘›

    โ†’ 0, (10)

    then there exists ๐ถ๐‘ ,๐‘Ÿ > 0, depending only on ๐‘  and ๐‘Ÿ, such that with probability 1, thefollowing holds for all but finitely many ๐‘:

    โ€–๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐ผโ€–๐‘˜,op โ‰ค ๐ถ๐‘ ,๐‘Ÿ

    โˆš๏ธƒ๐‘˜ log(๐‘’๐‘/๐‘˜)

    ๐‘›. (11)

    We note that condition (10) is relatively mild and would be satisfied if ๐‘˜ โ‰ค ๐‘๐›ผ for any๐›ผ โˆˆ [0, 1).

    All theoretical results in this section except for Theorem 1 assume the random designCondition (C1) to hold. However, as revealed by the proofs, for any given (deterministic)sequence of ๐‘‹, these results remain true as long as (9) and (11) are satisfied. Theasymptotic nature of Proposition 2 is a result of our application of Bai et al. (2015,Theorem 1.1), which guarantees an almost sure convergence of the empirical spectraldistribution of Beta random matrices in the weak topology. This sets the tone for theasymptotic nature of our results, which depend on the aforementioned limiting spectraldistribution.

    The following theorem provides power control of our procedure ๐œ“sparse๐œ†,๐œ , when the โ„“2norm of the scaled difference in regression coefficient ๐œƒ = (๐›ฝ1 โˆ’ ๐›ฝ2)/2 exceeds an appro-priate threshold.

    Theorem 3. Under Conditions (C1) and (C2), we further assume ๐‘˜ โˆˆ [๐‘] and that (10)holds. If ๐œƒ = (๐›ฝ1 โˆ’ ๐›ฝ2)/2 โˆˆ ฮ˜๐‘,๐‘˜(๐œŒ) with ๐œŒ โ‰ฅ

    โˆš๏ธ8๐‘˜ log ๐‘๐‘›๐œ…1

    , and we set input parameters๐œ† = 2

    โˆšlog ๐‘ and ๐œ โ‰ค 3๐‘˜ log ๐‘ in Algorithm 1, then

    ๐œ“sparse๐œ†,๐œ (๐‘‹1, ๐‘‹2, ๐‘Œ1, ๐‘Œ2)a.s.โˆ’โˆ’โ†’ 1.

    The size and power controls in Theorems 1 and 3 jointly provide an upper bound onthe minimax detection threshold. Specifically, let ๐‘ƒ๐‘‹๐›ฝ1,๐›ฝ2 be the conditional distribution of๐‘Œ1, ๐‘Œ2 given ๐‘‹1, ๐‘‹2 under model (2). Conditionally on the design matrices ๐‘‹1 and ๐‘‹2 and

    10

  • given ๐‘˜ โˆˆ [๐‘] and ๐œŒ > 0, the (conditional) minimax risk of testing ๐ป0 : ๐›ฝ1 = ๐›ฝ2 against๐ป1 : ๐œƒ = (๐›ฝ1 โˆ’ ๐›ฝ2)/2 โˆˆ ฮ˜๐‘,๐‘˜(๐œŒ) is defined as

    โ„ณ๐‘‹(๐‘˜, ๐œŒ) := inf๐œ“

    {๏ธƒsup๐›ฝโˆˆR๐‘

    ๐‘ƒ๐‘‹๐›ฝ,๐›ฝ(๐œ“ ฬธ= 0) + sup๐›ฝ1,๐›ฝ2โˆˆR๐‘

    (๐›ฝ1โˆ’๐›ฝ2)/2โˆˆฮ˜๐‘,๐‘˜(๐œŒ)

    ๐‘ƒ๐‘‹๐›ฝ1,๐›ฝ2(๐œ“ ฬธ= 1)}๏ธƒ,

    where we suppress all dependences on the dimension of data for notational simplicityand the infimum is taken over all ๐œ“ : (๐‘‹1, ๐‘Œ1, ๐‘‹2, ๐‘Œ2) โ†ฆโ†’ {0, 1}. If โ„ณ๐‘‹(๐‘˜, ๐œŒ)

    pโˆ’โ†’ 0, thereexists a test ๐œ“ that with asymptotic probability 1 correctly differentiates the null andthe alternative. On the other hand, if โ„ณ๐‘‹(๐‘˜, ๐œŒ)

    pโˆ’โ†’ 1, then asymptotically no test cando better than a random guess. The following corollary provides an upper bound on thesignal size ๐œŒ for which the minimax risk is asymptotically zero.

    Corollary 4. Let ๐‘˜ โˆˆ [๐‘] and assume Conditions (C1), (C2) and (10). If ๐œŒ โ‰ฅโˆš๏ธ

    8๐‘˜ log ๐‘๐‘›๐œ…1

    ,and we set input parameters ๐œ† = 2

    โˆšlog ๐‘ and ๐œ โˆˆ (0, 3๐‘˜ log ๐‘] in Algorithm 1, then

    โ„ณ๐‘‹(๐‘˜, ๐œŒ) โ‰ค sup๐›ฝโˆˆR๐‘

    ๐‘ƒ๐‘‹๐›ฝ,๐›ฝ(๐œ“sparse๐œ†,๐œ ฬธ= 0) + sup

    ๐›ฝ1,๐›ฝ2โˆˆR๐‘(๐›ฝ1โˆ’๐›ฝ2)/2โˆˆฮ˜๐‘,๐‘˜(๐œŒ)

    ๐‘ƒ๐‘‹๐›ฝ1,๐›ฝ2(๐œ“sparse๐œ†,๐œ ฬธ= 1)

    a.s.โˆ’โˆ’โ†’ 0

    Corollary 4 shows that the test ๐œ“sparse๐œ†,๐œ has an asymptotic detection limit, measured inโ€–๐›ฝ1 โˆ’ ๐›ฝ2โ€–2, of at most

    โˆš๏ธ8๐‘˜ log ๐‘๐‘›๐œ…1

    for all ๐‘˜ satisfying (10). While (10) is satisfied for ๐‘˜ โ‰ค ๐‘๐›ผwith any ๐›ผ โˆˆ [0, 1), the detection limit upper bound shown in Corollary 4 is suboptimalwhen ๐›ผ > 1/2, as we will see later in Theorem 6. On the other hand, the followingtheorem shows that when ๐›ผ < 1/2, the detection limit of ๐œ“sparse๐œ†,๐œ is essentially optimal.

    Theorem 5. Under conditions (C1) and (C2), if further assume ๐‘˜ โ‰ค ๐‘๐›ผ for some ๐›ผ โˆˆ[0, 1/2) and ๐œŒ โ‰ค

    โˆš๏ธ(1โˆ’2๐›ผโˆ’๐œ€)๐‘˜ log ๐‘

    4๐‘›๐œ…1 for some ๐œ€ โˆˆ (0, 1โˆ’ 2๐›ผ], then โ„ณ๐‘‹(๐‘˜, ๐œŒ)a.s.โˆ’โˆ’โ†’ 1.

    For any fixed ๐›ผ < 1/2, Theorem 5 shows that if the signal โ„“2 norm is a factor of32/(1โˆ’ 2๐›ผโˆ’ ๐œ€) smaller than what can be detected by ๐œ“sparse๐œ†,๐œ shown in Corollary 4, thenall tests are asymptotically powerless in differentiating the null from the alternative. Inother words, in the sparse regime where ๐‘˜ โ‰ค ๐‘๐›ผ for ๐›ผ < 1/2, the test ๐œ“sparse๐œ†,๐œ has a minimaxoptimal detection limit measured in โ€–๐›ฝ1 โˆ’ ๐›ฝ2โ€–2, up to constants depending on ๐›ผ only.

    It is illuminating to relate the above results with the corresponding ones in the one-sample problem in the sparse regime (๐›ผ < 1/2). Let๐‘‹ be an ๐‘›ร—๐‘matrix with independent๐‘(0, 1) entries and ๐‘Œ = ๐‘‹๐›ฝ + ๐œ– for ๐œ– | ๐‘‹ โˆผ ๐‘(0, ๐ผ๐‘›), and we consider the one-sampleproblem to test ๐ป0 : ๐›ฝ = 0 against ๐ป1 : ๐›ฝ โˆˆ ฮ˜๐‘,๐‘˜(๐œŒ). Theorem 2 and 4 of Arias-Castro,Candรจs and Plan (2011) state that under the additional assumption that all nonzeroentries of ๐›ฝ have equal absolute values, the detection limit for the one-sample problemis at ๐œŒ โ‰

    โˆš๏ธ๐‘˜ log ๐‘๐‘›

    , up to constants depending on ๐›ผ. Thus, Corollary 4 and Theorem 5suggest that the two-sample problem with model (2) has up to multiplicative constantsthe same detection limit as the one-sample problem with sample size ๐‘›๐œ…1. By the explicitexpression of ๐œ…1 from Lemma 12, we have

    ๐‘›๐œ…1 =๐‘›๐‘Ÿ

    (1 + ๐‘Ÿ)2(1 + ๐‘ ) , (12)

    11

  • which unveils how this โ€˜effective sample sizeโ€™ depends on the relative proportions betweensample sizes ๐‘›1, ๐‘›2 and the dimension ๐‘ of the problem. It is to be expected that theeffective sample size is proportional to ๐‘š, which is the number of observations constructedfrom ๐‘‹1 and ๐‘‹2 in ๐‘Š . More intriguingly, (12) also gives a precise characterisation of howthe effective sample size depends on the imbalance between the number of observationsin ๐‘‹1 and ๐‘‹2. For a fixed ๐‘› = ๐‘›1 + ๐‘›2, ๐‘›๐œ…1 is maximised when ๐‘›1 = ๐‘›2 and converges to๐‘›1๐‘š/๐‘› (or ๐‘›2๐‘š/๐‘›) if ๐‘›1/๐‘›โ†’ 0 (or ๐‘›2/๐‘›โ†’ 0).

    3.2 Dense caseWe now turn our attention to our second test, ๐œ“dense๐œ‚ . The following theorem states asufficient signal โ„“2 norm size for which ๐œ“dense๐œ‚ is asymptotically powerful in distinguishingthe null from the alternative.

    Theorem 6. Let ๐œ‚ = ๐‘š + 23/2โˆš๐‘š log ๐‘ + 4 log ๐‘. Under Conditions (C1) and (C2), we

    further assume ๐‘˜ โˆˆ [๐‘], ๐œŒ2 โ‰ฅ 2โˆš๐‘š log ๐‘๐‘›๐œ…1

    and that (10) is satisfied.

    (a) If ๐›ฝ1 = ๐›ฝ2, then ๐œ“dense๐œ‚ (๐‘‹1, ๐‘‹2, ๐‘Œ1, ๐‘Œ2)a.s.โˆ’โˆ’โ†’ 0.

    (b) If ๐œƒ = (๐›ฝ1 โˆ’ ๐›ฝ2)/2 โˆˆ ฮ˜๐‘,๐‘˜(๐œŒ), then ๐œ“dense๐œ‚ (๐‘‹1, ๐‘‹2, ๐‘Œ1, ๐‘Œ2)a.s.โˆ’โˆ’โ†’ 1.

    Consequently, โ„ณ๐‘‹(๐‘˜, ๐œŒ) a.s.โˆ’โˆ’โ†’ 0.

    Theorem 6 indicates that the sufficient signal โ„“2 norm for asymptotic powerful testingvia ๐œ“dense๐œ‚ does not depend upon the sparsity level. While the above result is valid forall ๐‘˜ โˆˆ [๐‘], it is more interesting in the dense regime where ๐‘˜ โ‰ฅ ๐‘1/2. More precisely, bycomparing Theorems 6 and 4, we see that if ๐‘˜2 log ๐‘ > ๐‘š and ๐‘˜ log(๐‘’๐‘/๐‘˜) โ‰ค ๐‘›/(2๐ถ๐‘ ,๐‘Ÿ),the test ๐œ“dense๐œ‚ has a smaller provable detection limit than ๐œ“

    sparse๐œ†,๐œ . In our asymptotic

    regime (C2), 2โˆš๐‘š log ๐‘๐‘›๐œ…1

    is, up to constants depending on ๐‘  and ๐‘Ÿ, of order ๐‘โˆ’1/2 log1/2 ๐‘.The following theorem establishs that the detection limit of ๐œ“dense๐œ‚ is minimax optimal upto poly-logarithmic factors in the dense regime.

    Theorem 7. Under conditions (C1) and (C2), if we further assume ๐‘1/2 โ‰ค ๐‘˜ โ‰ค ๐‘๐›ผ forsome ๐›ผ โˆˆ [1/2, 1] and ๐œŒ = ๐’ช(๐‘โˆ’1/4 logโˆ’3/4 ๐‘), then โ„ณ๐‘‹(๐‘˜, ๐œŒ) a.s.โˆ’โˆ’โ†’ 1.

    Theorem 7 points out that the lower bound on detectable signal size ๐œŒ2 prescribed inTheorem 6 is necessary up to poly-logarithmic factors. The following proposition makesit explicit that the upper bound on sparsity imposed by (10) in Theorem 6 cannot becompletely removed, i.e., the same result may not hold if we allow ๐‘˜ to be a constantfraction of ๐‘›.

    Proposition 8. If ๐‘˜ = ๐‘ โ‰ฅ min{๐‘›1, ๐‘›2}, then โ„ณ๐‘‹(๐‘˜, ๐œŒ) = 1. If ๐‘˜ = ๐‘ and ๐‘/๐‘›1, ๐‘/๐‘›2 โˆˆ[๐œ€, 1) for any fixed ๐œ€ โˆˆ (0, 1), and ๐œƒ = (๐›ฝ1 โˆ’ ๐›ฝ2)/2 โˆˆ ฮ˜๐‘,๐‘˜(๐œŒ) with

    ๐œŒ2 = ๐’ช(๏ธƒ

    max{๏ธƒ

    ๐‘

    (๐‘›1 โˆ’ ๐‘)2,

    ๐‘

    (๐‘›2 โˆ’ ๐‘)2

    }๏ธƒ)๏ธƒ,

    then โ„ณ๐‘‹(๐‘˜, ๐œŒ) a.s.โˆ’โˆ’โ†’ 1.

    12

  • 4 Numerical studiesIn this section, we study the finite sample performance of our proposed procedures vianumerical experiments. Unless otherwise stated, the data generating mechanism for allsimulations in this section is as follows. We first generate design matrices ๐‘‹1 and ๐‘‹2 withindependent ๐‘(0, 1) entries. Then, for a given sparsity level ๐‘˜ and a signal strength ๐œŒ,set ฮ” = (ฮ”๐‘—)๐‘—โˆˆ[๐‘] so that (ฮ”1, . . . ,ฮ”๐‘˜)โŠค โˆผ ๐œŒUnif(๐’ฎ๐‘˜โˆ’1) and ฮ”๐‘— = 0 for ๐‘— > ๐‘˜. We thendraw ๐›ฝ1 โˆผ ๐‘๐‘(0, ๐ผ๐‘) and define ๐›ฝ2 := ๐›ฝ1 + ฮ”. Finally, we generate ๐‘Œ1 and ๐‘Œ2 as in (2),with ๐œ–1 โˆผ ๐‘๐‘›1(0, ๐ผ๐‘›1) and ๐œ–2 โˆผ ๐‘๐‘›2(0, ๐ผ๐‘›2) independent from each other.

    In Section 4.1, we supply the oracle value of ๏ฟฝฬ‚๏ฟฝ2 = 1 to our procedures to checkwhether their finite sample performance is in accordance with our theory. In all subsequentsubsections where we compare our methods against other procedures, we estimate thenoise variance ๐œŽ2 with the method-of-moments estimator proposed by Dicker (2014).Specifically, after obtaining ๐‘Š and ๐‘ in Step 4 of Algorithm 1, we compute

    ๏ฟฝฬ‚๏ฟฝ1 :=1๐‘

    tr(๏ธƒ

    1๐‘š๐‘ŠโŠค๐‘Š

    )๏ธƒand ๏ฟฝฬ‚๏ฟฝ2 :=

    1๐‘

    tr{๏ธƒ(๏ธƒ

    1๐‘š๐‘ŠโŠค๐‘Š

    )๏ธƒ2}๏ธƒโˆ’ 1๐‘๐‘š

    {๏ธƒtr(๏ธƒ

    1๐‘š๐‘ŠโŠค๐‘Š

    )๏ธƒ2}๏ธƒ.

    This allows us to estimate

    ๏ฟฝฬ‚๏ฟฝ2 :={๏ธƒ

    1 + ๐‘๏ฟฝฬ‚๏ฟฝ21

    (๐‘š+ 1)๏ฟฝฬ‚๏ฟฝ22

    }๏ธƒโ€–๐‘โ€–22๐‘šโˆ’ ๏ฟฝฬ‚๏ฟฝ1๐‘š(๐‘š+ 1)๏ฟฝฬ‚๏ฟฝ2

    โ€–๐‘ŠโŠค๐‘โ€–22.

    We implement our estimators ๐œ“sparse๐œ†,๐œ and ๐œ“dense๐œ‚ on standardised data ๐‘‹1/๏ฟฝฬ‚๏ฟฝ, ๐‘‹2/๏ฟฝฬ‚๏ฟฝ, ๐‘Œ1/๏ฟฝฬ‚๏ฟฝand ๐‘Œ2/๏ฟฝฬ‚๏ฟฝ with the tuning parameters ๐œ† =

    โˆš4 log ๐‘, ๐œ = 3๐‘˜ log ๐‘ and ๐œ‚ = ๐‘š+

    โˆš8๐‘š log ๐‘+

    4 log ๐‘ as suggested by Theorems 1, 3 and 6.

    4.1 Effective sample size in two-sample testingWe first investigate how the empirical power of our test ๐œ“sparse๐œ†,๐œ relies on various problemparameters. In light of our results in Theorems 1 and 3, we define

    ๐œˆ := ๐‘Ÿ๐‘›๐œŒ2

    ๐œŽ2(1 + ๐‘ )(1 + ๐‘Ÿ)2๐‘˜ log ๐‘, (13)

    where ๐‘  := ๐‘/๐‘š and ๐‘Ÿ := ๐‘›1/๐‘›2. As discussed after Theorem 3, ๐‘Ÿ๐‘›/{(1 + ๐‘ )(1 + ๐‘Ÿ)2}in the definition of ๐œˆ can be viewed as the effective sample size in the testing problem.In Figure 1, we plot the estimated test power of ๐œ“sparse๐œ†,๐œ against ๐œˆ over 100 Monte Carlorepetitions for ๐‘› = 1000, ๐‘˜ = 10, ๐œŒ โˆˆ {0, 0.2, . . . , 2} and various values of ๐‘ and ๐‘›1. Inthe left panel of Figure 1, ๐‘ ranges from 100 to 900, which corresponds to ๐‘  from 1/9 to9. In the right panel, we vary ๐‘›1 from 100 to 900, which corresponds with an ๐‘Ÿ varyingbetween 1/9 and 9. In both panels, the power curves for different ๐‘  and ๐‘Ÿ values overlapeach other, with the phase transition all occurring at around ๐œˆ โ‰ˆ 1.5. This conforms wellwith the effective sample size and the detection limit articulated in our theory.

    13

  • 0 1 2 3 4 5 6

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    ฮฝ

    power

    p = 100

    p = 200

    p = 300

    p = 400

    p = 500

    p = 600

    p = 700

    p = 800

    p = 900

    0 1 2 3 4 5 6

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    ฮฝ

    power

    n1 = 100

    n1 = 200

    n1 = 300

    n1 = 400

    n1 = 500

    n1 = 600

    n1 = 700

    n1 = 800

    n1 = 900

    Figure 1: Power function of ๐œ“sparse๐œ†,๐œ , estimated over 100 Monte Carlo repetitions, plottedagainst ๐œˆ, as defined in (13), in various parameter settings. Left panel: ๐‘›1 = ๐‘›2 = 500,๐‘ โˆˆ {100, 200, . . . , 900}, ๐‘˜ = 10, ๐œŒ โˆˆ {0, 0.2, . . . , 2}. Right panel: ๐‘›1 โˆˆ {100, 200, . . . , 900},๐‘›2 = 1000โˆ’ ๐‘›1, ๐‘ = 400, ๐‘˜ = 10, ๐œŒ โˆˆ {0, 0.2, . . . , 2}.

    4.2 Comparison with other methodsNext, we compare the performance of our procedures against competitors in the existingliterature. The only methods we were aware of that could allow for dense regressioncoefficients ๐›ฝ1 and ๐›ฝ2 were those proposed by Zhu and Bradic (2016) and Charbonnier,Verzelen and Villers (2015). In addition, we also include in our comparisons the classicallikelihood ratio test, denoted by ๐œ“LRT, which rejects the null when the ๐น -statistic definedin (4) exceeds the upper ๐›ผ-quantile of an ๐น๐‘, ๐‘›โˆ’2๐‘ distribution. Note that the likelihoodratio test is only well-defined if ๐‘ < min{๐‘›1, ๐‘›2}. The test proposed by Zhu and Bradic(2016), which we denote by ๐œ“ZB, requires that ๐‘›1 = ๐‘›2 (when the two samples do nothave equal sample size, a subset of the larger sample would be discarded for the test toapply). Specifically, writing ๐‘‹+ := ๐‘‹1 +๐‘‹2, ๐‘‹โˆ’ := ๐‘‹1โˆ’๐‘‹2 and ๐‘Œ+ := ๐‘Œ1 + ๐‘Œ2, ๐œ“ZB firstestimates ๐›พ = (๐›ฝ1 + ๐›ฝ2)/2 and

    ฮ  := {E(๐‘‹โŠค+๐‘‹+)}โˆ’1E(๐‘‹โŠค+๐‘‹โˆ’)

    by solving Dantzig-Selector-type optimisation problems. Then based on the obtainedestimators ๐›พ and ฮ ฬ‚, ๐œ“ZB proceeds to compute a test statistic

    ๐‘‡ZB :=โ€–{๐‘‹โˆ’ โˆ’๐‘‹+ฮ ฬ‚}โŠค{๐‘Œ+ โˆ’๐‘‹+๐›พ}โ€–โˆž

    โ€–๐‘Œ+ โˆ’๐‘‹+๐›พโ€–2.

    Their test rejects the null if the test statistic exceeds an empirical upper-๐›ผ-quantile (ob-tained via Monte-Carlo simulation) of โ€–๐œ‰โ€–โˆž for ๐œ‰ โˆผ ๐‘(0, {๐‘‹โˆ’โˆ’๐‘‹+ฮ ฬ‚}โŠค{๐‘‹โˆ’โˆ’๐‘‹+ฮ ฬ‚}). Asthe estimation of ฮ  involves solving a sequence of ๐‘ Dantzig Selector problems, which isoften time consuming, we have implemented ๐œ“ZB with the oracle choice of ฮ ฬ‚ = ฮ , whichis equal to ๐ผ๐‘ when covariates in the two design matrices ๐‘‹1 and ๐‘‹2 follow independentcentred distribution with the same covariance matrix. The test proposed by Charbon-nier, Verzelen and Villers (2015), denoted here by ๐œ“CVV, first performs a LARS regression(Efron et al., 2004) of concatenated response ๐‘Œ = (๐‘Œ โŠค1 , ๐‘Œ โŠค2 )โŠค against the block design

    14

  • matrix (๏ธƒ๐‘‹1 ๐‘‹1๐‘‹2 โˆ’๐‘‹2

    )๏ธƒ

    to obtain a sequence of regression coefficients ๏ฟฝฬ‚๏ฟฝ = (๏ฟฝฬ‚๏ฟฝ1, ๏ฟฝฬ‚๏ฟฝ2) โˆˆ R๐‘+๐‘. Then for every ๏ฟฝฬ‚๏ฟฝ on theLARS solution path with โ€–๏ฟฝฬ‚๏ฟฝโ€–0 โ‰ค min{๐‘›1, ๐‘›2}/2, they restrict the original testing probleminto the subset of coordinates where either ๏ฟฝฬ‚๏ฟฝ1 or ๏ฟฝฬ‚๏ฟฝ2 is non-zero, and form test statisticsbased on the Kullbackโ€“Leibler divergence between the two samples restricted to thesecoordinates. The sequence of test statistics are then compared with Bonferonni-correctedthresholds at size ๐›ผ. For both the ๐œ“LRT and ๐œ“CVV, we set ๐›ผ = 0.05.

    0 1 2 3 4 5 6 7 8 9

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    ฯ

    power

    k = 1

    k = 10

    k = 31

    k = 100

    k = 1000

    CSsparse

    CSdense

    LRT

    ZB

    CVV

    0 1 2 3 4 5 6 7 8 10 12 15 18

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    ฯ

    power

    k = 1

    k = 10

    k = 27

    k = 80

    k = 800

    CSsparse

    CSdense

    ZB

    CVV

    Figure 2: Power comparison of different methods at different sparsity levels ๐‘˜ โˆˆ{1, 10, โŒŠ๐‘1/2โŒ‹, 0.1๐‘, ๐‘} and different signal โ„“2 norm ๐œŒ on a logarithmic grid (noise variance๐œŽ2 = 1). Left panel: ๐‘›1 = ๐‘›2 = 500, ๐‘ = 800, ๐œŒ โˆˆ [0, 10]; right panel: ๐‘›1 = ๐‘›2 = 1200,๐‘ = 1000, ๐œŒ โˆˆ [0, 20].

    Figure 2 compares the estimated power, as a function of โ€–๐›ฝ1 โˆ’ ๐›ฝ2โ€–2, of ๐œ“sparse๐œ†,๐œ and๐œ“dense๐œ‚ against that of ๐œ“LRT, ๐œ“ZB and ๐œ“CVV. We ran all methods on the same 100 datasetsfor each set of parameters. We performed numerical experiments in two high-dimensionalsettings with different sample-size-to-dimension ratio: ๐‘ = 1000, ๐‘›1 = ๐‘›2 = 1200 in theleft panel and ๐‘ = 800, ๐‘›1 = ๐‘›2 = 500 in the right panel. Here, we took ๐‘›1 = ๐‘›2 tomaximise the power of ๐œ“ZB. Also, since the likelihood ratio test requires ๐‘ < min{๐‘›1, ๐‘›2},it is only implemented in the left panel. For each experiment, we varied ๐‘˜ in the set{1, 10, โŒŠ๐‘1/2โŒ‹, 0.1๐‘, ๐‘} to examine different sparsity levels.

    We see in Figure 2 that both ๐œ“sparse๐œ†,๐œ and ๐œ“dense๐œ‚ showed promising finite sample perfor-mance. Both our tests did not produce any false positives under the null when ๐œŒ = 0, andshowed better power compared to ๐œ“ZB and ๐œ“CVV. In the more challenging setting of theright panel with ๐‘ > max{๐‘›1, ๐‘›2}, it takes a signal โ„“2 norm more than 10 times smallerthan that of the competitors for our test ๐œ“sparse๐œ†,๐œ to reach power of almost 1 in the sparsestcase. Note though, in the densest case on the right panel (๐‘˜ = 800), ๐œ“sparse๐œ†,๐œ and ๐œ“dense๐œ‚ didnot have saturated power curves, because noise variance is over-estiamted by ๏ฟฝฬ‚๏ฟฝ2 in thissetting.

    We also observe that the power of ๐œ“sparse๐œ†,๐œ has a stronger dependence on the level๐‘˜ than that of ๐œ“dense๐œ‚ . For ๐‘˜ โ‰ค

    โˆš๐‘, ๐œ“sparse๐œ†,๐œ appears much more sensitive to the signal

    size. As ๐‘˜ increases, ๐œ“dense๐œ‚ eventually outperforms ๐œ“sparse๐œ†,๐œ , which is consistent with our

    15

  • observed phase transition behaviour as discussed after Theorem 6. It is interesting to notethat when the likelihood ratio test is well-defined (left panel), it has better power than๐œ“dense๐œ‚ . This is partly due to the fact that the theoretical choice of threshold ๐œ‚ is relativelyconservative to ensure asymptotic size of the test is 0 almost surely. In comparison, therejecting threshold for the likelihood ratio test is chosen to have (๐‘ fixed and ๐‘› โ†’ โˆž)asymptotic size of ๐›ผ = 0.05, and the empirical size is sometimes observed to be largerthan 0.08.

    4.3 Model misspecificationWe have so far focused on the case of Gaussian random design ๐‘‹1, ๐‘‹2 with identity co-variance and Gaussian regression noises ๐œ–1, ๐œ–2. This complies with these assumptions thathave helped us gain theoretical insights into the two-sample testing problem. However,our proposed testing procedures can still be used even if these modelling assumptionswere not satisfied. To evaluate the performance of our proposed procedure under modelmisspecification, we consider the following four setups:

    (a) Correlated design: assume rows of ๐‘‹1 and ๐‘‹2 are independently drawn from ๐‘(0,ฮฃ)with ฮฃ = (2โˆ’|๐‘—1โˆ’๐‘—2|)๐‘—1,๐‘—2โˆˆ[๐‘].

    (b) Rademacher design: assume entries of ๐‘‹1 and ๐‘‹2 are independent Rademacherrandom variables.

    (c) One way balanced ANOVA design: assume ๐‘‘1 := ๐‘›1/๐‘ and ๐‘‘2 := ๐‘›2/๐‘ are integersand ๐‘‹1 and ๐‘‹2 are block diagonal matrices

    ๐‘‹1 =

    โŽ›โŽœโŽœโŽ1๐‘‘1

    . . .1๐‘‘2

    โŽžโŽŸโŽŸโŽ  ๐‘‹2 =โŽ›โŽœโŽœโŽ

    1๐‘‘2. . .

    1๐‘‘2

    โŽžโŽŸโŽŸโŽ  ,where 1๐‘‘ is an all-one vector in R๐‘‘.

    (d) Heavy tailed noise: we generate both ๐œ–1 and ๐œ–2 with independent ๐‘ก4/โˆš

    2 entries.Note that the

    โˆš2 denominator standardises the noise to have unit variance, to

    ensure easier comparison between settings.

    In setups (a) to (c), we keep ๐œ–1 โˆผ ๐‘๐‘›1(0, ๐ผ๐‘›1) and ๐œ–2 โˆผ ๐‘๐‘›2(0, ๐ผ๐‘›2) and in setup (d), wekeep ๐‘‹1 and ๐‘‹2 to have independent ๐‘(0, 1) entries. Figure 3 compares the performanceof ๐œ“sparse๐œ†,๐œ , ๐œ“dense๐œ‚ with that of ๐œ“ZB and ๐œ“CVV. In all settings, we set ๐‘›1 = ๐‘›2 = 500and ๐‘˜ = 10. In settings (a), (b) and (d), we choose ๐‘ = 800 and ๐œŒ from 0 to 20. Insetting (c), we choose ๐‘ = 250 and ๐œŒ from 0 to 50. We see that ๐œ“sparse๐œ†,๐œ is robust tomodel misspecification and exhibits good power in all settings. The test ๐œ“desne๐œ‚ is robustto non-normal design and noise, but exhibits a slight reduction in power in a correlateddesign. The advantage of ๐œ“sparse๐œ†,๐œ and ๐œ“dense๐œ‚ over competing methods is least significant inthe ANOVA design in setting (c), where each row vector of the design matrices has allmass concentrated in one coordinate. In all other settings where the rows of the design

    16

  • matrices are more โ€˜incoherentโ€™ in the sense that all coordinate have similar magnitude,๐œ“sparse๐œ†,๐œ and ๐œ“dense๐œ‚ start having nontrivial power at a signal โ„“2 norm 10 to 20 times smallerthan that of the competitors.

    (a) (b)

    0 1 2 3 4 5 6 7 8 10 12 15 18

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    ฯ

    power

    CSsparse

    CSdense

    ZB

    CVV

    0 1 2 3 4 5 6 7 8 10 12 15 18

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    ฯ

    power

    CSsparse

    CSdense

    ZB

    CVV

    (c) (d)

    0 1 2 3 4 5 6 8 10 13 17 22 28 35 44

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    ฯ

    power

    CSsparse

    CSdense

    ZB

    CVV

    0 1 2 3 4 5 6 7 8 10 12 15 18

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    ฯ

    power

    CSsparse

    CSdense

    ZB

    CVV

    Figure 3: Power functions of different methods in models with non-Gaussian design ornon-Gaussian noise, plotted against signal โ„“2 norm ๐œŒ on a logarithmic grid. (a) correlateddesign matrix ฮฃ = (2โˆ’|๐‘–โˆ’๐‘—|)๐‘–,๐‘—โˆˆ[๐‘]; (b) Rademacher design; (c) one-way balanced ANOVAdesign; (d) Gaussian design with ๐‘ก4/

    โˆš2-distributed noise. Details of the models are in

    Section 4.3.

    5 Proof of main resultsProof of Theorem 1. Under the null hypothesis where ๐›ฝ1 = ๐›ฝ2, we have ๐œƒ = 0 and there-fore, ๐‘ = ๐‘Š๐œƒ + ๐œ‰ = ๐œ‰ โˆผ ๐‘๐‘š(0, ๐ผ๐‘š). In particular, ๐‘„๐‘— โˆผ ๐‘(0, 1) for all ๐‘— โˆˆ [๐‘]. Thus, bya union bound, we have for ๐œ† =

    โˆš๏ธ(4 + ๐œ–) log ๐‘ and any ๐œ > 0 that

    P(๐‘‡ โ‰ฅ ๐œ) โ‰ค๐‘โˆ‘๏ธ๐‘—=1

    P(|๐‘„๐‘—| โ‰ฅ ๐œ†) โ‰ค ๐‘๐‘’โˆ’๐œ†2/2 = ๐‘โˆ’1โˆ’๐œ€/2,

    The almost sure convergence is derived from the Borelโ€“Cantelli lemma after noting that๐‘โˆ’1โˆ’๐œ€/2 is summable for any ๐œ€ > 0.

    17

  • Proof of Proposition 2. Let ๐ป โˆˆ O๐‘ร—๐‘ be any orthogonal matrix, then since ๐‘‹ d= ๐‘‹๐ป,by Lemma 9 we have

    ๐ปโŠค๐‘ŠโŠค๐‘Š๐ป = 4(๐ปโŠค๐‘‹โŠค1 ๐‘‹1๐ป)(๐ปโŠค๐‘‹โŠค๐‘‹๐ป)โˆ’1(๐ปโŠค๐‘‹โŠค2 ๐‘‹2๐ป)d= 4(๐‘‹โŠค1 ๐‘‹1)(๐‘‹โŠค๐‘‹)โˆ’1(๐‘‹โŠค2 ๐‘‹2) = ๐‘ŠโŠค๐‘Š. (14)

    In particular, all diagonal entries of ๐‘ŠโŠค๐‘Š have the same distribution and all off-diagonalentries of ๐‘ŠโŠค๐‘Š have the same distribution. So it suffices to study (๐‘ŠโŠค๐‘Š )1,1 and(๐‘ŠโŠค๐‘Š )1,2.

    Let ๐‘‹ = ๐‘„๐‘‡ be the QR decomposition of ๐‘‹, which is almost surely unique if werequire the upper-triangular matrix ๐‘‡ to have non-negative entries on the diagonal. Let๐‘„1 be the submatrix obtained from the first ๐‘›1 rows of ๐‘„. By Lemma 13, ๐‘„1 and ๐‘‡ areindependent and ๐‘‡ has independent entries distributed as ๐‘‡๐‘—,๐‘— = ๐‘ก๐‘— > 0 with ๐‘ก2๐‘— โˆผ ๐œ’2๐‘›โˆ’๐‘—+1for ๐‘— โˆˆ [๐‘] and ๐‘‡๐‘—,๐‘˜ = ๐‘ง๐‘—,๐‘˜ โˆผ ๐‘(0, 1) for 1 โ‰ค ๐‘— < ๐‘˜ โ‰ค ๐‘.

    Define ๐ต := ๐‘„โŠค1 ๐‘„1 and let ๐ต = ๐‘‰ ฮ›๐‘‰ โŠค be its eigendecomposition, which is al-most surely unique if we require the diagonal entries of ฮ› to be non-increasing and thediagonal entries of ๐‘‰ to be nonnegative. By Lemma 13, ๐‘„ is uniformly distributedon O๐‘›ร—๐‘, which means ๐‘„ d= ๐‘„๐ป for any ๐ป โˆˆ O๐‘ร—๐‘. Consequently ๐‘„1 d= ๐‘„1๐ป and๐ต

    d= ๐ปโŠค๐ต๐ป = (๐ปโŠค๐‘‰ )ฮ›(๐ปโŠค๐‘‰ )โŠค. Since the group O๐‘ร—๐‘ acts transitively on itself throughleft multiplication, the joint density of ๐‘‰ and ฮ› must be a function of ฮ› only. In particular,๐‘‰ and ฮ› are independent.

    Note that ๐‘‹1 = ๐‘„1๐‘‡ . Thus, ๐‘‹โŠค1 ๐‘‹1 = ๐‘‡โŠค๐ต๐‘‡ and ๐‘‹โŠค2 ๐‘‹2 = ๐‘‡โŠค(๐ผ๐‘โˆ’๐ต)๐‘‡ . By Lemma 9,we have

    ๐‘ŠโŠค๐‘Š = 4๐‘‹โŠค1 ๐ด1๐ดโŠค1 ๐‘‹1 = 4(๐‘‹โŠค1 ๐‘‹1)(๐‘‹โŠค1 ๐‘‹1 +๐‘‹โŠค2 ๐‘‹2)โˆ’1(๐‘‹โŠค2 ๐‘‹2)= 4๐‘‡โŠค๐ต(๐ผ๐‘ โˆ’๐ต)๐‘‡ = 4๐‘‡โŠค๐‘‰ ฮ›(๐ผ๐‘ โˆ’ ฮ›)๐‘‰ โŠค๐‘‡. (15)

    Let 1 โ‰ฅ ๐œ†1 โ‰ฅ ยท ยท ยท โ‰ฅ ๐œ†๐‘ โ‰ฅ 0 be the diagonal entries of ฮ›. Define ๐‘Ž๐‘— = ๐œ†๐‘—(1โˆ’ ๐œ†๐‘—) for ๐‘— โˆˆ [๐‘]and set ๐‘Ž := (๐‘Ž1, . . . , ๐‘Ž๐‘). We can write ๐‘ก21 = ๐‘ 21 + ๐‘Ÿ21 with ๐‘ 21 โˆผ ๐œ’2๐‘ and ๐‘Ÿ21 โˆผ ๐œ’2๐‘›โˆ’๐‘ suchthat ๐‘ 1 โ‰ฅ 0, ๐‘Ÿ1 โ‰ฅ 0 are independent from each other and independent of everything else.By Lemma 13, we have that ๐บ๐‘—,1 := ๐‘ 1๐‘‰๐‘—,1 for ๐‘— โˆˆ [๐‘] are independent ๐‘(0, 1) randomvariables. Note that

    14(๐‘Š

    โŠค๐‘Š )1,1 =๐‘โˆ‘๏ธ๐‘—=1

    ๐‘ก21๐‘Ž๐‘—๐‘‰2๐‘—,1 =

    ๐‘ก21๐‘ 21

    ๐‘โˆ‘๏ธ๐‘—=1

    ๐‘Ž๐‘—๐บ2๐‘—,1.

    Let ๐›ฟ > 0 be chosen later. By Laurent and Massart (2000, Lemma 1), applied conditionallyon ๐‘Ž, we have with probability at least 1โˆ’ 6๐›ฟ that all of the following inequalities hold:

    โ€–๐‘Žโ€–1 โˆ’ 2โ€–๐‘Žโ€–2โˆš๏ธ

    log(1/๐›ฟ) โ‰ค๐‘โˆ‘๏ธ๐‘—=1

    ๐‘Ž๐‘—๐บ2๐‘—,1 โ‰ค โ€–๐‘Žโ€–1 + 2โ€–๐‘Žโ€–2

    โˆš๏ธlog(1/๐›ฟ) + 2โ€–๐‘Žโ€–โˆž log(1/๐›ฟ),

    ๐‘โˆ’ 2โˆš๏ธ๐‘ log(1/๐›ฟ) โ‰ค ๐‘ 21 โ‰ค ๐‘+ 2

    โˆš๏ธ๐‘ log(1/๐›ฟ) + 2 log(1/๐›ฟ),

    ๐‘›โˆ’ 2โˆš๏ธ๐‘› log(1/๐›ฟ) โ‰ค ๐‘ก21 โ‰ค ๐‘›+ 2

    โˆš๏ธ๐‘› log(1/๐›ฟ) + 2 log(1/๐›ฟ).

    18

  • Using the fact that โ€–๐‘Žโ€–โˆž โ‰ค 1/4, we have with probability at least 1โˆ’ 6๐›ฟ that

    ๐‘›โˆ’ 2โˆš๏ธ๐‘› log(1/๐›ฟ)

    ๐‘+ 2โˆš๏ธ๐‘ log(1/๐›ฟ) + 2 log(1/๐›ฟ)

    {๏ธโ€–๐‘Žโ€–1 โˆ’ 2โ€–๐‘Žโ€–2

    โˆš๏ธlog(1/๐›ฟ)

    }๏ธโ‰ค 14(๐‘Š

    โŠค๐‘Š )1,1

    โ‰ค๐‘›+ 2

    โˆš๏ธ๐‘› log(1/๐›ฟ) + 2 log(1/๐›ฟ)

    ๐‘โˆ’ 2โˆš๏ธ๐‘ log(1/๐›ฟ)

    {๏ธโ€–๐‘Žโ€–1 + 2โ€–๐‘Žโ€–2

    โˆš๏ธlog(1/๐›ฟ) + 12 log(1/๐›ฟ)

    }๏ธIf log(1/๐›ฟ) = ๐’ช(๐‘), then for each ๐‘ with probability at least 1โˆ’ 6๐›ฟ, we have thatโƒ’โƒ’โƒ’โƒ’โƒ’(๐‘ŠโŠค๐‘Š )1,14 โˆ’ ๐‘›๐‘โ€–๐‘Žโ€–1

    โƒ’โƒ’โƒ’โƒ’โƒ’ โ‰ค โ€–๐‘Žโ€–1 2

    โˆš๏ธ๐‘› log(1/๐›ฟ)๐‘

    (๏ธ1 +

    โˆš๏ธ๐‘›/๐‘

    )๏ธ+ ๐‘›๐‘

    {๏ธƒ2โ€–๐‘Žโ€–2

    โˆš๏ธlog(1/๐›ฟ) + log(1/๐›ฟ)2

    }๏ธƒ

    +๐’ช๐‘ (๏ธƒโ€–๐‘Žโ€–1 log(1/๐›ฟ)

    ๐‘+ โ€–๐‘Žโ€–2 log(1/๐›ฟ)โˆš

    ๐‘+ log

    3/2(1/๐›ฟ)๐‘1/2

    )๏ธƒ. (16)

    By the definition of ๐ต, we have for ๐ป := ๐‘‡ (๐‘‹โŠค๐‘‹)โˆ’1/2 โˆˆ O๐‘ร—๐‘ that

    ๐ปโŠค๐ต๐ป = ๐ปโŠค๐‘‡โˆ’โŠค๐‘‹โŠค1 ๐‘‹1๐‘‡โˆ’1๐ป = (๐‘‹โŠค๐‘‹)โˆ’1/2(๐‘‹โŠค1 ๐‘‹1)(๐‘‹โŠค๐‘‹)โˆ’1/2,

    which follows the matrix-variate Beta distribution Beta๐‘(๐‘›1/2, ๐‘›2/2) as defined beforeLemma 12. Hence the diagonal elements of ฮ› are the same as the eigenvalues of aBeta๐‘(๐‘›1/2, ๐‘›2/2) random matrix. By Lemma 12, throughout the rest of this proof wemay restrict ourselves to an almost sure event on which

    โ€–๐‘Žโ€–1/๐‘โ†’ ๐œ…1 and โ€–๐‘Žโ€–2/โˆš๐‘โ†’ ๐œ…2.

    By (16), for each ๐‘, with probability at least 1โˆ’ 6๐›ฟ, we haveโƒ’โƒ’โƒ’โƒ’โƒ’(๐‘ŠโŠค๐‘Š )1,1โˆ’ 4๐‘›๐‘ โ€–๐‘Žโ€–1

    โƒ’โƒ’โƒ’โƒ’โƒ’ โ‰ค 8โˆš๏ธ๐‘› log(1/๐›ฟ)(๏ธ(๐œ…1 +๐œ…2)โˆš๏ธ๐‘›/๐‘+๐œ…1)๏ธ+๐’ช๐‘ 

    (๏ธƒlog(1/๐›ฟ)+ log

    3/2(1/๐›ฟ)๐‘1/2

    )๏ธƒ.

    (17)For the first claim in the proposition, we take ๐›ฟ = 1/๐‘3. Using (17), Lemma 12, a unionbound over ๐‘— โˆˆ [๐‘] and the Borelโ€“Cantelli lemma (noting that 1/๐‘2 is summable), weobtain that,

    max๐‘—โˆˆ[๐‘]

    โƒ’โƒ’โƒ’โƒ’โƒ’(๐‘ŠโŠค๐‘Š )๐‘—,๐‘—4๐‘›๐œ…1 โˆ’ 1

    โƒ’โƒ’โƒ’โƒ’โƒ’ a.s.โˆ’โˆ’โ†’ 0.

    For the second part of the claim, define ฮ” := ๐‘ŠโŠค๐‘Š โˆ’ 4๐‘›๐‘โˆ’1โ€–๐‘Žโ€–1๐ผ๐‘. By Lemma 11,there exists a 1/4-net ๐’ฉ of cardinality at most

    (๏ธ๐‘๐‘˜

    )๏ธ9๐‘˜ such that

    โ€–ฮ”โ€–๐‘˜,op โ‰ค 2 sup๐‘ขโˆˆ๐’ฉ

    ๐‘ขโŠคฮ”๐‘ข. (18)

    By (14), we have ๐‘ขโŠคฮ”๐‘ข d= ๐‘’โŠค1 ฮ”๐‘’1 for all ๐‘ข โˆˆ ๐’ฎ๐‘โˆ’1. Hence, by (17), (18) and a unionbound over ๐‘ข โˆˆ ๐’ฉ , there exists ๐ถ๐‘  > 0, depending only on ๐‘ , such that for each ๐‘, withprobability at least 1โˆ’ 6|๐’ฉ |๐›ฟ that

    โ€–ฮ”โ€–๐‘˜,op โ‰ค 8โˆš๏ธ๐‘› log(1/๐›ฟ)

    {๏ธ(๐œ…1 + ๐œ…2)

    โˆš๏ธ๐‘›/๐‘+ ๐œ…1

    }๏ธ+๐’ช๐‘ 

    (๏ธƒlog(1/๐›ฟ) + log

    3/2(1/๐›ฟ)๐‘1/2

    )๏ธƒ. (19)

    19

  • If log(1/๐›ฟ) = ๐’ช(๐‘), as will be verified for our choice of ๐›ฟ later, then on the eventwhere (19) is true, we have โ€–๐‘ŠโŠค๐‘Šโ€–๐‘˜,op = 4(1 + ๐’ช(1))๐‘›๐‘โˆ’1โ€–๐‘Žโ€–1 and โ€–diag(๐‘ŠโŠค๐‘Š )โ€–op =4(1 + ๐’ช(1))๐‘›๐‘โˆ’1โ€–๐‘Žโ€–1. Define ๐ท :=

    (๏ธdiag(๐‘ŠโŠค๐‘Š )4๐‘›๐‘โˆ’1โ€–๐‘Žโ€–1

    )๏ธ1/2, we have

    ๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐‘ŠโŠค๐‘Š

    4๐‘›โ€–๐‘Žโ€–1/๐‘= (๐ทโˆ’1 โˆ’ ๐ผ) ๐‘Š

    โŠค๐‘Š

    4๐‘›โ€–๐‘Žโ€–1/๐‘๐ทโˆ’1 + ๐‘Š

    โŠค๐‘Š

    4๐‘›โ€–๐‘Žโ€–1/๐‘(๐ทโˆ’1 โˆ’ ๐ผ).

    Taking ๐‘˜-operator norms on both sides of the above identity, since ๐ท is a diagonal matrix,we haveโƒฆโƒฆโƒฆโƒฆโƒฆ๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐‘ŠโŠค๐‘Š4๐‘›โ€–๐‘Žโ€–1/๐‘

    โƒฆโƒฆโƒฆโƒฆโƒฆ๐‘˜,opโ‰ค โ€–๐ทโˆ’1 โˆ’ ๐ผโ€–op

    โƒฆโƒฆโƒฆโƒฆโƒฆ ๐‘ŠโŠค๐‘Š4๐‘›โ€–๐‘Žโ€–1/๐‘

    โƒฆโƒฆโƒฆโƒฆโƒฆ๐‘˜,op

    (โ€–๐ทโˆ’1โ€–op + 1)

    โ‰ค (2 + ๐’ช(1))โ€–๐ทโˆ’1 โˆ’ ๐ผ๐‘โ€–op โ‰ค (1 + ๐’ช(1)) max๐‘—โˆˆ[๐‘]

    โƒ’โƒ’โƒ’โƒ’โƒ’ (๐‘ŠโŠค๐‘Š )๐‘—,๐‘—4๐‘›๐‘โˆ’1โ€–๐‘Žโ€–1 โˆ’ 1

    โƒ’โƒ’โƒ’โƒ’โƒ’.

    Combining the above inequality with the observation that |(๐‘ŠโŠค๐‘Š )๐‘—,๐‘— โˆ’ 4๐‘›๐‘โˆ’1โ€–๐‘Žโ€–1| โ‰คโ€–ฮ”โ€–๐‘˜,op, we have by (19) and Lemma 12 that, asymptotically with probability at least1โˆ’ 6|๐’ฉ |๐›ฟ,

    โ€–๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐ผโ€–๐‘˜,op โ‰คโƒฆโƒฆโƒฆโƒฆโƒฆ๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐‘ŠโŠค๐‘Š4๐‘›โ€–๐‘Žโ€–1/๐‘

    โƒฆโƒฆโƒฆโƒฆโƒฆ๐‘˜,op

    +โƒฆโƒฆโƒฆโƒฆโƒฆ ๐‘ŠโŠค๐‘Š4๐‘›โ€–๐‘Žโ€–1/๐‘ โˆ’ ๐ผ๐‘

    โƒฆโƒฆโƒฆโƒฆโƒฆ๐‘˜,op

    โ‰ค (2 + ๐’ช(1)) โ€–ฮ”โ€–๐‘˜,op4๐‘›๐‘โˆ’1โ€–๐‘Žโ€–1โ‰ค (4 + ๐’ช(1))

    {๏ธ(1 + ๐œ…2/๐œ…1)

    โˆš๏ธ๐‘›/๐‘+ 1

    }๏ธโˆš๏ธƒ log(1/๐›ฟ)๐‘›

    .

    Choosing ๐›ฟ := (10๐‘’๐‘/๐‘˜)โˆ’(๐‘˜+4). By (10), we indeed have log(1/๐›ฟ) = (๐‘˜ + 4) log(10๐‘’๐‘/๐‘˜) =๐‘œ(๐‘), as required in the above calculation. Also,

    |๐’ฉ |๐›ฟ โ‰ค 9๐‘˜(๏ธƒ๐‘

    ๐‘˜

    )๏ธƒ(๏ธƒ10๐‘’๐‘๐‘˜

    )๏ธƒโˆ’(๐‘˜+4)โ‰ค(๏ธƒ

    9๐‘’๐‘๐‘˜

    )๏ธƒ๐‘˜(๏ธƒ10๐‘’๐‘๐‘˜

    )๏ธƒโˆ’(๐‘˜+4)โ‰ค 0.9

    ๐‘˜

    (๐‘’๐‘/๐‘˜)4 โ‰ค max{๐‘โˆ’2, 0.9

    โˆš๐‘},

    which is summable over ๐‘. Thus, by the Borelโ€“Cantelli lemma, we see that for any ๐œ€ > 0,with probability 1, the following sequence of events{๏ธƒ

    โ€–๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐ผโ€–๐‘˜,op > (4 + ๐œ€){๏ธ(1 + ๐œ…2/๐œ…1)

    โˆš๏ธ๐‘›/๐‘+ 1

    }๏ธโˆš๏ธƒ(๐‘˜ + 4) log(10๐‘’๐‘/๐‘˜)๐‘›

    }๏ธƒ

    happen finitely often. Hence the desired conclusion holds with, for instance, ๐ถ๐‘ ,๐‘Ÿ =5(1 +

    โˆš๏ธ1 + 1/๐‘ +

    โˆš๏ธ๐‘ + ๐‘Ÿ โˆ’ 1 + 1/๐‘ + 1/๐‘Ÿ).

    Proof of Theorem 3. By Proposition 2, it suffices to work with a deterministic sequenceof ๐‘Š such that (9) and (11) holds, which we henceforth assume in this proof.

    Define ๐œƒ = (๐œƒ1, . . . , ๐œƒ๐‘)โŠค such that ๐œƒ๐‘— := ๐œƒ๐‘—โ€–๐‘Š๐‘—โ€–2. Then, from (8), we have

    ๐‘ = ๏ฟฝฬƒ๏ฟฝ๐œƒ + ๐œ‰,

    20

  • for ๐œ‰ โˆผ ๐‘๐‘š(0, ๐ผ๐‘š). Write ๐‘„ := (๐‘„1, . . . , ๐‘„๐‘)โŠค and ๐‘† := supp(๐œƒ) = supp(๐œƒ), then

    ๐‘„๐‘† = (๏ฟฝฬƒ๏ฟฝโŠค๐‘)๐‘† โˆผ ๐‘๐‘˜(๏ธ(๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ )๐‘†,๐‘†๐œƒ๐‘†, (๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ )๐‘†,๐‘†

    )๏ธ.

    Our strategy will be to control โ€–๐‘„๐‘†โ€–2. To this end, by (9), we have

    โ€–๐œƒ๐‘†โ€–22 =โˆ‘๏ธ๐‘—โˆˆ๐‘†

    ๐œƒ2๐‘—โ€–๐‘Š๐‘—โ€–22 โ‰ฅ {4โˆ’ ๐’ช(1)}๐‘›๐œ…1โˆ‘๏ธ๐‘—โˆˆ๐‘†

    ๐œƒ2๐‘— = (4โˆ’ ๐’ช(1))๐‘›๐œ…1๐œŒ2. (20)

    Moreover, by (11) and (10), we have for sufficiently large ๐‘ that

    โ€–๐ผ๐‘˜ โˆ’ (๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ )๐‘†,๐‘†โ€–op โ‰ค โ€–๐ผ๐‘ โˆ’ ๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝโ€–๐‘˜,op โ‰ค ๐ถ๐‘ ,๐‘Ÿ

    โˆš๏ธƒ๐‘˜ log(๐‘’๐‘/๐‘˜)

    ๐‘›= ๐’ช(1). (21)

    Define ๐‘… := โ€–(๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ )๐‘†,๐‘†๐œƒ๐‘†โ€–2. By the triangle inequality, (20) and (21), we have

    ๐‘…2 โ‰ฅ[๏ธโ€–๐œƒ๐‘†โ€–2

    {๏ธ1โˆ’ โ€–๐ผ๐‘˜ โˆ’ (๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ )๐‘†,๐‘†โ€–op

    }๏ธ]๏ธ2โ‰ฅ (4โˆ’ ๐’ช(1))๐‘›๐œ…1๐œŒ2 โ‰ฅ (32โˆ’ ๐’ช(1))๐‘˜ log ๐‘, (22)

    where we have evoked the condition ๐œŒ2 โ‰ฅ 8๐‘˜ log ๐‘/(๐‘›๐œ…1) in the final bound.By Lemma 14 and (21), for sufficiently large ๐‘, we have with probability at least

    1โˆ’ 2๐‘โˆ’2 that

    โ€–๐‘„๐‘†โ€–22 โ‰ฅ (1โˆ’ ๐’ช(1))(๐‘˜ +๐‘…2)โˆ’ (2 + ๐’ช(1)){๏ธโˆš๏ธ

    2(๐‘˜ + 2๐‘…2) log ๐‘+ 2 log ๐‘}๏ธ

    โ‰ฅ (1โˆ’ ๐’ช(1))(๐‘˜ +๐‘…2)โˆ’ (2 + ๐’ช(1))โˆš๏ธ

    (๐‘˜ + 2๐‘…2)๐‘…2/16โˆ’ (1/8 + ๐’ช(1))๐‘…2

    โ‰ฅ(๏ธƒ

    1โˆ’ 14โˆš

    2โˆ’ ๐’ช(1)

    )๏ธƒ๐‘˜ +

    (๏ธƒ78 โˆ’

    1โˆš2โˆ’ ๐’ช(1)

    )๏ธƒ๐‘…2 โ‰ฅ 5๐‘˜ log ๐‘, (23)

    where both the second and the final inequalities hold because of (22).From (23), using the tuning parameters ๐œ† = 2

    โˆšlog ๐‘ and ๐œ = ๐‘˜ log ๐‘, we have for

    sufficiently large ๐‘ that with probability at least 1โˆ’ 2๐‘โˆ’2,

    ๐‘‡ =๐‘โˆ‘๏ธ๐‘—=1

    ๐‘„2๐‘—1{|๐‘„๐‘— |โ‰ฅ๐œ†} โ‰ฅ โ€–๐‘„๐‘†โ€–22 โˆ’ ๐‘˜๐œ†2 โ‰ฅ ๐‘˜ log ๐‘ โ‰ฅ ๐œ,

    which allows us to reject the null. The desired almost sure convergence follows by theBorelโ€“Cantelli lemma since 1/๐‘2 is summable over ๐‘ โˆˆ N.

    Proof of Corollary 4. The first inequality follows from the definition of โ„ณ๐‘‹(๐‘˜, ๐œŒ). Aninspection of the proofs of Theorems 1 and 3 reveals that both results only depend onthe complementary-sketched model ๐‘ = ๐‘Š๐œƒ + ๐œ‰, and hence hold uniformly over (๐›ฝ1, ๐›ฝ2).Thus, we have from Theorem 1 that sup๐›ฝโˆˆR๐‘ ๐‘ƒ๐‘‹๐›ฝ,๐›ฝ(๐œ“

    sparse๐œ†,๐œ ฬธ= 0)

    a.s.โˆ’โˆ’โ†’ 0 and from Theorem 3that sup๐›ฝ1,๐›ฝ2โˆˆR๐‘:(๐›ฝ1โˆ’๐›ฝ2)/2โˆˆฮ˜๐‘,๐‘˜(๐œŒ) ๐‘ƒ

    ๐‘‹๐›ฝ,๐›ฝ(๐œ“

    sparse๐œ†,๐œ ฬธ= 1)

    a.s.โˆ’โˆ’โ†’ 0. Combining the two completes theproof.

    21

  • Proof of Theorem 5. By considering a trivial test ๐œ“ โ‰ก 0, we see that โ„ณ โ‰ค 1. Thus, itsuffices to show that โ„ณ โ‰ฅ 1 โˆ’ ๐’ช(1). Also, by Proposition 2, it suffices to work witha deterministic sequence of ๐‘‹ (and hence ๐‘Š ) such that (9) and (11) holds, which wehenceforth assume in this proof.

    Let ๐ฟ := (๐‘‹โŠค1 ๐‘‹1 +๐‘‹โŠค2 ๐‘‹2)โˆ’1(๐‘‹โŠค2 ๐‘‹2 โˆ’๐‘‹โŠค1 ๐‘‹1) and ๐œ‹ be the uniform distribution on

    ฮ˜0 := {๐œƒ โˆˆ {๐‘˜โˆ’1/2๐œŒ,โˆ’๐‘˜โˆ’1/2๐œŒ, 0}๐‘ : โ€–๐œƒโ€–0 = ๐‘˜} โŠ† ฮ˜.

    We write ๐‘ƒ0 := ๐‘ƒ๐‘‹0,0 and let ๐‘ƒ๐œ‹ :=โˆซ๏ธ€๐œƒโˆˆฮ˜0 ๐‘ƒ

    ๐‘‹๐ฟ๐œƒ,๐œƒ ๐‘‘๐œ‹(๐œƒ) denote the uniform mixture of ๐‘ƒ๐‘‹๐›พ,๐œƒ for

    {(๐›พ, ๐œƒ) : ๐œƒ โˆˆ ฮ˜0, ๐›พ = ๐ฟ๐œƒ}. Let โ„’ := ๐‘‘๐‘ƒ๐œ‹/๐‘‘๐‘ƒ0 be the likelihood ratio between the mixturealternative ๐‘ƒ๐œ‹ and the simple null ๐‘ƒ0. We have that

    โ„ณโ‰ฅ inf๐œ“

    {๏ธ‚1โˆ’ (๐‘ƒ0 โˆ’ ๐‘ƒ๐œ‹)๐œ“

    }๏ธ‚= 1โˆ’ 12

    โˆซ๏ธ โƒ’โƒ’โƒ’โƒ’โƒ’1โˆ’ ๐‘‘๐‘ƒ๐œ‹๐‘‘๐‘ƒ0

    โƒ’โƒ’โƒ’โƒ’โƒ’๐‘‘๐‘ƒ0

    โ‰ฅ 1โˆ’ 12

    {๏ธƒโˆซ๏ธ (๏ธƒ1โˆ’ ๐‘‘๐‘ƒ๐œ‹

    ๐‘‘๐‘ƒ0

    )๏ธƒ2๐‘‘๐‘ƒ0

    }๏ธƒ1/2โ‰ฅ 1โˆ’ 12{๐‘ƒ0(โ„’

    2)โˆ’ 1}1/2.

    So it suffices to prove that ๐‘ƒ0(โ„’2) โ‰ค 1+๐’ช(1). Writing ๏ฟฝฬƒ๏ฟฝ1 = ๐‘‹1๐ฟ+๐‘‹1 and ๏ฟฝฬƒ๏ฟฝ2 = ๐‘‹2๐ฟโˆ’๐‘‹2and suppressing the dependence of ๐‘ƒ โ€™s on ๐‘‹ in notations, by the definition of ๐‘ƒ๐œ‹, wecompute that

    โ„’ =โˆซ๏ธ ๐‘‘๐‘ƒ๐ฟ๐œƒ,๐œƒ

    ๐‘‘๐‘ƒ0๐‘‘๐œ‹(๐œƒ) =

    โˆซ๏ธ ๐‘’โˆ’ 12 (โ€–๐‘Œ1โˆ’๐‘‹1๐ฟ๐œƒโˆ’๐‘‹1๐œƒโ€–2+โ€–๐‘Œ2โˆ’๐‘‹2๐ฟ๐œƒ+๐‘‹2๐œƒโ€–2)๐‘’โˆ’

    12 (โ€–๐‘Œ1โ€–2+โ€–๐‘Œ2โ€–2)

    ๐‘‘๐œ‹(๐œƒ)

    =โˆซ๏ธ๐‘’โŸจ๏ฟฝฬƒ๏ฟฝ1๐œƒ,๐‘Œ1โŸฉโˆ’

    12 โ€–๏ฟฝฬƒ๏ฟฝ1๐œƒโ€–

    2+โŸจ๏ฟฝฬƒ๏ฟฝ2๐œƒ,๐‘Œ2โŸฉโˆ’ 12 โ€–๏ฟฝฬƒ๏ฟฝ2๐œƒโ€–2๐‘‘๐œ‹(๐œƒ).

    For ๐œƒ โˆผ ๐œ‹ and some fixed ๐ฝ0 โŠ† [๐‘] with |๐ฝ0| = ๐‘˜, let ๐œ‹๐ฝ0 be the distribution of ๐œƒ๐ฝ0conditional on supp(๐œƒ) = ๐ฝ0. Let ๐ฝ, ๐ฝ โ€ฒ be independently and uniformly distributed on{๐ฝ0 โŠ† [๐‘] : |๐ฝ0| = ๐‘˜}. By Fubiniโ€™s theorem, we have

    ๐‘ƒ0(โ„’2) =โˆซ๏ธโˆซ๏ธ

    ๐œƒ,๐œƒโ€ฒ๐‘’

    12 โ€–๏ฟฝฬƒ๏ฟฝ1(๐œƒ+๐œƒ

    โ€ฒ)โ€–2โˆ’ 12 โ€–๏ฟฝฬƒ๏ฟฝ1๐œƒโ€–2โˆ’ 12 โ€–๏ฟฝฬƒ๏ฟฝ1๐œƒ

    โ€ฒโ€–+ 12 โ€–๏ฟฝฬƒ๏ฟฝ2(๐œƒ+๐œƒโ€ฒ)โ€–2โˆ’ 12 โ€–๏ฟฝฬƒ๏ฟฝ2๐œƒโ€–

    2โˆ’ 12 โ€–๏ฟฝฬƒ๏ฟฝ2๐œƒโ€ฒโ€–2 ๐‘‘๐œ‹(๐œƒ) ๐‘‘๐œ‹(๐œƒโ€ฒ)

    =โˆซ๏ธโˆซ๏ธ

    ๐œƒ,๐œƒโ€ฒ๐‘’๐œƒ

    โŠค(๏ฟฝฬƒ๏ฟฝโŠค1 ๏ฟฝฬƒ๏ฟฝ1+๏ฟฝฬƒ๏ฟฝโŠค2 ๏ฟฝฬƒ๏ฟฝ2)๐œƒโ€ฒ ๐‘‘๐œ‹(๐œƒ) ๐‘‘๐œ‹(๐œƒโ€ฒ)

    =โˆซ๏ธโˆซ๏ธ

    ๐œƒ,๐œƒโ€ฒ๐‘’๐œƒ

    โŠค๐‘ŠโŠค๐‘Š๐œƒโ€ฒ ๐‘‘๐œ‹(๐œƒ) ๐‘‘๐œ‹(๐œƒโ€ฒ)

    = E[๏ธ‚E{๏ธ‚๐‘’๐œƒ

    โŠค๐ฝโˆฉ๐ฝโ€ฒ๐œƒ

    โ€ฒ๐ฝโˆฉ๐ฝโ€ฒ

    โˆซ๏ธ๐œƒโ€ฒ

    ๐ฝโ€ฒ

    โˆซ๏ธ๐œƒ๐ฝ๐‘’๐œƒ

    โŠค๐ฝ (๏ฟฝฬƒ๏ฟฝ

    โŠค๏ฟฝฬƒ๏ฟฝโˆ’๐ผ๐‘)๐ฝ,๐ฝโ€ฒ๐œƒโ€ฒ๐ฝโ€ฒ ๐‘‘๐œ‹๐ฝ(๐œƒ๐ฝ) ๐‘‘๐œ‹๐ฝ โ€ฒ(๐œƒโ€ฒ๐ฝ)โŸ โž โ„

    โƒ’โƒ’โƒ’โƒ’๐ฝ, ๐ฝ โ€ฒ

    }๏ธ‚]๏ธ‚, (24)

    where we have employed Lemmas 10 and 9 in the penultimate equality and used thedecomposition ๐œƒโŠค๐‘ŠโŠค๐‘Š๐œƒโ€ฒ = ๐œƒโŠค๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ๐œƒโ€ฒ = ๐œƒโŠค(๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐ผ๐‘)๐œƒโ€ฒ + ๐œƒโŠค๐œƒโ€ฒ in the final step.

    We first control the integral โ„. Define ๐œƒ := (๐œƒ1, . . . , ๐œƒ๐‘)โŠค and ๐œƒโ€ฒ := (๐œƒโ€ฒ1, . . . , ๐œƒโ€ฒ๐‘)โŠค suchthat ๐œƒ๐‘— := ๐œƒ๐‘—โ€–๐‘Š๐‘—โ€–2 and ๐œƒโ€ฒ๐‘— := ๐œƒโ€ฒ๐‘—โ€–๐‘Š๐‘—โ€–2. By (9), we have

    ๐œ— := max{๏ธmax๐‘—โˆˆ๐ฝ|๐œƒ๐‘—|,max

    ๐‘—โˆˆ๐ฝ โ€ฒ|๐œƒโ€ฒ๐‘—|

    }๏ธโ‰ค (1 + ๐’ช(1))

    โˆš๏ธƒ4๐‘›๐œ…1๐œŒ2๐‘˜

    . (25)

    22

  • By (25) and (11), we have for any ๐œŒ = ๐’ช(๐‘โˆ’1/4 logโˆ’3/4 ๐‘) (which includes the case ๐œŒ โ‰คโˆš๏ธ(1โˆ’2๐›ผโˆ’๐œ€)๐‘˜ log ๐‘

    4๐‘›๐œ…1 for ๐‘˜ โ‰ค ๐‘๐›ผ with ๐›ผ < 1/2) that

    ๐œ—2โ€–(๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐ผ๐‘)๐ฝ,๐ฝ โ€ฒโ€–F โ‰ค ๐œ—2โ€–(๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐ผ๐‘)๐ฝโˆช๐ฝ โ€ฒ,๐ฝโˆช๐ฝ โ€ฒโ€–F โ‰คโˆš

    2๐‘˜๐œ—2โ€–๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐ผ๐‘โ€–2๐‘˜,op

    โ‰ค (1 + ๐’ช(1))โˆš

    2๐‘˜4๐‘›๐œ…1๐œŒ2

    ๐‘˜ยท ๐ถ๐‘ ,๐‘Ÿ

    โˆš๏ธƒ2๐‘˜ log{๐‘’๐‘/(2๐‘˜)}

    ๐‘›= ๐’ช(logโˆ’1(๐‘)).

    Consequently, by Arias-Castro, Candรจs and Plan (2011, Lemma 4), we have

    โ„ โ‰ค ๐‘’2๐œ—2โ€–(๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝโˆ’๐ผ๐‘)๐ฝ,๐ฝโ€ฒ โ€–F log(3๐‘˜) + ๐œ—2โ€–(๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐ผ๐‘)๐ฝ,๐ฝ โ€ฒโ€–F โ‰ค ๐‘’๐’ช(1) + ๐’ช(1) = 1 + ๐’ช(1).

    Plugging the above display in (24), we obtain

    ๐‘ƒ0(โ„’2) โ‰ค (1 + ๐’ช(1))E{๏ธE(๏ธ๐‘’๐œƒ

    โŠค๐ฝโˆฉ๐ฝโ€ฒ๐œƒ

    โ€ฒ๐ฝโˆฉ๐ฝโ€ฒ

    โƒ’โƒ’โƒ’๐ฝ โˆฉ ๐ฝ โ€ฒ

    )๏ธ}๏ธ= (1 + ๐’ช(1))E

    [๏ธ‚E{๏ธ‚ โˆ๏ธ๐‘—โˆˆ๐ฝโˆฉ๐ฝ โ€ฒ

    exp(๏ธ๐œƒ๐‘—๐œƒ

    โ€ฒ๐‘—

    )๏ธ โƒ’โƒ’โƒ’โƒ’๐ฝ โˆฉ ๐ฝ โ€ฒ

    }๏ธ‚]๏ธ‚

    โ‰ค (1 + ๐’ช(1))E{๏ธ‚ โˆ๏ธ๐‘—โˆˆ|๐ฝโˆฉ๐ฝ โ€ฒ|

    2 cosh(๐œ—2)}๏ธ‚

    = (1 + ๐’ช(1))E{๏ธcosh|๐ฝโˆฉ๐ฝ โ€ฒ|(๐œ—2)

    }๏ธ. (26)

    Hence, it suffices to show that E{cosh|๐ฝโˆฉ๐ฝ โ€ฒ|(๐œ—2)} = 1 + ๐’ช(1). Note that |๐ฝ โˆฉ ๐ฝ โ€ฒ| โˆผHyperGeom(๐‘˜; ๐‘˜, ๐‘) is a hypergeometric random variable (defined as the number of blackballs obtained from ๐‘˜ draws without replacement from an urn containing ๐‘ balls, ๐‘˜ ofwhich is black). Let ๐ต โˆผ Bin(๐‘˜, ๐‘˜/๐‘). By Hoeffding (1963, Theorem 4) and the fact thatcosh(๐‘ฅ) โ‰ค ๐‘’๐‘ฅ for all ๐‘ฅ โ‰ฅ 0, we have

    E{๏ธcosh|๐ฝโˆฉ๐ฝ โ€ฒ|(๐œ—2)

    }๏ธโ‰ค E๐‘’๐œ—2|๐ฝโˆฉ๐ฝ โ€ฒ| โ‰ค E๐‘’๐œ—2๐ต =

    {๏ธƒ1 + ๐‘˜

    ๐‘

    (๏ธ๐‘’๐œ—

    2 โˆ’ 1)๏ธ}๏ธƒ๐‘˜โ‰ค exp

    {๏ธƒ๐‘˜2

    ๐‘(๐‘’๐œ—2 โˆ’ 1)

    }๏ธƒ. (27)

    Since ๐›ผ โˆˆ [0, 1/2) and ๐œŒ โ‰คโˆš๏ธ

    (1โˆ’2๐›ผโˆ’๐œ€)๐‘˜ log ๐‘4๐‘›๐œ…1 , from (25), we can deduce that

    ๐œ— โ‰คโˆš๏ธ

    (1โˆ’ 2๐›ผโˆ’ ๐œ€+ ๐’ช(1)) log ๐‘

    and hence ๐‘’๐œ—2๐‘˜2/๐‘ = ๐‘โˆ’๐œ€+๐’ช(1) = ๐’ช(1). So, from (27) we have E{cosh|๐ฝโˆฉ๐ฝ โ€ฒ|(๐œ—2)} = 1+ ๐’ช(1),which completes the proof.

    Proof of Theorem 6. As in the proof of Theorem 3, we work with a deterministic sequenceof ๐‘Š such that (9) and (11) are satisfied. For ๐œƒ = (๐œƒ1, . . . , ๐œƒ๐‘)โŠค such that ๐œƒ๐‘— := ๐œƒ๐‘—โ€–๐‘Š๐‘—โ€–2,we have from (8) that

    ๐‘ = ๏ฟฝฬƒ๏ฟฝ๐œƒ + ๐œ‰,for ๐œ‰ โˆผ ๐‘๐‘š(0, ๐ผ๐‘š). Hence, under the null hypothesis, we have โ€–๐‘โ€–22 โˆผ ๐œ’2๐‘š, which byLaurent and Massart (2000, Lemma 1) yields that

    P{๏ธโ€–๐‘โ€–22 โ‰ฅ ๐‘š+ 2

    โˆš๏ธ๐‘š log(1/๐›ฟ) + 2 log(1/๐›ฟ)

    }๏ธโ‰ค ๐›ฟ.

    23

  • Setting ๐›ฟ = ๐‘โˆ’2, for ๐œ‚ = ๐‘š+ 23/2โˆš๐‘š log ๐‘+ 4 log ๐‘, we have by the Borelโ€“Cantelli lemma

    that ๐œ“dense๐œ‚ (๐‘‹1, ๐‘‹2, ๐‘Œ1, ๐‘Œ2)a.s.โˆ’โˆ’โ†’ 0.

    On the other hand, under the alternative hypothesis, โ€–๐‘โ€–22 โˆผ ๐œ’2๐‘š(โ€–๐‘Š๐œƒโ€–22), which byBirgรฉ (2001, Lemma 8.1) implies that

    P{๏ธโ€–๐‘โ€–22 โ‰ค ๐‘š+ โ€–๐‘Š๐œƒโ€–22 โˆ’ 2

    โˆš๏ธ(๐‘š+ 2โ€–๐‘Š๐œƒโ€–22) log(1/๐›ฟ)

    }๏ธโ‰ค ๐›ฟ.

    Again, setting ๐›ฟ = ๐‘โˆ’2, observe from (11) and (10) that

    โ€–๐‘Š๐œƒโ€–22 = โ€–๏ฟฝฬƒ๏ฟฝ๐œƒโ€–22 = โ€–๐œƒโ€–22 + ๐œƒโŠค(๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐ผ๐‘)๐œƒ โ‰ฅ โ€–๐œƒโ€–22(๏ธ1โˆ’ โ€–๏ฟฝฬƒ๏ฟฝโŠค๏ฟฝฬƒ๏ฟฝ โˆ’ ๐ผ๐‘โ€–๐‘˜,op

    )๏ธโ‰ฅ (4โˆ’ ๐’ช(1))๐‘›๐œ…1๐œŒ2,

    where the final bound comes from (20). Similarly we bound โ€–๐‘Š๐œƒโ€–22 โ‰ค (4 + ๐’ช(1))๐‘›๐œ…1๐œŒ2. If๐œŒ2 โ‰ฅ 2

    โˆš๐‘š log ๐‘๐‘›๐œ…1

    , then from the above display, we have โ€–๐‘Š๐œƒโ€–22 โ‰ฅ (8โˆ’ ๐’ช(1))โˆš๐‘š log ๐‘โ‰ซ log ๐‘

    and โ€–๐‘Š๐œƒโ€–22 โ‰ค (8 + ๐’ช(1))โˆš๐‘š log ๐‘โ‰ช ๐‘š, and so

    ๐‘š+ โ€–๐‘Š๐œƒโ€–22 โˆ’ 2โˆš๏ธ

    (๐‘š+ 2โ€–๐‘Š๐œƒโ€–22) log(1/๐›ฟ)โˆ’ ๐œ‚

    โ‰ฅ โ€–๐‘Š๐œƒโ€–22 โˆ’ 25/2โˆš๏ธ

    (๐‘š+ 2โ€–๐‘Š๐œƒโ€–22) log ๐‘โˆ’ 4 log ๐‘

    โ‰ฅ (1โˆ’ ๐’ช(1))โ€–๐‘Š๐œƒโ€–22 โˆ’ 25/2(1 + ๐’ช(1))โˆš๏ธ๐‘š log ๐‘ > 0

    asymptotically. Consequently, P(โ€–๐‘โ€–22 โ‰ค ๐œ‚) โ‰ค ๐‘โˆ’2 and by the Borelโ€“Cantelli lemma, wehave ๐œ“dense๐œ‚ (๐‘‹1, ๐‘‹2, ๐‘Œ1, ๐‘Œ2)

    a.s.โˆ’โˆ’โ†’ 1.

    Proof of Theorem 7. We follow the proof of Theorem 5 up to (26). Let ๐ต โˆผ Bin(๐‘˜, ๐‘˜/๐‘).By Hoeffding (1963, Theorem 4) and the fact that cosh(๐‘ฅ) โ‰ค ๐‘’๐‘ฅ2/2 for all ๐‘ฅ โˆˆ R, we have

    E{๏ธcosh|๐ฝโˆฉ๐ฝ โ€ฒ|(๐œ—2)

    }๏ธโ‰ค E๐‘’|๐ฝโˆฉ๐ฝ โ€ฒ|๐œ—4/2 โ‰ค E๐‘’๐œ—4๐ต/2 =

    {๏ธƒ1 + ๐‘˜

    ๐‘

    (๏ธ๐‘’๐œ—

    4/2โˆ’ 1)๏ธ}๏ธƒ๐‘˜โ‰ค exp

    {๏ธƒ๐‘˜2

    ๐‘(๐‘’๐œ—4/2โˆ’ 1)

    }๏ธƒ.

    If ๐›ผ โˆˆ [1/2, 1] and ๐œŒ = ๐’ช(๐‘โˆ’1/4 logโˆ’3/4 ๐‘), the by (25), ๐œ—4 = ๐’ช(๐‘1โˆ’2๐›ผ) = ๐’ช(1), and hence(๐‘’๐œ—4/2โˆ’1)๐‘˜2/๐‘ = (1/2+ ๐’ช(1))๐‘˜2๐œ—4/๐‘ = ๐’ช(1). By (26), ๐‘ƒ0(โ„’2) = 1+ ๐’ช(1) and we concludeas in the proof of Theorem 5.

    Proof of Proposition 8. As in the proof of Theorem 5, it suffices to control ๐‘ƒ0(โ„’2) forsome choice of prior ๐œ‹. We write ๐œ†min(๐‘ŠโŠค๐‘Š ) for the minimum eigenvalue of ๐‘ŠโŠค๐‘Š andlet ๐œƒ be an associated eigenvector with โ„“2 norm equal to ๐œŒ. We choose ๐œ‹ to be the Diracmeasure on ๐œƒ. Then by (24), we have

    ๐‘ƒ0(โ„’2) = ๐‘’๐œƒโŠค๐‘ŠโŠค๐‘Š๐œƒ = ๐‘’๐œŒ2๐œ†min(๐‘ŠโŠค๐‘Š ).

    When ๐‘ โ‰ฅ ๐‘›1 or ๐‘ โ‰ฅ ๐‘›2, by Lemma 9, ๐‘ŠโŠค๐‘Š = (๐‘‹โŠค1 ๐‘‹1)(๐‘‹โŠค๐‘‹)โˆ’1(๐‘‹โŠค2 ๐‘‹2) is singular.Hence ๐œ†min(๐‘ŠโŠค๐‘Š ) = 0 and ๐‘ƒ0(โ„’2) = 1, which implies that โ„ณ๐‘‹(๐‘˜, ๐œŒ) = 1.

    On the other hand, if ๐‘ < min{๐‘›1, ๐‘›2}, we work on the almost sure event where (9)and (11) hold. Let ๐‘‡ and ฮ› be defined as in the proof of Proposition 2, then by (15), wehave

    ๐œ†min(๐‘ŠโŠค๐‘Š ) โ‰ค 4โ€–๐‘‡โ€–2op๐œ†min(๏ธฮ›(๐ผ โˆ’ ฮ›)

    )๏ธโ‰ค 4โ€–๐‘‹โŠค๐‘‹โ€–op min{๐œ†min(ฮ›), 1โˆ’ ๐œ†max(ฮ›)}

    24

  • Using tail bounds for operator norm of a random Gaussian matrix (see, e.g. Wainwright,2019, Theorem 6.1), we have

    โ€–๐‘‹โŠค๐‘‹โ€–op โ‰ค ๐‘›(๏ธƒ

    1 +โˆš๏ธ‚๐‘

    ๐‘›+โˆš๏ธƒ

    2 log ๐‘๐‘›

    )๏ธƒ2โ‰ค 5๐‘›

    asymptotically with probability 1. Moreover, by Bai et al. (2015, Theorem 1.1), there isan almost sure event on which the empirical spectral distribution of ฮ› converges weaklyto a distribution supported on [๐‘กโ„“, ๐‘ก๐‘Ÿ], for ๐‘กโ„“ and ๐‘ก๐‘Ÿ defined in (28). We will work onthis almost sure event henceforth. For ๐‘/๐‘›1 โ†’ ๐œ‰ โˆˆ [๐œ€, 1) and ๐‘/๐‘›2 โ†’ ๐œ‚ โˆˆ [๐œ€, 1), wehave lim sup๐‘โ†’โˆž ๐œ†min(ฮ›) โ‰ค ๐‘กโ„“ and lim inf๐‘โ†’โˆž ๐œ†max(ฮ›) โ‰ฅ ๐‘ก๐‘Ÿ. On the other hand, Taylorexpanding the expression for ๐‘กโ„“ and ๐‘ก๐‘Ÿ in (28) with respect to 1โˆ’ ๐œ‰ and 1โˆ’ ๐œ‚ respectively,we obtain that

    ๐‘กโ„“ =14๐œ‚(1โˆ’ ๐œ‰)

    2 +๐’ช๐œ€(๏ธ(1โˆ’ ๐œ‰)3

    )๏ธ,

    1โˆ’ ๐‘ก๐‘Ÿ =14๐œ‰(1โˆ’ ๐œ‚)

    2 +๐’ช๐œ€(๏ธ(1โˆ’ ๐œ‚)3

    )๏ธ.

    Therefore, min{๐œ†min(ฮ›), 1โˆ’ ๐œ†max(ฮ›)} = ๐’ช๐œ€(min{(1โˆ’ ๐œ‰)2, (1โˆ’ ๐œ‚)2}). Using the conditionon ๐œŒ2, we have

    ๐œŒ2๐œ†min(๐‘ŠโŠค๐‘Š ) = ๐’ช(๏ธƒ

    max{๏ธƒ

    1(1โˆ’ ๐œ‰)2๐‘,

    1(1โˆ’ ๐œ‚)2๐‘

    }๏ธƒ)๏ธƒ๐’ช๐œ€(๐‘›min{(1โˆ’ ๐œ‰)2, (1โˆ’ ๐œ‚)2}) = ๐’ช(1),

    which implies that ๐‘ƒ0(โ„’2) = 1 + ๐’ช(1) and โ„ณ๐‘‹ a.s.โˆ’โˆ’โ†’ 1.

    6 Ancillary resultsLemma 9. Let ๐‘›1, ๐‘›2, ๐‘,๐‘š be positive integers such that ๐‘›1 + ๐‘›2 = ๐‘ + ๐‘š = ๐‘›. Let๐‘‹ = (๐‘‹โŠค1 , ๐‘‹โŠค2 )โŠค โˆˆ R๐‘›ร—๐‘ be a non-singular matrix with block components ๐‘‹1 โˆˆ R๐‘›1ร—๐‘ and๐‘‹2 โˆˆ R๐‘›2ร—๐‘. Let ๐ด1 โˆˆ R๐‘›1ร—๐‘š and ๐ด2 โˆˆ R๐‘›2ร—๐‘š be chosen to satisfy (7). Then

    ๐‘‹โŠค1 ๐ด1๐ดโŠค1 ๐‘‹1 = โˆ’๐‘‹โŠค2 ๐ด2๐ดโŠค2 ๐‘‹2 = (๐‘‹โŠค1 ๐‘‹1)(๐‘‹โŠค๐‘‹)โˆ’1(๐‘‹โŠค2 ๐‘‹2).

    Proof. The first equality follows immediately from (7). Define ๏ฟฝฬƒ๏ฟฝ1 := ๐‘‹1(๐‘‹โŠค๐‘‹)โˆ’1/2 and๏ฟฝฬƒ๏ฟฝ2 := ๐‘‹2(๐‘‹โŠค๐‘‹)โˆ’1/2. Then ๏ฟฝฬƒ๏ฟฝ := (๏ฟฝฬƒ๏ฟฝโŠค1 , ๏ฟฝฬƒ๏ฟฝโŠค2 )โŠค has orthonormal columns with the samecolumn span as ๐‘‹, and so (๏ธƒ

    ๏ฟฝฬƒ๏ฟฝ1 ๐ด1๏ฟฝฬƒ๏ฟฝ2 ๐ด2

    )๏ธƒโˆˆ O๐‘›ร—๐‘›.

    In particular, ๏ฟฝฬƒ๏ฟฝ1๏ฟฝฬƒ๏ฟฝโŠค1 + ๐ด1๐ดโŠค1 = ๐ผ๐‘›1 . Therefore,

    ๐‘‹โŠค1 ๐ด1๐ดโŠค1 ๐‘‹1 = ๐‘‹โŠค1 (๐ผ๐‘›1 โˆ’ ๏ฟฝฬƒ๏ฟฝ1๏ฟฝฬƒ๏ฟฝโŠค1 )๐‘‹1 = ๐‘‹โŠค1 ๐‘‹1 โˆ’๐‘‹โŠค1 ๐‘‹1(๐‘‹โŠค๐‘‹)โˆ’1๐‘‹โŠค1 ๐‘‹1

    = ๐‘‹โŠค1 ๐‘‹1(๐‘‹โŠค๐‘‹)โˆ’1(๐‘‹โŠค๐‘‹ โˆ’๐‘‹โŠค1 ๐‘‹1) = (๐‘‹โŠค1 ๐‘‹1)(๐‘‹โŠค๐‘‹)โˆ’1(๐‘‹โŠค2 ๐‘‹2),

    where the last equality holds by noting the block structure of ๐‘‹.

    25

  • Lemma 10. For ๐‘‹1 โˆˆ R๐‘›1ร—๐‘ and ๐‘‹2 โˆˆ R๐‘›2ร—๐‘, define ๐ฟ := (๐‘‹โŠค1 ๐‘‹1 +๐‘‹โŠค2 ๐‘‹2)โˆ’1(๐‘‹โŠค2 ๐‘‹2 โˆ’๐‘‹โŠค1 ๐‘‹1), ๏ฟฝฬƒ๏ฟฝ1 := ๐‘‹1(๐ฟ+ ๐ผ๐‘) and ๏ฟฝฬƒ๏ฟฝ2 := ๐‘‹2(๐ฟโˆ’ ๐ผ๐‘). We have

    ๏ฟฝฬƒ๏ฟฝโŠค1 ๏ฟฝฬƒ๏ฟฝ1 + ๏ฟฝฬƒ๏ฟฝโŠค2 ๏ฟฝฬƒ๏ฟฝ2 = 4๐‘‹โŠค1 ๐‘‹1(๐‘‹โŠค1 ๐‘‹1 +๐‘‹โŠค2 ๐‘‹2)โˆ’1๐‘‹โŠค2 ๐‘‹2.

    Proof. Write ๐บ1 := ๐‘‹โŠค1 ๐‘‹1, ๐บ2 := ๐‘‹โŠค2 ๐‘‹2. It is clear that

    ๐ฟโˆ’ ๐ผ๐‘ = โˆ’2(๐‘‹โŠค1 ๐‘‹1 +๐‘‹โŠค2 ๐‘‹2)โˆ’1๐‘‹โŠค1 ๐‘‹1 = โˆ’2(๐บ1 +๐บ2)โˆ’1๐บ1,๐ฟ+ ๐ผ๐‘ = 2(๐‘‹โŠค1 ๐‘‹1 +๐‘‹โŠค2 ๐‘‹2)โˆ’1๐‘‹โŠค2 ๐‘‹2 = 2(๐บ1 +๐บ2)โˆ’1๐บ2.

    Therefore, we have

    14(๏ฟฝฬƒ๏ฟฝ

    โŠค1 ๏ฟฝฬƒ๏ฟฝ1 + ๏ฟฝฬƒ๏ฟฝโŠค2 ๏ฟฝฬƒ๏ฟฝ2) =

    14{๏ธ(๐ฟ+ ๐ผ๐‘)โŠค๐‘‹โŠค1 ๐‘‹1(๐ฟ+ ๐ผ๐‘) + (๐ฟโˆ’ ๐ผ๐‘)โŠค๐‘‹โŠค1 ๐‘‹2(๐ฟโˆ’ ๐ผ๐‘)

    }๏ธ= ๐บ2(๐บ1 +๐บ2)โˆ’1๐บ1(๐บ1 +๐บ2)โˆ’1๐บ2 +๐บ1(๐บ1 +๐บ2)โˆ’1๐บ2(๐บ1 +๐บ2)โˆ’1๐บ1= โˆ’๐บ1(๐บ1 +๐บ2)โˆ’1๐บ1(๐บ1 +๐บ2)โˆ’1๐บ2โˆ’๐บ1(๐บ1 +๐บ2)โˆ’1๐บ2(๐บ1 +๐บ2)โˆ’1๐บ2 + 2๐บ1(๐บ1 +๐บ2)โˆ’1๐บ2

    = ๐บ1(๐บ1 +๐บ2)โˆ’1๐บ2.

    The proof is complete by recalling the definitions of ๐บ1 and ๐บ2.

    The following lemma concerns the control of the ๐‘˜-operator norm of a symmetricmatrix. Similar results have been derived in previous works (see, e.g. Wang, Berthet andSamworth, 2016, Lemma 2). For completeness, we include a statement and proof of thespecific version we use.

    Lemma 11. For any symmetric matrix ๐‘€ โˆˆ R๐‘ร—๐‘ and ๐‘˜ โˆˆ [๐‘], there exists a subset๐’ฉ โŠ† ๐’ฎ๐‘โˆ’1 such that |๐’ฉ | โ‰ค

    (๏ธ๐‘๐‘˜

    )๏ธ9๐‘˜ and

    โ€–๐‘€โ€–๐‘˜,op โ‰ค 2 sup๐‘ขโˆˆ๐’ฉ

    ๐‘ขโŠค๐‘€๐‘ข.

    Proof. Define โ„ฌ0(๐‘˜) := โˆช๐ฝโŠ‚[๐‘],|๐ฝ |=๐‘˜๐‘†๐ฝ , where ๐‘†๐ฝ := {๐‘ฃ โˆˆ ๐’ฎ๐‘โˆ’1 : ๐‘ฃ๐‘– = 0,โˆ€๐‘– /โˆˆ ๐ฝ}. For each๐‘†๐ฝ , we find a 1/4-net ๐’ฉ๐ฝ of cardinality at most 9๐‘˜ (Vershynin, 2012, Lemma 5.2). Define๐’ฉ := โˆช๐ฝโŠ‚[๐‘],|๐ฝ |=๐‘˜๐’ฉ๐ฝ , which has the desired upper bound on cardinality. By construction,for ๐‘ฃ โˆˆ arg max๐‘ขโˆˆโ„ฌ0(๐‘˜) ๐‘ขโŠค๐‘€๐‘ข, there exists a ๐‘ฃ โˆˆ ๐’ฉ such that | supp(๐‘ฃ)โˆช supp(๐‘ฃ)| โ‰ค ๐‘˜ andโ€–๐‘ฃ โˆ’ ๐‘ฃโ€–2 โ‰ค 1/4. We have

    โ€–๐‘€โ€–๐‘˜,op = ๐‘ฃโŠค๐‘€๐‘ฃ = ๐‘ฃโŠค๐‘€(๐‘ฃ โˆ’ ๐‘ฃ) + (๐‘ฃ โˆ’ ๐‘ฃ)โŠค๐‘€๐‘ฃ + ๐‘ฃโŠค๐‘€๐‘ฃ โ‰ค 2โ€–๐‘ฃ โˆ’ ๐‘ฃโ€–2โ€–๐‘€โ€–๐‘˜,op + ๐‘ฃโŠค๐‘€๐‘ฃ

    โ‰ค 12โ€–๐‘€โ€–๐‘˜,op + sup๐‘ขโˆˆ๐’ฉ๐‘ขโŠค๐‘€๐‘ข.

    The desired inequality is obtained after rearranging terms in the above display.

    The following lemma describes the asymptotic limit of the nuclear and Frobenius normsof the product of a matrix-variate Beta-distributed random matrix and its reflection.

    26

  • Recall that for ๐‘›1 + ๐‘›2 > ๐‘, we say that a ๐‘ ร— ๐‘ random matrix ๐ต follows a matrix-variate Beta distribution with parameters ๐‘›1/2 and ๐‘›2/2, written ๐ต โˆผ Beta๐‘(๐‘›1/2, ๐‘›2/2),if ๐ต = (๐‘†1 + ๐‘†2)โˆ’1/2๐‘†1(๐‘†1 + ๐‘†2)โˆ’1/2, where ๐‘†1 โˆผ ๐‘Š๐‘(๐‘›1, ๐ผ๐‘) and ๐‘†2 โˆผ ๐‘Š๐‘(๐‘›2, ๐ผ๐‘) areindependent Wishart matrices and (๐‘†1 + ๐‘†2)1/2 is the symmetric matrix square root of๐‘†1 + ๐‘†2. Recall also that the spectral distribution function of any ๐‘ ร— ๐‘ matrix ๐ด isdefined as ๐น๐ด(๐‘ก) := ๐‘›โˆ’1โˆ‘๏ธ€๐‘๐‘–=1 1{๐œ†๐ด๐‘– โ‰ค๐‘ก}, where ๐œ†๐ด๐‘– s are eigenvalues (counting multiplicities)of the matrix ๐ด. Further, given a sequence (๐ด๐‘›)๐‘›โˆˆN of matrices, their limiting spectraldistribution function ๐น is defined as the weak limit of the ๐น๐ด๐‘› , if it exists.

    Lemma 12. Let ๐ต โˆผ Beta๐‘(๐‘›1/2, ๐‘›2/2) and suppose that ๐œ†1, . . . , ๐œ†๐‘ are the eigenvaluesof ๐ต. Define ๐‘Ž = (๐‘Ž1, . . . , ๐‘Ž๐‘)โŠค, with ๐‘Ž๐‘— = ๐œ†๐‘—(1โˆ’๐œ†๐‘—) for ๐‘— โˆˆ [๐‘]. In the asymptotic regimeof (C2), we have

    โ€–๐‘Žโ€–1/๐‘a.s.โˆ’โˆ’โ†’ ๐œ…1,

    โ€–๐‘Žโ€–2/โˆš๐‘

    a.s.โˆ’โˆ’โ†’ ๐œ…2,

    where๐œ…1 =

    ๐‘Ÿ

    (1 + ๐‘Ÿ)2(1 + ๐‘ ) and ๐œ…22 =

    ๐‘Ÿ(๐‘Ÿ + ๐‘ โˆ’ ๐‘Ÿ๐‘ + ๐‘Ÿ2๐‘ + ๐‘Ÿ๐‘ 2)(1 + ๐‘Ÿ)4(1 + ๐‘ )3 .

    Proof. We first look at the limiting spectral distribution of ๐ต. From the asymptoticrelations between ๐‘›1, ๐‘›2 and ๐‘ in (C2), we have that

    ๐‘/๐‘›1 โ†’ ๐œ‰ :=๐‘ + ๐‘ ๐‘Ÿ๐‘Ÿ + ๐‘ ๐‘Ÿ and ๐‘/๐‘›2 โ†’ ๐œ‚ :=

    ๐‘ + ๐‘ ๐‘Ÿ1 + ๐‘  .

    Define the left and right limits

    ๐‘กโ„“, ๐‘ก๐‘Ÿ :=(๐œ‰ + ๐œ‚)๐œ‚ + ๐œ‰๐œ‚(๐œ‰ โˆ’ ๐œ‚)โˆ“ 2๐œ‰๐œ‚

    โˆš๐œ‰ โˆ’ ๐œ‰๐œ‚ + ๐œ‚

    (๐œ‰ + ๐œ‚)2 . (28)

    By Bai et al. (2015, Theorem 1.1), almost surely, weak limit ๐น of ๐น๐ต exists and is of theform max{1 โˆ’ 1/๐œ‰, 0}๐›ฟ0 + max{1 โˆ’ 1/๐œ‚, 0}๐›ฟ1 + ๐œ‡, where ๐›ฟ0 and ๐›ฟ1 are point masses at 0and 1 respectively, and ๐œ‡ has a density

    (๐œ‰ + ๐œ‚)โˆš๏ธ

    (๐‘ก๐‘Ÿ โˆ’ ๐‘ก)(๐‘กโˆ’ ๐‘กโ„“)2๐œ‹๐œ‰๐œ‚๐‘ก(1โˆ’ ๐‘ก) 1[๐‘กโ„“,๐‘ก๐‘Ÿ]

    with respect to the Lebesgue measure on R. Define โ„Ž1 : ๐‘ก โ†ฆโ†’ ๐‘ก(1โˆ’ ๐‘ก). By the portmanteaulemma (see, e.g. van der Vaart, 2000, Lemma 2.2), we have almost surely that

    โ€–๐‘Žโ€–1/๐‘ = ๐น๐ตโ„Ž1 โ†’ ๐นโ„Ž1 =๐œ‰ + ๐œ‚2๐œ‹๐œ‰๐œ‚

    โˆซ๏ธ ๐‘ก๐‘Ÿ๐‘กโ„“

    โˆš๏ธ(๐‘ก๐‘Ÿ โˆ’ ๐‘ก)(๐‘กโˆ’ ๐‘กโ„“)๐‘‘๐‘ก =

    ๐œ‰ + ๐œ‚16๐œ‰๐œ‚ (๐‘ก๐‘Ÿ โˆ’ ๐‘กโ„“)

    2

    = ๐‘Ÿ(1 + ๐‘Ÿ)2(1 + ๐‘ ) .

    27

  • Similarly, for โ„Ž2 : ๐‘ก โ†ฆโ†’ ๐‘ก2(1โˆ’ ๐‘ก)2, we have almost surely that

    โ€–๐‘Žโ€–22/๐‘โ†’ ๐นโ„Ž2 =๐œ‰ + ๐œ‚2๐œ‹๐œ‰๐œ‚

    โˆซ๏ธ ๐‘ก๐‘Ÿ๐‘กโ„“

    ๐‘ก(1โˆ’ ๐‘ก)โˆš๏ธ

    (๐‘ก๐‘Ÿ โˆ’ ๐‘ก)(๐‘กโˆ’ ๐‘กโ„“)๐‘‘๐‘ก

    = ๐œ‰ + ๐œ‚256๐œ‰๐œ‚ (๐‘ก๐‘Ÿ โˆ’ ๐‘กโ„“)2(8๐‘กโ„“ โˆ’ 5๐‘ก2โ„“ + 8๐‘ก๐‘Ÿ โˆ’ 6๐‘กโ„“๐‘ก๐‘Ÿ โˆ’ 5๐‘ก2๐‘Ÿ)

    = ๐‘Ÿ(๐‘Ÿ + ๐‘ โˆ’ ๐‘Ÿ๐‘ + ๐‘Ÿ2๐‘ + ๐‘Ÿ๐‘ 2)

    (๐‘Ÿ + 1)4(๐‘ + 1)3 .

    Define ๐œ…1 := ๐นโ„Ž1 and ๐œ…2 := (๐นโ„Ž2)1/2, we arrive at the lemma.

    The following result concerning the QR decomposition of a Gaussian random matrixis probably well-known. However, since we did not find results in this exact form in theexisting literature, we have included a proof here for completeness. Recall that for ๐‘› โ‰ฅ ๐‘,the set O๐‘›ร—๐‘ can be equipped with a uniform probability measure that is invariant underthe action of left multiplication by O๐‘›ร—๐‘› (see, e.g. Stiefel manifold in Muirhead, 2009,Section 2.1.4).

    Lemma 13. Suppose ๐‘› โ‰ฅ ๐‘ and ๐‘‹ is an ๐‘›ร— ๐‘ random matrix with independent ๐‘(0, 1)entries. Write ๐‘‹ = ๐ป๐‘‡ , with ๐ป taking values in O๐‘›ร—๐‘ and ๐‘‡ an upper-triangular ๐‘ ร— ๐‘matrix with non-negative diagonal entries. This decomposition is almost surely unique.Moreover, ๐ป and ๐‘‡ are independent, with ๐ป uniformly distributed on O๐‘›ร—๐‘ with respectto the invariant measure and ๐‘‡ = (๐‘ก๐‘—,๐‘˜)๐‘—,๐‘˜โˆˆ[๐‘] having independent entries satisfying ๐‘ก2๐‘—,๐‘— โˆผ๐œ’2๐‘โˆ’๐‘—+1 and ๐‘ก๐‘—,๐‘˜ โˆผ ๐‘(0, 1) for 1 โ‰ค ๐‘— < ๐‘˜ โ‰ค ๐‘.

    Proof. The uniqueness of the QR decomposition follows since ๐‘‹ has rank ๐‘ almost surely.The marginal distribution of ๐‘‡ then follows from the Bartlett decomposition of ๐‘‹โŠค๐‘‹(Muirhead, 2009, Theorem 3.2.4) and the relationship between the QR decomposition of๐‘‹ and the Cholesky decomposition of ๐‘‹โŠค๐‘‹.

    For any fixed ๐‘„ โˆˆ O๐‘›ร—๐‘›, we have ๐‘„๐‘‹ d= ๐‘‹. Since O๐‘›ร—๐‘› acts transitively (by leftmultiplication) on O๐‘›ร—๐‘, the joint density of ๐ป and ๐‘‡ must be constant in ๐ป for eachvalue of ๐‘‡ . In particular, we have that ๐ป and ๐‘‡ are independent, and that ๐ป is uniformlydistributed on O๐‘›ร—๐‘ with respect to the translation-invariant measure.

    The lemma below provides concentration inequalities for the norm of a multivariatenormal random vector with near-identity covariance.

    Lemma 14. Let ๐‘‹ โˆผ ๐‘๐‘‘(๐œ‡,ฮฃ). Suppose โ€–๐œ‡โ€–2 = ๐‘… and โ€–ฮฃโˆ’ ๐ผโ€–op โ‰ค 1/2. Then

    P[๏ธ‚โ€–๐‘‹โ€–22 โˆ’ (๐‘‘+๐‘…2)(1 + 2โ€–ฮฃโˆ’ ๐ผโ€–op) โ‰ฅ

    {๏ธ2โˆš๏ธ

    (๐‘‘+ 2๐‘…2)๐‘ก+ 2๐‘ก}๏ธ(1 + 2โ€–ฮฃโˆ’ ๐ผโ€–op)

    ]๏ธ‚โ‰ค 2๐‘’โˆ’๐‘ก,

    P[๏ธ‚โ€–๐‘‹โ€–22 โˆ’ (๐‘‘+๐‘…2)(1โˆ’ 2โ€–ฮฃโˆ’ ๐ผโ€–op) โ‰ค โˆ’

    {๏ธ2โˆš๏ธ

    (๐‘‘+ 2๐‘…2)๐‘ก+ 2๐‘ก}๏ธ(1 + 2โ€–ฮฃโˆ’ ๐ผโ€–op)

    ]๏ธ‚โ‰ค 2๐‘’โˆ’๐‘ก.

    Proof. Define ๐‘Œ = ฮฃโˆ’1/2(๐‘‹ โˆ’ ๐œ‡). Then

    โ€–๐‘‹โ€–22 = ๐‘Œ โŠคฮฃ๐‘Œ + 2๐œ‡โŠคฮฃ1/2๐‘Œ + โ€–๐œ‡โ€–22 = โ€–๐‘Œ + ๐œ‡โ€–22 + ๐‘Œ โŠค(ฮฃโˆ’ ๐ผ)๐‘Œ + 2๐œ‡โŠค(ฮฃ1/2 โˆ’ ๐ผ)๐‘Œ.

    28

  • Observe that โ€–๐‘Œ + ๐œ‡โ€–22 โˆผ ๐œ’2๐‘‘(๐‘…2). By Birgรฉ (2001, Lemma 8.1), we have

    P{๏ธโ€–๐‘Œ + ๐œ‡โ€–22 โ‰ฅ ๐‘‘+๐‘…2 + 2

    โˆš๏ธ(๐‘‘+ 2๐‘…2)๐‘ก+ 2๐‘ก

    }๏ธโ‰ค ๐‘’โˆ’๐‘ก (29)

    P{๏ธโ€–๐‘Œ + ๐œ‡โ€–22 โ‰ค ๐‘‘+๐‘…2 โˆ’ 2

    โˆš๏ธ(๐‘‘+ 2๐‘…2)๐‘ก

    }๏ธโ‰ค ๐‘’โˆ’๐‘ก (30)

    On the other hand, we also have

    |๐‘Œ โŠค(ฮฃโˆ’ ๐ผ)๐‘Œ |+ |2๐œ‡โŠค(ฮฃ1/2 โˆ’ ๐ผ)๐‘Œ | โ‰ค โ€–๐‘Œ โ€–22โ€–ฮฃโˆ’ ๐ผโ€–op + 2๐‘…โ€–๐‘Œ โ€–2โ€–ฮฃ1/2 โˆ’ ๐ผโ€–opโ‰ค (2โ€–๐‘Œ โ€–22 +๐‘…2)โ€–ฮฃโˆ’ ๐ผโ€–op, (31)

    where the final inequality follows from a Taylor expansion of {๐ผ+ (ฮฃโˆ’ ๐ผ)}1/2 after notingโ€–ฮฃ โˆ’ ๐ผโ€–op โ‰ค 1/2 and the elementary inequality 2๐‘Ž๐‘ โ‰ค ๐‘Ž2 + ๐‘2 for ๐‘Ž, ๐‘ โˆˆ R. By Laurentand Massart (2000, Lemma 1), we have with probability at least 1โˆ’ ๐‘’โˆ’๐‘ก that

    โ€–๐‘Œ โ€–22 โ‰ค ๐‘‘+ 2โˆš๐‘‘๐‘ก+ 2๐‘ก. (32)

    Combining (29), (31) and (32), we have with probability at least 1โˆ’ 2๐‘’โˆ’๐‘ก that

    โ€–๐‘‹โ€–22 โ‰ค{๏ธ๐‘‘+๐‘…2 + 2

    โˆš๏ธ(๐‘‘+ 2๐‘…2)๐‘ก+ 2๐‘ก

    }๏ธ(1 + 2โ€–ฮฃโˆ’ ๐ผโ€–op).

    Similarly, for the lower bound, we have again by (30), (31) and (32) that

    โ€–๐‘‹โ€–22 โ‰ฅ ๐‘‘+๐‘…2 โˆ’ 2โˆš๏ธ

    (๐‘‘+ 2๐‘…2)๐‘กโˆ’ {2๐‘‘+ 4โˆš๐‘‘๐‘ก+ 4๐‘ก+๐‘…2}โ€–ฮฃโˆ’ ๐ผโ€–op

    โ‰ฅ (๐‘‘+๐‘…2)(1โˆ’ 2โ€–ฮฃโˆ’ ๐ผโ€–op)โˆ’{๏ธ2โˆš๏ธ

    (๐‘‘+ 2๐‘…2)๐‘ก+ 2๐‘ก}๏ธ(1 + 2โ€–ฮฃโˆ’ ๐ผโ€–op).

    holds with probability at least 1โˆ’ 2๐‘’โˆ’๐‘ก.

    ReferencesArias-Castro, E., Candรจs, E. J. and Plan, Y. (2011) Global testing under sparse alter-

    natives: ANOVA, multiple comparisons and the higher criticism. Ann. Statist., 39,2533โ€“2556.

    Bai, Z., Hu, J., Pan, G. and Zhou, W. (2015) Convergence of the empirical spectraldistribution function of Beta matrices. Bernoulli, 21, 1538โ€“1574.

    Birgรฉ, L. (2001) An alternative point of view on Lepskiโ€™s method. In de Gunst, M andKlaassen, C. and van der Vaart, A. Eds,. Lecture Notes-Monograph Series 36, 113โ€“133.

    Cai, T. T., Liu, W. and Xia, Y. (2014) Two-sample test of high dimensional means underdependence. J. Roy. Statist. Soc., Ser. B, 76, 349โ€“372.

    Charbonnier, C., Verzelen, N. and Villers, F. (2015) A global homogeneity test for high-dimensional linear regression. Electron. J. Statist., 9, 318โ€“382.

    29

  • Chen, S. X., Li, J. and Zhong, P.-S. (2019) Two-sample and ANOVA tests for highdimensional means. Ann. Statist., 47, 1443โ€“1474.

    Chow, G. C. (1960) Tests of equality between sets of coefficients in two linear regressions.Econometrica, 28, 591โ€“605.

    Dicker, L. H. (2014) Variance estimation in high-dimensional linear models. Biometrika,101, 269โ€“284.

    Donoho, D. and