two-sampletestingofhigh-dimensionallinear ...ucaktwa/publication/compsket.pdfhigh-dimensional...
TRANSCRIPT
-
Two-sample testing of high-dimensional linearregression coefficients via complementary sketching
Fengnan Gao* and Tengyao Wangโ
(November 27, 2020)
Abstract
We introduce a new method for two-sample testing of high-dimensional linearregression coefficients without assuming that those coefficients are individually es-timable. The procedure works by first projecting the matrices of covariates andresponse vectors along directions that are complementary in sign in a subset of thecoordinates, a process which we call โcomplementary sketchingโ. The resulting pro-jected covariates and responses are aggregated to form two test statistics, whichare shown to have essentially optimal asymptotic power under a Gaussian designwhen the difference between the two regression coefficients is sparse and dense re-spectively. Simulations confirm that our methods perform well in a broad class ofsettings.
1 IntroductionTwo-sample testing problems are commonplace in statistical applications across differentscientific fields, wherever researchers want to compare observations from different samples.In its most basic form, a two-sample Gaussian mean testing problem is formulated asfollows: upon observing two samples ๐1, . . . ๐๐1
iidโผ ๐(๐1, ๐2) and ๐1, . . . , ๐๐2iidโผ ๐(๐2, ๐2),
we wish to test๐ป0 : ๐1 = ๐2 versus ๐ป1 : ๐1 ฬธ= ๐2. (1)
This leads to the introduction of the famous two-sample Studentโs ๐ก-test. In a slightly moreinvolved form in the parametric setting, we observe ๐1, . . . , ๐๐1
iidโผ ๐น๐1,๐พ1 and ๐1, . . . , ๐๐2iidโผ
๐น๐2,๐พ2 and would like to test ๐ป0 : ๐1 = ๐2 versus ๐ป1 : ๐1 ฬธ= ๐2, where ๐พ1 and ๐พ2 are nuisanceparameters.
Linear regression models have been one of the staples of statistics. A two-sampletesting problem in linear regression arises in the following classical setting: fix ๐ โชmin{๐1, ๐2}, we observe an ๐1-dimensional response vector ๐1 with an associated design
*Fudan University. Email: [email protected]โ University College London. Email: [email protected]
1
-
matrix ๐1 โ R๐1ร๐ in the first sample, and an ๐2-dimensional response ๐2 with designmatrix ๐2 โ R๐2ร๐ in the second sample. We assume in both samples the responses aregenerated from standard linear modelsโงโจโฉ๐1 = ๐1๐ฝ1 + ๐1,๐2 = ๐2๐ฝ2 + ๐2, (2)for some unknown regression coefficients ๐ฝ1, ๐ฝ2 โ R๐ and independent homoscedastic noisevectors ๐1 | (๐1, ๐2) โผ ๐๐1(0, ๐2๐ผ๐1) and ๐2 | (๐1, ๐2) โผ ๐๐2(0, ๐2๐ผ๐2). The purpose isto test ๐ป0 : ๐ฝ1 = ๐ฝ2 versus ๐ป1 : ๐ฝ1 ฬธ= ๐ฝ2. Suppose that ๐ฝ is the least square estimate of๐ฝ = ๐ฝ1 = ๐ฝ2 under the null hypothesis and ๐ฝ1, ๐ฝ2 are the least square estimates of ๐ฝ1 and๐ฝ2 respectively under the alternative hypothesis. Define the residual sum of squares as
RSS1 = โ๐1 โ๐1๐ฝ1โ22 + โ๐2 โ๐2๐ฝ2โ22,RSS0 = โ๐1 โ๐1๐ฝโ22 + โ๐2 โ๐2๐ฝโ22.
(3)
The classical generalised likelihood ratio test (Chow, 1960) compares the ๐น -statistic
๐น = (RSS0โRSS1)/๐RSS1 /(๐1 + ๐2 โ 2๐)โผ ๐น๐, ๐1+๐2โ2๐ (4)
against upper quantiles of the ๐น๐, ๐1+๐2โ2๐ distribution. It is well-known that in the clas-sical asymptotic regime where ๐ is fixed and ๐1, ๐2 โโ, the above generalised likelihoodratio test is asymptotically optimal.
High-dimensional datasets are ubiquitous in the contemporary era of Big Data. Asdimensions of modern data ๐ in genetics, signal processing, econometrics and other fieldsare often comparable to sample sizes ๐, the most significant challenge in high-dimensionaldata is that the fixed-๐-large-๐ setup prevalent in classical statistical inference is no longervalid. Yet the philosophy remains true that statistical inference is only possible when thesample size relative to the true parameter size is sufficiently large. Most advances in high-dimensional statistical inference so far have been made under some โsparsityโ conditions,i.e., all but a small (often vanishing) fraction of the ๐-dimensional model parameters arezero. The assumption in effect reduces the parameter size to an estimable level and itmakes sense in many applications because often only few covariates are really responsiblefor the response, though identification of these few covariates is still a nontrivial task. Inthe high-dimensional regression setting ๐ = ๐๐ฝ + ๐ where ๐ โ R๐, ๐ โ R๐ร๐, ๐ฝ โ R๐with ๐, ๐ โ โ simultaneously, a common assumption to make is ๐ log ๐/๐ โ 0 with๐ = โ๐ฝโ0 :=
โ๏ธ๐๐=1 1{๐ฝ๐ ฬธ=0}. Therefore, ๐ is the true parameter size, which vanishes relative
to the sample size ๐, and log ๐ is understood as the penalty to pay for not knowing wherethe ๐ true parameters are.
Aiming to take a step in studying the fundamental aspect of two-sample hypothesistesting in high dimensions, this paper is primarily concerned with the following testingproblem: we need to decide whether or not the responses in the two samples have differentlinear dependencies on the covariates. More specifically, under the same regression settingas in (2) with min{๐, ๐} โ โ, we wish to test the global null hypothesis
๐ป0 : ๐ฝ1 = ๐ฝ2 (5)
2
-
against the composite alternative
๐ป1 : โ๐ฝ1 โ ๐ฝ2โ2 โฅ 2๐, โ๐ฝ1 โ ๐ฝ2โ0 โค ๐. (6)
In other words, we assume that under the alternative hypothesis, the difference betweenthe two regression coefficients is a ๐-sparse vector with โ2 norm at least 2๐ (the additionalfactor of 2 here exists to simplify relevant statements under the reparametrisation we willintroduce later in Section 2). Throughout this paper, we do not assume the sparsity of๐ฝ1 or ๐ฝ2 under the alternative.
Classical ๐น -tests no longer work well on the above testing problem, for the simplereason that it is not possible to get good estimates of ๐ฝโs through naive least squareestimators, which are necessary in establishing RSS in (3) to measure the modelโs goodnessof fit. A standard way out is to impose certain kinds of sparsity on both ๐ฝ1 and ๐ฝ2 toensure that both quantities are estimable. To our best knowledge, this is the out-of-shelfapproach taken by most literature, see, for instance, Stรคdler and Mukherjee (2012); Xia,Cai and Cai (2015). Nevertheless, it is both more interesting and relevant in applicationsto study the testing problem where neither ๐ฝ1 nor ๐ฝ2 is estimable but only ๐ฝ1 โ ๐ฝ2 issparse.
As an example of applications where such assumptions come about natrually, considerthe area of differential networks. Here, researchers are interested in whether two biologicalnetworks formulated as Gaussian graphical models, such as โbrain connectivity networkโand gene-gene interaction network (Xia, Cai and Cai, 2015; Charbonnier, Verzelen andVillers, 2015), are different in two subgroups of population. Such complex networks aremostly of high-dimensional nature, in the sense that the number of nodes or features inthe networks are large, relative to the number of observations. Such networks are oftendense as interactions within different brain parts or genes are omnipresent, but becausethey are subject to the about same physiology, the differences between networks from twosubpopulations are small, i.e., there are only a few different edges from one network toanother. In the above case of dense coefficients, sparsity assumption makes no sense and itis impossible to obtain reasonable estimates of either regression coefficient ๐ฝ1 or ๐ฝ2 when๐ is of the same magnitude as ๐. For this reason, any approach to detect the differencebetween ๐ฝ1 and ๐ฝ2, which is built upon comparing estimates of ๐ฝ1 and ๐ฝ2 in some ways,fails. In fact, any inference on ๐ฝ1 or ๐ฝ2 is not possible unless we make some other strin-gent structural assumptions on the model. However, certain inference on the coefficientdifference ๐ฝ1 โ ๐ฝ2, such as testing the zero null with the sparse alternative, is feasible byexploiting sparse difference between different networks without much assumptions.
1.1 Related WorksThe two-sample testing problem in its most general form is not well-understood in highdimensions. Most of the existing literature has focused on testing the equality of means,namely the high-dimensional equivalence of (1), see, e.g. Cai, Liu and Xia (2014); Chen,Li and Zhong (2019). Similar to our setup, in the mean testing problems, we may allowfor non-sparse means in each sample and test only for sparse differences between the twopopulation means (Cai, Liu and Xia, 2014). The intuitive approach for testing equality
3
-
of means is to eliminate the dense nuisance parameter by taking the difference in meansof the two samples and thus reducing it to a one-sample problem of testing a sparse meanagainst a global zero null, which is also known as the โneedle in the haystackโ problemwell studied previously by e.g. Ingster (1997); Donoho and Jin (2004). Such reduction,however, is more intricate in the regression problem, as a result of different design matricesfor the two samples.
Literature is scarce for two-sample testing under high-dimensional regression setting.Stรคdler and Mukherjee (2012), Xia, Cai and Cai (2018), Xia, Cai and Sun (2020) haveproposed methods that work under the additional assumption so that both ๐ฝ1 and ๐ฝ2can be consistently estimated. Charbonnier, Verzelen and Villers (2015) and Zhu andBradic (2016) are the only existing works in the literature we are aware of that allowfor non-sparse regression coefficients ๐ฝ1 and ๐ฝ2. Specifically, Charbonnier, Verzelen andVillers (2015) look at a sequence of possible supports of ๐ฝ1 and ๐ฝ2 on a Lasso-type solutionpath and then apply a variant of the classical ๐น -tests to the lower-dimensional problemsrestricted on these supports, with the test ๐-values adjusted by a Bonferonni correction.Zhu and Bradic (2016) (after some elementary transformation) uses a Dantzig-type se-lector to obtain an estimate for (๐ฝ1 + ๐ฝ2)/2 and then use it to construct a test statisticbased on a specific moment condition satisfied under the null hypothesis. As both testsdepend on the estimation of nuisance parameters, their power can be compromised if suchnuisance parameters are dense.
1.2 Our contributionsOur contributions are four-fold.
1. We propose a novel method to solve the testing problems formulated in (5) and (6)for model (2). Through complementary sketching, a delicate linear transformationon both the designs and responses, our method turns the testing problem withtwo different designs into one with the same design of dimension ๐ ร ๐ where๐ = ๐1+๐2โ๐. After taking the difference in two regression coefficients, the problemis reduced to testing whether the coefficient in the reduced one-sample regression iszero against sparse alternatives. The transformation is carefully chosen such thatthe error distribution in the reduced one-sample regression is homoscedastic. Thispaves the way for constructing test statistics using the transformed covariates andresponses. Our method is easy to implement and does not involve any complicationsarising from solving computationally expensive optimisation problems. Moreover,when complementary sketching is combined with any methods designed for one-sample problems, our proposal substantially supplies a novel class of testing andestimation procedures for the corresponding two-sample problems.
2. In the sparse regime, where the sparsity parameter ๐ โผ ๐๐ผ in the alternative (6) forany fixed ๐ผ โ (0, 1/2), we show that the detection limit of our procedure, definedas the minimal โ๐ฝ1 โ ๐ฝ2โ2 necessary for asymptotic almost sure separation of thealternative from the null, is minimax optimal up to a multiplicative constant undera Gaussian design. More precisely, we show that in the asymptotic regime where
4
-
๐1, ๐2, ๐ diverge at a fixed ratio, if ๐ โฅโ๏ธ
8๐ log ๐๐๐ 1
, where ๐ 1 is a constant dependingon ๐1/๐2 and ๐/๐ only, then our test has asymptotic power 1 almost surely. On theother hand, in the same asymptotic regime, if ๐ โค
โ๏ธ๐๐ผ๐ log ๐๐๐ 1
for some ๐๐ผ dependingon ๐ผ only, then almost surely no test has asymptotic size 0 and power 1 .
3. Our results reveal the effective sample size of the two-sample testing problem. Here,by effective sample size, we mean the sample size for a corresponding one-sampletesting problem (i.e. testing ๐ฝ = 0 in a linear model ๐ = ๐๐ฝ + ๐ with rows of ๐following the same distribution as rows of ๐1 and ๐2) that has an asymptoticallyequal detection limit; see the discussion after Theorem 5 for a detailed definition. Atfirst glance, one might think that the effective sample size is ๐, which is the numberof rows in the reduced design. This hints that the reduction to the one-sampleproblem has made the original two-sample problem obsolete. However, on deeperthoughts, as an imbalance in the numbers of observations in ๐1 and ๐2 clearly makestesting more difficult, the effective sample size has to also incorporate this effect. Wesee from the previous point that uniformly for any ๐ผ less than and bounded awayfrom 1/2, the detection boundary is of order ๐2 โ ๐ log ๐/(๐๐ 1), with the precisedefinition of ๐ 1 given in Lemma 12. Writing ๐1/๐2 = ๐ and ๐/๐ = ๐ , our results onthe sparse case implies that the two-sample testing problem has the same order ofdetection limit as in a one-sample problem with sample size ๐๐ 1 = ๐(๐โ1 +๐+2)โ1.We note that this effective sample size is proportional to ๐, and for each fixed ๐,maximised when ๐ = 1 (i.e. ๐1 = ๐2) and approaches ๐/๐ in the most imbalanceddesign. This is in agreement with the intuition that testing is easiest when ๐1 = ๐2and impossible when ๐1 and ๐2 are too imbalanced. Our study, thus, sheds light onthe intrinsic difference between two-sample and one-sample testing problems andcharacterises the precise dependence of the difficulty of the two-sample problem onthe sample size and dimensionality parameters.
4. We observe a phase transition phenomena of how the minimax detection limit de-pends on the sparsity parameter ๐. In addition to establishing minimax rate optimaldetection limit of our procedure in the sparse case when ๐ โ ๐๐ผ for ๐ผ โ [0, 1/2), wealso prove that a modified version of our procedure, suited for denser signals, is ableto achieve minimax optimal detection limit up to logarithmic factors in the denseregime ๐ โ ๐๐ผ for ๐ผ โ (1/2, 1). However, the detection limit is of order ๐ โ
โ๏ธ๐ log ๐๐๐ 1
in the sparse regime, but of order ๐ โ ๐โ1/4 up to logarithmic factors in the denseregime. Such a phase transition phenomenon is qualitatively similar to results pre-viously reported in the one-sample testing problem (see, e.g. Ingster, Tsybakov andVerzelen, 2010; Arias-Castro, Candรจs and Plan, 2011).
1.3 Organization of the paperWe describe our methodology in detail in Section 2 and establish its theoretical propertiesin Section 3. Numerical results illustrate the finite sample performance of our proposed
5
-
algorithm in Section 4. Proofs of our main results are deferred until Section 5 withancillary results in Section 6.
1.4 NotationFor any positive integer ๐, we write [๐] := {1, . . . , ๐}. For a vector ๐ฃ = (๐ฃ1, . . . , ๐ฃ๐)โค โ R๐,we define โ๐ฃโ0 :=
โ๏ธ๐๐=1 1{๐ฃ๐ ฬธ=0}, โ๐ฃโโ := max๐โ[๐] |๐ฃ๐| and โ๐ฃโ๐ :=
{๏ธโ๏ธ๐๐=1(๐ฃ๐)๐
}๏ธ1/๐for any
positive integer ๐, and let ๐ฎ๐โ1 := {๐ฃ โ R๐ : โ๐ฃโ2 = 1}. The support of vector ๐ฃ is definedby supp(๐ฃ) := {๐ โ [๐] : ๐ฃ๐ ฬธ= 0}.
For ๐ โฅ ๐, O๐ร๐ denotes the space of ๐ ร ๐ matrices with orthonormal columns.For ๐ โ R๐, we define diag(๐) to the ๐ ร ๐ diagonal matrix with diagonal entries filledwith elements of ๐, i.e., (diag(๐))๐,๐ = 1{๐=๐}๐๐. For ๐ด โ R๐ร๐, define diag(๐ด) to be the๐ร ๐ diagonal matrix with diagonal entries coming from ๐ด, i.e., (diag(๐ด))๐,๐ = 1{๐=๐}๐ด๐,๐.We also write tr(๐ด) := โ๏ธ๐โ[๐] ๐ด๐,๐. For a symmetric matrix ๐ด โ R๐ร๐ and ๐ โ [๐], the๐-operator norm of ๐ด is defined by
โ๐ดโ๐,op := sup๐ฃโ๐ฎ๐โ1:โ๐ฃโ0โค๐
|๐ฃโค๐ด๐ฃ|.
For any ๐ โ [๐], we write ๐ฃ๐ for the |๐|-dimensional vector obtained by extracting co-ordinates of ๐ฃ in ๐ and ๐ด๐,๐ the matrix obtained by extracting rows and columns of ๐ดindexed by ๐.
Given two sequences (๐๐)๐โN and (๐๐)๐โN such that ๐๐ > 0 for all ๐, we write ๐๐ =๐ช(๐๐) if |๐๐| โค ๐ถ๐๐ for some constant ๐ถ. If the constant ๐ถ depends on some parameter๐ฅ, we write ๐๐ = ๐ช๐ฅ(๐๐) instead. Also, ๐๐ = ๐ช(๐๐) denotes ๐๐/๐๐ โ 0.
2 Testing via complementary sketchingIn this section, we describe our testing strategy. Since we are only interested in thedifference in regression coefficients in the two linear models, we reparametrise (2) with๐พ := (๐ฝ1 + ๐ฝ2)/2 and ๐ := (๐ฝ1 โ ๐ฝ2)/2 to separate the nuisance parameter from theparameter of interest. Define
ฮ๐,๐(๐) :={๏ธ๐ โ R๐ : โ๐โ2 โฅ ๐ and โ๐โ0 โค ๐
}๏ธ.
Under this new parametrisation, the null and the alternative hypotheses can be equiva-lently formulated as
๐ป0 : ๐ = 0 and ๐ป1 : ๐ โ ฮ๐,๐(๐).
The parameter of interest ๐ is now ๐-sparse under the alternative hypotheses. However,its inference is confounded by the possibly dense nuisance parameter ๐พ โ R๐. A naturalidea, then, is to eliminate the nuisance parameter from the model. In the special designsetting where ๐1 = ๐2 (in particular, ๐1 = ๐2), this can be achieved by considering thesparse regression model ๐1 โ ๐2 = ๐1๐ + (๐1 โ ๐2). While the above example only worksin a special, idealised setting, it nevertheless motivates our general testing procedure.
6
-
To introduce our test, we first concatenate the design matrices and response vectorsto form
๐ =(๏ธ๐1๐2
)๏ธand ๐ =
(๏ธ๐1๐2
)๏ธ.
A key idea of our method is to project ๐ and ๐ respectively along ๐ โ ๐ pairs of direc-tions that are complementary in sign in a subset of their coordinates, a process we callcomplementary sketching. Specifically, assume ๐1 + ๐2 > ๐ and define ๐ := ๐1 + ๐2 and๐ := ๐โ ๐ and let ๐ด1 โ R๐1ร๐ and ๐ด2 โ R๐2ร๐ be chosen such that
๐ดโค1 ๐ด1 + ๐ดโค2 ๐ด2 = ๐ผ๐ and ๐ดโค1 ๐1 + ๐ดโค2 ๐2 = 0. (7)
In other words, ๐ด := (๐ดโค1 , ๐ดโค2 )โค is a matrix with orthonormal columns orthogonal to thecolumn space of ๐. Such ๐ด1 and ๐ด2 exist since the null space of ๐ has dimension at least๐. Define ๐ := ๐ดโค1 ๐1 + ๐ดโค2 ๐2 and ๐ := ๐ดโค1 ๐1 โ ๐ดโค2 ๐2. From the above construction,we have
๐ = ๐ดโค1 ๐1๐ฝ1 + ๐ดโค2 ๐2๐ฝ2 + (๐ดโค1 ๐1 + ๐ดโค2 ๐2) = ๐๐ + ๐, (8)
where ๐ | ๐ โผ ๐๐(0, ๐ดโค๐ด) = ๐๐(0, ๐2๐ผ๐). We note that similar to conventional sketch-ing (see, e.g. Mahoney, 2011), the complementary sketching operation above synthesises๐ data points from the original ๐ observations. However, unlike conventional sketching,where one projects the design ๐ and response ๐ by the same sketching matrix ๐ โ R๐ร๐to obtain sketched data (๐๐, ๐๐ ), here we project ๐ and ๐ along different directions toobtain (๐ด๐,๐ด๐ ), where ๐ด := (๐ดโค1 ,โ๐ดโค2 )โค is complementary in sign to ๐ด in its secondblock. Moreover, the main purpose of the conventional sketching is to trade off statisticalefficiency for computational speed by summarising raw data with a smaller number ofsynthesised data points, whereas the main aim of our complementary sketching operationis to eliminate the nuisance parameter, and surprisingly, as we will see in Section 3, thereis essentially no loss of statistical efficiency introduced by our complementary sketchingin this two-sample testing setting.
To summarise, after projecting ๐ and ๐ via complementary sketching to obtain ๐and ๐, we reduce the original two-sample testing problem to a one-sample problem with ๐observations, where we test the global null of ๐ = 0 against sparse alternatives using data(๐,๐). From here, we can construct test statistics as functions of ๐ and ๐, for whichwe describe two different tests. The first testing procedure, detailed in Algorithm 1, com-putes the sum of squares of hard-thresholded inner products between the response ๐ andstandardised columns of the design matrix ๐ in (8). We denote the output of Algorithm 1with input ๐1, ๐2, ๐1 and ๐2 and tuning parameters ๐ and ๐ as ๐sparse๐,๐ (๐1, ๐2, ๐1, ๐2). Aswe will see in Section 3, the choice of ๐ = 2๐
โlog ๐ and ๐ = ๐๐2 log ๐ would be suitable
for testing against sparse alternatives in the case of ๐ โค ๐1/2. On the other hand, in thedense case when ๐ > ๐1/2, one option would be to choose ๐ = 0. However, it turns out tobe difficult to set the test threshold level ๐ in this dense case using the known problemparameters. Therefore, we decided to study instead the following as our second test. Weapply steps 1 to 4 of Algorithm 1 to obtain the vector ๐, and then define our test as
๐dense๐ (๐1, ๐2, ๐1, ๐2) := 1{โ๐โ22 โฅ ๐},
7
-
Algorithm 1: Pseudo-code for complementary sketching-based test ๐sparse๐,๐ .Input: ๐1 โ R๐1ร๐, ๐2 โ R๐2ร๐, ๐1 โ R๐1 , ๐2 โ R๐2 satisfying ๐1 + ๐2 โ ๐ > 0, a
hard threshold level ๐ โฅ 0, and a test threshold level ๐ > 0.1 Set ๐โ ๐1 + ๐2 โ ๐.2 Form ๐ด โ O๐ร๐ with columns orthogonal to the column space of (๐โค1 , ๐โค2 )โค.3 Let ๐ด1 and ๐ด2 be submatrices formed by the first ๐1 and last ๐2 rows of ๐ด.4 Set ๐ โ ๐ดโค1 ๐1 + ๐ดโค2 ๐2 and ๐ โ ๐ดโค1 ๐1 โ ๐ดโค2 ๐2.5 Compute ๐โ diag(๐โค๐ )โ1/2๐โค๐.6 Compute the test statistic
๐ :=๐โ๏ธ๐=1
๐2๐1{|๐๐ |โฅ๐}.
7 Reject the null hypothesis if ๐ โฅ ๐ .
for a suitable choice of threshold level ๐.The computational complexity of both ๐sparse๐,๐ and ๐dense๐ depends on Step 2 of Algo-
rithm 1. In practice, we can form the projection matrix ๐ด as follows. We first generate an๐ร๐ matrix ๐ with independent ๐(0, 1) entries, and then project columns of ๐ to theorthogonal complement of the column space of ๐ to obtain ๏ฟฝฬ๏ฟฝ := (๐ผ๐ โ๐๐โ )๐ , where๐โ is the MooreโPenrose pseudoinverse of ๐. Finally, we extract an orthonormal basisfrom the columns of ๏ฟฝฬ๏ฟฝ via a QR decomposition ๏ฟฝฬ๏ฟฝ = ๐ด๐ , where ๐ is upper triangularand ๐ด is a (random) ๐ร๐ matrix with orthonormal columns that can be used in Step 2of Algorithm 1. The overall computational complexity for our tests are therefore of order๐ช(๐2๐+๐๐2). Finally, it is worth emphasising that while the matrix ๐ด generated this wayis random, our test statistics ๐ = โ๏ธ๐๐=1 ๐2๐1{|๐๐ |โฅ๐} and โ๐โ22, are in fact deterministic.To see this, we observe that both
๐โค๐ = (๐ดโค1 ๐1 โ ๐ดโค2 ๐2)โค(๐ดโค1 ๐1 + ๐ดโค2 ๐2) =(๏ธ๐โค1 โ๐โค2
)๏ธ(๏ธ๐ด1๐ด2
)๏ธ(๏ธ๐ดโค1 ๐ด
โค2
)๏ธ(๏ธ๐1๐2
)๏ธ
and โ๐โ22 = ๐ โค๐ด๐ดโค๐ depend on ๐ด only through ๐ด๐ดโค, which is determined by the columnspace of ๐ด. Moreover, by Lemma 9, (โ๐๐โ22)๐โ[๐], being diagonal entries of ๐โค๐ =4๐โค1 ๐ด1๐ดโค1 ๐1, are also functions of ๐ alone. This attests that both test statistics, andconsequently our two tests, are deterministic in nature.
3 Theoretical analysisWe now turn to the analysis of the theoretical performance of ๐sparse๐,๐ and ๐dense๐ . Weconsider both the size and power of each test, as well as the minimax lower bounds forsmallest detectable signal strengths.
In addition to working under the regression model (2), we further assume the followingconditions in our theoretical analysis.
8
-
(C1) All entries of ๐1 and ๐2 are independent standard normals.
(C2) Parameters ๐1, ๐2, ๐ satisfy ๐ = ๐1 + ๐2 โ ๐ > 0 and lie in the asymptotic regimewhere ๐1/๐2 โ ๐ and ๐/๐โ ๐ as ๐1, ๐2, ๐โโ
The main purpose for assuming (C1) is to ensure that after standardising columns to haveunit length, the matrix ๐ , as computed in Step 4 of Algorithm 1, satisfies a restrictedisometry condition with high probability as in Proposition 2. In fact, as revealed in theproof of our theory, even if matrices ๐1 and ๐2 do not follow the Gaussian design, aslong as ๐ satisfies (9) and (10), all results in this section will still be true. The condition๐1 + ๐2 โ ๐ > 0 in (C2) is necessary in this two-sample problem, since otherwise, for anyprescribed value of ฮ := ๐ฝ1 โ ๐ฝ2, the equation system with ๐ฝ1 as unknowns(๏ธ
๐1๐2
)๏ธ๐ฝ1 =
(๏ธ๐1
๐2 โ๐2ฮ
)๏ธ
has at least one solution when (๐โค1 , ๐โค2 )โค has rank ๐. As a result, except in some patho-logical cases, we can always find ๐ฝ1, ๐ฝ2 โ R๐ that fit the data perfectly with ๐1 = ๐1๐ฝ1and ๐2 = ๐2๐ฝ2, which makes the testing problem impossible. Finally, we have carried outproofs of our theoretical results with finite sample arguments wherever possible. Never-theless, due to a lack of finite-sample bounds on the empirical spectral density of matrix-variate Beta distributions, all results in this section are presented under the asymptoticregime set out in Condition (C2). Under this condition, we were able to exploit existingresults in the random matrix theory to obtain a sharp dependence of the detection limiton ๐ and ๐.
In what follows, we also make the simplifying assumption that the noise variance ๐2is known. By replacing ๐1, ๐2, ๐1, ๐2 with ๐1/๐, ๐2/๐, ๐1/๐ and ๐2/๐, we may furtherassume that ๐2 = 1 without loss of generality, which we assume for the rest of this section.In practice, if the noise variance is unknown, we can replace it with one of its consistentestimators ๏ฟฝฬ๏ฟฝ2 (see e.g. Reid, Tibshirani and Friedman, 2016, and references therein).
3.1 Sparse caseWe consider in this subsection the test ๐sparse๐,๐ , which is suitable for distinguishing ๐ฝ1and ๐ฝ2 that differ in a small number of coordinates, the setting that has more subtlephenomena and hence is most of interest to us. Our first result below states that with achoice of hard-thresholding level ๐ of order
โlog ๐, the test has asymptotic size 0.
Theorem 1. If Condition (C2) holds and ๐ฝ1 = ๐ฝ2, then, with the choice of parameters๐ > 0 and ๐ =
โ๏ธ(4 + ๐) log ๐ for any ๐ > 0, we have
๐sparse๐,๐ (๐1, ๐2, ๐1, ๐2)a.s.โโโ 0.
The almost sure statement in Theorem 1 and subsequent results in this section arewith respect to both the randomness in ๐ = (๐โค1 , ๐โค2 )โค and in ๐ = (๐โค1 , ๐โค2 )โค. However,a closer inspection of the proof of Theorem 1 tells us that the statement is still true if we
9
-
allow an arbitrary sequence of matrices ๐ (indexed by ๐) and only consider almost sureconvergence with respect to the distribution of ๐.
The control of the asymptotic power of ๐sparse๐,๐ is more involved. A key step in theargument is to show that ๐โค๐ is suitably close to a multiple of the identity matrix.More precisely, in Proposition 2 below, we derive entrywise and ๐-operator norm controlsof the Gram matrix of the design matrix sketch ๐ .
Proposition 2. Under Conditions (C1) and (C2), we further assume ๐ โ [๐] and let ๐be defined as in Algorithm 1. Then with probability 1,
max๐โ[๐]
โโโโโ(๐โค๐ )๐,๐4๐๐ 1 โ 1
โโโโโโ 0, (9)
where ๐ 1 is as defined in Lemma 12. Moreover, define ๏ฟฝฬ๏ฟฝ = ๐ diag(๐โค๐ )โ1/2. If
๐ log(๐๐/๐)๐
โ 0, (10)
then there exists ๐ถ๐ ,๐ > 0, depending only on ๐ and ๐, such that with probability 1, thefollowing holds for all but finitely many ๐:
โ๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐ผโ๐,op โค ๐ถ๐ ,๐
โ๏ธ๐ log(๐๐/๐)
๐. (11)
We note that condition (10) is relatively mild and would be satisfied if ๐ โค ๐๐ผ for any๐ผ โ [0, 1).
All theoretical results in this section except for Theorem 1 assume the random designCondition (C1) to hold. However, as revealed by the proofs, for any given (deterministic)sequence of ๐, these results remain true as long as (9) and (11) are satisfied. Theasymptotic nature of Proposition 2 is a result of our application of Bai et al. (2015,Theorem 1.1), which guarantees an almost sure convergence of the empirical spectraldistribution of Beta random matrices in the weak topology. This sets the tone for theasymptotic nature of our results, which depend on the aforementioned limiting spectraldistribution.
The following theorem provides power control of our procedure ๐sparse๐,๐ , when the โ2norm of the scaled difference in regression coefficient ๐ = (๐ฝ1 โ ๐ฝ2)/2 exceeds an appro-priate threshold.
Theorem 3. Under Conditions (C1) and (C2), we further assume ๐ โ [๐] and that (10)holds. If ๐ = (๐ฝ1 โ ๐ฝ2)/2 โ ฮ๐,๐(๐) with ๐ โฅ
โ๏ธ8๐ log ๐๐๐ 1
, and we set input parameters๐ = 2
โlog ๐ and ๐ โค 3๐ log ๐ in Algorithm 1, then
๐sparse๐,๐ (๐1, ๐2, ๐1, ๐2)a.s.โโโ 1.
The size and power controls in Theorems 1 and 3 jointly provide an upper bound onthe minimax detection threshold. Specifically, let ๐๐๐ฝ1,๐ฝ2 be the conditional distribution of๐1, ๐2 given ๐1, ๐2 under model (2). Conditionally on the design matrices ๐1 and ๐2 and
10
-
given ๐ โ [๐] and ๐ > 0, the (conditional) minimax risk of testing ๐ป0 : ๐ฝ1 = ๐ฝ2 against๐ป1 : ๐ = (๐ฝ1 โ ๐ฝ2)/2 โ ฮ๐,๐(๐) is defined as
โณ๐(๐, ๐) := inf๐
{๏ธsup๐ฝโR๐
๐๐๐ฝ,๐ฝ(๐ ฬธ= 0) + sup๐ฝ1,๐ฝ2โR๐
(๐ฝ1โ๐ฝ2)/2โฮ๐,๐(๐)
๐๐๐ฝ1,๐ฝ2(๐ ฬธ= 1)}๏ธ,
where we suppress all dependences on the dimension of data for notational simplicityand the infimum is taken over all ๐ : (๐1, ๐1, ๐2, ๐2) โฆโ {0, 1}. If โณ๐(๐, ๐)
pโโ 0, thereexists a test ๐ that with asymptotic probability 1 correctly differentiates the null andthe alternative. On the other hand, if โณ๐(๐, ๐)
pโโ 1, then asymptotically no test cando better than a random guess. The following corollary provides an upper bound on thesignal size ๐ for which the minimax risk is asymptotically zero.
Corollary 4. Let ๐ โ [๐] and assume Conditions (C1), (C2) and (10). If ๐ โฅโ๏ธ
8๐ log ๐๐๐ 1
,and we set input parameters ๐ = 2
โlog ๐ and ๐ โ (0, 3๐ log ๐] in Algorithm 1, then
โณ๐(๐, ๐) โค sup๐ฝโR๐
๐๐๐ฝ,๐ฝ(๐sparse๐,๐ ฬธ= 0) + sup
๐ฝ1,๐ฝ2โR๐(๐ฝ1โ๐ฝ2)/2โฮ๐,๐(๐)
๐๐๐ฝ1,๐ฝ2(๐sparse๐,๐ ฬธ= 1)
a.s.โโโ 0
Corollary 4 shows that the test ๐sparse๐,๐ has an asymptotic detection limit, measured inโ๐ฝ1 โ ๐ฝ2โ2, of at most
โ๏ธ8๐ log ๐๐๐ 1
for all ๐ satisfying (10). While (10) is satisfied for ๐ โค ๐๐ผwith any ๐ผ โ [0, 1), the detection limit upper bound shown in Corollary 4 is suboptimalwhen ๐ผ > 1/2, as we will see later in Theorem 6. On the other hand, the followingtheorem shows that when ๐ผ < 1/2, the detection limit of ๐sparse๐,๐ is essentially optimal.
Theorem 5. Under conditions (C1) and (C2), if further assume ๐ โค ๐๐ผ for some ๐ผ โ[0, 1/2) and ๐ โค
โ๏ธ(1โ2๐ผโ๐)๐ log ๐
4๐๐ 1 for some ๐ โ (0, 1โ 2๐ผ], then โณ๐(๐, ๐)a.s.โโโ 1.
For any fixed ๐ผ < 1/2, Theorem 5 shows that if the signal โ2 norm is a factor of32/(1โ 2๐ผโ ๐) smaller than what can be detected by ๐sparse๐,๐ shown in Corollary 4, thenall tests are asymptotically powerless in differentiating the null from the alternative. Inother words, in the sparse regime where ๐ โค ๐๐ผ for ๐ผ < 1/2, the test ๐sparse๐,๐ has a minimaxoptimal detection limit measured in โ๐ฝ1 โ ๐ฝ2โ2, up to constants depending on ๐ผ only.
It is illuminating to relate the above results with the corresponding ones in the one-sample problem in the sparse regime (๐ผ < 1/2). Let๐ be an ๐ร๐matrix with independent๐(0, 1) entries and ๐ = ๐๐ฝ + ๐ for ๐ | ๐ โผ ๐(0, ๐ผ๐), and we consider the one-sampleproblem to test ๐ป0 : ๐ฝ = 0 against ๐ป1 : ๐ฝ โ ฮ๐,๐(๐). Theorem 2 and 4 of Arias-Castro,Candรจs and Plan (2011) state that under the additional assumption that all nonzeroentries of ๐ฝ have equal absolute values, the detection limit for the one-sample problemis at ๐ โ
โ๏ธ๐ log ๐๐
, up to constants depending on ๐ผ. Thus, Corollary 4 and Theorem 5suggest that the two-sample problem with model (2) has up to multiplicative constantsthe same detection limit as the one-sample problem with sample size ๐๐ 1. By the explicitexpression of ๐ 1 from Lemma 12, we have
๐๐ 1 =๐๐
(1 + ๐)2(1 + ๐ ) , (12)
11
-
which unveils how this โeffective sample sizeโ depends on the relative proportions betweensample sizes ๐1, ๐2 and the dimension ๐ of the problem. It is to be expected that theeffective sample size is proportional to ๐, which is the number of observations constructedfrom ๐1 and ๐2 in ๐ . More intriguingly, (12) also gives a precise characterisation of howthe effective sample size depends on the imbalance between the number of observationsin ๐1 and ๐2. For a fixed ๐ = ๐1 + ๐2, ๐๐ 1 is maximised when ๐1 = ๐2 and converges to๐1๐/๐ (or ๐2๐/๐) if ๐1/๐โ 0 (or ๐2/๐โ 0).
3.2 Dense caseWe now turn our attention to our second test, ๐dense๐ . The following theorem states asufficient signal โ2 norm size for which ๐dense๐ is asymptotically powerful in distinguishingthe null from the alternative.
Theorem 6. Let ๐ = ๐ + 23/2โ๐ log ๐ + 4 log ๐. Under Conditions (C1) and (C2), we
further assume ๐ โ [๐], ๐2 โฅ 2โ๐ log ๐๐๐ 1
and that (10) is satisfied.
(a) If ๐ฝ1 = ๐ฝ2, then ๐dense๐ (๐1, ๐2, ๐1, ๐2)a.s.โโโ 0.
(b) If ๐ = (๐ฝ1 โ ๐ฝ2)/2 โ ฮ๐,๐(๐), then ๐dense๐ (๐1, ๐2, ๐1, ๐2)a.s.โโโ 1.
Consequently, โณ๐(๐, ๐) a.s.โโโ 0.
Theorem 6 indicates that the sufficient signal โ2 norm for asymptotic powerful testingvia ๐dense๐ does not depend upon the sparsity level. While the above result is valid forall ๐ โ [๐], it is more interesting in the dense regime where ๐ โฅ ๐1/2. More precisely, bycomparing Theorems 6 and 4, we see that if ๐2 log ๐ > ๐ and ๐ log(๐๐/๐) โค ๐/(2๐ถ๐ ,๐),the test ๐dense๐ has a smaller provable detection limit than ๐
sparse๐,๐ . In our asymptotic
regime (C2), 2โ๐ log ๐๐๐ 1
is, up to constants depending on ๐ and ๐, of order ๐โ1/2 log1/2 ๐.The following theorem establishs that the detection limit of ๐dense๐ is minimax optimal upto poly-logarithmic factors in the dense regime.
Theorem 7. Under conditions (C1) and (C2), if we further assume ๐1/2 โค ๐ โค ๐๐ผ forsome ๐ผ โ [1/2, 1] and ๐ = ๐ช(๐โ1/4 logโ3/4 ๐), then โณ๐(๐, ๐) a.s.โโโ 1.
Theorem 7 points out that the lower bound on detectable signal size ๐2 prescribed inTheorem 6 is necessary up to poly-logarithmic factors. The following proposition makesit explicit that the upper bound on sparsity imposed by (10) in Theorem 6 cannot becompletely removed, i.e., the same result may not hold if we allow ๐ to be a constantfraction of ๐.
Proposition 8. If ๐ = ๐ โฅ min{๐1, ๐2}, then โณ๐(๐, ๐) = 1. If ๐ = ๐ and ๐/๐1, ๐/๐2 โ[๐, 1) for any fixed ๐ โ (0, 1), and ๐ = (๐ฝ1 โ ๐ฝ2)/2 โ ฮ๐,๐(๐) with
๐2 = ๐ช(๏ธ
max{๏ธ
๐
(๐1 โ ๐)2,
๐
(๐2 โ ๐)2
}๏ธ)๏ธ,
then โณ๐(๐, ๐) a.s.โโโ 1.
12
-
4 Numerical studiesIn this section, we study the finite sample performance of our proposed procedures vianumerical experiments. Unless otherwise stated, the data generating mechanism for allsimulations in this section is as follows. We first generate design matrices ๐1 and ๐2 withindependent ๐(0, 1) entries. Then, for a given sparsity level ๐ and a signal strength ๐,set ฮ = (ฮ๐)๐โ[๐] so that (ฮ1, . . . ,ฮ๐)โค โผ ๐Unif(๐ฎ๐โ1) and ฮ๐ = 0 for ๐ > ๐. We thendraw ๐ฝ1 โผ ๐๐(0, ๐ผ๐) and define ๐ฝ2 := ๐ฝ1 + ฮ. Finally, we generate ๐1 and ๐2 as in (2),with ๐1 โผ ๐๐1(0, ๐ผ๐1) and ๐2 โผ ๐๐2(0, ๐ผ๐2) independent from each other.
In Section 4.1, we supply the oracle value of ๏ฟฝฬ๏ฟฝ2 = 1 to our procedures to checkwhether their finite sample performance is in accordance with our theory. In all subsequentsubsections where we compare our methods against other procedures, we estimate thenoise variance ๐2 with the method-of-moments estimator proposed by Dicker (2014).Specifically, after obtaining ๐ and ๐ in Step 4 of Algorithm 1, we compute
๏ฟฝฬ๏ฟฝ1 :=1๐
tr(๏ธ
1๐๐โค๐
)๏ธand ๏ฟฝฬ๏ฟฝ2 :=
1๐
tr{๏ธ(๏ธ
1๐๐โค๐
)๏ธ2}๏ธโ 1๐๐
{๏ธtr(๏ธ
1๐๐โค๐
)๏ธ2}๏ธ.
This allows us to estimate
๏ฟฝฬ๏ฟฝ2 :={๏ธ
1 + ๐๏ฟฝฬ๏ฟฝ21
(๐+ 1)๏ฟฝฬ๏ฟฝ22
}๏ธโ๐โ22๐โ ๏ฟฝฬ๏ฟฝ1๐(๐+ 1)๏ฟฝฬ๏ฟฝ2
โ๐โค๐โ22.
We implement our estimators ๐sparse๐,๐ and ๐dense๐ on standardised data ๐1/๏ฟฝฬ๏ฟฝ, ๐2/๏ฟฝฬ๏ฟฝ, ๐1/๏ฟฝฬ๏ฟฝand ๐2/๏ฟฝฬ๏ฟฝ with the tuning parameters ๐ =
โ4 log ๐, ๐ = 3๐ log ๐ and ๐ = ๐+
โ8๐ log ๐+
4 log ๐ as suggested by Theorems 1, 3 and 6.
4.1 Effective sample size in two-sample testingWe first investigate how the empirical power of our test ๐sparse๐,๐ relies on various problemparameters. In light of our results in Theorems 1 and 3, we define
๐ := ๐๐๐2
๐2(1 + ๐ )(1 + ๐)2๐ log ๐, (13)
where ๐ := ๐/๐ and ๐ := ๐1/๐2. As discussed after Theorem 3, ๐๐/{(1 + ๐ )(1 + ๐)2}in the definition of ๐ can be viewed as the effective sample size in the testing problem.In Figure 1, we plot the estimated test power of ๐sparse๐,๐ against ๐ over 100 Monte Carlorepetitions for ๐ = 1000, ๐ = 10, ๐ โ {0, 0.2, . . . , 2} and various values of ๐ and ๐1. Inthe left panel of Figure 1, ๐ ranges from 100 to 900, which corresponds to ๐ from 1/9 to9. In the right panel, we vary ๐1 from 100 to 900, which corresponds with an ๐ varyingbetween 1/9 and 9. In both panels, the power curves for different ๐ and ๐ values overlapeach other, with the phase transition all occurring at around ๐ โ 1.5. This conforms wellwith the effective sample size and the detection limit articulated in our theory.
13
-
0 1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
1.0
ฮฝ
power
p = 100
p = 200
p = 300
p = 400
p = 500
p = 600
p = 700
p = 800
p = 900
0 1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
1.0
ฮฝ
power
n1 = 100
n1 = 200
n1 = 300
n1 = 400
n1 = 500
n1 = 600
n1 = 700
n1 = 800
n1 = 900
Figure 1: Power function of ๐sparse๐,๐ , estimated over 100 Monte Carlo repetitions, plottedagainst ๐, as defined in (13), in various parameter settings. Left panel: ๐1 = ๐2 = 500,๐ โ {100, 200, . . . , 900}, ๐ = 10, ๐ โ {0, 0.2, . . . , 2}. Right panel: ๐1 โ {100, 200, . . . , 900},๐2 = 1000โ ๐1, ๐ = 400, ๐ = 10, ๐ โ {0, 0.2, . . . , 2}.
4.2 Comparison with other methodsNext, we compare the performance of our procedures against competitors in the existingliterature. The only methods we were aware of that could allow for dense regressioncoefficients ๐ฝ1 and ๐ฝ2 were those proposed by Zhu and Bradic (2016) and Charbonnier,Verzelen and Villers (2015). In addition, we also include in our comparisons the classicallikelihood ratio test, denoted by ๐LRT, which rejects the null when the ๐น -statistic definedin (4) exceeds the upper ๐ผ-quantile of an ๐น๐, ๐โ2๐ distribution. Note that the likelihoodratio test is only well-defined if ๐ < min{๐1, ๐2}. The test proposed by Zhu and Bradic(2016), which we denote by ๐ZB, requires that ๐1 = ๐2 (when the two samples do nothave equal sample size, a subset of the larger sample would be discarded for the test toapply). Specifically, writing ๐+ := ๐1 +๐2, ๐โ := ๐1โ๐2 and ๐+ := ๐1 + ๐2, ๐ZB firstestimates ๐พ = (๐ฝ1 + ๐ฝ2)/2 and
ฮ := {E(๐โค+๐+)}โ1E(๐โค+๐โ)
by solving Dantzig-Selector-type optimisation problems. Then based on the obtainedestimators ๐พ and ฮ ฬ, ๐ZB proceeds to compute a test statistic
๐ZB :=โ{๐โ โ๐+ฮ ฬ}โค{๐+ โ๐+๐พ}โโ
โ๐+ โ๐+๐พโ2.
Their test rejects the null if the test statistic exceeds an empirical upper-๐ผ-quantile (ob-tained via Monte-Carlo simulation) of โ๐โโ for ๐ โผ ๐(0, {๐โโ๐+ฮ ฬ}โค{๐โโ๐+ฮ ฬ}). Asthe estimation of ฮ involves solving a sequence of ๐ Dantzig Selector problems, which isoften time consuming, we have implemented ๐ZB with the oracle choice of ฮ ฬ = ฮ , whichis equal to ๐ผ๐ when covariates in the two design matrices ๐1 and ๐2 follow independentcentred distribution with the same covariance matrix. The test proposed by Charbon-nier, Verzelen and Villers (2015), denoted here by ๐CVV, first performs a LARS regression(Efron et al., 2004) of concatenated response ๐ = (๐ โค1 , ๐ โค2 )โค against the block design
14
-
matrix (๏ธ๐1 ๐1๐2 โ๐2
)๏ธ
to obtain a sequence of regression coefficients ๏ฟฝฬ๏ฟฝ = (๏ฟฝฬ๏ฟฝ1, ๏ฟฝฬ๏ฟฝ2) โ R๐+๐. Then for every ๏ฟฝฬ๏ฟฝ on theLARS solution path with โ๏ฟฝฬ๏ฟฝโ0 โค min{๐1, ๐2}/2, they restrict the original testing probleminto the subset of coordinates where either ๏ฟฝฬ๏ฟฝ1 or ๏ฟฝฬ๏ฟฝ2 is non-zero, and form test statisticsbased on the KullbackโLeibler divergence between the two samples restricted to thesecoordinates. The sequence of test statistics are then compared with Bonferonni-correctedthresholds at size ๐ผ. For both the ๐LRT and ๐CVV, we set ๐ผ = 0.05.
0 1 2 3 4 5 6 7 8 9
0.0
0.2
0.4
0.6
0.8
1.0
ฯ
power
k = 1
k = 10
k = 31
k = 100
k = 1000
CSsparse
CSdense
LRT
ZB
CVV
0 1 2 3 4 5 6 7 8 10 12 15 18
0.0
0.2
0.4
0.6
0.8
1.0
ฯ
power
k = 1
k = 10
k = 27
k = 80
k = 800
CSsparse
CSdense
ZB
CVV
Figure 2: Power comparison of different methods at different sparsity levels ๐ โ{1, 10, โ๐1/2โ, 0.1๐, ๐} and different signal โ2 norm ๐ on a logarithmic grid (noise variance๐2 = 1). Left panel: ๐1 = ๐2 = 500, ๐ = 800, ๐ โ [0, 10]; right panel: ๐1 = ๐2 = 1200,๐ = 1000, ๐ โ [0, 20].
Figure 2 compares the estimated power, as a function of โ๐ฝ1 โ ๐ฝ2โ2, of ๐sparse๐,๐ and๐dense๐ against that of ๐LRT, ๐ZB and ๐CVV. We ran all methods on the same 100 datasetsfor each set of parameters. We performed numerical experiments in two high-dimensionalsettings with different sample-size-to-dimension ratio: ๐ = 1000, ๐1 = ๐2 = 1200 in theleft panel and ๐ = 800, ๐1 = ๐2 = 500 in the right panel. Here, we took ๐1 = ๐2 tomaximise the power of ๐ZB. Also, since the likelihood ratio test requires ๐ < min{๐1, ๐2},it is only implemented in the left panel. For each experiment, we varied ๐ in the set{1, 10, โ๐1/2โ, 0.1๐, ๐} to examine different sparsity levels.
We see in Figure 2 that both ๐sparse๐,๐ and ๐dense๐ showed promising finite sample perfor-mance. Both our tests did not produce any false positives under the null when ๐ = 0, andshowed better power compared to ๐ZB and ๐CVV. In the more challenging setting of theright panel with ๐ > max{๐1, ๐2}, it takes a signal โ2 norm more than 10 times smallerthan that of the competitors for our test ๐sparse๐,๐ to reach power of almost 1 in the sparsestcase. Note though, in the densest case on the right panel (๐ = 800), ๐sparse๐,๐ and ๐dense๐ didnot have saturated power curves, because noise variance is over-estiamted by ๏ฟฝฬ๏ฟฝ2 in thissetting.
We also observe that the power of ๐sparse๐,๐ has a stronger dependence on the level๐ than that of ๐dense๐ . For ๐ โค
โ๐, ๐sparse๐,๐ appears much more sensitive to the signal
size. As ๐ increases, ๐dense๐ eventually outperforms ๐sparse๐,๐ , which is consistent with our
15
-
observed phase transition behaviour as discussed after Theorem 6. It is interesting to notethat when the likelihood ratio test is well-defined (left panel), it has better power than๐dense๐ . This is partly due to the fact that the theoretical choice of threshold ๐ is relativelyconservative to ensure asymptotic size of the test is 0 almost surely. In comparison, therejecting threshold for the likelihood ratio test is chosen to have (๐ fixed and ๐ โ โ)asymptotic size of ๐ผ = 0.05, and the empirical size is sometimes observed to be largerthan 0.08.
4.3 Model misspecificationWe have so far focused on the case of Gaussian random design ๐1, ๐2 with identity co-variance and Gaussian regression noises ๐1, ๐2. This complies with these assumptions thathave helped us gain theoretical insights into the two-sample testing problem. However,our proposed testing procedures can still be used even if these modelling assumptionswere not satisfied. To evaluate the performance of our proposed procedure under modelmisspecification, we consider the following four setups:
(a) Correlated design: assume rows of ๐1 and ๐2 are independently drawn from ๐(0,ฮฃ)with ฮฃ = (2โ|๐1โ๐2|)๐1,๐2โ[๐].
(b) Rademacher design: assume entries of ๐1 and ๐2 are independent Rademacherrandom variables.
(c) One way balanced ANOVA design: assume ๐1 := ๐1/๐ and ๐2 := ๐2/๐ are integersand ๐1 and ๐2 are block diagonal matrices
๐1 =
โโโโ1๐1
. . .1๐2
โโโโ ๐2 =โโโโ
1๐2. . .
1๐2
โโโโ ,where 1๐ is an all-one vector in R๐.
(d) Heavy tailed noise: we generate both ๐1 and ๐2 with independent ๐ก4/โ
2 entries.Note that the
โ2 denominator standardises the noise to have unit variance, to
ensure easier comparison between settings.
In setups (a) to (c), we keep ๐1 โผ ๐๐1(0, ๐ผ๐1) and ๐2 โผ ๐๐2(0, ๐ผ๐2) and in setup (d), wekeep ๐1 and ๐2 to have independent ๐(0, 1) entries. Figure 3 compares the performanceof ๐sparse๐,๐ , ๐dense๐ with that of ๐ZB and ๐CVV. In all settings, we set ๐1 = ๐2 = 500and ๐ = 10. In settings (a), (b) and (d), we choose ๐ = 800 and ๐ from 0 to 20. Insetting (c), we choose ๐ = 250 and ๐ from 0 to 50. We see that ๐sparse๐,๐ is robust tomodel misspecification and exhibits good power in all settings. The test ๐desne๐ is robustto non-normal design and noise, but exhibits a slight reduction in power in a correlateddesign. The advantage of ๐sparse๐,๐ and ๐dense๐ over competing methods is least significant inthe ANOVA design in setting (c), where each row vector of the design matrices has allmass concentrated in one coordinate. In all other settings where the rows of the design
16
-
matrices are more โincoherentโ in the sense that all coordinate have similar magnitude,๐sparse๐,๐ and ๐dense๐ start having nontrivial power at a signal โ2 norm 10 to 20 times smallerthan that of the competitors.
(a) (b)
0 1 2 3 4 5 6 7 8 10 12 15 18
0.0
0.2
0.4
0.6
0.8
1.0
ฯ
power
CSsparse
CSdense
ZB
CVV
0 1 2 3 4 5 6 7 8 10 12 15 18
0.0
0.2
0.4
0.6
0.8
1.0
ฯ
power
CSsparse
CSdense
ZB
CVV
(c) (d)
0 1 2 3 4 5 6 8 10 13 17 22 28 35 44
0.0
0.2
0.4
0.6
0.8
1.0
ฯ
power
CSsparse
CSdense
ZB
CVV
0 1 2 3 4 5 6 7 8 10 12 15 18
0.0
0.2
0.4
0.6
0.8
1.0
ฯ
power
CSsparse
CSdense
ZB
CVV
Figure 3: Power functions of different methods in models with non-Gaussian design ornon-Gaussian noise, plotted against signal โ2 norm ๐ on a logarithmic grid. (a) correlateddesign matrix ฮฃ = (2โ|๐โ๐|)๐,๐โ[๐]; (b) Rademacher design; (c) one-way balanced ANOVAdesign; (d) Gaussian design with ๐ก4/
โ2-distributed noise. Details of the models are in
Section 4.3.
5 Proof of main resultsProof of Theorem 1. Under the null hypothesis where ๐ฝ1 = ๐ฝ2, we have ๐ = 0 and there-fore, ๐ = ๐๐ + ๐ = ๐ โผ ๐๐(0, ๐ผ๐). In particular, ๐๐ โผ ๐(0, 1) for all ๐ โ [๐]. Thus, bya union bound, we have for ๐ =
โ๏ธ(4 + ๐) log ๐ and any ๐ > 0 that
P(๐ โฅ ๐) โค๐โ๏ธ๐=1
P(|๐๐| โฅ ๐) โค ๐๐โ๐2/2 = ๐โ1โ๐/2,
The almost sure convergence is derived from the BorelโCantelli lemma after noting that๐โ1โ๐/2 is summable for any ๐ > 0.
17
-
Proof of Proposition 2. Let ๐ป โ O๐ร๐ be any orthogonal matrix, then since ๐ d= ๐๐ป,by Lemma 9 we have
๐ปโค๐โค๐๐ป = 4(๐ปโค๐โค1 ๐1๐ป)(๐ปโค๐โค๐๐ป)โ1(๐ปโค๐โค2 ๐2๐ป)d= 4(๐โค1 ๐1)(๐โค๐)โ1(๐โค2 ๐2) = ๐โค๐. (14)
In particular, all diagonal entries of ๐โค๐ have the same distribution and all off-diagonalentries of ๐โค๐ have the same distribution. So it suffices to study (๐โค๐ )1,1 and(๐โค๐ )1,2.
Let ๐ = ๐๐ be the QR decomposition of ๐, which is almost surely unique if werequire the upper-triangular matrix ๐ to have non-negative entries on the diagonal. Let๐1 be the submatrix obtained from the first ๐1 rows of ๐. By Lemma 13, ๐1 and ๐ areindependent and ๐ has independent entries distributed as ๐๐,๐ = ๐ก๐ > 0 with ๐ก2๐ โผ ๐2๐โ๐+1for ๐ โ [๐] and ๐๐,๐ = ๐ง๐,๐ โผ ๐(0, 1) for 1 โค ๐ < ๐ โค ๐.
Define ๐ต := ๐โค1 ๐1 and let ๐ต = ๐ ฮ๐ โค be its eigendecomposition, which is al-most surely unique if we require the diagonal entries of ฮ to be non-increasing and thediagonal entries of ๐ to be nonnegative. By Lemma 13, ๐ is uniformly distributedon O๐ร๐, which means ๐ d= ๐๐ป for any ๐ป โ O๐ร๐. Consequently ๐1 d= ๐1๐ป and๐ต
d= ๐ปโค๐ต๐ป = (๐ปโค๐ )ฮ(๐ปโค๐ )โค. Since the group O๐ร๐ acts transitively on itself throughleft multiplication, the joint density of ๐ and ฮ must be a function of ฮ only. In particular,๐ and ฮ are independent.
Note that ๐1 = ๐1๐ . Thus, ๐โค1 ๐1 = ๐โค๐ต๐ and ๐โค2 ๐2 = ๐โค(๐ผ๐โ๐ต)๐ . By Lemma 9,we have
๐โค๐ = 4๐โค1 ๐ด1๐ดโค1 ๐1 = 4(๐โค1 ๐1)(๐โค1 ๐1 +๐โค2 ๐2)โ1(๐โค2 ๐2)= 4๐โค๐ต(๐ผ๐ โ๐ต)๐ = 4๐โค๐ ฮ(๐ผ๐ โ ฮ)๐ โค๐. (15)
Let 1 โฅ ๐1 โฅ ยท ยท ยท โฅ ๐๐ โฅ 0 be the diagonal entries of ฮ. Define ๐๐ = ๐๐(1โ ๐๐) for ๐ โ [๐]and set ๐ := (๐1, . . . , ๐๐). We can write ๐ก21 = ๐ 21 + ๐21 with ๐ 21 โผ ๐2๐ and ๐21 โผ ๐2๐โ๐ suchthat ๐ 1 โฅ 0, ๐1 โฅ 0 are independent from each other and independent of everything else.By Lemma 13, we have that ๐บ๐,1 := ๐ 1๐๐,1 for ๐ โ [๐] are independent ๐(0, 1) randomvariables. Note that
14(๐
โค๐ )1,1 =๐โ๏ธ๐=1
๐ก21๐๐๐2๐,1 =
๐ก21๐ 21
๐โ๏ธ๐=1
๐๐๐บ2๐,1.
Let ๐ฟ > 0 be chosen later. By Laurent and Massart (2000, Lemma 1), applied conditionallyon ๐, we have with probability at least 1โ 6๐ฟ that all of the following inequalities hold:
โ๐โ1 โ 2โ๐โ2โ๏ธ
log(1/๐ฟ) โค๐โ๏ธ๐=1
๐๐๐บ2๐,1 โค โ๐โ1 + 2โ๐โ2
โ๏ธlog(1/๐ฟ) + 2โ๐โโ log(1/๐ฟ),
๐โ 2โ๏ธ๐ log(1/๐ฟ) โค ๐ 21 โค ๐+ 2
โ๏ธ๐ log(1/๐ฟ) + 2 log(1/๐ฟ),
๐โ 2โ๏ธ๐ log(1/๐ฟ) โค ๐ก21 โค ๐+ 2
โ๏ธ๐ log(1/๐ฟ) + 2 log(1/๐ฟ).
18
-
Using the fact that โ๐โโ โค 1/4, we have with probability at least 1โ 6๐ฟ that
๐โ 2โ๏ธ๐ log(1/๐ฟ)
๐+ 2โ๏ธ๐ log(1/๐ฟ) + 2 log(1/๐ฟ)
{๏ธโ๐โ1 โ 2โ๐โ2
โ๏ธlog(1/๐ฟ)
}๏ธโค 14(๐
โค๐ )1,1
โค๐+ 2
โ๏ธ๐ log(1/๐ฟ) + 2 log(1/๐ฟ)
๐โ 2โ๏ธ๐ log(1/๐ฟ)
{๏ธโ๐โ1 + 2โ๐โ2
โ๏ธlog(1/๐ฟ) + 12 log(1/๐ฟ)
}๏ธIf log(1/๐ฟ) = ๐ช(๐), then for each ๐ with probability at least 1โ 6๐ฟ, we have thatโโโโโ(๐โค๐ )1,14 โ ๐๐โ๐โ1
โโโโโ โค โ๐โ1 2
โ๏ธ๐ log(1/๐ฟ)๐
(๏ธ1 +
โ๏ธ๐/๐
)๏ธ+ ๐๐
{๏ธ2โ๐โ2
โ๏ธlog(1/๐ฟ) + log(1/๐ฟ)2
}๏ธ
+๐ช๐ (๏ธโ๐โ1 log(1/๐ฟ)
๐+ โ๐โ2 log(1/๐ฟ)โ
๐+ log
3/2(1/๐ฟ)๐1/2
)๏ธ. (16)
By the definition of ๐ต, we have for ๐ป := ๐ (๐โค๐)โ1/2 โ O๐ร๐ that
๐ปโค๐ต๐ป = ๐ปโค๐โโค๐โค1 ๐1๐โ1๐ป = (๐โค๐)โ1/2(๐โค1 ๐1)(๐โค๐)โ1/2,
which follows the matrix-variate Beta distribution Beta๐(๐1/2, ๐2/2) as defined beforeLemma 12. Hence the diagonal elements of ฮ are the same as the eigenvalues of aBeta๐(๐1/2, ๐2/2) random matrix. By Lemma 12, throughout the rest of this proof wemay restrict ourselves to an almost sure event on which
โ๐โ1/๐โ ๐ 1 and โ๐โ2/โ๐โ ๐ 2.
By (16), for each ๐, with probability at least 1โ 6๐ฟ, we haveโโโโโ(๐โค๐ )1,1โ 4๐๐ โ๐โ1
โโโโโ โค 8โ๏ธ๐ log(1/๐ฟ)(๏ธ(๐ 1 +๐ 2)โ๏ธ๐/๐+๐ 1)๏ธ+๐ช๐
(๏ธlog(1/๐ฟ)+ log
3/2(1/๐ฟ)๐1/2
)๏ธ.
(17)For the first claim in the proposition, we take ๐ฟ = 1/๐3. Using (17), Lemma 12, a unionbound over ๐ โ [๐] and the BorelโCantelli lemma (noting that 1/๐2 is summable), weobtain that,
max๐โ[๐]
โโโโโ(๐โค๐ )๐,๐4๐๐ 1 โ 1
โโโโโ a.s.โโโ 0.
For the second part of the claim, define ฮ := ๐โค๐ โ 4๐๐โ1โ๐โ1๐ผ๐. By Lemma 11,there exists a 1/4-net ๐ฉ of cardinality at most
(๏ธ๐๐
)๏ธ9๐ such that
โฮโ๐,op โค 2 sup๐ขโ๐ฉ
๐ขโคฮ๐ข. (18)
By (14), we have ๐ขโคฮ๐ข d= ๐โค1 ฮ๐1 for all ๐ข โ ๐ฎ๐โ1. Hence, by (17), (18) and a unionbound over ๐ข โ ๐ฉ , there exists ๐ถ๐ > 0, depending only on ๐ , such that for each ๐, withprobability at least 1โ 6|๐ฉ |๐ฟ that
โฮโ๐,op โค 8โ๏ธ๐ log(1/๐ฟ)
{๏ธ(๐ 1 + ๐ 2)
โ๏ธ๐/๐+ ๐ 1
}๏ธ+๐ช๐
(๏ธlog(1/๐ฟ) + log
3/2(1/๐ฟ)๐1/2
)๏ธ. (19)
19
-
If log(1/๐ฟ) = ๐ช(๐), as will be verified for our choice of ๐ฟ later, then on the eventwhere (19) is true, we have โ๐โค๐โ๐,op = 4(1 + ๐ช(1))๐๐โ1โ๐โ1 and โdiag(๐โค๐ )โop =4(1 + ๐ช(1))๐๐โ1โ๐โ1. Define ๐ท :=
(๏ธdiag(๐โค๐ )4๐๐โ1โ๐โ1
)๏ธ1/2, we have
๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐โค๐
4๐โ๐โ1/๐= (๐ทโ1 โ ๐ผ) ๐
โค๐
4๐โ๐โ1/๐๐ทโ1 + ๐
โค๐
4๐โ๐โ1/๐(๐ทโ1 โ ๐ผ).
Taking ๐-operator norms on both sides of the above identity, since ๐ท is a diagonal matrix,we haveโฆโฆโฆโฆโฆ๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐โค๐4๐โ๐โ1/๐
โฆโฆโฆโฆโฆ๐,opโค โ๐ทโ1 โ ๐ผโop
โฆโฆโฆโฆโฆ ๐โค๐4๐โ๐โ1/๐
โฆโฆโฆโฆโฆ๐,op
(โ๐ทโ1โop + 1)
โค (2 + ๐ช(1))โ๐ทโ1 โ ๐ผ๐โop โค (1 + ๐ช(1)) max๐โ[๐]
โโโโโ (๐โค๐ )๐,๐4๐๐โ1โ๐โ1 โ 1
โโโโโ.
Combining the above inequality with the observation that |(๐โค๐ )๐,๐ โ 4๐๐โ1โ๐โ1| โคโฮโ๐,op, we have by (19) and Lemma 12 that, asymptotically with probability at least1โ 6|๐ฉ |๐ฟ,
โ๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐ผโ๐,op โคโฆโฆโฆโฆโฆ๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐โค๐4๐โ๐โ1/๐
โฆโฆโฆโฆโฆ๐,op
+โฆโฆโฆโฆโฆ ๐โค๐4๐โ๐โ1/๐ โ ๐ผ๐
โฆโฆโฆโฆโฆ๐,op
โค (2 + ๐ช(1)) โฮโ๐,op4๐๐โ1โ๐โ1โค (4 + ๐ช(1))
{๏ธ(1 + ๐ 2/๐ 1)
โ๏ธ๐/๐+ 1
}๏ธโ๏ธ log(1/๐ฟ)๐
.
Choosing ๐ฟ := (10๐๐/๐)โ(๐+4). By (10), we indeed have log(1/๐ฟ) = (๐ + 4) log(10๐๐/๐) =๐(๐), as required in the above calculation. Also,
|๐ฉ |๐ฟ โค 9๐(๏ธ๐
๐
)๏ธ(๏ธ10๐๐๐
)๏ธโ(๐+4)โค(๏ธ
9๐๐๐
)๏ธ๐(๏ธ10๐๐๐
)๏ธโ(๐+4)โค 0.9
๐
(๐๐/๐)4 โค max{๐โ2, 0.9
โ๐},
which is summable over ๐. Thus, by the BorelโCantelli lemma, we see that for any ๐ > 0,with probability 1, the following sequence of events{๏ธ
โ๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐ผโ๐,op > (4 + ๐){๏ธ(1 + ๐ 2/๐ 1)
โ๏ธ๐/๐+ 1
}๏ธโ๏ธ(๐ + 4) log(10๐๐/๐)๐
}๏ธ
happen finitely often. Hence the desired conclusion holds with, for instance, ๐ถ๐ ,๐ =5(1 +
โ๏ธ1 + 1/๐ +
โ๏ธ๐ + ๐ โ 1 + 1/๐ + 1/๐).
Proof of Theorem 3. By Proposition 2, it suffices to work with a deterministic sequenceof ๐ such that (9) and (11) holds, which we henceforth assume in this proof.
Define ๐ = (๐1, . . . , ๐๐)โค such that ๐๐ := ๐๐โ๐๐โ2. Then, from (8), we have
๐ = ๏ฟฝฬ๏ฟฝ๐ + ๐,
20
-
for ๐ โผ ๐๐(0, ๐ผ๐). Write ๐ := (๐1, . . . , ๐๐)โค and ๐ := supp(๐) = supp(๐), then
๐๐ = (๏ฟฝฬ๏ฟฝโค๐)๐ โผ ๐๐(๏ธ(๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ )๐,๐๐๐, (๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ )๐,๐
)๏ธ.
Our strategy will be to control โ๐๐โ2. To this end, by (9), we have
โ๐๐โ22 =โ๏ธ๐โ๐
๐2๐โ๐๐โ22 โฅ {4โ ๐ช(1)}๐๐ 1โ๏ธ๐โ๐
๐2๐ = (4โ ๐ช(1))๐๐ 1๐2. (20)
Moreover, by (11) and (10), we have for sufficiently large ๐ that
โ๐ผ๐ โ (๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ )๐,๐โop โค โ๐ผ๐ โ ๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝโ๐,op โค ๐ถ๐ ,๐
โ๏ธ๐ log(๐๐/๐)
๐= ๐ช(1). (21)
Define ๐ := โ(๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ )๐,๐๐๐โ2. By the triangle inequality, (20) and (21), we have
๐ 2 โฅ[๏ธโ๐๐โ2
{๏ธ1โ โ๐ผ๐ โ (๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ )๐,๐โop
}๏ธ]๏ธ2โฅ (4โ ๐ช(1))๐๐ 1๐2 โฅ (32โ ๐ช(1))๐ log ๐, (22)
where we have evoked the condition ๐2 โฅ 8๐ log ๐/(๐๐ 1) in the final bound.By Lemma 14 and (21), for sufficiently large ๐, we have with probability at least
1โ 2๐โ2 that
โ๐๐โ22 โฅ (1โ ๐ช(1))(๐ +๐ 2)โ (2 + ๐ช(1)){๏ธโ๏ธ
2(๐ + 2๐ 2) log ๐+ 2 log ๐}๏ธ
โฅ (1โ ๐ช(1))(๐ +๐ 2)โ (2 + ๐ช(1))โ๏ธ
(๐ + 2๐ 2)๐ 2/16โ (1/8 + ๐ช(1))๐ 2
โฅ(๏ธ
1โ 14โ
2โ ๐ช(1)
)๏ธ๐ +
(๏ธ78 โ
1โ2โ ๐ช(1)
)๏ธ๐ 2 โฅ 5๐ log ๐, (23)
where both the second and the final inequalities hold because of (22).From (23), using the tuning parameters ๐ = 2
โlog ๐ and ๐ = ๐ log ๐, we have for
sufficiently large ๐ that with probability at least 1โ 2๐โ2,
๐ =๐โ๏ธ๐=1
๐2๐1{|๐๐ |โฅ๐} โฅ โ๐๐โ22 โ ๐๐2 โฅ ๐ log ๐ โฅ ๐,
which allows us to reject the null. The desired almost sure convergence follows by theBorelโCantelli lemma since 1/๐2 is summable over ๐ โ N.
Proof of Corollary 4. The first inequality follows from the definition of โณ๐(๐, ๐). Aninspection of the proofs of Theorems 1 and 3 reveals that both results only depend onthe complementary-sketched model ๐ = ๐๐ + ๐, and hence hold uniformly over (๐ฝ1, ๐ฝ2).Thus, we have from Theorem 1 that sup๐ฝโR๐ ๐๐๐ฝ,๐ฝ(๐
sparse๐,๐ ฬธ= 0)
a.s.โโโ 0 and from Theorem 3that sup๐ฝ1,๐ฝ2โR๐:(๐ฝ1โ๐ฝ2)/2โฮ๐,๐(๐) ๐
๐๐ฝ,๐ฝ(๐
sparse๐,๐ ฬธ= 1)
a.s.โโโ 0. Combining the two completes theproof.
21
-
Proof of Theorem 5. By considering a trivial test ๐ โก 0, we see that โณ โค 1. Thus, itsuffices to show that โณ โฅ 1 โ ๐ช(1). Also, by Proposition 2, it suffices to work witha deterministic sequence of ๐ (and hence ๐ ) such that (9) and (11) holds, which wehenceforth assume in this proof.
Let ๐ฟ := (๐โค1 ๐1 +๐โค2 ๐2)โ1(๐โค2 ๐2 โ๐โค1 ๐1) and ๐ be the uniform distribution on
ฮ0 := {๐ โ {๐โ1/2๐,โ๐โ1/2๐, 0}๐ : โ๐โ0 = ๐} โ ฮ.
We write ๐0 := ๐๐0,0 and let ๐๐ :=โซ๏ธ๐โฮ0 ๐
๐๐ฟ๐,๐ ๐๐(๐) denote the uniform mixture of ๐๐๐พ,๐ for
{(๐พ, ๐) : ๐ โ ฮ0, ๐พ = ๐ฟ๐}. Let โ := ๐๐๐/๐๐0 be the likelihood ratio between the mixturealternative ๐๐ and the simple null ๐0. We have that
โณโฅ inf๐
{๏ธ1โ (๐0 โ ๐๐)๐
}๏ธ= 1โ 12
โซ๏ธ โโโโโ1โ ๐๐๐๐๐0
โโโโโ๐๐0
โฅ 1โ 12
{๏ธโซ๏ธ (๏ธ1โ ๐๐๐
๐๐0
)๏ธ2๐๐0
}๏ธ1/2โฅ 1โ 12{๐0(โ
2)โ 1}1/2.
So it suffices to prove that ๐0(โ2) โค 1+๐ช(1). Writing ๏ฟฝฬ๏ฟฝ1 = ๐1๐ฟ+๐1 and ๏ฟฝฬ๏ฟฝ2 = ๐2๐ฟโ๐2and suppressing the dependence of ๐ โs on ๐ in notations, by the definition of ๐๐, wecompute that
โ =โซ๏ธ ๐๐๐ฟ๐,๐
๐๐0๐๐(๐) =
โซ๏ธ ๐โ 12 (โ๐1โ๐1๐ฟ๐โ๐1๐โ2+โ๐2โ๐2๐ฟ๐+๐2๐โ2)๐โ
12 (โ๐1โ2+โ๐2โ2)
๐๐(๐)
=โซ๏ธ๐โจ๏ฟฝฬ๏ฟฝ1๐,๐1โฉโ
12 โ๏ฟฝฬ๏ฟฝ1๐โ
2+โจ๏ฟฝฬ๏ฟฝ2๐,๐2โฉโ 12 โ๏ฟฝฬ๏ฟฝ2๐โ2๐๐(๐).
For ๐ โผ ๐ and some fixed ๐ฝ0 โ [๐] with |๐ฝ0| = ๐, let ๐๐ฝ0 be the distribution of ๐๐ฝ0conditional on supp(๐) = ๐ฝ0. Let ๐ฝ, ๐ฝ โฒ be independently and uniformly distributed on{๐ฝ0 โ [๐] : |๐ฝ0| = ๐}. By Fubiniโs theorem, we have
๐0(โ2) =โซ๏ธโซ๏ธ
๐,๐โฒ๐
12 โ๏ฟฝฬ๏ฟฝ1(๐+๐
โฒ)โ2โ 12 โ๏ฟฝฬ๏ฟฝ1๐โ2โ 12 โ๏ฟฝฬ๏ฟฝ1๐
โฒโ+ 12 โ๏ฟฝฬ๏ฟฝ2(๐+๐โฒ)โ2โ 12 โ๏ฟฝฬ๏ฟฝ2๐โ
2โ 12 โ๏ฟฝฬ๏ฟฝ2๐โฒโ2 ๐๐(๐) ๐๐(๐โฒ)
=โซ๏ธโซ๏ธ
๐,๐โฒ๐๐
โค(๏ฟฝฬ๏ฟฝโค1 ๏ฟฝฬ๏ฟฝ1+๏ฟฝฬ๏ฟฝโค2 ๏ฟฝฬ๏ฟฝ2)๐โฒ ๐๐(๐) ๐๐(๐โฒ)
=โซ๏ธโซ๏ธ
๐,๐โฒ๐๐
โค๐โค๐๐โฒ ๐๐(๐) ๐๐(๐โฒ)
= E[๏ธE{๏ธ๐๐
โค๐ฝโฉ๐ฝโฒ๐
โฒ๐ฝโฉ๐ฝโฒ
โซ๏ธ๐โฒ
๐ฝโฒ
โซ๏ธ๐๐ฝ๐๐
โค๐ฝ (๏ฟฝฬ๏ฟฝ
โค๏ฟฝฬ๏ฟฝโ๐ผ๐)๐ฝ,๐ฝโฒ๐โฒ๐ฝโฒ ๐๐๐ฝ(๐๐ฝ) ๐๐๐ฝ โฒ(๐โฒ๐ฝ)โ โ โ
โโโโ๐ฝ, ๐ฝ โฒ
}๏ธ]๏ธ, (24)
where we have employed Lemmas 10 and 9 in the penultimate equality and used thedecomposition ๐โค๐โค๐๐โฒ = ๐โค๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ๐โฒ = ๐โค(๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐ผ๐)๐โฒ + ๐โค๐โฒ in the final step.
We first control the integral โ. Define ๐ := (๐1, . . . , ๐๐)โค and ๐โฒ := (๐โฒ1, . . . , ๐โฒ๐)โค suchthat ๐๐ := ๐๐โ๐๐โ2 and ๐โฒ๐ := ๐โฒ๐โ๐๐โ2. By (9), we have
๐ := max{๏ธmax๐โ๐ฝ|๐๐|,max
๐โ๐ฝ โฒ|๐โฒ๐|
}๏ธโค (1 + ๐ช(1))
โ๏ธ4๐๐ 1๐2๐
. (25)
22
-
By (25) and (11), we have for any ๐ = ๐ช(๐โ1/4 logโ3/4 ๐) (which includes the case ๐ โคโ๏ธ(1โ2๐ผโ๐)๐ log ๐
4๐๐ 1 for ๐ โค ๐๐ผ with ๐ผ < 1/2) that
๐2โ(๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐ผ๐)๐ฝ,๐ฝ โฒโF โค ๐2โ(๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐ผ๐)๐ฝโช๐ฝ โฒ,๐ฝโช๐ฝ โฒโF โคโ
2๐๐2โ๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐ผ๐โ2๐,op
โค (1 + ๐ช(1))โ
2๐4๐๐ 1๐2
๐ยท ๐ถ๐ ,๐
โ๏ธ2๐ log{๐๐/(2๐)}
๐= ๐ช(logโ1(๐)).
Consequently, by Arias-Castro, Candรจs and Plan (2011, Lemma 4), we have
โ โค ๐2๐2โ(๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝโ๐ผ๐)๐ฝ,๐ฝโฒ โF log(3๐) + ๐2โ(๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐ผ๐)๐ฝ,๐ฝ โฒโF โค ๐๐ช(1) + ๐ช(1) = 1 + ๐ช(1).
Plugging the above display in (24), we obtain
๐0(โ2) โค (1 + ๐ช(1))E{๏ธE(๏ธ๐๐
โค๐ฝโฉ๐ฝโฒ๐
โฒ๐ฝโฉ๐ฝโฒ
โโโ๐ฝ โฉ ๐ฝ โฒ
)๏ธ}๏ธ= (1 + ๐ช(1))E
[๏ธE{๏ธ โ๏ธ๐โ๐ฝโฉ๐ฝ โฒ
exp(๏ธ๐๐๐
โฒ๐
)๏ธ โโโโ๐ฝ โฉ ๐ฝ โฒ
}๏ธ]๏ธ
โค (1 + ๐ช(1))E{๏ธ โ๏ธ๐โ|๐ฝโฉ๐ฝ โฒ|
2 cosh(๐2)}๏ธ
= (1 + ๐ช(1))E{๏ธcosh|๐ฝโฉ๐ฝ โฒ|(๐2)
}๏ธ. (26)
Hence, it suffices to show that E{cosh|๐ฝโฉ๐ฝ โฒ|(๐2)} = 1 + ๐ช(1). Note that |๐ฝ โฉ ๐ฝ โฒ| โผHyperGeom(๐; ๐, ๐) is a hypergeometric random variable (defined as the number of blackballs obtained from ๐ draws without replacement from an urn containing ๐ balls, ๐ ofwhich is black). Let ๐ต โผ Bin(๐, ๐/๐). By Hoeffding (1963, Theorem 4) and the fact thatcosh(๐ฅ) โค ๐๐ฅ for all ๐ฅ โฅ 0, we have
E{๏ธcosh|๐ฝโฉ๐ฝ โฒ|(๐2)
}๏ธโค E๐๐2|๐ฝโฉ๐ฝ โฒ| โค E๐๐2๐ต =
{๏ธ1 + ๐
๐
(๏ธ๐๐
2 โ 1)๏ธ}๏ธ๐โค exp
{๏ธ๐2
๐(๐๐2 โ 1)
}๏ธ. (27)
Since ๐ผ โ [0, 1/2) and ๐ โคโ๏ธ
(1โ2๐ผโ๐)๐ log ๐4๐๐ 1 , from (25), we can deduce that
๐ โคโ๏ธ
(1โ 2๐ผโ ๐+ ๐ช(1)) log ๐
and hence ๐๐2๐2/๐ = ๐โ๐+๐ช(1) = ๐ช(1). So, from (27) we have E{cosh|๐ฝโฉ๐ฝ โฒ|(๐2)} = 1+ ๐ช(1),which completes the proof.
Proof of Theorem 6. As in the proof of Theorem 3, we work with a deterministic sequenceof ๐ such that (9) and (11) are satisfied. For ๐ = (๐1, . . . , ๐๐)โค such that ๐๐ := ๐๐โ๐๐โ2,we have from (8) that
๐ = ๏ฟฝฬ๏ฟฝ๐ + ๐,for ๐ โผ ๐๐(0, ๐ผ๐). Hence, under the null hypothesis, we have โ๐โ22 โผ ๐2๐, which byLaurent and Massart (2000, Lemma 1) yields that
P{๏ธโ๐โ22 โฅ ๐+ 2
โ๏ธ๐ log(1/๐ฟ) + 2 log(1/๐ฟ)
}๏ธโค ๐ฟ.
23
-
Setting ๐ฟ = ๐โ2, for ๐ = ๐+ 23/2โ๐ log ๐+ 4 log ๐, we have by the BorelโCantelli lemma
that ๐dense๐ (๐1, ๐2, ๐1, ๐2)a.s.โโโ 0.
On the other hand, under the alternative hypothesis, โ๐โ22 โผ ๐2๐(โ๐๐โ22), which byBirgรฉ (2001, Lemma 8.1) implies that
P{๏ธโ๐โ22 โค ๐+ โ๐๐โ22 โ 2
โ๏ธ(๐+ 2โ๐๐โ22) log(1/๐ฟ)
}๏ธโค ๐ฟ.
Again, setting ๐ฟ = ๐โ2, observe from (11) and (10) that
โ๐๐โ22 = โ๏ฟฝฬ๏ฟฝ๐โ22 = โ๐โ22 + ๐โค(๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐ผ๐)๐ โฅ โ๐โ22(๏ธ1โ โ๏ฟฝฬ๏ฟฝโค๏ฟฝฬ๏ฟฝ โ ๐ผ๐โ๐,op
)๏ธโฅ (4โ ๐ช(1))๐๐ 1๐2,
where the final bound comes from (20). Similarly we bound โ๐๐โ22 โค (4 + ๐ช(1))๐๐ 1๐2. If๐2 โฅ 2
โ๐ log ๐๐๐ 1
, then from the above display, we have โ๐๐โ22 โฅ (8โ ๐ช(1))โ๐ log ๐โซ log ๐
and โ๐๐โ22 โค (8 + ๐ช(1))โ๐ log ๐โช ๐, and so
๐+ โ๐๐โ22 โ 2โ๏ธ
(๐+ 2โ๐๐โ22) log(1/๐ฟ)โ ๐
โฅ โ๐๐โ22 โ 25/2โ๏ธ
(๐+ 2โ๐๐โ22) log ๐โ 4 log ๐
โฅ (1โ ๐ช(1))โ๐๐โ22 โ 25/2(1 + ๐ช(1))โ๏ธ๐ log ๐ > 0
asymptotically. Consequently, P(โ๐โ22 โค ๐) โค ๐โ2 and by the BorelโCantelli lemma, wehave ๐dense๐ (๐1, ๐2, ๐1, ๐2)
a.s.โโโ 1.
Proof of Theorem 7. We follow the proof of Theorem 5 up to (26). Let ๐ต โผ Bin(๐, ๐/๐).By Hoeffding (1963, Theorem 4) and the fact that cosh(๐ฅ) โค ๐๐ฅ2/2 for all ๐ฅ โ R, we have
E{๏ธcosh|๐ฝโฉ๐ฝ โฒ|(๐2)
}๏ธโค E๐|๐ฝโฉ๐ฝ โฒ|๐4/2 โค E๐๐4๐ต/2 =
{๏ธ1 + ๐
๐
(๏ธ๐๐
4/2โ 1)๏ธ}๏ธ๐โค exp
{๏ธ๐2
๐(๐๐4/2โ 1)
}๏ธ.
If ๐ผ โ [1/2, 1] and ๐ = ๐ช(๐โ1/4 logโ3/4 ๐), the by (25), ๐4 = ๐ช(๐1โ2๐ผ) = ๐ช(1), and hence(๐๐4/2โ1)๐2/๐ = (1/2+ ๐ช(1))๐2๐4/๐ = ๐ช(1). By (26), ๐0(โ2) = 1+ ๐ช(1) and we concludeas in the proof of Theorem 5.
Proof of Proposition 8. As in the proof of Theorem 5, it suffices to control ๐0(โ2) forsome choice of prior ๐. We write ๐min(๐โค๐ ) for the minimum eigenvalue of ๐โค๐ andlet ๐ be an associated eigenvector with โ2 norm equal to ๐. We choose ๐ to be the Diracmeasure on ๐. Then by (24), we have
๐0(โ2) = ๐๐โค๐โค๐๐ = ๐๐2๐min(๐โค๐ ).
When ๐ โฅ ๐1 or ๐ โฅ ๐2, by Lemma 9, ๐โค๐ = (๐โค1 ๐1)(๐โค๐)โ1(๐โค2 ๐2) is singular.Hence ๐min(๐โค๐ ) = 0 and ๐0(โ2) = 1, which implies that โณ๐(๐, ๐) = 1.
On the other hand, if ๐ < min{๐1, ๐2}, we work on the almost sure event where (9)and (11) hold. Let ๐ and ฮ be defined as in the proof of Proposition 2, then by (15), wehave
๐min(๐โค๐ ) โค 4โ๐โ2op๐min(๏ธฮ(๐ผ โ ฮ)
)๏ธโค 4โ๐โค๐โop min{๐min(ฮ), 1โ ๐max(ฮ)}
24
-
Using tail bounds for operator norm of a random Gaussian matrix (see, e.g. Wainwright,2019, Theorem 6.1), we have
โ๐โค๐โop โค ๐(๏ธ
1 +โ๏ธ๐
๐+โ๏ธ
2 log ๐๐
)๏ธ2โค 5๐
asymptotically with probability 1. Moreover, by Bai et al. (2015, Theorem 1.1), there isan almost sure event on which the empirical spectral distribution of ฮ converges weaklyto a distribution supported on [๐กโ, ๐ก๐], for ๐กโ and ๐ก๐ defined in (28). We will work onthis almost sure event henceforth. For ๐/๐1 โ ๐ โ [๐, 1) and ๐/๐2 โ ๐ โ [๐, 1), wehave lim sup๐โโ ๐min(ฮ) โค ๐กโ and lim inf๐โโ ๐max(ฮ) โฅ ๐ก๐. On the other hand, Taylorexpanding the expression for ๐กโ and ๐ก๐ in (28) with respect to 1โ ๐ and 1โ ๐ respectively,we obtain that
๐กโ =14๐(1โ ๐)
2 +๐ช๐(๏ธ(1โ ๐)3
)๏ธ,
1โ ๐ก๐ =14๐(1โ ๐)
2 +๐ช๐(๏ธ(1โ ๐)3
)๏ธ.
Therefore, min{๐min(ฮ), 1โ ๐max(ฮ)} = ๐ช๐(min{(1โ ๐)2, (1โ ๐)2}). Using the conditionon ๐2, we have
๐2๐min(๐โค๐ ) = ๐ช(๏ธ
max{๏ธ
1(1โ ๐)2๐,
1(1โ ๐)2๐
}๏ธ)๏ธ๐ช๐(๐min{(1โ ๐)2, (1โ ๐)2}) = ๐ช(1),
which implies that ๐0(โ2) = 1 + ๐ช(1) and โณ๐ a.s.โโโ 1.
6 Ancillary resultsLemma 9. Let ๐1, ๐2, ๐,๐ be positive integers such that ๐1 + ๐2 = ๐ + ๐ = ๐. Let๐ = (๐โค1 , ๐โค2 )โค โ R๐ร๐ be a non-singular matrix with block components ๐1 โ R๐1ร๐ and๐2 โ R๐2ร๐. Let ๐ด1 โ R๐1ร๐ and ๐ด2 โ R๐2ร๐ be chosen to satisfy (7). Then
๐โค1 ๐ด1๐ดโค1 ๐1 = โ๐โค2 ๐ด2๐ดโค2 ๐2 = (๐โค1 ๐1)(๐โค๐)โ1(๐โค2 ๐2).
Proof. The first equality follows immediately from (7). Define ๏ฟฝฬ๏ฟฝ1 := ๐1(๐โค๐)โ1/2 and๏ฟฝฬ๏ฟฝ2 := ๐2(๐โค๐)โ1/2. Then ๏ฟฝฬ๏ฟฝ := (๏ฟฝฬ๏ฟฝโค1 , ๏ฟฝฬ๏ฟฝโค2 )โค has orthonormal columns with the samecolumn span as ๐, and so (๏ธ
๏ฟฝฬ๏ฟฝ1 ๐ด1๏ฟฝฬ๏ฟฝ2 ๐ด2
)๏ธโ O๐ร๐.
In particular, ๏ฟฝฬ๏ฟฝ1๏ฟฝฬ๏ฟฝโค1 + ๐ด1๐ดโค1 = ๐ผ๐1 . Therefore,
๐โค1 ๐ด1๐ดโค1 ๐1 = ๐โค1 (๐ผ๐1 โ ๏ฟฝฬ๏ฟฝ1๏ฟฝฬ๏ฟฝโค1 )๐1 = ๐โค1 ๐1 โ๐โค1 ๐1(๐โค๐)โ1๐โค1 ๐1
= ๐โค1 ๐1(๐โค๐)โ1(๐โค๐ โ๐โค1 ๐1) = (๐โค1 ๐1)(๐โค๐)โ1(๐โค2 ๐2),
where the last equality holds by noting the block structure of ๐.
25
-
Lemma 10. For ๐1 โ R๐1ร๐ and ๐2 โ R๐2ร๐, define ๐ฟ := (๐โค1 ๐1 +๐โค2 ๐2)โ1(๐โค2 ๐2 โ๐โค1 ๐1), ๏ฟฝฬ๏ฟฝ1 := ๐1(๐ฟ+ ๐ผ๐) and ๏ฟฝฬ๏ฟฝ2 := ๐2(๐ฟโ ๐ผ๐). We have
๏ฟฝฬ๏ฟฝโค1 ๏ฟฝฬ๏ฟฝ1 + ๏ฟฝฬ๏ฟฝโค2 ๏ฟฝฬ๏ฟฝ2 = 4๐โค1 ๐1(๐โค1 ๐1 +๐โค2 ๐2)โ1๐โค2 ๐2.
Proof. Write ๐บ1 := ๐โค1 ๐1, ๐บ2 := ๐โค2 ๐2. It is clear that
๐ฟโ ๐ผ๐ = โ2(๐โค1 ๐1 +๐โค2 ๐2)โ1๐โค1 ๐1 = โ2(๐บ1 +๐บ2)โ1๐บ1,๐ฟ+ ๐ผ๐ = 2(๐โค1 ๐1 +๐โค2 ๐2)โ1๐โค2 ๐2 = 2(๐บ1 +๐บ2)โ1๐บ2.
Therefore, we have
14(๏ฟฝฬ๏ฟฝ
โค1 ๏ฟฝฬ๏ฟฝ1 + ๏ฟฝฬ๏ฟฝโค2 ๏ฟฝฬ๏ฟฝ2) =
14{๏ธ(๐ฟ+ ๐ผ๐)โค๐โค1 ๐1(๐ฟ+ ๐ผ๐) + (๐ฟโ ๐ผ๐)โค๐โค1 ๐2(๐ฟโ ๐ผ๐)
}๏ธ= ๐บ2(๐บ1 +๐บ2)โ1๐บ1(๐บ1 +๐บ2)โ1๐บ2 +๐บ1(๐บ1 +๐บ2)โ1๐บ2(๐บ1 +๐บ2)โ1๐บ1= โ๐บ1(๐บ1 +๐บ2)โ1๐บ1(๐บ1 +๐บ2)โ1๐บ2โ๐บ1(๐บ1 +๐บ2)โ1๐บ2(๐บ1 +๐บ2)โ1๐บ2 + 2๐บ1(๐บ1 +๐บ2)โ1๐บ2
= ๐บ1(๐บ1 +๐บ2)โ1๐บ2.
The proof is complete by recalling the definitions of ๐บ1 and ๐บ2.
The following lemma concerns the control of the ๐-operator norm of a symmetricmatrix. Similar results have been derived in previous works (see, e.g. Wang, Berthet andSamworth, 2016, Lemma 2). For completeness, we include a statement and proof of thespecific version we use.
Lemma 11. For any symmetric matrix ๐ โ R๐ร๐ and ๐ โ [๐], there exists a subset๐ฉ โ ๐ฎ๐โ1 such that |๐ฉ | โค
(๏ธ๐๐
)๏ธ9๐ and
โ๐โ๐,op โค 2 sup๐ขโ๐ฉ
๐ขโค๐๐ข.
Proof. Define โฌ0(๐) := โช๐ฝโ[๐],|๐ฝ |=๐๐๐ฝ , where ๐๐ฝ := {๐ฃ โ ๐ฎ๐โ1 : ๐ฃ๐ = 0,โ๐ /โ ๐ฝ}. For each๐๐ฝ , we find a 1/4-net ๐ฉ๐ฝ of cardinality at most 9๐ (Vershynin, 2012, Lemma 5.2). Define๐ฉ := โช๐ฝโ[๐],|๐ฝ |=๐๐ฉ๐ฝ , which has the desired upper bound on cardinality. By construction,for ๐ฃ โ arg max๐ขโโฌ0(๐) ๐ขโค๐๐ข, there exists a ๐ฃ โ ๐ฉ such that | supp(๐ฃ)โช supp(๐ฃ)| โค ๐ andโ๐ฃ โ ๐ฃโ2 โค 1/4. We have
โ๐โ๐,op = ๐ฃโค๐๐ฃ = ๐ฃโค๐(๐ฃ โ ๐ฃ) + (๐ฃ โ ๐ฃ)โค๐๐ฃ + ๐ฃโค๐๐ฃ โค 2โ๐ฃ โ ๐ฃโ2โ๐โ๐,op + ๐ฃโค๐๐ฃ
โค 12โ๐โ๐,op + sup๐ขโ๐ฉ๐ขโค๐๐ข.
The desired inequality is obtained after rearranging terms in the above display.
The following lemma describes the asymptotic limit of the nuclear and Frobenius normsof the product of a matrix-variate Beta-distributed random matrix and its reflection.
26
-
Recall that for ๐1 + ๐2 > ๐, we say that a ๐ ร ๐ random matrix ๐ต follows a matrix-variate Beta distribution with parameters ๐1/2 and ๐2/2, written ๐ต โผ Beta๐(๐1/2, ๐2/2),if ๐ต = (๐1 + ๐2)โ1/2๐1(๐1 + ๐2)โ1/2, where ๐1 โผ ๐๐(๐1, ๐ผ๐) and ๐2 โผ ๐๐(๐2, ๐ผ๐) areindependent Wishart matrices and (๐1 + ๐2)1/2 is the symmetric matrix square root of๐1 + ๐2. Recall also that the spectral distribution function of any ๐ ร ๐ matrix ๐ด isdefined as ๐น๐ด(๐ก) := ๐โ1โ๏ธ๐๐=1 1{๐๐ด๐ โค๐ก}, where ๐๐ด๐ s are eigenvalues (counting multiplicities)of the matrix ๐ด. Further, given a sequence (๐ด๐)๐โN of matrices, their limiting spectraldistribution function ๐น is defined as the weak limit of the ๐น๐ด๐ , if it exists.
Lemma 12. Let ๐ต โผ Beta๐(๐1/2, ๐2/2) and suppose that ๐1, . . . , ๐๐ are the eigenvaluesof ๐ต. Define ๐ = (๐1, . . . , ๐๐)โค, with ๐๐ = ๐๐(1โ๐๐) for ๐ โ [๐]. In the asymptotic regimeof (C2), we have
โ๐โ1/๐a.s.โโโ ๐ 1,
โ๐โ2/โ๐
a.s.โโโ ๐ 2,
where๐ 1 =
๐
(1 + ๐)2(1 + ๐ ) and ๐ 22 =
๐(๐ + ๐ โ ๐๐ + ๐2๐ + ๐๐ 2)(1 + ๐)4(1 + ๐ )3 .
Proof. We first look at the limiting spectral distribution of ๐ต. From the asymptoticrelations between ๐1, ๐2 and ๐ in (C2), we have that
๐/๐1 โ ๐ :=๐ + ๐ ๐๐ + ๐ ๐ and ๐/๐2 โ ๐ :=
๐ + ๐ ๐1 + ๐ .
Define the left and right limits
๐กโ, ๐ก๐ :=(๐ + ๐)๐ + ๐๐(๐ โ ๐)โ 2๐๐
โ๐ โ ๐๐ + ๐
(๐ + ๐)2 . (28)
By Bai et al. (2015, Theorem 1.1), almost surely, weak limit ๐น of ๐น๐ต exists and is of theform max{1 โ 1/๐, 0}๐ฟ0 + max{1 โ 1/๐, 0}๐ฟ1 + ๐, where ๐ฟ0 and ๐ฟ1 are point masses at 0and 1 respectively, and ๐ has a density
(๐ + ๐)โ๏ธ
(๐ก๐ โ ๐ก)(๐กโ ๐กโ)2๐๐๐๐ก(1โ ๐ก) 1[๐กโ,๐ก๐]
with respect to the Lebesgue measure on R. Define โ1 : ๐ก โฆโ ๐ก(1โ ๐ก). By the portmanteaulemma (see, e.g. van der Vaart, 2000, Lemma 2.2), we have almost surely that
โ๐โ1/๐ = ๐น๐ตโ1 โ ๐นโ1 =๐ + ๐2๐๐๐
โซ๏ธ ๐ก๐๐กโ
โ๏ธ(๐ก๐ โ ๐ก)(๐กโ ๐กโ)๐๐ก =
๐ + ๐16๐๐ (๐ก๐ โ ๐กโ)
2
= ๐(1 + ๐)2(1 + ๐ ) .
27
-
Similarly, for โ2 : ๐ก โฆโ ๐ก2(1โ ๐ก)2, we have almost surely that
โ๐โ22/๐โ ๐นโ2 =๐ + ๐2๐๐๐
โซ๏ธ ๐ก๐๐กโ
๐ก(1โ ๐ก)โ๏ธ
(๐ก๐ โ ๐ก)(๐กโ ๐กโ)๐๐ก
= ๐ + ๐256๐๐ (๐ก๐ โ ๐กโ)2(8๐กโ โ 5๐ก2โ + 8๐ก๐ โ 6๐กโ๐ก๐ โ 5๐ก2๐)
= ๐(๐ + ๐ โ ๐๐ + ๐2๐ + ๐๐ 2)
(๐ + 1)4(๐ + 1)3 .
Define ๐ 1 := ๐นโ1 and ๐ 2 := (๐นโ2)1/2, we arrive at the lemma.
The following result concerning the QR decomposition of a Gaussian random matrixis probably well-known. However, since we did not find results in this exact form in theexisting literature, we have included a proof here for completeness. Recall that for ๐ โฅ ๐,the set O๐ร๐ can be equipped with a uniform probability measure that is invariant underthe action of left multiplication by O๐ร๐ (see, e.g. Stiefel manifold in Muirhead, 2009,Section 2.1.4).
Lemma 13. Suppose ๐ โฅ ๐ and ๐ is an ๐ร ๐ random matrix with independent ๐(0, 1)entries. Write ๐ = ๐ป๐ , with ๐ป taking values in O๐ร๐ and ๐ an upper-triangular ๐ ร ๐matrix with non-negative diagonal entries. This decomposition is almost surely unique.Moreover, ๐ป and ๐ are independent, with ๐ป uniformly distributed on O๐ร๐ with respectto the invariant measure and ๐ = (๐ก๐,๐)๐,๐โ[๐] having independent entries satisfying ๐ก2๐,๐ โผ๐2๐โ๐+1 and ๐ก๐,๐ โผ ๐(0, 1) for 1 โค ๐ < ๐ โค ๐.
Proof. The uniqueness of the QR decomposition follows since ๐ has rank ๐ almost surely.The marginal distribution of ๐ then follows from the Bartlett decomposition of ๐โค๐(Muirhead, 2009, Theorem 3.2.4) and the relationship between the QR decomposition of๐ and the Cholesky decomposition of ๐โค๐.
For any fixed ๐ โ O๐ร๐, we have ๐๐ d= ๐. Since O๐ร๐ acts transitively (by leftmultiplication) on O๐ร๐, the joint density of ๐ป and ๐ must be constant in ๐ป for eachvalue of ๐ . In particular, we have that ๐ป and ๐ are independent, and that ๐ป is uniformlydistributed on O๐ร๐ with respect to the translation-invariant measure.
The lemma below provides concentration inequalities for the norm of a multivariatenormal random vector with near-identity covariance.
Lemma 14. Let ๐ โผ ๐๐(๐,ฮฃ). Suppose โ๐โ2 = ๐ and โฮฃโ ๐ผโop โค 1/2. Then
P[๏ธโ๐โ22 โ (๐+๐ 2)(1 + 2โฮฃโ ๐ผโop) โฅ
{๏ธ2โ๏ธ
(๐+ 2๐ 2)๐ก+ 2๐ก}๏ธ(1 + 2โฮฃโ ๐ผโop)
]๏ธโค 2๐โ๐ก,
P[๏ธโ๐โ22 โ (๐+๐ 2)(1โ 2โฮฃโ ๐ผโop) โค โ
{๏ธ2โ๏ธ
(๐+ 2๐ 2)๐ก+ 2๐ก}๏ธ(1 + 2โฮฃโ ๐ผโop)
]๏ธโค 2๐โ๐ก.
Proof. Define ๐ = ฮฃโ1/2(๐ โ ๐). Then
โ๐โ22 = ๐ โคฮฃ๐ + 2๐โคฮฃ1/2๐ + โ๐โ22 = โ๐ + ๐โ22 + ๐ โค(ฮฃโ ๐ผ)๐ + 2๐โค(ฮฃ1/2 โ ๐ผ)๐.
28
-
Observe that โ๐ + ๐โ22 โผ ๐2๐(๐ 2). By Birgรฉ (2001, Lemma 8.1), we have
P{๏ธโ๐ + ๐โ22 โฅ ๐+๐ 2 + 2
โ๏ธ(๐+ 2๐ 2)๐ก+ 2๐ก
}๏ธโค ๐โ๐ก (29)
P{๏ธโ๐ + ๐โ22 โค ๐+๐ 2 โ 2
โ๏ธ(๐+ 2๐ 2)๐ก
}๏ธโค ๐โ๐ก (30)
On the other hand, we also have
|๐ โค(ฮฃโ ๐ผ)๐ |+ |2๐โค(ฮฃ1/2 โ ๐ผ)๐ | โค โ๐ โ22โฮฃโ ๐ผโop + 2๐ โ๐ โ2โฮฃ1/2 โ ๐ผโopโค (2โ๐ โ22 +๐ 2)โฮฃโ ๐ผโop, (31)
where the final inequality follows from a Taylor expansion of {๐ผ+ (ฮฃโ ๐ผ)}1/2 after notingโฮฃ โ ๐ผโop โค 1/2 and the elementary inequality 2๐๐ โค ๐2 + ๐2 for ๐, ๐ โ R. By Laurentand Massart (2000, Lemma 1), we have with probability at least 1โ ๐โ๐ก that
โ๐ โ22 โค ๐+ 2โ๐๐ก+ 2๐ก. (32)
Combining (29), (31) and (32), we have with probability at least 1โ 2๐โ๐ก that
โ๐โ22 โค{๏ธ๐+๐ 2 + 2
โ๏ธ(๐+ 2๐ 2)๐ก+ 2๐ก
}๏ธ(1 + 2โฮฃโ ๐ผโop).
Similarly, for the lower bound, we have again by (30), (31) and (32) that
โ๐โ22 โฅ ๐+๐ 2 โ 2โ๏ธ
(๐+ 2๐ 2)๐กโ {2๐+ 4โ๐๐ก+ 4๐ก+๐ 2}โฮฃโ ๐ผโop
โฅ (๐+๐ 2)(1โ 2โฮฃโ ๐ผโop)โ{๏ธ2โ๏ธ
(๐+ 2๐ 2)๐ก+ 2๐ก}๏ธ(1 + 2โฮฃโ ๐ผโop).
holds with probability at least 1โ 2๐โ๐ก.
ReferencesArias-Castro, E., Candรจs, E. J. and Plan, Y. (2011) Global testing under sparse alter-
natives: ANOVA, multiple comparisons and the higher criticism. Ann. Statist., 39,2533โ2556.
Bai, Z., Hu, J., Pan, G. and Zhou, W. (2015) Convergence of the empirical spectraldistribution function of Beta matrices. Bernoulli, 21, 1538โ1574.
Birgรฉ, L. (2001) An alternative point of view on Lepskiโs method. In de Gunst, M andKlaassen, C. and van der Vaart, A. Eds,. Lecture Notes-Monograph Series 36, 113โ133.
Cai, T. T., Liu, W. and Xia, Y. (2014) Two-sample test of high dimensional means underdependence. J. Roy. Statist. Soc., Ser. B, 76, 349โ372.
Charbonnier, C., Verzelen, N. and Villers, F. (2015) A global homogeneity test for high-dimensional linear regression. Electron. J. Statist., 9, 318โ382.
29
-
Chen, S. X., Li, J. and Zhong, P.-S. (2019) Two-sample and ANOVA tests for highdimensional means. Ann. Statist., 47, 1443โ1474.
Chow, G. C. (1960) Tests of equality between sets of coefficients in two linear regressions.Econometrica, 28, 591โ605.
Dicker, L. H. (2014) Variance estimation in high-dimensional linear models. Biometrika,101, 269โ284.
Donoho, D. and