variable selection in sparse regression with quadratic...
TRANSCRIPT
Variable Selection in Sparse Regression withQuadratic Measurements
Jun Fan ∗ Lingchen Kong ∗ Liqun Wang† and Naihua Xiu ∗
Department of Applied Mathematics, Beijing Jiaotong University ∗
(E-maill: [email protected], [email protected], [email protected])
Department of Statistics, University of Manitoba †
(E-mail: [email protected])
Final Version, October 2016
Abstract
Regularization methods for high-dimensional variable selection and estimationhave been intensively studied in recent years and most of them are developed in theframework of linear regression models. However, in many real data problems, e.g., incompressive sensing, signal processing and imaging, the response variables are non-linear functions of the unknown parameters. In this paper we introduce a so-calledquadratic measurements regression model that extends the usual linear model. Westudy the `q regularized least squares method for variable selection and establish theweak oracle property of the corresponding estimator. Moreover, we derive a fixedpoint equation and use it to construct an efficient algorithm for numerical optimiza-tion. Numerical examples are given to demonstrate the finite sample performance ofthe proposed method and the efficiency of the algorithm.
Keywords: sparsity, `q-regularization, moderate deviation, weak oracle property, optimiza-tion algorithm.
To appear in Statistica Sinica, 28 (2018), 1157-1178. doi:10.5705/ss.202015.0335
∗Supported by the National Natural Science Foundation of China (11431002,11171018)†Supported by the Natural Sciences and Engineering Research Council of Canada (NSERC)
1
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 2
1 Introduction and Motivation
In the era of big data, more and more massive and high-dimensional data become available
in many scientific fields, e.g., genome and health science, economics and finance, astronomy
and physics, signal processing and imaging, etc. The large size and high dimensionality
of data pose significant challenges to the traditional statistical methodologies, see, e.g.,
Donoho (2000) and Fan and Lv (2010) for excellent overviews. As pointed out by these
authors, a common feature in high-dimensional data analysis is the sparsity of the predictors
and one of the main goals is to select the most relevant variables to accurately predict a
response variable of interest.
Various regularization methods have been proposed in the literature, e.g., bridge regres-
sion (Frank and Friedman (1993)), the LASSO (Tibshirani (1996)), the SCAD and other
folded-concave penalties (Fan and Li (2001)), the Elastic-Net penalty (Zou and Hastie
(2005)), the adaptive LASSO (Zou (2006)), the group LASSO (Yuan and Lin (2006)), the
Dantzig selector (Candes and Tao (2007)), and the MCP (Zhang (2010)). Recently, Lv and
Fan (2009) pointed out that there is a distinction and close relation between the model se-
lection problem in statistics and sparse recovery problem in compressive sensing and signal
processing. Moreover, they proposed a unified approach to deal with both problems.
However, most existing statistical methods for variable selection are developed in the
context of sparse linear regression. On the other hand, there is a large number of real data
problems, especially in compressive sensing, signal processing and imaging, and statistics,
where the regression relationships are in nonlinear forms of unknown parameters. The
following are some examples.
Example 1.1. Compressive sensing has been intensively studied in the last decade and the
main goal is to reconstruct sparse signals from the observations. Recently, the theory has
been extended to nonlinear compressive sensing and, in particular, to the so-called quadratic
compressive sensing that aims to find the sparse signal β to the problem minβ∈Rp ‖β‖0
subject to yi = βTZiβ+xTi β+εi, i = 1, · · · , n, where ‖β‖0 is the number of nonzero entries
of β, yi, εi ∈ R, xi ∈ Rp and Zi ∈ Rp×p are real matrices (vectors). For more details see,
e.g., Beck and Eldar (2013), Blumensath (2013) and Ohlsson et al (2013).
There is a special class of problems in optical imaging, where partially spatially inco-
herent light such as sub-wavelength optical results in a quadratic relationship between the
input object β and image intensity yi as yi ≈ βTZiβ, i = 1, · · · , n, where Zi is known from
the mutual intensity and the impulse response function of the optical system (Shechtman
et al (2011) and Shechtman et al (2012)).
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 3
Example 1.2. Phase retrieval plays an important role in X-ray crystallography, transmis-
sion electron microscopy, coherent diffractive imaging, etc. Generally speaking, the problem
is to recover the lost phase information through the observed magnitudes. In particular,
in the real phase retrieval problem the goal is to find β ∈ Rp in yi = βT (zizTi )β + εi, i =
1, · · · , n, where zi ∈ Rp and yi ∈ R are observed variables and εi are random errors (Can-
des, Strohmer and Voroninski (2013), Candes, Li and Soltanolkotabi (2015), Eldar and
Mendelson (2014), Lecue and Mendelson (2013), Netrapalli, Jain and Sanghavi (2013),
Cai, Li and Ma (2015)).
Example 1.3. In wireless ad hoc and sensor networks, localization is crucial for building
low-cost, low-power and multi-functional sensor networks in which direct measurements of
all nodes’ locations via GPS or other similar means are not feasible (Biswas and Ye (2004),
Meng, Ding and Dasgupta (2008), Wang et al (2008)). The most important element of any
localization algorithms is to measure the distances between sensors and anchors. However,
the acquired data are usually imprecise because of the measurement noise and estimation
errors. Suppose p-dimensional vectors x1, x2, ..., xn are the known sensor positions and
β ∈ Rp is the signal source location that is unknown and to be determined. Then the
measured distance yi from the source to each sensor node is given by y2i = ‖xi − β‖2
2 +
εi, i = 1, · · · , n, where εi is a random error. Again, the above relation can be written as
y2i − ‖xi‖2
2 = βTβ − 2xTi β + εi.
Example 1.4. Measurement error is ubiquitous in statistical data analysis. Wang (2003,
2004) showed that for a class of measurement error models to be identifiable and consis-
tently estimable, at least the first two conditional moments of the response variable given
the observed predictors are needed. Wang and Leblanc (2008) showed that in a general
nonlinear model this second-order least squares estimator (SLSE) is asymptotically more
efficient than the ordinary least squares estimator when the regression error has nonzero
third moment, and the two estimators have the same asymptotic variances when the error
term has symmetric distribution. In a linear model, the SLSE is derived based on the first
two conditional moments E(yi|xi) = xTi β and E(y2i |xi) = (xTi β)2 + σ2, i = 1, · · · , n, where
β is the vector of regression coefficients and σ2 is the variance of the regression error. It
is easy to see that the above second moment can be written as E(y2i |xi) = θTZiθ with
θ = (βT , σ)T and
Zi =
(xix
Ti 0
0 1
).
In these examples, the main goal is to recover the sparse signals in regression setups
where the response variable is a quadratic function of the unknown parameters, and thus
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 4
not covered by linear regression models. Given their wide applications, however, the high-
dimensional variable selection problem in such models has not been studied in statistical
literature.
In this paper we attempt to fill in this gap. First, we introduce a so-called quadratic
measurements regression (QMR) model as an extension of the usual linear model. Then
we study the `q-regularized least squares (q-RLS) estimation in this model and establish its
weak oracle property (Lv and Fan (2009)). Moreover, using moderate deviations we show
that the estimators of the nonzero coefficients have an exponential convergence rate. To
deal with the problem of numerical optimization, we derive a fixed point equation that is
necessary for global optimality. This allows us to construct an iterative algorithm and to
establish its convergence. Finally, we present some numerical examples to demonstrate the
efficiency of the proposed method and algorithm.
In section 2 we introduce the quadratic measurements model and the q-RLS estimation.
In section 3 we discuss the weak oracle property of the q-RLS estimator using the mod-
erate deviation technique. In section 4, we deal with a special case of a purely quadratic
measurements model that has applications in some important problems. In section 5, we
derive a fixed point equation and construct an algorithm for numerical minimization. In
section 6, we calculate some numerical examples to illustrate our proposed method and to
demonstrate its finite sample performance. Finally, conclusions and discussions are given
in section 7, while technical lemmas and proofs are given in the Appendices.
2 The quadratic measurements model
Motivated by the examples in the previous section, we define the quadratic measurements
regression (QMR) model as
yi = βTZiβ + xTi β + εi, i = 1, · · · , n, (1)
where yi ∈ R is the observed response, xi ∈ Rp is the vector of predictors, Zi ∈ Sp×p is
a symmetric design matrix, β ∈ Rp is the vector of unknown parameters, and εi ∈ R are
independent and identically distributed random errors with mean 0 and variance σ2. When
Zi ≡ 0, this reduces to the usual linear model
yi = xTi β + εi, i = 1, · · · , n. (2)
In this paper we are mainly interested in the high-dimensional case where p > n,
although our theory applies to the case p ≤ n as well. Throughout the paper we assume
that log p = o(n%) for some constant % ∈ (0, 1), and E exp (δ0|ε1|) <∞ for some δ0 > 0.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 5
As mentioned earlier, in compressive sensing and signal processing the main goal is to
identify and estimate the smallest possible number of nonzero coefficients. Thus we consider
the problem of estimating unknown parameters of model (1) under the sparsity constraint
‖β‖0 ≤ s, where s < n is a certain integer and, accordingly, we study the `q-regularized
least squares (q-RLS) problem
minβ∈Rp
Ln(β) := `n(β) + λn‖β‖qq, (3)
where `n(β) =∑n
i=1(yi − βTZiβ − xTi β)2, λn > 0 and q ∈ (0, 1). The `q-regularization has
been widely used in compressive sensing. Compared to `1-regularization, this method tends
to produce precise signal reconstruction with fewer measurements (Chartrand (2007)), and
increases the robustness to noise and image non-sparsity (Saab, Chartrand and Yilmaz
(2008)). Moreover, Krishnan and Fergus (2009) demonstrated very high efficiency of `1/2
and `2/3 regularization in image deconvolution.
A minimizer β of the optimization problem (3) is called q-RLS estimator and it is a
generalization of the bridge estimator in linear models (Frank and Friedman (1993)). It is
well-known that the bridge estimator has various desirable properties including sparsity and
consistency (Knight and Fu (2000), Huang, Horowitz and Ma (2008)). A natural question
is whether the q-RLS solution of (3) continues to enjoy these properties in the more general
model (1). To answer this question, we study the moderate deviation (MD) of β which
gives the rate of convergence to β at a slower rate than n−1/2 (Kallenberg (1983)).
Although we are mainly interested in variable selection problem, our results on identi-
fiability and numerical optimization algorithm apply also to the case q ≥ 1. However, our
consistency results for selection and estimation hold only for the case where q ∈ (0, 1); this
is not surprising given that it is a well-known fact in linear models (Fan and Li (2001), Zou
(2006)).
Throughout the paper we use the following notation. For any d-dimensional vec-
tor v = (v1, · · · , vd)T , let |v| = (|v1|, · · · , |vd|)T , v2 = (v21, · · · , v2
d)T , ‖v‖2 = (
∑di=1 v
2i )
12 ,
‖v‖1 =∑d
i=1 |vi| and ‖v‖∞ = max{|v1|, · · · , |vd|}. For any set Γ ⊆ {1, · · · , d}, de-
note its cardinality by |Γ| and Γc = {1, · · · , d}/Γ. For any n × d matrix A = [aij], let
‖A‖F =√∑n
i=1
∑dj=1 a
2ij and |A|∞ = max1≤i,j≤d |aij|. Denote by AΓ the sub-matrix of A
consisting of its columns associated with index set Γ ⊆ {1, · · · , d}, AΓ′ the sub-matrix of
A consisting of its rows indexed by Γ′ ⊆ {1, · · · , n} and by AΓ′Γ the sub-matrix consisting
of the rows and columns of A indexed by Γ′ and Γ respectively. We use the notation vΓ
for a column or a row vector v. Finally, denote by ed,j the jth column of the d× d identity
matrix Id.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 6
3 Weak oracle property
In this section we discuss the moderate deviation and consistency of the q-RLS estimators.
Let β∗ be the true parameter value of model (1) and Γ∗ = supp(β∗) := {j : eTp,jβ∗ 6= 0, j =
1, · · · , p}. Without loss of generality, let |Γ∗| = s < n. Let X = (x1, · · · , xn)T , where
xi = (xi1, · · · , xip)T , i = 1, · · · , n. Then following Huang, Horowitz and Ma (2008), we
assume that there exist constants 0 < c ≤ c <∞ such that
c ≤ min{|eTp,jβ∗|, j ∈ Γ∗} ≤ max{|eTp,jβ∗|, j ∈ Γ∗} ≤ c.
Following the literature (e.g., Zou and Hastie (2005), Huang, Horowitz and Ma (2008),
Fan, Fan and Barut (2014)), the data are assumed to be standardized so thatn∑i=1
yi = 0,n∑i=1
xij = 0, max{ n∑i=1
x2ij,
n∑i=1
|Zi|2∞}
= n, j = 1, · · · , p. (4)
In the linear model, the third equality above reduces to∑n
i=1 x2ij = n.
3.1 Identifiability of β∗
For the sparse linear model, Donoho and Elad (2003) introduced the concept of spark and
showed that the uniqueness of β∗ can be characterized by spark(X) which is defined as
the minimum number of linearly dependent columns of the design matrix X. Another way
to express this property is via the s-regularity of X, i.e., any s columns of X are linearly
independent. Indeed, X is s-regular if and only if spark(X) ≥ s + 1, (Beck and Eldar
(2013)). Further, in the linear model, −X is the Jacobian matrix of the residual function
R(β) = y −Xβ, where y = (y1, · · · , yn)T . Correspondingly, under model (1) the residual
function is R(β) =(R1(β), · · · , Rn(β)
)Twith Ri(β) = yi − βTZiβ − xTi β and hence the
Jacobian is (−2Z1β − x1, · · · ,−2Znβ − xn)T
.
Definition 3.1. The affine transform A(β) = (Z1β + x1, · · · , Znβ + xn)T
is said to be
uniformly s-regular, if A(β)Γ has full column rank for any Γ ⊆ {1, · · · , p} with |Γ| = s and
β ∈ Rp with supp(β) ⊆ Γ.
Obviously, the uniform s-regularity of A(β) implies the s-regularity of X. It is straight-
forward to verify that A(β) is uniformly s-regular if and only if the submatrix AΓ(β1) =
(ZΓΓ1 β1, · · · , ZΓΓ
n β1
)T+ XΓ has full column rank for any index set Γ ⊆ {1, · · · , p} with
|Γ| = s and β1 ∈ Rs.
In the linear model, we have AΓ(β1) = XΓ since Zi ≡ 0, and therefore the uniform
s-regularity of A(β) reduces to the s-regularity of X. On the other hand, if Zi ≡ Ip as in
Example 1.3, then A(β) is uniformly s-regular provided∑n
i=1 xi = 0 and X is s-regular.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 7
Proposition 3.1. Let yi = β∗TZiβ∗ + xTi β
∗, i = 1, · · · , n. Then the system of equations
βTZiβ + xTi β = yi, i = 1, · · · , n, has a unique solution β∗ satisfying ‖β∗‖0 ≤ s if A(β) is
uniformly 2s-regular.
3.2 Moderate deviation and consistency
It is well known that strong convexity is the standard condition for the existence of unique
solution to a convex optimization problem. When the objective function is twice differen-
tiable, an equivalent condition is that the Hessian is uniformly positive definite. To establish
the consistency of an M-estimator in high-dimension, Negahban et al (2012) introduced the
concept of the restricted strong convexity that the objective function is strongly convex
on a certain set. To achieve the accuracy of a greedy method for the sparsity-constrained
optimization problem, Bahmani, Raj and Boufounos (2013) used the stable restricted Hes-
sian to characterize the curvature of the loss function over the sparse subspaces that can be
bounded locally from above and below such that the corresponding bounds have the same
order. However, the calculation of the exact Hessian of our model is costly. Fortunately
the transform A(β) has a special structure that allows us to not only use the Jacobian to
obtain the gradient ∇`n(β) = −2A(2β)TR(β), but also to employ it to approximate the
Hessian near β∗, as is shown below. Hence we introduce the following conditions.
Condition 1 (Uniformly Stable Restricted Jacobian).
(a) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β ∈ Rp satisfying supp(β) ⊆ Γ, there exists
a positive constant c1 that bounds all eigenvalues of n−1((A(β)Γ
)TA(β)Γ
)from below.
(b) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β ∈ Rp satisfying supp(β) ⊆ Γ and
‖β‖ ≤ 2c+3√
(σ2 + 1)/c1
√s, there exists a positive constant c2 that bounds all eigenvalues
of n−1((A(β)Γ
)TA(β)Γ
)from above.
It is easy to see that (a) and (b) are respectively equivalent to the following conditions.
(a′) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β1 ∈ Rs, there exists a positive constant
c1 that bounds all eigenvalues of n−1(AΓ(β1)TAΓ(β1)
)from below.
(b′) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β1 ∈ S := {u ∈ Rs : ‖u‖ ≤(2c +
3√
(σ2 + 1)/c1
)√s}, there exists a positive constant c2 that bounds all eigenvalues of
n−1(AΓ(β1)T AΓ(β1)
)from above.
Again, in the linear model (2), Condition 1 reduces to the first assumption of Condi-
tion 2 in Fan, Fan and Barut (2014) that the eigenvalues of n−1XTΓXΓ are bounded from
below and above. For the general case, (a′) is similar to the restricted strong convexity
in Negahban et al (2012). Indeed, the minimization problem (3) is derived from the orig-
inal optimization problem minβ∈Rp `n(β) subject to ‖β‖0 ≤ s. So, we first consider the
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 8
unconstrained optimization problem
minβ1∈Rs
1
n˜n(β1) :=
1
n
n∑i=1
(yi − βT1 ZΓ∗Γ∗
i β1 − xTiΓ∗β1
)2(5)
that is clearly non-convex and may not have a unique solution in general. However, one
can calculate the Hessian matrix of the objective function 1n
˜n(β1) at β∗Γ∗ as
∇2( 1
n˜n(β∗Γ∗)
)=
2
n
(AΓ∗(2β
∗Γ∗)
TAΓ∗(2β∗Γ∗))− 4
n
n∑i=1
εiZΓ∗Γ∗
i .
Since ‖ZΓ∗Γ∗i ‖2
F ≤ s2|Zi|∞, the third equality of (4) implies that∑n
i=1 ‖Zi‖2F ≤ ns2. Further,
it follows from Chebyshev’s inequality and s = o(√n) that
1
n‖
n∑i=1
εiZΓ∗Γ∗
i ‖FP−→ 0, as n→∞.
Hence Condition 1 (a′) ensures that the Hessian matrix ∇2(n−1 ˜
n(β∗Γ∗))
is strictly positive
definite and therefore the problem (5) has a unique solution in a neighborhood of β∗Γ∗
with probability approaching one. It follows that the minimization problem (3) may have
a unique solution in a neighborhood of β∗, as in Negahban et al (2012). Moreover, (a′)
implies that AΓ(β1) has full column rank for any Γ with |Γ| = s and therefore A(β) is
uniformly s-regular.
Further, (b′) is similar to the upper bound of the stable restricted Hessian. In particular,
if s is finite, then (b′) implies that the curvature of the loss function has upper bounds at
locations that are within a neighbourhood of the origin. From the proof in Appendix A,
one can see that (b′) ensures a more accurate convergent rate.
Condition 2 (Asymptotic Property of Design Matrix). Let κ1n = |X|∞ and κ2n =
max1≤i≤n |Zi|∞ be such that, as n→∞,
κ1n
√s√
n→ 0,
κ2ns3/2
√n→ 0. (6)
The first convergence in (6) is the same as in Fan, Fan and Barut (2014, Condition 2).
The second convergence in (6) and (8) below are required to deal with the quadratic term
in the low-dimensional space Rs.
Condition 3 (Partial Orthogonality). For any Γ ⊆ {1, · · · , p} with |Γ| = s, there exists
a positive constant c0 such that
1√n|
n∑i=1
xiΓ ⊗ xiΓc|∞ ≤ c0,1√n
(|
n∑i=1
xiΓ ⊗ ZΓΓc
i |∞ + |n∑i=1
xiΓc ⊗ ZΓΓi |∞
)≤ c0,
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 9
and1√n
(|
n∑i=1
ZΓΓi ⊗ ZΓcΓ
i |∞ + |n∑i=1
ZΓΓi ⊗ ZΓcΓc
i |∞)≤ c0,
where ⊗ is the Kronecker product.
Again, in the linear model (2), Condition 3 coincides with the partial orthogonality
condition of Huang, Horowitz and Ma (2008) that n−1/2|∑n
i=1 xijxik|∞ ≤ c0 for any j ∈Γ, k ∈ Γc, which is an essential assumption for the consistency when p > n.
Condition 4 (Asymptotic Property of Tuning Parameter). Let λn ≥ σc1−q√n log p be
such that, as n→∞,
√nqs3−q(log n)2−q
λn→ 0,
λns4−q
2(1−q)
n→ 0 (7)
andλnκ1ns log n
n→ 0,
λnκ2ns2 log n
n→ 0. (8)
The inequality here is equivalent to λn > 2√
(1 + C)n log p for some positive constant
C, which is used in Fan, Fan and Barut (2014). The first convergence is similar to the first
one in their condition (4.4) that γns3/2κ2
1n(log2 n)2 = o(λ2n/n), where γn = γ0
(√s log n/n+
λn‖d0‖2/n), γ0 is a positive constant and d0 is a s-dimensional vector of nonnegative
weight, and the first convergence in (8) is similar to their second convergence in (4.4). In
particular, the first convergence of this condition is trivial when s is finite and the inequality
in Condition 4 holds. The second convergence implies that λns2/n → 0, i.e., the penalty
parameter λn is o(n) if s is finite. If, for example, λn = nδ for a positive constant δ, then
Condition 2 implies that δ ∈ (1/2, 1) and log p = o(n(2δ−1)). Thus, Condition 2 imposes a
range for the penalty parameter with respect to the sample size n and dimension p. It is
easy to verify that Condition 2 also implies that s = o(√n) as needed to approximate the
Hessian through the Jacobian.
The proof of the following is given in Appendix A.
Theorem 3.1. (Moderate Deviation). Under model (1), if Conditions 1-4 hold, then
there exists a strict local minimizer β =(βTΓ∗ , β
TΓ∗c
)Tof (3) and a positive constant C0 <
min{ 18σ2 ,
12c2σ2 ,
c218c2σ2} such that
P(βΓ∗c = 0
)≥ 1− exp
(− C0a
2n
)(9)
and
P(‖βΓ∗ − β∗Γ∗‖2 ≤ rn) ≥ 1− exp(− C0a
2n)), (10)
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 10
where
βΓ∗ ∈ argminβ1∈RsLn(β1) :=n∑i=1
(yi − βT1 ZΓ∗Γ∗
i β1 − xTiΓ∗β1
)2
+ λn‖β1‖qq,
rn =(an√n
+ 2cq−1λn√s
c1n
), and {an} is a sequence of positive numbers such that, as n→∞,
an√s log n
→∞, (11)
anκ1n
√s√
n→ 0,
anκ2ns3/2
√n
→ 0, (12)
and
an(n q
2 s4−q2
λn
) 12−q → 0. (13)
Note that since max(κ21n, κ
22n) ≥ 1, conditions (12) imply an
√s/n → 0. Again, if s is
finite and λn = nδ for some δ ∈ (1/2, 1), then conditions (11)-(13) simplify to
an√log n
→∞, anκ1n√n→ 0,
anκ2n√n→ 0,
an
n2δ−q2(2−q)
→ 0.
It follows that {an} tends to infinity faster than log n but slower than n2δ−q2(2−q) = o(
√n).
This differs from the case of the linear model with fixed dimension p � n, where only
anκ1n/√n→ 0 is required to establish the MD of M-estimators (Fan (2012), Fan, Yan and
Xiu (2014)). Here we assume (11)-(13) to cover the case of p� n.
By inequality (9) the q-RLS estimator correctly selects nonzero variables with proba-
bility approaching one exponentially. It follows from (10) that the estimators of nonzero
variables are consistent with an exponential rate of convergence. Theorem 3.1 also implies
that P(‖β − β∗‖2 > rn) ≤ exp(− C0a
2n
), i.e., the tail probability decreases exponentially
with rate a2n, as the tail probability of the Gaussian.
Theorem 3.1 gives general results on the MD. By taking an =√s log n, we obtain the
familiar forms of convergence rate.
Theorem 3.2. (Weak Oracle Property). Under model (1), if Conditions 1-4 hold, then
there exists a strict local minimizer β =(βTΓ∗ , β
TΓ∗c
)Tof (3) such that for sufficiently large
n,
P(βΓ∗c = 0
)≥ 1− n−C0s logn (14)
and
P(‖βΓ∗ − β∗Γ∗‖2 ≤
√s log n√n
+2cq−1λn
√s
c1n
)≥ 1− n−C0s logn. (15)
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 11
In particular, when Zi ≡ 0, Conditions 1-4 reduce to similar conditions of Huang,
Horowitz and Ma (2008) and Fan, Fan and Barut (2014) for the linear model (2). Conse-
quently, we have the following result which are similar to theirs.
Corollary 3.1. Under linear model (2), the results of Theorem 3.2 hold, provided Condi-
tions 1-3 and (17) and the first condition in (8) of Condition 4 hold, that is,
(1) for each Γ ⊆ {1, · · · , p} with |Γ| = s, the eigenvalues of 1nXT
ΓXΓ are bounded from
below and above by some positive constants c1 and c2 respectively;
(2) κ1n
√s/√n→ 0, as n→∞;
(3) for each Γ ⊆ {1, · · · , p} with |Γ| = s, there exists a positive constant c0 such that
n−1/2|n∑i=1
xijxik|∞ ≤ c0, ∀j ∈ Γ, k ∈ Γc;
(4) λn ≥ σc1−q√n log p and λ−1n
√nqs3−q(log n)2−q → 0, n−1λns
4−q2(1−q) → 0, n−1λnκ1ns log n→
0, as n→∞.
Remark 3.1. To deal with the case p > n, Huang, Horowitz and Ma (2008) showed that
the marginal bridge estimators satisfy P(βΓ∗c = 0
)→ 1 and P
(eTp,jβ 6= 0, j ∈ Γ∗
)→ 1.
Here we provide the rate of this convergence. The result (15) is slightly different from
Theorem 2 in Fan, Fan and Barut (2014) that has
P(‖βΓ∗ − β∗Γ∗‖2 ≤ γ0
(√s log n√n
+λn‖d0‖2
n
))≥ 1−O(n−cs),
where γ0 and c are two positive constants and d0 is a s-dimensional vector of nonnegative
weight. To find the constant c, we use the number√
log n to dominate the constant γ0,
which results in the lower consistent rate. To compensate this loss, the right hand side of
(15) tends to one at a faster rate.
4 Purely quadratic model
In the previous sections we have studied the regularized least squares method in the QMR
model which is an extension of linear model. However, as demonstrated in Example 1.1 and
1.2, in many applications in phase retrieval and optical imaging the models do not contain
the linear term. Therefore in this section we consider the purely quadratic measurements
model
yi = βTZiβ + εi, i = 1, · · · , n. (16)
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 12
In particular, this covers the phase retrieval model where Zi = zizTi as demonstrated in
Example 1.2. As this model differs from the general model (1), some theoretical conditions
and results in the previous sections need to be modified.
4.1 Identifiability of β∗
The absence of the linear term in model (16) has β∗ and −β∗ indistinguishable from the
observed data. In the phase retrieval literature, e.g., Balan, Casazza and Edidin (2006)
and Ohlsson and Eldar (2014), this problem is treated by identifying ±β for any β ∈ Rp.
Without loss of generality, we assume that the first nonzero element of β∗ is positive.
For the phase retrieval problem, Balan, Casazza and Edidin (2006) and Bandeira et
al (2014) introduce the complement property which is necessary and sufficient for identi-
fiability. For the sparse regression, Ohlsson and Eldar (2014) propose the more general
concept of s-complement property. In the phase retrieval model where Zi = zizTi , the
s-complement property of {zi} means that either {zΓi }i∈N or {zΓ
i }i∈Nc span Rs for every
subset N ⊆ {1, · · · , n} and Γ ⊆ {1, · · · , p} with |Γ| = s. Here the identifiability of β∗ in
(1) is guaranteed by the uniform s-regularity of the affine transform A(β). In model (16),
the residual function R(β) =(R1(β), · · · , Rn(β)
)Twith Ri(β) = yi − βTZiβ has Jacobian
matrices (−2Z1β, · · · ,−2Znβ)T
. Hence Definition 3.1 is modified as follows.
Definition 4.1. The linear transform B(β) = (Z1β, · · · , Znβ)T
is said to be uniformly s-regular
if B(β)Γ has full column rank for any Γ ⊆ {1, · · · , p} with |Γ| = s and β ∈ Rp/{0} with
supp(β) ⊆ Γ.
If Zi = zizTi , then the uniform s-regularity of B(β) is equivalent to the s-complement
property of {zi}. Further, it is straightforward to verify that B(β) is uniformly s-regular
if and only if the submatrix BΓ(β1) = (ZΓΓ1 β1, · · · , ZΓΓ
n β1
)Thas full column rank for any
Γ ⊆ {1, · · · , p} with |Γ| = s and β1 ∈ Rs/{0}.The proof of the following is analogous to that of Theorem 4 in Ohlsson and Eldar
(2014) and is therefore omitted.
Proposition 4.1. Under model (16) β∗ is the unique solution satisfying ‖β∗‖0 ≤ s, if B(β)
is uniformly 2s-regular.
4.2 Weak oracle property
To drive the MD and consistency results under model (16), we modify Conditions 1-4 in
section 3.2 as follows.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 13
Condition 1′ (Uniformly Stable Restricted Jacobian).
(a) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β1 ∈ S1 :={u ∈ Rs : |{j : |eTs,ju| ≥ c/2}| ≥
s−[ s2]}
, there exists a positive constant c1 that bounds all eigenvalues of n−1(BΓ(β1)TBΓ(β1)
)from below.
(b) For any Γ ⊆ {1, · · · , p} with |Γ| = s and β1 ∈ S, there exists a positive constant c2
that bounds all eigenvalues of n−1(BΓ(β1)T BΓ(β1)
)from above.
Condition 2′ (Asymptotic Property of Design Matrix). As n→∞, n−1/2κ2ns3/2 → 0.
Condition 3′ (Partial Orthogonality). There exists a positive constant c0 such that
n−1/2(|
n∑i=1
ZΓΓi ⊗ ZΓcΓ
i |∞ + |n∑i=1
ZΓΓi ⊗ ZΓcΓc
i |∞)≤ c0.
Condition 4′ (Asymptotic Property of Tuning Parameter). Let λn ≥ σc1−q√n log p be
such that, as n→∞,
√nqs3−q(log n)2−q
λn→ 0,
λns4−q
2(1−q)
n→ 0 and
λnκ2ns2 log n
n→ 0.
For the phase retrieval model, since BΓ(β1)TBΓ(β1) =∑n
i=1
(zTiΓβ1
)2ziΓz
TiΓ, Condition 1′
implies that at β∗Γ∗ ,
c1‖u‖2 ≤ 1
n
n∑i=1
(zTiΓβ
∗Γ∗
)2(zTiΓu)2 ≤ c2‖u‖2, ∀u ∈ Rs/{0}.
This is similar to Corollary 7.6 of Candes, Li and Soltanolkotabi (2015) that for some
δ ∈ (0, 1), it holds
(1− δ)‖h‖2 ≤ n−1
n∑i=1
|〈zi, β∗〉|2|〈zi, h〉|2 ≤ (1 + δ)‖h‖2
for all h ∈ Cp with ‖h‖ = 1, provided n > p and zi are sampled from some distributions
such as the Gaussian.
Theorem 4.1. (Moderate Deviation). Under model (16), if Conditions 1′ - 4′ hold and
βΓ∗ ∈ argminβ1∈RsLn(β1) :=n∑i=1
(yi − βT1 ZΓ∗Γ∗
i β1
)2
+ λn‖β1‖qq,
then there exists a strict local minimizer β =(βTΓ∗ , β
TΓ∗c
)Tof (3) such that (9) and (10)
hold with {an} satisfying (11),(13) and the second condition in (12).
Similar to Theorem 3.2, we can obtain the familiar form of the convergence rate for β
by taking an =√s log n.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 14
Theorem 4.2. (Weak Oracle Property). Under model (16) and Conditions 1′ - 4′, there
exists a strict local minimizers β =(βTΓ∗ , β
TΓ∗c
)Tof (3) such that (14) and (15) hold.
For the phase retrieval problem, Candes, Strohmer and Voroninski (2013) used convex
relaxation to construct a consistent estimator of the matrix β∗(β∗)T but not for β∗. The
consistency of β∗ was studied by Eldar and Mendelson (2014) and Lecue and Mendelson
(2013). We obtain the following weak oracle property of β∗ as a consequence of Theorem
4.2.
Corollary 4.1. Under model (16) with Zi = zizTi , i = 1, 2, ..., n, the result of Theorem 4.2
holds if Condition 2′, 4′ and the following hold:
(1) for any Γ ⊆ {1, · · · , p} with |Γ| = s, there exist positive constants c1 and c2 such that
n−1∑n
i=1
(zTiΓβ1
)2(zTiΓu)2 ≥ c1‖u‖2 for any β1 ∈ S1 and n−1
∑ni=1
(zTiΓβ1
)2(zTiΓu)2 ≤ c2‖u‖2
for any β1 ∈ S;
(2) for any Γ ⊆ {1, · · · , p} with |Γ| = s, there exists a positive constant c0 such that
n−1/2(|∑n
i=1(ziΓ ⊗ ziΓc)(ziΓ ⊗ ziΓ)T |∞ +|∑n
i=1(ziΓ ⊗ ziΓc)(ziΓ ⊗ ziΓc)T |∞)≤ c0.
5 Optimization algorithm
The numerical computation of the q-RLS estimator as the solution of (3) is an important
and challenging issue, since the `q(0 < q < 1)-regularization is a nonconvex, nonsmooth,
and non-Lipschitz optimization problem. Recently, this type of problems has attracted
much attention in the field of optimization, including developing optimality conditions
and computational algorithms, see, e.g., Xu et al (2012b), Chen, Niu and Yuan (2013), Lu
(2014) and references therein. In this section, we propose an algorithm for the minimization
problem (3). Since n and p are given, to simplify notation we omit the subscript n of `n(β)
and λn so that (3) is written as
minβ∈Rp
L(β) := `(β) + λ‖β‖qq, (17)
where λ > 0. To motivate our algorithm we establish a fixed point equation. We start by
considering the simple minimization problem
minu∈R
ϕt(u) :=1
2(u− t)2 + λ|u|q, (18)
where t ∈ R, λ > 0 and q ∈ (0, 1). For this problem Chen, Xiu and Peng (2014) show
that there exists an implicit function hλ,q(·) such that the minimizer u of (18) satisfies
u = hλ,q(t). In particular, for q = 1/2 Xu et al (2012b) give an explicit expression hλ,1/2(t) =23t(1 + cos
(2π3− 2
3φλ(t)
))with φλ(t) = arccos(λ
4( |t|
3)−3/2
).
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 15
Theorem 5.1. There exists a function hλ,q(·) and a constant r > 0, such that any mini-
mizer β of problem (17) satisfies
β = Hλτ,q
(β − τ∇`(β)
)(19)
for any τ ∈(0,min{G−1
r , 1}), where Gr = supβ∈Br ‖∇2`(β)‖2, Br = {β ∈ Rp : ‖β‖2 ≤ r},
and Hλ,q(u) =(hλ,q(u1), · · · , hλ,q(up)
)Tfor u = (u1, · · · , up)T ∈ Rp.
Remark 5.1. The result of Theorem 5.1 remains true for any function ` that is bounded
from below, twice continuously differentiable, and for whish lim‖x‖→∞ `(β) = ∞. An ap-
propriate algorithm here can be derived similarly to that below.
Remark 5.2. In general, the `q minimization problem minβ∈Rp f(β) + λ‖β‖qq with λ >
0, q ∈ (0, 1) has been well studied in the optimization literature and efficient algorithms
have been proposed for f(β) = ‖Xβ − y‖2. For example, Chen, Xu and Ye (2010) derived
lower bounds for nonzero entries of the local minimizer and presented a hybrid orthogonal
matching pursuit-smoothing gradient method, while Xu et al (2012b) provided a globally
necessary optimality condition for the case q = 1/2 and proposed an efficient iterative
algorithm. More recently, the general `q problem has been studied by Chen, Niu and
Yuan (2013), who proposed a smoothing trust region Newton method for solving a class of
non-Lipschitz optimization problems. Lu (2014) studied iterative reweighted methods for
a smooth and bounded (from below) function f with an Lf -Lipschitz continuous gradient
satisfying ‖∇f(β) − ∇f(β′)‖ ≤ Lf‖β − β
′‖. Bian Chen and Ye (2015) proposed interior
point algorithms for solving a class of non-Lipschitz nonconvex optimization problems with
nonnegative bounded constraints. In these works the solution sequence of the algorithm
converges to a stationary point derived from the Karush-Kuhn-Tucker conditions.
Based on (19), we propose a fixed point iterative algorithm (FPIA).
Algorithm 5.1. Step 0. Given λ > 0, ε ≥ 0, γ, α ∈ (0, 1), δ > 0, choose an arbitrary β0
and set k = 0.
Step 1. (a) Compute ∇`(βk) from ∇`(β) = 2∑m
i=1(βTZiβ + xTi β − yi)(2Ziβ + xi);
(b) Compute βk+1 = Hλτk,q(βk − τk∇`(βk)) with τk = γαjk and jk the smallest non-
negative integer such that
L(βk)− L(βk+1) ≥ δ
2‖βk − βk+1‖2
2. (20)
Step 2. Stop if‖βk+1 − βk‖2
max{1, ‖βk‖2}≤ ε.
Otherwise, replace k by k + 1 and go to Step 1.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 16
An important step here is to evaluate the operator Hλ,q(·). As discussed before, hλ,q(·)has an explicit expression when q = 1/2. For more general q ∈ (0, 1), by Lemma B.1 in
Appendix B there exists a constant t∗ > 0 such that hλ,q(t) > 0, hλ,q(t)−t+λqhλ,q(t)q−1 = 0
and 1 + λq(q − 1)hλ,q(t)q−2 > 0, for t > t∗; and hλ,q(t) < 0, hλ,q(t)− t− λq|hλ,q(t)|q−1 = 0
and 1 + λq(q − 1)|hλ,q(t)|q−2 > 0, for t < −t∗. Hence one can use the function fsolve in
Matlab to get the desired solution at each iteration.
Another important step is the computation of step length τk, which represents a trade-
off between the speed of reduction of the objective function L and search time for the
optimal length. According to Theorem 5.1, the ideal choice of τk depends on the maximum
eigenvalue of the Hessian ∇2`(βk) at kth iteration, which is expensive to calculate. A
more practical strategy is to perform an inexact line search to identify a step length that
achieves adequate reduction in L. One such technique is the so-called Armijo-type line
search that is adopted in our algorithm. In our context this method requires finding the
smallest nonnegative integer jk such that (20) holds. That this can be done successfully is
assured by Lemmas B.3 and B.4 in Appendix B. We also verify the convergence property
of the FPIA by Theorem B.1.
Remark 5.3. Xu et al (2012b) studied a q-regularized least square method with q = 1/2
in a linear model and proposed serval strategies for choosing the optimal regularization
parameter λ besides cross validation. Analogous to their method we can derive the range
of the optimal regularization parameter in our problem as
λ ∈[√96
9τ|[Bτ (β)]s+1|3/2,
√96
9τ|[Bτ (β)]s|3/2
)where Bτ (β) = β− τ∇`(β) and |[Bτ (β)]k| is the kth largest component of Bτ (β) in magni-
tude for each k = 1, ..., p. Xu et al (2012b) suggest that λ =√
969τ|[Bτ (β)]s+1|3/2 is a reliable
choice with an approximation such as β ≈ βk. They recommend this strategy for s-sparsity
problems and cross validation for more general problems.
Our algorithm can also be used to compute the q-RLS estimator for q ≥ 1. Indeed,
similar to Lemma B.1, we can show that there exists a unique function hλ,q(t) such that the
global minimizer of problem (18) is u = hλ,q(t). In particular, we can obtain the explicit
expressions of this function for q = 1, 2/3, 2 as hλ,1(t) = max(0, t − λ) −max(0,−t − λ, ),hλ,3/2(t) =
(√916λ2 + |t| − 3
4λ)2
sign(t), and hλ,2(t) = t/(1 + 2λ).
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 17
6 Numerical Examples
In this section we calculate two examples to illustrate the proposed approach and demon-
strate the finite sample performance of the q-RLS estimator. The first example is the
second-order least squares method described in Example 1.4, and the second is the quadratic
equations problem considered by Beck and Eldar (2013). In a phase diagram study Xu et
al (2012a) point out that the `q-regularization method yields sparser solutions with smaller
value of q in the range [1/2, 1), while there is no significant difference for q ∈ (0, 1/2].
In view of these findings, we use q = 1/2 in both examples. In addition, following the
literature we use 5-fold cross validation to choose the parameter λ. In each simulation 100
Monte Carlo samples were generated and in each case the true value β∗ was generated ran-
domly with s nonzero components from the standard normal distribution. The numerical
optimization is done using FPIA with iteration stopping criterion
‖βk+1 − βk‖max {1, ‖βk+1‖}
≤ 10−6,
or the maximum iterative time of 5000s is reached.
To evaluate the selection and estimation accuracy of our method, we calculated the
mean squared error (MSE) which is the average of ‖β−β∗‖22; the false positive (FP) which
is the number of zero coefficients incorrectly identified as nonzero; the false negative (FN)
which is the number of nonzero coefficients incorrectly identified as zero. We also report
the rate of successful recovery (SR) using the criterion Γ = Γ∗ and ‖β− β∗‖22 ≤ 2.5× 10−5,
where Γ = {j : βj 6= 0} and Γ∗ = {j : β∗j 6= 0}.
6.1 Example 1: Second-order least square method
We applied the second-order least squares method described in Example 1.4 to the variable
selection problem in linear model (2). It is known that in low-dimensional set-ups the
SLS estimator is asymptotically more efficient than the ordinary least squares estimator
when the error distributions is asymmetric. Therefore it is interesting to see if this robust-
ness property carries over to high-dimensional regularized estimation. In particular, we
considered the q-regularized second-order least squares (q-RSLS) problem
minθ
n∑i=1
ρi(θ)TWiρi(θ) + λ‖β‖qq,
where θ = (βT , σ2)T , ρi(θ) = (yi − xTi β, y2i − (xTi β)2 − σ2)T and Wi is a 2× 2 nonnegative
definite weight matrix. Here the objective function becomes that of the q-regularized
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 18
least squares (q-RLS) method if the weight is taken to be Wi = diag(1, 0). To simplify
computation, we used the weight
Wi =
(0.75 0.1
0.1 0.25
)
that is not necessarily optimal according to Wang and Leblanc (2008).
We considered five error distributions logN(0, 0.12)−e−0.005, X2(5)−5100
, 0.01∗t, U[−0.1, 0.1]
and N(0, 0.12). In each case, we took dimension p = 400 with sparsity s = 8 and sample
size n = 200.
The results in Table 1 show that q-RSLS and q-RLS perform well in identifying zero
coefficients; this is expected for `q-regularized methods with q = 1/2. Although both
methods have fairly low FN values, the values of q-RLS is about 3 times higher than
that of the q-RSLS. Moreover, The MSE of the q-RSLS estimator is about three times
smaller than that of the q-RLS estimator. The results in Table 2 show clearly that q-RSLS
has much higher rate of SR than q-RLS does, and this is true not only for the skewed
error distributions, such as log-normal and Chi-square, but also for normal or uniform
distributions.
Table 1: Selection and estimation results of Example 1.
error methodFP FN
MSEmean se mean se
e1
q-RSLS 0.12 0.04 0.00 0.00 3.41e-05
q-RLS 0.27 0.05 0.00 0.00 1.38e-04
e2
q-RSLS 0.12 0.04 0.00 0.00 2.91e-05
q-RLS 0.21 0.05 0.00 0.00 9.34e-05
e3
q-RSLS 0.09 0.03 0.00 0.00 1.32e-05
q-RLS 0.22 0.05 0.00 0.00 9.51e-05
e4
q-RSLS 0.09 0.03 0.00 0.00 3.34e-05
q-RLS 0.29 0.05 0.00 0.00 1.64e-04
e5
q-RSLS 0.11 0.03 0.00 0.00 2.14e-05
q-RLS 0.24 0.05 0.00 0.00 1.30e-04
Noiselessq-RSLS 0.10 0.03 0.00 0.00 3.80e-05
q-RLS 0.19 0.04 0.00 0.00 1.02e-04
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 19
Table 2: Rates of Successful Recovery of Example 1.
method
errore1 e2 e3 e4 e5 Noiseless
q-RSLS 0.62 0.78 0.88 0.52 0.86 0.56
q-RLS 0.08 0.15 0.13 0.06 0.12 0.10
6.2 Example 2: Quadratic measurements
We considered model (16) with εi ∼ N(0, σ2). A noise-free version of this model was
considered by Beck and Eldar (2013). For the sake of comparison we set σ = 0.01 and
generated matrices as Zi = zizTi , i = 1, 2, · · · ,m with vectors zi ∈ Rp from the standard
normal distribution. We considered n = 80, p = 120 with various sparsity s = 3, 4, · · · , 10.
For comparison, we calculated the q-RLS estimator for q = 1/2, 1, 3/2, 2.
The results are given in Table 3, with the results for q = 2 omitted since they are
very similar to those for q = 3/2. They show clearly that the FP values with q = 1/2
is much lower than the other cases. In particular, the FP values with q = 3/2, 2 are the
same as the number of true nonzero coefficients, which means that no variable selection
was performed. The MSE and SR are both very small; this demonstrates that the q-RLS
with q = 1/2 is efficient and stable in variable selection and estimation. Compared to the
results in Beck and Eldar (2013), our SR rates are lower when s = 3, 4 but significantly
higher when s = 5, 6, 7, 8, 9, 10.
To see the effectiveness of our numerical algorithm FPIA, we also ran the simulations
with n = 3p/4, s = 0.05p, and p = 100, 200, 300, 400, 500. The results in Table 4 show
that, as the dimension increases, the FP and FN, as well as MSE, remain fairly low and
stable. In all cases, the rates of successful recovery are over 50% and reach 86% when
p = 200.
7 Conclusions and Discussion
Although the problem of high-dimensional variable selection with quadratic measurements
arises in many areas in physics and engineering, such as compressive sensing, signal process-
ing and imaging, it has not been studied in statistical literature. We proposed a quadratic
measurements regression model and studied the `q-regularization method in this model.
We have established weak oracle property for the q-RLS estimator in high-dimensional
case where n and p are allowed to diverge, including the case p � n. To compute the q-
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 20
Table 3: Selection and estimation results of Example 2.
‖β∗‖0 methodFP FN
MSE SRmean se mean se
3
q = 1/2 3.95 0.64 0.36 0.09 1.56e-03 0.57
q = 1 64.36 5.84 1.07 0.14 1.47e-01 0.00
q = 3/2 117.00 0.00 0.00 0.00 2.04e-01 0.00
4
q = 1/2 3.73 0.65 0.09 0.04 6.98e-04 0.62
q = 1 62.64 5.79 1.63 0.20 2.97e-01 0.00
q = 3/2 116.00 0.00 0.00 0.00 2.15e-01 0.00
5
q = 1/2 4.66 0.71 0.05 0.02 4.97e-05 0.61
q = 1 76.75 5.38 1.55 0.23 3.08e-01 0.00
q = 3/2 115.00 0.00 0.00 0.00 3.18e-01 0.00
6
q = 1/2 5.99 0.88 0.04 0.02 4.75e-05 0.58
q = 1 81.88 5.07 1.50 0.26 2.60e-01 0.00
q = 3/2 114.00 0.00 0.00 0.00 4.36e-01 0.00
7
q = 1/2 4.70 0.84 0.07 0.03 3.37e-05 0.63
q = 1 83.32 4.91 1.01 0.30 3.27e-01 0.00
q = 3/2 113.00 0.00 0.00 0.00 5.76e-01 0.00
8
q = 1/2 3.76 0.77 0.32 0.14 5.22e-02 0.67
q = 1 87.54 4.37 1.28 0.30 2.78e-01 0.00
q = 3/2 111.99 0.01 0.00 0.00 8.02e-01 0.00
9
q = 1/2 4.01 0.97 0.34 0.16 4.92e-02 0.73
q = 1 86.05 4.38 1.53 0.34 3.30e-01 0.00
q = 3/2 111.00 0.00 0.00 0.00 6.35e-01 0.00
10
q = 1/2 5.46 0.46 0.11 0.03 2.68e-02 0.58
q = 1 84.69 4.22 1.50 0.36 3.56e-01 0.00
q = 3/2 110.00 0.00 0.00 0.00 6.57e-01 0.00
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 21
Table 4: The successful recoveries of Example 2.
p ‖β∗‖0
FP FNMSE SR
mean se mean se
100 5 2.99 0.60 0.12 0.07 1.90e-03 0.73
200 10 3.40 0.80 0.05 0.02 2.49e-05 0.86
300 15 9.50 1.20 0.09 0.03 5.17e-04 0.53
400 20 11.34 1.43 0.11 0.05 5.26e-04 0.53
500 25 13.07 2.56 0.07 0.03 5.45e-04 0.51
RLS estimator, we have derived a fixed point equation and designed an efficient algorithm
and established its convergence. We have presented two numerical examples to illustrate
the proposed method. The numerical results show that this method performs very well in
most of the cases.
In general, the classical moderate deviation principle is given in the form
P(‖β − β∗‖ > rn) = exp(−I(β∗)
2(rn√n)2 + o
((rn√n)2),
where I(β∗) is the rate function. We have derived an upper bound for the rate function
and the speed of convergence a2n which is slower than the standard (rn
√n)2 (Theorem
3.1). The result of Theorem 3.1 implies that the q-RLS estimator can correctly select the
nonzero variables with probability converging to one. Compared to the linear model, the
quadratic measurements model is more complex and therefore it is harder to obtain the
MD rate. Under some further assumptions, it is possible to establish more accurate results.
Another open question is the asymptotic normality of the q-RLS estimator for model (1).
It deserves further research.
We have studied the generalized bridge estimator because of the simplicity and tractabil-
ity of numerical optimization. We focused on the `q regularization with q < 1, mainly
because in phase retrieval and compressive sensing the primary goal is to find the smallest
set of predictors and the `q method with q < 1 helps to achieve this goal. Our identification
results and numerical optimization algorithm apply when q ≥ 1. Of course in such cases
the consistency results do not hold generally as in linear models. It is also interesting to
investigate the SCAD and other regularization methods in quadratic measurements mod-
els. Our model (1) can be viewed as a special case of the partially linear index model
y =∑d
j=1 fj(βTwj)+xTβ+ε. While it is interesting to study the regularization estimation
problem in this model, the theory and method are much more complicated.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 22
Acknowledgments
We are grateful to the Editor, an associate editor and two anonymous reviewers for their
comments and suggestions that helped to improve the previous version of this paper.
References
Bahmani, S, Raj, B. and Boufounos, P. T. (2013). Greedy sparsity-constrained optimiza-
tion. J. Mach. Learn. Res. 14, 807-841.
Balan, R., Casazza, P. and Edidin, D. (2006). On signal reconstruction without phase.
Appl. Comput. Harmon. Anal. 20, 345-356.
Bandeira, A. S., Cahill, J., Mixon, D. G. and Nelson, A. A. (2014). Saving phase: Injectivity
and stability for phase retrieval. Appl. Comput. Harmon. Anal. 37, 106-125.
Beck, A. and Eldar, Y. C. (2013). Sparsity constrained nonlinear optimization: Optimality
conditions and algorithms. SIAM J. Optim. 23, 1480-1509.
Bian, W., Chen, X., and Ye Y., (2015). Complexity analysis of interior point algorithms
for non-Lipschitz and nonconvex minimization. Math. Program. 149, 301-327.
Biswas, P. and Ye, Y., (2004). Semidefinite programming for ad hoc wireless sensor network
localization. In Proceedings of the 3rd international symposium on Information processing
in sensor networks, Berkeley, CA, 46-54.
Blumensath, T. (2013). Compressed sensing with nonlinear observations and related non-
linear optimization problems. IEEE Trans. Inform. Theory 59, 3466-3474.
Buhlmann, P. and Van De Geer. S., (2011). Statistics for high-dimensional data: methods,
theory and applications. Springer, Heidelberg.
Cai, T., Li, X., and Ma, Z. (2015). Optimal rates of convergence for noisy sparse phase
retrieval via thresholded Wirtinger flow. arXiv preprint arXiv:1506.03382.
Candes, E., Strohmer, T., and Voroninski, V. (2013). Phaselift: Exact and stable signal
recovery from magnitude measurements via convex programming. Communications on
Pure and Applied Mathematics 66, 1241-1274.
Candes, E., Li, X., and Soltanolkotabi, M. (2015). Phase retrieval via Wirtinger flow:
Theory and algorithms. IEEE Trans. Inform. Theory 61, 1985-2007.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 23
Candes, E., and Tao , T. (2007). The Dantzig selector: statistical estimation when p is
much larger than n. Ann. Statist. 35, 2313-2351.
Chartrand, R. (2007). Exact reconstruction of sparse signals via nonconvex minimizaion.
IEEE Signal Process. Lett. 14, 707-710.
Chen, X. Niu, L. and Yuan, Y. (2013). Optimality conditions and smoothing trust region
Newton method for non-Lipschitz optimization. SIAM J. Optim. 23, 1528-1552.
Chen, X., Xu, F., and Ye, Y. (2010). Lower bound theory of nonzero entries in solutions
of `2 − `p minimization. SIAM J. Sci. Comput. 32, 2832-2852.
Chen, Y., Xiu, N. and Peng, D. (2014). Global solutions of non-Lipschitz S2 − Sp mini-
mization over positive semidefinite cone. Optimization Letters 8, 2053-2064.
Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimen-
sionality. AMS Math Challenges Lecture, 1-32.
Donoho, D. L. and Elad, M. (2003). Optimally sparse representation in general (nonorthog-
onal) dictionaries via l1 minimization. Proceedings of the National Academy of Sciences
100, 2197-2202.
Eldar, Y. C., and Mendelson, S. (2014). Phase retrieval: Stability and recovery guarantees.
Appl. Comput. Harmon. Anal. 36, 473-494.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its
oracle properties. J. Amer. Statist. Assoc. 96, 1348-1360.
Fan, J., and Lv, J. (2010). A selective overview of variable selection in high dimensional
feature space. Statist. Sinica 20, 101-148.
Fan, J., Fan, Y. and Barut, E. (2014). Adaptive Robust Variable Selection. Ann. Statist.
42, 324-351.
Fan, Jun. (2012). Moderate Deviations for M-estimators in Linear Models with φ-mixing
Errors. Acta Math.Sin. (Engl. Ser.) 28, 1275-1294.
Fan, Jun, Yan, Ailing and Xiu, Naihua. (2014). Asymptotic Properties for M-estimators
in Linear Models with Dependent Random Errors. J. Stat. Plan. Infer. 148, 49-66.
Frank, L.E., and Friedman, J.H. (1993). A statistical view of some chemonmetrics regression
tools (with discussion). Technometrics 35, 109-148.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 24
Huang,J. Horowitz,J.L. and Ma,S. (2008). Asymptotic properties of bridge estimators in
sparse high-dimensional regression models. Ann.Statist. 36, 587-613.
Kallenberg, W. C. M. (1983). On moderate deviation theory in estimation. Ann. Statist.
11, 498-504.
Knight,K. and Fu, W.J. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 28,
1356-1378.
Krishnan, D. and Fergus, R. (2009). Fast image deconvolution using hyper-Laplacian priors.
Adavances in Neural Information Processing Systems. 1033-1041.
Lecue, G. and Mendelson, S. (2013). Minimax rate of convergence and the performance of
ERM in phase recovery. Preprint. Available at arXiv:1311.5024
Lu, Z. (2014). Iterative reweighted minimization methods for lp regularized unconstrained
nonlinear programming. Math. Program. 147, 277-307.
Lv, J., and Fan, Y. (2009). A unified approach to model selection and sparse recovery using
regularized least squares. Ann.Statist. 37, 3498-3528.
Meng, C., Ding, Z. and Dasgupta, S. (2008). A semidefinite programming approach to
source localization in wireless sensor networks. IEEE Signal Processing Letters 15, 253-
256.
Netrapalli, P., Jain, P., and Sanghavi, S. (2013). Phase retrieval using alternating mini-
mization. Advances in Neural Information Processing Systems, 2796-2804.
Negahban, S. N., Ravikumar, M., Wainwright, M. J. and Yu B, (2012). A unified framework
for high-dimensional analysis of M -estimators with decomposable regularizers. Statist.
Sci. 27, 538-557.
Ohlsson, H., and Eldar, Y. C. (2014). On conditions for uniqueness in sparse phase retrieval.
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
1841-1845.
Ohlsson, H., Yang, A. Y., Dong, R., Verhaegen, M. and Sastry,S. S. (2014) Quadratic basis
pursuit. Regularization, Optimization, Kernels, and Support Vector Machines, 195.
Saab, R., Chartrand, R., and Yilmaz, O. (2008). Stable sparse approximations via non-
convex optimization. In IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 3885-3888.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 25
Shechtman, Y., Eldar, Y. C., Szameit, A. and Segev. M. (2011). Sparsity-based sub-
wavelength imaging with partially spatially incoherent light via quadratic compressed
sensing. Optics Express 19, 14807-14822.
Shechtman, Y., Szameit, A., Bullkich, E., et al. (2012). Sparsity-based single-shot sub-
wavelength coherent diffractive imaging. Nature Materials 11, 455-459.
Tibshirani, R., (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc.
Ser. B Stat. Methodol. 58, 267-288.
Wang, L. (2003). Estimation of nonlinear models with Berkson measurement error models.
Statist. Sinica 13, 1201-1210.
Wang, L. (2004). Estimation of nonlinear models with Berkson measurement errors. Ann.
Statist. 32, 2343-2775.
Wang, L. and Leblanc A. (2008). Second-order nonlinear least squares estimation. Ann.
Inst. Stat. Math. 60, 883-900.
Wang, Z., Zheng, S., Boyd, S. and Ye, Y. (2008). Further relaxations of the SDP approach
to sensor network localization. SIAM J. Optim. 19, 655-671.
Xu, Z., Guo, H., Wang, Y. and Zhang H. (2012a). Representative of L1/2 regularization
among Lq(0 < q ≤ 1) regularizations: an experimental study based on phase diagram.
Acta Autom. Sinica 38, 1225-1228.
Xu, Z., Chang X., Xu, F. and Zhang, H. (2012b). L1/2 regularization: A thresholding
representation theory and a fast solver. Neural Networks and Learning Systems. IEEE
Transactions on Neural Networks and Learning Systems 23, 1013-1027.
Yuan M and Lin Y. (2006). Model selection and estimation in regression with grouped
variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68, 49-67.
Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty.
Ann. Statist., 38, 894-942.
Zhang, C.-H. and Zhang, T. (2012). A general theory of concave regularization for high-
dimensional sparse estimation problems. Statist. Sci. 27, 576-593.
Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101,
1418-1429.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 26
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J.
R. Stat. Soc. Ser. B Stat. Methodol. 67, 301-320.
Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood
models. Ann. Statist., 36, 1509-1533.
Appendix A Proofs of MD and weak oracle property
A.1 Proof of Theorem 3.1 and 3.2
Without loss of generality, in the following we let Γ∗ = {1, · · · , s} and β∗ = (β∗T1 , 0T )T .
Correspondingly, we partition Zi and xi as
Zi =
(Z11i Z12
i
Z21i Z22
i
)and xi = (x1T
i , x2Ti )T ,
where Z11i is an s× s symmetric matrix and Z22
i is a (p− s)× (p− s) symmetric matrix.
For convenience, we also denote
Ln(β1) :=n∑i=1
(yi − βT1 Z11i β1 − x1T
i β1)2 + λn‖β1‖qq
and C1 = 2c+ 3√
(σ2 + 1)/c1.
We first prove some lemmas.
Lemma A.1. Let {wn} be a sequence of real numbers and assume that {bn} and {Bn} are
two sequences of positive numbers tending to infinity. If
Bn ≥n∑i=1
w2i and
bn√Bn
max1≤i≤n
|wi| → 0,
then, for any τ > 0,
lim supn→∞
b−2n logP
(|
n∑i=1
wiεi| > bn√Bnτ
)≤ − τ 2
2σ2,
or
P(|
n∑i=1
wiεi| > bn√Bnτ
)≤ exp
(− b2
nτ2
2σ2+ o(b2
n)).
Proof. Based on the similar method proof to that of Lemma 3.2 in Fan, Yan and Xiu
(2014), it is easy to show it and so is omitted.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 27
Lemma A.2. Assume that Conditions 1-2 and 4 hold. Let {an} be a sequence of positive
numbers satisfying (11) and (12). Then for any τ > 0,
P( 1
an√n
supu∈S‖
n∑i=1
(Z11i u+ x1
i )εi‖ > τ)≤ exp
(− a2
nτ2
2c2σ2+ o(a2
n)).
Proof. Let A = {v ∈ Rs : ‖v‖ ≤ 1} and denote rn = 1/n. Then by Lemma 14.27 in
Buhlmann and van de Geer (2011), we have
A ⊆mn⋃j=1
B(vj, rn),
where mn = (1 + 2n)s and B(uj, rn) = {v ∈ Rs : ‖v − vj‖ ≤ rn, vj ∈ A} for j = 1, · · · ,mn.
By the similar method to the proof the second result of Lemma 5.1 in Fan, Yan and Xiu
(2014), we use Lemma A.1 with Bn = nc2 and bn = an to obtain that for any τ1 > 0 and
ε1 ∈ (0, τ1/2),
P( 1
an√n‖
n∑i=1
(Z11i u+ x1
i )εi‖ > τ1
)≤ mn exp
(− a2
n(τ1 − ε1)2
2c2σ2+ o(a2
n)). (21)
Further, denote r′n = C1
√s/n. Again, by Lemma 14.27 of Buhlmann and van de Geer
(2011), we have
S ⊆mn⋃j=1
B(uj, r′n),
where B(uj, r′n) = {u ∈ Rs : ‖u− uj‖ ≤ r′n, uj ∈ S} for j = 1, · · · ,mn. Analog to (21) we
obtain that for any ε ∈ (0, τ/2) and ε1 ∈ (0, (τ − ε)/2),
P( 1
an√n
supu∈S‖
n∑i=1
(Z11i u+ x1
i )εi‖ > τ)≤m2
n exp(− a2
n(τ − ε− ε1)2
2c2σ2+ o(a2
n)).
From (11) we conclude that a−2n logm2
n = a−2n
(s log(1 + 2n)
)→ 0, which together with the
above inequality implies that
lim supn→∞
a−2n logP
( 1
an√n
supu∈S‖
n∑i=1
(Z11i u+ x1
i )εi‖ > τ)≤ −(τ − ε− ε1)2
2c2σ2.
Since ε and ε1 are arbitrary, we have for large enough n,
P( 1
an√n
supu∈S‖
n∑i=1
(Z11i u+ x1
i )εi‖ > τ)≤ exp
(− a2
nτ2
2c2σ2+ o(a2
n)).
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 28
Lemma A.3. Under the assumptions of Lemma A.2, there exists β1 = arg minβ1∈Rs Ln(β1)
such that
P(‖β1 − β∗1‖ ≤ rn
)≥ 1− exp
(− (1 + c2
1/4)a2n
2c2σ2+ o(a2
n)). (22)
Proof. To show the existence of minimizer β1, we consider the level set{β1 ∈ Rs : Ln(β1) ≤
Ln(β∗1)}
. It is apparent that
infβ1∈Rs
Ln(β1) = infβ1∈{β1∈Rs:Ln(β1)≤Ln(β∗1 )}
Ln(β1).
Since Ln(·) is continuous and the level set is non-empty and closed, Ln(·) has at least one
minimizer β1 in the level set.
Now we prove (22). For notational convenience, we denote Z = (Z1, · · · , Zn), Σn =
ZZT/n and ε = (ε1, · · · , εn)T , where Zi = Z11i (β1 + β∗1) + x1
i . Obviously, Condition 1
implies that Σn is invertible. Then by the definition of β1 we have Ln(β1) ≤ Ln(β1) for any
β1 ∈ Rs, which implies
n∑i=1
ε2i + λn
s∑j=1
|β∗1j|q ≥n∑i=1
(εi − (β1 − β∗1)T Zi
)2
+ λn
s∑j=1
|eTs,jβ1|q
=n∑i=1
ε2i − 2(β1 − β∗1)T Zε+ λn
s∑j=1
|eTs,jβ1|q
+ n(β1 − β∗1)T Σn(β1 − β∗1)
and therefore
n(β1 − β∗1)T Σn(β1 − β∗1) ≤ 2(β1 − β∗1)T Zε+ λn
s∑j=1
(|eTs,jβ∗1 |q − |eTs,jβ1|q). (23)
By the similar method to the proof of relation (8) in Huang, Horowitz and Ma (2008), we
conclude from Condition 1, the second convergence of Condition 4 and the strong law of
large number that for large enough n,
‖β1 − β∗1‖ ≤ C1
√s and ‖β1 + β∗1‖ ≤ C1
√s, a.s.
and
‖β1 − β∗1‖2 ≤ 2
nc1
‖β1 − β∗1‖‖Zε‖+ηnnc1
≤ 2C1
√s
nc1
supu∈S‖
n∑i=1
(Z11i u+ x1
i )εi‖+λnsc
q
nc1
, a.s., (24)
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 29
where ηn = λn∑s
j=1(|eTs,jβ∗1 |q − |eTs,jβ1|q). Therefore,
1 =P(‖β1 − β∗1‖2 ≤ 2C1
√s
nc1
supu∈S‖
n∑i=1
(Z11i u+ x1
i )εi‖+λnsc
q
nc1
)≤P(‖β1 − β∗1‖2 ≤ 2C1
√san
c1
√n
+λnsc
q
nc1
)+ P
( 1
an√n
supu∈S‖
n∑i=1
(Z11i u+ x1
i )εi‖ > 1),
which together with Lemma A.2 yields that
P(‖β1 − β∗1‖ > r′n) ≤ exp(− a2
n
2c2σ2+ o(a2
n)), (25)
where r′n =(
2C1an√s
c1√n
+ λnscq
nc1
)1/2.
Since r′n → 0 as n→∞, it follows that for large enough n,
1
2|eTs,jβ∗1 | ≤ |eTs,jβ1| ≤
3
2|eTs,jβ∗1 |, j = 1, · · · , s,
when ‖β1−β∗1‖ ≤ r′n. By the mean value theorem and Cauchy-Schwarz inequality, we have,
for large enough n,
ηn ≤ 2cq−1λn√s‖β1 − β∗1‖
when ‖β1 − β∗1‖ ≤ r′n. Combining the above inequality, (23), Cauchy-Schwarz inequality
and Condition 1, we have, for large enough n,
‖β1 − β∗1‖ ≤2
nc1
‖Zε‖+2cq−1λn
√s
nc1
,
when ‖β1 − β∗1‖ ≤ r′n. Therefore it follows from the first inequality of (24) that for large
enough n,
1 =P(‖β1 − β∗1‖2 ≤ 2
nc1
‖(β1 − β∗1)T‖‖Zε‖+ηnnc1
)≤P(‖β1 − β∗1‖ ≤
2
nc1
‖Zε‖+2cq−1λn
√s
c1n
)+ P
(‖β1 − β∗1‖ > r
′
n
)≤P(‖β1 − β∗1‖ ≤
an√n
+2cq−1λn
√s
c1n
)+ P
( 1
an√n
sup‖u‖≤C1
√s
‖n∑i=1
(Z11i u+ x1
i )εi‖ > c1/2)
+ P(‖β1 − β∗1‖ > r′n
).
Then, by Lemma A.2 and (25) we have
P(‖β1 − β∗1‖ ≥ rn) ≤ exp(− a2
n
2c2σ2+ o(a2
n))
+ exp(− c2
1a2n
8c2σ2+ o(a2
n)),
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 30
which yields
lim supn→∞
a−2n logP(‖β1 − β∗1‖ ≥ rn) ≤ − 1
2c2σ2− c2
1
8c2σ2.
Thus, we have
P(‖β1 − β∗1‖ ≥ rn) ≤ exp(− (1 + c2
1/4)a2n
2c2σ2+ o(a2
n)),
which yields (22).
Proof of Theorem3.1. Denote bn = an + 2cq−1λn√s√
c1nand rn =
(an√n
+ 2cq−1λn√s
c1n
)√s. We
first show that
λ−1n r1−q
n bn√ns→ 0 and λ−1
n r2−qn
√ns2 → 0, as n→∞. (26)
The first convergence of (26) follows from the second convergence of Condition 4 and (13).
Since λns2/n→ 0, the inequality of Condition 2 implies that
s3/2
√n≤ λns
2
n·√n
λn≤ λns
2
n
1
σc1−q√
log p→ 0, (27)
which yields
λ−1n r2−q
n
√ns2 = λ−1
n r1−qn bn
√ns · s
3/2
√n→ 0.
For any u = (uT1 , uT2 ) ∈ Rp and u1 ∈ Rs, we show that there exists a sufficiently large
constant C such that
P(Ln(β1, 0) = inf
‖u‖1≤CLn(β∗1 + rnu1, rnu2)
)≥ 1− exp
(− C0a
2n + o(a2
n)), (28)
which implies that with probability 1 − exp(− C0a
2n + o(a2
n))
that (βT1 , 0T )T is a local
minimizer in the ball {β∗ + rnu : ‖u‖1 ≤ C}, so that both (9) and (10) hold.
Denote ζ1i = Z11i (2β∗1 + rnu1) +x1
i and ζ2i = 2Z21i (β∗1 + rnu1) +x2
i + rnZ22i u2, and define
event
E1 :={‖
n∑i=1
ζ2iεi‖∞ ≤ 4((1 + c2))1/2bn√ns},
where bn = an + 2cq−1λn√s√
c1n. For any u2 ∈ Rp−s, we show that under event E1,
Ln(β∗1 + rnu1, rnu2) ≥ Ln(β∗1 + rnu1, 0). (29)
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 31
Clearly, Ln(β∗1 + rnu1, rnu2) = Ln(β∗1 + rnu1, 0) when ‖u2‖1 = 0. We proceed to show (29)
for ‖u2‖1 > 0. It follows that
Ln(β∗1 + rnu1, rnu2)− Ln(β∗1 + rnu1, 0)
= −2rn
n∑i=1
uT2 ζ2iεi + r2n
n∑i=1
(uT2 ζ2i)2 + 2r2
n
n∑i=1
uT1 ζ1iuT2 ζ2i + λnr
qn‖u2‖qq
≥ −2rn
n∑i=1
uT2 ζ2iεi + 2r2n
n∑i=1
uT1 ζ1iuT2 ζ2i + λnr
qn‖u2‖qq. (30)
We now use the fact that |uTAv| ≤ ‖u1‖1‖Av‖∞ ≤ |A|∞‖u‖1‖v‖1 for any n× d matrix
A and vector u ∈ Rn, v ∈ Rd to discuss the bound of |∑n
i=1 uT1 ζ1iu
T2 ζ2i|. Noting that
|n∑i=1
uT1 ζ1iuT2 ζ2i| ≤ ‖u1‖1‖u2‖1|
n∑i=1
ζ1iζT2i|∞,
we then estimate the upper bound of |∑n
i=1 ζ1iζT2i|∞. Recalling the definition of | · |∞, we
calculate the eTs,jζ1iζT2iep−s,k for each j = 1, · · · , s and k = 1, · · · , p− s. It is easy to check
that
eTs,jζ1iζT2iep−s,k =2(2β∗1 + rnu1)TZ11
i es,jeTp−s,kZ
21i (β∗1 + rnu1) + 2x1T
i es,jeTp−s,kZ
21i (β∗1 + rnu1)
+ (2β∗1 + rnu1)TZ11i es,je
Tp−s,kx
2i + x1T
i es,jeTp−s,kx
2i
+ rn(2β∗1 + rnu1)TZ11i es,je
Tp−s,kZ
22i u2 + rnx
1Ti es,je
Tp−s,kZ
22i u2.
So,
|n∑i=1
eTs,jζ1iζT2iep−s,k|
≤2‖(2β∗1 + rnu1)‖1‖(β∗1 + rnu1)‖1|n∑i=1
Z11i es,je
Tp−s,kZ
21i |∞ + 2‖β∗1 + rnu1‖1|
n∑i=1
eTs,jx1i eTp−s,kZ
21i |∞
+ ‖(2β∗1 + rnu1)‖1|n∑i=1
Z11i es,je
Tp−s,kx
2i |∞ + |
n∑i=1
x1Ti es,je
Tp−s,kx
2i |∞
+ rn‖(2β∗1 + rnu1)‖1|‖u2‖1|n∑i=1
Z11i es,je
Tp−s,kZ
22i |∞ + rn‖u2‖1|
n∑i=1
x1Ti es,je
Tp−s,kZ
22i |∞
≤(
2(2cs+ rnC)(cs+ rnC) + 2(cs+ rnC) + (2cs+ rnC) + 1 + rn(2cs+ rnC)C + rnC)√
nc0,
where the last inequality follows from Condition 3. Since rn → 0 as n → ∞, we conclude
that for large enough n,
|n∑i=1
eTs,jζ1iζT2iep−s,k| ≤ (12c2s2 + 12cs+ 1)
√nc0 ≤ (12c2 + 1)c0
√ns2,
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 32
and therefore
|n∑i=1
uT1 ζ1iuT2 ζ2i| ≤ ‖u1‖1‖u2‖1|
n∑i=1
ζ1iζT2i|∞ ≤ (12c2 + 1)c1C
√ns2‖u2‖1. (31)
Note that
|n∑i=1
uT2 ζ2iεi| ≤ ‖u2‖1‖n∑i=1
ζ2iεi‖∞
and ‖u2‖qq ≥ Cq−1‖u2‖1. Under the event E1, it follows from (26), (30) and (31) that
Ln(β∗1 + rnu1, rnu2)− Ln(β∗1 + rnu1, 0)
≥ −2rn‖n∑i=1
ζ2iεi‖∞‖u2‖1 − (12c2 + 1)c1Cr2n
√ns2‖u2‖1 + Cq−1λnr
qn‖u2‖1
≥ λnrqn‖u2‖1
(− 2(2(1 + c2))1/2λ−1
n r1−qn bn
√ns− (12c2 + 1)c1Cλ
−1n r2−q
n
√ns2 + Cq−1
)> 0,
when ‖u2‖1 > 0. That is, (29) holds.
On the other hand, under the event {‖β1− β∗1‖1 ≤ rn}, we conclude from ‖β1− β∗1‖1 ≤√s‖β1 − β∗1‖, that ‖β − β∗‖1 = ‖β1 − β∗1‖1 ≤ rn
√s, which yields
inf‖u‖1≤C
Ln(β∗1 + rnu1, rnu2) ≤ Ln(β) = Ln(β1, 0) ≤ Ln(β∗1 + rnu1, 0).
Combining this and (29), we have Ln(β) = inf‖u‖1≤C Ln(β∗1 + rnu1, rnu2) under the event
E1 ∩ {{‖β1 − β∗1‖ ≤ rn}. That is,
E1 ∩{{‖β1 − β∗1‖ ≤ rn} ⊆
{β ∈ arg inf
‖u‖1≤CLn(β∗1 + rnu1, rnu2)
}. (32)
To complete the proof of (28), we need to verify that
P(‖
n∑i=1
ζ2iεi‖∞ > 4((1 + c2))1/2bn√ns)≤ exp
(− b2
n
4σ2+ o(b2
n)). (33)
Denote the jth element of ζ2i by ζ2ij. Since ‖β∗1 + rnu1‖1 ≤ ‖β∗1‖1 + rn‖u1‖1 ≤ cs + rnC,
we use Cauchy-Schwarz inequality to obtain
|ζ2ij| ≤ 2|eTp−s,jZ21i (β∗1 + rnu1)|+ |xij|+ rn|eTp−s,jZ22
i u2|
≤ 2‖eTp−s,jZ21i ‖∞‖β∗1 + rnu1‖1 + κ1n + rn‖eTp−s,jZ22
i ‖∞‖u2‖1
≤(2c+ 3rnC
)κ2ns+ κ1n. (34)
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 33
By similar calculation, we have
n∑i=1
ζ22ij ≤4
n∑i=1
(4(eTp−s,jZ
21i β
∗1
)2+ 4r2
n
(eTp−s,jZ
21i u1
)2+ x2
ij + r2n
(eTp−s,jZ
22i u2
)2)
≤4n∑i=1
(4‖eTp,jZi‖2
∞(‖β∗1‖2
1 + r2n‖u1‖2
1 + r2n‖u2‖2
1
)+ x2
ij
)≤4
n∑i=1
(4‖Zi‖2
∞(‖β∗1‖2
1 + r2n‖u‖2
1
)+ x2
ij
)≤4(4c2 + 4r2
nC2 + 1)ns2.
Write Bn = 4(4c2 + 4r2nC
2 + 1)ns2. Since the limits (8) and (12) imply respectively that
λn√s√n
(κ2ns+ κ1n√ns
)→ 0 and an
(κ2ns+ κ1n√ns
)→ 0,
it follows from (34) and rn → 0 that
bn max1≤i≤n |ζ2ij|√Bn
→ 0, as n→∞.
We use Lemma A.1 to obtain that,
P(|
n∑i=1
ζ2ijεi| > bn√Bn
)≤ exp
(− b2
n
2σ2+ o(b2
n)),
which combining the relation rn → 0 leads to
P(‖
n∑i=1
ζ2iεi‖∞ > 4((1 + c2))1/2bn√ns)≤ exp
(− b2
n
2σ2+ o(b2
n)). (35)
Note that the first relation of Condition 4 implies that
bn >2cq−1λn
√s√
n≥ 2σ
√log p.
Therefore we conclude that
P(‖
n∑i=1
ζ2iεi‖∞ > 4((1 + c2))1/2bn√ns)≤
p∑j=s+1
P(|
n∑i=1
ζ2ijεi| > 4((1 + c2))1/2bn√ns)
≤ exp(− b2
n
4σ2+ o(b2
n)).
which yields (33). Further by Lemma A.3, (32) and (33), we have
P(β ∈ arg inf
‖u‖1≤CLn(β∗1 + rnu1, rnu2)
)≥ P
(E1 ∩ {‖β1 − β∗1‖ ≤ rn}
)≥ 1− exp
(− C0a
2n + o(a2
n)).
�
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 34
Proof of Theorem3.2 It suffices to show that the sequence an =√s log n satisfies (11)-
(13). First, it is clear that an/√s log n→∞. Further, it follows from (27) that
an√n =
√s log n√n≤(
max(s, log n))3/2
√n
→ 0.
Moreover, the inequality in Condition 4 and (8) imply that
anκ1n
√s√
n=λnκ1ns log n
n·√n
λn≤ λnκ1ns log n
n· 1
σc1−q√
log p→ 0
andanκ2ns
3/2
√n
=λnκ2ns
2 log n
n·√n
λn≤ λnκ2ns
2 log n
n· 1
σc1−q√
log p→ 0.
Therefore by the first convergence of Condition 4, we obtain
a2−qn n
q2 s
4−q2
λn=
√nqs3−q(log n)2−q
λn→ 0,
which completes the proof. �
A.2 Proof of Theorem 4.1 and 4.2.
We here also use the notation in Appendix A.1 and provide two lemmas below, i.e., Lemmas
A.4 and A.5, corresponding to Lemmas A.2 and A.3 there.
Lemma A.4. For the model (16), assume that Conditions 1′-2′ and 4′ hold. Let {an} be a
sequence of positive numbers satisfying (11) and (12). Then, for any τ > 0,
P( 1
an√n
supu∈S‖
n∑i=1
(Z11i u)εi‖ > τ
)≤ exp
(− a2
nτ2
2c2σ2+ o(a2
n)),
where S ′ = S1 ∩ S.
Proof. First note that
{u ∈ Rs : ‖u‖0 ≥ s− [
s
2]} ⊆
s⋃k=[ s
2]
{u ∈ Rs : ‖u‖0 = k},
and Lemma 14.27 of Buhlmann and van de Geer (2011) implies that in the subspace Rk,{v ∈ Rk : min
1≤l≤k|eTk,lv| ≥
c
2, ‖v‖ ≤ C1
√s}
⊆(1+2n)k⋃j=1
{v ∈ Rk : ‖v − vj‖ ≤
1
n, min
1≤l≤k|eTk,lvj| ≥
c
2, ‖vj‖ ≤ C1
√s}.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 35
Sinces∑
k=[ s2
]
Cks (1 + 2n)k ≤ (1 + 2n)s
s∑k=[ s
2]
Cks ≤ (2 + 4n)s
and {u ∈ Rs : |
{j : |eTs,ju| ≥
c
2
}| ≥ s− [
s
2]}
={u ∈ Rs : ‖u‖0 ≥ s− [
s
2], |eTs,ju| ≥ c/2, j ∈ supp(u)
},
we have
S ′ ⊆(2+4n)s⋃j=1
{u ∈ Rs : ‖u− uj‖ ≤
1
n, uj ∈ S ′
}.
Then, we can use the similar method to the proof of Lemma A.2 to get the desired result.
Lemma A.5. Under the assumptions of Lemma A.4, Ln(β1) has two minimizers β1 and
−β1 such that
P(‖(−β1)− (−β∗1)‖ ≤ rn
)=P(‖β1 − β∗1‖ ≤ rn
)≥1− exp
(− (1 + c2
1/4)a2n
2c2σ2+ o(a2
n)).
Proof. Define
S1(β∗1) ={u ∈ Rs :
∣∣∣{j : |eTs,ju+ eTs,jβ∗1 | ≥ c/2
}∣∣∣ ≥ s− [s
2]}
and
S1(−β∗1) ={u ∈ Rs :
∣∣∣{j : |eTs,ju− eTs,jβ∗1 | ≥ c/2}∣∣∣ ≥ s− [
s
2]}.
We first show that
S1(β∗1) ∪ S1(−β∗1) = Rs. (36)
First, it is obvious that S1(β∗1) ∪ S1(−β∗1) ⊆ Rs. To show the opposite inclusion, we need
the following two facts that for any u ∈ Rs,∣∣{j : |eTs,ju+ eTs,jβ∗1 | ≥ c/2}
∣∣ ≤ [s
2]⇔
∣∣{j : |eTs,ju+ eTs,jβ∗1 | < c/2}
∣∣ ≥ s− [s
2]
and
{j : |eTs,ju+ eTs,jβ∗1 | < c/2} ⊆ {j : |eTs,ju− eTs,jβ∗1 | ≥ c/2}.
It is clear that the first holds. We only need to check the second. Note that for each j with
|eTs,ju+ eTs,jβ∗1 | < c/2, it is easy to verify that
−2eTs,jβ∗1 − c/2 < eTs,ju− eTs,jβ∗1 < −2eTs,jβ
∗1 + c/2.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 36
Combining this and the assumption 0 < c ≤ min{|eTp,jβ∗|, j ∈ Γ∗}, we have
eTs,ju− eTs,jβ∗1
{< −3c/2, if eTs,jβ
∗1 > c;
> 3c/2, if eTs,jβ∗1 > −c.
which yields |eTs,ju− eTs,jβ∗1 | ≥ c/2. Therefore the second fact holds. It follows that, for any
β1 /∈ S1(β∗1), i.e.,∣∣{j : |eTs,ju+eTs,jβ
∗1 | ≥ c/2}
∣∣ ≤ [ s2], the above two facts imply β1 ∈ S1(−β∗1),
which further implies that (36) holds.
Note that for any β1 ∈ S1(β∗1), −β1 ∈ S1(−β∗1), and for any β1 ∈ S1(−β∗1), −β1 ∈ S1(β∗1).
That is, the sets S1(β∗1) and −S1(−β∗1) are symmetric. Since Ln(β1) is an even function, it
follows from (36) that
minβ1∈Rs
Ln(β1) = minβ1∈S1(β∗1 )
Ln(β1) = minβ1∈S1(−β∗1 )
Ln(β1).
By the similar method to the proof of Lemma A.3, we can show that there exists a minimizer
β1 = arg minβ1∈S1(β∗1 ) Ln(β1), such that (22) holds. Therefore the desired result follows and
the proof is completed.
Proof of Theorem4.1 From Lemmas A.4 and A.5, we can use the similar method
for model (1) to prove that under the event E1 ∩ {‖β1 − β∗1‖ ≤ rn, (βT1 , 0T )T is a local
minimizer in the ball {β∗ + rnu : ‖u‖1 ≤ C}, and (−βT1 , 0T )T is a local minimizer in the
ball {−β∗ + rnu : ‖u‖1 ≤ C}. As mentioned before, we identify vectors β, β′ ∈ Rp which
satisfy β′ = ±β. Then, there exists strict local minimizer β such that both the results (9)
and (10) remain true. �
Proof of Theorem 4.2 is analog to that of Theorem 3.2.
Appendix B Analysis of the optimization algorithm
Lemma B.1. [Chen, Xiu and Peng (2014)] Let t ∈ R, λ > 0, q ∈ (0, 1) be given and
t∗ = (2 − q)(q(1 − q)q−1λ
)1/(2−q). For any t0 > t∗, there exists a unique implicit function
u = hλ,q(t) on (t∗,∞) such that u0 = hλ,q(t0), u = hλ,q(t) > 0, hλ,q(t)− t+ λqhλ,q(t)q−1 = 0
and u = hλ,q(t) is continuously differentiable with hλ,q′(t) = 1
1+λq(q−1)hλ,q(t)q−2 > 0. For
any t0 < −t∗, there exists a unique function u = hλ,q(t) on (−∞,−t∗) such that u0 =
hλ,q(t0), u = hλ,q(t) < 0, hλ,q(t) − t − λq|hλ,q(t)|q−1 = 0 and u = hλ,q(t) is continuously
differentiable with hλ,q′(t) = 1
1+λq(q−1)|hλ,q(t)|q−2 > 0.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 37
Furthermore, the global solution u of the problem (18) satisfies
u = hλ,q(t) :=
hλ,q(t), if t < −t∗;−(2λ(1− q))
12−q or 0, if t = −t∗;
0, if − t∗ < t < t∗;
(2λ(1− q))1
2−q or 0, if t = t∗;
hλ,q(t), if t > t∗.
Especially, hλ,1/2(t) = 23t(1 + cos
(2π3− 2
3φλ(t)
))with φλ(t) = arccos(λ
4( |t|
3)−3/2
).
Lemma B.2. For q ∈ (0, 1), λ > 0, let u = arg minu∈Rp12‖u− b‖2
2 + λ‖u‖qq, ∀ b ∈ Rp. Then
u = Hλ,q(b).
The result is an immediate consequence of Lemma B.1 and therefore the proof is omit-
ted.
Proof of Theorem5.1 For any τ > 0, define the following auxiliary problem
minβ∈Rp
Fτ (β, u) := `(u) + 〈∇`(u), β − u〉+1
2τ‖β − u‖2
2 + λ‖β‖qq, ∀u ∈ Rp. (37)
It is easy to check that the problem (37) is equivalent to the following minimization problem
minβ∈Rp
1
2‖β − (u− τ∇`(u))‖2
2 + λτ‖β‖qq.
For any r > 0, let Br = {β ∈ Rp : ‖β‖2 ≤ r} and Gr = supβ∈Br ‖∇2`(β)‖2. For any
τ ∈ (0, G−1r ] and β, u ∈ Br, we have
L(β) = `(u) + 〈∇`(u), β − u〉+1
2(β − u)T∇2`(ξ)(β − u) + λ‖β‖qq
= Fτ (β, u) +1
2(β − u)T∇2`(ξ)(β − u)− 1
2τ‖β − u‖2
2
≤ Fτ (β, u) +1
2‖∇2`(ξ)‖2‖β − u‖2
2 −1
2τ‖β − u‖2
2
≤ Fτ (β, u) +L
2‖β − u‖2
2 −1
2τ‖β − u‖2
2
≤ Fτ (β, u), (38)
where ξ = u+α(β−u) for some α ∈ (0, 1) and the second inequality follows from ‖ξ‖2 ≤ r.
Further, let β ∈ arg minβ∈Rp Fτ (β, β). Since L(β) ≥ 0 and lim‖β‖2→∞ L(β) =∞, there
exists a positive constant r1 such that ‖β‖2 ≤ r1. Note that
∇`(β) = 2m∑i=1
(βTZiβ + xTi β − yi)(2Ziβ + xi) (39)
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 38
which implies that ∇`(β) is continuous differentiable. Then, take
r2 = r1 + supβ∈Br1
‖∇`(β)‖2.
Hence it follows from Lemma B.2 that ‖β‖2 ≤ r2 for any τ ∈ (0, 1]. By the definitions of
β and β, we obtain from the inequality (38) that for any τ ∈(0,min{G−1
r2, 1}),
Fτ (β, β) ≤ Fτ (β, β) = L(β) ≤ L(β) ≤ Fτ (β, β),
which leads to Fτ (β, β) = Fτ (β, β). Therefore β is also a minimizer of the problem (37)
with u = β. The results follows then from Lemma B.2. �
Lemma B.3. Let gk = ‖∇`(βk)‖2, Gk = supβ∈Bk ‖∇2`(β)‖2 where Bk = {β ∈ Rp : ‖β‖2 ≤
‖βk‖2 + gk}. For any δ > 0, γ, α ∈ (0, 1), define
jk =
{0, if γ(Gk + δ) ≤ 1;
−[ logα γ(Gk + δ)] + 1, otherwise.
Then (20) holds.
Proof. From the definition of τk and jk, it is easy to check that
Gk −1
τk≤ −δ. (40)
Indeed, take τk = γ which yields to
Gk −1
τk=γGk − 1
γ≤ −δ,
when γ(Gk + δ) ≤ 1. If γ(Gk + δ) > 1,
τk = γαjk ≤ γα− logα γ(Gk+δ) =1
Gk + δ
which also leads to (40).
Note that
βk+1 ∈ arg minβ∈Rp
Gτk(β, βk) (41)
and
‖βk+1‖2 ≤ ‖βk − τk∇`(βk)‖2 ≤ ‖βk‖2 + gk,
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 39
which yields βk+1 ∈ Bk. Similar to (38), we obtain from (40) that
L(βk+1) ≤ Fτk(βk+1, βk) +
1
2‖βk+1 − βk‖2
2
(‖∇2`(ξk)‖2 −
1
τk
)≤ Fτk(β
k+1, βk) +1
2‖βk+1 − βk‖2
2(Gk −1
τk)
≤ Fτk(βk+1, βk)− δ
2‖βk+1 − βk‖2
2,
where ξk = βk + %(βk+1 − βk) for some % ∈ (0, 1) and then ξk ∈ Bk leads to the second
inequality. Combining this and (41), we have
L(βk)− L(βk+1) = Fτk(βk, βk)− L(βk+1) ≥ Fτk(β
k+1, βk)− L(βk+1)
≥ δ
2‖βk+1 − βk‖2
2,
which completes the proof.
Lemma B.4. Let {βk} and {τk} be generated by FPIA. Then,
(i) {βk} is bounded; and
(ii) there is a nonnegative integer j such that τk ∈ [γαj, γ].
Proof. Lemma B.3 implies that {L(βk)} is strictly decreasing. From this, `(·) ≥ 0 and
the definition of L(·), it is easy to check that {βk} is bounded. Since `(·) is a twice
continuous differentiable function, it then follows from the bound of {βk} that there exist
two positive constants g and G such that supk≥0{gk} ≤ g and supk≥0{Gk} ≤ G. Define
j = max(0, [− logα γ(G + δ)] + 1). Then, 0 ≤ jk ≤ j which combining the definition of τk
imply that τk ∈ [γαj, γ].
Now we consider the convergence of the sequence {βk}. To this end we slightly modify
hλ,q(·) as follows
hλ,q(t) :=
hλ,q(t), if t < −t∗;0, if |t| ≤ t∗;
hλ,q(t), if t > t∗.
(42)
Then we have the following result.
Theorem B.1. Let {βk} be the sequence generated by FPIA. Then,
(i) {L(βk)} converges to L(β), where β is any accumulation point of {βk};(ii) limk→∞
‖βk+1−βk‖2τk
= 0;
(iii) any accumulation point of {βk} is a stationary point of the minimization problem
(17) when γ ≤(
q16(1−q) g
−1) 2−q
1−q(λ(1− q)
) 11−q and g = supk≥0 ‖∇`(βk)‖2.
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 40
Proof. (i) Since {βk} is bounded, it has at least one accumulation point. Since {L(βk)}is monotonically decreasing and L(·) ≥ 0, {L(βk)} converges to a constant L(≥ 0). Since
L(β) is continuous, we have {L(βk)} → L = L(β), where β is an accumulation point of
{βk} as k →∞.(ii) From the definition of βk+1 and (20), we have
n∑k=0
‖βk+1 − βk‖22 ≤
2
δ
n∑k=0
[L(βk)− L(βk+1)] =2
δ[L(β0)− L(βn+1)] ≤ 2
δL(β0).
Hence,∑∞
k=0 ‖βk+1 − βk‖22 <∞ and ‖βk+1 − βk‖2 → 0 as k →∞. Then the second result
of Lemma B.4 leads to the result (ii).
(iii) Since {βk} and {τk} have convergent sequences, without loss of generality, assume
that
βk → β and τk → τ , as k →∞. (43)
It suffices to prove that β and τ satisfy (19). Note that
‖β −Hλτ ,q
(β − τ∇`(β)
)‖2
≤ ‖β − βk+1‖2 + ‖Hλτk,q
(βk − τk∇`(βk)
)−Hλτ ,q
(β − τ∇`(β)
)‖2
= I1 + I2. (44)
The result (ii) and (43) imply that I1 → 0 as k →∞.
To complete the proof, we need show I2 → 0 for q ∈ (0, 1). For i = 1, · · · , p, denote
vki = eTp,i(βk − τk∇`(βk)
), vi = eTp,i
(β − τ∇`(β)
), t∗i =
2− q2(1− q)
[2λτ(1− q)]1/(2−q)
and βi =(2λτ(1− q)
)1/(2−q). Then it suffices to prove that
hλτk,q(vki )→ hλτk,q(vi) (45)
when vki → vi as k → ∞. We only give the proof of (45) as vi > 0 because the case of
vi < 0 can be similarly proved.
For vi < t∗i , the limit (43) and the definition of hλτ,q imply that hλτk,q(vki ) = 0 =
hλτ ,q(vi). For vi > t∗i , one can conclude from (43) and the continuity of hλτ,q on (t∗i ,∞) that
hλτk,q(vki )→ hλτ ,q(vi). For vi = t∗i , we show that any subsequence of {vki } converging to vi,
without loss of generality, say {vki }, must satisfy
vki ≤ t∗i , for large enough k. (46)
Fan et al (2016): Variable Selection in Sparse Regression with Quadratic Measurements 41
We prove the above inequality by contradiction. Denote ∆ = q16(1−q)
(λ(1 − q)
) 12−q and
δi =t∗i−βi
4. Note that t∗i > βi implies that δi = p
16(1−q)∆(2τ)1
2−q > 0. The second limit
of (43) implies τ ≥ 12τk and hence δi ≥ 2∆(τk)
12−q for large enough k. Since τ
1−q2−qk ∆−1 ≤
γ1−q2−q∆−1 ≤ ¯−1, for large enough k, we have
τk‖∇`(βk)‖2 ≤ ∆τk ¯∆−1 ≤ δi2τ− 1
2−qk τk ¯∆−1 ≤ δi
2
and therefore
eTp,iβk = vki + τk[∇`(βk)]i ≥ vki − τk‖[∇`
(eTp,iβ
k))‖2 ≥ vki −
1
2δi.
Combining this, the result (ii) and vki → t∗i , we have
eTp,iβk+1 ≥ eTp,iβ
k − 1
2δi ≥ vki − δi ≥ t∗i − 2δi = βi + 2δi, for large enough k. (47)
Note that hλτ,q is continuous on (t∗i ,∞) and limn→∞ hλτk,q(vki ) = βi. For large enough k,
we have eTp,iβk+1 = hλτk,q(v
ki ) ∈ [βi− δi, βi + δi], which is in contradiction with (47). So (46)
holds. By the definition of hλ,q(·), we have hλτk,q(vki ) = 0 = hλτ ,q(vi).