lecture 14: variable selection - beyond lassohzhang/math574m/2019...hao helen zhang lecture 14:...

64
Lq Methods Penalties with Oracle Properties Penalties for Correlated Predictors Group Lasso Variable Selection in Nonparametric Regression Lecture 14: Variable Selection - Beyond LASSO Hao Helen Zhang Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Upload: others

Post on 27-Jun-2020

7 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Lecture 14: Variable Selection - Beyond LASSO

Hao Helen Zhang

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 2: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Extension of LASSO

To achieve oracle properties,Lq penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001;Zhang et al. 2007).Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al.2007)

Jλ(β) = λ

p∑j=1

wj |βj |.

Nonnegative garotte (Breiman, 1995)To handle correlated predictors,

Elastic net (Zou and Hastie 2005), adaptive elastic net

Jλ =

p∑j=1

[λ|βj |+ (1− λ)β2j ], λ ∈ [0, 1].

OSCAR (Bondell and Reich, 2006)Adaptive elastic net (Zou and Zhang 2009)

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 3: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

For group variable selection,

group LASSO (Yuan and Lin 2006)supnorm penalty (Zhang et al. 2008)

For nonparametric (nonlinear) models

LBP (Zhang et al. 2004),COSSO (Lin and Zhang 2006; Zhang 2006; Zhang and Lin2006)group lasso based methods

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 4: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Various Penalty Functions

−4 −2 0 2 4

01

23

4L_0 Loss

x

|x|_0

−4 −2 0 2 4

01

23

4

L_0.5 loss

x

|x|_1

−4 −2 0 2 4

01

23

4

L_1 loss

x

|x|_0.5

−4 −2 0 2 4

01

23

4

SCAD Loss(lambda=0.8,a = 3)

x

p(|x|)

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 5: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

About Lq Penalty Methods

q = 0 corresponds variable selection, the penalty is thenumber of nonzero coefficients or model size

For q ≤ 1, more weights imposed on the coordinate directions

q = 1 is the smallest q such that the constraint region isconvex

q can be fractions. Maybe estimate q from the data?

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 6: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Motivations:

LASSO imposes the same penalty on the coefficients

Ideally, small penalty should be applied to large coefficients(for small bias), while large penalty applied to smallcoefficients (for better sparsity)

LASSO is not an oracle procedure

Methods with oracle properties include:

SCAD, weighted lasso, adaptive lasso, nonnegative garrote,etc.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 7: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

What is A Good Penalty

A good penalty function should lead to the estimator

“nearly” unbiased

sparsity: threshold small coefficients to zero (reduced modelcomplexity)

continuous in data, robust against small perturbation)

How to make these happen? (Fan and Li 2002)

sufficient condition for unbiasedness: J ′(|β|) = 0 for large |β|(penalty bounded by a constant).

necessary and sufficient condition for sparsity: J(|β|) issingular at 0

Most of the oracle penalties are concave, suffering from multiplelocal minima numerically.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 8: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Smoothly Clipped Absolute Deviation Penalty

pλ(β) =

λ|β| if |β| ≤ λ,− (|β|2−2aλ|β|+λ2)

2(a−1) if λ < |β| ≤ aλ,(a+1)λ2

2 if |β| > aλ,

where a > 2 and λ > 0 are tuning parameters.

A quadratic spline function with two knots at λ and aλ.

singular at the origin; continuous first-order derivative.

constant penalty on large coefficients; not convex

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 9: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

SCAD Penalty

−2 −1 0 1 2

0.00.1

0.20.3

0.4

w

L(w)

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 10: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Computational Issues

The SCAD penalty is not convex

Local Quadratic Approximation (LQA; Fan and Li, 2001)

Local Linear Approximation (LLA; Zou and Li, 2008)

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 11: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Nonnegative Garotte

Breiman (1995): solve

minc1,··· ,cp ‖y −p∑

j=1

x j βolsj cj‖2 + λn

p∑j=1

cj

subject to cj ≥ 0, j = 1, · · · , p,

and denote the solution as cj ’s. The garotte solution

βgarottej = cj βolsj , j = 1, · · · , p.

A sufficiently large λn shrinks some cj to exact 0 ( sparsity)

The nonnegative garotte estimate is selection consistent. Seealso Yuan and Lin (2005).

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 12: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Adaptive LASSO

Proposed by Zou (2006), Zhang and Lu (2007), Wang et al.(2007)

βalasso

= arg minβ‖y − Xβ‖2 + λ

p∑j=1

wj |βj |,

where wj ≥ 0 for all j .

Adaptive choices of weights

The weights are data-dependent and adaptively chosen fromthe data.

Large coefficients receive small weights (and small penalties)

Small coefficients receive large weights (and penalties)

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 13: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

How to Compute Weights

Need some good initial estimates βj ’s.

In general, we require βj ’s to be root-n consistent

The weights are calculated as

wj =1

|βj |γ, j = 1, · · · , p.

γ > 0 is another tuning parameter.

For example, βj ’s can be chosen as βols

.

Alternatively, if collinearity is a concern, one can use βridge

.

As the sample size n grows,

the weights for zero-coefficient predictors get inflated (to ∞);

the weights for nonzerocoefficient predictors converge to afinite constant.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 14: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Oracle Properties of Adaptive LASSO

Assume the first q variables are important. And

1

nXTn Xn → C =

(C11 C12

C21 C22

),

where C is a positive definite matrix.

Assume the initial estimators β are root-n consistent. Supposeλn/√n→ 0 and λnn

(γ−1)/2 →∞.

Consistency in variable selection:

limn

P(An = A0) = 1,

where An = {j : βalassoj ,n 6= 0}.Asymptotic normality:

√n(β

alasso

n − βA0,0)→d N(0, σ2C−111 ).

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 15: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Additional Comments

The oracle properties are closely related to the super-efficiencyphenomenon (Lehmann and Casella 1998).

Adaptive lasso simultaneously unbiasedly (asymptotically)estimate large coefficients and small threshold estimates.

The adaptive lasso uses the same rational behind SCAD.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 16: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Computational Advantages of Adaptive LASSO

The adaptive lasso

has continuous solutions (continuous in data)

is a convex optimization problem

can be solved by the same efficient algorithm LARS to obtainthe entire solution path.

By contrast,

The bridge regression (Frank and Friedman 1993) uses the Lqpenalty.

The bridge with 0 < q < 1 has the oracle properties (Knightand Fu, 2000), but the bridge with q < 1 solution is notcontinuous. (See the picture)

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 17: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Fit Adaptive LASSO using LARS algorithm

Algorithm:

1 Define x∗∗j = x j/wj for j = 1, · · · , p.

2 Solve the lasso problem for all λn,

β∗∗

= arg minβ‖y −

p∑j=1

x∗∗j βj‖2 + λn

p∑j=1

|βj |.

3 Outputβalassoj = β∗∗j /wj , j = 1, · · · , p.

The LARS algorithm is used to compute the entire solution path ofthe lasso in Step 2.

The computational cost is of order O(np2), which is the sameorder of computation of a single OLS fit.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 18: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Adaptive Solution Plot

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 19: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Connection of Nonnegative Garotte and Adaptive LASSO

Let γ = 1. The garotte can be reformulated as

minβ ‖y −p∑

j=1

x jβj‖2 + λn

p∑j=1

|βj ||βolsj |

subject to βj βolsj ≥ 0, j = 1, · · · , p,

The nonnegative garotte can be regarded as the adaptivelasso (γ = 1) with additional sign constraint.

Zou (2006) showed that, if λ→∞ and λn/√n→ 0, then

then the nonnegative garotte is consistent for variableselection consistent.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 20: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Adaptive LASSO for Generalized Linear Models (GLMs)

Assume the data has the density of the form

f (y |x, θ) = h(y) exp(yθ − φ(θ)), θ = xTβ,

which belongs the exponential family with canonical parameter θ.The adaptive lasso solves the penalized log-likelihood

βalasso

n = arg minβ

n∑i=1

(−yi (xTi β) + φ(xTi β)

)+ λn

p∑j=1

wj |βj |,

wherewj = 1/|βmle

j |γ , j = 1, · · · , p,

βmlej is the maximum likelihood estimator.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 21: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Adaptive LASSO for GLMs

For logistic regression, the adaptive lasso solves

βalasso

n = arg minβ

n∑i=1

(−yi (xTi β) + log(1 + ex

Ti β))

+ λn

p∑j=1

wj |βj |.

For Poisson log-linear regression models, the adaptive lasso solves

βalasso

n = arg minβ

n∑i=1

(−yi (xTi β) + exp(xTi β)

)+ λn

p∑j=1

wj |βj |.

Zou (2006) show that, under some mild regularity conditions,

βalasso

n enjoys oracle properties if λn is chosen appropriately.

The limiting covariance matrix of βalasso

n is I11, the Fisherinformation with the true sub-model known.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 22: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

SCADNonnegative Garotteadaptive lasso

Adaptive LASSO for High Dimensional Linear Regression

Huang, Ma, and Zhang (2008) considers the case p →∞ asn→∞. If a reasonable initial estimator is available, they show

Under appropriate conditions, βalasso

is consistent in variableselection, and the nonzero coefficient estimators have thesame asymptotic distribution they would have if the truemodel were known (oracle properties)Under a partial orthogonality condition, i.e., unimportantcovariates are weakly correlated with important covariates,marginal regression can be used to obtain the initial estimator

and assures the oracle property of βalasso

even when p > n.For example, margin regression initial estimates work in theessentially uncorrelated case

|corr(Xj ,Xk)| ≤ ρn = O(n−1/2), ∀j ∈ A0, k ∈ Ac0.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 23: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Correlated Predictors

Motivations:

If p > n or p � n, the lasso can select at most n variablesbefore it saturates, because of the nature of the convexoptimization problem.

If there exist high pairwise correlations among predictors, lassotends to select only one variable from the group and does notcare which one is selected.

For usual n > p situations, if there are high correlationsbetween predictors, it has been empirically observed that theprediction performance of the lasso is dominated by ridgeregression (Tibshirani, 1996).

Methods: elastic net, OSCAR, adaptive elastic net

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 24: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Motivations of Elastic Net

A typical microarray data set has many thousands of predictors(genes) and often fewer than 100 samples.

Genes sharing the same biological ‘pathway’ tend to have highcorrelations between them (Segal and Conklin, 2003).

The genes belonging to a same pathway form a group.

The ideal gene selection method should do two things:

eliminate the trivial genes

automatically include whole groups into the model once onegene among them is selected (‘grouped selection’).

The lasso lacks the ability to reveal the grouping information.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 25: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Elastic net

Zou and Hastie (2005) combine the lasso and the ridge

β = arg minβ‖y − Xβ‖2 + λ2‖β‖2 + λ1‖β‖1.

The l1 part of the penalty generates a sparse model.

The quadratic part of the penalty

removes the limitation on the number of selected variables;

encourages grouping effect;

stabilizes l1 regularization path.

The new estimator is called Elastic Net. The solution is

βenet

= (1 + λ2)β,

which removes the double shrinkage effect.Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 26: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Elastic Net Regularization

The elastic net penalty function

has singularity at the vertex (necessary for sparsity)

has strict convex edges (encouraging grouping)

Properties: the elastic net solution

simultaneously does automatic variable selection andcontinuous shrinkage

can select groups of correlated variables, retaining ‘all the bigfish’.

The elastic net often outperforms the lasso in terms of predictionaccuracy in numerical results.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 27: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 28: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Strictly Convex Penalty vs LASSO penalty

Considerβ = arg min

β‖y − Xβ‖2 + λJ(β),

where J(β) > 0 for β 6= 0.

Assume xi = xj , i , , j ∈ {1, · · · , p}.If J is strictly convex, then βi = βj for any λ > 0.

If J(β) = ‖β‖1, then βi βj > 0, and β∗

is another minimzer,where

β∗k =

βk if k 6= i , k 6= j ,

(βi + βj) · s if k = i ,

(βi + βj) · (1− s) if k = j ,

for any s ∈ [0, 1].

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 29: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Comments

There is a clear distinction between strictly convex penaltyfunctions and the lasso penalty.

Strict convexity guarantees the grouping effect in the extremesituation with identical predictors.

In contrast the lasso does not even have a unique solution.

The elastic net penalty with λ2 > 0 is strictly convex, thusenjoying the grouping effect.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 30: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Results on the Grouping Effect

Given data (y,X ) and parameters (λ1, λ2), the response y is

centred and the predictors X are standardized. Let βenet

(λ1, λ2)be the elastic net estimate. Supposeβeneti (λ1, λ2)βenetj (λ1, λ2) > 0, then

1

‖y‖1|βeneti (λ1, λ2)− βenetj (λ1, λ2)| <

√2

λ2

√1− ρij ,

where ρij = xTi xj is the sample correlation.

The closer to 1 ρij is, the more grouping in the solution.

The larger λ2 is, the more grouping in the solution.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 31: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Effective Degrees of Freedom

Effective df describes the model complexity.

df is very useful in estimating the prediction accuracy of thefitted model.

df is well studied for linear smoothers: y = Sy, and

df (y) = tr(S).

For the l1 related methods, the non-linear nature makes theanalysis difficult.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 32: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Elastic net: degrees of freedom

For the elastic net, an unbiased estimate for df is

df (βenet

) = Tr(Hλ2(A)),

where A = {j : βenetj 6= 0, j = 1, · · · , p} is the active set(containing all the selected variables), and

Hλ2(A) = XA(XTAXA + λ2I )

−1XTA .

Special case for the lasso: λ2 = 0 and

df (βlasso

) = the number of nonzero coefficients in βlasso

.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 33: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Computing and Tuning Elastic Net

Solution path algorithm:

The elastic net solution path is piecewise linear.

Given a fixed λ2, a stage-wise algorithm called LARS-ENefficiently solves the entire elastic net solution path.

The computational effort is same as that of a single OLS fit.

R package: elasticnet

Two-dimensional cross validation

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 34: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Elastic NetAdaptive Elastic Net

Adaptive Elastic Net

Zou and Zhang (2009) propose

βaenet

= (1 + λ2) arg minβ‖y − Xβ‖2 + λ2‖β‖2 + λ1

p∑j=1

wj |βj |,

where wj = (|βenetj |)−γ for j = 1, · · · , p and γ > 0 is a constant.

Main results:

Under weak regularity conditions, the adaptive elastic-net hasthe oracle property.

Numerically, the adaptive elastic-net deals with the collinearityproblem better than the other oracle-like methods and hasbetter finite sample performance.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 35: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Group LASSOAdaptive Group LASSO

Motivation of Group Lasso

The problem of group selection arises naturally in many practicalsituations. For example, In multifactor ANOVA,

each factor has several levels and can be expressed through agroup of dummy variables;

the goal is to select important main effects and interactionsfor accurate prediction, which amounts to the selection ofgroups of derived input variables.

Variable selection amounts to the selection of important factors(groups of variables) rather than individual derived variables.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 36: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Group LASSOAdaptive Group LASSO

Another Motivation: Nonparametric Regression

Additive Regression Models

Each component in the additive model may be expressed as alinear combination of a number of basis functions of theoriginal measured variable.

The selection of important measured variables corresponds tothe selection of groups of basis functions.

Variable selection amount to the selection of groups of basisfunctions (corresponding to a same variable).

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 37: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Group LASSOAdaptive Group LASSO

Group Selection Setup

Consider the general regression problem with J factors:

y =J∑

j=1

Xjβj + ε,

y is an n × 1 vector

ε ∼ N(0, σ2I )

Xj is an n × pj matrix corresponding to the jth factor

βj is a coefficient vector of size pj , j = 1, . . . , J.

Both y and Xj ’s are centered. Each Xj is orthonormalized, i.e.,

XTj Xj = Ipj , j = 1, ..., J,

which can be done through GramSchmidt orthonormalization.Denoting X = (X1,X2, . . . ,XJ) and β = (βT

1 ,βT2 , · · · ,βT

J )T .Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 38: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Group LASSOAdaptive Group LASSO

Group LASSO

Yuan and Lin (2006) propose

βglasso

= arg minβ‖y −

J∑j=1

Xjβj‖2 + λ

J∑j=1

√pj‖βj‖,

where

λ ≥ 0 is a tuning parameter.

‖βj‖ =√βTj βj is the empirical L2 norm.

The group lasso penalty encourages sparsity at the factor level.

When p1 = · · · = pJ = 1, the group lasso reduces to the lasso.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 39: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Group LASSOAdaptive Group LASSO

Degree of Freedom for Group LASSO

df =J∑

j=1

I (‖βglasso

j ‖ > 0) +J∑

j=1

‖βglasso

j ‖

‖βols

j ‖(pj − 1).

λ ≥ 0 is a tuning parameter.

The group lasso penalty encourages sparsity at the factor level.

When p1 = · · · = pJ = 1, the group lasso reduces to the lasso.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 40: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Group LASSOAdaptive Group LASSO

Motivation of Adaptive Group LASSO

The group lasso suffers estimation inefficiency and selectioninconsistency. To remedy this, Wang and Leng (2006) propose

βaglasso

= arg minβ‖y −

J∑j=1

Xjβj‖2 + λ

J∑j=1

wj‖βj‖,

The weight wj = ‖βj‖−γ for all j = 1, · · · , J.

The adaptive group LASSO has oracle properties. In other

words, βaglasso

is root-n consistency, model selectionconsistency, asymptotically normal and efficient.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 41: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Function ANOVA Decomposition

Similar to classical ANOVA, any multivariate function f isdecomposed as

f (x) = β0 +

p∑j=1

fj(xj) +∑j<k

fjk(xj , xk) + · · ·+ f1···p(x1, · · · , xp)

Side conditions guarantee uniqueness of decomposition(Wahba 1990, Gu 2002)

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 42: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Product Domain

Consider a multivariate function f (x) = f (x1, . . . , xp).

Let Xj be the domain for xj , i.e., xj ∈ Xj

X =∏p

j=1Xj is the product domain for x = (x1, . . . , xp)

Examples:

continuous domain: Xj = [0, 1], and X = [0, 1]p.

discrete domain: Xj = {1, 2, · · · ,K}

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 43: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Averaging Operator: Aj

Aj is the average operator on Xj that averages out xj from theactive argument list

Aj satisfies A2j = Aj .

Aj is constant on Xj , but not necessarily an overall constantfunction

Examples: p = 2.

Example 1. X1 = {1, . . . ,K1} and X2 = {1, . . . ,K2}A1 = f (1, x2),A2 = f (x1, 1).

Aj =∑Kj

xj=1 f (x1, x2)/Kj for j = 1, 2

Example 2: X1 = X2 = [0, 1],X = [0, 1]2

A1 = f (0, x2),A2 = f (x1, 0).

Aj =∫ 1

0f (x1, x2)dxj for j = 1, 2

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 44: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Multiway ANOVA Decomposition

f (x) = {p∏

j=1

(I − Aj + Aj)}f =∑S{∏j∈S

(I − Aj)∏j 6∈S

Aj}f

=∑S

fS

= β0 +

p∑j=1

fj(xj) +∑j<k

fjk(xj , xk) + · · ·+ f1···p(x1, · · · , xp)

where

S ∈ {1, . . . , p} enlists the active arguments in fS

the summation is over all of the 2p subsets of {1, . . . , p}.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 45: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

ANOVA Interpretation

β0 =∏p

j=1 Aj(f ) is a constant (overall mean).

fj = fj = (I − Aj)∑

k 6=j Ak(f ) is the xj main effect.

fjk = fj ,k = (I − Aj)(I − Ak)∑

l 6=k,j Al(f ) is xj -xk interaction

Side conditions:Aj fS = 0, ∀j ∈ S.

Example: p = 2.

β0 = A1A2f ,

f1 = (I − A1)A2f = A2f − A1A2f = A2f − β0,f2 = (I − A2)A1f = A1f − A1A2f = A1f − β0,f12 = f (x1, x2)− f1(x1)− f2(x2) + β0.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 46: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Discrete Domain Example: p = 2

Domain X1 = {1, . . . ,K1} and X2 = {1, . . . ,K2}.Example: A1 = f (1, x2) and A2 = f (x1, 1).

β0 = A1A2f = f (1, 1)f1 = (I − A1)A2f = f (x1, 1)− f (1, 1)f2 = (I − A2)A1f = f (1, x2)− f (1, 1)f12 = (I −A1)(I −A2)f = f (x1, x2)− f (x1, 1)− f (1, x2)+ f (1, 1)

Example: Aj =∑Kj

xj=1 f (x1, x2)/Kj for j = 1, . . . , 2

β0 = A1A2f = f··f1 = (I − A1)A2f = fx1· − f··f2 = (I − A2)A1f = f·x2 − f··f12 = (I − A1)(I − A2)f = f (x1, x2)− fx1· − f·x2 + f··,

f·· is the overall mean, fx1· =∑K2

x2=1 f (x1, x2)/K2 is the

marginal average over x2, f·x2 =∑K1

x1=1 f (x1, x2)/K1 is themarginal average over x1.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 47: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Continuous Domain Example: p = 2

Domain X1 = X2 = [0, 1].

Example: A1 = f (0, x2) and A2 = f (x1, 0).

β0 = A1A2f = f (0, 0)f1 = (I − A1)A2f = f (x1, 0)− f (0, 0)f2 = (I − A2)A1f = f (0, x2)− f (0, 0)f12 = (I −A1)(I −A2)f = f (x1, x2)− f (x1, 0)− f (0, x2)+ f (0, 0)

Example: Aj =∫ 10 f (x1, x2)dxj for j = 1, . . . , 2

β0 = A1A2f =∫ 1

0

∫ 1

0f (x1, x2)dx1x2

f1 = (I − A1)A2f =∫ 1

0f (x1, x2)dx2 − β0

f2 = (I − A2)A1f =∫ 1

0f (x1, x2)dx1 − β0

f12 = (I − A1)(I − A2)f

= f (x1, x2)−∫ 1

0fdx1 −

∫ 1

0fdx2 +

∫ 1

0

∫ 1

0f (x1, x2)dx1x2

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 48: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Truncated Models

Additive models

f (x) = β0 +

p∑j=1

fj(xj),

Claim Xj as unimportant if the function fj = 0

Two-way interaction model

f (x) = β0 +

p∑j=1

fj(xj) +∑j<k

fjk(xj , xk).

The interaction effect between Xj and Xk is unimportant iffjk = 0.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 49: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Variable Selection for Additive Models

Yi = µ+

p∑j=1

fj(Xij) + εi ,

where

µ is the intercept

the component fj ’s are unknown smooth functions

εi is an unobserved random variable with mean 0 and finitevariance σ2.

Suppose some of the additive components fj ’s are zero. The goalsare

to distinguish the nonzero components from the zerocomponents

to and estimate the nonzero components consistently

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 50: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

(Traditional) Smoothing Spline Fits

0.0 0.2 0.4 0.6 0.8 1.0

−2−1

01

2

x1

f1

0.0 0.2 0.4 0.6 0.8 1.0

−3−2

−10

12

3

x2

f2

0.0 0.2 0.4 0.6 0.8 1.0

−1.0

−0.5

0.00.5

1.0

x3

f3

0.0 0.2 0.4 0.6 0.8 1.0

−1.0

−0.5

0.00.5

1.0

x4

f4

Traditional splines do not produce sparse function estimates, aszero functions are not estimated as exactly zero.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 51: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Existing Methods for Variable Selection for Additie Models

Multivariate Adaptive Regression Splines (MARS) (Friedman1991)

Classification and Regression Tree (CART, Brieman 1985)

Basis Pursuit (Zhang et al. 2004); Component Selection andSmoothing Operator (COSSO, Lin and Zhang, 2006)

Group LASSO (Huang), Group SCAD (Wang et al. 2007)

Sparse Additive Models (SpAM, Ravikuma 2007)

Example: to identify the correct sparse model structure

Y = f1(X1) + f2(X2) + 0(X3) + 0(X4) + ε.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 52: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Review of Basis-Expansion Methods

Assume

fj(xj) =mn∑k=1

βjkφk(xj), j = 1, · · · , p,

where {φk : 1 ≤ k ≤ mn} is the set of mn basis functions.

MARS (Friedman, 1991) builds models by sequentiallysearching/adding basis functions, followed by basis pruning(mimic stepwise selection)

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 53: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Likelihood Basis Pursuit (LBP)

An early attempt to generalize LASSO to multivariate splineregression context. Consider additive models with a basisexpansion:

f (x) =

p∑j=1

fj(xj) =

p∑j=1

mj∑k=1

βjkφk(xj).

The LASSO penalty is imposed on basis coefficients

minn∑

i=1

[yi − f (x i )]2 + λ

p∑j=1

mj∑k=1

|βjk |.

See Zhang et al. (2004).

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 54: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Properties of LBP

Each Xj is associated with multiple coefficients βlj .

To claim Xj as an unimportant variable, one needs βjk = 0 forall k = 1, · · · ,mj .

Sequential testing was suggested (based on Monte Carlobootstrap).

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 55: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Group-LASSO Methods

Huang et al. (2010) considers the adaptive group Lasso by

minµ,β1,··· ,βp

n∑i=1

yi − µ−p∑

j=1

mn∑k=1

βjkφk(xij)

2

+ λ

p∑j=1

wj‖βj‖2,

where βj = (βj1, · · · , βjmn)T .

φk ’s are normalized B-spline basis functions.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 56: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Two Important Issues

Can we do nonparametric variable selection via functionshrinkage operator?

Can the estimator achieve the nonparametric oracleproperties?

Answer:COSSO (Lin and Zhang, 2006), Adaptive COSSO.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 57: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

COmponent Smoothing and Selection Operator

COSSO is a soft-thresholding operator on the function componentsto achieve sparse function estimation:

minf ∈F

1

n

n∑i=1

[yi − f (x i )]2 + λJ(f ),

where J(f ) is the sum of RKHS norms (of the components)

J(f ) =

p∑j=1

‖P j f ‖F .

P j is the projection operator of f into the function space of Xj

J(f ) is continuous in f ; J(f ) is convex

See Lin and Zhang (2006) and Zhang and Lin (2006) for details.Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 58: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Model Selection Consistency of COSSO

The solution f is a linear combination of p components

f (x) = b +

p∑j=1

θj[ n∑i=1

ciRj(x i , x)].

Theorem 2 Let F be the second order Sobolev space of periodicfunctions. Under the tensor product design, whenλ→ 0, nλ2 →∞,

If fj = 0, then with probability tending to 1, θj = 0 (implying

fj is dropped from the model).

If fj 6= 0, then with probability tending to 1, θj > 0.

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 59: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Example 1

Dimension p = 10, sample size n = 100.

Generate y = f (x) + ε, with ε ∼ N(0, 1.74)

True regression function

f (x) = 5g1(x1) + 3g2(x2) + 4g3(x3) + 6g4(x4).

var{5g1(X1)} = 2.08, var{3g2(X2)} = 0.80,var{4g3(X3)} = 3.30, var{6g4(X4)} = 9.45.

MARS is done in R with function “mars” in the “mda” library

The magnitude of the functional component is measured bythe empirical L1 norm: n−1

∑ni=1 |fj(xij)| for each j .

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 60: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Table 1. The average ISE s over 100 runs of Example 1 .Comp. symm.

t = 0 t = 1 t = 3COSSO(GCV) 0.93 (0.05) 0.92 (0.04) 0.97 (0.07)COSSO(5CV) 0.80 (0.03) 0.97 (0.05) 1.07 (0.06)

MARS 1.57 (0.07) 1.24 (0.06) 1.30 (0.06)

AR(1)ρ = −0.5 ρ = 0 ρ = 0.5

COSSO(GCV) 0.94 (0.05) 1.04 (0.07) 0.98 (0.07)COSSO(5CV) 1.03 (0.06) 1.03 (0.06) 0.98 (0.05)

MARS 1.32 (0.07) 1.34 (0.07) 1.36 (0.08)

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 61: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Selection Frequency of Variables

For the independent case, we have

Variable1 2 3 4 5 6 7 8 9 10

COSSO(GCV) 100 100 100 100 14 11 18 15 11 13COSSO(5CV) 100 94 100 100 1 1 3 2 4 2

MARS 100 100 100 100 35 35 34 39 28 35

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 62: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Nonparametric Oracle Property

A nonparametric regression estimator is nonparametric-oracle if

(1) ‖f − f ‖ = Op(n−α) at the optimal rate α > 0.

(2) fj = 0 for all j ∈ U with probability tending to 1

The COSSO is not oracle:

For consistent estimation, λ approaches to zero at a raten−4/5, which is faster than needed for consistent modelselection.

Can we oraclize COSSO?

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 63: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Adaptive (weighted) COSSO

Recently we proposed

minf ∈F

1

n

n∑i=1

[yi − f (x i )]2 + λ

p∑j=1

wj‖P j f ‖Fc .

0 < wj ≤ ∞ are pre-specified weights

Role of weights: adaptively adjusting the penalty on individualcomponents based on their relative importance

Large weights are imposed on unimportant components =⇒sparsity

Small weights are imposed on important components =⇒small bias

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO

Page 64: Lecture 14: Variable Selection - Beyond LASSOhzhang/math574m/2019...Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO. Methods Penalties with Oracle Properties Penalties

Lq MethodsPenalties with Oracle Properties

Penalties for Correlated PredictorsGroup Lasso

Variable Selection in Nonparametric Regression

Choices of Weights

Recall that ‖g‖2L2 =∫

[g(x)]2dx .

Given an initial estimate f , we construct the weights as

wj = ‖P j f ‖−γL2, j = 1, · · · , p.

where γ > 0 is an extra parameter which can be adaptivelychosen.

f should be a good estimator (say, consistent) to accuratelyindicate the relative importance of different function components.

In practice, use the traditional smoothing spline or COSSO fitas f to construct weights.

Adaptive COSSO estimator is np-oracle (Storlie, et al. 2011)

Hao Helen Zhang Lecture 14: Variable Selection - Beyond LASSO