robert tibshirani, stanford university april 20,...

Two novel applications of selective inference

Robert Tibshirani, Stanford University

April 20, 2015

1 / 36

Outline

1. Review of polyhedral lemma for selective inference

2. Clustering followed by lasso— Protolasso (joint with StephenReid)

3. Assessment of internally derived predictors (joint with SamGross and Jonathan Taylor)

2 / 36

Where selective inference started for me

”A significance test for the lasso”– Annals of Statistics 2014(with discussion)

Richard Lockhart Jonathan Taylor Simon Fraser University !!!!!!!!!!!!!!!!Stanford!University!

Vancouver !!!!!!!!!!PhD!Student!of!Keith!Worsley,!2001!PhD . Student of David Blackwell, !

Berkeley,!1979!!

Ryan Tibshirani , CMU. PhD student of Taylor 2011

Rob Tibshirani Stanford

3 / 36

What it’s like to work with Jon Taylor

4 / 36

References

I Reid and Tibshirani (2015) Sparse regression and marginaltesting using cluster prototypes. arXiv:1503.00334

I Jason D. Lee, Dennis L. Sun, Yuekai Sun, Jonathan E. Taylor(2014). Exact post-selection inference, with application to thelasso. arXiv:1311.6238

I Gross, Taylor, Tibshirani (2015). in preparation.

I Hastie, Tibshirani, Wainwright

Statistical Learning with Sparsity: The Lasso andGeneralizationsJune 26, 2015 (has a chapter on selective inference!)

5 / 36

Ongoing work on selective inference

I Optimal inference after model selection (Fithian, Sun, Taylor)

I LAR+forward stepwise (Taylor, Lockhart, Tibs times 2)

I Forward stepwise with grouped variables (Loftus and Taylor)

I Sequential selection procedures (G’Sell, Wager, Chouldechova,Tibs) to appear JRSSB

I PCA (Choi, Taylor, Tibs)

I Marginal screening (Lee and Taylor)

I Many means problem (Reid, Taylor, Tibs)

I Asymptotics (Tian and Taylor)

I Bootstrap (Ryan Tibshirani+friends) in preparation

6 / 36

R package in development

selectiveInference

Joshua Loftus, Jon Taylor, Ryan Tibshirani, Rob Tibshirani

I Forward Stepwise regression, including AIC stopping rule,categorical variablesforwardStepInf(x,y)

I Fixed lambda LassofixedLassoInf(x,y)

I Least angle regressionlarInf(x,y)

These compute p-values and selection intervals

7 / 36

The polyhedral lemma

I Response vector y ∼ N(µ,Σ). Suppose we make a selectionthat can be written as

Ay ≤ b

with A, b not depending on y .

I Then for any vector η

F[V−,V+]

η>µ,σ2η>η(η>y)|{AL,λy ≤ bL,λ} ∼ Unif(0, 1)

(truncated Gaussian distribution), where V−,V+ are(computable) constants that are functions of η,A, b.

8 / 36

V+(y)V−(y)

Pη⊥y

Pηy

ηTy

y

η

{Ay ≤ b}

9 / 36

The polyhedral lemma: some intuition

I Forward stepwise regression, with standardized predictors:suppose that we choose predictor 2 at step 1, with a positivesign.

I Then predictor 2 “won the competition” and so we know that〈x2, y〉 > ±〈xj , y〉 for all j 6= 2

I Hence to carry out inference for β2, we condition on the set ofy such that 〈x2, y〉 > ±〈xj , y〉 for all j 6= 2. This can bewritten in the form Ay ≤ b.

I From this, it turns out that Vm = |〈xk , y〉| where xk has thesecond largest absolute inner product with y , and Vp = +∞.We carry out inference using the Gaussian distributiontruncated to lie in (Vm,+∞)

I In subsequent steps ` > 1, we condition on results of“competitions” in all of the previous `− 1 steps.

10 / 36

Protolasso: regression with correlated predictors

I Outcome y , predictors X1,X2, . . .Xp.

I The predictors fall into K groups, and are correlated withingroups. The groups may be pre-defined, or discovered viaclustering

11 / 36

Protolasso: A picture

−−−−−−−−−−−−−−−−−−−

−

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−

−

−

−

−−−−

0 20 40 60 80 100

−3

−1

12

3

ordinary lasso

Variable index

−

−−−−−−−−−

−

−−−−−−−−−

−

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−−

− −−−−−

−

−

−

−

−−

−

−

−

−

−

−

x

xxxxxxxxx

x

xxxxxxxxx

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

−

−−−−−−−−−

−

−−−−

−−−−

−−

−−

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−−−−−−−−

−−

0 20 40 60 80 100

−3

−1

12

3

protolasso

Variable index

−

−−−−−−−−−

−

−−−−

−−−−−−

−−−−−−−−−

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−−−−

−

−−−

−

−−−

−−

−−

x

xxxxxxxxx

x

xxxxxxxxx

xxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

12 / 36

Protolasso

1. Cluster predictors into disjoint groups C1,C2, . . . ,CK . Can useany clustering method. We use hierarchical clustering here.

2. Choose prototypes: Pk ∈ Ck (one from each cluster):

Pk = xj : j = argmaxj∈Ck|corr(xj , y)|

for each cluster Ck ⊂ {1, 2, . . . , p} and k = 1, 2, . . . ,K

3. Run Lasso on prototypes, giving selected prototypes M ⊂ P.

4. Do Inference on :I Selected prototypesI Other members of selected clusters.

13 / 36

Post selection inferenceLee et al (2014)

I Focus on Lasso as selection procedure

βλ = argminβ

[1

2||y − Xβ||22 + λ||β||1

]I Define set E ⊂ {1, 2, . . . , p} of selected variables and zE –

their coefficient signs.I Two important results

I Affine constraint:

{E , zE} = {AL,λy ≤ bL,λ}

I Conditional distribution:

F[V−,V+]η>µ,σ2η>η

(η>y)|{AL,λy ≤ bL,λ} ∼ Unif(0, 1)

I Lee and Taylor (2014) extend results to marginal correlationselection as well.

14 / 36

Post selection inferenceExtending to protolasso

I Condition on two-stage selection:I Prototype set: P = P, and signs of marginal correlations of

prototypes with yI Subset of prototypes selected by lasso : M = M and their

signs.

I Stack the constraint matrices: one per cluster prototype andone for second stage lasso. Reid and Tibshirani (2015).

I Do inference on η>µ using the distribution of η>y conditionalon the selection procedure.

15 / 36

Post selection inferenceExtending to protolasso

I p-values or selection intervalsI For selected prototypes: η = (X>

M )+j (partial regressioncoefficients)

I For other members of selected clusters: η = (X>M\{Pk}∪{k1})+j .

(swap out prototype for another cluster member)

16 / 36

A picture

−−−−−−−−−−−−−−−−−−−

−

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−

−

−

−

−−−−

0 20 40 60 80 100

−3

−1

12

3

ordinary lasso

Variable index

−

−−−−−−−−−

−

−−−−−−−−−

−

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−−

− −−−−−

−

−

−

−

−−

−

−

−

−

−

−

x

xxxxxxxxx

x

xxxxxxxxx

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

−

−−−−−−−−−

−

−−−−

−−−−

−−

−−

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−−−−−−−−

−−

0 20 40 60 80 100

−3

−1

12

3

protolasso

Variable index

−

−−−−−−−−−

−

−−−−

−−−−−−

−−−−−−−−−

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−−−−

−

−−−

−

−−−

−−

−−

x

xxxxxxxxx

x

xxxxxxxxx

xxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

17 / 36

The Knockoff methodBarber and Candes (2014)

I FDR-controlling framework for variable selection.

I Construct knockoff matrix of p fake predictors X , with theproperties

1. corr(xj , xk) = corr(xj , xk)2. corr(xj , xk) = corr(xj , xk),∀j 6= k3. corr(xj , xj) small

I Then run a competition: run your selection procedure on the2p predictors (X , X ) and see if xj or xj comes in first.

I These are equally likely for noise predictors; xj should win ifit’s signal!

18 / 36

Knockoff for LassoBarber and Candes (2014)

I Statistics for each pair of original and knockoff variables:Wj = Zj − Zj , j = 1, 2, . . . , p where Zj = supλ{λ; βj(λ) 6= 0}and similarly for Zj .

I Data dependent threshold,

T = min[t ∈ W :

1+#{j :Wj≤−t}#{j :Wj≥t}∨1 ≤ q

], where

W = {|Wj | : j = 1, . . . , p} \ {0}.I Select features with Wj ≥ T to control FDR at q.

19 / 36

KnockoffBarber and Candes (2014)

LemmaLet ε ∈ {±1}p be a sign sequence independent ofW = (W1,W2, . . . ,Wp), with εj = +1 for all non-null variables

and εji .i .d∼ {±1} for null variables. Then

(W1,W2, . . . ,Wp)d= (ε1 ·W1, ε2 ·W2, . . . , εp ·Wp)

20 / 36

Knockoff method for protolasso

1. Cluster columns of X : {C1,C2, . . . ,CK}.2. Split the data by rows into two (roughly) equal parts:

y =

(y (1)

y (2)

)and X =

(X (1)

X (2)

).

3. Extract Prototypes: P = {P1, P2, . . . , PK} using only y (1)

and X (1).

4. Form knockoff matrix X (2) from X (2).

5. Reduce to matrices X(2)

Pand X

(2)

P.

6. Proceed with knockoff screening using y (2) and[X

(2)

PX

(2)

P

].

21 / 36

Knockoffs for protolasso

x x x x

xxx x

X(1)

X(2)

X(2)

y(1)

y(2)

Knockoff

22 / 36

FDR control

We show that that Barber and Candes’ Lemmas are satisfied, andhence this procedure achieves FDR control

23 / 36

Knockoffs for protolasso

I W(2)k = Z

(2)k − Z

(2)k .

●

●

●

●

●

●

●

●

●●

●●●

●

1 2 3 4 5 6 7 8 9

0.00

0.15

0.30

signal amplitude

aver

age

FD

P

●

●●

●●

●●

●●●

●●

●

●●

●●●●●

1 2 3 4 5 6 7 8 9

0.0

0.4

0.8

signal amplitude

aver

age

pow

er

●

1 2 3 4 5 6 7 8 9

0.00

0.15

0.30

signal amplitude

aver

age

FD

P

●

●●●●●

●●

1 2 3 4 5 6 7 8 9

0.0

0.4

0.8

signal amplitudeav

erag

e po

wer

●

●

1 2 3 4 5 6 7 8 9

0.00

0.15

0.30

signal amplitude

aver

age

FD

P

●

●

1 2 3 4 5 6 7 8 9

0.0

0.4

0.8

signal amplitude

aver

age

pow

er

24 / 36

Summary

I High correlation amongst variables.

I Cluster and prototype.

I The test/regress – use post selection framework.

I Extensions to knockoff and non-Gaussian models.

I Extension to marginal testing (Prototest)

25 / 36

Assessment of internally derived predictors

I In biomarker studies, an important problem is to compare apredictor of disease outcome derived from gene expressionlevels to standard clinical predictors.

I Comparing them on the same dataset that was used to derivethe biomarker predictor can lead to results strongly biased infavor of the biomarker predictor.

I Efron and Tibshirani (2002) proposed a method called“pre-validation” for making a fairer comparison between thetwo sets of predictors. Doesn’t exactly work.

I Can do better via selective inference

26 / 36

Xy eta Z y

X yZ

yeta

Naive

Pre−validation

27 / 36

An example

An example of this problem arose in the paper of Van deVeer et al(2002). Their microarray data has 4918 genes measured over 78cases, taken from a study of breast cancer. There are 44 cases inthe good prognosis group and 34 in the poor prognosis group. Themicroarray predictor was constructed as follows:

1. 70 genes were selected, having largest absolute correlationwith the 78 class labels

2. Using these 70 genes, a nearest centroid classifier wasconstructed.

3. Applying the classifier to the 78 microarrays gave adichotomous predictor ηj for each case j .

4. η was then compared with clinical predictors for predictinggood or bad prognosis (logistic regression)

28 / 36

Pre-validation

1. Divide the cases up into K = 13 equal-sized parts of 6 caseseach.

2. Set aside one of parts. Using only the data from the other 12parts, select the genes having absolute correlation at least .3with the class labels, and form a nearest centroid classificationrule.

3. Use the rule to the predict the class labels for the 13th part

4. Do steps 2 and 3 for each of the 13 parts, yielding a“pre-validated” microarray predictor zj for each of the 78cases.

5. Fit a logistic regression model to the pre-validated microarraypredictor and the 6 clinical predictors.

29 / 36

P-values on Breast cancer data

Naive test: 1.1e-06 ;

Prevalidated test: 0.0378

30 / 36

Pre-validation doesn’t exactly work

Type I error is somewhat close to nominal level, butanti-conservative

31 / 36

Selective inference to the rescue!

I Response y , external predictors Z = (Z1,Z2, . . .Zm),biomarkers X = (X1,X2, . . .Xp).

I We model y ∼ X to give η, then we run lasso y ∼ (Z , η)

I If first stage model is simple, like

“choose η = Xj most correlated with y”,

then we can simply add constraints to matrix A in selectionrule Ay ≤ b.

32 / 36

More generally

I Fit y ∼ X by least squares or lasso to yield prediction η.

I Then fit y ∼ (Z , η). We define test for significance of η as

T = (RSS0 − RSS)/RSS ∼ P

I If first stage model is least squares on p predictors, thenP = Fc,d with c = p, d = n − (m + p)

I If first stage model is lasso, the P is a union of truncated Fdistributions.

33 / 36

Simulation

n = 50, nx = 200, nz = 20, coef of X is zero; λ set to 1.5

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Null Simulation, Prevalidation

Expected

Obs

erve

d

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Null Simulation, Selective Inference

Expected

Obs

erve

d

34 / 36

P-values on Breast cancer data

Naive test: 1.1e-06 ;

Prevalidated test: 0.0378

Selective test: 0.5028

35 / 36

Conclusions

I Selective inference is an exciting new area. Lots of potentialresearch problems and generalizations(grad students take note)!!

I Practitioners: look for selectiveInference R package

I Coming soon: Deep Selective Inference r

36 / 36

robert tibshirani, stanford university april 20,...

Documents