robert tibshirani, stanford university april 20,...
TRANSCRIPT
Two novel applications of selective inference
Robert Tibshirani, Stanford University
April 20, 2015
1 / 36
Outline
1. Review of polyhedral lemma for selective inference
2. Clustering followed by lasso— Protolasso (joint with StephenReid)
3. Assessment of internally derived predictors (joint with SamGross and Jonathan Taylor)
2 / 36
Where selective inference started for me
”A significance test for the lasso”– Annals of Statistics 2014(with discussion)
Richard Lockhart Jonathan Taylor Simon Fraser University !!!!!!!!!!!!!!!!Stanford!University!
Vancouver !!!!!!!!!!PhD!Student!of!Keith!Worsley,!2001!PhD . Student of David Blackwell, !
Berkeley,!1979!!
Ryan Tibshirani , CMU. PhD student of Taylor 2011
Rob Tibshirani Stanford
3 / 36
What it’s like to work with Jon Taylor
4 / 36
References
I Reid and Tibshirani (2015) Sparse regression and marginaltesting using cluster prototypes. arXiv:1503.00334
I Jason D. Lee, Dennis L. Sun, Yuekai Sun, Jonathan E. Taylor(2014). Exact post-selection inference, with application to thelasso. arXiv:1311.6238
I Gross, Taylor, Tibshirani (2015). in preparation.
I Hastie, Tibshirani, Wainwright
Statistical Learning with Sparsity: The Lasso andGeneralizationsJune 26, 2015 (has a chapter on selective inference!)
5 / 36
Ongoing work on selective inference
I Optimal inference after model selection (Fithian, Sun, Taylor)
I LAR+forward stepwise (Taylor, Lockhart, Tibs times 2)
I Forward stepwise with grouped variables (Loftus and Taylor)
I Sequential selection procedures (G’Sell, Wager, Chouldechova,Tibs) to appear JRSSB
I PCA (Choi, Taylor, Tibs)
I Marginal screening (Lee and Taylor)
I Many means problem (Reid, Taylor, Tibs)
I Asymptotics (Tian and Taylor)
I Bootstrap (Ryan Tibshirani+friends) in preparation
6 / 36
R package in development
selectiveInference
Joshua Loftus, Jon Taylor, Ryan Tibshirani, Rob Tibshirani
I Forward Stepwise regression, including AIC stopping rule,categorical variablesforwardStepInf(x,y)
I Fixed lambda LassofixedLassoInf(x,y)
I Least angle regressionlarInf(x,y)
These compute p-values and selection intervals
7 / 36
The polyhedral lemma
I Response vector y ∼ N(µ,Σ). Suppose we make a selectionthat can be written as
Ay ≤ b
with A, b not depending on y .
I Then for any vector η
F[V−,V+]
η>µ,σ2η>η(η>y)|{AL,λy ≤ bL,λ} ∼ Unif(0, 1)
(truncated Gaussian distribution), where V−,V+ are(computable) constants that are functions of η,A, b.
8 / 36
V+(y)V−(y)
Pη⊥y
Pηy
ηTy
y
η
{Ay ≤ b}
9 / 36
The polyhedral lemma: some intuition
I Forward stepwise regression, with standardized predictors:suppose that we choose predictor 2 at step 1, with a positivesign.
I Then predictor 2 “won the competition” and so we know that〈x2, y〉 > ±〈xj , y〉 for all j 6= 2
I Hence to carry out inference for β2, we condition on the set ofy such that 〈x2, y〉 > ±〈xj , y〉 for all j 6= 2. This can bewritten in the form Ay ≤ b.
I From this, it turns out that Vm = |〈xk , y〉| where xk has thesecond largest absolute inner product with y , and Vp = +∞.We carry out inference using the Gaussian distributiontruncated to lie in (Vm,+∞)
I In subsequent steps ` > 1, we condition on results of“competitions” in all of the previous `− 1 steps.
10 / 36
Protolasso: regression with correlated predictors
I Outcome y , predictors X1,X2, . . .Xp.
I The predictors fall into K groups, and are correlated withingroups. The groups may be pre-defined, or discovered viaclustering
11 / 36
Protolasso: A picture
−−−−−−−−−−−−−−−−−−−
−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−
−
−
−
−−−−
0 20 40 60 80 100
−3
−1
12
3
ordinary lasso
Variable index
−
−−−−−−−−−
−
−−−−−−−−−
−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−
− −−−−−
−
−
−
−
−−
−
−
−
−
−
−
x
xxxxxxxxx
x
xxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
−
−−−−−−−−−
−
−−−−
−−−−
−−
−−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−
−−
0 20 40 60 80 100
−3
−1
12
3
protolasso
Variable index
−
−−−−−−−−−
−
−−−−
−−−−−−
−−−−−−−−−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−
−
−−−
−
−−−
−−
−−
x
xxxxxxxxx
x
xxxxxxxxx
xxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
12 / 36
Protolasso
1. Cluster predictors into disjoint groups C1,C2, . . . ,CK . Can useany clustering method. We use hierarchical clustering here.
2. Choose prototypes: Pk ∈ Ck (one from each cluster):
Pk = xj : j = argmaxj∈Ck|corr(xj , y)|
for each cluster Ck ⊂ {1, 2, . . . , p} and k = 1, 2, . . . ,K
3. Run Lasso on prototypes, giving selected prototypes M ⊂ P.
4. Do Inference on :I Selected prototypesI Other members of selected clusters.
13 / 36
Post selection inferenceLee et al (2014)
I Focus on Lasso as selection procedure
βλ = argminβ
[1
2||y − Xβ||22 + λ||β||1
]I Define set E ⊂ {1, 2, . . . , p} of selected variables and zE –
their coefficient signs.I Two important results
I Affine constraint:
{E , zE} = {AL,λy ≤ bL,λ}
I Conditional distribution:
F[V−,V+]η>µ,σ2η>η
(η>y)|{AL,λy ≤ bL,λ} ∼ Unif(0, 1)
I Lee and Taylor (2014) extend results to marginal correlationselection as well.
14 / 36
Post selection inferenceExtending to protolasso
I Condition on two-stage selection:I Prototype set: P = P, and signs of marginal correlations of
prototypes with yI Subset of prototypes selected by lasso : M = M and their
signs.
I Stack the constraint matrices: one per cluster prototype andone for second stage lasso. Reid and Tibshirani (2015).
I Do inference on η>µ using the distribution of η>y conditionalon the selection procedure.
15 / 36
Post selection inferenceExtending to protolasso
I p-values or selection intervalsI For selected prototypes: η = (X>
M )+j (partial regressioncoefficients)
I For other members of selected clusters: η = (X>M\{Pk}∪{k1})+j .
(swap out prototype for another cluster member)
16 / 36
A picture
−−−−−−−−−−−−−−−−−−−
−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−
−
−
−
−−−−
0 20 40 60 80 100
−3
−1
12
3
ordinary lasso
Variable index
−
−−−−−−−−−
−
−−−−−−−−−
−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−
− −−−−−
−
−
−
−
−−
−
−
−
−
−
−
x
xxxxxxxxx
x
xxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
−
−−−−−−−−−
−
−−−−
−−−−
−−
−−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−
−−
0 20 40 60 80 100
−3
−1
12
3
protolasso
Variable index
−
−−−−−−−−−
−
−−−−
−−−−−−
−−−−−−−−−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−
−
−−−
−
−−−
−−
−−
x
xxxxxxxxx
x
xxxxxxxxx
xxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
17 / 36
The Knockoff methodBarber and Candes (2014)
I FDR-controlling framework for variable selection.
I Construct knockoff matrix of p fake predictors X , with theproperties
1. corr(xj , xk) = corr(xj , xk)2. corr(xj , xk) = corr(xj , xk),∀j 6= k3. corr(xj , xj) small
I Then run a competition: run your selection procedure on the2p predictors (X , X ) and see if xj or xj comes in first.
I These are equally likely for noise predictors; xj should win ifit’s signal!
18 / 36
Knockoff for LassoBarber and Candes (2014)
I Statistics for each pair of original and knockoff variables:Wj = Zj − Zj , j = 1, 2, . . . , p where Zj = supλ{λ; βj(λ) 6= 0}and similarly for Zj .
I Data dependent threshold,
T = min[t ∈ W :
1+#{j :Wj≤−t}#{j :Wj≥t}∨1 ≤ q
], where
W = {|Wj | : j = 1, . . . , p} \ {0}.I Select features with Wj ≥ T to control FDR at q.
19 / 36
KnockoffBarber and Candes (2014)
LemmaLet ε ∈ {±1}p be a sign sequence independent ofW = (W1,W2, . . . ,Wp), with εj = +1 for all non-null variables
and εji .i .d∼ {±1} for null variables. Then
(W1,W2, . . . ,Wp)d= (ε1 ·W1, ε2 ·W2, . . . , εp ·Wp)
20 / 36
Knockoff method for protolasso
1. Cluster columns of X : {C1,C2, . . . ,CK}.2. Split the data by rows into two (roughly) equal parts:
y =
(y (1)
y (2)
)and X =
(X (1)
X (2)
).
3. Extract Prototypes: P = {P1, P2, . . . , PK} using only y (1)
and X (1).
4. Form knockoff matrix X (2) from X (2).
5. Reduce to matrices X(2)
Pand X
(2)
P.
6. Proceed with knockoff screening using y (2) and[X
(2)
PX
(2)
P
].
21 / 36
Knockoffs for protolasso
x x x x
xxx x
X(1)
X(2)
X(2)
y(1)
y(2)
Knockoff
22 / 36
FDR control
We show that that Barber and Candes’ Lemmas are satisfied, andhence this procedure achieves FDR control
23 / 36
Knockoffs for protolasso
I W(2)k = Z
(2)k − Z
(2)k .
●
●
●
●
●
●
●
●
●●
●●●
●
1 2 3 4 5 6 7 8 9
0.00
0.15
0.30
signal amplitude
aver
age
FD
P
●
●●
●●
●●
●●●
●●
●
●●
●●●●●
1 2 3 4 5 6 7 8 9
0.0
0.4
0.8
signal amplitude
aver
age
pow
er
●
1 2 3 4 5 6 7 8 9
0.00
0.15
0.30
signal amplitude
aver
age
FD
P
●
●●●●●
●●
1 2 3 4 5 6 7 8 9
0.0
0.4
0.8
signal amplitudeav
erag
e po
wer
●
●
1 2 3 4 5 6 7 8 9
0.00
0.15
0.30
signal amplitude
aver
age
FD
P
●
●
1 2 3 4 5 6 7 8 9
0.0
0.4
0.8
signal amplitude
aver
age
pow
er
24 / 36
Summary
I High correlation amongst variables.
I Cluster and prototype.
I The test/regress – use post selection framework.
I Extensions to knockoff and non-Gaussian models.
I Extension to marginal testing (Prototest)
25 / 36
Assessment of internally derived predictors
I In biomarker studies, an important problem is to compare apredictor of disease outcome derived from gene expressionlevels to standard clinical predictors.
I Comparing them on the same dataset that was used to derivethe biomarker predictor can lead to results strongly biased infavor of the biomarker predictor.
I Efron and Tibshirani (2002) proposed a method called“pre-validation” for making a fairer comparison between thetwo sets of predictors. Doesn’t exactly work.
I Can do better via selective inference
26 / 36
Xy eta Z y
X yZ
yeta
Naive
Pre−validation
27 / 36
An example
An example of this problem arose in the paper of Van deVeer et al(2002). Their microarray data has 4918 genes measured over 78cases, taken from a study of breast cancer. There are 44 cases inthe good prognosis group and 34 in the poor prognosis group. Themicroarray predictor was constructed as follows:
1. 70 genes were selected, having largest absolute correlationwith the 78 class labels
2. Using these 70 genes, a nearest centroid classifier wasconstructed.
3. Applying the classifier to the 78 microarrays gave adichotomous predictor ηj for each case j .
4. η was then compared with clinical predictors for predictinggood or bad prognosis (logistic regression)
28 / 36
Pre-validation
1. Divide the cases up into K = 13 equal-sized parts of 6 caseseach.
2. Set aside one of parts. Using only the data from the other 12parts, select the genes having absolute correlation at least .3with the class labels, and form a nearest centroid classificationrule.
3. Use the rule to the predict the class labels for the 13th part
4. Do steps 2 and 3 for each of the 13 parts, yielding a“pre-validated” microarray predictor zj for each of the 78cases.
5. Fit a logistic regression model to the pre-validated microarraypredictor and the 6 clinical predictors.
29 / 36
P-values on Breast cancer data
Naive test: 1.1e-06 ;
Prevalidated test: 0.0378
30 / 36
Pre-validation doesn’t exactly work
Type I error is somewhat close to nominal level, butanti-conservative
31 / 36
Selective inference to the rescue!
I Response y , external predictors Z = (Z1,Z2, . . .Zm),biomarkers X = (X1,X2, . . .Xp).
I We model y ∼ X to give η, then we run lasso y ∼ (Z , η)
I If first stage model is simple, like
“choose η = Xj most correlated with y”,
then we can simply add constraints to matrix A in selectionrule Ay ≤ b.
32 / 36
More generally
I Fit y ∼ X by least squares or lasso to yield prediction η.
I Then fit y ∼ (Z , η). We define test for significance of η as
T = (RSS0 − RSS)/RSS ∼ P
I If first stage model is least squares on p predictors, thenP = Fc,d with c = p, d = n − (m + p)
I If first stage model is lasso, the P is a union of truncated Fdistributions.
33 / 36
Simulation
n = 50, nx = 200, nz = 20, coef of X is zero; λ set to 1.5
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Null Simulation, Prevalidation
Expected
Obs
erve
d
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Null Simulation, Selective Inference
Expected
Obs
erve
d
34 / 36
P-values on Breast cancer data
Naive test: 1.1e-06 ;
Prevalidated test: 0.0378
Selective test: 0.5028
35 / 36
Conclusions
I Selective inference is an exciting new area. Lots of potentialresearch problems and generalizations(grad students take note)!!
I Practitioners: look for selectiveInference R package
I Coming soon: Deep Selective Inference r
36 / 36