optimal model selection with v-fold cross-validation: how should v … · f‘(s?;bs )g+ r n...
TRANSCRIPT
1/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Optimal model selection with V -foldcross-validation:
how should V be chosen?
Sylvain Arlot (joint work with Matthieu Lerasle)
1Cnrs
2Ecole Normale Superieure (Paris), DI/ENS, Equipe Sierra
High Dimensional Probability VII, Cargese, May 29th, 2014
Optimal model selection with V -fold CV Sylvain Arlot
2/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Density estimation: data ξ1, . . . , ξn
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Optimal model selection with V -fold CV Sylvain Arlot
3/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Goal: estimate the common density s? of ξi
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Optimal model selection with V -fold CV Sylvain Arlot
4/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Problem: density estimation
Data Dn : ξ1, . . . , ξn ∈ Ξ (i.i.d. ∼ P, density s? w.r.t. µ)
Least-squares loss γ(t, ξ) = ‖t‖2L2(µ) − 2t(ξ)
Goal: learn t ∈ S = {measurable functions Ξ→ R} s.t.Eξ∼P [γ(t; ξ) ] =: Pγ(t) is minimal.
Pγ(t) =
∫t2 dµ− 2
∫ts? dµ =
∫(t − s?)2 dµ− ‖s?‖2
L2(µ)
⇒ the true density s? ∈ argmint∈S Pγ(t) and the excess risk is
` (s?, t ) := Pγ(t)− Pγ(s?) = ‖t − s?‖2L2(µ)
Optimal model selection with V -fold CV Sylvain Arlot
4/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Problem: density estimation
Data Dn : ξ1, . . . , ξn ∈ Ξ (i.i.d. ∼ P, density s? w.r.t. µ)
Least-squares loss γ(t, ξ) = ‖t‖2L2(µ) − 2t(ξ)
Goal: learn t ∈ S = {measurable functions Ξ→ R} s.t.Eξ∼P [γ(t; ξ) ] =: Pγ(t) is minimal.
Pγ(t) =
∫t2 dµ− 2
∫ts? dµ =
∫(t − s?)2 dµ− ‖s?‖2
L2(µ)
⇒ the true density s? ∈ argmint∈S Pγ(t) and the excess risk is
` (s?, t ) := Pγ(t)− Pγ(s?) = ‖t − s?‖2L2(µ)
Optimal model selection with V -fold CV Sylvain Arlot
5/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Estimators: example: histogram
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Optimal model selection with V -fold CV Sylvain Arlot
6/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Estimator selection: regular histograms
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Optimal model selection with V -fold CV Sylvain Arlot
7/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Estimator selection: Parzen, Gaussian kernel
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Optimal model selection with V -fold CV Sylvain Arlot
8/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Estimator selection
Estimator/Learning algorithm: s : Dn 7→ s(Dn) ∈ SExample: least-squares estimator on some model Sm ⊂ S
sm ∈ argmint∈Sm
{Pnγ(t)} where Pnγ(t) :=1
n
∑ξ∈Dn
γ(t; ξ)
Example of model: histograms
Estimator collection (sm)m∈M ⇒ choose m = m(Dn)?e.g., family of models (Sm)m∈M ⇒ family of least-squaresestimators
Goal: minimize the risk, i.e.,Oracle inequality (in expectation or with a large probability):
` (s?, sm ) ≤ C infm∈M
{` (s?, sm )}+ Rn
Optimal model selection with V -fold CV Sylvain Arlot
8/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Estimator selection
Estimator/Learning algorithm: s : Dn 7→ s(Dn) ∈ SExample: least-squares estimator on some model Sm ⊂ S
sm ∈ argmint∈Sm
{Pnγ(t)} where Pnγ(t) :=1
n
∑ξ∈Dn
γ(t; ξ)
Example of model: histograms
Estimator collection (sm)m∈M ⇒ choose m = m(Dn)?e.g., family of models (Sm)m∈M ⇒ family of least-squaresestimators
Goal: minimize the risk, i.e.,Oracle inequality (in expectation or with a large probability):
` (s?, sm ) ≤ C infm∈M
{` (s?, sm )}+ Rn
Optimal model selection with V -fold CV Sylvain Arlot
8/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Estimator selection
Estimator/Learning algorithm: s : Dn 7→ s(Dn) ∈ SExample: least-squares estimator on some model Sm ⊂ S
sm ∈ argmint∈Sm
{Pnγ(t)} where Pnγ(t) :=1
n
∑ξ∈Dn
γ(t; ξ)
Example of model: histograms
Estimator collection (sm)m∈M ⇒ choose m = m(Dn)?e.g., family of models (Sm)m∈M ⇒ family of least-squaresestimators
Goal: minimize the risk, i.e.,Oracle inequality (in expectation or with a large probability):
` (s?, sm ) ≤ C infm∈M
{` (s?, sm )}+ Rn
Optimal model selection with V -fold CV Sylvain Arlot
9/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Bias-variance trade-off
E [` (s?, sm ) ] = Bias + Variance
Bias or Approximation error
` (s?, s?m ) = inft∈Sm
` (s?, t )
Variance or Estimation error
regular histograms on R with bin size d−1m :
dm − ‖s?m‖2L2(µ)
n≈ dm
n
Bias-variance trade-off⇔ avoid overfitting and underfitting
Optimal model selection with V -fold CV Sylvain Arlot
9/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Bias-variance trade-off
E [` (s?, sm ) ] = Bias + Variance
Bias or Approximation error
` (s?, s?m ) = inft∈Sm
` (s?, t )
Variance or Estimation error
regular histograms on R with bin size d−1m :
dm − ‖s?m‖2L2(µ)
n≈ dm
n
Bias-variance trade-off⇔ avoid overfitting and underfitting
Optimal model selection with V -fold CV Sylvain Arlot
10/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Validation principle
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Optimal model selection with V -fold CV Sylvain Arlot
10/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Validation principle: learning sample
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Optimal model selection with V -fold CV Sylvain Arlot
10/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Validation principle: learning sample
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Optimal model selection with V -fold CV Sylvain Arlot
10/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Validation principle: validation sample
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Optimal model selection with V -fold CV Sylvain Arlot
10/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Validation principle: validation sample
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Optimal model selection with V -fold CV Sylvain Arlot
11/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
V -fold cross-validation
ξ1, . . . , ξq︸ ︷︷ ︸, ξq+1, . . . , ξn︸ ︷︷ ︸Learning Validation
s(`)m ∈ arg min
t∈Sm
{q∑
i=1
γ(t, ξi )
}
P(v)n =
1
n − q
n∑i=q+1
δξi ⇒ P(v)n γ
(s
(`)m
)
V -fold cross-validation : B = (Bj)1≤j≤V partition of {1, . . . , n}
⇒ Rvf ( sm; Dn;B ) =1
V
V∑j=1
P jnγ(
s(−j)m
)m ∈ arg min
m∈M
{Rvf ( sm )
}Optimal model selection with V -fold CV Sylvain Arlot
12/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Bias of cross-validation
In this talk, we always assume Card(Bj) = n/V for all j .
Ideal criterion: Pγ(sm)
General analysis for the bias:
E [Pγ(sm(Dn)) ] = α(m) +β(m)
n
⇒ E[Rvf ( sm; Dn;B )
]= E
[P
(j)n γ
(s
(−j)m
)]= E
[Pγ(
s(−j)m
)]= α(m) +
V
V − 1
β(m)
n
⇒ bias decreases with V , vanishes as V →∞
Optimal model selection with V -fold CV Sylvain Arlot
12/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Bias of cross-validation
In this talk, we always assume Card(Bj) = n/V for all j .
Ideal criterion: Pγ(sm)
General analysis for the bias:
E [Pγ(sm(Dn)) ] = α(m) +β(m)
n
⇒ E[Rvf ( sm; Dn;B )
]= E
[P
(j)n γ
(s
(−j)m
)]= E
[Pγ(
s(−j)m
)]= α(m) +
V
V − 1
β(m)
n
⇒ bias decreases with V , vanishes as V →∞
Optimal model selection with V -fold CV Sylvain Arlot
12/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Bias of cross-validation
In this talk, we always assume Card(Bj) = n/V for all j .
Ideal criterion: Pγ(sm)
General analysis for the bias:
E [Pγ(sm(Dn)) ] = α(m) +β(m)
n
⇒ E[Rvf ( sm; Dn;B )
]= E
[P
(j)n γ
(s
(−j)m
)]= E
[Pγ(
s(−j)m
)]= α(m) +
V
V − 1
β(m)
n
⇒ bias decreases with V , vanishes as V →∞
Optimal model selection with V -fold CV Sylvain Arlot
13/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Bias and model selection
m ∈ argminm∈M
{Rvf ( sm )
}vs. m? ∈ argmin
m∈M
{Pγ(sm(Dn)
)}Perfect ranking among (sm)m∈M ⇔ ∀m,m′ ∈M,
sign(Rvf ( sm )− Rvf ( sm′ )
)= sign (Pγ ( sm )− Pγ ( sm′ ))
Key quantities:
E[Pγ ( sm )− Pγ ( sm′ )
]= α(m)− α(m′) +
β(m)− β(m′)
n
E[Rvf ( sm )− Rvf ( sm′ )
]= α(m)− α(m′) +
V
V − 1
β(m)− β(m′)
n
V -fold CV favours m with smaller complexity β(m)
Optimal model selection with V -fold CV Sylvain Arlot
13/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Bias and model selection
m ∈ argminm∈M
{Rvf ( sm )
}vs. m? ∈ argmin
m∈M
{Pγ(sm(Dn)
)}Perfect ranking among (sm)m∈M ⇔ ∀m,m′ ∈M,
sign(Rvf ( sm )− Rvf ( sm′ )
)= sign (Pγ ( sm )− Pγ ( sm′ ))
Key quantities:
E[Pγ ( sm )− Pγ ( sm′ )
]= α(m)− α(m′) +
β(m)− β(m′)
n
E[Rvf ( sm )− Rvf ( sm′ )
]= α(m)− α(m′) +
V
V − 1
β(m)− β(m′)
n
V -fold CV favours m with smaller complexity β(m)
Optimal model selection with V -fold CV Sylvain Arlot
14/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Suboptimality of V -fold cross-validation
Theorem
If m is selected by V -fold cross-validation, with V fixed as n grows,Then, with probability at least 1− ♦n−2,
` (s?, sm ) ≥ (1 + κ(V )) infm∈M
{` (s?, sm )}
with κ(V ) > 0.
Proved in least-squares regression for some specific modelselection problem (A. 2008, arXiv:0802.0566).
Also valid in least-squares density estimation for some specificmodel selection problem.
Optimal model selection with V -fold CV Sylvain Arlot
15/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Bias-corrected VFCV / V -fold penalization
Bias-corrected V -fold CV (Burman, 1989):
Rvf,corr ( sm; Dn;B ) := Rvf ( sm; Dn;B )+Pnγ ( sm )− 1
V
V∑j=1
Pnγ(
s(−j)m
)Resampling heuristics (Efron, 1983), V -fold subsampling andpenalization principle ⇒ V -fold penalty (A. 2008)
penVF(sm; Dn;B) :=V − 1
V
V∑j=1
(Pn − P
(−j)n
)γ(
s(−j)m
)
Rvf,corr ( sm; Dn;B ) = Pnγ ( sm(Dn)) + penVF(sm; Dn;B)
Least-squares density estimation:Rvf ( sm; Dn;B ) = Pnγ ( sm(Dn)) + V−1/2
V−1 penVF(sm; Dn;B)
Optimal model selection with V -fold CV Sylvain Arlot
15/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Bias-corrected VFCV / V -fold penalization
Bias-corrected V -fold CV (Burman, 1989):
Rvf,corr ( sm; Dn;B ) := Rvf ( sm; Dn;B )+Pnγ ( sm )− 1
V
V∑j=1
Pnγ(
s(−j)m
)Resampling heuristics (Efron, 1983), V -fold subsampling andpenalization principle ⇒ V -fold penalty (A. 2008)
penVF(sm; Dn;B) :=V − 1
V
V∑j=1
(Pn − P
(−j)n
)γ(
s(−j)m
)
Rvf,corr ( sm; Dn;B ) = Pnγ ( sm(Dn)) + penVF(sm; Dn;B)
Least-squares density estimation:Rvf ( sm; Dn;B ) = Pnγ ( sm(Dn)) + V−1/2
V−1 penVF(sm; Dn;B)
Optimal model selection with V -fold CV Sylvain Arlot
16/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Oracle inequality with V -fold penalization
Histograms on R + technical assumptions:‖s?‖∞ <∞, ∀λ ∈ m ∈M, µ(λ) ≥ 1/n .
Theorem (A. & Lerasle 2012, arXiv:1210.5830)
With probability at least 1− e−x , for all ε > 0, if δ− + ε < 1, forany m(C ) ∈ argminm∈M{Pnγ ( sm(Dn)) + C penVF(sm; Dn;B)},
`(
s?, sm(C)
)≤ 1 + δ+ + ε
1− δ− − εinf
m∈M{` (s?, sm )}+ Rn,x
where δ := 2( CV−1 − 1) and Rn,x := κ(V ,s?,δ,ε)
ε3n [ x + ln ( Card(M) ) ]2.
Similar result valid for more general models.Related results: van der Laan, Dudoit & Keles (2004); Celisse(2012) for the leave-p-out case.
Optimal model selection with V -fold CV Sylvain Arlot
17/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Key elements of the proof
Ideal penalty: penid(m) = (P − Pn)γ ( sm )
Sufficient to prove that with a large probability, for allm ∈M, |penVF(m)− penid(m)| ≤ δ(m) “small”.
Concentration of a U-statistics of order two (Houdre &Reynaud-Bouret, 2003):
penVF(sm; Dn;B)− penid(m) = (Pn − P)γ (s?m )
− 2V
(V − 1)n2
∑1≤k ′ 6=k≤V
∑i∈Bk ,j∈Bk′
∑λ∈Λm
(ψλ(ξi )− Pψλ)(ψλ(ξj)− Pψλ)
More precise than concentration of(
Pn − P(−j)n
)γ(
s(−j)m
)+ union bound over j ∈ {1, . . . ,V }.
Optimal model selection with V -fold CV Sylvain Arlot
17/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Key elements of the proof
Ideal penalty: penid(m) = (P − Pn)γ ( sm )
Sufficient to prove that with a large probability, for allm ∈M, |penVF(m)− penid(m)| ≤ δ(m) “small”.
Concentration of a U-statistics of order two (Houdre &Reynaud-Bouret, 2003):
penVF(sm; Dn;B)− penid(m) = (Pn − P)γ (s?m )
− 2V
(V − 1)n2
∑1≤k ′ 6=k≤V
∑i∈Bk ,j∈Bk′
∑λ∈Λm
(ψλ(ξi )− Pψλ)(ψλ(ξj)− Pψλ)
More precise than concentration of(
Pn − P(−j)n
)γ(
s(−j)m
)+ union bound over j ∈ {1, . . . ,V }.
Optimal model selection with V -fold CV Sylvain Arlot
17/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Key elements of the proof
Ideal penalty: penid(m) = (P − Pn)γ ( sm )
Sufficient to prove that with a large probability, for allm ∈M, |penVF(m)− penid(m)| ≤ δ(m) “small”.
Concentration of a U-statistics of order two (Houdre &Reynaud-Bouret, 2003):
penVF(sm; Dn;B)− penid(m) = (Pn − P)γ (s?m )
− 2V
(V − 1)n2
∑1≤k ′ 6=k≤V
∑i∈Bk ,j∈Bk′
∑λ∈Λm
(ψλ(ξi )− Pψλ)(ψλ(ξj)− Pψλ)
More precise than concentration of(
Pn − P(−j)n
)γ(
s(−j)m
)+ union bound over j ∈ {1, . . . ,V }.
Optimal model selection with V -fold CV Sylvain Arlot
18/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Variance of the (corrected)-VFCV criterion
Histogram model with constant bin size d−1m (to simplify)
Theorem (A. & Lerasle 2012, arXiv:1210.5830)
var(Rvf ( sm; Dn;B )
)=
1 +O(1)
nvarP(s?m)
+2
n2
(1 +
4
V − 1+O
(1
V+
1
n
))A(m)
var(Rvf,corr ( sm; Dn;B )
)=
1 +O(1)
nvarP(s?m)
+2
n2
(1 +
4
V − 1− 1
n
)A(m)
where A(m) ≈ dm
Optimal model selection with V -fold CV Sylvain Arlot
19/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Variance and model selection
m ∈ argminm∈M{C(m; Dn)} invariant by any randomtranslation of C ⇒ var(C(m; Dn)) meaningless.
Heuristic argument: for every m /∈ argminm∈M E [` (s?, sm ) ],we want to minimize P(m = m)
= P(∀m′ ∈M, C(m)− C(m′) < 0)
� minm′ 6=m
P(C(m)− C(m′) < 0)
Constant bias among (Rvf,corr)2≤V≤n ⇒ compare
var(Rvf,corr ( sm; Dn;B )− Rvf,corr ( sm′ ; Dn;B )
)
Optimal model selection with V -fold CV Sylvain Arlot
19/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Variance and model selection
m ∈ argminm∈M{C(m; Dn)} invariant by any randomtranslation of C ⇒ var(C(m; Dn)) meaningless.
Heuristic argument: for every m /∈ argminm∈M E [` (s?, sm ) ],we want to minimize P(m = m)
= P(∀m′ ∈M, C(m)− C(m′) < 0)
� minm′ 6=m
P(C(m)− C(m′) < 0)
Constant bias among (Rvf,corr)2≤V≤n ⇒ compare
var(Rvf,corr ( sm; Dn;B )− Rvf,corr ( sm′ ; Dn;B )
)
Optimal model selection with V -fold CV Sylvain Arlot
19/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Variance and model selection
m ∈ argminm∈M{C(m; Dn)} invariant by any randomtranslation of C ⇒ var(C(m; Dn)) meaningless.Heuristic argument: for every m /∈ argminm∈M E [` (s?, sm ) ],we want to minimize P(m = m)
= P(∀m′ ∈M, C(m)− C(m′) < 0)
� minm′ 6=m
P(C(m)− C(m′) < 0)
≈ minm′ 6=m
P(E[C(m)− C(m′)
]− ξ√
var (C(m)− C(m′)) < 0)
Constant bias among (Rvf,corr)2≤V≤n ⇒ compare
var(Rvf,corr ( sm; Dn;B )− Rvf,corr ( sm′ ; Dn;B )
)
Optimal model selection with V -fold CV Sylvain Arlot
19/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Variance and model selection
m ∈ argminm∈M{C(m; Dn)} invariant by any randomtranslation of C ⇒ var(C(m; Dn)) meaningless.Heuristic argument: for every m /∈ argminm∈M E [` (s?, sm ) ],we want to minimize P(m = m)
= P(∀m′ ∈M, C(m)− C(m′) < 0)
� minm′ 6=m
P(C(m)− C(m′) < 0)
≈ minm′ 6=m
P(E[C(m)− C(m′)
]− ξ√
var (C(m)− C(m′)) < 0)
= Φ
(maxm′ 6=m
E [C(m)− C(m′) ]√var (C(m)− C(m′))
)
Constant bias among (Rvf,corr)2≤V≤n ⇒ compare
var(Rvf,corr ( sm; Dn;B )− Rvf,corr ( sm′ ; Dn;B )
)
Optimal model selection with V -fold CV Sylvain Arlot
19/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Variance and model selection
m ∈ argminm∈M{C(m; Dn)} invariant by any randomtranslation of C ⇒ var(C(m; Dn)) meaningless.Heuristic argument: for every m /∈ argminm∈M E [` (s?, sm ) ],we want to minimize P(m = m)
= P(∀m′ ∈M, C(m)− C(m′) < 0)
� minm′ 6=m
P(C(m)− C(m′) < 0)
≈ minm′ 6=m
P(E[C(m)− C(m′)
]− ξ√
var (C(m)− C(m′)) < 0)
= Φ
(maxm′ 6=m
E [C(m)− C(m′) ]√var (C(m)− C(m′))
)
Constant bias among (Rvf,corr)2≤V≤n ⇒ compare
var(Rvf,corr ( sm; Dn;B )− Rvf,corr ( sm′ ; Dn;B )
)Optimal model selection with V -fold CV Sylvain Arlot
20/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Empirical assessment of the heuristic
0 0.02 0.04 0.06 0.080
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Heuristic estimation of P(m is selected)
P(m
is s
ele
cte
d)
LOO
10−Fold
5−Fold
2−Fold
E[penid]
Optimal model selection with V -fold CV Sylvain Arlot
21/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Variance and model selection
∆(m,m′,V ) = Rvf,corr ( sm )− Rvf,corr ( sm′ )
Theorem (A. & Lerasle 2012, arXiv:1210.5830)
If Sm ⊂ Sm′ are two histogram models with constant bin sizesd−1m , d−1
m′ , then,
var(
∆(m,m′,V ))
= 4
(1 +
2
n+
1
n2
)varP
(s?m − s?m′
)n
+ 2
(1 +
4
V − 1− 1
n
)B(m,m′)
n2
where B(m,m′) ∝∥∥s?m − s?m′
∥∥ dm.
The two terms are of the same order if∥∥s?m − s?m′
∥∥ ≈ dm/n.Similar formulas for general models Sm, Sm′ .
Optimal model selection with V -fold CV Sylvain Arlot
22/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Variance of C(m)− C(m?) as a function of dm and V
0 20 40 60 80 1000
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
dimension
LOO
10−Fold
5−Fold
2−Fold
E[penid]
Optimal model selection with V -fold CV Sylvain Arlot
23/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Conclusion: how to choose V ?
Bias: decreases with V / can be removed (V -foldpenalization)
Variance: decreases with V / close to its minimum withV = 5 or 10
⇒ best performance for the largest V (not necessarily valid ingeneral in learning)
Computational complexity: O(V ) in general
Open problems:precise concentration of the V -fold criterion in a generalsettingfull proof of the link between variance and model selectionperformances (second-order optimality?)
Optimal model selection with V -fold CV Sylvain Arlot
23/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Conclusion: how to choose V ?
Bias: decreases with V / can be removed (V -foldpenalization)
Variance: decreases with V / close to its minimum withV = 5 or 10
⇒ best performance for the largest V (not necessarily valid ingeneral in learning)
Computational complexity: O(V ) in general
Open problems:precise concentration of the V -fold criterion in a generalsettingfull proof of the link between variance and model selectionperformances (second-order optimality?)
Optimal model selection with V -fold CV Sylvain Arlot
23/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Conclusion: how to choose V ?
Bias: decreases with V / can be removed (V -foldpenalization)
Variance: decreases with V / close to its minimum withV = 5 or 10
⇒ best performance for the largest V (not necessarily valid ingeneral in learning)
Computational complexity: O(V ) in general
Open problems:precise concentration of the V -fold criterion in a generalsettingfull proof of the link between variance and model selectionperformances (second-order optimality?)
Optimal model selection with V -fold CV Sylvain Arlot
24/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Simulation setting: densities
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
L
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
S
Optimal model selection with V -fold CV Sylvain Arlot
25/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Simulation setting: model families
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
Regu
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
Dya2
Optimal model selection with V -fold CV Sylvain Arlot
26/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Simulation: E[` (s?, sm ) / infm∈M ` (s?, sm )]
Procedure L–Dya2 S–Dya2
Cp 8.52± 0.24 3.26± 0.04
pen2F 10.27± 0.24 2.46± 0.03pen5F 7.53± 0.19 2.21± 0.03pen10F 6.76± 0.17 2.14± 0.03penLOO 6.41± 0.18 2.08± 0.03
2FCV 6.41± 0.16 2.10± 0.025FCV 6.27± 0.16 2.09± 0.0310FCV 6.25± 0.16 2.07± 0.03LOO 6.41± 0.18 2.08± 0.03
E [penid ] 6.62± 0.18 2.09± 0.03
Optimal model selection with V -fold CV Sylvain Arlot
27/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Simulation: E[` (s?, sm ) / infm∈M ` (s?, sm )]
Procedure L–Dya2 S–Dya2
C ? × Cp 4.38± 0.09 3.01± 0.04
C ?× pen2F 5.12± 0.12 2.10± 0.02C ?× pen5F 3.80± 0.07 1.95± 0.02C ?× pen10F 3.66± 0.06 1.91± 0.02C ?× penLOO 3.61± 0.06 1.91± 0.02
2FCV 6.41± 0.16 2.10± 0.025FCV 6.27± 0.16 2.09± 0.0310FCV 6.25± 0.16 2.07± 0.03LOO 6.41± 0.18 2.08± 0.03
C ? × E [penid ] 3.66± 0.06 1.93± 0.02
C ? 2 1.5
Optimal model selection with V -fold CV Sylvain Arlot
28/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Φ(maxm′ 6=m E[C(m)− C(m′)]/√
var(C(m)− C(m′)))
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
dimension
LOO
10−Fold
5−Fold
2−Fold
E[penid]
Optimal model selection with V -fold CV Sylvain Arlot
29/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Probability of selection of every m
0 20 40 60 80 1000
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
dimension
P(m
is s
ele
cte
d)
LOO
10−Fold
5−Fold
2−Fold
E[penid]
Optimal model selection with V -fold CV Sylvain Arlot
30/30
Introduction V-fold cross-validation Bias Variance Conclusion & Experiments
Probability of selection of every m (zoom)
0 10 20 30 400
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
dimension
P(m
is s
ele
cte
d)
LOO
10−Fold
5−Fold
2−Fold
E[penid]
Optimal model selection with V -fold CV Sylvain Arlot