validation methods - pydata israel
TRANSCRIPT
![Page 1: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/1.jpg)
Validation methods
Nathaniel ShimoniPyData Israel 16/2/2017
![Page 2: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/2.jpg)
Validation techniques
basics
Model selection
Early stopping
Train-test split
Kfold
leave one out (loo)
Leave P out
Group Kfold
Leave one group out
Time series
Sliding window
Anchored sliding window
Time based group Kfold
Unbalanced data
Stratified methods
Why? Grouped data
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov De
c
train01
Data
Fold 1Fold 2Fold 3
Hyper-parameter tuning
𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹
![Page 3: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/3.jpg)
Validation techniques
basics
Model selection
Early stopping
Train-test split
Kfold
leave one out (loo)
Leave P out
Group Kfold
Leave one group out
Time series
Sliding window
Anchored sliding window
Time based group Kfold
Unbalanced data
Stratified methods
Why? Grouped data
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov De
c
train01
Data
Fold 1Fold 2Fold 3
Hyper-parameter tuning
𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹
![Page 4: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/4.jpg)
We use validation to balance two things:
1. Fit our training data as well as we can (while…)2. Generalize well to get best performance on
unseen data (aka refrain from over-fitting)
We use validation to: • Select best model• Select best hyper-parameters• Early stopping of training process
![Page 5: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/5.jpg)
Validation techniques
basics
Model selection
Early stopping
Train-test split
Kfold
leave one out (loo)
Leave P out
Group Kfold
Leave one group out
Time series
Sliding window
Anchored sliding window
Time based group Kfold
Unbalanced data
Stratified methods
Why? Grouped data
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov De
c
train01
Data
Fold 1Fold 2Fold 3
Hyper-parameter tuning
𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹
![Page 6: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/6.jpg)
Train test split
The most basic validation technique
It is based on hold-out-sample
We split the training data randomly and test our performance on the unseen data
![Page 7: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/7.jpg)
Select folds in a way that keeps equal proportion of target variable in each fold
Stratification
The train-test split validation method is very common. its main benefits:
Cross Validation
computational efficiency simplicity
we’re loosing large amount of data
Might suffer from skew / bias
but it has two disadvantages:
Data
Fold 1Fold 2Fold 3
![Page 8: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/8.jpg)
Validation techniques
basics
Model selection
Early stopping
Train-test split
Kfold
leave one out (loo)
Leave P out
Group Kfold
Leave one group out
Time series
Sliding window
Anchored sliding window
Time based group Kfold
Unbalanced data
Stratified methods
Why? Grouped data
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov De
c
train01
Data
Fold 1Fold 2Fold 3
Hyper-parameter tuning
𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹
![Page 9: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/9.jpg)
Yes use Kfold /
stratified Kfold
What if the samples are drown from different groups?
Can we retrain for new groups?
No use group-based folding methods
![Page 10: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/10.jpg)
Yes use Kfold /
stratified Kfold
What if the samples are drown from different groups?
Can we retrain for new groups?
No use group-based folding methods
![Page 11: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/11.jpg)
Validation techniques
basics
Model selection
Early stopping
Train-test split
Kfold
leave one out (loo)
Leave P out
Group Kfold
Leave one group out
Time series
Sliding window
Anchored sliding window
Time based group Kfold
Unbalanced data
Stratified methods
Why? Grouped data
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov De
c
train01
Data
Fold 1Fold 2Fold 3
Hyper-parameter tuning
𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹
![Page 12: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/12.jpg)
Time series data?Are we predicting a specific time frame?Are we predicting future events? • Use time based folds / split• Use sliding window• Use anchored sliding window• Random split is more like an imputation problem
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov De
c
trainvalidation
![Page 13: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/13.jpg)
Validation techniques
basics
Model selection
Early stopping
Train-test split
Kfold
leave one out (loo)
Leave P out
Group Kfold
Leave one group out
Time series
Sliding window
Anchored sliding window
Time based group Kfold
Unbalanced data
Stratified methods
Why? Grouped data
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov De
c
train01
Data
Fold 1Fold 2Fold 3
Hyper-parameter tuning
𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹
![Page 14: Validation methods - PyData Israel](https://reader036.vdocuments.site/reader036/viewer/2022062523/58f260491a28abc9488b4595/html5/thumbnails/14.jpg)
Thank you !
linkedInGithub