benchmarking missing-values approaches for predictive

35
Benchmarking missing-values approaches for predictive models on health databases. CIMD presentation Alexandre Perez-Lebel 1,2 Ga¨ el Varoquaux 1,2 Marine Le Morvan 2 Julie Josse 2 Jean-Baptiste Poline 1 1 McGill University, Canada 2 Inria, France October 11, 2021 1 / 35

Upload: others

Post on 25-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Benchmarking missing-values approaches for predictive

Benchmarking missing-values approaches for predictive

models on health databases.CIMD presentation

Alexandre Perez-Lebel 1,2 Gael Varoquaux 1,2 Marine Le Morvan 2

Julie Josse 2 Jean-Baptiste Poline 1

1McGill University, Canada

2Inria, France

October 11, 2021

1 / 35

Page 2: Benchmarking missing-values approaches for predictive

Content

1. Overview2. Introduction3. Benchmark

Methods benchmarkedDatasetsProtocol

4. ResultsPrediction performanceComputational timeSignificance

5. DiscussionInterpretationLimitationsConclusion

2 / 35

Page 3: Benchmarking missing-values approaches for predictive

Overview

• Benchmark on real-world data.

• 4 health databases with missing values.

• Arbitrarily defined prediction tasks.

• Methods to handle missing values:Missing Incorporated in Attribute (MIA) vs imputation.

• Evaluate prediction score and computational time.

3 / 35

Page 4: Benchmarking missing-values approaches for predictive

Overview

Results:

• MIA performed better at little cost, but not always significantly.

• Conditional imputation is on a par with constant imputation.

• Complex imputation can be untractable at large scale.

4 / 35

Page 5: Benchmarking missing-values approaches for predictive

Introduction

5 / 35

Page 6: Benchmarking missing-values approaches for predictive

Introduction: scope of the study

Focus on supervised learning with missing values.Different tradeoffs: risk minimization instead of parameters estimation.

In supervised learning, most statistical models and machine learning algorithms are notdesigned for incomplete data.

How to deal with missing values in this framework?

• Delete samples having missing values → to avoid.• Use imputation.

• Constant imputation (mean, median)• Conditional imputation (KNN, MICE)

• Adapt or create predictive models to handle missing values natively.• Boosted-trees with Missing Incorporated in Attribute (MIA) adaptation [Twala et al., 2008].• NeuMiss networks in the regression setting [Le Morvan et al., 2020].

6 / 35

Page 7: Benchmarking missing-values approaches for predictive

Introduction: problem

• How does MIA experimentally compare to imputation?

• Constant imputation vs conditional imputation.

7 / 35

Page 8: Benchmarking missing-values approaches for predictive

Imputation

Replace missing values by values.

• Constant imputation:Mean or Median.

• Conditional imputation:Xmis ← E [Xmis | Xobs ].MICE [Buuren and Groothuis-Oudshoorn, 2010] or KNN.

Add a binary mask to keep track of imputed values.

8 / 35

Page 9: Benchmarking missing-values approaches for predictive

Missing Incorporated in Attribute (MIA)

Adaptation of boosted-trees to account for missing values.

Idea:For each split on a variable, all samples with a missing value in this variable are either sent tothe left or to the right child node depending on which option leads to the lowest risk.

9 / 35

Page 10: Benchmarking missing-values approaches for predictive

Methods benchmarked

Table: Methods compared in the main experiment.

In-article name Imputer Mask Predictive model

MIA - - Gradient-boosted treesMean Mean No Gradient-boosted treesMean+mask Mean Yes Gradient-boosted treesMedian Median No Gradient-boosted treesMedian+mask Median Yes Gradient-boosted treesIterative MICE No Gradient-boosted treesIterative+mask MICE Yes Gradient-boosted treesKNN KNN No Gradient-boosted treesKNN+mask KNN Yes Gradient-boosted trees

10 / 35

Page 11: Benchmarking missing-values approaches for predictive

Datasets

• Traumabase [The Traumabase Group, ], 20 000 samples.

• UK BioBank [Sudlow et al., ], 500 000 samples.

• MIMIC-III [Johnson et al., ], 60 000 samples.

• NHIS [National Center for Health Statistics, 2017], 88 000 samples.

Defined 13 prediction tasks (10 classifications, 3 regressions).Outcomes chosen arbitrarily.Feature selection:

• ANOVA (11 tasks)

• Expert knowledge (2 tasks).

11 / 35

Page 12: Benchmarking missing-values approaches for predictive

0 20 40 60 80 100

Number of features

death screening

hemo

hemo screening

platelet screening

septic screening

breast 25

breast screening

fluid screening

parkinson screening

skin screening

hemo screening

septic screening

income screening

MIMIC

NHIS

Traumabase

UKBB

Feature type

Categorical Ordinal Numerical

Figure: Types of features before encoding.

12 / 35

Page 13: Benchmarking missing-values approaches for predictive

0 50

0

50

100

death screening

0 5 10

0

50

100

hemo

0 25 50 75

0

50

100

hemo screening

0 25 50 75

0

50

100

platelet screening

0 25 50 75

0

50

100

septic screening

Traumabase

0 20 40 60

0

50

100

breast 25

0 50 100

0

50

100

breast screening

0 50 100

0

50

100

fluid screening

0 50 100

0

50

100

parkinson screening

0 50 100

0

50

100

skin screening

UKBB

0 50 100

0

50

100

hemo screening

0 50 100

0

50

100

septic screening

MIMIC

0 25 50 75

Features

0

50

100

Pro

port

ion

income screening

NHIS

Figure: Missing values distribution.

13 / 35

Page 14: Benchmarking missing-values approaches for predictive

Table: Correlation between features.

Threshold0.1 0.2 0.3

Database Task # features

Tra

um

ab

ase

death screening 92 68% 41% 22%hemo 12 50% 23% 12%hemo screening 76 65% 36% 20%platelet screening 90 67% 40% 22%septic screening 76 68% 37% 18%

UKBB breast 25 11 40% 20% 19%breast screening 100 26% 12% 8%fluid screening 100 21% 10% 6%parkinson screening 100 28% 16% 11%skin screening 100 24% 11% 8%

MIMIC hemo screening 100 22% 6% 3%septic screening 100 21% 6% 2%

NHIS income screening 78 15% 6% 4%

Average 79 40% 20% 12%

14 / 35

Page 15: Benchmarking missing-values approaches for predictive

Experimental protocol

• 9 methods, 13 prediction tasks.

• One-hot encode categorical features.

• Feature selection trained on 1/3 of the samples: 5 trials, 100 features.

• Sub-sampled the tasks: 2 500, 10 000, 25 000 and 100 000 samples.

• Cross-validation.

• Tuned the hyper-parameters of the predictive model.

• Evaluate prediction with accuracy or r2.

15 / 35

Page 16: Benchmarking missing-values approaches for predictive

Results - Prediction performance

−0.02 0.00

Relative prediction score

MIA

Mean

Mean+mask

Median

Median+mask

Iterative

Iterative+mask

KNN

KNN+mask

Con

stan

tim

pu

tati

onC

ond

itio

nal

imp

uta

tion

n=2500

−0.015 0.000 0.015

Relative prediction score

n=10000

−0.015 0.000

Relative prediction score

n=25000

Meanrank

MIA 1.9

Mean 5.0

Mean+mask

2.8

Median 6.7

Median+mask

3.1

Iterative 6.4

Iterative+mask

3.7

KNN 8.8

KNN+mask

6.4−0.004 0.000 0.004

Relative prediction score

n=100000 DatabaseTraumabase

UKBB

MIMIC

NHIS

Figure: Prediction performance.

For a specific task:Relative prediction score = prediction score - average prediction score of the 9 methods.

Iterative = MICE16 / 35

Page 17: Benchmarking missing-values approaches for predictive

Results - Computational time

23× 1× 3

Relative total training time

MIA

Mean

Mean+mask

Median

Median+mask

Iterative

Iterative+mask

KNN

KNN+mask

Con

stan

tim

pu

tati

onC

ond

itio

nal

imp

uta

tion

n=2500

23× 1× 3

Relative total training time

n=10000

23× 1× 3

Relative total training time

n=25000

23× 1× 3

Relative total training time

n=100000

DatabaseTraumabase

UKBB

MIMIC

NHIS

Figure: Computational time.

For a specific task:Relative prediction time = prediction time - average prediction time of the 9 methods.

17 / 35

Page 18: Benchmarking missing-values approaches for predictive

Results - Significance

Friedman test. [Friedman, 1937]Null hypothesis: “methods are equivalent”.

p-valueSize

2500 1.6e-1010000 2.6e-1025000 2.8e-04100000 8.5e-03

Nemenyi test. [Nemenyi, 1963]Once the Friedman test is rejected, the Nemenyi test can be applied. It provides a criticaldifference CD which is the minimal difference between the average ranks of two algorithms forthem to be significantly different. (N: number of datasets, k: number of algorithms)

CD = qα

√k(k + 1)

6N18 / 35

Page 19: Benchmarking missing-values approaches for predictive

1

2

3

4

5

6

7

8

9

crit

ical

dis

tan

ce

MIA

Mean+mask

Median+mask

Iterative+mask

Mean

Median

Iterative

KNN+mask

KNN

Size=2500, N=131

2

3

4

5

6

7

8

9

crit

ical

dis

tan

ce

MIA

Mean+mask

Median+mask

Iterative+mask

Mean

KNN+mask

Iterative

Median

KNN

Size=10000, N=12

1

2

3

4

5

6

7

8

9

crit

ical

dis

tan

ce

MIA

Median+mask

Mean+mask

Iterative+mask

Mean

KNN+mask

Median

Iterative

KNN

Size=25000, N=71

2

3

4

5

6

7

8

9

crit

ical

dis

tan

ce

MIA

Mean+mask

Median+mask

Iterative+mask

Mean

Median

Iterative

Size=100000, N=4

Figure: Mean ranks by method and by size of dataset.19 / 35

Page 20: Benchmarking missing-values approaches for predictive

Results - Significance

Same conclusions with the Wilcoxon one-sided signed rank test. [Wilcoxon, 1945]

20 / 35

Page 21: Benchmarking missing-values approaches for predictive

Findings and interpretation

Findings:

• MIA takes the lead at little cost, although not significantly.

• Adding the mask improves prediction.

Interpretation:• Good imputation does not imply good prediction.

• Low correlation between features.• Strong non-linear mechanisms.• Constant imputation provides a simple structure that can be extracted by the learner.

• The missingness is informative (MNAR or outcome depends on missingness)→ imputation is not applicable.

21 / 35

Page 22: Benchmarking missing-values approaches for predictive

Strengths and limitations

Limitations:

• Not every difference is significant.

• Would benefit having more datasets and having more datasets with large number ofsamples.

Strengths of the benchmark:

• 12 000 CPU hours.

• Lots of datasets (only 6% of empirical NeurIPS articles build upon more than 10 datasets[Bouthillier and Varoquaux, 2020].)

• Real data.

22 / 35

Page 23: Benchmarking missing-values approaches for predictive

Conclusion

23 / 35

Page 24: Benchmarking missing-values approaches for predictive

Conclusion

• Using MIA provides small but systematic improvement over imputation.

• Complex imputation untractable at large scale.

• Experiments suggests that missingness is informative: imputation not grounded.

• Directly handling missing values in the predictive model is to be considered.

• Change habits in practice: better choices than imputation.

24 / 35

Page 25: Benchmarking missing-values approaches for predictive

Reviewers’ feedbacks

Manuscript submitted in GigaScience.

Some comments of the reviewers:

• What about multiple imputation?

• Break boxplots by task.

• Relative prediction score difficult to interpret.

25 / 35

Page 26: Benchmarking missing-values approaches for predictive

Acknowledgments

Thank you for you attention.

26 / 35

Page 27: Benchmarking missing-values approaches for predictive

Appendix

27 / 35

Page 28: Benchmarking missing-values approaches for predictive

Introduction: the problem of missing values

• Missing values are omnipresent in real world problems

• Have long been studied in the statistical literature within the inferential framework

[Rubin, 1976] defined several missing values mechanisms:

• Missing At Random (MAR): the probability of a value to be missing only depends on theobserved variables.

• Missing Not At Random (MNAR): the missingness can depend on both the observed andunobserved values.

Most missing values methods in inference rely on the MAR hypothesis since theoretical resultsshow that the mechanism can be ignored. In practice, real data is often MNAR.

28 / 35

Page 29: Benchmarking missing-values approaches for predictive

Supplementary experiment

0 0.05 0.1

Relative prediction score

MIA

Linear+Mean

Linear+Mean+mask

Linear+Med

Linear+Med+mask

Linear+Iter

Linear+Iter+mask

Linear+KNN

Linear+KNN+mask

Con

stan

tim

pu

tati

onC

ond

itio

nal

imp

uta

tion

n=2500

0 0.05 0.1

Relative prediction score

n=10000

0 0.05 0.1

Relative prediction score

n=25000

Meanrank

MIA 1.3

Linear+Mean 5.5

Linear+Mean+mask

3.9

Linear+Med 5.9

Linear+Med+mask

4.0

Linear+Iter 5.9

Linear+Iter+mask

4.6

Linear+KNN 7.8

Linear+KNN+mask

6.20 0.05 0.1

Relative prediction score

n=100000 DatabaseTraumabase

UKBB

MIMIC

NHIS

Figure: Prediction performance.

29 / 35

Page 30: Benchmarking missing-values approaches for predictive

Supplementary experiment

150× 1

10× 1

3× 1× 3× 10×

Relative total training time

MIA

Linear+Mean

Linear+Mean+mask

Linear+Med

Linear+Med+mask

Linear+Iter

Linear+Iter+mask

Linear+KNN

Linear+KNN+mask

Con

stan

tim

pu

tati

onC

ond

itio

nal

imp

uta

tion

n=2500

150× 1

10× 1

3× 1× 3× 10×

Relative total training time

n=10000

150× 1

10× 1

3× 1× 3× 10×

Relative total training time

n=25000

150× 1

10× 1

3× 1× 3× 10×

Relative total training time

n=100000

DatabaseTraumabase

UKBB

MIMIC

NHIS

Figure: Computational time.

30 / 35

Page 31: Benchmarking missing-values approaches for predictive

Results - Significance

Table: Wilcoxon one-sided signed rank test.

Size 2500 10000 25000 100000Method

Mean 1.2e-03?? 4.6e-02? 2.3e-02? 6.2e-02Mean+mask 4.0e-02? 2.3e-01 1.5e-01 6.2e-02Median 5.2e-03? 1.7e-03?? 2.3e-02? 6.2e-02Median+mask 4.0e-02? 2.1e-01 1.5e-01 1.2e-01Iterative 5.2e-03? 3.2e-02? 3.9e-02? 6.2e-02Iterative+mask 2.4e-02? 2.1e-01 4.7e-01 6.2e-02KNN 1.2e-04?? 2.4e-04?? 3.1e-02?

KNN+mask 1.2e-04?? 7.3e-04?? 3.1e-02?

Linear+Mean 6.1e-04?? 4.9e-04?? 7.8e-03? 6.2e-02Linear+Mean+mask 8.5e-04?? 7.3e-04?? 1.6e-02? 6.2e-02Linear+Med 6.1e-04?? 4.9e-04?? 7.8e-03? 6.2e-02Linear+Med+mask 6.1e-04?? 4.9e-04?? 1.6e-02? 6.2e-02Linear+Iter 3.1e-03?? 1.2e-03?? 1.6e-02? 6.2e-02Linear+Iter+mask 2.3e-03?? 1.2e-03?? 1.6e-02? 6.2e-02Linear+KNN 1.2e-04?? 2.4e-04?? 1.6e-02? 5.0e-01Linear+KNN+mask 1.2e-04?? 2.4e-04?? 3.1e-02? 5.0e-01

31 / 35

Page 32: Benchmarking missing-values approaches for predictive

References I

Bouthillier, X. and Varoquaux, G. (2020).Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020.Research report, Inria Saclay Ile de France.

Buuren, S. v. and Groothuis-Oudshoorn, K. (2010).mice: Multivariate imputation by chained equations in r.Journal of statistical software, pages 1–68.

Friedman, M. (1937).The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis ofVariance.Journal of the American Statistical Association, 32(200):675–701.

32 / 35

Page 33: Benchmarking missing-values approaches for predictive

References II

Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M.,Moody, B., Szolovits, P., Anthony Celi, L., and Mark, R. G.MIMIC-III, a freely accessible critical care database.3(1):160035.

Le Morvan, M., Josse, J., Moreau, T., Scornet, E., and Varoquaux, G. (2020).NeuMiss networks: differentiable programming for supervised learning with missing values.Advances in Neural Information Processing Systems, 33:5980–5990.

National Center for Health Statistics (2017).National Health Interview Survey (NHIS).

Nemenyi, P. (1963).Distribution-free Multiple Comparisons.Princeton University.

33 / 35

Page 34: Benchmarking missing-values approaches for predictive

References III

Rubin, D. B. (1976).Inference and missing data.Biometrika, 63(3):581–592.

Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott,P., Green, J., Landray, M., Liu, B., Matthews, P., Ong, G., Pell, J., Silman, A., Young, A.,Sprosen, T., Peakman, T., and Collins, R.UK biobank: An open access resource for identifying the causes of a wide range ofcomplex diseases of middle and old age.12(3):e1001779.

The Traumabase Group.Traumabase.

34 / 35

Page 35: Benchmarking missing-values approaches for predictive

References IV

Twala, B. E. T. H., Jones, M. C., and Hand, D. J. (2008).Good methods for coping with missing data in decision trees.Pattern Recogn. Lett., 29:950–956.

Wilcoxon, F. (1945).Individual Comparisons by Ranking Methods.Biometrics Bulletin, 1(6):80–83.

35 / 35