data science: principles, practice, potential and pitfalls · principles: what underpins data...
TRANSCRIPT
Data Science: Principles, Practice, Potential andPitfalls
Dionisio Acosta
Institute of Health InformaticsUniversity College London
June 2019
Dionisio Acosta (UCL) June 2019 1 / 20
Outline
Principles what underpins Data Science
Practice showcase of exemplar applications
Pitfalls practices that could lead to over-optimism
Potential current research that addresses fundamental problems
Dionisio Acosta (UCL) June 2019 2 / 20
Principles: what underpins Data Science?
The aim is to address engineering practices that suppportdata-driven healthcare organisations.
Data-driven healthcare and Evidence-based practice (Smith, A.1996)
What are optimal (automated) decisions: What are the predictiontolerance levels? What are the value functions of actions?
We are concerned with supporting human decision making, but byand large we are concerned in how to support data-drivenautomated decision making.
There are degrees of support: visualisation, prediction, decisionmaking, automated (AI) planning.
Dionisio Acosta (UCL) June 2019 3 / 20
Principles: what underpins Data Science?
Data models and platforms (HW&SW): every research questioninduces a data model.
The fundamental problem of model selection: very well specified inStatistics but less understood elsewhere.
Statistics (Computational), Computer Science, Machine Learning,Database Systems, Distributed and High Performance Computing.
Dionisio Acosta (UCL) June 2019 4 / 20
Practice: exemplar applications
Some example applications that provide an overview of the art of thepossible:
Management of chest pain
Predicting emergency admissions
Phamacotherapy management in Parkinson’s Disease
Automatic data quality
Breast cancer treatment selector
Dionisio Acosta (UCL) June 2019 5 / 20
Inspirational Example: Management of Chest Pain
Mean follow-up of 21 ± 5 monthsZacharias, K., et. al. (2017). European Heart Journal-Cardiovascular Imaging, 18 (2), 195-202. doi:10.1093/ehjci/jew049
Dionisio Acosta (UCL) June 2019 6 / 20
Predicting emergency admissions
Error reduction from 20% to 2% using TBATS modelOdera, I. and Acosta, D. (2014)
Dionisio Acosta (UCL) June 2019 7 / 20
Phamacotherapy Management in Parkinson’s Disease
Nguyen, V. et al. (2018) Studies in Health Technology and Informatics, pp. 156–160, 2018.
Dionisio Acosta (UCL) June 2019 8 / 20
Automatic Data Quality Control
Saez, C, et al. (2016) Journal of the American Medical Informatics Association (23), pp1085–1095,
doi:10.1093/jamia/ocw010
Dionisio Acosta (UCL) June 2019 9 / 20
Breast Cancer Treatment Selector
Patkar, V, et al. (2012) BMJ Open 2, 3: e000439. doi:10.1136/bmjopen-2011-000439.
Dionisio Acosta (UCL) June 2019 10 / 20
Pitfalls: practices that could lead to over-optimism
Cross-validation practices
Model selection using AUC and model complexity
Model selection in penalised models a p-values
Model application context
Sample size in penalised regression models
Covariance matrix in high dimension small sample size context
Dionisio Acosta (UCL) June 2019 11 / 20
Cross Validation Practices
Dionisio Acosta (UCL) June 2019 12 / 20
Cross Validation Practices
Dionisio Acosta (UCL) June 2019 13 / 20
Cross Validation Practices
Dionisio Acosta (UCL) June 2019 14 / 20
Sample Size and Penalised Regression
k log p
n→ 0 (1)
Meinshausen, N., Yu, B., 2009. Lasso-type recovery of sparserepresentations for high-dimensional data. The Annals of Statistics37, 246–270.
Wainwright, M.J., 2009. Sharp Thresholds for High-Dimensionaland Noisy Sparsity Recovery Using L1-Constrained QuadraticProgramming (Lasso). IEEE Transactions on Information Theory55, 2183–2202.
Dionisio Acosta (UCL) June 2019 15 / 20
Covariance matrix in high dimemsion small sample size
Consider the effect in graphical methods and PCA
Ledoit, O. et al., (2012) The Annals of Statistics 40(2):1024–60. doi:10.1214/12-AOS989.
Dionisio Acosta (UCL) June 2019 16 / 20
Potential: addressing fundamental problems
Differential Privacy and Distributed Learning (Balcan et al. 2012)
Data-driven phenotypes: Learning longitudinal phenotypes
Dionisio Acosta (UCL) June 2019 17 / 20
Differential Privacy and Distributed Learning
Dionisio Acosta (UCL) June 2019 18 / 20
Learning Longitudinal Phenotypes
Dionisio Acosta (UCL) June 2019 19 / 20
Concluding Remarks
The data scientist reaches out to statisticians, computer scientists,informaticians, etc., to achieve the aim of creating data-drivenorganisations.
Data Science extends the language at my disposal to formulatemodels. The benefit is that I can find succinct ways to model reality.
The challenge is that, like learning a new language, one makesendless mistakes and those mistakes, however small, stick around.
The hope is that new generations are able to see those mistakes,sometimes kindly correct me, but most importantly go out there bythemselves and be able to express reality faithfully.
Dionisio Acosta (UCL) June 2019 20 / 20