tune up your data science process
Post on 07-Jan-2017
258 Views
Preview:
TRANSCRIPT
Tune up your data science process
Benjamin S. Skrainka
February 10, 2016
Benjamin S. Skrainka Tune up your data science process February 10, 2016 1 / 24
The correctness problem
A lot of (data) science is unscientific:
“My code runs, so the answer must be correct”“It passed Explain Plan, so the answer is correct”“This model is too complex to have a design document”“It is impossible to unit test scientific code”“The lift from the direct mail campaign is 10%”
Benjamin S. Skrainka Tune up your data science process February 10, 2016 2 / 24
Correctness matters
Bad (data) science:
Costs real money and can kill peopleWill eventually damage your reputation and careerCould expose you to litigationAn issue of basic integrity and sleeping at night
Benjamin S. Skrainka Tune up your data science process February 10, 2016 3 / 24
Objectives
Today’s goals:
Introduce VV&UQ framework to evaluate correctness of scientificmodelsSurvey good habits to improve quality of your work
Benjamin S. Skrainka Tune up your data science process February 10, 2016 4 / 24
Verification, Validation, & Uncertainty Quantification
Benjamin S. Skrainka Tune up your data science process February 10, 2016 5 / 24
Introduction to VV&UQ
Verification, Validation, & Uncertainty Quantification providesepistemological framework to evaluate correctness of scientific models:
Evidence of correctness should accompany any predictionIn absence of evidence, assume predictions are wrongPopper: can only disprove or fail to disprove a modelVV&UQ is inductive whereas science is deductive
Reference: Verification and Validation in Scientific Computing byOberkampf & Roy
Benjamin S. Skrainka Tune up your data science process February 10, 2016 6 / 24
Definitions of VV&UQ
Definitions of terms (Oberkampf & Roy):
Verification:I “solving equations right”I I.e., code implements the model correctly
Validation:I “solving right equations”I I.e., model has high fidelity to reality
Definitions of VV&UQ will vary depending on source . . .
→ Most organizations do not even practice verification. . .
Benjamin S. Skrainka Tune up your data science process February 10, 2016 7 / 24
Definition of UQ
Definition of Uncertainty Quantification (Oberkampf & Roy):
Process of identifying, characterizing, and quantifying thosefactors in an analysis which could affect accuracy ofcomputational results
Do your assumptions hold? When do they fail?Does your model apply to the data/situation?Where does your model break down? What are its limits?
Benjamin S. Skrainka Tune up your data science process February 10, 2016 8 / 24
Verification of code
Does your code implement the model correctly?
Unit test everything you can:I Scientific code can be unit testedI Test special casesI Test on cases with analytic solutionsI Test on synthetic data
Unit test framework will setup and tear-down fixturesShould be able to recover parameters from Monte Carlo data
Benjamin S. Skrainka Tune up your data science process February 10, 2016 9 / 24
Verification of SQL
Passing Explain Plan doesn’t mean your SQL is correct:
Garbage in, garbage outCheck a simple case you can compute by handCheck join plan is correctCheck aggregate statisticsCheck answer is compatible with reality
Benjamin S. Skrainka Tune up your data science process February 10, 2016 10 / 24
Unit test
import unittest2 as unittestimport assignment as problems
class TestAssignment(unittest.TestCase):
def test_zero(self):result = problems.question_zero()self.assertEqual(result, 9198)
...
if __name__ == '__main__':unittest.main()
Benjamin S. Skrainka Tune up your data science process February 10, 2016 11 / 24
Unit test
Figure 1:Benjamin S. Skrainka Tune up your data science process February 10, 2016 12 / 24
Validation of model
Check your model is a good (enough) representation of reality:
“All models are wrong but some are useful” – George Box
Run an experimentPerform specification testingTest assumptions holdBeware of endogenous features
Benjamin S. Skrainka Tune up your data science process February 10, 2016 13 / 24
Approaches to experimentation
Many ways to test:
A/B testMulti-armed banditBayesian A/B testWald sequential analysis
Benjamin S. Skrainka Tune up your data science process February 10, 2016 14 / 24
Uncertainty quantification
There are many types of uncertainty which affect the robustness of yourmodel:
Parameter uncertaintyStructural uncertaintyAlgorithmic uncertaintyExperimental uncertaintyInterpolation uncertainty
Classified as aleatoric (statistical) and epistemic (systematic)
Benjamin S. Skrainka Tune up your data science process February 10, 2016 15 / 24
Good habits
Benjamin S. Skrainka Tune up your data science process February 10, 2016 16 / 24
Act like a software engineer
Use best practices from software engineering:
Good design of codeFollow a sensible coding conventionVersion controlUse same file structure for every projectUnit testUse PEP8 or equivalentPerform code reviews
Benjamin S. Skrainka Tune up your data science process February 10, 2016 17 / 24
Reproducible research
‘Document what you do and do what you document’:
Keep a journal!Data provenanceHow data was cleanedDesign documentSpecification & requirements
Do you keep a journal? You should. Fermi taught me that. –John A. Wheeler
Benjamin S. Skrainka Tune up your data science process February 10, 2016 18 / 24
Follow a workflow
Use a workflow like CRISP-DM:
1 Define business question and metric2 Understand data3 Prepare data4 Build model5 Evaluate6 Deploy
Ensures you don’t forget any key steps
Benjamin S. Skrainka Tune up your data science process February 10, 2016 19 / 24
Automate your data pipeline
One-touch build of your application or paper:
Automate entire workflow from raw data to final resultEnsures you perform all stepsEnsures all steps are known – no one off manual adjustmentsAvoids stupid human errorsAuto generate all tables and figuresSave time when handling new data . . . which always has subtlechanges in formatting
Benjamin S. Skrainka Tune up your data science process February 10, 2016 20 / 24
Write flexible code to handle data
Use constants/macros to access data fields:
Code will clearly show what data mattersEasier to understand code and data pipelineEasier to debug data problemsEasier to handles changes in data formatting
Benjamin S. Skrainka Tune up your data science process February 10, 2016 21 / 24
Python example
# Setup indicatorsix_gdp = 7...
# Load & clean datam_raw = np.recfromcsv('bea_gdp.csv')gdp = m_raw[:, ix_gdp]...
Benjamin S. Skrainka Tune up your data science process February 10, 2016 22 / 24
Politics. . .
Often, there is political pressure to violate best practice:
Examples:I 80% confidence intervalsI Absurd attribution windowI Two year forecast horizon but only three months of data
Hard to do right thing vs. senior managementRecruit a high-level scientist to advocateParticularly common with forecasting:
I Often requested by management for CYAI Insist on a ‘panel of experts’ for impossible decisions
Benjamin S. Skrainka Tune up your data science process February 10, 2016 23 / 24
Conclusion
Need to raise the quality of data science:
VV & UQ provides rigorous framework:I Verification: solve the equations rightI Validation: solve the right equationsI Uncertainty quantification: how robust is model to unknowns?
Adopting good habits provides huge gains for minimal effort
Benjamin S. Skrainka Tune up your data science process February 10, 2016 24 / 24
top related