correctness in data science - data science pop-up seattle

26
#datapopupseattle Correctness in Data Science Benjamin Skrainka Principal Data Scientist + Lead Instructor, Galvanize bskrainka galvanize

Upload: domino-data-lab

Post on 21-Apr-2017

5.686 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Correctness in Data Science - Data Science Pop-up Seattle

#datapopupseattle

Correctness in Data Science

Benjamin SkrainkaPrincipal Data Scientist + Lead Instructor, Galvanize

bskrainka galvanize

Page 2: Correctness in Data Science - Data Science Pop-up Seattle

#datapopupseattle

UNSTRUCTUREDData Science POP-UP in Seattle

www.dominodatalab.com

D

Produced by Domino Data Lab

Domino’s enterprise data science platform is used by leading analytical organizations to increase productivity, enable collaboration, and publish

models into production faster.

Page 3: Correctness in Data Science - Data Science Pop-up Seattle

Correctness in Data Science

Benjamin S. Skrainka

October 7, 2015

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 1 / 24

Page 4: Correctness in Data Science - Data Science Pop-up Seattle

The correctness problem

A lot of (data) science is unscientific:

“My code runs, so the answer must be correct”“It passed Explain Plan, so the answer is correct”“This model is too complex to have a design document”“It is impossible to unit test scientific code”“The lift from the direct mail campaign is 10%”"

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 2 / 24

Page 5: Correctness in Data Science - Data Science Pop-up Seattle

Correctness matters

Bad (data) science:

Costs real money and can kill peopleWill eventually damage your reputation and careerCould expose you to litigationAn issue of basic integrity and sleeping at night

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 3 / 24

Page 6: Correctness in Data Science - Data Science Pop-up Seattle

Objectives

Today’s goals:

Introduce VV&UQ framework to evaluate correctness of scientificmodelsSurvey good habits to improve quality of your work

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 4 / 24

Page 7: Correctness in Data Science - Data Science Pop-up Seattle

Verification, Validation, & Uncertainty Quantification

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 5 / 24

Page 8: Correctness in Data Science - Data Science Pop-up Seattle

Introduction to VV&UQ

Verification, Validation, & Uncertainty Quantification providesepistemological framework to evaluate correctness of scientific models:

Evidence of correctness should accompany any predictionIn absence of evidence, assume predictions are wrongPopper: can only disprove or fail to disprove a modelVV&UQ is inductive whereas science is deductive

Reference: Verification and Validation in Scientific Computing byOberkampf & Roy

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 6 / 24

Page 9: Correctness in Data Science - Data Science Pop-up Seattle

Definitions of VV&UQ

Definitions of terms (Oberkampf & Roy):

Verification:I “solving equations right”I I.e., code implements the model correctly

Validation:I “solving right equations”I I.e., model has high fidelity to reality

Definitions of VV&UQ will vary depending on source . . .

æ Most organizations do not even practice verification. . .

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 7 / 24

Page 10: Correctness in Data Science - Data Science Pop-up Seattle

Definition of UQ

Definition of Uncertainty Quantification (Oberkampf & Roy):

Process of identifying, characterizing, and quantifying those

factors in an analysis which could a�ect accuracy of

computational results

Do your assumptions hold? When do they fail?Does your model apply to the data/situation?Where does your model break down? What are its limits?

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 8 / 24

Page 11: Correctness in Data Science - Data Science Pop-up Seattle

Verification of code

Does your code implement the model correctly?

Unit test everything you can:I Scientific code can be unit testedI Test special casesI Test on cases with analytic solutionsI Test on synthetic data

Unit test framework will setup and tear-down fixturesShould be able to recover parameters from Monte Carlo data

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 9 / 24

Page 12: Correctness in Data Science - Data Science Pop-up Seattle

Verification of SQL

Passing Explain Plan doesn’t mean your SQL is correct:

Garbage in, garbage outCheck a simple case you can compute by handCheck join plan is correctCheck aggregate statisticsCheck answer is compatible with reality

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 10 / 24

Page 13: Correctness in Data Science - Data Science Pop-up Seattle

Unit test

import unittest2 as unittest

import assignment as problems

class TestAssignment(unittest.TestCase):

def test_zero(self):

result = problems.question_zero()

self.assertEqual(result, 9198)

...

if __name__ == �__main__�:

unittest.main()

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 11 / 24

Page 14: Correctness in Data Science - Data Science Pop-up Seattle

Unit test

Figure 1:Benjamin S. Skrainka Correctness in Data Science October 7, 2015 12 / 24

Page 15: Correctness in Data Science - Data Science Pop-up Seattle

Validation of model

Check your model is a good (enough) representation of reality:

“All models are wrong but some are useful” – George Box

Run an experimentPerform specification testingTest assumptions holdBeware of endogenous features

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 13 / 24

Page 16: Correctness in Data Science - Data Science Pop-up Seattle

Approaches to experimentation

Many ways to test:

A/B testMulti-armed banditBayesian A/B testWald sequential analysis

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 14 / 24

Page 17: Correctness in Data Science - Data Science Pop-up Seattle

Uncertainty quantification

There are many types of uncertainty which a�ect the robustness of yourmodel:

Parameter uncertaintyStructural uncertaintyAlgorithmic uncertaintyExperimental uncertaintyInterpolation uncertainty

Classified as aleatoric (statistical) and epistemic (systematic)

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 15 / 24

Page 18: Correctness in Data Science - Data Science Pop-up Seattle

Good habits

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 16 / 24

Page 19: Correctness in Data Science - Data Science Pop-up Seattle

Act like a software engineer

Use best practices from software engineering:

Good design of codeFollow a sensible coding conventionVersion controlUse same file structure for every projectUnit testUse PEP8 or equivalentPerform code reviews

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 17 / 24

Page 20: Correctness in Data Science - Data Science Pop-up Seattle

Reproducible research

‘Document what you do and do what you document’:

Keep a journal!Data provenanceHow data was cleanedDesign documentSpecification & requirements

Do you keep a journal? You should. Fermi taught me that. –

John A. Wheeler

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 18 / 24

Page 21: Correctness in Data Science - Data Science Pop-up Seattle

Follow a workflow

Use a workflow like CRISP-DM:

1 Define business question and metric2 Understand data3 Prepare data4 Build model5 Evaluate6 Deploy

Ensures you don’t forget any key steps

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 19 / 24

Page 22: Correctness in Data Science - Data Science Pop-up Seattle

Automate your data pipeline

One-touch build of your application or paper:

Automate entire workflow from raw data to final resultEnsures you perform all stepsEnsures all steps are known – no one o� manual adjustmentsAvoids stupid human errorsAuto generate all tables and figuresSave time when handling new data . . . which always has subtlechanges in formatting

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 20 / 24

Page 23: Correctness in Data Science - Data Science Pop-up Seattle

Write flexible code to handle data

Use constants/macros to access data fields:

Code will clearly show what data mattersEasier to understand code and data pipelineEasier to debug data problemsEasier to handles changes in data formatting

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 21 / 24

Page 24: Correctness in Data Science - Data Science Pop-up Seattle

Python example

# Setup indicators

ix_gdp = 7

...

# Load & clean data

m_raw = np.recfromcsv(�bea_gdp.csv�)

gdp = m_raw[:, ix_gdp]

...

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 22 / 24

Page 25: Correctness in Data Science - Data Science Pop-up Seattle

Politics. . .

Often, there is political pressure to violate best practice:

Examples:I 80% confidence intervalsI Absurd attribution windowI Two year forecast horizon but only three months of data

Hard to do right thing vs. senior managementRecruit a high-level scientist to advocateParticularly common with forecasting:

I Often requested by management for CYAI Insist on a ‘panel of experts’ for impossible decisions

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 23 / 24

Page 26: Correctness in Data Science - Data Science Pop-up Seattle

Conclusion

Need to raise the quality of data science:

VV & UQ provides rigorous framework:I Verification: solve the equations right

I Validation: solve the right equationsI Uncertainty quantification: how robust is model to unknowns?

Adopting good habits provides huge gains for minimal e�ort

Benjamin S. Skrainka Correctness in Data Science October 7, 2015 24 / 24