dan mallinger, data science practice manager, think big analytics at mlconf nyc

29
Think Big, Start Smart, Scale Fast Analytics Communication: Re-Introducing Complex Models

Upload: sessionsevents

Post on 15-Jul-2015

458 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

Think Big, Start Smart, Scale Fast

Analytics Communication: Re-Introducing Complex Models

Page 2: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

2

• Director of Data Science at Think Big

• I work in the intersection of statistics and technology

• But also business and analytics

• Too often see data scientists limit themselves and their businesses

Dan Mallinger

Page 3: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

3

1. Importance of Communication

2. Lost Tools of Analytics Communication

3. Tricks for those in Regulated Environments

4. More Communication

Today

Page 4: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

4

Not Today

Page 5: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

5

• Familiar = Clear

• Clear = Explainable

• Explainable = Understood

• Understood = Trustworthy

“Explainable” Model Fallacy

Page 6: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

6

Better Communication Yields…

Page 7: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

7

Bad Communication and Black Boxes…

Page 8: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

8

Why We Should Care:We Won’t Waste Money

Alas, not even a 250Gb server was sufficient: even after waiting three days, the data couldn't even be loaded. […]

Steve said it would be difficult for managers to accept a process that involved sampling.

Page 9: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

9

hlm.html('Test1', test1_score__eoy~test1_score__boy + ...

is_special_ed * perc_free_lunch ...

other_score * support_rec ...

(is_focal | inst_sid), data=kinder)

Technically this is a regression…

So simple anyone can understand it!

Why We Should Care:You Can’t Explain Your Models Anyway

Page 10: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

10

• If your model need to be re-fit every month, it probably has an eating disorder

• Be a better communicator to yourself

Why We Should Care:Some of Us Don’t Understand Our Models

Page 11: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

11

Meet Bob

Page 12: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

12

• Predicting “Membership” (Not really, this is dummy outcome)

• Pick a “black box” model

• Build understanding

Airline Data

Page 13: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

13

Danger! Does Your Manager Know What Strata Are?!

Manager Doesn’t Trust Samples?

Page 14: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

14

• Easy:sapply(1:5000, function(i) {

rand.rows <- sample.int(nrow(raw),

size=10000)

df <- raw[rand.rows, c(dep.cols, ind.cols)]

m <- nnet(Member~., data=df, size=10)

})

• Easier:

library(bootstrap)

• Bootstrap!

– Simple, but underused

– Resample data, rebuild models

– Parametric and non-parametric bootstrapping (bias/variance)

Gist of non-parametric: Do it a bunch of times, treat results as distribution for CI

Manager Doesn’t Trust Samples?

Page 15: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

15

Stability of Model

Page 16: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

16

• Bob has convinced his manager that his sampling strategy is acceptable (Good Job, Bob!)

• But he hasn’t built trust in the model

Now What?

Page 17: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

17

Bob Doesn’t Explain Variables Like This…

Page 18: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

18

• If X matters, then shuffling it should hurt our model

• Then bootstrap for confidence intervals

• Most R models have a method for this (see caret)

Shining a light into the parameters of our black box

Variable Importance

Page 19: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

19

Shining a light into the parameters of our black box

Variable Importance: Bob’s Data

Page 20: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

20

• Similar to variable importance

• How do relationships in our model play out in different settings?

• How much does our model depend on accurate measurement?

Sensitivity and Robustness

Page 21: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

21

Sensitivity and Robustness Example

My code wasn’t working, so thanks to:

https://beckmw.wordpress.com/2013/10/07/sensitivity-analysis-for-neural-networks/

Page 22: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

22

More Sensitivity and Robustness

Manual variable permutation in R

library(sensitivity)

Page 23: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

23

• Bob’s manager has told him that black box models are not allowed

• But Bob’s neural net performed better than anything else. Oh dear!

Dang!

Page 24: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

24

• Bob’s work in neural nets can be leveraged!

• Generically: Prototype selection

• Identify points on the decision boundary to improve model

• Specifically: Extracting decision trees from neural nets

Blackbox to Whitebox

Page 25: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

25

Blackbox to Whitebox: Methodology

“Extracting Decision Trees from Trained Neural Networks” - Krishnan & Bhattacharya

Also: https://github.com/dvro/scikit-protopy

Page 26: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

26

• Bob has shown how variables impact his black box

• He’s shown how they behave in different contexts

• He’s show how robust they are to errors

• But he hasn’t told us why we should care

Now What?

Page 27: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

27

Accuracy, False Positive Rates, Confusions matrices are CONSTRUCTS

Metrics and Assessment

Page 28: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

28

• Enterprises are slow: Predict KPI not KRI

• Give confidence bands, sensitivities, and impact of context changes

• Build a story about the model internals and assumptions; tie to domain knowledge of audience

• Explainability is up to the modeler, not the model *

• Unless, of course, your regulator says otherwise!

Conclusions

Page 29: Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

29

We’re hiring!

http://thinkbig.teradata.com

Thanks!