dan mallinger, data science practice manager, think big analytics at mlconf nyc

Think Big, Start Smart, Scale Fast

Analytics Communication: Re-Introducing Complex Models

2

• Director of Data Science at Think Big

• I work in the intersection of statistics and technology

• But also business and analytics

• Too often see data scientists limit themselves and their businesses

Dan Mallinger

3

1. Importance of Communication

2. Lost Tools of Analytics Communication

3. Tricks for those in Regulated Environments

4. More Communication

Today

4

Not Today

5

• Familiar = Clear

• Clear = Explainable

• Explainable = Understood

• Understood = Trustworthy

“Explainable” Model Fallacy

6

Better Communication Yields…

7

Bad Communication and Black Boxes…

8

Why We Should Care:We Won’t Waste Money

Alas, not even a 250Gb server was sufficient: even after waiting three days, the data couldn't even be loaded. […]

Steve said it would be difficult for managers to accept a process that involved sampling.

9

hlm.html('Test1', test1_score__eoy~test1_score__boy + ...

is_special_ed * perc_free_lunch ...

other_score * support_rec ...

(is_focal | inst_sid), data=kinder)

Technically this is a regression…

So simple anyone can understand it!

Why We Should Care:You Can’t Explain Your Models Anyway

10

• If your model need to be re-fit every month, it probably has an eating disorder

• Be a better communicator to yourself

Why We Should Care:Some of Us Don’t Understand Our Models

11

Meet Bob

12

• Predicting “Membership” (Not really, this is dummy outcome)

• Pick a “black box” model

• Build understanding

Airline Data

13

Danger! Does Your Manager Know What Strata Are?!

Manager Doesn’t Trust Samples?

14

• Easy:sapply(1:5000, function(i) {

rand.rows <- sample.int(nrow(raw),

size=10000)

df <- raw[rand.rows, c(dep.cols, ind.cols)]

m <- nnet(Member~., data=df, size=10)

})

• Easier:

library(bootstrap)

• Bootstrap!

– Simple, but underused

– Resample data, rebuild models

– Parametric and non-parametric bootstrapping (bias/variance)

Gist of non-parametric: Do it a bunch of times, treat results as distribution for CI

Manager Doesn’t Trust Samples?

15

Stability of Model

16

• Bob has convinced his manager that his sampling strategy is acceptable (Good Job, Bob!)

• But he hasn’t built trust in the model

Now What?

17

Bob Doesn’t Explain Variables Like This…

18

• If X matters, then shuffling it should hurt our model

• Then bootstrap for confidence intervals

• Most R models have a method for this (see caret)

Shining a light into the parameters of our black box

Variable Importance

19

Shining a light into the parameters of our black box

Variable Importance: Bob’s Data

20

• Similar to variable importance

• How do relationships in our model play out in different settings?

• How much does our model depend on accurate measurement?

Sensitivity and Robustness

21

Sensitivity and Robustness Example

My code wasn’t working, so thanks to:

https://beckmw.wordpress.com/2013/10/07/sensitivity-analysis-for-neural-networks/

22

More Sensitivity and Robustness

Manual variable permutation in R

library(sensitivity)

23

• Bob’s manager has told him that black box models are not allowed

• But Bob’s neural net performed better than anything else. Oh dear!

Dang!

24

• Bob’s work in neural nets can be leveraged!

• Generically: Prototype selection

• Identify points on the decision boundary to improve model

• Specifically: Extracting decision trees from neural nets

Blackbox to Whitebox

25

Blackbox to Whitebox: Methodology

“Extracting Decision Trees from Trained Neural Networks” - Krishnan & Bhattacharya

Also: https://github.com/dvro/scikit-protopy

26

• Bob has shown how variables impact his black box

• He’s shown how they behave in different contexts

• He’s show how robust they are to errors

• But he hasn’t told us why we should care

Now What?

27

Accuracy, False Positive Rates, Confusions matrices are CONSTRUCTS

Metrics and Assessment

28

• Enterprises are slow: Predict KPI not KRI

• Give confidence bands, sensitivities, and impact of context changes

• Build a story about the model internals and assumptions; tie to domain knowledge of audience

• Explainability is up to the modeler, not the model *

• Unless, of course, your regulator says otherwise!

Conclusions

29

We’re hiring!

http://thinkbig.teradata.com

Thanks!

dan mallinger, data science practice manager, think big analytics at mlconf nyc

Technology

data scientists

bad communication

better communication

director of data science

complex models

better communicator

black boxes

think bigi work