data science for everyone - knime...•customer data •task is upselling prediction •product is a...

46
© 2016 KNIME.com AG. All Rights Reserved. Data Science for Everyone Greg Landrum Rosaria Silipo KNIME

Upload: others

Post on 31-Dec-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved.

Data Science for Everyone

Greg Landrum

Rosaria Silipo

KNIME

Page 2: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 2

Introduction to the characters

• The scientist (chemist, business analyst, domain expert, etc.).

– Deep domain knowledge

– Strong analytics needs (questions that need to be answered!)

• The data scientists (analyst, modeler, informatician, data scientist, etc.)

– Deep knowledge of analytics, data processing

– Knows KNIME (and other tools)

Page 3: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 3

The specific scenario/problem

The scientist:

“I’m trying to discover a new anti-malaria medicine. I’ve got a new dataset from a high-throughput screen against a malaria target. Doing the next experiments is expensive. I want to pick the right compounds from our inventory to try next.”

Page 4: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 4

The scenario/problem

• Given a new dataset, clean it up so that a model can be built

• Build and validate a model from that dataset

• Use the model to prioritize a set of items from a catalog

• Let the user pick from that prioritized list

Page 5: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 5

The steps for doing this

• Cleaning up the data

• Building and validating a model

• Ranking a set of new items from a catalog

• Letting the user pick the items they are interested in

• Providing an excel file

This is a familiar pattern, we know how to do this

Page 6: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 8

A guided analytics solution

• The data scientist builds a data preparation and modeling workflow in KNIME capturing their most robust approach along with a solid validation protocol that won’t let a low-quality model pass.

• The data scientist deploys this model as a web application using the KNIME server.

• The scientist can then upload their data, build and validate a model, and then apply it to generate predictions for the items in their catalog in order to decide which experiments to do next.

Page 7: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 9

Data Cleaning

Page 8: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 10

The 80% problem

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

Page 9: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 11

Data Cleaning & Data Scientists

11

https://twitter.com/mrogati/status/601538814746628096

Page 10: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 12

7 Techniques for Dimensionality Reduction

Column Reduction based on:

1. Missing values 2. High correlation3. Low standard deviation4. PCA5. Infrequent choice in random forest shallow trees6. Backward Feature Elimination7. Forward Feature Construction

Whitepaper on KNIME web site https://www.knime.org/files/knime_seventechniquesdatadimreduction.pdf

12

Page 11: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 13

Dataset Quality Measures

Additional Techniques for Data Dimensionality Reduction:

• Low Skewness

• Outlier Removal

13

Measure Dataset Quality Before and After:

• Average Error (%) from Cross-Validation

• Normalized Cronbach Alpha

Page 12: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 14

Data Cleaning as a Process

• Reliable (cross-domain)

• Repeatable (not automatic)

• Interactive (human expert supervised)

• From a Web Browser (no KNIME expertise)

• On demand

14

Page 13: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 15

CRM Dataset

• Customer Data

• Task is Upselling prediction

• Product is a lawyer insurance

• Lawyer Insurance 0/1 is Target

• If lawyer insurance was bought then after a little while lawyer was assigned

• 10K data rows x 33 data columns

15

Page 14: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 16

From KNIME WebPortal: Login

16

Page 15: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 17

From KNIME WebPortal: Start

17

Page 16: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 18

From KNIME WebPortal: Upload File

18

Only .table and .csv files

Page 17: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 19

From KNIME WebPortal: Initial Dataset Quality

19

Page 18: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 20

From KNIME WebPortal: Missing Values

20

Page 19: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 21

From KNIME WebPortal: Outliers

21

Page 20: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 22

From KNIME WebPortal: Low Standard Deviation

22

Page 21: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 23

From KNIME WebPortal: Low Skewness & High Correlation

23

Page 22: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 24

From KNIME WebPortal: Final Dataset Quality

24

Page 23: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 25

From KNIME WebPortal: Back to Refine

25

Page 24: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 26

From KNIME WebPortal: Final dataset Quality again

26

Page 25: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 27

From KNIME WebPortal: Workflow successful

27

Page 26: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 28

Workflow

28

Page 27: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 29

Metanode “Dataset Quality”

29

Sum

mar

y o

f D

atas

et Q

ual

ity

Page 28: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 30

Malaria Dataset

• Patient Data

• Task is Pf3D7_ps_hit = yes/no

• Primary & secondary readouts, SMILES, experiment date, sample

• Many primary readout ?

• 6675 data rows x 8 data columns

30

Page 29: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 31

From KNIME WebPortal: Initial Dataset Quality

31

Page 30: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 32

From KNIME WebPortal: Missing Values

32

Page 31: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 33

From KNIME WebPortal: Outliers

33

Page 32: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 34

From KNIME WebPortal: Low Standard Deviation

34

Page 33: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 35

From KNIME WebPortal: Low Skewness & High Correlation

35

Page 34: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 36

From KNIME WebPortal: Final Dataset Quality

36

Page 35: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 37

That was easy!

37

Happy scientist!

Page 36: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 38

Model building

Page 37: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 39

The modeling and prediction workflow

Reading the cleaned data and adding the chemistry-specific details

Building a model

Evaluating the model

Ranking and picking new items

Page 38: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 40

Robust learning: use multiple models and representations

• Multiple models:

– Random forest (representation 2)

– Gradient boosting (representation 1)

– Fingerprint Bayes (representation 1)

– Logistic regression (representation 1)

– Logistic regression (representation 2)

• Combine predictions using "model fusion"

Page 39: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 41

Validation

• The model will be used for ranking new items

• To ensure that it is useful we will evaluate it based both on overall accuracy (using Cohen’s Kappa) and how accurate early picks are (using enrichment)

Page 40: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 42

Validation

• Parameters from the Scorer node, adapted to model fusion

• Accuracy parameters from the ROC node

Page 41: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 43

When the model isn’t good enough

Accuracy thresholds are set by the data scientist when building the workflow

The workflow ends here.No sense continuing with a model that's unreliable/misleading.

Page 42: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 44

Making predictions

Read items from catalog

Generate predictions

Show histogram and ask for number of items to consider

Interactive selection

Download Excel file

Page 43: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 45

Interactive selection

Create images for the table

Create plots Keep only rows that are selected in the table

Page 44: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 46

The output, Excel at last!

Page 45: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 48

That’s it!

48

• Whitepapers & workflows for the two different parts coming soon!

• For more infos email: [email protected]

Page 46: Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a lawyer insurance •Lawyer Insurance 0/1 is Target •If lawyer insurance was bought

© 2016 KNIME.com AG. All Rights Reserved. 49

The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States.

KNIME® is also registered in Germany.