powered by: data science center of excellence analyzing...

1

Analyzing Diabetes with H2O & R Overview

Powered by: Data Science Center of Excellence

By Josh W. Smitherman - Chief Data Scientist

[email protected]

Better Healthcare Insights

mailto:[email protected]

2

About the Me

Josh W. Smitherman is Chief Data Scientist at Colaberry Inc. where he leads the data

science practice of integrated advanced analytical capabilities into client’s and

organization’s value chain processes. Josh holds a Masters of Operations Research from

Southern Methodist University in Dallas Texas, Master Certificate in Applied Statistics from

Penn State, and a B.B.A in Management Information Systems from Texas Tech University.

Certified Lean Six Sigma Black Belt, where he has implemented continuous improvement

and analytical processes within many organizations. He has worked on data science

projects over the last 10 years in retail, CPG, healthcare (clinical trails and patient outcome

health), consumer and auto financial services, oil and gas, and supply chain. He has

worked with companies like Wal-Mat, GE, Dr. Pepper Snapple Group, Pergo, Mondelez,

MyHealth Access Network, CBRE, Costco, Lowe’s, Home Depot, State Farm, UK

healthcare networks, and many others.

3

Agenda

Review technical paper & Colaberry Data Science Center of Excellence…………………….……………………

Overview of H2O.ai and R, how they work together, benefits, what can be done, etc………………………….…

Analytical process of predicting Diabetes…………………………………..….………..………………………….…

Highlight key findings, analytical capabilities used, and other applications…………..………………………….…

4

Disclaimer

This may go without saying, but just in case: the results of this analysis demonstrated in this document does not in any way suggestion anyone to self diagnose or take any medical or any other actions as a result of the results produced in this document. Please see a certified and licensed physician for any medical related concerns or questions you might have.

5

Review Technical Paper & Colaberry Data Science Center of Excellence

Recent publication of “Analyzing Diabetes with H2O & R (feat. Plotly)” by Josh

W. Smitherman Chief Data Scientist

• Paper walks through the technical aspects of performing machine learning

and predictive analytics using R and H2O.ai given diabetes cases

• The paper deploys machine learning & visualization methods using:

• Kmeans clustering of patients

• H2O.ai deep learning, gradient boosting machine, generalized linear

models, and distributed rand forest

• Visualization using the “plotly” packages that allows D3.js charts look

and feel

• What’s different about this paper is that it combines several areas of machine

learning and predictive analytics to combat the issue of identifying groups of

individuals risk of contracting diabetes.

Colaberry Data Center of Excellence is an approach and methodology

that is used to deploy integrated data science and advanced analytical

capabilities within organization’s processes, applications, and strategies.

• Domain Expertise

• Best Practices

• Delivery Models

• 360o Support

6

Overview of H2O.ai and R

What is R?

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S

language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by

John Chambers and colleagues. R can be considered as a different implementation of S. There are some important

differences, but much code written for S runs unaltered under R.

From R Project Intro Page: https://www.r-project.org/about.html

RStudio is an R IDE (Integrated Development Environment) that aids in the

development of R scripts and programs.

Why use R?

Cutting edge machine learning and predictive analytical packages

usually show up first in R due to it’s rapid deployment and community of

contributors

Largest community of contributors that include companies, large

research institutes and universities, practitioners, and PhD’s in a verity

of statistical, engineering, business, and other quantitative areas

Easily deploy models on local machines, web applications, distributed

systems like Spark and Hadoop, various OS, and much more

It’s FREE!!!

7


What is H2O.ai?

H2O.ai is an in memory engine that allows you to run R code (and others languages such as Python, Java, etc.) in a

distributed fashion. This helps when you need to run large or higher computational algorithms for quicker and deeper

analytics verses a single node. H2O.ai has a nice set of pre-built packages for clustering, gradient boosting machine,

random forest, deep learning, and others. Another great appeal to using this environment is from a programming API

standpoint to process fast computations along with applications for analytics results.

H2O.ai has developed a workflow tool and interface that allows uses to develop

models on H2O.ai platform on the web. It is similar to IPython Notebook which

allows you to add code, graphs, comments, etc. to a an analytical project.

Why use H2O.ai?

Run models in panelized fashion that speed up the processing and output

results quickly vs. running on traditional local machines (without in-memory

optimization)

Great packages such as deep learning, random forest, and gradient boosting

machines can be run on H2O.ai distributed in-memory process for fast,

accurate results of modeling

Open source and integrates right into R, Python, Java, and others

8


How does H2O.ai & R Work Together?

H2O.ai is a platform where R runs on. What that means is that you

execute R code in RStudio or R console window (or any other IDE), but

you initiate an H2O session that you make calls to H2O functions.

You can run H2O within your R interface on a local laptop or H2O can

be installed on a server, cluster on Hadoop/Spark, etc. The process is

still the same where you execute R “type” code in your R session, but

it’s within the H2O platform.

9

Analytical Process of Predicting Diabetes

Data Source – CDC & NHANES

Data published, gathered, and synthesized by the US Center of Disease

Control and Prevention (CDC), National Health and Nutrition Examination

Survey (NHANES) contains a verity of data points around several health

related aspects.

Within the R, we are able to load NHANES data into H2O platform and run

a verity of capabilities dealing with the Diabetes data set.

10


Data Source – NHANES Structure & Analysis Goals

The NHANES data set contains various health and social

economical data points on a verity of cases.

The goal in our analysis is to use these variables/features to

explain and predict the sources that cause higher diabetes

risk in various groups of cases.

Using R + H2O will enables us to process these variables

and cases in a fast/scalable way while performing advanced

analytical process on the data set.

Gender Race Education

Income /

Poverty Weight Height

PulseBlood

Pressure

Alcohol

UsageCholesterol

Pregnancy Depression Narcotics Smoking

Diabetes

Number of Variables: 78

Number of Rows: 20,293

Structure Stat’s

11


Exploratory Data Analysis: Prevalence of Diabetes

The proportion of diabetes found within each group observed:

Yes =

14%

Overall Proportion of Diabetes Cases

Gender Proportion of Diabetes Cases

Ma

leF

em

ale

Education Proportion of Diabetes Cases

Race Proportion of Diabetes Cases

12


Exploratory Data Analysis: Missing Value Issue within NHANES

The NHANES data has high degree of missing values across

many different variables.

Since 32 out of the 78 total variables (~41%) contain 50% or

more NA’s within each variable we will need to formulate a

strategy on how we can go forward with our EDA, clustering,

and predictive analytics.

To get a good understanding of the population health risk of

diabetes, we must infer some data point based on other values.

This process is known as data imputation, and is discussed

next.

variable Count_NA Percentage variable Count_NA Percentage variable Count_NA Percentage

Length 12382 100% Alcohol12PlusYr 2084 17% HomeRooms 94 1%

HeadCirc 12382 100% LittleInterest 1879 15% HomeOwn 82 1%

TVHrsDayChild 12382 100% Depressed 1873 15% SleepHrsNight 28 0%

CompHrsDayChild 12382 100% DaysMentHlthBad 1801 15% SleepTrouble 4 0%

AgeRegMarij 10562 85% DaysPhysHlthBad 1795 14% Work 2 0%

UrineFlow2 10491 85% HealthGen 1780 14% PhysActive 1 0%

UrineVol2 10483 85% BPSys1 1409 11% ID 0 0%

PregnantNow 9772 79% BPDia1 1409 11% SurveyYr 0 0%

Age1stBaby 9226 75% UrineFlow1 1398 11% Gender 0 0%

RegularMarij 8672 70% HHIncome 1359 11% Age 0 0%

AgeFirstMarij 8670 70% HHIncomeMid 1359 11% Race1 0 0%

nBabies 8445 68% DirectChol 1230 10% Diabetes 0 0%

nPregnancies 8182 66% TotChol 1230 10% WTINT2YR 0 0%

SmokeAge 7338 59% BPSys2 1226 10% WTMEC2YR 0 0%

Testosterone 7279 59% BPDia2 1226 10% SDMVPSU 0 0%

SmokeNow 7154 58% BPSys3 1203 10% SDMVSTRA 0 0%

TVHrsDay 6530 53% BPDia3 1203 10%

CompHrsDay 6526 53% Poverty 1186 10%

PhysActiveDays 6507 53% BPSysAve 964 8%

SexOrientation 5538 45% BPDiaAve 964 8%

AlcoholDay 5393 44% Pulse 942 8%

SexNumPartYear 5345 43% UrineVol1 644 5%

Marijuana 5313 43% BMI_WHO 637 5%

SexAge 4251 34% Education 632 5%

SexNumPartnLife 3855 31% MaritalStatus 624 5%

SameSex 3751 30% Smoke100 619 5%

SexEver 3749 30% BMI 578 5%

HardDrugs 3746 30% Height 559 5%

AlcoholYear 3555 29% Weight 557 4%

13


Exploratory Data Analysis: Data Imputation

There are situations in data analytical projects where missing values are not

relevant and therefore can be ignored without any consequences to the analytical

results. Then, there are those situations where ignoring missing values is not

possible if the analytical results are to be completed. The later is the situation with

the NHANES data set and analytics of diabetes. What is needed is a method which

is called data imputation and is defined next. .

Definition from Wikipedia

“the process of replacing missing data with substituted values. When substituting for a

data point, it is known as "unit imputation"; when substituting for a component of a data

point, it is known as "item imputation". Because missing data can create problems for

analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise

deletion of cases that have missing values. That is to say, when one or more values

are missing for a case, most statistical packages default to discarding any case that

has a missing value, which may introduce bias or affect the representativeness of the

results. Imputation preserves all cases by replacing missing data with an estimated

value based on other available information. Once all missing values have been

imputed, the data set can then be analysed using standard techniques for complete

data.”

Impute Function

Missing Values

14


Exploratory Data Analysis: Measure of Distribution

From the distribution it appears the imputation did not alter the overall structure. That is what we want to see.

Total Cholesterol Weight Weight

15


Exploratory Data Analysis: Dot Plot of Variables, Which are Higher for Diabetes?

16


Exploratory Data Analysis: Dot Plot of Variables, Which are Higher for Diabetes?

From dot chart we can see all the numeric variables on the

right, the blue dots are those with diabetes mean for that

variable, and the green colored dot are those with out

diabetes mean for each of the variables.

The scales are read as 1 is the overall mean and any value

above or below the overall mean. So for example, we see

that age for the green (no diabetes) is at ~.96 which means

4% below average, however the blue dot (with diabetes)

indicates they are 28% higher in age than the average

(value 1.28).

Diabetes have higher/lower, compared to the total average,

the below variables:

• Weight (+10%)

• Sex Number of Praters (+22%)

• Number of Pregnancies (+22%)

• Number of Babies (+23%)

• Days of Physical Bad Health (+70%)

• Days with Mentally Bad Health (+17%)

• All Systolic Blood Pressure Readings

(+6%)

• BMI (+12%)

• Alcohol Days (-13%)

• Alcohol Years (-44%)

• Testosterone (-14%)

• All Diastolic Blood Pressure

Readings (-3%)

• Urine Volume Readings (-5%)

Positive Related Factors Negative Related Factors

17

Clustering

18

Clustering Cases – Diabetes

What is Clustering? Why use it?

We want to cluster observations into groups to help execute different

activities on each group for better maximization of objectives. For

example, in our case we want to identify groups with certain

characteristics so that when we identify how many within each group

has higher / lower observations of diabetes we can perform the

needed activates of preventive care on a subset of individuals verses

the whole enchilada.

We will use the popular kmeans method to perform our clustering

activities within H2O. Kmeans partitions observations based on what

is called a centroid, which is a calculation of the groups center. It set a

K points as initial centroids, then loops through multiple iterations to

find the optimal center point for each group. This type of process in

machine learning and statistics is called an unsupervised learning

algorithm because we never tell the algorithm a training, or testing

data set to learn and tune. It simply attempts to find an optimal point

by reduction of the sum of square error for all the observations based

on a distance metric (Euclidean, Manhattan, etc.).

Suppose we had the data to the left plotted in a chart. We can

use the kmeans method to find the center points and identify a

unique cluster identifier to each group that partitions each

observation in a homogeneous cluster group. The is the

center point the kmeans method assigned to each group.

19


How to determine the right number of clusters in H2O?

• What we are looking for is the “elbow” of the

curve, and therefore will determine the

number of clusters

• Cycle through many iterations of K where

the sum of squared distances drop is

significant but beyond the “elbow” there

really isn’t much improvement

• We can see in our case we have our

“elbow” around K = 40

H2O.ai incredibly fast processing:

H2O only 13 min, verses without

H2O would have taken 20 hours!!!

20


Cluster sizes:

• Graph shows each cluster and the number

of observations by each

• See many various sizes

• We will need to filter some out or re-cluster

• Usually want to keep them so as we scale

up there potentially have more observations

21


Which cluster are about the same and which are really different?

• Using multidimensional scaling (MDS) of the

variables combining into a single metric by cluster

• Each variable centers are on different scales

• Normalized and fitted to a single metric

• Determine distance of each cluster

• Example: cluster 17 and 26 are very close

together

• Example: cluster 39 and 4 are very much different

groups

22


Proportion of Diabetes Across Each Cluster

• Graph shows the proportion of diabetes within

each cluster

• Cluster 5 seems to have high levels of diabetes

as well as 18, 30, and 31

23

Predicting Diabetes

24


Predictive Models in H2O by Cluster for Variable importance

1) Distributed Random Forest:

• An ensemble decision tree method

• Using predictions from multiple decision trees, the model combines each resulting

prediction

• Each tree gets a vote used in the bagging method

• It is distributed in terms of the parallelized decision tree process across each H2O

clusters

2) Gradient Boosting Machine:

• Can be either ensemble of either regression or classification tree models, we will use

classification tree GBM

• Is an expansion of the “Multi additive regression tree” of MART combines two methods

gradient optimization of loss matrix and boosting for “weak” classifiers producing a

committee based approach

• Uses distributed trees as well

3) General Linear Model:

• A general linear regression method tool kit for conducting linear models

• Since our target is binary (diabetes found, not found), we will use a logistic regression

approach

• Uses likelihood optimization in fitting the model parameters

4) Deep Learning:

• Similar to Neural Networks, deep learner are feedforward Neural Networks with many

layers and call fall into other categories such as Deep Belief Network (DBN), Deep

Neural Network (DNN), etc.

• Weights are adapted to minimize the error on the labeled training data

Cluster Groups

GBM DRF GLM DL

Important Variables

Target Diabetes

25


Predictive Models in H2O by Cluster for Variable importance + Model Performance

• Includes variables importance, percentage

of importance, and diabetes = “Yes” error

rate

Summary of Loss Matrix

PredActual No Yes

No 98% 2%

Yes 5% 95%

Normal Balanced Problems

Pred

Actual No YesNo 98% 2%Yes 40% 60%

Class Imbalance Problems

This is a case where

classes are well

balanced

This is typical with

class imbalance. It

predicts the No’s well,

but imbalanced class,

Yes, it performance

poorly

26


Diabetes Risk by Cluster

• Includes variables importance,

percentage of importance, and diabetes

= “Yes” error rate

27

In Summary of the Analytics

What did we do in this walk through?

1st

2nd

3rd

4th

powered by: data science center of excellence analyzing...

Documents