powered by: data science center of excellence analyzing...
TRANSCRIPT
1
Analyzing Diabetes with H2O & R Overview
Powered by: Data Science Center of Excellence
By Josh W. Smitherman - Chief Data Scientist
Better Healthcare Insights
2
About the Me
Josh W. Smitherman is Chief Data Scientist at Colaberry Inc. where he leads the data
science practice of integrated advanced analytical capabilities into client’s and
organization’s value chain processes. Josh holds a Masters of Operations Research from
Southern Methodist University in Dallas Texas, Master Certificate in Applied Statistics from
Penn State, and a B.B.A in Management Information Systems from Texas Tech University.
Certified Lean Six Sigma Black Belt, where he has implemented continuous improvement
and analytical processes within many organizations. He has worked on data science
projects over the last 10 years in retail, CPG, healthcare (clinical trails and patient outcome
health), consumer and auto financial services, oil and gas, and supply chain. He has
worked with companies like Wal-Mat, GE, Dr. Pepper Snapple Group, Pergo, Mondelez,
MyHealth Access Network, CBRE, Costco, Lowe’s, Home Depot, State Farm, UK
healthcare networks, and many others.
3
Agenda
Review technical paper & Colaberry Data Science Center of Excellence…………………….……………………
Overview of H2O.ai and R, how they work together, benefits, what can be done, etc………………………….…
Analytical process of predicting Diabetes…………………………………..….………..………………………….…
Highlight key findings, analytical capabilities used, and other applications…………..………………………….…
4
Disclaimer
This may go without saying, but just in case: the results of this analysis demonstrated in this document does not in any way suggestion anyone to self diagnose or take any medical or any other actions as a result of the results produced in this document. Please see a certified and licensed physician for any medical related concerns or questions you might have.
5
Review Technical Paper & Colaberry Data Science Center of Excellence
Recent publication of “Analyzing Diabetes with H2O & R (feat. Plotly)” by Josh
W. Smitherman Chief Data Scientist
• Paper walks through the technical aspects of performing machine learning
and predictive analytics using R and H2O.ai given diabetes cases
• The paper deploys machine learning & visualization methods using:
• Kmeans clustering of patients
• H2O.ai deep learning, gradient boosting machine, generalized linear
models, and distributed rand forest
• Visualization using the “plotly” packages that allows D3.js charts look
and feel
• What’s different about this paper is that it combines several areas of machine
learning and predictive analytics to combat the issue of identifying groups of
individuals risk of contracting diabetes.
Colaberry Data Center of Excellence is an approach and methodology
that is used to deploy integrated data science and advanced analytical
capabilities within organization’s processes, applications, and strategies.
• Domain Expertise
• Best Practices
• Delivery Models
• 360o Support
6
Overview of H2O.ai and R
What is R?
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S
language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by
John Chambers and colleagues. R can be considered as a different implementation of S. There are some important
differences, but much code written for S runs unaltered under R.
From R Project Intro Page: https://www.r-project.org/about.html
RStudio is an R IDE (Integrated Development Environment) that aids in the
development of R scripts and programs.
Why use R?
Cutting edge machine learning and predictive analytical packages
usually show up first in R due to it’s rapid deployment and community of
contributors
Largest community of contributors that include companies, large
research institutes and universities, practitioners, and PhD’s in a verity
of statistical, engineering, business, and other quantitative areas
Easily deploy models on local machines, web applications, distributed
systems like Spark and Hadoop, various OS, and much more
It’s FREE!!!
7
Overview of H2O.ai and R
What is H2O.ai?
H2O.ai is an in memory engine that allows you to run R code (and others languages such as Python, Java, etc.) in a
distributed fashion. This helps when you need to run large or higher computational algorithms for quicker and deeper
analytics verses a single node. H2O.ai has a nice set of pre-built packages for clustering, gradient boosting machine,
random forest, deep learning, and others. Another great appeal to using this environment is from a programming API
standpoint to process fast computations along with applications for analytics results.
H2O.ai has developed a workflow tool and interface that allows uses to develop
models on H2O.ai platform on the web. It is similar to IPython Notebook which
allows you to add code, graphs, comments, etc. to a an analytical project.
Why use H2O.ai?
Run models in panelized fashion that speed up the processing and output
results quickly vs. running on traditional local machines (without in-memory
optimization)
Great packages such as deep learning, random forest, and gradient boosting
machines can be run on H2O.ai distributed in-memory process for fast,
accurate results of modeling
Open source and integrates right into R, Python, Java, and others
8
Overview of H2O.ai and R
How does H2O.ai & R Work Together?
H2O.ai is a platform where R runs on. What that means is that you
execute R code in RStudio or R console window (or any other IDE), but
you initiate an H2O session that you make calls to H2O functions.
You can run H2O within your R interface on a local laptop or H2O can
be installed on a server, cluster on Hadoop/Spark, etc. The process is
still the same where you execute R “type” code in your R session, but
it’s within the H2O platform.
9
Analytical Process of Predicting Diabetes
Data Source – CDC & NHANES
Data published, gathered, and synthesized by the US Center of Disease
Control and Prevention (CDC), National Health and Nutrition Examination
Survey (NHANES) contains a verity of data points around several health
related aspects.
Within the R, we are able to load NHANES data into H2O platform and run
a verity of capabilities dealing with the Diabetes data set.
10
Analytical Process of Predicting Diabetes
Data Source – NHANES Structure & Analysis Goals
The NHANES data set contains various health and social
economical data points on a verity of cases.
The goal in our analysis is to use these variables/features to
explain and predict the sources that cause higher diabetes
risk in various groups of cases.
Using R + H2O will enables us to process these variables
and cases in a fast/scalable way while performing advanced
analytical process on the data set.
Gender Race Education
Income /
Poverty Weight Height
PulseBlood
Pressure
Alcohol
UsageCholesterol
Pregnancy Depression Narcotics Smoking
Diabetes
Number of Variables: 78
Number of Rows: 20,293
Structure Stat’s
11
Analytical Process of Predicting Diabetes
Exploratory Data Analysis: Prevalence of Diabetes
The proportion of diabetes found within each group observed:
Yes =
14%
Overall Proportion of Diabetes Cases
Gender Proportion of Diabetes Cases
Ma
leF
em
ale
Education Proportion of Diabetes Cases
Race Proportion of Diabetes Cases
12
Analytical Process of Predicting Diabetes
Exploratory Data Analysis: Missing Value Issue within NHANES
The NHANES data has high degree of missing values across
many different variables.
Since 32 out of the 78 total variables (~41%) contain 50% or
more NA’s within each variable we will need to formulate a
strategy on how we can go forward with our EDA, clustering,
and predictive analytics.
To get a good understanding of the population health risk of
diabetes, we must infer some data point based on other values.
This process is known as data imputation, and is discussed
next.
variable Count_NA Percentage variable Count_NA Percentage variable Count_NA Percentage
Length 12382 100% Alcohol12PlusYr 2084 17% HomeRooms 94 1%
HeadCirc 12382 100% LittleInterest 1879 15% HomeOwn 82 1%
TVHrsDayChild 12382 100% Depressed 1873 15% SleepHrsNight 28 0%
CompHrsDayChild 12382 100% DaysMentHlthBad 1801 15% SleepTrouble 4 0%
AgeRegMarij 10562 85% DaysPhysHlthBad 1795 14% Work 2 0%
UrineFlow2 10491 85% HealthGen 1780 14% PhysActive 1 0%
UrineVol2 10483 85% BPSys1 1409 11% ID 0 0%
PregnantNow 9772 79% BPDia1 1409 11% SurveyYr 0 0%
Age1stBaby 9226 75% UrineFlow1 1398 11% Gender 0 0%
RegularMarij 8672 70% HHIncome 1359 11% Age 0 0%
AgeFirstMarij 8670 70% HHIncomeMid 1359 11% Race1 0 0%
nBabies 8445 68% DirectChol 1230 10% Diabetes 0 0%
nPregnancies 8182 66% TotChol 1230 10% WTINT2YR 0 0%
SmokeAge 7338 59% BPSys2 1226 10% WTMEC2YR 0 0%
Testosterone 7279 59% BPDia2 1226 10% SDMVPSU 0 0%
SmokeNow 7154 58% BPSys3 1203 10% SDMVSTRA 0 0%
TVHrsDay 6530 53% BPDia3 1203 10%
CompHrsDay 6526 53% Poverty 1186 10%
PhysActiveDays 6507 53% BPSysAve 964 8%
SexOrientation 5538 45% BPDiaAve 964 8%
AlcoholDay 5393 44% Pulse 942 8%
SexNumPartYear 5345 43% UrineVol1 644 5%
Marijuana 5313 43% BMI_WHO 637 5%
SexAge 4251 34% Education 632 5%
SexNumPartnLife 3855 31% MaritalStatus 624 5%
SameSex 3751 30% Smoke100 619 5%
SexEver 3749 30% BMI 578 5%
HardDrugs 3746 30% Height 559 5%
AlcoholYear 3555 29% Weight 557 4%
13
Analytical Process of Predicting Diabetes
Exploratory Data Analysis: Data Imputation
There are situations in data analytical projects where missing values are not
relevant and therefore can be ignored without any consequences to the analytical
results. Then, there are those situations where ignoring missing values is not
possible if the analytical results are to be completed. The later is the situation with
the NHANES data set and analytics of diabetes. What is needed is a method which
is called data imputation and is defined next. .
Definition from Wikipedia
“the process of replacing missing data with substituted values. When substituting for a
data point, it is known as "unit imputation"; when substituting for a component of a data
point, it is known as "item imputation". Because missing data can create problems for
analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise
deletion of cases that have missing values. That is to say, when one or more values
are missing for a case, most statistical packages default to discarding any case that
has a missing value, which may introduce bias or affect the representativeness of the
results. Imputation preserves all cases by replacing missing data with an estimated
value based on other available information. Once all missing values have been
imputed, the data set can then be analysed using standard techniques for complete
data.”
Impute Function
Missing Values
14
Analytical Process of Predicting Diabetes
Exploratory Data Analysis: Measure of Distribution
From the distribution it appears the imputation did not alter the overall structure. That is what we want to see.
Total Cholesterol Weight Weight
15
Analytical Process of Predicting Diabetes
Exploratory Data Analysis: Dot Plot of Variables, Which are Higher for Diabetes?
16
Analytical Process of Predicting Diabetes
Exploratory Data Analysis: Dot Plot of Variables, Which are Higher for Diabetes?
From dot chart we can see all the numeric variables on the
right, the blue dots are those with diabetes mean for that
variable, and the green colored dot are those with out
diabetes mean for each of the variables.
The scales are read as 1 is the overall mean and any value
above or below the overall mean. So for example, we see
that age for the green (no diabetes) is at ~.96 which means
4% below average, however the blue dot (with diabetes)
indicates they are 28% higher in age than the average
(value 1.28).
Diabetes have higher/lower, compared to the total average,
the below variables:
• Weight (+10%)
• Sex Number of Praters (+22%)
• Number of Pregnancies (+22%)
• Number of Babies (+23%)
• Days of Physical Bad Health (+70%)
• Days with Mentally Bad Health (+17%)
• All Systolic Blood Pressure Readings
(+6%)
• BMI (+12%)
• Alcohol Days (-13%)
• Alcohol Years (-44%)
• Testosterone (-14%)
• All Diastolic Blood Pressure
Readings (-3%)
• Urine Volume Readings (-5%)
Positive Related Factors Negative Related Factors
17
Clustering
18
Clustering Cases – Diabetes
What is Clustering? Why use it?
We want to cluster observations into groups to help execute different
activities on each group for better maximization of objectives. For
example, in our case we want to identify groups with certain
characteristics so that when we identify how many within each group
has higher / lower observations of diabetes we can perform the
needed activates of preventive care on a subset of individuals verses
the whole enchilada.
We will use the popular kmeans method to perform our clustering
activities within H2O. Kmeans partitions observations based on what
is called a centroid, which is a calculation of the groups center. It set a
K points as initial centroids, then loops through multiple iterations to
find the optimal center point for each group. This type of process in
machine learning and statistics is called an unsupervised learning
algorithm because we never tell the algorithm a training, or testing
data set to learn and tune. It simply attempts to find an optimal point
by reduction of the sum of square error for all the observations based
on a distance metric (Euclidean, Manhattan, etc.).
Suppose we had the data to the left plotted in a chart. We can
use the kmeans method to find the center points and identify a
unique cluster identifier to each group that partitions each
observation in a homogeneous cluster group. The is the
center point the kmeans method assigned to each group.
19
Clustering Cases – Diabetes
How to determine the right number of clusters in H2O?
• What we are looking for is the “elbow” of the
curve, and therefore will determine the
number of clusters
• Cycle through many iterations of K where
the sum of squared distances drop is
significant but beyond the “elbow” there
really isn’t much improvement
• We can see in our case we have our
“elbow” around K = 40
H2O.ai incredibly fast processing:
H2O only 13 min, verses without
H2O would have taken 20 hours!!!
20
Clustering Cases – Diabetes
Cluster sizes:
• Graph shows each cluster and the number
of observations by each
• See many various sizes
• We will need to filter some out or re-cluster
• Usually want to keep them so as we scale
up there potentially have more observations
21
Clustering Cases – Diabetes
Which cluster are about the same and which are really different?
• Using multidimensional scaling (MDS) of the
variables combining into a single metric by cluster
• Each variable centers are on different scales
• Normalized and fitted to a single metric
• Determine distance of each cluster
• Example: cluster 17 and 26 are very close
together
• Example: cluster 39 and 4 are very much different
groups
22
Clustering Cases – Diabetes
Proportion of Diabetes Across Each Cluster
• Graph shows the proportion of diabetes within
each cluster
• Cluster 5 seems to have high levels of diabetes
as well as 18, 30, and 31
23
Predicting Diabetes
24
Clustering Cases – Diabetes
Predictive Models in H2O by Cluster for Variable importance
1) Distributed Random Forest:
• An ensemble decision tree method
• Using predictions from multiple decision trees, the model combines each resulting
prediction
• Each tree gets a vote used in the bagging method
• It is distributed in terms of the parallelized decision tree process across each H2O
clusters
2) Gradient Boosting Machine:
• Can be either ensemble of either regression or classification tree models, we will use
classification tree GBM
• Is an expansion of the “Multi additive regression tree” of MART combines two methods
gradient optimization of loss matrix and boosting for “weak” classifiers producing a
committee based approach
• Uses distributed trees as well
3) General Linear Model:
• A general linear regression method tool kit for conducting linear models
• Since our target is binary (diabetes found, not found), we will use a logistic regression
approach
• Uses likelihood optimization in fitting the model parameters
4) Deep Learning:
• Similar to Neural Networks, deep learner are feedforward Neural Networks with many
layers and call fall into other categories such as Deep Belief Network (DBN), Deep
Neural Network (DNN), etc.
• Weights are adapted to minimize the error on the labeled training data
Cluster Groups
GBM DRF GLM DL
Important Variables
Target Diabetes
25
Clustering Cases – Diabetes
Predictive Models in H2O by Cluster for Variable importance + Model Performance
• Includes variables importance, percentage
of importance, and diabetes = “Yes” error
rate
Summary of Loss Matrix
PredActual No Yes
No 98% 2%
Yes 5% 95%
Normal Balanced Problems
Pred
Actual No YesNo 98% 2%Yes 40% 60%
Class Imbalance Problems
This is a case where
classes are well
balanced
This is typical with
class imbalance. It
predicts the No’s well,
but imbalanced class,
Yes, it performance
poorly
26
Clustering Cases – Diabetes
Diabetes Risk by Cluster
• Includes variables importance,
percentage of importance, and diabetes
= “Yes” error rate
27
In Summary of the Analytics
What did we do in this walk through?
1st
2nd
3rd
4th