final data mining_elizabeth ortega

13
Application of Data Mining Techniques for Determining Factors Associated with Overweight and Obesity Among California Adults Elizabeth A. Ortega, California State University, Long Beach ABSTRACT This paper describes the application of supervised data mining methods using SAS R Enterprise Miner 12.3 (EM) on data from the 2013-2014 California Health Interview Survey (CHIS), in order to better understand obesity and the indicators that may predict it. CHIS is the largest health survey ever conducted in any state, which samples California households through random-digit- dialing (RDD). EM was used to apply logistic regression, decision trees and neural network models to predict a binary variable, Overweight/Obese Status, which determines whether an individual has a Body Mass Index (BMI) greater than 25. These models were compared to assess which categories of information, such as demographic factors or insurance status, and individual factors like race, best predict whether an individual is overweight/obese or not. Keywords: Enterprise Miner, Data Mining, Decision Trees, Neural Networks, Logistic Regression, Gradient Boosting. INTRODUCTION Obesity and the risks that come with it are increasingly a concern for people around the world. The health risks are especially high for adults, with obesity increasing the incidence of diabetes, heart disease, and several types of cancer. Besides increasing the risk of disease, there are several other effects associated with being overweight. Obesity greatly affects an individuals quality of life by affecting their attitudes, emotions, and their ability to live and work as they normally would. It is known that lifestyle factors and demographic factors change the prevalence of obesity and being overweight in people. In order to study these relationships more in depth, data mining techniques will be used on a data set focusing on adults in California. The source of the data is the California Health Interview Survey (CHIS), specifically the results of the survey from 2013-2014. This data set includes health information on 19,516 adults in Califor- nia. This sample of adults was obtained by placing telephone calls to California households. The sample was ensured to be a random selection of households by using the random-digit-dialing (RDD) method. The target variable is a binary variable that has a value of either 1 or 2 to signify whether an individual is either overweight or obese, or not. A total of 19 variables are used as inputs. Some of the variables include demographic information, like their race measured according to the census, gender and their self-reported age. Other variables used had to do with their health behaviors like their walking habits, fast food consumption, and whether they were able to readily access fresh fruits and vegetables in their neighborhood. Also taken into account were variables concerning their income, poverty status, employment and whether or not they had food security; food available whenever they were hungry. The entire list of inputs into SAS Enterprise Miner are listed in the table on the following page. 1

Upload: elizabeth-ortega

Post on 13-Feb-2017

38 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Final Data Mining_Elizabeth Ortega

Application of Data Mining Techniques for Determining Factors Associatedwith Overweight and Obesity Among California Adults

Elizabeth A. Ortega, California State University, Long Beach

ABSTRACT

This paper describes the application of supervised data mining methods using SAS R© EnterpriseMiner 12.3 (EM) on data from the 2013-2014 California Health Interview Survey (CHIS), in orderto better understand obesity and the indicators that may predict it. CHIS is the largest healthsurvey ever conducted in any state, which samples California households through random-digit-dialing (RDD). EM was used to apply logistic regression, decision trees and neural networkmodels to predict a binary variable, Overweight/Obese Status, which determines whether anindividual has a Body Mass Index (BMI) greater than 25. These models were compared toassess which categories of information, such as demographic factors or insurance status, andindividual factors like race, best predict whether an individual is overweight/obese or not.

Keywords: Enterprise Miner, Data Mining, Decision Trees, Neural Networks, Logistic Regression,Gradient Boosting.

INTRODUCTION

Obesity and the risks that come with it are increasingly a concern for people around the world.The health risks are especially high for adults, with obesity increasing the incidence of diabetes,heart disease, and several types of cancer. Besides increasing the risk of disease, there areseveral other effects associated with being overweight. Obesity greatly affects an individualsquality of life by affecting their attitudes, emotions, and their ability to live and work as theynormally would. It is known that lifestyle factors and demographic factors change the prevalenceof obesity and being overweight in people. In order to study these relationships more in depth,data mining techniques will be used on a data set focusing on adults in California.

The source of the data is the California Health Interview Survey (CHIS), specifically the results ofthe survey from 2013-2014. This data set includes health information on 19,516 adults in Califor-nia. This sample of adults was obtained by placing telephone calls to California households. Thesample was ensured to be a random selection of households by using the random-digit-dialing(RDD) method.

The target variable is a binary variable that has a value of either 1 or 2 to signify whether anindividual is either overweight or obese, or not. A total of 19 variables are used as inputs.Some of the variables include demographic information, like their race measured according tothe census, gender and their self-reported age. Other variables used had to do with their healthbehaviors like their walking habits, fast food consumption, and whether they were able to readilyaccess fresh fruits and vegetables in their neighborhood. Also taken into account were variablesconcerning their income, poverty status, employment and whether or not they had food security;food available whenever they were hungry. The entire list of inputs into SAS Enterprise Minerare listed in the table on the following page.

1

Page 2: Final Data Mining_Elizabeth Ortega

The entire data set of 19,516 adults was partitioned into a training data set and a validation dataset, in order to have some values to test the effectiveness of the models created. The trainingdata set was 67% of the original data and the validation data set was 23% of the original. It isnecessary to ensure that the data set is not modeled too closely by the algorithm, since the goalis to create models that are applicable and accurate when it comes to to other data sets as well.Reserving a portion of the data set to be a validation set serves this purpose.

Variables Used (n=20)CHIS Name Description Role TypeAC11 # OF TIMES DRANK SODA LAST MONTH Input IntervalAC31 P1 # TIMES ATE FAST FOOD PAST WEEK Input OrdinalAC42 HOW OFTEN FIND FRESH FRUIT/VEG IN NEIGHBOR-

HOODInput Ordinal

AC44 NEIGHBORHOOD FRUIT/VEG AFFORDABLE Input OrdinalAC46 # OF TIMES DRANK SWEET FRUIT DRINKS PAST

MONTHInput Interval

AC47 P1 # OF GLASSES OF WATER DRANK YESTERDAY Input IntervalAC48 P1 # OF GLASSES OF NON-LOW/FAT MILK DRANK YES-

TERDAYInput Interval

AHEDC P1 EDUCATIONAL ATTAINMENT Input OrdinalAK10 P RESPONDENT’S EARNINGS LAST MONTH Input IntervalAL5 RECEIVING FOOD STAMP BENEFITS Input BinaryAM1 # TIMES FOOD DIDN’T LAST, COULDN’T AFFORD

MORE,PAST YRInput Ordinal

AM2 # TIMES COULDN’T AFFORD TO EAT BALANCEDMEALS

Input Ordinal

FAMTYP P FAMILY TYPE Input NominalOMBSRR P1 OMB SELF-REPORT RACE ETHNICITY Input NominalOVRWT OVERWEIGHT OR OBESE Target BinarySMKCUR CURRENT SMOKER Input BinarySRAGE P1 SELF-REPORTED AGE Input OrdinalSRSEX GENDER Input BinaryWRKST P1 WORKING STATUS Input OrdinalYRUS P1 YEARS LIVED IN THE U.S. Input Ordinal

VISUALIZATION WITH TABLEAU

The first portion of the paper will focus on visual analysis of the data set and will focus ona subset of the 19 predictor variables that will be used throughout the paper to explain andmodel obesity. Tableau R© Desktop Software will be used to examine the distribution of obesityamong the genders and other factors. Tableau is an analytics software intended for exploring andanalyzing data using visuals. Only those visualizations that visually provide insight into this largedata set will be shown. Also, some statistical analysis will be performed in order to determine ifthere is any significance to the differences seen in the Tableau images.

The first variable to be examined will be gender, or SRSEX, and its relationship to obesity. Theoriginal data set of adults includes 12,002 (61.5%) overweight/obese adults and 7,514 (38.5%)non-overweight adults. The data set includes 11,628 females (SRSEX = 2) which is 59.58% of

2

Page 3: Final Data Mining_Elizabeth Ortega

the data set and 7,888 males (SRSEX = 1) which is 40.42% of the data set. The figures belowshow visually the original break down of gender in the data set and the original breakdown ofadults who have obese/overweight BMI values those who do not. It is important to keep in mindwhen looking at future visualizations of this data that females outnumber males and there aremore overweight/obese people than adults who are not overweight.

The graphic below shows that there are more overweight/obese women in this data set thanoverweight/obese men in this data set. However, it is important to keep the breakdown above inmind.

However, the percentage of men that are obese is actually larger than the percentage of womenthat are obese, as is shown in the tables below. If one had only seen the visualization abovethey would have thought that women are obese/overweight more often than men, when it isactually the other way around. A t test was used to see if the difference between the genderswas statistically significant. The test yielded a t value of -17.22 and a p value of <.0001, whichsignifies that there is a significant difference in obesity between the genders.

3

Page 4: Final Data Mining_Elizabeth Ortega

Another factor that may play a part in obesity is age. In order to model this visually, the graphbelow uses the binned self-report age variable SRAGE as well as the gender variable exploredearlier in order to examine the difference in obesity among different ages and genders.

The value shown on left signifies the lower boundary for the age variable bin. The blue bars,which are represented by negative numbers on the axis, show the number of overweight/obesemen in the data set for that particular age bin. The pink bars show the same for women, whichare shown in positive numbers. It can be seen that the distribution of obesity by age is similar forboth genders, but obesity varies greatly with age.

Analysis of Variance (ANOVA) was used on the age groups to see if the groups did in fact differ interms of obesity. The F value found from this test was 31.34 and the corresponding p value was<.0001. This result suggests that there is a statistical difference in obesity between the differentage groups.

Race is another demographic variable that may explain obesity in this population of adults. Thevariable OMBSRR P1 self-report race and ethnicity according to the Office of Management andBudget (OMB) standards. A value of 1 signifies that the indvidual’s race/ethnicity is Hispanic,a value of 2 signifies that the individual is White, Non-Hispanic, a value of 3 signifies African-American, a value of 4 signifies American Indian or Alaskan Native, a value of 5 signifies Asian,6 represents Other and 7 signifies that the respondent identified with two or more races orethnicities.

The figure below shows a the distribution of overweight/obese, the 1 value on the left, and 0 notobese, among the different races which are shown on the horizontal axis.

Those races which have a more equal distribution of obese/non-obese individuals have squareswhich are similar in pigmentation. One race that shows this is 2, White Non-Hispanic, whichhas almost as many non-obese individuals as obese ones. The races that have an unequaldistribution show one square as gray and one square as very red, like group 2, Hispanics. Thisrace has substantially more individuals who are overweight/obese than those who are not, with73% of Hispanics in this group being overweight/obese.

4

Page 5: Final Data Mining_Elizabeth Ortega

An ANOVA analysis was also conducted on these groups and the resulting F statistic was 136.09with a p value of <.0001, signifying that there is a substantial difference in the distribution ofobesity between the races.

From the graph below one can also see that group 5, Asians, has significantly more individualswho are not overweight, than those who are. Overweight individuals make up only 37% ofthe Asian adults sampled, almost the reverse of the distribution of obesity in the entire sampleincluding all races.

Also explored visually was the relationship between educational attainment and obesity. Thegraphic below shows the distribution of obesity among the different levels of educational attain-ment which range from 1,which signifies that the adult has had no formal education, to 10 whichsignifies that the respondent holds a doctorate degree. The horizontal axis represents the num-ber of overweight/obese adults in each of these educational categories. ANOVA was also usedto compare these groups and the results showed that there is a statistically significant differentin obesity among the different education levels. When comparing these groups to one anotherusing Bonferroni tests, there is a statistically significant difference between those who have grad-uate degrees (master’s or doctorate degrees) and those who do not. Those who do not tend tobe obese/overweight more often than those who have graduate degrees.

DECISION TREES

The EM diagram shown below shows the different decision trees fit to the training data set andtested for effectiveness with the validation data set.

5

Page 6: Final Data Mining_Elizabeth Ortega

In total, 5 types of decision tree algorithms were used on the data, as well as an interactivedecision tree and a gradient boosting decision tree. The first type of decision tree used was asimple classification and regression tree (CART) algorithm, which is a binary decision tree thatconstructs nodes and splits them based on Gini impurity. Gini impurity determines how often arandomly chosen element would be labeled incorrectly if it were labeled randomly based on thedistribution of labels in the smaller subset. The formula for calculating the Gini index of a nodeis below for a data set T with examples from n classes.

This CART tree yielded 10 significant variables, with race being the most important. The nextmost important variable was the number of times an individual ate fast food in the last week. Thevariables and their importance for this first tree are listed below. The decision tree diagram in itsentirety is also shown. The diagrams for the following trees are much more complicated, sincethey are not binary and contain several more nodes than this tree so they will not be shown.

6

Page 7: Final Data Mining_Elizabeth Ortega

Since the original distribution of obesity/overweight in the original data set was 62% overweight/obeseand 38% not overweight or obese, below are some nodes that showed a significantly differentdistribution of obesity than the original data. For example, Asian women have a significantlylower risk of obesity with only 29.6% of them being overweight/obese and 70.4% being not over-weight or obese.

A group that has an even higher rate of obesity than the original data set is the group that iscomposed of Hispanics, African-Americans or American Indians that are older than 26 and eatfast food more than once a week. This group has an 80.1% prevalence of obesity.

Tree 2 was the best tree in terms of the misclassification rate. It used the C4.5 algorithm witha maximum of 4 branches. Instead of Gini impurity, the C4.5 algorithm uses entropy to decidewhether or not a node should be made. The formula for entropy is below:

7

Page 8: Final Data Mining_Elizabeth Ortega

This algorithm yielded 13 variables of importance, with race also being the most important vari-able. The list of variables in order of importance is below. The misclassification rate was, for thistree, .321 for the training data set and .332 for the validation data set.

The remaining trees did not follow a specific algorithm, besides the third tree that used the C4.5algorithm but with a maximum of 6 branches, and they did not yield any significant increases inability to classify as Tree 4 so this is the last one I will list results for individually. This tree usedvariance, entropy for nominal variables and Gini impurity for ordinal variables in order to classifyeach of the nodes. The variables in order of importance are below. The variables used were 19,so this tree used all of the available input variables in order to classify the data into groups. Themisclassification rate for this tree was .305 for training, .322 for validation.

8

Page 9: Final Data Mining_Elizabeth Ortega

GRADIENT BOOSTING

Gradient boosting is a method specifically for reducing error in decision trees. In this case weare looking at the misclassification rate as our error rate, which is the number of predictions thatincorrectly predict the value of our target variable.

Error is not considered to be more important in either case, whether we misclassify an over-weight/obese individual as not overweight or whether we classify a person who is not overweightas overweight/obese. When comparing the best models from each method, we will considerthese more in detail, since it may be more important to correctly classify overweight/obese indi-viduals.

DECISION TREES, CONTINUED

The remaining trees including the gradient boosting tree and excluding the interactive decisiontree because that one was made using user inputs in order to decide when to create a node andin what order the variables were used, had several similarities. They are compared using themodel comparison tool in Enterprise Miner below. The ROC curves are shown for these modelsfor both the training and validation data sets.

Race was the most important variable in all of these decision trees, no matter the criterion usedto create the nodes. Age, gender, and fast food frequency were in the top 5 most importantvariables in each of the trees, usually followed by earnings and educational attainment.

Following this were the variables on the subjects drinking habits: how often they had soda in thelast month, low fat or skim milk, sweet fruit drinks or how often they drank water. These variableswere usually clumped together but varied slightly in their order of importance from tree to tree.The food security variables like whether or not a subject was using food stamp benefits, andwhether or not they always had food available to them when they were hungry were significantin all trees, but always consistently at the bottom of the list in terms of significance.

This was surprising, since I expected these variables to contribute more heavily to whether ornot a subject was overweight or not. The misclassification rate for all of these trees was around.32-.33 and did not change substantially from the training data set to the validation data set. Ascan be seen in the ROC curves below for all the trees, Tree 4 seems to be the best and all of thetrees provide a substantial increase in classifying power from random chance.

9

Page 10: Final Data Mining_Elizabeth Ortega

CLUSTER ANALYSIS AS INPUT TO DECISION TREES

A decision tree was also used to model the different segments created using cluster analysis byusing a cluster node before a decision tree node and changing the target variable to SEGMENTinstead of OVRWT.

This method had a much lower misclassification rate than the decision trees without this feature,however, these two types of decision trees are not comparable to one another. Using a decisiontree here just helps understand the different clusters created more clearly. For this data set, thebest amount of clusters was two.

On the training data this method had a .46 misclassification rate and a .56 rate on the validationdata set. When analyzing the cluster means and variables used (which were 17 of the original19), a picture began to emerge of the types of adults in each of the clusters. Cluster 1 adults onaverage only had soda and sweet drinks twice a month. These adults earned almost twice asmuch monthly, on average, than the adults in Cluster 2. This cluster consisted mainly of femalesand had more White and Asian adults than Cluster 2.

Cluster 2 on the other hand contained individuals who on average had soda 22 times a monthand 15 sweet drinks a month. Besides only making $1,200 a month, these individuals weremainly males and were more likely to be Hispanic and current smokers than those in Cluster1. Cluster 1 consisted of 84.38% of the data and Cluster 2 was 15.62%. The variables usedin the clustering and subsequent tree are shown below in order of importance as well as theirimportance to the model.

10

Page 11: Final Data Mining_Elizabeth Ortega

NEURAL NETWORKS

Neural networks with different algorithms and hidden layers were used to see which one couldmodel obesity in the training data best, while also working well for the validation data set. Thealgorithms used with varying amounts of hidden layers were Back Propogation and Levenberg-Marquardt.

The Levenberg model with 5 hidden layers outperformed the other neural networks in terms ofmissclassification rate. The training misclassification rate for this model was .331 and .334 forthe validation data set. Although neural networks are a ”black box” in terms of their ability tobe interpreted, and are much less intuitive and explainable than decision trees, for example,for this data set they did not perform substantially better than the other methods in terms ofmissclassification.

The ROC curves comparing this model and the baseline, for both the training and validation datasets, are below. This method does work well, however, the issues that it has with interpretabilityand the fact that it does not show a substantial decrease in the error rate, suggest that it may notbe the best model to model obesity in this data set.

11

Page 12: Final Data Mining_Elizabeth Ortega

LOGISTIC REGRESSION

In order to determine whether these data mining techniques, which at times require large amountsof computing power, offer substantial insight into our data set over traditional methods, logisticregression models were fit to the data. The stepwise, forward and backward selection methodswere used find the best models for the training data, the models were also judged based onmisclassification rate. All of these selection methods yielded a model which was significant atthe α =.05 level.

The method selected using these procedures yielded a misclassification rate of .332 for the train-ing data and .335 for the validation data. The Akaike’s Information Criterion (AIC) value for thismodel is 16071.46. It includes the twelve following variables, all of which are significant at the α=.05 level individually as predictors: Fast Food, Non-Low/Fat Milk, Educational Attainment, FoodStamp Benefits, Couldn’t Afford Balanced Meals, Family Type, Race, Smoker, Age, Gender,Working Status, Years Living in the U.S.

Overall, this classical method did not perform substantially differently than the data mining tech-niques used on this data set. It seems that logistic regression is a good option for modelingobesity in adults, especially because it is already familiar to most individuals in the health field,where these data mining techniques may not be.

ERROR RATE COMPARISON

Comparing the classification charts for the best decision tree, logistic regression model and neu-ral network, shows the differences in classification for each of the methods. All of the methodshad a higher rate of classifying individuals as overweight/obese versus not overweight. Also, allthree of the models had more observations that they incorrectly classified as overweight thanobservations that were incorrectly classified as not overweight.

This is positive, since it is more beneficial that our model correctly identifies those who areoverweight/obese, since that is our focus. The models were similar in terms of the rates in which

12

Page 13: Final Data Mining_Elizabeth Ortega

they incorrectly and correctly classified the data. Below are the classification charts for the bestdecision tree, neural network and logistic regression model, in that order, for the training dataset.

CONCLUSIONS

The best model overall, in terms of missclassification rate, was the decision tree model using avariation of the C4.5 algorithm. It had the lowest misclassification rate for the training data setand validation data set, which were .305 and .322 respectively. The best neural network modelwas not far behind with misclassification rates of .331 and .334 for training and validation. Thebest logistic regression model had rates of .332 and .335.

The data mining techniques had slightly lower error rates than the classical technique of logisticregression, for predicting our target variable of whether an individual is obese/overweight or not.However, the classical method still resulted in a model that was comparable to those using moreadvanced techniques. It seems that for predicting obesity in this sample, logistic regression is aviable option. This may be due to the relatively large sample which this study was based on andthe fact that the sample contained very little missing data for the predictors used.

Overall the same types of variables were significant in most of the techniques used. Overall, de-mographic factors like gender, educational attainment and race were more significant predictorsof obesity, whereas factors which had to do with an individual’s health behaviors like their sodadrinking habits, were less significant. Although this study is limited, since only 19 variables aboutthe adults were used as inputs into the models, this may have implications for how physiciansand public health officials tackle the increasing issue that is obesity in the United States.

SAS and all other SAS Institute Inc. product or service names are registered trademarks ortrademarks of SAS Institute Inc. in the USA and other countries. R© indicates USA registration.

Other brand and product names are trademarks of their respective companies.

13