readmission of diabetes patients report

26
Readmission of Diabetes Patients Project Report by Rahmawati Nusantari Maria D. Marroquin Essenam Kakpo Hong Lu Team 10 INSY 5339 – Th. 7 pm-9:50 pm May 12, 2016

Upload: hong-lu

Post on 10-Feb-2017

16 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Readmission of Diabetes Patients Report

Readmission of Diabetes Patients

Project Report by

Rahmawati Nusantari Maria D. Marroquin

Essenam Kakpo Hong Lu

Team 10

INSY 5339 – Th. 7 pm-9:50 pm

May 12, 2016

Page 2: Readmission of Diabetes Patients Report

  2  

Table of Contents Problem Domain 3

Data Summary 3 Encounters 3 Features 3

Target Variable 6 Prediction 6

Data Cleaning Process 7

Data Cleaning Tools 7 Missing Values 7 Irrelevant Data 7 Data Imbalance 8 Past Cleaning Efforts 10 Discretization 10 Various SMOTE Percentages 10

Algorithms Utilized 11 Classifiers 11 Comparison of Bayes Classifiers 11 Factor Experimental Design 13 Number of Attributes 13 Noise 13 Experiments 14 Combination Sets with Each Classifier 14 Summary of Results 18 Analysis and Conclusion 20 ROC Curves 20 Additional Analysis 23 Overall Observations 23 References 26

Page 3: Readmission of Diabetes Patients Report

  3  

Problem Domain Dataset Summary The dataset was obtained from the UCI Machine Learning Repository. It is listed under the name Diabetes 130 – US Hospitals. According to the dataset description, the data has been prepared to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes. The dataset represents 10 years (1999-2008) of clinical care at 130 U.S. hospitals and integrated delivery networks. This dataset contains 101,766 unique inpatients encounters (instances) with 50 attributes, making the size of this dataset a total of 5,088,300 cells. Encounters (Records) As stated on the UCI’s dataset information page, the dataset contains encounters that satisfied the following criteria:

•   It is an inpatient encounter (a hospital admission). •   It is a diabetic encounter, that is, one during which any kind of diabetes was entered to

the system as a diagnosis. •   The length of stay was at least 1 day and at most 14 days. •   Laboratory tests were performed during the encounter. •   Medications were administered during the encounter.

Features (Attributes) The attributes represent patient and hospital outcomes. This data set mostly contains nominal attributes such as medical specialty and gender, but also includes a few ordinal attributes such as age and weight and continues attributes such as time(days) in hospital and number of medications. The following table list each attribute, its description, and the percentage of missing information pertaining to each attribute.

Page 4: Readmission of Diabetes Patients Report

  4  

Attributes and Target Variable Table Feature name Type Description and values

% missing

Encounter ID Numeric Unique identifier of an encounter 0% Patient number Numeric Unique identifier of a patient 0%

Race Nominal Values: Caucasian, Asian, African American, Hispanic, and other 2%

Gender Nominal Values: male, female, and unknown/invalid 0%

Age Nominal Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100) 0%

Weight Numeric Weight in pounds. 97%

Admission type Nominal

Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available 0%

Discharge disposition Nominal

Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available 0%

Admission source Nominal

Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital 0%

Time in hospital Numeric

Integer number of days between admission and discharge 0%

Payer code Nominal

Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay 52%

Medical specialty Nominal

Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon 53%

Number of lab procedures Numeric Number of lab tests performed during the encounter 0% Number of procedures Numeric

Number of procedures (other than lab tests) performed during the encounter 0%

Number of medications Numeric

Number of distinct generic names administered during the encounter 0%

Number of outpatient visits Numeric

Number of outpatient visits of the patient in the year preceding the encounter 0%

Number of emergency visits Numeric

Number of emergency visits of the patient in the year preceding the encounter 0%

Page 5: Readmission of Diabetes Patients Report

  5  

Feature name Type Description and values

% missing

Number of inpatient visits Numeric

Number of inpatient visits of the patient in the year preceding the encounter 0%

Diagnosis 1 Nominal The primary diagnosis (coded as first three digits of ICD9); 848 distinct values 0%

Diagnosis 2 Nominal Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values 0%

Diagnosis 3 Nominal Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values 1%

Number of diagnoses Numeric Number of diagnoses entered to the system 0% Glucose serum test result Nominal

Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured 0%

A1c test result Nominal

Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured. 0%

Change of medications Nominal

Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change” 0%

Diabetes medications Nominal

Indicates if there was any diabetic medication prescribed. Values: “yes” and “no” 0%

24 features for medications Nominal

For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, and metformin-pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed 0%

Readmitted Nominal

Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission. 0%

Page 6: Readmission of Diabetes Patients Report

  6  

Target Variable The last attribute in the previous table is the class attribute, which in this case is Readmission. The distribution of the class attribute is as follows:

•   Encounters of patients who were not readmitted (No) to the hospital. There are 54, 864 of such encounters.

•   Encounters of patients who were readmitted to the hospital after 30 days of discharge (>30). There are 35,545 of such encounters.

•   Encounters of patients who were readmitted to the hospital within 30 days of discharge (<30). There are 11, 357 of such encounters.

Prediction We want to predict whether or when diabetes patients will be readmitted to the hospital based on several factors (attributes).

>30

No

<30

Readmission?

Page 7: Readmission of Diabetes Patients Report

  7  

Data Cleaning Process Data cleaning is commonly defined as the process of detecting and correcting corrupt or inaccurate records from a dataset, table, or database.1 Data quality is an important component in any data mining efforts. For this reason, many data scientists spend from 50% to 80% of their time preparing and cleaning their data before it can be mined for insights.2 There are four broad categories of data quality problems: missing data, abnormal data (outliers), departure from models, and goodness-of-fit.3 For our project, our team mainly dealt with missing data. Our team will also address the imbalance in the class variable using SMOTE. Data Cleaning Tools Our team utilized Microsoft Excel to perform the data cleaning. As our guidance to understand the variables and meaning of the data, we consulted the research article that owned the data: “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records” by Beata Strack et al. Missing Values The journal identified three attributes with the majority of their records missing such as weight (97%), payer code (52%), and medical specialty (53%). Weight was not properly recorded since this experiment was done prior to the HITECH legislation of the American Reinvestment and Recovery Act in 2009, while payer code was deemed irrelevant by the researchers. As a result, these 3 attributes were deleted. There were also 23 attributes that had zero values in 79% to 99% of their records. Those are medications features such as metformin and other generic medications. The zero value indicated that the type of medication was not prescribed to the patient. As a result, all these 23 attributes were deleted. However, insulin was the only medication attribute retained since it had more than 50% of data in its records, and it is considered prevalent in diabetic patient cases. Irrelevant Data The class attribute determines whether a patient is readmitted in the hospital within 30 days, over 30 days, or not readmitted at all. The attribute, discharge disposition, corresponds to 29 distinct values that indicate patients are discharged to home or another hospital, to hospice for terminally-ill patients, or indicate that the patients have passed away.

                                                                                                               1 https://en.wikipedia.org/wiki/Data_cleansing 2 Steve Lohr, The New York Times, August 17, 2014, For Big-Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights. 3 Tamraparni Dasu and Theodore Johnson, Exploratory Data Mining and Data Quality, Wiley, 2004

Page 8: Readmission of Diabetes Patients Report

  8  

To correctly include only active (alive) patients and not in hospice, we removed records that had Discharge Disposition codes of 11, 13, 14, 19, 20, and 21. These discharge codes matched the instances of patients who were deceased or sent to hospice. This cleaning process removed 2,423 instances. Data Imbalance SMOTE (Synthetic Minority Oversampling Technique) is a filter that samples the data and alters the class distribution. It can be used to adjust the relative frequency between the minority and majority classes in the data. SMOTE does not under-sample the majority classes. Instead, it oversamples the minority class by creating synthetic instances using a K-Nearest-Neighbor approach. The user can specify the oversampling percentage and the number of neighbors to use when creating synthetic instances.4 Our team applied SMOTE in different combinations and ultimately decided to apply a 200% synthetic minority oversample with 3-nearest-neighbors as shown below. SMOTE filter in WEKA

                                                                                                               4 Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd edition, Elsevier, 2011

Page 9: Readmission of Diabetes Patients Report

  9  

The following graphs and matrices represent the comparison of the data before the SMOTE and after the 200% SMOTE applied to the minority class (<30). Class Distribution Graphs Before and After 200% SMOTE

Confusion Matrices Before and After 200% SMOTE

Using J-48

Using BayesNet    

Original  Data   SMOTE  200%  

SMOTE  200%  

SMOTE  200%  Original  Data  

Original  Data  

Page 10: Readmission of Diabetes Patients Report

  10  

Past Cleaning Efforts Discretization As part of our initial data cleaning efforts, we discretized several nominal attributes in integer identifiers. Those attributes are diagnosis1, diagnosis2, diagnosis3, admission type, discharge disposition, and admission source. The first three attributes (diagnoses) were coded based on ICD9 (International Statistical Code of Diseases and Related Health Problems). For example, code IDs 390-459 and 785 are diseases of the circulatory system. After converting all the integer identifiers into nominal values, the results did not show significant improvement. Various SMOTE Percentages We also applied different SMOTE percentages mainly to the <30 minority class. However, there was no significant improvement. We ultimately decided to apply a 200% increase on the <30 minority class as mentioned earlier. Class Distribution Graphs with Different SMOTE Percentages

350%,  <30   350%  on  <30,  50%  on  >30  

500%,  <30  250%,  <30  

Page 11: Readmission of Diabetes Patients Report

  11  

Algorithms Utilized After data cleaning and pre-processing, the selection of algorithms to run our experiments was made. We selected three classifiers for the experiment design: Classifiers

•   J48. It works on the Decision Tree Learning process to find and optimize the most efficient attribute which increases the prediction accuracy.

•   Naïve Bayes. It takes a probabilistic approach to determine the attributes upon which a

model is to be built.

•   Bayes Net. It is a probabilistic graphical model that represents a set of random variables and their conditional dependencies through a directed acyclic graph.

Comparison of Classifiers Since we selected two Bayes classifiers, we compared the difference between Naïve Bayes and Bayes Net. A Naive Bayes classifier is a simple model that describes a particular class of the Bayesian network: all the features are conditionally independent of each other. Because of this, there are certain problems that Naive Bayes cannot solve. An advantage of Naive Bayes is it only requires a small amount of training data to estimate the parameters necessary for classification. A Bayesian Net models relationships between features in a very general way. The Bayesian Network does not have such assumptions. All the dependence in the Bayesian Network has to be modeled. If it is known what these relationships are, or there is enough data to derive them, then it may be appropriate to use a Bayesian Network. The following are two examples that illustrate the differences between these two algorithms. In the first example, a fruit may be considered to be an apple if it is red, round, and about 10 centimeters in diameter. A Naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

Page 12: Readmission of Diabetes Patients Report

  12  

In the second example, presume that there are two events that could cause grass to be wet: either the sprinkler is on, or it's raining. Also, presume that the rain has a direct effect on the use of the sprinkler (i.e. when it rains, the sprinkler is usually not turned on). Then the situation can be modeled with a Bayesian network. All three variables have two possible values, T (for true) and F (for false). Two attributes (Sprinkler and Rain) are correlated.

Page 13: Readmission of Diabetes Patients Report

  13  

Factor Experimental Design We selected two factors to conduct our experiment design: Number of Attributes and Noise. Each of the combinations, as shown in the 2-Factor Experimental Design table below, were run in an experiment with each algorithm utilizing 10 random seeds. Number of Attributes Due to the amount of attributes this dataset contained originally and even after it was cleaned, we decided to analyze the effect of decreasing the number of attributes. We compared the results of experimental runs with the full, cleaned data set versus the results with a reduced, cleaned data set. We utilized a tool found in Weka called InfoGain. This tool evaluates the worth of the attribute by measuring the information gain with respect to the class. The output from this tool ranked all attributes and we selected the top 10. We then compared experiment results from the dataset containing 22 attributes versus the same dataset containing only the top 10 attributes. Noise Noise refers to the modification of original values such as a distortion in voice during a phone call or fuzziness on a computer screen that can’t be seen clearly. We wanted to observe the effect on classification performance by adding noise into our data. Noise was selected as our second factor in the experiment design. We carefully added 10% of noise only to the target variable, ran the experiments and compared the results to the dataset without noise. 2-Factor Experimental Design Table

ALL ATTRIBUTES

SELECTED ATTRIBUTES

NO NOISE

C1 All Attributes & No Noise

C3 Selected Attributes & No Noise

NOISE (10%)

C2 All Attributes & Noise

C4 Selected Attributes & Noise

Page 14: Readmission of Diabetes Patients Report

  14  

Experiments As previously mentioned, our experiment design was composed of two factors (Selected Attributes and Noise), giving us four different sets of experiments to run:

•   C1: All Attributes & No Noise •   C2: All Attributes & Noise •   C3: Selected Attributes & No Noise •   C4: Selected Attributes & Noise

Combination Sets with Each Classifier  C1    

E1   Performance  of  J48  for  All  Attributes,  No  Noise  E2   Performance  of  Naïve  Bayes  for  All  Attributes,  No  Noise  E3   Performance  of  Bayes  Net  for  All  Attributes,  No  Noise  

 C2    

E4   Performance  of  J48  for  All  Attributes,  10%  Noise  E5   Performance  of  Naïve  Bayes  for  All  Attributes,  10%  Noise  E6   Performance  of  Bayes  Net  for  All  Attributes,  10%  Noise  

 C3      

E7   Performance  of  J48  for  Selected  Attributes,  No  Noise  E8   Performance  of  Naïve  Bayes  for  Selected  Attributes,  No  Noise  E9   Performance  of  Bayes  Net  for  Selected  Attributes,  No  Noise  

 C4    

E10   Performance  of  J48  for  Selected  Attributes,  10%  Noise  E11   Performance  of  Naïve  Bayes  for  Selected  Attributes,  10%  Noise  E12   Performance  of  Bayes  Net  for  Selected  Attributes,  10%  Noise  

 Each of the experiments E1, E2,…, E12 was run 10 separate times with a different seed each time, ensuring that the algorithm would use a slightly different training data set each time. For each of the experiments, the percentage split was 66% training and 34% testing. For each C1-C4, we use 3 different algorithms:

•   Experiments E1, E4, E7, and E10 use the J48 algorithm. •   Experiments E2, E5, E8, and E11 use the Naives Bayes algorithm •   Experiments E3, E6, E9, and E12 use the Bayes Net algorithm.

The following tables are the results of the experiments conducted:              

Page 15: Readmission of Diabetes Patients Report

  15  

Results Tables  

E1  (J48)  Run   Seed   Accuracy  1   1   57.3161  2   2   57.6682  3   3   57.5814  4   4   57.7960  5   5   57.4174  6   6   57.7502  7   7   57.4680  8   8   57.5838  9   9   57.9696  10   10   58.0709  Average   57.6622  Std  Dev   0.2397  

 

E2  (Naïve  Bayes)  Run   Seed   Accuracy  

1   1   56.3468  2   2   56.4794  3   3   56.4673  4   4   56.7085  5   5   56.6458  6   6   57.0051  7   7   56.7519  8   8   56.7977  9   9   57.1039  10   10   57.1208  

  Average   56.7427     Std  Dev   0.2704  

     

E3  (Bayes  Net)  Run   Seed   Accuracy  

1   1   64.6274  2   2   63.8679  3   3   64.2633  4   4   64.4418  5   5   63.9474  6   6   64.5816  7   7   64.2416  8   8   63.8968  9   9   63.9354  10   10   64.4225  

  Average   64.2226     Std  Dev   0.2931    

E4  (J48)  Run   Seed   Accuracy  

1   1   53.5954  2   2   53.6364  3   3   53.2554  4   4   53.7015  5   5   53.4338  6   6   53.2988  7   7   53.1903  8   8   53.4893  9   9   53.7497  10   10   53.6219  

  Average   53.49725     Std  Dev   0.196119  

           

 

Page 16: Readmission of Diabetes Patients Report

  16  

 Results Tables

 E5  (Naïve  Bayes)  

Run   Seed   Accuracy  1   1   52.5826  2   2   52.9274  3   3   52.6405  4   4   52.9708  5   5   53.0769  6   6   53.0311  7   7   52.9539  8   8   52.9395  9   9   53.2771  10   10   53.3856  

  Average   52.97854     Std  Dev   0.245655392    

E6  (Bayes  Net)  Run   Seed   Accuracy  

1   1   59.4695  2   2   59.3369  3   3   59.5611  4   4   59.8143  5   5   59.5322  6   6   59.5274  7   7   59.4406  8   8   59.2669  9   9   59.3851  

10   10   59.6889     Average   59.50229     Std  Dev   0.162799778  

     

E7  (J48)  Run   Seed   Accuracy  

1   1   57.415  2   2   57.4849  3   3   56.87  4   4   57.3113  5   5   56.776  6   6   57.1907  7   7   57.2631  8   8   57.2004  9   9   57.2438  

10   10   57.1739     Average   57.19291     Std  Dev   0.219752874    

E8  (Naïve  Bayes)  Run   Seed   Accuracy  

1   1   55.2882  2   2   55.4642  3   3   55.4618  4   4   55.469  5   5   55.2472  6   6   56.0043  7   7   55.6137  8   8   55.416  9   9   55.5052  

10   10   55.5341     Average   55.50037     Std  Dev   0.207621349  

           

Page 17: Readmission of Diabetes Patients Report

  17  

Results Tables  

E9  (Bayes  Net)  Run   Seed   Accuracy  

1   1   55.2882  2   2   55.4642  3   3   55.4618  4   4   55.469  5   5   55.2472  6   6   56.0043  7   7   55.6137  8   8   55.416  9   9   55.5052  10   10   55.5341  

  Average   55.50037     Std  Dev   0.207621349    

E10  (J48)  Run   Seed   Accuracy  

1   1   52.6887  2   2   53.511  3   3   52.9853  4   4   53.1769  5   5   52.7538  6   6   52.9829  7   7   52.7213  8   8   52.7646  9   9   53.1457  10   10   53.1022  

  Average   52.9832     Std  Dev   0.2475  

     

E11  (Naïve  Bayes)  Run   Seed   Accuracy  

1   1   51.4082  2   2   51.9701  3   3   51.8519  4   4   51.9471  5   5   51.6639  6   6   51.6831  7   7   51.6772  8   8   51.6674  9   9   51.7206  10   10   51.9471  

  Average   51.7537     Std  Dev   0.1668                  

 E12  (Bayes  Net)  

Run   Seed   Accuracy  1   1   51.4082  2   2   51.9701  3   3   51.8519  4   4   51.9471  5   5   51.6639  6   6   51.6831  7   7   51.6772  8   8   51.6674  9   9   51.7206  

10   10   51.9471     Average   51.7537     Std  Dev   0.1668  

Page 18: Readmission of Diabetes Patients Report

  18  

Summary of Results  Results  of  Experiments  Graph  

   From the graph shown above, we can infer that C1 (All attributes & No Noise) is the best experiment since it gives us the highest results across all algorithms. Next is C3 (Selected Attributes & No Noise) which gives us slightly lesser results, but they are still significantly higher than the results obtained from C2 (third best) and C4 (fourth best). When it comes to the accuracy of the algorithms, Bayes Net leads with a significant margin over J48 and Naïve Bayes across all four experiments. J48 comes in second position, performing up to 1.7% points higher than Naives Bayes across all experiments. We can say the Selected Attributes that we considered for experiments in C3 and C4 are the most relevant because, controlling for noise, the accuracy of the algorithms with all attributes declines by less than 1.5% when switching to only the Selected Attributes.    

C1,  56.74272

C2,  52.97854

C3,  55.50037

C4,  51.7537

C1,  57.66216

C2,  53.49725

C3,  57.19291

C4,  52.9832

64.22257

59.50229

63.85893

59.1443

51

53

55

57

59

61

63

65

C1 C2 C3 C4

Accuracy  Averages  (%)

NaiveBayes J48 BayesNet

Page 19: Readmission of Diabetes Patients Report

  19  

Standard  Errors  Graph  

   Looking at the different standard errors across experiments, we noticed that C1 stands out with relatively high values and the highest mean standard error among all four experiments. The mean standard error for C2, C3 and C4 are roughly the same. The linear graph of the standard errors for each model shows different trends for each algorithm. The area under the J48 curve form C1 to C4 is roughly similar to the one of Naïve Bayes, and as far as size, they are both relatively high. BayesNet on the other hand, seems to have a slightly lower standard error in general, with a smaller area under its curve. We can infer that Bayes Net has the smallest standard error among all three algorithms used.                      

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

C1 C2 C3 C4

Standard  Errors

NaiveBayes J48 BayesNet

Page 20: Readmission of Diabetes Patients Report

  20  

Analysis and Conclusion In order to evaluate the performance of our algorithms, we will be using ROC curves. ROC Curves A receiver operating characteristic (ROC), or ROC curve, is defined as a graphical plot that shows the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, or recall in machine learning. The false-positive rate is also known as the fall-out and can be calculated as (1 - specificity). In order to determine which algorithm performs better, we will be looking at the tendency of each curve. The closer the curve follows the Y-axis and then the top border, the larger the area is, and the more accurate is the test. In order to get a plot of the curves, we made use of Weka’s Knowledge Flow tool. We loaded the following workflow through which we derived the curves:

 Knowledge Flow

     The above Knowledge Flow merely loads the specified file, assigns which attribute is considered as the class, then chooses which class value to plot the curve for, and then allows us to select a percentage split. The three algorithms are then run on the dataset with the parameters selected, their performance is recorded, and the results are used to plot the ROC curve.

Page 21: Readmission of Diabetes Patients Report

  21  

The following are ROC curves for each experiment set when the class value is NO. ROC Curve Graphs When Readmission is NO C1  

     C2  

 

Page 22: Readmission of Diabetes Patients Report

  22  

ROC Curve Graphs When Readmission is NO  C3  

     C4  

   From the above graphs, we observe that the area under the curve is greater for Naïve Bayes, in each of the four experiment sets, C1, C2, C3 and C4. We can therefore conclude that Naïve Bayes is more accurate than the other algorithms, according to these ROC curves.

Page 23: Readmission of Diabetes Patients Report

  23  

Additional Analysis Let’s take a look at the confusion matrices we obtained from C1 and C3 (which give us the most relevant models) Confusion Matrix Tables for C1 and C3

                          We can conclude from these confusion matrices that Bayes Net gives a higher percentage of True Positives across class values <30, >30 and NO. Bayes Net therefore appears to be the best predictor from the point of view of our confusion matrices. Overall Observations Considering the average accuracy, average standard deviation, ROC curves, attributes, and classifier evaluation, we recommend the following for the Readmission of Diabetes Patients dataset: Class balancing - SMOTE increased overall model accuracy (see SMOTE Comparison Matrices below) Classifier - Bayes Net gives the highest accuracy. Naive Bayesian classifier (NBC) assumes independence between all attributes given a class, which is seldom true. That is why it is called “Naïve”. In contrast, in a Bayesian network you can make a more detailed (true) model of the problem using several layers of dependencies. It can track the cause-effect relationship among attributes and class, and at the same time calculate and draw the probabilistic graph. Attributes factor: Using All attributes instead of Top 10 has the highest accuracy.

Page 24: Readmission of Diabetes Patients Report

  24  

SMOTE Comparison Matrices

u   Original Data using J48

=== Confusion Matrix ===

a b c <-- classified as

15914 2664 117 | a = NO

8222 3728 141 | b = >30

2396 1371 47 | c = <30

u   After SMOTE 200% using J48

=== Confusion Matrix === a b c <-- classified as 13827 2683 1231 | a = NO 7596 3282 1331 | b = >30 3427 1433 6660 | c = <30

u   Original Data using Naives Bayes

=== Confusion Matrix ===

a b c <-- classified as

16009 2138 548 | a = NO

8230 3168 693 | b = >30

2430 927 457 | c = <30

u   After SMOTE 200% using Naives Bayes

=== Confusion Matrix === a b c <-- classified as 15294 1439 1008 | a = NO 8445 2092 1672 | b = >30 4721 818 5981 | c = <30

Page 25: Readmission of Diabetes Patients Report

  25  

u   Original Data using BayesNet

=== Confusion Matrix ===

a b c <-- classified as

13302 4867 526 | a = NO

5138 6440 513 | b = >30

1662 1800 352 | c = <30

u   After SMOTE 200% using BayesNet

=== Confusion Matrix ===

a b c <-- classified as

12746 4934 61 | a = NO

5802 6312 95 | b = >30

1714 2063 7743 | c = <30

Page 26: Readmission of Diabetes Patients Report

  26  

References

•   Beata Strack et al., “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records.”

•   Hindawi Publishing Corporation - http://www.hindawi.com/journals/bmri/2014/781670/tab1/

•   Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd edition, Elsevier, 2011

•   Machine Learning Repository - https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#

•   Naïve Bayes for Dummies - http://blog.aylien.com/post/120703930533/naive-bayes-for-dummies-a-simple-explanation

•   Steve Lohr, The New York Times, August 17, 2014, For Big-Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights.

•   Tamraparni Dasu and Theodore Johnson, Exploratory Data Mining and Data Quality, Wiley, 2004.

•   Wikipedia - https://en.wikipedia.org/wiki/Data_cleansing https://en.wikipedia.org/wiki/Receiver_operating_characteristic https://en.wikipedia.org/wiki/Bayesian_network