cardiac rhythm classification

L. Barnes Machine Learning

SYS 6016 / CS 6316

1

Cardiac Rhythm Classification

Imran A. Khan

1. Summary of Results The task is to design and evaluate models to use in cardiac rhythm classification and to recommend

the best approach with smaller test cases and cross validation error. In this research, I use different

machine learning approaches such as tree learning, rule learning, and instance-based learners, and

ensemble method. The goal is to distinguish atrial fibrillation from normal sinus rhythm and a

normal sinus rhythm with ectopy.

After doing different experiments with different prediction classifiers, this research shows that

Random Forest, with 500 trees and 4 attributes (HRV,LDs, COSEn, and DFA), gives the highest

prediction accuracy 93.43%.

2. Problem Description The goal of this research is to classify types of cardiac arrhythmia by building models and

explaining the results based on the data from the University of Virginia (UVa) Health System Heart

Station. The UVa physicians have used Holter monitor to record 2,895 24-h RR interval time series

during the period of 12/2014 to 10/2010 for clinical reasons.

The training data (rhythm-training.csv) include 2,178 labeled instances in rows and 8 numeric

attributes organized in columns. There is no missing value in data. The last column indicates the

classification of the cardiac rhythm, i.e. atrial fibrillation AF (1), normal sinus rhythm NSR (2), and

normal sinus rhythm NSR with Ectopy (3). The test data (rhythm-testing.csv) include 512 unlabeled

instances of rhythm classification. A detailed description about the attributes is summarized in

Table 1.

No Attributes Description

1 PID Unique patient id

2 HR heart rate

3 HRV heart rate variability

4 AGE age of patient

5 LDs Local Dynamics score for 12-beat segments [2]

6 COSEn Coefficient of Sample Entropy assessed on 30 minute segments [1]

7 DFA

Detrended Fluctuation Analysis (long-range correlations in non-stationary

signals) [3]

8 Class {1, 2, 3} class attribute of type AF (1), NSR (2) or NSR with Ectopy (3) Table 1: Attributes description


SYS 6016 / CS 6316

2

PID variable is not considered in model building. Thus, in total there are only 6 attributes used to

filter Class variable.

To evaluate the models, I use two main performance metrics, i.e. cross validation error and test

error.

Cross Validation Error: 10-fold cross validation is used to build the model on the training

data to investigate how well the models perform.

Test Error: prediction model on the test data is performed to avoid over fitting.

Evaluation measures on the success of prediction:

Correctly classified instances

Incorrectly classified instances

Kappa statistic

Confusion matrix

I also make some preprocessing in Weka version 3.6.9:

Remove ID out of predictor list

Convert Class variable from numeric to nominal

3. Tree Learning

a. Decision Tree

To build a good decision tree, I try with the same decision tree models but with different parameters.

For all decision trees, I use Weka J4.8 classifier to build the model. I use the same numFold to be

2, minimum number of object is 2, and confidence factor is 0.15.

The decision tree model is built on the 6 attributes to distinguish the Class variable with unpruned

and pruned technique.

Model 1: 6 attributes, no pruned

Model 2: 6 attributes, pruned

I also search on the contribution of the 6 attributes by implementing Weka selection evaluator

GainRatioAttributeEval and ranker method. The importance of an attribute measures the gain ratio

with respect to the class.

Average rank Attribute

1 HRV

2 COSEn

3 DFA

4 LDs

5 AGE

6 HR Table 2: The importance variables


SYS 6016 / CS 6316

3

Following the important rank of attributes in Table 2, I build the decision tree with 3 models

(excluding Age and HR, excluding Age and LDs, excluding Age and DFA). The accuracy instance

of model by excluding 2 attributes Age and LDs is smaller than the others. Thus, I only consider 2

models: a model without using Age and HR and a model without using Age and DFA.

The results show that the model performs better over time based on cross-validation accuracy rate.

The entire accuracy and error rate are calculated by using a 10-fold cross validation, where the

training data is divided into 10 parts, i.e. 9 parts for training and 1 part for testing. The following

table shows the structure of the tree.

Tree/Feature

Number

of

Leaves

Size of

Tree

Accuracy

instances

Incorrect

instances

Kappa

statistic

Model 1: 6 attributes, unpruned 56 111 91.55% 8.45% 0.783

Model 2: 6 attributes, pruned 34 67 92.05% 7.95% 0.795


(exclude Age and HR) 11 21 92.25% 7.75% 0.801


(exclude Age and DFA) 21 41 92.65% 7.35% 0.812

Table 3: The information table of models

As processing time in Weka, Model 1 takes 0.06 second long to implement and Model 2 takes 0.1

seconds. Meanwhile, Model 3 and 4 have the same time in implementation 0.04 seconds.

The accuracy instances increases slightly from Model 1 to Model 4 since the Kappa statistic value

increases closely to 1 for each model. It shows that the last model performs better by excluding Age

and DFA from the model. However, this model is more complex than Model 3 with size of tree

equal to 41. Also, Model 4 only gains a little accuracy instances as compared to Model 3. Thus,

Model 3 is chosen as the best model.

Model 1 Predicted class

a = 1 b = 2 c = 3 Total

Actual class

a = 1 161 17 23 201

b = 2 9 1587 46 1642

c = 3 15 74 246 335

Total 185 1678 315 2178 Table 4: Confusion matrix of Model 1


SYS 6016 / CS 6316

4


a = 1 b = 2 c = 3 Total

Actual class

a = 1 160 17 24 201

b = 2 4 1593 45 1642

c = 3 13 70 252 335



a = 1 b = 2 c = 3 Total

Actual class

a = 1 159 14 28 201

b = 2 4 1596 42 1642

c = 3 14 67 254 335



a = 1 b = 2 c = 3 Total

Actual class

a = 1 157 17 27 201

b = 2 2 1595 45 1642

c = 3 11 58 266 335


From Table 4-7, I can see that pruned tree always performs better than no pruned tree. It also makes

tree significantly simpler with fewer leaves and smaller size of tree.

Comparing the confusion matrix from the above tables, the number of correct predictions from

Model 1 is 161 + 1,587 + 246 = 1,994 while Model 2 has 2,005 correct predictions, Model 3 has

2,009 correct predictions, and Model 4 has 2,016 correct predictions. As can be observed from

Table 5 and 6, the number of corrections in class 2 and 3 has increased whereas the correct

predictions for class 2 have decreased slightly.


SYS 6016 / CS 6316

5

Figure 1: Decision tree plot of Model 3

Figure 1 displays the decision tree of Model 3. It can be observed that COSEn and DFA are the 2

most important predictors as they appear at the top of the trees and in many places over the tree.

b. Random Forests

Random Forest is a collection of decision trees that try to build model on different set of predictors.

After each decision tree is built, the final prediction output is based on how many trees agree on the

same class for dependent variable.

I consider 3 random forest models with each of them have 500 trees inside.

Random Forest Accuracy

instances

Out of Bag

Error

Incorrect

instances

Kappa

statistic

Model 5: 6 attributes 93.48% 0.0634 6.52% 0.833

Model 6: 4 attributes

(exclude Age and HR) 93.43% 0.0666 6.57% 0.833

Model 7: 4 attributes

(exclude Age and DFA) 93.11% 0.0652 6.89% 0.823

Table 8: The information of random forest models


SYS 6016 / CS 6316

6

The random forest models perform significantly better than the decision tree model with higher

accuracy rates. Models 5 and 6 have almost similar accuracy percentage as well as Kappa statistic

value. Model 5 takes longer time in implementation than Model 6 (11.85 seconds comparing to

10.67 seconds). It is hard to conclude which model is better. However, from the confusion matrices

in Table 9 and 10, Model 5 has one more correct prediction than Model 6.


a = 1 b = 2 c = 3 Total

Actual class

a = 1 163 17 21 201

b = 2 3 1601 38 1642

c = 3 7 56 272 335

Total 173 1674 331 2178 Table 9: The confusion matrix of Model 5


a = 1 b = 2 c = 3 Total

Actual class

a = 1 163 14 24 201

b = 2 0 1599 43 1642

c = 3 9 53 273 335



a = 1 b = 2 c = 3 Total

Actual class

a = 1 157 19 25 201

b = 2 1 1601 40 1642

c = 3 6 59 270 335


Model 7 has the smallest accuracy instances as compared to Model 5 and 6 (Table 11). The

evaluation measures of Model 7 in Table 8 are smaller than measures of Model 5 and 6. This shows

that model with 6 attributes performs better.

4. Rule Learning In this part, I will consider two different algorithms: JRIP (Repeated Incremental Pruning to

Produce Error Reduction (RIPPER), which was proposed by William W.) and PART (Combine the

divide-and-conquer strategy with separate-and-conquer strategy of rule learning)[6]. Each of them

will be built using 6 attributes and then 4 attributes (exclude Age and HR). I will analyze the result

to see which of them has a better performance based on the cross validation accuracy.

a. 6 attributes

The result in Table 12 shows that the accuracy of the JRIP is slightly larger than PART. The

accuracy instance in Model 8 is 168 + 1,591 + 259 = 2,018 and the accuracy instance in Model 9 is


SYS 6016 / CS 6316

7

2,007 (Table 13 and 14). It can be concluded that The JRIP model performs better than the PART

model.

Rule Learning Accuracy

instances

Incorrect

instances

Kappa

Statistic

Model 8: JRIP 92.65% 7.35% 0.812

Model 9: PART 92.15% 7.85% 0.801

Table 12: The information of rule learning models with 6 variables

Model 8: JRIP Predicted class

a = 1 b = 2 c = 3 Total

Actual class

a = 1 168 16 17 201

b = 2 4 1591 47 1642

c = 3 11 65 259 335


Model 9: PART Predicted class

a = 1 b = 2 c = 3 Total

Actual class

a = 1 164 16 21 201

b = 2 4 1579 59 1642

c = 3 11 60 264 335


b. 4 attributes (excluded Age and HR)

I also exclude Age and HR variables from the model and then rebuild the models using JRIP and

PART to see if that will improve the prediction performance.


instances

Incorrect

instances

Kappa

Statistic

Model 10: JRIP 92.06% 7.94% 0.799

Model 11: PART 91.74% 8.26% 0.791

Table 15: The information of Model 10 and 11


SYS 6016 / CS 6316

8

The results from JRIP and PART with 4 prediction attributes are quite smaller than the model with

6 attributes. The accuracy for JRIP is higher than PART model. The accuracy number for JRIP is

2005 and for PART is 1,998.


a = 1 b = 2 c = 3 Total

Actual class

a = 1 153 19 29 201

b = 2 1 1585 56 1642

c = 3 14 54 267 335



a = 1 b = 2 c = 3 Total

Actual class

a = 1 152 18 31 201

b = 2 7 1580 55 1642

c = 3 11 58 266 335


c. 4 attributes (excluded Age and DFA)

I also build the model with 4 attributes by excluding Age and DFA using JRIP and PART to see if

that will improve the prediction performance.


instances

Incorrect

instances

Kappa

Statistic

Model 12: JRIP 92.33% 7.67% 0.801

Model 13: PART 92.19% 7.81% 0.802


The results of JRIP and PART with 4 prediction attributes are large than the model with 6 attributes.

The accuracy of JRIP model is always higher than PART model. It has a larger number of accuracy

instances (JRIP = 2,011; PART = 2,008)


a = 1 b = 2 c = 3 Total

Actual class

a = 1 154 23 24 201

b = 2 0 1596 46 1642

c = 3 9 65 261 335



SYS 6016 / CS 6316

9


a = 1 b = 2 c = 3 Total

Actual class

a = 1 155 19 27 201

b = 2 7 1584 51 1642

c = 3 8 58 269 335


5. Instance-Based Learning For instance-based learning, I utilize k-Nearest Neighbor classifier with the IBK algorithms in

Weka. A model with 6 attributes and with different k values (k=5 and k=10) is considered. In

addition, a model with 4 attributes with different k values (k from 1 to 10) is also considered to see

which k is good enough for classification.

a. 6 attributes From Table 21, I can clearly see that the IBK10 performs much better than the IBK5, with a

higher accuracy rate and a better Kappa statistic.


instances

Incorrect

instances

Kappa

statistic

Model 14: IBK5 92.56% 7.44% 0.805

Model 15: IBK10 93.16% 6.84% 0.821


The confusion matrix shows that the k-Nearest Neighbor model with k=10 has a much better

prediction rate on all 3 classes (Table 22 and 23).

Model 14: IBK5 Predicted class

a = 1 b = 2 c = 3 Total

Actual class

a = 1 170 19 12 201

b = 2 8 1606 28 1642

c = 3 13 82 240 335



SYS 6016 / CS 6316

10


a = 1 b = 2 c = 3 Total

Actual class

a = 1 165 19 17 201

b = 2 2 1612 28 1642

c = 3 11 72 252 335


b. 4 attributes (exclude Age and HR) A k-Nearest Neighbor classifier model with 4 attributes exclude Age and HR with different k

values ranging from 1 to 10 is fitted to the data.

Accuracy

instances

Incorrect

instances

Kappa

Statistic

IBK1 90.27% 9.73% 0.753

IBK2 89.90% 10.10% 0.732

IBK3 92.15% 7.85% 0.799

IBK4 91.87% 8.13% 0.785

IBK5 92.56% 7.43% 0.805

IBK6 92.29% 7.71% 0.797

IBK7 92.70% 7.30% 0.810

IBK8 92.88% 7.12% 0.814

IBK9 92.79% 7.21% 0.813

IBK10 93.16% 6.84% 0.821

Table 24: The information of the fitted model with 4 attributes excluding Age and HR with different k values

Following the information in Table 24 and observing the plot in Figure 2, it can be concluded that

k=10 nearest neighbors is the best number to build classifier model. The accuracy increases when

the number of nearest neighbors is increased.


SYS 6016 / CS 6316

11

Figure 2: Plot percentage accuracy by different number nearest neighbor classification

The confusion matrix of Model 16 (model with 10-number nearest neighbor classification) has high

prediction accuracy, i.e. 2,025. It performs better as compared to Model 14 and 15 with 6 attributes.


a = 1 b = 2 c = 3 Total

Actual class

a = 1 158 19 24 201

b = 2 0 1608 34 1642

c = 3 7 69 259 335


6. Neural Network Learning For neural network training, the number of data is divided into three sets:

1. Training (70%, 1524 records)

2. Validating (15%, 327 records)

3. Testing (15%, 327 records)

The model with the smallest validation error is chosen as the optimal neural network. The

parameters of neural network are:

Number of input nodes = 6

Number of output nodes = 3

Number of hidden layer = 1

Number of nodes in hidden layer = 10


SYS 6016 / CS 6316

12

The performance of neural network is given in Figure 3.

Figure 3: Change in MSE in increasing iterations

The minimum MSE is obtained at 39th iteration. The confusion matrix for training, validating and

testing data are given in Figure 5.

The misclassification error in the training data set is 6.5%. The misclassification error is reduced in

validation set to 6.4%. In test set, the outstanding result is obtained with misclassification error

3.1% and with 96.9% accuracy. This is the best result obtained so far. The overall accuracy is 94%

with 6% error.


SYS 6016 / CS 6316

13

Figure 4: Confusion matrices for Training, validation, testing and overall performance

7. Ensemble Methods Ensemble methods use multiple models to obtain better predictive performance than could be

obtained from any of the constituent models. There are various ensemble methods. Here, I use

three methods namely AdaBoostM2, Bag and StackingC to build three different models. The ‘Tree’

is selected as the weak learner for all the models. Then I use the same algorithms to build the model

with 4 attributes.


SYS 6016 / CS 6316

14

a. 6 attributes model

Models using AdaBoostM2 To build this model, 100 tree learners are used. AdaBoostM2 is a well-known boosting algorithm

for multiclass classification (3+ classes). This algorithm trains learners sequentially and computes

the weighted classification error. The result of this model is given below:

Training set accuracy = 91.80%

Test set accuracy = 93.27%

The change in test classification error with an increase in number of trees is given in Figure 6.

Figure 5: Change in test classification error with number of trees for AdaBoostM2

Model using Bag Bagging, which stands for “bootstrap aggregating,” is a type of ensemble learning. To bag a weak

learner such as a decision tree on a dataset, I generate many bootstrap replicas of this dataset and

then grow decision trees on these replicas. I obtain each bootstrap replica by randomly selecting N

observations. By taking the average over predictions from individual trees, the predicted response

of a trained ensemble is calculated. This ensemble method uses 200 weak learners’ decision trees.

The result of the model is:

Training set accuracy = 100%

Test set accuracy = 94.04%


SYS 6016 / CS 6316

15

Using the cross validation with 5 folds, the cross validation error is calculated with an increase in

number of trees. The change in test classification error with number of trees for test and cross

validated set is given in Figure 7.

Figure 6: The change in classification error with number of trees for test and cross validation

Model using StackingC The stackingC is the efficient version of stacking method. This learning method

combines predictions of base learners (JRIP, Random Forest, and NaiveBayes)

using a meta learner (Linear regression)[5].

Instance-based

learning

Accuracy

instances

Incorrect

instances

Kappa

Statistic

Model 17: StackingC 93.48% 6.52% 0.833

Table 26: The information of Model 17


SYS 6016 / CS 6316

16

Out of the tree models, the model using Bag method gives the highest prediction accuracy. Its

confusion matrix also shows the best results to predict each class (Table 27).

Model 17: StackingC Predicted class

a = 1 b = 2 c = 3 Total

Actual class

a = 1 161 19 21 201

b = 2 2 1601 39 1642

c = 3 9 52 274 335


b. 4 attributes In this part, I rerun the above algorithms to the model with 4 attributes (exclude HR and Age). The

number of trees used in Bagging is 100 trees.

Instance-based learning Accuracy

instances

Incorrect

instances

Kappa

Statistic

Model 18: AdaBoostM2 92.06% 7.94% 0.797

Model 19: Bagging 92.38% 7.62% 0.806

Model 20: StackingC 93.02% 6.98% 0.822

Table 28: The information of Model 18, 19, 20

Model 18 takes 0.66 seconds to perform while model 19 takes 0.33 seconds. The StackingC method

takes longest time to perform (>100 seconds) because it uses random forest in combine prediction.

However, the StackingC algorithm gives a very good result with a high accuracy rate and a high

kappa statistic (Table 29, 30, and 31). The accuracy of StackingC in the case of 4 attributes is much

better than in case of 6 attributes.


a = 1 b = 2 c = 3 Total

Actual class

a = 1 160 16 25 201

b = 2 3 1590 49 1642

c = 3 17 63 255 335



SYS 6016 / CS 6316

17


a = 1 b = 2 c = 3 Total

Actual class

a = 1 161 17 23 201

b = 2 2 1589 51 1642

c = 3 14 59 262 335



a = 1 b = 2 c = 3 Total

Actual class

a = 1 159 14 28 201

b = 2 0 1598 44 1642

c = 3 9 57 269 335


8. Comparison of All Models The data cardiac rhythm consists of 2,178 attributes and 8 variables. There is no missing value in

data and the first ID column is removed before designing model and evaluating the classification.

In this research, I use 5 methods implemented by Weka including tree learning (decision tree,

random forest), rule learning (JRIP and PART), instance-based learning (IBK), ensemble methods

(AdaBoostM2, Bag, StackingC) with 10 fold cross validation.

The decision tree is built by using different post-pruning techniques. The decision tree is the

simplest method to analyze and visualize it. However, the decision tree model does not give high

prediction accuracy.

Random Forest, Neural Networks and Ensemble Methods have the highest building time; each of

them takes from 10 to 15 seconds to build the model. These models are hard to explain and can

only be used for prediction and does not really give insights into the data.

For decision tree and random forest, it is clear that by excluding Age and HR in building the

prediction model, it can improve the prediction accuracy. Excluding DFA does not give a

significant impact on model accuracy.

For instance-based method, I pick k-Nearest Neighbor with k = 10. It shows that k=10 is the best

parameter to give the highest accuracy.

I have chosen some of the above models to apply on the test set. The decision tree model with 6

attributes (90.07%) has lower accuracy instances than random forest 500 trees with 6 attributes


SYS 6016 / CS 6316

18

(91.912%). The model 7 without Age and DFA was predicted on the test data and has highest

predicted classification with 94.48%. Other models in different method that I have selected to test

(Model 6, 8, 14, 18) have lower accurate prediction than Model 7. However, based on the table 32,

it shows that random forest 500 trees and stackingC models give the best percentage of accuracy

on training data. The test set consists only 270 instances compared to (12%) 2,178 instances in

training set. Therefore the predictions on test set may overfit.

Random Forest with 500 trees on 4 attributes and StackingC on 6 attributes give me similar

prediction accuracy but I pick Random Forest model because it has smaller attributes. This gives

the Random Forest a simpler model and can avoid over fitting issues (Table 32).

Tree/Feature Accuracy

instances

Incorrect

instances

Kappa

statistic

Decision tree

Model 1: 6 attributes, unpruned 91.55% 8.45% 0.783

Model 2: 6 attributes, pruned 92.05% 7.95% 0.795


(exclude Age and HR) 92.25% 7.75% 0.801


(exclude Age and DFA) 92.65% 7.35% 0.812

Random forests

Model 5: 6 attributes 93.48% 6.52% 0.833

Model 6: 4 attributes (exclude Age

and HR) 93.43% 6.57% 0.833

Model 7: 4 attributes (exclude Age

and DFA) 93.11% 6.89% 0.823

Rule learning

Model 8: JRIP with 6 attributes 92.65% 7.35% 0.812

Model 9: PART with 6 attributes 92.15% 7.85% 0.801

Model 10: JRIP with 4 attributes


Model 11: PART with 4 attributes


Model 12: JRIP with 4 attributes


Model 13: PART with 4 attributes


Instance based learning

Model 14: IBK5 with 6 attributes 92.56% 7.44% 0.805

Model 15: IBK10 with 6 attributes 93.16% 6.84% 0.821

Model 16: IBK10 with 4 attributes

(exclude (Age and HR) 93.16% 6.84% 0.821

Neural network learning

Overall accuracy 94.00% 6.00%


SYS 6016 / CS 6316

19

Ensemble method

Model 17: AdaBoostM2 with 6

attributes 91.80% 8.20%

Model 18: Bagging with 6 attributes 100.00% 0.00%

Model 18: StackingC with 6

attributes 93.48% 6.52% 0.833

Model 18: AdaBoostM1 with 4

attributes (exclude Age and HR) 92.06% 7.94% 0.797

Model 19: Bagging with 4 attributes


Model 20: StackingC with 4

attributes (exclude Age and HR) 93.02% 6.98% 0.822

Table 32. Prediction accuracy comparison

References 1. Lake DE, Moorman JR. “Accurate estimation of entropy in very short physiological time

series: the problem of atrial fibrillation detection in implanted ventricular devices.” Am J

Physiol Heart Circ Physiol, 300:H319-H325, 2011.

2. Moss Travis J, Lake DE, Moorman JR. “Local dynamics of heart rate: detection and

prognostic implications.” Physiological Measurement. In press.

3. Peng CK, Havlin S, et al. “Quantification of scaling exponents and crossover phenomena

in nonstationary heartbeat time series.” Chaos 5, 82, 1995.

4. Ian H.Witten, Eibe Frank, Mark A.Hall. “Data mining practical machine learning tools

and techniques”. Third Edition.

5. Ian H. Witten. Data Mining with Weka. Available:

http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/slides/Class4-

DataMiningWithWeka-2013.pdf

6. Data Mining Rule-based Classifiers. Available:

http://staffwww.itn.liu.se/~aidvi/courses/06/dm/lectures/lec4.pdf

http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/slides/Class4-DataMiningWithWeka-2013.pdf

http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/slides/Class4-DataMiningWithWeka-2013.pdf

http://staffwww.itn.liu.se/~aidvi/courses/06/dm/lectures/lec4.pdf

cardiac rhythm classification

Documents