stat 5600 esd poster

1
Classifying Erythemato-Squamous Diseases Randall Reese Utah State University, Stat 5600 Classifying Erythemato-Squamous Diseases Randall Reese Utah State University, Stat 5600 Introduction We examine the classification of categories of Erythemato-Squamous diseases (ESD). We will use a series of clinical and histopathological (i.e. traits examined via biopsy) attributes and a variety of classification methods for multiclass data. Erythemato-Squamous diseases are dermatological conditions that are marked (especially in their early stages) by a redness of the squamous cells, which cells form the high majority of the epidermis of humans. The six ESDs we examine are as follows: psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra pilaris. These six ESDs will act as the response variable for our examination of the various classification techniques. The Attributes A clinical attribute is an attribute that can be observed visually without biopsy. Encoded by values 0, 1, 2, 3, (unless otherwise indicated), on an increasing scale of severity. We considered 12 such attributes listed as follows. erythema scaling definite borders itching koebner phenomenon polygonal papules follicular papules oral mucosal involvement knee and elbow involvement scalp involvement family history, (0 if no, 1 otherwise) Age (linear, in years) A histopathologocal attribute is an attribute that is observed via biopsy. Encoded by values 0, 1, 2, 3, on an increasing scale of severity. We considered 22 such attributes. melanin incontinence eosinophils in infiltrate PNL infiltrate fibrosis of papillary dermis exocytosis acanthosis hyperkeratosis parakeratosis clubbing of rete ridges elongation of rete ridges thin suprapapillary epidermis spongiform pustule munro microabcess focal hypergranulosis disappearance of granular layer vacuolisation damage basal layer spongiosis saw-tooth appearance of retes follicular horn plug perifollicular parakeratosis inflammed monoluclear inflitrate band-like infiltrate Methods Using the above attributes (or at times a subset thereof), we apply different classi- fication methods to classify patients into one of the six ESD categories. Note that in the case of logistic regression, we classify based on if a patient has an ESD associated with autoimmune disorders or not. There are three ESDs associated with autoimmune disorders: psoriasis (Type 1), lichen planus (Type 3), and chronic dermatitis (Type 5). The other three ESDs do not have a know association with autoimmune disorders. The classification methods we use are as follows: Logistic Regression Linear Discriminant Analysis Quadratic Discriminant Analysis Classification Trees (1-SE) Random Forests MultiClass Boosting Table of ESD Types Our training set contains the following breakdown of ESD types: Totals for Each ESD Type ESD Type 1 2 3 4 5 6 Total Subjects 111 60 71 48 48 20 Classification Trees 10-fold CV accuracy given by ESD type: ESD Type Accuracy % 1 97.30% 2 93.33% 3 98.59% 4 93.75% 5 100% 6 85% Overall 96.09% Fig. 1: Classification Tree using 1-SE Rule with cp =0.017 Fig. 2: Variable Importance for Random Forests. The 16 most important variables were retained. Random Forests 10-fold CV accuracy given by ESD type: ESD Type Accuracy % 1 100% 2 96.67% 3 100% 4 81.25% 5 100% 6 85% Overall 96.09% Even with a smaller set of predictive attributes, random forests obtains very high 10-fold cross-validated accuracy. Accuracies Summarized The 10-fold cross validated accuracies for each method are given below. Where noted, fullmeans all attributes were used. Clinicalmeans that only the clinical attributes were used. Var. selectindicates that variable selection was used. Classification Accuracy Rate Log Reg (Full) 99.16% Log Reg (Clinical) 84.08% Log Reg (Var. Select) 95.25% LDA 96.63% QDA (Full) 82.36% QDA (Clinical) 72.07% Classf Trees (Full) 96.09% Classf Trees (Clinical) 84.92% Rand Forest (Full) 96.93% Rand Forest (Var Select) 96.09% MultiClass Boost (Full) 96.09% MultiClass Boost (Clinical) 84.92% MultiClass Boost (Var. Select) 95.81% Conclusions Overall, random forests gave us the most consistent classification results. We were able to obtain over 96% accuracy, even after variable selection. MultiClass boosting (from the R package maboost) also was quite accurate. On the same subset of variables selected by random forests, MultiClass boosting had a 10-fold cross validated accuracy rate of 95.81%, which is quite respectable. Further Work Further work in ESD classification could focus on more accurate classification using only clinical attributes. This would lead to possibly faster diagnosis, as well as lower the overall cost of diagnosis for the patient. Additional ESD types could also be added to the data set, and attempts at even more robust diagnostic methods could be performed. The data set for this report comes from the following paper G. Demiroz, H. Govenir, and N. Ilter (1998). Learning Differential Diagnosis of ESD using Voting Feature Intervals. Artif. Intel. in Med., Jul;13(3):147-65. referenced on the UC Irvine Machine Learning Repository under the multivariate data category ( Dermatology Data Set).

Upload: randall-reese

Post on 16-Apr-2017

11 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Stat 5600 ESD Poster

Classifying Erythemato-Squamous Diseases

Randall ReeseUtah State University, Stat 5600

Classifying Erythemato-Squamous Diseases

Randall ReeseUtah State University, Stat 5600

Introduction

We examine the classification of categories of Erythemato-Squamous diseases (ESD). We will use a series of clinical and histopathological (i.e. traits examined via biopsy) attributes and a variety of classification methods for multiclass data.Erythemato-Squamous diseases are dermatological conditions that are marked (especially in their early stages) by a redness of the squamous cells, which cells form the high majority of the epidermis of humans.The six ESDs we examine are as follows: psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra pilaris. These six ESDs will act as the response variable for our examination of the various classification techniques.

The Attributes

A clinical attribute is an attribute that can be observed visually without biopsy.Encoded by values 0, 1, 2, 3, (unless otherwise indicated), on an increasing scale ofseverity. We considered 12 such attributes listed as follows.

• erythema

• scaling

• definite borders

• itching

• koebner phenomenon

• polygonal papules

• follicular papules

• oral mucosal involvement

• knee and elbow involvement

• scalp involvement

• family history, (0 if no, 1 otherwise)

•Age (linear, in years)

A histopathologocal attribute is an attribute that is observed via biopsy.Encoded by values 0, 1, 2, 3, on an increasing scale of severity. We considered 22such attributes.

•melanin incontinence

• eosinophils in infiltrate

•PNL infiltrate

• fibrosis of papillary dermis

• exocytosis

• acanthosis

• hyperkeratosis

• parakeratosis

• clubbing of rete ridges

• elongation of rete ridges

• thin suprapapillary epidermis

• spongiform pustule

•munro microabcess

• focal hypergranulosis

• disappearance of granular layer

• vacuolisation damage basal layer

• spongiosis

• saw-tooth appearance of retes

• follicular horn plug

• perifollicular parakeratosis

• inflammed monoluclear inflitrate

• band-like infiltrate

Methods

Using the above attributes (or at times a subset thereof), we apply different classi-fication methods to classify patients into one of the six ESD categories.

Note that in the case of logistic regression, we classify based on if a patient has anESD associated with autoimmune disorders or not. There are three ESDs associatedwith autoimmune disorders: psoriasis (Type 1), lichen planus (Type 3), and chronicdermatitis (Type 5). The other three ESDs do not have a know association withautoimmune disorders.

The classification methods we use are as follows:

•Logistic Regression

•Linear Discriminant Analysis

•Quadratic Discriminant Analysis

•Classification Trees (1-SE)

•Random Forests

•MultiClass Boosting

Table of ESD Types

Our training set contains the following breakdown of ESD types:

Totals for Each ESD Type

ESD Type 1 2 3 4 5 6

Total Subjects 111 60 71 48 48 20

Classification

Trees

10-fold CV accuracygiven by ESD type:

ESD Type Accuracy %

1 97.30%2 93.33%3 98.59%4 93.75%5 100%6 85%

Overall 96.09%

Fig. 1: Classification Tree using 1-SE Rule with cp = 0.017

Fig. 2: Variable Importance for Random Forests.

The 16 most important variables were retained.

Random Forests

10-fold CV accuracy given byESD type:

ESD Type Accuracy %

1 100%2 96.67%3 100%4 81.25%5 100%6 85%

Overall 96.09%

Even with a smaller set of predictive attributes, random forests obtains very high10-fold cross-validated accuracy.

Accuracies Summarized

The 10-fold cross validated accuracies for each method are given below. Wherenoted, “full” means all attributes were used. “Clinical” means that only the clinicalattributes were used. “Var. select” indicates that variable selection was used.

Classification Accuracy Rate

Log Reg (Full) 99.16%Log Reg (Clinical) 84.08%

Log Reg (Var. Select) 95.25%

LDA 96.63%QDA (Full) 82.36%

QDA (Clinical) 72.07%

Classf Trees (Full) 96.09%Classf Trees (Clinical) 84.92%

Rand Forest (Full) 96.93%Rand Forest (Var Select) 96.09%

MultiClass Boost (Full) 96.09%MultiClass Boost (Clinical) 84.92%

MultiClass Boost (Var. Select) 95.81%

Conclusions

Overall, random forests gave us the most consistent classification results. We wereable to obtain over 96% accuracy, even after variable selection.

MultiClass boosting (from the R package maboost) also was quite accurate. Onthe same subset of variables selected by random forests, MultiClass boosting had a10-fold cross validated accuracy rate of 95.81%, which is quite respectable.

Further Work

Further work in ESD classification could focus on more accurate classification usingonly clinical attributes. This would lead to possibly faster diagnosis, as well aslower the overall cost of diagnosis for the patient.

Additional ESD types could also be added to the data set, and attempts at evenmore robust diagnostic methods could be performed.

The data set for this report comes from the following paper

G. Demiroz, H. Govenir, and N. Ilter (1998). Learning Differential Diagnosis ofESD using Voting Feature Intervals. Artif. Intel. in Med., Jul;13(3):147-65.

referenced on the UC Irvine Machine Learning Repository under the multivariatedata category (“Dermatology Data Set”).