Download - GMDH-based feature ranking and selection for improved classification of medical data

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

GMDH-based feature ranking and selection for improved

classification of medical data

Advisor : Dr. Hsu

Presenter : Yu-San Hsieh

Author : R.E. Abdel-Aal

2005. BI.456-468

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Motivation Objective Method Material Results Conclusions

Outline

3


N.Y.U.S.T.

I. M.

Motivation Accuracy is very important in classifiers used

for medical application.

4


N.Y.U.S.T.

I. M.

Objective Improved classification performance of

medical data.

5


N.Y.U.S.T.

I. M.Method

First stage – ranked feature─ GMDH algorithm

z1

Zm(m-1)/2

1. representation

2.Selection and stopping

x1

x2

x3

x4

y

An increasing rmin： model becoming complex,

1.Overfitting the estimation data

2.Performing poorly on the new selection data.

Iteration

Square error

r12

rm(m-1)2

rmin

r22

6


N.Y.U.S.T.

I. M.Method

First stage – ranked feature─ AIM abductive network

2.Selection and stopping

1.repesentation

1.repesentation

First stage – ranked feature─ AIM abductive network

2.Selection and stoppingAvoid overfitting

Using CPM control

1.CPM>1,simpler model that are less accurate but generalize.

2.CPM<1,complex model, overfit training data and decrease actual prediction performance.

7


N.Y.U.S.T.

I. M.Method Second stage – selected feature

─ Selected k, performance on an evaluation dataset would first improve and starts to deteriorate due to the model overfitting the training data.

─ A compact m-feature subset can be obtained by taking the first m features starting from top of the ranking list.Ex: ranking list{2,6,7,8,1,5,3,4,9}, selected 6-features is {2,6,7,8,1,5}.

─ The optimum subset of features is determined by repeatedly forming subset of k features, starting from the top of the ranking list.Ex: ranking list{2,6,7,8,1,5,3,4,9}, {2,6,7,8,1,5},{6,7,8,1,5,3}…中選出最佳的 subset

8


N.Y.U.S.T.

I. M.Material Two standard medical diagnosis datasets from

the UCI Machine Learning Repository were used for this study.─ Wisconsin breast cancer dataset─ Cleveland heart disease dataset

70% 30%

9


N.Y.U.S.T.

I. M.Results

The breast cancer data─ Ranking for the feature set

{2,6,7,8,1,5,3,4,9}

7

5

9

Feature selected Feature ranked

10


N.Y.U.S.T.

I. M.Results Rough set data analysis of dataset

Overfitting Overfitting

3%

3%

11


N.Y.U.S.T.

I. M.Results

Standard error↓Standard error↓

AUC↑

3%3%

12


N.Y.U.S.T.

I. M.Results The heart disease data

─ Ranking for the feature set{13,12,9,3,2,10,8,4,5,11,1,7,6}

Feature selected Feature ranked

13


N.Y.U.S.T.

I. M.Results

3%6%

Overfitting

14


N.Y.U.S.T.

I. M.Results

AUC↑

AUC↑

Requires less than half the number of input features

Models using the reduced feature set will be more efficient.

15


N.Y.U.S.T.

I. M.Conclusions

Improved implementation and performance of classifiers for medical screening and diagnosis.

Feature reduction is particularly useful with high-dimensional data characterized by a large number of feature and a relatively few training example.

16


N.Y.U.S.T.

I. M.My opinion

Advantage: Preprocess Disadvantage: Apply： Clustering, Association Rule……

Download - GMDH-based feature ranking and selection for improved classification of medical data

Top Related