1Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
GMDH-based feature ranking and selection for improved
classification of medical data
Advisor : Dr. Hsu
Presenter : Yu-San Hsieh
Author : R.E. Abdel-Aal
2005. BI.456-468
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation Objective Method Material Results Conclusions
Outline
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation Accuracy is very important in classifiers used
for medical application.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective Improved classification performance of
medical data.
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Method
First stage – ranked feature─ GMDH algorithm
z1
Zm(m-1)/2
1. representation
2.Selection and stopping
x1
x2
x3
x4
y
An increasing rmin: model becoming complex,
1.Overfitting the estimation data
2.Performing poorly on the new selection data.
Iteration
Square error
r12
rm(m-1)2
rmin
r22
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Method
First stage – ranked feature─ AIM abductive network
2.Selection and stopping
1.repesentation
1.repesentation
First stage – ranked feature─ AIM abductive network
2.Selection and stoppingAvoid overfitting
Using CPM control
1.CPM>1,simpler model that are less accurate but generalize.
2.CPM<1,complex model, overfit training data and decrease actual prediction performance.
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Method Second stage – selected feature
─ Selected k, performance on an evaluation dataset would first improve and starts to deteriorate due to the model overfitting the training data.
─ A compact m-feature subset can be obtained by taking the first m features starting from top of the ranking list.Ex: ranking list{2,6,7,8,1,5,3,4,9}, selected 6-features is {2,6,7,8,1,5}.
─ The optimum subset of features is determined by repeatedly forming subset of k features, starting from the top of the ranking list.Ex: ranking list{2,6,7,8,1,5,3,4,9}, {2,6,7,8,1,5},{6,7,8,1,5,3}…中選出最佳的 subset
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Material Two standard medical diagnosis datasets from
the UCI Machine Learning Repository were used for this study.─ Wisconsin breast cancer dataset─ Cleveland heart disease dataset
70% 30%
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results
The breast cancer data─ Ranking for the feature set
{2,6,7,8,1,5,3,4,9}
7
5
9
Feature selected Feature ranked
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results Rough set data analysis of dataset
Overfitting Overfitting
3%
3%
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results
Standard error↓Standard error↓
AUC↑
3%3%
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results The heart disease data
─ Ranking for the feature set{13,12,9,3,2,10,8,4,5,11,1,7,6}
Feature selected Feature ranked
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results
3%6%
Overfitting
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results
AUC↑
AUC↑
Requires less than half the number of input features
Models using the reduced feature set will be more efficient.
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusions
Improved implementation and performance of classifiers for medical screening and diagnosis.
Feature reduction is particularly useful with high-dimensional data characterized by a large number of feature and a relatively few training example.
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.My opinion
Advantage: Preprocess Disadvantage: Apply: Clustering, Association Rule……