biological data mining by genetic programming ai project #2 biointelligence lab cho, dong-yeon...
TRANSCRIPT
Biological data mining Biological data mining by Genetic Programmingby Genetic Programming
AI Project #2
Biointelligence lab
Cho, Dong-Yeon
© 2006 SNU CSE Biointelligence Lab
2
Project PurposeProject Purpose
Medical Diagnosis To predict the presence or absence of a disease given th
e results of various medical tests carried out on a patient
Human experts (M.D.) vs Machine (GP)
Two Data Sets Heart Disease Diabetes
© 2006 SNU CSE Biointelligence Lab
3
Heart DiseaseHeart Disease Data Description
Number of patients (270) Absence (150) Presence (120)
13 attributes age sex chest pain type (4 values) resting blood pressure serum cholestoral in mg/dl fasting blood sugar > 120 mg/dl resting electrocardiographic results (values 0,1,2) maximum heart rate achieved exercise induced angina oldpeak = ST depression induced by exercise relative to rest the slope of the peak exercise ST segment number of major vessels (0-3) colored by flourosopy thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
© 2006 SNU CSE Biointelligence Lab
4
Learning a ClassifierLearning a Classifier
GP settings Functions
Numerical and condition operators {+, -, *, /, exp, log, sin, cos, sqrt, iflte ifltz, …} Some operators should be protected from the illegal operation.
Terminals Input attributes and constants {x1, x2, … x13, R} where R [a, b]
Additional parameters Threshold value For preprocessing (normalization)
© 2006 SNU CSE Biointelligence Lab
5
Cross Validation (1/3)Cross Validation (1/3)K-fold Cross Validation
The data set is randomly divided into k subsets. One of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set.
45 4545 45 45D1 D2 D3 D4 D5
45D6
45 4545 45 45D1 D2 D3 D4 D6
45D5
45 4545 45 45D2 D3 D4 D5 D6
45D1
© 2006 SNU CSE Biointelligence Lab
6
Cross Validation (2/3)Cross Validation (2/3)
Confusion Matrix for test data sets Number of patients = p + q + r + s
Accuracy
srqp
sp
True
PredictPositive Negative
Positive p q
Negative r s
© 2006 SNU CSE Biointelligence Lab
7
Cross Validation (3/3)Cross Validation (3/3) Cross validation and Confusion Matrix
At least 10 runs for your k value.
Show the confusion matrix for the best result of your experiments.
Run Accuracy
1
2
10
Average
© 2006 SNU CSE Biointelligence Lab
8
InitializationInitialization
Maximum initial depth of trees Dmax is set.
Full method (each branch has depth = Dmax): nodes at depth d < Dmax randomly chosen from function set F
nodes at depth d = Dmax randomly chosen from terminal set T
Grow method (each branch has depth Dmax): nodes at depth d < Dmax randomly chosen from F T
nodes at depth d = Dmax randomly chosen from T
Common GP initialisation: ramped half-and-half, where grow and full method each deliver half of initial population
© 2006 SNU CSE Biointelligence Lab
9
Fitness FunctionFitness Function
Maximization problem Number of the correctly classified patients
Minimization problem Number of the incorrectly classified patients Mean Squared Error
N: number of training data
N
iGPtrue OO
NError
1
2)(1
© 2006 SNU CSE Biointelligence Lab
10
Selection (1/2)Selection (1/2)
Fitness proportional (roulette wheel) selection The roulette wheel can be constructed as follows.
Calculate the total fitness for the population.
Calculate selection probability pk for each chromosome vk.
Calculate cumulative probability qk for each chromosome vk.
SIZEPOP
kkifF
_
1
)(
SIZEPOPkF
ifp kk _,...,2,1 ,
)(
SIZEPOPkpqk
jjk _,...,2,1 ,
1
© 2006 SNU CSE Biointelligence Lab
11
Procedure: Proportional_Selection Generate a random number r from the range [0,1]. If r q1, then select the first chromosome v1; else, select the kth chromosome vk (2 k pop_size) such that qk-1 < r qk.
pk qk
1 0.082407 0.082407
2 0.110652 0.193059
3 0.131931 0.324989
4 0.121423 0.446412
5 0.072597 0.519009
6 0.128834 0.647843
7 0.077959 0.725802
8 0.102013 0.827802
9 0.083663 0.911479
10 0.088521 1.000000
0.036441)(_
1
sizepop
kkifF
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
© 2006 SNU CSE Biointelligence Lab
12
Selection (2/2)Selection (2/2)
Tournament selection Tournament size q
Ranking-based selection
2 POP_SIZE 1 + 2 and - = 2 - +
1
1)(
1
i
pi
© 2006 SNU CSE Biointelligence Lab
13
GP FlowchartGP Flowchart
GA loop GP loop
© 2006 SNU CSE Biointelligence Lab
14
BloatBloat
Bloat = “survival of the fattest”, i.e., the tree sizes in the population are increasing over time
Ongoing research and debate about the reasons Needs countermeasures, e.g.
Prohibiting variation operators that would deliver “too big” children
Parsimony pressure: penalty for being oversized
)#,(#
)#,(#
DNCError
DNPenaltyFitnessOriginalFitness
© 2006 SNU CSE Biointelligence Lab
15
© 2006 SNU CSE Biointelligence Lab
16
ExperimentsExperiments
Two problems Heart Disease Pima Indian diabetes
Various experimental setup Termination condition: maximum_generation Various settings
Effects of the penalty term Different function and terminal sets Selection methods and their parameters Crossover and mutation probabilities
© 2006 SNU CSE Biointelligence Lab
17
ResultsResults For each problem
Result table and your analysis
Present the optimal classifier Draw a learning curve for the run where the best solution
was found. Compare with the results of neural networks (optional). Different k for cross validation (optional)
Training Test
Average SD
Best Worst Average SD
Best Worst
Setting 1
Setting 2
Setting 3
© 2006 SNU CSE Biointelligence Lab
18Generation
Fitness
(Error)
© 2006 SNU CSE Biointelligence Lab
19
ReferencesReferences
Source Codes GP libraries (C, C++, JAVA, …) MATLAB Tool box
Web sites http://www.cs.bham.ac.uk/~cmf/GPLib/GPLib.html http://cs.gmu.edu/~eclab/projects/ecj/ http://www.geneticprogramming.com/GPpages/softwar
e.html …
© 2006 SNU CSE Biointelligence Lab
20
Pay Attention!Pay Attention!
Due: Nov. 16, 2006 Submission
Source code and executable file(s) Proper comments in the source code Via e-mail
Report: Hardcopy!! Running environments and libraries (or packages) which you
used. Results for many experiments with various parameter settings Analysis and explanation about the results in your own way