biological data mining by genetic programming ai project #2 biointelligence lab cho, dong-yeon...

20
Biological data mining Biological data mining by Genetic Programming by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon ([email protected])

Upload: amber-harris

Post on 03-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

Biological data mining Biological data mining by Genetic Programmingby Genetic Programming

AI Project #2

Biointelligence lab

Cho, Dong-Yeon

([email protected])

Page 2: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

2

Project PurposeProject Purpose

Medical Diagnosis To predict the presence or absence of a disease given th

e results of various medical tests carried out on a patient

Human experts (M.D.) vs Machine (GP)

Two Data Sets Heart Disease Diabetes

Page 3: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

3

Heart DiseaseHeart Disease Data Description

Number of patients (270) Absence (150) Presence (120)

13 attributes age sex chest pain type (4 values) resting blood pressure serum cholestoral in mg/dl fasting blood sugar > 120 mg/dl resting electrocardiographic results (values 0,1,2) maximum heart rate achieved exercise induced angina oldpeak = ST depression induced by exercise relative to rest the slope of the peak exercise ST segment number of major vessels (0-3) colored by flourosopy thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Page 4: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

4

Learning a ClassifierLearning a Classifier

GP settings Functions

Numerical and condition operators {+, -, *, /, exp, log, sin, cos, sqrt, iflte ifltz, …} Some operators should be protected from the illegal operation.

Terminals Input attributes and constants {x1, x2, … x13, R} where R [a, b]

Additional parameters Threshold value For preprocessing (normalization)

Page 5: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

5

Cross Validation (1/3)Cross Validation (1/3)K-fold Cross Validation

The data set is randomly divided into k subsets. One of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set.

45 4545 45 45D1 D2 D3 D4 D5

45D6

45 4545 45 45D1 D2 D3 D4 D6

45D5

45 4545 45 45D2 D3 D4 D5 D6

45D1

Page 6: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

6

Cross Validation (2/3)Cross Validation (2/3)

Confusion Matrix for test data sets Number of patients = p + q + r + s

Accuracy

srqp

sp

True

PredictPositive Negative

Positive p q

Negative r s

Page 7: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

7

Cross Validation (3/3)Cross Validation (3/3) Cross validation and Confusion Matrix

At least 10 runs for your k value.

Show the confusion matrix for the best result of your experiments.

Run Accuracy

1

2

10

Average

Page 8: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

8

InitializationInitialization

Maximum initial depth of trees Dmax is set.

Full method (each branch has depth = Dmax): nodes at depth d < Dmax randomly chosen from function set F

nodes at depth d = Dmax randomly chosen from terminal set T

Grow method (each branch has depth Dmax): nodes at depth d < Dmax randomly chosen from F T

nodes at depth d = Dmax randomly chosen from T

Common GP initialisation: ramped half-and-half, where grow and full method each deliver half of initial population

Page 9: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

9

Fitness FunctionFitness Function

Maximization problem Number of the correctly classified patients

Minimization problem Number of the incorrectly classified patients Mean Squared Error

N: number of training data

N

iGPtrue OO

NError

1

2)(1

Page 10: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

10

Selection (1/2)Selection (1/2)

Fitness proportional (roulette wheel) selection The roulette wheel can be constructed as follows.

Calculate the total fitness for the population.

Calculate selection probability pk for each chromosome vk.

Calculate cumulative probability qk for each chromosome vk.

SIZEPOP

kkifF

_

1

)(

SIZEPOPkF

ifp kk _,...,2,1 ,

)(

SIZEPOPkpqk

jjk _,...,2,1 ,

1

Page 11: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

11

Procedure: Proportional_Selection Generate a random number r from the range [0,1]. If r q1, then select the first chromosome v1; else, select the kth chromosome vk (2 k pop_size) such that qk-1 < r qk.

pk qk

1 0.082407 0.082407

2 0.110652 0.193059

3 0.131931 0.324989

4 0.121423 0.446412

5 0.072597 0.519009

6 0.128834 0.647843

7 0.077959 0.725802

8 0.102013 0.827802

9 0.083663 0.911479

10 0.088521 1.000000

0.036441)(_

1

sizepop

kkifF

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1

Page 12: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

12

Selection (2/2)Selection (2/2)

Tournament selection Tournament size q

Ranking-based selection

2 POP_SIZE 1 + 2 and - = 2 - +

1

1)(

1

i

pi

Page 13: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

13

GP FlowchartGP Flowchart

GA loop GP loop

Page 14: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

14

BloatBloat

Bloat = “survival of the fattest”, i.e., the tree sizes in the population are increasing over time

Ongoing research and debate about the reasons Needs countermeasures, e.g.

Prohibiting variation operators that would deliver “too big” children

Parsimony pressure: penalty for being oversized

)#,(#

)#,(#

DNCError

DNPenaltyFitnessOriginalFitness

Page 15: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

15

Page 16: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

16

ExperimentsExperiments

Two problems Heart Disease Pima Indian diabetes

Various experimental setup Termination condition: maximum_generation Various settings

Effects of the penalty term Different function and terminal sets Selection methods and their parameters Crossover and mutation probabilities

Page 17: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

17

ResultsResults For each problem

Result table and your analysis

Present the optimal classifier Draw a learning curve for the run where the best solution

was found. Compare with the results of neural networks (optional). Different k for cross validation (optional)

Training Test

Average SD

Best Worst Average SD

Best Worst

Setting 1

Setting 2

Setting 3

Page 18: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

18Generation

Fitness

(Error)

Page 19: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

19

ReferencesReferences

Source Codes GP libraries (C, C++, JAVA, …) MATLAB Tool box

Web sites http://www.cs.bham.ac.uk/~cmf/GPLib/GPLib.html http://cs.gmu.edu/~eclab/projects/ecj/ http://www.geneticprogramming.com/GPpages/softwar

e.html …

Page 20: Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

© 2006 SNU CSE Biointelligence Lab

20

Pay Attention!Pay Attention!

Due: Nov. 16, 2006 Submission

Source code and executable file(s) Proper comments in the source code Via e-mail

Report: Hardcopy!! Running environments and libraries (or packages) which you

used. Results for many experiments with various parameter settings Analysis and explanation about the results in your own way