genetic programming for the automatic construction of ... · applying genetic programming to...

Genetic Programming for the

Automatic Construction of Features in

Skin-Lesion Image Classification

Jonathan Streater

Master of Science

Artificial Intelligence

School of Informatics

University of Edinburgh

2010

I

Abstract

This dissertation describes the design and implementation of a genetic programming

system which automatically constructs feature equations for the classification of skin

lesion images as a part of a real world dermatological image retrieval system. It uses

generalized co-occurrence matrices (GCMs) and normal mathematical functions

combined stochastically and evaluated using the feature selection techniques of fisher‟s

discriminant ratio, and the classification accuracy of either a bayes classifier or support

vector machine. It deals with the notion of GP closure with „shell‟ functions and is

able to arbitrarily combine information from different color channels, both unique

designs compared with similar GP systems. Further, it can evolve features iteratively to

complement each other. The implementation here is able to create features in small

numbers which are able to classify better than most of the traditional set of Haralik

features, even when the Haralik features are created with a greater number of GCM

parameters. However, the system developed here does exhibit two notable problems

for future work. The run-time is notably long and the amount of data collected

in-house is not yet great enough to significantly measure the ability of the system to

generalize. However, these problems are fixable and the work described has resulted

in a system which aids classification relatively well and just as importantly, shows

much potential.

II

Acknowledgements

Many thanks to my superviser, Lucia Ballerini, who provided invaluable guidance.

III

Declaration

I declare that this thesis was composed by myself, that the work contained herein is my

own except where explicitly stated otherwise in the text, and that this work has not been

submitted for any other degree or professional qualification except as specified.

(Jonathan Streater)

IV

Table of Contents

Chapter 1. Introduction ................................................................................................ 1

Chapter 2. Related Work ............................................................................................... 4

2.1 Query-By-Example Content Based Image Retrieval System ........................... 4

2.1.1 The System ........................................................................................... 4

2.1.2Color Features ....................................................................................... 6

2.1.3 Texture Features .................................................................................. 7

2.1.4 Feature Selection and Retrieval ........................................................... 8

2.1.5 Synthesized Features ........................................................................... 8

2.2 Genetic Programming For Feature Construction ............................................ 9

Chapter 3. Conceptual Background ............................................................................ 12

3.1 Haralik Features ............................................................................................ 12

3.2 Genetic Programming ................................................................................... 15

3.2.1 The GP algorithm ............................................................................... 15

3.2.2 Representation, Terminals, and Functions ........................................ 16

3.2.3 Initialization ....................................................................................... 18

3.2.4 Selection and Reproduction .............................................................. 18

3.2.5 Fitness and Test Cases........................................................................ 20

3.2.6 Closure and Sufficiency ...................................................................... 21

3.3 Machine Learning With Feature Selection ................................................... 21

3.3.1 The Machine Learning Problem ......................................................... 21

3.3.2 Feature Selection ............................................................................... 22

3.3.3 Feature Relationships ........................................................................ 25

3.3.4 Fisher’s Discriminant Ratio ................................................................ 26

3.3.5 Naïve Bayes Classifer ......................................................................... 27

3.3.6 Support Vector Machine .................................................................... 28

V

2.1.1 Leave-One-Out Cross-Validation ........................................................ 30

Chapter 4. Design and Implimentation ....................................................................... 31

4.1 Motivation and Overall Design ..................................................................... 31

4.2 Basic GP Implementation.............................................................................. 33

4.3 Training Data ................................................................................................. 34

4.4 Representation of Individuals ....................................................................... 36

4.5 Fitness ........................................................................................................... 39

4.6 Iterative Genetic Programming ..................................................................... 44

4.7 Parameters .................................................................................................... 46

Chapter 5. Results and Analysis .................................................................................. 50

5.1 Exploration of Population, Generation, and Depth ...................................... 50

5.2 Pooled FDR Features Compared to Haralik Features .................................... 57

5.3 Wrapper Fitness and Iterative GP ................................................................. 59

Chapter 6. Conclusions ............................................................................................... 71

6.1 Future Work .................................................................................................. 71

6.2 Conclusions ................................................................................................... 73

Bibliography ................................................................................................................ 75

1

Chapter 1. Introduction


This project is focused on the problem of using Genetic Programming and the relevant

machine learning techniques, such as a Bayes Classifier and Support Vector Machine,

to automatically construct and select features to best aid classification, both in

efficiency and performance, of skin lesion images. It is part of a larger project to build

and enhance a query-by-example content-based image retrieval system which can

return skin lesion images based on similarity to a given query image [1]. The entirety

of work here is based on attempting to improve the ability of this specific system to

classify these types of images correctly and in so doing, improving the chances of

success for a potentially educational and/or commercial dermatological system. Thus,

this is both an exploration of machine learning concepts which may be able to enhance

this system and it is a practical implementation of these concepts in a specific and real

engineering problem.

The work described in the proceeding was initially envisioned as a search for and

analyses of meta-heuristic algorithms, such as Genetic Algorithms and Ant Colony

Optimization, to select features for the image retrieval system. Before this

dissertation had even begun, work on the image retrieval system had already

constructed over 17,000 possible features to use in classification and was still

generating more. Therefore, the combinatorial optimization problem of choosing the

best set of features which could make classification accuracy the highest while at the

same time allowing the classifications to occur in a reasonable amount of time became

2


an imperative. However, as research for this project developed, it became evident that

while algorithms like Genetic Algorithms could very well be effective search strategies

for feature selection, search wasn‟t a primary concern of many of the most successful

feature selection strategies. This is demonstrated, for example, in a competition in

2003 in which dozens of research teams competed to see who could attain the best

feature selection and classification results [2]. All of the contestants focused not on

search strategies, often using the most basic greedy forward selection algorithms, but

on methods more directly related to measuring, analyzing, and processing features, and

classifying using these. This combined with the very scant but successful work that

has already begun in the area of genetic programming for feature construction in image

classification, were the prime motivations for the shift in focus of the project. The

hope is that genetic programming, in conjunction with statistical learning machine

tools, can simultaneously construct and select features which provide quicker and more

accurate results.

By this benchmark, the work here was able to largely accomplish its goal. The project

reports the results of a GP implementation which utilizes a filter and two wrapper

feature selection techniques for constructing feature equations. Further, it attempts to

use these to iteratively build complementary features. What it means for a feature to

be complementary and for a feature selection technique to be able or not able to find

complementary features will be explained in chapter 3. B even the most basic

methods of independently built features were able to perform at least as well as the

standard feature equations in the domain chosen for constructing features, Haralick

texture features. Further, results indicate that, given more time and resources, there is

the possibility of increasing performance and further exploring possible techniques for

additional implementations. Importantly, given the time, data, and computing

limitations of this project, it was evident that the GP implementation had not yet

3


reached its full potential in generating solutions. All that would be required would be

to run it for longer and with larger populations. It is also likely that exploring a greater

complexity in the image data input, genetic programming techniques, and machine

learning tools would generate improvements as well. The limitations of training data

and computational resources available to the project were noteworthy and will be

addressed.

Chapter 2 is an exposition of directly relevant background material to this project,

including an account of work on the query-by-example content based image retrieval

system which the work of dissertation is for. This relevant background work is

presented first to set the stage for the work done here. However, if some of the

concepts listed in it are completely unfamiliar it might be helpful to take a look at some

of the concepts described in chapter 3 first. Chapter 3 is focused on explaining

background theory for the tools used in this project from the three fields of digital

image analysis, genetic programming, and machine learning with feature selection.

Chapter 4 explains the implementation of work completed, including the tools

incorporated from other works such as GPlab, and justifies design decisions. Chapter

5 details the experiments run, lists results, and analyses them along with the problems

encountered. Finally, chapter 6 explores possibilities for future work and concludes.

4

Chapter 2. Related Work


This chapter introduces and explains the directly relevant works which precedes this

project. These works include a close look at the image retrieval system that this

project is a part of and at research which also combines genetic programming with

feature selection for image classification. Though there has been little work in

applying genetic programming to feature construction in the domain of texture features,

the two works cited here form a basis for comparison.

2.1 Query-By-Example Content Based Image Retrieval System.

2.1.1 The System

The work described with features for this project will be for a query-by-example

content-based image retrieval system of non-melanoma skin lesions [1]. Though in

general there are many new content-based image-retrieval systems [3], and many

within the medical domain [4], most of these are based on radiological images.

Within dermatology, computer vision systems have focused on techniques for

segmentation, feature extraction [5], and classification [6-8], often for cancer detection

and especially for melanoma. Though melanoma is a very dangerous kind of cancer,

other types are much more prevalent. The query-by-example CBIR system which this

project is a part of is the first of its kind with the five classes of skin lesions, Actinic

Keratosis (AK), Basal Cell Carcinoma (BCC), Melanocytic Nevus/Mole (ML),

5


Squamos Cell Carcinoma (SCC), and Seborrhoeic Keratosis (SK). It uses color

images digitally captured and segmented by project-affiliated team members.

It is hoped that this tool can be useful for dermatologists as well as non-expert users

with its ability to retrieve images based on similarity to a query image, allowing the

perusal of large databases of skin lesions based not only on diagnosis but on similar

visual attributes, and so be a decision-support as well as educational tool. Thus the

central goal of this project is to construct and select features of skin lesion images so

that the system may best classify and retrieve relevant images, hopefully significantly

improving the effectiveness of the system as a whole. These features are extracted

from the skin lesion images so that they can later be used to compute the „similarity‟ of

the images without using an entire digital image as input to a classifier. This similarity

score can then be used to retrieve images which are near the given query image. The

entire success of the system relies on providing the retrieval system with features which

are able to discriminate as uniquely as possible between the different classes of

pathologies and having a classifier or similarity metric which can most effectively take

advantage of these features and separate them out correctly [1].

For the original project, features used in the system are taken from color and texture.

In general there is a vast quantity of possible methods and combinations of methods,

hand-crafted and empirically compared, for generating many different features. The

idea is to sort and choose some features among the many extracted for their ability to

aid classification of images. Unfortunately, this method for feature selection leads to a

complicated and time-consuming combinatorial optimization problem about which

features to use and it can lead to feature vectors of large size. The genetic

programming approach in this project is an attempt to perform this search while at the

same time constructing novel features which are able to combine image inputs in new

6


ways, not constrained by human intuition. Further, it‟s hoped that features can be built

with respect to other, already evolved features and so be effective in smaller subsets.

The next two sections are an attempt to show how thousands of features are extracted

for selection from images and to set the stage for features which will be relevant to GP

feature construction.

2.1.2 Color Features

So to detail the feature construction methods in the original project, colors are

represented by the mean of a skin lesion and its covariance matrix so

that:

Here, N is the number of pixels in the lesion and is the color component of channel

X (X,Y Є {R, G, B}) of pixel i. And so using the RGB color space the covariance

matrix is:

Features were constructed using the RGB, HSV, CIE_Lab, CIE_Lch, Munsell color

coordinate system [9], and Otha [10] color spaces. The colors components were

normalized by the average of the same component of the safe, non-lesion skin of the

same patient [1].

7


2.1.3 Texture Features

Texture features are taken from generalized co-occurrence matrices where a

co-occurrence matrix is a matrix which is taken over an image to be the distribution of

co-occurring values at some offset. For the original project, distances of one to six and

orientations of 0, 45, 90 and 135 are used. Much of this statistical feature analysis is

founded on the popular and successful Haralik feature equations which are used on

these co-occurrence matrices [11]. The background for this will be detailed in chapter

3.

Generalized co-occurrence matrices are generated from images coded with n color

channels. For example with an image in RGB, there are six co-occurrence matrices

(RR), (GG), (BB), (RG), (RB), and (GB). For orientation invariance, the matrices are

averaged with respect to θ and quantization levels of are used for

the color spaces of RGB, HSV, and CIE_Lab [1]. From each generalized

co-occurrence matrix, 12 Haralick features are extracted including energy, contrast,

correlation, entropy, homogeneity, inverse difference, cluster shade, cluster

prominence, max probability, autocorrelation, dissimilarity, and variance [11]. All of

the combinations of these features, inter-pixel distances, color pairs, color spaces, and

grey level quantizations result in 3888 texture features. The total texture features is

brought up to 9,720 by also extracting from the sum and difference histograms, varying

displacement, orientation, quantization level, and color spaces, and using the features

sum mean, sum variance, sum energy, sum entropy, diff mean, diff variance, diff

energy, diff entropy, cluster shade, cluster prominence, contrast, homogeneity,

correlation, angular second moment, and entropy [12]. The actual feature equations

behind the names can be viewed in detail in the works cited.

8


2.1.4 Feature Selection and Retrieval

In order to choose the optimal set of these thousands of features, a greedy forward

selection algorithm and a genetic algorithm are used. Both algorithms maximize the

number of correctly retrieved images but the GA had superior results. Note that skin

lesion images are taken on a Canon EOS 350D SLR with a resolution of about 0.03mm

and then manually segmented. Ground truth is provided by the medical co-authors

and retrieval resulted in a precision of 59-63% using the Bhattacharya distance metric

and the Euclidian distance for retrieval [1].

2.1.5 Synthesized Features

Synthesized features were created to improve the performance by 8% using a genetic

algorithm [13]. The GA used already created features, for example the texture

features listed in 2.1.3, for the synthesis of new composite features. It used several

genetic operators to combine the old features in novel ways by representing them as

strings and then using index numbers to indicate the features and operators used to

combine the features. These were operators were: taking only one of the features

considered or by adding, subtracting, multiplying, or dividing them. The fact that this

very simple GA was able to increase the performance of the system is a major

motivation for this project. One suggestion for further study from this work [1] was to

pursue the development of a large number of operators which can combine an arbitrary

number of features. GP is also mentioned as a possible avenue which may be more

powerful than GA. For example, it pointed to [14] which used GP to automatically

generate features to improve classification performance (up to 87%) and to drastically

reduce the complexity of feature vectors (78 to 4).

9


2.2 Genetic Programming for Feature Construction.

It might be concluded from the above methodology for feature extraction and selection

--where just about every method is tried and then the best are taken from the bag by a

feature selection algorithm-- that there doesn‟t seem to be any designated set of “right”

features. In fact, this is a key motivation of Aurnhammer [14]. That work is

attempting to use Genetic Programming to avoid having to hand craft features for the

given task at hand, and instead make a system which can automatically generalize

without lots of time constructing, extracting, and selecting features. Part of this is the

combinatorial optimization problem that is generated by having so many features to

select from [1]. The hope is that, instead of relying on human intuition and analyses

to engineer features, Genetic Programming can automatically generate a reasonably

small yet effective number of features by combining the basic building blocks which

are typically used to construct features, and evaluating their fitness automatically.

Using co-occurrence matrices, typical mathematical functions used in Haralick

features, and a fitness function combining Fisher‟s Discriminant Ratio with a simple

minimum distance classifier, Aurnhammer improves from a Haralick feature

classification accuracy of 67% to a Genetic Programming generated feature accuracy

of 87% using only 4 features. It should be noted that this genetic programming

algorithm is iterative in that it evolves a best feature and then evolves another feature

with respect to the feature(s) already evolved. Thus new features are generated which

classify well with the already generated features. This is an idea which relates to the

complexity of how features relate to each other and how they relate to the machine

learning problem at hand and will be discussed in chapter 3.

The GP features were created using the Open Beagle GP framework for C++, the Intel

10


OpenCV library for image processing, and the publically available real-world

photography database VisTek. Therefore it seems that GP might be able, if adapted

and applied to our skin lesion classification problem, to generate a small number of

effective features for classification. This approach has the potential to generate new

and effective features that discriminate well between pathologies while at the same

time producing a manageably small set of features that work well together. It should

be noted, however, that while this implementation is directly motivational for this

dissertation, it is based on a very different database and is primarily aimed at being a

demonstration of evolving feature equations. This is to be contrasted with the image

retrieval system that this project is based on which is for a real and practical

engineering problem. Further, the combination of feature selection methods used here

is different. And as it will be discussed, the implementation for this project deals with

the GP property of closure and with color channels differently. The GP

implementation and results will be further contrasted with this work in chapter 5.

Using genetic programming for the construction and selection of features for the

classification of images, especially with respect to texture features, is a relatively

unexplored area. Other than [14], the only other directly related research found was

[15]. But again, this is a demonstration of the feasible success which might be

attained with this technique for practical purposes. Again, it uses the Vistek database

and attains comparable performance to Haralik features and fairly better performance

by combining the evolved features with the Haralik features. It uses the K-means

clustering algorithm for fitness.

Other work less directly related but worth mentioning include work using a GP to

reduce dimensionality of input features for a classifier using a generalized linear

machine, k-nearest neighbors, or maximum likelihood classifier [16] and work using a

11


GA to simultaneously select features and optimize a support vector machine [17] [18].

The latter supplied guidance for the SVM wrapper approach described in chapter 4

where the parameters of the SVM are evolved with individuals in the population.

In that work, however, the algorithm is just doing feature selection whereas here

feature construction and selection are occurring together.

12

Chapter 3. Conceptual Background


The following covers the three main domains which converge in the work for this

project. These domains are: features for digital image analysis, genetic programming,

and machine learning with feature selection.

3.1 Haralik Features.

For some time now it has been possible to digitally process all sorts of image data and

so it has become an imperative to come up with effective ways to deal with these

complicated 2 dimensional arrays of information. For this problem, Haralik [11] is

concerned with the creation of texture features for the classification of images. The

features established in that work have remained some of the most popular features for

these purposes largely because of their simplicity and effectiveness. As it is noted in

the original paper, categorizing image data is very difficult precisely because one needs

to deal with such large blocks of cells. However, once features are defined for the

large blocks of image data, reducing the incredible dimensionality of the problem, they

can be used in any number of pattern-recognition techniques.

To solve the problem of how to construct such features, Haralik uses the idea of texture.

This is a set of features used by human beings and the idea is to also use it as the

foundation for feature construction methods for use by digital computers. Texture is a

13


property of all surfaces and contains important information about the surface, its

makeup, and its relationship to the surroundings. Specifically, it is about the spatial

distribution of gray tones and can be evaluated, for example, as on the one hand fine,

coarse, or smooth and on the other hand, rippled, lineated, or irregular [11].

The resulting procedure for calculating texture features from blocks of image data

relies on the assumption that texture information is based on the average spatial

relationship which gray tones in an image have with each other. Thus a set of

“gray-tone spatial-dependence probability-distribution matrices” are assumed to

adequately represent this average of textural information. They are computed for

various orientations and distances between neighboring cell pairs in the image, and

then plugged into feature equations to produce feature values. The resulting features,

focused on macroscopic notions of texture rather than picking out any specific classes

of texture, contain information such as the homogeneity, contrast, boundaries,

dependencies, and complexity in an image [11].

As formulated in the original paper, “such matrices of gray-tone spatial dependence

frequencies are a function of the angular relationship between the neighboring

resolution cells as well as a function of the distance between them.” So for an Image I

which has columns, rows, and grey levels, the co-occurrence matrix is

of dimension X , where a value in position (i, j) is determined by the number of

co-occurrences of the grey-levels i and j which are an inter-pixel distance d and

orientation θ offset [11] :

Note that, as mentioned in chapter 2, a generalized co-occurance matrix (GCM) is

14


simply the idea of gray-level co-occurrence matrix adapted to a color space instead of

using only grey levels. All of the 14 Haralik feature equations suggested in [11] are

based on the use of these co-occurrence matrices. GCMs in the RGB color space will

be used with Haralik equations computed using several distance and interpixel distance

settings for comparison in chapter 5.

Haralik features are, as mentioned, an attempt to hone in on information about

homogeneity, contrast, organized structure, complexity, and transitions. Their

equations are described exactly in the appendix of [11]. They are a large part of the

many features generated in [1], forming the basis of the texture features there. These

generalized co-occurrence matrices will also be the foundation of features generated by

genetic programming in this dissertation.

Figure 1 – Taken from [11], a demonstration of GLCM calculations. (a) gray-tone values for a

4x4 image (b) the general form of any gray-tone spatial dependence matrix with gray tone

values 0-3. #(i,j) stands for number of times gray tones i and j have been neighbors. (c-f)

calculation for all 4 distance 1 gray-tone spatial-dependence matrices.

15


3.2 Genetic Programming

Genetic Programming is a technique for the automatic and systematic solving of

problems by means of algorithmic evolution, regardless of domain [19] [20]. This is

why it is a good candidate for the solving of the feature selection and construction

problem for image classification which often suffers from domain specificity and

bloated numbers of features. Just as in many domains it has matched or exceeded

human intuition and engineering, even creating patentable solution in some cases, it has

the potential to improve on the classification performance of the hand crafted Haralick

features and features like them. This was demonstrated in principle in two works cited

in the previous chapter and it is the hope of expanding this method to work in the

specific engineer problem associated with the skin lesion image retrieval system.

3.2.1 The GP Algorithm

The basic idea of Genetic Programming is to stochastically create and then transform a

population of programs, a generation at a time, into better and better solutions for the

problem at hand. This is done by, at every generation, evaluating the fitness of the

population and then using genetic operators to push and build the changing population

of programs towards better and better solutions. The idea is that we don‟t know

exactly how to make these good solutions but we do have methods for judging and

measuring the goodness of solutions that the stochastic evolutionary process produces.

The principle of survival of the fittest and the genetic operators ensure that when good

solutions are found, they propagate in the population and eventually are even built on

top of to produce better solutions down the road. What the „programs‟ are which are

being evolved just depends on the domain. For feature construction, we are evolving

16


feature extraction equations.

Figure 2 – Taken from [18], the general outline of the genetic programming algorithm.

The Genetic Programming algorithm is basically: randomly generate a population of

valid programs, execute each program and measure its fitness, select some individuals

with a probability based on fitness scores to participate in genetic operations, and apply

the operators to the individuals to create a new generation of individuals. This process

begins again with the calculating of the fitness scores of the new population of

solutions and repeated until an optimal solution is found or some other stopping

condition is met. The end result is the returning of the best individual found. In order

to accomplish this whole process, methods need to be established for the representation

of programs, the evaluation of fitness, the selection of individuals from the population,

and the execution of genetic operators.

3.2.1 Representation, Terminals, and Functions

Programs are often visualized as trees rather than as lines of code. In figure 3, it is

easy to see that variables and constants in the tree, known as terminals, take their places

as leaves of the tree while the operations in the program, known as functions, take their

17


places everywhere above the terminals, up to the top [19]. In the case of the picture

below, the terminal set is made up of x, 3, and y. The function set is made up of max,

+, and *.

These programs are often represented in the genetic program in prefix notation, such as

max(plus(x,x),plus(x,times(3,y))), because this makes it easier to manipulate and to

visualize the branches of the tree and the relationships of the functions in it. The first

computer language used to implement Genetic Programming was LISP because it

represents operations in this way and because its dynamic lists and automatic garbage

collection make it much easier to implement and manipulate a population of programs

[18]. As we shall see, however, other computer languages associated with AI research

or scientific computing, such as MATLAB, also have many of these same capabilities.

It is possible to implement the population in prefix notation or more explicitly in a tree

data structure.

Figure 3 – Taken from [18], graphical

representation of the tree of an individual

in a genetic program.

18


3.2.3 Initialization

Once we have representations for individuals in the population, it is necessary to

initialize the first generation of the genetic program. There are three predominant

methods for this named grow, full, and ramped-half-and-half. The full method

generates trees by randomly selecting from the function set until every branch of the

tree has reached a pre-specified depth. Here, terminals are installed. This creates a

full tree with branches all to the same depth, though trees still might have different

sizes (numbers of nodes) because some functions might take in different numbers of

operators (different arity). In order to increase the diversity of the population, the

grow method creates trees by randomly selection both from the function and terminal

sets, granting the possibility that branches in the tree might be of different lengths.

Similar to the full method though, if a branch reaches some pre-specified depth, it is

capped with a random terminal. The third method is a combination of the full and

grow methods and is an attempt to further increase the diversity of the initial population.

In the Ramped-half-and-half method, half of the population is generated with the grow

method and half is generated with the full method. Further, this is done with a range

of depth limits [20].

3.2.4 Selection and Reproduction

After creating the initial population and then evaluating its fitness, it is necessary to

probabilistically select individuals, based on fitness, for reproductive operations. Two

oft used methods for selection are fitness-proportionate selection and tournament

selection. For fitness-proportionate selection, the chances of being picked to

participate in reproduction are directly proportional to the individual‟s fitness

compared to all other individuals. That is, its chances of being selected are its fitness

19


divided by the sum of all other individuals [21]. This method provides very high

selection pressure as it is possible for an individual with an extremely high fitness

compared to the rest of the population to completely swamp the next generation. This

can be good if there is a need to take advantage of good solutions but it can also destroy

the diversity of the population, narrowing the possibility of finding new paths for

solutions. In tournament selection, a number of individuals are randomly chosen from

the population, compared with each other, and the best individual is selected. This

method has a lower selection pressure and allows a larger variety of individuals to be

selected for reproduction.

Once individuals have been selected, genetic operators are executed on them to make

new individuals to fill the next generation of programs. The most basic operators are

crossover and mutation. For crossover, two individuals are selected, a crossover node

is chosen randomly in the tree of each parent, and the branches below the two points are

swapped. The new individuals are placed into new programs rather than copying over

the originals so that the parents have the chance to be selected for reproduction again.

Often function nodes are given a greater chance of being chosen for swap points than

terminal nodes to prevent terminal nodes from always being chosen, and to prevent

such a small amount of genetic material from being exchanged in most crossovers.

For mutation, a single individual is selected, a random mutation point is chosen on the

individual, and everything below the point is replaced with a randomly generated tree.

It is also possible to simply replace the chosen node by a single function from the

function set which has the same arity [19]. Finally, reproduction can be executed

instead of crossover or mutation. In this case, an individual is selected from the

population and simply copied over to the next generation. The process of creating

new generations of programs, evaluating them, and creating still more, hopefully better

20


generations of programs, continues until some stopping condition. This can be, for

example, the execution of a preset number of generations.

3.2.5 Fitness and Test Cases

Finally, an implementation of Genetic Programming requires that an appropriate

fitness function as well as the terminal and function sets need to be chosen. For a

terminal set, this means gathering samples of data which can be plugged into the

evolved equations. These depend entirely on the particular problem that the Genetic

Program is being designed to solve. If a controller for a robot is being designed, it

would be prudent to include sensor inputs in the terminal sets and robot actions in the

function set. If functions for the construction of features for classification of images

are being designed, it might be prudent to include the relevant image data as terminals

and commonly used mathematical operations as functions. In this case, it would also

be necessary to choose a fitness function which could readily measure the goodness of

a given generated feature equation.

But before evaluating fitness, the input data, or test cases, are plugged into the

terminals of the given program and the program is executed. In the robot example, test

cases would include the data for sensors that the different terminal variables represent.

Once the test cases have been plugged in and the program executed, then the fitness can

be determined from the robot‟s behavior resulting from using the given program. If

the robot drives off a cliff, killing itself, a fitness of 0 might be awarded. On the other

hand if the robot successfully completes its given task, such as navigating a maze, then

some high score could be given. In the case of skin-lesion classification, test cases

could be GCMs produced by sample images. Here, knowledge of the true classes is

part of the test cases. The goal is to evolve some equation which can help connect the

21


sample input to the true class.

3.2.6 Closure and Sufficiency

Note finally that „programs‟ generated by the genetic program must have the properties

of closure and sufficiency. Closure requires that any sub-tree must be able to be

processed by any function in the function set. This is because in the process of

evolution, nodes are joined arbitrarily and so any combination of them might be

generated. Closure also requires that every program that can be generated can‟t be

crashed at run-time by a particular evaluation, for example, by trying to divide by zero.

Two ways to ensure closure are by requiring that every function takes any input and

produces as output the same type of data, and by using modified versions of functions

so that they will run no matter what numbers are given to them. For example, if the

function divide is given a zero for the denominator, it should return as output what it

received as input for the numerator [19]. The choices for fitness functions and

methods to ensure these properties will be addressed in the design and implementation

for this project in chapter 4.

3.3 Machine Learning with Feature Selection

This section addresses the related ideas of statistical machine learning and feature

selection and describes several specific learning methods which are used in this project.

As the ultimate goal of feature construction is classification, these methods are used for

evaluating individuals in the genetic program.

22


3.3.1 The Machine Learning Problem

Machine learning tasks are defined by problems which are established and solved by a

series of examples rather than predefined rules. Often this means taking many

training examples along with their associated correct answers and having the learning

machine decipher the underlying rules which connect them. For example, this could

be the case of showing a learning machine many training examples of email messages

and having it learn to classify the email as spam or ham. Or this could be the case of

showing a learning machine many training examples of skin lesion images and having

it learn to classify the images into one of five skin lesion diagnoses. These are both

cases of supervised learning of a classification problem where the correct answer is

used to teach the machine to categorize data into one of N categories. This type of

learning machine will be the backdrop for the proceedings as this is the type used for

this project.

The inputs into the learning machine are called features. If one were training a

machine to learn how to diagnose the illness of a patient, the features could be the

various symptoms and characteristics of the patient such as her current temperature or

age. These should contain information which allows the learning machine to separate

out the true classes (in this example the true illnesses). As more and more information

becomes available, however, it becomes more and more complicated for the learning

machine to correctly learn the underlying signal from training examples. Thus feature

selection is a possible way to aid the learning machine‟s job. This is the problem of

choosing the best set of available features and feature construction is the problem of

modifying available raw data into a useful form for learning [22] [23].

23


3.3.2 Feature Selection

The principle motivations for doing feature selection and construction are to increase

classifying performance, save resources, and better understand the data. A large

reason that these goals are important but hard to attain, however, is the “curse of

dimensionality.” That is, two points which are close together in 2 dimensional space

are likely far apart in 200 dimensional space. Because the number of input features is

central to deciding the space of all possible solutions, as the number of features

increases so does the space of hypotheses, thus making the learning problem that much

more complex and difficult. A linear increase in the number of features causes an

exponential increase in the hypothesis space. And the more complicated the learning

problem is, the greater the volume of training data that is required. Therefore, good

feature selection and construction can make the learning problem easier by eliminating

redundant or irrelevant features and thus enhancing performance and saving resources

[22].

According to [23] there are three dimensions to feature selection: search, evaluation

criterion definition, and evaluation criterion estimation. Search is the method by

which subsets of a larger whole of features are sifted through. Often an exhaustive

search of every possible subset size and subset combination is computationally

intractable, slow, and in any case may lead to over-fitting problems where the learning

machine is able to fit training examples very well but not generalize to new examples

effectively. Search methods include forward selection, backward elimination, and

genetic algorithm. Evaluation criterion definition is the means by which to judge the

goodness of features and evaluation criterion estimation is the means of estimating the

goodness of features given the amount of data and the evaluation criterion.

24


Two broad types of feature selection methods are termed filters and wrappers (figure 4).

The defining difference between the two is the evaluation criterion definition.

Wrappers judge features based on their performance with a learning machine, for

example the classification accuracy of a bayes classifier, support vector machine, or

neural network, and filters are basically any ranking function which does not use the

performance of a learning machine. One approach for filters is to use them to rank

individual features. This can be especially useful when there are massive numbers of

features (e.g. 10,000) and relatively few training examples (e.g. 100) [23]. They are

also often much computationally cheaper to run than wrappers. Filters for ranking

individual features can be tricky, however, because often they don‟t necessarily give

information about how different features work together.

Figure 4 – Taken from [22], a pictorial explanation of the makeup of the two broad categories

of feature selection methods: filters and wrappers.

25


3.3.3 Feature Relationships

In fact, a feature may be useless by itself but become effective when paired with a

certain other feature. Or, a feature may be effective by itself but provide absolutely

zero additional information when paired with another. Fisher‟s Discriminant Ratio is

an example of a univariate feature selection scoring method which only reveals

information about one feature alone. A ranking index based on the Relief Algorithm

is an example of a multivariate method which reveals the relevance of multiple features

together. To be clear, the terms univariate and multivariate in the context of feature

selection refer to the ability of the particular method at hand to give information about

feature relationships. Obviously, it is possible to compute FDR for multiple

dimensions. But while the resulting FDR scores will allow the ranking of features and

so allow something like cheap dimensionality reduction, the scores do not give

information about how features may act together. For example, if there are four

features ranked by FDR, it is possible to select the top two features in an effort to

reduce the dimensions. Often this is reasonable when there are very many features.

But it may be that features ranked 1 and 3 by FDR will for some reason complement

each other and allow a classifier to perform better than features ranked 1 and 2. Thus

if you want to be more certain that you are taking the best two features, either a

multivariate feature selection method or a wrapper method which uses classification

accuracy as the score should be used. These will score how well the 2 given features

will work together. Note that though FDR can be combined into one score for many

features, it is still not giving a score which reveals information about how the given

features may complement each other.

The pictures below are a 2d example of feature relevance. The left picture shows an

example where one of the features is very effective at separating the classes and the

other, x2, is irrelevant. The right picture, however, shows a situation where both

26


features are required to effectively separate the two classes. For a much more in depth

and formal discussion of feature relevance, see [23].

Figure 5 – Taken from [22], a demonstration of two possible cases of feature relationships.

On the left, projection x2 is uninformative and could be discarded without a loss of

information with respect to the classes. On the right, both projections are informative and

needed to define the classes. A univariate filter , such as FDR, is not necessarily able to give

good information about cases where there are redundant or irrelevant features.

3.3.4 Fisher’s Discriminant Ratio

With the general idea of feature selection and feature relationships explained, consider

several methods of feature selection used in this project. A prominent univariate

feature ranking measure is the Fisher‟s Discriminant Ratio, or Fisher‟s criterion. It is

the ratio of between-class variances to within class variances. Roughly, the more

tightly wound classes are to themselves and the more separated they are from each

other, the higher the FDR score will be [23].

– (4)

– –

(5)

27


where there are N column vectors, K classes { ,…, }, and the mean of class k,

, contains members. The mean of the data set is and finally the FDR is:

.

Again, the value of FDR is that it is often cheaper computationally to run and might

help avoid over-fitting problems with few examples. However, it is a univariate

scoring method and so only gives the score with respect a given feature by itself.

3.3.5 Naïve Bayes Classifier

The naïve bayes classifier is a classifier based both on Bayes‟ Theorum and a strong

independence assumption. In words, Bayes theorem is basically the prior probability

times the maximum likelihood divided by the evidence. However, because the

evidence is the same for every class and the prior is typically taken to be the same for

every class, we only need to compute the maximum likelihood. Further, we make the

strong assumption of class-conditional independence. That is, the values of attributes

are independent of each other given the class label. This vastly simplifies the

calculations needed. So to calculate the posterior probability for a sample given the

classes, we have:

where X is a vector of n features and is the ith class. If feature values are

continuous, often a normal distribution is assumed for them. To calculate the

28


posterior probability of a sample for each class, the mean and variance of the features

for every class is computed from the training data, plugged into the normal distribution

along with the feature values for the sample, and multiplied together as the equation

above shows. This is computed for every class. Finally, to classify the sample, the

class is chosen which has the highest posterior probability [24].

3.3.6 Support Vector Machine

Relatively speaking, Support Vector Machines are a fairly new method for

classification of linear and nonlinear data [24] [25]. This class of learning machine

has garnered a lot of attention because of its high accuracy as well as its greater

resistance to problems such as over-fitting or local minima which, for example, neural

networks have more trouble with. The idea of an SVM is to use a nonlinear mapping

to transform the training data into a higher dimension and in this space search for a

linear boundary, a hyperplane, which optimally separates the data. This search for the

maximum marginal hyperplane is done using „support vectors‟ and „margins‟. It is a

search for a hyperplane which has the largest margin between classes. The idea is that

this will likely lead to the best classification of future data. Any training points which

fall on either border of the margin are called support vectors and are the most difficult

points to classify. Regardless of other points in the set of training samples, these

support vectors define the margin. Further, a trained SVM with few support vectors is

able to generalize well, even in high dimensionality [24]. A separating hyperplane can

be written as

(7)

where W is a vector of weights on X, a vector of input features. The weights can be

29


adjusted so that the borders of the margin can be written as

(8)

This can be rewritten as

(9)

This establishes the two hypotheses defining how class y should be defined, here either

+1 or -1. The maximal margin is thus given by

, where is the Euclidian

norm of W. Rewriting the above with a Lagrange multiplier into a constrained

optimization problem, it is possible to find the maximum marginal hyperplane and the

support vectors [24]. Once this is done with the training data, we have a trained

support vector machine. Using the above, a decision boundary can be written for the

classification of future data:

(10)

Here, y is the class label of support vector X, is a test feature vector, and α and b are

parameters. The sign of the result of plugging a test feature vector into the equation

above determines which side of the hyperplane the test samples are on and thus decides

the predicted class. This works for linearly separable data. For nonlinear data it is

possible to map the original data into a higher dimension using a kernel function, find

the linear maximal marginal hyperplane in the higher dimension, and then substitute

back to translate this into a nonlinear boundary in the original space [24]. Examples of

30


Kernel functions include [25]:

SVM has been described here for binary classification problems. However, this can

be transformed to solve multiclass problems as well. One approach is: If there are m

classes, m SVM‟s are trained, each of which learns to separate the ith class from the

rest. The predicted class is chosen according to the SVM which returns the largest

positive distance from the margin [25].

3.3.7 Leave-One-Out Cross-Validation

K-fold cross-validation is a method by which a learning machine can be evaluated

and/or have its parameters set. In the case of this project it is used with learning

machines to evaluate the goodness of inputs to the learning machine. The training

data is split into k sets of approximately equal size. The learning machine is then

trained on all of the sets of training data except for one group which is used to test the

learning machine‟s accuracy. Then the sets of data are rotated so that the old left out

test set is included for training, and one of the former training sets is used to test for

classification accuracy. The sets are rotated so that all are used for training and testing

and the classification accuracies are averaged together. Leave-one-out cross

validation is specifically when K is equal to the size of the training set and so training is

done on all but one sample which is used for testing. Just as before, all of the training

samples are rotated so that all are used for training and testing [25].

31

Chapter 4. Design and Implementation


This chapter describes the design strategy as well as the implementation of the work

undertaken for this dissertation. Further, it attempts to justify design decisions.

4.1 Motivation and Overall Design

This project is based on the implementation of a genetic program which evolves

features for the effective classification of skin lesion images. It sits at the intersection

of the fields of genetic programming, machine learning, and digital image analysis.

Thus it has been necessary to understand generally these three fields and specifically

the most effective and feasible ways to bring them together.

The problem from image analysis has been how to take in a massive matrix of raw

image data and turn it into a simple feature or set of features which contain adequate

information for classification of similar images. This has often been solved, for

example, by intuitively creating features which contain information about image

texture. Machine learning provides the tools by which to take these features for input,

calculated for many different sample images, and learn-by-example a set of statistical

“rules” which allow it to classify future images correctly. Thus the classification of

image data has often been solved by extracting these sorts of hand-crafted or intuitively

designed features from images and then using some particular learning machine, for

example a neural network or a naïve bayes classifier, to learn to classify future sample

32


images. This was the basic method of [1]. However, there is a general shortcoming

to this method: there is yet no set of features which have been established to work

across image classification domains or problems. Thus because of this and human

intuition‟s shortcomings when it comes to dealing with digital information at the

lowest level, often the method for choosing features to use in such classification

problems amounts to spending a great deal of time trying many different types of

features as well as empirically hand-crafting useful features specifically for the

particular job at hand. There is then the problem of feature selection as it is necessary

to determine, of the many features extracted, which ones are the best. This is where

the third topic, genetic programming, comes in. Its use aims to aid classification.

Genetic programming provides a means by which to evolve “programs” which perform

some needed role according to some set standard. In this case, the needed role is

features generated from image data and the set standard is best classification accuracy

that we can get. Therefore, we can use Genetic Programming to automatically

generate mathematical functions which can be applied to images to extract feature

values. We hope that we can do this in such a way that the features generated provide

adequate information for classification and yet come in smaller subsets than would be

generated by empirically testing and crafting many kinds of features. Thus we are

solving the problem of constructing effective features for the problem at hand and the

problem of selecting a near-optimal set of these features for classification,

automatically and at the same time. The aim is that this will not only avoid some of

the necessary hand-crafting of features for the specific domain in question, but it will

also improve overall performance of the system developed. Hopefully this would be

because the evolved features outperform the crafted features. But we could also

improve performance by adding the set of evolved features to the set of best crafted

features. For example, in the projects most similar to this one [14] [15], while evolved

33


features did at least as well as crafted features, and often better, the two sets put

together were better still.

So how do we design and set up Genetic Programming for these purposes? Besides

the implementation of the basic GP, it needs to be decided how the appropriate training

data will be handled, what is going to be evolved and how to represent it, what tools

will be used to evaluate the goodness of the individuals in the GP population, how to

evolve multiple complementary features, what parameters should be used for the GP as

well as the machine learning techniques incorporated, and finally, what to do with the

resulting evolved features.

4.2 Basic GP Implementation

The language chosen to implement genetic programming was Matlab. This was

chosen partly because the previous work on skin lesion feature construction and

classification was done in Matlab. In addition, it is perfectly suitable for feature

construction and selection of image data using genetic programming. This is because

it is a language with many useful mathematical tools and it contains dynamic lists,

automatic garbage collection, available pre-fix notation for mathematical functions,

and easy execution of strings as code. These are all things cited by Koza, creator of

genetic programming, as characteristics defining suitable genetic programming

languages [20]. The basic infrastructure of the Genetic Programming algorithm

implemented here is provided by GPlab, a free and open source Genetic Programming

toolbox for Matlab [26]. Although GPlab has been extremely useful, providing the

nuts and bolts of population representation, creation, evolution, and visualization, it

was also ultimately very inadequate for the purposes of this project. It was necessary

to heavily modify and extend GPlab so that, among other supportive changes, GPlab

34


could load, manipulate, and evaluate large matrices as terminals in feature equations,

individuals could contain and evolve separate „shell‟ functions for turning matrices into

scalars, GPlab could evaluate the fitness of an individual‟s resulting feature values

using machine learning techniques for feature selection, and GPlab could run genetic

algorithms iteratively so that multiple features can be evolved to complement each

other. In the case of using support vector machine as a wrapper, it was also necessary

to set up the GP so it could evolve the cost parameter along with individuals.

4.3 Training Data

From the beginning of the project, it was decided that traditional texture feature

equations would be the basis for evolving features. That is, the materials used to

construct individual feature equations would consist of generalized co-occurrence

matrices (GCMs) generated from the training samples of lesion images as well as

mathematical functions which are commonly used in, for example, the texture feature

equations of Haralick. This provided building blocks for equations, a starting point

and concise domain for building an initial GP feature construction system for future

expansion, as well as a set of already well established and often used Haralick features

to use for comparison. In genetic programming terms, the GCMs of skin lesions are

the set of terminals and the mathematical functions used are the set of functions. All

of the experiments conducted in this project are with GCMs constructed using a

quantization level of 64 and interpixel distance of 5 in the RBG color space. This

means that for each sample image there are 6 matrices (from the 6 color channels RR

GG BB RG RB BG) which are each 64x64. And because the available training set

consists of 100 images, with each class of skin lesion being 20 images, the training set

is 100 rows by 6 columns, each row being a sample, each column being a different

color channel, and each item in the training set being another 64x64 matrix. If, to

35


generate the GCMs, the quantization level were changed to 256, the training set would

be full of 256x256 matrices. The problems associated with the scarcity of available

training data will be discussed below.

Note that these matrices are all generated from the segmented lesion part of the skin

lesion images. If it were thought that useful information were held in the healthy skin

part of the images, GCMs could be generated from these and included in the training

set. Then the training set would be 100 rows by 12 columns. All of these changes

increasing the size of the training set, however, have their associated computational

costs. Thus, the experiments here are based on a quantization level of 64 and do not

include healthy skin. They are all pre-computed from the images and loaded at the

start of the genetic program. So, the basic idea of the whole proposal is that we

calculate generalized co-occurrence matrices from skin lesion images, use these

matrices as terminals in a genetic programming algorithm which evolves some best

feature or set of features, and then use feature values produced by these for a classifier

of skin lesions images. See figure 6.

36


Figure 6 – A graphical outline of the overall design. The end result of the GP should be a

best equation that can be used to generate feature values for inputs into a classifier. In the

case of iterative GP, there should be several equations for several features that work together.

4.4 Representation of Individuals

An individual feature equation is made up by a combination of some functions from the

function set and by, in the case of a 100x6 training set, variables X1 through X6.

These variables represent where the corresponding 6 matrices of a given sample image

are to be plugged in as terminals. They are created either by one of the genetic

programming initialization methods or by genetic operations on individuals that

37


already exist. In the implementation here, this means ramped-half-and-half, mutation,

and crossover. These are stored both as strings in the prefix notation and as tree data

structures. These strings are straight forward to evaluate in Matlab. See for example,

the fairly complicated yet successful function which evolved in one experiment to use

all of the GCMs:

mydivide(mydivide(mylog(mydivide(mylog(X6),mydivide(X1,minus(X1,minus(X6,t

imes(times(mylog(X3),X5),X6)))))),mydivide(X2,mylog(X5))),X4)

Another central reason for potentially improved features produced by GP is the ability

of the GP to produce feature equations which are able to arbitrarily select and combine

the GCMs of different color channels. This is readily displayed in the graphic of the

tree representation of the same individual and is unlike the work in [14] where only

grey-levels are used.

Figure 7 – A graphical representation of the tree of an individual created

by the GP implemented in this project. It illustrates the arbitrary

combination of all of the glcm’s, something which traditional Haralik

features do not do.

38


In order to make sure that every possible branch of every possible equation is able to be

executed without error, the functions operate individually on the elements of the

matrices. For instance, the function „times‟ is not a matrix dot multiplication. This

way, inputs to functions are always guaranteed to be matrices and the final result of an

equation is also a matrix. But because we need a scalar value in the end, it was

necessary to add functions which have been dubbed „shell‟ functions to each individual.

As well as having the standard set of functions, each individual also contains two

additional sets of shell functions. The first shell function takes in a matrix and

produces a vector and the second shell function takes in a vector and produces a scalar

value. These shell functions are always the last two functions around the rest. This

final scalar value is the feature value for the given sample skin lesion image. This

method of ensuring sufficiency is to be contrasted with [14] where strongly typed

functions are used so that there is only one larger group of functions. These functions

are strongly typed so that their output depends on the type of input. If there is a matrix

input a given function will return a matrix and if there is a scalar input it will return a

scalar. Though it must be done somehow, [14] does not explain how it guarantees that

evolved programs result in scalars.

The individual point-mutation of shell functions in an individual have independent

chances at every mutation. For experiments here, the shell mutation rate was set to

25%. The standard function set includes the element-wise Matlab functions times (.*),

mydivide (./), plus, minus, cos, sin, mysqrt, and mylog (see table 2). The inner-most

shell function set includes the standard Matlab functions mean, max, min, sum, as well

as the function row. Row places the column vectors of a matrix all in one long vector.

The outer-most shell function includes mean, max, min, and sum.

The functions „mydivide,‟ „mylog,‟ and „mysqrt‟ exist so that there is a check to

39


prevent operations that result in infinity or NaN evaluations. Mydivide should return

the numerator itself, for example, in the case of a zero denominator. However, it is

impossible to completely prevent, after an equation has been completely evaluated, the

fitness evaluations of generated equations from resulting in NaN evaluations. For

example, using Fishers Discriminant Ratio on an equation that results in feature values

which are all the same will result in a zero divided by zero evaluation and a NaN result.

To cope with these, the soft constraint is added where NaN fitness evaluations are

checked for and result in fitness scores of 0.

Two other processing steps for evolved feature values, before they are evaluated for

fitness, are standardization and precision rounding. To standardize a vector of sample

feature values, the mean of the values is subtracted from each element and then divided

by the standard deviation of the values. The hope is that this will help erase problems

for classification introduced by features being scaled differently. The precision

rounding is just the rounding of any value generated by the features and any fitness

values to a precision of 12 decimal places. Without this precision check, Matlab can

sometimes generate wrong numbers. For example, when it should compute a 0 as an

answer, it may compute 1e-19. This means that when another number is divided by

1e-19, instead of generating NaN, and thus a 0 fitness, it may generate a very high

fitness. Ensuring a precision to 12 decimal places solves this problem.

4.5 Fitness

On its way to having a fitness evaluated for it, an individual must first be evaluated on

all 100 training samples, with the appropriate GCMs being plugged into the appropriate

X variables contained in the individual. This occurs for every individual in every

generation. This produces a column vector of size 100 which is essentially 100 feature

40


input values for 100 skin lesion images, using the evolved equation. These values are

then standardized and a fitness value is calculated according to a feature selection

method which attempts to reveal how well the feature values are for aiding a potential

classifier to use them to classify images.

There is a concern here with regard to over-fitting as this means that all of the available

training data is being used for evaluating individuals in every generation. The issue of

how to utilize the available training data is important both with respect to genetic

programming and in the use of a statistical learning machine. Often in machine

learning situations when there is ample data, it is split into training, validation, and test

sets. The validation set is used to select parameters, the training set is used for training,

and the test set is used to evaluate the learning machine‟s performance. This is done

largely to remedy the problem of over-fitting where the learning machine, rather than

learning an effectively general rule for classification, hones in on noise in the training

set and interprets it as the true underlying signal. Thus, the learning machine‟s

performance when introduced to new samples for classification is poor.

Unfortunately however, as is the case here, it is not always possible to have an ample

set of training data. The skin lesion images for this dissertation are directly captured

and processed by a group associated with the larger skin lesion image retrieval project

that this work is associated with. In the collected set used for this project, there are

only 100 images, 20 from each class. There are more images in the whole set but these

cause the classes to be drastically uneven (2 classes have only 20 samples). It was

decided that for the work here, an even number of samples, albeit limited in number,

would be used in the interest of avoiding the extra complexity needed in working with

an uneven data set. Using leave-one-out cross-validation as the basis for accuracy

scores in the wrapper approach is an attempt to take the most advantage of the available

41


data. Though using leave-one-out cross validation is a common method in this

situation [27], it is important to note that it is known to be a high-variance estimator of

generalization error [28]. Therefore we can compare the relative accuracies of

evolved features and Haralik features but we can‟t yet be sure about generalization

abilities. Hopefully, future collection of data will enable additional training methods

in related future work.

Other possibilities for using the data might be to train on a data set of 15 samples per

class and test on a data set of 5 samples per class. Also, whether cross validation is

used or split training and testing sets are used, it might also be beneficial to split up data

between generations of the genetic program. This way, not all of the training and

testing data are used repeatedly on every generation. Rather, it could be rotated,

resulting in a sort of cross-validation, or it could just be split up enough so that it could

last the total number of generations. A systematic exploration of these possibilities

would be fruitful and an implementation of them in the current system fairly straight

forward. However, the time required for the experiments is not feasible for including

in the results here. Further, it‟s probable that this exploration would be best when at

least some more data is available. Thus, the common method of cross-validation is

used on all 100 samples on every generation.

There are three methods utilized here for evaluating the feature values produced by

evaluating the individuals on all 100 samples: the filter method of a score based on the

Fisher Disciminant Ratio and the wrapper methods based on the prediction accuracy of

either a naïve bayes classifier or a support vector machine using leave-one-out cross

validation where the cross validation is used to attempt to predict the future prediction

accuracy of the model with the available data. In the case of the wrappers, the FDR is

also used. It is first computed on the candidate features and used to provide a

42


threshold. If a feature does not have an FDR above .2 its classification accuracy is not

computed and its fitness is set to 0. This saves some training time but also prevents the

classification process from taking in features which cause the classifiers to work with

singular matrices.

There are many other potential fitness evaluations and many potential combinations of

these. For this project, however, these three were the most useful, expedient, and

interesting to test. Further, only FDR was used in any of the directly relevant work

cited in chapter 2 and unlike here, it is never used alone. Other options and

combinations of options would certainly be worth exploring. For example, a scoring

method based on the multivariate Relief algorithm as well as embedded methods both

seem to be very popular in the newer feature selection literature [27] [22].

The bayes classifier is a simple yet empirically effective classifier and was provided by

work done already for the overall skin lesion retrieval project. Support Vector

Machine is a very popular and newer classifier which is known for its quick training,

high accuracy, and resistance to over-fitting despite often needing few training samples.

The implementation used in this project of SVM is provided by libsvm, a support

vector machine library for Matlab [29]. FDR, basically the variance between classes

over the variance within classes, is perhaps known as the weakest performer of the

three. However, it is also known to be quite fast to compute as well as effective at

aiding the prevention of over-fitting when there are relatively many features and few

training samples. Therefore, it may aid the generalization ability of the classifier

ultimately used both as a fitness function method alone and as used in conjunction with

the wrapper methods.

A final concern is that, unlike the FDR and bayes classifier methods, the SVM requires

43


the setting of parameters. In this case this means setting C, the penalty parameters of

the error term, and if a radial basis kernel is used, the hyperparameter γ (see the listed

SVM kernels in the background section). Typically, parameters for an SVM are found

by doing cross-validation on a set of data samples with a grid search of the most

commonly reasonable parameter values and choosing the parameters which have the

best accuracy [30]. However, because of the small amount of data and because it is

impossible to pre-cross-validate on features which do not yet exist (because we are

evolving them), setting these parameters is not a straight forward procedure.

Three possibilities were thought of for this. The parameter values could be included in

each individual so that the parameter values are evolved along with the features. This

makes sense as best choice of parameters is mainly a function of the data set at hand.

Best features and their associated best parameters could reinforce each other.

Unfortunately, this could greatly increase the complexity of the hypothesis space that

the GP is attempting to explore. It would be possible to reduce this space as much as

possible by using a linear kernel and then using a standard set of fewer than 10 possible

settings for the C parameter. Other options for solving this problem would be to

attempt to select the parameter values before the GP runs by doing a cross-validation

grid search on features which are thought to be useful or similar to what will be evolved.

In this case the parameter values are inflexible and will cause the features evolved to be

evolved for the parameter particular settings. Finally, it might be possible, though

costly and questionable, to run a small cross-validation grid search at every fitness

evaluation of individuals. The first option, where C is evolved with the feature

equations, is used here with a radial basis function kernel where the hyperparameters

for the kernel are always the libsvm default. It would be best if the hyperparameters

were also evolved but the concern for this first implementation was for keeping the

hypothesis space smaller. Further, evaluating SVMs with a radial basis function

44


kernel were much quicker than with a linear kernel. The parameter C is initialized

randomly and then has a chance of mutating to another of the possible C values at every

generation (10e-5 10e-3 10e-1 10e1 10e3). In future work this set could be increased.

4.6 Iterative Genetic Programming

Following the concept introduced by [14], an iterative genetic programming algorithm

was developed to work with GPlab so that features could be developed together to

complement each other. The motivation here is related to what it means to be a set of

relevant features. Review the brief discussion of feature relevance from chapter 3:

One feature alone might be very good by itself at aiding classification and we could

take this to be a relevant feature. However, suppose we have a second feature which is

just as good at aiding classification but when paired with the first, is barely able to

increase classification accuracy. What is happening is that the second feature is

redundant and adds no new information for the classifier to take advantage. This

makes the second feature irrelevant. Consider instead two features which are barely

worthwhile when treated by themselves, but which, when combined, enable the

classifier to outperform the first single feature. This is a general idea of what the

iterative GP is attempting to do: to find these complementary features.

This is just a gloss of the highly complicated relationships which might exist between

possible features. But the whole idea is to consider these highly complicated

relationships when trying to automatically evolve feature construction equations. For

these reasons, it may not be maximally effective to run many genetic programming

algorithms in parallel, evolve many best equations, and then throw them all together as

input features for a classifier. This is because the features are never actually evaluated

with respect to each other. Instead, the idea is to evolve features so that they are

45


chosen for their ability to complement each other. Thus, it might be possible to run

genetic programming algorithms iteratively so that one best equation is evolved after

another. This way, each subsequent feature equation is evolved with respect to how

well it can aid classification with the previously evolved feature equations. This is

why it‟s important to distinguish between univariate and multivariate feature selection

evaluation procedures. Note that, in this process, FDR cannot take the principle role

for fitness evaluation because it is a univariate feature scoring method. This means

that it scores how good a feature is by itself. And this is another reason why it‟s

important to include the wrapper methods using a bayes classifier or SVM.

The implementation of iterative genetic programming here evolves a population until a

maximum number of generations. It then takes the best individual from this first

iteration, and uses it in a second iteration of new individuals which all have their fitness

evaluated with the first evolved feature. After each feature is evolved, its feature

values are added to a feature list. Every subsequent feature evolved is scored by its

ability to aid classification along with the features already evolved. The hope is that at

the end of n iterations, there are n features which complement each other very well.

Not only are these n features effective but n is hopefully small, making the

classification problem that much less complex. Note that except for the first iteration

in the SVM wrapper GP, the parameter C from the best individual of the previous

generation is copied over exactly to the individuals in the first generation of the next

iteration. This is an attempt to preserve the parameter found that already works for the

features found on previous iterations. After the first generation of the next iteration, C

can still mutate between generations.

46


4.7 Parameters

In genetic programming there are a number of parameters to decide on and set. Here

are some brief explanations of the most important ones chosen and used in the modified

and extended version of GPlab used for this project. Three of the most basic and

important parameters in any genetic program are the size of the population, the

stopping condition, and the depth restriction for the size of program trees. The size of

the population is important for allowing a large amount of variation and exploration for

the GP and the depth restriction is important in the same regard but also because it

restricts the amount of bloat possible. Bloat is a classic genetic programming problem

where programs tend to grow in size without any substantial increase in fitness. The

stopping condition for GP runs here is just taken to be a preset number of generations to

run. These three parameters are systematically tested and discussed in chapter 5.

Other important parameters to set, though the exact settings aren‟t known to as

drastically affect results, are selection type, tournament size, operator probabilities, tree

depth initialization, dynamic depth restrictions, and elitism type. The selection type

used is the GPlab default, lexictour. Lexictour is similar to tournament selection

where some number of individuals is randomly drawn from the population, and only

the best of these survive. The size of tournaments is set according to the default which

is 1% of the population size. The only difference between tournament selection and

lexictour selection is that if two individuals have the same fitness, the one with smaller

size is chosen. Another possibility to try would be roulette selection but this is known

to cause very high selection pressure and might prevent a very good exploration of the

space of feature equations.

The implementation here uses variable crossover rates. The relative number of good

47


individuals that each genetic operator produces is used to decide whether to increase or

decrease the rate at which the different operators occur. Both crossover and mutation

rates are initialized at 50%. However, over the course of a run, it was observed that

crossover would typically range between 50% and 95% and mutation would typically

range between 5% and 50%. Reproduction is set at a fixed 10%.

In addition to the absolute maximum for the depth of trees, there is also a dynamic level

maximum. This dynamic level is set smaller than the absolute maximum but can be

surpassed if the new individual is better than the best tree so far. The maximum level

for tree initialization and the dynamic limit are typically set around one or two levels

smaller than the absolute maximum for tree size. The idea behind these settings is to

try to strictly limit growth of bloat but give equations at least a chance for some small

growth if the fitness increase is there.

The type of elitism used is „keepbest‟. In filling the next generation with individuals,

many new children are created using genetic operators. These are then placed with the

parents, and all of them are given the chance to move to the next generation depending

on their priority for survival and the amount of space in the next generation. Using

„keepbest‟ elitism, the best individual of all of the parents and children is always given

the highest priority. However, in the interest of diversity, the rest of the children are

given priority over their parents, regardless of whether they have a lower fitness than

their parents. The children are ranked by their fitness which are then followed by the

parents ranked by their fitness. Stronger versions of elitism include „halfelitism‟ and

„totalelitism‟ where more individuals are taken to the next generation based on their

fitness rather than whether they are a parent or child.

Finally, note the complications for the project due to run-time problems. The use of

48


the bayes classifier and SVM were questionable as the project began. This was

because learning machines are known to be much more costly to run for feature

selection. In fact, it became clear early on that run-time for the genetic program would

be an important consideration in producing features. However, unexpectedly it was

not because of the use of any wrapper method. The long run-times for the genetic

programming system implemented here were due mostly to the time it takes to evaluate

every individual in a given generation. That is, to evaluate the feature values for a

feature equation, it is required to plug in the terminal GCM matrices, in tests here each

being 64x64, and evaluate them in sometimes very large equations. These evaluations

occur in a loop, one sample at a time, a process which is often very inefficient

compared to the identical operations carried out via matrices where all samples would

be thrown into a matrix and computed together. This is because in adapting GPlab to

work with large GCM matrices, it was necessary to throw out a piece of code which

could bypass Matlab‟s nested 32 error. This error prevents large matrix operations

which are nested by more than 32 brackets or parenthesis and so of course prevents the

large batch processing of 100 samples all at once where each sample holds a large

number of giant matrices. It‟s not known whether bypassing the nested 32 error would

greatly improve importance but this runtime consideration combined with the time

limit on this project and the need to run many trials for a given experiment basically put

an upper bound on the number and variations of experiments possible. The

proceeding experiments represent roughly over 120 days of computing time.

Run-time is not mentioned in the related work and so a comparison in this regard is not

possible.

49


GP Parameters

Parameter Values Used

Elistism keepbest

C parameter for SVM 10e-5 10e-3 10e-1 10e1 10e3

C mutation rate 50% chance inside mutation operator

Selection Lexictour

Shell mutation rate 2 independent 25% chances

Crossover and overall mutation rates variable

Fitness FDR filter, bayes wrapper, SVM wrapper

# iterations for iterative GP 1,2, and 3

Table 1 – list of primary GP parameters and their used values.

Function and Terminal Sets

Mathematical Name Matlab command

(all are element-wise matrix operations) -- operations with the

„my‟ prefix are modified to ensure closure

Function Set

Sine sin()

Cosine cos()

Multiplication times() (.*)

Division mydivide() (./)

Subtraction minus() (-)

Addition plus() (+)

Square Root mysqrt()

Natural Logorithm mylog()

Terminal Set 6 GCMs (color space RGB, interpixel distance 5, quantization 64)

Table 2 – list of the function and terminal sets used to create individuals in the GP population

50

Chapter 5. Results and Analysis


This chapter explains the experiments run as well as the results attained, evaluates the

resulting performance, and compares the results to the related works described in

chapter 2.

5.1 Exploration of Population, Generation, and Depth

Experiments were run with the FDR filter, bayes wrapper, and SVM wrapper GP.

First, the FDR filter was used to systematically explore the GP‟s behavior. This

includes changing the population size, the generation number, and the maximum depth

size. Then the top 25% of all of the individuals created by the FDR filter GP were

pooled for classification performance evaluations using a bayes classifier and greedy

feature selection algorithm. Each set of experiments consists of ten runs of the GP for

the given parameters. The run-time for one iteration of one of the filter GPs ranged

from about 2 hours to greater than 2 days with the average being more than 1 day.

Run-times were recorded for every experiment but exact values are not reported here

because the GP, in an effort to get results as quickly as possible, were run on many

different computers with different levels of unrelated processing going on. It is still

possible, however, to estimate the relative run-times of experiments by considering the

number of generations X the size of the population X the max depth of the equations X

the number of iterations. The value taken from each run is the best individual returned.

The initial experiments are an attempt to explore the behavior of the GP, the goal being

to find parameters which will result in the highest scores, while at the same time being

constrained by run-time. The use of the results from the filter GP experiments are also

useful to discover classification performance when using the FDR filter for fitness.

Experiment numbers 1 and 4 are the same as it was repeated for comparison.

51


Round 1 -- Vary Population

Experiment Generation Population Depth FDR (avg. of 10)

1 35 50 12 .8493

2 35 150 12 .8479

3 35 250 12 1.0711

Table 3 –Round 1 of experiments where the population size is varied. The average is

slightly deceiving but the FDR distribution generally increases with population size.

Clearly, increasing population increases the resulting distribution of likely FDR scores

returned in the best individual. Unfortunately, using a population of 250 and

generation of 35 resulted in run-times of about 2 days.

Figure 8 – statistical illustration of experiments, varying population. The red

line is the median, the top and bottom of the boxes are the 25th and 75th

percentiles, the whiskers above and below the boxes signify data which are not

considered outliers but are in the extremes of the distribution, and the red

crosses are outliers.

52


Figure 9 – An example of the evolution and result from 1 run of the ‘vary population’

experiments. Top left – the tree of the resulting best individual. Top right – the evolution

of fitness over the run. Bottom left – the evolution of genetic operator probabilities over

the run. Bottom right – The evolution of the complexity of the population over the run.

The intron count is calculated in order to save time.

53


Round 2 -- Vary Generation


4 35 50 12 .8493

5 70 50 12 .8733

6 110 50 12 1.0008

Table 4 – Round 2 of experiments where the number of generations is varied.

An increase in the number of generations increases the resulting distribution of best

individuals. Like Round 1 of experiments, the third group has the greatest increase as

well as the largest variance of individuals between the 25th and 75th percentiles.

Figure 10 -- statistical illustration of experiments, varying generation. The red

line is the median, the top and bottom of the boxes are the 25th and 75th percentiles,

the whiskers above and below the boxes signify data which are not considered

outliers, and the red crosses are outliers.

54


Figure 11 -- An example of the evolution and result from 1 run of the ‘vary generation’




The intron count is not allowed to increase in order to save time.

55


Round 3 -- Vary Depth


7 50 150 4 .7650

8 50 150 8 .8737

9 50 150 12 1.1192

Table 5 – Round 3 of experiments where the depth of individuals in the GP were varied.

Once again, the highest feasible setting with respect to what the maximum allowable

run-time would allow (for the purposes of reporting here) returns the distribution with

greatest FDR scores. While it seems that round 3 of experiments illustrates the

greatest increase in scores with increasing settings, indicating that perhaps depth has a

greater effect on score than population size or number of generations, this is not exactly

a fair comparison. The rounds of experiments don‟t have the same values for

parameters which were not being varied.

Figure 12 -- statistical illustration of experiments, varying depth. The red line is the

median, the top and bottom of the boxes are the 25th and 75th percentiles, the whiskers

above and below the boxes signify data which are not considered outliers, and the red

crosses are outliers.

56


Figure 13 -- An example of the evolution and result from 1 run of the ‘vary depth’




The intron count is not allowed to increase in order to save time.

57


5.2 Pooled FDR Features Compared to Haralik Features

Even though we know that features selected according to the FDR filter are not

necessarily known to work well together as FDR is a univariate method, it is very

possible that some of them do. So classification was run on the pool of the top 25% of

FDR scores from the 3 rounds of experiments to find out how they would perform.

Individuals were selected from this pool using a greedy forward-selection algorithm

with leave-one-out cross validation accuracy for the bayes classifier as the evaluation

method. The performance is shown using 1, 4, and 10 features.

Again, the evolved equations were created from the FDR filter GP which used as

terminals GCMs with quantization level 64 and interpixel distance 5. For comparison,

the forward-selection was run on the 72 Haralik which were created with the same

GCM settings (12 Haralik equations x 6 color channels x 1 interpixel distance (5) x 1

quantization level (64)). Additionally, forward-selection was run on 1296 Haralik

features which were created using combinations of additional GCM parameters (12

Haralik equations x 6 color spaces x 6 interpixel distances (5 10 15 20 25 30) * 3

quantizations (64 128 256)). These additional parameters should give an advantage to

the performance of this group. See table 6.

58


72 Haralik features (same GCM parameters)

# Features Selected 1 4 10

Accuracy 37% 66% 65%

1296 Haralik features (more GCM parameters)



Table 6 -- Comparison of classification accuracy of feature equations produced by FDR and

Haralik Features

The evolved features generally have a fair advantage over the Haralik features. The

worst comparison is between the evolved features and the Haralik features selected

from 1296 where both have 72% accuracy. However, this looks favorably on the

evolved features as the GP features before selection were less than 200 and were

created using GCMs from only one interpixel distance setting and one quantization

setting. In contrast, there were 1296 total Haralik features before selection which were

created with 6 interpixel distances and 3 different quantization levels. Also, the single

best evolved feature is able to classify 51% correctly, a significant leap over the nearest

single best Haralik feature. It seems, however, that adding more and more features are

improving the accuracy less and less. In this regard, though the univariate FDR filter

evolved features are not exactly working together to provide much more information,

they are not significantly less complementary to each other than the Haralik features

when selecting up to 10 features.

Classifier Performance with Bayes Classifier and Greedy Selection

FDR filter GP features (selected from top 25% of total)



59


5.3 Wrapper Fitness and Iterative GP

Finally, the wrapper GPs were given 10 trials using 3 iterations. These were run once

with the bayes wrapper for a fitness function and once with the svm wrapper for a

fitness function. The parameters selected were chosen in an attempt to maximize

score while at the same time finishing in a reasonable time. Thus a population of 250

and depth of 8 were chosen to run for 35 generations and 3 iterations. One iteration of

one GP took about 1 day to complete and so these trials took about 30-40 days in

computing time to complete. In the future it should definitely be a priority to

investigate the results attainable when iterating at least 4 times.

First, the distribution of the classification accuracies produced by the FDR GP runs and

the wrapper GP runs are compared using one feature (figure 14). Note that the FDR

features are a distribution from 200 equations which are the top 25% of FDR scores

produced by the FDR GP with a number of different GP parameters. The results for

the bayes wrapper and svm wrapper GPs, however, are the resulting distribution of just

10 trials of GP runs each. With the limited number of runs it seems that both wrapper

methods are able to more consistently produce equations with higher classification

accuracies than the FDR GP. Further, the SVM is slightly better than the bayes

classifier. But with the limited number of trials, especially with 3 iterations, this is not

an entirely confident comparison between the two. The distribution of GPs using 3

features is from 7 trials rather than 10. Figure 11 summarizes the classification

accuracies attained on all the wrapper GP trials.

60


Figure 14 – a Comparison of the classification accuracies of evolved equations using

FDR, bayes, and SVM for fitness methods. Note that while the FDR distribution is

made of 200 samples, the bayes and svm distributions are made of only 10 each.

Figure 15 – A comparison of the classification accuracies of evolved equations

using the bayes and SVM wrappers with 1, 2, and 3 features. Note that the

distributions with 1 and 2 features are from 10 samples each while the

distributions with 3 are from 7 samples each.

61


As the number of features was increased, the classification accuracy increased and

surpassed both the Haralick features and the evolved FDR filter GP features. The

highest scores produced were by the SVM wrapper GP with 3 features which evenly

matched the very best accuracy of the Haralik and FDR-evolved features using 10

features for classification. These scores were produced with only 3 iterations of the

GP whereas the Haralik features were selected from over a thousand computed features

and the FDR features were selected from the top 25% of 800 evolved features, both

using a greedy feature selection algorithm.

These results show promise for the wrapper based approach to evolving features

though it is not clear from these results whether or how much SVM is better than the

bayes classifier. While the margin is small between the best of the Haralik, FDR, and

wrapper features, the wrapper features did perform at least as well as the larger set of

features generated otherwise and with a smaller feature set which was generated in

fewer runs. Unfortunately, the wrapper method implemented here is the least tuned

part of these experiments. Though the FDR filter GP clearly has more potential and

could use trial runs with greater population sizes, it seems that the wrapper GPs could

also use more development for things like how operators handle when and how the

individuals are modified. For instance, in this implementation the genetic operator

mutation is responsible for providing independent chances to change the shell

functions, the SVM parameter C, as well as the feature equations themselves. But this

seems to have a bad effect on the diversity of the population as illustrated in figure 16

where the mutation operator never results in improvements in fitness causing its chance

of happening always to be low. The equation is able to be evolved to a classification

of 72% but in the 2nd and 3rd iterations the populations have converged on only one

individual. Diversity is supplied only at the very beginning of these runs. If diversity

can be restored with adjustments to the algorithm, the SVM wrapper will improve.

62


Figure 16 – Take note that scaling is different in every fitness plot—the maximum

classification accuracies are listed in the boxes of run statistics. The evolution of fitness in

iterations 1 (top left), 2 (top right), and 3 (bottom left). The populations in 2 and 3 get

stuck after only initially making advances in fitness. The bottom right plot reveals that

mutation is not contributing much after intiailization in contrast to, for example, figure 13

where mutation continues to play a part in the evolution of the FDR-evolved features. Note

that the initial high median values are an artifact of a starting population with many zero

fitness values. This is due to using the FDR threshold to prevent training with some of the

worst features.

63


Figure 17 – An example of feature trees produced by 3 iterations of the SVM wrapper GP.

64


It has been established that the FDR evolved features can often perform better than the

standard set of Haralick features and that, though there is a smaller set of trials than

would be preferred, the wrapper methods can more consistently create smaller subsets

of features which are at least as good as the FDR evolved features. Further, both of

these methods have room to improve both with respect to increasing primary

parameters such as population size, but also with respect to tweaking how genetic

operators split up evolutionary duties.

But it is interesting to briefly take a closer look at the features evolved. In this regard,

table 6 shows the confusion matrices of 3 similarly performing features evolved by the

three fitness functions. Also, figures 18 and 19 show 3d visualizations of the feature

values produced by the FDR filter and the bayes Wrapper. With the focus on

developing a GP implementation for feature construction it is not possible to develop a

very detailed appreciation of the space of features and their relationships. However,

we might at least build a feel for how the evolved features classify and misclassify the

classes and how they place data points in the feature space relative each other.

We can conclude from the confusion matrices that, at least in this slight snapshot, there

are no drastic, systematic differences in how the classes are classified using the

different wrapper methods. There are only slight variations in classification decisions.

In all of those in table 7, for example, AK is the most difficult to classify.

Moving on to the visualizations, according to these two visualized features, obviously

the classes are not perfectly linearly separable. This may indicate that a linear kernel

is not the best option for the SVM approach. Further, it seems clear that there are a

number of ways that the classes can be flipped relative to each other and variously

clumped together or spread out. The visualization shows that the features do separate

65


the classes reasonably to the human eye. However, classes are still fairly cluttered

together. For comparison, table 8 shows the confusion matrix for figure 9 data. It

shows that the most difficult to classify class according to the confusion matrices is

also the class which is in the space between other classes.

66


Accuracy

65%

3 feature

Confusion Matrix – FDR (bayes)

predicted

AK BCC ML SCC SK

actual

AK 8 3 4 2 3

BCC 2 14 3 1 0

ML 2 0 16 0 2

SCC 4 2 0 13 1

SK 1 1 3 1 14

Table 7 – The confusion

matrices displaying the

classification predictions next

to their true classes. There

don‟t seem to be any

systematic differences in error

between the three methods of

fitness. Similar classification

accuracies were deliberately

chosen here to compare the

classification characteristics.

At least with this small

snapshot, there seem to be no

systematic differences. For

example in every method, AK

is the most poorly classified

class. Note that these accuracy

scores are at the bottom of the

distribution for the wrapper

methods and at the top for the

FDR filter method.

Accuracy

64%

3 feature

Confusion Matrix – SVM

predicted

AK BCC ML SCC SK

actual

AK 10 2 2 3 3

BCC 2 15 3 0 0

ML 3 1 14 0 2

SCC 3 2 1 12 2

SK 3 2 1 1 13

Accuracy

64%

3 feature

Confusion Matrix – Bayes

predicted

AK BCC ML SCC SK

actual

AK 10 2 2 3 3

BCC 4 14 1 1 0

ML 2 1 16 0 1

SCC 3 0 4 13 0

SK 0 2 2 2 14

67


Figure 18 – Visualization of the feature values made by 3 of the FDR-filter produced

equations. The class of a point is indicated by color. Classification accuracy using these

features was 64%.

Figure 19 -- Visualization of the feature values made by 3 of the bayes wrapper produced

equations. The class of a point is indicated by color. Classification accuracy using these

features was 69%.

68


Though the evolved features were able to classify features better than the Haralik

features and with less complicated GCMs, the performance results show much room

for improvement. The best of any run of the GP was 72%. This compares to the

results obtained in [15] but is below the 84% obtained in [14]. However, it is

important to point out the differences between the work completed here and there.

The work here was on a real engineering problem with the related real world problem

of very limited (but hopefully growing) data.

In contrast, both of the GP projects cited were completed on the same relatively large

and publically available online database of digital photography. Importantly, these

images are specifically photographs of different classes of textures. Working with

these is an easier problem than using texture features to distinguish between

Accuracy

69%

Confusion Matrix – bayes

predicted

AK BCC ML SCC SK

actual

AK 13 2 3 0 2

BCC 4 13 1 2 0

ML 3 0 14 1 2

SCC 3 1 2 13 1

SK 0 1 2 1 16

Table 8 – The confusion matrix corresponding to the features from figure 19. It

makes sense that AK is the class which is most incorrectly classified in the previous

tables as it lies at the intersection of just about all of the other classes (blue dots in

figure 19). Notice, however, that it is not worse than two other classes using these

features and that correspondingly, the blue dots are much more tightly clumped in

figure 19 than figure 18

69


skin-lesions. Whereas the texture photographs are fairly straightforward to

distinguish using even the human eye, it takes a dermatologist‟s careful diagnosis to

distinguish between skin lesions. Also, [14] split the photographs up, four patches of

data coming from one image, so that the data set could be increased up to 480 samples.

Compare this with the 100 samples of 5 classes of skin lesion images used here. Here,

each sample is not only from different images but also from different patients.

It is quite clear from every experiment conducted here that the full potential of this GP

implementation, even without any optimizations of the current system, has not been

reached. Even the highest values used for population, depth, and generation did not

result in a convergence of scores to a max. It is expected and hoped that the

classification of skin lesion images can be improved quite easily by increasing

population sizes and generations.

But there are clear problems with the implementation and experiments here. Due to

the small data set, cross-validation is used for optimization and there is no data set left

over to verify the generalization abilities of the evolved features. This should be a

priority in the future and will become possible as more data is collected from the field.

Also, it seems even from the very few trial runs with the support vector machine that

the svm wrapper GPs potential is not being reached either. It seems that, for example,

the mutation operator is not able to introduce new genetic material into the population.

The probability of mutation should increase when that operator produces good

individuals. Unfortunately the mutation probability never rises using the wrapper

methods which may be partially because of the more complicated hypothesis space

provided by the attempted evolving of the C parameter for the SVM. Also, though a

non-linear kernel is used for the SVM, only one parameter is evolved. Though also

70


including the kernel parameters for evolution would make the hypothesis space still

more complicated, it seems necessary to eventually incorporate them as well. In this

regard, it seems that radial basis functions are probably the best option. It would be

best if a separate genetic operator is used for mutating the individual equations, for

mutating shell functions, and for mutating the parameters of the SVM. This way, the

variable rates of mutation for each can go up and down according to each one‟s own

merits. As it is now it may be that bad equation mutations prevent an exploration of

the SVM parameter space or vice-versa.

71

Chapter 6. Conclusions


This chapter provides ideas for work to continue with the GP system resulting from

this project and it concludes with a summary of the work developed here, results,

and the problems encountered.

6.1 Future Work

Before making any grand plans for expansion of the current project, of which there are

many possible, the issue of run-time should be addressed. If addressing Matlab‟s

nested 32 error could decrease the run-time of the GP it would greatly ease the

exploration of different parameters and new methods. It is clear that though the results

of feature construction and selection presented here via genetic programming do show

success, the system shows no signs of having been pushed to its full potential yet. It

would benefit both from more run-time and from an efficiency increase in the GP

algorithm itself. Even without any changes, it is likely that just more run time for

greater populations with more generations and greater depth limits could produce even

better results. Further, it would also be statistically reassuring to establish the current

results with at least 10 more runs of the GP in the various experimental settings.

The possibilities for future work lie in the three domains which converge for this

project: digital image analysis, genetic programming, and machine learning. For

image analysis, GCMs generated with a wider variety of parameters could be used as

terminals, for example, using matrices generated from different interpixel distances all

72


together. This way, the GP would have the ability to arbitrarily combine even more

information from the image. Going further, it might be fruitful to investigate

terminals which are not provided only by GCMs. Perhaps color information could be

included, for example. Also, healthy skin images could be incorporated.

The investigation of advanced techniques of genetic programming would also likely be

helpful. Particularly, novel and ad-hoc methods for combining or manipulating

feature equations in genetic operators, specifically designed for mixing up image

analysis information in helpful ways, could be pursued. The inclusion of additional

functions in the function set or the designing of ad-hoc functions would also be worth

exploring. Finally, the use of Automatically Defined Functions might hold promise.

An ADT is a set of functions which are always defined together in a certain way. For

example, (X1 – X2) / X3 could be treated as a single piece of program which gets

plugged into new, larger equations. This way code which is known to potentially

work well has the possibility of being reused and built upon. Perhaps the original

Haralik feature equations themselves could be defined as ADT‟s and given the chance

to be a part of larger evolved equations. Or perhaps the best feature from a previous

iteration could be used as an ADT in future iterations.

Finally, a greater focus on proper machine learning techniques can only aid greater

performance gains. This would not only aid performance increases in the future but it

would allow the more accurate measurement of performance of the system already in

place. In this regard, it will help learning greatly as more skin lesion images are

collected. And as already mentioned, a variety of training methods could be

systematically investigated. Now, all 100 samples are being used in every generation

of the population. If fewer were used every generation there could be less for the

learning machine to work with. But if less training data is executed on every

73


generation, perhaps being rotated as generations advance, then this might help reduce

over-fitting. It would also reduce the run-time as evaluating equations is the greatest

use of time in a run.

Another avenue relevant to this is the use of extremely skewed data. The main reason

more data isn‟t being used for training in this project is because the classes are so

uneven. In a data set of about 500 samples with 5 classes, 2 classes have only 20

samples. A test set based on the extra data beyond the 100 samples used here might be

statistically questionable because for two of the five classes, there are no more samples

beyond this set. The best answer to all of these problems is just more data. But a

handle on how the extra data can be soundly used would be helpful as well.

Additional machine learning techniques could also be used for fitness evaluations. As

already mentioned, a filter based on a Relief algorithm score and embedded methods

seem to be attracting attention in feature selection. In the same regard, there are many

different types of filters (such as information theoretic) and many different types of

classifiers (such as Bayesian neural networks) which may hold promise. An easy

addition to the current implementation would be to simply try additional kernels in the

already implemented support vector machine. This would require that the individuals

hold additional parameters to evolve with them which would in turn make the

hypothesis space that much more complicated. But perhaps this trade-off would be

worthwhile if the described problem with diversity is smoothed out.

6.2 Conclusions

A Genetic Programming system which automatically evolves feature equations for the

classification of skin lesion images has been developed and described. The system

74


uses generalized co-occurrence matrices and normal mathematical functions which are

combined stochastically and evaluated by a simple FDR filter or a combination of FDR

score and one of two classifiers, naïve bayes or SVM. This system is similar to two

other systems in its use of texture GCMs to evolve features. However, it is unique in

its choice of classifiers, the methods by which it guarantees closure and allows the

arbitrary combination of color channels, and its actual application to a real world

problem. Again, instead of working only in grey-scales like [14], this system is able to

arbitrarily combine color channels into one feature value. Whereas the two previous

works (described in chapter 2 and contrasted with in chapter 5) were demonstrations of

the possibility of success implementing GP for image classification using GCMs in a

large texture photography database, this project was developed as an effort to improve

the classification abilities of a dermatological system. With the real world problem of

skin-lesion classification, however, came the real world problem of scant data.

Therefore cross-validation was used for attaining accuracy evaluations with learning

machines. However, there was yet no way to measure the ability of features to

generalize. Further, very long run-times, caused at least partially by a restriction in the

programming language used in this project, severely limited the exploration of the GP

system‟s abilities. However, the system was still able to outperform traditional

Haralik features by a fair margin and more importantly demonstrated that there is

certainly room for improvement, even before further developments in the system. The

GP system evolved features which aided classification with higher accuracy and in

fewer numbers of features.

75

Bibliography

Bibliography

[1] Ballerini, L., Li, X., Fisher, R., & Rees, J. (2010). A Query-by-Example

Content-Based Image Retrieval System of Non-Melanoma Skin Lesions.

Medical Content-Based Retrieval for Clinical Decision Support, 31–38.

Springer. Retrieved from

http://www.springerlink.com/index/N027330482888GL3.pdf.

[2] Liu, H., Motoda, H. Computational methods of feature selection. Champan

& Hall. 2008.

[3] Muller, H., Michoux, N., Brandon, N., and Geissbuhler, A., review of

content-based image retrieval systems in medical applications - clinical

benefits and future directions,"International Journal of Medical

Informatics, vol. formatics7, 2004, pp. 1-23.

[4] Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences,

and trends of the new age. ACM Computing Surveys 40(2) (April 2008)

5:1–5:60

[5] Wollina, U., Burroni, M., Torricelli, R., Gilardi, S., Dell‟eva, G., Helm, C.,

and Bardey, W., Digital dermoscopy in clinical practise: a three-centre

analysis. Skin Research and Technology 13 (May, vol. 133, 2007.

[6] Schmid-saugeons, P., Guillod, J., and Thiran, J.P., Towards a

computer-aided diagnosis system for pigmented skin lesions, Computerized

Medical Imaging and Graphics, vol. 27, 2003, pp. 65-78.

[7] Maglogiannis, I., Pavlopoulos, S., Koutsouris, D., An integrated computer

supported acquisition, handling, and characterization system for pigmented

skin lesions in dermatological images. IEEE Transactions on," Information

Technology in Biomedicine, vol. 9, 2005, pp. 86-98.

[8] Celebi, M.E., Kingravi, M.A., Uddin, B., Iyatomi, H., Aslandogan, Y.A.,

Stoecker W.V., and Moss, R.H., A methodological approach to the

classification of dermoscopy images,"Computerized Medical Imaging and

Graphics, vol. 31, 2007, pp. 362-373.

76

Bibliography

[9] Rahman, M.M., Desai, B.C., Bhattacharya, P.: Image retrieval-based

decision support system for dermatoscopic images. In: IEEE Symposium

on Computer-Based Medical Systems, Los Alamitos, CA, USA, IEEE

Computer Society (2006) 285-290

[10] Ohta, Y.I., Kanade, T., Sakai, T.: Color information for region

segmentation. Computer Graphics and Image Processing 13(1) (July

1980) 222-241

[11] Haralick, R.M., Shanmungam, K., Dinstein, I.: Textural features for image

classification. IEEE Translactions on Systems, Man and Cybernetics 3(6)

(1973) 610-621

[12] Unser, M.: Sum and difference histograms for texture classification. IEEE

Transactions on Pattern Analysis and Machine Intelligence 8(1) (January

1986) 118-125

[13] Ballerini, L., Li, X., Fisher, R., Aldridge, B., & Rees, J. (2010).

Content-Based Image Retrieval of Skin Lesions by Evolutionary Feature

Synthesis. Applications of Evolutionary Computation, 312–319. Springer.

Retrieved from

http://www.springerlink.com/index/X0V1425K97G70578.pdf.

[14] Aurnhammer, M. Evolving Features by Genetic Programming.

Applications of evolutionary computing: EvoWorkshops 2007,

EvoCOMNET, EvoFIN, EvoIASP, EvoINTERACTION, EvoMUSART,

EvoSTOC and EvoTRANSLOG, Valencia, Spain, April 11-13, 2007:

proceedings.

[15] B. Lam and V. Ciesielski, "Discovery of human-competitive image texture

feature extraction programs using genetic programming," In: GECCO (2).

Volume 3103 of Lecture Notes in Computer Science, 2004, pp. 1114-1125.

[16] J.R. Sherrah, R.E. Bogner, and A. Bouzerdoum, "The Evolutionary

Pre-Processor : Automatic Feature Extraction for Supervised Classi cation

using Genetic Programming," Evolutionary Computation, 1996.

[17] C. Huang and C. Wang, "A GA-based feature selection and parameters

optimizationfor support vector machines," Expert Systems with

Applications, vol. 31, 2006, pp. 231-240.

77

Bibliography

[18] H. Frohlich, O. Chapelle, and B. Scholkopf, "Feature selection for support

vector machines by means of genetic algorithm," Proceedings. 15th IEEE

International Conference on Tools with Artificial Intelligence, 2002, pp.

142-148.

[19] R. Poli, W.B. Langdon, and N.F. Mcphee, "A Field Guide to Genetic

Programming," 2008.

[20] J.R. Koza and M.J. Hall, "GENETIC PROGRAMMING : A PARADIGM

FOR GENETICALLY BREEDING POPULATIONS OF COMPUTER

PROGRAMS TO SOLVE PROBLEMS," Science, 1990.

[21] Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine

Learning. Addison-Wesley, Reading, MA (1989)

[22] Liu, H., Motoda, H., Computational methods of feature selection, Chapman

\& Hall/CRC, 2008.

[23] Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L., "Feature Extraction:

Foundations and Applications," International journal of computer, 2006.

[24] J. Han and M. Kamber, Data mining: concepts and techniques, Morgan

Kaufmann, 2006.

[25] R.G. Brereton and G.R. Lloyd, "Support vector machines for classification

and regression.," The Analyst, vol. 135, 2010, pp. 230-67.

[26] Silva, Sarah. A Genetic Programming Toolbox for MATLAB. Software

available at http://gplab.sourceforge.net GPLAB

[27] Guyon, I, "An Introduction to Variable and Feature Selection 1

Introduction," Journal of Machine Learning Research, vol. 3, 2003, pp.

1157-1182.

[28] Vapnik, V. Estimation of dependencies based on empirical data. Springer

series in statistics. Springer, 1982.

[29] Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support

vector machines, 2001. Software available at

http://www.csie.ntu.edu.tw/~cjlin/libsvm

78

Bibliography

[30] Hsu, C.,Chang, C., and Lin, C., "A Practical Guide to Support Vector

Classification," Bioinformatics, vol. 1, 2010, pp. 1-1.

genetic programming for the automatic construction of ... · applying genetic programming to...

Documents