genetic programming for the automatic construction of ... · applying genetic programming to...
TRANSCRIPT
Genetic Programming for the
Automatic Construction of Features in
Skin-Lesion Image Classification
Jonathan Streater
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
2010
I
Abstract
This dissertation describes the design and implementation of a genetic programming
system which automatically constructs feature equations for the classification of skin
lesion images as a part of a real world dermatological image retrieval system. It uses
generalized co-occurrence matrices (GCMs) and normal mathematical functions
combined stochastically and evaluated using the feature selection techniques of fisher‟s
discriminant ratio, and the classification accuracy of either a bayes classifier or support
vector machine. It deals with the notion of GP closure with „shell‟ functions and is
able to arbitrarily combine information from different color channels, both unique
designs compared with similar GP systems. Further, it can evolve features iteratively to
complement each other. The implementation here is able to create features in small
numbers which are able to classify better than most of the traditional set of Haralik
features, even when the Haralik features are created with a greater number of GCM
parameters. However, the system developed here does exhibit two notable problems
for future work. The run-time is notably long and the amount of data collected
in-house is not yet great enough to significantly measure the ability of the system to
generalize. However, these problems are fixable and the work described has resulted
in a system which aids classification relatively well and just as importantly, shows
much potential.
II
Acknowledgements
Many thanks to my superviser, Lucia Ballerini, who provided invaluable guidance.
III
Declaration
I declare that this thesis was composed by myself, that the work contained herein is my
own except where explicitly stated otherwise in the text, and that this work has not been
submitted for any other degree or professional qualification except as specified.
(Jonathan Streater)
IV
Table of Contents
Chapter 1. Introduction ................................................................................................ 1
Chapter 2. Related Work ............................................................................................... 4
2.1 Query-By-Example Content Based Image Retrieval System ........................... 4
2.1.1 The System ........................................................................................... 4
2.1.2Color Features ....................................................................................... 6
2.1.3 Texture Features .................................................................................. 7
2.1.4 Feature Selection and Retrieval ........................................................... 8
2.1.5 Synthesized Features ........................................................................... 8
2.2 Genetic Programming For Feature Construction ............................................ 9
Chapter 3. Conceptual Background ............................................................................ 12
3.1 Haralik Features ............................................................................................ 12
3.2 Genetic Programming ................................................................................... 15
3.2.1 The GP algorithm ............................................................................... 15
3.2.2 Representation, Terminals, and Functions ........................................ 16
3.2.3 Initialization ....................................................................................... 18
3.2.4 Selection and Reproduction .............................................................. 18
3.2.5 Fitness and Test Cases........................................................................ 20
3.2.6 Closure and Sufficiency ...................................................................... 21
3.3 Machine Learning With Feature Selection ................................................... 21
3.3.1 The Machine Learning Problem ......................................................... 21
3.3.2 Feature Selection ............................................................................... 22
3.3.3 Feature Relationships ........................................................................ 25
3.3.4 Fisher’s Discriminant Ratio ................................................................ 26
3.3.5 Naïve Bayes Classifer ......................................................................... 27
3.3.6 Support Vector Machine .................................................................... 28
V
2.1.1 Leave-One-Out Cross-Validation ........................................................ 30
Chapter 4. Design and Implimentation ....................................................................... 31
4.1 Motivation and Overall Design ..................................................................... 31
4.2 Basic GP Implementation.............................................................................. 33
4.3 Training Data ................................................................................................. 34
4.4 Representation of Individuals ....................................................................... 36
4.5 Fitness ........................................................................................................... 39
4.6 Iterative Genetic Programming ..................................................................... 44
4.7 Parameters .................................................................................................... 46
Chapter 5. Results and Analysis .................................................................................. 50
5.1 Exploration of Population, Generation, and Depth ...................................... 50
5.2 Pooled FDR Features Compared to Haralik Features .................................... 57
5.3 Wrapper Fitness and Iterative GP ................................................................. 59
Chapter 6. Conclusions ............................................................................................... 71
6.1 Future Work .................................................................................................. 71
6.2 Conclusions ................................................................................................... 73
Bibliography ................................................................................................................ 75
1
Chapter 1. Introduction
Chapter 1. Introduction
This project is focused on the problem of using Genetic Programming and the relevant
machine learning techniques, such as a Bayes Classifier and Support Vector Machine,
to automatically construct and select features to best aid classification, both in
efficiency and performance, of skin lesion images. It is part of a larger project to build
and enhance a query-by-example content-based image retrieval system which can
return skin lesion images based on similarity to a given query image [1]. The entirety
of work here is based on attempting to improve the ability of this specific system to
classify these types of images correctly and in so doing, improving the chances of
success for a potentially educational and/or commercial dermatological system. Thus,
this is both an exploration of machine learning concepts which may be able to enhance
this system and it is a practical implementation of these concepts in a specific and real
engineering problem.
The work described in the proceeding was initially envisioned as a search for and
analyses of meta-heuristic algorithms, such as Genetic Algorithms and Ant Colony
Optimization, to select features for the image retrieval system. Before this
dissertation had even begun, work on the image retrieval system had already
constructed over 17,000 possible features to use in classification and was still
generating more. Therefore, the combinatorial optimization problem of choosing the
best set of features which could make classification accuracy the highest while at the
same time allowing the classifications to occur in a reasonable amount of time became
2
Chapter 1. Introduction
an imperative. However, as research for this project developed, it became evident that
while algorithms like Genetic Algorithms could very well be effective search strategies
for feature selection, search wasn‟t a primary concern of many of the most successful
feature selection strategies. This is demonstrated, for example, in a competition in
2003 in which dozens of research teams competed to see who could attain the best
feature selection and classification results [2]. All of the contestants focused not on
search strategies, often using the most basic greedy forward selection algorithms, but
on methods more directly related to measuring, analyzing, and processing features, and
classifying using these. This combined with the very scant but successful work that
has already begun in the area of genetic programming for feature construction in image
classification, were the prime motivations for the shift in focus of the project. The
hope is that genetic programming, in conjunction with statistical learning machine
tools, can simultaneously construct and select features which provide quicker and more
accurate results.
By this benchmark, the work here was able to largely accomplish its goal. The project
reports the results of a GP implementation which utilizes a filter and two wrapper
feature selection techniques for constructing feature equations. Further, it attempts to
use these to iteratively build complementary features. What it means for a feature to
be complementary and for a feature selection technique to be able or not able to find
complementary features will be explained in chapter 3. B even the most basic
methods of independently built features were able to perform at least as well as the
standard feature equations in the domain chosen for constructing features, Haralick
texture features. Further, results indicate that, given more time and resources, there is
the possibility of increasing performance and further exploring possible techniques for
additional implementations. Importantly, given the time, data, and computing
limitations of this project, it was evident that the GP implementation had not yet
3
Chapter 1. Introduction
reached its full potential in generating solutions. All that would be required would be
to run it for longer and with larger populations. It is also likely that exploring a greater
complexity in the image data input, genetic programming techniques, and machine
learning tools would generate improvements as well. The limitations of training data
and computational resources available to the project were noteworthy and will be
addressed.
Chapter 2 is an exposition of directly relevant background material to this project,
including an account of work on the query-by-example content based image retrieval
system which the work of dissertation is for. This relevant background work is
presented first to set the stage for the work done here. However, if some of the
concepts listed in it are completely unfamiliar it might be helpful to take a look at some
of the concepts described in chapter 3 first. Chapter 3 is focused on explaining
background theory for the tools used in this project from the three fields of digital
image analysis, genetic programming, and machine learning with feature selection.
Chapter 4 explains the implementation of work completed, including the tools
incorporated from other works such as GPlab, and justifies design decisions. Chapter
5 details the experiments run, lists results, and analyses them along with the problems
encountered. Finally, chapter 6 explores possibilities for future work and concludes.
4
Chapter 2. Related Work
Chapter 2. Related Work
This chapter introduces and explains the directly relevant works which precedes this
project. These works include a close look at the image retrieval system that this
project is a part of and at research which also combines genetic programming with
feature selection for image classification. Though there has been little work in
applying genetic programming to feature construction in the domain of texture features,
the two works cited here form a basis for comparison.
2.1 Query-By-Example Content Based Image Retrieval System.
2.1.1 The System
The work described with features for this project will be for a query-by-example
content-based image retrieval system of non-melanoma skin lesions [1]. Though in
general there are many new content-based image-retrieval systems [3], and many
within the medical domain [4], most of these are based on radiological images.
Within dermatology, computer vision systems have focused on techniques for
segmentation, feature extraction [5], and classification [6-8], often for cancer detection
and especially for melanoma. Though melanoma is a very dangerous kind of cancer,
other types are much more prevalent. The query-by-example CBIR system which this
project is a part of is the first of its kind with the five classes of skin lesions, Actinic
Keratosis (AK), Basal Cell Carcinoma (BCC), Melanocytic Nevus/Mole (ML),
5
Chapter 2. Related Work
Squamos Cell Carcinoma (SCC), and Seborrhoeic Keratosis (SK). It uses color
images digitally captured and segmented by project-affiliated team members.
It is hoped that this tool can be useful for dermatologists as well as non-expert users
with its ability to retrieve images based on similarity to a query image, allowing the
perusal of large databases of skin lesions based not only on diagnosis but on similar
visual attributes, and so be a decision-support as well as educational tool. Thus the
central goal of this project is to construct and select features of skin lesion images so
that the system may best classify and retrieve relevant images, hopefully significantly
improving the effectiveness of the system as a whole. These features are extracted
from the skin lesion images so that they can later be used to compute the „similarity‟ of
the images without using an entire digital image as input to a classifier. This similarity
score can then be used to retrieve images which are near the given query image. The
entire success of the system relies on providing the retrieval system with features which
are able to discriminate as uniquely as possible between the different classes of
pathologies and having a classifier or similarity metric which can most effectively take
advantage of these features and separate them out correctly [1].
For the original project, features used in the system are taken from color and texture.
In general there is a vast quantity of possible methods and combinations of methods,
hand-crafted and empirically compared, for generating many different features. The
idea is to sort and choose some features among the many extracted for their ability to
aid classification of images. Unfortunately, this method for feature selection leads to a
complicated and time-consuming combinatorial optimization problem about which
features to use and it can lead to feature vectors of large size. The genetic
programming approach in this project is an attempt to perform this search while at the
same time constructing novel features which are able to combine image inputs in new
6
Chapter 2. Related Work
ways, not constrained by human intuition. Further, it‟s hoped that features can be built
with respect to other, already evolved features and so be effective in smaller subsets.
The next two sections are an attempt to show how thousands of features are extracted
for selection from images and to set the stage for features which will be relevant to GP
feature construction.
2.1.2 Color Features
So to detail the feature construction methods in the original project, colors are
represented by the mean of a skin lesion and its covariance matrix so
that:
Here, N is the number of pixels in the lesion and is the color component of channel
X (X,Y Є {R, G, B}) of pixel i. And so using the RGB color space the covariance
matrix is:
Features were constructed using the RGB, HSV, CIE_Lab, CIE_Lch, Munsell color
coordinate system [9], and Otha [10] color spaces. The colors components were
normalized by the average of the same component of the safe, non-lesion skin of the
same patient [1].
7
Chapter 2. Related Work
2.1.3 Texture Features
Texture features are taken from generalized co-occurrence matrices where a
co-occurrence matrix is a matrix which is taken over an image to be the distribution of
co-occurring values at some offset. For the original project, distances of one to six and
orientations of 0, 45, 90 and 135 are used. Much of this statistical feature analysis is
founded on the popular and successful Haralik feature equations which are used on
these co-occurrence matrices [11]. The background for this will be detailed in chapter
3.
Generalized co-occurrence matrices are generated from images coded with n color
channels. For example with an image in RGB, there are six co-occurrence matrices
(RR), (GG), (BB), (RG), (RB), and (GB). For orientation invariance, the matrices are
averaged with respect to θ and quantization levels of are used for
the color spaces of RGB, HSV, and CIE_Lab [1]. From each generalized
co-occurrence matrix, 12 Haralick features are extracted including energy, contrast,
correlation, entropy, homogeneity, inverse difference, cluster shade, cluster
prominence, max probability, autocorrelation, dissimilarity, and variance [11]. All of
the combinations of these features, inter-pixel distances, color pairs, color spaces, and
grey level quantizations result in 3888 texture features. The total texture features is
brought up to 9,720 by also extracting from the sum and difference histograms, varying
displacement, orientation, quantization level, and color spaces, and using the features
sum mean, sum variance, sum energy, sum entropy, diff mean, diff variance, diff
energy, diff entropy, cluster shade, cluster prominence, contrast, homogeneity,
correlation, angular second moment, and entropy [12]. The actual feature equations
behind the names can be viewed in detail in the works cited.
8
Chapter 2. Related Work
2.1.4 Feature Selection and Retrieval
In order to choose the optimal set of these thousands of features, a greedy forward
selection algorithm and a genetic algorithm are used. Both algorithms maximize the
number of correctly retrieved images but the GA had superior results. Note that skin
lesion images are taken on a Canon EOS 350D SLR with a resolution of about 0.03mm
and then manually segmented. Ground truth is provided by the medical co-authors
and retrieval resulted in a precision of 59-63% using the Bhattacharya distance metric
and the Euclidian distance for retrieval [1].
2.1.5 Synthesized Features
Synthesized features were created to improve the performance by 8% using a genetic
algorithm [13]. The GA used already created features, for example the texture
features listed in 2.1.3, for the synthesis of new composite features. It used several
genetic operators to combine the old features in novel ways by representing them as
strings and then using index numbers to indicate the features and operators used to
combine the features. These were operators were: taking only one of the features
considered or by adding, subtracting, multiplying, or dividing them. The fact that this
very simple GA was able to increase the performance of the system is a major
motivation for this project. One suggestion for further study from this work [1] was to
pursue the development of a large number of operators which can combine an arbitrary
number of features. GP is also mentioned as a possible avenue which may be more
powerful than GA. For example, it pointed to [14] which used GP to automatically
generate features to improve classification performance (up to 87%) and to drastically
reduce the complexity of feature vectors (78 to 4).
9
Chapter 2. Related Work
2.2 Genetic Programming for Feature Construction.
It might be concluded from the above methodology for feature extraction and selection
--where just about every method is tried and then the best are taken from the bag by a
feature selection algorithm-- that there doesn‟t seem to be any designated set of “right”
features. In fact, this is a key motivation of Aurnhammer [14]. That work is
attempting to use Genetic Programming to avoid having to hand craft features for the
given task at hand, and instead make a system which can automatically generalize
without lots of time constructing, extracting, and selecting features. Part of this is the
combinatorial optimization problem that is generated by having so many features to
select from [1]. The hope is that, instead of relying on human intuition and analyses
to engineer features, Genetic Programming can automatically generate a reasonably
small yet effective number of features by combining the basic building blocks which
are typically used to construct features, and evaluating their fitness automatically.
Using co-occurrence matrices, typical mathematical functions used in Haralick
features, and a fitness function combining Fisher‟s Discriminant Ratio with a simple
minimum distance classifier, Aurnhammer improves from a Haralick feature
classification accuracy of 67% to a Genetic Programming generated feature accuracy
of 87% using only 4 features. It should be noted that this genetic programming
algorithm is iterative in that it evolves a best feature and then evolves another feature
with respect to the feature(s) already evolved. Thus new features are generated which
classify well with the already generated features. This is an idea which relates to the
complexity of how features relate to each other and how they relate to the machine
learning problem at hand and will be discussed in chapter 3.
The GP features were created using the Open Beagle GP framework for C++, the Intel
10
Chapter 2. Related Work
OpenCV library for image processing, and the publically available real-world
photography database VisTek. Therefore it seems that GP might be able, if adapted
and applied to our skin lesion classification problem, to generate a small number of
effective features for classification. This approach has the potential to generate new
and effective features that discriminate well between pathologies while at the same
time producing a manageably small set of features that work well together. It should
be noted, however, that while this implementation is directly motivational for this
dissertation, it is based on a very different database and is primarily aimed at being a
demonstration of evolving feature equations. This is to be contrasted with the image
retrieval system that this project is based on which is for a real and practical
engineering problem. Further, the combination of feature selection methods used here
is different. And as it will be discussed, the implementation for this project deals with
the GP property of closure and with color channels differently. The GP
implementation and results will be further contrasted with this work in chapter 5.
Using genetic programming for the construction and selection of features for the
classification of images, especially with respect to texture features, is a relatively
unexplored area. Other than [14], the only other directly related research found was
[15]. But again, this is a demonstration of the feasible success which might be
attained with this technique for practical purposes. Again, it uses the Vistek database
and attains comparable performance to Haralik features and fairly better performance
by combining the evolved features with the Haralik features. It uses the K-means
clustering algorithm for fitness.
Other work less directly related but worth mentioning include work using a GP to
reduce dimensionality of input features for a classifier using a generalized linear
machine, k-nearest neighbors, or maximum likelihood classifier [16] and work using a
11
Chapter 2. Related Work
GA to simultaneously select features and optimize a support vector machine [17] [18].
The latter supplied guidance for the SVM wrapper approach described in chapter 4
where the parameters of the SVM are evolved with individuals in the population.
In that work, however, the algorithm is just doing feature selection whereas here
feature construction and selection are occurring together.
12
Chapter 3. Conceptual Background
Chapter 3. Conceptual Background
The following covers the three main domains which converge in the work for this
project. These domains are: features for digital image analysis, genetic programming,
and machine learning with feature selection.
3.1 Haralik Features.
For some time now it has been possible to digitally process all sorts of image data and
so it has become an imperative to come up with effective ways to deal with these
complicated 2 dimensional arrays of information. For this problem, Haralik [11] is
concerned with the creation of texture features for the classification of images. The
features established in that work have remained some of the most popular features for
these purposes largely because of their simplicity and effectiveness. As it is noted in
the original paper, categorizing image data is very difficult precisely because one needs
to deal with such large blocks of cells. However, once features are defined for the
large blocks of image data, reducing the incredible dimensionality of the problem, they
can be used in any number of pattern-recognition techniques.
To solve the problem of how to construct such features, Haralik uses the idea of texture.
This is a set of features used by human beings and the idea is to also use it as the
foundation for feature construction methods for use by digital computers. Texture is a
13
Chapter 3. Conceptual Background
property of all surfaces and contains important information about the surface, its
makeup, and its relationship to the surroundings. Specifically, it is about the spatial
distribution of gray tones and can be evaluated, for example, as on the one hand fine,
coarse, or smooth and on the other hand, rippled, lineated, or irregular [11].
The resulting procedure for calculating texture features from blocks of image data
relies on the assumption that texture information is based on the average spatial
relationship which gray tones in an image have with each other. Thus a set of
“gray-tone spatial-dependence probability-distribution matrices” are assumed to
adequately represent this average of textural information. They are computed for
various orientations and distances between neighboring cell pairs in the image, and
then plugged into feature equations to produce feature values. The resulting features,
focused on macroscopic notions of texture rather than picking out any specific classes
of texture, contain information such as the homogeneity, contrast, boundaries,
dependencies, and complexity in an image [11].
As formulated in the original paper, “such matrices of gray-tone spatial dependence
frequencies are a function of the angular relationship between the neighboring
resolution cells as well as a function of the distance between them.” So for an Image I
which has columns, rows, and grey levels, the co-occurrence matrix is
of dimension X , where a value in position (i, j) is determined by the number of
co-occurrences of the grey-levels i and j which are an inter-pixel distance d and
orientation θ offset [11] :
Note that, as mentioned in chapter 2, a generalized co-occurance matrix (GCM) is
14
Chapter 3. Conceptual Background
simply the idea of gray-level co-occurrence matrix adapted to a color space instead of
using only grey levels. All of the 14 Haralik feature equations suggested in [11] are
based on the use of these co-occurrence matrices. GCMs in the RGB color space will
be used with Haralik equations computed using several distance and interpixel distance
settings for comparison in chapter 5.
Haralik features are, as mentioned, an attempt to hone in on information about
homogeneity, contrast, organized structure, complexity, and transitions. Their
equations are described exactly in the appendix of [11]. They are a large part of the
many features generated in [1], forming the basis of the texture features there. These
generalized co-occurrence matrices will also be the foundation of features generated by
genetic programming in this dissertation.
Figure 1 – Taken from [11], a demonstration of GLCM calculations. (a) gray-tone values for a
4x4 image (b) the general form of any gray-tone spatial dependence matrix with gray tone
values 0-3. #(i,j) stands for number of times gray tones i and j have been neighbors. (c-f)
calculation for all 4 distance 1 gray-tone spatial-dependence matrices.
15
Chapter 3. Conceptual Background
3.2 Genetic Programming
Genetic Programming is a technique for the automatic and systematic solving of
problems by means of algorithmic evolution, regardless of domain [19] [20]. This is
why it is a good candidate for the solving of the feature selection and construction
problem for image classification which often suffers from domain specificity and
bloated numbers of features. Just as in many domains it has matched or exceeded
human intuition and engineering, even creating patentable solution in some cases, it has
the potential to improve on the classification performance of the hand crafted Haralick
features and features like them. This was demonstrated in principle in two works cited
in the previous chapter and it is the hope of expanding this method to work in the
specific engineer problem associated with the skin lesion image retrieval system.
3.2.1 The GP Algorithm
The basic idea of Genetic Programming is to stochastically create and then transform a
population of programs, a generation at a time, into better and better solutions for the
problem at hand. This is done by, at every generation, evaluating the fitness of the
population and then using genetic operators to push and build the changing population
of programs towards better and better solutions. The idea is that we don‟t know
exactly how to make these good solutions but we do have methods for judging and
measuring the goodness of solutions that the stochastic evolutionary process produces.
The principle of survival of the fittest and the genetic operators ensure that when good
solutions are found, they propagate in the population and eventually are even built on
top of to produce better solutions down the road. What the „programs‟ are which are
being evolved just depends on the domain. For feature construction, we are evolving
16
Chapter 3. Conceptual Background
feature extraction equations.
Figure 2 – Taken from [18], the general outline of the genetic programming algorithm.
The Genetic Programming algorithm is basically: randomly generate a population of
valid programs, execute each program and measure its fitness, select some individuals
with a probability based on fitness scores to participate in genetic operations, and apply
the operators to the individuals to create a new generation of individuals. This process
begins again with the calculating of the fitness scores of the new population of
solutions and repeated until an optimal solution is found or some other stopping
condition is met. The end result is the returning of the best individual found. In order
to accomplish this whole process, methods need to be established for the representation
of programs, the evaluation of fitness, the selection of individuals from the population,
and the execution of genetic operators.
3.2.1 Representation, Terminals, and Functions
Programs are often visualized as trees rather than as lines of code. In figure 3, it is
easy to see that variables and constants in the tree, known as terminals, take their places
as leaves of the tree while the operations in the program, known as functions, take their
17
Chapter 3. Conceptual Background
places everywhere above the terminals, up to the top [19]. In the case of the picture
below, the terminal set is made up of x, 3, and y. The function set is made up of max,
+, and *.
These programs are often represented in the genetic program in prefix notation, such as
max(plus(x,x),plus(x,times(3,y))), because this makes it easier to manipulate and to
visualize the branches of the tree and the relationships of the functions in it. The first
computer language used to implement Genetic Programming was LISP because it
represents operations in this way and because its dynamic lists and automatic garbage
collection make it much easier to implement and manipulate a population of programs
[18]. As we shall see, however, other computer languages associated with AI research
or scientific computing, such as MATLAB, also have many of these same capabilities.
It is possible to implement the population in prefix notation or more explicitly in a tree
data structure.
Figure 3 – Taken from [18], graphical
representation of the tree of an individual
in a genetic program.
18
Chapter 3. Conceptual Background
3.2.3 Initialization
Once we have representations for individuals in the population, it is necessary to
initialize the first generation of the genetic program. There are three predominant
methods for this named grow, full, and ramped-half-and-half. The full method
generates trees by randomly selecting from the function set until every branch of the
tree has reached a pre-specified depth. Here, terminals are installed. This creates a
full tree with branches all to the same depth, though trees still might have different
sizes (numbers of nodes) because some functions might take in different numbers of
operators (different arity). In order to increase the diversity of the population, the
grow method creates trees by randomly selection both from the function and terminal
sets, granting the possibility that branches in the tree might be of different lengths.
Similar to the full method though, if a branch reaches some pre-specified depth, it is
capped with a random terminal. The third method is a combination of the full and
grow methods and is an attempt to further increase the diversity of the initial population.
In the Ramped-half-and-half method, half of the population is generated with the grow
method and half is generated with the full method. Further, this is done with a range
of depth limits [20].
3.2.4 Selection and Reproduction
After creating the initial population and then evaluating its fitness, it is necessary to
probabilistically select individuals, based on fitness, for reproductive operations. Two
oft used methods for selection are fitness-proportionate selection and tournament
selection. For fitness-proportionate selection, the chances of being picked to
participate in reproduction are directly proportional to the individual‟s fitness
compared to all other individuals. That is, its chances of being selected are its fitness
19
Chapter 3. Conceptual Background
divided by the sum of all other individuals [21]. This method provides very high
selection pressure as it is possible for an individual with an extremely high fitness
compared to the rest of the population to completely swamp the next generation. This
can be good if there is a need to take advantage of good solutions but it can also destroy
the diversity of the population, narrowing the possibility of finding new paths for
solutions. In tournament selection, a number of individuals are randomly chosen from
the population, compared with each other, and the best individual is selected. This
method has a lower selection pressure and allows a larger variety of individuals to be
selected for reproduction.
Once individuals have been selected, genetic operators are executed on them to make
new individuals to fill the next generation of programs. The most basic operators are
crossover and mutation. For crossover, two individuals are selected, a crossover node
is chosen randomly in the tree of each parent, and the branches below the two points are
swapped. The new individuals are placed into new programs rather than copying over
the originals so that the parents have the chance to be selected for reproduction again.
Often function nodes are given a greater chance of being chosen for swap points than
terminal nodes to prevent terminal nodes from always being chosen, and to prevent
such a small amount of genetic material from being exchanged in most crossovers.
For mutation, a single individual is selected, a random mutation point is chosen on the
individual, and everything below the point is replaced with a randomly generated tree.
It is also possible to simply replace the chosen node by a single function from the
function set which has the same arity [19]. Finally, reproduction can be executed
instead of crossover or mutation. In this case, an individual is selected from the
population and simply copied over to the next generation. The process of creating
new generations of programs, evaluating them, and creating still more, hopefully better
20
Chapter 3. Conceptual Background
generations of programs, continues until some stopping condition. This can be, for
example, the execution of a preset number of generations.
3.2.5 Fitness and Test Cases
Finally, an implementation of Genetic Programming requires that an appropriate
fitness function as well as the terminal and function sets need to be chosen. For a
terminal set, this means gathering samples of data which can be plugged into the
evolved equations. These depend entirely on the particular problem that the Genetic
Program is being designed to solve. If a controller for a robot is being designed, it
would be prudent to include sensor inputs in the terminal sets and robot actions in the
function set. If functions for the construction of features for classification of images
are being designed, it might be prudent to include the relevant image data as terminals
and commonly used mathematical operations as functions. In this case, it would also
be necessary to choose a fitness function which could readily measure the goodness of
a given generated feature equation.
But before evaluating fitness, the input data, or test cases, are plugged into the
terminals of the given program and the program is executed. In the robot example, test
cases would include the data for sensors that the different terminal variables represent.
Once the test cases have been plugged in and the program executed, then the fitness can
be determined from the robot‟s behavior resulting from using the given program. If
the robot drives off a cliff, killing itself, a fitness of 0 might be awarded. On the other
hand if the robot successfully completes its given task, such as navigating a maze, then
some high score could be given. In the case of skin-lesion classification, test cases
could be GCMs produced by sample images. Here, knowledge of the true classes is
part of the test cases. The goal is to evolve some equation which can help connect the
21
Chapter 3. Conceptual Background
sample input to the true class.
3.2.6 Closure and Sufficiency
Note finally that „programs‟ generated by the genetic program must have the properties
of closure and sufficiency. Closure requires that any sub-tree must be able to be
processed by any function in the function set. This is because in the process of
evolution, nodes are joined arbitrarily and so any combination of them might be
generated. Closure also requires that every program that can be generated can‟t be
crashed at run-time by a particular evaluation, for example, by trying to divide by zero.
Two ways to ensure closure are by requiring that every function takes any input and
produces as output the same type of data, and by using modified versions of functions
so that they will run no matter what numbers are given to them. For example, if the
function divide is given a zero for the denominator, it should return as output what it
received as input for the numerator [19]. The choices for fitness functions and
methods to ensure these properties will be addressed in the design and implementation
for this project in chapter 4.
3.3 Machine Learning with Feature Selection
This section addresses the related ideas of statistical machine learning and feature
selection and describes several specific learning methods which are used in this project.
As the ultimate goal of feature construction is classification, these methods are used for
evaluating individuals in the genetic program.
22
Chapter 3. Conceptual Background
3.3.1 The Machine Learning Problem
Machine learning tasks are defined by problems which are established and solved by a
series of examples rather than predefined rules. Often this means taking many
training examples along with their associated correct answers and having the learning
machine decipher the underlying rules which connect them. For example, this could
be the case of showing a learning machine many training examples of email messages
and having it learn to classify the email as spam or ham. Or this could be the case of
showing a learning machine many training examples of skin lesion images and having
it learn to classify the images into one of five skin lesion diagnoses. These are both
cases of supervised learning of a classification problem where the correct answer is
used to teach the machine to categorize data into one of N categories. This type of
learning machine will be the backdrop for the proceedings as this is the type used for
this project.
The inputs into the learning machine are called features. If one were training a
machine to learn how to diagnose the illness of a patient, the features could be the
various symptoms and characteristics of the patient such as her current temperature or
age. These should contain information which allows the learning machine to separate
out the true classes (in this example the true illnesses). As more and more information
becomes available, however, it becomes more and more complicated for the learning
machine to correctly learn the underlying signal from training examples. Thus feature
selection is a possible way to aid the learning machine‟s job. This is the problem of
choosing the best set of available features and feature construction is the problem of
modifying available raw data into a useful form for learning [22] [23].
23
Chapter 3. Conceptual Background
3.3.2 Feature Selection
The principle motivations for doing feature selection and construction are to increase
classifying performance, save resources, and better understand the data. A large
reason that these goals are important but hard to attain, however, is the “curse of
dimensionality.” That is, two points which are close together in 2 dimensional space
are likely far apart in 200 dimensional space. Because the number of input features is
central to deciding the space of all possible solutions, as the number of features
increases so does the space of hypotheses, thus making the learning problem that much
more complex and difficult. A linear increase in the number of features causes an
exponential increase in the hypothesis space. And the more complicated the learning
problem is, the greater the volume of training data that is required. Therefore, good
feature selection and construction can make the learning problem easier by eliminating
redundant or irrelevant features and thus enhancing performance and saving resources
[22].
According to [23] there are three dimensions to feature selection: search, evaluation
criterion definition, and evaluation criterion estimation. Search is the method by
which subsets of a larger whole of features are sifted through. Often an exhaustive
search of every possible subset size and subset combination is computationally
intractable, slow, and in any case may lead to over-fitting problems where the learning
machine is able to fit training examples very well but not generalize to new examples
effectively. Search methods include forward selection, backward elimination, and
genetic algorithm. Evaluation criterion definition is the means by which to judge the
goodness of features and evaluation criterion estimation is the means of estimating the
goodness of features given the amount of data and the evaluation criterion.
24
Chapter 3. Conceptual Background
Two broad types of feature selection methods are termed filters and wrappers (figure 4).
The defining difference between the two is the evaluation criterion definition.
Wrappers judge features based on their performance with a learning machine, for
example the classification accuracy of a bayes classifier, support vector machine, or
neural network, and filters are basically any ranking function which does not use the
performance of a learning machine. One approach for filters is to use them to rank
individual features. This can be especially useful when there are massive numbers of
features (e.g. 10,000) and relatively few training examples (e.g. 100) [23]. They are
also often much computationally cheaper to run than wrappers. Filters for ranking
individual features can be tricky, however, because often they don‟t necessarily give
information about how different features work together.
Figure 4 – Taken from [22], a pictorial explanation of the makeup of the two broad categories
of feature selection methods: filters and wrappers.
25
Chapter 3. Conceptual Background
3.3.3 Feature Relationships
In fact, a feature may be useless by itself but become effective when paired with a
certain other feature. Or, a feature may be effective by itself but provide absolutely
zero additional information when paired with another. Fisher‟s Discriminant Ratio is
an example of a univariate feature selection scoring method which only reveals
information about one feature alone. A ranking index based on the Relief Algorithm
is an example of a multivariate method which reveals the relevance of multiple features
together. To be clear, the terms univariate and multivariate in the context of feature
selection refer to the ability of the particular method at hand to give information about
feature relationships. Obviously, it is possible to compute FDR for multiple
dimensions. But while the resulting FDR scores will allow the ranking of features and
so allow something like cheap dimensionality reduction, the scores do not give
information about how features may act together. For example, if there are four
features ranked by FDR, it is possible to select the top two features in an effort to
reduce the dimensions. Often this is reasonable when there are very many features.
But it may be that features ranked 1 and 3 by FDR will for some reason complement
each other and allow a classifier to perform better than features ranked 1 and 2. Thus
if you want to be more certain that you are taking the best two features, either a
multivariate feature selection method or a wrapper method which uses classification
accuracy as the score should be used. These will score how well the 2 given features
will work together. Note that though FDR can be combined into one score for many
features, it is still not giving a score which reveals information about how the given
features may complement each other.
The pictures below are a 2d example of feature relevance. The left picture shows an
example where one of the features is very effective at separating the classes and the
other, x2, is irrelevant. The right picture, however, shows a situation where both
26
Chapter 3. Conceptual Background
features are required to effectively separate the two classes. For a much more in depth
and formal discussion of feature relevance, see [23].
Figure 5 – Taken from [22], a demonstration of two possible cases of feature relationships.
On the left, projection x2 is uninformative and could be discarded without a loss of
information with respect to the classes. On the right, both projections are informative and
needed to define the classes. A univariate filter , such as FDR, is not necessarily able to give
good information about cases where there are redundant or irrelevant features.
3.3.4 Fisher’s Discriminant Ratio
With the general idea of feature selection and feature relationships explained, consider
several methods of feature selection used in this project. A prominent univariate
feature ranking measure is the Fisher‟s Discriminant Ratio, or Fisher‟s criterion. It is
the ratio of between-class variances to within class variances. Roughly, the more
tightly wound classes are to themselves and the more separated they are from each
other, the higher the FDR score will be [23].
– (4)
– –
(5)
27
Chapter 3. Conceptual Background
where there are N column vectors, K classes { ,…, }, and the mean of class k,
, contains members. The mean of the data set is and finally the FDR is:
.
Again, the value of FDR is that it is often cheaper computationally to run and might
help avoid over-fitting problems with few examples. However, it is a univariate
scoring method and so only gives the score with respect a given feature by itself.
3.3.5 Naïve Bayes Classifier
The naïve bayes classifier is a classifier based both on Bayes‟ Theorum and a strong
independence assumption. In words, Bayes theorem is basically the prior probability
times the maximum likelihood divided by the evidence. However, because the
evidence is the same for every class and the prior is typically taken to be the same for
every class, we only need to compute the maximum likelihood. Further, we make the
strong assumption of class-conditional independence. That is, the values of attributes
are independent of each other given the class label. This vastly simplifies the
calculations needed. So to calculate the posterior probability for a sample given the
classes, we have:
where X is a vector of n features and is the ith class. If feature values are
continuous, often a normal distribution is assumed for them. To calculate the
28
Chapter 3. Conceptual Background
posterior probability of a sample for each class, the mean and variance of the features
for every class is computed from the training data, plugged into the normal distribution
along with the feature values for the sample, and multiplied together as the equation
above shows. This is computed for every class. Finally, to classify the sample, the
class is chosen which has the highest posterior probability [24].
3.3.6 Support Vector Machine
Relatively speaking, Support Vector Machines are a fairly new method for
classification of linear and nonlinear data [24] [25]. This class of learning machine
has garnered a lot of attention because of its high accuracy as well as its greater
resistance to problems such as over-fitting or local minima which, for example, neural
networks have more trouble with. The idea of an SVM is to use a nonlinear mapping
to transform the training data into a higher dimension and in this space search for a
linear boundary, a hyperplane, which optimally separates the data. This search for the
maximum marginal hyperplane is done using „support vectors‟ and „margins‟. It is a
search for a hyperplane which has the largest margin between classes. The idea is that
this will likely lead to the best classification of future data. Any training points which
fall on either border of the margin are called support vectors and are the most difficult
points to classify. Regardless of other points in the set of training samples, these
support vectors define the margin. Further, a trained SVM with few support vectors is
able to generalize well, even in high dimensionality [24]. A separating hyperplane can
be written as
(7)
where W is a vector of weights on X, a vector of input features. The weights can be
29
Chapter 3. Conceptual Background
adjusted so that the borders of the margin can be written as
(8)
This can be rewritten as
(9)
This establishes the two hypotheses defining how class y should be defined, here either
+1 or -1. The maximal margin is thus given by
, where is the Euclidian
norm of W. Rewriting the above with a Lagrange multiplier into a constrained
optimization problem, it is possible to find the maximum marginal hyperplane and the
support vectors [24]. Once this is done with the training data, we have a trained
support vector machine. Using the above, a decision boundary can be written for the
classification of future data:
(10)
Here, y is the class label of support vector X, is a test feature vector, and α and b are
parameters. The sign of the result of plugging a test feature vector into the equation
above determines which side of the hyperplane the test samples are on and thus decides
the predicted class. This works for linearly separable data. For nonlinear data it is
possible to map the original data into a higher dimension using a kernel function, find
the linear maximal marginal hyperplane in the higher dimension, and then substitute
back to translate this into a nonlinear boundary in the original space [24]. Examples of
30
Chapter 3. Conceptual Background
Kernel functions include [25]:
SVM has been described here for binary classification problems. However, this can
be transformed to solve multiclass problems as well. One approach is: If there are m
classes, m SVM‟s are trained, each of which learns to separate the ith class from the
rest. The predicted class is chosen according to the SVM which returns the largest
positive distance from the margin [25].
3.3.7 Leave-One-Out Cross-Validation
K-fold cross-validation is a method by which a learning machine can be evaluated
and/or have its parameters set. In the case of this project it is used with learning
machines to evaluate the goodness of inputs to the learning machine. The training
data is split into k sets of approximately equal size. The learning machine is then
trained on all of the sets of training data except for one group which is used to test the
learning machine‟s accuracy. Then the sets of data are rotated so that the old left out
test set is included for training, and one of the former training sets is used to test for
classification accuracy. The sets are rotated so that all are used for training and testing
and the classification accuracies are averaged together. Leave-one-out cross
validation is specifically when K is equal to the size of the training set and so training is
done on all but one sample which is used for testing. Just as before, all of the training
samples are rotated so that all are used for training and testing [25].
31
Chapter 4. Design and Implementation
Chapter 4. Design and Implementation
This chapter describes the design strategy as well as the implementation of the work
undertaken for this dissertation. Further, it attempts to justify design decisions.
4.1 Motivation and Overall Design
This project is based on the implementation of a genetic program which evolves
features for the effective classification of skin lesion images. It sits at the intersection
of the fields of genetic programming, machine learning, and digital image analysis.
Thus it has been necessary to understand generally these three fields and specifically
the most effective and feasible ways to bring them together.
The problem from image analysis has been how to take in a massive matrix of raw
image data and turn it into a simple feature or set of features which contain adequate
information for classification of similar images. This has often been solved, for
example, by intuitively creating features which contain information about image
texture. Machine learning provides the tools by which to take these features for input,
calculated for many different sample images, and learn-by-example a set of statistical
“rules” which allow it to classify future images correctly. Thus the classification of
image data has often been solved by extracting these sorts of hand-crafted or intuitively
designed features from images and then using some particular learning machine, for
example a neural network or a naïve bayes classifier, to learn to classify future sample
32
Chapter 4. Design and Implementation
images. This was the basic method of [1]. However, there is a general shortcoming
to this method: there is yet no set of features which have been established to work
across image classification domains or problems. Thus because of this and human
intuition‟s shortcomings when it comes to dealing with digital information at the
lowest level, often the method for choosing features to use in such classification
problems amounts to spending a great deal of time trying many different types of
features as well as empirically hand-crafting useful features specifically for the
particular job at hand. There is then the problem of feature selection as it is necessary
to determine, of the many features extracted, which ones are the best. This is where
the third topic, genetic programming, comes in. Its use aims to aid classification.
Genetic programming provides a means by which to evolve “programs” which perform
some needed role according to some set standard. In this case, the needed role is
features generated from image data and the set standard is best classification accuracy
that we can get. Therefore, we can use Genetic Programming to automatically
generate mathematical functions which can be applied to images to extract feature
values. We hope that we can do this in such a way that the features generated provide
adequate information for classification and yet come in smaller subsets than would be
generated by empirically testing and crafting many kinds of features. Thus we are
solving the problem of constructing effective features for the problem at hand and the
problem of selecting a near-optimal set of these features for classification,
automatically and at the same time. The aim is that this will not only avoid some of
the necessary hand-crafting of features for the specific domain in question, but it will
also improve overall performance of the system developed. Hopefully this would be
because the evolved features outperform the crafted features. But we could also
improve performance by adding the set of evolved features to the set of best crafted
features. For example, in the projects most similar to this one [14] [15], while evolved
33
Chapter 4. Design and Implementation
features did at least as well as crafted features, and often better, the two sets put
together were better still.
So how do we design and set up Genetic Programming for these purposes? Besides
the implementation of the basic GP, it needs to be decided how the appropriate training
data will be handled, what is going to be evolved and how to represent it, what tools
will be used to evaluate the goodness of the individuals in the GP population, how to
evolve multiple complementary features, what parameters should be used for the GP as
well as the machine learning techniques incorporated, and finally, what to do with the
resulting evolved features.
4.2 Basic GP Implementation
The language chosen to implement genetic programming was Matlab. This was
chosen partly because the previous work on skin lesion feature construction and
classification was done in Matlab. In addition, it is perfectly suitable for feature
construction and selection of image data using genetic programming. This is because
it is a language with many useful mathematical tools and it contains dynamic lists,
automatic garbage collection, available pre-fix notation for mathematical functions,
and easy execution of strings as code. These are all things cited by Koza, creator of
genetic programming, as characteristics defining suitable genetic programming
languages [20]. The basic infrastructure of the Genetic Programming algorithm
implemented here is provided by GPlab, a free and open source Genetic Programming
toolbox for Matlab [26]. Although GPlab has been extremely useful, providing the
nuts and bolts of population representation, creation, evolution, and visualization, it
was also ultimately very inadequate for the purposes of this project. It was necessary
to heavily modify and extend GPlab so that, among other supportive changes, GPlab
34
Chapter 4. Design and Implementation
could load, manipulate, and evaluate large matrices as terminals in feature equations,
individuals could contain and evolve separate „shell‟ functions for turning matrices into
scalars, GPlab could evaluate the fitness of an individual‟s resulting feature values
using machine learning techniques for feature selection, and GPlab could run genetic
algorithms iteratively so that multiple features can be evolved to complement each
other. In the case of using support vector machine as a wrapper, it was also necessary
to set up the GP so it could evolve the cost parameter along with individuals.
4.3 Training Data
From the beginning of the project, it was decided that traditional texture feature
equations would be the basis for evolving features. That is, the materials used to
construct individual feature equations would consist of generalized co-occurrence
matrices (GCMs) generated from the training samples of lesion images as well as
mathematical functions which are commonly used in, for example, the texture feature
equations of Haralick. This provided building blocks for equations, a starting point
and concise domain for building an initial GP feature construction system for future
expansion, as well as a set of already well established and often used Haralick features
to use for comparison. In genetic programming terms, the GCMs of skin lesions are
the set of terminals and the mathematical functions used are the set of functions. All
of the experiments conducted in this project are with GCMs constructed using a
quantization level of 64 and interpixel distance of 5 in the RBG color space. This
means that for each sample image there are 6 matrices (from the 6 color channels RR
GG BB RG RB BG) which are each 64x64. And because the available training set
consists of 100 images, with each class of skin lesion being 20 images, the training set
is 100 rows by 6 columns, each row being a sample, each column being a different
color channel, and each item in the training set being another 64x64 matrix. If, to
35
Chapter 4. Design and Implementation
generate the GCMs, the quantization level were changed to 256, the training set would
be full of 256x256 matrices. The problems associated with the scarcity of available
training data will be discussed below.
Note that these matrices are all generated from the segmented lesion part of the skin
lesion images. If it were thought that useful information were held in the healthy skin
part of the images, GCMs could be generated from these and included in the training
set. Then the training set would be 100 rows by 12 columns. All of these changes
increasing the size of the training set, however, have their associated computational
costs. Thus, the experiments here are based on a quantization level of 64 and do not
include healthy skin. They are all pre-computed from the images and loaded at the
start of the genetic program. So, the basic idea of the whole proposal is that we
calculate generalized co-occurrence matrices from skin lesion images, use these
matrices as terminals in a genetic programming algorithm which evolves some best
feature or set of features, and then use feature values produced by these for a classifier
of skin lesions images. See figure 6.
36
Chapter 4. Design and Implementation
Figure 6 – A graphical outline of the overall design. The end result of the GP should be a
best equation that can be used to generate feature values for inputs into a classifier. In the
case of iterative GP, there should be several equations for several features that work together.
4.4 Representation of Individuals
An individual feature equation is made up by a combination of some functions from the
function set and by, in the case of a 100x6 training set, variables X1 through X6.
These variables represent where the corresponding 6 matrices of a given sample image
are to be plugged in as terminals. They are created either by one of the genetic
programming initialization methods or by genetic operations on individuals that
37
Chapter 4. Design and Implementation
already exist. In the implementation here, this means ramped-half-and-half, mutation,
and crossover. These are stored both as strings in the prefix notation and as tree data
structures. These strings are straight forward to evaluate in Matlab. See for example,
the fairly complicated yet successful function which evolved in one experiment to use
all of the GCMs:
mydivide(mydivide(mylog(mydivide(mylog(X6),mydivide(X1,minus(X1,minus(X6,t
imes(times(mylog(X3),X5),X6)))))),mydivide(X2,mylog(X5))),X4)
Another central reason for potentially improved features produced by GP is the ability
of the GP to produce feature equations which are able to arbitrarily select and combine
the GCMs of different color channels. This is readily displayed in the graphic of the
tree representation of the same individual and is unlike the work in [14] where only
grey-levels are used.
Figure 7 – A graphical representation of the tree of an individual created
by the GP implemented in this project. It illustrates the arbitrary
combination of all of the glcm’s, something which traditional Haralik
features do not do.
38
Chapter 4. Design and Implementation
In order to make sure that every possible branch of every possible equation is able to be
executed without error, the functions operate individually on the elements of the
matrices. For instance, the function „times‟ is not a matrix dot multiplication. This
way, inputs to functions are always guaranteed to be matrices and the final result of an
equation is also a matrix. But because we need a scalar value in the end, it was
necessary to add functions which have been dubbed „shell‟ functions to each individual.
As well as having the standard set of functions, each individual also contains two
additional sets of shell functions. The first shell function takes in a matrix and
produces a vector and the second shell function takes in a vector and produces a scalar
value. These shell functions are always the last two functions around the rest. This
final scalar value is the feature value for the given sample skin lesion image. This
method of ensuring sufficiency is to be contrasted with [14] where strongly typed
functions are used so that there is only one larger group of functions. These functions
are strongly typed so that their output depends on the type of input. If there is a matrix
input a given function will return a matrix and if there is a scalar input it will return a
scalar. Though it must be done somehow, [14] does not explain how it guarantees that
evolved programs result in scalars.
The individual point-mutation of shell functions in an individual have independent
chances at every mutation. For experiments here, the shell mutation rate was set to
25%. The standard function set includes the element-wise Matlab functions times (.*),
mydivide (./), plus, minus, cos, sin, mysqrt, and mylog (see table 2). The inner-most
shell function set includes the standard Matlab functions mean, max, min, sum, as well
as the function row. Row places the column vectors of a matrix all in one long vector.
The outer-most shell function includes mean, max, min, and sum.
The functions „mydivide,‟ „mylog,‟ and „mysqrt‟ exist so that there is a check to
39
Chapter 4. Design and Implementation
prevent operations that result in infinity or NaN evaluations. Mydivide should return
the numerator itself, for example, in the case of a zero denominator. However, it is
impossible to completely prevent, after an equation has been completely evaluated, the
fitness evaluations of generated equations from resulting in NaN evaluations. For
example, using Fishers Discriminant Ratio on an equation that results in feature values
which are all the same will result in a zero divided by zero evaluation and a NaN result.
To cope with these, the soft constraint is added where NaN fitness evaluations are
checked for and result in fitness scores of 0.
Two other processing steps for evolved feature values, before they are evaluated for
fitness, are standardization and precision rounding. To standardize a vector of sample
feature values, the mean of the values is subtracted from each element and then divided
by the standard deviation of the values. The hope is that this will help erase problems
for classification introduced by features being scaled differently. The precision
rounding is just the rounding of any value generated by the features and any fitness
values to a precision of 12 decimal places. Without this precision check, Matlab can
sometimes generate wrong numbers. For example, when it should compute a 0 as an
answer, it may compute 1e-19. This means that when another number is divided by
1e-19, instead of generating NaN, and thus a 0 fitness, it may generate a very high
fitness. Ensuring a precision to 12 decimal places solves this problem.
4.5 Fitness
On its way to having a fitness evaluated for it, an individual must first be evaluated on
all 100 training samples, with the appropriate GCMs being plugged into the appropriate
X variables contained in the individual. This occurs for every individual in every
generation. This produces a column vector of size 100 which is essentially 100 feature
40
Chapter 4. Design and Implementation
input values for 100 skin lesion images, using the evolved equation. These values are
then standardized and a fitness value is calculated according to a feature selection
method which attempts to reveal how well the feature values are for aiding a potential
classifier to use them to classify images.
There is a concern here with regard to over-fitting as this means that all of the available
training data is being used for evaluating individuals in every generation. The issue of
how to utilize the available training data is important both with respect to genetic
programming and in the use of a statistical learning machine. Often in machine
learning situations when there is ample data, it is split into training, validation, and test
sets. The validation set is used to select parameters, the training set is used for training,
and the test set is used to evaluate the learning machine‟s performance. This is done
largely to remedy the problem of over-fitting where the learning machine, rather than
learning an effectively general rule for classification, hones in on noise in the training
set and interprets it as the true underlying signal. Thus, the learning machine‟s
performance when introduced to new samples for classification is poor.
Unfortunately however, as is the case here, it is not always possible to have an ample
set of training data. The skin lesion images for this dissertation are directly captured
and processed by a group associated with the larger skin lesion image retrieval project
that this work is associated with. In the collected set used for this project, there are
only 100 images, 20 from each class. There are more images in the whole set but these
cause the classes to be drastically uneven (2 classes have only 20 samples). It was
decided that for the work here, an even number of samples, albeit limited in number,
would be used in the interest of avoiding the extra complexity needed in working with
an uneven data set. Using leave-one-out cross-validation as the basis for accuracy
scores in the wrapper approach is an attempt to take the most advantage of the available
41
Chapter 4. Design and Implementation
data. Though using leave-one-out cross validation is a common method in this
situation [27], it is important to note that it is known to be a high-variance estimator of
generalization error [28]. Therefore we can compare the relative accuracies of
evolved features and Haralik features but we can‟t yet be sure about generalization
abilities. Hopefully, future collection of data will enable additional training methods
in related future work.
Other possibilities for using the data might be to train on a data set of 15 samples per
class and test on a data set of 5 samples per class. Also, whether cross validation is
used or split training and testing sets are used, it might also be beneficial to split up data
between generations of the genetic program. This way, not all of the training and
testing data are used repeatedly on every generation. Rather, it could be rotated,
resulting in a sort of cross-validation, or it could just be split up enough so that it could
last the total number of generations. A systematic exploration of these possibilities
would be fruitful and an implementation of them in the current system fairly straight
forward. However, the time required for the experiments is not feasible for including
in the results here. Further, it‟s probable that this exploration would be best when at
least some more data is available. Thus, the common method of cross-validation is
used on all 100 samples on every generation.
There are three methods utilized here for evaluating the feature values produced by
evaluating the individuals on all 100 samples: the filter method of a score based on the
Fisher Disciminant Ratio and the wrapper methods based on the prediction accuracy of
either a naïve bayes classifier or a support vector machine using leave-one-out cross
validation where the cross validation is used to attempt to predict the future prediction
accuracy of the model with the available data. In the case of the wrappers, the FDR is
also used. It is first computed on the candidate features and used to provide a
42
Chapter 4. Design and Implementation
threshold. If a feature does not have an FDR above .2 its classification accuracy is not
computed and its fitness is set to 0. This saves some training time but also prevents the
classification process from taking in features which cause the classifiers to work with
singular matrices.
There are many other potential fitness evaluations and many potential combinations of
these. For this project, however, these three were the most useful, expedient, and
interesting to test. Further, only FDR was used in any of the directly relevant work
cited in chapter 2 and unlike here, it is never used alone. Other options and
combinations of options would certainly be worth exploring. For example, a scoring
method based on the multivariate Relief algorithm as well as embedded methods both
seem to be very popular in the newer feature selection literature [27] [22].
The bayes classifier is a simple yet empirically effective classifier and was provided by
work done already for the overall skin lesion retrieval project. Support Vector
Machine is a very popular and newer classifier which is known for its quick training,
high accuracy, and resistance to over-fitting despite often needing few training samples.
The implementation used in this project of SVM is provided by libsvm, a support
vector machine library for Matlab [29]. FDR, basically the variance between classes
over the variance within classes, is perhaps known as the weakest performer of the
three. However, it is also known to be quite fast to compute as well as effective at
aiding the prevention of over-fitting when there are relatively many features and few
training samples. Therefore, it may aid the generalization ability of the classifier
ultimately used both as a fitness function method alone and as used in conjunction with
the wrapper methods.
A final concern is that, unlike the FDR and bayes classifier methods, the SVM requires
43
Chapter 4. Design and Implementation
the setting of parameters. In this case this means setting C, the penalty parameters of
the error term, and if a radial basis kernel is used, the hyperparameter γ (see the listed
SVM kernels in the background section). Typically, parameters for an SVM are found
by doing cross-validation on a set of data samples with a grid search of the most
commonly reasonable parameter values and choosing the parameters which have the
best accuracy [30]. However, because of the small amount of data and because it is
impossible to pre-cross-validate on features which do not yet exist (because we are
evolving them), setting these parameters is not a straight forward procedure.
Three possibilities were thought of for this. The parameter values could be included in
each individual so that the parameter values are evolved along with the features. This
makes sense as best choice of parameters is mainly a function of the data set at hand.
Best features and their associated best parameters could reinforce each other.
Unfortunately, this could greatly increase the complexity of the hypothesis space that
the GP is attempting to explore. It would be possible to reduce this space as much as
possible by using a linear kernel and then using a standard set of fewer than 10 possible
settings for the C parameter. Other options for solving this problem would be to
attempt to select the parameter values before the GP runs by doing a cross-validation
grid search on features which are thought to be useful or similar to what will be evolved.
In this case the parameter values are inflexible and will cause the features evolved to be
evolved for the parameter particular settings. Finally, it might be possible, though
costly and questionable, to run a small cross-validation grid search at every fitness
evaluation of individuals. The first option, where C is evolved with the feature
equations, is used here with a radial basis function kernel where the hyperparameters
for the kernel are always the libsvm default. It would be best if the hyperparameters
were also evolved but the concern for this first implementation was for keeping the
hypothesis space smaller. Further, evaluating SVMs with a radial basis function
44
Chapter 4. Design and Implementation
kernel were much quicker than with a linear kernel. The parameter C is initialized
randomly and then has a chance of mutating to another of the possible C values at every
generation (10e-5 10e-3 10e-1 10e1 10e3). In future work this set could be increased.
4.6 Iterative Genetic Programming
Following the concept introduced by [14], an iterative genetic programming algorithm
was developed to work with GPlab so that features could be developed together to
complement each other. The motivation here is related to what it means to be a set of
relevant features. Review the brief discussion of feature relevance from chapter 3:
One feature alone might be very good by itself at aiding classification and we could
take this to be a relevant feature. However, suppose we have a second feature which is
just as good at aiding classification but when paired with the first, is barely able to
increase classification accuracy. What is happening is that the second feature is
redundant and adds no new information for the classifier to take advantage. This
makes the second feature irrelevant. Consider instead two features which are barely
worthwhile when treated by themselves, but which, when combined, enable the
classifier to outperform the first single feature. This is a general idea of what the
iterative GP is attempting to do: to find these complementary features.
This is just a gloss of the highly complicated relationships which might exist between
possible features. But the whole idea is to consider these highly complicated
relationships when trying to automatically evolve feature construction equations. For
these reasons, it may not be maximally effective to run many genetic programming
algorithms in parallel, evolve many best equations, and then throw them all together as
input features for a classifier. This is because the features are never actually evaluated
with respect to each other. Instead, the idea is to evolve features so that they are
45
Chapter 4. Design and Implementation
chosen for their ability to complement each other. Thus, it might be possible to run
genetic programming algorithms iteratively so that one best equation is evolved after
another. This way, each subsequent feature equation is evolved with respect to how
well it can aid classification with the previously evolved feature equations. This is
why it‟s important to distinguish between univariate and multivariate feature selection
evaluation procedures. Note that, in this process, FDR cannot take the principle role
for fitness evaluation because it is a univariate feature scoring method. This means
that it scores how good a feature is by itself. And this is another reason why it‟s
important to include the wrapper methods using a bayes classifier or SVM.
The implementation of iterative genetic programming here evolves a population until a
maximum number of generations. It then takes the best individual from this first
iteration, and uses it in a second iteration of new individuals which all have their fitness
evaluated with the first evolved feature. After each feature is evolved, its feature
values are added to a feature list. Every subsequent feature evolved is scored by its
ability to aid classification along with the features already evolved. The hope is that at
the end of n iterations, there are n features which complement each other very well.
Not only are these n features effective but n is hopefully small, making the
classification problem that much less complex. Note that except for the first iteration
in the SVM wrapper GP, the parameter C from the best individual of the previous
generation is copied over exactly to the individuals in the first generation of the next
iteration. This is an attempt to preserve the parameter found that already works for the
features found on previous iterations. After the first generation of the next iteration, C
can still mutate between generations.
46
Chapter 4. Design and Implementation
4.7 Parameters
In genetic programming there are a number of parameters to decide on and set. Here
are some brief explanations of the most important ones chosen and used in the modified
and extended version of GPlab used for this project. Three of the most basic and
important parameters in any genetic program are the size of the population, the
stopping condition, and the depth restriction for the size of program trees. The size of
the population is important for allowing a large amount of variation and exploration for
the GP and the depth restriction is important in the same regard but also because it
restricts the amount of bloat possible. Bloat is a classic genetic programming problem
where programs tend to grow in size without any substantial increase in fitness. The
stopping condition for GP runs here is just taken to be a preset number of generations to
run. These three parameters are systematically tested and discussed in chapter 5.
Other important parameters to set, though the exact settings aren‟t known to as
drastically affect results, are selection type, tournament size, operator probabilities, tree
depth initialization, dynamic depth restrictions, and elitism type. The selection type
used is the GPlab default, lexictour. Lexictour is similar to tournament selection
where some number of individuals is randomly drawn from the population, and only
the best of these survive. The size of tournaments is set according to the default which
is 1% of the population size. The only difference between tournament selection and
lexictour selection is that if two individuals have the same fitness, the one with smaller
size is chosen. Another possibility to try would be roulette selection but this is known
to cause very high selection pressure and might prevent a very good exploration of the
space of feature equations.
The implementation here uses variable crossover rates. The relative number of good
47
Chapter 4. Design and Implementation
individuals that each genetic operator produces is used to decide whether to increase or
decrease the rate at which the different operators occur. Both crossover and mutation
rates are initialized at 50%. However, over the course of a run, it was observed that
crossover would typically range between 50% and 95% and mutation would typically
range between 5% and 50%. Reproduction is set at a fixed 10%.
In addition to the absolute maximum for the depth of trees, there is also a dynamic level
maximum. This dynamic level is set smaller than the absolute maximum but can be
surpassed if the new individual is better than the best tree so far. The maximum level
for tree initialization and the dynamic limit are typically set around one or two levels
smaller than the absolute maximum for tree size. The idea behind these settings is to
try to strictly limit growth of bloat but give equations at least a chance for some small
growth if the fitness increase is there.
The type of elitism used is „keepbest‟. In filling the next generation with individuals,
many new children are created using genetic operators. These are then placed with the
parents, and all of them are given the chance to move to the next generation depending
on their priority for survival and the amount of space in the next generation. Using
„keepbest‟ elitism, the best individual of all of the parents and children is always given
the highest priority. However, in the interest of diversity, the rest of the children are
given priority over their parents, regardless of whether they have a lower fitness than
their parents. The children are ranked by their fitness which are then followed by the
parents ranked by their fitness. Stronger versions of elitism include „halfelitism‟ and
„totalelitism‟ where more individuals are taken to the next generation based on their
fitness rather than whether they are a parent or child.
Finally, note the complications for the project due to run-time problems. The use of
48
Chapter 4. Design and Implementation
the bayes classifier and SVM were questionable as the project began. This was
because learning machines are known to be much more costly to run for feature
selection. In fact, it became clear early on that run-time for the genetic program would
be an important consideration in producing features. However, unexpectedly it was
not because of the use of any wrapper method. The long run-times for the genetic
programming system implemented here were due mostly to the time it takes to evaluate
every individual in a given generation. That is, to evaluate the feature values for a
feature equation, it is required to plug in the terminal GCM matrices, in tests here each
being 64x64, and evaluate them in sometimes very large equations. These evaluations
occur in a loop, one sample at a time, a process which is often very inefficient
compared to the identical operations carried out via matrices where all samples would
be thrown into a matrix and computed together. This is because in adapting GPlab to
work with large GCM matrices, it was necessary to throw out a piece of code which
could bypass Matlab‟s nested 32 error. This error prevents large matrix operations
which are nested by more than 32 brackets or parenthesis and so of course prevents the
large batch processing of 100 samples all at once where each sample holds a large
number of giant matrices. It‟s not known whether bypassing the nested 32 error would
greatly improve importance but this runtime consideration combined with the time
limit on this project and the need to run many trials for a given experiment basically put
an upper bound on the number and variations of experiments possible. The
proceeding experiments represent roughly over 120 days of computing time.
Run-time is not mentioned in the related work and so a comparison in this regard is not
possible.
49
Chapter 4. Design and Implementation
GP Parameters
Parameter Values Used
Elistism keepbest
C parameter for SVM 10e-5 10e-3 10e-1 10e1 10e3
C mutation rate 50% chance inside mutation operator
Selection Lexictour
Shell mutation rate 2 independent 25% chances
Crossover and overall mutation rates variable
Fitness FDR filter, bayes wrapper, SVM wrapper
# iterations for iterative GP 1,2, and 3
Table 1 – list of primary GP parameters and their used values.
Function and Terminal Sets
Mathematical Name Matlab command
(all are element-wise matrix operations) -- operations with the
„my‟ prefix are modified to ensure closure
Function Set
Sine sin()
Cosine cos()
Multiplication times() (.*)
Division mydivide() (./)
Subtraction minus() (-)
Addition plus() (+)
Square Root mysqrt()
Natural Logorithm mylog()
Terminal Set 6 GCMs (color space RGB, interpixel distance 5, quantization 64)
Table 2 – list of the function and terminal sets used to create individuals in the GP population
50
Chapter 5. Results and Analysis
Chapter 5. Results and Analysis
This chapter explains the experiments run as well as the results attained, evaluates the
resulting performance, and compares the results to the related works described in
chapter 2.
5.1 Exploration of Population, Generation, and Depth
Experiments were run with the FDR filter, bayes wrapper, and SVM wrapper GP.
First, the FDR filter was used to systematically explore the GP‟s behavior. This
includes changing the population size, the generation number, and the maximum depth
size. Then the top 25% of all of the individuals created by the FDR filter GP were
pooled for classification performance evaluations using a bayes classifier and greedy
feature selection algorithm. Each set of experiments consists of ten runs of the GP for
the given parameters. The run-time for one iteration of one of the filter GPs ranged
from about 2 hours to greater than 2 days with the average being more than 1 day.
Run-times were recorded for every experiment but exact values are not reported here
because the GP, in an effort to get results as quickly as possible, were run on many
different computers with different levels of unrelated processing going on. It is still
possible, however, to estimate the relative run-times of experiments by considering the
number of generations X the size of the population X the max depth of the equations X
the number of iterations. The value taken from each run is the best individual returned.
The initial experiments are an attempt to explore the behavior of the GP, the goal being
to find parameters which will result in the highest scores, while at the same time being
constrained by run-time. The use of the results from the filter GP experiments are also
useful to discover classification performance when using the FDR filter for fitness.
Experiment numbers 1 and 4 are the same as it was repeated for comparison.
51
Chapter 5. Results and Analysis
Round 1 -- Vary Population
Experiment Generation Population Depth FDR (avg. of 10)
1 35 50 12 .8493
2 35 150 12 .8479
3 35 250 12 1.0711
Table 3 –Round 1 of experiments where the population size is varied. The average is
slightly deceiving but the FDR distribution generally increases with population size.
Clearly, increasing population increases the resulting distribution of likely FDR scores
returned in the best individual. Unfortunately, using a population of 250 and
generation of 35 resulted in run-times of about 2 days.
Figure 8 – statistical illustration of experiments, varying population. The red
line is the median, the top and bottom of the boxes are the 25th and 75th
percentiles, the whiskers above and below the boxes signify data which are not
considered outliers but are in the extremes of the distribution, and the red
crosses are outliers.
52
Chapter 5. Results and Analysis
Figure 9 – An example of the evolution and result from 1 run of the ‘vary population’
experiments. Top left – the tree of the resulting best individual. Top right – the evolution
of fitness over the run. Bottom left – the evolution of genetic operator probabilities over
the run. Bottom right – The evolution of the complexity of the population over the run.
The intron count is calculated in order to save time.
53
Chapter 5. Results and Analysis
Round 2 -- Vary Generation
Experiment Generation Population Depth FDR (avg. of 10)
4 35 50 12 .8493
5 70 50 12 .8733
6 110 50 12 1.0008
Table 4 – Round 2 of experiments where the number of generations is varied.
An increase in the number of generations increases the resulting distribution of best
individuals. Like Round 1 of experiments, the third group has the greatest increase as
well as the largest variance of individuals between the 25th and 75th percentiles.
Figure 10 -- statistical illustration of experiments, varying generation. The red
line is the median, the top and bottom of the boxes are the 25th and 75th percentiles,
the whiskers above and below the boxes signify data which are not considered
outliers, and the red crosses are outliers.
54
Chapter 5. Results and Analysis
Figure 11 -- An example of the evolution and result from 1 run of the ‘vary generation’
experiments. Top left – the tree of the resulting best individual. Top right – the evolution
of fitness over the run. Bottom left – the evolution of genetic operator probabilities over
the run. Bottom right – The evolution of the complexity of the population over the run.
The intron count is not allowed to increase in order to save time.
55
Chapter 5. Results and Analysis
Round 3 -- Vary Depth
Experiment Generation Population Depth FDR (avg. of 10)
7 50 150 4 .7650
8 50 150 8 .8737
9 50 150 12 1.1192
Table 5 – Round 3 of experiments where the depth of individuals in the GP were varied.
Once again, the highest feasible setting with respect to what the maximum allowable
run-time would allow (for the purposes of reporting here) returns the distribution with
greatest FDR scores. While it seems that round 3 of experiments illustrates the
greatest increase in scores with increasing settings, indicating that perhaps depth has a
greater effect on score than population size or number of generations, this is not exactly
a fair comparison. The rounds of experiments don‟t have the same values for
parameters which were not being varied.
Figure 12 -- statistical illustration of experiments, varying depth. The red line is the
median, the top and bottom of the boxes are the 25th and 75th percentiles, the whiskers
above and below the boxes signify data which are not considered outliers, and the red
crosses are outliers.
56
Chapter 5. Results and Analysis
Figure 13 -- An example of the evolution and result from 1 run of the ‘vary depth’
experiments. Top left – the tree of the resulting best individual. Top right – the evolution
of fitness over the run. Bottom left – the evolution of genetic operator probabilities over
the run. Bottom right – The evolution of the complexity of the population over the run.
The intron count is not allowed to increase in order to save time.
57
Chapter 5. Results and Analysis
5.2 Pooled FDR Features Compared to Haralik Features
Even though we know that features selected according to the FDR filter are not
necessarily known to work well together as FDR is a univariate method, it is very
possible that some of them do. So classification was run on the pool of the top 25% of
FDR scores from the 3 rounds of experiments to find out how they would perform.
Individuals were selected from this pool using a greedy forward-selection algorithm
with leave-one-out cross validation accuracy for the bayes classifier as the evaluation
method. The performance is shown using 1, 4, and 10 features.
Again, the evolved equations were created from the FDR filter GP which used as
terminals GCMs with quantization level 64 and interpixel distance 5. For comparison,
the forward-selection was run on the 72 Haralik which were created with the same
GCM settings (12 Haralik equations x 6 color channels x 1 interpixel distance (5) x 1
quantization level (64)). Additionally, forward-selection was run on 1296 Haralik
features which were created using combinations of additional GCM parameters (12
Haralik equations x 6 color spaces x 6 interpixel distances (5 10 15 20 25 30) * 3
quantizations (64 128 256)). These additional parameters should give an advantage to
the performance of this group. See table 6.
58
Chapter 5. Results and Analysis
72 Haralik features (same GCM parameters)
# Features Selected 1 4 10
Accuracy 37% 66% 65%
1296 Haralik features (more GCM parameters)
# Features Selected 1 4 10
Accuracy 41% 61% 72%
Table 6 -- Comparison of classification accuracy of feature equations produced by FDR and
Haralik Features
The evolved features generally have a fair advantage over the Haralik features. The
worst comparison is between the evolved features and the Haralik features selected
from 1296 where both have 72% accuracy. However, this looks favorably on the
evolved features as the GP features before selection were less than 200 and were
created using GCMs from only one interpixel distance setting and one quantization
setting. In contrast, there were 1296 total Haralik features before selection which were
created with 6 interpixel distances and 3 different quantization levels. Also, the single
best evolved feature is able to classify 51% correctly, a significant leap over the nearest
single best Haralik feature. It seems, however, that adding more and more features are
improving the accuracy less and less. In this regard, though the univariate FDR filter
evolved features are not exactly working together to provide much more information,
they are not significantly less complementary to each other than the Haralik features
when selecting up to 10 features.
Classifier Performance with Bayes Classifier and Greedy Selection
FDR filter GP features (selected from top 25% of total)
# Features Selected 1 4 10
Accuracy 51% 67% 72%
59
Chapter 5. Results and Analysis
5.3 Wrapper Fitness and Iterative GP
Finally, the wrapper GPs were given 10 trials using 3 iterations. These were run once
with the bayes wrapper for a fitness function and once with the svm wrapper for a
fitness function. The parameters selected were chosen in an attempt to maximize
score while at the same time finishing in a reasonable time. Thus a population of 250
and depth of 8 were chosen to run for 35 generations and 3 iterations. One iteration of
one GP took about 1 day to complete and so these trials took about 30-40 days in
computing time to complete. In the future it should definitely be a priority to
investigate the results attainable when iterating at least 4 times.
First, the distribution of the classification accuracies produced by the FDR GP runs and
the wrapper GP runs are compared using one feature (figure 14). Note that the FDR
features are a distribution from 200 equations which are the top 25% of FDR scores
produced by the FDR GP with a number of different GP parameters. The results for
the bayes wrapper and svm wrapper GPs, however, are the resulting distribution of just
10 trials of GP runs each. With the limited number of runs it seems that both wrapper
methods are able to more consistently produce equations with higher classification
accuracies than the FDR GP. Further, the SVM is slightly better than the bayes
classifier. But with the limited number of trials, especially with 3 iterations, this is not
an entirely confident comparison between the two. The distribution of GPs using 3
features is from 7 trials rather than 10. Figure 11 summarizes the classification
accuracies attained on all the wrapper GP trials.
60
Chapter 5. Results and Analysis
Figure 14 – a Comparison of the classification accuracies of evolved equations using
FDR, bayes, and SVM for fitness methods. Note that while the FDR distribution is
made of 200 samples, the bayes and svm distributions are made of only 10 each.
Figure 15 – A comparison of the classification accuracies of evolved equations
using the bayes and SVM wrappers with 1, 2, and 3 features. Note that the
distributions with 1 and 2 features are from 10 samples each while the
distributions with 3 are from 7 samples each.
61
Chapter 5. Results and Analysis
As the number of features was increased, the classification accuracy increased and
surpassed both the Haralick features and the evolved FDR filter GP features. The
highest scores produced were by the SVM wrapper GP with 3 features which evenly
matched the very best accuracy of the Haralik and FDR-evolved features using 10
features for classification. These scores were produced with only 3 iterations of the
GP whereas the Haralik features were selected from over a thousand computed features
and the FDR features were selected from the top 25% of 800 evolved features, both
using a greedy feature selection algorithm.
These results show promise for the wrapper based approach to evolving features
though it is not clear from these results whether or how much SVM is better than the
bayes classifier. While the margin is small between the best of the Haralik, FDR, and
wrapper features, the wrapper features did perform at least as well as the larger set of
features generated otherwise and with a smaller feature set which was generated in
fewer runs. Unfortunately, the wrapper method implemented here is the least tuned
part of these experiments. Though the FDR filter GP clearly has more potential and
could use trial runs with greater population sizes, it seems that the wrapper GPs could
also use more development for things like how operators handle when and how the
individuals are modified. For instance, in this implementation the genetic operator
mutation is responsible for providing independent chances to change the shell
functions, the SVM parameter C, as well as the feature equations themselves. But this
seems to have a bad effect on the diversity of the population as illustrated in figure 16
where the mutation operator never results in improvements in fitness causing its chance
of happening always to be low. The equation is able to be evolved to a classification
of 72% but in the 2nd and 3rd iterations the populations have converged on only one
individual. Diversity is supplied only at the very beginning of these runs. If diversity
can be restored with adjustments to the algorithm, the SVM wrapper will improve.
62
Chapter 5. Results and Analysis
Figure 16 – Take note that scaling is different in every fitness plot—the maximum
classification accuracies are listed in the boxes of run statistics. The evolution of fitness in
iterations 1 (top left), 2 (top right), and 3 (bottom left). The populations in 2 and 3 get
stuck after only initially making advances in fitness. The bottom right plot reveals that
mutation is not contributing much after intiailization in contrast to, for example, figure 13
where mutation continues to play a part in the evolution of the FDR-evolved features. Note
that the initial high median values are an artifact of a starting population with many zero
fitness values. This is due to using the FDR threshold to prevent training with some of the
worst features.
63
Chapter 5. Results and Analysis
Figure 17 – An example of feature trees produced by 3 iterations of the SVM wrapper GP.
64
Chapter 5. Results and Analysis
It has been established that the FDR evolved features can often perform better than the
standard set of Haralick features and that, though there is a smaller set of trials than
would be preferred, the wrapper methods can more consistently create smaller subsets
of features which are at least as good as the FDR evolved features. Further, both of
these methods have room to improve both with respect to increasing primary
parameters such as population size, but also with respect to tweaking how genetic
operators split up evolutionary duties.
But it is interesting to briefly take a closer look at the features evolved. In this regard,
table 6 shows the confusion matrices of 3 similarly performing features evolved by the
three fitness functions. Also, figures 18 and 19 show 3d visualizations of the feature
values produced by the FDR filter and the bayes Wrapper. With the focus on
developing a GP implementation for feature construction it is not possible to develop a
very detailed appreciation of the space of features and their relationships. However,
we might at least build a feel for how the evolved features classify and misclassify the
classes and how they place data points in the feature space relative each other.
We can conclude from the confusion matrices that, at least in this slight snapshot, there
are no drastic, systematic differences in how the classes are classified using the
different wrapper methods. There are only slight variations in classification decisions.
In all of those in table 7, for example, AK is the most difficult to classify.
Moving on to the visualizations, according to these two visualized features, obviously
the classes are not perfectly linearly separable. This may indicate that a linear kernel
is not the best option for the SVM approach. Further, it seems clear that there are a
number of ways that the classes can be flipped relative to each other and variously
clumped together or spread out. The visualization shows that the features do separate
65
Chapter 5. Results and Analysis
the classes reasonably to the human eye. However, classes are still fairly cluttered
together. For comparison, table 8 shows the confusion matrix for figure 9 data. It
shows that the most difficult to classify class according to the confusion matrices is
also the class which is in the space between other classes.
66
Chapter 5. Results and Analysis
Accuracy
65%
3 feature
Confusion Matrix – FDR (bayes)
predicted
AK BCC ML SCC SK
actual
AK 8 3 4 2 3
BCC 2 14 3 1 0
ML 2 0 16 0 2
SCC 4 2 0 13 1
SK 1 1 3 1 14
Table 7 – The confusion
matrices displaying the
classification predictions next
to their true classes. There
don‟t seem to be any
systematic differences in error
between the three methods of
fitness. Similar classification
accuracies were deliberately
chosen here to compare the
classification characteristics.
At least with this small
snapshot, there seem to be no
systematic differences. For
example in every method, AK
is the most poorly classified
class. Note that these accuracy
scores are at the bottom of the
distribution for the wrapper
methods and at the top for the
FDR filter method.
Accuracy
64%
3 feature
Confusion Matrix – SVM
predicted
AK BCC ML SCC SK
actual
AK 10 2 2 3 3
BCC 2 15 3 0 0
ML 3 1 14 0 2
SCC 3 2 1 12 2
SK 3 2 1 1 13
Accuracy
64%
3 feature
Confusion Matrix – Bayes
predicted
AK BCC ML SCC SK
actual
AK 10 2 2 3 3
BCC 4 14 1 1 0
ML 2 1 16 0 1
SCC 3 0 4 13 0
SK 0 2 2 2 14
67
Chapter 5. Results and Analysis
Figure 18 – Visualization of the feature values made by 3 of the FDR-filter produced
equations. The class of a point is indicated by color. Classification accuracy using these
features was 64%.
Figure 19 -- Visualization of the feature values made by 3 of the bayes wrapper produced
equations. The class of a point is indicated by color. Classification accuracy using these
features was 69%.
68
Chapter 5. Results and Analysis
Though the evolved features were able to classify features better than the Haralik
features and with less complicated GCMs, the performance results show much room
for improvement. The best of any run of the GP was 72%. This compares to the
results obtained in [15] but is below the 84% obtained in [14]. However, it is
important to point out the differences between the work completed here and there.
The work here was on a real engineering problem with the related real world problem
of very limited (but hopefully growing) data.
In contrast, both of the GP projects cited were completed on the same relatively large
and publically available online database of digital photography. Importantly, these
images are specifically photographs of different classes of textures. Working with
these is an easier problem than using texture features to distinguish between
Accuracy
69%
Confusion Matrix – bayes
predicted
AK BCC ML SCC SK
actual
AK 13 2 3 0 2
BCC 4 13 1 2 0
ML 3 0 14 1 2
SCC 3 1 2 13 1
SK 0 1 2 1 16
Table 8 – The confusion matrix corresponding to the features from figure 19. It
makes sense that AK is the class which is most incorrectly classified in the previous
tables as it lies at the intersection of just about all of the other classes (blue dots in
figure 19). Notice, however, that it is not worse than two other classes using these
features and that correspondingly, the blue dots are much more tightly clumped in
figure 19 than figure 18
69
Chapter 5. Results and Analysis
skin-lesions. Whereas the texture photographs are fairly straightforward to
distinguish using even the human eye, it takes a dermatologist‟s careful diagnosis to
distinguish between skin lesions. Also, [14] split the photographs up, four patches of
data coming from one image, so that the data set could be increased up to 480 samples.
Compare this with the 100 samples of 5 classes of skin lesion images used here. Here,
each sample is not only from different images but also from different patients.
It is quite clear from every experiment conducted here that the full potential of this GP
implementation, even without any optimizations of the current system, has not been
reached. Even the highest values used for population, depth, and generation did not
result in a convergence of scores to a max. It is expected and hoped that the
classification of skin lesion images can be improved quite easily by increasing
population sizes and generations.
But there are clear problems with the implementation and experiments here. Due to
the small data set, cross-validation is used for optimization and there is no data set left
over to verify the generalization abilities of the evolved features. This should be a
priority in the future and will become possible as more data is collected from the field.
Also, it seems even from the very few trial runs with the support vector machine that
the svm wrapper GPs potential is not being reached either. It seems that, for example,
the mutation operator is not able to introduce new genetic material into the population.
The probability of mutation should increase when that operator produces good
individuals. Unfortunately the mutation probability never rises using the wrapper
methods which may be partially because of the more complicated hypothesis space
provided by the attempted evolving of the C parameter for the SVM. Also, though a
non-linear kernel is used for the SVM, only one parameter is evolved. Though also
70
Chapter 5. Results and Analysis
including the kernel parameters for evolution would make the hypothesis space still
more complicated, it seems necessary to eventually incorporate them as well. In this
regard, it seems that radial basis functions are probably the best option. It would be
best if a separate genetic operator is used for mutating the individual equations, for
mutating shell functions, and for mutating the parameters of the SVM. This way, the
variable rates of mutation for each can go up and down according to each one‟s own
merits. As it is now it may be that bad equation mutations prevent an exploration of
the SVM parameter space or vice-versa.
71
Chapter 6. Conclusions
Chapter 6. Conclusions
This chapter provides ideas for work to continue with the GP system resulting from
this project and it concludes with a summary of the work developed here, results,
and the problems encountered.
6.1 Future Work
Before making any grand plans for expansion of the current project, of which there are
many possible, the issue of run-time should be addressed. If addressing Matlab‟s
nested 32 error could decrease the run-time of the GP it would greatly ease the
exploration of different parameters and new methods. It is clear that though the results
of feature construction and selection presented here via genetic programming do show
success, the system shows no signs of having been pushed to its full potential yet. It
would benefit both from more run-time and from an efficiency increase in the GP
algorithm itself. Even without any changes, it is likely that just more run time for
greater populations with more generations and greater depth limits could produce even
better results. Further, it would also be statistically reassuring to establish the current
results with at least 10 more runs of the GP in the various experimental settings.
The possibilities for future work lie in the three domains which converge for this
project: digital image analysis, genetic programming, and machine learning. For
image analysis, GCMs generated with a wider variety of parameters could be used as
terminals, for example, using matrices generated from different interpixel distances all
72
Chapter 6. Conclusions
together. This way, the GP would have the ability to arbitrarily combine even more
information from the image. Going further, it might be fruitful to investigate
terminals which are not provided only by GCMs. Perhaps color information could be
included, for example. Also, healthy skin images could be incorporated.
The investigation of advanced techniques of genetic programming would also likely be
helpful. Particularly, novel and ad-hoc methods for combining or manipulating
feature equations in genetic operators, specifically designed for mixing up image
analysis information in helpful ways, could be pursued. The inclusion of additional
functions in the function set or the designing of ad-hoc functions would also be worth
exploring. Finally, the use of Automatically Defined Functions might hold promise.
An ADT is a set of functions which are always defined together in a certain way. For
example, (X1 – X2) / X3 could be treated as a single piece of program which gets
plugged into new, larger equations. This way code which is known to potentially
work well has the possibility of being reused and built upon. Perhaps the original
Haralik feature equations themselves could be defined as ADT‟s and given the chance
to be a part of larger evolved equations. Or perhaps the best feature from a previous
iteration could be used as an ADT in future iterations.
Finally, a greater focus on proper machine learning techniques can only aid greater
performance gains. This would not only aid performance increases in the future but it
would allow the more accurate measurement of performance of the system already in
place. In this regard, it will help learning greatly as more skin lesion images are
collected. And as already mentioned, a variety of training methods could be
systematically investigated. Now, all 100 samples are being used in every generation
of the population. If fewer were used every generation there could be less for the
learning machine to work with. But if less training data is executed on every
73
Chapter 6. Conclusions
generation, perhaps being rotated as generations advance, then this might help reduce
over-fitting. It would also reduce the run-time as evaluating equations is the greatest
use of time in a run.
Another avenue relevant to this is the use of extremely skewed data. The main reason
more data isn‟t being used for training in this project is because the classes are so
uneven. In a data set of about 500 samples with 5 classes, 2 classes have only 20
samples. A test set based on the extra data beyond the 100 samples used here might be
statistically questionable because for two of the five classes, there are no more samples
beyond this set. The best answer to all of these problems is just more data. But a
handle on how the extra data can be soundly used would be helpful as well.
Additional machine learning techniques could also be used for fitness evaluations. As
already mentioned, a filter based on a Relief algorithm score and embedded methods
seem to be attracting attention in feature selection. In the same regard, there are many
different types of filters (such as information theoretic) and many different types of
classifiers (such as Bayesian neural networks) which may hold promise. An easy
addition to the current implementation would be to simply try additional kernels in the
already implemented support vector machine. This would require that the individuals
hold additional parameters to evolve with them which would in turn make the
hypothesis space that much more complicated. But perhaps this trade-off would be
worthwhile if the described problem with diversity is smoothed out.
6.2 Conclusions
A Genetic Programming system which automatically evolves feature equations for the
classification of skin lesion images has been developed and described. The system
74
Chapter 6. Conclusions
uses generalized co-occurrence matrices and normal mathematical functions which are
combined stochastically and evaluated by a simple FDR filter or a combination of FDR
score and one of two classifiers, naïve bayes or SVM. This system is similar to two
other systems in its use of texture GCMs to evolve features. However, it is unique in
its choice of classifiers, the methods by which it guarantees closure and allows the
arbitrary combination of color channels, and its actual application to a real world
problem. Again, instead of working only in grey-scales like [14], this system is able to
arbitrarily combine color channels into one feature value. Whereas the two previous
works (described in chapter 2 and contrasted with in chapter 5) were demonstrations of
the possibility of success implementing GP for image classification using GCMs in a
large texture photography database, this project was developed as an effort to improve
the classification abilities of a dermatological system. With the real world problem of
skin-lesion classification, however, came the real world problem of scant data.
Therefore cross-validation was used for attaining accuracy evaluations with learning
machines. However, there was yet no way to measure the ability of features to
generalize. Further, very long run-times, caused at least partially by a restriction in the
programming language used in this project, severely limited the exploration of the GP
system‟s abilities. However, the system was still able to outperform traditional
Haralik features by a fair margin and more importantly demonstrated that there is
certainly room for improvement, even before further developments in the system. The
GP system evolved features which aided classification with higher accuracy and in
fewer numbers of features.
75
Bibliography
Bibliography
[1] Ballerini, L., Li, X., Fisher, R., & Rees, J. (2010). A Query-by-Example
Content-Based Image Retrieval System of Non-Melanoma Skin Lesions.
Medical Content-Based Retrieval for Clinical Decision Support, 31–38.
Springer. Retrieved from
http://www.springerlink.com/index/N027330482888GL3.pdf.
[2] Liu, H., Motoda, H. Computational methods of feature selection. Champan
& Hall. 2008.
[3] Muller, H., Michoux, N., Brandon, N., and Geissbuhler, A., review of
content-based image retrieval systems in medical applications - clinical
benefits and future directions,"International Journal of Medical
Informatics, vol. formatics7, 2004, pp. 1-23.
[4] Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences,
and trends of the new age. ACM Computing Surveys 40(2) (April 2008)
5:1–5:60
[5] Wollina, U., Burroni, M., Torricelli, R., Gilardi, S., Dell‟eva, G., Helm, C.,
and Bardey, W., Digital dermoscopy in clinical practise: a three-centre
analysis. Skin Research and Technology 13 (May, vol. 133, 2007.
[6] Schmid-saugeons, P., Guillod, J., and Thiran, J.P., Towards a
computer-aided diagnosis system for pigmented skin lesions, Computerized
Medical Imaging and Graphics, vol. 27, 2003, pp. 65-78.
[7] Maglogiannis, I., Pavlopoulos, S., Koutsouris, D., An integrated computer
supported acquisition, handling, and characterization system for pigmented
skin lesions in dermatological images. IEEE Transactions on," Information
Technology in Biomedicine, vol. 9, 2005, pp. 86-98.
[8] Celebi, M.E., Kingravi, M.A., Uddin, B., Iyatomi, H., Aslandogan, Y.A.,
Stoecker W.V., and Moss, R.H., A methodological approach to the
classification of dermoscopy images,"Computerized Medical Imaging and
Graphics, vol. 31, 2007, pp. 362-373.
76
Bibliography
[9] Rahman, M.M., Desai, B.C., Bhattacharya, P.: Image retrieval-based
decision support system for dermatoscopic images. In: IEEE Symposium
on Computer-Based Medical Systems, Los Alamitos, CA, USA, IEEE
Computer Society (2006) 285-290
[10] Ohta, Y.I., Kanade, T., Sakai, T.: Color information for region
segmentation. Computer Graphics and Image Processing 13(1) (July
1980) 222-241
[11] Haralick, R.M., Shanmungam, K., Dinstein, I.: Textural features for image
classification. IEEE Translactions on Systems, Man and Cybernetics 3(6)
(1973) 610-621
[12] Unser, M.: Sum and difference histograms for texture classification. IEEE
Transactions on Pattern Analysis and Machine Intelligence 8(1) (January
1986) 118-125
[13] Ballerini, L., Li, X., Fisher, R., Aldridge, B., & Rees, J. (2010).
Content-Based Image Retrieval of Skin Lesions by Evolutionary Feature
Synthesis. Applications of Evolutionary Computation, 312–319. Springer.
Retrieved from
http://www.springerlink.com/index/X0V1425K97G70578.pdf.
[14] Aurnhammer, M. Evolving Features by Genetic Programming.
Applications of evolutionary computing: EvoWorkshops 2007,
EvoCOMNET, EvoFIN, EvoIASP, EvoINTERACTION, EvoMUSART,
EvoSTOC and EvoTRANSLOG, Valencia, Spain, April 11-13, 2007:
proceedings.
[15] B. Lam and V. Ciesielski, "Discovery of human-competitive image texture
feature extraction programs using genetic programming," In: GECCO (2).
Volume 3103 of Lecture Notes in Computer Science, 2004, pp. 1114-1125.
[16] J.R. Sherrah, R.E. Bogner, and A. Bouzerdoum, "The Evolutionary
Pre-Processor : Automatic Feature Extraction for Supervised Classi cation
using Genetic Programming," Evolutionary Computation, 1996.
[17] C. Huang and C. Wang, "A GA-based feature selection and parameters
optimizationfor support vector machines," Expert Systems with
Applications, vol. 31, 2006, pp. 231-240.
77
Bibliography
[18] H. Frohlich, O. Chapelle, and B. Scholkopf, "Feature selection for support
vector machines by means of genetic algorithm," Proceedings. 15th IEEE
International Conference on Tools with Artificial Intelligence, 2002, pp.
142-148.
[19] R. Poli, W.B. Langdon, and N.F. Mcphee, "A Field Guide to Genetic
Programming," 2008.
[20] J.R. Koza and M.J. Hall, "GENETIC PROGRAMMING : A PARADIGM
FOR GENETICALLY BREEDING POPULATIONS OF COMPUTER
PROGRAMS TO SOLVE PROBLEMS," Science, 1990.
[21] Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine
Learning. Addison-Wesley, Reading, MA (1989)
[22] Liu, H., Motoda, H., Computational methods of feature selection, Chapman
\& Hall/CRC, 2008.
[23] Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L., "Feature Extraction:
Foundations and Applications," International journal of computer, 2006.
[24] J. Han and M. Kamber, Data mining: concepts and techniques, Morgan
Kaufmann, 2006.
[25] R.G. Brereton and G.R. Lloyd, "Support vector machines for classification
and regression.," The Analyst, vol. 135, 2010, pp. 230-67.
[26] Silva, Sarah. A Genetic Programming Toolbox for MATLAB. Software
available at http://gplab.sourceforge.net GPLAB
[27] Guyon, I, "An Introduction to Variable and Feature Selection 1
Introduction," Journal of Machine Learning Research, vol. 3, 2003, pp.
1157-1182.
[28] Vapnik, V. Estimation of dependencies based on empirical data. Springer
series in statistics. Springer, 1982.
[29] Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support
vector machines, 2001. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm
78
Bibliography
[30] Hsu, C.,Chang, C., and Lin, C., "A Practical Guide to Support Vector
Classification," Bioinformatics, vol. 1, 2010, pp. 1-1.