estimation of distribution algorithms: a new tool for evolutionary computation

ESTIMATION OF DISTRIBUTION ALGORITHMS

A New Tool for Evolutionary Computation

Genetic Algorithms and Evolutionary Computation

Consulting Editor, David E. Goldberg

Additional titles in the series:

Efficient and Accurate Parallel Genetic Algorithms, Erick Cantu-Paz ISBN: 0-7923-7466-5

OmeGA: A Competent Genetic Algorithm for Solving Permutation and Scheduling Problems, Dimitri Knjazew ISBN: 0-7923-7460-6

GE.f'JI\ GENAG Il:

GENAG ENAGENA Genetic Algorithms and

Evolutionary Computation

http://www.wkap.nllseries.htmIGENA

ESTIMATION OF DISTRIBUTION ALGORITHMS

A New Tool for Evolutionary Computation

edited by

Pedro Larraiiaga Jose A. Lozano

University of the Basque Country. Spain

SPRINGER SCIENCE+BUSINESS MEDIA, LLC

ISBN 978-1-46 13-5604-2 ISBN 978-1-4615-1539-5 DOI 10.1007/978-1-4615-1539-5

Library ofCongres5 Cataloging-in-Publication Data

A C.I.P. Catalogue record fo r this book is available from the Library of Congress.

Copyright <fl 2002 Springer Sciencc+Business Media New York Originally published by Kluwer Academic Publisher.; in 2002 Softcover reprint ofthe hardcover lst edition 2002

AII rights reserved. No part of this publication may be reproduced, stored in a retrieval systcm ar transmitted in any fonn ar by any means, mechanical, photo-copying, record ing, or otherwise, without Ihe prior written pennission of the publisher.

Primed on acid-free paper.

(eBook)

Contents

List of Figures

List of Tables

Preface

Contributing Authors

Series Foreword

Part I Foundations

1 An Introduction to Evolutionary Algorithms J.A. Lozano

1. Introduction 2. Genetic Algorithms 3. Evolution Strategies 4. Evolutionary Programming 5. Summary

2 An Introduction to Probabilistic Graphical Models P. Larranaga

1. Introduction 2. Notation 3. Bayesian networks 4. Gaussian networks 5. Simulation 6. Summary

3 A Review on Estimation of Distribution Algorithms P. Larranaga

xi

xvii

xxiii

xxvii

xxxiii

3

3 6

14 19 20

27

27 28 31 44 51 51

57

vi Estimation of Distribution Algorithms

4

1. 2. 3. 4. 5.

Introduction EDA approaches to optimization EDA approaches to combinatorial optimization EDA approaches in continuous domains Summary

57 58 64 80 90

Benefits of Data Clustering in Multimodal Function Optimization 101 via EDAs

J.M. Pena J.A. Lozano P. Larranaga 1. Introduction 101 2. Data clusterin(!; in evolutionary algorithms for multimodal nmc-

tion optimizatIOn 103 3. BNs and CGNs applied to data clustering 105 4. Further considerations about the EMDA 111 5. Experimental results 113 6. Conclusions 123

5 Parallel Estimation of Distribution Algorithms J.A. Lozano R. Sagama P. Larranaga

129

1. Introduction 2. Sequential EBNABIc 3. Parallel EBNABIc 4. Numerical evaluation 5. Summary and conclusions

6

129 130 133 138 142

Mathematical Modeling of Discrete Estimation of Distribution Algorithms 147 C. Gonzalez J.A. Lozano P. Larranaga

1. Introduction 147 2. Using Markov chains to model EDAs 148 3. Dynamical systems in the modeling of some EDAs 155 4. Other approaches to modeling EDAs 159 5. Conclusions 161

Part II Optimization

7 An Empirical Comparison of Discrete Estimation of Distribution 167

Algorithms R. Blanco J.A. Lozano

1. Introduction 167 2. Experimental framework 168 3. Sets of function test 169 4. Experimental results 173 5. Conclusions 177

8 Results in Function Optimization with EDAs in Continuous Domain 181 E. Bengoetxea T. Miquelez P. Larranaga J.A. Lozano

1. Introduction 181 2. Description of the optimization problems 182

3. Algorithms to test 4. Brief description of the experiments 5. Conclusions

9 Solving the 0-1 Knapsack Problem with EDAs R. Sagarna P. Larranaga

1. Introduction 2. The 0-1 knapsack problem 3. Binary representation 4. Representation based on permutations 5. Experimental results 6. Conclusions

10 Solving the Traveling Salesman Problem with ED As V. Robles P. de Miguel P. Larranaga

1. Introduction 2. Review of algorithms for the TSP 3. A new approach: Solving the TSP with EDAs 4. Experimental results with EDAs 5. Conclusions

11 EDAs Applied to the Job Shop Scheduling Problem J.A. Lozano A. Mendiburu

1. Introduction 2. EDAs in job shop scheduling problems 3. Hybridization 4. Experimental results 5. Conclusions

12

Contents Vll

183 185 193

195

195 196 197 202 203 208

211

211 212 217 221 226

231

231 233 234 237 240

Solving Graph Matching with EDAs Using a Permutation-Based 243 Representation

E. Bengoetxea P. Larranaga 1. Bloch A. Perchant 1. Introduction 244 2. Graph matching as a combinatorial optimization problem with

constraints 245 3. Representing a matching as a permutation 247 4. Obtaining a permutation with discrete EDAs 254 5. Obtaining a permutation with continuous EDAs 256 6. Experimental results. The human brain example 257 7. Conclusions and further work 262

Part III Machine Learning

13 Feature Subset Selection by Estimation of Distribution Algorithms 269 1. Inza P. Larranaga B. Sierra

1. Introduction 269 2. Feature Subset Selection: Basic components 271 3. FSS by EDAs in small and medium scale domains 273

Vlll Estimation of Distribution Algorithms

14

4. 5.

FSS by EDAs in large scale domains Conclusions and future work

Feature Weighting for Nearest Neighbor by EDAs 1. Inza P. Larranaga B. Sierra

1. Introduction 2. Related work 3. Learning weights by Bayesian and Gaussian networks 4. Experimental comparison 5. Summary and future work

15 Rule Induction by Estimation of Distribution Algorithms B. Sierra E.A. Jimenez 1. Inza P. Larranaga J. Muruzabal

1. Introduction 2. A review of Classifier Systems 3. An approach to rule induction by means of EDAs 4. Empirical comparison 5. Conclusions and future work

16 Partial Abductive Inference in Ba~sian Networks: An Empirical

Comparison Between GAs and EDAs L.M. de Campos J.A. Gamez P. Larranaga S. Moral T. Romero

17

1. Introduction 2. Query types in probabilistic expert systems 3. Solving queries 4. Tackling the problem with Genetic Algorithms 5. Tackling the problem with Estimation of Distribution Algo

rithms 6. Experimental evaluation 7. Concluding remarks

282 289

295

295 296 299 302 308

313

313 314 315 318 320

323

324 324 326 327

330 331 338

Comparing K-Means, GAs and EDAs in Partitional Clustering 343 J. Roure P. Larranaga R. Sanguesa

1. Introduction 343 2. Partitional clustering 345 3. Iterative algorithms 345 4. Genetic Algorithms in partitional clustering 347 5. Estimation of Distribution Algorithms in partitional clustering 351 6. Experimental results 352 7. Conclusions 355

18 Adjusting Weights in Artificial Neural

Networks using Evolutionary Algorithms 361

C. Cotta E. Alba R. Sagarna P. Larranaga 1. Introduction 2. An evolutionary approach to ANN training 3. Experimental results 4. Conclusions

362 363 368 373

Contents IX

Index 379

List of Figures

1.1 The modified Ackley function in two dimensions. 5

1.2 Pseudocode for the SGA. 7

1.3 One-point crossover. 8

1.4 The mutation operator. 8

1.5 An example of a four-point crossover operator. 10

1.6 Pseudocode for a general ES. 15

1.7 An example of recombination applied to four parents with global discrete recombination in the search space and in-termediary recombination in the strategy parameters. 17

2.1 Structure for a probabilistic graphical model defined over X = (XI, X 2 ,X3, X4,XS,X6)· 29

2.2 Checking conditional independencies in a probabilistic gra-phical model by means of the u-separation criterion for undirected graphs. 30

2.3 Different degrees of complexity in the structure of proba-bilistic graphical models. 31

2.4 Structure, local probabilities and resulting factorization for a Bayesian network with four variables (Xl, X3 and X 4 with two possible values, and X 2 with three possible val-ues). 33

2.5 Pseudocode for the PC algorithm. 36

2.6 The K2 algorithm. 41

2.7 The Chow and Liu MWST algorithm. 43

2.8 Pseudocode for the Probabilistic Logic Sampling method. 44

2.9 Structure, local densities, and resulting factorization for a Gaussian network with four variables. 46

3.1 Illustration of the EDA approach to optimization. 63

XlI Estimation of Distribution Algorithms

3.2 Pseudocode for EDA approach.

3.3 Pseudocode for UMDA.

3.4 Pseudocode for the PBIL algorithm.

3.5 Pseudocode for the cGA.

3.6 Graphical representation of the probability model of the proposed EDAs in combinatorial optimization without in-

64

65

67

68

terdependencies (UMDA, PBIL, cGA). 69

3.7 The MIMIC approach to estimation of the joint probability distribution at generation l. The symbols hi (X) and hi (X I Y) denote the empirical entropy of X and the empirical entropy of X given Y respectively. Both are estimated from Dfe. 70

3.8 Pseudocode for the COMIT algorithm. 71

3.9 Graphical representation of the probability models for the proposed EDAs in combinatorial optimization with pair-wise dependencies (MIMIC, tree structure, BMDA). 73

3.10 Pseudocode for the EcGA. 75

3.11 Pseudocode for the algorithms EBNApc, EBNAK2+pen, and EBNABIC. 77

3.12 Graphical representation of probability models for the pro-posed EDAs in combinatorial optimization with multiple dependencies (FDA, EBNA, BOA, LFDA and EcGA). 79

3.13 Pseudocode for learning the joint density function in UMDAe. 81

3.14 Graphical representation of the probability models for the proposed EDAs for optimization in continuous domains without dependencies between the variables (UMDAe, SHCLVND, PBILc). 82

3.15 Adaptation ofthe MIMIC approach to a multivariate Gaus-sian density function.

3.16 Pseudocode for the EMNAglobal approach.

3.17 Pseudocode for the EMNAa approach.

3.18 Pseudocode for the EMNAi approach.

3.19 Pseudocode for the EGNAee , EGNABGe, and EGNABIC

84

85 86

87

algorithms. 88

4.1 Schematics of the EMDA (top) and the BS-EM algorithm (bottom). 109

4.2 Graph structures for Fgrid16 (x) (left) and Feat28 (x) (right). Dashed lines indicate the optimal cuts. 116

List of Figures xiii

4.3 Dynamics of the EMDA in the Ftwo-maz problem. The horizontal axis represents the number of ones in a solu-tion whereas the vertical axis denotes the number of cor-responding solutions in the population of the generation indicated. 116

4.4 Dynamics of the EMDA in the continuous Ftwo-maz prob-lem. The horizontal axis represents the sum of the genes of a solution whereas the vertical axis denotes the number of corresponding solutions in the population of the generation indicated. 121

5.1 Pseudocode for the EBNAB/c. 131

5.2 Pseudocode for the sequential structural learning algorithm, SeqBIC. 133

5.3 Pseudocode for manager MNGl. 134

5.4 Pseudocode for explorer EPRl. 135

5.5 Pseudocode for manager MNG2. 136

5.6 Pseudocode for explorer EPR2. 137

5.7 Speed-up produced by PA1BIC (left) and PA2BIC (right) for the OneMax problem. 140

5.8 Speed-up produced by PA1BIC (left) and PA2BIC (right) for the EqualProducts problem. 140

5.9 Best solution produced by PA1BIC (left) and PA2BIC (right) for the OneM ax problem. 142

5.10 Best solution produced by PA1BIC (left) and PA2BIC (right) for the EqualProducts problem. 142

6.1 Pseudocode for a general EDA algorithm. 149 6.2 Pseudocode for PBIL. 154 6.3 Pseudocode for UMDA in binary spaces. 155 7.1 Convergence velocity in FOneMaz (above) and

FPlateau (below). 174 7.2 Convergence velocity in FCheckerBoard (above) and

F EquaiProduct8 (below). 175 7.3 Time/dimension scalability in FOneMaz (above) and

FCheckerBoard (below). 178 7.4 Evaluations/dimension in FOneMaz (above) and

F CheckerBoard (below) 179 8.1 Plots of the problems to be optimized with continuous

EDAs and ES techniques (n = 2). 184 8.2 Evolution of the different continuous EDAs for the Sum-

mation cancellation problem with a dimension of 10. 192

xiv Estimation of Distribution Algorithms

9.1 First fit algorithm. 199

10.1 Neighbourhood Search Method. 214 10.2 Pseudocode for the EDA approach. 217

10.3 Translation of an individual to a valid tour. 218 lOA Possibilities for incorporating TSP heuristics into a GA. 219 10.5 Using local search techniques in EDAs. Heuristic EDAs. 220

10.6 Learning curves for a 120-cities problem. Discrete and con-tinuous EDAs with UMDA learning. 226

11.1 Pseudocode for algorithm HI. 235 11.2 Pseudocode for algorithm H2. 236

12.1 Traditional representation of an individual for the problem of graph matching, when GI (the model graph) contains 6 nodes and G2 (the data graph representing the segmented image) contains· 11 nodes. 246

12.2 Pseudocode to compute the solution represented by a per-mutation-based individual. 249

12.3 Example of three permutation-based individuals and a sim-ilarity measure w(i,j) between nodes of the data graph (Vi, j E V2 ) for a data graph of 10 nodes 1V21 = 10. 251

1204 Result of the generation of the individual after the com-pletion of phase 1 for the example in Figure 12.3 where six nodes of G2 have been matched (IVi I = 6). 252

12.5 Generation of the solutions for the example individuals in Figure 12.3 after the first step of phase 2 (IVII = 6). 252

12.6 Result of the generation of the solutions after the comple-tion of phase 2. 253

12.7 Example of redundancy in the permutation-based approach. The two individuals represent the same solution shown at the bottom of the figure. 253

12.8 Pseudocode to translate from a continuous value in IRn to a discrete permutation composed of discrete values. 257

13.1 In this 3-feature (F1,F2,F3) problem, each individual in the space represents a feature subset, a possible solution for the FSS problem. In each individual, a feature's rectangle being filled, indicates that it is included in the feature su-bset. 272

13.2 FSS-EBNA method. 274 14.1 FW-EBNA method. 301 16.1 A small Bayesian network. 326 16.2 Two possible clique trees for the same network. 327

List of Figures xv

16.3 A plot of %mass1' for experiment 1. 334 16.4 A plot of %mass1' for experiment 2. 334 16.5 A plot of %mass1' for experiment 3. 335 16.6 A plot of mass1' for experiment 4. 335 16.7 A plot of mass1' for experiment 5. 337 17.1 K-Means algorithm. 346 18.1 The weights of an ANN are encoded into a linear binary

string in GAs, or into a 2k-dimensional real vector in ESs (k weights plus k stepsizes). The EDA encoding is similar to that of the ES, excluding the stepsizes, i.e. a k-dimensional real vector. 366

18.2 Convergence plot of different EAs on the KILN database. 373

List of Tables

2.1 Variables (Xi)' number of possible values of variables (ri), set of variable parents of a variable (Pai), number of pos-sible instantiations of the parent variables (qi). 33

3.1 The initial population, Do. 59 3.2 The selected individuals, Dge, from the initial population. 60 3.3 The population of the first generation, D1 . 61 3.4 The selected individuals, Dre, from the population of the

first generation. 62

4.1 Performance of the UMDA, EBNABIc and EMDA in the discrete domains considered. The numbers of evaluations and runtimes are average values over 10 independent runs. The numbers of times that each optima is reached summa-rize the final results of these 10 runs. 117

4.2 Performance of the UMDAe , EGNABGe and EMDA in the continuous domains considered. The numbers of evalua-tions and runtimes are average values over 10 independent runs. The numbers of times that each optima is reached summarize the final results of these 10 runs. 122

5.1 Time-related results for OneM ax using PA1BIC. 139 5.2 Time-related results for OneM ax using PA2BIC. 139 5.3 Time-related experimental results for EqualProducts us-

ing PA1BIC. 140 5.4 Time-related experimental results for EqualProducts us-

ing PA2BIC. 141 5.5 Algorithm performance-related results for OneM ax using

PA1BIC. 143 5.6 Algorithm performance-related results for OneM ax using

PA2BIC. 143

xviii Estimation of Distribution Algorithms

5.7 Algorithm performance-related results for EqualProducts using PA1BIC. 143

5.8 Algorithm performance-related results for EqualProducts using PA2BIC. 144

7.1 Experimental results on the convergence reliability test. 176

8.1 Mean values of experimental results after 10 executions for the problem Summation cancellation with a dimension of 10 and 50 (optimum fitness value = 1.0E+5). 186

8.2 Mean values of experimental results after 10 executions for the problem Griewangk with a dimension of 10 and 50 (optimum fitness value = 0). 186

8.3 Mean values of experimental results after 10 executions for the problem Sphere model with a dimension of 10 at"ld 50 (optimum fitness value = 0). 187

8.4 Mean values of experimental results after 10 executions for the problem Rosenbrock generalized with a dimension of 10 and 50 (optimum fitness value = 0). 187

8.5 Mean values of experimental results after 10 executions for the problem Ackley with a dimension of 10 and 50 (optimum fitness value = 0). 188

8.6 Mean values of the computation time after 10 executions for the problem Summation cancellation with a dimension of 10 and 50. 192

9.1 0-1 knapsack problem with 7 items. 196

9.2 Knapsack problem. Binary representation. Average of the best results. n = 50. Greedy: 1713. 203



9.5 Knapsack problem. Representation based on permutation. Mean of the best results. n = 50. Greedy: 1713. 204



10.1 Tour length for the Groste124 problem. 222

10.2 No. of generations and execution time for the Groste124 problem. 222

10.3 Tour length for the Groste148 problem. 224

List of Tables XIX

10.4 No. of generations and execution time for the Grostel48 problem. 224

10.5 Tour length for the Gr6ste1120 problem. 225 10.6 No. of generations and execution time for the Gr6ste1120

problem. 225

11.1 Experimental results with continuous EDAs for FT10 x 10. 237

11.2 Experimental results with continuous EDAs for FT20 x 5. 238

11.3 Experimental results with discrete EDAs for FT10 x 10. 239

11.4 Experimental results with discrete EDAs for FT20 x 5. 239

12.1 Mean values of experimental results after 10 executions for each algorithm of the inexact graph matching problem of the Human Brain example. 260

13.1 Details of small and medium dimensionality experimental domains. 277

13.2 Accuracy percentages of the NB classifier on real datasets without feature selection and using the five FSS methods shown. The last row shows the average accuracy percent-ages for all six domains. 279

13.3 Cardinalities of fin&lly selected features subsets for the NB classifier on real datasets without feature selection and us-ing the five FSS methods shown. It must be taken into account that when no FSS is applied to NB, it uses all the features. 279

13.4 Mean stop-generation for FSS-GAs and FSS-EBNA. The standard deviation of the mean is also reported. The initial generation is considered to be the zero generation. 280

13.5 Number of generations needed on average (and their stan-dard deviation) by FSS-GA-o, FSS-GA-u and FSS-EBNA to discover the optimum feature subset in artificial do-mains. The initial generation is considered as generation zero. 281

13.6 Details of large-dimensionality experimental domains. 283 13.7 Accuracy percentages of the NB classifier on real datasets

without feature selection and using FSS-GA-o and FSS-GA-u. The last row shows the average accuracy percent-ages for all six domains. 284

13.8 Accuracy percentages of the NB classifier on real datasets using FSS-PBIL, FSS-BSC, FSS-MIMIC and FSS-TREE. The last row shows the average accuracy percentages for all six domains. 285

xx Estimation of Distribution Algorithms

13.9 Cardinalities of finally selected feature subsets for the NB classifier on real datasets without feature selection and using FSS-GA-o and FSS-GA-u. It must be taken into account that when no FSS is applied to NB, it uses all the features. 285

13.10 Cardinalities of finally selected features subsets for the NB classifier on real datasets using FSS-PBIL, FSS-BSC, FSS-MIMIC and FSS-TREE. 286

13.11 Mean stop-generation for FSS algorithms. The standard deviation of the mean is also reported. The initial genera-tion is considered to be the zero generation. 286

13.12 Average CPU times (in seconds) for the induction of different probabilistic models (standard deviations are nearly zero) in each generation of the EDA search. The last column shows the average CPU time to estimate the predic-tive accuracy of a feature subset by the NB classifier. 287

13.13 Number of generations needed on average (and their standard deviation) by FSS-GA-o, FSS-GA-u, FSS-PBIL, FSSBSC, FSS-MIMIC and FSS-TREE to discover a feature subset that equalizes or surpasses the estimated accuracy level of the feature subset which induces the domain. The initial generation is considered to be the zero generation. 289

14.1 Details of experimental domains. 303

14.2 Accuracy percentages of the NN algorithm using the 5 FW methods shown and without FW. The standard deviation of the estimated percentage is also reported. 304

14.3 Mean stop-generation for FW-GA-o, FW-EBNA and FWEGNA. The standard deviation of the mean is also reported. The initial generation is considered to be the zero generation. 306

14.4 Average CPU times (in seconds) for the induction of different probabilistic models (standard deviations are nearly zero) in each generation of the EDA search. The last column shows the average CPU time to estimate the predic-tive accuracy of a feature weight set. 307

15.1 Details of experimental domains. 318

15.2 Estimated accuracy of the three EDA approaches using disjunctions of 2 simple rules. The average accuracy and standard deviation of 5 runs of a 10-fold cross-validation procedure is reported. 319

List of Tables XXI

15.3 Estimated accuracy of the three EDA approaches using disjunctions of 4 simple rules. The average accuracy and standard deviation of 5 runs of a 10-fold cross-validation procedure is reported. 319

15.4 CN2 and RIPPER results. The estimated accuracy and standard deviation of a single 10-fold cross-validation pro-cedure is reported. 320

16.1 Some characteristics of the networks used in the experim-ents. 332

16.2 Description of the experiments. 333 16.3 Results for experiment 1. Population size was 300 for

UMDA, 500 for MIMIC, 250 for EBNA and 100 for GA. 336 16.4 Results for experiment 2. Population size was 400 for




UMDA, 500 for MIMIC, 300 for EBNA and 200 for GA. 337 17.1 Dataset descriptions. 353 17.2 Results for Cleveland 355 17.3 Results for Wine 355 17.4 Results for Iris 355 17.5 Results for Soybean small 356 17.6 Results for Voting 356 18.1 Results obtained with the BC database. 370 18.2 Results obtained with the ECOLI database. 371 18.3 Results obtained with the KILN database. 371

Preface

The study and use of heuristic techniques for optimization have been successfully developed during the last decade. Among these techniques, Evolutionary Computation -Genetic Algorithms, Evolution Strategies, Evolutionary Programming and Genetic Programming- has been the reference.

This book is devoted to a new paradigm for Evolutionary Computation, named Estimation of Distribution Algorithms (EDAs). Based on Genetic Algorithms (GAs), this new class of algorithms generalizes GAs by replacing the crossover and mutation operators by learning and sampling the probability distribution of the best individuals of the population at each iteration of the algorithm. Working in such a way, the relationships between the variables involved in the problem domain are explicitly and effectively captured and exploited.

This text constitutes the first compilation and review of the techniques and applications of this new tool for performing Evolutionary Computation. The book is clearly divided into three parts and comprised of a total of 18 chapters.

Part I is dedicated to the foundations of EDAs. In this part different paradigms for Evolutionary Computation are introduced and some probabilistic graphical models -Bayesian networks and Gaussian networks- used in learning and sampling the probability distribution of the selected individuals at each iteration of EDAs are presented. In addition to this, a review of the existing EDA approaches is carried out. Also EDAs based on the learning mixture models are presented and some approaches to the parallelization of the learning task are introduced. This part concludes with the mathematical modeling of some of the proposed EDA approaches.

Part II brings together several applications of EDAs in optimization problems and reports on the results reached. Among the solved problems are the following ones: the traveling salesman problem, the job scheduling problem and the knapsack problem, as well as the optimization of some well-known combinatorial and continuous functions. This part ends with a chapter devoted to an EDA based approach to the inexact graph matching problem.

xxiv Estimation of Distribution Algorithms

Part III presents the application of EDAs in order to solve some problems that arise in the Machine Learning field. Concretely, the problems considered are: feature subset selection, feature weigthing in K-NN classifiers, rule induction, partial abductive inference in Bayesian networks, partitional clustering and the searching for optimal weights in artificial neural networks.

This book can be a useful and interesting tool for researchers working in the field of Evolutionary Computation. Also engineers who, in their every day life, face real-world optimization problems and whom are provided with a new and powerful tool can derive benefit from the reading of the book. Moreover, this book may be used by graduate students in computer science and by people interested in taking part in the development of this new methodology that, in the following years, will provide us with interesting and appealing challenges.

Acknowledgments First and foremost we want to acknowledge the 25 contributors. Without

their work and effort this book would not have been possible. A special thanks to Sara-Jayne Farmer and David E. Goldberg who read an early version of the text and provided much constructive criticism and advice.

Our work was partially supported by the University of the Basque Country and by the Department of Education, University and Research of the Basque Government under grants 9/UPV /EHU 00140.226-12084/2000 and PI 1999-40 respectively.

We are grateful to the editorial staff of Kluwer Academic Publishers, especially Lance Wobus and Sharon Palleschi for their patience, interest, and helpfulness in bringing this project to a successful conclusion.

Our families have been a constant source of encouragement throughout this project. Pedro's greatest debt to his wife: Maria, and daughters: Nagore and Ana. Jose Antonio's deepest gratitude to his wife: Susana. We warmly appreciate their understanding and support.

Contributing Authors

Enrique Alba has been working with evolutionary algorithms and neural networks since 1991 in the Department of Computer Science of the University of Malaga, Spain. He received his PhD in Computer Science in 1999 from this university with a dissertation about parallel genetic algorithms and their applications. At present he works as an Assistant Professor in this department, and his current interests are evolutionary algorithms, parallel algorithms, and real-life applications. He owns national and international awards to his research activities, and has authored many journal and conference papers on evolutionary algorithms and also neural networks.

Endika Bengoetxea works as Lecturer at the Department of Technology and Computer Architecture at the University of the Basque Country, Spain. He earned his BSc in Computer Science from both the University of Brighton (UK) and the University of the Basque Country in 1994, and his MSc in Medical Imaging from the University of Aberdeen (UK) in 1999. He is currently a PhD student from the ENST (Signal and Image Processing Department) at Paris.

Rosa Blanco received the degree on Computer Science from the University of the Basque Country (Spain) in 2000. She is currently a candidate for the PhD degree in the Department of Computer Science and Artificial Intelligence at the University of the Basque Country. Her research interests are probabilistic graphical models, K-NN classifiers and classifiers systems.

Isabelle Bloch is Professor at ENST (Signal and Image Processing Department, Image Processing and Interpretation Group) at Paris, France. She graduated from Ecole des Mines de Paris in 1986, received PhD from ENST Paris in 1990, and the "Habilitation Diriger des Recherches" from University Paris 5 in 1995. Her research interests include 3D image and object processing, 3D

XXVlll Estimation of Distribution Algorithms

and fuzzy mathematical morphology, discrete 3D geometry and topology, decision theory, information fusion in image processing, fuzzy set theory, evidence theory, structural pattern recognition, spatial reasoning and medical imaging.

Carlos Cotta received the MSc and PhD degrees in 1994 and 1998, respectively, in Computer Science from the University of Malaga, Spain. Since 1999 he has been an Assistant Professor at the Department of Computer Science of the University of Malaga. His research interests are primarily in evolutionary algorithms, especially in hybridization and combinatorial optimization, with secondary interests in parallel and distributed systems.

Luis M. de Campos received his MSc degree in Mathematics in 1984. He read his PhD Thesis in 1988 in "Characterization and study of fuzzy measures and integrals through probabilities" and became Associate Professor in Computer Science in 1991 at the University of Granada, Spain. His current areas of research interest include numerical representations of uncertainty, graphical models and Bayesian networks, machine learning and information retrieval.

Pedro de Miguel is since 1981 Professor of Computer Science at the Universidad Politecnica de Madrid, Spain. His main research interest is now focused on mobile systems for vehicle fleets management.

Jose A. Gamez received the MSc degree in Computer Science in 1991, and the PhD degree in Computer Science in 1998, both from the University of Granada, Spain. He is an Assistant Professor at the Department of Computer Science, University of Castilla-La Mancha, Spain. His research interests include probabilistic reasoning, Bayesian networks and evolutionary computation.

Cristina Gonzruez received the MSc degree in Mathematics in 1999, from the University of the Basque Country, Spain. She is currently a PhD student at the Department of Computer Science and Artificial Intelligence at the University of the Basque Country. Her research interest is focused on mathematical aspects of estimation of distribution algorithms.

liiaki Inza is Lecturer at the Department Computer Science and Artificial Intelligence ofthe University of the Basque Country, Spain. His research interests reside in machine learning, evolutionary algorithms and Bayesian networks.

Contributing Authors xxix

Elias A. Jimenez received the MSc degree in 2000 in Computer Science from the University of the Basque Country, Spain. He is currently a PhD student at the Department of Computer Science and Artificial Intelligence at the University of the Basque Country. His research interests include evolutionary computation and classifier systems.

Pedro Larraiiaga is Associate Professor at the Department of Computer Science and Artificial Intelligence at the University ofthe Basque Country (Spain) where he leads the Intelligent Systems Group. His research interests include probabilistic graphical models, evolutionary algorithms, optimization and machine learning.

Jose A. Lozano is Associate Professor at the Department of Computer Science and Artificial Intelligence at the University of the Basque Country, Spain. His research interests include probabilistic graphical models, evolutionary algorithms, optimization and machine learning.

Alexander Mendiburu is Lecturer at the Department of Technology and Computer Architecture of the University of the Basque Country, Spain. He obtained his BSc in Computer Science from the University of the Basque Country in 1995. His main research areas are scheduling and the vehicle routing problem.

Teresa Miquelez is Senior Lecturer at the Department of Technology and Computer Architecture of the University of the Basque Country, Spain. She graduated in Computer Science in the University of the Basque Country in 1983. Her research main interests are artificial intelligence and its application to medicine.

Serafin Moral is Professor of Computer Science and Artificial Intelligence at the University of Granada, Spain. His main research topics are imprecise probabilities (foundations, conditioning, independence, entropy, .. ) and graphical dependence structures (propagation algorithms, approximate algorithms, abduction, computation with non-probabilistic representations, ... ).

Jorge Muruzabal (PhD Statistics, University of Minnesota 1992) is Associate Professor, University Rey Juan Carlos, Madrid, Spain. He is currently a member of EVONET, The European Network of Excellence in Evolutionary Computing, and SIGKDD, ACM Special Interest Group in Knowledge Discov-

xxx Estimation of Distribution Algorithms

eryand Data Mining. His main research interests are evolutionary algorithms, neural networks, Bayesian modelling and data mining.

Jose M. Peiia gained his Computer Science Engineer degree from the University of the Basque Country (Spain) and his BSc in Computer Science from the University of Brighton, UK. He is currently undertaking research towards a PhD degree in Computer Science at the Department of Computer Science and Artificial Intelligence of the University of the Basque Country.

Aymeric Perchant is graduated from ENST, Paris, France. He received his PhD in morphism of graphs with fuzzy attributes for the recognition of structural scenes in 2000.

Victor Robles worked at the European Organization for Nuclear Research (CERN), Geneva, in the Open System Environment section. At this moment, he is Lecturer of Operating Systems and Web Services Design in the Department of Architecture and Technology, School of Computer Science, Madrid, Spain. His interest areas are local optimization, evolutionary computation, vehicle routing and internet programming.

Txomin Romero is a PhD Student at the University of the Basque Country (Spain) where he is currently a Systems Analyst at the Computer Center. He is also a member of the Intelligent Systems Group at the Department of Computer Science and Artificial Intelligence. His research interests include machine learning, probabilistic graphical models, estimation of distribution algorithms and partial abductive inference.

Josep Roure received the BSc and MSc degree in Computer Science in 1993 and 1994 respectively, from the Technical University of Catalonia (UPC), Spain. Currently, he is a PhD student in the PhD Program on Artificial Intelligence of the UPC and he works as Senior Lecturer at the Technical School of Matar6, Spain.

Ramon Sagarna received the MSc degree in 2000 in Computer Science from the University of the Basque Country, Spain. He is currently a PhD student at the Department of Computer Science and Artificial Intelligence at the University of the Basque Country. Among others, he is interested in evolutionary computation, neural networks, data analysis, data mining and knowledge discovery.

Contributing Authors xxxi

Ramon Sangiiesa is Associate Professor at the Software Department of the Technical University of Catalonia, Spain. Previously he has been a researcher at the Spanish Superior Council for Scientific Research. His research interests are artificial intelligence, machine learning and learning agents.

Basilio Sierra is Lecturer at the Department of Computer Science and Artificial Intelligence of the University of the Basque Country, Spain. He received his BSc in Computer Science in 1990, MSc in Computer Science and Architecture in 1992 and returned to the university world in 1995 after some enterprise work experience. He received his PhD in Computer Science in 2000 from the University of the Basque Country. His research interests include machine learning, Bayesian networks and evolutionary computation.

Series Foreword Genetic Algorithms and Evolutionary Computation

David E. Goldberg, Consulting Editor University of Illinois at Urbana-Champaign Email: [email protected]

Researchers and practitioners alike are increasingly turning to search, optimization, and machine-learning procedures based on natural selection and natural genetics to solve problems across the spectrum of human endeavor. These genetic algorithms and techniques of evolutionary computation are solving problems and inventing new hardware and software that rival human designs. The Kluwer International Series on Genetic Algorithms and Evolutionary Computation publishes research monographs, edited collections, and graduate-level texts in this rapidly growing field. Primary areas of coverage include the theory, implementation, and application of genetic algorithms (GAs), evolution strategies (ESs), evolutionary programming (EP), learning classifier systems (LeSs) and other variants of genetic and evolutionary computation (GEe). The series also publishes texts in related fields such as artificial life, adaptive behavior, artificial immune systems, agent-based systems, neural computing, fuzzy systems, and quantum computing as long as GEe techniques are part of, or inspiration for, the system being described.

This volume on estimation of distribution algorithms (EDAs) is a particularly welcome addition to the series, because EDAs have become one of the fastest growing techniques within genetic and evolutionary computation. Indeed, EDAs belong to GEe as they use selection to choose good subsets of samples, but in another sense EDAs throw out the genetics and instead build probabilistic models of the observed best points. For this reason, EDAs are sometimes called probabilistic model-building GAs, but removing the genetics is interesting on two counts. First, by borrowing liberally from techniques developed in statistics, artificial intelligence, and clustering, EDAs can push toward higher competence-better speed, solution quality, and reliability on harder problem instances-without being hampered by considerations of biological plausibility or past practices within the GEe field. Second, EDAs help us understand the role of various genetic operators in creating a kind of distributed model of good solutions across the population, thereby giving a better perspective on exactly what genetics in a population is doing for us.

Estimation of Distribution Algorithms is divided into three parts: (1) foundations, (2) optimization, and (3) machine learning. The coverage is scholarly, logical, and accessible. It should be of interest to newcomers and old timers both. The foundations section lays down the basics of GEe, the basics of EDAs, and key theory. The breadth of optimization section is remarkable, covering discrete, continuous, and a variety of combinatorial problems. The machine learning section works with rules, neural networks, subset selection, and inference in Bayesian

networks to name a few. This book is a good introduction to the state of the art at the same time that it covers many of the difficult issues faced by researchers on the front lines. I urge those who are interested in EDAs to study this well-crafted book today.

I

FOUNDATIONS

Chapter 1

An Introduction to Evolutionary Algorithms

J.A. Lozano Department of Compute1' Science and Artificial Intelligence

University of the Basque Country

[email protected]

Abstract In this first chapter an introduction to Evolutionary Algorithms will be given. The introduction is focused on optimization. The basic components of the most used Evolutionary Algorithms -Genetic Algorithms, Evolution Strategies and Evolutionary Programming- are explained in detail. We give pointers to the literature on their theoretical foundations.

Keywords: Evolutionary Algorithms, Genetic Algorithms, Evolution Strategies, Evolutionary Programming, optimization

1. Introduction Evolutionary Algorithms (EAs) are a set of techniques with the common

feature that they are all inspired by natural evolution of species. In the natural world, evolution of a species is carried out by means of selection and random changes. These two elements can be translated to computers in two different ways. Computers can be used to simulate the evolution of species, but these elements can also be used to build computer systems that, following the principles of natural evolution established by Darwin (1859), evolve to optimize a function. Mathematical population genetics has used the first approach since the 1960s. The second approach has been used extensively by the Computer Science community during the last two decades. This is mainly in the field of optimization, and this work is the subject of this chapter. Simulating this process on a computer results in stochastic optimization techniques that can often outperform classical methods of optimization when applied to difficult real-world problems (Fogel, 1994).

P. Larrañaga et al. (eds.), Estimation of Distribution Algorithms

© Springer Science+Business Media New York 2002

4 Estimation of Distribution Algorithms

Formally, an optimization problem is given by a pair (D, h) composed of a search space D and a function h:

h:D-+IR . The problem consists of searching for the point (or points) x* E D such that in the case of maximization (if symbol ~ is replaced by ~ in the following expression we obtain the minimization problem):

V xED: h(x*) ~ h(x) . The set D can be finite or infinite and can be defined by a set of restric

tions. The optimization problem can have many levels of difficulty, for instance multimodality, non-linearity, nondifferentiability or complicated restrictions. Function h is commonly called fitness or objective function in EA literature.

We introduce two simple examples to understand the kind of optimization problems that EAs try to solve.

In the combinatorial field, the traveling salesman problem (TSP) is probably the most appealing. In the TSP, a set of n cities {Gl , G2 , ... , Gn } and a matrix D = [di ,j]i,j=1,2, ... ,n representing the distances between the cities are given. The objective is to find a tour of minimum length that visits each city once and finally returns to the starting city. Formally, if we denote by ?T a permutation of the n cities, the search space is D = {?TI?T is a permutation of {I, 2, ... , n}}. The function to minimize is:

n-l

h(?T) = 2:= d7ri ,7ri+l + d7rn ,7rl

i=l

where ?Ti represents the ith element of the permutation ?T.

On the other hand, one of the most used (as a benchmark) function in the numerical field is a generalization (Back, 1996) of a function after Ackley (Ackley, 1987). It can be written as:

In this case, the problem is minimization, the search space is D = IRn, a point of the search space is represented as x = (Xl, X2, ... , xn) and n is the dimension of the space. Different values can be given to the constants Cl, C2

and C3, the most common being Cl = 20, C2 = 0.2 and C3 =?T. A plot of the function in two dimensions with these values can be seen in Figure 1.1.

Three main different but related approaches have been developed independently in the field of EAs. Genetic Algorithms have their roots in the work of Holland (1975), although a set of earlier related works can be seen in Fogel (1998), and it was the book by Goldberg (1989) which contributed to the

An Introduction to Evolutionary Algorithms 5

Figure 1.1 The modified Ackley function in two dimensions.

broad extension of the field. Genetic Algorithms were initially developed in the field of combinatorial optimization, but their applications quickly spread to the numerical field. Evolution Strategies developed in Germany by Rechenberg (1973) and Schwefel (1981) were concerned with problems in the continuous domain, and Evolutionary Programming was initiated in the USA by Fogel (1962, 1964). It was initially applied to discrete optimization, but was later broadly applied in the field of numerical optimization. We will present each technique separately, and emphasize their similarities and differences. In addition, the reader can consult a paper by Back et al. (1993) where the differences and similarities between Evolution Strategies and Evolutionary Programming are stated, a paper by Hoffmeister and Back (1991) for Genetic Algorithms and Evolution Strategies, and the book by Fogel (1995) for a comparison between the three groups of different EAs.

The main advantages of EAs when compared with classical methods of optimization are the following. First, EAs are widely applicable, even in problems where there are no derivatives or where the problem is not completely defined. They can deal with multimodalities, discontinuities and constraints, with noisy functions, multiple-criteria decision making processes or with problems given by a simulation model. Second, they do not make any assumptions about the search space. Third, the cost to adapt EAs to a new optimization problem is relatively low, and they can be tailored to each problem with little effort. Finally, EAs can run interactively, i.e. the parameters can be changed in the course of their execution.


EAs also have some disadvantages, for instance, there is no guarantee of finding the global optimal solutions and there are no reliable stopping criteria. There is not enough theoretical basis. Often, these methods are computationally expensive. It is very difficult to carry out a comparison between different algorithms unless experimentally. It is not possible to know how far the solution obtained by the algorithm is from the global optimum. Finally, and probably the worst characteristic, is the fact that the algorithms depend on a set of parameters that have to be tuned experimentally for the problem at hand, with this tuning itself being a hard optimization problem in some cases. In Evolution Strategies, in particular, this tuning is done using a self-adaptation technique. Nevertheless, the exponential growth of the field is due to the good results obtained in practical problems.

It is important to think that EAs are not a set of techniques ready to be applied, but a set of mechanisms that have to be modified and tailored to the optimization problem at hand.

Before continuing with this chapter, it is useful to note that several books about EAs exist , and that most of the material in this chapter has been adapted from them. The books by Goldberg (1989), Back (1996), Rudolph (1997), Schwefel (1995) and the edited volume by Reeves (1993) are all useful and some reviews (Fogel, 1994; Back and Schwefel, 1996; Back et al., 1997) can also be of great value.

EAs comprise, of course, many more techniques than those presented here. We want to mention explicitly Genetic Programming (Koza, 1992), because this technique has been used sometimes in the field of optimization.

The rest of this chapter is structured as follows. Genetic Algorithms will be introduced in Section 2. The next section will be dedicated to Evolution Strategies, leaving Evolutionary Programming for Section 4. A summary will be given in the final section.

2. Genetic Algorithms As we have said, Genetic Algorithms (GAs) have their basis in the work of

Holland (1975), but their popularity is mainly due to Goldberg (1989). Basically, these algorithms try to model the evolution process of natural beings, taking its components and some of the nomenclature from this field. For instance, we talk about populations to refer to a set of possible solutions to a problem, each solution is called an individual or chromosome and each part of an individual is called a gene.

The basic components of the algorithms are: a population of individuals, a set of random operators to modify the individuals and a selection procedure over the individuals . Informally speaking, the mechanics of the algorithms is as follows. The algorithms maintain at each step a set of individuals, the pop-

t:= 0 Initialize P(t) Evaluate P(t)


while not terminate do For i = 1, ... , N /2 do

Select two parents from P(t) Apply crossover to the two parents with probability Pc Mutate the offspring with probability Pm Introduce the two new individuals into P(t + 1)

od t := t + 1

od

Figure 1.2 Pseudocode for the SGA.

ulation. Some of these individuals are selected and some random operators are applied to them to create new individuals and consequently a new population.

2.1 Simple Genetic Algorithm

The Simple Genetic Algorithm (SGA) is the simplest GA. The individuals in the SGA are encoded as 0-1 strings of length n, Le. each individual can be written as x = (Xl, X2, ... , xn) and belongs to the search space n = IBn = {O, 1 In. A selection operator, proportional-based selection, is applied to the population and two random recombination operators called crossover and mutation are applied to the selected individuals. Pseudocode for the SGA can be seen in Figure 1.2. Its components are explained in detail below.

Proportional-based selection chooses an individual, using a probability distribution which depends on the function value of each individual in the population. In this way, given an individual x j and assuming that a population is composed of N individuals and that we want to maximize the function h, the probability of selecting individual Xj, Ps(Xj), can be written as:

h(Xj) Ps(Xj) = 2:~=1 h(Xk)

The crossover operator used in the SGA is called one-point crossover and it is applied to two individuals. This operator is applied with a probability, Pc, known as crossover probability. A crossing point in the individuals is chosen from a uniform distribution over {I, 2, ... , n - I}, and a new individual is created by joining the part of the first individual on the left of the crossing point with the part of the second individual on the right of the crossing point,


crossing point crossing point

I I Parents ,X X Y Y" X X Y Y Y Y,

L.:' 0_0-=-----:---=--'--'1 "I ° ° I

Children XXYYI ° ° ° ° ° IXXYYYY

Figure 1.3 One-point crossover.

mutated gene

Original Individual o O. 1 1 0 0 o

New Individual o 0 o 0 0 o

Figure 1.4 The mutation operator.

i.e. the two selected individuals mix their "genetic information". The new individual is known as the "child" of the original individuals. A second child is created from the remaining information in them. Figure 1.3 shows an example of the application of the one-point crossover.

The mutation operator is applied to one individual. Given an individual, each gene is flipped from 1 to 0 or 0 to 1 with a mutation probability of Pm. Figure 1.4 shows an example of the application of the mutation operator.

2.2 Extensions and modifications of the SGA

Obviously, the algorithm presented in the previous section is the most basic algorithm and many extensions and modifications of it can be carried out. These extensions and modifications usually refer to the genetic operators (selection, crossover and mutation) and to how the individuals are coded.

2.2.1 Selection operators. Many different selection operators can be found in the GA literature. A summary and analysis of these can be con-


suIted in Goldberg and Deb (1991) and Back (1996). We describe here the mechanisms of the most widely used selection operators.

Elitism is an important concept when using selection operators. An obvious defect (from an optimization point of view) with the SGA is that there is no guarantee that the best member of the population will survive into the next generation. The solution, called elitism, was proposed by De Jong (1975) and consists of forcing the best member of the current population to be a member of the next population.

Another important concept introduced by De Jong (1975) is the generation gap. In the SGA, the offspring totally replace their parents, but it is possible to generate only a proportion G of offspring to replace some selected members of the current population. The extreme case in which G = N-1 , i.e. at each step of the algorithm a single individual is generated, has been proposed by Whitley and Kauth (1988) and has been used extensively. This approach is known as steady-state GA.

Brindle (1991) proposes a selection operator in which individuals are forced to become parents a number of times based on their expected frequencies as predicted by their fitness function values. This is carried out by following a policy of random sampling without replacement. That is, each individual

Xj is selected an integer number, [L~~~~~k)]' of times and the remainder,

N - 2:;':1 [L~~~~~k)]' individuals are selected by carrying out a random

sampling, each individual Xj with probability proportional to Lf~~~~k)

[ N·h(Xj) ] LJ:=l h(Xk) .

In linear ranking selection (Baker, 1987), the probability of selecting an individual is related to the ranking of the fitness function values of the individuals in the population rather than the particular fitness value of each of them. If we rank all the individuals of the population, then in Baker's selection mechanism the probability assigned to each individual is calculated as follows. Let 1]+

denote the expected number of times that the best individual X1:N is selected, i.e. 1]+ = N . P1 and 1]- the minimum expected value assigned to XN:N, i.e. 1]- = N . PN, then a linear mapping of the form:

1 (+ (+ _) j - 1 ) Pj = N· 1] - 1] - 1] . N - 1

gives the probability assigned to individual Xj:N and the constraints 2:~1 Pi = 1 and Pi 2: 0 'Vi E {I, 2, ... ,N} require that 1 ~ 1]+ ~ 2 and 1]- = 2 - 1]+.

Tournament selection was proposed by Goldberg and Deb (1991). In this selection operator, a previously fixed number of individuals, Q, is randomly selected and the best of these Q individuals is selected for recombination. This


Parent I , "',

Parent 2

i Offspring IiItv' ''' w' I ' I '_'I

I

Figure 1.5 An example of a four-point crossover operator.

selection was later specialized to Boltzmann tournament selection (Mahfoud, 1993) where once the individuals are chosen for the tournament, the competition is carried out using a Boltzmann distribution. The probabilities of selecting an individual in tournament selection can be seen in Back (1996).

Truncation selection has been introduced by Miihlenbein and SchlierkampVoosen (1993). In this selection scheme a number M 2: N of individuals are generated and the best N individuals are selected from these to form the next population.

Many more selection mechanisms can be found in the GA literature, but the ones presented here are the most used in practice.

Goldberg and Deb (1991), Back (1996) and Rudolph (1997) analyzed selection operators using a measure called takeover time. The takeover time is the number of iterations that an algorithm with only selection operator needs to reach a uniform population, i.e. a population in which all the individuals are the same. The idea behind the takeover time is related to the selective pressure imposed by the selection operators. A selection operator with small takeover time puts strong selective pressure and, conversely, a large takeover time implies weak selective pressure. An ordering of some selection mechanisms from lower to higher selective pressure is: proportional selection, linear ranking, tournament selection and truncation selection.

2.2.2 Crossover operators. As with selection operators, there are also many crossover operators proposed in the GA literature. Many of them are operators tailored to a particular application, but we concentrate here on operators for 0-1 codings.

First, it is easy to propose operators as generalizations of the one-point crossover. For instance, it is easy to describe an operator in which two or more points are selected and the new individual is composed by joining alternative pieces from the first and second parent. An example of the application of a four-point crossover can be seen in Figure 1.5. The most general situation is the operator called uniform crossover (Syswerda, 1991) where for each bit one of the parents is chosen randomly.


All the crossover operators previously introduced use two parents, but there are crossover operators in the literature that use, for instance, all the parents of the population. BSC (Syswerda, 1993) is one of these operators. For each gene, a new individual is chosen from the whole population by using a probability distribution that depends on the objective function value of the individuals.

Rudolph (1997) analyzes three general crossover operators that include some of the ones proposed above. The author reaches the conclusion that in all these cases the probability of drawing a specific gene for a new individual is the same regardless of the crossover operator chosen.

2.2.3 Encodings. When somebody deals with a practical optimiza-tion problem and (s)he wants to solve it with GAs, two alternatives have to be faced. On one hand, it is possible to encode every solution to the problem as a 0-1 string and to use the classical genetic operators. In this case, some encoding and decoding functions, to translate from the 0-1 string to the solution of the problem, have to be designed. On the other hand, each solution can be encoded in a natural, non 0-1, way and new operators can be defined for this encoding.

This second alternative is the one that has motivated the use of non 0-1 en co dings in GAs. The most commonly used non-binary encodings are integer codes, real codes and permutation-based codes.

In integer codes, each gene can take an integer value in a set {I, 2, ... ,r}. In this case, most of the crossover operators designed for binary strings can be applied to this encoding and only mutation has to be changed. Some studies have been carried out comparing integer with binary encodings, but no definite conclusions have been reached.

Real codes maintain an individual in which each gene is a real number. In this case, both the crossover operator and the mutation operator have to be designed appropriately for each concrete application.

Finally, permutation-based coding is a type of encoding in which each individual represents a permutation. This is the most straightforward encoding in combinatorial problems such as TSP, VRP (vehicle routing problem) and others. Much work has been carried out in the design of operators for this encoding, a review of which is Larraiiaga et al. (1999). Examples of applications of integer and permutation-based codings, can be seen in Lozano et al. (1998).

2.2.4 Parameters. GAs depend on several parameters that have to be instantiated for each practical application. These parameters include: the size of the population N, the crossover probability Pc and the mutation probability Pm .

A lot of work has been dedicated to the search for a set of optimal parameters for GAs. Researchers have concentrated mainly on the mutation probability Pm


and have given little attention to the parameter Pc. This is probably due to the conclusion in Shaffer et al. (1989) that it is much more important to establish an optimum value for Pm than for Pc. In most applications, the parameter Pc takes a value of l.

Grefensttete (1986) carried out probably the first work related to the choice of an optimal set of parameters. The author used a GA to optimize the parameters. De Jong (1975) proposed a value for Pm of lin, and this value was similarly proposed by Miihlenbein (1992) for unimodal functions. Back (1993) also showed that this value was the optimal value for the Onemax function. Other works related to the probability of mutation can be found in Hesser and Manner (1990), Suzuki (1995) and Rudolph (1994).

Some authors (Davis, 1989; Fogarty, 1989; Reeves, 1995; Michalewicz and Janikov, 1991) have proposed modifying the mutation probability during the execution of the algorithm. This was based on the idea that the mutation probability would have to take into account the diversity of the population.

2.3 Theoretical modeling of GAs

Much effort has been employed in the theoretical study of GAs, with varying success.

The most commonly used result is the schema theorem proposed by Holland (1975). Holland considers that in each population GAs are sampling subspaces of the search space, and that each individual is sampling several subspaces. These subspaces are defined by means of schemata.

Definition 1.1 For a binary representation schema 8 = (Sl' S2 ... , sn) is a string belonging to {O, 1, *}n that represents a subspace of IBn = {O,l}n such that a string x E IBn belongs to the schema if satisfy the condition:

Xi =P Si ¢} Si = * 't/ i = 1,2, ... ,n .

Informally, the schema theorem establishes that the expected number of individuals belonging to a particular schema 8 at step t+ 1 increases exponentially if the individuals that belong to the schema 8 at step t have average fitness value better than the average fitness value of the population at step t. To state the theorem we need some notation:

Definition 1.2 Given a schema 8 we define the dimension of 8, dim( 8), as the number of symbols * in 8. The length of 8, ~(8), is the difference between the rightmost and leftmost fixed (non *) positions.

In the next theorem, N8(P(t)) represents the number of individuals that belong to the schema 8 in the population P(t) at time t. h(8, t) represents the average fitness value of the individuals in schema 8 in time t and h( t) represents the average fitness value of the individuals of P( t).


Theorem 1 (Holland, 1975) The expected number of offspring belonging to some schema s after an iteration of the SGA, E[Ns(P(t + l))IP(t)), verify:

E[Ns(P(t + l))IP(t)] ~ Ns(P(t)) h~~;;) (1 - ~~siPc) (1 -Pmr-dim(S) .

Departing from this theorem, Goldberg (1989) proposed the building block hypothesis: "short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations". This result has been criticized by Rudolph (1997), who concluded that this hypothesis is only valid in one generation. Related with the notion of schema and building blocks is the facetwise approach to GAs (Goldberg, 1998).

Besides schema theory, many other mathematical tools have been used to analyze GAs. The most common are Markov chains and dynamical systems.

Markov chains can be used because the population in time t+ 1 only depends probabilistically on the population in time t. An explicit expression for the elements of the Markov chain that models the SGA (where the states of the Markov chain are the different populations) was given independently by Vose and Liepins (1991) and Davis and Principe (1993). Using a more abstract Markov chain model, Eiben et al. (1991) analyzed the limit behavior of GAs through the properties of their operators. The authors concluded that the SGA visits the optimum infinitely often, and that if the selection operator is elitist then the GA converges. A relaxation of this condition, while maintaining the convergence properties, can be seen in Lozano et al. (1999). Convergence rates (how quickly the population comes to include the individual with the highest fitness value) have been given by Suzuki (1995) in a general framework for an elitist algorithm where, these convergence rates depend on the eigenvalues of the Markov chain matrix. Rudolph (1997) established some convergence rates in particular sets of functions. Recent work in convergence rates is due to He and Kang (1999) who apply the theory of convergence rates from the field of Markov chains to GAs with infinite search spaces. A review of work on modeling GAs with Markov chains can be seen in Rudolph (1998).

Markov chains have also been used to build simulation models of the behavior of GAs with small populations and dimension (Whitley, 1992; De Jong et al., 1995).

Dynamical systems are another mathematical tool that can be used to theoretically study GAs. The first work using this tool is Bridges and Goldberg (1987). Nix and Vose (1992) analyzed the SGA as a dynamical system. They consider an infinite population SGA and analyze the dynamics of the SGA in various problems. This work was further developed in Vose (1999) where a study of the characteristics of the particular dynamical system i.e. fixed points, basin of attraction, etc. is carried out. Some recent works (Priigel-Bennet and Shapiro, 1997; van Nimwegen et al. 1999) have followed this approach with


a twofold objective: they looked for the relationship between the finite population model and the infinite population model and, they also tried to give a more macroscopic view than that of Vose (1999) .

3. Evolution Strategies

3.1 Introduction

Evolution Strategies (ESs) were introduced by Rechenberg (1973) and further developed by Schwefel (1981). ESs are applied to problems in continuous domains, and represent alternative and more widely applied methods than techniques developed in the field of numerical analysis.

Like GAs, ESs maintain at each step of the algorithm a set of solution candidates, the population. Some of these individuals are selected as parents for reproduction and, after their offspring are generated, the next population is created. The reproduction process comprises two types of operators: recombination operators that take two or more individuals and produce one individual, and mutation operators that take one individual and produces one individual. These two operators are stochastic in the sense that the result of their application is not known, but with probability. The selection operator uses the survival of the fittest principle.

The components of an ES are the following: coding (the representation of the individuals), fitness or objective function, parent selection mechanism, definition of the operators (recombination and mutation), selection mechanism and algorithm parameters (population size, mutation rate, recombination rate, etc.).

A pseudocode for a general ES where the previous components have been taken into account can be seen in Figure 1.6. In this figure, reproduction summarizes recombination and mutation, and the selection of individuals to form the next population may include individuals from the parent population.

3.2 ESs components

In this subsection we review the components of ESs. The first component is the encoding. ESs are usually applied in continuous domains, i.e. the search space n is included in JR'n. Unlike GAs , in ESs the individual contains not only its position in the search space but also some information about the mutation of that individual. In fact, ESs emphasize mutation. Mutation is carried out (in the most general framework) by adding a random vector, from a multivariate normal distribution with mean zero, to the individual. Information related to that multivariate normal distribution is then incorporated into each individual.

The mutation parameter space S (called strategy parameter space) is composed of standard deviations and rotation angles that represent the covariance

values:


t:= ° Initialize P(t) Evaluate P(t) while not terminate do

P'(t) = select parents(P(t)) PI/(t) = reproduction(P'(t)) Evaluate pI/ (t) P(t + 1) = select(pl/(t) U P(t)) t := t + 1

od

Figure 1.6 Pseudocode for a general ES.

5 = 1R~" X [_7f,7f]no

where n(J and n", represent the number of standard deviations and the number of covariances used. The individual space is therefore given by 0 x 5 and an individual a can be represented by:

a = ((Xl, X2, ... ,Xn), (al' a2, ... ,an,,), (al' a2, ... ,ano )) , " , ' , " .". .". 'Y X U a

where

• x refers to the elements of the search space, i.e. x E O.

• u represents the standard deviations, where n(J is the number of different standard deviations that are considered.

• a is a vector such that the covariance values can be calculated from.

Different simplifications can be carried out in this set of parameters, resulting in different algorithms. The most studied ESs in the literature are cited below.

In the first ES proposed in the literature (Rechenberg, 1973), n(J = 1 and n", = 0, i.e. for each component Xi of x , a random number sampled from a univariate normal distribution N(O, a) was added. The value a is usually called search step because it is similar to the search step used in deterministic numerical analysis methods.

A further step consists of using n(J = nand n", = 0. In this case, each component of x is modified using a value sampled from a univariate normal distribution whose standard deviation depends on the component index. That is, Xi is modified by adding a number sampled from N(O, ai).

The most general case is when n(J = n (although situations where 1 :::; n(J :::; n

could be given) and n", = n· (n - 1)/2. In this case, a vector sampled from


a multivariate normal distribution where the covariance matrix has non-zero elements is added.

One of the principal differences between ESs and GAs is that in ESs the parameters are changed in the course of evolution. This process is called selfadaptation. The strategy parameters incorporated by each individual are mutated as well as its position in the search space. This mutation depends on the type of strategy used.

In the simplest ES (nO" = 1, no: = 0) this mutation is carried out using the following rules:

a' a· exp (TO· N(O,I))

Xi+a'· M(O,I)

where the value of TO used is set near 1/ Jii.

(1.1)

(1.2)

In the second case, where nO" = n and no: = 0, the mutation is carried out in the following way:

a~ ai . exp (T'· N(O, 1)) + T· M(O,I))

x~ xi+a~· Ni(O,I).

Here the values of the parameters used are T ~ 1/ J2Jii and T' ~ 1/ ffn. In the most general case (nO" = n and no: = n· (n - 1)/2), the standard

deviations and also the values corresponding to the covariances have to be mutated:

a~ ai . exp (T' . N(O, 1)) + T· M(O,I))

aj aj+(3· N j (O,I) x' x +N(O,c') .

In this case, the values of T and T' are similar to the previous case and the value of (3 is approximately 0.0873. These values were suggested by Schwefel (1981). Matrix C' is the inverse of the covariance matrix and the values of its elements can be calculated using the parameters aij (the vector a is reindexed in a matrix form) and ai in the following way:

if i = j ifi:j:.j

Vector N(O, C') is created by first obtaining a vector Zu from N(O, 0-) (0-represents a diagonal matrix with the (i, i)th entry equal to ai) and then using rotation matrices Rij :

n-1 n

Zc = II II Rij(aij)· Zu

i=l j=i+1


Xl I Xl 2

Xl 3

Xl 4

xl 5 a l I a l

2 a l 3 a l

4 a l 5

X2 X2 I X~ I X2 X2 a 2 a 2 I a~ I 2 a 2 I 2 4 5 I 2 a4 5

xi Ix~ I x~ x3 I x~ I I ar I a 3 a 3 Idl a~ 4 2 3

I xi I X4 X4 I x! I X4 a 4 lai I a 4 a 4 I at I 2 3 5 I 3 4

X4 X3 X2 X4 X3 0"; +O"f O"~ +O"~ O"~ +O"~ O"~ +0"; O"~+O"i I 2 3 4 5 2 2 2 2 2

Figure 1.7 An example of recombination applied to four parents with global discrete recombination in the search space and intermediary recombination in the strategy parameters.

Rotation matrices, Rij (aij) = [rij ]i,j=I,2 , ... ,n, are unit matrices modified by rii = rjj = cos(aij) and rij = rji = - sin(aij) (Rudolph, 1992) . It is important to note that modifying the covariances values in this way keeps the covariance matrix positive definite (Rudolph, 1992).

Another component of ESs is recombination. In ESs, recombination has been largely forgotten and has only appeared in the ESs field in the last decade. Like mutation, recombination takes into account all the elements of the individual, i.e. both its position in the search space and its mutation parameters. In addition, each component can be recombined in different ways. Recombination takes more than one individual and produces only one individual. There are two main types of recombination used in ESs. The first strategy called dual recombination, chooses two individuals. In the second strategy, called global recombination, one parent is chosen and held fixed while, for each component of its vectors, a second parent is randomly chosen anew from the complete population. Two different ways are used to create the new individual: discrete or intermediary recombination. In discrete recombination, for each gene, a value is chosen randomly from one of the parents. Conversely, in intermediary recombination, the new gene is calculated by averaging the value of the parents in their corresponding genes. The most common strategy followed by ESs is to use discrete recombination in the individuals positions in the search space, and global intermediary recombination in its strategy parameters. This strategy is justified by empirical evidence because no theoretical justification for it yet exists.

An example to illustrate the recombination process in ESs can be seen in Figure 1.7. In this figure global discrete recombination in points of the search space and intermediary recombination in the strategy parameters have been used. The first individual is held fixed, and the individual marked with a square is the one chosen for recombination in each component.


Selection is another component of ESs. In ESs, the size of the parent population is denoted by J.L and the number of descendants created by A. An important characteristic of selection in ESs is the fact that selection is deterministic. Two main types of selection mechanism are distinguished; there are represented by (J.L, A) and (J.L + A).

In (J.L, A) selection the next population is formed from the best J.L individuals of the offspring population. In (J.L + A) selection the best J.L individuals froIll

both the parents and offspring populations are chosen for the next generation. It is important to notice two facts: first, the second selection mechanism is an elitist strategy, i.e. the best individuals from both parents and descendants go to the next population, and second, in this second selection mechanism, the value of A could be smaller than J.L whereas in (J.L, A) selection, A ~ J.L.

Many controversial arguments have been given in favor of one or the other selection method, but there is no theory in favor of any of them. As in GAs, selective pressure and therefore takeover time has been the measure to evaluate the ESs selection mechanism, where the selection pressure of the (J.L + A) selection is much greater than that of (J.L, A). In particular, for an ES with the common values J.L = 15 and A = 100, the takeover time for (J.L + A) is T = 2 and for (J.L, A) is T = 460 (Back, 1996).

The parent-selection is carried out randomly from a uniform distribution over the population.

3.3 Theoretical foundations

Theoretical study of ESs have followed two different but related paths: the convergence characteristics of the algorithm, and the rate of convergence or local behavior, both near and far from an optimum.

For the convergence results, the first studies were carried out for the (1 + 1)ES (with mutation, but without recombination or self-adaptation). For it, Devorye (1976), Solis and Wets (1981) and Pinter (1984) among others, established the basis of its convergence. Rudolph (1997), pp. 162-167, reviewed all these results and rewrote them in a Markov chain and super martingale framework. In addition, Rudolph extended the theory for the non-elitist strategies (1, A). These convergence results are, of course, given in probabilistic terms. A sufficient condition for the convergence in mean and completely to the global optimum of the elitist strategy (1 + A) is that at each step of the search, the probability of entering a small (as small as we want) neighbor of the global optimum will be greater than a given positive number. In the case of non-elitist strategies, some conditions on the improvement between two generations can be obtained.

Multimembered strategies ((J.L, A) and (J.L + A) with J.L > 1) are treated again in Rudolph (1997), pp. 199-205, arriving at similar results to the ones obtained for the one-individual strategies.


For the rate of convergence, most work has been dedicated to studying the behavior of the algorithm near the optimum. This behavior is simulated using the sphere function (Equation 1.3). These studies have tried to give rules for the modification of the standard deviations that will drive the algorithm faster to the optimum.

The first theoretical results (Rechenberg, 1973) were given for the (1 + l)-ES with ncr = 1 and only mutation. The author calculated convergence rates for the sphere function (Equation 1.3) and the corridor function (Equation 1.4):

h(x) (1.3)

h(x) Xl X E 1R x [-b, bl n - l (1.4)

Using the optimal convergence rate, Rechenberg gave a rule for adjusting the standard deviations during evolution: the 1/5-success rule. This means that, on average, one of five mutations should cause an improvement in the objective function. The rule for adjusting the standard deviation tries to follow the previous 1/5-success rule: if the ratio of successful mutation is greater than 1/5, then the standard deviation should be increased, otherwise the standard deviation should be decreased.

Schwefel (1995) generalized the results obtained by Rechenberg to the case of the (1, ),)-ES with ncr = 1 and without recombination and self-adaptation.

A series of papers by Beyer (Beyer, 1995a; Beyer, 1995b; Beyer, 1996) analyzed the rate of progress for large dimension problems (n » 30) when different strategies are used: (fL,),) without recombination (Beyer, 1995b), (fL,),) with recombination (Beyer, 1995a), and (1,),) with the simplest self-adaptation (ncr = 1). All these analyses were carried out using the sphere model. Recently, Oyman et al. (2000) and Oyman and Beyer (2000) have analyzed the (1, )')-ES and (fL, )')-ES respectively in the parabolic ridge. The objective was to simulate the behavior of the ESs far from an optimum.

Rudolph (1997) has developed rates of convergence for the (1, ),)-ES in a type of convex function known as (K, Q)-strongly convex.

4. Evolutionary Programming

Evolutionary Programming (EP) (Fogel, 1962; Fogel, 1964) was proposed as a method to simulate evolution, and a learning process to generate artificial intelligence. To do that, Fogel carried out a series of experiments where a simulated environment was described by a sequence of symbols from a finite alphabet. The problem was defined as evolving an algorithm that operated on the sequence of symbols observed so far. The objective was to produce an output symbol that was likely to maximize the benefit of the algorithms,


given the next symbol to appear in the environment and a well-defined payoff function. Finite state machines (FSM) provided a useful representation for the required behavior.

EP operated on FSMs as follows. A population of parent FSMs was randomly constructed, where each machine was given the sequence of symbols and a payoff value was given for its output. Offspring were created by modifying the machines using simple operations (add a state, delete a state, change the initial state, change an state transition, and change an output symbol). The offspring were evaluated and the best formed the new population. Clearly, this algorithm is quite similar to that of GAs or ESs.

Modern EP has been proposed by Fogel (1992), and works mainly in continuous domains. Its application is very similar to that of ESs, and the basic body of the algorithm resembles that of ESs too. The main differences are the following:

• The strategy parameters in EP are mutated using different perturbation equations (Back, 1996), pp. 94, than the lognormal perturbation used in ESs (Equation 1.2).

• The populations of parents and offspring are the same size, i.e. A = p.

• There is no recombination operator in EP, i.e. only the mutation operator is applied to individuals.

• Selection is not deterministic in EP as it is in ESs. The selection operator used in EP is a type of tournament selection operator which depends on a parameter Q and is applied to the combined population of parents and offspring. For each individual x, a set S of Q more individuals is chosen at random from the population. Individual x is compared with all the individuals of S. A value that represents the number of individuals in S with worse fitness value than x is assigned to x. After that the individuals of the population are ranked according to the value assigned, and the best p individuals form the new population.

5. Summary This chapter gave an introduction to the most common EAs that are applied

to optimization. The most important components of these have been explained, and some pointers to their theoretical analysis have been given. EAs constitute a growing and exciting field of research that needs more theoretical foundations and, more mathematical analysis.

Some pitfalls, such as the creation of values for algorithm parameters, have to be overcome either by means of mathematical results, or the design of new algorithms where the number of parameters can be reduced.


References Ackley, D. H. (1987). A Connectionist Machine for Genetic Hillclimbing. Kluwer

Academic Press. Back, T. (1993). Optimal mutation rates in genetic search. In Forrest, S., editor,

Proceedings of the Fifth International Conference on Genetic Algorithms, pages 2-9. Morgan Kaufmann Publishers.

Back, T. (1996). Evolutionary Algorithms in Theory and Practice. Oxford University Press.

Back, T., Hammel, U., and Schwefel, H.-P. (1997). Evolutionary computation: Comments on the history and current state. IEEE Transactions on Evolutionary Computation, 1(1):3-17.

Back, T., Rudolph, G., and Schwefel, H.-P. (1993). Evolutionary programming and evolution strategies: similarities and differences. In Fogel, D. B. and Atmar, W., editors, Proceedings of the Second Annual Conference on Evolutionary Programming, pages 11-22. Evolutionary Programming Society.

Back, T. and Schwefel, H.-P. (1996). Evolutionary computation: An overview. In Proceedings of the Third IEEE Conference on Evolutionary Computation, pages 20-29. IEEE press.

Baker, J. E. (1987). Reducing bias and inefficiency in the selection algorithm. In Grefenstette, J. J., editor, Proceedings of the Second International Conference on Genetic Algorithms and Their Applications, pages 14-21. Lawrence Erlbaum Associates.

Beyer, H.-G. (1995a). Toward a theory of evolution strategies: On the benefits of sex- the (/1) /-L, >..) theory. Evolutionary Computation, 3(1):81-111.

Beyer, H.-G. (1995b). Toward a theory of evolution strategies: The (/-L, >..)theory. Evolutionary Computation, 2(4):381-407.

Beyer, H.-G. (1996). Toward a theory of evolution strategies: Self-adaptation. Evolutionary Computation, 3(3):311-347.

Bridges, C. L. and Goldberg, D. E. (1987). An analysis of reproduction and crossover in a binary-coded genetic algorithm. In Grefenstette, J. J., editor, Proceedings of the Second International Conference on Genetic Algorithms, pages 9-13. Lawrence Erlbaum Associates.

Brindle, A. (1991). Genetic algorithm for function optimization. Doctoral Dissertation, University of Alberta.

Darwin, C. (1859). The Origin of Species by Means of Natural Selection or the Preservation of Favoured Races in the Struggle for Life. Mentor Reprint, 1958, New York.

Davis, L. (1989). Adapting operator probabilities in genetic algorithms. In Schaffer, J. D., editor, Proceedings of the Third International Conference on Genetic Algorithms, pages 61-69. Morgan Kaufmann.


Davis, T. E. and Principe, J. C. (1993). A Markov chain framework for the simple genetic algorithm. Evolutionary Computation, 1(3):269-288.

De Jong, K. A. (1975). An analysis of the behavior of a class of genetic adaptive systems. Doctoral Dissertation. University of Michigan.

De Jong, K. A., Spears, W. M., and Gordon, D. F. (1995). Using Markov chains to analyze GAFOs. In Whitley, D. and Vose, M. D., editors, Foundations of Genetic Algorithms 3, pages 115-138. Morgan Kaufmann.

Devorye, L. P. (1976) . On the convergence of statistical search. IEEE Transactions on Systems, Man, and Cybernetics, 6(1):46-56.

Eiben, A. E., Aarts, E. H. L., and Hee, K. M. V. (1991). Global convergence of genetic algorithms: A Markov chain analysis. In Schwefel, H.-P. and Manner, R., editors, Parallel Problem Solving from Nature, PPSN /. Lectures Notes in Computer Science, volume 496, pages 4-12. Springer-Verlag.

Fogarty, T. C. (1989). Varying the probability of mutation in the genetic algorithm. In Schaffer, J. D., editor, Proceedings of the Third International Conference on Genetic Algorithms, pages 104-109. Morgan Kaufmann.

Fogel, D. B. (1992). Evolving Artificial Intelligence. PhD Thesis, University of California, San Diego, CA.

Fogel, D. B. (1994). An introduction to evolutionary computation. IEEE Transactions on Neural Networks, 5(1):3-14.

Fogel, D. B. (1995). Evolutionary Computation: toward a new philosophy of machine intelligence. IEEE Press, Piscataway, New Jersey.

Fogel, D. B. (1998). Evolutionary Computation. The Fossil Record. IEEE press. Fogel, L. J. (1962). Autonomous automata. Industrial Research, 4:14- 19. Fogel, L. J. (1964) . On the Organization of Intellect. Doctoral Dissertation.

University of California, Los Angeles, CA. Goldberg, D. E. (1989). Genetic algorithms in search, optimization, and ma

chine learning. Addison-Wesley. Goldberg, D. E. (1998). The Race, the Hurdle, and the Sweet Spot: Lessons from

Genetic Algorithms for the Automation of Design Innovation and Creativity. Technical Report IlliGAL Report No. 98007, University of Illinois at UrbanaChampaign.

Goldberg, D. E. and Deb, K. (1991). A comparative analysis of selection schemes used in genetic algorithms. In Rawlins , G. J. E., editor, Foundations of Genetic Algorithms, pages 69-93. Morgan Kaufmann.

Grefenstette, J. J. (1986). Optimization of control parameters for genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 16(1):122-128.

He, J. and Kang, L. (1999). On the convergence rates of genetic algorithms. Theoretical Computer Science, 229(1-2):23-39.

Hesser, J. and Manner, R. (1990). Towards an optimal mutation probability for genetic algorithms. In Parallel Problem Solving from Nature, PPSN /.


Lectures Notes in Computer Science, volume 496, pages 23-32. SpringerVerlag.

Hoffmeister, F. and Back, T. (1991). Genetic algorithms and evolution strategies: Similarities and differences. In Schwefel, H.-P. and Manner, R., editors, Parallel Problem Solving from Nature, PPSN 1. Lectures Notes in Computer Science, volume 496, pages 455-470. Springer.

Holland, J. H. (1975). Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbor, M1.

Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press.

Larranaga, P., Kuijpers, C. M. H., Murga, R. H., Inza, 1., and Dizdarevic, S. (1999). Genetic algorithms for the travelling salesman problem: A review of representations and operators. Artificial Intelligence Review, 13:129-170.

Lozano, J. A., Larranaga, P., Albizuri, F. X., and Grana, M. (1999). Genetic algorithms: Bridging the convergence gap. Theoretical Computer Science, 229(1-2):11-22.

Lozano, J. A., Larranaga, P., and Grana, M. (1998). Partitional cluster analysis with genetic algorithms: searching for the number of clusters. In Hayashi, C., Ohsumi, N., Yajima, K., Tanaka, Y., Bock, H. H., and Baba, Y., editors, Data Science, Classification and Related Methods, pages 117-125. Springer.

Mahfoud, S. W. (1993). Finite Markov chain models of an alternative selection strategy for the genetic algorithm. Complex Systems, 7: 155-170.

Michalewicz, Z. and Janikov, C. Z. (1991). Handling constraints in genetic algorithms. In Belew, R. and Booker, L. B., editors, Proceedings of the Fourth International Conference on Genetic Algorithms, pages 151-157. Morgan Kaufmann.

Miihlenbein, H. (1992). How genetic algorithms really work. I: Mutation and hillclimbing. In Manner, R. and Manderick, B., editors, Parallel Problem Solving from Nature II, pages 15-25. North-Holland.

Miihlenbein, H. and Schlierkamp-Voosen, D. (1993). Predictive models for the breeder genetic algorithm. I: Continuous parameter optimization. Evolutionary Computation, 1(1):25-49.

Nix, A. E. and Vose, M. D. (1992). Modeling genetic algorithms with Markov chains. Annals of Mathematics and Artificial Intelligence, 5:79-88.

Oyman, A. 1. and Beyer, H.-G. (2000). Analysis of (p,j f..l, A)-ES on the parabolic ridge. Evolutionary Computation, 8(3):267-289.

Oyman, A. 1., Beyer, H.-G., and Schwefel, H.-P. (2000). Analysis of (1, A)-ES on the parabolic ridge. Evolutionary Computation, 8(3):249-265.

Pinter, J. (1984). Convergence properties of stochastic optimization procedures. Mathematische Operationsforschung und Statisk, Series Optimization, 15:53-61.


Priigel-Bennet, A. and Shapiro, J. L. (1997). The dynamics of genetic algorithms in simple random Ising systems. Physica D, 104(1):75-114.

Rechenberg, I. (1973). Evolutionstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Fromman-Holzboog, Stuttgart.

Reeves, C. R. (1993). Modern Heuristic Techniques for Combinatorial Optimization. Blackwell Scientific Publications.

Reeves, C. R. (1995). A genetic algorithm for flowshop sequencing. Computer f3 Operations Research, 22:5-13.

Rudolph, G. (1992). On correlated mutations in evolution strategies. In Manner, R. and Manderick, B., editors, Parallel Problem Solving from Nature, PPSN II, pages 105-114. North-Holland.

Rudolph, G. (1994). Convergence analysis of canonical genetic algorithms. IEEE Transactions on Neural Networks, 5(1):96-101.

Rudolph, G. (1997). Convergence Properties of Evolutionary Algorithms. Kovac, Hamburg.

Rudolph, G. (1998). Finite Markov chain results in evolutionary computation: A tour d'horizon. Fundamenta Informaticae, 35(1-4):67-89.

Schaffer, J. D., Caruana, R. A., Eshelman, L. J., and Das, R. (1989). A study of control parameters affecting online performance of genetic algorithms for function optimization. In Schaffer, J. D., editor, Proceedings of the Third International Conference on Genetic Algorithms, pages 51-60. Morgan Kaufmann.

Schwefel, H.-P. (1981). Numerical Optimization of Computer Models. John Wiley & Sons, Inc.

Schwefel, H.-P. (1995). Evolution and Optimum Seeking. John Wiley & Sons, Inc.

Solis, F. J. and Wets, R. J.-B. (1981). Minimization by random search techniques. Mathematics of Operations Research, 6:19-30.

Suzuki,J. (1995). A Markov chain analysis on simple genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 25(4):655-659.

Syswerda, G. (1991). Schedule optimization using genetic algorithms. In Davis, L., editor, Handbook of Genetic Algorithms, pages 332-349. Van Nostrand Reinhold.

Syswerda, G. (1993). Simulated crossover in genetic algorithms. In Whitley, L. D., editor, Foundations of Genetic Algorithms 2, pages 239-255. Morgan Kaufmann.

van Nimwegen, E., Crutchfield, J. P., and Mitchell, M. (1999). Statistical dynamics of the royal road genetic algorithms. Theoretical Computer Science, 229(1-2):41-102.

Vose, M. D. (1999). The simple genetic algorithm: foundations and theory. MIT Press .


Vose, M. D. and Liepins, G. E. (1991). Punctuated equilibria in genetic search. Complex Systems, 5:31-44.

Whitley, D. and Kauth, J. (1988). GENITOR: A different genetic algorithm. In Proceedings of the Rocky Mountain Conference on Artificial Intelligence, volume II, pages 118-130.

Whitley, L. D. (1992). An executable model of a simple genetic algorithm. In Whitley, D., editor, Foundations of Genetic Algorithms 2, pages 45-62. Morgan Kauffmann.

Chapter 2

An Introduction to Probabilistic Graphical Models

P. Larrafiaga Department of Computer Science and Artificial Intelligence

University of the Basque Country [email protected]

Abstract In this chapter we will introduce two probabilistic graphical models -Bayesian networks and Gaussian networks- that will be used to carry out factorization of the probability distribution of the selected individuals in the Estimation of Distribution Algorithms based approaches. For both paradigms we will present different algorithms to induce the underlying model from data, as well as some methods to simulate such models.

Keywords: Probabilistic graphical models, Bayesian networks, Gaussian networks, model induction, simulation

1. Introduction In this chapter, we will introduce the probabilistic graphical model paradigm

(Howard and Matheson, 1981; Pearl, 1988; Lauritzen, 1996) that has become a popular representation for encoding uncertain knowledge in expert systems over the last decade. We will restrict to probabilistic graphical models where the structural part is a directed acyclic graph.

Once the paradigm has been introduced in an abstract way, we will present two probabilistic graphical models that will be used in this book to obtain factorizations of the probability distribution of selected individuals in Estimation of Distribution Algorithms (EDAs) based approaches: Bayesian networks and Gaussian networks. In the chapter we emphasize the aspects related to model induction and simulation of the two previously mentioned paradigms, due to their applicability to ED As approaches.




The organization of this chapter is as follows. In Section 2 we introduce a general notation that will allow us to consider Bayesian networks and Gaussian networks as instances of it. In this section the semantic of probabilistic graphical models based on the conditional (in)dependence concept will be explained by means of the separation criterion. We will also present three different degrees of complexity in these probabilistic graphical models -trees, poly trees and multiple connected- depending on the structure of the model. In Section 3 the Bayesian network paradigm is introduced. We will pay attention to two tasks that are important for the EDAs approach: structure learning from data and simulation. Section 4, where the Gaussian network paradigm is introduced, is organized in a similar way as the previous one. We conclude in Section 5 with some consideration about the kind of research in probabilistic graphical models that can be of interest for EDAs.

2. Notation In this section, we introduce a general notation that is useful for the two

probabilistic graphical models developed in this chapter -Bayesian and Gaussian networks.

We use Xi to represent a random variable. A possible instance of Xi is denoted Xi. P(Xi = Xi) (or simply P(Xi)) represents the generalized probability distribution (DeGroot, 1970, p. 19) over the point Xi. Similarly, we use X = (Xl, ... , Xn) to represent an n-dimensional random variable, and x = (Xl' ... ' Xn) to represent one of its possible instances. The joint generalized probability distribution of X is denoted p( X = x) (or simply p( x)). The generalized conditional probability distribution of the variable Xi given the value Xj of the variable Xj is represented as P(Xi = xilXj = Xj) (or simply by p(xilxj)). We use D to represent a data set, i.e. a set of N instances of the variable X = (Xl, ... ,Xn ).

If the variable Xi is discrete, p(Xi = Xi) = p(Xi = Xi) (or simply p(Xi)) is called the mass probability for the variable Xi. If all the variables in X are discrete, p(X = x) = p(X = x) (or simply p(x)) is the joint probability mass, and P(Xi = xilXj = Xj) = P(Xi = xilXj = Xj) (or simply p(xilxj)) is the conditional mass probability of the variable Xi given that X j = X j .

In the case that Xi is continuous, P(Xi = Xi) = f(Xi = Xi) (or simply f(Xi)) is the density function of Xi. If all the variables in X are continuous, p(X = x) = f (X = x) (or simply f (x)) is the joint density function, and p( Xi = XilXj = Xj) = f(Xi = xilXj = Xj) (or simply f(Xilxj)) is the conditional density function of the variable Xi given that Xj = Xj.

Let X = (Xl' ... ' Xn) be a vector of random variables. We use Xi to denote a value of Xi, the ith component of X, and y = (Xi)XiEY to denote a value of Y ~ X. A probabilistic graphical model for X is a graphical factorization of the joint generalized probability distribution, p(X = x) (or simply

An Introduction to Probabilistic Graphical Models 29

p( x)). The representation consists of two components: a structure and a set of local generalized probability distributions. The structure 5 for X is a directed acyclic graph (DAG) that represents a set of conditional (in)dependence (Dawid, 1979) 1 assertions on the variables on X.

The structure 5 for X represents for each variable the assertions that Xi and {Xl, . .. , Xn} \Paf 2 are independent given Paf, i = 2, ... , n. Thus, the factorization is as follows:

p(x) = P(Xl, ... ,Xn) =

P(Xl) ,p(x21 Xl)' ... ,p(Xi I Xl, ... ,Xi-l) .... 'P(Xn I Xl, ... ,xn-d = n

p(xd . p(X2 I pa~) ..... P(Xi I paf) ..... P(Xn I pa~) = II P(Xi I paf). (2.1) i=l

The local generalized probability distributions associated with the probabilistic graphical model are precisely those in the previous equation.

In this presentation, we assume that the local generalized probability distributions depend on a finite set of parameters Os E E>s. Thus, we rewrite the previous equation as follows:

n

p(x lOs) = IIp(xi I par,Oi) (2.2) i=1

where Os = (01 , ... , On). Taking both components of the probabilistic graphi-cal model into account, this will be represented by M = (5, Os).

G) G \/

@ G 1\/

(0 @ Figure 2.1 Structure for a probabilistic graphical model defined over X (Xl, X2, X3, X4, X5, X6).

Example 2.1 The structure of the probabilistic graphical model represented in Figure 2.1 provides us the following factorization of the joint generalized probability distribution:

P(Xl,X2,X3,X4,X5,X6) =

p(xd' P(X2)' p(x3I xl,X2)' p(x4I x3)' p(x5I x3,X6)' P(X6) {2.3}


Step 1. Obtain the smallest subgraph containing Y, Z, Wand their ancestors

Step 2. Moralize the obtained subgraph

Step 3. If every path between the variables in Y and the variables in Z in the obtained undirected graph is blocked by a variable in W, then I(Y, Z I W)

Figure 2.2 Checking conditional independencies in a probabilistic graphical model by means of the u-separation criterion for undirected graphs.

To understand the underlying semantic of a probabilistic graphical model we introduce the separation criterion for undirected graphs. According to this criterion -see Figure 2.2- to check the conditional independence between variables Y and Z given W, we need to consider the smallest subgraph containing Y, Z, Wand their ancestors. This subgraph must be moralized. To carry out this moralization it is mandatory to add an edge between parents with common children and, then to delete the direction of the arcs, that is to transform the arcs in edges. If every path between the variables in Y and variables in Z in the obtained undirected graph is blocked by a variable in W, then we say that in the original graph variables Y and Z are conditionally independent given W.

Depending on the connectivity of the model structure we can consider different degrees of complexity in probabilistic graphical models:

• Tree

In this type of structure, each variable can have at the most one parent variable. As a consequence, the following factorization follows:

n

(2.4) i=l

where Xj(i) is the (possibly empty) parent of variable Xi.

• Poly tree

n

p(x lOs) = I1P(Xi I Xjl(i),Xj2(ij> ... ,Xjr(i),Oi) (2.5) i=l


where {Xjl(i),Xj2(i), ... ,Xjr(i)} is the (possibly empty) set of parents of the variable Xi in the poly tree, and the parents of each variable are mutually independent, i.e.

r

P(Xjl(i), Xj2(i)'···' Xjr(i») = II P(Xjk(i»), Vi = 1, ... , n (2.6) k=l

• Multiple connected

While in tree and poly tree structures, given two nodes in the DAG, there is one path connecting them at the most, in multiple connected structures, every two nodes in the DAG can be connected by more than one path . As a result, the factorization is as follows:

n

p(x lOs) = IIp(xi I par,Oi). (2.7) i=l

o o 0

a) Tree structure b) Poly tree

?"" 0", X

o 0 c) Multiply connected

Figure 2.3 Different degrees of complexity in the structure of probabilistic graphical models.

Figure 2.3 shows a graphical representation of the different types of structures introduced in this section.

3. Bayesian networks

3.1 Introduction

Bayesian networks have been surrounded by a growing interest in the recent years, shown by the large number of dedicated books and the wide range of theoretical and practical publications in this field. Textbooks include the classic Pearl (1988). Neapolitan (1990) explains the basics of propagation algorithms and these are studied in detail by Shafer (1996). Jensen (1996) is a recommended tutorial introduction while in Castillo et al. (1997) another sound


introduction with many worked examples can be found. Lauritzen (1996) provides a mathematical analysis of graphical models. More recently, Cowell et al. (1999) is an excellent compilation material covering recent advances in the field. Other good sources of tutorial material are Charniak (1991), Henrion et al. (1991), Pearl (1993) and Heckerman and Wellman (1995).

The Bayesian network paradigm is used mainly to reason in domains with an intrinsic uncertainty. The reasoning inside the model, that is, the propagation of the evidence through the model, depends on the structure reflecting the conditional (in)dependencies between the variables. Cooper (1990) proved that this task is NP-hard in the general case of multiple connected Bayesian networks. The most popular algorithm to accomplish this task was proposed by Lauritzen and Spiegelhalter (1988) -later improved by Jensen et al. (1990)and is based on a manipulation of the Bayesian network which starts with the moralization of the graphical model, its posterior triangulation, followed by the finding of the cliques of the moral graph, and finally with the building of a junction tree from the cliques.

In contrast with the common use of Bayesian networks that is related to reasoning tasks, in this section we concentrate on model induction from data and the simulation of the induced models, due to the importance of both problems in ED As based approaches.

3.2 Notation

As an instance of the probabilistic graphical model introduced in Section 2, we have that, in the particular case of each variable Xi E X being discrete, the probabilistic graphical model is called Bayesian network.

If the variable Xi has ri possible values, x}, . .. , x~', the local distribution, P(Xi I pa{'S, (}i) is an unrestricted discrete distribution:

( k I j,S () ) 1I - 1I PXi pai , i =ux7 Ipai=Uijk (2.8)

where pa7's, ... ,pari,S denotes the values of Par, the set of parents of the variable Xi in the structure S. The term qi denotes the number of possible different instances of the parent variables of Xi. Thus, qi = I1XgEPa. r g .

The local parameters are given by (}i = ((Bijk)~~l)j~l)' In others words, the parameter Bijk represents the conditional probability of variable Xi being in its kth value, knowing that the set of its parent variables is in its ih value.

Example 2.2 In order to understand the introduced notation, we obtain from Figure 2.4 the values expressed in Table 2.1.

Notice also that from the factorization of the joint probability mass derived from the structure of the Bayesian network in Figure 2.4 the number of parameters needed to specify the joint probability mass is reduced from 23 to 11.

Structure


Local probabilities

62 = ((;12 - 1 , (;12-2 , (;I2 -:J )

6a = ((;1;111,(;1321,(1331 ,

(;Ia~I ' (;Ia51 , (;Iati l ,

(J:j 12 , 8:322 , (J;J;J2,

(;I:J~2 ' (;1;152, (;1362)

p(xl),p(xi)

p( x~), p( x~), p( x~)

p(x1Ixl, x~), p(x1Ixl , x§), p(x~ Ixl, x~), p(x1Ixi, xD,p(x~lxi , x~) , p(x~lxi , x1) ,

p(x;lxl, x~), p(x;l x l, x§), p(x~lxl, x~),

p(x5Ixi, x1) , p(x5Ixr, x~) , p(x~lxr, x~)

p(x~ Ix~) , p(x~ Ix5), p(x~ Ixl), p(x~ Ix5)

Factorization of the joint mass-probability

Figure 2.4 Structure, local probabilities and resulting factorization for a Bayesian network with four variables (Xl, X3 and X 4 with two possible values, and X 2 with three possible values).

Table 2.1 Variables (Xi) , number of possible values of variables (ri), set of variable parents of a variable (Fai), number of possible instantiations of the parent variables (qi) .

variable possible values parent variables possible values of the parents

Xi ri Pai qi

Xl 2 0 0 X 2 3 0 0 X3 2 {Xl, X 2 } 6 X 4 2 {X3} 2

From Figure 2.4, we see that to assess a Bayesian network, the user needs to specify:

• A structure by means of a directed acyclic graph which reflects the set of conditional (in)dependencies among the variables.

• The unconditional probabilities for all root nodes (nodes with no predecessors) , p(x/ I 0, (}i) (or Bi-d, and

• Conditional probabilities for all other nodes, given all possible combinations of their direct predecessors, Bijk = p(Xi k I pa1's, (}i) .


3.3 Model induction

Once the Bayesian network is built, it constitutes an efficient device to perform probabilistic inference. Nevertheless, the problem of building such a network remains. The structure and conditional probabilities necessary for characterizing the Bayesian network can be provided either externally by experts -time consuming and subject to mistakes- or by automatic learning from a database of cases. On the other hand, the learning task can be separated into two subtasks: structure learning, that is, to identify the topology of the Bayesian network, and parametric learning, the numerical parameters (conditional probabilities) for a given network topology.

The easier accessibility to huge databases during the recent years has led to a big number of model learning algorithms had been proposed. We classify the different approaches to Bayesian network model induction according to the nature of the modeling (detecting conditional (in)dependencies versus score+search methods).

The reader can consult some good reviews on model induction in Bayesian networks in Heckerman (1995), Buntine (1996), Sangiiesa and Cortes (1998) and Krause (1998).

3.3.1 Detecting conditional (in)dependencies. Every algo-rithm that tries to recover the structure of the Bayesian network by detecting (in)dependencies has some conditional (in)dependence relations between some subset of variables of the model as input, and a directed acyclic graph that represents a large percentage (and even all of them if possible) of these relations as output. Once the structure has been learnt, the conditional probability distributions required to completely specify the model are estimated from the database -using some of the different approaches to parameter learning- or are given by an expert.

Following de Campos (1998) the input information for the algorithms belonging to this category can have one of the following forms:

• A database from which, with the help of some statistical tests -see Kreiner (1989) for a review of conditional independence tests-, it is possible to determine the correctness of some conditional (in)dependence relationships.

• An n- dimensional probability distribution where it is possible to test the veracity of the conditional (in)dependence relationships, and

• A list containing relations of conditional dependence and independence between triplets of variables.


Although from a formal point of view there are not differences between the three types of input information, from a practical point of view these differences can be seen more clearly. Some of them are related to:

• The cost of performing the statistical tests, which increases with the number of variables that we take into account to carry out the tests, and

• The reliability of the results of the tests, which are less robust if the number of variables is too high.

The different algorithms can be classified by considering some criteria such as the type of directed acyclic graph they recover, their efficiency relative to the number and the order of conditional (in)dependencies to check, the reliability of the solution, the robustness of the solution, and so on.

In this section we present the main characteristics of the PC algorithm introduced by Spirtes et al. (1991). As almost all recovery algorithms based on independence detections, the PC algorithm starts by forming the complete undirected graph, then "thins" that graph by removing edges with zero order conditional independence relations, "thins" again with first order conditional independence relations, and so on. The set of variables conditioned on needs only to be a subset of the set of variables adjacent to one or the other of the variables conditioned.

We can see the pseudocode of the PC algorithm in Figure 2.5. Adj(G, A) represents the set of vertices adjacent to the vertex A in the undirected graph G. Note that the graph G is continually updated, so Adj(G, A) is constantly changing as the algorithm progresses.

The book by Spirtes et al. (1993) provides a good review for the induction of Bayesian networks by detecting conditional (in)dependencies.

3.3.2 Score+search methods. Although the approach to model elicitation based on detecting conditional (in)dependencies is quite appealing due to its closeness to the semantic of Bayesian networks, a big percentage of the developed structure learning algorithms belongs to the category of score+search methods.

To use this learning approach we need to define a metric that measures the goodness of every candidate Bayesian network with respect to a datafile of cases. In addition, we also need a procedure to move in one intelligent way through the space of possible networks.

Search approaches

In the majority of the score+search approaches the search is performed in the space of directed acyclic graphs that represent the feasible Bayesian network


PC algorithm 1. Form the complete undirected graph G on vertex set V = {X!, ... , Xn} 2. r = 0 3. repeat

repeat (a) select an ordered pair of variables Xi and Xj that are adjacent in G such that 1 Adj(G,Xi)\{Xj } I~ rand a subset S(Xi' Xj) ~ Adj(G, X i)\ {Xj} of cardinality r (b) if J(Xi,Xj 1 S(Xi,Xj)) delete the edge Xi - Xj from G, and record S(Xi' Xj) in Sepset(Xi' Xj) and Sepset(Xj , Xi)

until all ordered pairs of adjacent variables Xi and Xj such that 1 Adj(G,Xi)\{Xj } I~ r and all S(Xi,Xj ) of cardinality r have been tested for u-separation r:= r + 1

until for each ordered pair of adjacent vertices Xi,Xj , we have 1 Adj(G,Xi)\{Xj } 1< r

4. For each triplet of vertices Xi, Xj, Xl such that the pair Xi, Xj and the pair Xj, Xl are both adjacent in G but the pair Xi, Xl are not adjacent in G, orient Xi - Xj - Xl as Xi --+ Xj +- Xl if and only if Xj is not in Sepset(Xi' Xl)

5. repeat if Xi --+ X j , Xj and Xl are adjacent, Xi and Xl are not adjacent, and there is no arrowhead at X j , then orient Xj - Xl as Xj --+ Xl if there is a directed path from Xi to X j , and an edge between Xi and X j , then orient Xi - Xj as Xi --+ Xj until no more edges can be oriented

Figure 2.5 Pseudocode for the PC algorithm.


structures. The number of possible structures for a domain with n variables is given by the following recursive formula obtained by Robinson (1977):

n

f(n) = L(-I)i+1(n2i(n-i) f(n - i); f(O) = 1; f(l) = 1. (2.9) i=l

Other possibilities include to search in the space of equivalence classes of Bayesian networks (Chickering, 1996) -when a score that verifies the likelihood equivalence property is used- or also in the space of orderings of the variables (Larrafiaga et al., 1996a).

The problem of finding the best network according to some criterion from the set of all networks in which each node has no more than K parents (K > 1) is NP-hard (Chickering et al., 1995). This result gives a good opportunity to use different heuristic search algorithms.

These heuristic search methods can be more efficient when the model selection criterion, C(5, D), is separable, that is, when the model selection criterion can be written as a product of variable-specific criteria, such as:

n

C(5, D) = II c(Xi,Pai,Dx;uPa;) (2.10) i=l

where Dx;uPa; denotes the dataset D restricted to the variables Xi and Pai. Among all heuristic search strategies used to find good models through

the space of Bayesian network structures, we have the following alternatives: greedy search (Buntine, 1991; Cooper and Herskovits, 1992), simulated annealing (Chickering et al., 1995), tabu search (Bouckaert 1995), genetic algorithms (Larrafiaga et al., 1996b; Myers et al., 1999) and evolutionary programming (Wong et al., 1999).

Due to the fact that it is widely used in the modeling phase of EDAs based approaches, we briefly introduce the Algorithm B (Buntine, 1991). The Algorithm B is a greedy search heuristic which starts with an arc-less structure and, at each step it adds the arc with the maximum improvement in the used score. The algorithm stops when adding an arc does not increase the scoring measure used.

Another possibility to quick and efficiently induce models from data -something that is crucial in EDAs based approaches- consists on the use of local search strategies. Starting with a given structure, the addition or deletion of the arc with the maximum increase in the scoring measure is performed at every step. Local search strategies stop when no modification of the structure improves the scoring measure. The main drawback of local search strategies is that they heavily depend on the initial structure.


Score metrics

• Penalized maximum likelihood

Given a database D with N cases, D = {Xl, ... ,XN}, one might calculate for any structure S the maximum likelihood estimate, ii, for the parameters 0 and the associated maximized log likelihood, logp(D I S, ii). This can be used as a crude measure of the success of the structure S to describe the observed data D. It seems appropriate to score each structure by means of its associated maximized log likelihood and thus, to seek out (using an appropriate search strategy) the structure that maximizes logp(D I S,o).

Using the notation introduced in Section 2 we obtain: N N n

logp(D I S,O) = log IT p(xw I S,O) = log IT IT p(Xw,i I par, Oi) = w=l w=l i=l

n qi Ti

L L L log(Bijk)Nijk (2.11) i=l j=l k=l

where Nijk denotes the number of cases in D in which the variable Xi has the value x~ and Pai has its jth value, and Nij = L~~l Nijk .

Taking into account that the maximum likelihood estimate for Bijk is given by Oijk = ~jk , we obtain:

'J n qi Ti N

logp(D I S,o) = L L L Nijk log ~jk. (2.12) i=lj=lk=l ~

When the model is complex, the sampling error associated with the maximum likelihood estimator implies that the maximum likelihood estimate is not really a believable value for the parameter -even when sample sizes appear large. On the other hand, the monotonicity of the likelihood with respect to the complexity of the structure usually leads the search through complete networks. A common response to these difficulties is to incorporate some form of penalty for model complexity into the maximized likelihood.

There are a wide range of suggested penalty functions. A general formula for a penalized maximum likelihooa Bcore is as follows:

t t t Nijk log ~jk - J(N)dim(S) i=lj=lk=l ~

(2.13)

where dim(S) is the dimension -number of parameters needed to specify the model- of the Bayesian network with a structure given by S. We have dim(S) = L~=l qi(ri - 1). On the other hand, J(N) is a non negative


penalization function. Some examples for f(N) are the Akaike's Information Criterion (AIC) (Akaike, 1974) where f(N) = 1, and the JeffreysSchwarz criterion, sometimes called the Bayesian Information Criterion (BIC) (Schwarz, 1978) where f(N) = ~ log N.

• Bayesian scores. Marginal likelihood

In the Bayesian approach to the Bayesian network model induction from data, we express our uncertainty on the model (structure and parameters) by defining a variable whose states correspond to the possible network structure hypothesis Sh and assessing the probability p(Sh).

After this is done, given a random sample D = {Xl, ... , X N} from the physical probability distribution for X, we compute the posterior distribution of the structure given the database, p(Sh I D), and the posterior distribution of the parameters given the structure and the database, p( 0 siD, Sh). By making use of these distributions the expectations of interest can be computed.

Using the Bayes rule, we have:

p(Sh I D) = p(Sh). p(D ISh) Ls p(S) . p(D I S)

(0 I D Sh) = p(Os ISh) . p(D lOs, Sh) P s , p(D ISh)

where p(D ISh) = J p(D lOs, Sh) . p(Os I Sh)dOS.

(2.14)

(2.15)

In the Bayesian model averaging approach we estimate the joint distribution for X, p( x), by averaging over all possible models and their param-eters:

p(X) = I>(S I D) J p(x lOs, S)p(Os I D, S)dOs. s

(2.16)

If we try to apply this Bayesian model averaging approach to the induction of Bayesian networks, we must sum up all possible structures which results in an intractable approach. Two common approximations to the former equation are used instead. The first is known as selective model averaging (Madigan and Raftery, 1994), where only a reduced number of promising structures S is taken into account and the previous equation is approximated in the following way:

p(X) ~ L p(S I D) J p(x lOs, S)p(Os I D, S)dOs. SES

(2.17)


In the second approximation, known as Bayesian model selection, we select a single "good" model Sh and estimate the joint distribution for X using:

(2.18)

This second approximation is the only one that can be applied to ED As based approaches when the model search is done from one Bayesian point of view due to the large computational costs associated with the Bayesian model averaging and the selective model averaging.

A score commonly used in Bayesian model selection is the logarithm of the relative posterior probability of the model:

logp(S I D) (X logp(S, D) = logp(S) + logp(D IS). (2.19)

Under the assumption that the prior distribution over the structure is uniform, an equivalent criterion is the log marginal likelihood of the data given the structure.

It is possible -see Cooper and Herskovits (1992) and Heckerman et al. (1995) for details- to compute the marginal likelihood efficiently and in closed form under some general asumptions.

In the following we present the K2 algorithm (Cooper and Herskovits, 1992) due to its use in the EDA approaches. Given a Bayesian network model, if the cases occur independently, there are not missing values, and the density of the parameters given the structure is uniform, then the previous authors show that

n qi (. _ 1)1 Ti

p(D I S) = II II (N- .r~ r. ~ I)! II Nijk! i=l j=l 'J' k=l

(2.20)

The K2 algorithm assumes that an orpering on the variables is available and that, a priori, all structures are equally likely. It searches, for every node, the set of parent nodes that maximizes the following function:

qi (. _ 1)1 ri

g(i, Pai) = II (N' r. r. ~ I)! II Nijk ! . j=l 'J + • k=l

(2.21)

The K2 algorithm is a greedy heuristic. It starts by assuming that a node does not have parents, after which in every step it adds incrementally that parent whose addition most increases the probability of the resulting structure. The K2 algorithm stops adding parents to the nodes when the


Algorithm K2

INPUT: A set of n nodes, an ordering on the nodes, an upper bound u on the number of parents a node may have, and a database D containing N cases OUTPUT: For each node, a printout of the parents of the node BEGIN K2

FOR i := 1 TO n DO

END K2

BEGIN Pai:= 0 Pold := g(i, Pa;) OKToProceed := TRUE WHILE OKToProceed AND IPai I < u DO

BEGIN Let Z be the node in Pred(Xi)\Pa; that maximizes g(i,Pai U {Z}) pnew := g(i, Pa; U {Z}) IF pnew> Pold THEN

BEGIN Pold := pnew Pa; := Pai U {Z}

END ELSE OKToProceed := FALSE

END; WRITE('Node:', X;, 'Parents of this node:', Pai)

END

Figure 2.6 The K2 algorithm.


addition of a single parent can not increase the probability. Obviously, this approach does not guarantee to obtain the structure with the highest probability.

• Scores based on information theory

The scores able to compare two probability distributions are called scoring rules. We denote S(p(x),p'(x)) the function used as score of the two probability distributions to be compared, p( x) (the true), and p' (x) (the associated with the alternative). A score S is called a proper score if S(p(x),p(x)) 2 S(p(x),p'(x)) for all p'(x). Although there is an infinite number of functions that may serve as proper scoring (McCarthy, 1956), the logarithmic score have received particular attention in literature:

S(p(x),p'(x)) = LP(x) logp'(x). (2.22) x

One interesting property of the logarithm score is the fact that it is equivalent to the Kullback-Leibler cross-entropy measure (Kullback, 1951):

DK-dp(x),p'(x)) = LP(x) log ~((x)) = x p x

LP(x) logp(x) - LP(x) logp'(x). (2.23) x x

This formula represents the difference in the information contained in the actual distribution p( x) and the information contained in the approximate distribution p' (x). Since the expression I:x p( x) log p( x) does not depend on the approximate representation, the logarithm scoring rule is a linear transformation of the Kullback-Leibler cross-entropy measure, and minimizing the Kullback-Leibler cross-entropy measure is equivalent to maximizing the logarithm scoring rule.

Due to the interest in the development of EDA approaches, we present the MWST (Maximum Weight Spanning Tree) algorithm proposed by Chow and Liu (1968). These authors asked the following question: if we measured (or estimated) a probability distribution, p(x), what is the tree-dependent distribution, pt(x), that best approximates p(x)? As a distance criterion between p( x) and pt (x), Chow and Liu chose the Kullback-Leibler cross-entropy measure DK_L(p(X),pt(x)) and showed that this is minimized by projectingp(x) on any MWST, where the weight on the branch (Xi, X j ) is defined by the mutual information measure:

(2.24)


Algorithm MWST

Step 1. From the given (observed) distribution p(x), compute the pairwise distributions, P(Xi,Xj)(Xi,Xj), for all variable pairs

Step 2. Using the pairwise distributions, compute all n(n - 1)/2 branch weights and order them by magnitude

Step 3. Assign the largest two branches to the tree to be constructed Step 4. Examine the next~largest branch, and add it to the tree unless it

forms a loop, in which case discard it and examine the next~largest branch

Step 5. Repeat Step 3 until n - 1 branches have been selected (at this point the spanning tree has been constructed)

Step 6. pt (x) can be computed by selecting an arbitrary root node and forming the product: pt(x) = TI~=l P(Xi I Xj(i))

Figure 2.7 The Chow and Liu MWST algorithm.

In Figure 2.7 the algorithm to obtain pt(x) can be consulted.

Other works where scores based on information theory are used to induce Bayesian networks can be found in Herskovits and Cooper (1990), Lam and Bacchus (1994) and Bouckaert (1995).

3.4 Simulation

The simulation of Bayesian networks can be considered as an alternative to the exact propagation methods developed in order to reason with the networks. For our purposes related to the EDAs based approaches, it is enough to obtain a database in which the probabilistic relations between the variables are underlying.

A good number of approaches to the simulation of Bayesian networks have been developed during the last years. Among them, we can mention the following: the likelihood weigthing method developed independently by Fung and Chang (1990) and Shachter and Peot (1990) and later investigated by Shwe and Cooper (1991), the backward-forward sampling method introduced by Fung and Del Favero (1994), the Markov sampling method proposed by Pearl (1987) and the systematic sampling method by Bouckaert (1994). See Bouckaert et al. (1996) for a comparison of the previous methods on different random Bayesian networks. In Chavez and Cooper (1990), Dagum and Horvitz (1993), Hryceij


PLS Find an ancestral ordering, 7f, of the nodes in the Bayesian network For j = 1, 2, ... , N For'i = 1,2, . .. ,n

X11'(i) f- generate a value from p(x11'(i)lpa11'(i))

Figure 2.8 Pseudocode for the Probabilistic Logic Sampling method.

(1990), Jensen et al. (1993) and Salmeron et al. (2000) other simulation methods can be consulted.

In this section, we introduce the Probabilistic Logic Sampling (PLS) method proposed by Henrion (1988). In this method the instances are generated one variable at each time in a forward way, that is, a variable is sampled after all its parents have already been sampled. Thus, variables must be ordered in such a way that the values for Pa11'(i) must be assigned before X11'(i) is sampled. An ordering of the variables satisfying such property is called an ancestral ordering. This simulation strategy is called forward sampling since it goes from parents to children. Once the values of Pa11'(i) have been assigned, we simulate a value for X11'(i) using the distribution p(x11'(i)lpa11'(i))' Figure 2.8 shows a pseudocode of the method.

4. Gaussian networks

4.1 Introduction

In this section we introduce one example of the probabilistic graphical model paradigm where it is assumed that the joint density function follows a multivariate Gaussian density (Whittaker, 1990). This paradigm is known as Gaussian network (Shachter and Kenley, 1989). In it the number of parameters needed to specify a multivariate Gaussian density is reduced.

We have organized the presentation of this paradigm in a similar manner as in the case of Bayesian networks. We present some methods to model induction and simulation of Gaussian networks after introducing the notation used throughout the section.

4.2 Notation

The other particular case of probabilistic graphical models to be considered in this chapter is when each variable Xi E X is continuous and each local density function is the linear-regression model:


f(xilpar,Oi)=N(Xi;mi+ L bji(xj-mj),vi) (2.25) xjEpai

where N(x; P, 0- 2 ) is a univariate normal distribution with mean P and variance 0- 2 . Given this form, a missing arc from Xj to Xi implies that bji = 0 in the former linear-regression model. The local parameters are given by Oi = (mi, bi , Vi) , where bi = (b1i , . .. , bi_1i )t is a column vector. We call a probabilistic graphical model constructed with these local density functions a Gaussian network after Shachter and Kenley (1989).

The interpretation of the components of the local parameters is as follows: mi is the unconditional mean of Xi, Vi is the conditional variance of Xi given Pai, and bji is a linear coefficient reflecting the strength of the relationship between Xj and Xi. See Figure 2.9 for an example of a Gaussian network in a 4-dimensional space.

In order to see the relation between Gaussian networks and multivariate normal densities, we consider that the joint probability density function of the continuous n- dimensional variable X is a multivariate normal distribution iff:

(2.26)

where f..L is the vector of means, ~ is an n x n covariance matrix, and I~I denotes the determinant of~. The inverse of this matrix, W = ~-1, whose elements are denoted by Wij, is referred to as the precision matrix.

The former density can be written as a product of n conditional densities, namely

n n i-I

f(x) = IIf(xi I Xl, ... ,Xi-d = IIN(xi;Pi+ Lbji(Xj -Pj),Vi) (2.27) i=1 i=1 j=1

where Pi is the unconditional mean of Xi, Vi is the variance of Xi given XI, ... , Xi-I, and bji is a linear coefficient reflecting the strength of the relationship between variables Xj and Xi (DeGroot, 1970). This notation gives us the possibility of interpreting a multivariate normal distribution as a Gaussian network where there is an arc from Xj to Xi whenever bji =j:. 0 with j < i.

On the other hand, given a Gaussian network it is possible to generate a multivariate normal density. The unconditional means in both paradigms verify that mi = Pi for all i = 1, ... , n, and Shachter and Kenley (1989) describe the general transformation from v and {bji I j < i} of a given Gaussian network to the precision matrix W of the normal distribution that the Gaussian network represents. The transformation can be done with the use of the following recursive formula in which W(i) denotes the i x i upper left submatrix, bi

denotes the column vector (b li , ... , bi _ 1i )t and b~ its transposed vector:


W(i+ll=(

for i > 0, and W(l) = l. VI

Structure Local densities

_ bi +l ) Vi+l

_1_ Vi+l

fXI <-;N(xI;ml,vtl

fX2 <-;N(x2;m2,v2)

fX3IXI=-I,X2=-2 <-;

(2.28)

83 = (m3,b3,v3)

b3 = (bI3, b23)' N(X3; m3 + bl3 (Xl - ml) + b23(X2 - m2), v3)

84 = (m4,b4,v4) b4 = (b34)'

Factorization of the joint density function

Figure 2.9 Structure, local densities, and resulting factorization for a Gaussian network with four variables.

Example 2.3 Using the Gaussian network introduced in Figure 2.9, where Xl == N(Xl;ml,vd, X 2 == N(X2;m2,v2), X3 == N(X3;m3 + b13 (Xl - md + b23 (x2 - m2),v3) and X 4 == N(x4;m4 + b34 (x3 - m3),v4), we obtain that the precision matrix is given by:

l+!!.h ~ _!!.u. 0 VI V3 V3 V3

h.a.!!.u. l+£.h -~ 0 V3 V2 V2 V3

W= (2.29) _!!.u. -~ l+£1. _h1.

V3 V3 V3 V4 V4

0 0 _h1. V4

l V4

The Gaussian network representation of a multivariate normal distribution is better suited to model elicitation and understanding than the standard representation in which one needs to guarantee that the assessed covariance matrix


is positive-definite. Also, notice that it is necessary to test that the database D with N cases, D = {Xl, ... ,X N }, follows a multivariate normal distribution.

4.3 Model induction

In this section we present three different approaches to induce Gaussian networks from data. While the first of them is based on edge exclusion tests, the other two belong to score+search methods. As in the section devoted to Bayesian networks one score corresponds to a penalized maximum likelihood metric and the other is a Bayesian score.

4.3.1 Edge exclusion tests. Dempster (1972) introduces graphical Gaussian models where the structure of the precision matrix is modeled, rather than the variance matrix itself. The idea for this modeling is to simplify the joint n-dimensional normal density by testing whether a particular element Wij with i = 1, ... n - 1 and j > i of the n x n precision matrix W can be set to zero. Wermuth (1976) shows that fitting these models is equivalent to testing for conditional independence between the corresponding elements of the n-dimensional variable X. Speed and Kiiveri (1986) show that the tests correspond to testing if the edge connecting the vertices corresponding to Xi and Xj in the conditional independence graph can be eliminated. Hence, such tests are known as edge exclusion tests. Many graphical model selection procedures start by making the UI) single edge exclusion tests -excluding the edge connecting Xi and Xj corresponds to accepting the null hypothesis Ho: Wij = 0, being the alternative HA: Wij unspecified- evaluating the likelihood ratio statistic and comparing it to a X2 distribution. However, the use of this distribution is only asymptotically correct. In this section we will introduce -borrowed from Smith and Whittaker (1998)- one alternative to these tests based on the likelihood ratio test.

The likelihood ratio test statistic to exclude the edge between Xi and Xj

from a graphical Gaussian model is Tlik = -n 10g(1 - r;jlrest) where rijlrest is the sample partial correlation of Xi and Xj adjusted for the remainder variables. The latter can be expressed (Whittaker, 1990) in terms of the maximum likelihood estimates of the elements of the precision matrix as rijlrest =

-Wij(WiiWjj)-! .

Smith and Whittaker (1998) obtain the density and distribution functions of the likelihood ratio test statistic under the null hypothesis. These expressions are of the form:

1 flik(t) = gx(t) + 4(t - 1)(2n + l)gx(t)N- I + O(N-2 ) (2.30)

(2.31)


where gx(t) and Gx(x) are the density and distribution functions, respectively, ofaX12 variable.

4.3.2 Score+search methods. The main idea under this approach consists of having a measure for each candidate Gaussian network in combination with an intelligent search through the space of different structures.

Search approaches

All the comments expressed for Bayesian networks are also valid in the case of Gaussian networks. In fact, Algorithm B or variants of the general local search strategy introduced in Section 3.3.2 can be also applied in this case.

Score metrics

• Penalized maximum likelihood

Denoting by L(D I S,O) the likelihood of the database D = {Xl, ... , X N }

given the Gaussian network model M = (S, 0), we have that:

The maximum likelihood estimates for 0 = (01, ... , On), denoted as ii = (iiI,' .. ,iin ), are obtained by maximizing L(D I S,O) or equivalently maximizing In L(D I S,O). The expression for the latter is:

InL(D I S,O) =

ii = (iiI, .. . ,iin ) are obtained as the solutions of the following equations system:

a!i In L(D I S,O) = 0 i = 1, ... , n

a~i In L(D I S,O) = 0 i = 1, ... ,n (2.34)

at In L(D I S,O) = 0 j = 1, ... , i-I and Xj E Pai.

It can be proved (Larraiiaga et al., 2001) that the maximum likelihood estimates for Oi = (mi' bji , Vi) with i = 1, ... ,n and j = 1, ... ,i - 1 and Xj E Pai are:


A SXjX; bji = ~ (2.36)

Xj

SXjXkSXjX;SXkX;

S1: j st (2.37)

where

(2.38)

denotes the sample mean of variable Xi,

(2.39)

denotes the sample covariance between variables Xj and Xi, and

(2.40)

denotes the sample variance of variable X j .

As explained in Section 3.3.2 in this chapter, a general formula for a penalized maximum likelihood score is as follows:

N n 1 L L[-ln(v'27rVi)-~(xir-mi- L bji (Xjr- m j»2)-J(N)dim(S). 1 . v, a r= ,=1 XjEP ;

(2.41)

The number of parameters, dim(S), needed to specify a Gaussian network model with a structure given by S can be obtained with the following formula:

n

dim(S) = 2n + L I Pai I . (2.42) i=1

In fact, for each variable, Xi, we need to specify its mean, mi, its conditional variance, Vi, and its regression coeficients, bji . The comments done about J(N) in Section 3.3.2 are also valid in this case.


• Bayesian scores

In Geiger and Heckerman (1994), a continuous version of the BDe metric (Heckerman et al., 1995) for Gaussian networks, called BGe (Bayesian Gaussian equivalence), is obtained. This metric verifies the interesting property of score equivalence. This means that two Gaussian networks that are isomorphic -they represent the same conditional independence and dependence assertions- receive the same score.

The metric is based upon the fact that the normal-Wishart distribution is conjugate with respect to the multivariate normal. This fact allows to obtain a closed formula for the computation of the marginal likelihood of the data given the structure.

It can be proved (Geiger and Heckerman, 1994) that the marginallikelihood for a general Gaussian network can be calculated using the following formula:

L(D I S) = lIn L(DXiuPai I Se) . L(DPai IS) ,=1 e

(2.43)

where each term is of the form given in Equation 2.44, DXiuPai is the database D restricted to the variables Xi U Pai, and Se represents a complete network structure.

Combining the results provided by the theorems given in (DeGroot, 1970), pp. 178; pp. 180, Geiger and Heckerman (1994) obtain:

(2.44)

where the c(n, a) is defined as follows:

(2.45)

This result yields a metric for scoring the marginal likelihood of any Gaussian network.

See Geiger and Heckerman (1994) for a discussion about the three components of the user's prior knowledge that are relevant to learn Gaussian networks: (1) the prior probabilities p(S), (2) the parameters a and 1I,

and (3) the parameters JLo and To.


5. Simulation In the book by Ripley (1987), pp. 98- 99, two general approaches to the

sampling from multivariate normal distributions are presented. The first approach is based on the Cholesky decomposition of the covariance matrix, while the second, known as the conditioning method, generates instances of X by computing Xl, then X 2 conditional on Xl, and so on. This second method has some similarities with the PLS algorithm introduced in Section 3.4 for simulation of Bayesian networks.

For the simulation of a univariate normal distribution, a simple method based on the sum of 12 uniform variables can be applied. See Box and Muller (1958), Marsaglia et al. (1976), Brent (1974), or more recent methods based on the ratio-of-uniforms (Best, 1979; Ripley, 1983) for alternative methods.

6. Summary In this chapter we have introduced two instances of probabilistic graphical

models - Bayesian networks and Gaussian networks- that are of interest for the development of Estimation of Distribution Algorithms.

In both paradigms we have reviewed some methods for the structure learning from data and the simulation of such models, with a special emphasis on the first task. In fact, for Bayesian and Gaussian networks we have presented some induction methods based on detecting conditional (in)dependencies and also some methods related with a score+search point of view. The reason to emphasize the structure learning task is due to the fact that the different Estimation of Distribution Algorithms approaches strongly depend on the manner in which the joint probability density function underlying the selected individuals in each generation is modeled.

Notes 1. Given Y, Z, W three disjoints sets of variables, we say that Y is conditionally inde

pendent of Z given W if for any y, z, w we have p(y I z, w) = p(y I w). If this is the case, we denote [(Y , Z I W).

2. Par represents the set of parents -variables from which an arrow that ends in Xi comes out- of the variable Xi in the probabilistic graphical model with structure given by S.

References Akaike, H. (1974). New look at the statistical model identification. IEEE Trans

actions on Automatic Control, 19(6):716- 723. Best, D. J. (1979). Some easily programmed pseudo-random normal generators.

Aust. Compo J., 11:60-62. Bouckaert, R. R. (1194). A stratified simulation scheme for inference in Bayesian

belief networks, pages 110-117. Morgan Kaufmann Publishers, San Francisco.

52 Estimation of Dist ribution Algorithms

Bouckaert, R R (1995). Bayesian Belief Networks: From Constr"'uction to Inference. PhD Thesis, University of Utrecht.

Bouckaert, R R, Castillo, E., and Gutierrez, J. M. (1996). A modified simulation scheme for inference in Bayesian networks. International Journal of Approximate Reasoning, 14:55-80.

Box, G. E. P. and Muller, M. E. (1958). A note on the generation of random normal deviates. Ann. Math. Statist., 29:61O-61l.

Brent, R P. (1974). A Gaussian pseudo random generator. Comm. Assoc. Comput. Mach., 17:704-706.

Buntine, W. (1991) . Theory refinement in Bayesian networks. In Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence, pages 52- 60.

Buntine, W. (1996). A guide to the literature on learning probabilistic networks from data. IEEE Transactions on Knowledge and Data Engineering, 8(2):195-210.

Castillo, E., Gutierrez, J. M., and Hadi, A. S. (1997). Expert Systems and Probabilistic Network Models. Springer-Verlag, New York.

Charniak, E. (1991). Bayesian networks without tears. AI Magazine, 12:50- 63. Chavez, R M. and Cooper, G. F. (1990). A randomized approximation al

gorithm for probabilistic inference on Bayesian belief networks. Networks, 20(5):661- 685.

Chickering, D. M., Geiger, D. , and Heckerman, D. (1994). Learning Bayesian networks is NP-hard. Technical report, Microsoft Research, Redmond, WA.

Chickering, D. M., Geiger, D., and Heckerman, D. (1995). Learning Bayesian networks: Search methods and experimental results. In Preliminary Papers of the Fifth International Workshop on Artificial Intelligence and Statistics, pages 112-128.

Chickering, M. (1996). Learning equivalence classes of Bayesian networks structures. In Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, pages 150- 157, Portland, OR Morgan Kaufmann.

Chow, C. and Liu, C. (1968) . Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14:462-467.

Cooper, G. F. (1990) . The computational complexity of probabilistic inference using belief networks. Artificial Intelligence, 42:393-405.

Cooper, G. F . and Herskovits, E. A. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309- 347.

Cowell, R G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999). Probabilistic Networks and Expert Systems. Springer-Verlag, New York.

Dagum, P. and Horvitz, E. (1993). A Bayesian analysis of simulation algorithms for inference in belief networks. Networks, 23(5):499-516.


Dawid, A. P. (1979). Conditional independence in statistical theory. Journal of the Royal Statistics Society, Series B, 41:1-31.

de Campos, L. M. (1998). Automatic learning of graphical models. I: Basic methods. In Gamez, J. A. and Puerta, J. M., editors, Probabilistic Expert System, pages 113-140. Ediciones de la Universidad de Castilla-La Mancha. (In Spanish).

DeGroot, M. (1970). Optimal Statistical Decisions. McGraw-Hill, New York. Dempster, A. P. (1972). Covariance selection. Biometrika, 32:95-108. Fung, R. M. and Chang, K. C. (1990). Weighting and integrating evidence

for stochastic simulation in Bayesian networks. In Henrion, M., Shachter, R. D., Kanal, L. N., and Lemmer, J. F., editors, Uncertainty in Artificial Intelligence, volume 5, pages 209-220, Amsterdam. Elsevier.

Fung, R. M. and Del Favero, B. (1994). Backward simulation in Bayesian networks. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pages 227-234. Morgan Kaufmann Publishers, San Francisco.

Geiger, D. and Heckerman, D. (1994). Learning Gaussian networks. Technical report, Microsoft Advanced Technology Division, Microsoft Corporation, Seattle, Washington.

Heckerman, D. (1995). A tutorial on learning with Bayesian networks. Technical report, Microsoft Advanced Technology Division, Microsoft Corporation, Seattle, Washington.

Heckerman, D., Geiger, D., and Chickering, D. M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197-243.

Heckerman, D. and Wellman, M. P. (1995). Bayesian networks. Communications of the ACM, 38:27-30.

Henrion, M. (1988). Propagating uncertainty in Bayesian networks by probabilistic logic sampling. In Lemmer, J. F. and Kanal, L. N., editors, Uncertainty in Artificial Intelligence, volume 2, pages 149-163. North-Holland, Amsterdam.

Henrion, M., Breese, J. S., and Horvitz, E. J. (1991). Decision analysis and expert systems. AI Magazine, 12:64-91.

Herskovits, E. and Cooper, G. (1990). Kutat6: An entropy-driven system for construction of probabilistic expert systems from database. Technical report, Knowledge Systems Laboratory, Medical Computer Science, Stanford University, California.

Howard, R. and Matheson, J. (1981). Influence diagrams. In Howard, R. and Matheson, J., editors, Readings on the Principles and Applications of Decision Analysis, volume 2, pages 721-764. Strategic Decision Group, Menlo Park, CA.

Hryceij, T. (1990). Gibbs sampling in Bayesian networks. Artificial Intelligence, 46(3):351-363.


Jensen, C. S., Kong, A., and Kjrerulff, U. (1993). Blocking Gibbs sampling in very large probabilistic expert systems. Technical report, Department of Mathematics and Computer Science, University of Aalborg, Denmark.

Jensen, F. V. (1996). An introduction to Bayesian networks. University College of London.

Jensen, F. V., Olesen, K. G., and Andersen, S. K. (1990). An algebra of Bayesian belief universes for knowledge based systems. Networks, 28(5):637-659.

Krause, P. J. (1998). Learning probabilistic networks. Technical report, Philips Research Laboratories.

Kreiner, S. (1989). On tests of conditional independence. Technical report, Statistical Research Unit, University of Copenhagen.

Kullback, S. (1951). On information and sufficiency. Annals of Math. Stats., 22:79-86.

Lam, W. and Bacchus, F. (1994). Learning Bayesian belief networks. An approach based on the MDL principle. Computational Intelligence, 10(4):269-293.

Larraiiaga, P., Kuijpers, C. M. H., Murga, R. H., and Yurramendi, Y. (1996a). Searching for the best ordering in the structure learning of Bayesian networks. IEEE Transactions on Systems, Man and Cybernetics, 41(4):487-493.

Larraiiaga, P., Lozano, J. A., and Bengoetxea, E. (2001). Estimation of Distribution Algorithms based on multivariate normal and Gaussian networks. Technical Report KZZA-IK-1-01, Department of Computer Science and Artificial Intelligence, University of the Basque Country, Spain.

Larraiiaga, P., Poza, M., Yurramendi, Y., Murga, R. H., and Kuijpers, C. M. H. (1996b). Structure learning of Bayesian networks by genetic algorithms. A performance analysis of control parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(9):912-926.

Lauritzen, S. 1. (1996). Graphical Models. Oxford University Press. Lauritzen, S. L. and Spiegelhalter, D. J. (1988). Local computations with prob

abilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society, Series B, 50(2):157-224.

Madigan, D. and Raftery, A. (1994). Model selection and accounting for model uncertainty in graphical models using Occams window. Journal of the American Statistical Association, 89: 1535-1546.

Marsaglia, G., Ananthanarayanan, K., and Paul, N. J. (1976). Improvements on fast methods for generating normal random variables. Information Processing Letters, 5:27-30.

McCarthy, J. (1956). Measures of the value of information. In Proceedings of the National Academy of Sciences, pages 645-655.

Myers, J. W., Laskey, K. B., and Levitt, T. (1999). Learning Bayesian networks from incomplete data with stochastic search algorithms. In Proceedings of


the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 476-485.

Neapolitan, E. (1990). Probabilistic Reasoning in Expert Systems. John Wiley and Sons, New York.

Pearl, J. (1987). Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence, 32:245- 257.

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, Palo Alto, CA.

Pearl, J. (1993). Belief networks revisited. Artificial Intelligence, 59:49-56. Pearl, J. (1996). Fusion, propagation and structuring in belief networks. Arti

ficial Intelligence, 29:241- 288. Ripley, B. D. (1983). Computer generation of random variables: a tutorial. Int .

Statist. Rev., 51:301- 319. Ripley, B. D. (1987). Stochastic Simulation. John Wiley and Sons. Robinson, R. W. (1977). Counting unlabelled acyclic digraphs. In Lecture Notes

in Mathematics: Combinatorial Mathematics V, pages 28-43. Springer-Verlag. Salmeron, A., Cano, A., and Moral, S. (2000). Importance sampling in Bayesian

networks using probability trees. Computational Statistics and Data Analysis, 34:387-413.

Sangiiesa, R. and Cortes, U. (1998). Learning causal networks from data: a survey and a new algorithm for recovering possibilistic causal networks. AI Communications, 10:31- 61.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 7(2) :461- 464.

Shachter, R. and Kenley, C. (1989) . Gaussian influence diagrams. Management Science, 35 :527-550.

Shachter, R. D. and Peot, M. A. (1990) . Simulation approaches to general probabilistic inference on belief networks. In Henrion, M., Shachter, R. D., Kanal, L. N., and Lemmer, J. F., editors, Uncertainty in Artificial Intelligence, volume 5, pages 221- 234 . Elsevier, Amsterdam.

Shafer , G. R. (1996). Pmbabilistic Expert Systems. Society for Industrial and Applied Mathematics.

Shwe, M. and Cooper, G. (1991). An empirical analysis of likelihood-weighting simulation on a large multiply connected medical belief network. Comput. and Biomed. Res., 24:453-475.

Smith, P. W. and Whittaker, J. (1998). Edge exclusion tests for graphical Gaussian models. In Learning in Graphical Models, pages 555-574. Kluwer Academic Publishers, Dordrecht, The Netherlands.

Speed, T. P. and Kiiveri , H. (1986) . Gaussian Markov distributions over finite graphs. Annals of Statistics, 14: 138-150.

Spirtes, P., Glymour, C., and Scheines, R. (1991). An algorithm for fast recovery of sparse causal graphs. Social Science Computing Reviews, 9:62- 72.


Spirtes, P., Glymour, C., and Scheines, R. (1993). Causation, Prediction, and Search. Lecture Notes in Statistics 81, Springer-Verlag.

Wermuth, N. (1976). Model search among multiplicative models. Biometrics, 32:253-263.

Whittaker, J. (1990). Graphical models in applied multivariate statistics. John Wiley and Sons.

Wong, M. L., Lam, W., and Leung, K. S. (1999). Using evolutionary computation and minimum description length principle for data mining of probabilistic knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(2):174-178.



Chapter 3

A Review on Estimation of Distribution Algorithms


University of the Basque Country [email protected]

Abstract In this chapter, we review the Estimation of Distribution Algorithms proposed for the solution of combinatorial optimization problems and optimization in continuous domains. Different approaches for Estimation of Distribution Algorithms have been ordered by the complexity of the interrelations that they are able to express. These will be introduced using one unified notation.

Keywords: Estimation of Distribution Algorithms, combinatorial optimization, optimization in continuous domains, without dependencies, bivariate dependencies, multiple dependencies, mixture models

1. Introduction The behavior of the evolutionary computation algorithms introduced in the

first chapter depends on several parameters associated with them (crossover and mutation operators, probabilities of crossover and mutation, size of the population, rate of generational reproduction, number of generations, etc.). If a researcher does not have experience in using this type of approach for the resolution of a concrete optimization problem, then the choice of suitable values for the parameters is itself converted into an optimization problem, as was shown by Grefenstette (1986). This reason, together with the fact that prediction of the movements of the populations in the search space is extremely difficult, has motivated the birth of a type of algorithms known as Estimation of Distribution Algorithms (EDAs). EDAs were introduced in the field of evolutionary computation for the first time by Miihlenbein and PaaB (1996), and some similar approaches can be found in the book by Zhigljavsky (1991).


Holland (1975) already recognized that to take into account interacting variables would be beneficial to genetic algorithms. This unexploited source of knowledge was called linkage information. Following this idea, in other approaches developed by different authors (Goldberg et al., 1989; Goldberg et al., 1993; Kargupta, 1996; Kargupta and Goldberg, 1997; Bandyopadhyay et al., 1998; Lobo et al., 1998; van Kemenade, 1998; Bosman and Thierens, 1999b) simple genetic algorithms were extended to process building blocks.

In EDAs there are neither crossover nor mutation operators. Instead, the new population of individuals is sampled from a probability distribution, which is estimated from a database containing selected individuals from the previous generation. Whereas in evolutionary computation heuristics, the interrelations -building blocks- between the different variables representing the individuals are kept implicitly in mind, in ED As the interrelations are explicitly expressed through the joint probability distribution associated with the individuals selected at each iteration. In fact, estimation of the joint probability distribution associated with the database containing these selected individuals constitutes the bottleneck of this new heuristic.

The fundamental objective of this chapter is to review the different approaches developed for EDAs. We will explain the elements of these approaches by means of a simple example. Later we will review the EDAs approaches to combinatorial optimization problems as well as to optimization in continuous domains.

Works that collect these kind of approaches to optimization can be found in Larraiiaga et al. (1999a, 1999b), and Pelikan et al. (1999b). See Gonzalez et al. (2001) in this book for a review of theoretical results concerning EDAs.

2. EDA approaches to optimization In order to better understand the different components and steps of EDAs,

we apply the simplest version of this approach to a simple optimization example. Suppose that we are trying to maximize the OneMax function defined in

a 6-dimensional space. Thus, we try to obtain the maximum of the function h(x) = L~=l Xi with Xi = 0, l.

The initial population is obtained at random by sampling the following probability distribution: Po(x) = f1~=1 PO(Xi), where po(Xi = 1) = 0.5 for i = 1, . .. ,6. This means that the joint probability distribution from which we are sampling is factorized as a product of six univariate marginal probability distributions, each following a Bernouilli distribution with parameter value equal to 0.5. We denote by Do the file containing 20 cases -see Table 3.1- obtained from this simulation.

In a second step, we select some of the individuals from Do. This can be done using one of the standard selection methods that are common in evolutionary

A Review on Estimation of Distribution Algorithms 59

Table 3.1 The initial population, Do .

I Xl X 2 X3 X 4 X5 X6 I h(x) I 1 1 0 1 0 1 0 3 2 0 1 0 0 1 0 2 3 0 0 0 1 0 0 1 4 1 1 1 0 0 1 4 5 0 0 0 0 0 1 1 6 1 1 0 0 1 1 4 7 0 1 1 1 1 1 5 8 0 0 0 1 0 0 1 9 1 1 0 1 0 0 3 10 1 0 1 0 0 0 2 11 1 0 0 1 1 1 4 12 1 1 0 0 0 1 3 13 1 0 1 0 0 0 2 14 0 0 0 0 1 1 2 15 0 1 1 1 1 1 5 16 0 0 0 1 0 0 1 17 1 1 1 1 1 0 5 18 0 1 0 1 1 0 3 19 1 0 1 1 1 1 5 20 1 0 1 1 0 0 3


Table 3.2 The selected individuals, Dge , from the initial population.

Xl X2 X3 X4 X5 X6

1 1 0 1 0 1 0 4 1 1 1 0 0 1 6 1 1 0 0 1 1 7 0 1 1 1 1 1

11 1 0 0 1 1 1 12 1 1 0 0 0 1 15 0 1 1 1 1 1 17 1 1 1 1 1 0 18 0 1 0 1 1 0 19 1 0 1 1 1 1

computation. Let us assume that our selection method is truncation, and that we select half of the population. We denote by Dge the data file containing the selected individuals. If there are ties between the evaluation functions of some individuals (see for instance the individuals numbered 1, 9, 12, 18 and 20), then the selection is done in a probabilistic manner. For example here, we will need to select 3 individuals from the set of individuals whose evaluation function value is 3.

Once we have the 10 selected individuals, Dge, -see Table 3.2- we would like to explicitly express -by means of the joint probability distribution- the characteristics of these selected individuals. Although we are aware that it would be a good idea that the joint probability distribution takes into account all the dependencies between the variables, in this example we are using the simplest model possible to express the joint probability distribution. In this simple model, every variable is considered independent of the rest. This is expressed mathematically as:

6

PI(X) =Pl(Xl, ... ,X6) = IIp(XiIDge). (3.1) i=l

Thus, we only need 6 parameters to specify the model. Each of these parameters, p(xiIDge) with i = 1, ... ,6, will be estimated from the data file Dge by means of its corresponding relative frequency, p(Xi = IIDge).

In this case, the values of the parameters are:

P(XI = IIDge) = 0.7 p(X4 = IIDge) = 0.6

P(X2 = IIDge) = 0.7 P(X5 = IIDge) = 0.8

P(X3 = IIDge) = 0.6 P(X6 = IIDge) = 0.7.

(3.2)


Table 3.3 The population of the first generation, D j •

1 1 1 1 1 1 1 6 2 1 0 1 0 1 1 4 3 1 1 1 1 1 0 5 4 0 1 0 1 1 1 4 5 1 1 1 1 0 1 5 6 1 0 0 1 1 1 4 7 0 1 0 1 1 0 3 8 1 1 1 0 1 0 4 9 1 1 1 0 0 1 4 10 1 0 0 1 1 1 4 11 1 1 0 0 1 1 4 12 1 0 1 1 1 0 4 13 0 1 1 0 1 1 4 14 0 1 1 1 1 0 4 15 0 1 1 1 1 1 5 16 0 1 1 0 1 1 4 17 1 1 1 1 1 0 5 18 0 1 0 0 1 0 2 19 0 0 1 1 0 1 3 20 1 1 0 1 1 1 5

By sampling this joint probability distribution, PI (x), we obtain a new population, D 1 , of individuals. In Table 3.3 we represent such a data file consisting of 20 individuals, each of which is evaluated using h(x).

Again, we select the best 10 individuals from D1 , obtaining -as can be seen in Table 3.4- the datafile Dfe.

From this datafile we can obtain 6

P2(X) = P2(Xl, ... ,X6) = IIp(xiIDfe ) (3.3) i=1

where p(xiIDfe) with i = I, ... ,6 is estimated from the datafile Dfe by means of its corresponding relative frequency, P(Xi = IIDfe).

In this case the values of the parameters are:

P(XI = IIDfe) = 0.9 P(X2 = IIDfe) = 0.8 P(X4 = 1lDfe) = 0.7 P(X5 = IIDfe) = 0.8

P(X3 = IIDfe) = 0.8 P(X6 = IIDfe) = 0.7.

(3.4)


Table 3.4 The selected individuals, Dre, from the population of the first generation.

Xl X 2 X3 X 4 X5 X6

1 1 1 1 1 1 1 2 1 0 1 0 1 1 3 1 1 1 1 1 0 5 1 1 1 1 0 1 6 1 0 0 1 1 1 8 1 1 1 0 1 0 9 1 1 1 0 0 1 15 0 1 1 1 1 1 17 1 1 1 1 1 0 20 1 1 0 1 1 1

These three steps consisting of (i) selecting some individuals from a population of individuals, (ii) learning the joint probability distribution of the selected individuals, and (iii) sampling from the learnt distribution, are repeated until a -previously established- stopping criterion is met.

In Figure 3.1, as well as in the pseudocode of Figure 3.2, we can see a schematic of the EDA approach. At the beginning M individuals are generated at random, for example, from a uniform distribution (discrete or continuous) for each variable. These M individuals constitute the initial population, Do, and each of them is evaluated. In a first step, a number N (N ~ M) of individuals are selected (usually those with the best objective function values). Next, induction of the n-dimensional probabilistic model that best reflects the interdependencies between the n variables is carried out. In a third step, M new individuals (the new population) are obtained from simulation of the probability distribution learnt in the previous step. The previous three steps are repeated until a stopping condition is verified. Examples of stopping conditions are: when a fixed number of populations or a fixed number of different evaluated individuals are achieved, uniformity in the generated population, no improvement with regard to the best individual obtained in the previous generation, etc.

The main problem with EDAs is how the probability distribution Pl(X) is estimated. Obviously, computation of all the parameters needed to specify the probability distribution is impractical. This has led to several approximations where the probability distribution is assumed to factorize according to a probability model.

Xl )(2 1 1 1 2 0 0

: :

M 0 0


Do

Xl 1 1 2 0

M 1

Xl 1 0 2 1 :

N 1

Selection of N :::: M individuals

./\'";\ X4 eval 1 1 33.78 1 1 32.45

:

0 0 37.26

X 2

0 0

1

X 2

0 1

1

X3 1 1 :

0

X;l 1 1

0

X 4 eval 1 13.25 1 32.45 :

0 34.12

Selection of N < M individuals

X 4 eval 1 32.25 1 33.78

:

0 34.12 Induction of the

----------=:'bability model

Sampling from PI(X)

Figure 3.1 Illustration of the EDA approach to optimization.


EDA Do f- Generate M individuals (the initial population) at random

Repeat for I = 1,2, ... until the stopping criterion is met

Dr':l f- Select N :S M individuals from D1- 1 according to the selection method

Pl(X) = p(xIDr':l) f- Estimate the probability distribution of an individual being among the selected individuals

Dl f- Sample M individuals (the new population) from Pl(X)

Figure 3.2 Pseudocode for EDA approach.

With the example and the pseudocode of Figure 3.2 we have seen how to apply the EDA approach to function optimization. In Part II of this book, several optimization problems will be solved while in Part III different EDA approaches are applied to some machine learning problems. See Zhang (2000) for differences and connections between function optimization and function approximation (learning) with EDAs.

3. EDA approaches to combinatorial optimization

3.1 Introduction

In this section we review proposed approaches to using EDAs for combinatorial optimization. We have organized this review by the complexity of the probabilistic model used to learn the interdependencies between the variables from the database of selected individuals.

The different parts of this section are therefore: without dependencies, bivariate dependencies, multivariate dependencies, and mixture models.

3.2 Without dependencies

In all the works belonging to this category it is assumed that the n- dimensional joint probability distribution factorizes as a product of n univariate and independent probability distributions. That is Pl(X) = rr~=l Pl(Xi) -see Figure 3.6 for a graphical representation. Obviously this assumption is very far


UMDA Do f- Generate M individuals (the initial population) at random

Repeat for l = 1,2, ... until the stopping criterion is met

Df.!1 f- Select N ::; M individuals from DI-l according to the selection method

() ( ID Se) nn () nn Ei'=16j(Xi=XiI Df_\) f-PI x = P X 1-1 = i=1 PI Xi = i=1 N Estimate the joint probability distribution

DI f- Sample M individuals (the new population) from PI(X)

Figure 3.3 Pseudocode for UMDA.

from what happens in a difficult optimization problem, where interdependencies between the variables usually exist.

3.2.1 UMDA. Miihlenbein (1998) introduced the Univariate Marginal Distribution Algorithm (UMDA). UMDA generalizes previous work from Syswerda (1993), Eshelman and Schaffer (1993) and Miihlenbein and Voigt (1996).

Pseudocode for UMDA can be seen in Figure 3.3. As we can see in Figure 3.3 the model used to estimate the joint probability

distribution of the selected individuals at each generation, PI (x), is as simple as possible. It is factorized as a product of independent univariate marginal distributions. That is: n

PI(X) = p(xIDf.!l) = IIpI(xi). (3.5) i=1

Each univariate marginal distribution is estimated from marginal frequen-cies:

where

if in the ph case of Df.!I' Xi = Xi otherwise.

(3.6)

(3.7)

See Santana and Ochoa (1999) and Santana et al. (2000) for modifications of the basic UMDA in the simulation phase. In Alba et al. (2000) one application


to the feature subset selection problem can be seen, while Rivera (1999) applies UMDA to search in a classifier system. For a mathematical analysis of UMDA, work by Mahnig and Miihlenbein (2000) and Miihlenbein and Mahnig (2000) can be consulted.

3.2.2 PBIL. The PBIL (Population Based Incremental Learning) was introduced by Baluja (1994), and later improved by Baluja and Caruana (1995), with the objective of obtaining the optimum of a function defined in the binary space n = {O,l}n. In each generation, the population of individuals is represented by a vector of probabilities:

(3.8)

where PI (Xi) refers to the probability of obtaining a value of 1 in the ith component of DI , the population of individuals in the lth generation.

The algorithm works as follows. At each generation, using the probability vector, PI (x), M individuals are obtained. Each of these M individuals are evaluated and the N best of them (N ::; M) are selected. We denote them by:

I I I X1:M'··· ,xi:M'··· ,xN:M·

These selected individuals are used to update the probability vector by using a Hebbian inspired rule:

1 N Pl+l(X) = (1- a)Pl(x) + aN L X~:M

k=l

(3.9)

where a E (0,1) is a parameter of the algorithm. Note that the PBIL algorithm only belongs to the EDA approach as shown in Figure 3.2 in the case that a = l. In this case PBIL coincides with UMDA.

Figure 3.4 shows pseudocode for the PBIL algorithm. The PBIL algorithm has received great interest from the research community

and many papers on it exist in the literature. Baluja (1995) compares the PBIL algorithm with six other heuristics in

different optimization problems, and some empirical comparisons can also be found in the work of Monmarche et al. (1999,2000). Galic and H6hfeld (1996), Maxwell and Anderson (1999) and Gallagher (2000) apply PBIL to the problem of obtaining optimal weights between nodes in a neural network architecture. Servais et al. (1997) and Kvasnicka et al. (1996) propose an extension of the method to general cardinality spaces. Other papers giving modifications of the basic method include Schmidt et al. (1999) where genetics is added to the standard algorithm, Fyfe (1999) which extends the algorithm to non-static multiobjective problems and Baluja (1997) which proposes a parallel version of the PBIL. Sukthankar et al. (1997) give an application to a problem that arises in intelligent vehicle domains, Salustowicz and Schmidhuber (1997, 1998) show


Obtain an initial probability vector Po(x)

while no convergence do begin

Using PI(X) obtain M individuals: xL ... , x~, ... , x~

Evaluate and rank xi, ... , x~, ... , x~

Select the N (N :s M) best individuals: xi,M"'" XLM" .. , X>V:M

Update the probability vector PI+! (x) = (PI+! (xd, ... , PHl (xn ))

for i = 1, ... n do

PI+! (Xi) = (1 - a)PI(Xi) + aft L~=l XLk:M

end

FiguTe 3.4 Pseudocode for the PBlL algorithm.

applications of PBIL to genetic programming and Inza et al. (2001c) gives an application in a medical domain.

Work on theoretical aspects of the PBIL algorithm include the following: H6hfeld and Rudolph (1997) prove mean convergence of the algorithm for linear functions, Berny (2000a) derives the PBIL algorithm from a gradient dynamical system, Gonzalez et al. (2001a) shows a strong dependence of the algorithm on its initial parameters, and Gonzalez et al. (2001b) analyzes the algorithm using discrete dynamical systems. The doctoral dissertation of Juels (1997) can also be consulted.

Thathachar and Sastry (1987) also gives an algorithm similar to the PBIL.

3.2.3 eGA. Harik et al. (1998) present an algorithm called compact Genetic Algorithm (cGA) that also belongs to this family. The algorithm (for binary representations) begins by initializing a vector of probabilities where each component follows a Bernouilli distribution with parameter 0.5. Next, two individuals are generated randomly from this vector of probabilities. After the individuals are evaluated, a competition between them is carried out. The competition is held at the level of each of the unidimensional variables, in such a way that if for the ith position the conquering individual takes a value different from the loser, then the ith component of the vector of probabilities increases


or diminishes by a constant amount which depends on wheter the ith position of the conquering individual was a one or a zero. This process of adaptation of the vector of probabilities towards the winning individual continues until the vector of probabilities has converged. Figure 3.5 shows a pseudocode for the cGA.

Figure 3.6 is a graphical representation of the probability model of EDAs without interdependencies. '

Step 1. Initialize the probability vector Po (x) Po(x) = PO(X1, ... , Xi,···, Xn) = (PO(X1), ... ,PO(Xi), ... ,Po(Xn)) = (0.5, ... ,0.5, ... ,0.5)

Step 2. l = l + 1. Sampling PI(X) with l = 0,1,2, ... obtain two individuals: xi, x~

Step 3. Evaluate and rank xi and x~ obtaining: XL2 (the best of both) and X~:2 (the worst of both)

Step 4. Update the probability vector PI(X) towards XL2 for i = 1 to n if x~ 1.2 f. X~ 2.2 then

if' ~i,1:2 = '1· then PI (Xi) = PI-l (Xi) + 1< if Xi,1:2 = 0 then PI(Xi) = PI-dxi) - 1<

Step 5. Check if the probability vector PI (x) has converged for i = 1 to n do if PI(Xi) > 0 and PI(Xi) < 1 then

return to Step 2

Step 6. PI(X) represents the final solution

Figure 3.5 Pseudocode for the cGA.

3.3 Bivariate dependencies

Estimation of the joint probability distribution can be done quickly without assuming independence between the variables -which is very far from the reality in some problems- by taking dependencies between pairs of variables into account. In this case, it is enough to consider second-order statistics. While in the algorithms of the previous section, learning of just the parameters was


o o 00 o

o o Figure 3.6 Graphical representation of the probability model of the proposed EDAs in combinatorial optimization without interdependencies (UMDA, PBIL, cGA).

carried out -the structure of the model remained fixed- in this subsection, parametric learning is extended to structural learning too.

3.3.1 MIMIC. In De Bonet et al. (1997) a greedy algorithm called MIMIC (Mutual Information Maximization for Input Clustering) is developed. MIMIC searches in each generation for the best permutation between the variables in order to find the probability distribution, pj(z), that is closest to the empirical distribution of the set of selected points when using the KullbackLeibler distance, where

(3.10)

and'Tr = (iI, i 2, ... , in) denotes a permutation of the indexes 1,2, ... , n. It can be proved that the Kullback-Leibler divergence between two proba

bility distributions, PI(Z) and pj(z), as expressed in the previous equation, is a function of:

n-1

HiCz) = hi (Xi,.) + L hi (Xij I XiHJ (3.11) j=l

where heX) = - LzP(X = x) logp(X = x) denotes the Shannon entropy of the X variable, and heX I Y) = - Ly heX I Y = y)p(Y = y), where heX I Y = y) = - Lz p(X = x I Y = y) logp(X = xlY = y), denotes the mean uncertainty in X given Y.

The problem of searching for the best piCz) is equivalent to the search for the permutation 'Tr* that minimizes H{(z). To find 'Tr*, De Bonet et al. (1997) propose a straightforward greedy algorithm that avoids searching through all


Step 1. Choose in = argminj hl(Xj )

Step 2. for k = n - 1, ... ,1

Choose i k = arg minj hl (Xj I Xik+l) j "I ik+l, ... ,in

Figure 3.7 The MIMIC approach to estimation of the joint probability distribution at generation t. The symbols h/(X) and h/(X I Y) denote the empirical entropy of X and the empirical entropy of X given Y respectively. Both are estimated from Dfe.

n! permutations. The idea is to select Xin as the variable with the smallest estimated entropy then, at every following step, to pick the variable -from the set of variables not chosen so far- whose average conditional entropy with respect to the variable selected in the previous step is the smallest. Figure 3.7 shows pseudocode for the estimation of the joint probability distribution carried out by the MIMIC algorithm.

3.3.2 COMIT. Baluja and Davies (1997a) proposed an algorithm that learns second-order probability distributions from good mlutions seen so far, and uses these statistics to generate optimal (in terms of maximum likelihood) dependency trees.

In this section we concentrate on an algorithm named COMIT (CombiIling Optimizers with Mutual Information Trees) that hybridizes the EDA approach with local optimizers. COMIT was introduced by Baluja and Davies (1997b, 1998) and its characteristics can be seen in Figure 3.8. Note the difference between this approach and the approach shown by the pseudocode in Figure 3.2 where estimation of the joint probability distribution is through the individuals selected from the population using the model estimated in the previous generation.

Estimation of the probability distribution of the selected individuals in each generation is done using a tree structured Bayesian network learnt using the algorithm proposed by Chow and Liu (1968) -see Chapter 2 in this book for details-. Once the probabilistic model is learnt, some individuals are sampled from it. The best of these individuals are considered as initial points for a fast-search procedure. Some of the best individuals obtained during these fast-


COMIT Do +- Generate M individuals (the initial population) at random

Repeat for l = 1,2, .. . until the stopping criterion is met

Df':l +- Select N :s: M individuals from Dl - l according to the selection method

Pl(X) = p(xIDf':l) = rr~l Pl(Xi I Xj(i)) +- Estimate the probability distribution of the selected individuals using the MWST algorithm of Chow and Liu (1968)

Dfa +- Sample Ml individuals from Pl(X)

D{-S +- Obtain M - N individuals by executing a fast-search procedure, initialized with the best solutions of Dfa

Figure 3.8 Pseudocode for the COMIT algorithm.


searches are added to the individuals selected in the previous generation, to create the new population of individuals.

3.3.3 BMDA. Pelikan and Miihlenbein (1999) propose a factorization of the joint probability distribution that only needs second-order statistics. Their approach, BMDA (Bivariate Marginal Distribution Algorithm), is based on the construction of a dependency graph, which is always acyclic but does not necessarily have to be a connected graph. In fact the dependency graph can be seen as a set of trees that are not mutually connected. The basic idea underlying the algorithm for construction of the dependency graph is simple.

First, an arbitrary variable is chosen and added as a node of the graph. This first variable is the one with the greatest dependency -measured by Pearson's X2 statistic-. Second, we need to add to the graph the variable with the greatest dependency between any of the previously incorporated variables and the set of not yet added variables. This last step is repeated until there is no dependency surpassing a previously fixed threshold between already added variables and the rest. If this is the case, a variable is chosen at random from the set of variables not yet used. The whole process is repeated until all variables are added into the dependency graph.

In each generation the factorization obtained with the BMDA is given by:

(3.12) XiEV\R,

where V denotes the set of n variables, RI denotes the set containing the root variable -in generation l- for each of the connected components of the dependency graph, and Xj(i) returns the variable connected to the variable Xi and added before Xi.

The probabilities for the root nodes, PI (xr), as well as the conditional probabilities, PI(Xi I Xj(i)), are estimated from the database, Dr':l' containing the selected individuals.

Figure 3.9 is a graphical representation of EDAs with pairwise dependencies.

3.4 Multiple dependencies

Several approaches to EDAs have been proposed in the literature where factorization of the joint probability distribution requires statistics of order greater than two.

As far as we know, the first work in which the possibility of adapting the methods of model induction developed by the scientific community working on probabilistic graphical models to EDAs approaches is that of Baluja and Davies (1997a). This possibility is mentioned again in their posterior work (Baluja and Davies, 1998), but unfortunately, they only mention this possibility and do not show evidence of implementation.

A Review 011 Estimation of Distribution Algorithms 73

0--0--0--0

b) Tree structure

a) MIMIC structure

p", OA o /~ 9

o 0 0 c)BMDA

Figure 3. 9 Graphical representation of the probability models for the proposed EDAs in combinatorial optimization with pairwise dependencies (MIMIC, tree structure, BMDA).

3.4.1 EcGA. Harik (1999) presents an algorithm -Extended compact Genetic Algorithm (EcGA)- whose basic idea consists of using a marginal product model to estimate the joint probability distribution of the selected individuals in each generation. It means that, in each generation, the factorization of the joint probability distribution is done as a product of marginal distributions of variable size. These marginal distributions of variable size are related to the variables that are contained in the same group and to the probability distributions associated with them. The grouping is carried out using a greedy forward algorithm that obtains a partition between the n variables. Each group of variables is assumed to be independent of the rest -as shown in Figure 3.12-. In this way, factorization of the joint probability on the n variables is of the form:

Pl(X) = IT Pl(X c ) (3.13) cECI

where C1 denotes the set of groups in the lth generation, and Pl(X c ) represents the marginal distribution of the variables X c, that is the variables that belong to the cth group in the lth generation.

As the EcGA obtains a partition of the set of variables, we have that for all l and for all c, k E C1:

"


UXC={X1, ... ,Xn }, Xc nxk=0. cEC,

(3.14)

The greedy algorithm that carries out the grouping begins with a partition with n clusters (a variable in each cluster). Then, the algorithm performs the union of the two variables that results in the greatest reduction of a measure that conjugates the sum of the entropies of the marginal distributions with a penalty for the complexity of the model based on the minimum description length principle (MDL) (Rissanen, 1978).

More precisely, the measure that the EcGA tries to minimize in each generation has two components:

• The compressed population complexity defined using the entropy of the marginal distributions, as follows:

(3.15)

and

• The model complexity that takes into account the dimension of the model in this way:

logN L dim Xc cEC,

(3.16)

where dim Xc represents the number of parameters needed to specify the marginal distribution of Xc. If all the unidimensional variables belonging to the cth group were binary, then we would obtain dim Xc = 21Xcl -1.

Taking into account both components, the measure that EcGA tries to minimize in each generation is:

-N L LP(Xc = xc) logp(Xc = xc) + logN L dimXc. (3.17) cEC, Xc cEC,

This measure is called combined complexity by Harik (1999). The greedy search used by EcG A begins each generation by postulating that

all the variables are independent. It performs a steepest ascent search, where at each step, the algorithm attempts to merge each pairs of groups into a larger group. It judges the merit of these merges on their combined complexity. If the best combination leads to a decrease in combined complexity, then that merger is carried out. This process continues until no further pairs of groups can be merged. The resulting marginal product model is then the one that is used for that generation.


EcGA Do ~ Generate M individuals (the initial population) at random


Dr~1 ~ Select N ::; M individuals from Dt- 1 using the tournament selection method

Pl(X) = p(xIDr~1) = ITcEC I Pt(x c IDr~1) ~ Estimate the probability distribution of the selected individuals by means of a marginal product model. Model search using steepest ascent search, minimizing: -N LCEcl LXc p(Xc = xc) logp(Xc = xc) + logN LCEcl dimXc

Dl ~ Sample M individuals (the new population) from Pt(x)

Figure 3.10 Pseudocode for the EcGA.

As can be seen in Figure 3.10, tournament selection is also used in each generation to obtain the set of selected individuals.

In Sastry and Goldberg (2000) empirical relations for population size and convergence time are derived for the EcGA.

3.4.2 FDA. In the work of Miihlenbein et al. (1999) the FDA (Factorized Distribution Algorithm) is introduced. This algorithm applies to additively decomposed functions for which, using the running intersection property (Lauritzen, 1996), a factorization of the mass-probability based on residuals, Xbi'

and separators, XCi' is obtained. A function h( x) is additively decomposed if:

h(x) = L hi(xs,) (3.18) siES

where the set S = {S1, ... sd with Si C {I, ... , n} constitutes a covering of {I, ... , n}, and the following sets:

satisfy these three conditions:

di = U}=1 Sj bi = si\di - 1

Ci = Si n di - 1

(3.19)


• bi::j:: 0 for all i = 1, ... , k

• dk = {1,2, ... ,n}

• V i ~ 2 :3 j < i such that Ci ~ S j

then the joint probability distribution can be factorized in this way:

k

Pl(X) = II Pl(Xbi Ixc;}. (3.20) i=l

This factorization remains valid for all the iterations. Changes are only in the estimation of the probabilities that in each iteration are done from the database containing the selected individuals. In any case, the requirement of specifying the factorization of the joint probability distribution is a drawback in applying the FDA approach to generic optimization problems. It is for this reason that, besides parametric learning, structural learning is also desirable.

Theoretical results for FDA can be found in Miihlenbein and Mahnig (1999a, 1999b, 1999c, 2000), Zhang and Miihlenbein (1999) and Mahnig and Miihlenbein (2000).

3.4.3 PADA. In Soto et al. (1999) factorization is done using a Bayesian network with poly tree structure (no more than one undirected path connecting every pair of variables). The proposed algorithm is called PADA (Poly tree Approximation of Distribution Algorithms) and can be considered a hybrid between a method for detecting conditional (in)dependencies and a procedure based on score+search.

3.4.4 EBNApc, EBNAK2+pen, EBNAB1C • In the work of Etxeberria and Larraiiaga (1999) and Larraiiaga et al. (2000a) the factorization of the joint probability distribution encoded by a Bayesian network is learnt from the database containing the selected individuals in each generation. The algorithm developed is called EBNA (Estimation of Bayesian Networks Algorithm).

As we can see in Figure 3.11, the Bayesian network that corresponds to the first iteration has an arc-less directed acyclic graph as its structure. Factorization of the joint probability distribution is therefore carried out using n univariate uniform distributions. This means that the initial Bayesian network, BNo, assigns the same probability to all points in the search space.

As we need to find an adequate model structure as quickly as possible, a simple algorithm which returns a good structure, even if it is not optimal, is preferred. An interesting algorithm with these characteristics is Algorithm B (Buntine, 1991). Algorithm B is a greedy search which starts with an arc-less structure and, at each step, adds the arc with the maximum improvement in the measure used. The algorithm stops when adding an arc would not increase


EBNApc , EBNAK2+pen , EBNABIC

ENo +- (So, (}o) where So is an arcless DAG, and (}o is uniform

Po(x) = rr~=lP(Xi) = rr~=l * Do +- Sample M individuals from Po(x)

For l = 1,2, ... until the stopping criterion is met

Dr':l +- Select N individuals from DI - 1

St +- Find the best structure according to a criterion: conditional (in)dependence tests -+ EBNApc penalized Bayesian score+search -+ EBNAK 2+pen

penalized maximum likelihood+search -+ EBNABIC

(i +- Calculate eLk using Dr':l as the data set

DI +- Sample M individuals from ENI using PLS

Figure 3.11 Pseudocode for the EBNApc, EBNAKHpen, and EBNAB1C algorithms.


the scoring measure. Another possibility for finding good models quickly is the use of local search strategies. Unlike Algorithm B which starts each step from scratch, local search starts with the model created in the previous generation.

Several criteria for guiding the search for good model structures based on different scoring -BIC, K2+pen- as well as on testing conditional (in)dependencies between variables -PC algorithm- have been implemented, giving the different instantiations of EBNA: EBNABIc, EBNAK2+pen, EBNApc.

Note that in EBNAK2+pen, the result -(Larraiiaga et al., 2000a)- bounding the number of parents that one node can have in the best structure is used to automatically control the complexity of the model.

Applications of the EBNA approach to different problems can be found in Bengoetxea et al. (2000, 2001a) -inexact graph matching-, Blanco and Lozano (2001) -combinatorial optimization-, de Campos et al. (2001) -partial abductive inference in Bayesian networks-, Inza et al. (2000, 2001a, 2001c) -feature subset selection-, Inza et al. (2001b) -feature weighting in K-NN-, Lozano and Mendiburu (2001) -job scheduling-, Sierra et al. (2001) -rule induction-, Robles et al. (2001) -traveling salesman problem- Roure et al. (2001) -partitional clustering- and Sagarna and Larraiiaga (2001) -knapsack problems-. See Sagarna (2000) and Lozano et al. (2001) for a parallel version of EBNA.

3.4.5 BOA. Pelikan et al. (1999a, 2000a, 2000b) and Pelikan and Goldberg (2000c) propose the BOA (Bayesian Optimization Algorithm). BOA uses the BDe (Bayesian Dirichlet equivalence) metric to measure the goodness of each structure. This Bayesian metric has the property that the score of two structures that reflect the same conditional (in)dependencies is the same. The search used is a greedy search and it starts in each generation from scratch. In order to reduce the cardinality of the search space the constraint that each node in the Bayesian network has at most k parents is assumed. In Schwarz and Ocenasek (1999) some empirical comparisons between BOA and BMDA can be found.

See Pelikan and Goldberg (2000b) for a modification of the BOA approach in order to model hierarchical problems using a type of hybrid model called a Huffman network. In Pelikan et al. (2000c) BOA is adapted to include local structures by using decision graphs to guide the network construction.

Other work that uses Bayesian approaches to optimization with EDAs -but in this case the Bayesian network paradigm is not used- are Zhang (1999), Zhang and Cho (2000) and Zhang and Shin (2000).

3.4.6 LFDA, FDA L , FDA-BC, FDA-SC. Miihlenbein and Mah-nig (1999c) introduce the LFDA (Learning Factorized Distribution Algorithm) which essentially follows the same approach as in EBNABIc . The main differ-


ence is that in the LFDA the complexity of the learnt model is controlled by the BIC measure in conjunction with a restriction on the maximum number of parents each variable can have in the Bayesian network.

Ochoa et al. (1999) propose an initial algorithm, FDA£, to learn -by means of conditional (in)dependence tests- a junction tree from a database. The underlying idea is to return the junction tree that best satisfies the previous assertions once a list of dependencies and independencies between the variables is obtained.

Also in Ochoa et al. (2000a) a structure learning algorithm that takes into account questions of reliability and computational cost is presented. The algorithm, called FDA-BC, studies the class of Factorized Distribution Algorithm with Bayesian networks of Bounded Complexity.

Similar ideas are introduced by Ochoa et al. (2000b) in the FDA-SC. In this case the factorization of the joint probability distribution is done using simple structures, i.e. trees, forests or poly trees.

FDA EBNA,BOA EcGA

Figure 3.12 Graphical representation of probability models for the proposed EDAs in combinatorial optimization with multiple dependencies (FDA, EBNA, BOA, LFDA and EcGA).

3.5 Mixture models

Pelikan and Goldberg (2000a) propose the use of more flexible probabilistic models to estimate the joint probability distribution of the selected individuals. In order to solve symmetry and multimodal problems they cluster the set of selected individuals at each generation. This clustering is done in each generation using a fast method (Forgy, 1965). The obtained model can be written as follows:

k

PI(X) = L 7r1,iPI,i(X) (3.21) ;=1


where at generation l, 1I"1 ,i denotes the weight of the ith mixture component, and PI ,i(X) is the probability distribution of the ith cluster created from the selected individuals. One drawback of the method -that can be solved using more sophisticated clustering approaches- is the a priori determination of the number of classes. Also, as the authors claim, in problems where the peaks of the function to be optimized are unequally sized the approach is quite sensitive to strong selection methods.

See also the work of Pefia et al. (2001) in Chapter 4 of this book, for an approach to EDAs based on mixture models, where each mixture is a Bayesian network and its corresponding weight is obtained with the EM algorithm.

4. EDA approaches in continuous domains

4.1 Introduction

In this section we review work on optimization in continuous domains with EDAs. The organization of the section is analogous to the previous one as the different approaches have again been grouped by the complexity of the interdependencies between the variables that the learnt density function is able to express. It reviews work where the density function is factorized as a product of n univariate marginal densities, then an approach that uses two order densities is presented. More general approaches that consider multiple dependencies are introduced, then we finish with a section devoted to mixture models.

4.2 Without dependencies

In work that does not take into account dependencies between the variables it is usual to assume that the joint density function follows a n -dimensional normal distribution, which is factorized by a product of unidimensional and independent normal densities. Using the mathematical notation X == N(x; j.L, 2]' this is:

(3.22)

4.2.1 UMDAc • The Univariate Marginal Distribution Algorithm for continuous domains (UMDAc) was introduced by Larrafiaga et al. (1999a, 2000b). In every generation and for every variable the UMDAc carries out some statistical tests in order to find the density function that best fits the variable. Note that in this case, although the factorization of the joint density function is

n

!1(X; (i) = II !1(Xi, O~) (3.23) i=l


UMDAc ** learning the joint density function** Repeat for l = 1,2, ... until the stopping criterion is met for i := 1 to n do

(i) Select via hypothesis test the density function !1(Xi; (}~) that best fits Df~ixi , the projection of the selected individuals over the ith variable

(ii) Obtain the maximum likelihood estimates for (J~ = (O!,k 1 , ••• ,O!,ki )

At each generation the learnt joint density function is expressed as:

!I(X; (}I) = TI~=l !L(Xi, (}~)

Figul'e 3.13 Pseudocode for learning the joint density function in UMDAc .

the UMDAc is in fact a structure identification algorithm (something that does not happen with the UMDA for the discrete case) in the sense that the density components of the model are identified via hypothesis tests. The estimation of parameters is performed, once the densities have been identified, by their maximum likelihood estimates.

If all the univariate distributions are normal, then the two parameters to be estimated at each generation and for each variable are the mean, ilL and the standard deviation, ai. It is well known that their respective maximum likelihood estimates are:

N 1_-l_l""l. Ili - X, - N ~xi,r'

r=l

a l = , N

1 "" 1 -I N ~(Xi,r - X i )2 . r=l

(3.24)

This particular case of the UMDAc will be denoted UMDA~ (Univariate Marginal Distribution Algorithm for Gaussian models).

Figure 3.13 shows pseudocode for learning the joint density function in the UMDAc ·

4.2.2 SHCLVND. Rudlof and Koppen (1996), in their SHCLVND (Stochastic Hill Climbing with Learning by Vectors of Normal Distributions), estimate the joint density function as a product of unidimensional and independent normal densities. The vector of means J.L = (Ill, ... , Ili, ... , Iln) is adapted by means of the Hebbian rule:

(3.25)


where /-L(l+1) denotes the vector of means in the generation [+ 1,0: denotes the learning rate, and b(l) denotes the baricenter of the B (an amount fixed at the beginning) best individuals in the [th generation. Adaptation of the vector of variances (1" is carried out using a reduction policy in the following way:

(1"(1+1) = (1"(1) . f3 (3.26)

where f3 denotes a previously fixed constant (0 < f3 < 1).

4.2.3 PBILc . Sebag and Ducoulombier (1998) propose an extension (PBILc ) of the boolean PBIL algorithm to continuous spaces. As with the previous authors, they assume a joint density function that follows a Gaussian distribution factorizable as a product of unidimensional and independent marginal densities. The adaptation of each component of the vector of means is carried out using the following formula:

Illl+1) = (1 - 0:) . Ill l ) + 0:. (Xibest,l (l) + Xibest,2(l) - Xiworst(l)) (3.27) where Illl+ 1) represents the ith component of the mean, /-L(l+1) , at generation (l + 1), and Xibest,1(l)(Xibest,2(l))(Xiworst(l)) denotes the best ((second best) (worst)) individual of the generation [ and 0: is a constant. For the adaptation of the vector of variances, they propose four heuristics: (i) use a constant value for all the marginals and all the generations; (ii) adjust it as in a (1, A) evolution strategy; (iii) calculate the sample variance of the K best individuals of each generation; (iv) by means of a Hebbian rule, similar to the adaptation of the means.

Notice the similarity of this approach to the (1, '\)-ES (Schwefel, 1995).

o o 00 o

o o Figure 3.14 Graphical representation of the probability models for the proposed EDAs for optimization in continuous domains without dependencies between the variables (UMDAc, SHCLVND, PBILc).


4.2.4 Servet et al. Servet et al. (1997) introduce a progressive ap-proach to the problem. At each generation, and for each dimension i, they store an interval (ail, bi l ) and a real number Zi l (i = 1, ... , n). Zi l represents the probability that the ith component of a solution is on the right half of the previous interval. In every generation the probabilities, z/, for each dimension are calculated, and when Zi l is closer to 1(0) than a previously fixed quantity, the interval is reduced to its right (left) half.

4.3 Bivariate dependencies

4.3.1 MIMIC~. This approach was introduced by Larraiiaga et al. (1999a, 2000b) and constitutes an adaptation of the MIMIC algorithm (De Bonet et al., 1997) to continuous domains where the underlying probability model for every pair of variables is assumed to be a bivariate Gaussian.

The idea, as in MIMIC for combinatorial optimization, is to describe the true joint density function by fitting the model as closely as possible to the empirical data by using only one univariate marginal density and n - 1 pairwise conditional density functions. In order to do this, the following result is used:

Theorem 1 (Whittaker, 1990; pp. 167) Let X be a n-dimensional normal density function, X == N(x; /-L, 2:), then the entropy of X is:

1 1 h(X) = 2n (1 + log27l') + 2 log 12: 1 . (3.28)

Applying this result to univariate and bivariate normal density functions in order to define MIMIC?, we obtain:

1 h(X) = 2(1 + log27l') + logO'x (3.29)

h(X 1 Y) = ~ [(1+l0g 27l') +log(O'3.;O'~ -:;O'3.;Y)] 2 O'y

(3.30)

where 0'3.;(0'~) denotes the variance of the univariate X(Y) variable and O'XY denotes the covariance between the variables X and Y.

Structure learning in MIMIC? -see Figure 3.15- works as a straightforward greedy algorithm with two steps. In the first step, the variable with the smallest sample variance is chosen. In the second step, the variable X, whose estimation

2 2 2

of O'xO'y;-O'XY with respect to the previous chosen variable, Y, is the smallest O'y

is chosen and linked to Y.

4.4 Multiple dependencies

In this section we introduce some approaches to EDAs for continuous domains in which the density function learnt at each generation is not restricted.


MIMIC? Step 1. Choose in = argminj aXj

Step 2. for k = n - 1, n - 2, ... ,1

Figure 3.15 Adaptation of the MIMIC approach to a multivariate Gaussian density function.

As a first approach we present a model where the density function corresponds to a non restricted multivariate normal density that is learnt from scratch at each generation. The following two models correspond to adapt ion and incremental versions of the former model respectively. We also present approaches based on learning of Gaussian networks by edge exclusion tests, and also using two score+search approaches to learn the appropiate Gaussian network at each generation. We finish the section with the IDEA framework.

As well as the references in the following sections, an interesting paper that tries to establish a relation between evolutionary strategies and reinforcement learning in real function optimization is Berny (2000b). In this work adaptive schemes based on gradient flows acting on the density parameters of a multivariate Gaussian density are created.

4.4.1 EMN Aglobal. This is an approach based on the estimation of a multivariate normal density function at each generation. As we can see in Figure 3.16, at each generation we estimate the vector of means, ILL = (J..ll,l, ... , J..ln,I), and the variance-covariance matrix, :E1, whose elements are denoted by ali,1 with i, j = 1, ... , n. This means that we need to estimate

2n + ( n ~ 1 ) parameters at each generation: n means, n variances and

( n ~ 1 ) covariances. These parameter estimations use their maximum like

lihood estimates in the following way:

N A -11",1 J..li,1 = Xi = N ~xi,r

r=l

i = 1, ... ,n

i = 1, ... ,n

(3.31)

(3.32)


EMNAglobal Do f- Generate M individuals (the initial population) at random


Df~l f- Select N :::; M individuals from Dl- l according to the selection method

ft(x) = f(xIDf~l) = N(x; ILl, ~l) f- Estimate the multivariate normal density function from the selected individuals

Dl f- Sample M individuals (the new population) from ft(x)

Figure 3.16 Pseudocode for the EMNAglobal approach.

N ,2 1 '"' I -I I -I aij,l = N L.J(xi,r - Xi)(xj,r - Xj) i,j = 1, ... ,n i =I j.

r=l

(3.33)

Although the number of parameters that this approach needs to estimate at each generation is greater than in the case where the joint density function is estimated by means of Gaussian networks -see Section 4.4.4-, the mathematics needed to develop this approach are very simple. Note also that in the approach based on Gaussian networks where edge exclusion tests are used it is necessary to calculate the same number of parameters as in this approach and then carry out a hypothesis test on them. On the other hand in the approach where score+search is used in order to obtain the best Gaussian network it is mandatory to do a search over the space of possible models. For more details about EMNAglobal see Larraiiaga et al. (2001).

4.4.2 EMNAa . We present in this section an adaptive version of the approach introduced in the previous one. Pseudocode for EMNAa (Estimation of Multivariate Normal Algorithm adaptive) is shown in Figure 3.17.

Once we obtain the first model, N(x; ILl' ~d, whose parameters are estimated from the individuals selected from the initial population, the flow of EMNAa is similar to a steady-state genetic algorithm.

First, we simulate one individual from the current multivariate normal density model. Next, we compare the goodness of this simulated individual with the worst individual we maintain in the population. If the fitness of the simulated individual is better than the worst individual we have in the population,


EMNAa

Do +- Generate M individuals (the initial population) at random Select N :::; M individuals from Do according to the selection method

Obtain the first multivariate normal density N(x; ILl' ~d


Generate an individual x~e from N(x; ILL' ~L) if x~e is better than the worst individual, xL,N, then

add x~e to the population and drop xL,N from it Obtain N(x; ILL+!, ~L+d

Figure 3.17 Pseudocode for the EMNAa approach.

then we replace this worst individual with the simulated individual. Once a new individual replaces an existing one, it is necessary to update the parameters of the multivariate normal density function.

This update will be done using the following formulae -see Larraiiaga et al. (2001) for details- which can be obtained by simple algebraic manipulations:

1 L L N ILL+! = ILL + N (x ge - x' ) (3.34)

1 N 1 N 2 _ 2 ( L L,N) '"'( L,r L) ( L L,N) '"'( L,r L)+

aij,L+!-aij,l- N2 Xge,i-Xi .~ Xj -J.lj - N2 Xge,j-X j .~ Xi -J.li r=l r=l

1 ( L L,N)( L L,N) 1 (L,N L+1)( L,N L+!) N2 Xge,i - Xi Xge,j - Xj - N Xi - J.li Xj - J.lj +

1 ( l l+!) ( L l+!) N Xge,i - J.li Xge,j - J.lj . (3.35)

Note that with this EMNAa approach, the size of the population of individuals is kept fixed in every generation.

4.4.3 EMNAi • In this section we describe the EMNAi (Estimation of Multivariate Normal Algorithm incremental) approach. The similarity of this approach to the previous one is that both approaches only generate one individual from each model. The difference is that in EMNAi each of the generated individuals is added -as can be seen in Figure 3.18- to the population.


Do f- Generate M individuals (the initial population) at random Select N :::; M individuals from Do according to the selection method

Obtain the first multivariate normal density N(x; J-Ll' Ed


Generate an individual x~e from N(x; J-Ll' E 1)

Add x~e to the population

Figure 3.18 Pseudocode for the EMNAi approach.

This procedure uses the following updating rules:

(3.36)

(3.37)

Notice that in EMNAi the number of individuals in the population increases as the algorithm evolves. For details about the formulae see Larraiiaga et al. (2001).

4.4.4 EGNAee , EGNABGe , EGNAB1C • One proposal for opti-mization in continuous domains based on the learning and simulation of Gaussian networks is illustrated in Figure 3.19. The basic steps of each iteration are:

• Learning the Gaussian network structure by using one of the different methods (edge-exclusion tests, Bayesian score+search, penalized maximum likelihood +search).

• Computation of estimates for the parameters of the learnt Gaussian network structure.


EGNAee , EGNABGe, EGNABIC

For l = 1,2, ... until the stopping criterion is met Dr':l f- Select Se individuals from DI- l

(i) 51 f- Structural learning via: edge exclusion tests -+ EGNAee

Bayesian score+search -+ EGNABGe penalized maximum likelihood+search -+ EGNABIc

(ii) ii f- Calculate the estimates for the parameters of 51

(iv) DI f- Sample M individuals from MI using the continuous version of the PLS algorithm

Figure 3.19 Pseudocode for the EGNAee , EGNABGe, and EGNABIC algorithms.

• Creation of the Gaussian network model.

• Simulation of the joint probability distribution function encoded by the Gaussian network learnt in the previous steps. For this last step, we use an adaptation of the PLS algorithm to continuous domains.

While in the EGNAee the Gaussian network is induced at each generation by means of edge exclusion tests, as explained in Chapter 2 (Section 4.3.1), the model induction in the EGNABGe and EGNABIC is carried out by score+search approaches. In the EGNABGe a Bayesian score that gives the same value for Gaussian networks reflecting the same conditional (in)dependencies is used. Details of this score are given in Chapter 2 (Section 4.3 .2). On the other hand in the EGNABIc a penalized maximum likelihood score based on the Bayesian Information Criterion is used. See also Chapter 2 (Section 4.3.2) for additional information. In both approaches, EGNABGe and EGNAB1C, a local search is used to search for good structures.

For more details about the EGNAee and EGNABGe see Larraiiaga et al. (1999a, 2000b). The approach EGNABIc can be seen in more detail in Larraiiaga et al. (2001). Applications of these EGNA approaches can be found in Cotta


et al. (2001), Bengoetxea et al. (2001a), Lozano and Mendiburu (2001) and Robles et al. (2001).

4.4.5 IDEA. Bosman and Thierens (1999b, 2000a) introduce the IDEA (Iterated Density Evolutionary Algorithms) approach to EDAs. Although IDEA is a general framework that can be used in discrete and continuous optimization, the proposed new approaches belong to continuous domains.

There are two main characteristics of IDEA. The first one is that in each generation individuals are sampled from a truncated distribution, where the truncation point is given by the worst individual found in the previous generation. The second characteristic is that only part of the population is replaced in each generation.

Bosman and Thierens (2000b, 2000c, 2000d) contain some experiments in continuous optimization using IDEA. The density models used are multivariate normal, the histogram distribution, and the normal kernel distribution with a diagonal covariance matrix. While in the first two references the model is searched using a greedy search with the Kullback-Leibler divergence as score, in the last reference the model is obtained by a hypothesis test approach.

4.5 Mixture models

In order to obtain more flexible models, Gallagher et al. (1999) and Gallagher (2000) propose a finite Adaptive Gaussian Mixture model (AMix) density estimator. In A Mix, the density function, fl(X), is factorized as a product of univariate densities:

n

ft(X) = II ft(Xi) (3.38) i=l

where fl(Xi) is represented at each generation by the sum of component distributions:

Kl

ft(Xi) = L 1f1,jfl(Xi I j) (3.39) j=l

where J(l is the number of mixture components, 1f1,j is the mixing coefficient for the jth component, and fl(Xi I j) represents the univariate Gaussian density corresponding to the jth clustering, that is ft(xi I j) == N(Xi; J-lL, 0-7,j). Adaptation of the model is done using the Adaptive Mixture Model (Priebe, 1994) which assumes that data points arrive sequentially. One interesting characteristic of the model is that the number of mixture components, J(l, is allowed to vary during the execution of the algorithm.


Gallagher (2000) proposes modeling each variable by a finite Gaussian Kernel density estimator. The algorithm is called Fink (Finite number of kernels) and the model used is as follows:

n

fl(x) = II !t(Xi) (3.40) i=1

with

(3.41)

where for all i,j = 1, ... , n

f ( I ·) - N( 1 2,1) l Xi J = Xi; ILi,j' (J i,j (3.42)

and N denotes the number of kernels that are equal to the number of selected individuals. Fink introduces the principle of cooperative search in the following way. Each sample individual is created by first selecting a kernel at random, and then drawing from the kernel chosen. Update of the ILL parameters is done by a PBIL like rule that takes into account the best sample from each kernel as well as the single best sample.

Notice that, unlike AMix, in Fink all components (kernels) have equal weighting, eliminating the need to store and adapt mixture coefficients. Also, updating of the probability model in Fink is via PBIL-style equations, which is faster than the AMix update procedure.

Bosman and Thierens (2000e) present an extension of IDEA to learn mixtures of multivariate normal distributions in each generation by means of two fast clustering algorithms: leader and k-means.

Pefia et al. (2001) -Chapter 4 in this book- introduces a model based on conditional Gaussian networks where each component of the mixture is a Gaussian network whose structure is a tree augmented network and where the weight of each component is obtained by an adaptation of the EM algorithm.

5. Summary In this chapter we have presented a review of different approaches to EDAs

in combinatorial optimization and in optimization in continuous domains. The different approaches have been presented using a unified notation and have been organized by the complexity of the probabilistic graphical model learnt from data at each generation.


References

Alba, E., Santana, R., Ochoa, A., and Lazo, M. (2000). Finding typical testors by using an evolutionary strategy. In Proceedings of the Fifth Ibero American Symposium on Pattern Recognition, pages 267-278.

Baluja, S. (1994). Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning. Technical Report CMU-CS-94-163, Carnegie Mellon University.

Baluja, S. (1995). An empirical comparison of seven iterative and evolutionary function optimization heuristics. Technical Report CMU-CS-95-193, Carnegie Mellon University.

Baluja, S. (1997). Genetic algorithms and explicit search statistics. Advances in Neural Information Processing Systems, 9:319-325.

Baluja, S. and Caruana, R. (1995). Removing the genetics from standard genetic algorithm. In Prieditis, A. and Russell, S., editors, Proceedings of the International Conference on Machine Learning, pages 38-46. Morgan Kaufmann.

Baluja, S. and Davies, S. (1997a). Combining multiple optimization runs with optimal dependency trees. Technical Report CMU-CS-97-157, Carnegie Mellon University.

Baluja, S. and Davies, S. (1997b). Using optimal dependency-trees for combinatorial optimization: Learning the structure of the search space. Technical Report CMU-CS-97-107, Carnegie Mellon University.

Baluja, S. and Davies, S. (1998). Fast probabilistic modeling for combinatorial optimization. In AAAI-98.

Bandyopadhyay, S., Kargupta, R., and Wang, G. (1998). Revisiting the gemga: Scalable evolutionary optimization through linkage learning. In Proceedings of the 1998 IEEE International Conference on Evolutionary Computation, pages 603-608. IEEE Press.

Bengoetxea, E., Larranaga, P., Bloch, I., and Perchant, A. (200la). Solving graph matching with EDAs using a permutation-based representation. In Larranaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Bengoetxea, E., Larranaga, P., Bloch, I., Perchant, A., and Boeres, C. (2000). Inexact graph matching using learning and simulation of Bayesian networks. An empirical comparison between different approaches with synthetic data. In Workshop Notes of CaNew2000: Workshop on Bayesian and Causal Networks: From Inference to Data Mining. Fourteenth European Conference on Artificial Intelligence, ECAI2000. Berlin.

Bengoetxea, E., Mikelez, T., Lozano, J. A., and Larranaga, P. (2001b). Empirical comparison of Estimation of Distributions Algorithms in continuous


optimization. In Larrafiaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Berny, A. (2000a). An adaptive scheme for real function optimization acting as a selection operator. In Yao, X., editor, First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks.

Berny, A. (2000b). Selection and reinforcement learning for combinatorial optimization. In Schoenauer, M., Deb, K., Rudolph, G., Yao, X., Lutton, E., Merelo, J. J., and Schwefel, H.-P., editors, Parallel Problem Solving from Nature - PPSN VI. Lecture Notes in Computer Science 1917, pages 601-610.

Blanco, R. and Lozano, J. A. (2001). Empirical comparison of Estimation of Distribution Algorithms in combinatorial optimization. In Larrafiaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Bosman, P. A. N. and Thierens, D. (1999a). An algorithmic framework for density estimation based evolutionary algorithms. Technical Report UU-CS-1999-46, Utrech University.

Bosman, P. A. N. and Thierens, D. (1999b). Linkage information processing in distribution estimation algorithms. In Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., and Smith, R. E., editors, Proceedings of the Genetic and Evolutionary Computation Conference GECCO-99, volume 1, pages 60-67. Morgan Kaufmann Publishers. San Francisco, LA.

Bosman, P. A. N. and Thierens, D. (2000a). Continuous iterated density estimation evolutionary algorithms within the IDEA framework. In Wu, A. S., editor, Proceedings of the 2000 Genetic and Evolutionary Computation Conference Workshop Program, pages 197-200.

Bosman, P. A. N. and Thierens, D. (2000b). Expanding from discrete to continuous estimation of distribution algorithms: The IDEA. In Schoenauer, M., Deb, K., Rudolph, G., Yao, X., Lutton, E., Merelo, J. J., and Schwefel, H.-P., editors, Parallel Problem Solving from Nature - PPSN VI. Lecture Notes in Computer Science 1917, pages 767-776.

Bosman, P. A. N. and Thierens, D. (2000c). IDEAs bases on the normal kernels probability density function. Technical Report UU-CS-2000-11, Utrech University.

Bosman, P. A. N. and Thierens, D. (2000d). Negative log-likelihood and statistical hypothesis testing as the basis of model selection in IDEAs. In Genetic and Evolutionary Computation Conference GECCO-OO. Late Breaking Papers, pages 51-58.

Bosman, P. A. N. and Thierens, D. (2000e). Mixed IDEAs. Technical Report UU-CS-2000-45, Utrech University.


Buntine, W. (1991). Theory refinement in Bayesian networks. In Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence, pages 52-60. Morgan Kaufmann.

Chow, C. and Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14:462-467.

Cotta, C., Alba, E., Sagarna, R, and Larraiiaga, P. (2001). Adjusting weights in artificial neural networks using evolutionary algorithms. In Larraiiaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

De Bonet, J. S., Isbell, C. L., and Viola, P. (1997). MIMIC: Finding optima by estimating probability densities. Advances in Neural Information Processing Systems, Vol. 9.

de Campos, L. M., Gamez, J. A., Larraiiaga, P., Moral, S., and Romero, T. (2001). Partial abductive inference in Bayesian networks: an empirical comparison between GAs and EDAs. In Larraiiaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Eshelman, L. J. and Schaffer, J. D. (1993). Productive recombination and propagating and preserving schemata. Foundations of Genetic Algorithms, 3:299-314.

Etxeberria, Rand Larraiiaga, P. (1999). Global optimization with Bayesian networks. In II Symposium on Artificial Intelligence. CIMAF99. Special Session on Distributions and Evolutionary Optimization, pages 332-339.

Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21:768.

Fyfe, C. (1999). Structured population-based incremental learning. Soft Computing, 2(4):191-198.

Galic, E. and H6hfeld, M. (1996). Improving the generalization performance of multi-Iayer-perceptrons with population-based incremental learning. In Parallel Problem Solving from Nature. PPSN-IV, pages 740-750.

Gallagher, M. R (2000). Multi-layer perceptron error surfaces: Visualization, structure and modelling. Technical Report Doctoral Thesis, Department of Computer Science and Electrical Engineering, University of Queensland.

Gallagher, M. R, Frean, M., and Downs, T. (1999). Real-valued evolutionary optimization using a flexible probability density estimator. In Proceedings of Genetic and Evolutionary Computation Conference, pages 840-846. Morgan Kaufmann.

Goldberg, D. E., Deb, K., Kargupta, H., and Harik, G. (1993). Rapid, accurate optimization of difficult problems using fast messy genetic algorithms. In Forrest, S., editor, Proceedings of the Fifth International Conference on Genetic Algorithms, pages 56-64. Morgan Kauffman.


Goldberg, D. E., Korb, B., and Deb, K. (1989). Messy genetic algorithms: Motivation, analysis and first results. Complex Systems, 3(5):493-530.

Gonzalez, C., Lozano, J. A., and Larraiiaga, P. (200la). Analyzing the PBIL algorithm by means of discrete dynamical systems. Complex Systems, In press.

Gonzalez, C., Lozano, J. A., and Larraiiaga, P. (2001b). The converge behavior of PBIL algorithm: a preliminary approach. In Kurkova, V., Steel, N. C., Neruda, R., and Karny, M., editors, International Conference on Artificial Neural Networks and Genetic Algorithms. ICANNGA-2001, pages 228-23l. Springer.

Gonzalez, C., Lozano, J. A., and Larraiiaga, P. (2001c). Mathematical modeling of discrete Estimation of Distribution Algorithms. In Larraiiaga, P. and Lozano, J . A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Grefenstette, J. J. (1986). Optimization of control parameters for genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 16(1):122-128.

Harik, G. (1999). Linkage learning in via probabilistic modeling in the ECGA. Technical Report 99010, IlliGAL Technical Report.

Harik, G., Lobo, F. G., and Golberg, D. E. (1998). The compact genetic algorithm. In Proceedings of the IEEE Conference on Evolutionary Computation, pages 523-528.

Hohfeld, M. and Rudolph, G. (1997). Towards a theory of population-based incremental learning. In Proceedings of the 4th International Conference on Evolutionary Computation, pages 1-5. IEEE Press.

Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. The University of Michigan Press.

Inza, 1., Larraiiaga, P., and R. Etxeberria, B. S. (2000). Feature subset selection by Bayesian networks based optimization. Artificial Intelligence , 123(1-2):157-184.

Inza, 1., Larraiiaga, P., and Sierra, B. (2001a). Feature subset selection by Estimation of Distribution Algorithms. In Larraiiaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Inza, 1., Larraiiaga, P., and Sierra, B. (2001b). Feature weighting in K-NN by means of Estimation of Distribution Algorithms. In Larraiiaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Inza, 1., Merino, M., Larraiiaga, P., Quiroga, J., Sierra, B. , and Girala, M. (2001c). Feature subset selection by population-based incremental learning. A case study in the survival of cirrhotic patients with TIPS. Artificial Intelligence in Medicine, In press.


Juels, A. (1997). Topics in black-box combinatorial optimization. Technical Report Doctoral Thesis, University of California-Berkeley.

Kargupta, H. (1996). The gene expression messy genetic algorithm. In Proceedings of the 1996 IEEE International Conference on Evolutionary Computation, pages 631-636. IEEE Press.

Kargupta, H. and Goldberg, D. E. (1997). Search, blackbox optimization, and sample complexity. In Belew, R. W. and Vose, M., editors, Foundations of Genetic Algorithms 4. Morgan Kaufmann. San Mateo, CA.

Kvasnicka, V., Pelikan, M., and Pospichal, J. (1996). Hill climbing with learning (an abstraction of genetic algorithms). Neural Network World, 6:773-796.

Larranaga, P., Etxeberria, R., Lozano, J. A., and Pena, J. M. (1999a). Optimization by learning and simulation of Bayesian and Gaussian networks. Technical Report KZZA-IK-4-99, Department of Computer Science and Artificial Intelligence, University of the Basque Country.

Larranaga, P., Etxeberria, R., Lozano, J. A., and Pena, J. M. (2000a). Combinatorial optimization by learning and simulation of Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pages 343-352. Stanford.

Larranaga, P., Etxeberria, R., Lozano, J. A., and Pena, J. M. (2000b). Optimization in continuous domains by learning and simulation of Gaussian networks. In Wu, A. S., editor, Proceedings of the 2000 Genetic and Evolutionary Computation Conference Workshop Program, pages 201-204.

Larranaga, P., Etxeberria, R., Lozano, J. A., Sierra, B., Inza, 1., and Pena, J. M. (1999b). A review of the cooperation between evolutionary computation and probabilistic graphical models. In Second Symposium on Artificial Intelligence. Adaptive Systems. CIMAF 99, pages 314-324. La Habana.

Larranaga, P., Lozano, J. A., and Bengoetxea, E. (2001). Estimation of Distribution Algorithms based on multivariate normal and Gaussian networks. Technical Report KZZA-IK-1-01, Department of Computer Science and Artificial Intelligence, University of the Basque Country.

Lauritzen, S. L. (1996). Graphical Models. Oxford University Press. Lobo, F. G., Deb, K., Goldberg, D. E., Harik, G. R., and Wang, L. (1998).

Compressed introns in a linkage learning genetic algorithm. In Genetic Programming 1998: Proceedings of the Third Annual Conference, pages 551-558. Morgan Kauffman.

Lozano, J. A, and Mendiburu, A. (2001). Estimation of Distribution Algorithms Applied to the Job Shop Scheduling Problem: Some Preliminary Research. In Larranaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Lozano, J. A., Sagarna, R., and Larranaga, P. (2001). Parallel Estimation of Distribution Algorithms. In Larranaga, P. and Lozano, J. A., editors, Estima-


tion of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Mahnig, T. and Miihlenbein, H. (2000) . Mathematical analysis of optimization methods using search distributions. In Wu, A. S., editor, Proceedings of the 2000 Genetic and Evolutionary Computation Conference Workshop Program, pages 205- 208.

Maxwell , B. and Anderson, S. (1999). Training hidden Markov models using population-based learning. In Genetic and Evolutionary Computation Conference, GECCO-99.

Monmarche, N., Ramat, E., Desbarats, L., and Venturini, G. (2000). Probabilistic search with genetic algorithms and ant colonies. In Wu, A. S., editor, Proceedings of the 2000 Genetic and Evolutionary Computation Conference Workshop Program, pages 209-211.

Monmarche, N., Ramat, E., Dromel, G., Slimane, M., and Venturini, G. (1999). On the similarities between AS, BSC and PBIL: toward the birth of a new meta-heuristics. Technical Report 215, E3i, Universite de Tours.

Miihlenbein, H. (1998) . The equation for response to selection and its use for prediction. Evolutionary Computation, 5:303-346.

Miihlenbein, H. and Mahnig, T. (1999a). Convergence theory and applications of the factorized distribution algorithm. Journal of Computing and Information Technology, 7:19-32.

Miihlenbein, H. and Mahnig, T. (1999b) . The Factorized Distribution Algorithm for additively decomposed functions. In Second Symposium on Artificial Intelligence. Adaptive Systems. CIMAF 99, pages 301-313. La Habana.

Miihlenbein, H. and Mahnig, T. (1999c). FDA - a scalable evolutionary algorithm for the optimization of additively decomposed functions. Evolutionary Computation, 7(4) :353- 376.

Miihlenbein, H. and Mahnig, T. (2000). Evolutionary algorithms: From recombination to search distributions. Theoretical Aspects of Evolutionary Computing. Natural Computing, pages 137-176.

Miihlenbein, H. , Mahnig, T., and Ochoa, A. (1999). Schemata, distributions and graphical models in evolutionary optimization. Journal of Heuristics, 5:215- 247.

Miihlenbein, H. and PaaB, G. (1996). From recombination of genes to the estimation of distributions 1. Binary parameters . In Lecture Notes in Computer Science 1411: Parallel Problem Solving from Nature - PPSN IV, pages 178-187.

Miihlenbein, H. and Voigt, H.-M. (1996). Gene pool recombination in genetic algorithms. Metaheuristics: Theory and applications, pages 53- 62.

Ochoa, A. , Miihlenbein, H., and Soto, M. (2000a). Factorized Distribution Algorithm using Bayesian networks. In Wu, A. S., editor, Proceedings of the


2000 Genetic and Evolutionary Computation Conference Workshop Program, pages 212-215.

Ochoa, A., Miihlenbein, H., and Soto, M. (2000b). A Factorized Distribution Algorithm using single connected Bayesian networks. In Schoenauer, M., Deb, K., Rudolph, G., Yao, X., Lutton, E., Merelo, J. J., and Schwefel, H.P., editors, Parallel Problem Solving from Nature - PPSN VI. Lecture Notes in Computer Science 1917, pages 787- 796.

Ochoa, A., Soto, M., Santana, R. , Madera, J., and Jorge, N. (1999). The Factorized Distribution Algorithm and the junction tree: A learning perspective. In Second Symposium on Artificial Intelligence. Adaptive Systems. CIMAF 99, pages 368-377. La Habana.

Pelikan, M. and Goldberg, D. E. (2000a). Genetic algorithms, clustering, and the breaking of symmetry. In Schoenauer, M., Deb, K., Rudolph, G., Yao, X., Lutton, E., Merelo, J. J., and Schwefel, H.-P., editors, Parallel Problem Solving from Nature - PPSN VI. Lecture Notes in Computer Science 1917, pages 385-394.

Pelikan, M. and Goldberg, D. E. (2000b). Hierarchical problem solving and the Bayesian optimization algorithm. In Whitley, D., Goldberg, D., Cantu-Paz, E., Spector, L., Parmee, I., and Beyer, H.-G., editors, Proceedings of the Genetic and Evolutionary Computation Conference, pages 267- 274. Morgan Kaufmann.

Pelikan, M. and Goldberg, D. E. (2000c). Research on the Bayesian optimization algorithm. In Wu, A. S., editor, Proceedings of the 2000 Genetic and Evolutionary Computation Conference Workshop Program, pages 212-215.

Pelikan, M., Goldberg, D. E., and Cantu-Paz, E. (1999a) . BOA: The Bayesian optimization algorithm. In Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., and Smith, R. E., editors, Proceedings of the Genetic and Evolutionary Computation Conference GECCO-99, volume 1, pages 525-532. Morgan Kaufmann Publishers, San Francisco, CA. Orlando, F1.

Pelikan , M., Goldberg, D. E. , and Cantu-Paz, E. (2000a). Bayesian optimization algorithm, population sizing, and time to convergence. In Whitley, D., Goldberg, D., Cantu-Paz, E., Spector, 1., Parmee, I., and Beyer, H.-G., editors, Proceedings of the Genetic and Evolutionary Computation Conference, pages 275-282. Morgan Kaufmann.

Pelikan, M., Goldberg, D. E., and Cantu-Paz, E. (2000b). Linkage problem, distribution estimation and Bayesian networks. Evolutionary Computation, 8(3):311-340.

Pelikan, M., Goldberg, D. E., and Lobo, F. (1999b). A survey of optimization by building and using probabilistic models. Technical Report IlliGAL Report 99018, University of Illinois at Urbana-Champaing.


Pelikan, M., Goldberg, D. E., and Sastry, K. (2000c). Bayesian optimization algorithm, decision graphs, and Occam's razor. Technical Report IlliGAL Report 200020, University of Illinois at Urbana-Champaing.

Pelikan, M. and Miihlenbein, H. (1999). The bivariate marginal distribution algorithm. Advances in Soft Computing-Engineering Design and Manufacturing, pages 521-535.

Peiia, J. M., Lozano, J. A., and Larraiiaga, P. (2001). Benefits of data clustering in multimodal function optimization via EDAs. In Larraiiaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation.

Priebe, C. E. (1994). Adaptive mixtures. Journal of the American Statistical Association, 89(427):796-806.

Rissanen, J. (1978). Modeling by shortest data description. A utomatica, pages 465-47l.

Rivera, J. (1999). Using Estimation of Distribution Algorithms as an evolutive component of the XCS classifier system. Technical Report , University of La Habana (In spanish).

Robles, V., de Miguel, P., and Larraiiaga, P. (2001). Solving the travelling salesman problem with Estimation of Distribution Algorithms. In Larraiiaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Roure, J. , Sangiiesa, R., and Larraiiaga, P. (2001). Partitional clustering by means of Estimation of Distribution Algorithms. In Larraiiaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Rudlof, S. and Koppen, M. (1996). Stochastic hill climbing by vectors of normal distributions. In Proceedings of the First Online Workshop on Soft Computing (WSC1). Nagoya, Japan.

Sagarna, R. (2000). Parallelization of Estimation of Distribution Algorithms. Master Thesis, University of the Basque Country, Department of Computer Science and Artificial Intelligence (In spanish).

Sagarna, R. and Larraiiaga, P. (2001). Solving the knapsack problem with Estimation of Distribution Algorithms. In Larraiiaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Salustowicz, R. and Schmidhuber, J. (1997). Probabilistic incremental program evolution. Evolutionary Computation, 5(2):123-14l.

Salustowicz, R. and Schmidhuber, J. (1998). Learning to predict trough probabilistic incremental program evolution and automatic task decomposition. Technical Report Technical Report IDSIA-U-98, University of Lugano.

Santana, R. and Ochoa, A. (1999). Dealing with constraints with Estimation of Distribution Algorithms: The univariate case. In Second Symposium on


Artificial Intelligence. Adaptive Systems. CIMAF 99, pages 378-384. La Habana.

Santana, R., Ochoa, A., Soto, M., Pereira, F. B., Machado, P., Costa, E., and Cardoso, A. (2000). Probabilistic evolution and the busy beaver problem. In Whitley, D., Goldberg, D., Cantu-Paz, E., Spector, L., Parmee, 1., and Beyer, H.-G., editors, Proceedings of the Genetic and Evolutionary Computation Conference, pages 380-380. Morgan Kaufmann.

Sastry, K. and Goldberg, D. E. (2000). On extended compact genetic algorithm. In GECCO-2000, Late Breaking Papers, Genetic and evolutionary Computation Conference, pages 352-359.

Schmidt, M., Kristensen, K., and Jensen, T. (1999). Adding genetics to the standar PBIL algorithm. In Congress on Evolutionary Computation. CEC'99.

Schwarz, J. and Ocenasek, J. L. (1999). Experimental study: Hypergraph partitioning based on the simple and advanced algorithms BMDA and BOA. In Proceedings of the Fifth International Conference on Soft Computing, pages 124-130. Brno, Czech Republic.

Schwefel, H.-P. (1995). Evolution and Optimum Seeking. Wiley, New York. Sebag, M. and Ducoulombier, A. (1998). Extending population-based incre

mental learning to continuos search spaces. In Parallel Problem Solving from Nature - PPSN V, pages 418-427. Springer-Verlag. Berlin.

Servais, M. P., de Jaer, G., and Greene, J. R. (1997). Function optimization using multiple-base population based incremental learning. In Proceedings of the Eight South African Workshop on Pattern Recognition.

Servet, 1., Trave-Massuyes, L., and Stern, D. (1997). Telephone network traffic overloading diagnosis and evolutionary techniques. In Proceedings of the Third European Conference on Artificial Evolution, (AE'97), pages 137-144.

Sierra, B., Jimenez, E., Inza, 1., Larranaga, P., and Muruzabal, J. (2001). Rule induction using Estimation of Distribution Algorithms. In Larranaga, P. and Lozano,J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Soto, M., Ochoa, A., Acid, S., and de Campos, L. M. (1999). Introducing the poly tree aproximation of distribution algorithm. In Second Symposium on Artificial Intelligence. Adaptive Systems. CIMAF 99, pages 360-367. La Habana.

Sukthankar, R., Baluja, S., and Hancock, J. (1997). Evolving an intelligent vehicle for tactical reasoning in traffic. In International Conference on Robotics and Automation.

Syswerda, G. (1993). Simulated crossover in genetic algorithms. Foundations of Genetic Algorithms 2, pages 239-255.

Thathachar, M. and Sastry, P. S. (1987). Learning optimal discriminant functions through a cooperative game of automata. IEEE Transactions on Systems, Man, and Cybernetics, 17(1).


van Kemenade, C. H. M. (1998). Building block filtering and mixing. In Proceedings of the 1998 International Conference on Evolutionary Computation. IEEE Press.

Whittaker, J. (1990). Graphical models in applied multivariate statistics. John Wiley and Sons.

Zhang, B.-T. (1999). A Bayesian framework for evolutionary computation. In Proceedings of the Congress on Evolutionary Computation (CEC99), IEEE Press, pages 722-727.

Zhang, B.-T. (2000). Bayesian evolutionary algorithms for learning and optimization. In Wu, A. S., editor, Proceedings of the 2000 Genetic and Evolutionary Computation Conference Workshop Program, pages 220-222.

Zhang, B.-T. and Cho, D.-Y. (2000). Evolving neural trees for time series prediction using Bayesian evolutionary algorithms. In Proceedings of the First IEEE Workshop on Combinations of Evolutionary Computation and Neural Networks (ECNN-2000).

Zhang, B.-T. and Shin, S.-Y. (2000). Bayesian evolutionary optimization using Helmholtz machines. In Schoenauer, M., Deb, K., Rudolph, G., Yao, X., Lutton, E., Merelo, J. J., and Schwefel, H.-P., editors, Lecture Notes in Computer Science, 1917. Parallel Problem Solving from Nature - PPSN VI, pages 827-836.

Zhang, Q. and Miihlenbein, H. (1999). On global convergence of FDA with proportionate selection. In Second Symposium on Artificial Intelligence. Adaptive Systems. CIMAP 99, pages 340-343. La Habana.

Zhigljavsky, A. A. (1991). Theory of Global Random Search. Kluwer Academic Publishers.

Chapter 4

Benefits of Data Clustering in Multimodal Function Optimization via ED As

J.M. Peiia J.A. Lozano P. Larraiiaga Department of Computer Science and Artificial Intelligence


{ccbpepaj. lozano. ccplamup}@si.ehu.es

Abstract This chapter shows how Estimation of Distribution Algorithms (EDAs) can benefit from data clustering in order to optimize both discrete and continuous multimodal functions. To be exact, the advantage of incorporating clustering into ED As is two-fold: to obtain all the best solutions rather than only one of them, and to alleviate the difficulties that affect many evolutionary algorithms when more than one global optimum exists. We propose the use of Bayesian networks and conditional Gaussian networks to perform such a data clustering when EDAs are applied to optimization in discrete and continuous multimodal domains, respectively. The dynamics and performance of our approach are shown by evaluating it on a number of symmetrical functions, some of them highly multimodal.

Keywords: EDAs, data clustering, Bayesian networks, conditional Gaussian networks

1. Introduction Many optimization problems present several global optima. However, most

evolutionary algorithms are essentially developed to capture only one of the set of best solutions in the problem domain considered. For instance, given a multimodal function with several equally sized global peaks, the simple genetic algorithm (GA) can converge to only one of them. Moreover, this peak is randomly chosen due to the well-known genetic drift (De Jong, 1975): the simple GA has no means to decide among the different global peaks, and only the stochastic variations due to the genetic operators can make the population




drift to one of these peaks. Unfortunately, this behaviour is shared by most evolutionary algorithms. This is a drawback as we are interested in obtaining all the global optima of both discrete and continuous multimodal functions.

Furthermore, this interest is not only quantitative but also qualitative. That is, the existence of several global peaks is a major source of difficulties for many evolutionary algorithms. Basically, the problems appear because combining good solutions coming from different parts of the search space or basins often results in poor solutions. If this is the case, convergence may be very slow until the population drifts to one of the basins or, even worse, the algorithm may get stuck in a local optimum. Apparently, these problems are aggravated in algorithms based on building and simulating probabilistic graphical models (EDAs) (Pelikan and Goldberg, 2000).

Consequently, we propose taking advantage of data clustering in order to effectively and efficiently discover all the global optima of a given optimization problem by using EDAs. The work presented in this chapter is clearly inspired by the research of Pelikan and Goldberg (2000). In that work, the authors introduce for the first time data clustering as a tool to alleviate the problems derived from the existence of several optima in symmetrical discrete domains when EDAs are applied to perform the optimization. Roughly speaking, their motivation is to process separately, at each generation, the complementary parts of the solution space. Unlike Pelikan and Goldberg (2000), our proposal does not divide explicitly the solution space into different parts but it takes advantage of probabilistic graphical models that are able to represent the complexity of the solution space. The models considered in this chapter are Bayesian networks (BNs) for discrete domains (Pearl, 1988; Peiia et aI., 2000; Peiia et aI., 2001a), and conditional Gaussian networks (CGNs) for continuous domains (Lauritzen, 1992; Lauritzen, 1996; Lauritzen and Wermuth, 1989; Peiia et aI., 2001b; Peiia et aI., 2001d; Peiia et aI., 2001a). Thus, we present a unified framework to tackle both discrete and continuous multimodal function optimization problems that extends the idea behind Pelikan and Goldberg (2000) .

The remainder of this chapter is organized as follows. Section 2 is a presentation of the motivation and objectives of our proposal to combine EDAs with data clustering via BNs and CGNs to optimize multimodal functions. Section 3 introduces BNs and CGNs applied to data clustering. Section 4 deals with some practical issues related to our proposal. We present some experimental results of the performance achieved by our approach in Section 5. Finally, we draw conclusions in Section 6.

Benefits of Data Clustering in Multimoclal Function Optimization via EDAs 103

2. Data clustering in evolutionary algorithms for multimodal function optimization

The combination of data clustering and evolutionary algorithms has proven to be very successful as evidenced by the large body of research conducted in this direction. However, the purpose of this combination varies from work to work. According to our aims, two of the most relevant works that propose the use of clustering inside the evolutionary algorithms framework are Hocaoglu and Sanderson (1997), and Pelikan and Goldberg (2000). Whereas the authors of the first paper focus on discovering all peaks of multimodal functions, the authors of the second paper emphasize the goodness of data clustering to overcome the difficulties that appear when optimizing symmetrical discrete functions. We are interested here in a combination of the goals of these two papers.

The paper by Hocaoglu and Sanderson (1997) presents the minimal representation size cluster GA (MRSC_GA) , previously introduced in Hocaoglu and Sanderson (1995), applied to multimodal function optimization and path planning. Using minimal representation size clustering (Segen and Sanderson, 1981), the initial population is clustered into multiple subpopulations that evolve separately for a number of iterations. The aim is that each subpopulation converges to one of the different peaks of the given multimodal function . Moreover, minimal representation size clustering ensures that no two subpopulations achieve the same optimum. Occasionally, the different subpopulations are merged to obtain a new population by applying an operator similar to the crossover operator to individuals selected from different subpopulations. The process is repeated with the multiple separated subpopulations resulting from the clustering of this new population. According to the results reported, the MRSC_GA exhibits good behaviour when applied to multimodal function optimization without being provided with a priori knowledge of the solution space.

On the other hand, Pelikan and Goldberg (2000) motivate the use of data clustering in evolutionary algorithms in general and in ED As in particular, as a means to overcome the disrup';ing effects that symmetry creates. Some other works that study the problem of symmetry in the search space are Collard and Aurand (1994), Naudts and Naudts (1998), Van Hoyweghen (2000), Van Hoyweghen and Naudts (2000). Whereas the last three works focus on the negative influence on the dynamics of GAs that symmetry has, the former paper demonstrates how to take advantage of symmetry to reach the optima more quickly. However, in many cases, the GA proposed in Collard and Aurand (1994) is not a realistic approach as it needs to be provided with a priori knowledge about the function to optimize.


Among the different types of symmetry described by Van Hoyweghen and Naudts (2000), the work by Pelikan and Goldberg (2000) deals with what it is known as symmetry on the alphabet or spin-flip symmetry. An objective function contains spin-flip symmetry when every pair of bit-complementary individuals has identical objective function values. Some functions suffering from spin-flip symmetry are graph partitioning problems, random number partitioning problems, graph coloring problems and two-max functions. Thus, the work by Pelikan and Goldberg (2000) is limited to combinatorial optimization where the individuals are strings of binary variables. The difficulties that spinflip symmetry creates are due to the fact that combining promising solutions coming from complementary parts of the solution space often results in poor solutions. This fact may slow down convergence or even make the algorithm get stuck in a local optimum.

The purpose of Pelikan and Goldberg (2000) is to distinguish, at each generation, the complementary parts of the solution space in order to break the symmetry by treating each part separately. To be exact, they propose clustering the genotypes of the selected individuals of each generation. Then, each cluster is processed separately in the learning and simulation steps to obtain some offspring that are incorporated into the original population. The process is repeated until convergence is reached. Working in such a way avoids combining solutions that belong to complementary basins of the search space and results in an improvement of performance. As a side effect, multiple optima are discovered. According to results reported for the UMDA (Muhlenbein, 1998) using the Forgy algorithm (Forgy, 1965) to carry out the clustering, this approach works very well on spin-flip symmetrical functions of considerable dimension.

Although it is not discussed by Hocaoglu and Sanderson (1997), the use of multiple separated subpopulations in the MRSC_GA also helps to avoid the harmful effects that the existence of several peaks involves. These undesirable effects are similar to those addressed above in the context of spin-flip symmetry.

2.1 Estimation of Mixture of Distributions Algorithm

The primary goal of this chapter is the proposal and evaluation of an enhancement of EDAs able to deal effectively and efficiently with both discrete and continuous multimodal function optimization problems. In the remainder of this chapter, this new EDA is referred to as Estimation of Mixture of Distributions Algorithm (EMDA). In order to achieve our objective, the EMDA combines the benefits derived from the incorporation of data clustering into evolutionary algorithms that motivate Hocaoglu and Sanderson (1997) with those that motivate Pelikan and Goldberg (2000). These benefits are:

Benefits of Data Clustering in Multimodal Function Optimization via EDAs 105

• data clustering has proven to be an effective approach for overcoming the difficulties that multimodal function optimization involves for evolutionary algorithms in general and EDAs in particular, and

• data clustering is a reliable tool for obtaining all the global optima of multimodal functions. It should be noted that we aim to discover only global optima instead of local and global peaks as in Hocaoglu and Sanderson (1997).

Unlike the work by Pelikan and Goldberg (2000), the EMDA does not rely on either the well-known Forgy algorithm or alternative techniques to carry out data clustering. Our proposal is true to the EDA paradigm as it is the model elicited from the selected individuals of each generation itself which encodes a probabilistic clustering of these individuals. Thus, the selected individuals are not explicitly divided into clusters to break the symmetry. On the contrary, the EMDA accepts the symmetry of the solution space because it takes advantage of probabilistic graphical models that are able to represent the complexity of the selected individuals. As a result, every model learnt at each iteration of the EMDA reveals the structure of the multimodal function that is being optimized, restricted to the selected individuals.

The models that we propose using are BNs when dealing with combinatorial optimization and CGNs when facing optimization problems in continuous domains. These two classes of probabilistic graphical models have been successfully applied to data clustering (Pefia et al., 2000; Pefia et al., 2001b; Pefia et al., 2001d; Pefia et al., 200la). Thus, the EMDA consists of the iteration of the same main steps as the generic EDA (Figure 4.1): selection of promising individuals from the current population, model learning from the selected individuals, and model sampling to obtain the offspring that are somehow incorporated into the current population to create the new population. Unsupervised learning of the BN or CGN should be provided with the number of clusters that we assume exist in the set of selected individuals. Ideally, this number should be the number of global optima of the function that we aim to optimize. If this is unknown, then it should be determined before starting the learning process involved in each iteration of the EMDA.

3. BNs and CGNs applied to data clustering One of the basic problems that arises in a great variety of fields , including

pattern recognition, machine learning and statistics, is the so-called data clustering problem (Anderberg, 1973; Hartigan, 1975; Kaufman and Rousseeuw, 1990). Despite the different interpretations and expectations it gives rise to, the generic data clustering problem involves the assumption that, in addition to the observed variables (or predictive attributes), there is a hidden variable. This last unobserved variable would reflect the cluster membership for every


case in the database. Thus, the data clustering problem is also referred to as an example of learning from incomplete data due to the existence of such a hidden variable. Incomplete data represents a special case of missing data where all the missing entries are concentrated in a single variable: the hidden cluster variable. That is, we refer to a given database as incomplete when all the cases are unlabeled.

The fundamental data clustering problem aims to discover groups in data. Each of these groups is called a cluster, a region in which the density of objects is locally higher than in other regions. From the point of view adopted in this section, the data clustering problem may be defined as the inference of the joint generalized probability distribution for a given database. In the context of the EMDA, this database groups the selected individuals of the current iteration. Alternatively, data clustering can be viewed as a data partitioning problem. Given data, we can ask ourselves how the data can be split into different partitions dependent on a quality criterion (Pelikan and Goldberg, 2000).

3.1 Notation

We follow the usual convention of denoting variables by upper-case letters and their states by the same letters in lower-case. We use a letter or letters in bold-face upper-case to designate a set of variables and the same bold-face lower-case letter or letters to denote an assignment of a state to each variable in a given set. The joint generalized probability distribution of X is represented as p(x). Additionally, p(x I y) denotes the generalized conditional probability distribution of X given Y = y. If all the variables in X are discrete, then p(x) = p(x) is the joint probability mass function of X. Thus, p(x I y) denotes the conditional probability mass function of X given Y = y. If all the variables in X are continuous, then p(x) = f(x) is the joint probability density function of X. Thus, f(x I y) denotes the conditional probability density function of X given Y = y.

When facing a data clustering problem, we assume the existence of a (n+ 1)dimensional random variable X partitioned as X = (Y, C) into a n-dimensional observed variable Y and a unidimensional discrete hidden variable C. In the particular case of every component Yi of Y is discrete, the probabilistic graphical models that we aim to learn are called BNs. On the other hand, if every component Yi of Y is continuous and follows a Gaussian distribution, then the probabilistic graphical models that we want to learn are called CGNs.

3.2 BNs for data clustering

Given a discrete random variable X = (Y, C) = (Y1 , ... , Yn , C), a BN for X (Pearl, 1988; Pefia et al., 2000; Pefia et al., 2001a) is a graphical factorization


of the joint probability distribution of X . When applied to data clustering, a BN is defined by a directed acyclic graph s (model structure) determining the conditional (in)dependencies among the variables of Y and a set of local probability distributions. The model structure yields to a factorization of the joint probability distribution for X as follows:

n

p(x) =p(c)p(y I c) =p(c) IIp(Yi I pa(s)i'c) ( 4.1) i=l

where pa(s)i is the state of the parents of Yi in s, Pa(s)i' consistent with x. The local probability distributions of the BN are those in Equation 4.1 and

we assume that they depend on a finite set of parameters Os E 8s. Moreover, let sh denote the hypothesis that the conditional (in)dependence assertions implied by s hold in the true joint probability distribution of X. Therefore, Equation 4.1 can be rewritten as follows:

n

p(c I (}s,sh) IIp(Yi I pa(s)i,OLsh) (4.2) i=l

where O~ = (OJ. , . . . ,O~J denotes the parameters for the local probability distributions when C = c.

We limit our discussion to the case in which the local probability distributions of each variable of the BN consist of a set of multinomial distributions, one for each configuration of the parents and the cluster variable C.

3.3 CGNs for data clustering

A random variable X = (Y, C) = (Y1 , ... , Yn , C), where Y is continuous and C is discrete, is said to have a conditional Gaussian distribution (Lauritzen, 1992; Lauritzen, 1996; Lauritzen and Wermuth, 1989) if the distribution of Y, conditioned on each state of C, is a multivariate normal distribution:

f(y I C = c) == N(y; lL(c), ~(c)) (4.3)

whenever p(c) = p(C = c) > O. Given C = c, 1L(c) is the n-dimensional mean vector, and :E(c), the n x n variance matrix, is positive definite.

We define a CGN for X (Lauritzen, 1992; Lauritzen, 1996; Lauritzen and Wermuth, 1989; Pella et al., 2001b; Pella et al. , 2001d; Pella et al., 200la) as a probabilistic graphical model that encodes a conditional Gaussian distribution for X. Thus, a CGN is defined by a directed acyclic graph s (model structure) determining the conditional (in)dependencies among the variables of Y, a set of local probability density functions and a multinomial distribution for the


variable C. The model structure yields to a factorization of the joint generalized probability density function for X as follows:

n

p(x) = p(c)f(y I c) = p(c) II f(Yi I pa(s)i' c) ( 4.4) i=l

where pa(s)i is the state of the parents of Yi in s, Pa(s)i' consistent with x. The local probability density functions and the multinomial distribution of

the CGN are those in the previous equation and we assume that they depend on a finite set of parameters Os E e s . Moreover, let sh denote the hypothesis that the conditional (in)dependence assertions implied by s hold in the true joint generalized probability density function of X. Therefore, Equation 4.4 can be rewritten as follows:

n

p(c I Os,sh) IIf(Yi I pa(s)i,O~,sh) (4.5) i=l

where 0; = (01 , ... ,O~) denotes the parameters for the local probability density functions when C = c.

In order to encode a conditional Gaussian distribution for X , each local probability density function of the CGN should be a linear-regression model. Thus, when C = c:

f(Yilpa(s)i,O~,Sh)=N(Yi;m~+ L bji(Yj-mj),vf) (4.6) Yi Epa(s)i

where N(y; f,l, 0'2) is a univariate normal distribution with mean f,l and standard deviation 0' (0' > 0). Given this form, a missing arc from Yj to Yi implies that bji = 0 in the linear-regression model. When C = c, the local parameters are Of = (mf, bf,vf), i = 1, .. . ,n, where bf = (b1i , ... ,bf_li)t is a column vector.

The interpretation of the components ofthe local parameters Of, i = 1, ... ,n, is as follows: given C = c, mf is the unconditional mean of Yi, vi is the conditional variance of Yi given Pa(s)i' and bji' j = 1, ... , i-I, is a linear coefficient reflecting the strength of the relationship between Yj and Yi .

3.4 Unsupervised learning of BNs and CGNs

One of the methods for learning BNs and CGNs from incomplete data is the well-known Bayesian Structural EM (BS-EM) algorithm developed by Friedman (1998) . Due to its good performance, this algorithm has received special attention in the literature and has motivated several variants of itself (Meila and Jordan, 1998; Pefia et al., 1999; Pefia et al., 2000; Pefia et al., 2001c; Thies-


EMDA Generate M individuals at random to create the initial population (do) 1=0 Repeat until the stopping criterion is met

1++ Let dr":l group the N (N ~ M) individuals selected from dl - 1

according to the selection method Let PI(X) represent the joint generalized probability distribution of x encoded by a BN or CGN learnt from dr":l via the BS-EM algorithm Generate the offspring by sampling M individuals from PI (x) Let d l be the new population created by replacing part of d l - l

with part of the offspring by using the replacement method

where the BS-EM algorithm is as follows:

BS-EM algorithm loop I = 0,1, ...

Run the EM algorithm to get the MAP parameters OSI for Sl given ° Perform search over model structures, evaluating each one by

Score(s : Sl) = E[log p(h, 0, sh) I 0, OSI' sf] ~ h h

= Lh p(h I 0, OS/l Sl ) log p(h, 0, S )

Let SIH be the model structure with the highest score

if Score(sl : Sl) = Score(slH : st) then return (SI, OSI)

Figure 4.1 Schematics of the EMDA (top) and the BS-EM algorithm (bottom).

son et al., 1998). We use the BS-EM algorithm for explanatory purposes as well as in our experiments presented in Section 5.

When applying the BS-EM algorithm in a data clustering problem, we assume that we have a database of N cases, d = {Xl, .. . ,XN }, where every case is represented by an assignment to n of the n + 1 variables involved in the problem domain. So, there are (n + I)N random variables that describe the database. The N cases of the database correspond to the selected individuals at each iteration of the EMDA. Let 0 denote the set of observed variables, that is , the nN variables that have assigned values. Similarly, let H denote the set of hidden or unobserved variables, that is, the N variables that reflect the unknown cluster membership of each case of d.


For learning BNs and CGNs from incomplete data, the BS-EM algorithm performs a search over the space of models based on the well-known EM algorithm (Dempster et aI., 1977; McLachlan and Krishnan, 1997) and direct optimization of the Bayesian score. This results in an attempt to maximize the expected Bayesian score at each iteration instead of the true Bayesian score. As shown in Figure 4.1 (bottom), the BS-EM algorithm is comprised of two steps: an optimization of the BN or CGN parameters, usually by means of the EM algorithm, and a structural search for model selection.

To completely specify the BS-EM algorithm, we have to decide on the structural search procedure (the second step of the BS-EM algorithm of Figure 4.1). The usual approach is to perform a greedy hill-climbing search over BN or CGN structures considering all possible additions , removals and reversals of a single arc at each point in the search. This structural search procedure is desirable as it exploits the decomposition properties of BNs and CGNs, and the factorization properties of the Bayesian score for complete data. However, any structural search procedure that exploits these properties can be used. The log marginal likelihood of the expected complete data is usually chosen as the score to guide the structural search.

Direct application of the BS-EM algorithm as it appears in Figure 4.1 (bottom) may result in a unrealistic and inefficient solution because the computation of Score(s : st) implies a huge computational expense as it takes account of every possible completion of the database. It is common to use a relaxed version of the presented algorithm that only considers the most likely completion of the database to compute Score(s : sd instead of considering every possible completion. Thus, this relaxed version of the BS-EM algorithm is comprised of the iteration of a parametric optimization for the current model , and a structural search once the database has been completed with the most likely completion by using the best estimate of the joint generalized probability distribution of the data so far (current model). The completion is achieved by calculating the posterior probability distribution of the cluster variable C for each case of the database, p(c I Yi, OSI' s?). The case is assigned to the cluster where the maximum of this posterior probability distribution of C is reached. We use this relaxed version in our experiments.

It should be noted that the learnt model does not provide us with an explicit partition of the selected individuals into clusters but with an encoding of the joint generalized probability distribution of these individuals. Thus, the clusters determined by the learnt BN or CGN should be understood as probabilistic clusters. Instead of belonging to a particular cluster, each selected individual Yi implies a probability distribution for C, p( c I Yi, OSI' s?), that represents a probabilistic clustering of the individual.

Benefits of Data Clustering in Multimodal Function Optimization via ED As 111

4. Further considerations about the EMDA

4.1 Sampling the learnt model

In order to generate the offspring to create the new population, the learnt model must be sampled. By doing this, the number of offspring produced by each probabilistic cluster is determined by the marginal probability distribution of the cluster variable C in the learnt model. This implies that the number of offspring sampled by each probabilistic cluster is somehow proportional to its size.

According to preliminary experiments, a decisive factor in performing efficient and effective multimodal function optimization is to keep in the population a reasonable number of individuals representing each of the basins of the global optima in order to avoid losing any of them as the optimization progresses. Based on these preliminary results, the model that the EMDA samples to generate the offspring in our experimental evaluation is not exactly the model learnt but a slightly modified one. As we are interested in discovering all global optima and they are equally sized peaks, the EMDA samples the same number of individuals from each of the probabilistic clusters encoded by the learnt model. This is equivalent to modifying the marginal probability distribution of C in the learnt model to be a uniform distribution. The effects of other sampling alternatives on the performance of the EMDA need to be studied. For instance, we could explore sampling a number of individuals from each probabilistic cluster proportional to its average fitness. Sampling a number of offspring from each probabilistic cluster inversely proportional to its size or to its average fitness could also be an alternative to consider. This last option aims to sample more individuals from the clusters with the least number of representatives or with the representatives with the worst average fitness, resulting in a positive discrimination of these probabilistic clusters that can aid their recovery.

4.2 Members of the EMDA family

The EMDA relies on unsupervised learning of BNs and CGNs in order to obtain all the global optima while avoiding the harmful effects of multimodality. However, the reader should be aware of the existence of other probabilistic graphical models that could also provide us with the same benefits as BNs and CGNs when considered under the EMDA paradigm. For intance, Perra et al. (2001b) and Thiesson et al. (1998) present some probabilistic graphical models for data clustering that are more flexible than BNs and CGNs. Thus, the EMDA leads us to a family of EMDAs where the difference between the distinct members of this family consists of the class of probabilistic graphical models used to perform the clustering of the selected individuals.


It should be noted that the structure of a learnt BN or CGN for data clustering is independent of the value of the cluster variable C, and so, the model structure is the same for all values of C. However, the parameters of the local probability distributions depend on the value of C and they may differ for different values of C. It is interesting to note that the constraint of having a single model structure for every value of C can be relaxed by considering models more flexible than BNs and CGNs. An example is the class of what are known as mixtures of DAG (MDAG) models (Thiesson et al., 1998). Roughly speaking, MDAG models represent a generalization of BNs and CGNs applied to data clustering where different model structures for the different values of the variable C are allowed.

Despite having received less attention than BNs and CGNs in the recent literature, MDAG models appear to be more appropriate than BNs and CGNs under the EMDA paradigm for the optimization of multimodal functions. The explanation is straightforward. The advantage of EMDA over other members of the EDA paradigm is that the model learnt from the selected individuals, a BN or a CGN, is able to capture the multimodality of the function being optimized. Ideally, each probabilistic cluster of the learnt model corresponds to one of the several global optima that the function has. However, every probabilistic cluster involve the same set of conditional (in)dependence assertions as the rest, independently of the global optimum that is being modeled by that particular probabilistic cluster. This is because BNs and CGNs when applied to data clustering have a single model structure for all the values of C. On the other hand, MDAG models offer enough flexibility to encode a different set of conditional (in)dependencies for each probabilistic cluster. This fact together with the possibility of having different parameters for each probabilistic cluster make MDAG models desirable paradigms for multimodal function optimization. However, many of these problems lead us to make use of the EMDA whilst discarding the consideration of more flexible models such as MDAG models. That is, the solution space of many problems restricted to the selected individuals can be perfectly modeled by unsupervised learning of BNs or CGNs. Thus, the use of models more flexible than these would not contribute to an improvement in the performance of the algorithm. See, for instance, the 14 problems that we use in our experimental evaluation of Section 5 (most of these are taken from the existing literature).

Finally, another example of an evolutionary algorithm that may be seen as a member of the EMDA family is the Adaptive Mixture Model Algorithm (AMix) (Gallagher et al., 1999). The enhancement that this algorithm proposes consists of the use of a Gaussian mixture to model the joint probability distribution of the selected individuals of each generation. Moreover, the number of components in the mixture is allowed to vary during the execution of the algorithm as new data points are available. This is based on whether the Mahalanobis dis-


tance between the existing model and these new points is greater or not than a prespecified threshold. Gallagher et al. (1999) factorize each component of the Gaussian mixture into univariate Gaussian distributions, and then correlations among the different variables are not modeled. As far as we know, the AMix algorithm has not been studied in multimodal function optimization problems. However, it can be easily applied to them by simply considering that every component in the Gaussian mixture models the probability distribution of the selected individuals that belong to the basin of one of the global optima.

For the sake of conciseness, the remainder of this chapter considers the EMDA as it appears in Figure 4.1, i.e. the models learnt from the selected individuals are BNs for combinatorial optimization and CGNs for optimization in continuous domains. It is beyond the scope of this chapter to compare the different instances of what we have named the EMDA family. Moreover, BNs and CGNs are well-established classes of probabilistic graphical models that have been studied in depth. Besides, the use of models more flexible and, thus, more complex than BNs and CGNs to perform the clustering of the selected individuals (e.g. MDAG models) would also imply enlarging the optimization process as their unsupervised learning is usually computationally more expensive.

5. Experimental results This section is devoted to the experimental evaluation of the EMDA for

combinatorial optimization as well as optimization in continuous domains. For this purpose, we use the UMDA (Miihlenbein, 1998) and the EBNA (Larraiiaga et al., 2000b) as benchmarks for combinatorial optimization, and the UMDAc and the EGNA (Larraiiaga et al., 2000a) for optimization in continuous domains. The comparison between the results achieved by the EMDA and those achieved by the benchmarks allows us to draw some conclusions about the efficiency and effectiveness of the EMDA.

Whereas the UMDA and the UMDAc are classic EDA instances, the EBNA and the EGNA have shown good performance on discrete and continuous optimization problems. Moreover, these two last algorithms are close in spirit to the EMDA as they are also based on learning and simulation of probabilistic graphical models. However, neither the EBNA nor the EGNA use probabilistic graphical models to perform data clustering, but they carry out supervised learning of them instead. The EBNA and the EGNA instances considered in this section make use of the B algorithm (Buntine, 1991) together with the BIC score (Schwarz, 1978) for the former and the BGe score (Geiger and Heckerman, 1995) for the latter, in order to perform learning of the probabilistic graphical models at each generation. They will be denoted as EBNABIc and EGNABGe respectively. We refer the interested reader to the original works for more details.


In order to keep the cost of the optimization process carried out by the EMDA as low as possible, we propose limiting the BS-EM algorithm to learning Tree Augmented Naive Bayes (TANB) models (Friedman and Goldszmidt, 1996; MeiUi, 1999; Pefia et al., 2000; Pefia et al., 2001d). This is a sensible as well as usual decision to reduce the otherwise large search spaces of BNs and CGNs. TANB models constitute a class of compromise BNs and CGNs defined by the following condition: predictive attributes may have at most one other predictive attribute as a parent. Thus, TANB models represent an interesting trade-off between efficiency and effectiveness, that is, a balance between the cost of the learning process and the quality of the learnt models (Pefia et al., 2000; Pefia et al., 2001d).

It is well-known that TANB models are able to solve efficiently data clustering problems of considerable size. Moreover, they avoid the difficulties of learning densely connected BNs and CGNs, and the painfully slow probabilistic inference when working with these. Also, generation of the offspring from the learnt model is accelerated when this is a TANB model as few conditional dependencies among the variables are allowed.

5.1 General considerations

In this subsection we discuss some decisions that are common to the EDA instances used in our experimental comparison (UMDA, UMDAc , EBNABIC ,

EGNABGe and EMDA): selection method, population replacement method and stopping condition.

The five algorithms considered use truncation selection as the selection method, i.e the best individuals of the current population according to their objective function values are selected. The way in which the new population is created consists of the replacement of the worst individuals of the current population by all the offspring.

The algorithms are stopped when the relative difference between the sum of the objective function values of all the individuals of the populations of two successive generations is less than a fixed value here referred to as precision.

The particular values for the population size, the number of selected individuals, the number of generated offspring and the precision may vary from objective function to objective function. We find it convenient to use a regular grammar to clearly identify these values. Thus, each of the objective functions used has an optimization scheduling represented as (a, {3, ,,(, J) where a is the size of the population, {3 is the number of selected individuals, "( is the size of the offspring and J is the precision.

As noted earlier, unsupervised learning of the BN or CGN should be provided with the number of clusters that we assume exist in the set of selected individuals. In our experiments, this number is set to the number of global optima of the function that we aim to optimize.


All the experiments are run on a Pentium 550 MHz computer.

5.2 Combinatorial optimization

This subsection evaluates the EMDA as applied to combinatorial optimization. Most of the problems considered can be found in Pelikan and Goldberg (2000) and Pelikan et al. (2000). All the problems are defined in {O, l}n, i.e. the set of binary individuals of length n.

We limit our current evaluation to multimodal functions that show spin-flip symmetry in the solution space. This class of multimodal functions represents a set of challenging problems for many EDAs and GAs (Naudts and Naudts, 1998; Pelikan and Goldberg, 2000; Van Hoyweghen, 2000; Van Hoyweghen and Naudts, 2000).

5.2.1 Problems.

Two-max problems. These are two simple spin-flip symmetrical functions:

(4.7)

D ( ) _ { Ftwo-max(x) - 5 if Ftwo-max(x) > 5 I'two-max2 X - D ( ) h.

I'two-max X ot erWlse ( 4.8)

where u is the sum of the bits in x and n is the length of x. The objective is to maximize the functions. For both functions, there are two global optima: xi=(O, ... , 0) and x z=(l, ... , 1) with fitness equal to ~ for Ftwo-max, and equal to ~ - 5 for Ftwo-max2. In our case, n = 50. The optimization scheduling for both functions is (2000, 1000, 1999, 0). Ftwo-max2 is considered more difficult than Ftwo-max as it has two local optima in addition to the two global optima.

Graph bisection problems. Graph bisection problems aim to split the set of nodes of a given graph structure into two equally sized subsets so that the number of edges between the two subsets is minimized. We use two grid-like graph structures cut in halves and connected by two edges, with sizes n = 16,36 resulting in the problems Fgrid16 (x) and Fgrid36 (x), respectively. Also, we consider three so-called caterpillar graph structures with sizes n = 28,42,56 that result in Fcat28 (x), Fcat42 (x) and Fcat56 (x), respectively. Figure 4.2 shows the graph structures for Fgrid16 (x) and Fcat28 (x).

Each bit of a given individual represents one node of the graph structure. The value of the bit classifies the corresponding node into one of the two subsets.


Figure 4.2 Graph structures for Fgr id16(X) (left) and FCat28(x) (right). Dashed lines indicate the optimal cuts.

1200,--------, 1200,.--------, 1200,--------,

600 600 600

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50

generation 0 generation 6 generation 12

Figure 4.3 Dynamics of the EMDA in the Ftwo-max problem. The horizontal axis represents the number of ones in a solution whereas the vertical axis denotes the number of corresponding solutions in the population of the generation indicated.

It should be noted that only individuals with equal numbers of zeroes and ones represent feasible solutions. Thus, some individuals may need to be repaired. Although more specialized repair operators might be considered, we make use of a simple randomized repair operator: a unfeasible solution is converted into a feasible one by randomly picking a number of the bits that are in the majority and changing them to their complementary values.

The fitness of a given individual is calculated as the size of the graph structure minus the number of edges connecting the two subsets of nodes encoded in the individual. Thus, the objective is to maximize. The global optima have an objective function value equal to n - 2 for Fgrid16 (x) and Fgrid36 (x), and equal to n - 1 for Fcat28 (x), Fcat42 (x) and Fcat56 (x). It is easy to see that these five problems are spin-flip symmetrical problems and, thus, the global optima are represented by complementary individuals. The optimization scheduling is (2000, 1000, 1999, 0) for the five graph bisection problems.

In addition to the difficulties derived from their symmetrical nature, these graph bisection problems present another source of problems for GAs and EDAs due to the fact that there are many local optima and only two global optima (Pelikan and Goldberg, 2000; Pelikan et aI., 2000; Schwarz and Ocenasek, 1999).


Table 4.1 Performance of the UMDA, EBNABIC and EMDA in the discrete domains considered. The numbers of evaluations and runtimes are average values over 10 independent runs. The numbers of times that each optima is reached summarize the final results of these 10 runs.

Problem UMDA EBNABIc EMDA

Ftwo-max 35183 eval. 31185 eval. 25988 eval. 20 sec. 160 sec. 107 sec. (5, 5) (7,3) (10, 10)

Ftwo-max2 49176 eval. 40181 eval. 27987 eval. 33 sec. 207 sec. 194 sec. (6,4) (4,6) (10, 10)

F grid16 57972 eval. 60971 eval. 23989 eval. 59 sec. 46 sec. 39 sec. (5, 5) (4, 6) (10, 10)

Fgrid36 81560 eval. 120941 eval. 38182 eval. 68 sec. 336 sec. 161 sec. (5,4) (9, 1) (10, 10)

Fcat28 50776 eval. 53374 eval. 25988 eval. 34 sec. 80 sec. 82 sec. (1, 8) (4,6) (10, 10)

Fcat42 66968 eval. 96953 eval. 33184 eval. 58 sec. 310 sec. 189 sec. (2, 3) (4,4) (10, 10)

Fcat56 87957 eval. 120741 eval. 39981 eval. 91 sec. 776 sec. 359 sec. (3,4) (1, 2) (10, 10)

5.2.2 Results. Figure 4.3 shows the dynamics of the EMDA in the Ftwo-max problem. The histograms summarize the number of solutions (vertical axis) in the populations of generations 0, 6 and 12 with the number of ones denoted by the horizontal axis. As previously stated, the two global optima are complementary and are on the left-most and right-most sides of the histograms. It is clear that, as the optimization progresses, the population drifts to both sides as the BN learnt at each iteration of the EMDA is able to capture this division of the selected individuals. Finally, both global optima are discovered and seen in the population of the last generation of the EMDA. Moreover, the individuals of this last population are almost equally distributed between both global peaks.


Table 4.1 summarizes the results achieved when applying the EMDA to each of the 7 combinatorial problems presented in the previous subsection. Additionally, this table reports the results reached by the UMDA and the EBNABIC in these problems for comparison. For each problem and each evolutionary algorithm three results are given: the number of evaluations of the objective function until the stopping criterion is met, the runtime of the optimization process (in seconds), and the number of times that each of the global optima of the objective function is discovered. The first two results are average values over 10 independent runs, but the third result summarizes these 10 runs by a pair ("7, ~), where "7 and ~ are the number of times that xi and x:;; are obtained respectively. Obviously, the UMDA and the EBNABIC are able to reach at most one global optima per run while the EMDA could reach both optima in each run.

From Table 4.1 we can conclude that the EMDA shows a more effective and efficient behaviour than the other two algorithms for the problems chosen. It is specially appealing that, for all the objective functions, the EMDA needs less evaluations of the objective function than the UMDA and the EBNA B1C to reach convergence without degrading the quality of the obtained solutions (the two global optima of every function are obtained in every run). Except in the two-max problems, the EBNAB1C needs more evaluations than the UMDA to converge.

As expected, the number of evaluations has a decisive influence on the runtime of the optimization process measured in seconds. Here, the optimization process using the EBNAB1C is the slowest of the three. On the other hand, the UMDA is the quickest although the number of evaluations that it needs to converge in any of the 7 problems is much larger than the number of evaluations needed by the EMDA. Obviously, this is due to the unsupervised learning of BNs that the EMDA performs which is known to be a difficult and, sometimes. computationally expensive process. However, the runtime of the EMDA in this set of problems is considered reasonable.

Looking at Table 4.1 we discover that to converge and obtain the two complementary global optima of any of the 7 functions using the EMDA involves less evaluations than to converge and obtain only one using the UMDA or the EBNABIC. Thus, these results confirm what we already proposed: the incorporation of probabilistic clustering into EDAs is not only interesting because it allows all the global optima to be obtained, but also because it deals with symmetry in a natural way. That is, it avoids the combination of good solutions coming from complementary parts of the solution space which often results in poor solutions that slow down convergence. We categorized this dual interest in developing the EMDA as quantitative and qualitative, i.e. a gain in effectiveness together with a gain in efficiency.

Benefits of Data Clustering in Multimodal Function Optimization via ED As 119

From the point of view of effectiveness measured as the number of global optima recovered by each algorithm for each function, the EMDA outperforms the two benchmarks. This is not surprising as this was part of the motivation that led us to propose it. Specifically, the EMDA always discovers the global optima independently of the actual problem. Moreover, the individuals of the last population of every run of the EMDA for any of the 7 domains are equally distributed between the two global optima. On the other hand, the UMDA and the EBNABIC suffer the effects of the symmetry and the existence of several peaks, specially in the caterpillar graph bisection problems. Their harmful effects can be observed in the fact that, for the 7 functions chosen, the UMDA and the EBNABIC need a larger number of evaluations to converge and discover at most one global optima per run, than the EMDA for reaching convergence and discover both global optima. In addition to this, it can be appreciated from Table 4.1 that the caterpillar graph bisection problems are extremely difficult for the UMDA and the EBNAB1C . These two algorithms exhibit a poor performance in those problems as they get stuck in local optima in 9 out of the 30 runs performed for the three caterpillar graph bisection problems. The EMDA exhibits a unbeatable behaviour in these particular problems.

These results prove the goodness of the EMDA in particular and the combination of EDAs and probabilistic clustering in general, in alleviating the disrupting effects of spin-flip symmetry and in obtaining all the global optima in the objective function.

5.3 Optimization in continuous domains

This subsection evaluates the EMDA as applied to optimization in continuous domains. We limit our current evaluation to multimodal functions that show symmetry in the solution space. Specifically, we consider that a function F(x) exhibits symmetry in the solution space with respect to a when F(a + x) = F(a - x) for all x in the domain. As in the discrete case, this class of multimodal functions represents a set of challenging problems since they involve the same harmful effects on many evolutionary algorithms as spin-flip symmetrical functions.

5.3.1 Problems.

Two-max problems. These are two simple symmetrical functions similar to the discrete two-max problems introduced in the evaluation of the EMDA in combinatorial optimization problems:

n

Ftwo-max (x) = I L Xi I i=1

- 5 ~ Xi ~ 5 i=1, ... ,n (4.9)


() { Ftwo-max(x) - 30 if Ftwo-max(x) > 30

Ftwo-max2 x = ( ) Ftwo-max x otherwise

-10:::; Xi :::; 10 i = 1, ... ,n (4.10)

where the objective is to maximize these functions. For both functions, there are two global optima: x~=(-5, ... , -5) and x2=(5, ... , 5) for Ftwo-max with fitness equal to 5n, and x~=(-10, ... , -10) and x2=(1O, ... , 10) for Ftwo-max2

with fitness equal to IOn - 30. In our case, n = 10. The optimization scheduling for both functions is (2000, 1000, 1999, 1). FLwo-max2 is considered more difficult than Ftwo-max as it has two local optima in addition to the two global optima.

Mixture of normal distributions problems. The first example of this class of problems that we consider can be defined as the joint probability density function of a mixture of two normal distributions with different mean vectors:

Fmix1 (x) = 0.5 N(x; 1-'1,:E) + 0.5 N(x; 1-'2,:E) (4.11)

where N(x; I-',:E) is a multivariate normal distribution with n-dimensional mean vector I-' and n x n variance matrix:E. In our problem, 1-'1 = (-1, ... , -1) and 1-'2 = (1, ... , 1). Moreover, we consider that the variance matrix is diagonal with all the elements of the diagonal equal to 1. There are two global optima: xr=(-l, ... , -1) and x2=(1, ... , 1).

We also use two more examples of the mixture of normal distributions problems here denoted as Fmix2 and Fmix3. They are similar to F mix1 but in these cases the non-zero elements of the variance matrix are equal to 4 for F mix2

and equal to 9 for F mix3 ' The two global optima of Fmix2 are approximately xr=(-0.99, ... , -0.99) and x2=(0.99, ... , 0.99). For F mix3 the global optima are around xr=(-0.52, ... , -0.52) and x2=(0.52, ... , 0.52).

The objective for the three functions is maximization, and -3 :::; Xi :::; 3 for i = 1, ... , n. In our case n = 10. The optimization scheduling is (2000, 1000, 1999, 10-8 ) for the first function and (2000, 1000, 1999, 10-1°) for the other two. It is easy to see that the three functions have been introduced in an increasing order of difficulty.

Shekel's foxholes problems. We consider two instances of the wellknown multimodal Shekel's foxholes problem (De Jong, 1975). The first instance is as follows:


1200.-----------, 1200.-----------, 1200,---------,

600 600 600

-50 -30 -10 10 3D 50 -50 -3~ -10 10 3D 50 -50 -3~ - 10 10 3D 50

generation 0 generation 10 generation 57

Figure 4.4 Dynamics of the EMDA in the continuous Ftwo-max problem. The horizontal axis represents the sum of the genes of a solution whereas the vertical axis denotes the number of corresponding solutions in the population of the generation indicated.

m 1

FShekel1(X) = - L Ilx - x*112 + c· )=1 ) )

(4.12)

where m is the number of global optima and Cj, j = 1, ... , m, is a coefficient that determines the height of each of the global peaks. The objective is minimization. In our case, m = 2, C1 = C2 = 0.001, xr=(l, ... , 1) and xz=(3, ... , 3). Thus, the value of the objective function in the global minima is equal to -1000. Moreover, 0 ~ Xi ~ 4 for i = 1, . . . ,n. The dimension of the problem is n = 5. The optimization scheduling is (2000, 1000, 1999, 50).

We refer to the second instance of the Shekel's foxholes problem as FShekel2.

In this case m = 3, C1 = C2 = C3 = 0.001, xr=(l, ... , 1), xz=(4, ... , 4) and x3=(7, ... , 7). The value of the objective function in the global minima is equal to -1000, and 0 ~ Xi ~ 8 for i = 1, ... ,n. The dimension of the problem is n = 5. The optimization scheduling is (5000, 1000, 500, 50). The objective is also minimization.

5.3.2 Results. Figure 4.4 shows the dynamics of the EMDA in the continuous Ftwo-max problem. The histograms summarize the number of solutions (vertical axis) in the populations of generations 0, 10 and 57 whose sum of genes is equal to the value denoted by the horizontal axis. The two global optima are on the left-most and right-most sides of the histograms. Thus, it is clear that, as the EMDA progresses, the population drifts to both sides since the CGN learnt at each iteration is able to capture this division of the selected individuals. Finally, both global optima are discovered and seen in the population of the last generation of the EMDA. Moreover, the individuals of this last population are almost equally distributed between both global peaks.

Table 4.2 summarizes the results achieved when applied the EMDA to each of the 7 optimization problems presented in the previous subsection. Additionally, this table reports the results reached by the UMDAc and the EGNABGe in these


Table 4.2 Performance of the UMDAc , EGNABGe and EMDA in the continuous domains considered. The numbers of evaluations and runtimes are average values over 10 independent runs. The numbers of times that each optima is reached summarize the final results of these 10 runs.

Problem UMDAc EGNA BGe EMDA

Ftwo-max 104149 eval. 161120 eval. 115343 eval. 36 sec. 104 sec. 152 sec. (4, 6) (5, 5) (10, 10)

Ftwo-max2 169516 eval. 185108 eval. 141130 eval. 84 sec. 142 sec. 172 sec. (2, 8) (6,4) (10, 10)

Fmixl 90556 eval. 92955 eval. 78962 eval. 58 sec. 58 sec. 80 sec. (7, 3) (5, 5) (10, 10)

Fmix2 77562 eval. 88357 eval. 63769 eval. 53 sec. 105 sec. 61 sec. (8, 2) (4,6) (10, 10)

F mix3 50776 eval. 57972 eval. 44179 eval. 20 sec. 47 sec. 50 sec. (0, 0) (3, 7) (10, 10)

FShekell 77162 eva!. 102750 eva!. 56773 eva!. 30 sec. 49 sec. 37 sec. (5, 5) (5, 5) (10, 10)

FShekel2 40000 eva!. 39900 eval. 42850 eval. 7 sec. 18 sec. 85 sec.

(0, 10,0) (0, 10,0) (10, 10, 10)

problems for comparison. For each problem and each evolutionary algorithm three results are given: the number of evaluations of the objective function until the stopping criterion is met, the runtime of the optimization process (in seconds), and the number of times that each of the global optima of the objective function is discovered. The first two results are average values over 10 independent runs. The third result is encoded using the same system as in Section 5.2.2.

Roughly speaking, the results achieved for the continuous domains repeat the patterns that appear for combinatorial optimization. Let us analyze in detail the results summarized in Table 4.2. Except in Ftwo-max (only for UMDAc) and FShekel2, the EMDA needs a smaller number of evaluations of the objective function than the UMDAc and the EGNABGe to achieve convergence. Thus,


the EMDA exhibits a more efficient behaviour than the other two evolutionary algorithms. Despite this, the EMDA usually takes a larger runtime than the other two algorithms. Again, the reason is the unsupervised learning of CGNs performed by the EMDA. However, its runtime is considered reasonable.

In addition to being the most efficient, the results of Table 4.2 confirm that the EMDA is also the most effective of the three algorithms considered: it always discovers all the global optima that exist in the 7 functions chosen. Moreover, except for Ftwo-max (only for UMDAc) and FShekel2, the EMDA is able to converge to all the global optima in all the runs in a number of evaluations smaller than the UMDAc and the EGNABGe whereas these algorithms discover at most just one of the existing optima. Thus, the EMDA reveals once again its benefits for multimodal function optimization from a qualitative as well as a quantitative point of view. To reinforce the effective behaviour shown by the EMDA, we should add that the individuals of the last population of every run of the EMDA for any of the 7 domains are equally distributed between the existing global optima. On the other hand, the UMDAc suffers the effects of the symmetry of the solution space when dealing with Fmix3 and it is unable to achieve any of the global optima of this function in the 10 runs carried out.

Finally, we should conclude that, as seen in the combinatorial optimization problems previously considered, the EMDA when applied to optimization in continuous domains fulfills all its objectives.

6. Conclusions The main contribution of this chapter has been the introduction of a new

member of the EDA family: the EMDA (Estimation of Mixture of Distributions Algorithm). The motivation that has led us to the EMDA was two-fold. First, we wanted to obtain all the global optima when facing both discrete and continuous multimodal function optimization problems. Second, the optimization process needed to be efficient in addition to effective, i.e. it had to be able to overcome the difficulties derived from the existence of several global peaks in the function to optimize.

The main steps of the EMDA are the same as in any other EDA: selection of promising individuals, model learning and model sampling to generate a new population. The improvement of the EMDA over other EDAs is the model to be learnt at each iteration. This model is intended to capture the multimodality of the function to be optimized by clustering the selected individuals according to their genotypes. This avoids the harmful effects of multimodality as individuals from different parts of the search space are treated separately. Furthermore, each cluster should ideally evolve to a different global peak.

Unlike other works that divide the set of selected individuals of each generation into a set of clusters, the EMDA does not perform such an explicit partition of the selected individuals. The EMDA makes use of two well-known classes of


probabilistic graphical models to cluster the selected individuals: BNs for combinatorial optimization and CGNs for continuous optimization. This makes the EMDA fit ED As in a natural way as well as representing a unified framework for combinatorial as well as continuous multimodal function optimization.

Empirical evaluation of the EMDA for combinatorial as well as continuous optimization has been limited to some symmetrical functions. The functions chosen are known to be difficult problems for many evolutionary algorithms. This point has been confirmed by the results reported: the EMDA has outperformed the UMDA, UMDAc , EBNABIc and EGNABGe in the number of evaluations of the objective functions needed to converge and all the global optima were discovered for all the problems considered. This proves that the EMDA is able to deal with multimodal functions and discover all existing global optima while alleviating the harmful effects that the existence of several global peaks implies for many other evolutionary algorithms.

Acknowledgments This work was supported by the Spanish Ministry of Education and Culture under

AP97 44673053 grant.

References Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press. Buntine, W . (1991). Theory refinement in Bayesian networks. In Proceedings

of the Seventh Conference on Uncertainty in Artificial Intelligence, pages 52-60. Morgan Kaufmann, Inc.

Collard, P. and Aurand, J. P. (1994) . DGA: An efficient genetic algorithm. In Proceedings of the European Conference on Artificial Intelligence 1994, pages 487-492. John Wiley & Sons, Inc.

De Jong, K. A. (1975) . An analysis of the behavior of a class of genetic adaptive systems. Doctoral Dissertation. University of Michigan.

Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B,39:1-38.

Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. Biometrics, 21:768.

Friedman, N. (1998). The Bayesian Structural EM algorithm. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pages 129-138. Morgan Kaufmann, Inc.

Friedman, N. and Goldszmidt, M. (1996). Building classifiers using Bayesian networks. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1277-1284. AAAI Press.


Gallagher, M., Frean, M., and Downs, T. (1999). Real-valued Evolutionary Optimization using a Flexible Probability Density Estimator. In Proceedings of the Genetic and Evolutionary Computation Conference 1999, pages 840-846.

Geiger, D. and Heckerman, D. (1995). Learning Gaussian Networks. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pages 235-243.

Hartigan, J. A. (1975). Clustering Algorithms. John Wiley & Sons, Inc. Hocaoglu, C. and Sanderson, A. C. (1995). Evolutionary speciation using min

imal representation size clustering. In Evolutionary Programming IV: Proceedings of the Fourth Annual Conference on Evolutionary Programming, pages 187-203. MIT Press.

Hocaoglu, C. and Sanderson, A. C. (1997). Multimodal Function Optimization Using Minimal Representation Size Clustering and Its Applications to Planning Multipaths. Evolutionary Computation, 5(1):81-104.

Kaufman, L. and Rousseeuw, P. (1990). Finding Groups in Data. John Wiley & Sons, Inc.

Larranaga, P., Etxeberria, R., Lozano, J. A., and Pena, J. M. (2000a). Combinatorial optimization by learning and simulation of Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pages 343-352. Morgan Kaufmann, Inc.

Larranaga, P., Etxeberria, R., Lozano, J. A., and Pena, J. M. (2000b). Optimization in continuous domains by learning and simulation of Gaussian networks. In Genetic and Evolutionary Computation Conference 2000. Proceedings of the Program Workshops, pages 201-204. Morgan Kaufmann, Inc.

Lauritzen, S. L. (1992). Propagation of probabilities, means and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098-1108.

Lauritzen, S. 1. (1996). Graphical Models. Clarendon Press. Lauritzen, S. L. and Wermuth, N. (1989). Graphical models for associations

between variables, some of which are qualitative and some quantitative. The Annals of Statistics, 17:31-57.

McLachlan, G. J. and Krishnan, T. (1997). The EM Algorithm and Extensions. John Wiley & Sons, Inc.

Meila, M. (1999). Learning with Mixtures of Trees. Doctoral Dissertation. Massachusetts Institute of Technology.

Meila, M. and Jordan, M. r. (1998). Estimating dependency structure as a hidden variable. Neural Information Processing Systems, 10:584-590.

Miihlenbein, H. (1998). The Equation for Response to Selection and its Use for Prediction. Evolutionary Computation, 5:303-346.

Naudts, B. and Naudts, J. (1998). The Effect of Spin-Flip Symmetry on the Performance of the Simple GA. In Proceedings of Parallel Problem Solving


from Nature V, pages 67-76. Springer-Verlag. Lectures Notes in Computer Science.

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, Inc.

Pelikan, M. and Goldberg, D. E. (2000). Genetic Algorithms, Clustering, and the Breaking of Symmetry. In Proceedings of Parallel Problem Solving from Nature VI, pages 385-394. Springer-Verlag. Lectures Notes in Computer Science.

Pelikan, M., Goldberg, D. E., and Sastry, K. (2000). Bayesian Optimization Algorithm, Decision Graphs, and Occam's Razor. Technical Report IlliGAL No. 2000020, Illinois.

Pena, J. M., Lozano, J. A., and Larranaga, P. (1999). Learning Bayesian networks for clustering by means of constructive induction. Pattern Recognition Letters, 20(11-13):1219-1230.

Pena, J. M., Lozano, J. A., and Larranaga, P. (2000). An improved Bayesian structural EM algorithm for learning Bayesian networks for clustering. Pattern Recognition Letters, 21(8):779-786.

Pena, J. M., Lozano, J. A., and Larranaga, P. (2001a). Geographical Clustering of Cancer Incidence by Means of Bayesian Networks and Conditional Gaussian Networks. In Proceedings of the Eighth Inter'national Workshop on Artificial Intelligence and Statistics, pages 266-271. Morgan Kaufmann, Inc.

Pena, J. M., Lozano, J. A., and Larranaga, P. (2001b). Learning conditional Gaussian networks for data clustering via edge exclusion tests. Submitted.

Pena, J. M., Lozano, J. A., and Larranaga, P. (2001c). Learning recursive Bayesian multinets for data clustering by means of constructive induction. Machine Learning, In press.

Pena, J. M., Lozano, J. A., and Larranaga, P. (2001d). Performance evaluation of compromise conditional Gaussian networks for data clustering. International Journal of Approximate Reasoning, In press.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 7(2) :461-464.

Schwarz, J. and Ocenasek, J. (1999). Experimental study: Hypergraph partitioning based on the simple and advanced algorithms BMDA and BOA. In Proceedings of the Fifth International Conference on Soft Computing, pages 124-130.

Segen, J. and Sanderson, A. C. (1981). Model inference and pattern discovery by minimal representation method. Technical Report CMU-RI-TR-82-2, Carnegie Mellon University.

Thiesson, B., Meek, C., Chickering, D. M., and Heckerman, D. (1998). Learning Mixtures of DAG Models. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pages 504-513. Morgan Kaufmann, Inc.


Van Hoyweghen, C. (2000). Detecting Spin-flip Symmetry in Optimization Problems. Theoretical Aspects of Evolutionary Computing.

Van Hoyweghen, C. and Naudts, B. (2000). Symmetry in the Search Space. In Proceedings of the Conference on Evolutionary Computation 2000, pages 1072-1079. IEEE Press.

Chapter 5

Parallel Estimation of Distribution Algorithms

J.A. Lozano R. Sagarna P. Larraiiaga Department of Computer Science and Artificial Intelligence


{lozano. ccbsaalr. ccplamup}@si.ehu.es

Abstract This chapter describes parallel versions of some Estimation of Distribution Algorithms (EDAs). We concentrate on those algorithms that use Bayesian networks to model the probability distribution of the selected individuals, and particularly on those that use a score+search learning strategy. Apart from the evaluation of the fitness function, the biggest computational cost in these EDAs is due to the structure learning step. We aim to speed up the structure learning step by the use of parallelism. Two different approaches will be given and evaluated experimentally in a shared memory MIMD computer.

Keywords: Estimation of Distribution Algorithms, parallelism, Bayesian networks, structure learning

1. Introduction Estimation of Distribution Algorithms (EDAs) (Miihlenbein and Paa,B, 1996;

Larraiiaga et al., 2000a; Larraiiaga et al., 2000b) constitute a set of promising optimization techniques. However, the most sophisticated approaches, those that make use of Bayesian and Gaussian networks, are computationally very expensive. Their computation cost is mainly due to the structure learning phase. That is, the elicitation of a probabilistic graphical model that encodes a factorization for the probability distribution of the set of selected individuals.

As our current focus is on combinatorial optimization problems, i.e. discrete domains, one should be aware that structure learning of a Bayesian network is a NP-hard problem (Chickering et al., 1995). Thus, it is mandatory to use simple algorithms in order to maintain a feasible computational cost.




Two different approaches (Larrafiaga, 2001) are mainly used to learn the probability distribution by using Bayesian networks in EDAs: score+search and detecting conditional (in) dependencies. We concentrate on the first approach, and develop two parallel algorithms for it. These parallel learning algorithms, although exemplified in this chapter with the EBNABIc algorithm (Etxeberria and Larrafiaga, 1999; Larrafiaga et al., 2000a), could be adapted to other instances of ED As that use the score+search approach to structure learning: BOA (Pelikan et al., 1999), EBNAK2+pen (Larrafiaga et al., 2000a) and LFDA (Miihlenbein and Mahnig, 1999).

Parallel structure learning of probabilistic graphical models has not received much attention in the literature so far. Some work can be found in non-Bayesian network paradigms: Sangiiesa et al. (1998) and Xiang and Chu (1999) develop parallel structure learning algorithms for possibilistic networks and decomposable Markov networks, respectively. Xiang and Chu (1999), in addition, outline how their approach could be extended to Bayesian networks. One of our parallel algorithms has been inspired by these ideas.

In the Bayesian network field, Lam and Segre (2001) present an algorithm to distribute structure learning, but their work is not useful here. In addition to using a Minimum Description Length based score and imposing an a priori ordering on the set of variables, their structural search is carried out using a branch and bound algorithm. The computational cost of this branch and bound is too expensive for our purpose.

Despite it being clear that other components of the algorithm, such as the selection step or the sampling process, could be parallelized, we restrict our attention to the structure learning process. The reason for this is that the computational cost implied by these other components is tiny compared with that of structure learning.

This chapter is organized as follows. Section 2 describes the EBNABIc and the structure learning process in detail. Two parallel algorithms are introduced in Section 3, leaving Section 4 for a numerical comparison between both versions. Section 5 draws conclusions and gives a summary.

2. Sequential EBN ABle

As previously stated our objective is to parallelize the structure learning phase of a discrete EDA. We concentrate our attention on the EBNAB1C algorithm, pseudocode for which can be found in Figure 5.l.

In the EBNABIc an initial Bayesian network is given (normally with an arcless structure). M individuals are sampled from this Bayesian network and, applying some selection rule, N of them are selected. With these N individuals a new Bayesian network is built (the structure search starts with the structure

Parallel Estimation of Distribution Algorithms 131

Algorithm EBNAB1C

Step 1. give an initial probability distribution Po (x) using a Bayesian network

Step 2. sample M individuals from Po(x), obtain Do and set l = 1 Step 3. select N individuals from Dl-1

Step 4. find a good enough structure according to the penalized maximum likelihood (BIC)

Step 5. calculate the parameters for the structure and obtain Pl(X) Step 6. sample M individuals from Pl(X) and obtain Dl Step 7. if a stopping criterion is met

stop else

set l = l + 1 and go to Step 3

Figul'e 5.1 Pseudocode for the EBNABlc.

learned in the previous loop). This process is repeated until a stopping criterion is met.

As mentioned earlier, EBNABIC relies on a score+search approach to perform the Bayesian network structural search. To be exact, the score used is the penalized maximum likelihood, denoted by BIG (Bayesian Information Criterion) (Schwarz, 1978). Given a structure S and a dataset D, this BIC score can be written as:

n qi Ti N .. 1 n

BIC(S, D) = L L L Nijk log ;Jk - 2 log N L qi(ri - 1) (5.1) i=1 j=1 k=1 tJ i=1

where:

• n is the number of variables of the Bayesian network.

• ri is the number of different values that variable Xi can take.

• qi is the number of different values that the parent variables of Xi, PaT, can take.

• N ij is the number of individuals in D in which variables PaT take their ih value.

• N ijk is the number of individuals in D in which variable Xi takes its kth

value and variables PaT take their jth value.


An important property of this score is that it is decomposable. This means that the score can be calculated as the sum of the separate local BIC scores for the variables, i.e. each variable Xi has associated with it a local BIC score (BIC(i, S, D)):

n

BIC(S,D) L BIC(i, S, D) (5.2) i=l

BIC(i, S, D) (5.3)

The structure search algorithm used in EBNABIC is usually a hill-climbing algorithm. At each step, an exhaustive search is done through the set of possible arc modifications. An arc modification consists of adding or deleting an arc from the current structure S. The arc modification that maximizes the gain of the BIC score is used to update S, as long as it results in a DAG (Directed Acyclic Graph) structure (note that the structure of a Bayesian network must be a DAG). This cycle continues until there is no arc modification that improves the score. It is important to bear in mind that if we update S with the arc modification (j, i), then only BIC(i, S, D) needs to be recalculated.

The structural learning algorithm involves a sequence of actions that differs between the first step and for all subsequent steps. In the first step, given a structure S and a database D, the change in the BIC is calculated for each possible arc modification. Thus, we have to calculate n(n - 1) terms as there are n(n - 1) possible arc modifications. The arc modification that maximizes the gain of the BIC score, whilst maintaining the DAG structure, is applied to S. In remaining steps, only changes to the BIC due to arc modifications related to the variable Xi (it is assumed that in the previous step, S was updated with the arc modification (j, i)) need to be considered. Other arc modifications have not changed its value because of the decomposable property of the score. In this case, the number of terms to be calculated is n - 2.

We use four memory structures for this algorithm. A vector BIC[i], i = 1,2, ... ,n, where BIC[i] stores the local BIC score of the current structure associated with variable Xi' A structure Sri], i = 1,2, ... , n, with the DAG represented as adjacency lists, i.e. Sri] represents a list of the immediate successors of vertex Xi' A n x n matrix G, where each (j, i) entry represents the gain or loss in score associated with the arc modification (j, i). Finally a matrix paths[i, j], i, j = 1,2 ... , n, of dimension n x n that represents the number of paths between each pair of vertices (variables). This last structure is used to check if an arc modification produces a DAG structure. For instance, it is possible to add the arc (j, i) to the structure if the number of paths between i and j is equal to 0, i.e. paths[i, j] = O.


Algorithm SeqBIC

Input: D,S, paths Step 1. for i = 1, ... ,n calculate BIC[i] Step 2. for i = 1, ... ,n and j = 1, ... ,n G[j, i] = 0 /* initializing G * / Step 3. for i = 1, ... , nand j = 1, ... , n

if (i "I j) calculate G[j, i] /* the change of the BIC produced by the arc modification (j, i) * /

Step 4. find (j, i) such that paths[i, j] = 0 and G[j, i] 2': G[r, s] for all r, s = 1, ... ,n such that paths[s, r] = 0

Step 5. if G[j, i] > 0 update 5 with arc modification (j, i) update paths

else stop Step 6. for k = 1, ... , n

if (k "I i or k "I j) calculate G[k, i] Step 7. go to Step 4

Figure 5.2 Pseudocode for the sequential structural learning algorithm, SeqBIC.

Pseudocode for the sequential structure learning algorithm, SeqBIC, can be seen in Figure 5.2.

3. Parallel EBN ABle

In this section we explore parallelism for speeding up structure learning, and consequently EDAs. To this end, we decompose the structure learning process to take into account that some tasks in the structure learning algorithm SeqBIC can be carried out independently, e.g. the change in the BIC implied by an arc modification or the check of the DAG property. Two different parallel algorithms are proposed.

We consider a MIMD architecture with shared memory because this is available to us, but some generalizations to other architectures are obvious. The processors are partitioned in both algorithms as follows: one processor works as a search manager and the remaining processors are arc modifications explorers.


Algorithm MNG 1

Input: D,S, paths, k /* number of explorers * / Step 1. for i = 1, ... ,k

start_explorer ( i); Step 2. send_start..signal to explorers /* explorers start working * / Step 3. receiveJinaLsignal from explorers /* each explorer has processed

its arc modifications * / Step 4. find (j, i) the best arc modification Step 5. if G[j, i] > 0

update 5 with arc modification (j, i) update paths

else sendl1alt..signal to the explorers stop

Step 6. send_start..signal to the explorers /* changes of the BIC due to arc modifications related to node Xi have to be calculated * /

Step 7. go to Step 3

Figure 5.3 Pseudocode for manager MNGl.

3.1 PAIBIC

We call PA1BIC the first parallel algorithm proposed. PA1BIC is a straightforward parallelization of the sequential SeqBIC.

The manager (MNG 1) in this case is dedicated to controlling the whole algorithm, synchronizing with the explorers, and carrying out some tasks that have to be centralized. MNG1 starts the explorers (EPR1) and recovers information from them, calculates the best arc modification, and updates the current structure. Figure 5.3 gives pseudocode for MNG 1. The work carried out by the explorers consists of calculating the change in the BIC due to some arc modifications and, choosing the best arc modification that maintains DAG structure from there.

As in the SeqBIC, we can distinguish the first step of the search from the rest of the steps. In the first step, the changes in the BIC score for all the arc modifications have to be calculated. In the rest of the steps, only those changes due to arc modifications related to a single node, Xi, are calculated.


Algorithm EPRI

Step 1. receive...start_signal from manager Step 2. calculate the set of arc modifications to examine Step 3. for each assigned (j, i) calculate C[j, i] Step 4. find (j, i) such that paths[i, j] = 0 and C[j, i] 2 C[r, s] for all

assigned (r, s) such that paths[s, r] = 0 Step 5. send_final...signal to the manager Step 6. if NOT receivellalLsignal from manager

receive...start...signal from manager calculate the set of arc modifications to examine for each assigned (j, i) calculate C[j, i] find (j, i) such that paths[i,j] = 0 and C[j, i] 2 C[r, s] for all assigned (r, s) such that paths[s, r] = 0 sendJinaLsignal to manager go to Step 6

Figure 5.4 Pseudocode for explorer EPRl.

Each explorer according to its unique identifier is able to calculate the set of arc modifications that it has to process. In the first step, given the number of variables, n, and the number of explorers, k, we have n = ka + r. The arc

modifications are distributed so that each of the r first explorers examines the arc modifications related to a + 1 nodes, and each of the remaining explorers processes the arc modifications related to a nodes. In the remaining steps, the changes due to n - 2 arc modifications have to be evaluated, and a similar partition is created.

The way in which the arc modifications are distributed between the explorers makes the load evenly balanced. In the first step each explorer examines (a + 1) . (n - 1) or a· (n - 1) arc modifications, and in remaining steps [nk2] + 1 or [nk2].

Pseudocode for EPR1 can be seen in Figure 5.4.

3.2 PA2BIC

The second parallel version for the algorithm SeqBIC is introduced here. In the previous algorithm we calculated the change in the BIC for each arc modification without taking into account whether the arc modification produced a DAG or not. In this algorithm we first check if an arc modification produces


Algorithm MNG2

Input: D, S, paths, k 1* number of explorers * / Step 1. for i = 1, ... , k

starLexplorer( i); Step 2. send-start-signal to explorers Step 3. receiveJinaLsignal from explorers /* the set of arc modifications

that produce DAGs has been calculated * / Step 4. for i = 1, ... , k

assign a subset of valid arc modifications to explorer i Step 5. send-start-signal to the explorers Step 6. receiveJinal-signal from explorers /* the best arc modification

has been calculated by each explorer * / Step 7. find (j, i) the best arc modification Step 8. if G[j, i] > 0

update S with arc modification (j, i) update paths

else sendllalt-signal to the explorers stop

Step 9. go to Step 2

FigU1'e 5.5 Pseudocode for manager MNG2.

a DAG, and, if it does then, we calculate the change in the BIC produced by this arc modification.

In this case the manager (MNG2) distributes between the explorers (EPR2) the set of arc modifications for which the change in the BIC has to be calculated. Previously each explorer checked a set of arc modifications and determined the set of them that produced DAGs.

MNG2 starts the explorers and synchronizes with them. Once the explorers have calculated the set of arc modifications that produces DAGs, these are evenly distributed between the explorers. Each explorer calculates the change in the BIC and the best arc modification of its set. MNG2 calculates the best arc modification from the set of best arc modification returned by the explorers. If there is an increase in the BIC, then the best arc modification is used to update S and the loop starts again. Pseudocode for MNG2 can be seen in Figure 5.5.


Algorithm EPR2

Step 1. receive-lltarLsignal from manager Step 2. calculate the set of arc modifications to examine Step 3. calculate the set of arc modifications that produce DAGs Step 4. send..finaLsignal to manager Step 5. receive-lltart-llignal from manager Step 6. for each assigned (j, i) calculate G[j, i] Step 4. find (j,i) such that G[j,i] ~ G[r,s] for all (r,s) assigned Step 5. send_finaLsignal to manager Step 6. if NOT receiveJlalt-llignal from manager

go to Step 1

Figure 5.6 Pseudocode for explorer EPR2.

EPR2 calculates the set of arc modifications where DAG checks need to be done. Later, each explorer calculates the change in the BIC due to each arc modification assigned by MNG2 and indentifies the best arc modification. This process is repeated until MNG2 finishes the explorer. Pseudocode for EPR2 can be seen in Figure 5.6.

3.3 PAIBIC versus PA2BIC

There are many differences between the two parallel versions. The main difference is related to the number of changes in the BIC that need to be processed. In PA1BIC, n(n - 1) changes in the BIC score due to arc modifications are calculated in the first step, even though some of those modifications could not have produced DAGs. In the remaining iterations a set of only n - 2 changes in the BIC are considered. Alternatively, in PA2BIC the changes in the BIC calculated in each iteration are only those due to arc modifications that produce DAGs and thus they cannot be calculated beforehand.

Therefore, it can be said that the first algorithm is more stable and is more amenable to theoretical analysis than the second.

In relation to the balance of the load between the explorers, in both algorithms the work is apparently evenly balanced between them.


4. Numerical evaluation We have carried out some numerical experiments to evaluate the performance

of SeqBIC, PA1BIC and PA2BIC algorithms. All the experiments have been carried out in a dedicated MIMD machine with shared memory and four processors, so experiments were only done for a number of explorers k = 1,2,3,4. In all the experiments, 10 independent runs of an EBNA BIC with ranking selection, elitism and a selection of half of the population have been performed.

Two different functions of different complexity have been used in the experiments, the well-known OneMax function and the EqualProducts function.

The OneMax problem consist of maximising: n

OneMax(x) = LXi i=l

where Xi E {O, I}. Clearly One Max is a linear function, which is easy to optimize. The computational cost of evaluating this function is tiny.

For the EqualProducts function a set of random real numbers {aI, a2 , ... ,an} in the interval [0, u] has to be given (in our case we used [0,4]). The objective is to select some numbers in such a way that the difference between the products of selected and unselected numbers is minimized. Mathematically:

n n

EqualProducts(x) = II h(xi,ai) - II h(l- xi,ai) i=l i=l

where function h is defined as:

{ I ifx=O h(x, a) = a if x = 1

In this problem we do not know the optimum, but the closer the solution is to zero, the better. The computational cost of a function evaluation is larger than in the previous case.

Evaluation of the algorithms has been carried out in three different aspects or dimensions: time-related, solution quality-related and algorithm performancerelated.

4.1 Time-related dimension

In the time-related dimension we want to know what the gain in time implied by parallelism is. To do that we measure the processing time of the algorithms for different numbers of explorers. Here, we executed the algorithms a fixed number of generations and obtained the following values:

• Total time of execution.

• CPU total time.

• Structural learning phase CPU time.


Table 5.1 Time-related experimental results for OneJvI ax using PAIBIC.

num. of expo exec. t. CPUt. learning t. CPU effie. learning effie.

1 explorer 59.4 54.83 53.83 1.00 1.00 2 explorers 28.1 28.04 27.03 0.97 0.99 3 explorers 19.3 19.23 18.23 0.95 0.98 4 explorers 14.8 14.62 13.61 0.94 0.99

Table 5.2 Time-related experimental results for OneM ax using PA2BIC.



sequential time . . • Speed-up = parallel time ,for each of the second and thud previOUS

items.

• Efficiency = b speld-up , for each of the second and third pre-num er 0 processors vious items.

In particular, we used in both functions 20 generations, a population size of 100, and an individual dimension of 100.

Tables 5.1 and 5.2 summarize the results of the experiments with the OneMax function for PA1BIC and PA2BIC, respectively. It can be seen that there is a dramatic decrease in the execution time for both algorithms when we increase the number of explorers. This is also revealed in Tables 5.3 and 5.4 where the results for the EqualProduets function are shown.

In both functions the times obtained by PA2BIC are better than those obtained by PA1BIC, but the speed-up (as we will see later) and efficiency seem to be better with PAlBIC. Figures 5.7 and 5.8 show graphs of the speed-up reached by the algorithms for the OneM ax and EqualProducts functions respectively. The continuous line represents the speed-up in CPU total time and the dashed line the speed-up in structure learning CPU time.


4 * 4 /.

/. /.

3.5 /. 3.5 /.

/.

3 /. 3 a. a. ::J /. ::J I /. I

~ 2.5 /. '5l 2.5 /. Q)

a. /. a. en

" en

2 2

" 1.5 1.5

1 1 1 2 3 4 1

number of explorers

Figure 5.7 Speed-up produced by PAlBIC (left) OneM ax problem.

a. ::J I

4

3.5

3

~ 25 a. en

2

1.5

/.

/. /.

/ /.

/. /.

/. /.

/. /.

1*-------~------~------~ 1 2 3 4

number of explorers

4

3.5

3 a. ::J I

a] 2.5 Q) a. en

2

1.5

1 1

jI< /.

/. /.

/. /.

/.

* /. /

/. /.

-<

2 3 4 number of explorers

and PA2BIC (right) for the

)i' /

/ /

/ /

/*'-/.

/. /.

/. /.

/ /

/.

"

2 3 4 number of explorers

Figure 5.8 Speed-up produced by PAlBIC (left) and PA2BIC (right) for the EqualProducts problem.

Table 5.3 Time-related experimental results for EqualProducts using PAIBIC.




Table 5.4 Time-related experimental results for EqualProducts using PA2BIC.

num. of expo exee. t. CPUt. learning t. CPU effie. learning effie.


4.2 Solution quality-related dimension

It is clear that the general behaviour of the algorithm is the same independently of the version, SeqBIC, PA1BIC or PA2BIC, that we use. However, due to the speed-up produced by the parallelism, the solution reached when a fixed execution time is allowed can be different for the algorithms considered. In this subsection we try to measure the difference between the solutions found.

Characteristics measured in relation to the solution quality-related dimension are the following:

• Best solution.

• CPU time.

Here we measure the characteristics at approximately every 100 seconds of CPU time. Obviously, there are some minor differences between the CPU times of different executions because we only measure these characteristics when a complete generation has finished. The CPU time allowed is limited to the CPU time that the best algorithm needs to reach convergence.

For this measure the population size is set to 100 and the dimension of the individuals to 300.

As it can be seen in Figures 5.9 and 5.10 there is a huge increment in the solution quality when the parallel algorithms are used.

4.3 Algorithm performance-related dimension

In this section we describe experimental results from measuring the performance of the parallel algorithms. Here, we are interested in how the load is balanced between the processors.

The characteristics related to the performance of the algorithms that we measure are:

• Manager CPU time.

• Explorer CPU times.


Figure 5.9 Best solution produced by PAIBIC (left) and PA2B1C (right) for the OneM ax problem.

-----.--Cc_.::_ ........ _.

Figure 5.10 Best solution produced by PAlBIC (left) and PA2B1C (right) for the EqualProducts problem.

As can be seen in Tables 5.5 to 5.8 the load between the processors is very well balanced, and there are no significant differences between the two parallel algorithms. This justifies our job allocation between the processors.

5. Summary and conclusions This chapter has presented a method for parallelizing some EDAs, in par

ticular those that use Bayesian networks. The task parallelized is structure learning of the Bayesian network. This is the task mainly responsible for the computational cost of these EDAs. Two different algorithms have been proposed. Numerical comparisons in a MIMD with shared memory architecture have been carried out. The results obtained show a CPU efficiency that in the worst case is about 0.86.


Table 5.5 Algorithm performance-related experimental results for OneM ax using PAIBIC.

num. of expo MNGI t. EPRI t.

1 2 3 4

1 explorer 0.26 53.57 2 explorers 0.28 26.70 26.57 3 explorers 0.30 17.65 17.61 17.34 4 explorers 0.31 13.26 13.13 13.19 13.12

Table 5.6 Algorithm performance-related experimental results for OneM ax using PA2B1C.

num. of expo MNG2 t. EPR2 t. 1 2 3 4


Table 5. 'l Algorithm performance-related experimental results for EqualProducts using PAIBIC.

num. of expo MNGI t. EPRI t. 1 2 3 4



Table 5.8 Algorithm performance-related experimental results for EqualProducts using PA2BIC.

num. of expo MNG2 t. EPR2 t. 1 2 3 4


In view of these results we consider that it could be interesting to study other architectures which could also be used to parallelize EDAs. Clearly, most algorithms in the family of EDAs allow parallelizations similar to those given for Genetic Algorithms (Cantu-Paz, 2000).

References Cantu-Paz, E. (2000). Efficient and Accurate Parallel Genetic Algorithms. Klu

wer Academic Publishers. Chickering, D., Geiger, D., and Heckerman, D. (1995). Learning Bayesian net

works: search methods and experimental results. In Proceedings of the Fifth Conference on Artificial Intelligence and Statistics, pages 112-128. Society for AI and Statistics.

Etxeberria, R. and Larranaga, P. (1999). Global optimization using Bayesian networks. In Second Symposium on Artificial Intelligence and Adaptive Systems. CIMAF'99, pages 332-339.

Lam, W. and Segre, A. (2001). A parallel learning algorithm for Bayesian inference networks. IEEE Transactions on Knowledge Discovery and Data Engineering. In press.

Larranaga, P. (2001). A review on Estimation of Distribution Algorithms. In Larranaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Larranaga, P., Etxeberria, R., Lozano, J. A., and Pena, J. M. (2000a). Combinatorial optimization by learning and simulation of Bayesian networks. In Boutilier, C. and Goldszmidt, M., editors, Uncertainty in Artificial Intelligence, UAI-2000, pages 343-352. Morgan Kaufmann Publishers, San Francisco, CA.

Larranaga, P., Etxeberria, R., Lozano, J. A., and Pena, J. M. (2000b). Optimization in continuous domains by learning and simulation of Gaussian


networks. In Wu, A. S., editor, Proc. of the Genetic and Evolutionary Computation Conference, GECCO-2000, Workshop Program, pages 201-204.

Miihlenbein, H. and Mahnig, T. (1999). FDA - A scalable evolutionary algorithm for the optimization of additively decomposed functions. Evolutionary Computation, 7(4) :353-376.

Miihlenbein, H. and Paa,B, G. (1996). From recombination of genes to the estimation of distributions 1. Binary parameters. In Voigt, H., Ebeling, W., Rechenberg, 1., and Schwefel, H.-P., editors, Parallel Problem Solving from Nature, PPSN IV. Lectures Notes in Computer Science, volume 1141, pages 178-187.

Pelikan, M., Goldberg, D. E., and Cantu-Paz, E. (1999). BOA: The Bayesian Optimization Algorithm. In Banzhaf, W., Daida, J., Eiben, A., Garzon, M., Honavar, V., Jakiela, M., and Smith, R. E., editors, Proceedings of the Genetic and Evolutionary Computation Conference GECCO-99, pages 525-532. Morgan Kaufmann Publishers, San Francisco, CA.

Sangiiesa, R., Cortes, D., and Gisolfi, A. (1998). A parallel algorithm for building possibilistic causal networks. International Journal of Approximate Reasoning, 18:251-270.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 7(2):461-464.

Xiang, Y. and Chu, T. (1999). Parallel learning of belief networks in large and difficult domains. Data Mining and Knowledge Discovery, 3:315-339.

Chapter 6

Mathematical Modeling of Discrete Estimation of Distribution Algorithms

C. Gonzalez J.A. Lozano P. Larraiiaga Department of Computer Science and Artificial Intelligence

University of the Basque Countl'y {cristina, lozano, ccplamup}@si.ehu.es

Abstract In this chapter we discuss the theoretical aspects of Estimation of Distribution Algorithms (EDAs). We unify most of the results found in the EDA literature by introducing them into two general frameworks: Markov chains and dynamical systems. In addition, we use Markov chains to give a general convergence result for discrete EDAs. Some discrete EDAs are analyzed using this result, to obtain sufficient conditions for convergence.

Keywords: Estimation of Distribution Algorithms, Markov chains, dynamical systems, convergence

1. Introduction Estimation of Distribution Algorithms (EDAs) (Larraiiaga et aI., 2000a;

Larraiiaga et aI., 2000b; Miihlenbein and PaaB, 1996) are a promising new area of Evolutionary Computation. During recent years much effort has been dedicated to creating new ED As and to EDA applications. This development has not been accompanied by mathematical analysis, i.e. little attention has been given to the theoretical aspects of EDAs. This lack of general mathematical analysis together with the fact that existing results are specialized for each particular algorithm, make a review a difficult task.

In this chapter, with the aim of offering a unified view, we introduce most of the results given in the literature related to convergence behavior into two general frameworks: Markov chains and dynamical systems. For other aspects of theoretical work related with population sizing Pelikan et aI. (2000b) can be consulted. P. Larrañaga et al. (eds.), Estimation of Distribution Algorithms



To overcome the lack of general results we use Markov chains to give a general convergence theorem for EDAs. The most common discrete EDAs are analyzed using this theorem, resulting in convergence and non-convergence algorithms. For ~hose algorithms that do not converge, some conditions have been imposed on the parameters of th~ir probability distributions to guarantee convergence.

To better understand the works that model EDAs with dynamical systems we show that all the apparently different dynamical systems given for EDAs can be obtained from the same equation.

The chapter is organized as follows: in Section 2, we model EDAs with Markov chains, and introduce a new general theorem about the limit behaviour of these algorithms. Section 3 is devoted to works on using dynamical systems to study specific EDAs. Section 4 summarizes those results that do not fit Markov chain or dynamical systems, and we give our conclusions in Section 5.

2. U sing Markov chains to model EDAs This section is devoted to modelling EDAs (Figure 6.1 shows a general pseu

docode) by using Markov chains. First, we give a general theorem about the limit behaviour of EDAs and apply it to some of these algorithms. Then we show how Markov chains can be used to model PBIL and give some results for this.

Let us introduce some notation. The search space is represented by:

(6.1)

where Oi = {I, 2, ... , rd for all i = 1,2, ... , nand n E 1N denotes the length or dimension of vector x E O. The cardinality of the search space is 101 = rl . r2 ..... rn = m. Without lack of generality we can consider the following optimization problem:

min f(x) xEO

(6.2)

where f : 0 -t IR is the objective function. A population in the algorithms is a subset (in the multiset sense) of size M

of elements of O. Each population DI can be represented as a vector

(6.3)

where dil is the number of ith individuals in population DI. Of course, Z::::l dil = M. The number of different populations, v, is equal to the number of different ways to place m - 1 balls into M + m - 1 boxes, i.e.:

Mathematical Modeling of Discrete Estimation of Distribution Algorithms 149

v = ( (6.4)

An individual x* such that f(x*) ::; f(x) for all x E n is a global optimum (minimum in our case) of equation (6.2) and V* = {D13 xED such that f(x) = f(x*)}, is the set of populations that contain a global optimum.

EDA Do f- Generate M individuals (the initial population) randomly

Repeat for I = 1,2, ... until a stopping criterion is met

Dr':! f- Select N ::; M individuals from D l - 1 according to a selection method

Pl(X) = p(xIDr.:!) f- Estimate the probability distribution of an individual being among the selected individuals

Dl f- Sample M individuals (the new population) from Pl(X)

Figure 6.1 Pseudocode for a general EDA algorithm.

A general EDA can be modeled using a finite Markov chain whose state space is formed from the different populations that the algorithm can take:

(6.5)

A Markov chain model can be used here because the population at step I only depends on the population at step l- 1. Moreover, neither operation used for the calculation of the transition probabilities depends on the step parameter I, so the Markov chain is homogeneous.

2.1 General theorem for the convergence of discrete EDAs

The next theorem is a new general result about the limit behavior of discrete EDAs. We find a sufficient condition for the convergence of these algorithms. Before stating the theorem it is important to make clear what we understand by convergence in this sectipn.


Definition 6.1 Let A be a discrete EDA. We say that A converges to a population that contains a global optimum, when there exists some step l from which the algorithm always visits populations of D* .

Therefore if A converges to a population that contains a global optimum this means that once a population of D* is reached the chain will never visit another population D ~ D*, i.e. the algorithm never loses a global optimum.

Theorem 1 Let A be an instance of EDAs such that:

Pl(X) ;:::: 8 > 0, for all xED, and for all step l = 1, . .. {6.6}

Then A visits populations of D* infinitely often with probability one. If additionally the selection is elitist, then the EDA algorithm converges to a populat'ion that contains the global optimum.

Proof 1 Suppose that algorithm A is non elitist. In this case we show that the Markov chain has a probability transition matrix Q = [qrs]r,s=1,2 , ... ,v whose entries are all positive.

The probability of going from population Dr to population Ds at step l of the algorithm is given by:

qr. P(D.IDr )

M' m

= LPsel(D~e)d1s!d2s!.· .. dms!II~::5~>0 {6.7} DS' .=1 ----v----

r >0

where Psel(D~e) is the probability to select D~e from Dr, and PDSe (x) is the estimation of the joint probability distribution of D~e. PI (x) coincides with some PO;' (x) at step l of the algorithm.

Hence the Markov chain is irreducible {all the states are intercommunicated}, and the chain is aperiodic. Since the chain is finite and irreducible, it is positive persistent. This results in the existence of a limit distribution:

lim q(l) = 7r I rs s -+00

{6.8}

where q~~ is the probability of going from population Dr to population D 8 in l steps and the 7r. are positive for all s = 1, 2 ... , v. Therefore the chain will visit D* infinitely often with probability one. In fact it visits all the states infinitely often. This proves the first part of the theorem.

For the second part, if the selection is elitist, then when the global optimum is found it will never be lost, and therefore the algorithm converges to a population that contains the global optimum.


2.2 Applying the general theorem of convergence to some EDAs

Theorem 1 offers an easy way to argue that some instances of EDAs converge to the global optimum. Next we analyze examples of EDAs in which it can be quickly checked if condition (6.6) is fulfilled:

2.2.1 UMDA. Miihlenbein (1998) proposes the Univariate Marginal Distribution Algorithm. UMDA uses the simplest model to estimate the joint probability distribution in each generation. This joint probability distribution is factorized as a product of independent univariate marginal distributions:

n n

Pl(X) = IIpI(xi) = IIp(xiIDf~I)' (6.9) i=1 i=l

These univariate marginal distributions are estimated from marginal frequencies:

where

n n ",N J(X = xlDSe ) II ( ·IDSe ) - II L..J=1 J t '1-1 P X, 1-1 - N i=1 i=1

if in the j-th case of Df~I' Xi = Xi otherwise.

(6.10)

(6.11)

Hence, taking into account the way in which the probabilities are estimated, there could be some situations where an x exists such that PI (x) = O.

For example, when the selected individuals at a previous step are such that Jj(Xi = xiIDf~l) = 0 for all j = 1, ... ,N, given an individual x with Xi in the ith component, we have that p(xiIDf~l) = 0 and therefore:

n

PI(X) = p(xiIDf~l) IIp(xkIDf~l) = O. '-v--"k=l

(6.12)

=0 k#i

Therefore condition (6.6) is not fulfilled and we can not ensure that UMDA visits a global optimum.

In fact, the Markov chain that models the UMDA has m absorbing states. Those absorbing states correspond to uniform populations. A uniform population is formed by M copies of the same individual, and can be represented by:


Dr = (0, ... ,0, M, 0, ... ,0). (6.13)

In this case, the probability of visiting a new population D s from a uniform population Dr is:

( I ) { 0 if Ds i Dr P Ds Dr =

1 otherwise. (6.14)

Therefore if the chain visits one of these populations it will be trapped in it. Clearly UMDA non-convergence is due to the way in which the probabilities

p(xiIDf~l) are calculated. To overcome this problem the Laplace correction (Cestnik, 1990) could be applied. In this case the parameters p(xiIDf~l) are calculated as:

L~l c5j (X i = xiIDf~l) + 1

N +ri (6.15)

With this change we ensure that condition (6.6) is fulfilled.

2.2.2 MIMIC. Unlike UMDA, MIMIC (De Bonet et aI., 1997) takes into account pairwise dependencies among the variables in order to estimate the joint probability distribution.

At each step of this algorithm, a permutation of the indexes i 1 , i 2 , ... ,in that fulfills an entropy related condition must be found before the probabilities can be estimated. Then the joint probability distribution is factorized as:

(6.16)

where each conditional probability is estimated from the database Df~l' by using conditional relative frequencies. Hence if we use the same argument that we did for UMDA, we can not state that MIMIC visits a global optimum. To fulfill condition (6.6), it is sufficient to do similar changes that we have shown for UMDA.

2.2.3 EBNA algorithms. Etxeberria and Larrafiaga (1999) and Larrafiaga et aI. (2000a), propose a set of algorithms in which the factorization of the joint probability distribution is encoded by a Bayesian network. The factorization can be written as:


n

PI(X) = IIp(xilpaD (6.17) i=1

where Pa~ is the set of parents of variable Xi.

Different algorithms can be obtained by varying the structural search method. Two structural search methods are usually considered: score+search and detecting conditional (in)dependencies (EBNApc). Particularly, two scores are used in the score+search approach, the BIC score (EBNABIC) and the K2+penalization score (EBNA K 2+pen).

In each case the convergence is only affected by the calculus of the parameters eijk , where eijk represents the conditional probability of variable Xi being in its kth value, given that the set of its parent variables are in their jth value. The parameters of the local probability distributions can be calculated for every generation using either:

• Their expected values as obtained by Cooper and Herskovits (1992) for their score:

E[e IDSe 1 = N ijk + 1 tJk 1-1 N .. + r.

tJ t

(6.18)

or

• The maximum-likelihood estimates:

(6.19)

where N ijk denotes the number of cases in Dr':l in which the variable Xi

takes its kth value and its parents Pai, are instantiated as their ih value. N ij

represents the number of cases in which the parents of variable Xi take their ih value.

In the first case, we can conclude that when the selection is elitist , EBN As converge to a population that contains the global optimum because (6.18) is always a positive value. In the second case, as with UMDA and MIMIC, we can not ensure that EBNAs reach a global optimum because the quantity (6.19) could be zero.

2.2.4 BOA. In Pelikan et al. (1999), Pelikan and Goldberg (2000a), Pelikan and Goldberg (2000b) and Pelikan et al. (2000a), the Bayesian Optimization Algorithm is proposed. BOA uses Bayesian networks to encode the joint probability distribution. The structural search is driven by the BDe score


(Heckerman et al., 1995). In this case the parameters of the local distributions are calculated following a Bayesian approach that avoids taking a zero value. Hence we can say that, when the selection is elitist, the algorithm converges to a population that contains the global optimum.

2.2.5 LFDA. LFDA (Miihlenbein and Mahnig, 1999), like EBNAs and BOA, encodes the joint probability distributions with Bayesian networks. LFDA uses the same score as EBNABIC but limits the number of parents that a variable can take. In the case that the parameters of the local distributions are calculated using the maximum-likelihood estimates (6.19) it can not be ensured the convergence of the algorithm. As in the previous algorithms the use of the Laplace correction (Cestnik, 1990) will provide convergence.

2.3 Modeling PBIL by means of Markov chains

In this subsection we analyze PBIL (Baluja, 1994) in binary spaces (Figure 6.2 where qf is the probability of obtaining a 1 in the ith component at iteration l). This instance of ED As does not exactly comply with the general model given in Section 2 (Figure 6.1). That is because the probability distribution at time l not only depends on the selected individuals but also on the probability distribution at time l - 1. Therefore we can not apply Theorem 1 directly to PBIL.

PBIL Obtain an initial probability vector qo = (q6,q5,··· ,qo)

Repeat for l = 0,1, ... until a stopping criterion is met

U . b· M· d··d 1 I 2 M smg ql 0 tam m IVl ua s Xl , Xl , ... , Xl

Evaluate and rank xl, xf, ... , x{'1

Select the N ~ M best individuals xj'M, xr=M , ... , xf"'M

Figure 6.2 Pseudocode for PBIL.

However, PBIL, like the other EDAs, can also be modeled using Markov chains, because the probability vector ql only depends on ql-I, but not in the same way. Gonzalez et al. (2001a) model PBIL using a Markov chain, whose


state space is formed from the different values that the probability vector ql

can take. In this work the authors apply the simplest version of the PBIL algorithm

(draw two individuals and select the better of them) to the minimization of the well-known OneM ax function in two dimensions. They obtain that the convergence results depend on the initial probability vector qo, and on the 0:

parameter value. They show that the algorithm can converge to any point of the search space with probability as near to one as we want whenever qo and 0: go to suitable values. Given a point x in the search space we have that:

P( lim ql = x) --+ 1 , when 0: -+ 1 and qo -+ x . 1-+00

(6.20)

Thus, it can not be ensured that PBIL converges to the global optimum.

3. Dynamical systems in the modeling of some ED As

This section summarizes work on dynamical systems and EDAs. Two works (Mahnig and Miihlenbein, 2000; Gonzalez et al., 2001b) model UMDA and PBIL respectively with dynamical systems.

UMDA Obtain an initial vector qo = (q6, Q5, ... , Qo)

Repeat for l = 1,2, ... until a stopping criterion is met

Using ql-l draw M individuals to obtain D1- 1

Select N individuals from Dl-l according to proportional selection

Figure 6.3 Pseudocode for UMDA in binary spaces.

Mahnig and Miihlenbein (2000) develop their dynamical system from "linkage analysis". Below we see that their dynamical system can be also obtained following the ideas developed in Gonzalez et al. (2001 b ). These ideas were previously used by Vose (1999) for the simple Genetic Algorithm.

Alternatively, as we will show in Section 3.2, the opposite to this is done by Berny (2000a, 2000b) who obtains PBIL from a dynamical system.


3.1 A dynamical system for UMDA and PBIL

If we want to model UMDA (Figure 6.3, where qf is the probability of obtaining a 1 in the ith component at the [th generation) and PBIL (Figure 6.2) by using dynamical systems, then the key problem is to associate a discrete dynamical system with both algorithms, such that the trajectories followed by the probability vectors {qt}I=O,1,2, ... will be related to the iterations of that discrete dynamical system.

UMDA and PBIL can be considered as a sequence of probability vectors, each given by a stochastic transition rule r:

(6.21)

i.e. ql = r(ql-d = rl(qo). We are interested in the trajectories followed by the iterations of r, and in particular in its limit behavior:

(6.22)

A new operator 9 is defined:

g: [0, It -7 [0, It (6.23)

such that g(q) = E[r(q)] = (E[rdq)], E[r2(q)], ... , E[rn(q)]). The operator 9 is a deterministic function that gives the expected value of the random operator r. The iterations of 9 are defined as gl(q) = g(QI-l(q)) with gi(q) = E[ri(q)] 't/ i = 1,2, ... ,n. The operator 9 can be thought of as a discrete dynamical system:

q, 9 (q), ... , gl (q), ... (6.24)

A similar operator was used by Vose (1999) for the simple Genetic Algorithm. In what follows, we give the dynamical systems for both algorithms sepa

rately.

3.1.1 UMDA. We assume that the search space is binary, 0 = {O, 1 }n, with cardinality 101 = 2n = m. Each population will be represented as:

D = (Zl,Z2, ... ,zm) (6.25)


where Zi is the proportion of the ith individual in D (we use proportions instead of numbers of individuals because we will use infinite populations).

Calculation of each component of E[T(q)] for finite populations can be done using the formula:

E[T;(q)] = L L xiP(obtain population D I q)P(select x I D) (6.26) xEODEDx

with 'Ox = {D I XED}. However, if we work with infinite populations, then after drawing an infinite number of individuals from probability vector q, a single population Dq is obtained. This population can be expressed as:

(6.27)

where pq(x) = TIj=l (qj)Xj (1 - qj)(l-iXj ). Hence the ith component of E[T(q)] can be written for infinite populations as:

E[T;(q)] = L Xi P ( select x I Dq) (6.28) xEO

If we take into account the fact that we are using proportionate selection:

where Eq[j] denotes the expectation of f with respect to the probability distribution implied by probability vector q. If we develop Pq(x), the dynamical system can be written as:

9i(q) = E[T;(q)]

The expression obtained by Mahnig and Miihlenbein (2000) is the same as (6.30). These authors also give another equivalent expression:

(6.31)


Using this last expression they state that UMDA transforms the discrete optimization problem into a continuous one, and also that the continuous optimization problem is solved by gradient search.

3.1.2 PBIL. Gonzalez et al. (2001b) expressed PBIL (for N = 1) as a dynamical system (equation 6.24) . They realized that for PBIL, a function f can be seen as an ordering of the elements of the search space n. The behavior of the algorithm is the same for two functions hand 12 if:

V x,x' En h(x) > h(x') ¢:> 12(x) > 12(x'). (6.32)

Only the ranking imposed by f in n affects the PBIL behavior, and not the particular value that function f takes at a point x. Thus, given an ordering in the elements of n such that Xm is the best individual, the dynamical system can be written -see Gonzalez et al. (2001b) for details- as:

9(q) ~ (1 - a)q +a t, X'pq(x,) (E (~pq(Xj))k-1 (tpq(Xj)) M-k) (6.33)

These authors studied the relationship between the iterations of the dynamical system and the trajectories followed by T. Their conclusion was that when the algorithm's parameter 0: is near to 0, the stochastic operator T follows with high probability and for a long time the iterations of the deterministic operator g. This fact allowed them to study the discrete dynamical system instead of the iterations of PBIL. They performed a stability analysis of the dynamical system, discovering that all the points in the search space are fixed points for the discrete dynamical system. Moreover the local optima are stable fixed points while the other fixed points in the search space are unstable. This result has various outcomes, the most important of which is that PBIL converges to the global optimum in unimodal functions.

3.2 Obtaining Reinforcement Learning and PBIL algorithms from gradient dynamical systems

Berny (2000b) shows that Reinforcement Learning and PBIL algorithms can be derived from gradient dynamical systems acting on the probability vectors q as defined previously for UMDA and PBIL.

To do that, the author shows the equivalence of searching for an optimal string of function f and searching for a probability distribution pq over n that


maximizes the function expectation:

J1 (Pq) = Epq[fl = Lpq(x)f(x) (6.34) xEO

or the log-expectation of the exponential of the function:

(6.35)

If we try to optimize J1 (Pq) and h (Pq) by means of a gradient search and we take into account that the probability distribution Pq depends on the probability vector q, then two gradient dynamical systems can be obtained. The first for J1 (Pq) can be written as:

q'

and the second for h (Pq) as: q' cp(q)

qi(l - qi) aJ~(pq) qi

(6.36)

(6.37)

(6.38)

(6.39)

From the first dynamical system Reinforcement Learning can be obtained by using stochastic approximation with a comparison scheme. PBIL is obtained from the second dynamical system with a Lagrange technique and stochastic approximations.

The author carried out a stability analysis of vertices and states and concluded that Reinforcement Learning and PBIL perform as well as hill climbing, since they can only converge to locally optimal solutions.

Similar developments were made by Berny (2000a) for real function optimization but the author did not give any stability results.

4. Other approaches to modeling EDAs This section covers those theoretical works that do not fit into the Markov

chains or dynamical systems frameworks. We deal with those approaches that offer results about convergence or about the limit behaviour of particular instances of EDAs.

4.1 Limit behaviour of PBIL

H6hfeld and Rudolph (1997) present an analysis of the convergence behaviour of the PBIL algorithm when the search space is n = {O,l}n. They prove that a simplified version of PBIL's update rule (only the best of M trials


vectors is involved in updating the vector of probabilities):

ql = (1 - a)ql-l + axL!f (6.40)

ensures convergence with probability one to the global optimum in the case of pseudoboolean linear functions.

The aim of these authors is to show that the stochastic sequence {ql h::::o converges in mean (and therefore in probability) to the global optimum of the search space. In order to do that they require that, for a linear pseudoboolean function:

lim E[qzl = x* 1--+00

(6.41)

where E[qzl is the expectation of the probability vector at step l, and x* is the optimum point in n.

Thus, studying the (deterministic) sequence {E[qzlh>o, the points in n to which PBIL's stochastic process {qlll::::o will eventually converge are identified and they obtain global convergence in mean for PBIL with linear pseudoboolean functions.

4.2 Convergence for BEDA with infinite populations

Miihlenbein et al. (1999) introduce an EDA which uses Boltzmann selection: Boltzmann Estimation of Distribution Algorithm (BEDA). In this work they show the convergence of a general BEDA for infinite populations.

Boltzmann selection has an interesting property: when the points have been generated according to a Boltzmann distribution (u > 0):

uf(x)

Po(x) = 2: f(y) , YEO u

(6.42)

and Boltzman selection is used with basis v > 1:

vf(x)

Pl,sel (x) = PI (x) 2: ( ) f(y) YEOPI Y V

(6.43)

then after selection the selected points are also distributed according to a Boltzmann distribution:

(6.44)


This fact allows us to write the probability distribution at step l for a BEDA as:

(u·v1)f{x) Pl(X) = L: (. l)f{Y)

yEO U V (6.45)

Using the previous arguments the authors prove the following theorem:

Theorem 2 Let f(x*) = minx EO f(x). The minimum need not be unique. Let the distribution PI (x) be given by equation (6.45). Let v > 1. Then

f(x) > f(x*) => lim PI(X) = O. l-too

(6·46)

If the minimum is unique, then

(6·47)

5. Conclusions This chapter had two goals. One was to organize results up to now on the

convergence of discrete EDAs. We have classified most of these works into two frameworks: Markov chains and dynamical systems. Our second goal was to give a general theorem about limit behaviour for these algorithms. This theorem lets us analyze particular instances of EDAs.

The theoretical aspects of EDAs is still a little explored area, and much work remains to be done. As we have seen in this chapter, discrete dynamical systems are a suitable tool for modelling discrete EDAs. Hence, in future research we will try to model other instances of discrete EDAs by means of discrete dynamical systems. We will also try to adapt the results obtained in discrete domains to continuous domains.

References Baluja, S. (1994). Population-based incremental learning: A method for inte

grating genetic search based function optimization and competitive learning. Technical Report CMU-CS-94-163, Carnegie Mellon University.

Berny, A. (2000a). An adaptive scheme for real function optimization acting as a selection operator. In Yao, X., editor, First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks.


Berny, A. (2000b). Selection and reinforcement learning for combinatorial optimization. In Schoenauer, M., Deb, K., Rudolph, G., Yao, X., Lutton, E., Merelo, J. J., and Schwefel, H.-P., editors, Lecture Notes in Computer Science 1917: Parallel Problem Solving from Nature - PPSN VI, pages 601-610.

Cestnik, B. (1990). Estimating probabilities: A crucial task in machine learning. Proceedings of the European Conference in Artificial Intelligence, pages 147-149

Cooper, G. F. and Herskovits, E. A. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309-347.

De Bonet, J. S., Isbell, C. L., and Viola, P. (1997). MIMIC: Finding optima by estimating probability densities. Advances in Neural Information Processing Systems, Vol. 9.

Etxeberria, Rand Larranaga, P. (1999). Global optimization with Bayesian networks. In II Symposium on Artificial Intelligence. CIMAF99. Special Session on Distributions and Evolutionary Optimization, pages 332-339.

Gonzalez, C., Lozano, J. A., and Larranaga, P. (2001a). The converge behavior of PBIL algorithm: a preliminary approach. In Kurkova, V., Steel, N. C., Neruda, R, and Karny, M., editors, International Conference on Artificial Neural Networks and Genetic Algorithms. ICANNGA-2001, pages 228-231. Springer.

Gonzalez, C., Lozano, J. A., and Larranaga, P. (2001b). Analyzing the PBIL algorithm by means of discrete dynamical systems. Complex Systems. In press.

Heckerman, D., Geiger, D., and Chickering, D. M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197-243.

H6hfeld, M. and Rudolph, G. (1997). Towards a theory of population-based incremental learning. In Proceedings of the 4th International Conference on Evolutionary Computation, pages 1-5. IEEE Press.

Larranaga, P., Etxeberria, R, Lozano, J. A., and Pena, J. M. (2000a). Combinatorial optimization by learning and simulation of Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pages 343-352. Morgan Kaufmann.

Larranaga, P., Etxeberria, R., Lozano, J. A., and Pena, J. M. (2000b). Optimization in continuous domains by learning and simulation of Gaussian networks. In Wu, A. S., editor, Proceedings of the 2000 Genetic and Evolutionary Computation Conference Workshop Program, pages 201-204.

Mahnig, T. and Muhlenbein, H. (2000). Mathematical analysis of optimization methods using search distributions. In Wu, A. S., editor, Proceedings of the 2000 Genetic and Evolutionary Computation Conference Workshop Program, pages 205-208.


Muhlenbein, H. (1998). The equation for response to selection and its use for prediction. Evolutionary Computation, 5:303-346.

Muhlenbein, H. and Mahnig, T. (1999). FDA - a scalable evolutionary algorithm for the optimization of additively decomposed functions. Evolutionary Computation, 7(4):353-376.

Muhlenbein, H., Mahnig, T., and Ochoa, A. (1999). Schemata, distributions and graphical models in evolutionary optimization. Journal of Heuristics, 5:215-247.

Muhlenbein, H. and Paai3, G. (1996). From recombination of genes to the estimation of distributions 1. Binary parameters. In Lecture Notes in Computer Science 1411: Parallel Problem Solving from Nature - PPSN IV, pages 178-187.

Pelikan, M. and Goldberg, D. E. (2000a). Hierarchical problem solving and the Bayesian optimization algorithm. In Whitley, D., Goldberg, D., Cantu-Paz, E., Spector, 1., Parmee, 1., and Beyer, H.-G., editors, Proceedings of the Genetic and Evolutionary Computation Conference, pages 267-274. Morgan Kaufmann.

Pelikan, M. and Goldberg, D. E. (2000b). Research on the Baypsian optimization algorithm. In Wu, A. S., editor, Proceedings of the 2000 Genetic and Evolutionary Computation Conference Workshop Program, pages 212-215.

Pelikan, M., Goldberg, D. E., and Cantu-Paz, E. (1999). BOA: The Bayesian optimization algorithm. In Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., and Smith, R. E., editors, Proceedings of the Genetic and Evolutionary Computation Conference GECCO-99, volume 1, pages 525-532. Morgan Kaufmann.

Pelikan, M., Goldberg, D. E., and Sastry, K. (2000a). Bayesian optimization algorithm, decision graphs, and Occam's razor. Technical Report IlliGAL Report 200020, University of Illinois at Urbana-Champaing.

Pelikan, M., Goldberg, D. E., and Cantu-Paz, E. (2000b). Bayesian optimization algorithm, population sizing, and time to convergence. In Whitley, D., Goldberg, D., Cantu-Paz, E., Spector, L., Parmee, 1., and Beyer, H.-G., editors, Proceedings of the Genetic and Evolutionary Computation Conference, pages 275-282. Morgan Kaufmann.

Vose, M. D. (1999). The simple genetic algorithm: Foundations and theory. MIT Press.

II

OPTIMIZATION

Chapter 7

An Empirical Comparison of Discrete Estimation of Distribution Algorithms

R. Blanco J .A. Lozano Department of Computer Science and Artificial Intelligence

Unive1'sity of the Basque Co'unh'y {ccbblgor. lozano}@si.ehu.es

Abstract In this paper we present an empirical comparison between different implementations of Estimation of Distribution Algorithms in discrete domains. The empirical comparison is carried out in relation with three different criteria: the convergence velocity, the convergence reliability and the scalability. Different function sets are optimized depending on the aspect to evaluate.

Keywords: Estimation of Distribution Algorithms, discrete domains, convergence velocity, convergence reliability, scalability

1. Introduction Estimation of Distribution Algorithms (EDAs) (Miihlenbein and PaaB, 1996;

Larraiiaga et al., 2000a; Larraiiaga et al., 2000b) are a new approach to solve optimization problems. EDAs are a non-deterministic search algorithm based on population of individuals like Genetic Algorithms (GAs) (Goldberg, 1989). Whereas in GAs there are crossover and mutation operators, in EDAs they have been replaced by the learning and sampling of a probability distribution. This distribution is estimated from the database which contains selected individuals from the previous population.

On discrete domains the individuals of the population are composed by genes which can take a value in the range {O, 1, ... , k}, but in this chapter, the search space is constrained to individuals whose genes can only take a value in {O, 1}. That is, an individual of the population will be a binary string.

In this chapter we try to obtain some conclusions about the algorithms by means of experimental evaluation in binary search spaces. P. Larrañaga et al. (eds.), Estimation of Distribution Algorithms



The chapter is structured as follows. In Section 2 the instances of ED As compared are presented and the criteria to carry out the comparison are established. Section 3 presents the set of functions that we use in the comparison. Section 4 shows the experimental results obtained while we draw conclusions in Section 5.

2. Experimental framework Seven instances of EDAs of different complexity have been chosen to carry

out the experimental comparison. An increasing ordering of them, according to the complexity of the probabilistic graphical model they learn, is as follows: Univariate Marginal Distribution Algorithm (UMDA) (Miihlenbein, 1998), Bit-Based Simulated Crossover (BSC) (Syswerda, 1993), PopulationBased Incremental Learning (PBIL) (Baluja, 1994), Mutual Information Maximization of Input Clustering (MIMIC) (De Bonet et al., 1997) and Estimation of Bayesian Networks Algorithm (EBNA) (Etxeberria and Larraiiaga, 1999). Different versions of the latter algorithm are considered: EBNAB1C, EBNApc and EBNAK 2+pen (Larraiiaga et al., 2000a).

To evaluate these different instances of EDAs, three criteria are used: the convergence velocity, the convergence reliability and the scalability.

The convergence velocity tries to measure the speed of the algorithm to reach the global optimum. In order to assess the convergence velocity a measure independent of the starting values is needed. We adopt the measure proposed in Schwefel (1988) and define the progress measure of a single run as:

P(t) = ln

where fmax(i) is the best objective function at the ith generation. To obtain reliable results, 100 independent runs for each instance of EDAs are carried out, where the population size is IOn (n denotes the individual dimension). Three stopping conditions are introduced. First, the algorithm is stopped when a fixed number of evaluations, 105 , is reached. Second, the algorithm is stopped when all the individuals of the population are the same. Finally, the algorithm is stopped when the global optimum is found in case it is known. It does not matter what condition is first met, whenever one of them occurs the algorithm is stopped. For each run, the best objective function of each generation is taken.

The second criterion used to compare the algorithms is the convergence reliability. The aim of this criterion is to show if the algorithms are able to find the global optimum. To do that each algorithm is applied to each problem with different population sizes. We start with a population of 16 individuals and this value is doubled until the global optimum is found in 20 consecutive and

An Empirical Comparison of Discrete Estimation of Distribution Algorithms 169

independent runs. This process is repeated for each algorithm 100 independent times. The maximum population size tried is 16384. An algorithm that does not find the global optimum in 20 consecutive runs with the last population size it is suppose that can not guarantee to find the global optimum.

The last criterion is the scalability. With this criterion we want to see how the behaviour of the EDAs changes when the problem dimension increases. The problem dimension starts at 50, this value is increased in 50 until the maximum value. This maximum value is 300. For each dimension, 10 consecutive and independent runs are executed being IOn the population size, where n is the dimension. In this case only two stopping conditions are introduced. First, when all the individuals of the population are the same, and second, when the global optimum is found in case it is known.

In all the experiments and for all the instances of EDAs, the population is created following an elitist approach, the selection method is truncation selection, and the number of selected individuals is set to half of the population.

3. Sets of function test The choice of appropriate functions to assess the strengths and weaknesses

of the different instances of EDAs, depends strongly on the goals envisaged. For our goals, it is important to develop a set of functions that cover several features:

• including unimodal functions for comparison of convergence velocity

• including deceptive functions for comparison of convergence reliability

• including functions that are scalable with respect to their dimension n.

The set of objective functions are divided into two groups for the experimental comparison. The functions of the first group provide good mechanisms to test the convergence velocity. They are functions not very difficult to optimize. The functions of the second group are the functions called deceptives. These are the adequate functions to test the convergence reliability of EDAs. The functions to carry out the scalability test belong to the group of functions used for assessing the convergence velocity.

3.1 Functions for the convergence velocity and scalability

3.1.1 OneMax problem. This is a well-know simple linear prob-lem. It can be written as:

n

FOneMax{X) = LXi i=1


The objective is to maximize the function FOneMax with Xi E {O, 1}. The global optimum is located at the point (1,1, ... ,1).

3.1.2 Plateau problem. This problem was proposed in Miihlenbein and Schlierkamp-Voosen (1993). In our case, the individuals of this function consists of a n-dimensional vector, such that n = m x 3 (the genes are divided into groups of three). We write an auxilary function gas:

( ) _ {1 if Xl = 1 and X2 = 1 and X3 = 1 9 Xl, X2, X3 - ° h . ot erWlse .

Now, we can define the Plateau function as:

m

FPlateau(x) = L g(sd i=l

where Si = (X3i-2,X3i-l,X3i). As the previous function, the goal is to maximize the function FPlateau and the global optimum is located at the point (1,1, ... ,1).

3.1.3 CheckerBoard problem. Baluja and Davies (1997) proposed this function. In this problem, a 8 x 8 grid is given. Each point of the grid can take value 0 or 1. The goal of the problem is to create a checkerboard pattern of O's and 1 's on the grid. Each point with a value of 1 should be sorrounded in all four basic directions by a value of 0, and vice versa. The evaluation counts the number of correct sorrounding bits. The corners are not included in the evaluation. The maximum value is 4(8 - 2)2, and the problem dimension is n = 8 2 . If we consider the grid as a matrix x = [xijkj=1, ... ,8 and interpret 8(a, b) as the Kronecker's delta function, the checkerboard function can be written as:

FCheckerBoard(X) = 4(8 - 2)2_

8-18-1

L L {8(Xij, Xi-1j) + 8(Xij, Xi+1j) + 8(Xij, Xij-d + 8(Xij, xij+d} ;=2 j=2

3.1.4 EqualProducts problem. As FCheckerBoard this function was presented by Baluja and Davies (1997). Given a set of n random real numbers {aI, a2, ... ,an} from an interval [0, k], a subset of them is selected. The aim of the problem is to minimize the difference between the products of the selected and unselected numbers. We can write it as:


where function h is defined as:

{ I if x = ° h(x, a) = a if x = 1

The optimum value is unknown because the set of real numbers is random. However, as near we are of zero as better.

3.2 Functions for the convergence reliability

The next functions are used to measure the convergence reliability. These functions have been chosen because finding the optimum is a hard task.

3.2.1 SixPeaks. This problem (Baluja and Davies, 1997) can be de-fined mathematically as:

FSixPeaks (x, t) = max{ tail(O, x), head(l, x), tail(l, x), head(O, x)} + R(x, t)

where tail(b, x) = number of trailing b's in x head(b, x) = number of leading b's in x

R(x,t) = { : if tail(O, x) > t and head(l, x) > t or

tail(l,x) > t and head(O, x) > t otherwise.

The goal is to maximize the function. This function has 4 global optima, located at the points:

t+l ~

(0,0, ... , ° 1,1, ... ,1)

t+l ~

(0,0, ... ,0 1, 1, ... ,1)

HI ~

(1,1, ... ,10,0, ... ,0)

t+l

(1,1, ... ,1~.

These points are very difficult to get because they are isolated. On the other hand two local optima (0,0, ... ,0), (1,1, ... ,1) are very easy reachable. The value of t was set to ~ - 1.

3.2.2 Deceptive functions. All the following functions have been proposed in Miihlenbein et al. (1999). This set is composed by deceptive decomposable functions with adjacent neighborhoods.

We define previously some auxiliary functions:


• Function F;:whl:

3.0 for x = (0,0,0,0,1) 2.0 for x = (0,0,0,1,1)

F!uhl(X) = 1.0 for x = (0,0,1,1,1) 3.5 for x = (1,1,1,1,1) 4.0 for x = (0,0,0,0,0) 0.0 otherwise.

• Function F!ultimodal:

g(x) = { 1 for odd(FoneMax(X))

o otherwise.

F!ultimodal = FOneMax(X) + 2g(x)

• Function F;uban1:

0.595 for x = (0,0,0) 0.200 for x = (0,0,1) 0.595 for x = (0,1,0)

F;ubanl (x) = 0.100 for x = (0,1,1) 1.000 for x = (1,0,0)

• Function F~ubanl:

F~uban1 (x) = {

• Function F~uban2:

0.050 for x = (1,0,1) 0.090 for x = (1,1,0) 0.150 for x = (1, 1, 1)

4F;ubanl (Xl, X2, X3)

o if X2 = X4 and X3 = X5

otherwise.

F";",,,,,, (x) ~ { FOneMax(X)

o if X5 = 0 if Xl = 0 and X5 = 1

FOneMax(X) - 2 if Xl = 1 and X5 = 1

The deceptive functions used in the experiments are:

m

FC2(X) = L F!uhl(Sj)

j=l

where Sj = (X5j-4,X5j-3,X5j-2,X5j-1,X5j) and n = 5m.

m

FC3(X) = L F~lUltimodal(Sj) j=l

where Sj = (X5j-4,X5j-3,X5j-2,X5j-l,X5j) and n = 5m .


• Function FC4: m

FC4(X) = LF~ubanl(Sj) j=l

where Sj = (X5j-4,X5j-3,X5j-2,X5j-l,X5j) and n = 5m .

• Function FC5: m

FC5(X) = F~ubanl (so) + L(F~uban2(S2j+l) + F~ubanl (S2j+2))

j=l

where Sj = (X4j-3, X4j-2, X4j-l, X4j, X4j+t) and n = 4(2m + 1) + 1.

4. Experimental results

4.1 Convergence velocity

We used an individual size of 100 for FOneMax, FPlateau and FCheckerBoard

and a dimension of 50 for F EqualProducts. For PBIL, the value of the parameter a was set to 0.5.

The experimental results can be consulted in Figures 7.1 and 7.2. The results for FOneMax (Figure 7.1 (above)) and FEqualProducts (Figure

7.2 (below)) are not surprising. Clearly in FOneMax all the variables are independent, so each algorithm can build a probabilistic model that mirrors these relationships between the variables. The situation in FEqualProducts is the opposite. It seems that there is no dependence model between the variables, so there is no mean to build a probabilistic model. After these arguments it seems plausible that all the algorithms behave the same. PBIL is an special case, the convergence velocity in PBIL depends strongly on the parameter a, a bigger value of a will probably give a faster convergence velocity.

In FChe ckerBoard (Figure 7.2 (above)) , each variable is strongly related with those sorrounding it. Therefore, UMDA, BSC, PBIL and MIMIC are not able to capture these relations. However, EBNAs build more complex probability models, and because of it they convergence velocity is faster than the rest of algorithms in this function .

The most surprising results are those obtained in FPlateau (Figure 7.1 (below)). Even though each variable is related with other two variables, and this relationship can not be captured by UMDA, BSC and MIMIC, all the algorithms performs the same. It is probably due to the simple relation between the variables.

4.2 Convergence reliability

The results of applying the algorithms in functions F Six Peaks , FC2 , FC3, FC4 and FC5 presented in Section 3 are shown in Table 7.l.

The number between square brackets shows the problem dimension for each function. For each function and each implementation of EDAs, three values


Or---~~---.-----.----_.----_.----._----._----._----._--_.

-0.02

-0.04

-0.06

~ -0.08

~ g -0.1 ., Ie' ~ 5 -0.12 u

-0.14

-0.16

-0.18

* ..... * 0 .. · .. 0

x .. .. ·x

+ .. ... +

0 0

* .... ,,* (I ... ... «

,// . 'x. ",

'-;';'t<;. 'x

";;<:-t:.. . x.

",.'

·:::::::1: ....

UMDA BSC PBIL MIMIC EBNAS1C EBNApC EBNAK2

··~;:l.·. " . ........•.

.: ... :~ .....

·'X.,

. 'x"

.•.• ::.>.~ .. .>* ........ ,

"'<k:: l ••••• ,. -0.2 '-------''---------'-------'--------'------'-------'------'--------'------''-------'

o 2 3 4 5 Generations

6 7 8 9 10

Or---~~---.-----.----_.----_.----._----._----._----._--_. ... ,/. ...

-0.05

-0.1

-0.15

.~ -0.2

~ g -0.25 ., Ie' ~ 8 -0.3

-0.35

-0.4

-0.45

* ... * 0·· .. 0

x·· . .. ·x

+ .. .. +

<> <> * ... .* (I .. . «

·f::;., ',','" ......

"x,

·X.,

"x.,

UMDA * ~.' BSC PBIL MIMIC EBNAslC EBNApC EBNAK2

-0.5'-------''---------'-------'--------'------'-------'------'--------'------''-------' o 2 3 4 5

Generations 6 7 8 9 10

Figure 7.1 Convergence velocity in F OneM ax (above) and F Plateau (below).


or*-----r------r_-----r----~r_----_r----~r_----_r----~

-0.05

i!:-g -0.1

~ g e. '" > 6 -0.1 5 ()

......• UMDA BSC PBIL

-0.2

0 ····· 0

)1;"--'. -x

+ ...... +

0- --- ¢

* ._ . * I) ...... <)

MIMIC EBNAB1C EBNApC EBNAK2

-0·~0~-----5~-----1LO------1L5------ro~-----2~5------3~0------3~5----~~

Generations

2_5r---~----~-----r----_r----~----r_--_,r_--_,----_.--~=

2

• . - ..• UMDA 0 · 0 BSC )C ,····,·x PBIL + ... ... + MIMIC 0- ..... ¢ EBNAB1C * ..... * EBNApC 0 - ..... (1 EBNAK2

O~--~----~----~--~-----L----~----~--~----~----J 10 20 30 ~ 50

Generations 60 70 80 90 100

Figure 7.2 Convergence velocity in FCheckerBoard (above) and FEqualProducb (below).


Table 7.1 Experimental results on the convergence reliability test.

FHuetio,tII UMDA DSC PDlL MIMIC EDNABIC EDNApC EBNAK2+pen

2048(58) 8192(61) 256(64) F Si;J!Peak., [20] 3.3 5.2 5.3

6871.8 6871.8 1354.5

512(60) 512(57) 512(63) 512(58) 256(58) 256(61) 256(68) FC2(20) 7 .5 6 .5 10 .9 7 .6 4.7 5 .4 5 . 7

3853 .8 3327 .5 5596 . 7 3897 .6 1214.8 1389 .6 1452. 3

32(67) 32(62) 32(75) 64(71) 128(75) 32(75) 128(83) FC3 (20) 7 6.1 8.7 5.5 5.6 6.8 4.8

219.5 197 .61 278.7 353.53 723.3 220.52 620.7

4096(70) 4096(80) 1024(59) FC4 (21) 16.4 15.5 5.5

67120.6 63701. 7 5556.5

128(63) 256(65) 256(61) 128(72) 256(62) 128(65) 128(77) FC5 (21) 7.4 6.03 10.1 6.9 5 .5 6.4 6.3

948.5 1545.7 2606 891. 7 1406 824.9 805.4

are shown. The first value indicates the closer value to the average population size required to find the optimum individual in 20 consecutive runs, between parenthesis the number of times that this value appears in the 100 experiments. The second value is the average number of generations that the execution required in those 20 runs for the former population size. And the third value is the average number of function evaluations that the execution required in the 20 runs for this population size. When a symbol "-" is shown this means that in the 100 experiments, the algorithm does not reach the optimum in 20 consecutive runs with any population, and we suppose that the algorithm can not guarantee to find the optimum.

It can be seen that a bigger population size is required when the complexity of the function is increased. UMDA, BSC, PBIL and MIMIC are not enough for difficult functions. The most reliable instance is EBNAK 2+pen. This algorithm is able to find the optimum with lower population sizes than the others in most cases.

The simplest EDA instances could not find the optimum in two of the five problems. An interesting property is that when they find the optimum they use small population sizes. It seems that if they are able to find the optimum they will do that with small populations, in the opposite case no matter you enlarge the population size the optimum will not be found.

4.3 Scalability

Results of the scalability criterion for functions FOneMax and FCheckerBoard

are shown in Figures 7.3 and 7.4. Figure 7.3 indicates the increasing in time with the problem dimension and

Figure 7.4 presents the increasing in evaluations required with the dimension increasing.


As it can be seen, the increasing in execution time changes with the algorithm. While in UMDA, BSC and PBIL the execution time seems to increase linearly, in the algorithms that use more complex probabilistic models this execution time seems to increase exponentially. In this last case, it is known that learning a Bayesian network is a NP-hard problem, however the use of local search should alleviate this problem. If it is not the case simpler algorithms have to be used.

5. Conclusions In this paper we have presented an empirical comparison of some discrete

instances of EDAs. The results of the comparison suggest that, for simple functions, the performance of UMDA, BSC, PBIL and MIMIC is comparable to EBNABIC , EBNApc and EBNAK2+pen. But for complex functions they are not enough and EBNABIC, EBNApc and EBNAK2+pen are needed. If the function is hard to optimize UMDA, MIMIC, BSC and PBIL do not guarantee to find the optimum.

As conclusion we can say that EBNAs are more reliable than UMDA, PBIL, BSC and MIMIC but at the expense of bigger execution times.

References Baluja, S. (1994). Population-based incremental learning: A method for inte

grating genetic search based function optimization and competitive learning. Technical Report Technichal report CMU-CS-94-163, Carnegie Mellon University.

Baluja, S. and Davies, S. (1997). Using optimal dependency-trees for combinatorial optimization: Learning the structure of the search space. Technical Report CMU-CS-97-107, Carnegie Mellon University.

De Bonet, J., Isbell, C., and Viola, P. (1997). MIMIC: Finding optima by estimating probability densities. Advances in Neural Information Processing Systems, 9.

Etxeberria, R. and Larranaga, P. (1999). Global optimization with Bayesian networks. In II Symposium on Artificial Intelligence. CIMAF 99. Special Session on Distributions and Evolutionary Optimization, pages 332-339.

Goldberg, D. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley.

Larranaga, P., Etxeberria, R., Lozano, J. A., and Pena, J. M. (2000a). Combinatorial optimization by learning and simulation of Bayesian networks. In Boutilier, C. and Goldszmidt, M., editors, Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, UAI-2000, pages 343-352. Morgan Kaufmann Publishers, San Francisco, CA.


CD E i=

CD .5 b-

X 10' 3.5~~-------r----------~---------'-----------r----------'

3 .. * UMDA 0·· 0 BSC x ... . . ·X PBIL

2.5 + . . ... + MIMIC 0 .. . · 0 EBNAS1C :/r .. .:/r EBNApC

2 (I .. '" EBNAK2

1.5

0.5 . . . ..

. . . . . ," .

50 100 150 200 250 300 Individual dimension

X 10' 18r----------T----------~--------~----------_r--------__.

16 .. . .. * 0 0

14 x · . . ·x

+ .. .. + 12 0 ····· 0

:/r ."'11-

10 '" .",

8

6

4

2

50

UMDA BSC PBIL MIMIC EBNAS1C EBNApC EBNAK2

it

100

"!<

,." '

.,~"' .. " .. , ... . .. .

150 200 Individual dimension

. . . . .

.. ~'

250 300

Figure 7.3 Time/dimension scalability in FOneMax (above) and FCheckerBoard (below) .


X 104 12,-----------.-----------.-----------r-----------r---------~

en c .2

10

8

~ 6 ~ UJ

4

2

*--0

x

+-

0--

* $

-* 0

x

+

<>

* $

UMDA BSC PBIL MIMIC EBNAS1C EBNApC

EBNAK2

@

.. ', '" ,

oL----------L--------~~--------~--------~--------~ 50 100 150 200 250 300

Individual dimension

x 105 9,-----------,-----------,-----------,-----------,-----------,

8 * * UMDA 0 .. -0 BSC

7 x· ..... ·x PBIL +- .. + MIMIC

6 0 <> EBNAslC

*. ..* EBNApc ~5 0

$ .. $ EBNAK2 :; ::J 'iii ~4

3 .x

* 2

.;11\' .~:.""'"

150 200 250 300 Individual dimension

Figure 7.4

Evaluations/dimension in FOneMax (above) and FCheckerBoard (below).


Larranaga, P., Etxeberria, R., Lozano, J . A., and Pena, J. M. (2000b). Optimization in continuous domains by learning and simulation of Gaussian networks. In Wu, A. S., editor, Proc . of the Genetic and Evolutionary Computation Conference, GECCO-2000, Workshop Program, pages 201-204.

Miihlenbein, H. (1998). The equation for response to selection and its use for prediction. Evolutionary Computation, 5:303-346.

Miihlenbein, H., Mahnig, T., and Ochoa, A. (1999). Schemata, distributions and graphical models in evolutionary optimization. Journal of Heuristics, 5:215- 247.

Miihlenbein, H. and PaaB, G. (1996) . From recombination of genes to the estimation of distributions I. Binary parameters. Lecture Notes in Computer Science 1411: Parallel Problem Solving from Nature-PPSN IV, pages 178-187.

Miihlenbein, H. and Schlierkamp-Voosen, D. (1993) . The science of breeding and its application to the breeder generic algorithm (BGA). Evolutionary Computation, 1:335-360.

Schwefel, H. P. (1988). Evolutionary learning optimum-seeking on parallel computer arquitectures. In A. Sydow, S. T. and Vichnevetsky, R., editors, Proceedings of the International Symposium on Systems Analysis and Simulation 1988, I: Theory and Foundations, pages 217-225, Berlin. AkademicVerlag.

Syswerda, G. (1993). Simulated crossover in genetic algorithms. In Foundations of Genetic Algorithms, volume 2, pages 332-339.

Chapter 8

Experimental Results in Function Optimization with EDAs in Continuous Domain

E. Bengoetxea T. Miquelez Department of Computer Architectul'e and Technology


{endika, teresa}@si.ehu.es

P. Larraiiaga J.A. Lozano Department of Computer Science and Artificial Intelligence


{ccplamup, lozano}@si.ehu.es

Abstract This chapter shows experimental results of applying continuous Estimation of Distribution Algorithms to some well known optimization problems. For this UMDAc , MIMICc , EGNABIC, EGNABGe, EGNAee ,

EMNAglobal, and EMNAa algorithms were implemented. Their performance was compared to such of Evolution Strategies (Schwefel, 1995). The optimization problems of choice were Summation cancellation, Griewangk, Sphere model, Rosenbrock generalized, and Ackley.

Keywords: Estimation of Distribution Algorithms, Gaussian networks, function optimization, continuous domain

1. Introduction The aim of this chapter is to show the results of applying Estimation of

Distribution Algorithms (EDAs) in continuous domain on some well known optimization problems. Evolution Strategies (ESs) (Schwefel, 1995) were also applied to the same functions in order to compare the performance of continuous EDAs.




The outline of this chapter is as follows: Section 2 describes the optimization problems that will be used and Section 3 explains which algorithms will be applied. Section 4 is a brief description of the experiments, and Section 5 shows the results obtained. Finally, Section 6 is the conclusion.

2. Description of the optimization problems In order to test the performance of continuous EDAs and ESs , some standard

functions broadly used on the literature for optimization techniques comparison have been chosen. The functions are as follows:

Summation cancellation: This is a maximization problem introduced in Baluja and Davies (1997). For any individual x= (X1, ... ,Xn ), the range of the components is -0.16 :S Xi :S 0.16, i = 1, ... , n. The fitness function is computed as follows:

F(x) _ 1 - (10-5 + L:~l IYil)

(8.1)

where Y1 = Xl, Yi = Xi + Yi-1, i = 2, ... , n . The fittest individual is the one whose variables have all components equal to 0, and this corresponds to a fitness value of 100000.

Griewangk: This is a minimization problem proposed in Torn and Zilinskas (1989). The fitness function is as follows:

n ? n () F(x) = 1 + 2: 4~~0 - II cos Xq .

t=l t=l VI

(8.2)

The range of all the components of the individual is -600 :S Xi :S 600, i = 1, ... ,n, and the fittest individual corresponds to a value of 0, that only can be obtained when all the components of the individual are O.

Sphere model: This other problem is a broadly known simple minimization one. It is also defined so that -600 :S Xi :S 600, i = 1, . .. , n, and the fitness value for each individual is as follows:

n

(8.3) i=l

The reader can easily appreciate that the fittest individual is the one whose all components are 0, which corresponds to the fitness value O.

Rosenbrock generalized: This problem proposed in Rosenbrock (1960) is a minimization one. The originally proposed problem was thought for only 2 dimensions, and this problem has been generalized to n dimensions (Salomon, 1998). It is defined as:

n-1

F(x) = 2: [100 · (Xi+1 - xr)2 + (1 - Xi)2] . (8.4) i=l

Results in Function Optimization with ED As in Continuous Domain 183

The optimum value is 0, which is obtained again for the individual whose components are all set to O. The range of values is of -10 :S Xi :S 10, i = 1, ... ,n.

Ackley: This optimization problem was proposed in Ackley (1987), and it is again a minimization problem which best value is O. This fitness valup is obtained by the individual whose components are all set to O. Originally this problem was defined for two dimensions, but the problem has been generalized to n dimensions (Back, 1996). The fitness function is computed as follows:

F(x) -20· exp (-0.2 +20 + exp(l).

1 n) (1 n ) ~ . LX; - exp ~. L COS(27rXi)

t=l t=l

(8.5)

Figure 8.1 shows a graphic of each of the problems explained so far for the particular case of n = 2.

3. Algorithms to test For each of the optimization problems in continuous domains introduced in

the previous section, the behavior of 8 evolutionary computation algorithms has been compared to each other. From these 8 algorithms, 7 correspond to examples of continuous EDAs, while the other is an ES.

In particular, the 8 algorithms tested have been the following:

• Evolutionary strategy, (J.L + A)-strategy with recombination.

• UMDAc , in which the factorization of the joint density function is computed as the product of univariate normal densities.

• MIMICc , where in this case the joint density is factorized by means of a chain-like model using statistics of first and second order.

• EGNABIc , in which the factorization is given by a Gaussian network. The search of a model in every generation is based on the penalized maximum-likelihood criterion, and the heuristic that looks for the best structure is reduced to a local search that is initialized as the model obtained in the previous generation.

• EGNABGe, its characteristics are similar to the previous algorithm except for the evaluation criterion, which in this case we make use of a Bayesian metric.

• EGN Aee , induces Gaussian network models starting from the results of hypothesis tests applied to each of the arcs in the structure of a Gaussian network.


(al Summation cancellation (b) Griewangk

(c) Sphere model (d) Rosenbrock generali2;ed

(e) Ackley

Figure 8.1 Plots of the problems to be optimized with continuous EDAs and ES techniques (n = 2).

• EMN Aglobal, in this algorithm the model induced corresponds to a multivariate normal distribution.

• EMNAa , in which, similarly as in the previous algorithm, the model is a multivariate normal distribution. In this case, the distribution is adapted every step following the same philosophy as a steady state genetic algorithm (Whitley and Kauth, 1988).

Results in Function Optimization with EDAs in Continuous Domain 185

This last algorithm has only been applied to a dimension of 10, but the rest of the algorithms have been tested to dimensions of both 10 and 50.

For more details about ESs the reader is referred to Schwefel (1995). Algorithms UMDAc , MIMICc , EGNABGe and EGNAee are described in Larraiiaga et a1. (2000), and algorithms EGNABIC, EMNAglobal, and EMNAa are further described in Larraiiaga et a1. (2001).

4. Brief description of the experiments This section describes the experiments and the results obtained. Continu

ous EDAs were implemented in ANSI C++ language, and the ES techniques were obtained from Schwefel (1995). Some of the selected problems were not implemented in the ESs source code, and therefore some changes were made in order to include them in the ANSI C program.

The initial population for both continuous ED As and ESs was generated using the same random generation procedure based on a uniform distribution.

In EDAs, the following parameters were used: a population of 2000 individuals (M = 2000), from which a subset of the best 1000 are selected (N = 1000) to estimate the density function, and the elitist approach was chosen (that is, always the best individual is included for the next population and 1999 individuals are simulated).

The ES chosen was taken from Schwefel (1995). This is a standard (/-L + A)-strategy with discrete recombination in points of the search space and intermediary recombination for strategy parameters. The value of the parameters was set to /-L = 15, A = 100, lower bound to step sizes (absolute of 1E-30 and relative of 9.8E-07), parameter in convergence test (absolute of 1E-30 and relative of 9.8E-07).

For all the continuous EDAs the ending criterion was set to reach 301850 evaluations (Le. number of individuals generated) or when the result obtained was closer than 1E-06 from the optimum solution to be found.

These experiments were all executed in a two processor Ultra 80 Sun computer under Solaris version 7 with 1 Gb of RAM.

4.1 Experimental results

Results such as the best individual obtained and the number of evaluations to reach the final solution were recorded for each of the experiments.

Each algorithm was executed 10 times, and the null hypothesis of the same distribution densities was tested. The results are shown in Tables 8.1 to 8.5. In order to check whether the difference behaviour between the algorithms is statistically significant the non-parametric tests of Kruskal-Wallis and MannWhitney were used. This task was carried out with the statistical package S.P.S.S. release 9.00.


Table 8.1 Mean values of experimental results after 10 executions for the problem Summation cancellation with a dimension of 10 and 50 (optimum fitness value = 1.0E+5).

Dimension Algorithm B est fitness value Number of wllalua.tions

UMDAc 6.23390E+04 ± 1.87E+04 301850 ± 0.0 MIMICc 5.79875E+04 ± 2.30E+04 301850 ± 0.0 EGNAB1C 5.50837E+04 ± 1.70E+04 301850 ± 0.0

10 EGNABGe 9.99991E+04 ± 1.25E-01 190305 ± 1836.9 EGNAee 9.99992E+04 ± 1.73E-0l 195903 ± 1632.2 EMNAglobal 9.9999IE+04 ± 1.03E-01 192904 ± 1056.6 EMNAa 6.40978E+00 ± 6.78E-01 301850 ± 0.0 ES 3.57871E-03 ± 6.31E-03 43000 ± 8232.7

UMDAc 6.89860E-01 ± 2.92E-02 301850 ± 0.0 MIMICc 6.91292E-01 ± 4.73E-02 301850 ± 0.0 EGNAB1C 7.23125E-01 ± 3.28E-02 301850 ± 0.0

50 EGNABGe 9.17252E+03 ± 2.89E+04 278861.5 ± 17761.2 EGNAee 8.62138E+04 ± 8.9IE+03 301850 ± 0.0 EMNAglobal 8.61907E+04 ± 1.3IE+04 301850 ± 0.0 ES 5.31544E-08 ± 1.05E-08 193000 ± 27507.6

Table 8.2 Mean values of experimental results after 10 executions for the problem Griewangk with a dimension of 10 and 50 (optimum fitness value = 0).

Dimension Algorithm Best fitness value Number of evaluations

UMDAc 6.0783E-02 ± 1.93E-02 301850 ± 0.0 MIMICc 7.3994E-02 ± 2.86E-02 301850 ± 0.0 EGNAB1c 3.9271E-02 ± 2.43E-02 301850 ± 0.0

10 EGNABGe 7.6389E-02 ± 2.93E-02 301850 ± 0.0 EGNAee 5.6840E-02 ± 3.82E-02 301850 ± 0.0 EMNAglobal 5.1166E-02 ± 1.67E-02 301850 ± 0.0 EMNAa 12.9407 ± 3.43 301850 ± 0.0 ES 3.496E-02 ± 1.81E-02 25000 ± 1699.7

UMDAc 8.9869E-06 ± 9.36E-07 177912 ± 942.3 MIMICc 9.0557E-06 ± 8.82E-07 177912 ± 942.3 EGNABIC 1.7075E-04 ± 6.78E-05 250475 ± 18658.5

50 EGNABGe 8.6503E-06 ± 7.71E-07 173514.2 ± 1264.3 EGNAee 9.1834E-06 ± 5.9IE-07 175313.3 ± 965.6 EMNAglobal 8.7673E-06 ± 1.03E-06 216292 ± 842.8 ES 1.479E-03 ± 3.12E-03 109000 ± 13703.2


Table 8.S Mean values of experimental results after 10 executions for the problem Sphere model with a dimension of 10 and 50 (optimum fitness value = 0).


UMDAc 6.7360E-06 ± 1.26E-06 74163.9 ± 1750.3 MIMICc 7.2681E-06 ± 2.05E-06 74963.5 ± 1053.5 EGNAB1C 2.5913E-05 ± 3.71E-05 77162.4 ± 6335.4

10 EGNABGe 7.1938E-06 ± 1.78E-06 74763.6 ± 1032.2 EGNAee 7.3713E-06 ± 1.98E-06 73964 ± 16321 EMNAglobal 7.3350E-06 ± 2.24E-06 94353.8 ± 842.8 EMNAa 4.8107E+04 ± 1.32E+04 301000 ± 0.0 ES o ± 0.0 48200 ± 1135.2924

UMDAc 8.9113E-06 ± 8.41E-07 211495.2 ± 1264.2 MIMICc 8.9236E-06 ± 9.66E-07 211695.1 ± 1474.9 EGNABIC 1.2126E-03 ± 7.69E-04 263869 ± 29977.5

50 EGNABGe 8.7097E-06 ± 1.30E-06 204298.8 ± 1264.2 EGNAee 8.3450E-06 ± 1.04E-06 209496.2 ± 1576.8 EMNAglobal 8.5225E-06 ± 1.35E-06 247477.2 ± 1264.2 ES 1.54IE-45 ± 4.43E-46 173000 ± 4830.4

Table 8.4 Mean values of experimental results after 10 executions for the problem Rosenbrock generalized with a dimension of 10 and 50 (optimum fitness value = 0).

Dimension Algorithm B est fitness value Number of evaluations

UMDAc 8.7204 ± 3.82E-02 301850 ± 0.0 MIMICc 8.7141 ± 1.64E-02 301850 ± 0.0 EGNAB1C 8.8217 ± 0.16 268066.9 ± 69557.3

10 EGNABGe 8.6807 ± 5.87E-02 164518.7 ± 24374.5 EGNAee 8.7366 ± 2.23E-02 301850 ± 0.0 EMNAglobal 8.7201 ± 4.33E-02 289056.4 ± 40456.9 EMNAa 3263.0010 ± 1216.75 301000 ± 0.0 ES

UMDAc 48.8949 ± 4.04E-03 301850 ± 0.0 MIMICc 48.8894 ± l.1IE-02 301850 ± 0.0 EGNABIC 50.4995 ± 2.30 301850 ± 0.0

50 EGNABGe 48.8234 ± 0.118 275663.1 ± 1750.3 EGNAee 48.8893 ± l.1IE-02 301850 ± 0.0 EMNAglobal 49.7588 ± 0.52 296252.8 ± 7287.1 ES


Table 8.5 Mean values of experimental results after 10 executions for the problem Ackley with a dimension of 10 and 50 (optimum fitness value = 0).


UMDAc 7.8784E-06 ± 1.17E-06 114943.5 ± 1413.5 MIMICc 8.8351E-06 ± 9.01E-07 114743.6 ± 1032.3 EGNABIC 5.2294 ± 4.49 229086.4 ± 81778.4

10 EGNABGe 7.9046E-06 ± 1.39E-06 113944 ± 1632.2 EGNAee 7.4998E-06 ± 1.72E-06 118541.7 ± 2317.8 EMNAglobal 8.9265E-06 ± 6.89E-07 119141.4 ± 1032.3 EMNAa 10.8849 ± 1.19 301000 ± 0.0 ES 20 ± 0.0 18000 ± 7180.2

UMDAc 9.0848E-06 ± 3.11E-07 296852.5 ± 1053.5 MIMICc 9.6313E-06 ± 3.83E-07 295653.1 ± 632.1 EGNABIC 1.9702E-02 ± 7.50E-03 288256.8 ± 29209.4

50 EGNABGe 8.6503E-06 ± 3.79E-07 282059.9 ± 632.1 EGNAee 6.8198 ± 0.27 301850 ± 0.0 EMNAglobal 9.5926E-06 ± 2.39E-07 291255.3 ± 1349.2 ES 20 ± 0.0 88000 ± 19888.6

4.2 Comments on the results

The experimental results shown in Tables 8.1 to 8.5 contain important differences between the algorithms depending on the optimization problem. Next, each of the problems will be analyzed separately, showing for each case which appeared to be the most suited algorithms and testing whether the different performance is statistically significant.

Summation cancellation: For the Summation cancellation example the algorithms that arrived to the optimum solution for the 10 dimension case were EMNAgiobal, EGNABGe and EGNAee . When applying the non parametric tests to these algorithms we obtain p = 0.249 for the best fitness value and p < 0.001 for the number of evaluation required. This means that for this example there are statistically significant differences in the number of evaluations required to reach the best solution, but the best result obtained is not statistically significant for these three algorithms.

For the 50 dimension case, the algorithms that performed best (that arrived to the best solution) were EMNAglobal and EGNAee , as EGNA BGe shows a worse results when increasing the complexity of the problem. In both algorithms the final results obtained were close to the optimum, but in all the cases they arrived to the maximum of evaluations and their execution was stopped (they satisfied the ending criterion before


reaching the best solution). If we apply the non parametric test to these two algorithms we obtain p = 0.191 for the best fitness value and p < 0.001 f6r the number of evaluations required, which means that the fitness values obtained with these two best algorithms are not statistically significant, but difference in number of evaluations required for convergence is statistically significant.

Griewangk: The Griewangk problem is a very complex problem to optimize due to the many local minima it presents, as it can be seen in Figure 8.1b. In the 10 dimension case all the mean fitness values of the Table 8.2 appear quite similar for all but the EMNAa algorithm. When performing the non parametric test for all the results we obtain though that differences

. are significant for both the fitness value and number of evaluations. The algorithm that shows the best behaviour is ESs, closely followed by EGNABIC, EMNAglobal, EGNAee , MIMICc and UMDAc algorithms. When comparing ESs and these five continuous EDAs we do not obtain statistically significant differences in the best fitness value (p = 0.101). On the other hand, the differences are significant in the number of evaluations required for these algorithms (p < 0.001).

In the 50 dimension case the differences are bigger, and therefore the performance of the different algorithms can be seen more clearly. In this case ESs arrived to the ending criterion and stopped the search without reaching a solution as well as the obtained with any of the EDAs. Differences among all the continuous EDAs are statistically significant looking at the results (p < 0.001), and the algorithms that quicker arrived to these results were EGNABGe , EGNAee , MIMICc and UMDAc, although the difference in fitness value of these was non significant (p = 0.505). The fact that EGNABIC did not converge as quick as the rest shows that this algorithm is more dependent on the complexity of the problem than the others.

Sphere model: This problem does not have any local minima, and its optimum fitness value is O. This is again a very suitable optimization problem for the ES method, that obtained the optimum result in practically all the executions and for both the 10 and 50 dimensions.

If we do not consider the EMNAa, differences for the 10 dimension case were not significant in the best results obtained (p = 0.197). The main differences are in the number of evaluations, which is statistically significant when performing the test for all the algorithms (p < 0.001), but when performing the same test excluding the EMNAglobal we obtain p = 0.125 for the fitness values and p = 0.671 for the number of evaluations, which shows clearly that all the non-EMNA algorithms


do not require a significantly different number of evaluations to reach similar solutions.

In the 50 dimension these differences appear to be more important, and for this case again the EGNABIC algorithm shows a worse performance when increasing the dimension of the problem from 10 to 50. As a result, if excluding the latter algorithm from the non parametric test we obtain that differences are not statistically significant between all the rest of continuous EDAs (p = 0.719). The number of evaluations required for these algorithms (all continuous EDAs except for EGNABIc) is still significantly different (p < 0.001).

Rosenbrock generalized: Rosenbrock is a problem illustrated in Figure 8.1d which does not contain many local minima nor maxima. For this reason, the ES method is very suitable and shows the best results for both the 10 and 50 dimension cases.

Looking at the performance of the continuous EDAs, excluding EMN Aa ,

there are significant differences between all of them for the 10 dimension case (p = 0.002). If we perform the test for the algorithms MIMICc, EMNAglobal, UMDAc, EGNABGe and EGNAee we obtain that the difference in the best fitness value obtained are not significant (p = 0.074). From all of them, the EGNABGe appears to require significantly less number of evaluations to converge.

In the 50 dimension case, the differences in the best fitness value obtained are significant between all the continuous EDAs, and the best fitness values are obtained with EGNABGe, EGNAee , MIMICc and UMDAc. The results of these algorithms were not statistically significant (p = 0.375), but when computing the non parametric test also with EMNAglobal differences appear to be important (p < 0.001). The faster convergence was achieved with the EGNABGe algorithm.

Ackley: The Ackley problem has also several local minima as it can appreciated in Figure 8.1e. The ES method performed quite worse in both 10 and 50 dimensions than the rest of the algorithms, that showed much closer fitness results to the optimum value of o. Following the results shown in Table 8.5 for the 10 dimension case, the best results were obtained with EGNAee , UMDAc, EGNABGe, MIMICc and EMN Aglobal. However, the non parametric test of KruskalWallis showed that differences were not statistically significant for these algorithms in the fitness value (p = 0.085). In the 50 dimension case, from the 5 algorithms, EGNAee performed quite worse than the rest. The only not statistically significant results in the fitness value is for the algorithms UMDAc and EGNABGe (p = 0.151), although there are


significant differences for both in the number of evaluations required (p < 0.001). This means that EGNABGe behaves better for this problem at a dimension of 50. If we perform the same hypothesis test with UMDAc, EGNA BGe and EMNAglobal, the differences in fitness value become statistically significant (p = 0.005).

4.3 The evolution of the search

As an example to illustrate the difference in the way of reaching the final result for all the continuous EDAs, Figure 8.2 shows which is the behaviour of all the algorithms for the Summation cancellation problem with dimension of 10. In this figure appears clear that the algorithms that arrive quicker to the optimum solution are EMNAglobal, EGNABGe and EGNAee , which arrive to convergence. The rest of the algorithms do not show such a good behaviour, and when the maximum number of evaluations is reached their execution was stopped. Another important aspect in the figure is that it shows clearly that the EMNAglobal converges a bit faster than EGNABGe and EGNAee : if the execution had been stopped at about the 50th generation this algorithm would have returned the best result. EGNABGe converges also very close, but results in Table 8.3 show that when the complexity of the problem increases from 10 to 50 its relative performance worsens. This fact has also been seen for most of the optimization problems in the experiments.

4.4 The computation time

The computation time is the CPU time of the process for each execution, and therefore it is not dependent on the multiprogramming level at the execution time. As an example of the difference in computation time for all the algorithms again the example of the Summation cancellation problem was used for both 10 and 50 dimensions. The results are shown in Table 8.6. This computation time is presented as a measure to illustrate the different computation complexity of all the algorithms. It is important also to note that all the operations for the estimation of the distribution, the simulation, and the evaluation of the new individuals are carried out through memory operations.

As expected, the CPU time of each algorithm is according to the complexity of the algorithm for the learning step in the EDA algorithm. Following this fact, the shortest algorithm in computation time are in order UMDAc and MIMICc . All the EGNA type algorithms show a longer computation time due to the calculation of the structure that represents the learning, which has no restriction in the number of parents for each variable.

It is also worth mentioning the computation time of EMN Aglobal, which is a bit shorter that the EGNA.type ones. As EMNAglobal is based on the assumption of the complete dependence of all variables each other (the structure


80000

o ;:::I

~60000 til til o S ~40000 til o

CO

o - UMDA, • - MIMIC,

EGNAUGc EGNA"l<"

o ---- EGNA" 6 ---- EMNA~Jnhlll

.... EMNA.

20 40 60 80 100 Generations (Evaiuations/2)

120 140

Figure 8.2 Evolution of the different continuous EDAs for the Summation cancellation problem with a dimension of 10.

Table 8.6 Mean values of the computation time after 10 executions for the problem Summation cancellation with a dimension of 10 and 50.

Dimension 10 Dimension 50

UMDAc 0:02:36 ± 0:00:00 0:03:23 ± 0:00:01 MIMICc 0:02:47 ± 0:00:01 0:04:12 ± 0:00:00 EGNABIC 0:07:15 ± 0:00:01 3:15:31 ± 0:00:04 EGNABGe 0:03:03 ± 0:00:02 4:03:13 ± 0:13:18 EGNAee 0:01:59 ± 0:00:01 3:19:37 ± 0:03:42 EMNAglobal 0:01:55 ± 0:00:00 3:16:07 ± 0:00:10 EMNAa 0:05:49 ± 0:00:04 ES 0:00:02 ± 0:00:00 0:00:29 ± 0:00:06

is a complete graph), no time is required in order to estimate the most suitable structure for the learning step. It is also important to note that the other EMNA type algorithm (EMNAa) shows a much longer computation time than the rest of the algorithms, which made it not suitable for its use with the 50 dimension example. This fact happened as well in the rest of the optimization problems.

Results in Function Optimization with ED As in Continuous Domain 193

On the other hand, it is important to note that ESs show a very short computation time for this Summation cancellation problem. However, although its performance is quite good for some of these problems (e.g. the Sphere model), for the case of the Summation cancellation the ending criterion is reached too quick to obtain a good solution and its execution is aborted in advance.

5. Conclusions At the light of the results obtained in the fitness values, we can conclude

the following: generally speaking, for small dimension EMNAglobal, and EGNA type algorithms perform better that the rest, but when increasing the dimension some of the algorithms show a poorer performance as a result of the higher complexity to take into account (e.g. the case of EGNAB/c). The EMNAa algorithm showed a very poor behaviour for all these optimization problems, and its additional computation effort made impossible to apply it to the 50 dimension cases.

An important aspect to take into account is that the EMN Aglobal algorithm appears to be the method that more quickly approaches to the best results, although these results are not always the optima. Nevertheless, once this algorithm is nearby the optimum solution it requires more time than algorithms as EGNABGe or EGNAee to satisfy the ending criterion. This is the reason why in Tables 8.1 to 8.5 this fact is not clear.

Depending on the problem the ES method showed better results than the continuous EDAs, but when the type of problem to optimize presents many local minima or maxima, continuous EDAs show a more appropriated behaviour. The main drawback for continuous EDAs in general is the computation time they require, but for some problems the results that can be obtained with them are not comparable to methods in the ESs category.

Acknowledgments This article has been partially supported by the Spanish Ministry for Science and

Education with the project HF1999-0107.

References Ackley, D. H. (1987). A connectionist machine for genetic hillclimbing. Kluwer,

Boston. Back, T. (1996). Evolutionary Algorithms in Theory and Practice. Oxford Uni

versity Press. Baluja, S. and Davies, S. (1997). Using optimal dependency-trees for combi

natorial optimization: Learning the structure of the search space. Technical report, Carnegie Mellon Report, CMU-CS-97-107.


Larranaga, P., Etxeberria, R., Lozano, J., and Pena, J. (2000). Optimization in continuous domains by learning and simulation of Gaussian networks. In Proceedings of the Workshop in Optimization by Building and using Probabilistic Models. A Workshop within the 2000 Genetic and Evolutionary Computation Conference, GECCO 2000, pages 201-204, Las Vegas, Nevada, USA.

Larraiiaga, P., Lozano, J. A., and Bengoetxea, E. (2001). Estimation of Distribution Algorithms based on multivariate normal and Gaussian networks. Technical Report KZZA-IK-1-01, Department of Computer Science and Artificial Intelligence, University of the Basque Country.

Rosenbrock, H. H. (1960). An automatic method for finding the greatest or least value of a function. The Computer Journal, 3: 175-184.

Salomon, R. (1998). Evolutionary algorithms and gradient search: similarities and differences. IEEE Transactions on Evolutionary Computation, 2(2):45-55.

Schwefel, H.-P. (1995). Evolution and Optimum Seeking. Wiley InterScience. Torn, A. and Zilinskas, A. (1989). Global optimization. Lecture Notes in Com

puter Science 350. Springer-Verlag, Berlin Heidelberg. Whitley, D. and Kauth, J. (1988). GENITOR: A different genetic algorithm.

In Proceedings of the Rocky Mountain Conference on Artificial Intelligence, volume II, pages 118-130.

Chapter 9

Solving the 0-1 Knapsack Problem with EDAs

R. Sagarna P. Larraiiaga Department of Computer Science and Artificial Intelligence

University of the Basque Country {ccbsaalr. ccplamup}@si.ehu.es

Abstract In this chapter we present several approaches to the 0-1 knapsack problem based on Estimation of Distribution Algorithms. These approaches use two different types of representation, three methods for obtaining the initial population and two different methods for handling the problem's constraints. Experimental results for problems of different sizes are given.

Keywords: knapsack problem, Estimation of Distribution Algorithms, binary representation, permutation representation

1. Introduction The knapsack problem can be described as selecting from among various

items that could be placed in a knapsack those items which are most useful given that the knapsack has limited capacity. Knapsack problems have been intensively studied because of their simple structure and because they can model many classical industrial problems such as capital budgeting, cargo loading and stock cutting (Martello and Toth, 1990) .

This chapter presents several adaptations of Estimation of Distribution Algorithms (EDAs) for the knapsack problem. These adaptations differ in the representation that they used, the way in which they obtain their initial populations and the manner in which they treat the constraint related to the knapsack capacity. Experimental results obtained for problems of different sizes are used to compare different approaches based on EDAs.

The rest of the chapter is structured in the following way. Section 2 introduces the mathematical notation for the knapsack problem. In Section 3 a binary representation for the problem is presented, as the manner in which




Table 9.1 0-1 knapsack problem with 7 items.

Item 1 2 3 4 5 6 7 Profit 20 31 17 30 14 52 10 Weight 30 54 32 16 27 61 7

discrete and continuous EDAs can be applied to it. The next section has the same structure, but the representation used is based on permutations. Section 5 presents experimental results, while conclusions are given in Section 6.

2. The 0-1 knapsack problem The knapsack problem considered here is the 0-1 knapsack problem, which

is classified as NP-hard (Garey and Johnson, 1979). The 0-1 knapsack problem is: given a finite set of items where for each item

its weight and profit are known, try to select the subset of items that provides the maximum profit, and whose sum of weights is bounded by the knapsack capacity.

In mathematical notation, if we denote by

• n the number of items

• Pi the profit of item i

• Wi the weight of item i

• c the capacity of the knapsack

then a solution for the 0- 1 knapsack problem consists of selecting a subset of the items so as to:

. . ",n • maxImIze Wi=1 PiXi

• subject to the constraint l:~=1 WiXi ~ C

where for all i = 1, ... , n

Xi = {I if item i is selected o otherwise.

Example 9.1 We illustrate the 0-1 knapsack problem with a simple example consisting of 7 items, where the profits and weights associated with each item are shown in Table 9.1. We assume that the capacity of the knapsack is c = 100.

Solving the 0-1 Knapsack Problem with EDAs 197

If we select the items numbered 1,2 and 4 we obtain a profit of 81 with a combined weight (30 + 54 + 16 = 100) that doesn't exceed the capacity of the knapsack. However it is not possible to select the items numbered 1,2 and 3 because their combined weight (30 + 54 + 32 = 116) exceeds the capacity of the knapsack.

Approaches to the knapsack problem include both algorithms developed using greedy principles and exact methods. The greedy principle orders the items by nonincreasing efficiencies, where efficiency is the ratio between profit and weight, then includes the most efficient items in the knapsack until its capacity is exceeded. Approches based on the greedy principle include work by Ingargiola and Korsh (1973), Dembo and Hammer (1980), Martello and Toth (1988), Fayard and Plateau (1982) and Pisinger (1999). Alternatively, amongst the exact methods, Balas and Zemel's (1980) algorithm embeds the branch-andbound technique, while Plateau and Elkihel's (1985) hybrid algorithm uses both branch-and-bound and dynamic programming.

More relevant to this chapter are the approaches based on Genetic Algorithms (GAs) and EDAs. Using GAs, Watannabe et al. (1992) and Gordon et al. (1993) developed an approach that uses binary representation, while Hinterding (1994) uses a representation based on permutations. Olsen (1994) proposes the use of penalty functions designed for the knapsack problem and Simoes and Costa (2001) propose an approach to the 0-1 knapsack problem based on GAs where the standard crossover operator is replaced by a biologically-inspired mechanism known as transposition.

Regarding EDAs, Baluja (1995) presents some results obtained with the PBIL algorithm, while Baluja and Davies (1998) show some empirical comparisons between COMIT and PBIL (see Chapter 3 for details of these algorithms). In both these works, binary representation is used.

3. Binary representation

In this section, we introduce two new approaches based on binary representations. The first is based on discrete EDAs, and the second uses EDAs in continuous domains. In both approaches we will assume that the variables will be ordered from left to right, by their ratios between profit and weight. This means that Xl is the variable associated with the item with the largest ratio between profit and weight, and Xn corresponds to the variable with the worst ratio.

3.1 Discrete EDAs

Representation. Each possible solution to the 0-1 knapsack problem is represented by a binary array of dimension n, written as:

(Xl, ... ,Xi,··· ,Xn)·


A value of 1 in the ith position indicates that the ith item has been selected for inclusion in the knapsack. From the point of view of the EDAs, each bit represents the value of one random variable following a Bernouilli distribution. The cardinality of the search space is 2n.

Example 9.2 Continuing the example introduced in Section 9.2, if the 1st ,

2nd , and 4th are the only selected items, then the corresponding binary array 'is:

(1,1,0,1,0,0,0),

Evaluation. Since the array of bits can represent a solution that exceeds the capacity of the knapsack, we have developed two approaches to evaluate these arrays:

• Penalization of arrays representing non-feasible solutions.

In this approach, if the array represents a non-feasible selection of items, we penalize its evaluation so that it is not competitive with the evaluations of feasible solutions.

This evaluation is done in the following manner:

where K is a positive number so that:

if 2:7=1 WiXi ::; c if 2:~=1 WiXi > C

(9.1)

for all (Xl, ... , Xi, . .. , Xn) such that 2:~=1 WiXi > c we hwe:

11 11

K(L Wi - L WiXi) ::; min(Pl,' " ,Pn) ' (9.2) i=l i=l

Inequality 9.2 means that all the item selections that correspond to nonfeasible solutions will obtain a worse evaluation than the evaluation corresponding to any feasible solution.

• First fit algorithm.

In this approach to selecting items for inclusion in the knapsack whilst avoiding violating the constraint on its capacity, we select, from left to right and in order, items that meet the capacity constraint and reject items that result in a constraint violation. This first fit algorithm is shown in Figure 9.1. Hinterding (1994) uses this algorithm with different orderings to initialize the population of a GA.

Repeat


Search left to right for first item that does not violate the capacity constraint

if item found add to knapsack

else terminate algorithm

Fig'U1'e 9.1 First fit algorithm.

Initialization. We have considered three different methods for obtaining the initial population:

• Each item is selected with equal probability, independent of the remaining items and independent of its ratio between profit and weight. This initialization will be called uniform.

In order to obtain an expected number of selected items of Li~: Wi ' each item is selected with a probability equal to Li~l Wi •

The probability vector from which we generate the initial population is therefore:

• Each item is selected with a probability proportional to its ratio between profit and weight. That is:

(Po(xd, ... ,PO(Xi), ... ,PO(Xn)) ex: (~, ... , Pi , ... , Pn). (9.4) Wi Wi Wn

Denoting by M the proportionality constant and taking into account that the expected number of selected items must be Ei:: Wi' we obtain that:

n ML Pi _ nc - I:n w· W·

i=l' j=l)

(9.5)

or equivalently:

nc M = ",n . ",n Ei.. '

uj=l W) ui=l Wi

(9.6)


obtaining finally that:

( WI Lj=~~~CL7=1 ~ , ... , Wn Lj=~:~CL7=1 ~ ) . (9.7)

Since it is not guaranteed that each of the components of this vector is smaller than 1, we obtain the initial population of individuals using the following probability distribution:

() . ( Pinc ) Po Xi = mm 1, . ",n . ",n .Ei.. W, ~j=l W J ~i=l Wi

(9.8)

for all i = 1, ... ,n.

• Probabilistic seed, in which starting from the solution provided by the first fit algorithm, an initial population of solutions is obtained by simulating the following probability distribution:

(Po(xd,··· ,PO(Xi), ... ,Po(xn ))

where for all i = 1, ... ,n:

{ a if item Xi is selected by the first fit Po (x i) = 1 _ a if item X i is not selected by the first fit.

In this chapter we fix the a value to 0.95.

3.2 Continuous EDAs

(9.9)

Representation. For continuous EDAs, we need n+ 1 variables to represent each item selection. A Gaussian network is used to express the interdependencies between these n + 1 variables. The first n variables are related to their corresponding items, and the (n+ 1)th variable provides a threshold were chosen items have an associated variable which is larger than the threshold.

Example 9.3 Suppose we use the following vector of dimension 8 to represent a choice between the 7 items in Example 9.1:

(Xl, X2, X3, X4, X5, X6, X7, Xs) = (1004,12.8,904,16.2,7.14,5.67,9.14,9.98).

This array is interpreted here as selecting items 1,2 and 4, because their corresponding values -(10.4, 12.8 and 16.2)- are the only ones that overcomes the threshold of 9.98. This 8 dimensional array is therefore eq'uivalent to the following 7 dimensional binary array:

(1,1,0,1,0,0,0).

Solving the 0-1 Knapsack Problem with ED As 201

Evaluation. Evaluation is done on the binary transformation, so it is exactly the same as the one used for discrete EDAs.

Initialization. We again consider three different initializations:

• Each item is selected with uniform probability, independent of the remaining items and independent of its ratio between profit and weight.

In a similar manner to the discrete case, if the probability of selecting each item is 2:7:1 Wi' and if Xi == N(/.l, (J2) for all i = 1, ... , nand X n+1 = N(/.lthr, (J2), we can obtain the value for /.lthr, given that:

C

p(Xi > Xn+d = I:7=1 Wi (9.10)

for all i = 1, ... ,n. Noting that: Xi - Xn+l == N(/.l- /.lthr , 2(J2) we obtain that the parameter /.lthr must satisfy:

p(X > X ) - p(X X > 0) _/.00 1 e-~(Xi-JL+JLthr)2dx_ i n+1 - i - n+1 - (;;;2" i -

JLthr y 7r (J c

(9.11) ",n . L...i=l Wi

One example of this condition would be to use: (J = ~ and /.l = /.lthr.

• Each item is selected with a probability proportional to its ratio between profit and weight.

Reasoning in a similar way to the discrete case, we have that for all i = 1, ... ,n:

Pinc p(Xi > X n+1) = . ",n . ",n Ei.' (9.12)

W. L...j=l wJ L...i=l Wi

If we fix the parameters of the (n+ l)th variable, X n+1 == N(/.lthr, (J2) then each variable Xi (i = 1, ... , n) which also follows a normal distribution, Xi == N(/.li, (J2), must satisfy:

P(Xi > Xn+l) = P(Xi - Xn+l > 0) = p(N(/.l - /.lthr, 2(J2) > 0) =

(9.13)

In order to determine /.li for all i = 1, ... , n we fix the values for the parameters (J and /.lthr respectively to ~ and 0, to obtain that for all i = 1, . . . ,n:

(9.14)

• Probabilistic seed as described in the previous section.


4. Representation based on permutations

Representation. Each possible solution for the 0-1 knapsack problem is represented by a permutation (-rr(l), ... , 7r(i), ... 7r(n)) of the items be selected.

Existing work on discrete ED As (Santana et al., 1999; Bengoetxea et al., 2000a, 2000b) already deals with problems similar to the one of obtaining these permutations. These all adapt the simulation phase in order to obtain a permutation. The problem with these approaches is that the probability distribution learnt by the Bayesian network is changed by the constraint.

In this chapter we obtain each permutation from the simulation of a Gaussian network (Pelikan, 2000). We assume that the random variables in the Gaussian network are ordered -as in the binary representation- by their ratio between profit and weight.

If we denote by (Xl"'" Xi,"" xn) the continuous vector obtained in the simulation of the Gaussian network, then once the values Xi (i = 1, ... , n) are ordered from the largest to the smallest we obtain the items: 7r( i) = Xi:n for all i = 1, ... , n.

With this representation the cardinality of the search space is n!. This number is bigger (if n ~ 4) than 2n because the representation we are using is redundant.

Example 9.4 Assume that we have obtained the following 7 dimensional vector for the 7 items of Example 9.1:

(Xl, X2, X3, X4, X5, X6, X7) = (10.4,12.8,9.4,16.2,7.14,5.67,9.14).

Ordering the values corresponding to the items, we obtain:

(7r(1),7r(2),7r(3),7r(4),7r(5),7r(6),7r(7)) = (3,2,4,1,6,7,5).

This permutation indicates the order of selection for the items to be included in the knapsack.

Evaluation. Here, we don't use the evaluation via penalization, so each permutation is evaluated using the first fit algorithm described in Section 9.3.

3: Initialization. We consider three possible initializations, as seen in Section

• Each item has the same probability of being in each of the n positions of the permutation.

To obtain this initialization all the random variables will follow the same normal distribution model. That is Xi == N(/l, 0-'2) for all i = 1, ... , n.

• We assign more probability to those items with larger ratios between profit and weight.


Table 9.2 Knapsack problem. Binary representation. Average of the best results. n =: 50. Greedy: 1713.

penalty first fit

uniform proportional prob. seed uniform proportional prob. seed

UMDA 1731.8 1731.0 1717.8 1734.0 1734.0 1732.8 MIMIC 1731.2 1731.2 1716.8 1734.0 1734.0 1730.4 EBNApc 1731.0 1730.6 1720.4 1734.0 1733.0 1728.8

UMDAc 1731.2 1733.2 1713.0 1734.0 1732.4 1713.0 MIMICc 1731.2 1732.2 1713.0 1734.0 1734.0 1713.0 EGNAee 1729.3 1730.7 1715.2 1734.0 1733.6 1717.6

Table 9.3 Knapsack problem. Binary representation. Average of the best results. n =: 200. Greedy: 8010.

penalty first fit


UMDA 7964.0 7977.3 8011.6 8018.0 8018.0 8018.2 MIMIC 7977.2 7990.2 8013.0 8017.2 8018.0 8018.6

UMDAc 7935.4 8003.8 8010.4 8016.8 8016.8 8014.5 MIMICc 7950.2 7985.0 8010.0 8017.8 8016.8 8014.1

Here, we generate n dimensional vectors whose ith component have expected value proportional to its ratio between profit and weight. That is: Xi == N(Pi, 0-2 ) where Pi (X ~ for all i = 1, ... , n.

• Probabilistic seed, as described in previous sections.

5. Experimental results In this section we present the results of some experiments carried out with

different number of objects (n = 50,200 and 1000). For each experiment we randomly obtain the values for the profit and weight associated to each item, as well as the capacity of the knapsack.

In Tables 9.2 to 9.4 the average results over 10 independent runs for, respectively, the 50, 200 and 1000 objects problems are shown. All these three tables correspond to the results obtained with a binary representation. As can be seen in the tables we consider -see Section 3.1 for details- two ways for the evaluations of the individuals (penalization and first fit algorithm) in com-


Table 9.4 Knapsack problem. Binary representation. Average of the best results. n = 1000. Greedy: 41425.

penalty first fit


UMDA 38212.8 39063.6 40895.6 41145 .6 41097 .2 41393 .0 MIMIC 38307.8 39282.8 41070.2 41341.8 41239.6 41424.4

UMDAc 37545.8 40786.2 41425.0 41425.6 41425.0 41425.0 MIMICc 37647.2 39787.7 41425.0 41426.2 41425.4 41425.0

Table 9.5 Knapsack problem. Representation based on permutation. Mean of the best results. n = 50. Greedy: 1713.

UMDAc MIMICc EGNAee

uniform

1734.0 1734.0 1734.0

proportional

1734.0 1734.0 1733.6

probabilistic seed

1713.0 1713.0 1713.0


UMDAc MIMICc

uniform

8012.0 8005.8

proportional

8012.1 8014.4

probabilistic seed

8016.5 8016.2

bination with three initializations (uniform, proportional and by means of a probabilistic seeding).

In a similar manner Tables 9.5 to 9.7 present the results obtained for the permutation based representation -see Section 3.2 for details- for the 50, 200 and 1000 objects problems. In these tables we take into account the same three different initializations, but in these cases always the first fit algorithm was the only evaluation method considered.

Roughly speaking the best results were obtained with the first fit algorithm as the way to verify the constraints of the 0-1 knapsack problem. For the smallest problem considered (n = 50) the best results were obtained with the



UMDAc MIMICc

uniform

40246.0 40246.8

proportional

40585.0 40478.7

probabilistic seed

41427.0 41427.0

uniform initialization and binary representation. In the intermediate problem (n = 200) the best results were obtained with the first fit evaluation and the binary representation in conjunction with discrete UMDA. Finally in the biggest problem (n = 1000), the first fit evaluation in combination with a permutation based representation and a probabilistic seeding led to the best results.

The non-parametric tests of Kruskal-Wallis and Man-Whitney were used to verify the null hypothesis of the same distribution. These tasks were carried out with the statistical package S.P.S.S. release 10.0.6. The results were as follows:

• Comparing different EDA algorithms.

Here, fixing the representation (binary or permutation based), the evaluation (penalty or first fit) and the initialization type (uniform, proportional or probabilistic seeding), we aim to compare the results obtained with the different EDA approaches.

- 50 objects

The differences were statistically significant for the case of discrete EDAs for a binary representation, with a penalty evaluation and a uniform initialization (p = 0.006), and also with a first fit evaluation and a probabilistic seeding (p = 0.0291). We also found statistically significant differences for the case of continuous EDAs, for a binary representation, a first fit evaluation and a probabilistic seeding (p = 0.0403). On the other hand, with a permutation based representation the tests did not detect that the differences between the three continuous EDAs were statistically significant.

200 objects

The following cases presented differences statistically significant for discrete EDAs: binary representation, with a penalty evaluation and an uniform initialization (p = 0.009) or a proportional initialization (p = 0.0058), and also binary representation with a first fit evaluation and an uniform initialization (p = 0.0293). In the case


of continuous EDAs the tests showed differences for: binary representation with a penalty evaluation and proportional initialization (p = 0.0013), and for permutation based representation, and uniform initialization (p = 0.0086).

- 1000 objects

For discrete ED As all the differences were statistically significant except for the case of a penalty evaluation in conjunction with an uniform initialization. For continuous ED As we obtained differences for: binary representation with penalty evaluation and proportional initialization (p = 0.0001), and permutation based representation with proportional initialization (p = 0.0227).

• Comparing different evaluations.

The objective in this point is to compare the behaviour of the algorithms once the initialization and the type of EDA were fixed. In fact, these comparisons are only valid for the results presented in Tables 9.2 to 9.4.

- 50 objects

In the case of discrete EDAs with a binary representation, the obtained differences between pairs of algorithms of the same complexity and same initialization were statistically significant. When comparing continuous ED As with a binary representation the cases with differences statistically significant were UMDAs with uniform initialization (p = 0.0293), MIMICs with uniform initialization (p = 0.0049), and proportional initialization (p = 0.0049) as well as EGNAees with uniform initialization (p = 0.0019).

200 objects

In this case, for discrete ED As as well as for continuous EDAs with a binary represeIl~ation all the differences between pairs of algorithms of the same complexity and same initialization were statistically significant.

- 1000 objects

In this case we obtained the same behaviour as in the case of 200 objects, except for the continuous EDAs where the differences when comparing the two types of evaluations were not statistically significant for the probabilistic seeding based initializations.

• Comparing different representations.

For the penalty evaluation we compare the results obtained with discrete and continuous EDAs of the same complexity and a binary representation.


For the first fit evaluation we extend the comparison taking into account permutation based representation.

50 objects

The differences were significant for: UMDAs with penalty evaluation and proportional initialization (p < 0.0001), UMDAs with penalty evaluation and probabilistic seeding (p = 0.0051), UMDAs with first fit evaluation and proportional initialization (p = 0.0115), UMDAs with first fit evaluation and probabilistic seeding (p < 0.0001). Also MIMICs with first fit evaluation and probabilistic seeding (p < 0.0001) as well as EBNApc versus EGNAee with penalty evaluation and probabilistic seeding (p = 0.0015) and EBNApc versus EGNAee with binary representation versus EGNAee with continuous representation with first fit evaluation and probabilistic seeding (p < 0.0001) presented differences statistically significant.

200 objects

In this example, all the differences were statistically significant except the following three cases: UMDAs with penalty evaluation and probabilistic seeding initialization (p = 0.1351), MIMICs with penalty evaluation and uniform initialization (p = 0.1668), and MIMIC with first fit evaluation and proportional initialization (p = 0.3420).

- 1000 objects

In this example all the differences were statistically significant.

• Comparing different initializations.

Here we compare, for algorithms with the same complexity and the same evaluation type, the results obtained for the three different initializations: uniform, proportional and probabilistic seeding.

50 objects

All the differences were statistically significant except for the case of UMDA algorithms with a binary representation, and first fit evaluation (p = 0.1260).

- 200 objects

Except for UMDA algorithms with binary representation, and first fit evaluation (p = 0.4508) all the differences were statistically significant.

- 1000 objects

In all the comparisons the obtained differences were statistically significant.


6. Conclusions In this chapter we have introduced for the first time the application of EDAs

to the 0-1 knapsack problem. We have introduced two different representations (binary and permutation based) in combination with two manner of maintaining the feasibility of the individuals (penalization and first fit algorithm) and also three different initializations of the first population (uniform, proportional and probabilistic seeding).

With the experiment we have carried out in this preliminary work, we conclude the superiority of the first fit algorithm with respect to the penalization. More work must be done to obtain clear conclusions with respect to the other parameters.

References Balas, E. and Zemel, E. (1980). An algorithm for large zero-one knapsack

problems. Operations Research, 28: 1130-1154. Baluja, S. (1995). An empirical comparison of seven iterative and evolution

ary function optimization heuristics. Technical report, School of Computer Science. Carnegie Mellon University. CMU-CS-95-193.


Bengoetxea, E., Larrafiaga, P., Bloch, I., Perchant, A., and Boeres, C. (2000). Inexact graph matching using learning and simulation of Bayesian networks. An empirical comparison between different approaches with synthetic data. In Workshop Notes of CaNew2000: Workshop on Bayesian and Causal Networks: From Inference to Data Mining. Fourteenth European Conference on Artificial Intelligence, ECAI2000.

Chu, P.C. and Beasley, J.E. (1998). A genetic algorithm for the multidimensional knapsack problem. Journal of Heuristics, 4:63-86.

Dembo, R.S. and Hammer, P.L. (1980). A reduction algorithm for knapsack problems. Methods of Operational Research, 36:49-60.

Fayard, D. and Plateau, G. (1982). An algorithm for the solution of the 0-1 knapsack problem. Computing, 28:269-287.

Garey, M.R. and Johnson, D.S. (1979) . Computers and Intractability. A Guide to the Theory of NP-completeness. W.H. Freman Co., San Francisco.

Gordon, W.S., Bohm, A.P.W., and Whitney, D. (1993). A note on the performance of genetic algorithms on zero-one knapsack problem. Technical report, Department of Computer Science. Technical Report CS-93-108. Colorado State University.

Hinterding, R. (1994) . Mapping, order-independent genes and the knapsack problem. In IEEE Conference, pages 13-17.


Ingargiola, G.P. and Korsh, J.F. (1973). A reduction algorithm for zero-one single knapsack problems. Management Science, 20:460-463.

Martello, S. and Toth, P. (1988). A new algorithm for the 0-1 knapsack problem. Management Science, 34:633-644.

Martello, S. and Toth, P. (1990). Knapsack Problems: Algorithms and Computer Implementations. John Wiley and Sons.

Olsen, A.L. (1994). Penalty functions and the knapsack problem. In IEEE Conference, pages 554- 558.

Pelikan, M. (2000). Solving permutation problems with continuous EDAs. Personal communication.

Pisinger, D. (1999) .. Core problems knapsack algorithms. Operations Research, 47( 4):570-575.

Plateu, G. and Elkihel, M. (1985). A hybrid algorithm for the 0-1 knapsack problem. Methods of Operations Research, 49:277-293.

Santana, R. and Ochoa, A. (1999). Dealing with constraints with Estimation of Distribution Algorithms: The univariate case. In Second Symposium on Artificial Intelligence. Adaptive Systems. CIMAF 99, pages 378- 384.

Simoes, A. and Costa, E. (2001). An evolutionary approach to the zero-one knapsack problem: Testing ideas from biology. In Kurkovei, V., Steel, N. C., Neruda, R., and Keirny, M., editors, International Conference on Artificial Neural Networks and Genetic Algorithms. ICANNGA-2001, pages 236-239. Springer.

Watannabe, K., Ikeda, Y., Matsuo, S., and Tsuji, T. (1992). Improvements of the genetic algorithms and its applications. Technical report, Faculty of Engineering Fuki University. Vol. 40, Issue 1.

Chapter 10

Solving the Traveling Salesman Problem with EDAs

V. Robles P. de Miguel Depa1'tment of Compute1' A,'chitectu1'e and Technology Technical Unive1'sity of Mad1'id {vrobles, pmiguel}@fi.upm.es

P. Larranaga Depa1'tment of Compute1' Science and A1'tijicial Intelligence Unive1'sity of the Basque Count1'Y [email protected]

Abstract In this chapter we present an approach for solving the Traveling Salesman Problem using Estimation of Distribution Algorithms (EDAs). This approach is based on using discrete and continuous EDAs to find the best possible solution. We also present a method in which domain knowledge (based on local search) is combined with EDAs to find better solutions. We show experimental results obtained on several standard examples for discrete and continuous EDAs both alone and combined with a heuristic local search.

Keywords: Traveling Salesman Problem, Evolutionary Computation, Estimation of Distribution Algorithms, Genetic Algorithms , local search heuristics

1. Introduction

The objective of the Traveling Salesman Problem (TSP) is to find the shortest route for a traveling salesman who, starting from his home city, has to visit every city on a given list precisely once and then return to his home city. The main difficulty of this problem is the immense number of possible tours: (n - 1)!/2 for n cities .




The TSP is a relatively old problem: it was documented as early as 1759 by Euler (though not under that name), whose interest was in solving the knights' tour problem. The knights ' tour problem is to visit each of the 64 squares of a chessboard exactly once using a knight. The term "traveling salesman" was first used in 1932, in a German book written by a veteran traveling salesman. The "Rand Corporation" introduced the TSP in 1948. The Corporation's reputation helped to make the TSP a well-known and popular problem.

Through the years the TSP has occupied the thoughts of numerous researchers. There are several reasons for this. First, the TSP is very easy to describe but very difficult to solve. No polynomial time algorithm is known with which it can be solved. This lack of any polynomial time algorithm is a characteristic of the class of NP-complete problems, of which the TSP is a classic example. Second, the TSP is broadly applicable to a variety of routing and scheduling problems. Third, since a lot of information is already known about the TSP, it has become a kind of test problem; new combinatorial optimization methods are often applied to the TSP so that an idea can be formed of their usefulness. Finally, a great number of problems approached with heuristic techniques in Artificial Intelligence are related to the search for the best permutation of n elements. Example of these are problems in cryptanalysis, such as the discovery of a key of a simple substitution cipher (Spillman et al., 1993), or the breaking of transportation ciphers in cryptographic systems (Matthews, 1993).

The structure of this chapter is as follows. In Section 2 we introduce the different techniques used to solve the TSP. Section 3 presents a new approach for solving the TSP using EDAs. In Section 4 we present some experimental results using EDAs. Finally, conclusions are given in Section 5.

2. Review of algorithms for the TSP There have been many different approaches to solving the TSP. We have

split these approaches into three main groups: using TSP domain knowledge with heuristics, modern heuristics and Evolutionary Computation.

2.1 Using TSP domain knowledge with heuristics

Several formulations and algorithms have been proposed for the TSP. Many of these approaches can be outperformed using specific domain knowledge. This domain knowledge can be divided into two groups: tour construction heuristics and tour improvement heuristics.

2.1.1 Tour construction heuristics. If we are solving a TSP problem, we can use a tour construction heuristic , such as nearest neighbour, Greedy, Clarke-Wright or Christofides (Johnson and McGeoch, 1997).

Solving the Traveling Salesman Problem with ED As 213

Nearest neighbour: The most natural heuristic is the nearest neighbour algorithm. In this algorithm the voyager always goes to the nearest location. For an n city problem, we can create, at most, n different tours, each one starting in a different city.

Greedy: In the greedy heuristic we can see a tour as an instance of a graph with the cities as vertices and with edges of distance d between each pair of cities. Using this model, we can see a tour as a Hamiltonian cycle in the graph. To build the tour we insert one edge at a time starting with the shortest edge, and then repeatedly add the shortest remaining available edge, if adding it would not create a degree-3 vertex or a cycle of length less than n.

Clarke- Wright: The original name is the "Saving" algorithm of Clarke and Wright (Clarke and Wright, 1964), which was initially thought to solve the Vehicle Routing Problem. We start with a tour in which one city is the depot, and the traveler must return to the depot after visiting each city. The savings are the amount by which the tour is shortened if we combine two cities into a single tour, thereby bypassing the depot. We can perform this bypass if it does not create a cycle or cause a non-depot vertex to become adjacent to more than two other non-depot vertices.

Christofides: This algorithm was developed in Christofides (1976). To solve the TSP we construct a minimum spanning tree T for the set of cities. Next, we compute a minimum length matching M of the vertices of odd degree in T. Combining M and T we obtain a connected graph in which every vertex has even degree. This graph must contain an Euler tour, i.e. a tour that passes through each edge exactly once. Such a cycle can be easily found.

Experimental results (Johnson et al., 2001b) show that these are, from best to worst: Christofides, then Clarke-Wright, then Greedy and finally nearest neighbour.

2.1.2 Tour improvement heuristics. Tour improvement heuris-tics can be used for postprocessing, i.e. each time we have a tour we can improve it using these local improvement algorithms.

2-opt and 3-opt: The 2-opt (Croes, 1992), and the 3-opt algorithms (Lin, 1965), are the most well-known local search algorithms. In the 2-opt algorithm each move consists of deleting two edges, breaking the tour into two paths, and reconnecting those paths in another possible way. In the 3-opt algorithm we have more possibilities, as by breaking the tour into three paths we have at least two possible resulting tours.

2.5-opt and Or-opt: Based on 2-opt and 3-opt algorithms some authors have created slightly more complicated algorithms, for instance 2.5-opt and Or-opt. In the 2.5-opt algorithm (Bentley, 1992), we expand the 2-opt heuristic to include a simple form of 3-opt move that can be found with little extra effort. We also have the Or-opt heuristic (Or, 1976). Using 3-opt moves we take a


segment consisting of three or fewer consecutive cities and place it between two tour neighbours elsewhere in the tour.

Lin-Kernighan heuristic (LK): Perhaps the best local search algorithm for the TSP is the LK heuristic (Lin and Kernighan, 1973). It is based on 2-opt and 3-opt but it also uses some ideas that we will see later in Tabu Search. These ideas are based on avoiding some types of move depending on the contents of two different lists. For more information about this heuristic, we refer the reader to Johnson et al. (2001a).

The most widely-used tour improvement heuristics are LK, 2-opt and 3-opt. The best solutions are obtained with LK.

2.2 Modern heuristics

Besides the previous heuristics, which are used to create tours and to improve existing tours, there are some modern heuristics techniques which have been used for the TSP. Most of these heuristics use the idea of Neighbourhood Search (NS). NS is a widely used method in solving combinatorial optimization problems. A good introduction to NS can be found in Reeves (1993).

Step 1. Select a starting solution x now EX.

Step 2. Record the current best-known solution by setting x best = x now

and define besLcost = c(xbest ).

Step 3. Choose a solution x next E N(xnow ). If the choice criteria can not be satisfied or other termination criteria apply, then the method stops.

Step 4. Re-set x now = x next , and if c(xnow ) < besLcost, perform Step 2. Then return to Step 3.

Figure 10.1 Neighbourhood Search Method.

A solution is specified by a vector x, the set containing feasible solutions are denoted by X, and the cost of a solution is denoted by c(x). Each solution x E X has an associated set of neighbours, N(x) c X, known as the neighbourhood of x. Each solution x next can be reached directly from x now by a single move. The type of neighbourhood will depend on the heuristic method used. Modern heuristics based on the idea of NS are Simulated Annealing (SA) and Tabu Search (TS).

Simulated Annealing (SA): This technique was originally proposed around twenty years ago by Kirkpatrick et al. (1983). It works by searching the set of all possible solutions, reducing the chance of getting stuck in a poor

Solving the Traveling Salesman Problem with EDAs 215

local optimum by allowing moves to worse solutions under the control of a randomized scheme whose effect is determined by a "temperature" parameter. This parameter is initially high, allowing many inferior moves to be accepted, and is slowly reduced to a value where inferior moves are usually not accepted. SA has a close analogy with the thermodynamic process of annealing in physics. To solve the TSP with SA, Kirkpatrick et al. (1983) suggest the use of a neighbourhood structure based on 2-opt moves.

Tabu Search (TS): TS tries to model human memory processes, by recording previously seen solutions in simple but effective data structures. We create a tabu list of moves which have been made in the recent past, and which are forbidden for a certain number of iterations. This helps to avoid cycling and serves to promote a diversified search of solutions. While exploring the neighbourhood of a solution TS evaluates all the moves in a candidate list. The number of moves examined is one parameter of the search. The best move on the candidate list is accepted, unless it is in the tabu list. TS introduces diversification when there are no improvements from the moves available. This modern heuristic was developed by Glover (1986), and all the basic concepts can be found in Glover and Laguna (1993). Some TS heuristics for the TSP use 2-opt exchange as their basic move.

2.3 Evolutionary Algorithms

Evolutionary Algorithms are based on a model of natural evolution. Within these algorithms we can identify three different branches: Genetic Algorithms, Evolution Strategies and Evolutionary Programming. These algorithms are based on an initial population, which by means of selection, mutation and recombination evolve toward better regions in the search space. Individuals are measured using an objective function.

Genetic Algorithms (GAs): GAs (Holland, 1975) are all based on the idea of biological evolution. A GA operates on populations of chromosomes (strings representing possible solutions). New chromosomes are produced by combining members of the population and replacing existing chromosomes. There are two operators commonly used in GAs: crossover and mutation. Using crossover we perform a type of neighbourhood search, and using mutation we can introduce some noise into the population to help avoid local minima. GAs have been widely used for solving the TSP. Experimental results (Larrafiaga et al., 1999) show the superiority of the following operators: Genetic Edge Recombination Crossover (ER) (Whitley et al., 1989), Order Crossover (OX1) (Davis, 1985), Position Based Crossover (POS) (Syswerda, 1991) and Order Based Crossover (OX2) (Syswerda, 1991). More modern crossover operators are edge-2 and edge-3 (Mathias and Whitley, 1992) as an improvement of the ER crossover operator, and the maximum preservative crossover (MPX) (Freisleben and Merz, 1996). GAs which include interaction with local searches (adaptive or not) are


known as Memetic Algorithms (MA) (Moscato, 1999). A key feature of the MA implementation is the use of available knowledge about the specific problem. In different contexts MA are also known as Hybrid GAs, Genetic Local Searches, etc.

Evolution Strategies (ES): ESs were born in 1964 at the Technical University of Berlin in Germany (Rechenberg, 1973). The first example was a simple mutation-selection mechanism working on one individual, which create one offspring per generation by means of Gaussian mutation. Another initial proposal from the University of Berlin was a multimembered ES in which one or more individuals were recombined to form one offspring. These strategies provided the basis for (JL + ).)-ES in which the JL best individuals are selected from a population of). individuals. Individuals in ESs are vectors of real numbers. The main loop in an ES algorithm consists of recombination, mutation, evaluation and selection. Recombination produces one new individual from the selected parent individuals. In many ways this approach is similar to GAs, except that the primary operator is mutation and parameters are adapted as the search progresses.

Herdy and Patone (1994) uses an ES to solve the TSP. In this solution four different mutation operators are created: inversion of a segment of the tour, insertion of a town at another point in the tour, reciprocal exchange of two towns and displacement of a segment of the tour. A new recombination operator is also needed to allow for the fact that recombination may only produce valid tours. It is important to note that this ES is based on individuals which are vectors of integer numbers rather than real numbers.

Evolutionary Programming (EP): A complete description of an EP algorithm is given in Fogel (1992). EP is similar to ESs, but has no recombination operator, and its fitness evaluation, mutation and selection are different from the corresponding operators in ESs (Back and Schwefel, 1993). The operator called mutation creates all the changes in the population between one generation and the next.

Much interesting information about evolutionary computation can be found in ENCORE, the Evolutionary Computation Repository Network. ENCORE is mirrored across several web pages and ftp sites.

We know that there are many TSP algorithms missing from this review. More detailed information, a bibliography of TSP related papers and software, can be found on the following web page

http://www.densis.fee.unicamp.br/..-.moscato /TSPBIB..home.html


3. A new approach: Solving the TSP with EDAs

In this section we introduce a new heuristic for the TSP based on the use of EDAs.

We can use two different EDAs approaches for the TSP. The first uses discrete EDAs, in which individuals are vectors of integer numbers. The second uses continuous EDAs, in which individuals are represented by vectors of real numbers. Both approaches need some modifications of the standard EDAs. These modifications are described in the following sections.

3.1 Using discrete EDAs

With discrete EDAs, learning is based on Bayesian networks, and all the calculations use integer numbers. We represent individuals using the path representation. In this representation, the n cities that should be visited are ordered into a list of n elements, so that if city i is the jth element of the list, city i is the lh city to be visited. The fitness function of individuals is easy to compute by just adding all distances between adjacent cities.

Step 1. Generate M individuals (the initial population) randomly

Step 2. Select N individuals, N :S M from the population, according to a selection method

Step 3. Estimate the probability distribution of an individual being among the selected individuals

Step 4. Sample M individuals (the new popUlation) from the probability distribution created earlier

Step 5. If a stopping criterion is met stop, else go to Step 2

Figure 10.2 Pseudocode for the EDA approach.

Figure 10.2 gives pseudocode for this approach. In Step 1 we generate M individuals, where M is the size of the population that we are using. These individuals are generated randomly and must represent a correct tour, i.e. we visit every city precisely once. Step 2 to Step 5 is the main loop of the algorithm, and this loop is repeated until the stopping criterion is met. This stopping criteria can be, for example, reaching a certain number of generations, or when the Bayesian network has converged. The loop contains three


operations. The first is the selection of N individuals according to a selection method. In our experiments we have always selected the best half individuals of the population, i.e. N = M /2. The second step is the estimation of the probability function. Depending on the learning method used in EDAs, we will estimate different Bayesian network structures. For the TSP we have used the following learning methods: UMDA (Miihlenbein, 1998), MIMIC (De Bonet et al., 1997), TREE (Chow and Liu, 1968) and EBNA (Etxeberria and Larrafiaga, 1999). The last step is sampling the Bayesian network. In the standard discrete ED As we have a problem with this step, because it is possible to generate incorrect tours in which one or more cities are not visited, or are visited more than once. To solve this problem we apply the ATM (All Time Modification) (Bengoetxea et al., 2000) method. This method ensures that all the generated individuals are correct. When doing the sampling, we must be sure that none of the numbers (cities) are repeated. To do this, the ATM method dynamically modifies the sampling, to avoid generating each number again. With this approach our Bayesian network will have n variables each of which with n possible values. The advantage of this method is that we always create correct individuals (tours), but the disadvantage is large because we are influencing in the sampling, and thus spoiling the learning process.

The sampled individuals are introduced into the population in an elitist way, that is, replacing the worst individual in the population if a new individual is better than it.

The expected results with this approximation are not too exciting because, first, depending on the number of cities, we have a lot of variables and possible values for the learning, and second, the ATM method decisively influences the probability distribution.

3.2 Using continuous EDAs

With the use of continuous EDAs, learning is based on Gaussian networks, and all the calculations use real numbers. In this approximation individuals in a population are represented by vectors of real numbers. Thus, we need a method to translate these real vectors into a valid tour for the TSP. In Figure 10.3 we can see one of these translations.

Original vector: Resulting tour:

1.34 4

2.14 5

0.17 3

0.05 2

-1.23 1

Figure 10.3 Translation of an individual to a valid tour.

2.18 6

This is a 6-city example. In the original vector the generated real numbers are between -3 and 3. The obtained tour will be an integer vector in which each


of the elements is the index after the values of the original vector are sorted. Thus, the fitness function for individuals is more complex to compute. First we must obtain the resulting tour and after that we apply the same formula used in discrete EDAs.

The pseudocode for the continuous EDAs is the same as the one used for discrete EDAs. In general there are two main differences between discrete and continuous EDAs: when estimating the probability distribution we are learning a Gaussian network, and to calculate a fitness function we must first compute the resulting tour. Sampling of the Gaussian network will be done using a simple method (Box and Muller, 1958).

For continuous EDAs the following learning methods are used here: UMDAe ,

MIMICe , EGNA and EMNA. For detailed information about these learning types see Chapter 3 in this volume.

We still have the same problem with the large number of variables and possible values to be learnt. Despite this, continuous EDAs seems to be a better algorithm for the TSP.

3.3 Use of local optimization with EDAs for the TSP

As we saw in Section 2, GAs which include interaction with local searches (adaptive or not) are known as Memetic Algorithms (MA) (Moscato, 1999). A key feature, presented in most MA implementations, is the use of a populationbased search which attempts to use all available knowledge about the problem. From this point of view, if we are able to find an algorithm that introduces local search techniques into the EDAs, this will be a type of MA, but using EDAs rather than GAs.

In Freisleben and Merz (1996) there is pseudocode which summarizes the possibilities for incorporating TSP heuristics into a GA. This pseudocode is shown in Figure 10.4.

Step 1. Create the initial population by a tour construction heuristic Step 2. Apply a tour improvement heuristic to the initial population Step 3. Selection: select parents for mating Step 4. Recombination: perform heuristic crossover Step 5. Apply a tour improvement heuristic to offspring Step 6. Mutation: mutate individuals with a given probability Step 7. Replacement: replace some parents with new offsprings Step 8. If not converged go to Step 3 Step 9. Perform postprocessing by applying a tour improvement heuristic

Figure 10.4 Possibilities for incorporating TSP heuristics into a GA.


Previous attempts to use a TSP heuristic at particular steps of the template were rather discouraging. For example, Grefenstette (1987), used a heuristic crossover, tour construction heuristics and local hill-climbing in some experiments, and reported results that although better than "blind" genetic search, were worse than those produced by a simple 2-opt tour improvement heuristic. Another example is Suh and van Gucht (1987), who applied 2-opt to some individuals of a G A population, and reached a quality of 1. 73 times above the optimum for a lOO-cities problem. Using only 2-opt and Or-opt as a tour improvement operator, solutions can be found with a quality on average of 1.37 above the optimum.

If we want to introduce TSP heuristics into EDAs, then we can apply the same concept as used in GAs. The resulting algorithm is shown in Figure 10.5.

Step 1. Create the initial population randomly

Step 2. Apply a tour improvement heuristic to the initial population

Step 3. Select individuals according to a selection method

Step 4. Estimate the probability distribution of an individual being among the selected individuals

Step 5. Sample individuals from the probability distribution. Apply a tour improvement heuristic to each new individual

Step 6. If the stopping criterion is not met go to Step 3

Step 7. Select the best individual in the last generation

Figure 10.5 Using local search techniques in EDAs. Heuristic EDAs.

The heuristic EDA is quite similar to the initial one, but with the difference that a tour improvement heuristic is used on the individuals of the initial population and on all the new sampled individuals. We have chosen the 2-opt algorithm as a tour improvement heuristic here. Despite the fact that it is a very basic algorithm, this heuristic EDA has shown better results than standard EDAs as shown in a later section.


4. Experimental results with EDAs

4.1 Introduction

Faced with the impossibility of carrying out an analytic comparison of the different EDAs presented in the previous section, we have carried out an empirical comparison between the different combinations of EDAs, learning types and local optimization.

With these experiments we want to measure the performance of the various EDAs in two main aspects: quality of the results and speed. Besides this, we also want to compare these results with other heuristics commonly used in the TSP. These experiments have been carried out using a Pentium II Xeon 500 Mhz with 1MB cache, and 512 MB RAM under the Sun Solaris 2.7 operating system.

The following data files have been used in the empirical study: the well known Grostel24, Grostel48 and Grostel120. These can be obtained from many web or ftp sites. They represent the distances between 24, 48 and 120 imaginary cities. They are often used in TSP problems to asses the capabilities of algorithms, and are a classic experiment for the TSP.

We focus on both discrete and continuous EDAs, both with and without local optimization. In discrete EDAs, we use the following learning methods: UMDA, MIMIC, TREE and EBNA, while in continuous ED As we use: UMDAc ,

MIMICc , EGNA and EMNA. Discrete EDAs will be compared with GAs, which are the most similar heuristics inside the Evolutionary Computation field.

Results for the GAs will be taken from the literature (Larrafiaga et al., 1999), which use GENITOR (Whitley et al., 1989) algorithm. In this algorithm, only one individual is created in each iteration. This new individual replaces the worst individual in the current population, but only if its evaluation function is better. This is known as steady state. The criterion used to stop the algorithm is double. Thus, if in 1000 successive iterations, the average cost of the population has not decreased, then the algorithm will be stopped, with no more than 50000 evaluations allowed in total. In the experiment shown here the following parameters have been used: size of population 200, probability of mutation 0.01 and selective pressure 1.90.

For each of the combinations shown in the experiment, we have done 10 searches.

4.2 Results

4.2.1 Groste124. Table 10.1 shows the best results and average re-suits obtained for each combination of population size, local optimization, and learning type for EDAs. As a comparison the table also shows results obtained for the GA using the crossover operators ER and OX2 and the SM (Scramble


Table 10.1 Tour length for the Grostel24 problem.

Population f1 Local Optimization

SOO-without SOO-with lOOO-without lOOO-with

Best Aver Best Aver Best Aver Best Aver

Local-opt. 1272 1285 1272 1272

GA-ER* 1272 1272 GA-OX2* 1300 1367

UMDA 1339 1495 1272 1272 1329 1496 1272 1272 MIMIC 1391 1486 1272 1272 1328 1451 1272 1272 TREE 1413 1486 1272 1272 1429 1442 1272 1272 EBNA 1431 1528 1272 1272 1329 1439 1272 1272

UMDAc 1289 1289 1272 1272 1289 1464 1272 1272 MIMICc 1289 1289 1272 1272 1300 1560 1272 1272 EGNA 1289 1306 1272 1272 1289 1307 1272 1272 EMNA 1289 1289 1272 1272 1272 1285 1272 1272

• Size of population 200, mutation used SM Optimum 1272

Table 10.2 Number of generations and execution time for the Grostel24 problem.



Gen. Time Gen. Time Gen. Time Gen. Time

UMDA 75 00:14 19 00:27 78 00:55 12 00:35 MIMIC 47 00:09 4 00:06 58 00:36 4 00:12 TREE 37 08:58 4 00:51 46 22:29 2 00:57 EBNA 72 01:00 16 00:35 79 01:50 7 00:28

UMDAc 233 00:24 10 00:08 265 02:20 7 00:11 MIMICc 184 00:06 8 00:06 306 03:03 7 00:11 EGNA 263 05:31 7 00:25 298 06:05 6 00:20 EMNA 56 03:17 8 00:20 59 03:44 5 00:20

Mutation) (Syswerda, 1991) mutation. Taking into account the number of iter-ations needed for EDAs we also give results from using only local optimization.


We can analyze the quality of these results. All learning types hmrp similar results, with continuous ED As having much better results than discrete EDAs. Using continuous EDAs without local optimization, we only reach the optimum (1272) with the EMNA learning type, although the results for all the learning types are near this optimum.

Another interesting aspect to analyse is local optimization. Using EDAs with local optimization all our tests have reached the optimum, which is an improvement on the results obtained using only local optimization. Local optimization takes a lot of the total execution time, but makes the algorithm converge faster. Comparing this with the GA, this is better than the OX2 operator and equal to using the ER operator.

From the running times shown in Table 10.2, we can see that, in general, MIMIC is the best learning type for both discrete and continuous EDAs. In both kind of algorithms, using MIMIC and a population of 500 individuals, we reach the optimum solution in an average time of 6 seconds. Unfortunately we cannot compare these times with the ones obtained for GAs because these are not available to us.

In these results, a smaller population often implies obtaining better results. This is contrary to the intuition that having more individuals implies better learning, but, less individuals may imply some kind of "noise" in the system, where this noise could act as a type of mutation.

4.2.2 Groste148. The results for Grostel48 are shown in Table 10.3. The discrete EDAs results are not very good, and continuous ED As is shown to be more efficient. In this test, we do not reach the optimum tour length (5046) without the help of local optimization, but with UMDAe , MIMICe and EGNA we have reached values only 1.015 greater than the optimum. Again with the use of local optimization we get much better solutions, and by using it in continuous EDAs we frequently achieve the optimum.

The most significant differences are found in the running times, as shown in Table 10.4. Using UMDA and MIMIC with local optimization in continuous EDAs, the running time is about 5 minutes, but the running time of EMNA is several hours. For this reason, we have not tested EMNA with a population of 1000 individuals without local optimization. As before, the best algorithm is MIMIC.

Regarding the population size, the number of individuals is not decisive, and we can reach similar or even better solutions using smaller populations.

4.2.3 Groste1120. Results obtained for the Grostel120 problem are similar to those obtained for Grostel48. Without local optimization continuous EDAs is the best algorithm, and a surprising result is that with a population of 1000 individuals the algorithm does not converge to a correct solution. Using




500-without 500-with 1 ODD-without 1 ODD-with


Local-opt. 5200 5290 5188 5272

GA-ER* 5074 5138 GA-OX2* 5251 5715

UMDA 6715 7432 5079 5149 6683 7388 5067 5139 MIMIC 6679 7083 5046 5053 6104 6717 5046 5057 TREE 5046 5071 5046 5057 EBNA 7044 7476 5165 5193 6398 7336 5114 5146

UMDAc 5142 5248 5046 5048 5122 5245 5046 5046 MIMICc 5122 5176 5046 5046 5150 5228 5046 5050 EGNA 5122 5249 5046 5046 5129 5148 5046 5046 EMNA 5336 5532 5046 5048 5046 5046

* Size of population 200, mutation used SM Optimum 5046



500-without 500-with 1 ODD-without 1 ODD-with


UMDA 362 01:55 47 01:20 218 03:01 54 03:12 MIMIC 167 00:53 23 00:45 113 02:01 18 01:10 TREE 8 22:37 7 50:09 EBNA 306 52:16 63 12:02 261 47:50 65 14:45

UMDAc 481 01:59 78 04:10 327 04:03 52 05:16 MIMICc 327 01:17 126 06:47 300 03:51 59 05:59 EGNA 381 15:46 67 16:24 1905 30:10 42 16:14 EMNA 99 4:23:05 36 1:38:44 49 2:14:03

local optimization, MIMIC is again the best algorithm. The average result in

Solving the Traveling Salesman Problem with ED A::; 225


Population fj Local Optimization



UMDA 14550 15530 7171 7257 14440 15127 7287 7298 MIMIC 13644 14432 7050 7092 12739 13444 7042 7079

UMDAe 7546 7667 7077 7113 39692 40344 7076 7103 MIMICe 7658 7767 7055 7078 35863 39246 7053 7101

Optimum 6942


Population fj Local Optimization



UMDA 385 22:46 55 1:42:52 368 52:10 42 2:40:58 MIMIC 306 52:08 51 1:03:09 348 1:44:59 42 1:42:55

UMDAe 1078 32:40 95 1:11:49 425 36:30 65 1:39:30 MIMICe 1284 42:32 113 1:25:43 545 47:50 67 1:42:55

continuous EDAs with MIMICe, is only 1.02 times greater than the optimum of 6942.

The learning curves of discrete and continuous ED As without optimization are also interesting. Figure 10.6 shows the learning curves evolving with respect to time. Discrete ED As begins to converge quicker, but its final result is worse than in the continuous case. A possible solution is a combination of discrete and continuous EDAs. First generations could use discrete EDAs with later generations using continuous EDAs.

Although we are aware that the experiments on these three test files do not allow us to generalize the results obtained to other TSP problems, a certain uniformity of behaviour can be seen in these examples. Here, algorithms using UMDA and MIMIC learning gave the best results.

In Section 3 we discussed the ATM (Bengoetxea et al., 2000) modification needed to use discrete EDAs with the TSP problem. We think that the need


50000 r----·------_· ,,_mm~_M_~~.~W".~.,.· .. m •• _~ •.• ,._.,,, ••• __ •• ~,, ... , •• , ••• .. ~~",,~ i

40000 ~ i \

'''',

I VI 30000

!

VI

\ - Discrete EDA a> ,.. ;e 20000 '" Continuous EDA u.. '-- I

10000 i !

0 I 0 500 1000 1500 2000 2500 3000

Seconds

Figure 10.6 Learning curves for a 120-cities problem. Discrete and continuous ED As with UMDA learning.

to use this modification (or a similar one) to fullfill the permutation constraint is probably the reason for the bad solutions found by this approach.

The use of local optimization has been very successful, giving solutions quite near to the optimum. In our opinion, the speed of EDAs for solving this problem must be improved. The use of more specific EDA implementations for the TSP can help improve their speed, but the real problem with them is that a few more cities in the problem can mean a much greater algorithm execution time.

5. Conclusions

In this paper, EDAs, a new tool in Evolutionary Computation, has been applied to the TSP. We have also incorporated domain knowledge in the problem resolution by using local search optimization based on the 2-opt algorithm. The feasibility of the proposed approach has been demonstrated by presenting performance results for TSP instances of between 24 and 120 cities. As this is the first use of EDAs with the TSP, there are a lot of issues for future research. For example, the efficiency of the implementation could be increased to reduce computation times, and other types of local search heuristics could be used in the algorithm. We also need more tests of the proposed algorithm, with the population size used depending on the number of cities.


References Back, T. and Schwefel, G. R. H. (1993). Evolutionary programming and evo

lution strategies: Similarities and differences. Technical report, University of Dortmund, Deparment of Computer Science, Germany.

Bengoetxea, E., Larraiiaga, P., Bloch,!., Perch ant , A., and Boeres, C. (2000). Inexact graph matching using learning and simulation of Bayesian networks. An empirical comparison between different approaches with synthetic data. In Workshop Notes of CaNew2000: Workshop on Bayesian and Causal Networks: From Inference to Data Mining. Fourteenth European Conference on Artificial Intelligence, ECAI2000. Berlin.

Bentley, J. L. (1992). Fast algorithm for geometric travelling salesman problem. ORSA J. Computing, 4:125-128.

Box, G. E. P. and Muller, M. E. (1958). A note on the generation of random normal deviates. Ann. Math. Static., 29:610-61l.


Christofides, N. (1976). Worst-case analysis of a new heuristic for the travelling salesman problem. Technical Report 388, Carnegie Mellon University.

Clarke, G. and Wright , J. W. (1964). Scheduling of vehicles from a central depot to a number of delivery points. Operations Research, 12:568-58l.

Croes, G. A. (1992). A method for solving travelling salesman problems. Operations Research, 6:791-812.

Davis, 1. (1985). Applying adaptive algorithms to epistatic domains. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 162- 164.

De Bonet, J. S., Isbell, C. L., and Viola, P. (1997). MIMIC: Finding optima by estimating probability densities. In Mozer, M., Jordan, M., and Petsche, T., editors, Advances in Ne'ural Information Processing Systems, volume 9.

Etxeberria, R. and Larraiiaga, P. (1999). Global optimization with Bayesian networks. In II Symposium on Artificial Intelligence. CIMAF99. Special Session on Distributions and Evolutionary Optimization, pages 322- 339.

Fogel, D. B. (1992). An analysis of evolutionary programming. In Proc. of the First Annual Conf. on Evolutionary Computation, pages 43- 5l.

Freisleben, B. and Merz, P. (1996). A genetic local search algorithm for solving symmetric and asymmetric traveling salesman problems. In Proc. IEEE Int. Conf. on Evolutionary Computation , pages 616- 62l.

Glover, F. (1986). Future paths for integer programming and links to Artificial Intelligence. Computers fj Ops. Res., 5:533-549 .


Glover, F. and Laguna, M. (1993). Tabu search. In Modern Heuristic Techniques for Combinatorial Problems, pages 70-150. Blackwell Scientific Publications, Oxford.

Grefenstette, J. J. (1987). Incorporing problem specific knowledge into genetic algorithm. In Davis, L., editor, Schedule Optimization Using Genetic Algorithms, pages 42-60. Morgan Kaufmann.

Herdy, M. and Patone, G. (1994). Evolution Strategy in action: 10 ES-demonstrations. In International Conference On Evolutionary Computation. The Third Parallel Problem Solving From Nature.

Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor.

Johnson, D. S., Aragon, C. R, McGeoh, L. A., and Schevon, C. (2001a). Optimization by simulated annealing: An experimental evaluation. Part III (the travelling salesman problem). In preparation.

Johnson, D. S., Bentley, J. L., McGeoh, L. A., and Rothberg, E. E. (2001b). Near optimal solutions to very large travelling salesman problems. In preparation.

Johnson, D. S. and McGeoch, L. A. (1997). The traveling salesman problem: a case study. In Aarts, E. H. L. and Lenstra, J. K., editors, Local Seach in Combinatorial Optimization, pages 215-310. John Wiley and Sons, London.

Kirkpatrick, S., Gellat, C. D., and Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220:671-680.

Larraiiaga, P., Kuijpers, C. M. H., Murga, R H., Inza, 1., and Dizdarevic, S. (1999). Genetic algorithms for the travelling salesman problem: A review of representations and operators. Artificial Intelligence Review, 13: 129-170.

Lin, S. (1965). Computer solutions of the travelling salesman problem. Bell Syst. Tech. J., 44:2245-2269.

Lin, S. and Kernighan, B. W. (1973). An effective heuristic algorithm for the travelling salesman problem. Operation Res., 21:498-516.

Mathias, K. and Whitley, D. (1992). Genetic operators, the fitness landscape and the traveling salesman problem. In Manner, Rand Manderick, B., editors, Parallel Problem Solving from Nature, pages 219-228. Elsevier.

Matthews, R A. J. (1993). The use of genetic algorithms in cryptanalysis. Cryptologia, XVII(2):187-201.

Moscato, P. (1999). Memetic algorithms: A short introduction. In Corne, D., Glover, F., and Dorigo, M., editors, New ideas in optimization, pages 219-234. Mc Graw Hill.

Muhlenbein, H. (1998). The equation for response to selection and its use for prediction. Evolutionary Computation, 5:303-346.

Or, 1. (1976). Travelling Salesman- Type Combinatorial Problems and their Relation to the Logistics of Regional Blood Banking. Ph.D. Thesis, Deparment


of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL.

Rechenberg,1. (1973). Optimierung Technischer Systeme Nach Prinzipien der Biologischen Information. Fromman Verlag, Stuttgart.

Reeves, C. R. (1993). Modern Heuristic Techniques for' Combinatorial Problems. Blackwell Scientific Publications, Oxford.

Spillman, R., Janssen, M., Nelsonn, B., and Kepner, M. (1993). Use of a genetic algorithm in the cryptanalysis simple substitution ciphers. Cryptologia, XVII(1):31-44.

Suh, J. Y. and van Gucht, D. (1987). Incorporing heuristic information into genetic search. In Grefenstette, J. J., editor, Proc. of the Second Int. Conf. on Genetic Algorithms, pages 100-107. Lawrence Erlbaum.

Syswerda, G. (1991). Schedule optimization using genetic algorithms. In Davis, L., editor, Handbook of Genetic Algorithms, pages 332-349. Van Nostrand Reinhold.

Whitley, D., Starkweather, D., and Fuquay, D. (1989). Scheduling problems and travelling salesman: The genetic edge recombination operator. In Schaffer, J., editor, Proceedings of the International Joint Conference on Artificial Intelligence, pages 133-140. Morgan Kaufmann Publishers.

Chapter 11

Estimation of Distribution Algorithms Applied to the Job Shop Scheduling Problem: Some Preliminary Research

J.A. Lozano Department of Computer Science and Artificial Intelligence


[email protected]

A. Mendiburu Department of Compute1' Architecture and Technology


[email protected]

Abstract In this chapter we applied discrete and continuous Estimation of Distribution Algorithms to the job shop scheduling problem. We borrow from Genetic Algorithms literature the most successful codifications and hybridizations. Estimation of Distribution Algorithms are plainly applied with these elements in the Fisher and Thompson (1963) datasets. The results are comparable with those obtained with Genetic Algorithms.

Keywords: Estimation of Distribution Algorithms, job shop scheduling, hybridization, local search

1. Introduction Estimation of Distribution Algorithms (EDAs) (Miihlenbein and Paa,8, 1996;

Larraiiaga et al., 2000a; Larraiiaga et al., 2000b) constitute a new tool in the Evolutionary Computation field. They can be considered as a generalization of Genetic Algorithms (GAs). In EDAs reproduction operators (crossover and mutation) are substituted by the estimation and sampling of the probability distribution of the selected individuals.




This chapter applies ED As to the job shop scheduling problem (Blazewicz et al., 1985; Blazewicz et al., 1996) . The job shop scheduling problem is a classical NP-hard (Lenstra and Kan, 1979) combinatorial optimization problem. A set of n jobs {J1,}z, ... , I n } are given, each job Ji is composed of an ordered set of m operations {OJ, 0; , ... , Or}. Each operation 0I needs to be processed in a machine M! from a set of m , {Ml , M2"'" Mm }, and requires a time tij . Some restrictions are imposed in the jobs and in the machines: no operation can be interrupted, each machine can handle only one job at a time, two operations in a job can not be processed in the same machine. The most common optimization problem is to minimize the rriakespan, i.e. the need time for the last job to finish. Other objective functions have been considered in the literature (Anderson et al., 1997).

Many works have been dedicated to solve this problem with exact as well as with approximate algorithms. The best results seem to be reached with the Tabu Search algorithm (Nowicki and Smutnicki, 1996; Balas and Vazacopoulos, 1998). However Genetic Algorithms (GAs) have largely been applied to this problem. Probably the first work is by Davis (1985). Since then many papers have been reported in GAs literature. We briefly review the methods that obtained the best results, to borrow the most successful components from them to our approach.

Ono et al. (1996) use a codification where each individual has length m x n. The individual is divided in m parts of length n. Part i represents the order of the jobs in machine Mi. The authors design special crossover and mutation operators for this codification. The Giffier and Thompson algorithm (Giffier and Thompson, 1960) is applied to each individual of the population to obtain an active schedule and to avoid non-factible individuals. The authors carry out an experimentation in the famous Fisher and Thompson (1963) datasets . They compare the results with those obtained with their previous approach (Kobayashi et al., 1995). In the previous work they use the same codification but a different crossover operator. Their new approach obtains much better results.

Yamada and Nakano (1995) use a GA with a multi-step crossover: A local search algorithm based on the critical block neighborhood (CB). The authors carry out an experimental comparison between the proposed algorithm with Simulated Annealing with CB, the algorithms of Fang et al. (1993), Davidor et al. (1993) and Mattfeld et al. (1994) again in the Fisher and Thompson (1963) datasets. The Simulated Annealing with CB and the algorithm proposed by the authors obtain the best results in the first and second datasets respectively.

The job shop scheduling problem has been previously solved with EDAs by Baluja and Davis (1998). The authors use an algorithm called COMIT (the probabilistic model takes second order statistics into account), and hybridize it with a hill climbing and with a PBIL algorithm (Baluja, 1994).

ED As Applied to the Job Shop Scheduling Problem 233

This chapter is organized as follows. Section 2 introduces the codification, the particular EDAs used in the application to the job shop scheduling problem and the proposed hybridization. Section 3 reports experimental results while in Section 4 we draw some conclusions.

2. EDAs in job shop scheduling problems

2.1 Codifications

Two kind of codifications called C1 and C2 are used in this chapter. The choice of these codifications are based on the results obtained by GAs that used them.

The first codification, C1, is a classical permutation-based codification. Each individual is a permutation of the numbers {1, 2, ... ,m x n}. Of course a permutation does not imply directly a scheduling, therefore an algorithm has to be applied to de codify the permutation.

The second codification, C2, is close to the first. An individual is a vector of length m x n, where each gene can take a value in {1, 2, ... , n}. The vector is divided into pieces of length n. The ith piece of the individual represents the order of the jobs in the ith machine. Hence, we can consider an individual as a set of m permutation of the jobs. Although in some cases a scheduling could be directly obtained from this codification, there could be situations where this is not possible. For instance, the case in which all the first operations of the jobs are not situated in the first place of the corresponding permutation of the machine. Therefore, as in the previous codification, we need to use some decodification algorithm.

2.2 Algorithms

For the job shop scheduling problem discrete as well as continuous EDAs are used. A review of ED As can be found in Larraiiaga (2001). In this chapter we use some of the algorithms that appear in Larraiiaga et al. (2000a,2000b). Particularly the discrete algorithms used in the experiments are: UMDA, BSC, PEIL (a = 0.5), MIMIC, EBNApC' and EBNA K 2+pen, and the algorithms for continuous domains are: UMDAc , MIMICc and EGNABGe .

Although it seems a paradox the use of continuous optimization methods to solve a combinatorial optimization problem, this approach is not new. Rudolph (1991) uses Evolution Strategies to solve the TSP problem and, particularly, for the job shop scheduling problem, the work by Bean and Norman (1993) can be consulted. Both cases obtain good results with this approximation to the problem.

Given a real vector (Xl,X2, ... ,Xmxn ) of length m x n, it is easy to obtain an individual of codification Cl. A permutation is obtained from it by ranking


the positions using the values Xi, i = 1,2, ... , m x n. We can see it with an example in which m = n = 3. Suppose we have the real vector:

(2.35, 3.42, 9.35, 0.32, 11.54, 10.42, 5.23, 4.2, 7.8)

the permutation obtained is:

(2 3 7 1 9 8 5 4 6) . A similar argument can be used for codification C2, just by restricting the ranking to each vector piece of length of n. Using the previous real vector we obtain the following C2 individual:

(1 2 3 1 3 2 2 1 3) .

Another possible source of problems with the kind of codification proposed and discrete EDAs is the sampling of individuals that belong to the codifications. In each step of EDAs a probability distribution is learnt from the set of selected individuals and this distribution is sampled to obtain new individuals. However, there is no discrete EDA in the literature that can learn a probability distribution over a set of permutations. In the most general situation these algorithms can learn a probability distribution over a set 0 = 0 1 X O2 X ... x On, where Oi = {I, 2, ... ,r;} and ri E IN. Therefore the sampling can not provide permutation individuals but an individual in O.

To obtain permutation-based individuals a modification in the sampling process has to be carried out. Usually, the sampling in those algorithms that use Bayesian networks to codifiy the probability distribution, is carried out by means of the PLS algorithm (Henrion, 1988). In this algorithm the variables are instantiated following an ancestral order. To sample the ith ordered variable the previous (i - 1) variables have to be instantiated.

A permutation can be obtained if variable ith is not allowed to take the values instantiated to the previous variables. To do that, when variable ith has to be sampled, we set the probability of the previous sampled values to 0 and modify the local probabilities of the rest of the values to sum 1. A permutation can be obtained by these changes. Of course, a small modification drives us to sample individuals from codification C2.

3. Hybridization The need for hybridization appeared early in the application of GAs to job

shop scheduling (Davis, 1985; Husband et al., 1991; Starkweather et al., 1992). This hybridization was used in two different ways. First, to design crossover and mutation operators, and second to apply local search algorithms departing from the individuals of the population. We follow the second approach as the codifications proposed do not ever obtain feasible scheduling and ED As have not reproduction operators.

In our case we include the hybridization in the decodification process. Two algorithms are used to decodify the individuals. The first, HI, is based on the

EDAs Applied to the Job Shop Scheduling Problem 235

Algorithm HI

Step 1. Build a set with the first operation of each job S = {o~, o~, ... ,O;,} Step 2. Determine an operation 0 E S with the earliest completion time Step 3. Determine the set C of the operations of S that are processed in the

same machine M that 0, C = {Of E sIMi = M} Step 4. Obtain the set C' of operations of C that start before the

completion time of operation 0 Step 5. Select the operation 0* E C' which occurs leftmost in the

permutation and delete it from S, S = S\ {O*} Step 6. Add operation 0* to the schedule and calculate its starting time Step 7. If operation 0* is not the last operation in its job add the next

operation to S Step 8. If S -:j:. 0 go to Step2, else finish

Figure 11.1 Pseudocode for algorithm HI.

well-known Giiller and Thompson (1960) algorithm and produces active schedules. HI has the interesting property that active schedules contain the optimum schedules. A pseudocode for HI can be seen in Figure 11.1. The pseudocode is for e1, however it can be easily adapted to the second codification. To do that, it is enough to consider in Step 5 that permutation refers to the associated to machine M.

The second algorithm, H2, has been proposed by Bierwirth and Mattfeld (1999). The authors develop an algorithm that depends on a parameter 6. Active schedules (6 = 1) as well as non-delay schedules (6 = 0) can be obtained from H2. Hence it can be seen as a mixed between both active and non-delay schedules. The problem with this codification is that when 6 < 1 it can not be ensured that the optimum schedule is in the set of resultant schedules. However, the authors point that better results can be obtained with this algorithm, because the resultant schedules set is much smaller that the set of active schedules.

A pseudocode for H2 can be seen in Figure 11.2. The adaptation to the second codification is again straightforward.


Algorithm H2

Step 1. Build a set with the first operation of each job S = {ot, o~, ... , o;} Step 2. Determine an operation 0 E S with the earliest completion time Step 3. Determine the set C of the operations of S that are processed in the

same machine M as 0, C = {Of E SIM? = M} Step 4. Determine the operation 0' of C with the earliest starting time Step 5. Obtain the set C' of the operations of C whose starting time is

bigger than J times the difference between the completion time of 0 and the starting time of 0'

Step 5. Select the operation 0* E C' which occurs leftmost in the permutation and delete it from S, S = S\ {O*}

Step 6. Add operation 0* to the schedule and calculate its starting time Step 7. If operation 0* is not the last operation in its job add the next

operation to S Step 8. If S i: 0 go to Step2, else finish

Figure 11.2 Pseudocode for algorithm H2 .


Table 11.1 Experimental results with continuous EDAs for FT10 x 10.

FTlO x 10 HI H2 (6 = 0.5)

Best Mean Best Mean

UMDAc 967 974.7 937 947.2 MIMICc 967 971.9 938 944.4 EGNABGe 976 982.7 943 951.4

UMDAc 967 978.5 937 946.3 MIMICc 967 979.0 938 948.7 EGNABGe 967 975.7 937 944.3

4. Experimental results To evaluate the chosen ED As we use the classic Fisher and Thompson (1963)

datasets. These are two scheduling problems. The first, denoted as FT10 x 10, is a problem with 10 jobs and 10 machines. The second FT20 x 5 has 20 jobs and 5 machines. The optimum makespan for these problems are 930 and 1165 repectively. These datasets have been used previously by the works that apply GAs to the job shop scheduling problem referenced in the introduction. These problems deserved 20 years to be solved.

The experimental parameters are different depending on the kind of EDAs, discrete or continuous, we use. In the discrete case we use a population size of 20n x m, while in the continuous case the population size is 2n x m. These parameters were established after some previous experiments were carried out. In all the experiments the selection method chosen is truncation selection an the new population is built using elitism. Two stopping conditions were used: maximum number offunction evaluations and convergence (similarities between the individuals of two consecutive generations)

For each codification, each problem, each hybrid algorithm and each EDA, 10 independent runs were carried out. The results can be seen in Tables 11.1 to 11.4. Each table represents the mean and best value obtained in 10 independent runs.

In view of the results with continuous ED As we can say that the proposed algorithms perform well in these problems. While in FTlO x 10 they can not obtain the optimum solution, in problem FT20 x 5, that is supposed to be more difficult, EGNABGe and MIMICc are able to reach the optimum. It seems that UMDAc performs slightly worse than the others, but there is no big differences between them.


Table 11.2 Experimental results with continuous EDAs for FT20 x 5.

FT20 x 5 HI H2 (6 = 0.5)

Best Mean Best Mean

UMDAc 1180 1183.50 1176 1178.00 CI MIMICc 1180 1187.70 1165 1177.20

EGNABGe 1178 1184.15 1167 1177.18

UMDAc 1178 1182.20 1175 1177.90 C2 MIMICc 1178 1186.40 1178 1178.40

EGNABGe 1178 1184.19 1165 1176.94

In the case of discrete EDAs the results are slightly worse than the obtained with continuous EDAs. In addition discrete ED As use population sizes 10 times bigger than continuous EDAs.

Analyzing the performance of the different ED As considered here we can say that there is no differences between the proposed codifications, Cl and C2. From both discrete and continuous results it can be deduced that the hybridization H2 obtains better results than those obtained with HI. The fact that the algorithms are not able to reach the global optimum in FTlO x 10 could be due to the optimum scheduling can not be obtained with H2 and parameter value 6 = 0.5.

In addition, it is important to notice the robustness of the approaches. This is shown in the fact that the mean value differs a few from the best value.

We can compare our results with the obtained by the approaches summarized in the introduction. Ono et al. (1996) seems the most robust approach. They obtain the optimum in FTlO x 10 around the 80% of the times and in FT20 x 5 around the 22%, however a population size of 600 was used (compare it with the ones used by our continuous approaches). In the experimental results reported by Yamada and Nakano (1995) it can be seen that, for FTlO x 10, the proposed algorithm reach the optimum with an average of 934.5, i.e. in this dataset their algorithm perform better than our approaches (a population size of 500 was used). However in FT20 x 5 both approaches perform the same, the authors obtain the optimum and a mean value of 1177.3 (with a population size of 100). Finally, our approach outperforms the algorithms proposed in Baluja and Davis (1998), where the best value reached in FT10 x 10 is 953 and 1196 in FT20 x 5.


Table 11.3 Experimental results with discrete EDAs for FT10 x 10.

FTlO X 10 HI H2 (5 = 0.5)

Best Mean Best Mean

UMDA 992 1008.8 945 945.0

CI BSC 991 1003.3 944 944.8 PBlL 990 1011.1 943 944.4 MIMIC 994 1004.0 943 946.3

UMDA 985 999.0 943 944.8

C2 BSC 986 999.8 939 944.5 PBlL 994 1003.7 940 944.7 MIMIC 994 1005.3 944 946.1

Table 11.4 Experimental results with discrete EDAs for FT20 x 5.

FT20 X 5 HI H2 (5 = 0.5)

Best Mean Best Mean

UMDA 1196 1200.1 1178 1178.2

CI BSC 1194 1198.7 1175 1177.7 PBlL 1190 1201.9 1177 1178.5 MIMIC 1196 1208.7 1176 1178.5

UMDA 1191 1198.0 1178 1178.0

C2 BSC 1194 1202.5 1177 1178.1 PBlL 1201 1204.3 1178 1178.3 MIMIC 1197 1206.5 1178 1178.6


5. Conclusions In this chapter an application of some EDAs to the job shop scheduling

problem has been carried out. This application has borrowed the most successful components used by GAs in this problem. The results with this simple approach are comparable to those obtained by GAs.

This is a preliminary approach and much work can be carried out to adapt the components of EDAs to the particular characteristics of the job shop scheduling problem. One of these adaptations that we propose to do in the future is to reflect the disjunctive graph in the structure learnt by the EDAs that use Bayesian or Gaussian networks.

References Anderson, E., Glass, C., and Potts, C. (1997). Machine scheduling. In Aarts, E.

and Lenstra, J., editors, Local Search in Combinatorial Optimization, pages 361-414. John Wiley & Sons.

Balas, E. and Vazacopoulos, A. (1998). Guided local search with shifting bottleneck for job shop scheduling. Management Science, 44:262-275.

Baluja, S. (1994). Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning. Technical Report CMU-CS-94-163, Carnegie Mellon University.


Bean, J. and Norman, B. (1993). Random keys for job shop scheduling. Technical Report TR 93-7, Department of Industrial and Operations Ingeneering, The University of Michigan.

Bierwirth, C. and Mattfeld, D. (1999). Production scheduling and rescheduling with genetic algorithms. Evolutionary Computation, 7(1):1-17.

Blazewicz, J., Domschke, W., and Pesch, E. (1996). The job shop scheduling problem: Conventional and new solution techniques. European Journal of Operational Research, 93: 1-33.

Blazewicz, J., Ecker, K., Schmidt, G., and Weglarz, J. (1985). Scheduling in Computer· and Manufacturing Systems. Springer-Verlag.

Davidor, Y., Yamada, T., and Nakano, R. (1993). The ecological framework II: Improving GA performance at virtually zero cost. In Forrest, S., editor, Proceedings of the Fifth International Conference on Genetic Algorithms, ICGA-5, pages 171-176. Morgan Kaufmann.

Davis, L. (1985). Job shop scheduling with genetic algorithms. In Grefenstette, J., editor, Proceedings of the First International Conference on Genetic Algorithms and Their Applications, pages 136-140. Lawrence Erlbaum Associates.


Fang, H., Ross, P., and Corne, D. (1993). A promising genetic algorithm approach to job shop scheduling, rescheduling and open-shop scheduling problems. In Forrest, S., editor, Proceedings of the Fifth International Conference on Genetic Algorithms, ICGA-5, pages 375-382. Morgan Kaufmann.

Fisher, H. and Thompson, G. (1963). Probabilistic learning of local job-shop scheduling rules. In Muth, J. and Thompson, G., editors, Industrial Scheduling. Prentice-Hall, Englewood Cliffs, NJ.

GifHer, B. and Thompson, G. (1960). Algorithms for solving production scheduling problems. Operations Research, 8:487-503.

Henrion, M. (1988). Propagating uncertainty in Bayesian networks by probabilistic logic sampling. In Lemmer, J. and Kanal, L., editors, Uncer·tainty in Artificial Intelligence, volume 2, pages 149-163. North-Holland, Amsterdam.

Husband, P., Mill, F., and Warrington, S. (1991). Genetic algorithms, production plan optimisation and scheduling. In Schwefel, H.-P. and Manner, R., editors, Parallel Problem Solving from Nature, PPSN 1. Lectures Notes in Computer Science, volume 496, pages 80-84. Springer-Verlag.

Kobayashi, S., Ono, 1., and Yamamura, M. (1995). An efficient genetic algorithm for job shop scheduling problems. In Proceedings of the Sixth International Conference on Genetic Algorithms, ICGA-6, pages 506-511.

Larranaga, P. (2001). A review on Estimation of Distribution Algorithms. In Larranaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Larranaga, P., Etxeberria, R, Lozano, J. A., and Pena, J. M. (2000a). Combinatorial optimization by learning and simulation of Bayesian networks. In Boutilier, C. and Goldszmidt, M., editors, Uncertainty in Artificial Intelligence, UAI-2000, pages 343-352. Morgan Kaufmann Publishers, San Francisco, CA.

Larranaga, P., Etxeberria, R, Lozano, J. A., and Pena, J. M. (2000b). Optimization in continuous domains by learning and simulation of Gaussian networks. In Wu, A. S., editor, Proc. of the Genetic and Evolutionary Computation Conference, GECCO-2000, Workshop Program, pages 201-204.

Lenstra, J. and Kan, A. R (1979). Computational complexity of discrete optimization problems. Annals of Discrete Mathematics, 4:121-140.

Mattfeld, D., Kopfer, H., and Bierwirth, C. (1994). Control of parallel population dynamics by social-like behavior of GA-individuals. In Parallel Problem from Nature III, pages 16-25.

Miihlenbein, H. and Paa,B, G. (1996). From recombination of genes to the estimation of distributions 1. Binary parameters. In Voigt, H., Ebeling, W., Rechenberg, 1., and Schwefel, H.-P., editors, Parallel Problem Solving from Nature, PPSN IV. Lectures Notes in Computer Science, volume 1141, pages 178-187.


Nowicki, E. and Smutnicki, C. (1996). A fast taboo seach algorithm for the job shop problem. Management Science, 42:797-813.

Ono, I., Yamamura, M., and Kobayashi, S. (1996). A genetic algorithm for job-shop scheduling problems using job-based order crossover. In Fogel, D., editor, Proceedings of The Second IEEE Conference on Evolutionary Computation, pages 547-552. IEEE Computer Society Press.

Rudolph, G. (1991). Global optimization by means of distributed evolution strategies. In Schwefel, H.-P. and Manner, R., editors, Parallel Problem Solving from Nature, PPSN I. Lectures Notes in Computer Science, volume 496, pages 209-213. Springer-Verlag.

Starkweather, T., Whitley, D., Mathias, K., and McDaniel, S. (1992). Sequence scheduling with genetic algorithms. In Fandel, G., Gulledge, T., and Jones, J., editors, New Directions for Operations Research in Manufacturing, pages 129-148. Springer.

Yamada, T. and Nakano, R. (1995). A genetic algorithm with multi-step crossover for job-shop scheduling problems. In Proceedings of the First lEE/IEEE International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, pages 146-151. lEE Press.

Chapter 12

Solving Graph Matching with EDAs Using a Permutation-Based Representation

E. Bengoetxea Department of Compute1' A1'chitectu1'e and Technology


[email protected]

P. Larranaga Department of Compute1' Science and Artificial Intelligence


[email protected]

1. Bloch A. Perchant Department of Signal and Image FTOcessing Ecole Nationale Super'ieure des Telecommunications {bloch, perchant}@tsi.enstJr

Abstract Graph matching has become an important area of research because of the potential advantages of using graphs for solving recognition problems. An example of its use is in image recognition problems, where structures to be recognized are represented by nodes in a graph that are matched against a model, which is also represented as a graph.

As the number of image recognition areas that make use of graphs is increasing, new techniques are being introduced in the literature. Graph matching can also be regarded as a combinatorial optimization problem with constraints and can be solved with evolutionary computation techniques such as Estimation of Distribution Algorithms.

This chapter introduces for the first time the use of Estimation of Distribution Algorithms with individuals represented as permutations to solve a particular graph matching problem. This is illustrated with the real problem of recognizing human brain images.




Keywords: Inexact Graph Matching, Estimation of Distribution Algorithms, Human Brain Images

1. Introduction Representation of structural information by graphs is widely used in do

mains that include network modelling, psycho-sociology, image interpretation, and pattern recognition. There, graph matching is used to identify nodes and therefore structures. Most existing problems and methods in the graph matching domain assume graph isomorphism, where both graphs being matched have the same number of nodes and links. For some problems, this bijective condition between the two graphs is too strong and it is necessary to weaken it and express the correspondence as an inexact graph matching problem.

Examples of inexact graph matching can be found in the pattern recognition field, where structural recognition of images is performed: the model (also called the atlas or map depending on the application) is represented in the form of a graph, where each node contains information for a particular structure, and data graph are generated from the images to be analyzed. Graph matching techniques are then used to determine which structure in the model corresponds to each of the structures in a given image. When the data graph is generated automatically from the image to be analyzed, the difficulty of accurately segmenting the image into meaningful entities means that oversegmentation techniques need to be applied (Perchant et al., 1999; Perchant and Bloch, 1999; Perchant, 2000). These ensure that the boundaries between the meaningful entities to be recognized will appear in the data image as clearly distinct structures. As a result, the number of nodes in the data graph increases and isomorphism condition between the model and data graphs cannot be assumed. Such problems call for inexact graph matching, and similar examples can be found in other fields. There, the graph matching technique of choice has to perform the recognition process by returning a solution where each node in the data graph is matched with the corresponding node in the model graph.

In addition, another important aspect to be taken into account is the fact that some graph matching problems contain additional constraints on the matching that have to be satisfied in order to consider the matching as correct.

The complexity of the graph matching problem is mostly determined by the size of the model and data graphs. This has been proved to be NP-hard (Lovasz and Plummer, 1986), and therefore the use of heuristic methods is justified.

Different techniques have been applied to inexact graph matching, including combinatorial optimization (Cross and Hancock, 1999; Cross et al., 1997; Singh and Chaudhury, 1997), relaxation (Finch et al., 1997; Gold and Rangarajan, 1996; Hancock and Kittler, 1990; Wilson and Hancock, 1996; Wilson and Hancock, 1997), the EM algorithm (Cross and Hancock, 1998; Finch et al.,

Solving Graph Matching with ED As Using a Permutation-Based Representation 245

1998), and Evolutionary Computation techniques such as Genetic Algorithms (GAs) (Boeres et al. , 1999; Myers and Hancock, 2001) .

This chapter proposes optimization through learning and simulation of probabilistic graphical models (such as Bayesian networks and Gaussian networks) as the method of choice. Adaptations of different Estimation of Distribution Algorithms (EDAs) for use in inexact graph matching are also introduced. ED As are also modified to deal with additional constraints in a graph matching problem. Existing articles on using EDAs to solve the graph matching problem are Bengoetxea et al. (2000a) and Bengoetxea et al. (2000b), which compare EDAs with GAs in their use for this type of problem.

The outline of this chapter is as follows: Section 2 explains the graph matching problem, showing it as a combinatorial optimization problem with constraints. Section 3 proposes a permutation-based approach for solving the inexact graph matching problem using EDAs. Sections 4 and 5 introduce a method for translating from individuals containing a permutation to valid solutions of the inexact graph matching problem for both discrete and continuous domains. Section 6 describes the experiment carried out and the results obtained. Finally, Section 7 gives conclusions and suggests further work.

2. Graph matching as a combinatorial optimization problem with constraints

In any combinatorial optimization problem an important influence on algorithm performance is the way that the problem is defined, in both the representation of individuals chosen, and the fitness function used to evaluate those individuals. This section gives some examples of representations (the encoding of points in the search space).

2.1 Representation of individuals

One of the most important tasks in defining any problem to be solved with heuristics is choosing an adequate representation of individuals, because this determines to a large extent the performance of the algorithms. An individual represents a solution, i.e. a point in the search space that has to be evaluated. For a graph matching problem, each individual represents a match between the nodes of a data graph Gz and those of model graph G I .

A representation of individuals for this problem that was used in GAs in Boeres et al. (1999) that could also be applied to EDAs is the following: individuals with !VII'!Vzl binary (only contains Os and Is) genes or variables, where VI and V2 are the number of nodes in graphs G I and Gz respectively. In each individual, the meaning of entry Cij, 1 :S i :S !VII and 1 :S j :S !V21, is the following: Cij = 1 means that the jth node of G2 is matched with the ith node of G I . The main drawback of this type of representation is the large number of


variables or genes that the individual contains, which increases the complexity of the problem that ED As or GAs have to solve. The cardinality of the search space is also

(12.1)

which is quite large, although not all the individuals are valid (there are some restrictions to consider within the individuals).

Another possible representation that can be used either in GAs or EDAs consists of individuals which each contains 1V2\ genes or variables, where each variable can contain any value between 1 and IVI\. More formally, the individual as well as the solution it represents could be defined as follows: for 1 ::; k ::; IVl\ and 1 ::; i ::; \V2 \, Xi = k means that the ith node of G2 is matched with the kth node of G l. This is the representation used for instance in Bengoetxea et al. (2000a) and Bengoetxea et al. (2000b). In this representation, the number of possible solutions to the inexact graph matching problem is given by the following formula for number of cases of permutation with repetition:

1V21-1V11-l 1V21-lVtI-l

L L 1V2\! nIV11 . I

il=l iWll=l k=l Zk·

(12.2)

where the values i k (k = 1, ... , IVl\) satisfy the condition I:~~i i k = 1V2\' We will refer to this representation later in Section 6 as traditional.

An example of the traditional representation of individuals is shown in Figure 12.1 for a particular example where the model graph G1 contains 6 nodes (labeled from 1 to 6) and the data graph G2 represents a segmented image and contains 11 nodes (labeled from 1 to 11). This individual represents a solution (a point in the search space) where the first two nodes of G2 are matched to node number 1 of G l , the next four nodes of G2 are matched to node number 2 of G l , and so on.

I 1 I 1 I 2 I 2 I 2 I 2 I 3 I 4 I 4 I 5 1 6 1

Figure 12.1 Traditional representation of an individual for the problem of graph matching, when G1 (the model graph) contains 6 nodes and G2 (the data graph representing the segmented image) contains 11 nodes.

Another important aspect that determines which individual representation is the most appropriate is given by the fact that every problem has restrictions that have to be satisfied by the solutions (i.e. the individuals) in order to be considered as correct or useful. For instance, when applying graph matching techniques for the recognition of human brain structures, it is important for any acceptable solution that all the main brain structures such as the cerebellum are identified (e.g. a solution where the cerebellum is not present in the brain

Solving Graph Matching with EDAs Using a Permutation-Based Representation 247

image could not be accepted!). Each particular problem has its own particular constraints, and the different representations of individuals chosen have to take these into account. The reader can find a review of types of individual representations as well as the resolution of the restrictions in the human brain problem in Bengoetxea et al. (2000a). The same reference introduces different methods and mechanisms for generating correct individuals that satisfy these constraints. It is important to note that for each different individual representation the procedure to handle those constraints is different, and therefore this aspect has to be taken into account in any representation in order to obtain correct solutions and to minimize the complexity of the problem.

3. Representing a matching as a permutation Individual representations based on permutations have been typically ap

plied to problems such as the Traveling Salesman Problem or the Vehicle Routing Problem, where either a salesman or a vehicle has to pass through a number of places at the minimum cost.

A permutation-based representation can also be used for problems such as inexact graph matching. In this case the meaning of the individual is completely different, as an individual does not show directly which node of G2 is matched with each node of G I . In fact, what we obtain from each individual is the order in which nodes will be analyzed and treated so as to compute the matching that it is representing.

For the individuals to contain a permutation, the individuals will be the same size as the traditional ones described in Section 2.1 (i.e. 1V21 variables long). However, the number of values that each variable can obtain will be of size 1V21, and not IVII as in that representation. In fact, it is important to note that a permutation is a list of numbers in which all the values from 1 to n have to appear in an individual of size n. In other words, our new representation of individuals need to satisfy a strong constraint in order to be considered as correct, that is, they all have to contain every value from 1 to n, where n = 1V21.

More formally, all the individuals used for our problem of inexact graph matching will be formed from 1V21 genes or variables, that contain no repeated value within the individual and have values between 1 and 1V21. For 1 ::; k ::; 1V21 and 1 ::; i ::; 1V21, Xi = k means that the kth node of G2 will be the ith node that is analyzed for its most appropriate match.

3.1 From the permutation to the solution it represents

Once the type of individuals have been formally defined, we need to create a method to obtain a solution from the permutation itself because the representation does not directly define the meaning of the solution. Every individual


requires this step in order to be evaluated. As a result, it is important that this translation is performed by a fast and simple algorithm.

This section introduces a way of performing this step. A solution for the inexact graph matching problem can be calculated by comparing the nodes to each other and deciding which is more similar to which using a similarity function ro(i,j) defined for this purpose to compute the similarity between nodes i and j. The similarity measures used so far in the literature have been applied to two nodes, one from each graph, and their goal has been to help in the computation of the fitness of a solution, that is, the final value of a fitness function. However, the similarity measure ro(i, j) proposed in this section is quite different, as these two nodes to be evaluated are both in the data graph (i,j E V2 ). With these new similarity values we will be able to look for the node in G2 which is most similar to any particular node that is also in G2 . The aim of this is to identify for each particular node of G2 which other nodes in the data graph are most similar to it, and try to group it with the best set of already matched nodes.

We have not defined the exact basis for the similarity measure ro yet. Different aspects could be taken into account, and this topic will be further discussed in Section 3.3.

As explained in the introduction, each particular problem usually contains specific constraints that have to be satisfied by all the proposed solutions. If this is the case, another important aspect is to ensure that the solution represented by a permutation is always a correct individual. A solution will be considered as correct only when it satisfies the conditions defined for the problem. In order to set restrictions on our problem and test how the optimization methods handle them, we will assume in this chapter that the only condition to consider an individual as correct is that all the nodes of G2 have to be matched with a node of G I , and that every node of G I is matched with at least one node of G2 . These conditions will be satisfied by the translation procedure proposed next for both discrete and continuous domains.

Given an individual X= (xI, ... ,xlVl!,XIVI!H, ... ,xIV2!)' the procedure to do the translation is performed in two phases as follows:

• The first !VI! values (Xl, ... , xlVd) that directly represent nodes of V2 will be respectively matched to nodes 1, 2 ... , !VI! (that is, the node Xl E V2

is matched with the node 1 E VI, the node X2 E V2 is matched with the node 2 E VI, and so on, until the node XlVI! E V2 is matched with the node !VI ! E VI) .

• For each of the following values of the individual, (XIVI!+I,· .. ,XIV2!)' and following their order of appearance in the individual, the most similar node will be chosen from all the previous values in the individual by means of the similarity measure tv. For each of these nodes of G2 , we


From discrete permutations to the solution

Definitions IVII: number of nodes in the model graph G I 1V21: number of nodes in the data graph G2

1V21 > IVII· n = 1V21: size of the individual (the permutation) X= (Xl, ... ,X!V2!): individual containing a permutation Xi E {I, ... , n}: value of the ith variable in the individual PVi = {Xl, ... , Xi-I}: set of values assigned in the individual to

the variables X I, ... ,Xi - I (PV = previous values) w(i,j): similarity function that measures the similarity of node i with

respect to node j

Procedure Phase 1

For i = 1,2, ... , IVII (first IVII values in the individual, treated in order)

Match node Xi E V2 of data graph G2

with node i E VI in model graph G I

Phase 2 For i = IVII + 1, ... , 1V21 (remaining values in the individual, treated in this order)

Let k E PVi be the most similar node to Xi from all the nodes of PVi (k = maxj=l...i-I w(i,j))

Match node Xi E V2 of data graph G2

with the matched node that is matched to node k of G 2

Figu1'e 12.2 Pseudocode to compute the solution represented by a permutation-based individual.

assign the matched node of G I that is matched to the most similar node of G2 .

The first phase is very important in the generation of the individual, as this is also the one that ensures the correctness of the solution represented by the permutation: as all the values of VI are assigned from the beginning, and as we assumed 1V21 > lVII, we conclude that all the nodes of G I will be matched to any of the nodes of G 2 in every solution represented by any permutation.


Therefore, this permutation-based representation is suitable to be used for our problem. The procedure described in this section is shown as pseudocode in Figure 12.2.

3.2 Example

To demonstrate the representation of individuals containing permutations and the procedure for translating them to a point in the search space, we consider the example shown in Figure 12.3. In this example we are considering an inexact graph matching problem with a data graph G2 of 10 nodes (1V21 = 10) and a model graph GI of 6 nodes (IVII = 6). We also use a similarity measure for the example (the tv(i,j) function), the results of which are shown in the same figure. This similarity function does not always have to be symmetrical, and in this example we are using a non-symmetrical one (see Section 3.3 for a discussion on this topic). The translation has to produce individuals of the same size (10 nodes), but each of their values may contain a value between 1 and 6, that is, the number of the node of VI with which the node of G2 is matched in the solution.

Figure 12.2 shows the procedure for both phases 1 and 2. Following the procedure for phase 1, the first 6 nodes will be matched, and we will obtain the first matches for the three individuals in Figure 12.3.

In the second phase, generation of the solution will be completed by processing one by one all the remaining variables of the individual. For that, we will chose the next variable that is still not treated, the 7th in our example. Here, the first individual in the example has the value 7 in its 7th position, which means that node 7 of G2 will be worked on next. Similarly, the nodes of G2 to be assigned to the 7th position for the other two example individuals are nodes 10 and 4 respectively.

Next, in order to calculate the node of G I that we have to assign to our node of G2 in the matching, we compare the nodes of V2 that appear before the 7th variable in the individual with it. Therefore for the first individual, we compare the similarity between G2 node 7 and each of the G2 nodes 1 to 6. This similarity measure is given by the function tv shown in Figure 12.3. If we look at the 7th line in this table we see that in columns 1 to 6, the highest value is 0.96, in column 2. Therefore, following the algorithm in phase 2, we assign to node 7 the same matched value as for node 2. As we can see in Figure 12.4, for the first individual, node 2 was assigned the value 2, therefore we will also assign the value 2 to the 7th node of G2 .

Similarly, for the second individual, the 7th variable of the individual is also processed. This has the value 10, so node 10 of G2 is therefore the next to be matched. We will compare this node with the values of the previously matched nodes, i.e. nodes 5, 8, 7, 1, 6 and 9. The highest similarity value for these is tv = 0.97, in column 9. Therefore the most similar node is node 9, and


Individuals:

11121314151611718191101

15181711161911101314121

11019181716151141312111

Similarity Function:

w(i,j) 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1

1 1 1.00 1 0.87 1 0.67 1 0.80 1 0.77 1 0.48 1 0.88 1 0.80 1 0.75 1 0.89 1

2 1 0.03 1 1.00 1 0.96 1 0.13 1 0.73 1 0.90 1 0.15 1 0.66 1 0.74 1 0.92 1

3 1 0.20 1 0.42 1 1.00 1 0.63 1 0.05 1 0.22 1 0.20 1 0.51 1 0.31 1 0.50 1

4 1 0.52 1 0.50 1 0.88 1 1.00 1 0.49 1 0.88 1 0.08 1 0.91 1 0.38 1 0.47 1

5 1 0.19 1 0.90 1 0.85 1 0.71 11.00 1 0.15 1 0.24 1 0.51 1 0.97 1 0.80 1

6 1 0.47 1 0.87 1 0.67 1 0.80 1 0.77 1 1.00 1 0.88 1 0.80 1 0.75 1 0.87 1

7 1 0.03 1 0.96 1 0.35 1 0.13 1 0.73 1 0.90 1 1.00 1 0.66 1 0.74 1 0.92 1

8 1 0.20 1 0.42 1 0.93 1 0.63 1 0.05 1 0.22 1 0.20 1 1.00 1 0.31 1 0.50 1

9 1 0.52 1 0.50 1 0.89 1 0.53 1 0.49 1 0.88 1 0.08 1 0.91 1 1.00 1 0.47 1

10 1 0.19 1 0.90 1 0.85 1 0.71 1 0.18 1 0.15 1 0.24 1 0.51 1 0.97 11.00 1

Figure 12.3 Example of three permutation-based individuals and a similarity measure ro(i, j) between nodes of the data graph (Vi, j E V2) for a data graph of 10 nodes 1V21 = 10.

node 10 of G2 will be matched to the same node of G1 as node 9 of G2 was. Looking at Figure 12.4, this is 6th node of G1 . Following the same process for the third individual, we obtain that node 4 of G2 is matched with node 3 of G1 . Figure 12.5 shows the result of this first step of phase 2.

Continuing this procedure of phase 2 until the last variable, we obtain the solutions shown in Figure 12.6.

Note that each of the nodes of G2 is assigned to a variable between 1 and Wd = 6. Note also that every node of G1 is matched to at least one node of G2 , and that a value is given to every node of G2 , giving a matching value to each of the segments in the data image (all the segments in the data image are therefore recognised with a structure of the model).


I 1 I 2 I 3 I 4 I 5 I 6 I - I - I - I - I

1 2 3 4 5 6 7 8 9 10

I 4 I - I - I - I 1 I 5 I 3 I 2 I 6 I - I

1 2 3 4 5 6 7 8 9 10

I - I - I - I - I 6 I 5 I 4 I 3 I 2 I 1 I

1 2 3 4 5 6 7 8 9 10

Figure 12.4 Result of the generation of the individual after the completion of phase 1 for the example in Figure 12.3 where six nodes of G2 have been matched (IVII = 6).

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

I - I - I - I 3 I 6 I 5 I 4 I 3 I 2 I 1 I

1 2 3 4 5 6 7 8 9 10

Figure 12.5 Generation of the solutions for the example individuals in Figure 12.3 after the first step of phase 2 (IVII = 6).

An important aspect of this individual representation based on permutations is that the cardinality of the search space is n!. This cardinality is higher than that of the traditional individual representation. It is tested for its use with EDAs in graph matching for the first time here. In addition, it is important to note that a permutation-based approach can create redundancies in the solutions, as two different permutations may correspond to the same solution. An example of this is shown in Figure 12.7, where two individuals with different permutations are shown and the solution they represent is exactly the same.


I 1 I 2 I 3 I 4 I 5 I 6 I 2 I 3 I 3 I 3 I

1 2 3 4 5 6 7 8 9 10

14121 2 121 1 I 5 1 3 1 2 I 6 1 6 1

1 2 3 4 5 6 7 8 9 10

I 1 I 3 I 3 I 3 I 6 I 5 I 4 I 3 I 2 I 1 I

1 2 3 4 5 6 7 8 9 10

Figure 12.6 Result of the generation of the solutions after the completion of phase 2.

Individual 1:

I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I 10 I

Individual 2:

I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 9 I 8 I 10 I

Solution they represent:

I 1 I 2 I 3 I 4 I 5 I 6 I 2 I 3 I 3 I 3 I

1 2 3 4 5 6 7 8 9 10

Figure 12.7 Example of redundancy in the permutation-based approach. The two individuals represent the same solution shown at the bottom of the figure.

3.3 Defining the similarity concept

There are three important aspects to consider in order to define the similarity function w for phase 2:

• The first is to decide which nodes have to be compared. In the example we propose comparing nodes from the same graph G2 , that is, the model graph G1 has not been taken into account. Other approaches could be considered for instance, taking into account the similarity of both nodes of G1 and nodes of G2 and assigning a weight to both values, or having a


fitness function capable of returning a value for individuals that are not complete.

• Another additional procedure depending on the graph matching problem to be solved is the recalculation of the similarity measure as the individual is being generated: the similarity value could be changed as nodes of the individual are being matched, by following a clustering procedure. This means that in phase 2 an extra clustering procedure would be required in order to update the function roo

• And finally, the other aspect to take into account is the definition of the similarity itself. This factor depends on the problem. This definition will determine to an important degree the behavior of the algorithm.

4. Obtaining a permutation with discrete ED As

After describing how permutations can be used in graph matching to obtain correct solutions, the next step is to apply EDAs to this new type of individuals in order to look for the permutation that symbolizes the solution with the optimum fitness value. At the first glance the problem seems a simple application of any EDAs, applying the method described in Section 3.1.

4.1 On EDAs applied to graph matching

We will define now more formally the graph matching problem and the way of facing it with an EDA approach, based on the general notation introduced in Chapter 3.

We call Gl = (Vl,Et) the model graph and G2 = (V2,E2) the data graph. Vi is the set of nodes and Ei is the set of arcs of graph G i (i = 1,2). We still assume that G2 contains more segments than Gl • The graph matching task is accomplished by matching nodes from G2 with the nodes of the model graph

Gl ·

We use a permutation as the representation of individuals, which means that the size of these individuals will be of n = 1V21 variables (that is, each individual can be written x = (Xl, ... , X!V21)), and each of the Xi can have 1V21 possible values.

4.2 Looking for correct individuals

The simulation of Bayesian networks has been used to reason with networks as an alternative to exact propagation methods. In EDAs simulation is used to create the individuals of the following generation based on the structure learned previously.


Among the various methods to perform the simulation process, for this chapter the method of choice is the Probabilistic Logic Sampling (PLS) proposed in Henrion (1988).

Nevertheless as explained in Section 2.1, whatever the representation of individuals selected, it is important to check that each individual is correct and satisfies all the restrictions to the problem so that it can be considered as a point in the search space. The interested reader can find a more exhaustive review of this topic in Bengoetxea et al. (2000a), where the authors propose different methods to obtain only correct individuals that satisfy the particular constraints of the problem. In the latter reference two methods to control the simulation step in EDAs are introduced: Last Time Manipulation (LTM) and All Time Manipulation (ATM).

Both methods are based on the modification of the simulation step so that during the simulation of each individual the probabilities learned from the Bayesian network are modified. Each individual is generated variable by variable following the ancestral ordering as in PLS, but the constraints are verified during the instantiation and the probabilities obtained from the learning are modified if necessary to ensure the correctness of the individual.

It is important to note that altering the probabilities at the simulation step, whichever the way, implies that the result of the algorithm is also modified somehow.

For our concrete case of a permutation-based representation, and in order to lead ED As to the generation of correct permutations only, any of these two methods can be used, and both LTM and ATM will behave exactly in the same way: the only difference between them is that LTM only interacts in the simulation step when the number of values still not appeared equals the number of variables to be simulated in the individual, and that ATM interacts in the probabilities always. As in this case this situation will happen for all the variables of all the individuals, both methods behave in the same way, ensuring in both cases that every possible individual will contain always correct permutations.

4.3 Choosing the best discrete EDA algorithm

In order to test EDAs in the inexact graph matching problem defined above, three different ED As were tested. Typical graph matching problems can have large complexity, and as the difference in behavior between ED As is to a large extent due to the complexity of the probabilistic structure that they have to build, these three algorithms have been chosen as representatives of the three categories of EDAs introduced in Chapter 3: (1) UMDA (Muhlenbein, 1998) as an example of an EDA that considers no interdependencies between the variables; (2) MIMIC (De Bonet et al., 1997) is an example that belongs to


the category of pairwise dependencies; (3) EBNA (Etxeberria and Larraiiaga, 1999) multiple interdependencies are allowed.

5. Obtaining a permutation with continuous EDAs

Continuous EDAs provide the search with other types of EDAs that can be more suitable for some problems. But again, the main goal is to find a representation of individuals and a procedure to obtain an univocal solution to the matching from each of the possible permutations.

In this case we propose a strategy based on the previous section, trying to translate the individual in the continuous domain to a correct permutation in the discrete domain, proceeding next as explained in Section 3.l.

This procedure of translating from the continuous world to the discrete world has to be performed for each individual in order to be evaluated. Again, this process has to be fast enough in order to reduce computation time.

With all these aspects in mind, individuals of size n = JV21 will be defined. Each individual is obtained sampling from a n-dimensional Gaussian distribution, and therefore can take any value in IRn. With this new representation the individuals do not have a direct meaning of the solution it represents: the values for each of the variables do only show the way to translate from the continuous world to a permutation as with the discrete representation shown in Section 2.1, and it does not contain similarity values between nodes of any graph. This new type of representation can also be regarded as a way of change the search from the discrete to the continuous world, where the techniques that can be applied to the estimation of densities are completely different.

To obtain a translation to a discrete permutation, we order the continuous values of the individual, and set its corresponding discrete values by assigning to each Xi E {1, ... , JV21} the respective order. The procedure described in this section is shown as pseudocode in Figure 12.8.

For the simulation of an univariate normal distribution, a simple method based on the sum of 12 uniform variables (Box and Muller, 1958) is chosen. On the other hand, the sampling of multivariate normal distributions has been done by means of an adaptation of the conditioning method (Ripley, 1987) on the basis of the PLS algorithm. Note that in this continuous case it is not required to check whether all the values are different or not.

Again, for the continuous domain different EDAs are proposed and these are to be tested in this chapter for their performance in a concrete inexact graph matching problem. Three different algorithms are chosen again, as representatives of their complexity category. These are the UMDAc , MIMICc , and EGNA (Larraiiaga et al., 2000).


From a continuous value in IR n to a discrete permutation

Definitions n = 1V21: size of the individual, which is the number of

nodes in data graph G2 (the permutation) xC = (xf, ... , x~21): individual containing continuous

values (the input) x D = (xP, ... , xft21): individual containing a permutation

of discrete values (the output) xf E {I, ... , n}: value of the ith variable in the individual

Procedure Order the values xf, ... ,x~21 of individual xC using any

fast sorting algorithm such as Quicksort Let Ki be position in which each value xf, 1 ~ i ~ 1V21, occupies

after ordering all the values The values of the individual x D will be set in the following way:

Vi = 1, ... , 1V21, xf = Ki

Figure 12.8 Pseudocode to translate from a continuous value in lEe to a discrete permutation composed of discrete values.

6. Experimental results. The human brain example

6.1 Overview of the human brain example

The example chosen to test the new permutation-based representation is an inexact graph matching one used for recognition of structures in Magnetic Resonance Images (MRI). The data graph G2 is generated from this image and contains a node for each region (subset of a brain structure). The model graph G1 is built from an anatomical atlas and each node corresponds exactly to one brain structure. The experiments carried out in this chapter are focused on this type of graphs, but could similarly be adapted to any other inexact graph matching problem.

More specifically, the model graph has been obtained from the main structures of the brainstem, the inner part of the brain, and it does not take into account the cerebral hemispheres. This reduced example is a shorter version of the brain images recognition problem in Perchant and Bloch (1999). The


fact that less structures have to be recognized (from 43 to 12) reduces the complexity of the problem. In the same way, the human brain images have also been reduced, and the number of structures of the data image to be matched (number of nodes of G2 ) is also reduced from 245 to 94. The number of arcs is also different in this problem: while in Perchant and Bloch (1999) G1 and G2

contained 417 and 1451 arcs , in these examples the number of arcs is of 84 and 2868 respectively.

Speaking about the similarity concept, for our experiments we have used only a similarity measure based on the grey level distribution, so that when the function w returns a higher value for two nodes it shows a more similar grey level distribution over the two segments of the data image. Another possible property could have been the distance between the segments in the data image for instance. In addition, no extra computation is performed during the generation of the individual (not clustering process is performed), and therefore the similarity measure tv is kept as a constant during the generation of individuals. These decisions have been taken knowing the nature and properties of the data graph, which is a human brain NMR image in black and white. These decisions were also considered as a way to simplify the complexity of the problem.

6.2 Description of the experiment

The aim of these experiments is to test the performance of some discrete and continuous EDAs introduced in Chapter 3 in this volume for the same example. As the main difference between them is the number of dependencies between variables that they take into account, the more complex algorithms are expected to require more CPU time but also to reach a fitter final solution. This section describes the experiments and the results obtained. EDAs are also compared to a broadly known GA, the GENITOR (Whitley and Kauth, 1988), which is a steady state type algorithm (ssG A) (Michalewicz, 1992).

Both EDAs and GENITOR were implemented in ANSI C++ language, and the experiment was executed in a two processor Ultra 80 Sun computer under Solaris version 7 with 1 Gb of RAM.

In the discrete case, all the algorithms were designed to end the search when a maximum of 100 generations or when uniformity in the population was reached. GENITOR is a special case, as it is a ssGA and therefore generates only one individual at each iteration, but it was also programmed in order to generate the same number of individuals as in discrete ED As by allowing more iterations (201900 individuals). In the continuous case, the ending criterion was to reach 301850 evaluations (i.e. number of individuals generated).

The initial population for all the algorithms was generated using the same random generation procedure based on a uniform distribution. The fitness function used is described later in Section 6.3.


In EDAs, the following parameters were used: a population of 2000 individuals (M = 2000), from which a subset of the best 1000 are selected (N = 1000) to estimate the probability, and the elitist approach was chosen (that is, always the best individual is included for the next population and 1999 individuals are simulated). In GENITOR a population of 2000 individuals was also set, with a mutation rate of Pm = ~ and a crossover probability of Pc = 1. The operators used in GENITOR where ex (Oliver et al., 1987) and EM (Banzhaf, 1990).

6.3 Definition of the fitness function

The definition of the fitness function for the graph matching problem will be a very important factor in the resolution of the problem as well, as its behavior will also determine how the optimization algorithm approaches the best solution. It is important to define appropriately the function that will be used in order to compare individuals and obtain the best solution. The aim of this chapter is not to do a review of the different fitness functions for graph matching. This is the reason why the function proposed in Perchant and Bloch (1999) will be used just as an example of a fitness function in inexact graph matching. This function has been used to solve the problem applied to human brain images with GAs in Perchant et al. (1999) and Hoeres et al. (1999) and with ED As in Bengoetxea et al. (2000a) and Bengoetxea et al. (2000b). Following this function, an individual x= (xl"",xIV21) will be evaluated as follows:

[1 1V21IVlI i 'J

J(x; P", PI" a) = a 1V211V11 8 ~ (1 -ICij - p~l (u~)I) +

(I-a) [IE2~Ed ,_ ~, k ~, ( 1 - !cijCi1jl - p~i (e~)I)l (12.3) el-(ul,v l )EEl e2=(u2,v2 )EE2

where

{ I if Xi = j Cij = 0 otherwise

and a is a parameter used to adapt the weight of node and arc correspondences in J, and p" = {p~l : V2 -+ [0 , 1], Ul E VI} is the set of functions that measure the correspondence between the nodes of both graphs G1 and G2 . Similarly, PI' = {p~Ul'Vl): E2 -+ [0, 1], (ul,vd E Ed is the set of functions that measure the correspondence between the arcs of both graphs G1 and G2 . The value of J associated for each variable returns the goodness of the matching. Typically Pa and PI' are related to the similarities between node properties and arc properties respecti vely.

Function J(x; Pa, P,£> a) has to be maximized.


Table 12.1 Mean values of experimental results after 10 executions for each algorithm of the inexact graph matching problem of the Human Brain example.

Best fitness value Execution time Number of evaluations

UMDA 0.718623 00:53:29 85958 UMDAc 0.745036 03:01:05 301850 MIMIC 0.702707 00:57:30 83179 MIMICc 0.747970 03:01:07 301850 EBNA 0.716723 01:50:39 85958 EGNA 0.746893 04:13:39 301850 ssG A 0.693575 07:31:26 201900

p < 0.001 p < 0.001 p < 0.001

6.4 Experimental results

Results such as the best individual obtained, the computation time, and the number of evaluations to reach the final solution were recorded for each of the experiments.

The computation time is the CPU time of the process for each execution, and therefore it is not dependent on the multiprogramming level at the moment of the execution. This computation time is presented as a measure to illustrate the different computation complexity of all the algorithms. It is important also to note that all the operations for the estimation of the distribution, the simulation, and the evaluation of the new individuals are carried out through memory operations.

Each algorithm was executed 10 times, and the null hypothesis of the same distribution densities was tested for each of them. The non-parametric tests of Kruskal-Wallis and Mann-Whitney were used. This task was carried out with the statistical package S.P.S.S. release 9.00 and the results are shown in Table 12.1.

This table shows the mean results for each of the experiments, showing the different parameters (best fitness value obtained, execution time and number of generations required respectively). Additionally, the same Kruskal-Wallis and Mann-Whitney tests were also applied to test the differences between particular algorithms. The results were as follows:

• Between algorithms of similar complexity only:

UMDA vs. UMDAc. Fitness value: p < 0.001; CPU time: p < 0.001; Evaluations: p < 0.001.


- MIMIC vs. MIMICc . Fitness value: p < 0.001; CPU time: p < 0.001; Evaluations: p < 0.001.

- EBNA vs. EGNA. Fitness value: p < 0.001; CPU time: p < 0.001; Evaluations: p < 0.001.

From the results we can conclude that the differences between the algorithms in the discrete and continuous domains are significant for all the algorithms analyzed. This means that the behaviour of selecting a discrete learning algorithm or its equivalent in the continuous domain is very different regarding all the parameters analyzed. It is important to note that the number of evaluations was expected to be different, as the ending criteria for the discrete and continuous domains have been set to be different. In all the cases, the continuous algorithms obtained a fitter individual, but the CPU time and number of individuals created was also bigger.

• Between discrete ED As only:

Fitness value: p < 0.001.

CPU time: p < 0.001.

Evaluations: p < 0.001.

In this case significant results are also obtained in fitness value and CPU times, as well as in the number of evaluations. The discrete algorithm that obtained the best result was UMDA, closely followed by EBNA. The differences in the CPU time are also according to the complexity of the learning algorithm we used. Finally, the different number of evaluations means that MIMIC required significantly less individuals to converge (to reach the uniformity in the population), whereas the other two EDAs require quite the same number of evaluations to converge.

The genetic algorithm GENITOR is far behind the performance of EDAs. The computation time is also a factor to be taken into account: the fact that GENITOR requires about 7 hours for each execution can give an idea of the complexity of the problem that these algorithms are dealing with.

• Between continuous EDAs only:

Fitness value: p = 0.342.

CPU time: p < 0.001.

Evaluations: p = 1.000.

In the case of the continuous algorithms, the differences in fitness value between the different learning methods are not significant in the light of


the results. Nevertheless, the CPU time required for each of them is also according to the complexity of the learning algorithm. On the other hand, as the ending criterion for all the continuous algorithms was to reach the same number of evaluations, it was obvious that there were not differences between them in the number of evaluations. Speaking about the differences in computation time between discrete and continuous EDAs, it is important to note that the latter ones require all the 301850 individuals to be generated before they finish the search. Furthermore, the computation time for the continuous algorithms is also longer than the discrete equivalents as a result of several factors: firstly, due to the higher number of evaluations they perform each execution, secondly because of the longer individual-to-solution translation procedure that has to be done for each of the individuals generated, and lastly, as a result of the longer time required to learn the model in continuous spaces.

In the light of the results obtained in the fitness values, we can conclude the following: generally speaking, continuous algorithms perform better than discrete ones, either when comparing all of them in general or when only with algorithms of equivalent complexity.

7. Conclusions and further work This chapter introduces a new individual representation approach for EDAs

applied to the inexact graph matching problem. This new individual representation can be applied in both discrete and continuous domains.

In experiments carried out with a real example, a comparison of the performance of this new approach between the discrete and continuous domains has been done, and continuous EDAs have shown a better performance looking at the fittest individual obtained, however a longer execution time and more evaluations were required. Additionally, other fitness functions should be tested with this new approach. Techniques such as Bloch (1999a) and Bloch (1999b) could also help to introduce better similarity measures and therefore improve the results obtained considerably.

Acknow ledgments This chapter has been partially supported by the Spanish Ministry for Science and

Education, and the French Ministry for Education, Research and Technology with

the projects HF1999-0107, and Picasso-00773TE respectively. The authors would

also like to thank R. Etxeberria, 1. Inza and J.A. Lozano for their useful advice and

contributions to this work.


References Banzhaf, W. (1990). The molecular traveling salesman. Biological Cybernetics,

64:7- 14. Bengoetxea, E., Larraiiaga, P., Bloch, I., Perchant, A., and Boeres, C. (2000a).

Inexact graph matching using learning and simulation of Bayesian networks. An empirical comparison between different approaches with synthetic data. In Proceedings of CaNew workshop, ECAI2000 Conference, ECCAI, Berlin.

Bengoetxea, E., Larraiiaga, P., Bloch, I., Perchant, A., and Boeres, C. (2000b). Learning and simulation of Bayesian networks applied to inexact graph matching. International Journal of Approximate Reasoning. (submitted).

Bloch, I. (1999a). Fuzzy relative position between objects in image processing: a morphological approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(7):657- 664.

Bloch, I. (1999b). On fuzzy distances and their use in image processing under imprecision. Pattern Recognition, 32:1873-1895.

Boeres, C., Perchant, A. , Bloch, I., and Roux, M. (1999). A genetic algorithm for brain image recognition using graph non-bijective correspondence. Unpublished manuscript.

Box, G.E.P. and Muller, M.E. (1958) . A note on the generation of random normal deviates. Ann. Math. Statist., 29:610- 611.

Cross, A.D.J. and Hancock, E.R. (1998). Graph matching with a dual-step EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1236-53.

Cross, A.D.J. and Hancock, E.R. (1999). Convergence of a hill climbing genetic algorithm for graph matching. In Hancock, E.R. and Pelillo, M., editors , Lectures Notes in Computer Science 1654, pages 220- 236, York, UK.

Cross, A.D.J., Wilson, R.C., and Hancock, E.R. (1997). Inexact graph matching using genetic search. Pattern Recognition, 30(6) :953-70.

De Bonet, J.S., Isbell, C.L., and Viola, P. (1997). MIMIC: Finding optima by estimating probability densities. Advances in Neural Information Processing Systems, Vol. 9.

Etxeberria, R. and Larraiiaga, P. (1999). Global optimization with Bayesian networks. In II Symposium on Artificial Intelligence. CIMAF99. Special Session on Distributions and Evolutionary Optimization, pages 332-339.

Finch, A.W., Wilson, R.C., and Hancock, E.R. (1997). Matching Delaunay graphs. Pattern Recognition, 30(1):123-40.

Finch, A.W., Wilson, R.C., and Hancock, E.R. (1998). Symbolic graph matching with the EM algorithm. Pattern Recognition, 31(11):1777- 90.

Gold, S. and Rangarajan, A. (1996) . A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4) :377-88.


Hancock, E.R. and Kittler, J. (1990). Edge-labeling using dictionary-based relaxation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(2):165-18l.

Henrion, M. (1988). Propagating uncertainty in Bayesian networks by probabilistic logic sampling. In Lemmer, J.F. and Kanal, L.N., editors, Uncertainty in Artificial Intelligence, volume 2, pages 149-163. North-Holland, Amsterdam.

Larranaga, P., Etxeberria, R., Lozano, J.A., and Pena, J.M. (2000). Optimization in continuous domains by learning and simulation of Gaussian networks. In Proceedings of the Workshop in Optimization by Building and using Probabilistic Models. A Workshop within the 2000 Genetic and Evolutionary Computation Conference, GECCO 2000, pages 201-204, Las Vegas, Nevada, USA.

Lovasz, L. and Plummer, M.D. (1986). Matching Theory. Mathematics Studies. Elsevier Science, North-Holland.

Michalewicz, Z. (1992). Genetic algorithms + data structures = Evolution Programs. Springer Verlag, Berlin Heidelberg.

Miihlenbein, H. (1998). The equation for response to selection and its use for prediction. Evolutionary Computation, 5:303-346.

Myers, R. and Hancock, E.R. (2001). Least committment graph matching with genetic algorithms. Pattern Recognition, 34:375-394.

Oliver, J., Smith, D., and Holland, J. (1987). A study of permutation crossover operators on the TSP. In Grefenstette, J.J., editor, Proceedings of the Second International Conference on Genetic Algorithms and Their Applications, pages 224-230. Lawrence Erlbaum Associates.

Perchant, A. (2000). Morphism of graphs with fuzzy attributes for the recognition of structural scenes. PhD Thesis, Ecole Nationale Superieure des Telecommunications, Paris, France (In french).

Perch ant , A. and Bloch, 1. (1999). A New Definition for Fuzzy Attributed Graph Homomorphism with Application to Structural Shape Recognition in Brain Imaging. In IMTC'99, 16th IEEE Instrumentation and Measurement Technology Conference, pages 1801-1806, Venice, Italy.

Perchant, A., Boeres, C., Bloch, 1., Roux, M., and Ribeiro, C. (1999). Modelbased Scene Recognition Using Graph Fuzzy Homomorphism Solved by Genetic Algorithms. In GbR '99 2nd International Workshop on Graph-Based Representations in Pattern Recognition, pages 61-70, Castle of Haindorf, Austria.

Ripley, B.D. (1987). Stochastic Simulation. John Wiley and Sons. Singh, M. and Chaudhury, A.C.S. (1997). Matching structural shape descrip

tions using genetic algorithms. Pattern Recognition, 30(9):1451-62.


Whitley, D. and Kauth, J. (1988). GENITOR: A different genetic algorithm. In Proceedings of the Rocky Mountain Conference on Artificial Intelligence, volume 2, pages 118- 130.

Wilson, R.C. and Hancock, E.R. (1996). Bayesian compatibility model for graph matching. Pattern Recognition Letters, 17:263-276.

Wilson, R.C. and Hancock, E.R. (1997). Structural matching by discrete relaxation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6):634- 648.

III

MACHINE LEARNING

Chapter 13

Feature Subset Selection by Estimation of Distribution Algorithms

1. Inza P . Larraiiaga B. Sierra Department of Computer Science and Al·tificial Intelligence

Univel'sity of the Basque Country

{ccbincai. ccplamup. ccpsiarb}@si.ehu.es

Abstract Feature Subset Selection is a well known task in the Machine Learning, Data Mining, Pattern Recognition and Text Learning paradigms. In this chapter, we present a set of Estimation of Distribution Algorihtms (EDAs) inspired techniques to tackle the Feature Subset Selection problem in Machine Learning and Data Mining tasks. Bayesian networks are used to factorize the probability distribution of best solutions in small and medium dimensionality datasets, and simpler probabilistic models are used in larger dimensionality domains . In a comparison with different sequential and genetic-inspired algorithms in natural and artificial datasets, EDA-based approaches have obtained encouraging accuracy results and need a smaller number of evaluations than genetic approaches.

Keywords: Feature Subset Selection, cross-validation, predictive accuracy, number of evaluations, Estimation of Distribution Algorithms, Genetic Algorithms

1. Introduction In supervised Machine Learning and Data Mining processes, the goal of a

supervised learning algorithm is to induce a classifier that allows us to classify new examples E* = {eL+l, ... , eL+Q} that are only characterized by their n descriptive features. To generate this classifier we have a set of L samples E = {el, ... , ed, each characterized by n descriptive features X = {Xl, ... , X I1 }

and the class label C = {WI, ... ,wd to which they belong. The classification part of Machine Learning and Data Mining can be seen as a "data-driven" P. Larrañaga et al. (eds.), Estimation of Distribution Algorithms



process where, putting less emphasis on prior hypotheses than is the case with classical statistics, a "general rule" is induced for classifying new examples using a learning algorithm. Many representations with different biases have been used to develop this "classification rule", and the Machine Learning and Data Mining communities have formulated the following question: Are all of these n descriptive features useful for learning the "classification rule"? Trying to respond to this question the Feature Subset Selection (FSS) approach reformulates it as follows: Given a set of candidate features, select the best subset under some learning algorithm.

This dimensionality reduction produced by a FSS process has several advantages for a classification system on a specific task:

• A reduction in the cost of data acquisition.

• An improvement in the comprenhensibility of the final classification model.

• Faster induction of the final classification model.

• An improvement in classification accuracy.

The attainment of higher classification accuracies is the usual objective of Machine Learning processes. It has long been proved that the classification accuracy of supervised classification algorithms is not monotonic with respect to the addition of features. Irrelevant or redundant features, depending on the specific characteristics of the learning algorithm, may degrade the predictive accuracy of the classification model. In our work, the objective of FSS will be maximization of the performance of the classification algorithm. In addition, with the reduction in the number of features, it is more likely that the final classifier is less complex and more understandable by humans.

Once its objective is fixed, FSS can be viewed as a search problem, with each state in the search space specifying a subset of the possible features of the task. Exhaustive evaluation of possible feature subsets is usually infeasible in practice because of the large amount of computational effort required. Many search techniques have been proposed for solving the FSS problem when there is no knowledge about the nature of the task, by carrying out an intelligent search in the space of possible solutions. As they are randomized, evolutionary and population-based search algorithms, Genetic Algorithms (GAs) are possibly the most commonly used search engine in the FSS process.

As an alternative paradigm to GAs, in this chapter we propose the use of EDA-inspired techniques for the FSS task. The choice of the specific ED Ainspired algorithm which performs FSS depends on the initial dimensionality (number of features) of the domain. A FSS problem is considered small scale, medium scale or large scale (Kudo and Sklansky, 2000) if the number of features is less than 20, 20 - 49 or greater than 49, respectively. For small and medium

Feature Subset Selection by Estimation of Distribution Algorithms 271

scale domains we propose using the most attractive probabilistic paradigm, Bayesian networks, to factorize the probability distribution of the best solutions. For large scale domains, a large number of solutions is needed to induce a reliable Bayesian network, and we propose using simpler probabilistic structures: PBIL, BSC, MIMIC and TREE.

The chapter is organized as follows. The next section introduces the FSS problem and its basic components. Section 3 presents the specific application of Bayesian networks to solve the FSS problem within the EDA paradigm for small and medium scale domains, and associated results for natural and artificial tasks. The applications of PBIL, BSC, MIMIC and TREE probabilistic algorithms and their results for large scale natural and artificial domains are presented in Section 4. We finish the chapter with a brief set of conclusions and a description of possible avenues of future research in the field.

2. Feature Subset Selection: Basic components

Our work is located in Machine Learning and Data Mining, but FSS literature includes numerous works from other fields such as Pattern Recognition (Ferri et aI., 1994; Jain and Chandrasekaran, 1982; Kittler, 1978), Statistics (Miller, 1990; Narendra and Fukunaga, 1997) and Text-Learning (Mladenic, 1998; Yang and Pedersen, 1997). Thus, different research communities have exchanged and shared ideas on dealing with the FSS problem. A good review of FSS methods can be found in Liu and Motoda (1998).

The objective of FSS in a Machine Learning or a Data Mining framework (Aha and Bankert, 1994) is to reduce the number of features used to characterize a dataset so as to improve a learning algorithm's performance on a given task. Our objective will be the maximization of classification accuracy in a specific task for a specific learning algorithm; as a side effect, we will have a reduction in the number of features needed to induce the final classification model. The feature selection task can be viewed as a search problem, with each state in the search space identifying a subset of possible features . A partial ordering on this space, with each child having exactly one more feature than its parents, can be created.

Figure 13.1 expresses the search algorithm nature of the FSS process. The structure of this space suggests that any feature selection method must decide on four basic issues that determine the nature of the search process (Blum and Langley, 1997): a starting point in the search space, an organization of the search, an evaluation strategy for the feature subsets and a criterion for halting the search.

1. The starting point in the space. This determines the direction of the search. One might start with no features and successively add them, or one might start with all the features and successively remove them. One might also select an initial state somewhere in the middle of the search space.


Figure 13.1 In this 3-feature (Fl,F2,F3) problem, each individual in the space represents a feature subset, a possible solution for the FSS problem. In each individual, a feature's rectangle being filled, indicates that it is included in the feature subset.

2. The organization of the search. This determines the strategy for the search. Roughly speaking, search strategies can be complete or heuristic. The basis of the complete search is the systematic examination of every possible feature subset. Three classic complete search implementations are depth-first, breadth-first and branch and bound search (Narendra and Fukunaga, 1977). On the other hand, among heuristic algorithms, there are deterministic heuristic algorithms and non-deterministic heuristic ones. Classic deterministic heuristic FSS algorithms are sequential forward selection and sequential backward elimination (Kittler, 1978), floating selection methods (Pudil et al., 1994) and best-first search (Kohavi and John, 1997). They are deterministic in the sense that their runs always obtain the same solution. Non-deterministic heuristic search is used to escape from local maxima. Randomness is used for this purpose and this implies that one should not expect the same solution from different runs. Two classic implementations of non-deterministic search engines in FSS problems are the frequently applied GAs (Siedelecky and Sklansky, 1988) and Simulated Annealing (Doak, 1992).

3. Evaluation strategy for feature subsets. The evaluation function identifies promising areas of the search space by calculating the goodness of each proposed feature subset. The objective of the FSS algorithm is to maximize this function. The search algorithm uses the value returned by the evaluation function to guide the search. Some evaluation functions carry out this objective by looking only at the intrinsic characteristics of the data and measuring the power of a feature subset to discriminate between the classes of the problem: these evaluation functions are grouped under the title of filter strategies. These evaluation functions are usually monotonic and increase with the addition of


features that can later damage the predictive accuracy of the final classifier. However, when the goal of FSS is maximization of classifier accuracy, the features selected should depend not only on the features and the target concept to be learned, but also on the special characteristics of the supervised classifier (Kohavi and John, 1997). The wrapper concept was proposed for this: it implies that the FSS algorithm conducts the search for a good subset by using the classifier itself as a part of the evaluation function, i.e. the same classifier that will be used to induce the final classification model. Once the classification algorithm is fixed, the idea is to train it with the feature subset found by the search algorithm, estimating the predictive accuracy on the training set and using that accuracy as the value of the evaluation function for that feature subset. In this way, any representational biases of the classifier used to construct the final classification model are included in the FSS process. The role of the supervised classification algorithm is the principal difference between the filter and wrapper approaches.

4. Criterion for halting the search. An intuitive criterion for stopping the search is the non-improvement of the evaluation function value of alternative subsets. Another classic criterion is to fix a limit on the number of possible solutions to be visited during the search.

3. FSS by ED As in small and medium scale domains

For small and medium dimensionality domains, we use the search scheme provided by the EBNA algorithm (Etxeberria and Larraiiaga, 1999). Using an intuitive notation to represent each individual (there are n bits in each individual, each bit indicating whether a feature is present (1) or absent (0)). Figure 13.2 shows an overview of the application of the EBN A search engine to the FSS problem (FSS-EBNA). In each generation of the search, the induced Bayesian network will factorize the probability distribution of selected solutions. The Bayesian network contains n nodes, where each node represents one feature of the domain.

In our specific implementation of the EBN A algorithm, instead of better (but slow) techniques, a fast "score + search" procedure is used to learn the Bayesian network in each generation of the search. Algorithm B (Buntine, 1991) is used for learning Bayesian networks from data. Algorithm B is a greedy search heuristic which starts with an arc-less structure and at each step, adds the arc which gives the maximum increase in the score: here, the score used is the BIC score (Schwarz, 1978). The algorithm stops when adding an arc does not increase this score.

Determination of a minimum population size to reliably estimate the parameters of a Bayesian network is not an easy task (Friedman and Yakhini,


(1 )

Xn (*)

r-----, .- --I

I 1 0 0 0 :efl: o 0 I .................. 1 :<12: (2) 1 1 1 ................... 0 lei) I

4 0 I I I : ef4: selectiun of N/2

I···· I N 1 0 I I : c::JN : individuals

L-____ -' , __ _

1111d take the })est 'N-I' 1)1' them hi form the

I 0 0._. . I : en : o 0 I ............ 0 : el2 : 1 1 I 0 I dJ I , '

I ..... I , '

(3)

l<jX2 X3 .......... Xn

1 0 1 1 ................. 1 2 1 1 0 ................ 0

NI2 001 ................ 1

inductioll urlhe

BayesiulIlldwork (4) with 'n' nodes

smuple 'N-I' illlJi vidual..; from the Bayesian lIetwnrk and calculate

their evaluation fUlictiuli values

(6)

N-I LO_I_I ___ -" :dN-1 : (*) value uflhe evululIliuli fUllctiuli of the individual

(7)

Figure 13.2 FSS-EBNA method.

1996). This difficulty is greater for real world problems where the true probability distribution is not known. Taking the dimensionality of our problems into account, we consider that a population size of 1,000 individuals is enough to reliably estimate the Bayesian network parameters.

In our approach, the best individual of the previous generation is maintained and N - 1 individuals are created as offspring. An elitist approach is used then to form iterative populations. Instead of directly discarding the N - 1 individuals from the previous generation and replacing them with N - 1 newly generated ones, the 2N - 2 individuals are put together and the best N -1 chosen from them. These best N - 1 individuals together with the best individual of the previous generation form the new population. In this way, the populations converge faster to the best individuals found, but, this also carries a risk of losing diversity within the population.


3.1 Characteristics of the evaluation function

A wrapper approach is used here to calculate the evaluation function value of each proposed individual or feature subset. The value of the evaluation function of a feature subset found by the EBNA search technique, once the supervised classifier is fixed, is given by an accuracy estimation on training data. The accuracy estimation, seen as a random variable, has an intrinsic uncertainty. A lO-fold cross-validation multiple times, combined with a heuristic proposed by Kohavi and John (1997), is used to control the intrinsic uncertainty of the evaluation function. The heuristic works as follows:

• If the standard deviation of the accuracy estimate is above 1%, another lO-fold cross-validation is executed.

• This is repeated until the standard deviation drops below 1%, up to a maximum of five times.

In this way, small datasets will be cross-validated many times, but larger ones may only be once. Although FSS-EBNA is independent of the specific supervised classifier used within its wrapper approach, in our set of experiments we will use the well known Naive-Bayes (NB) (Cestnik, 1990) supervised classifier. This is a simple and fast classifier which uses Bayes rule to predict the class for each test instance, assuming that features are independent of each other given the class. Due to its simplicity and fast induction, it is commonly used on Data Mining tasks of high dimensionality (Kohavi and John, 1997; Mladenic, 1998). The probability of discrete features is estimated from data using maximum likelihood estimation and applying the Laplace correction. A Gaussian distribution is assumed to estimate the class conditional probabilities for continuous attributes. Unknown values in the test instance are skipped. Despite its simplicity and its independence assumption among variables, the literature shows that the NB classifier gives remarkably high accuracies in many domains (Langley and Sage, 1994), and especially in medical ones. In the face of its good scaling with irrelevant features, NB can improve its accuracy level by discarding correlated or redundant features. Because of its independence assumption of features to predict the class, NB is degraded by correlated features which violate this independence assumption. Thus, FSS can also play a "normalization" role that discards these groups of correlated features, and ideally selects just one of them in the final model.

3.2 The relevance of the stopping criteria

To stop the search algorithm, we have adopted an intuitive stopping criterion which takes the number of instances in the training set into account. In this way, we try to avoid the "overfitting" problem (Jain and Zongker, 1997):


• For datasets with more than 2,000 instances, the search is stopped when in a sampled new generation no feature subset appears with an evaluation function value improving the best subset found in the previous generation. Thus, the best subset of the search, found in the previous generation, is returned as the FSS-EBNA's solution.

• For smaller datasets the search is stopped when in a sampled new generation no feature subset appears with an evaluation function value improving, at least with a p-value smaller than 0.1 (using a lO-fold crossvalidated paired t test between the folds of both estimations, taking only the first run into account when lO-fold cross-validation is repeated multiple times), the value of the evaluation function of the best feature subset of the previous generation. Thus, the best subset of the previous generation is returned as the FSS-EBNA's solution.

For larger datasets the "overfitting" phenomenom has less impact and we hypothesize that an improvement in the accuracy estimation over the training set will be coupled with an improvement in generalization accuracy on unseen instances. Otherwise, for smaller datasets, in order to avoid the "overfitting" risk, continuation of the search is only allowed when a significant improvement in the accuracy estimation of best individuals of consecutive generations appears. We hypothesize that when this significant improvement appears, the "overfitting" risk decays and there is a basis for further generalization accuracy improvement over unseen instances.

The work of Ng (1997) can be consulted to understand the essence of this stopping criterion. The author demonstrates that when cross-validation is used to select from a large pool of different classification models in a noisy task with a too small training set, it may not be advisable to pick the model with minimum cross-validation error, and a model with higher cross-validation error could have better generalization power over novel test instances.

Another concept to consider in this stopping criterion is the wrapper nature of the proposed evaluation function. As we will see in the next section, the evaluation function value of each visited solution (the accuracy estimation of the NB classifier on the training set by 10-fold cross-validation multiple times, using only the features proposed by the solution) needs several seconds to be calculated (but this is never more than 4 seconds for the datasets used here). As the creation of a new generation of individuals implies the evaluation of 1,000 new individuals, we only allow the search to continue when it demonstrates that it is able to escape from the local optima, and is able to discover new "best" solutions in each generation. When the wrapper approach is used, CPU time must also be controled: we hypothesize that when the search is allowed to continue by our stopping criterion, the CPU times to evaluate a new generation


Table 13.1 Details of small and medium dimensionality experimental domains.

Domain Number of instances Number of features

(1) Ionosphere 351 34 (2) Horse-colic 368 22 (3) Soybean-large 683 35 (4) Anneal 898 38 (5) Image 2,310 19 (6) Sick-euthyroid 3,163 25

of solutions are justified. For a larger study of this stopping criterion, the work by Inza et al. (2000) can be consulted.

3.3 Experiments in real domains

We test the power of FSS-EBNA in six real small and medium-dimensionality datasets. Table 13.1 gives the principal characteristics of these datasets. All the datasets come from the UCI repository (Murphy, 1995) and all have been frequently used in the FSS literature. As an exhaustive search for all feature combinations is infeasible, a heuristic search is needed.

To test the power of FSS-EBNA, a comparison with the following well known FSS algorithms is carried out:

• Sequential Forward Selection (SFS) is a classic hill-climbing search algorithm (Kittler, 1978) which starts from an empty subset of features and sequentially selects features until no improvement is achieved in the evaluation function value. It performs the major part of its search near the empty feature set.

• Sequential Backward Elimination (SBE) is another classic hill-climbing algorithm (Kittler, 1978) which starts from the full set of features and sequentially deletes features until no improvement is achieved in the evaluation function value. It performs the major part of its search near the full feature set. Instead of working with a population of solutions, SFS and SBE try to optimize a single feature subset.

• GA with one-point crossover (FSS-GA-o).

• GA with uniform crossover (FSS-GA-u).

• FSS-EBNA.


For all the FSS algorithms the wrapper evaluation function explained in the previous section is used. SFS and SBE algorithms stop deterministically, and the FSS-GAs apply the same stopping criteria as FSS-EBNA.

Although the optimal selection of parameters is still an open problem for GAs (Grefenstette, 1986), for both these FSS-GAs, guided by the recommendations of Back (1996), the probability of crossover is set to 1.0 and the mutation probability to l/(problem - dimensionality) (these values are so common in the literature). Fitness-proportional selection is used to select individuals for crossover. In order to avoid any bias in the comparison, the remaining FSSGAs parameters are the same as FSS-EBNA's: the population size is set to 1, 000 and the new population is formed from the best members of both the old population and the offspring.

Due to the non-deterministic nature of FSS-EBNA and FSS-GAs, 5 replications of 2-fold cross-validation (5x2cv) are applied to assess the predictive generalization accuracy of compared FSS algorithms. In each replication, available data is randomly partitioned into two equal-sized sets S1 and S2. The FSS algorithm is trained on each set and tested on the other set. In this way, the reported accuracies are the mean of 10 accuracies. The standard deviation of the mean is also reported. We extend the comparison by running the NB classifier without feature selection. Table 13.2 shows accuracy results for real datasets. Apart from a high accuracy level, we will also focus our attention on the achieving a reduced number of features: a good tradeoff between a high-accuracy and a low-cardinality of the selected feature subset is required.

A deeper analysis of the accuracy results is carried out by using statistical tests. The 5x2cv F test (Alpaydin, 1999) is performed to determine the significance degree of accuracy differences between each algorithm and FSS-EBNA. Thus, in Table 13.2 the symbol 't' denotes a statistically significant difference to FSS-EBNA at the 0.05 confidence level, and '*', denotes significant difference at the 0.1 confidence level. The meaning of these symbols is the same in all the tables of this chapter. Table 13.3 shows the average (and its standard deviation) number of features selected by each approach. Experiments are executed on a SGI-Origin 200 computer using the Naive-Bayes algorithm's implementation of the MLC++ (Kohavi et al., 1997) software.

All FSS algorithms help NB to reduce the number of features needed to induce the final models. This dimensionality reduction is coupled with considerable accuracy improvements in all datasets except Anneal, where FSS-EBNA is the only algorithm able to significantly improve the accuracy of the NB model without feature selection. Although accuracy differences are not statistically significant across datasets between FSS algorithms for most domains, FSS-EBNA has the best average accuracy of the compared methods.

Although SBE achieves similar accuracy results in many datasets relative to randomized algorithms, its major disadvantage is the small reduction that


Table 13.2 Accuracy percentages of the NB classifier on real datasets without feature selection and using the five FSS methods shown. The last row shows the average accuracy percentages for all six domains.

DOUUlill W it/lOut FSS SFS SBE FSS-GA-o FSS-GA-u FSS-EBNA

( 1) 84.84 ± 3.12j 90.25 ± 1.58_ 91.39 ± 2.68 91.17±3.19 90.97 ± 2.56_ 92.40 ± 2.04 (2) 78.97 ± 2.98j 83.31 ± 1.98 82.12 ± 2.41. 83.43 ± 2.82 83.51 ± 1.47 83.93 ± 1.58 (3) 81.96 ± 3.46j 86.38 ± 3.30_ 87.78 ± 3.90_ 85.64 ± 4.06j 86.09 ± 4.37j 88.64 ± 1. 70 ( 4) 93.01 ± 3.13_ 86.72 ± 2.09j 92.49 ± 2.94. 92.95 ± 2.67_ 93.13 ± 2.56 94.10 ± 3.00 (Ii) 79.95 ± 1.52j 88.65 ± 1.21 88.82 ± 1. 74 88.67 ± 2.48 89.12 ± 1.56 88.98 ± 0.98 (6) 84.77 ± 2.70j 90.73 ± 0.55j 95.57 ± 0.16 95.97 ± 0.58 95.90 ± 0.43 96.14 ± 0.65

Average 83.91 87.67 89.69 89.63 89.78 90.69

Table 13.3 Cardinalities of finally selected features subsets for the NB classifier on real datasets without feature selection and using the five FSS methods shown. It must be taken into account that when no FSS is applied to NB, it uses all the features.

DO'm."iu Wit.luIHI_ FSS SFS S8E FSS-CA-o FSS-GA -TI FSS-E£3NA

(1) 34 6.00 ± 1.41 21.30 ± 3.80 15.00 ± 2.36 12.66 ± 1.03 13.40 ± 2.11 (2) 22 6.00 ± 2.74 11.20 ± 2.65 5.00 ± 2.82 4.60 ± 1.75 6.10 ± 1.85 (:/) 35 12.70 ± 2.71 23.50 ± 2.75 19.00 ± 2.09 19.16 ± 2.31 18.90 ± 2.76 ( 4) 38 5.50 ± 2.32 33.60 ± 2.91 21.66 ± 2 .66 19.50 ± 2.25 20.50 ± 3.13 (5) 19 5.60 ± 1.57 9.40 ± 1.95 8.00 ± 1.41 8.00 ± 1.09 8.00 ± 0.66 (6) 25 0 .00 ± 0.00 13.83 ± 1.32 10.66 ± 2.58 10.16±1.72 9.80 ± 2.09

it produces in the number of features. In all domains, SBE is the algorithm with the lowest feature reduction, and this reduction is nearly insignificant in the Anneal dataset. Although the other sequential algorithm, SFS, returns the subsets with the smallest number of features in all datasets, its accuracy results in all except one dataset are significantly inferior to FSS-EBNA.

Although both FSS-GA approaches and FSS-EBNA obtain similar accuracy results in many datasets , we note that FSS-GA approaches need more generations than FSS-EBNA to arrive at similar (or lower) accuracy levels. Table 13.4 shows which generations FSS-GA approaches and FSS-EBNA stop in, using the explained stopping criteria.

Starting from the fact that accuracy differences between both FSS-GA approaches are not statistically significant, it seems that FSS-GA-o is better suited for Horse-colic and Soybean-large datasets than FSS-GA-u, and in Ionosphere and Anneal we see the opposite behaviour. However, the results show that FSS-EBNA arrives faster to similar or better accuracies (see also Table 13.2) than both FSS-GA approaches: it seems that FSS-EBNA, by using Bayesian networks, is able to capture the underlying structure of the problem faster than FSS-GAs. Only in the Image dataset, the domain oflowest dimensionality, does using FSS-EBNA not give an advantage. This superiority of EDA approaches


Table 13.4 Mean stop-generation for FSS-GAs and FSS-EBNA. The standard deviation of the mean is also reported. The initial generation is considered to be the zero generation.

Domain FSS-GA-o FSS-GA-u FSS-EBNA

Ionosphere 3.50 ± 0.84t 3.10 ± 0.56t 1.80 ± 0.42 Horse-colic 3.20 ± 1.13* 3.40 ± 0.51t 2.40 ± 0.69 Soybean-large 3.30 ± 0.82* 3.60 ± 0.51t 2.50 ± 0.70 Anneal 3.80 ± 0.42t 3.20 ± 0.44t 1.80 ± 0.42 Image 3.60 ± 0.84 3.70 ± 0.48 3.50 ± 0.42 Sick-euthyroid 4.50 ± 0.70* 4.80 ± 0.42* 3.50 ± 0.97

that use Bayesian networks over GAs in domains with interacting variables is also noted in the literature (Pelikan et al., 1998).

When the wrapper approach is used to calculate the evaluation function value of each subset found, then faster discovery of similar or better accuracies is a critical task. Although NB is a fast supervised classifier, it needs several CPU seconds to estimate the predictive accuracy (by lO-fold cross-validation multiple times, as explained) of a feature subset on the training set: depending on the number of features selected by the subset, around 1 CPU second is needed in Ionosphere (the domain with fewest instances), while around 3 are needed in Image (the domain with most instances). Since the generation of a new population of solutions implies the evaluation of 1,000 new individuals, stopping earlier without damage to the accuracy level is highly desirable in order to save CPU time. On the other hand, the times for the induction of the Bayesian networks over the selected individuals are insignificant in all domains and the EDAs savings in CPU time relative to GA approaches are maintained: 3 CPU seconds are needed on average in the Image domain (the domain with fewest features) and 14 CPU seconds in Anneal (the domain with most features).

3.4 Experiments in artificial domains

In order to enrich this comparison among FSS-GA approaches and FSSEBNA, we have designed three artificial datasets of 2, 000 instances each, where we know the feature subset which induces each domain: Redundant-order-3, Redundant-order-5 and Redundant-order-7 all have 21 continuous features in the range [3,6]. The target concept in all three domains is to determine whether an instance is nearer (using Euclidean distance) to (0,0, ... ,0) or (9,9, ... ,9). At first, all 21 features participate in the distance calculation. As NB's predictive


Table 13.5 Number of generations needed on average (and their standard deviation) by FSS-GA-o, FSS-GA-u and FSS-EBNA to discover the optimum feature subset in artificial domains. The initial generation is considered as generation zero.

Domain

Redundant-order-3 Redundant-order-5 Redundant-order-7

FSS-GA-o

2.50 ± 1.76 3.83 ± 2.04 2.00 ± 1.67

FSS-GA-u

3.33 ± 1.63 3.33 ± 2.16 2.50 ± 1.76

FSS-EBNA

1.33 ± 0.81 1.83 ± 0.75 1.00 ± 1.09

power is heavily damaged by redundant features, we decide to generate groups of repeated features:

• There are 3 groups of 3 repeated features each in Redundant-order-3 while the remaining 12 features are not repeated. The class of the domain is induced by these 12 individual features and by one feature from each of the 3 groups.

• There are 3 groups of 5 repeated features each in Redundant-order-5, while the remaining 6 features are not repeated. The class of the domain is induced by these 6 individual features and by one feature from each of the 3 groups.

• There are 2 groups of 7 repeated features each in Redundant-order-7, while the remaining 7 features are not repeated. The class of the domain is induced by these 7 individual features and by one feature from each of the 2 groups.

The size of the relations between the features of these domains is well suited to be covered by Bayesian networks rather than probabilistic approaches that are only able to cover interactions of order one or two: not only conditional probabilities for a position having given the value for another one but for a position having given values for a set of some of other positions as well must be taken into account. Maintaining the framework given in the previous section, Table 13.5 shows the generation in which FSS-GA approaches and FSS-EBNA discover a feature subset that equalizes or surpasses the estimated accuracy level of the feature subset which induces the domain. This stopping criterion is also used by Rana et al. (1996) .

Although no statistically significant differences are achieved in the stop generations of different algorithms, it seems that for artificial datasets, FSS-EBNA needs fewer generations than FSS-GA approaches to arrive at similar fitness solutions. We therefore hypothesize that the superior behaviour of FSS-EBNA


with respect to FSS-GAs in natural domains is due to the existence of interacting features in these tasks. Table 13.5's results are achieved when the interacting variables of the same group are mapped together in the individual's representation. When we perform the same set of experiments, but randomly separate the interacting features in the individual's representation, FSS-GA-o needs the following number of generations to discover a feature subset which equalizes or surpasses the estimated accuracy level of the feature subset which induces the domain:

• 4.00 ± 1.67 in Redundant-order-3.

• 4 .66 ± 0.94 in Redundant-order-5.

• 3.50 ± 1.87 in Redundant-order-7.

While FSS-GA-u and FSS-EBNA are not influenced by the positions ofvariabIes in the individual's representation, FSS-GA-o suffers notably when interacting features are not coded together. This phenomenom is noted in the GAs literature by many authors (Harik and Goldberg, 1996; Thierens and Goldberg, 1993).

4. FSS by EDAs in large scale domains Although the FSS literature contains many papers, few of them tackle the

task of FSS in domains with more than 50 features (Aha and Bankert, 1994; Kudo and Sklansky, 2000; Mladenic, 1998). In this section we propose several EDA-inspired approaches to this kind of task.

For large dimensionality domains, instead of Bayesian networks, we propose the use of four simpler probabilistic models to perform FSS. It is well known in the Bayesian network literature (Friedman and Yakhini, 1996) that a large number of individuals is needed to induce a reliable Bayesian network in domains of large dimensionality. The possibility of obtaining a large number of individuals is not a problem in certain environments, but calculation of the evaluation function of an individual takes several CPU seconds when a wrapper evaluation function is used. For large dimensionality problems, despite losing the capability of Bayesian networks to factorize multiple order interactions among the variables of the problem, we prefer to use simpler probabilistic models that avoid an increase in the number of individuals in the population. In this chapter we use the following four probabilistic algorithms:

• PBIL (Baluja, 1994) (using 0: = 0.5) and BSC (Syswerda, 1993) univariate distribution models.

• MIMIC (De Bonet et al., 1997) chain distribution model.


Table 13.6 Details of large-dimensionality experimental domains.

Domain Number of instances Number of features

Audiology 226 69 Arrhythmia 452 279 Cloud 1,834 204 DNA 3,186 180 Internet advertisements 3,279 1558 Spambase 4,601 57

• The optimal dependency tree algorithm proposed by Chow and Liu (1968) (we refer to this algorithm as TREE).

We therefore have the following FSS algorithms: FSS-PBIL, FSS-BSC, FSSMIMIC and FSS-TREE. These algorithms only differ from FSS-EBNA in the probabilistic model employed to factorize the probability distribution of selected solutions: instead of FSS-EBNA's Bayesian network, they use the referred probabilistic algorithm. They employ the same evaluation function scheme, basic wrapper classifier (NB), stopping criterion, population size and rule for forming successive populations as FSS-EBNA.

4.1 Experiments in real domains

We test the power of FSS-PBIL, FSS-BSC, FSS-MIMIC and FSS-TREE on six real, large-dimensionality datasets. Table 13.6 shows the principal characteristics of these datasets. All except the Cloud dataset (Aha and Bankert, 1994) can also be downloaded from the UCI repository (Murphy, 1995).

Due to the large dimensionality of the datasets, we do not include in the comparison sequential algorithms such as SFS and SBE: we just compare our four FSS EDA-inspired algorithms with FSS-GA-o and FSS-GA-u. While Kudo and Sklansky (2000) recommend the use of sequential FSS algorithms in small and medium dimensionality scale problems, they do not advise using them in large scale domains, and they consider GAs the only practical way to get reasonable feature subsets over this kind of domains. The main reason for this is that sequential algorithms exhaustively search a specific part of the solution space (Doak, 1992) (SFS near the empty feature set and SBE near the full feature set), leaving remaining large parts of the solution space unexplored. Sequential algorithms do not have a mechanism for jumping from a subset to another very different subset, but trace a sequence of subsets in which adjacent subsets differ by only one feature (Kudo and Sklansky, 2000). On the


Table 13.7 Accuracy percentages of the NB classifier on real datasets without feature selection and using FSS-GA-o and FSS-GA-u. The last row shows the average accuracy percentages for all six domains.

Domain Without FSS FSS-GA-o FSS-GA-u

Audiology 52.39 ± 5.56t 68.29 ± 2.98 68.44 ± 4.46 Arrhythmia 39.91 ± 8.50t 63.23 ± 3.95 64.73 ± 3.52 Cloud 68.18 ± 2.09t 74.49 ± 1.93 75.17 ± 1.22 DNA 93.93 ± 0.67 94.00 ± 0.75 95.01 ± 0.56 Internet advertisements 95.23 ± 0.40* 96.10 ± 0.12 96.38 ± 0.47 Spambase 81.71 ± 0.92t 88.92 ± 1.45 88.77 ± 1.28

Average 71.88 80.83 81.41

other hand, population-based algorithms make use of their randomized nature, and allow a search with a larger degree of diversity. Another reason to avoid using sequential algorithms is that the application of SBE within a wrapper evaluation scheme is computationally prohibitive for large dimensionalities.

With medium dimensionality datasets, the probability of selecting and discarding a feature in a solution of the initial population is the same for all the domains. However, in the huge search space of large scale domains, an adequate bias towards specific areas of the search space could notably improve the operation of population-based algorithms (GAs and EDAs), and avoid visiting a large number of solutions. Taking into account the expert considerations in the Cloud dataset (Aha and Bankert, 1994), the probability of selecting a feature in a solution of the initial population is biased to 0.1. The Audiology, Arrhythmia and Internet advertisements datasets suggest to applying a similar bias: in these datasets the probability of selecting a feature in a solution of the initial population is biased to 0.05 . By the introduction of this bias, the nature of the comparison between GAs and ED As is not altered and a large amount of CPU time is saved.

Tables 13.7 and 13.8 show accuracy results (and their standard deviation) for the example domains. For each domain, statistically significant differences relative to the algorithm with the best estimated accuracy are also noted in Tables 13.7 and 13.8. Tables 13.9 and 13.10 show the average number offeatures selected by each approach (and its standard deviation). For this comparison, the GA parameters from the previous section and the 5x2cv cross-validation procedure are used.

With the use of FSS approaches, statistically significant accuracy improvements and notable dimensionality reductions are achieved relative to the no-FSS


Table 13.8 Accuracy percentages of the NB classifier on real datasets using FSSPBIL, FSS-BSC, FSS-MIMIC and FSS-TREE. The last row shows the average accu-racy percentages for all six domains.

Domain FSS-PBIL FSS-BSC FSS-MIMIC FSS-TREE

Audiology 70.22 ± 2.78 68.29 ± 3.18 68.88 ± 3.93 70.09 ± 4.12 Arrhythmia 64.62 ± 2.70 65.01 ± 2.22 64.33 ± 1.82 64.51 ± 2.59 Cloud 75.18 ± 1.30 76.24 ± 1.25 76.31 ± 0.95 75.84 ± 0.98 DNA 94.86 ± 0.64 95.40 ± 0.40 95.53 ± 0.29 95.40 ± 0.28 Internet adv. 96.49 ± 0.21 96.37 ± 0.41 96.46 ± 0.46 96.69 ± 0.63 Spambase 88.63 ± 1.36 89.52 ± 1.38 89.80 ± 0.79 89.60 ± 0.93

Average 81.66 81.80 81.88 82.02

Table 13.9 Cardinalities of finally selected feature subsets for the NB classifier on real datasets without feature selection and using FSS-GA-o and FSS-GA-u. It must be taken into account that when no FSS is applied to NB, it uses all the features.

Domain Without FSS FSS-GA-o FSS-GA-u

Audiology 69 14.00 ± 3.68 15.33 ± 3.50 Arrhythmia 279 15.40 ± 3.02 18.30 ± 4.71 Cloud 204 26.40 ± 4.45 27.60 ± 3.86 DNA 180 59.00 ± 8.35 55.80 ± 6.46 Internet advertisements 1,558 113.10 ± 7.52 108.00 ± 5.35 Spambase 57 29.20 ± 3.88 29.00 ± 4.24


Table 13.10 Cardinalities of finally selected features subsets for the NB classifier on real datasets using FSS-PBIL, FSS-BSC, FSS-MIMIC and FSS-TREE.

Domain FSS-PBIL FSS-BSC FSS-MIMIC FSS-TREE

Audiology 10.66 ± 2.50 14.33 ± 4.67 13.33 ± 3.14 12.50 ± 2.34 Arrhythmia 13.60 ± 1.95 13.40 ± 2.36 17.60 ± 2.83 20.50 ± 6.13 Cloud 26.40 ± 3.47 30.00 ± 3.59 29.50 ± 4.83 30.60 ± 4.08 DNA 56.90 ± 5.83 56.90 ± 5.89 57.40 ± 7.04 59.40 ± 5.10 Internet adv. 114.30 ± 5.65 120.25 ± 18.00 122.25 ± 8.88 125.00 ± 17.60 Spambase 28.80 ± 3.82 29.lO ± 3.78 29.lO ± 3.41 30.50 ± 3.40

Table 13.11 Mean stop-generation for FSS algorithms. The standard deviation of the mean is also reported. The initial generation is considered to be the zero generation.

DfJ1(Uli.U FSS-GA-o FSS-GA-u FSS-f'DIL FSS-DSC FSS-MIMIC FSS-TREE

AlIfliolo91J 5.80 ± 0.42t 4.60 ± 0.96. 5.20 ± 1.03_ 2.50 ± 0 .70 2.80 ± 0.78 2.80 ± 0.78 Af'1'llythmi.fl 8.70 ± 0.48t 8.80 ± 0.42t 8. 30 ± 0.48_ 7.10±0.73 7.00 ± 0 .66 7.20 ± 0.78 ClfHtfl 10.50 ± 0.52_ 10.60 ± 1.07_ 10 . 40 ± 0 .84 8 . 40 ± 0 .51 8.40 ± 0.69 8.30 ± 0 .82 DNA 12.80 ± 0.91 t 11.80 ± 0.42t 11 .30 ± 0.48t 8 .70 ± 0.82 8.1O±0.73 8.40 ± 0.69 lut. 111111. 4.70 ± 1.41 5.00 ± 1.41 5.00 ± 0.66 4.40 ± 1.26 4.30 ± 0. 6 7 4.00 ± 1.63 Sl}fL7nOIiSe 4 .80 ± 1.03 5 . 20 ± 0.63 5.50 ± 1.17 4. 2 0 ± 0 .91 3.70 ± 0.82 4.20 ± 1.22

approach in all except the DNA dataset. All six FSS algorithms obtain similar accuracy results and dimensionality reductions in all the domains. However, as in the case of small and medium dimensionality datasets, we note differences in the number of generations needed to achieve given accuracy levels. Table 13.11 shows which generation FSS algorithms halt in, when the stopping criterion described earlier is used.

Table 13.11 shows two notably different kinds of behaviour. For each domain in Table 13.11 statistically significant differences relative to the algorithm which needs the lowest number of generations are noted. The results show that FSSBSC, FSS-MIMIC and FSS-TREE arrive faster to similar fitness areas than FSS-PBIL and both of the FSS-GA approaches in all the domains . As in the case of medium dimensionality datasets, the capture of the underlying structure of the problem seems to be essential: as FSS-MIMIC and FSS-TREE are able to cover interactions of order-two among the features of the task, this could be the reason for their good behaviour. Note the good behaviour of FSSBSC, a probabilistic algorithm which does not cover interactions among domain features: the explanation of these FSS-BSC results could be its direct use of the accuracy percentages to estimate the univariate probabilities, probabilities


Table 13.12 Average CPU times (in seconds) for the induction of different probabilistic models (standard deviations are nearly zero) in each generation of the EDA search. The last column shows the average CPU time to estimate the predictive accuracy of a feature subset by the NB classifier.

Domain PBIL BSC MIMIC TREE I Naive-Bayes

Audiology 1.2 1.3 1.8 2.2 1.0 Arrhythmia 4.0 4.2 12.2 25.3 2.6 Cloud 2.3 2.4 6.5 14.6 7.2 DNA 1.8 2.0 4.8 10.9 5.3 Internet advertisements 101.1 106.4 808.5 1,945.6 9.8 Spambase 0.8 0.9 1.2 1.8 8.2

which are simulated to generate the new solutions of each EDA-generation. On the other hand, the behaviour of FSS-PBIL, the other order-one probabilistic algorithm, is similar to that of the FSS-GA approaches. We suspect that the explanation of this result is the absence of a tuning process to select a value for the a parameter: previous studies indicate that a good selection of the PBIL a parameter is a critical task (Gonzalez et al., 2001).

Because of the large dimensionality of the datasets, when the wrapper approach is employed to estimate the goodness of a feature subset, faster discovery of similar fitness solutions becomes a critical task. Despite the faster nature of the NB classifier, a large amount of CPU time is saved by avoiding the simulation of several generations of solutions. In order to understand the advantages of the EDA approach relative to the GA approach, CPU times for the induction of the probabilistic models must be studied: the EDA approach has the added overhead of the calculation of probabilistic models in each EDA-generation. Table 13.12 shows, for each domain, the average CPU times to induce the associated probabilistic model in each generation. The last column also shows the average CPU times needed to estimate the predictive accuracy of a single feature subset by the NB classifier: note that the times in the last column are not comparable with the previous columns, but they help to understand the magnitude of the CPU time savings when fewer generations are needed to achieve similar accuracy results.

As CPU times for the induction of probabilistic models are insignificant in all domains except Internet advertisements, the CPU time savings relative to FSS-GA approaches shown in Table 13.11 are maintained. In the case of the Internet advertisements domain, as order-two probabilistic approaches (MIMIC and TREE) need a large amount of CPU time in each generation, the advantage


of using them (in CPU time savings) relative to FSS-GA approaches is considerably reduced. It must be noted that FSS-GA CPU times for recombination operations in each generation are nearly zero.

4.1.1 Experiments in artificial domains. As in the case of small and medium scale domains, we have designed three artificial datasets of 2,000 instances each, where the feature subset which induces each domain is known; Red600fl and Red30of3 have 100 and Red30of2 80 continuous features in the range [3,6], and the target concept is the same as the artificial datasets in the previous section. The description of the three databases is as follows:

• No interactions appear among the features of the Red600fl domain. While 60 features induce the class of the domain, the remaining of 40 features are irrelevant.

• There are 30 groups of 2 repeated features each in the Red30of2 domain while the remaining 20 features are not repeated. The class of the domain is induced by these 20 individual features and one feature from each of the 30 groups.

• There are 30 groups of 3 repeated features each in the Red30of3 domain while the 10 individual features are not repeated. The class of the domain is induced by these single 10 features and one feature from each of the 30 groups.

While the degree of the relations among the features of Red600fl is well suited to be covered by probabilistic algorithms of order-one, approaches of order-two are needed for Red30of2 and approaches of order three (i.e. Bayesian networks) for Red30of3. Conditional probabilities for a variable given the value of another variable and also for a variable given values of a set of other variables should be considered in Red30of3. Maintaining the framework given in the previous section, Table 13.13 shows the generation where GA and EDA approaches discover a feature subset that equalizes or surpasses the estimated accuracy level of the feature subset which induces the domain. For each domain, statistically significant differences relative to the algorithm which needs the lowest number of generations are also shown in Table 13.13.

In Red60ofl, a domain with no interactions among the variables of the problem, the good behaviour of FSS-GA approaches relative to FSS-EDA ordertwo approaches must be noted. We think that the absence of a tuning process (Gonzalez et al., 2001) to fix the Q: parameter of FSS-PBIL is critical to understanding its behaviour in this domain. However, with the appearance of interacting features in the tasks Red30of2 and Red30of3, the performance of order-two probabilistic approaches (FSS-MIMIC and FSS-TREE) is noticably different from the remaining algorithms: this superiority of FSS-EDA order-two


Table 13.13 Number of generations needed on average (and their standard deviation) by FSS-GA-o, FSS-GA-u, FSS-PBIL, FSS-BSC, FSS-MIMIC and FSS-TREE to discover a feature subset that equalizes or surpasses the estimated accuracy level of the feature subset which induces the domain. The initial generation is considered to be the zero generation.

DIJ"'~/Ji1l

Rm16()ufl R.:.l.10o/2 n.l!fl:JOoj.1

FSS-GA-o

6.70 ± 0.481 22.40 ± 4.221 21.00 ± 2.261

4.10 ± 0.31 73.50 ± 5.731 119.00 ± 5.271

FSS-PBIL

12.80 ± 0.91 I 66.30 ± 7.521 113.80 ± 8.761

FSS-BSC

7.60±0.511 36.40 ± 3.131 89.50 ± 17.601

FSS-MIMIC

8.60 ± 0.51 I 15.10 ± 2.33 18.90 ± 2.13

FSS-TREE

8.00 ± 0.471 10.90 ± 1.52 16.30 ± 1.33

approaches relative to FSS-GAs and order-one approaches in domains with interacting features is also noted in the literature (De Bonet et al., 1997; Pelikan and Miihlenbein, 1999). In this way, we hypothesize that in artificial domains, the superior behaviour of order-two probabilistic approaches relative to orderone approaches and FSS-GAs is due to the existence of interacting features in these tasks.

As in the case of medium dimensionality datasets, Table 13.13's results are achieved when the interacting variables of the same group are mapped together in the individual's representation. When we perform the same set of experiments but randomly separate the interacting features in the individual's representation, FSS-GA-o needs the following number of generations to discover a feature subset which equalizes or surpasses the estimated accuracy level of the feature subset that induces the domain:

• 46.70 ± 6.53 in Red30of2.

• 48.00 ± 5.88 in Red30of3.

We note again that while FSS-GA-u and FSS-EDA approaches are not influenced by the positions of features in the individual's representation, FSS-GA-o noticably suffers when interacting features are not coded together. As the Red600fl domain has no interactions among the features of the task, it is not included in this comparison.

5. Conclusions and future work The application of the EDAs paradigm to solve the well known FSS problem

has been studied. While the most powerful probabilistic model (Bayesian networks) is used with small and medium dimensionality scale datasets, simpler probabilistic models are used with large dimensionalities. Making use of an appropriate probabilistic model, we note that FSS-GA approaches need more generations than an adequate FSS-EDA approach to discover similar fitness solutions. We show this behaviour on a set of real and artificial datasets. We also


show that while the performance of FSS-GA with uniform crossover and FSSEDA approaches are not influenced by the bit positioning in the individual's representation, FSS-GA with one-point crossover shows a noticeable decay in its performance when interacting bits are not coded together.

While Bayesian networks are an adequate and non-CPU expensive probabilistic tool for small and medium dimensionality datasets, PBIL, BSC, MIMIC and TREE seem suitable for large dimensionality ones. However , because of the high CPU times needed for the induction of order-two algorithms in the Internet advertisements domain, the CPU time saving produced by this reduction in the number of solutions relative to FSS-GA approaches is noticeably reduced.

As future work, we envision the use of other probabilistic models with large dimensionality datasets, models which assume few or no dependencies among the variables of the domain. Another interesting possibility is the use of parallel algorithms to induce Bayesian networks in these kinds of tasks (Sangiiesa et al., 1998; Xiang and Chu, 1999). When dimensionalities are higher than 1,000 variables , research is needed on the reduction of CPU times associated with the use of probabilistic order-two approaches.

Acknow ledgments The authors wish to thank D.W. Aha and R.L. Bankert for the donation of the

Cloud dataset .

References Aha, D.W. and Bankert, RL. (1994) . Feature selection for case-based classifica

tion of cloud types: An empirical comparison. In Proceedings of the AAAI'94 Workshop on Case-Based Reasoning, pages 106-112.

Alpaydin, E. (1999). Combined 5x2cv f test for comparing supervised classification learning algorithms. Neural Computation, 11:1885-1892.


Baluja, S. (1994). Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning. Technical Report CMU-CS-94-163, Carnegie Mellon University, Pittsburgh, PA.

Blum, A.L. and Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97:245-271.

Buntine, W. (1991). Theory refinement in Bayesian networks. In Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence, pages 52-60.


Cestnik, B. (1990). Estimating probabilities: a crucial task in machine learning. In Proceedings of the European Conference on Artificial Intelligence, pages 147-149.


De Bonet, J.D., Isbell, C.L., and Viola, P. (1997). MIMIC: Finding optima by estimating probability densities. In Advances in Neural Information Processing Systems, volume 9. MIT Press.

Doak, J. (1992). An evaluation of feature selection methods and their application to computer security. Technical Report CSE-92-18, University of California at Davis.


Ferri, F.J., Pudil, P., Hatef, M., and Kittler, J. (1994). Comparative study of techniques for large scale feature selection. In Gelsema, E.S. and Kanal, L.N., editors, Multiple Paradigms, Comparative Studies and Hybrid Systems, pages 403-413. North Holland.

Friedman, N. and Yakhini, Z. (1996). On the sample complexity of learning Bayesian networks. In Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, pages 274-282.

Gonzalez, C., Lozano, J. A., and Larraiiaga, P. (2001). The converge behavior of PBIL algorithm: a preliminary approach. In Kurkova, V., Steel, N. C., Neruda, R., and Karny, M., editors, International Conference on Artificial Neural Networks and Genetic Algorithms. ICANNGA-2001, pages 228-231. Springer.

Grefenstette, J.J. (1986). Optimization of control parameters for genetic algorithms. IEEE Transact'ions on Systems, Man, and Cybernetics, 16(1):122-128.

Harik, G.R. and Goldberg, D.E. (1996). Learning linkage. Technical Report IlliGAL Report 99003, University of Illinois at Urbana-Champaign, Illinois Genetic Algorithms Laboratory.

Inza, I., Larraiiaga, P., Etxeberria, R., and Sierra, B. (2000). Feature subset selection by Bayesian network-based optimization. Artificial Intelligence, 123(1-2) :157-184.

Jain, A.K. and Chandrasekaran, R. (1982). Dimensionality and sample size considerations in pattern recognition practice. In Krishnaiah, P.R. and Kanal, L.N., editors, Handbook of Statistics, volume 2, pages 835-855. North-Holland.

Jain, A.K. and Zongker, D. (1997). Feature selection: Evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2):153-158.


Kittler, J. (1978). Feature set search algorithms. In Chen, C., editor, Pattern Recognition and Signal Processing, pages 41-60. Sit hoff and Noordhoff.

Kohavi, R. and John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97{1-2):273-324.

Kohavi, R., Sommerfield, D., and Dougherty, J. (1997). Data mining using MLC++, a machine learning library in C++. International Journal of Artificial Intelligence Tools, 6:537-566.

Kudo, M. and Sklansky, J. (2000). Comparison of algorithms that select features for pattern classifiers. Pattern Recognition, 33:25-4l.

Langley, P. and Sage, S. (1994) . Induction of selective Bayesian classifiers. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pages 399-406.

Liu, H. and Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers.

Miller, A.J. (1990). Subset Selection in Regression. Chapman and Hall. Mladenic, M. (1998). Feature subset selection in text-learning. In Proceedings

of the Tenth European Conference on Machine Learning, pages 95-100. Murphy, P. (1995). UCI Repository of machine learning databases. University

of California, Department of Information and Computer Science. Narendra, P. and Fukunaga, K. (1977). A branch and bound algorithm for

feature subset selection. IEEE Transactions on Computer, C-26(9):917-922. Ng, A.Y. (1997). Preventing "overfitting" of cross-validation data. In Proceed

ings of the Fourteenth International Conference on Machine Learning, pages 245-253.

Pelikan, M., Goldberg, D.E., and Cantu-Paz, E. (1998). Linkage problem, distribution estimation, and Bayesian networks. Technical Report IlliGAL Report 98013, University of Illinois at Urbana-Champaign, Illinois Genetic Algorithms Laboratory.

Pelikan, M. and Miiehlenbein, H. (1999). The bivariate marginal distribution algorithm. In Advances in Soft Computing-Engineering Design and Manufacturing, pages 521-535. Springer-Verlag.

Pudil, P., Novovicova, J., and Kittler, J. (1994). Floating search methods in feature selection. Pattern Recognition Letters, 15(1):1119-1125.

Rana, S., Whitley, L.D., and Cogswell, R. (1996). Searching in the presence of noise. In Lecture Notes in Computer Science 1411: Parallel Problem Solving from Nature - PPSN IV, pages 198- 207.

Sangiiesa, R., Cortes, U., and Gisolfi, A. (1998). A parallel algorithm for building possibilistic causal networks. International Journal of Approximate Reasoning, 18{3-4):251-270.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 7:461-464.


Siedelecky, W. and Sklansky, J. (1988). On automatic feature selection. International Journal of Pattern Recognition and Artificial Intelligence, 2:197-220.

Syswerda, G. (1993) . Simulated crossover in genetic algorithms. In Whitley, L.D., editor, Foundations of Genetic Algorithms, volume 2, pages 239-255.

Thierens, D. and Goldberg, D.E. (1993). Mixing in genetic algorithms. In Proceedings of the Fifth International Conference in Genetic Algorithms, pages 38-45.

Xiang, Y. and Chu, T. (1999) . Parallel learning of belief networks in large and difficult domains. Data Mining and Knowledge Discovery, 3(3) :315-338.

Yang, Y. and Pedersen, J.O. (1997) . A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 412- 420.

Chapter 14

Feature Weighting for Nearest Neighbor by Estimation of Distribution Algorithms

1. Inza P. Larranaga B. Sierra Department of Compute1' Science and Artificial Intelligence

University of the Basque Count7'y

{ccbincai. ccplamup, ccpsiarb}@si.ehu.es

Abstract The accuracy of a Nearest Neighbor classifier depends heavily on the weight of each feature in its distance metric. In this paper, two new methods, FW-EBNA (Feature Weighting by Estimation of Bayesian Network Algorithm) and FW-EGNA (Feature Weighting by Estimation of Gaussian Network Algorithm), inspired by the Estimation of Distribution Algorithm (EDA) approach, are used together with a wrapper evaluation scheme to learn accurate feature weights for the Nearest Neighbor algorithm. While the FW-EBNA has a set of three possible discrete weights, the FW-EGNA works in a continuous range of weights. Both methods are compared in a set of natural and artificial domains with two sequential and one Genetic Algorithm.

Keywords: Feature Weighting, Nearest Neighbor, wrapper, Estimation of Distribution Algorithms, Bayesian networks, Gaussian networks

1. Introduction The k-Nearest Neighbor (k-NN) classifier has long been used by the Pattern

Recognition and Machine Learning communities (Dasarathy, 1991) in supervised classification tasks. The basic approach involves storing all training instances then, when a test instance is presented, retrieving the training instances nearest (least distant) to this test instance and using them to predict its class. Distance is classically defined as follows:

n

distance(x,y) = L Wi x dijJerence(Xi,Yi)2 i=1




where x = (Xl, ... ,Xi, ... ,Xn) and y = (Yl, ... ,Yi, ... ,Yn) are training instances and Wi is the weight value assigned to ith feature. To compute the difference between two values, the overlap metric (Salzberg, 1991) is used for symbolic features and the absolute difference (after being normalized) for numeric ones.

Dissimilarities among values of the same feature are computed and added together to obtain a representative value of the dissimilarity (distance) between the compared instances. In the basic NN approach, dissimilarities in each dimension are added in a naive manner, weighting equally dissimilarities in each dimension (for all features: Wi = 1). This approach is unrealistic, allowing irrelevant features to influence the distance computation and treating equally features with different degrees of relevance. With this handicapped approach, each time an unimportant feature is added to the feature set and a weight similar to the weight for an important feature is assigned to it, the number of training instances needed to maintain the predictive accuracy is increased exponentially (Lowe, 1995). This phenomenon is known as the curse of dimensionality. In order to find a realistic weight for each feature of the problem, several approaches have been proposed by the Pattern Recognition and Machine Learning communities, under the heading of Feature Weighting for Nearest Neighbor.

In this chapter we present two novel approaches based on the EDA paradigm, that search for a set of appropriate weights for the NN algorithm. The first approach, called FW-EBNA (Feature Weighting by Estimation of Bayesian Network Algorithm), which uses Bayesian networks, performs a search in a set of three discrete weights. The second one, FW-EGNA (Feature Weighting by Estimation of Gaussian Network Algorithm), which uses Gaussian networks, allows the search to be carried out in the [0, 1) continuous interval.

The chapter is organized as follows. The next section surveys previous approaches to the Feature Weighting (FW) problem. Section 3 presents the specific applications of Bayesian and Gaussian networks to solve the FW problem within the EDA paradigm. Section 4 presents the comparison of both EDAs with two sequential algorithms and one Genetic Algorithm based approach in a set of natural and artificial domains. The last section summarizes the contribution of the work and suggests lines of future research.

2. Related work Since the Pattern Recognition and Machine Learning communities frequently

use the NN classifier, they have also proposed many variants of it to address the FW problem. A complete review of these efforts was done by Wettschereck et al. (1997), which classified FW algorithms along five different dimensions. In this section, in order to properly locate both EDA-inspired approaches, and assuming that their principal contribution is the search mechanism, we organize

Feature Weighting for Nearest Neighbor by EDAs 297

the principal FW algorithms by the search strategy that they use to optimize the set of weights.

Several well-known FW methods learn the set of weights by means of a hill-climber, an incremental and on-line strategy, which makes only one pass through the training data. Applying the nearest neighbor's distance function, they iteratively adjust the feature weights after one or several classifications (on the training set) are made. Weight adjustment is computed to take account of whether the classification given was correct or incorrect. Considering each training instance once, weight adjustment has the purpose of decreasing the distance metric among instances of the same class and increasing the distance among instances of different classes. These types of algorithms can be seen in Salzberg (1991), Aha (1992) and Kira and Rendell (1992).

Lowe (1995) and Scherf and Brauer (1997) have proposed another local search mechanism known as gradient descent optimization that optimizes a set of continuous weights. Lowe (1995) applies gradient descent over the distance similarity metric to optimize feature weights so as to minimize the LOOCE (leave-one-out cross-validation error) on the training set. Lowe (1995) tries to prevent large weight changes in the optimization process which are not statistically reliable on datasets with few examples. In the other work, Scherf and Brauer (1997) apply gradient descent to optimize feature weights so as to minimize a function which reinforces the distance similarities between all training instances of the same class while decreasing the similarities between instances of different classes. Instead of being incremental on-line optimizers, these two approaches, repeatedly pass through the training set: each time a new weight or set of weights is calculated, they use the training set to measure the value of the function to be optimized.

We can qualify hill-climbing and gradient descent optimization as local search engines in the sense that they can not escape from local optima. Kohavi et al. (1997) propose a best-first search which has this capability of escape from local optima. Using the wrapper approach, each time a weight set is found, the training set is used to estimate the accuracy of the proposed set by 10-fold cross-validation. In order to avoid the overfitting risk, rather than considering a continuous weight space, they restrict the possible weights to a small, finite set.

All the algorithms reviewed so far are deterministic in the sense that all runs over the same data will always give the same result. Until now, Genetic Algorithms (GAs) and random sampling, which are two popular search strategies, have been the only non-deterministic search engines applied to the FW problem. With the non-deterministic approach, randomness is used to avoid getting stuck in local optima: this implies that one should not expect the same weight set solution from different runs. GAs are used in much work in this area (Kelly and Davis, 1991; Punch et al., 1993; Wilson and Martinez, 1996). These three


articles use the wrapper approach to guide the GA, with access to the training set each time a new weight set is found, but they differ slightly in the way that they apply it. Kelly and Davis (1991) measure the five-fold cross-validation accuracy on training data; Punch et al. (1993) use a 5-NN approach, with LOOCE used on the training data, and propose a mixed fitness function which combines the LOOCE with the number of neighbors that were not used in the final classification of each training instance. Wilson and Martinez (1996) just measure the LOOCE on the training data.

Skalak (1994) uses Monte Carlo sampling to simultaneously select features (only two discrete weights are used: 0.0 and 1.0) and prototypes for NN. He also utilizes Random mutation hill climbing (Papadimitrious and Steiglitz, 1982), a local search method that has a stochastic component: changing at random one point in the solution until an improvement is achieved, with a bound on the maximum number of iterations. The wrapper approach is used, with a hold-out estimate to measure the feature and prototype set performance on the training set.

The feature selection algorithm for NN proposed by Aha and Bankert (1994), in a domain with 204 features, also has a random sampling component: they randomly sample a specific part of the feature space for a fixed number of iterations and then begin a beam search with the best (by the la-fold crossvalidation wrapper approach) feature subset found during those iterations.

All the algorithms we have reviewed state the FW task as a search problem. They are grouped under the term wrapper because they use feedback from the NN classifier itself during training to learn weights. In order to measure the value of the wrapper function to be optimized, while on-line weighting algorithms only use the training set once, the rest of the algorithms presented use the training set each time a new weight set is found.

Other well-known approaches do not state the FW task as a search problem, but learn feature weights from the intrinsic characteristics of the data. These approaches do not make use of the NN classifier itself to learn the weights, and, they can therefore be grouped under the term filter (Kohavi and John, 1997). To learn a set of weights, these classic approaches make use of conditional probabilities (Crecy et al., 1992), class projections (Stanfill and Waltz, 1986; Howe and Cardie, 1997), mutual-information (Wettschereck and Dietterich, 1995) and information-gain (van den Bosch and Daelemans, 1993). In other interesting work, Cardie and Howe (1997) first build a decision tree to select features and then weight each feature according to its information-gain score.


3. Learning weights by Bayesian and Gaussian networks

In this section we use the search mechanism provided by EBN A (Etxeberria and Larraiiaga, 1999) to search a set of appropriate weights for the FW task. In order to specify the nature of our search space for FW-EBNA, we restrict the space of possible weights to a set of discrete weights. Thus, we follow the findings of Kohavi et al. (1997) to determine the number of possible discrete weights. Applying a wrapper approach (Kohavi and John, 1997), the authors found that considering only a small set of weights gave better results than using a larger set: increasing the number of possible weights greatly increased the variance in their FW algorithm, which induced a deterioration in the overall performance. In FW-EBNA the search is also performed by a wrapper procedure in a discrete space of weights: we restrict our search space to three possible weights for each feature, i.e. {O.O, 0.5, 1.0}. If n is the number of features in a domain, then the cardinality of the search space for FW-EBNA is 3n . With this restricted set of possible feature weights, a common notation is used to represent each individual: for a full n feature problem, there are n positions in each proposed solution, where each position indicates whether a feature has 0.0 weight, 0.5 weight or 1.0 weight.

In our specific implementation of EBN A, instead of better (but slow) techniques , a fast score + search procedure is used to learn the Bayesian network in each generation of the search. Algorithm B (Buntine, 1991) is used for learning Bayesian networks from data. Algorithm B is a greedy search heuristic which starts with an arc-less structure and, at each step, it adds the arc with the maximum increase in the score: score used here is the BIC score (Schwarz, 1978). The algorithm stops when adding an arc does not increase the score.

We can extend the set of possible weights to a continuous space of weights by the use of the EGN A (Larraiiaga et al., 2000) search engine. By using Gaussian networks, we consider a continuous space of weights in the [0,1] range. In our specific implementation of EGNA a fast score + search procedure is also preferred for learning the Gaussian network in each search generation. The search method is also Algorithm B (Buntine, 1991) and the BGe (Geiger and Heckerman, 1994) is the applied scoring metric.

As the execution of the NN algorithm has a high computational cost, it is traditionally applied over datasets with a low number of instances and features. Taking into account this low dimensionality, we hypothesize that Bayesian and Gaussian networks are attractive paradigms to discover the relationships among the variables of the task, discarding the use of simpler probabilistic models which are not able to reflect multiple order relationships among domain variables.


Determination of a minimum population size to reliably estimate the parameters of Bayesian and Gaussian networks is not an easy task (Friedman and Yakhini, 1996). This difficulty is greater for real world problems where the true probability distribution is not known. Taking the dimensionality of our problems into account, we consider a population size of 1,000 individuals enough to reliably estimate the network parameters.

Since our objective is to research FW rather than the correct amount of neighbors (k) to be considered for classification, the number of neighbors for FW-EBNA and FW-EGNA is fixed to one.

The wrapper schema (Kohavi and John, 1997) is applied to assess the evaluation function of each proposed solution, by calculating the LOOCE of the 1-NN classifier applied over the found set of weights. Following the basic EDA scheme, the initial population of weights is randomly created. For a stopping criterion, we use the findings of Ng (1997) and Kohavi and John (1997). Ng (1997), in a work about the overfitting phenomenom, demonstrates that when cross-validation is used to select from a large pool of different classification models in a noisy task with a too small training set, it may not be advisable to pick the model with minimum cross-validation error, and a model with higher cross-validation error could have better generalization power over novel test instances. Kohavi and John (1997) display the effect of overfitting in a Feature Subset Selection problem using a wrapper cross-validated approach when the number of instances is small. As the NN approach is usually applied over small (less than 1,000 training instances) and noisy training sets, we decide to stop the search when in a sampled new generation no feature weight set appears with a LOOCE improvement, with a p-value smaller than 0.1 (applying a cross-validated paired t test) , over the lowest LOOCE of the previous generation. Thus, the feature weight set with the lowest LOOCE of the previous generation is returned as FW-EBNA's or FW-EGNA's solution.

Adopting this stopping criterion, we aim to avoid the overfitting risk of the wrapper process, by only allowing the search to continue when a significant improvement appears in the accuracy estimation of the best solutions of consecutive generations. We hypothesize that when this significant improvement appears, the overfitting risk decays and there is a basis for further generalization accuracy improvement over unseen instances. When this improvement is absent, we hypothesize that the search is getting stuck in an area of the search space without statistically significant better solutions (compared to already found ones). Therefore, it is best to stop the search to avoid the risk of overfitting.

Another consideration in this stopping criterion is the wrapper nature of the proposed evaluation function. As we will see in the next section, the evaluation function value of each visited solution needs several seconds to be calculated. As the simulation of a new generation of individuals implies the evaluation of 1, 000

N


(1 )

XI X 2 '.,_ .. ",' __ Xu .- _(:) __ I

0.5 0.5 ... 0.5 ' looce I ' , , (2)

(3) x,x, ......... x" 0.5 0 I o 0.5 ... 0.5

0.5 I : looce2 : ___ s'-e."le"cI,.,io_"-:o_f-'.N_I2_ ..... 0.5 ........ 0.5 I Jooc:e3 I individuals NI2 0.5 1 ......... 0

, 0 0 :_I~~:

\ , , , ,

x, x, ............. Xi, (*) r------- I

0.5 I I loocel I , o 0 ........ 0.5 ' looce2 ,

r ••••••.•••••

induction of the Bayesiuulletwork

with 'u' nodes

sample 'N-!' individuals from (6) the Bayesian network and calculate

their evaluation function values

(4)

N·I 0 0.5.. .., 0.5 : looceN-1 : ~----~,------.

(7) (>II) Jeave-one-out-errof of the proposed set of weights

Figul'e J4.1 FW-EBNA method.


new individuals, we only allow the search to continue when it demonstrates that it is able to escape from local optima, and can discover new best solutions in each generation. When the wrapper approach is used, the CPU time must be also controlled: we hypothesize that when the search is allowed to continue by our stopping criterion, the CPU times to evaluate a new generation of solutions are justified. A detailed study of this stopping criterion within a wrapper strategy is carried out in Inza et al. (2000).

In our approach, the best individual of the previous generation is maintained and N - 1 individuals are created as offspring. An elitist approach is used then to form iterative populations. Instead of directly discarding the N - 1 individuals from the previous generation and replacing them with N - 1 newly generated ones, the 2N - 2 individuals are put together and the best N -1 chosen from them. These best N - 1 individuals together with the best individual of the previous generation form the new population. In this way, the populations converge faster to the best individuals found, but, this also carries a risk of losing diversity within the population. Figure 14.1 gives an overview of the FW-EBNA method. FW-EGNA only differs from FW-EBNA in the probabilistic model employed to factorize the probability distribution of selected solutions and in the set of possible weights.

4. Experimental comparison We have tested the power of FW-EBNA and FW-EGNA on four artificial

and four real domains. All datasets, except 3- Weights and C- Weights, can be found in the UCI Repository (Murphy, 1995). Table 14.1 summarizes the characteristics of these domains.

LED24 is a well known artificial dataset with 7 equally relevant and 17 irrelevant binary features. In the Waveform-21 task, all the features have different degrees of relevance. Both datasets have a significant degree of noise.

The 3- Weights domain has 12 continuous features in the range [3,6]. Its target concept is to define whether the instance is closer (using the Euclidean metric in all dimensions and summing them) to (0,0, ... ,0) or (9,9, ... ,9). In the distance computation 1.0 weight is assigned to 4 features, 0.5 weight to another 4 features and 0.0 to the 4 remaining ones. The 3- Weights dataset is inspired by the W domain proposed by Kohavi et al. (1997).

The C- Weights domain has 10 continuous features in the range [3,6] and its target concept is the same as the 3- Weights dataset. However, randomly created weights in the continuous range [0, 1] are assigned for the distance computation of the features.

We hypothesize that, while LED24 and 3- Weights domains are properly designed for the FW-EBNA approach, the continuous nature of Waveform-21 and C- Weights should be better for FW-EGNA. The four domains arise from natural tasks for which the true weights are unknown.

Domain LED24 Waveform-21 3- Weights C- Weights Glass CRX Vehicle Contraceptive


Table 14.1 Details of experimental domains.

Number of instances 600 600 600 600 214

690 846

1,473

Number of features 24 21 12 10 9 15 18

9

Due to the large number of instances in the Contraceptive domain, instead of a LOOCE evaluation scheme, a multiple-times lO-fold cross-validation procedure is used (Kohavi and John, 1997). The la-fold cross-validation is repeated until the standard deviation drops below 1%, up to a maximum of five times.

To test the power of FW-EBNA and FW-EGNA, a comparison with two sequential and one genetic FW algorithm is carried out. These FW algorithms are:

• Genetic Algorithm with one-point crossover (FW-GA-o). This performs the search in the same set of three discrete weights as FW-EBNA.

• An improved variation of the sequential DIET (called DIET-10) algorithm proposed by Kohavi et al. (1997). This also performs the search in the same three discrete weight space as above. DIET uses the best-first (Russell and Norvig, 1995) algorithm to guide the search. The search starts with the solution (0.5 , 0.5, ... , 0.5) and is stopped when it encounters 10 consecutive nodes with no children having scores more than 0.1 % better than their parent. The improvement introduced relative to Kohavi et al. (1997) implementation is that the original DIET algorithm stops the search when the number of consecutive nodes with worse child scores is only 5.

• IB4 (Aha, 1992) assigns weights to features by means of a hill-climbing, sequential, incremental and on-line strategy, with only one pass through the training data. As with FW-EGNA, IB4 performs the search in the [O,l]n continuous weight space.

FW-GA-o and DIET-10 use the same wrapper evaluation function as the FW EDA approaches. Although DIET-10 and IB4 stop deterministically, algorithm


Table 14.2 Accuracy percentages of the NN algorithm using the 5 FW methods shown and without FW. The standard deviation of the estimated percentage is also reported.

Dum.IJi.H 1LtJ-FW DIET·W FW-GA-o FW·EDNA ID4 FW·EGNA LED24 47.37 ± 3.36t 63.84 ± 2.42_ 68.64 ± 1.30 69.03 ± 1.54 66.70 ± 1.80 61.55 ± 1.90t WfJIIf;jm"111.-21 76.20 ± 1.48 76.66 ± 1.62 76.71 ± 1.57 76.87 ± 1.04 77.96 ± 1.62 76.90 ± 1.48 :1-Wf~i!lht.</ 77.19 ± 3.36t 81.91 ± 1.98_ 82.88 ± 1.66 85.99 ± 1.57 80.32 ± 4.46t 82.00 ± 2.72 C-WI;t!lht,<oI 81.01 ± 1.14t 83.55 ± 1.56 83.93 ± 1.24 83.55 ± 1.56 81.98 ± 1.84t 84.33 ± 1.31 Avg. a.rtif. 70.44 76.49 78.04 78.86 76.74 76.19 GI".<I,<I 64.85 ± 2.15t 71.34 ± 4.89 71.32 ± 2.97 71.12 ± 5.01 61.13 ± 5.55_ 70.09 ± 2.83 cnx 81.56 ± 1.92_ 82.12 ± 2.01_ 83.14 ± 1.81 83.74 ± 1.94 85.48 ± 0.92 82.17±2.14. Vd~jdf~ 67.33 ± 2.11 68.71 ± 1.48 69.86 ± 1.42 69.43 ± 2.11 64.65 ± 2.32_ 69.58 ± 2.33 Cfl1lt"IJ(;f~1ItitJ(! 43.61 ± 0.97t 47.54 ± 2.99 48.10 ± 2.50 48.32 ± 2.34 45.66 ± 2.66_ 44.95 ± 2.IOt Avg. real 64.33 67.42 68.10 68.15 64.23 66.69

FW-GA-o applies the same halting criteria as FW-EDAs. While DIET-lO, FWGA-o and FW-EBNA perform the search in 3-weight discrete space, IB4 and FW-EGNA's search space is continuous.

Although the optimal selection of parameters is still an open problem for GAs (Grefenstette, 1986), guided by the recommendations of Back (1996), the probability of crossover is set to 1.0 and the mutation probability to lin (these values are common in the literature). Fitness-proportional selection is used to select individuals for crossover. In order to avoid any bias in the comparison, the remaining FW-GA-o parameters are the same as the EDA approaches: the population size is set to 1, 000 and the new population is formed from the best members of both the old population and its offspring.

Because of the non-deterministic nature of FW-EBNA, FW-EGNA and GA, 5 replications of 2-fold cross-validation (5x2cv) are applied to assess the predictive generalization accuracy of all the FW algorithms being compared. In each replication, the available data is randomly partitioned into two equal-sized sets 81 and 82 , Each FW algorithm is then trained on one set and tested on the other. In this way, the reported accuracies are the mean of 10 accuracies. The standard deviation of the mean is also reported. We extend the comparison by running the NN algorithm with homogeneous weights (no-FW). Table 14.2 shows the accuracy results for these algorithms.

A deeper analysis of the accuracy results is carried out by using statistical tests. The 5x2cv F test (Alpaydin, 1999) is performed to determine the significance degree of accuracy differences among the proposed algorithms. In Table 14.2, the symbol 't' denotes a statistically significant difference from the best FW algorithm in the domain at the 0.05 confidence level; '*', significance at the 0.1 level. These symbols have the same meaning in all the tables in this chapter. Experiments are executed on a SGI-Origin 200 computer.

Using different FW techniques, statistically significant accuracy improvements are achieved relative to the no-FW approach in all datasets except Waveform-21 and Vehicle. Although accuracy differences between FW algo-

Feature Weighting for Nearest Neighbor by ED As 305

rithms are not statistically significant for most of the datasets, FW-EBNA has the best average accuracy among the compared methods in the artificial and real datasets.

It is not an easy task to show significant accuracy differences among classification algorithms in real datasets. Kohavi and John (1997) argue that real datasets are already preprocessed to include only relevant features, and this makes the appearance of significant accuracy differences among compared classification techniques unlikely.

FW-GA-o, FW-EBNA and FW-EGNA, our 3 population based algorithms, do not show statistically different accuracy levels in all the domains. Note the irregular behaviour of IB4: while it shows the best accuracies in Waveform-21 and CRX, it has significant differences relative to the best algorithm in all the other datasets except LED24. On the other hand, the comparable behaviour of DIET-10 relative to the population-based algorithms in many of the datasets must be noted.

FW-EBNA has the best estimated accuracy in both the artificial domains with discrete weights, LED24 and 3- Weights. On the other hand, in the domains with continuous weights, while FW-EGNA shows the best accuracy in C- Weights, its accuracy is surpassed by IB4 in Waveform-21. It must be remembered that among the FW algorithms being compared, only FW-EGNA and IB4 perform the search in a continuous weight space.

Although FW-GA-o and FW-EBNA do not have statistically different accuracy levels for most of the datasets, we note that in several domains there are significant differences in the number of generations needed to obtain these similar accuracy levels. Table 14.3 shows the generations in which populationbased FW-GA-o and FW-EBNA stop with the stopping criterion shown earlier. Statistically significant differences between these algorithms are marked. Table 14.3 also shows FW-EGNA's mean-stop generation, but since the nature of their search spaces is different, comparisons between FW-EGNA and the population-based discrete FW algorithms (FW-GA-o and FW-EBNA) should be made with caution.

Although FW-GA-o and FW-EBNA do not have statistically different accuracies in the LED24 , Vehicle and Contraceptive domains, FW-EBNA needs statistically fewer generations than FW-GA-o to obtain the percentages shown. With these datasets, it seems that FW-EBNA, by using Bayesian networks, is able to capture the underlying structure of the problem faster than FW-GA-o. This superiority of EDA approaches using Bayesian networks relative to GAs is also noted in the literature (Pelikan et al., 1998; Inza et al., 2001).

When the wrapper approach is used to calculate the evaluation function value of each found weight set, fast discovery of similar or better accuracies becomes a critical task. As the NN algorithm needs several CPU seconds to estimate the LOOCE of the proposed set of weights, a faster discovery of these


Table 14.3 Mean stop-generation for FW-GA-o, FW-EBNA and FW-EGNA. The standard deviation of the mean is also reported. The initial generation is considered to be the zero generation.

Domain FW-GA-o FW-EBNA FW-EGNA LED24 6.50 ± 0.54t 4.50 ± 0.47 8.66 ± 0.51 Waveform-21 2.50 ± 1.37 2.33 ± 1.03 2.16 ± 1.83 3- Weights 2.50 ± 1.22 2.66 ± 0.81 3.50 ± 1.04 C- Weights 5.16 ± 1.16 3.33 ± 1.50 3.16 ± 1.16 Glass 1.00 ± 0.89 1.00 ± 0.63 1.66 ± 1.50 CRX 1.33 ± 1.75 1.33 ± 0.51 1.83 ± 1.83 Vehicle 3.16 ± 0.75* 1.16 ± 0.40 1.33 ± 0.81 Contraceptive 2.66 ± 0.8h 1.83 ± 0.98 2.00 ± 0.63

similar fitness solutions is highly desirable. By thus, avoiding the simulation of several generations of solutions, a large amount of CPU time is saved.

In order to understand the advantages of this fast discovery of similar fitness solutions, CPU times for the induction of the probabilistic models must be studied: the EDA approaches have the computational overhead of the calculation of probabilistic models in each generation. Table 14.4 shows, for each domain, the average CPU time to induce the Bayesian network structure in each generation and the average CPU time needed to estimate the predictive accuracy of a single feature weight set. Although FW-EGNA is not included in this comparison between FW-GA-o and FW-EBNA, its average CPU time for the induction of the Gaussian network structure in each generation is also shown in Table 14.4.

From the results of Table 14.4, we can conclude that since the CPU times for the induction of Bayesian networks in each EDA generation are insignificant, their CPU time saving relative to the GA approach shown in Table 14.3 is maintained.

Let us now consider the accuracy results obtained by FW-EGNA. In the scheme used of 5 replications of 2-fold cross-validation (5x2cv) , the feature weight selected by FW -EG N A usually has the best accuracy estimation of the FW algorithms compared over the training fold for most of the datasets. However, when this weight set is tested on the novel instances that form the second fold (instances that are unseen during the training process), a notable decay in the percentage accuracy is noted, and its accuracy levels are similar to these of the other FW algorithms. Although this "overfitting" risk also exists for the rest of the algorithms, the largest accuracy differences between the training fold and the test fold of the 5x2cv scheme appear in the FW-EGNA algorithm.


Table 14.4 Average CPU times (in seconds) for the induction of different probabilistic models (standard deviations are nearly zero) in each generation of the EDA search. The last column shows the average CPU time to estimate the predictive accuracy of a feature weight set.

Domain FW-EGNA FW-EBNA l-NN acc. estim. LED24 39.4 5.5 29.8 Waveform-21 21.3 5.0 35.0 3- Weights 2.2 3.6 22.7 C- Weights 1.5 3.4 21.4 Glass 1.0 2.2 3.3 CRX 5.2 4.2 34.9 Vehicle 10.3 4.4 60.4 Contraceptive 1.0 2.2 20.7

It seems that allowing a continuous set of feature weights does not result in better accuracy levels. These findings agree with those of Kohavi et al. (1997). The authors find that the extra power given by an increased set of weights does not further reduce the bias of the error and usually increases its variance: to avoid this, they recommend the use of a set of two or three different weights.

As the feature weights of the C- Weights and Waveform-21 domains are continuous, it is interesting to study the different behaviour of FW-EGNA in these domains. In C- Weights, a non-noisy domain, FW-EGNA has a slight advantage in its final accuracy relative to the other algorithms. On the other hand, in Waveform-21, a domain with an important noise degree (Breiman et al., 1984), FW-EGNA's accuracy significantly decays and the other continuous FW algorithm, IB4, has the most accurate result. Although FW-EGNA has the best estimation percentages for the training folds (in the 5x2cv estimation scheme) of Waveform-21 and C- Weights, its accuracy for new instances in the test fold significantly decays in the Waveform-21 domain. As C- Weights is not a noisy domain, the existence of a significant noise degree is the reason for the appearance of this overfitting problem in Waveform-21. We have also seen this overfitted behaviour with the real datasets, domains where we can assume that noise exists.

In our case, unless we have a specifically designed and non-noisy domain such as C- Weights where the use of a continuous weight set gives a slight advantage, the naive assumption that using more weights with a wrapper scheme as in FW-EGNA will reduce the classification error seems false.

We should stress that this conclusion holds only in the context of a wrapper approach to FW and for small training sets where the overfitting risk is high.


5. Summary and future work The application of the EDA approach to the FW problem for the NN algo

rithm was studied in this chapter. Two powerful probabilistic models, Bayesian and Gaussian networks, are applied to factorize the probability distribution of weight set solutions: Bayesian networks are used with a set of three possible discrete weights and Gaussian networks are used with a continuous range of weights. Both new methods, FW-EBNA (Feature Weighting by Estimation of Bayesian Network Algorithm) and FW-EGNA (Feature Weighting by Estimation of Gaussian Network Algorithm), use the wrapper scheme for the evaluation of proposed weight set solutions. A comparison is performed in a set of natural and artificial domains with two sequential and one Genetic-inspired algorithm.

While interesting accuracy results are obtained for FW-EBNA, the impact of overfitting on noisy datasets significantly reduces the accuracy of FW-EGNA. We therefore confirm the findings of Kohavi et al. (1997), who found that the extra power given by an increased set of weights does not further reduce the bias of the error and usually increases its variance, and who recommend using a set of two or three different weights. Unless we have a specifically designed and non-noisy domain with continuous feature weights, the naive assumption that using more weights with a wrapper scheme will reduce the classification error seems to be false.

We note that the genetic approach needs more generations than FW-EBNA to discover similar fitness solutions. In order to save CPU time, when the wrapper approach is used to calculate the evaluation function value of each found weight set, the fast achievement of similar or better accuracies becomes a critical task.

As future work, we envision the use of other probabilistic models for the FW task within the EDA approach. Another interesting research avenue is the adoption of a different stopping criteria to avoid the overfitting risk in FW-EGNA.

Acknowledgments The authors thank D. Wettschereck for his useful comments.

References Aha, D.W. (1992). Tolerating noisy, irrelevant and novel attributes in instance

based learning algorithms. International Journal of Man-Machine Studies, 36:267-287.


Aha, D.W. and Bankert, R.L. (1994). Feature selection for case-based classification of cloud types: An empirical comparison. In Proceedings of the AAAI'94 Workshop on Case-Based Reasoning, pages 106-112.

Alpaydin, E. (1999). Combined 5x2cv F test for comparing supervised classification learning algorithms. Neural Computation, 11:1885-1892.


Breiman, L., Friedman, J .H., Olshen, R.A ., and Stone, C.J. (1984). Classification and Regresion Trees. Wadsworth.

Buntine, W. (1991). Theory refinement in Bayesian networks. In Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence, pages 52-60.

Cardie, C. and Howe, N. (1997) . Improving minority class prediction using case-specific feature weights. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 57-65.

Crecy, R.H., Masand, B.M., Smith, S.J., and Waltz, D.L. (1992). Trading mips and memory for knowledge engineering. Communications of the A CM, 35:48-64.

Dasarathy, B.V. (1991). Nearest neighbor (NN) norms: NN pattern classification techniques. IEEE Computer Society Press.

Etxeberria, R. and Larraiiaga, P. (1999) . Global optimization with Bayesian networks. In II Symposium on Artificial Intelligence. CIMAF99. Special Session on Distributions and Evolutionary Optimization, pages 332-339.

Friedman, N. and Yakhini , Z. (1996). On the sample complexity of learning Bayesian networks. In Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, pages 274-282.

Geiger, D. and Heckerman, D. (1994). Learning Gaussian networks. Technical Report MST-TR-94-1O, Microsoft Advanced Technology Division, Microsoft Corporation, Seattle, Washington.

Grefenstette, J.J. (1986). Optimization of control parameters for genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 16(1}:122-128.

Howe, N. and Cardie, C. (1997). Examining locally varying weights for nearest neighbor algorithms. In Lecture Notes in Artificial Intelligence: Case-Based Reasoning Research and Development: Second International Conference on Case-Based Reasoning, pages 455-466.

Inza, I., Larraiiaga, P., Etxeberria, R., and Sierra, B. (2000) . Feature subset selection by Bayesian network-based optimization. Artificial Intelligence, 123(1-2}:157-184. .

Inza, I., Larraiiaga, P., and Sierra, B. (2001). Feature subset selection by estimation of distribution algorithms. In Larraiiaga, P. and Lozano, J .A., ed-


itors, Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers.

Kelly, J.D. and Davis, L. (1991). A hybrid genetic algorithm for classification. In Proceedings of the Twelfth International Joint Conference on Artificial Intelligence, pages 645-650.

Kira, K. and Rendell, L.A. (1992). A practical approach to feature selection. In Proceedings of the Ninth International Conference on Machine Learning, pages 249-256.

Kohavi, R. and John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2) :273-324.

Kohavi, R., Langley, P., and Yun, Y. (1997). The utility offeature weighting in nearest-neighbor algorithms. In European Conference on Machine Learning, page poster.

Larranaga, P., Etxeberria, R., Lozano, J.A., and Pena, J.M. (2000). Optimization in continuous domains by learning and simulation of Gaussian networks. In Proceedings of the Workshop in Optimization by Building and using Probabilistic Models. GECCO-2000, pages 201-204.

Lowe, D. (1995). Similarity metric learning for a variable-kernel classifier. Neural Computation, 7:72-85.

Murphy, P. (1995). UCI Repository of machine learning databases. University of California, Department of Information and Computer Science.

Ng, A.Y. (1997). Preventing "overfitting" of cross-validation data. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 245-253.

Papadimitrious, C.H. and Steiglitz, K. (1982). Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall.

Pelikan, M., Goldberg, D.E., and Cantu-Paz, E. (1998). Linkage problem, distribution estimation, and Bayesian networks. Technical Report IlliGAL Report 98013, University of Illinois at Urbana-Champaign, Illinois Genetic Algorithms Laboratory.

Punch, W.F., Goodman, E.D., Pei, M., Chia-Shun, L., Hovland, P., and Enbody, R. (1993). Further research on feature selection and classification using genetic algorithms. In Proceedings of the International Conference on Genetic Algorithms, pages 557-564.

Russell, S.J. and Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall.

Salzberg, S.L. (1991). A nearest hyperrectangle learning method. Machine Learning, 6:251-276.

Scherf, M. and Brauer, W. (1997). Feature selection by means of a feature weighting approach. Technical Report FKI-221-97, Forschungsberichte Kunstliche Intelligenz, Institut fUr Informatik, Technische Universitiit Munchen, Munchen, Germany.


Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 7:461-464.

Skalak, D. (1994). Prototype and feature selection by sampling and random hill climbing algorithms. In Proceedings of the Eleventh International Conference on Machine Learning, pages 293-30l.

Stanfill, C. and Waltz, D. (1986). Toward memory-based reasoning. Communications of the ACM, 29:1213-1228.

van den Bosch, A. and Daelemans, W. (1993). Data-oriented methods for grapheme-to-phoneme conversion. Technical Report 42, Tilburg University, Institute for Language Technology and Artificial Intelligence, Tilburg, The Netherlands.

Wettschereck, D., Aha, D.W., and Mohri, T. (1997). A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review, 11:273-314.

Wettschereck, D. and Dietterich, D. (1995). An experimental comparison ofthe nearest-neighbor and nearest-hyperrectangle algorithms. Machine Learning, 19:1-25.

Wilson, R. and Martinez, T.R. (1996). Instance-based learning with genetically derived attribute weights. In Proceedings of the International Conference on A rtificial Intelligence, Expert Systems and N euml Networks, pages 11-14.



Chapter 15

Rule Induction by Estimation of Distribution Algorithms

B. Sierra E.A. Jimenez I. Inza P. Larraiiaga Department of Computer Science and Artificial Intelligence University of the Basque Country [email protected], [email protected], {ccbincai, ccplamup}@si.ehu.es

J. M uruzabal Statistics and Decision Sciences Group University Rey Juan Carlos [email protected]

Abstract In this chapter a preliminary work on the use of Estimation of Distribution Algorithms (EDAs) for the induction of classification rules is presented. Each individual obtained by simulation of the probability distribution learnt in each EDA generation represents a disjunction of a finite number of simple rules. This problem has been modeled to allow representations with different complexities. Experimental results comparing three types of EDAs -UMDA, a dependency tree and EBNAwith two classical algorithms of rule induction -RIPPER and CN2- are shown.

Keywords: Machine Learning, Supervised classification, Rule induction, Estimation of Distribution Algorithms

1. Introduction Among the different supervised classification paradigms developed in Statis

tics and Automatic Learning (Discriminant Analysis, Logistic Regression, Rule Induction, Classification Trees, K-NN, Neural Nets, Bayesian Nets, Multiclassifiers, etc.), systems based on the Rule Induction paradigm are very attractive because of their simplicity, transparency and comprehensibility.


These advantages meant that in early Artificial Intelligence research, most expert systems built were based on the so-called production "rules, which were extracted from interviews with experts in the subject being modeled. This knowledge extraction process was clearly tedious and prone to errors, so attempts were later made with algorithms that induced the production rules from a collection of cases containing information about the domain to be modeled.

Although initially statistical hypothesis tests played an important role in the induction of rules with good generalization capacity, other approaches were later developed (see Section 2 of this chapter) based on intelligent search for good rules. Using the EDA paradigm, a new approach to searching for good classification rules is presented in this chapter. This work can only be considered a preliminary approach, and the obtained results must be interpreted with caution.

The structure of the chapter is as follows. Section 2 presents a brief review of Classifier Systems. Section 3 presents a novel approach which is based on the use of the EDA paradigm to produce a set of IF-THEN rules. Each individual found during the EDA search is interpreted as a complex classification rule, which is formed from the disjunction of a finite number of simple rules. Three EDA approaches of different complexity (UMDA, a dependency tree and EBNA) are compared with two classic rule inducers (RIPPER and CN2) in two natural and one artificial problems. The result of this comparison appears in Section 4. The chapter is concluded with a section on the conclusions of the work and possible lines of future research.

2. A review of Classifier Systems Classifier Systems (CSs) were introduced by Holland (1986). To understand

what CSs originally meant, one needs to envisage " ... message-passing, rulebased production systems in which many rules are active simultaneously .... In such systems each rule can be looked upon as a tentative hypothesis about some aspect of the task environment, competing against other plausible hypotheses being entertained at the same time". The full scope of this message-passing activity of rules was later evidenced by Holland and coworkers. However, the original architecture has evolved in recent years towards relatively simpler variants. See Lanzi et al. (2000) for an update.

While certain random generalization operators have been shown to speed up the system's task, the leading rule discovery role in CSs has been traditionally assigned to some suitable Genetic Algorithms (GAs) (Fidelis et al., 2000; Liu and Kwok, 2000; Muruzabal, 1999; Nonas and Poulovassilis, 1998), considering each individual found during the search process as a rule classifier. In a standard implementation, a scalar fitness measure is maintained by each rule or classifier Q --+ R in the population, where Q E {O, 1, #} is the condition or antecedent part and R stands for the predicted class. In order to evolve the population

Rule Induction by Estimation of Distribution Algorithms 315

of solutions, proportional or tournament selection operates on the basis of the fitness measure over the entire population. Once parents are selected, standard crossover and mutation are applied: both recombination operators usually focus on the condition part of each rule alone.

There are two main traditional methods to represent a es by GAs. In the Michigan approach (Holland, 1986), each individual is represented by a fixed length string and the es is represented by the whole set of individuals in the population. On the other hand, the Pittsburgh approach (De Jong and Spears, 1991) proposes the use of a variable length string, interpreting each individual as an independent es.

Instead of GAs, we propose the use of ED As as an alternative rule generating engine in ess. Rivera and Santana (1999) consider a similar idea centered around Wilson's (1998) well-known xes design, yet our approach substantially generalizes the syntax of the rules handled by the underlying EDA. For example, we include R as part of the structure being probabilistically modeled (they restrict attention to Q); further, we allow disjunctions of standard Qs as well as negations at particular coordinates in the conditions of classifiers. Another important difference is that Rivera and Santana (1999) use a simplified EDA based solely on marginal distributions, whereas we use a more detailed joint distribution modeling with higher order dependencies among portions of Q and R.

3. An approach to rule induction by means of EDAs

In this section an approach to the modeling of the induction of IF-THEN rules by means of EDAs is presented. We tackle the problem of the induction of a rule-based classificatory model as a combinatorial optimization task. An EDA scheme is used as the search mechanism to produce a population of rules. In this way, each individual simulated by the EDA process represents a rule based classificatory model. Our approach incorporates ideas from the Michigan and Pittsburgh approaches: while a fixed length string representation is employed (Michigan), each individual found during the search is interpreted as an independent classifier (Pittsburgh).

The notation of the rest of this chapter is as follows. The n predictor variables are denoted by X = {Xl, ... , Xn} and C designates the class label. The values corresponding to the variable Xi with i = 1, ... , n are denoted by Xi. The number of possible different values of the variable Xi is represented by ri (i = 1, ... , n). The variable class C has I C I different values and they are denoted by c. Variables with continuous values are discretized using a supervised procedure proposed by Dougherty et al. (1995). The existence of a dataset of cases T R = {(Xl, cd, ... , (x R, C R)} is assumed. As this dataset of cases is used


to induce the set ofIF -THEN classification rules, it is also applied to estimate, by means of a cross-validation scheme, the rule set's classification accuracy.

3.1 Individuals codification

In this work we assume that each classificatory model is formed from one IF-THEN rule, which can have different degrees of complexity. Each rule is represented by n + 1 pieces of information: while the first n positions in the individual representation (one for each predictive variable) constitute the rule's antecedent, the last bit represents the rule's consequent (the predicted class label). In order to represent each individual, the following three approaches are considered:

• Xi approach.

In this approach, each variable (predictor or class) takes values in its own range. As a result, the cardinality of the search space is calculated as:

1 C 1 [17=1 rio

Example 15.1 Given a domain with 5 predictor variables Xl"'" X 5

with rl = r2 = r3 = 3 and r 4 = r5 = 4 having the variable class C with two possible values (I C 1= 2), the (1,2,2,3,4,1) individual can be considered as the classificatory model whose unique rule is:

IF (Xl = 1 and X 2 = 2 and X3 = 2 and X 4 = 3 and X5 = 4) THEN C=l

• Xi, # approach.

In this approach each predictor variable extends its alphabet with the symbol # which means "don't care". Therefore, the cardinality of the search space is calculated as: 1 C 1 [17=1 (ri + 1). We can consider this as applying a kind of Feature Subset Selection (Kittler, 1978) with the inclusion of the symbol # discarding (or considering as irrelevant) the variables which have this symbol in the individual's codification.

Example 15.2 Following the example introduced in the previous paragraph, the (1,2,4,3,5,2) individual can be considered as the classificatory model whose rule only has three variables in the antecedent part:

IF (Xl = 1 and X 2 = 2 and X 4 = 3) THEN C = 2

• Xi, #, i' Xi approach.


In this approach each predictor variable extends its alphabet with the symbols # and :P Xi, to give it a total of 2ri + 1 possible values. Therefore the cardinality of the search space is: I C I rE=1 (2ri + 1).

Example 15.3 Following the example introduced in the previous paragraph, the (5,2,7,5,8,2) individual can be considered as the classificatory model whose unique rule is:

IF (Xl :P 1 and X 2 = 2 and X3 :P 3 and X5 :P 3) THEN C = 2

3.2 Growing rule complexity

The rule induction models considered in this chapter have as basic elements the rules presented in the previous paragraphs. By the disjunction of these simple rules, a more complex rule antecedent can be constructed.

Example 15.4 Following the previous example, the individual (2,2,2,1,3,3,2,1,4,1,2) is equal to the classificatory model whose only rule is:

IF (Xl = 2 and X 2 = 2 and X3 = 2 and X4 = 1 and X5 = 3) or (Xl = 3 and X 2 = 2 and X3 = 1 and X4 = 4 and X5 = 1) THEN C = 2

Keeping in mind that the amount of simple antecedents, k, could vary and that for each simple antecedent the Xi, #,:p Xi approach could be used, the considered search space cardinality can be calculated as: I C I TI~=1 (2ri + 1)k.

3.3 Evaluation function

Each individual generated by the EDA approach is interpreted as a rule based classificatory model. Its evaluation is performed based on a lO-fold crossvalidation scheme and the score assigned is the percentage of well classified dataset instances upon using the individual's equivalent model statement. This means that given an individual generated by the EDA approach, the entire dataset should be run over, counting a success every time a case in the dataset matches the rule expressed by the individual and the class proposed by the individual agrees with the class of the case. If the rule does not match the case of the dataset, then a success is counted when the class expressed by the EDA individual and the class of the case are not the same.

3.4 Probabilistic models

We propose the use of three probabilistic models of different complexities for the factorization of the probability distribution of the proposed solutions within the EDA approach. The following three probabilistic algorithms are applied:


Table 15.1 Details of experimental domains.

Domain Number of instances Number of features Heart 270 13 Cleveland 303 13 2-A ttractors 2,000 12

• UMDA (Miiehlenbein, 1998) univariate distribution model.

• The optimal dependency tree algorithm proposed by Chow and Liu (1968). We refer to this algorithm as TREE in this chapter.

• EBNA (Etxeberria and Larraiiaga, 1999) multivariate model, based on the use of Bayesian networks.

The following decisions are made for these three EDA approaches. A population size of 1,000 individuals is used. Half of the best individuals form the group of selected individuals from which the probabilistic model is induced at each generation. In our approach, the best individual of the previous generation is maintained and N - 1 individuals are created as offspring. An elitist approach is used then to form iterative populations. Instead of directly discarding the N - 1 individuals from the previous generation and replacing them with N - 1 newly generated ones, the 2N - 2 individuals are put together and the best N - 1 chosen from them. These best N - 1 individuals together with the best individual of the previous generation form the new population. In this way, the population converges faster, but this also carries a risk of losing diversity.

4. Empirical comparison We test the power of our EDA inspired rule inducers in two real and one ar

tificial datasets. Table 15.1 gives the principal characteristics of these datasets. Both real datasets come from the UCf repository (Murphy, 1995) and have been frequently used in the Machine Learning literature. The 2-Attractors domain has 12 continuous features in the range [3,6J. Its target concept is to define whether an instance is closer (using Euclidean distance) to (0,0, ... ,0) or (9,9, ... ,9). Only the first 6 features participate in the distance calculation and the remaining variables are irrelevant.

For the three ED As considered, the Xi, #, =P Xi approach is applied. For all datasets, the probabilities of the possible values for each bit in the first EDA generation are biased in the following form: 0.6 of the probability is equally


Table 15.2 Estimated accuracy of the three EDA approaches using disjunctions of 2 simple rules. The average accuracy and standard deviation of 5 runs of a IO-fold cross-validation procedure is reported.

Domain Heart Cleveland 2-A ttractors

UMDA 77.33 ± 0.42 77.35 ± 0.82 77.34 ± 0.25

TREE 77.11 ± 1.46 78.41 ± 0.89 75.10 ± 0.62

EBNA 79.70 ± 1.91 77.22 ± 0.96 76.96 ± 0.40

Table 15.3 Estimated accuracy of the three EDA approaches using disjunctions of 4 simple rules. The average accuracy and standard deviation of 5 runs of a lO-fold cross-validation procedure is reported.

Domain Heart Cleveland 2-A ttractors

UMDA 80.22 ± 2.08 78.01 ± 1.70 80.85 ± 0.37

TREE 77.92 ± 2.36 78.48 ± 1.58 78.27 ± 0.62

EBNA 80.07 ± 1.23 78.08 ± 1.50 80.15 ± 0.33

distributed among all possible Xi values and the remaining 0.4 is distributed among # and all the :::/= Xi values.

Example 15.5 For a Xl variable with three possible values (0,1,2), the probabilities for these values in the first EDA generation are calculated as follows: P(Xl = 0) = 0.2; P(Xl = 1) = 0.2; P(Xl = 2) = 0.2; P(Xl = #) = 0.1; P(Xl :::/= 0) = 0.1; P(Xl:::/= 1) = 0.1; P(Xl:::/= 2) = 0.1.

Experiments are performed using disjunctions formed by 2 and 4 simple rules. Tables 15.2 and 15.3 show the results using 2 and 4 simple rules, respectively. These tables show the estimated accuracy in the datasets described. The accuracy of the three EDA approaches is estimated by executing a 10-fold cross-validation procedure 5 times, and the tables show the average and standard deviation of these 5 runs.

CN2 (Clark and Nibblet, 1989) and RIPPER (Cohen, 1995) are well known rule inducers which are included here for comparison. Both algorithms are usually applied in the rule induction literature. As CN2 and RIPPER are deterministic algorithms, a unique lO-fold cross-validation is performed to estimate their accuracy (the standard deviation of the cross-validation is also shown in the tables). CN2 and RIPPER results appear in Table 15.4.

A two-sided test for the difference of two proportions (Dietterich, 1998) is applied to calculate the statistical significance of the obtained accuracy dif-


Table 15.4 CN2 and RIPPER results. The estimated accuracy and standard deviation of a single lO-fold cross-validation procedure is reported.

Domain Heart Cleveland 2-Attractors

CN2 78.17 ± 2.61 79.84 ± 3.18 87.05 ± 0.85

RIPPER 77.04 ± 1.44 78.88 ± 3.30 81.15 ± 0.88

ferences among the algorithms being compared. Differences are considered as significant when the 0: = 0.05 significance level is surpassed.

All EDA approaches have similar accuracies relative to CN2 and RIPPER in the Heart and Cleveland datasets. The results obtained for both the real datasets do not show statistically significant differences between the five algorithms compared. Another interesting aspect is that we do not obtain any clear advantage in both real datasets from the use of 4 simple rules relative to the use of 2 when the three EDA approaches are involved in the optimization.

On the other hand, there are statistically significant accuracy differences between the three EDA approaches and CN2 in the 2-Attractors domain. Here, the use of 4 simple rules instead of 2 provides us with a clear advantage for the three EDA approaches, achieving statistically significant differences in each probabilistic approach.

Among the three EDA approaches considered, statistically significant differences only appear between the UMDA and TREE algorithms in the 2-Attractors dataset when 4 simple rules are used. We do not see statistically significant differences between the three EDA approaches in any of the other tasks. It does seem that the use of probabilistic models that capture dependencies among the problem variables (TREE and EBNA) provides us with no advantages relative to an approach that assumes that no dependencies exist among problem variables (UMDA). Taking into account the preliminary nature of this work, these trends may be interpreted with caution and may be limited to the tested datasets.

5. Conclusions and future work A preliminary approach for the application of the EDA paradigm to induce

rules is presented. The system is formed by an IF-THEN rule expressed by the disjunction of a finite set of simple rules. Three probabilistic models of different complexities (UMDA, TREE and EBNA) are compared with two classical rule inducers (CN2 and RIPPER) in a set of two real and one artificial dataset. For the EDA approaches, the antecedent part of the IF-THEN rule is allowed to be formed by 2 and 4 simple rules.


Despite their greater capability, TREE and EBN A approaches never obtain better accuracy results than the simpler probabilistic approach, UMDA. Encouraging results are achieved by the EDA approaches in both real tasks relative to CN2 and RIPPER accuracies. The enlargement of the antecedent part from 2 to 4 simple rules allows improvement of the results in certain tasks.

As future work we envision the enlargement of the proposed alphabet for the rule codification. In the case of ordinal variables, we could include the :S and ;::: symbols, allowing rules of the form: "IF (Xl :S 2 and X2 ;::: 2) THEN C = 2". As with the CN2 and RIPPER algorithms, instead of the induction of a unique rule, we could allow the production of a set of IF-THEN rules. An interesting line of research could therefore be the induction of a specific IF-THEN rule for each class of the dataset .

References Chow, C. and Liu, C. (1968). Approximating discrete probability distributions

with dependence trees. IEEE Transactions on Information TheoT"Y, 14:462-467.

Clark, P. and Nibblet, T. (1989). The CN2 induction algorithm. Machine Learning, 3(4) :261-283.

Cohen, W. (1995). Fast effective rule induction. In Proceedings of the Twelfth International Conference in Machine Learning, pages 115-123.

Dietterich, T. (1998). Approximate statistical tests for comparing supervised learning algorithms. Neural Computation, 10(7):1895-1924.

Dougherty, J., Kohavi, R., and Sahami, M. (1995). Supervised and unsupervised discretization of continuous features . In Proceedings of the Twelfth International Conference in Machine Learning, pages 194-202.


Fidelis, M., Lopes, H. , and Freitas, A. (2000) . Discovering comprehensible classification rules with a genetic algorithm. In Proceedings of the Congress on Evolutionary Computation, pages 805-810.

Holland, J . (1986). A mathematical framework for studying learning in classifier systems. Physica D, 22.

De Jong, K. and Spears, W. (1991). Learning concept classification rules using genetic algorithms. In Proceedings of the Twelfth International Joint Conference on Artificial Intelligence, pages 651-656.

Kittler, J. (1978). Feature set search algorithms. In Chen, C., editor , Pattern Recognition and Signal Processing, pages 41-60. Sithoff and Noordhoff.

Lanzi, P., Stolzmann, W., and Wilson, S. (2000). Learning Classifier Systems. From Foundations to Applications. Springer Verlag.


Liu, J. and Kwok, J. (2000). An extended genetic rule induction algorithm. In Proceedings of the 2000 Congress on Evolutionary Computation, pages 458-463.

Miiehlenbein, H. (1998). The equation for response to selection and its use for prediction. Evolutionary Computation, 5:303-346.

Murphy, P. (1995). UCI Repository of machine learning databases. University of California, Department of Information and Computer Science.

Muruzabal, J. (1999). Mining the space of generality with uncertainty-concerned, cooperative classifiers. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 449-457.

Nonas, E. and Poulovassilis, A. (1998). Optimization of active rule agents using a genetic algorithm approach. In DEXA, pages 332-34l.

Rivera, J. and Santana, R. (1999). Improving the discovery component of classifier systems by the application of Estimation of Distribution Algorithms. In Proceedings of the Student Sessions ACA1'99: Machine Learning and Applications, pages 43-44.

Wilson, S. (1998). Generalization in the XCS classifier system. In Proceedings of the Third Genetic Programming Conference.



Chapter 16

Partial Abductive Inference in Bayesian Networks: An Empirical Comparison Between GAs and EDAs

L.M. de Campos Department of Computer Science and Artificial Intelligence

University of Granada

[email protected]

J.A. Gamez Department of Computer Science

University of Castilla-La Mancha

[email protected]

P. Larraiiaga Department of Computer Science and ATtificial Intelligence

Unive1'sity of the Basque Count1'y [email protected]

S. Moral Department of Compute1' Science and A1,tificial Intelligence

UniveTsity of Granada

[email protected]

T. Romero Department of Computer Science and Artificial Intelligence

University of the Basque Count1'y

[email protected]


Abstract Partial abductive inference in Bayesian networks is intended as the process of generating the J( most probable configurations for a distinguished subset of the network variables (explanation set), given some observations (evidence). This problem, also known as the Maximum a Posteriori Problem, is known to be NP-hard, so exact computation is not always possible. As partial abductive inference in Bayesian networks can be viewed as a combinatorial optimization problem, Genetic Algorithms have been successfully applied to give an approximate algorithm for it (de Campos et al., 1999). In this work we approach the problem by means of Estimation of Distribution Algorithms, and an empirical comparison between the results obtained by Genetic Algorithms and Estimation of Distribution Algorithms is carried out.

Keywords: Abductive inference, most probable explanation, maximum a posteriori problem, probabilistic reasoning, Bayesian networks, Evolutionary Computation, Genetic Algorithms, Estimation of Distribution Algorithms

1. Introduction

As stated in Chapter 2, Bayesian networks (BNs) (Pearl, 1988; Jensen, 1996) exploit independence properties of probability distributions to give a compact and natural representation. BNs also allow us to calculate probabilities by means of local computation, i.e. probabilistic computations are carried out over the initial pieces of information rather than using a global distribution.

Most work in probabilistic reasoning has been devoted to evidence propagation and total abductive inference. However, in this chapter we focus on partial abductive inference, a type of diagnostic reasoning that can be viewed as a generalization of total abductive reasoning. Although this problem seems to be more useful in practical applications than total abductive inference, it has received much less attention from the BNs community.

The chapter is organized as follows: In Section 2 the basic query types in probabilistic expert systems are introduced. In Section 3 we briefly survey how these queries are solved. In Section 4 we describe how this problem has been approached using Genetic Algorithms (GAs), and in Section 5 we present how to approach the problem using Estimation of Distribution Algorithms (EDAs). Section 6 is devoted to experimental evaluation, and finally, in Section 7, we give our conclusions.

2. Query types in probabilistic expert systems

Assume that we have a n-dimensional variable X = {X1, ... ,Xn } whose probability distribution can be obtained by the factorization provided by a BN. Reasoning in a BN is performed by updating the various probabilities in

Partial Abductive Inference in BNs Using GAs and EDAs 325

the light of specific knowledge (evidence or observations). The basic type of queries in BNs are:

• Evidence propagation. In this type of query, given a set of observations (Xo = xo), our task is to compute:

p(xilXo = xo) (16.1)

for all non observed variables Xi in the network.

• Total Abductive Inference, also known as the Most Probable Explanation (MPE) problem (Pearl, 1987). Here, our task is to find the most probable state of the network given a set of observations (Xo = xo). More formally, if Xu = X \ Xo is the set of unobserved variables, then we aim to obtain the configuration Xu of Xu such that:

Xu = argmaxp(xulXo = xo). Xu

(16.2)

• Partial Abductive Inference, also known as the Maximum a Posteriori Problem (MAP). In this problem, our goal is to obtain the most probable configuration only for a subset of the network variables known as the explanation set (Neapolitan, 1990). More formally, if X E C Xu is the explanation set, then we aim to obtain the configuration xi; of X E such that:

Xi; = argmaxp(xEIXo = xo) = argmax LP(XE, xRlxo) (16.3) XE XE XR

where XR = Xu \ XE. It is important to note that, in general, xi; is not equal to the configuration obtained from Xu by removing the literals not in X E , so we have to obtain xi; directly from Equation 16.3. Example 16.1 illustrates this situation.

Example 16.1 Consider the network specified in Figure 16.1, where D I ,

D2 and S are propositional variables with two possible states each (OD, = {dl,...,dd, OD2 = {d2 ,...,d2 }, Os = {s,...,s}).

If we observe that S is present, that is, S = s, then the most probable explanation is (DI = ...,d1 , D2 = d2) . However, if variable D2 is selected as the explanation set, then partial abductive inference produces (D2 = ...,d2 )

as the most probable explanation, which is different to the configuration


p(dl) = 0.1

p(d2) = 0.4

p(sldl,d2)=1.0

p(sldl,-,d2) = 0.8

p(sl-,dl,d2) = 0.75

p(sl-,dl,-,d2) = 0.5

Figure 16.1 A small Bayesian network.

obtained by removing the literal corresponding to Dl from the configuration obtained by total abductive inference.

In (total and partial) abductive inference, in general, we are interested in obtaining the K most probable configurations and not just the best one.

3. Solving queries

In principle, we can answer all the queries presented in the previous section by "simply" generating the joint distribution, and then taking it as our starting point, summing out (in the case of evidence propagation), searching for the configuration with maximum probability (in the case of total abduction), or applying both of the previous operations (in the case of partial abduction). However, this approach is intractable even for networks with a small number of variables.

In recent years many algorithms have been proposed to solve the problem of evidence propagation by taking advantage of the conditional (in)dependencies among the variables given by the graphical structure. Nowadays, the most practical inference methods for Bayesian networks are those based on the clique tree algorithm (Jensen et al., 1990; Lauritzen and Spiegelhalter, 1988; Shenoy and Shafer, 1990). This class of propagation algorithms are based on the transformation (see Figure 16.2) of the Bayesian network into a secondary structure called a clique tree (or join/junction tree), in which the calculations are carried out. This method is based on the use of two operations: marginalization (addition) and combination (multiplication); and is divided into two phases: collectEvidence (messages are passed from leaves to root) and distributeEvidence (messages are passed from root to leaves). See Jensen (1996) and Shafer (1996) for details.

Although the propagation problem is NP-hard (Cooper, 1990) in the worst case, the clique tree algorithms work efficiently for moderately size networks, with their efficiency being strongly related to the size of the clique tree l obtained from the Bayesian network. For example, the same algorithm will perform better with the clique tree depicted in Figure 16.2(c) than with the clique tree depicted in Figure 16.2(b).


® @ (a) (b) (c)

Figure 16.2 Two possible clique trees for the same network.

Dawid (1992) has shown that the MPE can be found using evidence propagation methods but replacing summation with maximum in the marginalization operator (due to the distributive property of maximum with respect to multiplication). Therefore, the process of searching for the most probable explanation has the same complexity as probability propagation. However, searching for K MPEs is a more complex problem, and to obtain K MPEs more intricate methods have to be used (Nilsson, 1998; Seroussi and Goldmard, 1994).

In partial abductive inference, the process of finding the configuration x E is more complex than that of finding xu' because not all clique trees obtained from the original BN are valid. In fact, because summation and maximum have to be used simultaneously and these operations do not have commutative behaviour, the variables of X E must form a sub-tree of the complete tree. The problem of finding a valid clique tree for a given explanation set XE is studied in de Campos et al. (2000). From that study it can be concluded that the size of the clique tree obtained for partial abductive inference grows (in general) in an exponential way with respect to the number of variables included in the explanation set, with the worst case being when X E contains around half of the variables in the network. After that, it decreases with the size of the tree when X E contains all the variables being the same as when X E contains a single variable. Therefore, the computer resources (time and memory) needed can be so high that the problem becomes unsolvable by exact computation, even for medium-size networks.

4. Tackling the problem with Genetic Algorithms

As we have seen in Chapter 1, Genetic Algorithms (GAs) are now a popular technique for approaching difficult combinatorial problems. GAs have been previously used for solving NP-hard problems related to Bayesian networks, including: triangulation of graphs (Larraiiaga et al., 1997), imprecise proba-


bilities propagation (Cano and Moral, 1996), estimation of a causal ordering for the variables (de Campos and Huete, 2000; Larraiiaga et al., 1996a), and learning (Larraiiaga et al., 1996b). Given the success of these applications, the NP-hardness of the abductive inference problem, and the fact that abductive inference in BNs can be viewed as a combinatorial optimization problem, several authors have used GAs to solve (in an approximate way) both types of abductive inference problems: total (Gelsema, 1995; Rojas-Guzman and Kramer, 1996) and partial (de Campos et al., 1999).

In Rojas-Guzman and Kramer (1996) a chromosome is represented as a copy of the graph included in the BN, but in which every variable has been instantiated to one of its possible states. This representation makes it possible to implement the crossover operator as the interchange of a subgraph with center in a variable Xi, with Xi selected randomly for each crossover. In Gelsema's algorithm, a chromosome is a configuration of the unobserved variables (Xu = X \ Xo), i.e. a string of integers. In this case, crossover is implemented as the classical one point crossover.

The common feature of both GAs for total abductive inference is the way in which the fitness of an individual is calculated. Although we want to maximize p( Xu Ixo), this expression is proportional to p( Xu, xo), so we can use this instead as the fitness for the chromosome Xu. As (xu, xo) represents a complete instantiation of all the variables in the network, then we can use the factorization of the joint distribution as expressed in Eq. (16.4) to calculate p(xu, xo). Therefore, the evaluation of a chromosome requires n multiplications:

n

p(X = x) = p(Xu = Xu, Xo = xo) = IIp(xilpai). (16.4) i=l

Dealing with partial abductive inference using GAs appears easier than that of dealing with total abductive inference, because the size of the search space in the partial case is considerably smaller than in the total abductive inference problem. However, this is not the case, because of the increasing complexity of the evaluation function. In fact, in the partial case, (XE,XO) does not represent a configuration of all the variables in the network, so Eq. (16.4) cannot be applied directly. In this case, the variables not observed and not in the explanation set, XR = X \ (XE U Xo), have to be removed by addition. Therefore, to evaluate an individual XE using Eq. (16.4), we have to apply

p(xElxo) ex: p(XE, xo) = LP(XE, Xo, XR), XR

(16.5)

that is, Eq. (16.4) has to be applied In X R I times, where n X R is the set of possible configurations of X R. For example, if we have a network with 50 propositional binary variables, IXEI = 15, IXRI = 30 and IXol = 5, then Eq.


(16.4) has to be applied 230 times. Clearly, this is computationally intractable given the large number of evaluated individuals in the execution of a GA.

For this reason, in de Campos et al. (1999) the fitness p(XE' xo) of a chromosome XE is computed by using probability propagation over a clique tree. Below, we describe this evaluation function and some other details of a slightly modified version of the GAs described in de Campos et al. (1999), which will be used for the experiments in this chapter.

4.1 A Genetic Algorithm for partial abductive inference

We briefly describe the representation, the evaluation function and the structure of the GA used in de Campos et al. (1999).

• Representation. In our algorithm, a chromosome will be a configuration of the variables in the explanation set, that is, a string of integers of length IXEI.

• The evaluation function. The fitness of a chromosome XE is computed by the process described below, where T = {C1 , ... , Cd is a rooted clique tree, with root C1 .

1. Enter the evidence Xo in T.

2. Enter (as evidence) the configuration XE in T.

3. Perform CollectEvidence from the root (Cd (Le., an upward propagation).

4. p( x E, xo) is equal to the sum of the potential stored in the root

(Cd·

Therefore, to evaluate a configuration an exact propagation is carried out, or more correctly half propagation, because only the upward phase is performed and not the downward one (see Jensen (1996) for details of clique tree propagation). Furthermore, for this propagation we can use a clique tree obtained without constraints and so its size is much smaller than the clique tree used for exact partial abductive inference (de Campos et al., 2000). In addition, in de Campos et al. (1999) it is shown how the tree can be pruned (for a concrete explanation set) in order to avoid the repetition of unnecessary computations when a new chromosome is being evaluated.

• Structure of the GA. The GA used in de Campos et al. (1999) is based on the modified GA (modGA) proposed by Michalewicz (1996). This GA falls into the category of preservative, generational and elitist selection, and enjoys similar theoretical properties to the classical GA. The main


modification with respect to the classical GA is that in modGA we do not perform the classical selection step, but instead we select independently T

distinct chromosomes (usually those that fit best) from P(t) to be copied to P(t + 1). In de Campos et al. (1999) the parameters used are the following:

Select the best 50% chromosomes from P(t) and copy them to P(t+ 1). In this way we ensure the population diversity and the premature convergence problem is avoided.

- 35% of the new population is obtained by crossover. One parent is selected from P(t) with a probability proportional to its rank and the other in a random way. The crossover operator used is the classical two-point crossover, and the two children obtained are copied to P(t + 1).

- 15% of the new population is obtained by mutation. Mutation is carried out by selecting a chromosome from P(t) and modifying one of its components, then copying the resulting chromosome to P(t + 1). So, we apply genetic operators on whole individuals as opposed to individual bits (classical mutation). As Michalewicz (1996) points out, this would provide a uniform treatment of all operators used in the GA. The parents for mutation are selected from P(t) with a probability proportional to their rank, except for the best chromosome, which is always selected as a parent (thus, the neighbourhood of the best chromosome is explored).

The numbers 50, 35 and 15 have been selected by experimentation. Notice that in P(t + 1) only half of the population is new, and so only those chromosomes are candidates to be evaluated in each generation. This fact is important in our problem because of the evaluation function complexity. In addition, a hash table is used to avoid the reevaluation of previously seen individuals. When a new chromosome is evaluated, it is tested for whether it must be included in Kbest, an array which contains the K best individuals obtained so far.

5. Tackling the problem with Estimation of Distribution Algorithms

As described in Chapter 3, ED As constitute a new approach for Evolutionary Computation, where the crossover and mutation operators have been replaced in each generation by the estimation of a probability distribution and its posterior simulation.


As far as we know, this is the first time that the partial abductive inference in Bayesian networks has been tackled by means of EDAs. The characteristics of the proposed approach are as follows:

• Representation. An individual in EDAs is equal to a chromosome of the GA: a configuration of the explanation variables, i.e. the Bayesian network used to generate populations will have as variables the set X E .

• Evaluation function. This has been described in Section 4.1 in this chapter.

• Adaptation of EDAs in order to search the K MAPs. Although we are aware that one ad- hoc approach to the K MAPs problem would imply that the length of the individuals would be K x I X E I, in this case we have decided to adapt ED As in a more general way, that is related to their meta-heuristic character. In this way, we impose in the simulation phase of ED As the constraint that the generated individuals must be different from the individuals previously simulated, both in the same generation and also in previous generations. Where this condition is not verified after 50 attempts, the repeated individual is added to the population. Once the simulated phase is finished we select the N best individuals from the combined pool of individuals generated in this generation and the individuals used to induce the probabilistic model in the previous generation. From the individuals selected in this manner, a new probabilistic model will be induced (see Section 2 in Chapter 3 for details about the general scheme of EDAs).

• Types of EDAs used. For the experiments we have selected three types of ED As which present an increasing complexity in the factorization of the probability distribution of the selected individuals:

UMDA (Miihlenbein, 1998), without dependencies,

MIMIC (De Bonet et al., 1997), bivariate dependencies, and

EBNA (Etxeberria and Larrafiaga, 1999), multiple dependencies.

More information about these algorithms can be found in Section 3 in Chapter 3 of this book.

6. Experimental evaluation In order to perform the empirical comparison among the proposed algo

rithms, we have carried out five experiments. Three of these experiments have been carried out on the well known Alarm network (Beinlich et al., 1989), and the others on two artificially generated Bayesian networks: randoml00 and


randoml00e. The networks randoml00 and randoml00e have been generated using the same procedure for their structure, but different procedures for their probability tables. In randoml00, the probabilities were generated using uniform random numbers, but in randoml00e the process is more complex: two uniform random numbers, x and y were generated, and the probability of the two values (marginals for root nodes and conditionals for the rest) of a variable are determined by normalizing x 5 and y5, which give rise to extreme probabilities. Table 16.1 gives some information about these networks, where min, max and mean refer to the size of the probability table attached to each node.

Table 16.1 Some characteristics of the networks used in the experiments.

Network nodes arcs states min max mean

Alarm 37 46 {2,3,4} 2 108 20.3 randoml00 100 122 2 2 32 5.88 randoml00e 100 128 2 2 64 6.54

Table 16.2 shows a brief description of each experiment. Column IXEI gives the number of variables included in the explanation set, while column X E

shows the way that these variables were selected for inclusion in the explanation set. In all the experiments, the variables to be included in the explanation set were selected in a pseudo-random way, that is, several sets containing IXEI variables were randomly generated, and the most difficult one to be solved by exact computation was chosen. The difficulty of a problem was measured as a function of the time and space needed to solve the problem exactly. To solve the problem exactly we have used software implemented in Java and running on an Intel Pentium III (600 MHz) with 384 MB of RAM, a Linux operating system, and the JDK 1.2 virtual machine. The time needed to exactly solve experiments 1, 2 and 3 was between one and one and a half hours, while solving a total abductive inference problem using this software takes less than 0.5 seconds. For experiments 4 and 5, we have not been able to solve the problem exactly because of memory requirements, i.e. the "out of memory error" was obtained in both cases. This error is due to the enormous size of the clique trees obtained from these networks, by means of a compilation constrained by the selected explanation sets. In these networks, total abductive inference takes less than 9 seconds.

In all the experiments five variables have been selected as evidence, and have been instantiated to their "a priori" least probable state. In the five experiments we have taken K = 50, that is, we look for the 50 MAPs.

Partial Abductive Inference in ENs Using GAs and EDAs 333

Table 16.2 Description of the experiments.

#exp. IXEI network XE IOxEI 1 18 Alarm pseudo-random 143,327,232 2 19 Alarm pseudo-random 214,990,848 3 20 Alarm pseudo-random 382,205,952 4 30 mndomiOO pseudo-random 1,073,741,824 5 30 mndomiOOe pseudo-random 1,073,741,824

The data we have collected during execution of the algorithms is related to the probability mass of the K MAPs found. Thus, massi, massiO, mass25 and mass50 represent the probability mass of the first 1, 10, 25 and 50 MAPs found by the exact algorithm, and massi', massi (j, mass25' and mass5(j represent the probability mass of the first 1, 10, 25, and 50 MAPs found by the proposed algorithms. For experiments 1, 2 and 3, we present the percentage of probability mass obtained with respect to the exact algorithm (%massX' = ma;.:-;~~lOO). For experiments 4 and 5, because of the absence of exact results, we present massX' directly. In order to test the anytime behaviour of the algorithms, results are presented for (approximately) every 500 different evaluated individuals. Finally, all the algorithms have been run 50 times, so all the results are averages.

Finally, the experimentation has consisted of applying the four algorithms presented in this chapter (UMDA, MIMIC, EBNA, and GA) to the five examples shown. In each experiment we have considered 8 different population sizes (50, 100, 150, 200, 250, 300, 400 and 500), but due to space limitations we only present for each algorithm the results related to one selected size (that for which the best results -on average- were obtained). In all cases, the initial population has been generated randomly, and the algorithm stops when the number of different evaluated individuals is greater than 5000. Note that the algorithm cannot stop when exactly 5000 individuals have been evaluated, because the current generation has to be finished. The same applies to the intermediate points selected to study the anytime behaviour (500, 1000, ... ). Tables 16.3 to 16.7 show the obtained results for (%)massl', (%)masslO', (%)mass25', and (%)mass50' in each experiment, where the entries are interpreted as average ± standard deviation. Figures 16.3 to 16.7 show the anytime behaviour of the four algorithms with respect to (% )mass 1'.


~

~ e l>"

100

90

80

70

60

50

40

30

20

10

100

80

60

40

20

, , [!)

I

, I

/

~ i

I

El

;3; -'-'-'-~~ ->:~ ~.~~~:~. -+- .. " . -+- ....... -+- .... .

UMDA ---+MIMIC .. + .. EBNA --8-.

GA' 'X"

1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 Number of different evaluated individuals

Figure 16.3 A plot of %massl' for experiment 1.

X'

.X

UMDA ---+MIMIC .. +-. EBNA --8-'

GA' -XC. o~~~ __ -L __ ~ __ ~L-__ ~ __ ~ __ -L __ ~ ____ L-~

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 Number of different evaluated individuals

Figure 16.4 A plot of %massl' for experiment 2.


100 - - -.8 '-'-8e --8 _£r-------£r----- x' . 90 1iT- ..... + .... 80

70

60 ,

~ 50

wl ~

" E li' 40 X ,

30 ,

20 x' UMDA --MIMIC ,.+ ..

10 EBNA --(3-. GA' oX

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Number of different evaluated individuals

Figure 16.5 A plot of %massl' for experiment 3,

l.Ie-05

le-05 +.

ge-06 x

8e-06

7e-06

6e-06 ~ ~

'" 5e-06 E

4e-06

3e-06

2e-06 UMDA --

le-06 MIMIC .. +-. EBNA --(3-.

GA' oX 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Number of different evaluated individuals

Figure 16.6 A plot of massl' for experiment 4.


Table 16.3 Results for experiment 1. Population size was 300 for UMDA, 500 for MIMIC, 250 for EBNA and 100 for GA.

%massl' %masslr1 %mass25' %mass5r1

UMDA 80.77 ± 10.32 75.16 ± 12.25 71.07 ± 12.52 68.25 ± 12.43 MIMIC 85.21 ± 12.20 80.43 ± 15.22 76.66 ± 17.07 74.19 ± 17.91 EBNA 95.56 ± 9.57 93.62 ± 12.74 92.44 ± 14.95 91.75 ± 16.21 GA 97.04 ± 0.81 89.63 ± 1.38 85.16 ± 1.88 83.19 ± 2.09



UMDA 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 98.95 ± 0.25 MIMIC 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 98.99 ± 0.28 EBNA 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 99.62 ± 0.46 GA 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 99.38 ± 0.05


UMDA MIMIC EBNA GA


88.07 ± 29.04 81.32 ± 34.75 77.62 ± 35.81 74.54 ± 36.35 87.14 ± 30.28 81.06 ± 34.01 76.67 ± 35.49 73.26 ± 36.02 97.76 ± 12.62 96.79 ± 13.12 95.19 ± 14.86 93.80 ± 16.98 99.37 ± 0.21 97.36 ± 0.47 93.00 ± 1.5 87.55 ± 5.33

6.1 Experimental conclusions

As we can see from Tables 16.3 to 16.7 the results obtained by the four algorithms are similar in experiments 2, 4 and 5, while significant differences can be observed in the other two experiments. As an explanation of this fact we conjecture that the problems considered in experiments 2, 4 and 5 give rise to less complex search spaces than those generated by the problems considered


Table 16.6 Results for experiment 4. Population size was 100 for UMDA, 100 for MIMIC, 100 for EBNA and 100 for GA .

massl' massl(j mass25' mass5(j

UMDA 0.000010 ± 0 0.000077 ± 0 0 .000167 ± 0 0.000295 ± 0 MIMIC 0.000010 ± 0 0.000077 ± 0 0 .000167 ± 0 0.000295 ± 0 EBNA 0.000010 ± 0 0.000077 ± 0 0.000167 ± 0 0.000295 ± 0 GA 0.000010 ± 0 0.000076 ± 0 0 .000166 ± 0 0.000297 ± 0


massl' massl (j mass25' mass5(j

UMDA 0.014197 ± 0 0.090864 ± 0 0.164966 ± 0.002 0.237114 ± 0.006 MIMIC EBNA GA

1

0.014197 ± 0 0.014197 ± 0 0.014197 ± 0

0.016

0.014

0.012

0.01 m ,

0.008

0.006

0.004

0.002

0.090848 ± 0 0.163928 ± 0.002 0.232105 ± 0.007 0.091064 ± 0 0.168794 ± 0.003 0.247711 ± 0.008 0.091073 ± 0 0.168202 ± 0.003 0.244589 ± 0.008

.)( .

x UMDA --+-MIMIC .. + .. EBNA --{3_.

GA · oX- .

1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 Number of different evaluated individuals

Figure 16.7 A plot of massl' for experiment 5.

in experiments 1 and 3. From analysis of Tables 16.3 and 16.5 (experiments 1 and 3) we obtain the following conclusions:


• Among the three ED As used in the experimental evaluation, EBNA clearly outperforms UMDA and MIMIC. This fact is not surprising because EBNA can deal with unconstrained probability distributions, while UMDA and MIMIC have to deal, in general, with approximations.

• The performance of the GA is superior to UMDA and MIMIC in all the cases.

• The comparison between the GA and EBNA needs to be more detailed. With respect to %mass1' the GA seems to have a better behaviour than EBNA, although it is outperformed by EBNA in the search of the K most probable explanations.

• It is interesting to analyze the variability shown by the four algorithms. Thus, from the point of view of the standard deviations we can clearly establish the following pattern: MIMIC> UMDA > EBNA » GA.

• As we have mentioned before, we have experimented with eight different population sizes, and from the results obtained (and not shown here due to space limitations) it is observed that the behaviour of the EDA approach with respect to changes in the population size is more robust than the one shown by the GA.

With respect to the anytime behaviour of the proposed algorithms (see Figures 16.3 to 16.7), it is clear that the number of evaluations required by EDAs in order to get good solutions is smaller than the number required by the GA, which shows a slower convergence.

7. Concl uding remarks In this chapter we have studied the problem of partial abductive inference

in Bayesian networks. The problem has been approached using a previously known GA (de Campos et al., 1999) and three different algorithms (UMDA, MIMIC and EBNA) based on the novel approach of EDAs.

From the empirical comparison carried out we can conclude that UMDA and MIMIC are clearly outperformed by GA and EBNA, while differences between GA and EBNA are small and dependent on the parameter being considered (searching for the best explanation or for the K best). Anyway, given the obtained results, both algorithms (GA and EBNA) constitute a good choice for approaching the problem considered here.

Regarding future work, as it seems that GAs and ED As outperform each other with respect to different parameters ((%)mass(1', lfJ, 25', 5fJ), standard deviation, convergence speed) we plan to experiment with the hybridization of both types of algorithm in order to ascertain whether the joint approach is an improvement on these individual approaches. Furthermore, we plan to perform


a deeper study of the adequate population size for each algorithm. Another interesting starting point for future work could be to take advantage of the initially known structure of the Bayesian network in order to constrain the graphical model to be learnt during the search.

Acknow ledgments This work has been supported by the Spanish Comision Interministerial de Ciencia

y Tecnologia (CICYT) under Projects TIC97-1135-C04-01 and TIC97-1135-C04-03.

Notes 1. The size of a clique tree is the sum of the sizes associated with each of its cliques. The

size of a clique is the product of the number of different states that each variable within the clique can take .

References Beinlich, LA., Suermondt, H.J., Chavez, R.M., and Cooper, G.F. (1989). The

ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In Proceedings of the Second European Conference on Artificial Intelligence in Medicine, pages 247-256. Springer-Verlag.

Cano, A. and Moral, S. (1996). A genetic algorithm to approximate convex sets of probabilities. In Procceedings of the 6th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU'96), pages 847-852.

Cooper, G.F. (1990). Probabilistic inference using belief networks is NP-hard. Artificial Intelligence, 42(2-3}:393-405.

Dawid, A.P. (1992). Applications of a general propagation algorithm for probabilistic expert systems. Statistics and Computing, 2:25-36.

De Bonet J.S., Isbell, C.L. , and Viola, P. (1997). MIMIC: Finding optima by estimating probability densities. Advances in Neural Information Processing Systems, Vol. 9.

de Campos, L.M., Gamez, J.A., and Moral, S. (1999). Partial Abductive Inference in Bayesian Belief Networks using a Genetic Algorithm. Pattern Recognition Letters, 20(11-13):1211- 1217.

de Campos, L.M., Gamez, J.A., and Moral, S. (2000). On the problem of performing exact partial abductive inference in Bayesian belief networks using junction trees. In Proceedings of the 8th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU'OO), pages 1270- 1277.

de Campos, L.M. and Huete, J.F. (2000). Approximating causal orderings for Bayesian networks using genetic algorithms and simulated annealing. In


8th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU'OO), pages 333-340.


Gelsema, E.S. (1995). Abductive reasoning in Bayesian belief networks using a genetic algorithm. Pattern Recognition Letters, 16:865-87l.

Jensen, F.V. (1996). An introduction to Bayesian Networks. VCL Press. Jensen, F.V., Lauritzen, S.L., and Olesen, K.G. (1990). Bayesian updating in

causal probabilistic networks by local computation. Computational Statistics Quarterly, 4:269-282.

Larraiiaga, P., Kuijpers, C., Murga, R., and Y. Yurramendi (1996a). Learning Bayesian network structures by searching for the best ordering with genetic algorithms. IEEE Transactions on System, Man and Cybernetics, 26(4):487-493.

Larraiiaga, P., Kuijpers, C., Poza, M., and Murga, R. (1997). Decomposing Bayesian networks: triangulation of the moral graph with genetic algorithms. Statistics and Computing, 7:19-34.

Larraiiaga, P., Poza, M., Yurramendi, Y., Murga, R., and Kuijpers, C. (1996b). Structure learning of Bayesian networks by genetic algorithms. A perfomance analysis of control parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(9):912-926.

Lauritzen, S.L. and Spiegelhalter, D.J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. l.R. Statistics Society. Serie B, 50(2):157-224.

Michalewicz, Z. (1996). Genetic Algorithms + Data Structures = Evolution Programs. Springer-Verlag.

Miihlenbein, H.M. (1998). The equation for response to selection and its use for prediction. Evolutionary Computation, 5:303-346.

Neapolitan, R.E. (1990). Probabilistic Reasoning in Expert Systems . Theory and Algorithms. Wiley Interscience.

Nilsson, D. (1998). An efficient algorithm for finding the M most probable configurations in Bayesian networks. Statistics and Computing, 2:159-173.

Pearl, J. (1987). Distributed revision of composite beliefs. Artificial Intelligence, 33: 173-215.

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann.

Rojas-Guzman, C. and Kramer, M.A. (1996). An evolutionary computing approach to probabilistic reasoning in Bayesian networks. Evolutionary Computation, 4:57-85.


Seroussi, B. and Goldmard, J.L. (1994). An algorithm directly finding the K most probable configurations in Bayesian networks. International Journal of Approximate Reasoning, 11:205-233.

Shafer, G.R. (1996). Probabilistic Expert Systems. Society for Industrial and Applied Mathematics (SIAM).

Shenoy, P.P. and Shafer, G.R. (1990). Axioms for probability and belief-function propagation. In Shachter, R., Levitt, T., Kanal, L., and Lemmer, J., editors, Uncertainty in Artificial Intelligence, 4, 169-198. Elsevier Science Publishers B.V. (North-Holland).

Chapter 17

An Empirical Comparison Between K-Means, G As and EDAs in Partitional Clustering

J. Roure Department of Compute1' Science and Management

Escola Unive1'sita1'ia Politecnica de Mataro

[email protected]



[email protected]

R. Sangiiesa Department of Computer Languages and Systems

Technical University of Catalunya

[email protected]

Abstract In this chapter we empirically compare the performance of three different approaches to partitional clustering. These are iterative hillclimbing algorithms, genetic algorithms and estimation of distribution algorithms. Emphasis is placed on their ability to avoid local maxima and also on the simplicity of setting good parameters for them.

Keywords: partitional clustering, K-Means, Evolutionary Algorithms, Genetic Algorithms, Estimation of Distribution Algorithms, empirical comparison

1. Introduction Clustering is an important technique in the field of exploratory data analysis.

Huge databases now exist which are completely useless if they are not analysed and transformed into knowledge. Data mining is a growing field that attempts P. Larrañaga et al. (eds.), Estimation of Distribution Algorithms



to analyse these huge databases (Fayyad et al., 1996). In data mining, where there is no knowledge of data's a priori distribution, clustering plays an important role in simplifying and in discovering some inherent structure present in the databases.

Clustering algorithms can be categorised by the structure that they create into, either hierarchical or partitional (Jain and Dubes, 1988). Hierarchical clustering creates a sequence of partitions where each partition is nested inside the next in the sequence. Hierarchical algorithms could be divided into agglomerative and divisive. Agglomerative algorithms first place each object into a cluster of its own and gradually merge these clusters into larger ones until a cluster containing all the objects is obtained. On the contrary, divisive algorithms begin with all objects into the same cluster and gradually split clusters into smaller ones. A nice review on hierarchical clustering can be found in Gordon (1987).

Partitional clustering aims to obtain a single partition in order to recover natural groups present in the data. Objects that are found to be similar are placed in the same group, and dissimilar objects are placed in different groups. When data is described by means of symbolic attributes, clustering is usually called conceptual clustering (Fisher, 1987; Michalski and Stepp, 1983). Fisher et al. (1992a) divides conceptual clustering into two steps. In the first step, clustering creates an object partition. In the second step, a characterisation is obtained for each cluster. An approach to partitional clustering exists that is based on probability (Bandfield and Raftery, 1993; Duda and Hart, 1973; Hanson et al., 1990). In this approach, it is assumed that the dataset comes from a probability distribution, and the clustering algorithm tries to find the parameters of the probability distribution function.

Partitional algorithms usually need as a parameter the number of clusters in the partition. There are several different approaches to obtaining the optimal number of clusters. These include methods based on entropy statistical complexity criteria (Celeux and Soromenho, 1996; Bozdogan, 1994), statistical tests (Hardy, 1994; Rasson and Kubushishi, 1993; Gordon, 1987) and Genetic Algorithms (Lozano et al., 1998).

In this chapter, we focus on partitional clustering with emphasis on the efficient representation and compression of large databases. This chapter is organised as follows. Section 2 formally defines partitional clustering. Section 3 introduces iterative partitional algorithms as a hill-climbing strategy for finding reasonably good partition. Section 4 explains the Genetic Algorithm approach to clustering and gives different partition encodings and some heuristics for improving their performance. In Section 5, Estimation of Distribution Algorithms are used for the first time in the clustering problem. In Section 6 several experiments are conducted in order to empirically compare the performance of

Comparing K-Means, GAs and EDAs in Partitional Clustering 345

iterative algorithms, Genetic Algorithms (GAs) and Estimation of Distribution Algorithms (EDAs).

2. Partitional clustering We can formally define partitional cluster analysis as a multivariate sta

tistical procedure that given a set of objects described by means of attributes, partitions them into k groups such that objects within the same group are more similar to each other than objects in different groups. Specifically, clustering attempts to minimise the function

k n

F(W,C) = LLuijD(vj,Ci) (17.1) i=1 j=1

where

• V = {VI, 'V2, ... , vn } is a set of n objects, and each object is a vector of d dimensions.

• C = {Cl' C2, ... , Ck} is a set of k cluster centroids where each centroid is represented as a vector of d dimensions.

• W = {Uij} is a k x n association matrix where 2:7=1 Uij = 1; Uij E {O, I}.

• D(v,c) = J2:~=1 (v(l) - c(l))2 is a dissimilarity or distance function.

Once we have defined partitional clustering as an optimisation problem, we need an efficient algorithm for searching the space of all possible classifications to find one that minimises the function to be optimised. The number of different partitions of n objects into k clusters is given by Stirling numbers of the second kind, S(n, k), which grow exponentially with nand k.

S(n,k) = ~!t(-l)k-l (~)jn. (17.2) j=l J

As an example, if we have 100 objects and we want to partition them into 5 clusters, then there are S(100, 5) ~ 1068 different partitions. Hence, the search for an optimal solution using an exhaustive enumeration is not feasible. For this reason, algorithms must only look for a good solution among a reduced subset of reasonable partitions. In order to explore only promising parts of the partitional clustering space, some search heuristics are used.

We use a greedy search strategy (iterative algorithms) in Section 3, and Evolutionary Algorithms (GAs and EDAs) in Sections 4 and 5.

3. Iterative algorithms These algorithms use a hill-climbing search strategy in order to explore a

reduced set of reasonably good partitions. From the very beginning they work


with one complete and feasible partition which they try to iteratively improve. Thus, whenever they stop they are able to output a partition of the dataset.

Iterative algorithms are well known and widely used in the literature. Probably the most used is the K-Means algorithm (Forgy, 1965; MacQueen, 1967). Figure 17.1, shows the pseudocode for the K-Means. One of the parameters of this algorithm is the number of clusters in which the dataset must be divided. First, it builds the k clusters with the first k objects as centroids. Next, it assigns all the following objets to the closest cluster. Once an object has been assigned to a cluster, the new cluster centroid is computed. In each iteration, those objects found not to be into the closest cluster are moved. The algorithm stops either when the membership stabilises or when the limit of iterations is reached.

K-Means (k clusters, i iterations) {

}

Take the first k objects as initial cluster centroids Repeat for each object

Assign the object to the closest cluster Compute the new cluster centroid

Until membership stabilises or number of iterations> i

Figul'e 17.1 K-Means algorithm.

Despite being used in a wide range of applications, the K-Means algorithm is not exempt of drawbacks. The most important are listed below:

• The K-Means algorithm is very sensitive to the initial partition (Bezdek et al., 1994; Pena et al., 1999) and also to the order in which the objects are given to the algorithm (Fisher et al., 1992b; Langley, 1995).

• The number of clusters must be given to the algorithm. This is usually a problem because the number of clusters is not known beforehand.

• The K-Means algorithm tends to get stuck into local maxima. This is due to the hill-climbing strategy (Babu and Murty, 1994), the huge search space of the clustering problem (Lucasius et al., 1993) and also to the objective functions nature, i.e. they are not convex and they are highly non-linear (Babu and Murty, 1994).

The sensitivity problem to the initial partition and to the ordering of objets can be summarised into a single one, in our implementation of K-Means. The clusters are initialised with the first k objects. Thus, if the first two objects belong to the same actual cluster, the algorithm begins with an ill-formed


structure from which it is very difficult to recover. Some heuristics have been proposed (Fisher et al., 1992b; Roure and Talavera, 1998) in order to reduce the effects of object ordering. Other approaches look for a more appropriated objects order before launching the K-Means algorithm (Fisher et al., 1992b; Peiia et al., 1999).

In order to overcome the problem of predicting the actual number of clusters, several approaches can be used. Some algorithms make use of split and merge operators in order to explore partitions with a different number of clusters. A well known algorithm of this type is ISODATA (Ball and Hall, 1967). Other approaches (Lozano et al., 1998) use GAs to search for the actual number of clusters.

4. Genetic Algorithms in partitional clustering Due to the tendency of hill-climbing strategies to get stuck into local manima,

discussed in the previous section, evolutionary algorithms such as Evolutionary Programming (Sarkar et al., 1997), Evolution Strategies (Babu and Murty, 1994) and GAs (Alippi and Cucchiara, 1992; Bezdek et al., 1994; Jones and Beltramo, 1990; Maulik and Bandyopdhyay, 2000; Lozano, 1998) have been often applied in clustering problems. The robustness of these approaches is banked on the fact that they do not consider a single candidate solution. Instead, they use a collection of candidate solutions which is modified in each iteration. In this way, the algorithm does not perform one single search but multiple ones in each run, and thus, the probability of getting stuck into local minima is highly reduced. Evolutionary algorithms could be described as a form of incremental beam search (Langley, 1998).

GAs are a family of stochastic search algorithms based loosely upon principles of natural evolution. GAs usually start building a random population of individuals, each one representing a possible solution to the problem, that is, a partition in the clustering problem. From this point, two steps are iteratively applied to the population in order to make it evolve. The first step consists on selecting a subset of population individuals. They are selected according to their performance value. During the second step "genetic" operators are applied to the selected subset of individuals. Typically, these are crossover and mutation operator. In the simplest case, crossover consists on mixing two individuals, splitted at the same point and then crosswesely reassembled. Mutation alters each "gen" of an individual with a small probability. In this way a collection of offprings is generated and a new population is obtained. This process is iteratively repeated until some stopping criteria is reached, i.e. maximum number of generations, or population fitness criterion.

The main drawback of GAs is the difficulty of choosing good parameters for a given problem. The performance of GAs is very dependant on the encoding selected, on the mutation and crossover operators and on the probability with


which they are used in each generation. All these parameters conform a really huge configuration space. Besides, there is also a lack of a methodology in order to explore this configuration space (Lucasius et al., 1993). Thus, setting the parameters, in a way that guides the GA search to a reasonable solution, can be considered more art than an engineering process.

4.1 Encoding of clusters

Different encoding of clusters can be used. Here we report the most commonly found in the literature. Encodings can influence very much the efficiency and robustness of the evolutionary algorithms (Jones and Beltramo, 1991; Pelikan and Goldberg, 2000). As we will see, there are encodings where two different individuals can actually represent the same clustering. If this happens too often and if it is too difficult to identify such cases, many different individuals of the population may represent the same actual solution. Therefore, the presumed multi-directional search becomes a single one, or a few directional search in the early stages of the execution when most of the search space is still to be explored. Such a search would correspond to a hill-climbing strategy that could easily fall into local minima.

Encodings may be divided into two classes, namely the binary string representation and the ordinal representation (Bhuyan et al., 1991; Lozano, 1998). In the rest of the section we assume that we are given a partition with n objects and k clusters.

4.1.1 Binary string representation. This kind of representation uses a binary alphabet to encode the solutions. Binary representations are used in the GA approach, because it makes possible to apply directly the traditional crossover and mutation operators without further modifications (Bhuyan et al., 1991). The main drawback of this sort of encodings is that the codification and decodification of a string is usually difficult.

Graphical. This representation uses a graph where each node represents an object. Two nodes are connected in the graph if the objects they represent belong to the same cluster. Thus, an ordered string of length n(n - 1)/2 represents a partition. The value of each position represents the presence (1) or absence (0) of the edge in the graph. The problem of this encoding is that there are so many strings that do not represent legal partitions that it is rather expensive to verify correctness of the newly generated strings. On the contrary, note that two different strings always represent different solutions. This representation was proposed in Bhuyan et al. (1991) and Jones and Beltramo (1990).


Cluster filiation. Here, an individual is a string composed of k words of length n. The ith bit of the jth word is set to 1 means that the ith object belongs to the jth cluster. A given individual is correct if in all words there is at least one bit set to one and if once the ith bit of a word is set to one, it is set to zero in all other words. Note also the redundancy of the encoding, hence two different strings may represent the same solution, for example switching two words. This representation was proposed in Alippi and Cucchiara (1992) and Cucchiara (1993).

Object membership. An individual is a string composed of n words of length [log2(k)] + 1. The ith word represents the binary code of the cluster the ith object belongs to. A given individual is correct if all binary numbers corresponding to {1, 2, ... ,k} are present in the individual at least once. Here again two different strings may represent the same solution, for example switching the codes of two clusters. This representation was proposed in Alippi and Cucchiara (1992).

4.1.2 Ordinal representation. This sort of encoding do not re-strict the alphabet to binary values. They are much more compact, much simpler and much closer to the problem than binary ones. By using ordinal representations we avoid the conversion from binary to decimal values and vice versa. Usually, the traditional genetic operators must be modified in order to work with this sort of representations.

Order 1. An individual is a permutation of numbers {l, 2, ... , n}, where for instance if we have the values (3, 4, 5) in this order that would mean that 3 is more similar to 4 than to 5, and therefore the objects 3 and 5 will belong to the same cluster only if the object 4 does too. As we see, this encoding gives information about the similarity between objects rather than about the partition itself. A dynamic programming algorithm is used to find the best clustering through using the object ordering. Note that two different object orders may lead the clustering algorithm to the same partition. Hence, here we also find redundancy among individuals. This encoding was proposed in Bhuyan et al. (1991).

Order 2. Following the same idea of the latter the clustering problem could be transformed in the following way: instead of looking for a good partition directly, we could look for a good object order to be presented to an iterative clustering algorithm like K-Means. Note again that iterative algorithms are much sensible to object orders, and thus two different orders may lead the algorithm to the same partition. This encoding was proposed in Lozano (1998).


Partition. An individual is a permutation of numbers {I, 2, ... , n, n + 1, ... , n + k - I}, where the first n numbers represent the objects of the dataset while the last k - 1 numbers serve as cluster separator symbols. In this representation a permutation where two separator symbols are together is illegal as it would code a partition with empty cluster. Note that different permutations may represent the same partition, for example, switching numbers within the same cluster. This representation was proposed in Jones and Beltramo (1991).

Object membership. This representation is very similar to the one proposed in Alippi and Cucchiara (1992). An individual is a string of n natural numbers where the ith position represents the code of the cluster this ith object belongs to. Note that it has the same redundancy problem than the binary object membership. This representation was proposed in Bhuyan et al. (1991) and Jones and Beltramo (1990).

4.2 Heuristics

It is known that GAs perform usually well for a wide range of problems, and several heuristics have been proposed in order to improve their performance. These heuristics incorporate specific knowledge about the problem, which helps improving the results obtained and mainly in reducing the number of generations needed by the algorithms to converge. Heuristics could be categorised as follows:

Initial popUlation. This sort of heuristic does not generate the first population at random. Instead, they initiate the population considering specific problem knowledge. For instance, the heuristic could generate the initial population by running the K-Means algorithm with different object orderings. In this way, the first population would be fitter than the one generated at random, and the search would start from points of the solution space closer to good solutions (Bezdek et al., 1994; Bhuyan et al., 1991).

Operators. The mutation and crossover operators, are modified in order to incorporate into them specific problem information (Luchian et al., 1994; Raghavan and Birchard, 1979; SarkaI' et al., 1997). Doing it so, operators transmute the individuals in a way that is known to improve their fitness with respect to the evaluation function. Also, the operators may be modified in order to not produce illegal individuals.

Hybrid. This heuristic combines the GA's beam search with a hill-climbing search. In each generation a hill-climbing algorithm is executed one or more times to all population individuals. For example, the K-Means algorithm could be executed (Bhuyan et al., 1991; Jones and Beltramo, 1990; Sarkar et al.,

Comparing K-Means, GAs and ED As in Partitional Clustering 351

1997). In this way, if an individual survives 9 generations it goes, at least, through 9 iterations of the hill-climbing search.

Early stopping. This heuristic does not allow the algorithm to run until it converges to an optimal solution. Instead, it stops when the convergence rate becomes very slow, and executes the corresponding algorithm in order to reach the near-optimal local minima (Babu and Murty, 1994).

These heuristics reduce the number of generations needed for convergence considerably. However these heuristics may dramatically bias the search to local minima.

5. Estimation of Distribution Algorithms in partitional clustering

EDAs (Miihlenbein and PaaB, 1996) are, as well as GAs, a sort of evolutionary algorithms which use a collection of candidate solutions in order to perform a beam search and to avoid local minima. EDAs use the estimation and simulation of the joint probability distribution as the evolutionary mechanism instead of directly manipulate the individuals.

EDAs start building a random population of individuals where each is a candidate partition. Then, three steps are iteratively applied to the population. The first consists on selecting a subset of the best individuals. During the second step, a model of the joint probability distribution is learnt from the previously selected individuals. In the third step, new individuals are generated by simulating the distribution model. In this way, the population performance is iteratively improved from one generation to the next. The algorithm stops either when a number of generations is reached or when the overall population performance (the addition of all individual performances) does not improve in comparison with the previous generation.

During the rest of this section we will discuss the EDA approach to partitional clustering. We must say that, as far as we know, this is the first time ED As are used in partitional clustering.

In order to use EDAs in partitional clustering, we must firstly choose a representation. We will use an ordinal object membership representation. As we saw in Section 4.1.2, if we have a partition of n objects into k clusters, we encode it by a string of length n where each position can take one of the values {1,2, ... ,k}. In the string, the ith position represents the cluster to which ith

object belongs. For instance, if we have the objects {A,B,C,D,E,F} partitioned as follows: {A}, {B,E}, {C,D,F}, we could code it as (1,2,3,3,2,3).

We chose this representation because it is very simple to encode a partition, and thus, no additional computation is required. However, we must note that there is some redundancy, that is, different strings could represent the same


partition. In our example, the partition may also be encoded as (2,1,3,3,1,3). In fact, for this representation, there are k! different strings encoding the same partition.

In order to model the joint probability distribution, we assume that each string position is a random variable. Thus, there is a variable Xi, i E {I, ... , n} for each object of the partition where Xi E {I, 2, ... , k}. There may also be illegal individuals, that is, individuals encoding partitions with empty clusters. A string represents a legal partition if all the k values are found at least once in the string. Thus, when simulating the joint probability distribution we must ensure that there is at least one variable set to each of the k values. In order to obtain legal individuals we followed Bengoetxea et al. (2000). We can distinguish different EDA approaches by the joint probability distribution model they use. We will use in this chapter the following models: BSC (Syswerda, 1993), MIMIC (De Bonet et al., 1997), TREE (Chow and Liu, 1968) and EBNABIc

(Etxeberria and Larraiiaga, 1999).

6. Experimental results We carry out some experiments in order to compare the results of the K

Means algorithm, GAs and EDAs. Mainly, we are interested in comparing the average value of the evaluation function obtained in different executions. In this way, we compare the ability in avoiding from local minima of this three search heuristics.

We used five well known datasets from the VCI repository (Murphy and Aha, 1994). We report them in Table 17.1 in decreasing order of complexity. We measure their complexity in terms of the number of clusters, the number of attributes and the number of objects. Another important factor is the separation between clusters. For instance, Cleveland and Voting datasets have roughly the same number of attributes and objects, but we consider Cleveland to be more complex as there are 4 clusters which are not well separated. On the contrary, Voting dataset is divided into only two clusters which are very well separated. Even if there are 4 clusters in Soybean and only 3 in Iris, we consider the latter to be more complex than Soybean because it has more objects and the classes are not as well separated as the ones in Soybean. We consider that Wine dataset is more complex than Iris dataset as there are more attributes and objets.

We performed experiments with K-Means as follows. We executed the algorithm 1000 times, limiting the number of iterations to 20. In each execution the objects of the datasets where randomly ordered with the aim of obtaining different initial partitions.

For the experiments performed with GAs and EDAs, we used three different population sizes, namely populations with 200, 500, and 1000 individuals. We


Table 17.1 Dataset descriptions.

No. Clusters No. Objects No. Attributes

Cleveland 4 303 14 Wine 3 178 13 Iris 3 150 5 Soybean 4 47 36 Voting 2 345 17

encoded the partitions with the ordinal object membership representation, that is, with a string of length n where the ith position represents the cluster the ith

object belongs to. Experiments were conducted with and without an hybrid heuristic. Before the selection process the hybrid heuristic runs one iteration of the K-Means algorithm on each individual of each generation. Experiments were performed five times. The results reported correspond to the mean of the performance obtained. In Tables 17.2 to 17.6 we report the results obtained for each algorithm and each dataset.

In experiments with GAs, we used a simple mutation operator and a single point crossover operator. In all the experiments the algorithm was run for 1000 generations.

Experiments with EDAs were conducted with different joint probability distribution models, namely BSC, MIMIC, TREE, and EBNABIc . The algorithm was stopped either when the overall performance of a population did not improve the previous one, or when 400 generations were reached. We used the truncation selection method in order to select the best half of the population individuals. In every new generation we added the best individual of the previous, so if the population has N individuals we created N - 1 new individuals. In this way, we also ensure that we do not loose the best individual in the following generation.

From the results obtained with EDAs when clustering the Cleveland (Table 17.2) and Wine datasets (Table 17.3) we would like to note the three following points.

Firstly, it can be seen that increasing the population size helps in obtaining good evaluation function values. This is especially true for the Cleveland dataset while for the Wine dataset it is possible to reach values close to the best known with small populations. We also saw that the number of generations needed to converge was not very much affected by the population size. For instance, EDA with the BSC joint probability model required around 80 generations in order to cluster the Cleveland dataset for all population sizes. When clustering the Wine dataset the algorithm needed around 50 generations.


Secondly, results seem to improve as the complexity of the model used to learn the joint probability distribution increases. On the one hand, the BSC model obtains the worst results while EBNABIC always obtains very good results. However, for the Cleveland dataset the TREE model is the worst while for the Wine dataset is the best. We also would like to note that EBNABIc is not very much affected by the initial conditions, (i.e. initial population) as the five executions obtained very similar results.

And thirdly, we would like to note that the hybrid heuristic helps improving the evaluation function values. However, the most significant advantage is the reduction of the number of generations needed for convergence. For instance, EDA with the BSC joint probability model required around 45 (in front of 80) generations in order to cluster the Cleveland dataset. When clustering the Wine dataset the algorithm required around 20 (in front of 50) generations.

Iris, Soybean small, and Voting datasets, are not very difficult to learn. The Iris dataset is structured in three clusters, where two of them are well separated. The Soybean small dataset is composed by only 47 tuples. And there are only two clusters in the Voting dataset that considerably reduces the space of possible partitions.

From the results obtained with Iris (Table 17.4), Soybean (Table 17.5) and Voting (Table 17.6) datasets we conclude that all the EDAs approaches results with good performance values. Executions with small population sizes (200 individuals) perform well. We also noted that the number of generations needed to converge is dramatically reduced and that the results of the five executions are very similar when the hybrid heuristic was used. For instance, EDA with the BSC joint probability model required around 40,30,43 generations in order to cluster the Iris, Soybean and Voting datasets respectively when no heuristic was used, while it only needed 8, 8, 13 generations when the hybrid heuristic was applied.

Comparing GAs against the K-Means results, one can see that the former performed much worse than the last, when no additional hybrid heuristic is used. However, this may be caused by the simplicity of the genetic operators we used in our approach to partitional clustering with GAs. In the literature, there are papers where GAs results significally outperform the K-Means ones (Jones and Beltramo, 1991; Maulik and Bandyopdhyay, 2000). Our results may illustrate that it is not easy to obtain good results with GA when problem knowledge to the algorithm is not added.

If we compare the results obtained with the GAs approach with such of EDAs, it is clear that EDAs perform much better. When the hybrid heuristic is used the results obtained by both approaches get closer. This does not hold for Cleveland dataset, the most complex and difficult to learn.


Table 17.2 Average results for Cleveland (K-Means performance: 10048.9).

Size Heur'istic GA BSC MIMIC TREE EBNA Blc

None 14514.26 9994.26 9755.84 10442.28 9709.22 200 Hybrid 14502.60 9841.03 9739.99 10523.82 9705.01

None 14636.64 9763.35 9728.73 9956.10 9704.75 500 Hybrid 14907.86 9717.97 9702.96 9836.90 9703.75

None 14668.88 9734.70 9719.06 9797.03 9738.78 1000 Hybr'id 15021.24 9718.44 9698,57 9692.30 9711.06

Table 17.3 Average results for Wine (K-Means performance: 1190.05).

Size Heuristic GA BSC MIMIC TREE EBNABIC

None 1971.76 1194.12 1171.52 1168.64 1169,13 200 Hybrid 1168.52 1169,83 1168.52 1168,52 1168.54

None 1959,65 1171.38 1172,30 1168.58 1168,52 500 Hybrid 1168,52 1175,25 1168,52 1168.52 1168,86

None 1958,74 1171.67 1171,76 1168.52 1171.08 1000 Hybrid 1168,52 1173.44 1168,52 1168.52 1170,01

Table 17.4 Average results for Iris (K-Means performance: 108.21).

Size Heuristic GA BSC MIMIC TREE EBNABIC

None 261.90 107.46 100.50 103,19 100,08 200 Hybrid 100,08 100.08 100,08 100,08 100,08

None 267,92 100,08 100.08 100,08 100,08 500 Hybrid 100.08 100,08 100,08 100.08 100,08

None 262.45 100,08 100.08 100.08 100.08 1000 Hybrid 100,08 100,08 100.08 100,08 100,08

7. Conclusions As far as we know, this is the first time that EDAs have been used in

the partitional clustering problem. It seems clear that the EDA optimisation approach to clustering performs well, as shown in the results. The search is able to escape from local minima and obtain near optimal partitions.

The GA approach is not able to converge to good results without the hybrid heuristic while EDA does. From this point we could conclude that, in the case of GA, the ability of finding good solutions comes from the iterative nature of the heuristic instead of from the genetic operators. Even we must admit we used simple operators, we claim that it is much simpler to work with the EDA approach. EDAs overcome the complexity of building special genetic operators


Table 17.5 Average results for Soybean small (K-Means performance: 111.78).

Size Heu.ristic GA BSC MIMIC TREE EBNABIC

None 141.28 107.37 103.43 103.43 105.87 200 Hybrid 103.43 103.43 103.43 103.43 103.43

None 140.01 107.15 107.11 107.32 105.08 500 Hybrid 103.43 103.43 103.43 103.43 103.43

None 140.10 105.27 105.27 105 .49 105.27 1000 Hybrid 103.43 103.43 103.43 103.43 103.43

Table 17.6 Average results for Voting (K-Means performance: 718.18).

Size Heu.rist·ic GA BSC MIMIC TREE EBNAB1C

None 923 .96 718.06 717.85 717.90 717.85 200 Hybrid 717.85 717.85 717 .85 717.85 717.85

None 931. 73 717.85 717.85 717.85 717.85 500 Hybrid 717.85 717.90 717.85 717.85 717.85

None 938.68 717.85 717.85 718.81 717.85 1000 Hybrid 717.85 717.85 717.85 717.85 717.85

for each problem. Thus, users can easily obtain good results without knowing the evolutionary mechanism.

Comparing the results obtained with EDA and those obtained with the KMeans algorithm, it seems clear that EDA performs better, specially with difficult datasets like Cleveland or Wine. However, K-Means raise very good results when it is fitted with good object orderings, and it is computationally very cheap. In our opinion, it is worth executing the K-Means a thousand of times with different random object orderings, when the dataset is known to be simple, and at the end return the best solution.

We also performed experiments where we initialised the first population with some individuals generated with the K-Means algorithm. The results were deceptive as the search was strongly biased to the best individuals of the first populations leading the search to local minima. We observed that most of the times the best individual of the first generation survived through all generations until last.

Acknow ledgments The authors wish to thank E. Bengoetxea who provided us with the source code

used in our experiments . We also wish to thank J .A. Lozano for his support in writing

this chapter. The work of the first and third authors was partially supported by grant

UPC PR99-09 form the Universitat Politecnica de Catalunya.


References Alippi, C. and Cucchiara, R. (1992). Cluster partitioning in image analysis

classification: a genetic algorithm approach. In Proc. CompEuro 92, pages 139-144. IEEE Computer Society Press.

Babu, G. P. and Murty, M. N. (1994). Clustering with evolution strategies. Pattern Recognition, 27(2):321-329.

Ball, G. H. and Hall, D. J. (1967). A clustering technique for summarizing multivariate data. Behavioral Science, 12:153-155.

Bandfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and nonGaussian clustering. Biometrics, 49:803-82l.

Bengoetxea, E., Larranaga, P., Bloch, 1., Perchant, A., and Boeres, C. (2000). Inexact graph matching using learning and simulation of bayesian networks. an empirical comparison between different approaches with synthetic data. In Workshop Notes of CaNew2000: Workshop on Bayesian and Causal Networks: From Inference to Data Mining. Fourteenth European Conference on Artificial Intelligence, ECAI2000. Berlin.

Bezdek, J. C., Boggavaparu, S., Hall, L. 0., and Bensaid, A. (1994). Genetic algorithm guided clustering. In Fogel, D. B., editor, Proceedings of The First IEEE Conference on Evolutionary Computation, volume I, pages 34-40. IEEE Computer Society Press.

Bhuyan, J. N., Raghavan, V. V., and Elayavalli, V. K. (1991). Genetic algorithms with an ordered representation. In Belew, R and Booker, L. B., editors, Proc. of the Fourth International Conference on Genetic Algorithms, pages 408-415. Morgan Kaufmann.

Bozdogan, H. (1994). Choosing the number of clusters, subset selection of variablesm and outlier detection in the standard mixture-model cluster analysis. In Diday, E., Lechevallier, Y., Schader, M., Bertrand, P., and Burtschy, B., editors, New Approaches in Classification and Data Analysis, pages 169-177. Springer-Verlag.

Celeux, G. and Soromenho, G. (1996). An entropy criterion for assessing the number of clusters in a mixture model. Journal of Classification, 13(2):195-212.


Cucchiara, R. (1993). Analysis and comparison of different genetic models for the clustering problem in image analysis. In Albretch, R F., Reeves, C. R, and Steele, N. C., editors, Artificial Neural Networks and Genetic Algorithms, pages 423-427. Springer-Verlag.


De Bonet, J . S., Isbell, C. L., and Viola, P. (1997). MIMIC: Finding optima by estimating probability densities .. Advances in Neural Information Processing Systems, Vol. 9.

Duda, R O . and Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley.

Etxeberria, R. and Larrafiaga, P. (1999). Global optimization with Bayesian networks. In II Symposium on Artificial Intelligence. CIMAF99. Special Session on Distributions and Evolutionary Optimization, pages 332- 339.

Fayyad, V., Piatetsky-Shapiro, G., and Smyth, P. (1996). Knowledge discovery and data mining: towards a unifying framework. In Press, A., editor, Second International Conference on Knowledge Discovery and Data Mining, Portland OR

Fisher, D., Pazzani, M., and Langley, P. (1992a). Concept Formation: Knowledge and expertise on unsupervised learning. Morgan Kaufmman Publishers, Inc.

Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139- 172.

Fisher, D. H., Xu, L., and Zard, N. (1992b). Ordering effects in clustering. In Ninth International Conference on Machine Learning, pages 163- 168.

Forgy, E . W. (1965). Cluster analysis of multivariate data: efficency versus interpretability of classifications (abstract). Biometrics, 21:768- 769 .

Gordon, A. D. (1987). A review of hierarchical classification. Journal of the Royal Statistical Society, Series A , 150(2}:119- 137.

Hanson, R, Stutz, J., and Cheesman, P. (1990). Bayesian classification theory. Technical Report FIA-90-12-7-01, NASA, Ames Research Center.

Hardy, A. (1994). An examination of procedures for determining the number of clusters in a data set. In Diday, E., Lechevallier, Y., Schader, M., Bertrand, P., and Burtschy, B., editors, New Approaches in Classification and Data Analysis, pages 178- 185. Springer-Verlag.

Jain, A. K. and Dubes, R C. (1988). Algorithms for Clustering Data. Prentice Hall.

Jones, D. Rand Beltramo, M. A. (1990). Clustering with genetic algorithms. Technical Report GMR-7156, Operating Sciences Department, General Motors Research Laboratories.

Jones, D. R. and Beltramo, M. A. (1991). Solving partitioning problems with genetic algorithms. In Belew, R. and Booker, L. B. , editors, Proc. of the Fourth International Conference on Genetic Algorithms, pages 442- 449. Morgan Kaufmann.

Langley, P. (1995). Order effects in incremental learning. In Reimann, P. and Spada, H., editors, Learning in humans and machines: Towards an Interdisciplinary Learning Science. Pergamon.


Langley, P. (1998). Elements of Machine Learning. Series in Machine Learning. Morgan Kaufmann Publishers, Inc., San Francisco, California.

Lozano, J. A. (1998). Genetic Algorithms Applied to Unsupervised Classification. PhD thesis, University of the Basque Country (In spanish).

Lozano, J. A., Larranaga, P., and Grana, M. (1998). Partitional cluster analysis with genetic algorithms: searching for the number of clusters. In Hayashi, C., Ohsumi, N., Yajima, K., Tanaka, Y., Bock, H., and Baba, Y., editors, Data Science, Classification and Related Methods, pages 117-125. Springer.

Lucasius, C. B., Dane, A. D., and Kateman, G. (1993). On k-medoid clustering of large data sets with the aid of a genetic algorithm: background, feasibility and comparison. Analytica Chimica Acta, 282:647-669.

Luchian, S., Luchian, H., and Petriuc, M. (1994). Evolutionary automated classification. In Fogel, D. B., editor, Proceedings of The First IEEE Conference on Evolutionary Computation, volume I, pages 585- 589. IEEE Computer Society Press.

MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of Fifth Berkeley Symposium, volume 2, pages 281-297.

Maulik, U. and Bandyopdhyay, S. (2000). Genetic algorithm-based clustering technique. Pattern Recognition, 33:1455- 1465 .

Michalski, R. S. and Stepp, R. E. (1983). Learning from observation: Conceptual clustering. In Michalski, R. S., Carbonell, J. G., and Mitchell, T. M., editors, Machine learning: An artificial intelligence approach. Morgan Kaufmann, Los Altos, CA.

Miihlenbein, H. and Paaf3, G. (1996). From recombination of genes to the estimation of distributions i. binary parameters. In Lecture Notes in Computer Science 1411: Parallel Problem Solving from Nature - PPSN IV, pages 178-187.

Murphy, P. M. and Aha, D. W. (1994). Uci repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html.Irvine.CA: University of California, Department of Information and Computer Science.

Pelikan, M. and Goldberg, D. E. (2000). Genetic algorithms, clustering, and the breaking of symmetry. Technical Report Illi200013, University of Illinois at Urbana-Champaign, Illinois.

Pena, J. M., Lozano, J. A., and Larranaga, P. (1999). An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters, 20:1027-1040.

Raghavan, V. V. and Birchard, K. (1979). A clustering strategy based on a formalism of the reproductive process in natural systems. SIGIR Forum, 14:10-22.

Rasson, J. P. and Kubushishi, T. (1993) . The gap test: an optimal method for determining the number of natural classes in cluster analysis. In Diday, E.,


Lechevallier, Y., Schader, M., Bertrand, P., and Burtschy, B., editors, New Approaches in Classification and Data Analysis, pages 186-193. SpringerVerlag.

Roure, J. and Talavera, L. (1998). Robust incremental clustering with bad instance orderings: a new strategy. In Coelho, H., editor, Progress in Artificial Intelligence-IBERAMIA 98, Sixth Ibero-American Conference on AI, pages 136-147. Springer.

Sarkar, M., Yegnanarayana, B., and Khemani, D. (1997). A clustering algorithm using evolutionary programming-based approach. Pattern Recognition Letters, 18:975-986.

Syswerda, G. (1993). Simulated crossover in genetic algorithms. Foundations of Genetic Algorithms 2, pages 239-255.

Chapter 18

Adjusting Weights in Artificial Neural Networks using Evolutionary Algorithms

C. Cotta E. Alba Department of Compute1' Science

University of Malaga

{ccottap. eat}@lcc.uma.es

R. Sagarna P. Larraiiaga Department of Computer Science and Artificial Intelligence

Unive1'sity of the Basque Country

{ccbsaalr. ccplamup}@si .ehu.es

Abstract Training artificial neural networks is a complex task of great practical importance. Besides classical ad-hoc algorithms such as backpropagation, this task can be approached by using Evolutionary Computation, a highly configurable and effective optimization paradigm. This chapter provides a brief overview of these techniques, and shows how they can be readily applied to the resolution of this problem. Three popular variants of Evolutionary Algorithms -Genetic Algorithms, Evolution Strategies and Estimation of Distribution Algorithms- are described and compared. This comparison is done on the basis of a benchmark comprising several standard classification problems of interest for neural networks. The experimental results confirm the general appropriateness of Evolutionary Computation for this problem. Evolution Strategies seem particularly proficient techniques in this optimization domain, and Estimation of Distribution Algorithms are also a competitive approach.

Keywords: Evolutionary Algorithms, Artificial Neural Networks, Supervised Training, Hybridization




1. Introduction Artificial Neural Networks (ANNs) are computational models based on par

allel processing (McClelland and Rumelhart, 1986). Essentially, an ANN can be defined as a pool of simple processing units which communicate among themselves by sending analog signals. These signals travel through weighted connections between the units. Each of these processing units accumulates the inputs it receives, and produces an output that depends on an internal activation function. This output can serve as an input for other units, or can be a part of the network output. The attractiveness of ANNs resides in the very appealing properties they exhibit, such as adaptivity, learning capability, and their ability to generalize. Nowadays, ANNs have a wide spectrum of applications ranging from classification to robot control and vision (Alander, 1994).

The rough description of ANNs given in the previous paragraph provides some clues on the design tasks involved in the application of ANNs to a particular problem. As a first step, the architecture of the network has to be decided. Basically, two main variants can be considered: feed-forward networks and recurrent networks. The former model comprises networks in which the connections are strictly feed-forward, i.e. no unit receives input from a unit to which it has sent its own output . The latter model comprises networks in which feedback connections are allowed, thus making the dynamical properties of the network important. In this work we will concentrate on the first and simpler model, feed-forward networks. To be precise, we will consider the socalled multilayer perceptron (Rosenblatt, 1959), in which units are structured into ordered layers, with connections allowed only between adjacent layers.

Once the architecture of the ANN is restricted to that of a multilayer perceptron, some parameters such as the number of layers, and the number of units per layer must be defined. After doing this, the final step is adjusting the weights of the network, so that it produces the desired output when confronted with a particular input. This process is known as training the ANN or learning the network weights 1 . We will focus on the learning situation known as supervised training, in which a set of current-input/desired-output patterns is available. Thus, the ANN has to be trained to produce the desired output according to these examples.

The most classic approach to supervised training is a domain-dependent technique known as Backpropagation (BP) (Rumelhart et al. , 1986). This algorithm is based on measuring the total error in the input/output behaviour of the network, calculating the gradient of this error, and adjusting the weights in the descending gradient direction. Hence, BP is a gradient-descent local search procedure. This implies that BP is subject to some well-known problems such as the existence of local-minima in the error surface, and the non-differentiability of the weight space. Different solutions have been proposed for this problem,

Adjusting Weights in Artificial NeuralNetworks using Evolutionary Algorithms 363

resulting in several algorithmic variants, e.g. those in Silva and Almeida (1990). A completely different alternative is the use of Evolutionary Algorithms for this training task.

Evolutionary Algorithms (EAs) are heuristic search techniques loosely based on the principles of natural evolution, namely adaptation and survival of the fittest. These techniques have been shown to be very effective in solving hard optimization tasks with similar properties to the training of ANNs, i.e. problems in which gradient-descent techniques get trapped in local minima, or are fooled by the complexity and/or non-differentiability of the search space. This chapter provides a gentle introduction to the use of these techniques for the supervised training of ANNs. To be precise, this task will be tackled by means of three different EA models , namely Genetic Algorithms (GAs), Evolution Strategies (ESs), and Estimation of Distribution Algorithms (EDAs).

The remainder of the chapter is organized as follows. Section 2 addresses the application of these techniques to the training of an ANN. This section gives a brief overview on the classical BP algorithm, in order to clarify the difference and distinctiveness of the EA approach, subsequently described. Some basic differences and similarities in the application of the suggested variants of EAs to the problem at hand are illustrated in this section too. Next, an experimental comparison of these techniques is provided in Section 3. Finally, some conclusions and directions for further development are outlined in Section 4.

2. An evolutionary approach to ANN training As mentioned in Section 1, this section provides an overview of an evolu

tionary approach to weight adjusting in ANNs. This is done in Subsections 2.2 and 2.3, but a classical technique for this task, the BP algorithm, is described first. This description is needed for a further combination of both evolutionary and classical approaches.

2.1 The BP algorithm

It has already been mentioned that the BP algorithm is based on determining the descending gradient direction of the error function of the network, and adjusting the weights accordingly. It is therefore necessary to define the error function in the first place. This function is the summed squared error E defined as follows:

E = ~ L EP = ~ L L (d~ - y~)2 (18.1) l:S;p:S;m l:S;p:S;m l:S;o:S;no

where m is the number of patterns, no the number of outputs of the network, d~ is the desired value of the d h output in the pth pattern, and yg is the actual


value of this output. This value is computed as a function of the total input s~ received by the unit, i.e.

y~ = F(s~) = F( L wroyn (18.2)

where F is the activation function of the unit, and r ranges across the units from which unit 0 receives input.

The gradient of this error function E with respect to individual weights is

8E 8EP 8EP 8sP 8EP __ '" __ '" __ J _ '" _yP 8Wi' - ~ 8Wi' - ~ 8sP 8Wi' - ~ 8sP i .

J l~p~m J l~p~m J J l~p~m J

(18.3)

By defining 15; = - W:;-, the weight change is J

AW - '" A W - '" "V J:JPyP,. U ij - ~ Up ij - ~ I U (18.4)

where 'Y is a parameter known as the learning rate. In order to calculate the 15; terms, two situations must be considered: the

lh unit being an output unit or being an internal unit. In the former case,

t5f = (df - y;)F1(sf) . (18.5)

In the latter case, the error is backpropagated as follows:

8EP 8EP 8yP 8EP t5P - - J - F1(sP) j - --8 P - --8 P 8 P - --8 P j .

Sj Yj Sj Yj (18.6)

The term fff can be developed as J

8EP 8EP 8sP 8EP 8y~ = L 8s~ 8y~ = L 8s~ Wjr - L t5~Wjr

J Uj ---tUr J Uj -HL r Uj -+U r

(18.7)

where r ranges across the units receiving input from unit j. Thus,

(18.8)

One of the problems of following this update rule is the fact that some oscillation can occur when 'Y is large. For this reason, a momentum term a: is added, so that

~Wij(t + 1) = L 'Yt5fyf + a:~Wij(t). l~p~m

(18.9)

Despiste this modification, the BP algorithm is still sensitive to the ruggedness of the error surface, and is often trapped in local optima. Hence, the necessity of alternative search techniques.


2.2 The basic evolutionary approach

EAs can be used for adjusting the weights of an ANN. This approach is relatively popular, dating back to the late 1980s, for example in Caudell and Dolan (1989), Montana and Davis (1989), Whitley and Hanson (1989), Fogel et al. (1990) and Whitley et al. (1990), and nowadays constituting a state-ofthe-art tool for supervised learning. The underlying idea is to make individuals that represent the weights of the ANN, using the network error function as a cost function to be minimized (alternatively, an accuracy function such as the number of correctly classified patterns could be used as a fitness function to be maximized, but this approach is rarely used). Some general issues must be taken into account when using an evolutionary approach to ANN training. These are commented on below.

The first issue that has to be addressed is the representation of solutions. In this case, it is clear that the phenotype space F is nk , where n c IR is a closed interval [min, max], and k is the number of weights of the ANN being trained, i.e. solutions are k-dimensional vectors of real numbers in the range [min, max]. This phenotype space must be appropriately translated to a genotype space 9 which will depend on the particulars of the EA used. In this work, we use linear encoding of these weights. Thus, 9 == 9!, i.e. each weight is conveniently encoded in an algorithm-dependent way, and subsequently the genotype is constructed by concatenating the encoding of each weight into a linear string.

This linear encoding of weights raises a second issue, the distribution of weights within the string. This distribution is important because of the particular recombination operator used. If this operator breaks the strings into large blocks and uses them as units for exchange (e.g. one-point crossover), then this distribution may be relevant. Alternatively, using a recombination operator that breaks the string into very small blocks (e.g. uniform crossover) makes this distribution irrelevant. A good piece of advice is to group together the input weights for each unit. This way, the probability of transmitting them as a complete block is increased, if an operator such as one-point crossover is used. Obviously, recombination is not used in some EAs, e.g. in EDAs, so this issue is irrelevant there.

2.3 Specific EA details

The basic idea outlined in the previous subsection can be implemented in a variety of ways depending upon the particular EA used. We will now discuss these implementation details for the EA models mentioned, namely GAs, ESs and EDAs.


0"35

'--_'---_-----''---_------'- ~ ~ ~ -L-I _----'-_0_.2_8_6----'---' ES

Figul'e lB.l The weights of an ANN are encoded into a linear binary string in GAs, or into a 2k-dimensional real vector in ESs (k weights plus k stepsizes). The EDA encoding is similar to that of the ES , excluding the stepsizes, i.e. a k-dimensional real vector.

2.3.1 Genetic Algorithms. GAs are popular members of the evolutionary-computing paradigm. Initially conceived by Holland (1975), these techniques today constitute the most widespread type of EA. In traditional GAs, solutions are encoded as binary strings. Specifically, m bits are used to represent each single weight and the k m-bit segments are concatenated into a e-bit binary string, e = k . m. This process is illustrated in Fig. 18.I.

This encoding of the network weights raises a number of issues. The first one is the choice of m (the length of each segment encoding a weight). It is intuitively clear that a low value of m would give a very coarse discretization of the allowed range for weights, thus introducing oscillations and slowing down convergence during the learning process. Alternatively, too large a value for m would result in very long strings, whose evolution is known to be very slow. Hence, intermediate values for m seem to be appropriate. Unfortunately, such intermediate values seem to be problem dependent , sometimes requiring a costly trial-and-error process. Alternatively, advanced encoding techniques such as delta coding (Whitley et al., 1991) could be used, although it has to be taken into account that this introduces an additional level of complexity in the algorithm.

A related issue is the encoding mechanism for individual weights, i.e. a choice of pure binary, Gray-coded numbers, magnitude-sign, etc. Some authors have advocated the use of Gray-coded numbers (Whitley, 1999) on the basis of theoretical studies regarding the preservation of some topological properties in the resulting fitness landscape (Jones , 1995). However, the suitability of such analysis to this problem is barely understood. Furthermore, the disruption caused by classical recombination operators, as well as the effects of multiple mutations per segment being performed (a usual scenario) will most probably

Adjusting Weights ill Artificial NeuralNetworks using Evolutionary Algorithms 367

reduce the advantages (if any) of this particular encoding scheme. Hence, no preferred encoding technique can be distinguished in principle.

2.3.2 Evolution Strategies. The ES (Rechenberg, 1973; Schwefel, 1977) approach is somewhat different from the GA approach presented in the previous subsection. Most noticeably, the relative intricacy of deciding the representation of the ANN weights in a Genetic Algorithm contrasts with the simplicity of the ES approach. In this case, each solution is represented as it is, a n-dimensional vector of real numbers in the interval [min, max] (see Fig. 18.1)2.

Associated with each weight Wi, is a stepsize parameter ai for performing Gaussian mutation on that weight3 . These stepsizes are evolved together with the parameters that constitute the solution, thus allowing the algorithm to adapt the way that its search is performed.

Note also that the use of recombination operators (let alone positional recombination operators) is often neglected in ESs, thus making the distribution of weights inside the vector irrelevant.

Work on using ESs in the context of ANN training includes Wienholt (1993), Berlanga et al. (1999a) and Berlanga et al. (1999b).

2.3.3 Estimation of Distribution Algorithms. EDAs, intro-duced by Miihlenbein and Paa,B (1996), are a new tool for evolutionary computation, in which the usual crossover and mutation operators have been replaced by the estimation of the joint density function of the individuals selected at each generation, and the posterior simulation of this probability distribution, in order to obtain a new population of individuals. Details of different EDA approaches are in Chapter 3 of this book.

The weight learning problem for ANNs can be viewed as an optimization problem, so both discrete and continuous ED As may constitute effective approaches to solving it.

If discrete EDAs are used to tackle the problem, then the representation of the individuals would be similar to the one previously shown for GAs. If continuous EDAs are used, then the representation would be analogous to the one used by ESs. In the latter case, the representation is even simpler than for ESs as no mutation parameter is required.

Work where EDA approaches have been applied to evolve weights in artificial neural networks includes Baluja (1995), Galic and Hi::ihfeld (1996), Maxwell and Anderson (1999), Gallagher (2000) and Zhang and Cho (2000).

2.3.4 Memetic Algorithms. Besides the standard operators used in each of the EA models discussed above, it is possible to consider additional operators adapted for the particular problem at hand. It is well-known, and


supported both by theoretical (Wolpert and Macready, 1997) and empirical (Davis, 1991) results, that the appropriate utilization of problem-dependent knowledge within the EA results highly effective algorithms. Here, addition of problem-dependent knowledge can be done by means of a local search procedure specifically designed for ANN training: the BP algorithm. The resulting combination of an EA and BP is known as a hybrid or memetic (Moscato, 1999) algorithm.

The BP algorithm can be used in combination with an EA in a variety of ways. For example, an EA has been used in Gruau and Whitley (1993) to find the initial weights which are used in the BP algorithm for further training. Another approach is to use BP as a mutation operator, that is, as a procedure for modifying a solution (Davis, 1991). Because BP is a gradient-descent algorithm, this mutation is guaranteed to be monotonic in the sense that the mutated solution will be no worse that the original solution. However, care has to be taken with respect to the amount of computation given to the BP operator. Although BP can produce better solutions when executed for a longer time, it can fall within a local optimum, making subsequent computational effort useless; moreover, even when BP progresses steadily, the amount of improvement could be negligible relative to the additional overhead introduced. For these reasons, it is preferable to keep the BP utilization at a low level, where the exact meaning of "low level" is again a function of the specific problem being tackled, so no general guideline can be given.

3. Experimental results This section provides an empirical comparison of different evolutionary ap

proaches for training ANNs. The details of these approaches, and a description of the benchmark used are given in Section 3.2. The results of the experimental evaluation of these techniques are presented and analyzed in Section 3.3.

3.1 ANNs and databases

The algorithms described in the previous section have been used for the supervised training of three different ANN s. Each of these ANN s has a different architecture, and uses a different databases. These databases are:

• KILN: This database corresponds to the fault detection and diagnosis of an industrial lime kiln (Ribeiro et al., 1995). There are 70 patterns in this database, where each pattern has 8 descriptive attributes, and belongs to one of 8 different classes. The ANN architecture used in this case is 8-4-8.

• ECOLI: This database corresponds to the prediction of protein localization sites in eukaryotic cells (Nakai and Kanehisa, 1992). There are 336 patterns in this database, where each pattern has 8 descriptive attributes,


and belongs to one of 8 different classes. The ANN architecture used in this case is 8-4-2-8.

• BC: This database corresponds to the diagnosis of breast cancer (Mangasarian and Wolberg, 1990). There are 683 patterns in this database, where each pattern has 9 descriptive attributes, and one Boolean predictive attribute ("malignant" or "benign"). The ANN architecture used in this case is 9-4-3-1.

The weight range for each of the ANNs trained is [-10, 10]. The sigmoid function F(x) = (1 + cx)-l has been used as the activation function of all units.

3.2 The algorithms

Parameters of the GA for these problems are as follows: populationSize = 100, a = Roulette-Wheel, 'lj; = Steady-state, crossoverOp= Uniform-Crossover (Pc = 1.0, 80% bias to the best parent), mutationOp = Bit-Flip (Pm = 1/£), m = 16 bits per weight.

For to the ES, parameters are even simpler: a standard (l,lO)-ES without recombination, and using non-correlated mutations has been used. The stepsizes are mutated following the guidelines shown in Back (1996), i.e. a global

learning rate 7 = 1/ V2i, and a local learning rate 7' = 1/ J V2i. Two instances of the EDA paradigm have been used in the experiments.

The difference between them corresponds to the way in which the factorization of the joint density function of selected individuals has been done. Where factorization is done as a product of univariate marginal densities, we obtain the UMDAc . Where the joint density is factorized as a chain that considers statistics of order two, we refer to the algorithm as MIMICc ' For more information about these algorithms see Larrafiaga (2001). In the EDAs used in the experiments the number of simulated individuals at each generation was 250. The best half of the population was selected for the learning of the joint probability density function.

For any of the three basic algorithms (GAs, ESs, and EDAs), a maximum number of 50.000 RMSE (rooted mean square error)4 evaluations across the whole training set is allowed. These algorithms have also been hybridized with the BP algorithm. This is done by training each network for 10 epochs, using the parameters r = .1, and a = .5.

3.3 Analysis of results

The experiments have been carried out with two different scenarios. In the first one, all patterns within each database have been used for training purposes, and the RMSE has been used as the performance measure. In the second sce-


Table lS.l Results obtained with the Be database.

Algorithm error-training error-test-5GV per-test-5GV

BP 0.4550±0.0324 0.2244±0.0074 63.2650±2.9311

GA 0.1879±0.0117 0.1125±0.0062 90.8676± 1.1248 ES 0.1104±0.0017 0.0776±0.0039 95.8565±0.4529 UMDAc 0.1184±0.0081 0.0746±0.0035 95.2353±0.4609 MIMICc 0.1181±0.0091 0.0753±0.0042 95.0735±0.5892

GA + BP 0.3648±0.0246 0.1817±0.0059 71.3824±3.0779 ES + BP 0.1777±0.0266 0.0952±0.0098 93. 7189± 1.2528 UMDAc + BP 0.3081±0.0259 0.2747±0.0100 51.3529±3.4916 MIMICc + BP 0.3106±0.0018 0.2659±0.0206 54.2206±7.2556

nario, 5- fold cross-validation has been used, and the performance measures are the average RMSE for test patterns, and the percentage of correctly classified test patterns. To determine whether a pattern has been correctly classified, the Boolean nature of the desired output is exploited. Specifically, the actual activation values for each output unit are saturated to the closest Boolean value, and then compared with the desired output. If all saturated actual outputs match the desired output, the pattern is considered correctly classified.

Tables 18.1 to 18.3 summarize the experimental results obtained. A general inspection of the column showing the percentage of correctly classified test patterns reveals an evident hardness-ranking: the easiest database is BC, and the hardest one is KILN. This particular ranking could be due to several factors. On one hand, it is clear that the saturation criterion used to determine whether a pattern has been correctly classified might be advantageous for BC, since just one output-per-pattern exists. On the other hand, the network architecture is more complex (and hence the ANN is more adaptable) in BC and ECOLI than in KILN. Finally, KILN has the lowest number of patterns, a drawback a priori for learning to generalize. Actually, this hardness-ranking coincides with the ordering of databases according to their size (the smallest is the hardest).

Focusing on the error-training column, it can be seen that both ESs and ED As have the best results in quality and stability, with the former being slightly better. It is not surprising that these two models are precisely the ones using real-coded representation of weights. Unlike the binary representation, this representation is less prone to abrupt changes in weight values5 . This allows better exploitation of any gradient information that might be present. Note that the population-based search performed by these techniques makes


Table 18.2 Results obtained with the ECOLI database.

Algorithm error-training error-test-5CV per-test-5CV

BP O.2584±O.OO51 O.1289±O.OO17 8.3333±6.3909

GA O.1968±O.O165 O.lOOl±O.OO38 47.8308±7.8949 ES O.1667±O.OO85 O.O891±O.OO27 65.8929±2.5301 UMDAc O.1830±O.OO67 O.O808±O.OO22 58.5970±5.9286 MIMICc O.1778±O.O134 O.O802±O.OO18 58.5075±4.8153

GA + BP 0.3004±0.0126 0.1522±0.0040 8.0398±5.9927 ES + BP 0.1925±0.0202 0.O939±0.OO19 53.5417±4.2519 UMDAc + BP 0.2569±0.0069 O.1593±O.0011 9.8209±6.9430 MIMICc + BP 0.2587 ±0.0064 0.1585±0.001O 10.4179±7.3287

Table 18.3 Results obtained with the KILN database.

Algorithm error-training error-test-5 CV per-test-5CV

BP 0.3334±0.0011 0.1664±0.0003 O±O

GA 0.2379±0.0112 0.1229±0.0040 10.7619±5.3680 ES 0.2361±0.0043 0.1243±0.0023 19.7143±5.2511 UMDAc 0.2398±0.0025 0.1132±0.0002 6.1429±3.1623 MIMICc 0.2378±O.0077 0.1132±0.0002 8.4286±3.6546

GA + BP 0.3202±0.0392 0.1686±0.OO76 4.5714±4.0301 ES + BP 0.2367±0.0074 0.1241±0.0039 8.2857±5.1199 UMDAc + BP 0.2760±0.0059 0.1437±0.0058 2.2857±2.4467 MIMICc + BP 0.2751±0.O086 0.1420±0.0044 3.1429±3.3537


getting trapped in local optima much less likely (this is especially true in the non-elitist ES model used), and allows a better diversification of the search.

Moving to the 5CV columns, the results are fairly similar: again ESs and EDAs yield similar results, which are generally better than GAs. An interesting fact that is worth mentioning is the superiority of ED As over ESs in test error, and the superiority of the latter in the percentage of correctly classified patterns. This may indicate a difference in the progress of the underlying search, but more extensive results would be required in order to extract convincing conclusions.

Note also that the hybrid models of EAs and BP perform worse than nonhybridized EAs. This could be for several reasons. First, it was mentioned before that the balance of computation between BP and EAs is a very important factor. The parameterization chosen in this work may have been inadequate in this sense. Also, it can not be ruled out that different results would be obtained, if the BP parameters I' and Q were given different values.

Deeper analysis of the results was done by testing the null hypothesis that the results achieved by some groups of algorithms followed the same density distribution. For this task the non-parametric Kruskal-Wallis and Mann-Whitney tests were used. This analysis was carried out with the statistical package S.P.S.S. release 10.0.6. The results were as follows:

• Between non-memetic algorithms. We use the Kruskal-Wallis test on the null hypothesis that the results obtained by GA, ES, UMDAc and MIMICc follow the same distribution. For the three databases and the three measures (error-training, error-test-5CV and per-test-5CV), the differences were statistically significant (p < 0.05) except for the errortraining parameter in the KILN database (p = 0.5743).

• Between memetic algorithms. Using the Kruskal-Wallis test on the distributions for the results ofGA+BP, ES+BP, UMDAc+BP and MIMICc+BP, we discovered that there were statistically significant differences (p < 0.05) in the three databases and for the three measures.

• Between one non-memetic algorithm and its corresponding memetic algorithm. We also compared differences in the behavior of the non-memetic algorithms and their corresponding memetic ones, that is, G A vs G A + BP, ES vs ES+BP, UMDAc vs UMDAc+BP and MIMICc vs MIMICc+BP. Using the Mann-Whitney test, we obtained that for the comparisons: GA vs GA+BP, UMDAc vs UMDAc+BP and MIMICc vs MIMICc+BP, the differences were statistically significant (p < 0.05). When comparing ES vs ES+BP we found that the differences were not statistically significant for the error-training (p = 0.6305) and error-test-5CV (p = 0.9118) measures in the KILN database, but the significance in the differences (p < 0.05) was maintained in the rest of the databases and measures.


o .• ,------~ ___ ~ __ ~ _ _,

1000 10000 hpochs

- UMDAc -·MMIC

ES .. GA

c

100000

0 .• ,--------_----_--,

G ~ \ ~ 0.3 \

~ ~ j

- UMDAc+BP ~ - MlMlCc +BP ..... ES+BP - -- GA.+BP

O~~--~lOOOO~----~l00000~~~~OOOOO· hpochs

FigU7'e 18.2 Convergence plot of different EAs on the KILN database.

Given the above remarks about parameterization, it is also interesting to consider the situation in which a larger number of RMSE calculations are allowed, Specifically, the convergence properties of any of these algorithms is cause for concern, A final experiment has been done to shed some light on this: convergence for a long (2.105 RMSE calculations) run of the different algorithms considered has been compared. The results are shown in Fig. 18.2. Focusing first in the leftmost plot (corresponding to pure evolutionary approaches) it is evident the superiority of ESs in the short term (:S 104 RMSE calculations). In the medium term ('" 5 . 104 RMSE calculations), UMDAc emerges as a competitive approach. In the long term ('" 105 RMSE calculations), UMDAc yields the best results, being the remaining techniques fairly similar in performance. From that point on, there is not much progress, except in the GA case, in which an abrupt advance takes place around 1.5.105 RMSE calculations. Due to this abruptness, it would be necessary to carryon additional tests to determine the likelihood of such an event.

The scenario is different in the case of the hybridized algorithms. These techniques seem to suffer from premature convergence to same extent (in a high degree in the case of the GA, somewhat lower in the case of the EDAs, and not so severely in the case of the ES). As a consequence, only ESs and MIMICc can advance beyond the 104-RMSE-calculation point. In any case, and as mentioned before, more tests are necessary in order to obtain conclusive results.

4. Conclusions This work has surveyed the use of EAs for supervised training in ANNs. It is

remarkable that EAs remain a competitive technique for this problem, despite


their apparent simplicity. There obviously exist very specialized algorithms for training ANNs that can outperform these evolutionary approaches but, equally, it is foreseeable that more sophisticated versions of these techniques could again constitute highly competitive approaches. As a matter of fact, the study of specialized models of EAs for this domain is a hot topic, continuously yielding encouraging new results, as seen in e.g. Castillo et aI, (1999) and Yang et al. (1999).

Future research should be directed to the study of these sophisticated models. There are a number of questions that remain open. For example, the real usefulness of recombination within this application domain is still under debate. Furthermore, even given usefulness, the design of appropriate recombination operators for this problem is an area in which much work remains to be done. Finally, the lack of theoretical support for some of these approaches (a situation that could alternatively be formulated as their excessive experimental bias) is a problem to whose solution much effort has to be directed.

Acknow ledgments C. Cotta and E. Alba are partially supported by the Spanish Comisi6n Intermin

isterial de Ciencia y Tecnologia (CICYT) under grant TIC99-07S4-C03-03.

Notes 1. Network weights comprise both the previously mentioned connection weights, as well

as bias terms for each unit. The latter can be viewed as the weight for a constant saturated input that the corresponding unit always receives.

2. Although it is possible to use real-number encodings in GAs, such models still lack the strong theoretical corpus available for ESs (Beyer, 1993; Beyer, 1995; Beyer, 1996). furthermore, crossover is the main reproductive operator in GAs, so it is necessary to define sophisticated crossover operators for this representation (Herrera et al., 1996). Again, ESs offer a much simpler approach.

3. Some advanced ES models also include covariance values (Jij to make all perturbations be correlated. We did not consider this possibility here because we intended to keep the ES approach simple. Note also that the number of these covariance values is O(n 2 ), where n is the number of variables being optimized. Thus, very long vectors would have been required in the context of ANN training.

4. RMSE == -.-. {!!;'E mno

5. Of course, this also depends on the particular operators used in the algorithm. Recombination is a potentially disruptive operator in this sense. No recombination has been considered in these two models though.

References Alander, J. T. (1994). Indexed bibliography of genetic algorithms and neural

networks. Technical Report 94-1-NN, University of Vaasa, Department of Information Technology and Production Economics.


Back, T. (1996). Evolutionary Algorithms in Theory and Practice. Oxford University Press, New York.

Baluja, S. (1995). An empirical comparison of seven iterative and evolutionary function optimization heuristics. Technical Report CMU-CS-95-193, Carnegie Mellon University.

Berlanga, A., Isasi, P., Sanchis, A., and Molina, J. M. (1999a). Neural networks robot controller trained with evolution strategies. In Proceedings of the 1999 Congress on Evolutionar'y Computation, pages 413-419, Washington D. C. IEEE Press.

Berlanga, A., Molina, J. M., Sanchis, A., and Isasi, P. (1999b). Applying evolution strategies to neural networks robot controllers. In Mira, J. and SanchezAndres, J. V., editors, Engineering Applications of Bio-Inspired Artificial Neural Networks, volume 1607 of Lecture Notes in Computer Science, pages 516-525. Springer-Verlag, Berlin.

Beyer, H.-G. (1993). Toward a theory of evolution strategies: Some asymptotical results from the (1~ A)-theory. Evolutionary Computation, 1(2):165-188.

Beyer, H.-G. (1995). Toward a theory of evolution strategies: The (/1, A)-theory. Evolutionary Computation, 3(1):81-111.

Beyer, H.-G. (1996). Toward a theory of evolution strategies: Self adaptation. Evolutionary Computation, 3(3):311-347.

Castillo, P. A., Gonzalez, J., Merelo, J. J., Prieto, A., Rivas, V., and Romero, G. (1999). GA-Prop-II: Global optimization of multilayer perceptrons using GAs. In Proceedings of the 1999 Congress on Evolutionary Computation, pages 2022-2027, Washington D. C. IEEE Press.

Caudell, T. P. and Dolan, C. P. (1989). Parametric connectivity: training of constrained networks using genetic algoritms. In Schaffer, J. D., editor, Proceedings of the Third International Conference on Genetic Algorithms, pages 370-374, San Mateo, CA. Morgan Kaufmann.

Davis, L. (1991). Handbook of Genetic Algorithms. Van Nostrand Reinhold Computer Library, New York.

Fogel, D. B., Fogel, L. J., and Porto, V. W. (1990). Evolving neural networks. Biological Cybernetics, 63:487-493.

Galic, E. and H6hfeld, M. (1996). Improving the generalization performance of multi-Iayer-perceptrons with population-based incremental learning. In Parallel Problem Solving from Nature IV, volume 1141 of Lecture Notes in Computer Science, pages 7 40-750. Springer-Verlag, Berlin.

Gallagher, M. R. (2000). Multi-layer Perceptron Error Surfaces: Visualization, Structure and Modelling. PhD thesis, Department of Computer Science and Electrical Engineering, University of Queensland.

Gruau, F. and Whitley, D. (1993). Adding learning to the cellular development of neural networks: Evolution and the baldwin effect. Evolutionary Computation, 1:213-233.


Herrera, F., Lozano, M., and Verdegay, J. L. (1996). Dynamic and heuristic fuzzy connectives-based crossover operators for controlling the diversity and convengence of real coded genetic algorithms. Journal of Intelligent Systems, 11:1013-104l.

Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Harbor.

Jones, T. C. (1995). Evolutionary Algorithms, Fitness Landscapes and Search. PhD thesis, University of New Mexico.

Larrafiaga, P. (2001). A review on Estimation of Distribution Algorithms. In Larrafiaga, P. and Lozano, J. A., editors, Estimation of Distribution Algorithms: A new tool for Evolutionary Computation. Kluwer Academic Publishers.

Mangasarian, O. L. and Wolberg, W. H. (1990). Cancer diagnosis via linear programming. SIAM News, 23(5):1-18.

Maxwell, B. and Anderson, S. (1999). Training hidden Markov models using population-based learning. In Banzhaf, W. et al., editors, Proceedings of the 1999 Genetic and Evolutionary Computation Conference, page 944, Orlando FL. Morgan Kaufmann.

McClelland, J. 1. and Rumelhart, D. E. (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. The MIT Press.

Montana, D. and Davis, L. (1989). Training feedforward neural networks using genetic algorithms. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 762-767, San Mateo, CA. Morgan Kaufmann.

Moscato, P. (1999). Memetic algorithms: A short introduction. In Corne, D., Dorigo, M., and Glover, F., editors, New Ideas in Optimization, pages 219-234. McGraw-Hill.

Miihlenbein, H. and Paafi, G. (1996). From recombination of genes to the estimation of distributions i. binary parameters. In H. M. Voigt, e. a., editor, Parallel Problem Solving from Nature IV, volume 1141 of Lecture Notes in Computer Science, pages 178-187. Springer-Verlag, Berlin.

Nakai, K. and Kanehisa, M. (1992). A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics, 14:897-911.

Rechenberg, I. (1973). Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Frommann-Holzboog Verlag, Stuttgart.

Ribeiro, B., Costa, E., and Dourado, A. (1995). Lime kiln fault detection and diagnosis by neural networks. In Pearson, D. W., Steele, N. C., and Albrecht, R. F., editors, Artificial Neural Nets and Genetic Algorithms 2, pages 112-115, Wien New York. Springer-Verlag.

Rosenblatt, F. (1959). Principles of Neurodynamics. Spartan Books, New York.


Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323:533-536.

Schwefel, H.-P. (1977). Numerische Optimierung von Computer-Modellen mittels der Evolutionsstrategie, volume 26 of Interdisciplinary Systems Research. Birkhiiuser, Basel.

Silva, F. M. and Almeida, L. B. (1990). Speeding up backpropagation. In Eckmiller, R., editor, Advanced Neural Computers. North Holland.

Whitley, D. (1999). A free lunch proof for gray versus binary encoding. In Banzhaf, W. et al., editors, Proceedings of the 1999 Genetic and Evolutionary Computation Conference, pages 726-733, Orlando FL. Morgan Kaufmann.

Whitley, D. and Hanson, T. (1989). Optimizing neural networks using faster, more accurate genetic search. In Schaffer, J. D., editor, Proceedings of the Third International Conference on Genetic Algorithms, pages 391-396, San Mateo, CA. Morgan Kaufmann.

Whitley, D., Mathias, K., and Fitzhorn, P. (1991). Delta coding: An iterative search strategy for genetic algorithms. In Belew, R. K. and Booker, L. B., editors, Proceedings of the Fourth International Conference on Genetic Algorithms, pages 77-84, San Mateo CA. Morgan Kaufmann.

Whitley, D., Starkweather, T., and Bogart, B. (1990). Genetic algorithms and neural networks: Optimizing connections and connectivity. Parallel Computing, 14:347-36l.

Wienholt, W. (1993). Minimizing the system error in feedforward neural networks with evolution strategy. In Gielen, S. and Kappen, B., editors, Proceedings of the International Conference on Artificial Neural Networks, pages 490-493, London. Springer-Verlag.

Wolpert, D. H. and Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1 (1) :67-82.

Yang, J.-M., Horng, J.-T., and Kao, C.-Y. (1999). Incorporation family competition into Gaussian and Cauchy mutations to training neural networks using an evolutionary algorithm. In Proceedings of the 1999 Congress on Evolutionary Computation, pages 1994-2001, Washington D. C. IEEE Press.

Zhang, B.-T. and Cho, D.-Y. (2000). Evolving neural trees for time series prediction using Bayesian evolutionary algorithms. In Proceedings of the First IEEE Workshop on Combinations of Evolutionary Computation and Neural Networks (ECNN-2000).

Index

Abductive inference partial, 320~321, 323 total, 320~321, 323

AlC,37 Artificial neural networks, 358

training, 358 Automatic learning, 32 B algorithm, 35,74, 111,269,295 Backpropagation, 358 Basin of attraction, 13 Bayesian Structural EM algorithm, 106 Bayesian model averaging, 37 Bayesian model selection, :38 Bayesian network, 25, 30, 100, 125, 198, 292,

301, 304, 320 equivalence class, 35 simulation, 41

BDe, 76, 149 Beam search, 294, 343 BEDA,156 Bernouilli distribution, 56, 65 Best-first search, 268, 293 BGe, 48, 111, 295 BIC, 37,111,127,269,295 BMDA,68 BOA, 76, 126, 149 Branch and bound, 193 Breadth-first, 268 BSC, 164, 229, 267, 348 Building block, 13, 56 Cholesky decomposition, 49 Chromosome, 6, 324 CGA,65 Classification rule, 266 Classifier system, 310 Classifier, 265 Clique tree, 322 Clustering, 100, 103, 339

conceptual, 340 hierarchical, 340

partitional, 340 CN2,315 Codes, 14, 344

binary, 192~193, 344 delta, 362 gray, 362 integer, 11, 242 ordinal, 344 permutation, 11, 192, 243 real, 11

Combinatorial field, 4 Combined complexity, 72 COMIT, 68, 228 Compressed population complexity, 72 Conditional (in)dependellce, 26~27, 49, 105

detecting, 32 Continuous domain, 117, 177 Convergence, 13, 143, 145

completely, 18 in mean, 18, 156 rates, 13, 19 reliability, 164 veloci ty, 164

Covariance matrix, 15, 43 Cross-validation, 271, 299, 313 Crossover, 7, 55, 211, 343

four-point, 10 one-point, 7, 10, 273, 324, 349 probability, 7, 11, 55, 300 uniform, 10, 273

Data mining, 265, 339 Density function, 26

conditional probability, 104 conditional, 26 joint, 26, 104

Dependencies bivariate, 62 multivariate, 62 without, 62

Dependency graph, 68


Depth-first, 268 Directed acyclic graph, 25, 27, 105, 128 Dissimilarity, 292 Dynamic programming, 193 Dynamical system, 13, 143, 151

discrete, 154 gradient, 154

EBNA, 74, Ill, 126, 148, 164,214,229,252, 269, 295, 314, 327, 348

EcGA,70 Edge exclusion, 45 EGNA, 85, Ill, 179, 215, 229, 253, 295 EM algorithm, 78, 88, 108, 240 EMDA, 102, 121 EMNA,215 EMNAu, 83,180 EMNAglobah 82, 179 EMNAi,84 Elitism, 9, 165, 233, 314 Error surface, 358 Evidence propagation, 320-321 Evolution Strategies, 5, 14,177,211 , 229 Evolutionary Algorithms, 3 Evolutionary Programming, 5, 19,35,211,

343 Expert system, 25 FDA,73 FDA-BC,77 Feature Weighting, 292 Feature subset selection, 266 Feature

irrelevant, 266, 271 redundant, 266, 271

Feed-forward network, 358 Filter, 268 , 294 Finite state machine, 20 Fitness, 4 Fixed point, 13, 154

stable, 154 Floating selection methods, 268 Forgy algorithm, 102 Function

Ackley, 179, 186 EqualProducts, 135 Griewangk, 178, 185 OneMax, 135, 151, 166 Rosenbrock, 178, 186 Summation Cancellation, 178, 184 additively decomposed, 73 corridor, 19 deceptive, 167 error, 359 multimodal, 77, 99-100 parabolic ridge, 19 similarity, 244 sphere, 19, 178, 185 strongly convex, 19

symmetrical discrete, 101 unimodal, 165

Gaussian distribution, 118, 252, 271 conditional, 105 multivariate, 14, 42, 78, 87, 252 univariate, 15, 252

Gaussian kernel density stimator, 88 Gaussian network, 25, 42, 125, 196, 292, 304

conditional, 88, 100 Gene, 6 GClIeration

gap, 9 Genetic Algorithms, 4, 35, 191, 211,

227-228, 241, 266, 293, 310, 323, 343 modified, 325 steady-state, 9, 217, 254

Genetic Programming, 6 Genetic drift, 99 GENITOR,254 Giffier and Thompson algorithm, 228 Global optimum, 18 Gradient-descent local search , 358 Graph isomorfism, 240 Graph matching, 240

inexact, 240 Greedy search, 72, 108, 128, 193 Hebbian rule, 79 Heuristic search, 35 Hidden variable, 103 Hill climbing, 155, 293, 340, 344 Histogram distribution, 87 Human brain structures, 242 IDEA,87 Incomplete data, 104 Individual, 6, 14

representation, 191, 241 Job shop scheduling, 228 Junction tree, 30, 76, 322 K-means, 342 K2 algorithm, 38 Knapsack problem, 191 Kullback-Leibler cross-entropy, 40, 67, 87 Laplace correction, 148, 271 Learning rate, 360 Leave-one-out cross-validation error, 293 LFDA, 76, 126, 150 Likelihood equivalence, 35 Likelihood ratio statistic, 45 Linear-regression model, 43 Linkage information, 56, 151 Local search, 36 Log marginal likelihood, 38, 108 Logarithmic score, 40 Machine Learning, 265, 291 Magnetic Resonance Images, 253 Markov chain, 13, 18, 143 Maximized log likelihood, 36

Maximum Weight Spanning Tree, 40 Maximum a posteriori, 321 Maximum likelihood estimate, 36, 78, 82 MDL, 72,126 Memetic Algorithms, 212, 364 Michigan approach, 311 MIMD,130 MIMIC, 66, 148, 164, 214, 229, 251, 267,

279, 327, 348 MIMICe , 179, 215, 229, 253 MIMICf,81 Missing data, 104 Mixture component, 77 Mixture models, 62, 78, 87

adaptive, 87, 110 Mixtures of DAG, 110 MLC++,274 Model complexity, 36, 72 Moralize, 28, 30 Multilayer perceptron, 358 Multinomial distribution, 105 MUltiply connected, 26, 29 Mutation, 7-8, 14, 55, 211, 343, 349

parameter, 17 probability, 8, 11, 14, 55, 274, 300

Naive-Bayes, 271 Nearest Neighbor, 291, 309 Normal kernel distribution, 87 Normal-Wishart distribution, 48 NP-complete, 208 NP-hard, 30, 125, 192, 228, 240, 322-323 Number of clusters, 340 Numerical field, 4 Numerical optimization, 5 Objective function, 4 Observed variables, 103 Offspring, 14 Operator, 7

random, 6 Optimization, 3-4 Overfitting, 271, 293 PADA,74 Parametric learning, 32, 66 Pattern Recognition, 267, 291 PBIL, 64, 150, 154, 164, 228-229, 267, 279 PBILe ,80 PC algorithm, 33 Pearson's X2 statistic, 70 Penalized maximum likelihood, 36, 46-47,

127 Pittsburgh approach, 311 Poly tree, 26, 29 Population, 6, 14

finite model, 14 infinite model, 14 initial, 60, 191 replacement method, 112

size, 11, 14, 55 Precision matrix, 43, 45 Predictive accuracy, 292

INDEX 381

Probabilistic Logic Sampling, 42, 251 Probabilistic graphical model, 25, 27, 125

induction, 26 model induction, 32 simulation, 26

Probability distribution generalized conditional, 26, 104 generalized, 26 joint generalized, 26, 104 local generalized, 27 marginal, 71 univariate marginal, 56, 63

Probability mass, 26 conditional, 26, 104 joint, 26, 104

Proper score, 40 Random variable, 26

n-dimensional, 26 Recombination, 14

discrete, 17 dual, 17 global, 17 intermediary, 17

Recurrent network, 358 Reinforcement learning, 154 Relative frequency, 58 Residuals, 73 RIPPER, 315 Rotation angles, 14 Rotation matrix, 16 Rule antecedent, 312 Rule consequent, 312 Rule induction, 309 Running intersection property, 73 Sample

partial correlation, 45 covariance, 47 mean, 47 variance, 47

Scalability, 164-165, 172 Schema, 12 SHCLVND,79 Score+search, 32-33, 36, 295 Score

decomposable, 128 equivalence, 48

Scoring rules, 40 Search space, 4

cardinality, 242, 248, 295 Second-order statistics, 66, 68 Selection, 6-7, 14, 112

Boltzmann tournament, 9 Bolztmann, 156 (J.L + A), 18


(fL,>.),18 proportional, 7, 274, 311 ranking, 9 tournament, 9, 72, 311 truncation, 10, 112, 165, 233

Selective model averaging, 38 Self-adaptation, 16 Separation criterion, 26, 28 Separators, 73 Sequential Backward Elimination, 268, 273 Sequential Forward Selection, 268, 273 Shannon entropy, 67 Simulated Annealing, 35, 210 Spin-flip symmetry, 102 Standard deviations, 15 Stochastic optimization, 3 Stochastic transition rule, 152 Stopping criterion, 60, 112, 233, 269 Strategy

multimembered, 18 parameter, 14, 20

Structure learning, 32, 66, 125-126 parallel, 126

Supermartingale, 18 Supervised classification, 291

Supervised learning, 265 Tabu Search, 35, 211, 228 Takeover time, 10 Test

Kruskal-Wallis, 181, 256, 368 Mann-Whitney, 181,256,368 paired t, 272, 296

Text learning, 267 Training instances, 291 Traveling salesman problem, 207 TREE, 214, 267, 279, 314, 348 Tree, 26, 28

dependency, 68 Tree Augmented Naive Bayes, 112 Tree augmented network, 88 Triangulation, 30 TSP, 4, 243 UMDA, 63, 102, 111, 147, 164,214, 229,

251, 314, 327 UMDAc , 78, 111, 179,215,229,253 UMDAf,79 Undirected graph, 28 Uniform distribution, 7, 60, 195, 197 Vehicle Routing Problem, 243 Wrapper, 269, 276, 304

estimation of distribution algorithms: a new tool for evolutionary computation

Documents