puga ce, a cellular evolutionary algorithm framework on gpus · puga ce, a cellular evolutionary...

PUGACE, A Cellular Evolutionary Algorithm framework on GPUs

Nicolas Soca, Jose Luis Blengio, Martın Pedemonte and Pablo Ezzatti

Abstract— Metaheuristics are used for solving optimizationproblems since they are able to compute near optimal solutionsin reasonable times. However, solving large instances it maypose a challenge even for these techniques. For this reason,metaheuristics parallelization is an interesting alternative inorder to decrease the execution time and to provide a dif-ferent search pattern. In the last years, GPUs have evolvedat a breathtaking pace. Originally, they were specific-purposedevices, but in a few years they became general-purpose sharedmemory multiprocessors. Nowadays, these devices are a pow-erful low cost platform for implementing parallel algorithms.In this paper, we present a preliminary version of PUGACE,a cellular Evolutionary Algorithm framework implemented onGPU. PUGACE was designed with the goal of providing a toolfor easily developing this kind of algorithms. The experimentalresults when solving the Quadratic Assignment Problem arepresented to show the potential of the proposed framework.

I. INTRODUCTION

Exact algorithms can be useless for solving large di-mension optimization problems in reasonable times due totheir high computational complexity. In this context, whenthe use of exact algorithms is impractical, metaheuristictechniques have emerged as flexible and robust methods forsolving optimization problems. Metaheuristic methods do notguarantee to obtain an optimal solution, but they find nearoptimal solutions efficiently. Evolutionary Algorithms (EAs)is a metaheuristic technique that has become popular in thelast twenty years.

Although metaheuristics are efficient techniques, whenapplied to solve large problem instances they can lead toa significant increase in the execution time. For this reason,metaheuristics parallelization is an interesting alternative inorder to decrease the execution time, using the increasingpossibilities offered by modern hardware architectures. Par-allel metaheuristics can also benefit from using a differentsearch pattern, even improving over traditional sequentialmethods. Cellular Evolutionary Algorithms (cEAs) are aclass of inherently parallel evolutionary algorithms in whichthe population is structured in neighborhoods and individualscan only interact with their neighbors.

In recent years, Graphics Processing Units (GPUs) havebeen progressively and rapidly advancing from specializedfixed-function graphics to highly programmable and parallelcomputing devices. With the introduction of the ComputeUnified Device Architecture (CUDA), GPUs are no longerexclusively programmed using graphical APIs. Now, GPUsare exposed to the programmer as a set of general-purposeshared-memory Single Instruction Multiple Data multi-core

Nicolas Soca, Jose Luis Blengio, Martın Pedemonte and Pablo Ezzattiare with the Centro de Calculo - Instituto de Computacion, Universidad dela Republica, J. H. Reissig 575, Montevideo, Uruguay (phone: +598-2 71142 44; email: {nsoca,jblengio,mpedemon,pezzatti}@fing.edu.uy).

processors. The number of threads that can be executed inparallel on such devices is currently in the order of hundredsand it is expected to multiply soon.

This paper presents a preliminary version of PUGACE, aCellular Evolutionary Algorithm framework implemented ona GPU.

The paper is organized as follow. The next section givesa background on graphics hardware. Section III brieflydescribes EAs and cEAs, and it reviews the related workabout evolutionary techniques on GPU. Then, the proposedframework is commented in section IV. Section V presentsthe experimental analysis, reporting the quality and theperformance results for a test problem. The paper ends withthe conclusions of the research and suggestions for futurework in Section VI.

II. GPUS

In the last years, there was an uprise of hardware ac-celerators like graphics processors. This growth has beenmainly based on two key issues: the GPUs architectureis intrinsically parallel, in contrast to CPUs serial basedarchitecture; and game industry has forced to increase thegraphic processors capabilities to make faster and morerealistic games.

Old graphics cards used to have a fixed graphics pipeline,that is to say the operations and the order in which they areapplied over data were preconfigured. In the last ten years,GPUs have impressively changed. The major key changesare commented next.

Concerning to the programming capabilities, at first GPUonly provided their own transformations and lighting op-erations (vertex shaders) to be performed on vertices, andtheir own pixel shaders to determine the final pixels color.Whereas now, GPUs have many multiprocessors that can beprogrammed for general purpose computing.

With respect to accuracy, GPUs originally worked with 8-bits numbers and then switched to the 16-bits floating pointformat. Subsequently, they supported the single precisionfloating point system (32 bits); and currently GPUs fullysupport the standard IEEE double precision arithmetic (64bits).

Regarding the GPUs programming tools, the shaders orig-inally had to be written in assembly language. With theconstant rising of functionality provided by GPUs, differenthigher level programming languages were developed, such asHigh-Level Shading Language (HLSL) and nVidia’s Cg [1].Another alternative was to directly use computer graphicstools, such as OpenGL [2] or DirectX [3]. Later, other highlevel languages emerged based in turning GPUs into a streamprocessor, such as Brook [4], Sh [5], PyGPU [6], Accelerator

WCCI 2010 IEEE World Congress on Computational Intelligence July, 18-23, 2010 - CCIB, Barcelona, Spain CEC IEEE

978-1-4244-8126-2/10/$26.00 c©2010 IEEE 3891

languages [7], CTM and ATI Stream Technology [8]. Finally,CUDA [9] and OpenCL [10] were developed.

Compute Unified Device Architecture (CUDA) is theAPI from nVidia for exposing the processing features ofthe G80 GPU. CUDA is a C language API that providesservices ranging from common GPU operations in the CUDAlibrary to traditional C memory management semantics inthe CUDA runtime and device driver layers. Additionally,nVidia provides a specialized C compiler to build programsdeveloped for the GPU.

As it was mentioned before, current GPUs can be consid-ered as shared memory multi-core processors. The processingis based on hundreds of threads that are grouped into blocks.Memory hierarchy is an important attribute of modern GPUs.Nowadays, GPUs have four levels of memory: registers,shared block memory, local memory and global memory.Registers are the fastest memory on the multiprocessor andare only accessible by each thread. Shared memory is almostas fast as registers and could be accessed by any thread ofthe block. The local and the global memories are the slowestmemories on the multiprocessor (they are more than 100times slower than the shared memory), and while the localmemory is only accessible to each thread, the global memoryis accessible to all the threads on the GPU.

Information about general-purpose computation on GPUscan be found in Owens et al. [11]. The continuous develop-ment of the area can be followed on the GPGPU organizationwebsite [12].

III. CELLULAR EVOLUTIONARY ALGORITHMS

This section presents a description of EAs and cEAs.After that, it summarizes the related previous work on theimplementation of evolutionary techniques on GPUs.

A. Evolutionary Algorithms

Evolutionary algorithms are stochastic search methodsinspired by the natural process of evolution of species. EAsiteratively evolve a population of individuals representingcandidate solutions of the optimization problem. The evo-lution process is guided by a survival of the fittest appliedto the candidate solutions and it involves the probabilisticapplication of operators to find better solutions.

The initial population is randomly generated. Each it-eration is divided in four stages. In the first stage, everyindividual in the population is associated with a fitnessvalue that measures the quality of the candidate solution.Afterward, the solutions are selected from the populationbased on their fitness value, usually giving higher priority tohigher quality solutions. In the third stage, new solutions areconstructed applying evolutionary operators to the selectedsolutions. Typically, the evolutionary operators used are thecrossover (recombination of parts of two individuals) and themutation (random changes in a single individual). Finally, inthe fourth stage, the new population is created, by replacingthe worse adapted individuals with solutions generated in theiteration.

B. Cellular Evolutionary Algorithms

There are different proposals to structure the EA popu-lation. Single population or panmixia works with a singlepopulation in which there are no group structures and there-fore any individual could be mated for reproduction withanyone of the population. On the other hand, in the islandmodel or distributed EAs, the population is split in severalsubpopulations called islands. Each subpopulation worksindependently and some solutions are exchanged amongthem with a given frequency.

Another alternative is known as cEAs [13]. cEAs workwith a single population structured in many small over-lapped neighborhoods. Each individual is placed in a cellon a toroidal n-dimensional grid. Each individual belongsto several neighborhoods, but it has its own neighborhoodfor reproduction. A given individual can only be mated forreproduction with individuals of its neighborhood. The effectof finding high-quality solutions gradually spreads to otherneighborhoods along the grid using a diffusion model asa consequence of the neighborhoods overlapping. Figure 1presents an example of a cEA population on a 2D grid.

Fig. 1. cEA population on a 2D grid.

C. Related work: evolutionary computation on GPUs

This subsection reviews some of the preceding effort in thedesign of parallel implementations of EAs on GPUs. This isa relatively new approach and there are few studies regardingthe issue. So, the survey covers not only the works that haveproposed parallel EA in GPU, but also other evolutionarycomputation (EC) techniques. In recent times, several authorshave presented different interesting contributions about ECon GPUs; most of the work in this area can be found at theGPGPGPU website [14].

One of the pioneering works in this area is the proposal ofWong et al. [15], where the authors studied an EvolutionaryProgramming (EP) algorithm for solving five simple testfunctions. In their proposal, denominated Fast EvolutionaryProgramming (FEP), the selection process takes place in theCPU while the fitness function evaluation and the mutationare performed in the GPU, following a master-slave parallel

3892

model. The authors noticed that the population should bestored in the device memory to obtain high performancevalues due to the low ratio in the data transfer betweenCPU and the device memory. For this reason, FEP usestextures to store the population and one additional textureto store the fitness values of all the population individuals.The experimental evaluation was performed on a Pentium IVat 2.4 GHz and a GeForce 6800 Ultra card, comparing theparallel FEP and a sequential implementation. The resultsshowed that speedup values ranged between 0.62× and5.02× were obtained depending on the population size andthe problem complexity. Later [16], the authors proposed aParallel Hybrid Genetic Algorithm (HGA) on GPU, wherethe whole evolutionary process is executed on the GPU whilethe generation of random numbers runs on the CPU. HGAis a Genetic Algorithm that incorporates a Cauchy mutationoperator. The authors compared their proposal against afull CPU implementation and the FEP method on an AMDAthlon 64 3000 with 1 GB of memory RAM and a GeForce6800 Ultra graphics card with 256 MB. The results showedthat HGA reaches speedup values between 1.14× and 4.24×when considering population sizes between 400 and 6400individuals. Finally, in a later work [17] they extended theexperimental evaluation considering more difficult instances,reaching speedup values of 5.50×.

Harding and Banzhaf [18] implemented a Genetic Pro-gramming (GP) algorithm for evolving computer programsrepresented as tree structures following a master-slave modelon GPU. In their proposal, the GPU is only used to evalu-ate the fitness function. Several problems were consideredinvolving different types of fitness functions such as floatingpoint, boolean expressions and a real world test. The purposeof the conducted tests was to compare the time required forevaluating a tree with a given size for a fixed number offitness cases in the CPU and in the GPU. The experimentalplatform was an Intel Centrino T2400 and an nVidia GeForce7300 GO graphic card. The results showed significant reduc-tions on the time required for evaluating the fitness functionon the GPU, with speedup values in the order of tens for realworld tests, in the order of hundreds for boolean expressionsand in the order of thousands for floating point type whencompared against the sequential algorithm implemented in aCPU. Following a similar idea, the authors [19] studied a GPvariant to generate shader programs to obtain image filters fornoise reduction, but the work focuses on the specific detailsfor solving the problem.

Also following a master-slave model, Maitre et al. [20]extended EASE (a metalanguage to implement EAs) toautomatically exploit the GPU capabilities. The proposal wasvalidated by solving two cases: the Weierstrass-Mandelbrotproblem and a real world chemistry problem of structuredetermination. The experimental evaluation was aimed atstudying the scalability of the proposal, considering onlythe execution time corresponding to the fitness functionevaluation. The execution platform was a standard 3.6 GHzPC. Two different graphic cards were considered in the exper-

iments, an nVidia 8800 GTX and an nVidia GTX 260. In theWeierstrass-Mandelbrot problem, when using a populationof 4096 individuals the speedup of evaluation stage reachedvalues of 33× and 100× on the 8800 GTX and on the GTX260, respectively. In the real world chemistry problem, whenconsidering a population size of 20, 000 individuals and usingan initial codification, the fitness function evaluation stagetook 23 s. in the CPU implementation and 80 s. in the GPUimplementation. However, the authors were able to reducethe execution time to 7.66 s. for the CPU implementationand 0.33 s. for the GTX 260 implementation by changingthe data structures used.

The first work on implementing a cellular EAs usingGPUs was made by Yu et al. [21]. The authors studiedthe Colville minimization problem, placing the individualsin a toroidal 2D grid and using the classical Von Neumannneighborhood structure with five cells. The individuals wererepresented using a set of texture maps in two dimensions.Each chromosome was divided into several segments, whichwere assigned to different textures in the same position.Each segment contained four genes packed into each pixelas RGBA color space. An additional texture was used tostore the fitness value of each individual. The experimentsperformed on a computer AMD Athlon2500 and an nVidiaGeForce 6800 GT card showed speedup values of up to 20×for populations of 5122 individuals.

In a similar line of work, Li et al. [22] proposed a cEAcompletely implemented on GPU for solving the Schwefel,Shaffer and Camel functions. The execution platform wasa PC with a Pentium IV 2.66 GHz with 256 MB of RAMmemory and an nVidia GeForce 6800 card. This proposalobtained speedup values ranged between 1.4× and 5.4× fora population of sizes between 200 and 800, speedup valuesof 9.6× to 16.9× for populations with 3, 200 individualsand speedup values ranged between 21× and 73× for apopulation with 10, 000 individuals.

The proposal by Li et al. [23] follows a different linewithin the fine-grained parallelism on GPU. The authorsstudied an immune strategy algorithm applied to GA forsolving the Travelling Salesman Problem. The executionplatform was an AMD 3000 + with a 9600 GT card, and thespeedup values obtained ranged between 2.42× and 11.5×.

The island model is the parallelization strategy more latelyimplemented on GPU. Tsutsui and Fujimoto [24] presenteda parallel island model implementation of a GA for solvingthe Quadratic Assignment Problem. The execution platformwas a computer with an Intel i7 965 processor with an nVidiaGTX 285 card. The authors obtained speedup values between3× and 12× when using population of sizes between 3, 840and 19, 200 for solving instances with up to 40 locations.

Another proposal of a parallel island model was madeby Lewis et al. [25], who studied the technique knownas genetic cyclic programming. In this work, the authorsstudied a full GPU implementation, an hybrid GPU-CPUimplementation and an hybrid GPU-CPU implementationwith two CPUs and two GPUs. The reported results are really

3893

impressive, reaching speedup values in the order of hundreds.The paper also presents an interesting discussion on how toevaluate the performance when comparing GPU and CPUimplementations.

Finally, up to our knowledge, Wong [26] presented theonly proposal of Multiobjective Evolutionary Algorithms(MOEAs) using GPUs. The author proposed an algorithminspired on the NSGA-II implemented with CUDA. Twodifferent strategies to implement the non-dominance check-ing in GPU were discussed. This is an important operationin MOEAs, since it is used to quantify the quality of thesolutions. The experimental evaluation was performed in acomputer with an Intel Pentium Dual E2220 processor with2GB RAM and a GeForce 9600 GT graphics card with 512MB of RAM. The author reported values of speedup between5.65× and 10.75× when using population sizes between4, 096 and 16, 384 for solving the standard multiobjectivetest problems ZDT and DTLZ.

The survey of related works allows to conclude that allstandard parallelization strategies of EA have already beenimplemented on GPU, including the master-slave model([15], [16], [17], [18], [20]), the cellular model ([21], [22],[23]) and the island model ([24], [25]). The EC techniquethat has been more used in parallel implementations on GPUis EP. So far there is only one work that was proposed todevelop a generic framework or an automatic code generator([20]) to simplify the implementation of EAs in GPUs.

The reviewed articles show that EA implementations onGPUs systematically obtained high speedup values, therebydrastically reducing the execution time. In particular, cEAimplementations obtained speedup values up to 11.5× ([23]),20× ([21]) and 73× ([22]), respectively, making attractivethe study of this model. On the other hand, there have notbeen proposals of generic framework for EAs implementedon GPUs and therefore our proposal is a novel approach.

IV. PUGACE

PUGACE is a generic framework for implementing cEAsto solve optimization problems on GPUs. The generic ap-proach is in line with MALLBA (a library of skeletonsfor sequential and parallel optimization algorithms) [27]and JCell (a cellular genetic algorithms framework) [13].PUGACE is implemented in C, and uses CUDA (version2.1) to manage the GPU.

PUGACE supports working with different problem encod-ing, selection policies, crossover and mutation operators, aswell as setting different parameters such as population size,number of generations, mutation and crossover probability.The framework includes the implementation of several evo-lutionary operators, and it can be extended to incorporateadditional operators. PUGACE also supports using a localsearch mechanism to improve the solutions.

The main features of the proposed framework are pre-sented in the following subsections.

A. PUGACE architecture

The design of PUGACE focuses on implementing anextensible and easy to use framework. To achieve the first ob-jective, new evolutionary operators and neighborhood struc-tures can be incorporated to the framework in an easy way. Toachieve the second objective, the framework implementationis separated into several modules that encapsulate differentfunctionalities. However, the complete separation was notentirely possible due to some CUDA (version 2.1) limita-tions, such as the absence of dynamic memory allocation onGPU and the impossibility of defining a device function ina different module than the one that invokes it.

Figure 2 presents a diagram of the PUGACE modules.

Fig. 2. Diagram of the PUGACE modules.

The main program is main.cu, that reads the configurationfile, calls the user-defined functions, creates the data struc-tures for the population and the fitness values in the devicememory, and calls the GPU procedures. Both the cEA andframework parameters are set in the file config.txt.

All the functions and procedures that execute on theGPU are implemented on kernels.cu. The methods that areproblem-dependant, such as the fitness function and the localsearch procedure, as well as the methods that are problem-independent, such as new evolutionary operators incorporatedto the framework, are implemented in this file. Ideally,these functions and procedures should be implemented in aseparate file, but that is not possible due to CUDA limitationalready mentioned.

Some additional features, like a procedure for specificallygenerate the initial population, could be implemented in thefile user functions.cu.

Some required parameters of the framework are set inthe file params.h, due to the CUDA limitations alreadycommented. The procedures that generate the sequence ofrandom numbers are implemented in the file rannum.cu.

3894

To instantiate a new problem, at least a fitness functionmust be implemented in kernels.cu and both the cEA andthe framework parameters must be configured in config.txt.

The PUGACE design follows the next guidelines:• The population of chromosomes always resides in the

device memory, and it is transferred from the CPU to theGPU at the beginning of the algorithm, and viceversaat the end of the algorithm.

• Each individual uses a different execution thread, andthreads are organized in blocks of varying size.

• To improve the algorithmic performance, the frameworkexploits the different GPU memory levels. In particular,the information of the problem is preloaded on constantmemory.

• The crossover and mutation operators are applied proba-bilistically. So, some threads may apply them and othersdo not, or crossover may occur at different points, whichcan cause divergent threads within a block. To avoidthe thread divergence, the application of crossover andmutation operators is decided at the block level; onlyone decision per operator is made for each block. Thecrossover points are also selected at block level, this isto say that the same crossover points are used within ablock. On the other hand, the mutation points and thenew mutation values are selected independently for eachthread.

Designing a highly flexible and general framework usuallyis in conflict with obtaining implementations that are asefficient as possible. Since this is a first approach to developthe PUGACE framework, the generality of the design is fa-vored over the efficiency. For this reason, some GPUs aspectswere not taken into account when designing PUGACE, suchas maximizing the utilization of shared block memory andcoalescing the access to global memory. Incorporating suchaspects in the PUGACE design will imply an improvementin the efficiency of the implementation, so it is planned tostudy how to take into account this aspects in a future versionof the framework.

B. Population and neighborhood structure

The population is arranged in a circular 1-dimensionalstructure, according to ideas proposed by Baluja [28]. Theindividuals from both ends of the population are copied to theopposite end to simplify the application of the operators. Thenumber of copied individuals at each end varies depending onthe length of the neighborhood structure used. Figure 3 showsan example of a population with four copied individuals(chromosomes 1, 2, N-1, and N). Copying individuals helpsavoiding the thread divergence, since all neighbors are close.However, it requires an additional step in each generation inwhich the copies acquire the values associated to the originalindividuals.

The population can be randomly initialized or specificallygenerated. This second option is useful to incorporate aparticular heuristic algorithm for the optimization problem.After the population is initialized, the individuals are copied

Fig. 3. Circular 1-dimensional population structure.

into the device global memory where they reside until theevolutionary process end.

Neighborhood structure consists of a configurable num-ber of individuals to the left and right. The neighborhoodstructure length is a parameter of the framework that alsodetermines the number of individuals that must be copied onboth ends.

C. Fitness function

The values of fitness function are stored in an auxiliaryvector in the same order as the chromosome population. Likethe population, the fitness values vector is stored in the devicememory.

The fitness function evaluation uses an independent threadfor each chromosome of the population, except for copies.Each thread computes the fitness function and stores thevalue at the corresponding chromosome position of thefitness vector. The fitness function is problem-dependent andmust be implemented for each problem.

D. Evolutionary operators

Different alternatives for selection, crossover and mutationoperators were included. PUGACE implements a minimalnumber of different operators; but the framework can beeasily extended with new implementations of the operators.

The selection operator is applied for each of the populationpositions for choosing two individuals for reproduction.Three selection operators are provided by PUGACE. All ofthem choose the individual at the considered position asone of the mating individuals. According to the operatoremployed, the other individual could be choose randomly,fitness proportionally, or the best one from the neighborhood.

The crossover operator is applied to recombine the individ-uals selected for reproduction, obtaining two new solutions.PUGACE provides implementations of the classic one pointand two point crossovers, as well as Partially MatchedCrossover (PMX) [29] crossover for permutations encoding.

The mutation operator is applied after the reproductionto both offspring. PUGACE provides two different mutationoperators. The first one modifies one of the chromosomevalues with a random value. The second one exchanges tworandomly selected values of the chromosome.

PUGACE uses a generational replacement strategy. Eachparent is replaced by the best one of its children. Thereplacement is synchronous, this is to say that all childrenreplace its parents at the same time.

The framework allows incorporating a local search opera-tor to improve the solutions during the evaluation. Obviously,

3895

as such operators are problem-dependent, they have to beimplemented for each problem.

E. Random number generation

As the GPU does not provide a pseudo-random numbersgenerator, two different alternatives are implemented in PU-GACE to tackle this problem.

The first alternative is that pseudo-random numbers aregenerated in the CPU, and they are transferred to the GPUin each generation, taking benefit of the CPU idle times.

The second option is to directly generate pseudo-randomnumbers in the GPU with a specific algorithm based on alinear congruential method.

V. EXPERIMENTAL RESULTS

This section presents experimental results. It introducesthe optimization problem chosen to evaluate the proposedframework, the classical Quadratic Assignment Problem. Italso presents details on the cEA parametrization for solvingthis problem. Finally, the quality and performance results arecommented.

A. Quadratic Assignment Problem

Quadratic Assignment Problem (QAP) [30] model manyapplications in diverse areas such as operations research,parallel and distributed computing, and combinatorial dataanalysis. QAP is a NP-hard problem.

The problem is to assign N facilities to N locations withcosts proportional to the flow between facilities multipliedby the distances between locations. The goal is to assigneach facility to a different location such that the total cost isminimized.

This assignment problem can be modeled by two N ×Nmatrices:

• A = (aik), where aik is the flow from facility i tofacility k;

• B = (bjl), where bjl is the distance from location j tolocation l;

The objective function is defined by Equation 1, where Sn

is the set of all permutations of the integers 1, 2, ..., n. Eachproduct aikbφ(i)φ(k) is the transportation cost associated toplace the facility i on the location φ(i) and the facility k onthe location φ(k). Each term

∑k=1..n aikbφ(i)φ(k) is the

total cost for installing i.

minφ∈Sn

(n∑

i=1

n∑k=1

aikbφ(i)φ(k)

)(1)

Instances of the QAP with input matrices A and B aregenerally denoted by QAP (A, B). The QAPLIB [31] librarycontains several test sets with instances generated withdifferent criteria: symmetric matrices, asymmetric matrices,randomly generated, rectangular distances, real world data,etc. In the PUGACE evaluation we selected several instanceswith up to 50 facilities and locations from five different testsets.

B. Problem encoding

A permutation representation for encoding QAP solutionswas used. A feasible solution is represented as an integerarray; each integer value indicates the location where thefacility given by index array position is placed.

The population was initialized by generating random fea-sible permutations. The evolutionary operators used wereproportional selection to the fitness of the neighbors, thePMX crossover operator and the mutation operator thatexchanges two randomly selected values.

A local search method was implemented to improve so-lutions. The local search randomly selects a chromosomeposition and evaluates all possible exchanges between theselected position and the rest. The exchange that producesthe largest increase in the fitness solution is applied. Thechosen position is the same for the whole population, but itvaries during the generations. For efficiency reason, the localsearch method is applied every ten generations.

C. Parameters settings

A configuration analysis was carried out to determine thebest parameter values for solving the QAP. The parametersconsidered in the analysis were: population size (512, 1024and 2048), crossover probability (0.7, 0.8 and 0.9), mutationprobability (0.05, 0.1, 0.2) and neighborhood length (2 and4).

The configuration analysis considered two instances fromtwo different test sets of QAPLIB, Chr25a and Kra30. Theparameter values were combined following the orthogonalityprinciple. Ten independent executions were run for eachdifferent configuration using a stop criterion of reaching 500generations. From the comparison of the best solution foundand the average solution cost for each configuration, it wasdetermined that the best parameter values were: populationsize = 2048, crossover probability = 0.9, mutation probability= 0.1 and neighborhood length = 4. No specific experimentswere conducted to determine the optimum values for thenumber of threads per block and the number of thread blocks.

The results also showed that there is a direct relationshipbetween the crossover probability value and the runtime ofthe algorithm. When the mutation probability used is 0.1, theexecution time is 5% larger than when using a probabilityof 0.05, while when using a probability of 0.2 the executiontime is 10% larger than when using a mutation probabilityof 0.1.

The stopping criterion adopted was to reach a specifiednumber of generations. A high limit value for the numberof generations was used (5000) trying to reach high qualitysolutions.

Table I summarizes the parameter values used in this work.

D. Execution platform

The execution platform used was a Pentium dual-core at2.5 GHz with a 2 GB RAM and an nVidia GeForce 9800GTX+ graphic card.

3896

TABLE ICEA PARAMETERS USED FOR SOLVING THE QAP.

population size 2048

generations 5000

thread blocks 32

threads per block 64

crossover probability 0.9

mutation probability 0.1

neighborhood length 4

E. Solutions quality

This subsection presents and discusses results obtained bythe cEA implemented on PUGACE for solving fourteen QAPinstances.

Table II presents the results obtained by the cEA on tenindependent runs for each QAP instance studied. The tableincludes the number of facilities and locations of the instance,the best known solution, the number of executions that thebest know solution was found (# Hits) and the gap respectto the best known solution, defined by Equation 2.

Gap =(Solution−Best Known Solution)

(Best Known Solution)(2)

TABLE IIRESULTS OF CEA FOR SOLVING QAP.

Instance NBest known

# Hits Gap (%)solutionBur26a 26 5426670 7 0.00Chr12a 12 9552 8 0.00Chr15a 15 9896 7 0.00Chr18a 18 11098 10 0.00Chr20a 20 2192 8 0.00Esc16f 16 0 6 0.00Esc32a 32 130 4 0.00Had12 12 1652 8 0.00Had14 14 2724 4 0.00Had16 16 3720 6 0.00

Lipa30a 30 13178 2 0.00Lipa30b 30 151426 1 0.00Lipa40a 40 31538 0 1.05Lipa50b 50 1210244 1 0.00

The implemented algorithm reached the best known so-lution in 13 out of 14 of the problem instances considered.Only in one instance the best known solution was not found,but the gap was small (1.05%). The results are acceptable,but they start to degrade when considering larger instances,possibly because the followed approach for solving the QAPwas too simple. On the other hand, the population sizeis relatively smaller than the used by other authors forGPU implementation of EAs. Increasing the population sizecould contribute to improve the quality of the solutions byincreasing the diversity of the population.

The proposed framework showed its potential to easilyimplement a cEA that obtained good quality solutions for aclassical combinatorial optimization problem.

F. Performance results

This subsection presents and compares the computationalperformance of the cEA implemented on GPU and a sequen-tial cEA implemented on CPU. The tests were conductedin order to evaluate the reductions in runtime that can beobtained by implementing a cEA on PUGACE rather thanon a CPU.

Table III presents the execution time of the cEA imple-mented on GPU, the execution time of the sequential cEAimplemented on CPU and the algorithm speedup (sequentialexecution time / parallel execution time) evaluated in tenindependent runs for the QAP instances studied. The speedupvalues reported consider the total runtime of the algorithm,including the transfer time.

TABLE IIIPERFORMANCE RESULTS OF CEA IMPLEMENTED ON GPU AND ON CPU.

CaseGPU implementation CPU implementation Algorithm

time (s) time (s) speedupBur26a 16.961 270.22 15.93Chr12a 2.559 45.943 17.95Chr15b 4.802 74.152 15.44Chr18a 6.092 112.142 18.41Chr20a 7.853 143.285 18.25Esc16f 5.905 88.983 15.07Esc32a 28.06 455.564 16.24Had12 2.532 46.081 18.20Had14 3.444 63.82 18.53Had20 7.963 143.281 17.99

Lipa30a 21.599 385.926 17.87Lipa30b 21.044 386.632 18.37Lipa40a 46.852 808.803 17.26Lipa50b 91.429 1460.298 15.97

The speedup values obtained for the different QAP in-stances ranged between 15 and 19, showing the potential ofthe proposed framework for using GPU devices to signif-icantly reduce the runtimes of cEA. Although the speedupvalues are high, there are several areas for improvement suchas maximizing the utilization of shared block memory andthe coalescence of the access to the global memory.

Another important result concerning the PUGACE per-formance was found on the calibration stage. In the GPUimplementation, the increase in the population size impactsin a sublinear increase in the execution time. For example,doubling the number of individuals in the population (from512 to 1024 and from 1024 to 2048) implies runtime incre-ments of less than 10%. Therefore, the effect of increasingthe population size in the solutions quality and the algorithmperformance should be studied thoroughly.

VI. CONCLUSION AND FUTURE WORK

This work presented PUGACE, a general framework ofcEAs implemented on GPU. The aim of our proposal isto significantly reduce the cEAs runtime through exploitingthe potential of current multi-core GPUs. This work alsoincludes a study of algorithmic behavior and performancewhen implementing a cEA for solving the QAP on PUGACE,to show the potential of the proposed framework.

3897

Analyzing the quality and performance results obtained forthe QAP, some conclusions can be drawn on the applicabilityof the proposed framework. It can be noted that PUGACE isa useful tool for implementing cEA for solving optimizationproblems, and a viable option to easily exploit GPUs. Inaddition, high values of speedup were obtained, showingthat cEA is highly suitable for the implementation on GPUs,obtaining high reductions in the execution time.

Three main areas that deserve further work are the eval-uation of the applicability of PUGACE, the extension ofthe proposed framework, and the experimentation with otherhardware configurations.

In the first place, in order to evaluate the applicabilityand to validate the generality of PUGACE, the frameworkmust be applied to solve other optimization problems. Theproblems should be carefully chosen assuring that the fitnessfunctions are complex enough to exploit the GPUs potential.

A second issue that deserves further work is the extensionof the proposed framework. Currently, the framework onlysupports linear neighborhood structures. New neighborhoodsstructures should be implemented, as well as new evolution-ary operators should be incorporated to the framework tomake PUGACE a more powerful tool. Also, some aspectsof the CUDA behavior that were not considered in thiswork should be taken into account, such as maximizing theutilization of shared block memory and coalescing the accessto the global memory.

Finally, further work has to be done to extend the exper-iment with other hardware configurations. The frameworkwas designed to encapsulate the GPU particularities, but theexperiments performed in this work used only one GPUmodel. New experiments should be made on different deviceswith diverse features to validate the implementation of theframework. Complementarily, it is planned to investigate onhow to extend PUGACE to support different platforms thatinvolve GPU usage, such as one CPU with multiple graphiccards (i.e. Tesla) and a cluster of PCs with graphic cards.

ACKNOWLEDGMENT

The authors would like to thank Ph.D. Sergio Nesmachnowfor his help. Also, we thank the anonymous reviewers fortheir valuable suggestions that improved this work.

REFERENCES

[1] “Cg website,” http://developer.nvidia.com/page/cg main.html. Accessed on May 2010.

[2] “OpenGL website,” http://www.opengl.org/. Accessed onMay 2010.

[3] “DIRECTX website,” http://developer.nvidia.com/page/directx.html. Accessed on May 2010.

[4] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston,and P. Hanrahan, “Brook for GPUs: Stream computing on graphicshardware,” ACM Transactions on Graphics, vol. 23, pp. 777–786,2004.

[5] “Sh website,” http://libsh.org/. Accessed on May 2010.[6] “PyGPU website,” http://www.cs.lth.se/home/

Calle Lejdfors/pygpu/. Accessed on May 2010.[7] D. Tarditi, S. Puri, and J. Oglesby, “Accelerator: using data parallelism

to program gpus for general-purpose uses,” in ASPLOS-XII: Proceed-ings of the 12th Int. Conf. on Architectural support for programminglanguages and operating systems. ACM, 2006, pp. 325–335.

[8] “ATI Stream Technology website,” www.amd.com/stream. Ac-cessed on May 2010.

[9] “CUDA website,” http://www.nvidia.com/object/cuda home new.html. Accessed on May 2010.

[10] “OpenCl website,” http://www.nvidia.com/object/cuda opencl new.html. Accessed on May 2010.

[11] J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. Lefohn,and T. Purcell, “A survey of general-purpose computation on graphicshardware,” Computer Graphics Forum, vol. 26, no. 1, pp. 80–113,2007.

[12] “GPGPU website,” http://gpgpu.org. Accessed on May 2010.[13] E. Alba and B. Dorronsoro, Cellular Genetic Algorithms. Operations

Research/Computer Science series, Springer, 2008.[14] “GPGPGPU website,” http://www.gpgpgpu.com. Accessed on

May 2010.[15] M. Wong, T. Wong, and K. Fok, “Parallel evolutionary algorithms on

graphics processing unit,” in The 2005 IEEE Congress on EvolutionaryComputation, vol. 3, 2005, pp. 2286–2293.

[16] M. Wong and T. Wong, “Parallel hybrid genetic algorithms onConsumer-Level graphics hardware,” in The 2006 IEEE Congress onEvolutionary Computation, 2006, pp. 2973–2980.

[17] ——, “Implementation of parallel genetic algorithms on graphicsprocessing units,” in Intelligent and Evolutionary Systems, 2009, pp.197–216.

[18] S. Harding and W. Banzhaf, “Fast genetic programming and artificialdevelopmental systems on GPUs,” in 21st International Symposium onHigh Performance Computing Systems and Applications (HPCS’07).IEEE Computer Society, 2007, p. 2.

[19] ——, “Genetic programming on GPUs for image processing,” Inter-national Journal of High Performance Systems Architecture, vol. 1,no. 4, pp. 231 – 240, 2008.

[20] O. Maitre, L. A. Baumes, N. Lachiche, A. Corma, and P. Collet,“Coarse grain parallelization of evolutionary algorithms on GPGPUcards with EASEA,” in GECCO ’09: Proceedings of the 11th AnnualConference on Genetic and Evolutionary Computation. ACM, 2009,pp. 1403–1410.

[21] Q. Yu, C. Chen, and Z. Pan, “Parallel genetic algorithms on pro-grammable graphics hardware,” in Advances in Natural Computation,2005, pp. 1051–1059.

[22] J. Li, X. Wang, R. He, and Z. Chi, “An efficient fine-grained parallelgenetic algorithm based on GPU-Accelerated,” in IFIP InternationalConference on Network and Parallel Computing Workshops, 2007, pp.855–862.

[23] J. Li, L. Zhang, and L. Liu, “A parallel immune algorithm based onfine-grained model with gpu-acceleration,” in Proceedings of the 2009Fourth International Conference on Innovative Computing, Informa-tion and Control. IEEE Computer Society, 2009, pp. 683–686.

[24] S. Tsutsui and N. Fujimoto, “Solving quadratic assignment problemsby genetic algorithms with gpu computation: a case study,” in GECCO’09: Proceedings of the 11th Annual Conference on Genetic andEvolutionary Computation. ACM, 2009, pp. 2523–2530.

[25] T. Lewis and G. Magoulas, “Strategies to minimise the total run timeof cyclic graph based genetic programming with GPUs,” in GECCO’09: Proceedings of the 11th Annual Conference on Genetic andEvolutionary Computation. ACM, 2009, pp. 1379–1386.

[26] M. Wong, “Parallel multi-objective evolutionary algorithms on graph-ics processing units,” in GECCO ’09: Proceedings of the 11th AnnualConference on Genetic and Evolutionary Computation. ACM, 2009,pp. 2515–2522.

[27] E. Alba, G. Luque, J. Garcıa, G. Ordonez, and G. Leguizamon,“Mallba; a software library to design efficient optimisation algo-rithms,” International Journal of Innovative Computing and Applica-tions, vol. 1, no. 1, pp. 74–85, 2007.

[28] S. Baluja, “A massively distributed parallel genetic algorithm(mdpga),” Carnegie Mellon University, Tech. Rep. CMU-CS-92-196R,1992.

[29] D. Goldberg, Genetic Algorithms in Search, Optimization, and Ma-chine Learning. Addison-Wesley, 1989.

[30] E. Loiola, N. de Abreu, P. Boaventura-Netto, P. Hahn, and T. Querido,“A survey for the quadratic assignment problem,” European Journalof Operational Research, vol. 176, no. 2, pp. 657–690, 2007.

[31] “Qaplib - a quadratic assignment problem library.”http://www.seas.upenn.edu/qaplib/. Accessed onMay 2010.

3898

puga ce, a cellular evolutionary algorithm framework on gpus · puga ce, a cellular evolutionary...

Documents