research article openaccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene...

16
De Beukelaer et al. BMC Genetics (2015) 16:2 DOI 10.1186/s12863-014-0154-z RESEARCH ARTICLE Open Access Heuristic exploitation of genetic structure in marker-assisted gene pyramiding problems Herman De Beukelaer 1* , Geert De Meyer 2 and Veerle Fack 1 Abstract Background: Over the last decade genetic marker-based plant breeding strategies have gained increasing attention because genotyping technologies are no longer limiting. Now the challenge is to optimally use genetic markers in practical breeding schemes. For simple traits such as some disease resistances it is possible to target a fixed multi-locus allele configuration at a small number of causal or linked loci. Efficiently obtaining this genetic ideotype from a given set of parental genotypes is known as the marker-assisted gene pyramiding problem. Previous methods either imposed strong restrictions or used black box integer programming solutions, while this paper explores the power of an explicit heuristic approach that exploits the underlying genetic structure to prune the search space. Results: Gene Stacker is introduced as a novel approach to marker-assisted gene pyramiding, combining an explicit directed acyclic graph model with a pruned generation algorithm inspired by a simple exhaustive search. Both exact and heuristic pruning criteria are applied to reduce the number of generated schedules. It is shown that this approach can effectively be used to obtain good solutions for stacking problems of varying complexity. For more complex problems, the heuristics allow to obtain valuable approximations. For smaller problems, fewer heuristics can be applied, resulting in an interesting quality-runtime tradeoff. Gene Stacker is competitive with previous methods and often finds better and/or additional solutions within reasonable time, because of the powerful heuristics. Conclusions: The proposed approach was confirmed to be feasible in combination with heuristics to cope with realistic, complex stacking problems. The inherent flexibility of this approach allows to easily address important breeding constraints so that the obtained schedules can be widely used in practice without major modifications. In addition, the ideas applied for Gene Stacker can be incorporated in and extended for a plant breeding context that e.g. also addresses complex quantitative traits or conservation of genetic background. Gene Stacker is freely available as open source software at http://genestacker.ugent.be. The website also provides documentation and examples of how to use Gene Stacker. Keywords: Plant breeding, Marker-assisted gene pyramiding, Multi-objective optimization, Heuristics Background Over the last decade several genetic marker-based plant breeding strategies [1] have been established and are increasingly used to develop better lines and hybrids. The approach taken depends on trait architecture. For simple traits such as some disease or pest resistances it is possible to tag a small number of causal or linked loci with genetic markers and exploit these by marker-assisted selection [2]. More complex traits such as yield are better managed by *Correspondence: [email protected] 1 Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Krijgslaan 281 - S9, 9000 Gent, Belgium Full list of author information is available at the end of the article thousands of genome-wide markers focusing on predic- tion [3] rather than on causality of individual markers. At present, genotyping technologies are no longer limit- ing and the major challenge is to optimally use genetic markers in practical breeding schemes. Exploitation of genetic markers through crossing and selection is a combinatorial optimization problem in a genetic context. This problem has two distinct levels of objectives. For foreground markers that address simple traits, a fixed multi-locus allele configuration or genetic ideotype is targeted. Complex trait objectives managed by background markers require a more general optimiza- tion in a constrained space. At the same time, there is © 2015 De Beukelaer et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Upload: others

Post on 15-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 DOI 10.1186/s12863-014-0154-z

RESEARCH ARTICLE Open Access

Heuristic exploitation of genetic structure inmarker-assisted gene pyramiding problemsHerman De Beukelaer1*, Geert De Meyer2 and Veerle Fack1

Abstract

Background: Over the last decade genetic marker-based plant breeding strategies have gained increasing attentionbecause genotyping technologies are no longer limiting. Now the challenge is to optimally use genetic markers inpractical breeding schemes. For simple traits such as some disease resistances it is possible to target a fixedmulti-locus allele configuration at a small number of causal or linked loci. Efficiently obtaining this genetic ideotypefrom a given set of parental genotypes is known as the marker-assisted gene pyramiding problem. Previous methodseither imposed strong restrictions or used black box integer programming solutions, while this paper explores thepower of an explicit heuristic approach that exploits the underlying genetic structure to prune the search space.

Results: Gene Stacker is introduced as a novel approach to marker-assisted gene pyramiding, combining an explicitdirected acyclic graph model with a pruned generation algorithm inspired by a simple exhaustive search. Both exactand heuristic pruning criteria are applied to reduce the number of generated schedules. It is shown that this approachcan effectively be used to obtain good solutions for stacking problems of varying complexity. For more complexproblems, the heuristics allow to obtain valuable approximations. For smaller problems, fewer heuristics can beapplied, resulting in an interesting quality-runtime tradeoff. Gene Stacker is competitive with previous methods andoften finds better and/or additional solutions within reasonable time, because of the powerful heuristics.

Conclusions: The proposed approach was confirmed to be feasible in combination with heuristics to cope withrealistic, complex stacking problems. The inherent flexibility of this approach allows to easily address importantbreeding constraints so that the obtained schedules can be widely used in practice without major modifications. Inaddition, the ideas applied for Gene Stacker can be incorporated in and extended for a plant breeding context thate.g. also addresses complex quantitative traits or conservation of genetic background. Gene Stacker is freely availableas open source software at http://genestacker.ugent.be. The website also provides documentation and examples ofhow to use Gene Stacker.

Keywords: Plant breeding, Marker-assisted gene pyramiding, Multi-objective optimization, Heuristics

BackgroundOver the last decade several genetic marker-based plantbreeding strategies [1] have been established and areincreasingly used to develop better lines and hybrids. Theapproach taken depends on trait architecture. For simpletraits such as some disease or pest resistances it is possibleto tag a small number of causal or linked loci with geneticmarkers and exploit these bymarker-assisted selection [2].More complex traits such as yield are better managed by

*Correspondence: [email protected] of Applied Mathematics, Computer Science and Statistics,Ghent University, Krijgslaan 281 - S9, 9000 Gent, BelgiumFull list of author information is available at the end of the article

thousands of genome-wide markers focusing on predic-tion [3] rather than on causality of individual markers.At present, genotyping technologies are no longer limit-ing and the major challenge is to optimally use geneticmarkers in practical breeding schemes.Exploitation of genetic markers through crossing and

selection is a combinatorial optimization problem in agenetic context. This problem has two distinct levels ofobjectives. For foreground markers that address simpletraits, a fixed multi-locus allele configuration or geneticideotype is targeted. Complex trait objectives managedby background markers require a more general optimiza-tion in a constrained space. At the same time, there is

© 2015 De Beukelaer et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedicationwaiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwisestated.

Page 2: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 2 of 16

the need to deal with crop specific as well as practicalconstraints such as the expected amount of seeds obtainedfrom a crossing, the number of generations, and the num-ber of plants grown per generation. In all, the number ofobjectives and their diverse types make this a hard andcomplex problem that demands an explicit and modularoptimization strategy.A logical first step is to develop an explicit framework

to deal with the foreground markers. The objective is todesign a crossing schedule that efficiently stacks a smallnumber of favorable trait alleles (causal or tightly linked)present in a set of parental genotypes. This is known as themarker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants are grown and screened to identifydesired genotypes or targets. These targets are selectedfor crossings, generating offspring to be grown, genotypedand selected for use in the next generation, until the ideo-type is obtained. An example with 3 parental genotypesis given in Figure 1. The number of possible crossingschedules grows exponentially with the number of lociand parental genotypes which makes the task of designinggood schedules very challenging. With n loci, every sin-gle crossing may produce a vast amount of up to O(4n)possible offspring which are all candidates to be fixed astarget genotypes in the next generation. There are twomain aspects that define a crossing schedule: the targetgenotypes aimed for in each generation (selection prob-lem) and the crossings to be performedwith these selectedtargets (scheduling problem). Important properties of acrossing schedule are the number of generations (time)and the number of plants (cost) required to obtain the tar-get genotypes in each generation. The latter is inverselyproportional to the probability of observing these targetsamong the offspring.Previous research on this topic has mainly focused on

providing general guidelines for plant breeders [4,5] while

only few papers offer a systematic algorithmic approach.An initial method [6] considered restricted parental geno-types and represented crossing schedules as binary trees.For each crossing, the progeny that inherits all favorablealleles from both parents is selected, i.e. the selectionproblem is not addressed. An exhaustive algorithm isapplied to generate all possible crossing schedules by iter-atively combining smaller schedules through additionalcrossings. Later, integer programming approaches weredeveloped that optimize using general purpose solverslike CPLEXa. The first implementation [7] performs amulti-objective optimization to fix desirable alleles whilemaintaining genetic variability at some remaining lociwhen possible. Only the selection problem is considered:each target allele in the ideotype is assigned an originatingparental genotype and arbitrary minimum-depth binarytrees are used to stack the genes according to this assign-ment. In a more powerful mixed integer programming(MIP) implementation [8] crossing schedules are mod-eled as directed acyclic graphs (DAGs) that allow reuse ofmaterial. Both the selection and scheduling problem areconsidered and themodel incorporates a constraint on thenumber of offspring generated from one crossing.

Our contribution: We introduce Gene Stacker as a novelapproach to marker-assisted gene pyramiding, combin-ing an explicit DAG model [8] with a pruned generationalgorithm inspired by a simple exhaustive search [6]. Wedemonstrate that this works for small problems whilemore complex problems require supplementary heuristicpruning criteria that exploit the genetic structure to skipadditional, well-chosen parts of the search space. The pro-posed heuristics provide an interesting quality-runtimetradeoff. This makes Gene stacker not only a flexible andperformant tool for many practical problems but also acore module that can be extended to optimize for e.g.quantitative traits.

[0 0 0 1][0 0 0 1]

[0 0 1 0][0 1 0 0]

[0 0 0 1][0 1 1 0]

[1 0 0 1][1 0 1 0]

[0 1 1 1][1 0 1 1]

x

x

Grow and cross parental genotypes A & B

Screen offspring fortarget genotype D,cross with parentalgenotype C

Screen offspring for ideotype I

A B

D C

I

Figure 1 General crossing schedule layout (example). In each generation, a number of plants are grown and screened for the desired targetgenotype(s). Crossings are then performed to provide new offspring to be grown, genotyped and selected for use in the next generation. All targetsand crossings are fixed in advance. This example has 3 parental genotypes A, B and C. Initially, A and B are crossed to produce an intermediatetarget genotype D. In the next generation, D is crossed with parental genotype C to create the ideotype I.

Page 3: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 3 of 16

MethodsFirst, several definitions and formulas are stated and anextended DAG model is introduced. Then, the appliedoptimization strategy is described for which some exactand heuristic pruning criteria are proposed.

Encoding of genotypesA diploid phase-known genotype G = (G1, . . . ,Gk) con-sists of an ordered sequence of k ≥ 1 chromosomes,each represented by a 2 × ni matrix Gi of alleles. Therows Gi,1 and Gi,2 of matrix Gi are called haplotypes andeach correspond to one of both homologous chromo-somes. Note that interchanging the haplotypes (rows) ofa chromosome Gi ∈ G yields the same genotype. Thecolumns Gi(j), j = 1, . . . , ni, correspond to the consideredloci in chromosome Gi and binary digits (0/1) indicatethe absence or presence of specific alleles. At every locus0 ≤ j ≤ ni − 1 of chromosome Gi there are two allelesGi,1(j) and Gi,2(j); this jth locus is homozygous if Gi,1(j) =Gi,2(j), else it is heterozygous. A genotype is said to behomozygousb if all considered loci in each chromosomeare homozygous.

Recombination ratesWhen crossing two diploid genotypes P and Q, eachparent produces a haploid gamete and fusion of thesegametes yields the diploid genotype of the child. A gameteH = (H1, . . . ,Hk) produced by genotype P = (P1, . . . ,Pk)consists of a series of haploid chromosomesHi which eachcomprise a single haplotype and which are each (inde-pendently) obtained from the respective diploid chro-mosome Pi. A diploid chromosome can yield a numberof different haplotypes due to recombination of alleles(crossover events). The probability with which each pos-sible haplotype is produced can be computed using thegenetic map, from which the distance between any pair ofloci on the same chromosome can be inferred. These dis-tances are converted to crossover rates ri,p,q correspond-ing to the probability that a crossover will occur betweenloci p and q on chromosome i, e.g. using Haldane’s map-ping function [9]. Then, the probability Pr[Pi,Qi →Gi] that chromosomes Pi and Qi will yield haplotypeswhich together form chromosome Gi is computed usingformulas introduced in [8] (Additional file 1: Section 1).As Gene Stacker explicitly models multiple chromosomes,the final probability Pr[P,Q → G] of producing theentire phase-known genotype G when crossing parentsP and Q is computed by multiplying the chromosomeprobabilities:

Pr[P,Q → G]=k∏

i=1Pr[Pi,Qi → Gi] .

Population sizeEach genotype among the possible outcome of a crossingis a candidate to be selected in the next generation. How-ever, such target genotype can only be selected if it actuallyoccurs among the offspring. Thus, a sufficient amountof offspring should be generated so that the targets areexpected to be produced. Consider a crossing of geno-types P and Q and a target genotype G that is producedwith probability ρ = Pr[P,Q → G]. Given a desired suc-cess rate γ ′, the corresponding population size N(ρ, γ ′)indicates the number of offspring that has to be generatedso that the probability of obtaining at least one occurrenceof G is at least γ ′ [6,8]:

N(ρ, γ ′) =

⎧⎪⎨⎪⎩

⌈log (1 − γ ′)log (1 − ρ)

⌉if ρ < 1

1 otherwise(1)

Gene Stacker ensures a global success rate γ (e.g. 95%)by setting a success rate γ ′ = n√γ for each individual tar-get, where n is the total number of targets obtained fromcrossings that can produce more than one possible child(i.e. crossings with uncertainty about the outcome). Thetotal population size of a crossing schedule is equal tothe sum of the population sizes required to obtain all tar-get genotypes aimed for through the schedule and reflectsthe cost of the schedule. When several different geno-types or multiple occurrences of a specific genotype aretargeted among offspring grown from a shared seed lot,it is possible to compute a (lower) joint population sizeexpressing the number of offspring that has to be gener-ated to simultaneously obtain all targets (Additional file 1:Section 2).

Extended DAGmodelGene Stacker models crossing schedules as directedacyclic graphs (DAGs) with three types of nodes:

• Seed lot nodes: represent seeds obtained from acrossing, modelling the probability distribution of allphase-known genotypes that may be produced. Thesource nodes of the graph are seed lot nodes fromwhich the parental genotypes are grown. These initialseed lot nodes are assumed to be genetically uniform,i.e. they contain only one fixed phase-knowngenotype, and never to be depleted. Every internalseed lot node has a single crossing node as its parent.Edges leaving from a seed lot node are directedtowards one or more plant nodes in any subsequentgeneration.

• Plant nodes: represent target genotypes aimed foramong offspring grown from a specific seed lot. Aplant node is labeled with its phase-known genotypeand required population size (groups of plant nodesthat are simultaneously obtained from the same seed

Page 4: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 4 of 16

lot are labeled with the required joint population sizeinstead). If more than one occurrence of therespective genotype is targeted, the desired number ofduplicates is indicated. Every plant node has a singleseed lot node as its parent. Edges leaving from plantnodes lead to crossing nodes in the same generation.

• Crossing nodes: represent crossings with plants fromthe same generation, resulting in a seed lot availableas from the next generation. A crossing node islabeled with the number of times that the crossingwill be performed (if more than once). Every crossingnode has two (not necessarily distinct) plant nodes asits parents. A single edge leaves from every crossingnode to a seed lot node in the next generation.

Figure 2 shows a crossing schedule with 3 generationsand a total population size of 1197. It is assumed herethat every crossing provides about 250 seeds and thateach plant can be crossed twice. Circular nodes repre-sent seed lot nodes, rectangular nodes are plant nodes anddiamonds are crossings. Nodes which are aligned at thesame vertical level are part of the same generation. Thesource nodes cover the 0th generation, and each subse-quent level of seed lot nodes starts the next generation.This model allows reuse of plants (within a generation)as well as remaining seeds (across generations) and is anextension of the original DAG model from [8] which usesa single node type corresponding to Gene Stacker’s plantnodes.

Linkage phase ambiguityGene Stacker is entirely based on phase-known genotypesas this allows to infer the distribution of possible offspringfrom a crossing. However, in practice, the linkage phaseof a genotype can not always be directly observed [10].Therefore, it is important to monitor the linkage phaseambiguity (LPA) which expresses the probability thata genotype will have an undesired linkage phase. Theobserved allelic frequencies of a genotype G are referredto as G̃. When crossing genotypes P andQ, the probabilityPr[P,Q → G̃] of obtaining any genotype with the sameallelic frequencies as G is computed as follows:

Pr[P,Q → G̃]=∑

G′,G̃′=G̃

Pr[P,Q → G′] .

Then, the linkage phase ambiguity of G is equal to

LPA[P,Q → G]= 1 − Pr[P,Q → G]Pr[P,Q → G̃]

.

For target genotypes with non-zero linkage phase ambi-guity, the inferred LPA is included in the label of thecorresponding plant node. The overall LPA of a crossing

schedule is defined as the probability that at least one tar-get genotype aimed for through the schedule will have anundesired linkage phase, and can easily be computed fromthe individual ambiguities.

Approximated Pareto frontierGene Stacker approximates the Pareto frontier of cross-ing schedules with minimum number of generations, totalpopulation size and overall linkage phase ambiguity, pos-sibly subject to a number of crop specific and practicalconstraints. Upper limits can be set for

(a) the number of generations (required);(b) the overall linkage phase ambiguity;(c) the total number of crossings;(d) the population size per generation; and(e) the number of crossings with each plant.

Also, the expected number of seeds obtained from acrossing can be specified. The Pareto frontier contains allsolutions within the constraints that are not dominatedby any other valid solution, where C′ dominates C if it isat least as good for every objective and better for at leastone objective. All non-dominated schedules are optimalin some sense as they provide tradeoffs with respect tothe different objectives. The approximated Pareto frontierobtained by Gene Stacker contains all completed sched-ules for which no dominating other solution has beenconstructed.

AlgorithmGene Stacker applies a (heuristically) pruned genera-tion algorithm inspired by the exhaustive search strategyfrom [6]. The search space is traversed as a tree by start-ing with the smallest possible schedules, i.e. those whichsimply grow one of the parental genotypes, and iterativelyextending schedules through additional crossings. Thereare two types of extensions: (a) selfing the final plant of aschedule (i.e. crossing this plant with itself ); or (b) com-bining two schedules through a crossing of the final plantsof both schedules. Every phase-known genotype amongthe offspring is then considered to be fixed as the next tar-get, which results in a (possibly large) number of extendedschedules.When combining two schedules, their generations can

be aligned or interleaved in different ways (Additionalfile 1: Section 3). Plant or seed lot nodes occurring in bothschedules which are aligned in the same generation of thecombined schedule are dynamically reused. Gene Stackergreedily discards non Pareto optimal alignments; there-fore, the main algorithm is not exact. However, the impactof this greedy approach on the solution quality is expectedto be very small; it mainly prevents the introduction ofmost likely redundant generations and favors alignments

Page 5: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 5 of 16

1 1

4523

278 144

318

S1

[0][0 0][0][0][0][1 1][0][0 0][1][1][1][1 1]

x3

S2

[0][0 0][0][0][0][0 0][1][0 0][0][0][0][0 0]

S3

[0][0 1][0][0][0][0 0][0][1 0][0][0][0][0 0]

2

S4

[0][0 0][0][0][0][0 0][1][1 1][0][0][0][0 0]

2

S5S6

[0][0 0][0][0][0][0 0][1][1 1][1][1][1][1 1]

[0][0 0][1][1][1][1 1][0][0 0][1][1][1][1 1]

2

S7

[0][0 0][1][1][1][1 1][1][1 1][1][1][1][1 1]

BA

C D

E F

I

Figure 2 Example crossing schedule. An example crossing schedule according to Gene Stacker’s DAG model, with 3 generations and a totalpopulation size of 1197 (sum of population sizes required to obtain all target genotypes, as indicated at the corresponding plant nodes). It isassumed that every crossing yields about 250 seeds and that each plant can be crossed twice (or selfed once). First, parental genotypes A and B arecrossed. This crossing is performed twice to provide a sufficient amount of seeds to obtain the target genotype D among the offspring.Subsequently, D is crossed with the third parental genotype C and the latter is also crossed with itself (twice). To be able to perform each of thesecrossings, 3 duplicates of C are grown. Finally, E and F are crossed (twice) to produce the ideotype I.

with the highest amount of reuse which leads to a reducedcost.If an extension yields a new schedule in which the ideo-

type is obtained, the Pareto frontier is updated accord-ingly. Else, the schedule is queued for further extensionunless it is predicted that every completed extension willeither be dominated by an already obtained solution orviolate the constraints. Such pruning reduces the number

of constructed schedules and therefore the runtime andmemory footprint of the algorithm. Gene Stacker includesa number of heuristics that further reduce the searchspace by exploiting the underlying genetic structure toskip non promising branches of the search tree. Well-designed heuristics may result in large speedups withonly a slightly higher probability of obtaining suboptimalsolutions, which are often close to the optimum.

Page 6: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 6 of 16

The search terminates when there are no more sched-ules to be further extended. Termination is guaranteedbecause of a required constraint on the number of gen-erations. A more detailed description of the algorithm isprovided in the Additional file 1 (Section 4).

Exact pruning criteriaBecause the number of generations, total population sizeand overall linkage phase ambiguity are monotonicallyincreasing, any partial schedule which is dominated by apreviously obtained solution or which already violates thecorresponding constraints may be discarded. In addition,some basic bounds are applied; for example, when com-bining two partial schedules, it is predicted whether thismay yield a valid improvement over the current Paretofrontier approximation by inferring the minimum com-bined population size and linkage phase ambiguity fromthe set of non overlapping plant nodes and seed lot nodesoccurring in both schedules. Also, the minimum increasein population size and ambiguity caused by targeting anygenotype among the offspring of the performed crossing istaken into account. Although these are local bounds thatpredict the impact of a single extension, they often causesignificant speedups as creating all extensions is a timeconsuming process.Constructed seed lots are filtered based on the con-

straints. Genotypes with higher linkage phase ambigu-ity than the maximum allowed overall ambiguity areremoved. Also, if at most m plants per generation areallowed, a genotype G obtained from crossing P and Q isdiscarded if

Pr[P,Q → G]< 1 − (1 − γ ′)1m .

Given that at most g generations are allowed, GeneStacker prunes a significant number of branches whencreating schedules with g − 1 or g generations. At genera-tion g − 1 only genotypes from which the ideotype can beobtained through a single crossing are considered as pos-sible targets, i.e. genotypes that can produce one of bothdesired haplotypes for every chromosome of the ideotype.Furthermore, in this penultimate generation, only thosecrossings which can produce the complete ideotype areperformed. Obviously, in the final generation g, only theideotype itself is considered as a target. These pruning cri-teria are very effective and yield huge speedups in the finallevels of the search tree.

HeuristicsSome heuristics are proposed that exploit the underly-ing genetic structure to further reduce the search space.Several of these heuristics are based on improvementof phase-known genotypes as compared to the ideotype.Improvement is expressed within a chromosome and agenotype is considered to be an improvement if at least

one chromosome has improved. Gene Stacker uses twodifferent improvement criteria: weak and strong improve-ment. First, the definitions of desired alleles and stretchesare introduced.

Definition 1 (desired allele). Given a chromosome C withk loci, take any of both haplotypes H of C; then allele H(l),0 ≤ l ≤ k − 1, is desired if the respective chromosome Tfrom the ideotype contains a haplotype H ′ with the sameallele at locus l, i.e. H(l) = H ′(l).

Definition 2 (desired stretch). Given a chromosome Cwith k loci, take any of both haplotypes H of C; then thestretch SHi,j , with 0 ≤ i, j ≤ k−1, i ≤ j, is defined as the partof H comprising the consecutive alleles at loci i, i+ 1, . . . , j.The length of the stretch is denoted as |SHi,j | = j − i + 1.Stretch SHi,j is desired if the respective chromosome T fromthe ideotype contains a haplotype H ′ for which ∀l, i ≤ l ≤j,H(l) = H ′(l).

The definition of weak improvement then follows:

Definition 3 (weak improvement). Given two chromo-somes C,C′ and the ideotype I , C is a weak improvementover C′, denoted as C I

w C′, if either (a) one of both hap-lotypes H of C contains a desired stretch SHi,j which is notpresent in any of both haplotypes of C′; or (b) C homozy-gously contains a desired allele which does not occur in C′in homozygous state.

The first case favors the introduction of new orextended desired stretches and the second case rewardsstabilization of desired alleles to prevent them from beinglost during subsequent crossings. An alternative defini-tion, of strong improvement, is stated below:

Definition 4 (strong improvement). Given any chromo-some C, compute the set M containing all desired stretchesSHi,j occurring in any haplotype H that can be producedfrom C with at most 1 crossover. Then derive the tuple(lC , pC) defined by

lC = max{|SHi,j |; SHi,j ∈ M}and

pC = max{Pr[C → SHi,j] ; SHi,j ∈ M & |SHi,j | = lC}

where Pr[C → SHi,j] is the probability that C will pro-duce any haplotype containing stretch SHi,j . Now, given twochromosomes C,C′ and the ideotype I ; then C is a strongimprovement over C′, denoted as C I

s C′, if

(lC > lC′) ∨ (lC = lC′ ∧ pC > pC′).

Page 7: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 7 of 16

To detect strong improvement chromosomes are firstcompared based on the length of the longest desiredstretch that may be produced with at most 1 crossover,an idea which has been previously proposed in [11]. Incase of equal lengths, the highest probability with whichany such maximal desired stretch will be produced byeach chromosome is compared. Gene Stacker includesthree heuristics which are based on improvement of geno-types. The first heuristic (H0) is applied once to filter theparental genotypes G.

Heuristic H0 (parental genotype filter). Discard anyparental genotype G ∈ G for which ∃G′ ∈ G,G′ = G, withG′ I

w G ∧ ¬(G Iw G′).

The other heuristics are repeatedly applied to prune nonpromising branches of the search tree.

Heuristic H1 (improvement over ancestors). Each geno-type G is required to be an improvement over all ancestors,i.e. G I

... A for each genotype A occurring on any pathfrom a source node to G. It is also allowed that G = A ifG has a smaller linkage phase ambiguity or higher prob-ability than A, considering the seed lots from which bothgenotypes are obtained. The applied improvement crite-rion I

... can be either weak (H1a) or strong improvement(H1b).

Heuristic H2 (seed lot filter). When crossing genotypes Pand Q, discard any genotype G from the obtained seed lotS for which ∃G′ ∈ S , G′ = G, with

G′ I... G ∧ ¬(G I

... G′)

and both

Pr[P,Q → G′] ≥ Pr[P,Q → G]LPA[P,Q → G′] ≤ LPA[P,Q → G] .

Again, the applied improvement criterion I... can be

either weak (H2a) or strong improvement (H2b) .

Heuristic H2 removes genotypes from S if a strictly bet-ter genotype is also available which requires equal or lesseffort to be obtained from S , in terms of population size(probability) and linkage phase ambiguity. The followingheuristic (H3) assumes that an optimal schedule consistsof optimal subschedules.

Heuristic H3 (optimal subschedules). A distinct Paretofrontier F(G) is maintained for each genotype G, consist-ing of schedules with final genotype G. Such schedule C isonly queued for further extension if it is not dominated bya previous schedule C′ ∈ F(G). Moreover, extensions areonly constructed if C is still contained in F(G) when it is

dequeued. As an exception, selfing a homozygous genotypeis always allowed.

The exception allows efficient reuse of homozygousgenotypes across generations with only a small increasein the number of explored branches of the search tree.Experiments showed that applying heuristic H3 gener-ally results in very large speedups, but regularly alsoyields worse Pareto frontier approximations because theassumption that optimal schedules consist of optimal sub-schedules does not hold when reusingmaterial. Therefore,two dual run strategies have been designed where H3is enabled in the first run only. The second run thenbenefits from the availability of an initial Pareto fron-tier approximation, e.g. allowing earlier pruning. HeuristicH3s1 follows this basic dual run strategy. Heuristic H3s2also applies an additional seed lot filter in the second runthat restricts the possible haplotypes for each chromo-some to those occurring in a solution found in the firstrun. The overhead of the first run is usually much smallerthan the speedup obtained in the second run.The next heuristic (H4) requires that a genotype is

obtained from a Pareto optimal seed lot in terms of thecorresponding probability and ambiguity.

Heuristic H4 (Pareto optimal seed lot). Each genotypeG is required to be obtained from a Pareto optimal seedlot S in terms of probability and linkage phase ambi-guity, among all seed lots available up to the respectivegeneration.

The number of possible offspring from a crossing growsexponentially with the number of (heterozygous) loci inthe parents; therefore, it can take a significant amountof time and memory to construct the entire seed lot.Although Gene Stacker includes several seed lot filters,this filtering may also be time consuming. Therefore,heuristics are provided that reduce the number of hap-lotypes produced from the crossed genotypes’ chromo-somes by not considering all crossovers. These heuristics(H5 and H5c) assume that a crossover is difficult to obtainand should therefore result in an obvious improvement.

Heuristic H5 (heuristic seed lot construction). Take achromosome C with k loci of which l ≤ k are het-erozygous with ordered indices s = (ν1, . . . , νl). Also,take a haplotype H that is produced from C throughm < l crossovers between consecutive heterozygous loci(νi1−1, νi1), . . . , (νim−1, νim). Split H into a series of m + 1corresponding stretches

H = (SH0,(νi1 )−1, SHνi1 ,(νi2 )−1, . . . , S

Hνim ,k−1)

Page 8: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 8 of 16

where each stretch SHi,j ∈ H originates from one of bothhaplotypes of C. For every stretch SHi,j originating from thetop haplotype C1, i.e. SHi,j = SC1

i,j , the bottom haplotype C2

contains an alternative stretch SC2i,j = SHi,j and vice versa.

Produce only those haplotypes from C for which everystretch inH contains at least one desired allele which is notpresent in the alternative stretch.

Heuristic H5c (consistent heuristic seed lots). Thisheuristic is a stronger version of H5 that requires con-sistent improvement within all stretches towards a fixedhaplotype of the corresponding ideotype chromosome.

For a homozygous ideotype, H5c degenerates to H5. Tocompute linkage phase ambiguities a heuristically con-structed seed lot S is further extended to include allphase-known genotypes with the same allelic frequenciesas any genotype already contained in S . Heuristics H5and H5c also provide an option to limit the number ofsimultaneous crossovers per chromosome.Finally, heuristic H6 computes an approximate lower

bound on the population size of any completed exten-sion of a given partial schedule, based on the probabilitiesof those crossovers that are necessarily still required toobtain the ideotype.

HeuristicH6 (approximate population size bound). Fromevery chromosome T of the ideotype I , with nT loci, the setof desired stretches of size 2 is derived:

DT = {SHi,i+1;H = T1 ∨ T2 & 0 ≤ i < nT − 1

}Only those stretches from DT that do not occur in the

respective chromosome of any parental genotype G ∈ Gare retained. For each such stretch SHi,i+1 a crossover is nec-essarily required between loci i and i + 1 to obtain theideotype. Now, given a partial schedule, it is checked (forall chromosomes) which of the crucial stretches are not yetpresent in any genotype occurring in this schedule. The sumof the minimum population sizes required to obtain eachof the corresponding crossovers is used as a lower boundfor the increase in total population size of any completedextension of this schedule.

It might seem that heuristic H6 implements an exactbound but this is not guaranteed as Gene Stacker com-putes a joint population size when targeting multiplegenotypes among the offspring of a shared seed lot(Additional file 1: Section 2). It is therefore possible thatmultiple crucial stretches are simultaneously obtainedwith a lower total cost. However, it is expected that thiswill rarely occur.Several well-chosen combinations of heuristics provide

tradeoffs between solution quality and execution time.

Presets are named best, better, default, faster and fastest;ordered by the amount and restrictiveness of the appliedheuristics. Full descriptions of the presets are included inthe Additional file 1 (Section 5).

Implementation and hardwareGene Stacker is implemented in Java 7 and experimentshave been performed on the UGent HPC infrastructure,using computing nodes with a 2.4 GHz quad-socket octa-core AMD Magny-Cours processor having a total of 32cores and 64 GB RAM. Gene Stacker is freely availableat http://genestacker.ugent.be; version 1.6 was used for allexperiments. The website also contains user documenta-tion and examples.

Results and discussionThis section presents results of applying Gene Stacker toboth generated and real stacking problems. First, someadvantages of the extended DAG model are discussed.Then, the power of the applied optimization strategy incombination with the proposed heuristics is assessed. Thesection is concluded by providing some practical guide-lines for users of Gene Stacker. Results are compared tothose obtained by the method from [8], referred to asCANZARc. This method minimizes the total populationsize, number of generations and total number of crossings.As minimizing the number of crossings is not explicitlyconsidered as an objective in Gene Stacker, only scheduleswith the lowest total population size among those withthe same number of generations, produced by CANZAR,were selected for comparison with Gene Stacker.

Advantages of the extendedmodelSome advantages of the extended model are discussedhere based on two constructed examples and a complexreal stacking problem from cotton.

Constructed examplesConsider an example with two heterozygous parentalgenotypes G1,G2 and a heterozygous ideotype I :

G1 =[01

] [0 0 00 0 1

],

G2 =[00

] [0 1 01 0 1

],

I =[11

] [1 0 11 1 1

].

The distance between the loci on the second chromo-some is 31 and 42 cM, respectively. Five solutions werereported when running Gene Stacker in default mode,setting an overall success rate of γ = 0.95 and a limitof 4 generations and 10% overall linkage phase ambiguity(Additional file 1: Figure S4).

Page 9: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 9 of 16

Figure 3 (left) shows the best non-ambiguous threegeneration schedule obtained by Gene Stacker, with atotal population size of 275, as well as (right) the respec-tive best three generation solution found by CANZAR,which has a higher total population size of 363. Theleftmost target aimed for in the penultimate generationof the latter schedule has a linkage phase ambiguityof 23.1% while Gene Stacker’s solution is guaranteed tobe non-ambiguous. Gene Stacker provides a way to avoidsuch high ambiguities by carefully monitoring them andconsidering ambiguity as an additional objective to beminimized.

This example also shows how computing joint popula-tion sizes when simultaneously targeting multiple geno-types among the offspring grown from a shared seed lotmay significantly reduce the total population size (seedlot S3 in Figure 3). This approach enabled Gene Stackerto find an alternative schedule with a reduction of morethan 24% in the total population size as compared to theschedule constructed by CANZAR.Another advantage of representing plants and seed lots

with distinct nodes is that (re)use of plants and seeds isdifferentiated. Gene Stacker only allows crossings withplants from the same generation, which is justified by

Figure 3 Solutions for first constructed example. (left) Best non-ambiguous three generation schedule obtained for the first constructedexample when running Gene Stacker in defaultmode; (right) respective best three generation solution reported by CANZAR.

Page 10: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 10 of 16

the fact that almost all field crops flower only once, fora short time. Also for crops that flower multiple timesor for a longer period (such as tomato), crossings withplants from distinct generations are usually not consid-ered because of the high logistic impact. To repeatedlycross over multiple generations, the respective genotypehas to be reproduced, for example by regrowing it fromremaining seeds. In such case, the corresponding cost isaccounted for. Note that this does not limit the flexibilityof Gene Stacker’s model but ensures that the computedcost of the constructed schedules closely reflects plantbreeding practice.The next example has specifically been constructed to

show the advantage of modelling multiple chromosomes.It consists of the following two parental genotypes andideotype:

G1 =[00

] [01

] [01

] [01

] [01

] [01

],

G2 =[01

] [01

] [01

] [01

] [01

] [00

],

I =[01

] [01

] [01

] [01

] [01

] [01

].

A genetic map is not required as each chromosomecontains one locus only. Running Gene Stacker with anypreset and γ = 0.95 resulted in the schedule fromFigure 4 (left) which performs a single crossing. It is possi-ble to immediately obtain the ideotype from this crossingbecause the order of haplotypes within a chromosomeis arbitrary. This is taken into account when comput-ing the probability of observing a genotype among theoffspring (Additional file 1: Section 1). In contrast, previ-ous methods used a single chromosome and specified arecombination rate of 0.5 between loci that actually resideon different chromosomes. This requires to arbitrarilyfix an order of haplotypes in each actual chromosomeand artificially increases the complexity of the problem.Figure 4 (right) shows Gene Stacker’s solution for thesame example when combining all loci on a single arti-ficial chromosome. This schedule is significantly worse:it has an additional generation and a much higher totalpopulation size. Although this example was specificallyconstructed and is somewhat extreme in the sense thatit has six loci on six different chromosomes, it clearlyshows the general benefits of explicitly modelling multiplechromosomes.

Dealing with tight constraintsTight constraintsmight apply for specific crops. For exam-ple, cotton plants can be used for two crossings only(or one selfing) and each crossing yields a small amountof about 250 seeds. This makes it more difficult to findgood crossing schedules within the constraints. With the

extended model such important constraints can easilybe taken into account. Crossings are performed multi-ple times if necessary to provide a sufficient amount ofseeds, where sometimes several duplicates of the samegenotype are needed to be able to make all crossings.Population sizes are computed in such way that at leastthe required number of duplicates of each targeted geno-type is expected among the offspring (Additional file 1:Section 2).An example from cotton is considered with 6 parental

genotypes, 11 loci spread across 5 chromosomes anda heterozygous ideotype (full description in Additionalfile 1: Section 7). An overall success rate of γ = 0.95 wasused and the number of generations, number of plants pergeneration and overall linkage phase ambiguity were lim-ited to 5, 5000 and 10%, respectively. A time limit of 24hours was applied. The number of crossings per plant andseeds obtained per crossing were set to 2 and 250, respec-tively, to precisely reflect the tight constraints of cottonbreeding.Running Gene Stacker with preset fastest took 2 hours

and 15 minutes to complete and reported 4 solutions with3–5 generations, a total population size of 7256–1077 andan overall linkage phase ambiguity of 0–3.14% (Additionalfile 1: Figures S5–S8). All other presets ran out of mem-ory (64 GB). When restricting the number of generationsto 4 instead of 5, preset faster reported a different solutionwith 4 generations that has a lower total population size(1400) than the respective schedule found by preset fastest(1534) before being interrupted when the time limit of 24hours had been exceeded (Additional file 1: Figure S9). Allsolutions contain at least one crossing which is performedmultiple times and/or a genotype of which multiple dupli-cates are selected. It was not possible to obtain solutionswithin the constraints using CANZAR as this methoddoes not provide a way to accurately impose and workaround these constraints.

Optimization power and heuristicsWe first explore the limits of the optimization strategyand the power gained by applying additional heuristics,based on experiments with a large number of randomlygenerated problem instances. Then, the obtained quality-runtime tradeoff is assessed for various complex, realstacking problems.

Limits of the optimization strategyExperiments have been performed with a variety of240 randomly generated stacking problems; 120 with ahomozygous ideotype and 120 with a heterozygous ideo-type. All instances have 4–14 loci, taking steps of two,and 20 instances were created for every number of lociand for both types of ideotype. Each instance has beenindependently generated by

Page 11: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 11 of 16

Overall LPA: 0%# Plants: 193

1 1

191

S1

[0][0][0][0][0][0][0][1][1][1][1][1]

S2

[0][0][0][0][0][0][1][1][1][1][1][0]

S3

[0][0][0][0][0][0][1][1][1][1][1][1]

Overall LPA: 0%# Plants: 4191

11

4174

15

S1

[0 0 0 0 0 0][0 1 1 1 1 1]

S2

[0 0 0 0 0 0][1 1 1 1 1 0]

S3

[0 1 1 1 1 1][1 1 1 1 1 0]

[0 0 0 0 0 0][0 0 0 0 0 0]

S4

[0 0 0 0 0 0][1 1 1 1 1 1]

Figure 4 Solutions for second constructed example. (left) Best solution obtained for the second constructed example when explicitly modellingmultiple chromosomes; (right) best solution found when combining all loci on one artificial chromosome, where a crossover rate of 0.5 is specifiedbetween pairs of consecutive loci that actually reside on different chromosomes (in this example, all loci).

(i) picking a random number of 1–8 chromosomes,limited by the number of loci;

(ii) randomly assigning each locus to one of the availablechromosomes, with a minimum of 1 locus perchromosome;

(iii) setting a random distance of 1–50 cM between pairsof consecutive loci on the same chromosome;

(iv) randomly creating 2–8 parental genotypes, whereeach allele is set to 1 or 0 with equal probability; and

(v) generating a random ideotype.

The haplotypes of the ideotype’s chromosomes werecreated by copying alleles from one of both haplotypes ofthe respective chromosome of a randomly picked parentalgenotype (independently for every locus). To obtain ahomozygous ideotype, one haplotype is created for eachchromosome and included twice. For heterozygous ideo-types, two independent haplotypes are created and com-bined for every chromosome.Figure 5 shows results of running each preset of Gene

Stacker on the 120 instances with a homozygous ideotype.All experiments have been repeated with a maximum of4, 5 and 6 generations, and a runtime limit of 24 hours

has been applied, together with an overall success rate ofγ = 0.95 and a maximum of 10000 plants per generation,4 crossings per plant, 5000 obtained seeds per crossingand 20% overall linkage phase ambiguity. For every combi-nation of the maximum number of generations (rows), thenumber of loci (columns) and the applied preset (bars) itis reported for howmany out of 20 instances Gene Stackercompleted within the time limit of 24 hours.Without applying any heuristics (preset best), Gene

Stacker solves only 42.5%, 35% and 28.34% of all instanceswhen limiting the number of generations to 4, 5 and6, respectively. Interestingly, solutions are obtained forabout 95% of all instances when applying all heuristics(preset fastest) regardless of the limit on the numberof generations. As expected (and desired), the power ofthe other presets (better, default, faster) lies somewherein between. The problem complexity obviously increaseswith the number of loci as well as the maximum num-ber of generations. Without any heuristics, Gene Stackersolved almost no problems with more than 8 loci: solu-tions were obtained for less than half of the instanceswhen the number of loci exceeded 8, 6 and 4 with a limitof 4, 5 and 6 generations, respectively. Yet, Gene Stacker

Page 12: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 12 of 16

Figure 5 Results for random instances with a homozygous ideotype. This figure indicates the number of randomly generated instances with ahomozygous ideotype for which the different presets of Gene Stacker completed within the applied time limit of 24 hours. Experiments wererepeated with a maximum of 4–6 generations. Instances have 4–14 loci spread across 1–8 chromosomes and 2–8 parental genotypes. In total, 20instances were generated for each number of loci.

can cope with many more complex problems with up toat least 14 loci using the proposed heuristics. Of course,these heuristics may yield worse Pareto frontier approxi-mations, so it is preferred only to enable them if necessaryto find solutions within reasonable time. In this way, theheuristics offer a convenient quality-runtime tradeoff andallow to obtain (approximate) solutions for more complexproblems.

Figure 6 shows similar results for the 120 instances witha heterozygous ideotype. It is clear that these are gener-ally more complex as significantly fewer instances weresolved within the time limit compared to the results fromFigure 5. This may be explained from the fact that eachheterozygous chromosome in the ideotype contains twodifferent target haplotypes, i.e. two competing goals, thathave to be obtained simultaneously. Also, the heuristics

Page 13: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 13 of 16

Figure 6 Results for random instances with a heterozygous ideotype. This figure indicates the number of randomly generated instances with aheterozygous ideotype for which the different presets of Gene Stacker completed within the applied time limit of 24 hours. Experiments wererepeated with a maximum of 4–6 generations. Instances have 4–14 loci spread across 1–8 chromosomes and 2–8 parental genotypes. In total, 20instances were generated for each number of loci.

are less effective for heterozygous ideotypes. For exam-ple, improvement towards any of both haplotypes of aheterozygous ideotype chromosome is rewarded; there-fore, heuristics based on such improvement are less pow-erful in case of two distinct target haplotypes in a singlechromosome.Without applying any heuristics, Gene Stacker now

solves 22.5–37.5% of all instances for a varying limit onthe number of generations. Less than half of the instances

were solved when the number of loci exceeded 4–6.Whenall heuristics are enabled, solutions are obtained for 65–72.5% of the instances (for less than half of the instanceswhen exceeding 10 loci). Although the currently proposedheuristics are clearly less powerful when aiming for aheterozygous ideotype, they allowed to find solutions formany complex problems with up to 10 loci. Nevertheless,the challenge remains to develop better heuristics in thisrespect.

Page 14: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 14 of 16

We conclude that the applied optimization strategy caneffectively be used to find solutions for a wide range ofstacking problems.Without extra heuristics, some smallerproblems with 4–8 or 4–6 loci in case of a homozygous orheterozygous ideotype, respectively, can already be tack-led (depending on the maximum number of generations).To deal with more complex problems, additional heuris-tics are required. The proposed heuristics allow to obtain(approximate) solutions for problems with up to at least10–14 loci.

Quality-runtime tradeoffThe quality-runtime tradeoff obtained by applying differ-ent combinations of heuristics is assessed here for realstacking problems from tomato and rice (full specifica-tion in Additional file 1: Section 7). For all experiments, anoverall success rate of γ = 0.95 was set and the numberof generations and plants per generation were restrictedto 5 and 5000, respectively. The amount of seeds pro-duced per crossing and maximum number of crossingsper plant were set to reflect the specific properties of eachcrop. Approximated Pareto frontiers in terms of the totalpopulation size and number of generations are reported(only schedules with zero linkage phase ambiguity wereselected).First, experiments were performed with two stacking

problems from tomato. Both consist of the same 4 parentalgenotypes with 8 loci spread across 6 chromosomes. Thefirst example (Tomato-1) has a homozygous ideotypewhile the second example (Tomato-2) has a heterozygousideotype. Tomatoes can easily be crossed several dozensof times and every crossing yields a large number of seeds:the maximum number of crossings per plant and theamount of seeds obtained from one crossing were set to24 and 20000, respectively. A time limit of 12 hours wasimposed, after which the algorithms were interrupted andthe solutions found until then were inspected.Figure 7 (top left) shows the Pareto frontier approxi-

mations obtained by applying Gene Stacker with presetsdefault, faster and fastest as well as CANZAR to Tomato-1. Gene Stacker and CANZAR obtained exactly the sameschedule with 4 generations. The small difference in thereported population size is explained by the fact that bothmethods follow a slightly different approach to derive asuccess rate per targeted genotype (γ ′) from the desiredoverall success rate (γ ). Solutions with 5 generations werealso found. Those reported by Gene Stacker have a lowerpopulation size compared to the one obtained by CAN-ZAR, even when applying preset fastest which completesafter only 28 seconds. Presets default and faster reportedexactly the same solutions and the 5 generation sched-ule found here improves over the respective scheduleobtained by preset fastest. Yet, these two presets took sig-nificantly more time (ca. 6–8 hours). This shows how the

proposed heuristics provide tradeoffs between solutionquality and execution time and that they are capable offinding good solutions for a complex, realistic problemwithin reasonable time. CANZAR was interrupted whenexceeding the time limit of 12 hours.Similar results for Tomato-2 are presented in Figure 7

(top right) where only preset fastest has been appliedsince the other presets ran out of memory (64 GB). GeneStacker completed in about 5 hours while CANZAR wasinterrupted when the time limit had expired. Three solu-tions were reported by Gene Stacker with 3–5 generationsand CANZAR obtained 2 solutions with 4–5 generations.The 4 generation schedules reported by both methodsslightly differ but have approximately the same total pop-ulation size. Conversely, Gene Stacker found a somewhatbetter schedule with 5 generations and an additional solu-tion with only 3 generations. The difference in runtime, ascompared to Tomato-1, and the fact that all other presetsran out of memory again confirm that with the currentheuristics it is more difficult to solve stacking problemswith a heterozygous ideotype. Yet, the heuristics made itpossible to find 3 good solutions within a few hours, usinga transparent optimization strategy.Results were also obtained for two examples from rice.

Both consist of the same 8 parental genotypes with 10 locispread across 6 chromosomes. The first example (Rice-1) has a homozygous ideotype while the second example(Rice-2) has a heterozygous ideotype. About 300 seedsare obtained from each crossing and rice plants can becrossed no more than 5 times. For these examples, a timelimit of 24 hours was set.Figure 7 (bottom left) shows the Pareto frontier approx-

imations obtained by applying Gene Stacker with presetsbetter, default, faster and fastest as well as CANZARto Rice-1. Preset fastest completed after only 4 secondsand reported three solutions with 3–5 generations. Pre-sets default and faster terminated after about 30 sec-onds and found a better schedule that dominates boththe 4 and 5 generation schedules obtained by presetfastest. Preset better completed after about 12 minutesand found an additional 5 generation schedule with aslightly lower total population size. This again shows howthe heuristics offer a convenient quality-runtime trade-off. CANZAR did not complete within the time limitof 24 hours but was able to obtain a single schedulewith 4 generations that dominates all 4 and 5 genera-tion schedules obtained by Gene Stacker. It is inevitablethat the heuristics sometimes make wrong decisions inwhich case valuable parts of the search space may nothave been explored. In this specific example, heuristicH0 (included in all presets except preset best) removed aparental genotype that is needed to find the better sched-ule obtained by CANZAR. Still, results are quite closeto those of CANZAR, especially when applying presets

Page 15: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 15 of 16

4 5

1000

1025

1050

1075

1100

1125

1150

1175

1200

1225

1250

1275

1300

1325

Pareto frontier approximations (Tomato-1)

Number of generations

Pop

ulat

ion

size

Fastest (28s)Default (7h 28m), Faster (6h 12m)CANZAR (>12h, interrupted)

3 4 5

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

3250

Pareto frontier approximations (Tomato-2)

Number of generations

Pop

ulat

ion

size

Fastest (5h 8m)CANZAR (>12h, interrupted)

3 4 5

275

300

325

350

375

400

425

450

475

500

525

550

575

600

Pareto frontier approximations (Rice-1)

Number of generations

Pop

ulat

ion

size

Fastest (4s)Default (33s), Faster (28s)Better (12m 17s)CANZAR (>24h, interrupted)

3 4 5

450

500

550

600

650

700

750

800

850

900

950

Pareto frontier approximations (Rice-2)

Number of generations

Pop

ulat

ion

size

Fastest (5m 36s)CANZAR (>24h, interrupted)

Figure 7 Pareto frontier approximations of real stacking problems from tomato and rice. (top left) First example from tomato (Tomato-1),homozygous ideotype; (top right) second example from tomato (Tomato-2), heterozygous ideotype; (bottom left) first example from rice (Rice-1),homozygous ideotype; (bottom right) second example from rice (Rice-2), heterozygous ideotype. Full descriptions of the examples are provided inthe Additional file 1: Section 7.

faster, default or better, a significant speedup is obtainedand an additional solution with only 3 generations isfound.Similar results for Rice-2 are shown in Figure 7 (bottom

right) where only preset fastest has been applied as theother presets either ran out of memory or did not findany solutions within the time limit. Gene Stacker com-pleted after 5–6 minutes while CANZAR was interruptedafter exceeding the time limit of 24 hours. Three solutionswere reported by Gene Stacker, with 3–5 generations.CANZAR found a single solution with 4 generations anda higher population size than the respective scheduleobtained by Gene Stacker. Again, the runtime and mem-ory footprint of Gene Stacker is significantly higher forthis problem with a heterozygous ideotype as comparedto Rice-1 which has a homozygous ideotype. Yet, presetfastest outperforms CANZAR and is able to provide avaluable approximation of the Pareto frontier within a fewminutes.

Practical guidelinesBased on our findings we propose the following prac-tical guidelines for using Gene Stacker. Best is to firsttry the default settings, specifying the required parame-ters (maximum number of generations and overall successrate) and those constraints that are important for the spe-cific application (such as the number of seeds producedfrom a crossing and maximum number of crossings perplant) with a reasonable runtime limit (e.g. 24 hours). IfGene Stacker is too slow or requires too much mem-ory, consider setting additional or tighter constraints (e.g.maximum plants per generation, maximum overall link-age phase ambiguity, ...) and/or using preset faster orfastest. The latter may yield worse solutions which shouldbe avoided when possible. In case the default setting ismore than fast enough consider running presets betterand best as well to check whether this produces betterschedules, as the heuristics might have missed some-thing. Usually, differences between the latter presets and

Page 16: RESEARCH ARTICLE OpenAccess … · 2015. 3. 20. · marker-assisted gene pyramiding or gene stacking prob-lem. A crossing schedule consists of a number of genera-tions in which plants

De Beukelaer et al. BMC Genetics (2015) 16:2 Page 16 of 16

the default setting are very small (if any) except for theruntime which is significantly increased.In case QTL (quantitative trait locus) intervals need to

be stacked one can use flanking markers to delimit thetarget locus. The Tomato-1 problem (Additional file 1:Section 7) is a case in point. On the sixth chromosome, asmall region of 10 cM has been identified in which a tar-get gene is located. In this setting it is advised to makesure that the required haplotype is present in at least oneof the parents, and to verify it is maintained throughoutthe crossing scheme. There always remains a small risk ofa double cross-over within the interval in a single genera-tion which one can either ignore or monitor by saturatingthe interval with additional markers. More details andpractical examples are given at http://genestacker.ugent.be.

ConclusionsThe proposed transparent, flexible and easily extensibleapproach to marker-assisted gene pyramiding was con-firmed to be feasible in combination with heuristics toaddress realistic, complex stacking problems with up toat least 10–14 loci, while taking into account importantbreeding constraints. Carefully designed heuristics evenallow to find better or additional solutions within reason-able time compared to previous methods. The proposedheuristics are certainly not perfect nor complete. Forexample, they are less effective for problems with a het-erozygous ideotype. Future work may include the designof additional or improved heuristics as well as extensionof the ideas applied in Gene Stacker for a more generalplant breeding context that also addresses complex traitsand conservation of genetic background.

Availability of supporting dataThe data set(s) supporting the results of this article is(are)included within the article (and its additional file(s)).

EndnotesaSee http://cplex.com.bThe term ‘homozygous genotype’ is used to indicate

that all considered loci are homozygous; this does not sayanything about the remaining loci in the full DNA andshould for example not be confused with homozygousinbred lines.

cExperiments with CANZARwere run on the SURFsaraLisa computing system (https://www.surfsara.nl/systems/lisa/description) by the authors of this method.

Additional file

Additional file 1: Supplementary material. PDF file with supplementaryinformation such as formulas, algorithm details and additional results.

AbbreviationsMIP: Mixed integer programming; DAG: Directed acyclic graph; LPA: Linkagephase ambiguity; QTL: Quantitative trait locus.

Competing interestsHDB and VF declare that they have an ongoing scientific collaboration withBayer CropScience where GDM is employed. VF has also been involved inconsultancy for this company.

Authors’ contributionsHDB proposed the Gene Stacker algorithm, implemented it and performed allexperiments under the supervision of VF. GDM provided data, general adviceon the genetical context and assistance for the development of heuristics.HDB wrote the initial manuscript with all authors contributing to the finalversion. All authors read and approved the final manuscript.

AcknowledgementsThis work was carried out using the Stevin Supercomputer Infrastructure atGhent University. We thank Mohammed El-Kebir and Stefan Canzar for theirkind cooperation by running their algorithm on our test cases. This allowed foran interesting comparison between our methods and made a majorcontribution to the discussion. Herman De Beukelaer is supported by a Ph.D.grant from the Research Foundation of Flanders (FWO).

Author details1Department of Applied Mathematics, Computer Science and Statistics, GhentUniversity, Krijgslaan 281 - S9, 9000 Gent, Belgium. 2Bayer CropScience NV,Innovation Center, Technologiepark 38, 9052 Zwijnaarde, Belgium.

Received: 19 September 2014 Accepted: 16 December 2014

References1. Tester M, Langridge P: Breeding technologies to increase crop

production in a changing world. Science 2010, 327(5967):818–22.2. Moose SP, Mumm RH:Molecular plant breeding as the foundation for

21st century crop improvement. Plant Physiol 2008, 147(3):969–77.3. de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MP:

Whole-genome regression and prediction methods applied to plantand animal breeding. Genetics 2013, 193(2):327–45.

4. Ishii T, Yonezawa K: Optimization of the marker-based procedures forpyramiding genes frommultiple donor lines: I. schedule of crossingbetween the donor lines. Crop Sci 2007, 47(2):537–546.

5. Ye G, Smith KF:Marker-assisted gene pyramiding for inbred linedevelopment: Basic principles and practical guidelines. Int J PlantBreed 2008, 2(1):1–10.

6. Servin B, Martin OC, Mézard M, Hospital F: Toward a theory ofmarker-assisted gene pyramiding. Genetics 2004, 168:513–23.

7. Xu P, Wang L, Beavis WD: An optimization approach to gene stacking.Eur J Oper Res 2011, 214:168–78.

8. Canzar S, El-Kebir M: Amathematical programming approach tomarker-assisted gene pyramiding. In Algorithms in Bioinformatics, WABI2011, LNBI 6833. Edited by Przytycka TM, Sagot M.-F. Berlin, Germany:Springer; 2011:26–38.

9. Haldane J: The combination of linkage values and the calculation ofdistances between the loci of linked factors. J Genet 1919,8(29):299–309.

10. Browning SR, Browning BL: Haplotype phasing: existing methods andnew developments. Nat Rev Genet 2011, 12(10):703–14.

11. El-Kebir M, de Berg M, Buntjer J: Crossing schedule optimization.[Master’s thesis]. Technische Universiteit Eindhoven; 2009.