coevolving solutions to the shortest common superstring problem
DESCRIPTION
Coevolving Solutions to the Shortest Common Superstring Problem. Assaf Zaritsky & Moshe Sipper Ben-Gurion University, Israel www.cs.bgu.ac.il/~assafza. Outline. The “Shortest Common Superstring” problem. DNA sequencing and the input domain. - PowerPoint PPT PresentationTRANSCRIPT
1
Coevolving Solutions to the Shortest Common
Superstring Problem
Assaf Zaritsky & Moshe SipperAssaf Zaritsky & Moshe SipperBen-Gurion University, IsraelBen-Gurion University, Israel
www.cs.bgu.ac.il/~assafzawww.cs.bgu.ac.il/~assafza
2
Outline
The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic
algorithm (GA). The Puzzle approach. Conclusions and future work.
Messy Puzzle.
3
The Shortest Common Superstring Problem (SCS)
Let SS = { = {ss11,…,,…,ssnn}} be a set of strings (blocksblocks) over some alphabet ΣΣ. A superstringsuperstring of S is a string x such that each si in S is a substring of x.
Problem: Find shortest (common) superstring.Problem: Find shortest (common) superstring. NP-Complete. MAX-SNP hard. Motivation: DNA sequencing, data compression.
4
S = {ate, half, lethal, alpha, alfalfa} A trivial superstring is “atehalflethalalphaalfalfa” of
length 25 (a simple concatenation of all blocks). A shortest common superstring is “lethalphalfalfate”
of length 17. Note that a “compressed” permutation of the blocks
is actually a superstring.
SCS: Example
5
Approximation Algorithms Several linear approximations for SCS have been
proposed, most of which rely on greedy approaches. GREEDY
The most widely heuristic used in DNA sequencing. Conjecture [Blum 1994, Sweedyk 1999]: Superstring
produced by GREEDY is of length at most two times the optimal.
We are not aware of any previous evolutionary approach to the SCS problem.
6
Outline The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem.
DNA sequencing and the input domain.DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic
algorithm (GA). The Puzzle approach. Conclusions and future work.
Messy Puzzle.
7
DNA SequencingThe most common usage of the SCS problem.
8
DNA Sequencing (cont’d)
The problem: “read” a string of DNA. Short DNA strands can be read in laboratory. To sequence a long DNA strand:
(The DNA sequence appears in many copies)1. Cut the DNA to short fragments using restriction
enzymes.2. Sequence each of the resulting fragments.
3. Order those fragments using a SCS algorithm.
9
The Input DomainThe input strings used in the experiments were inspired by DNA sequencing:
10
Input Generation Setup: Parameters
NB: increasing number of blocks results in exponential growth of the problem’s complexity.
Size of random string250 bits (~50 blocks)400 bits (~80 blocks)
Minimal block size20 bits
Maximal block size30 bits
Number of duplicates created from a random string
5
11
Outline The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem. DNA sequencing and the input domain.
Standard and cooperative coevolutionary Standard and cooperative coevolutionary genetic algorithm (GA).genetic algorithm (GA).
The Puzzle approach. Conclusions and future work.
Messy Puzzle.
12
Simple Genetic Algorithmproduce an initialinitial population of individuals
evaluateevaluate fitness of all individuals
whilewhile termination condition not met dodo
selectselect fitter individuals for reproduction
recombinerecombine individuals
mutatemutate individuals
evaluateevaluate fitness of modified individuals
generategenerate a new population
end whileend while
13
Simple GA for the SCS Problem Given a set of strings as input, generate initial
population of random candidate solutions. The fitness of each individual depends on its
lengthlength and accuracyaccuracy. The GA uses selection, recombination, and
mutation to create the next generation, each individual of which is then evaluated.
Theses steps are repeated a predefined number of times or until the solution is deemed satisfactory.
14
Simple GA for the SCS Problem (cont’d)
Blocks of the input set are atomicatomic components. Representation: An individual’s genome is
represented as a sequence of blocks. An individual may have missing blocks or
contain duplicate copies of the same block. Permutation Representation: Good or Bad?
15
Simple GA for the SCS Problem (cont’d)
Evaluation: fitness of an individual is the length of it’s compressed genome + the total length of the blocks that are not covered by the individual.
Genetic operators: Fitness proportionate selection. Two-points recombination. Allows growth and
reduction in genome’s length. Block-change mutation.
16
Simple GA for the SCS Problem (example)
S = {s1,s2,s3,s4}; s1 = 0011, s2 = 1100, s3 = 1001, s4 = 111. Fitness (< s2,s1>) = |110011| + |111| = 6 + 3 = 9. Fitness (< s4,s2,s1,s4>) = |11100111| = 8. Recombination:
p1 = <s1,||s2,s3||,s4> p2 = <s4,||s1,s3,s2||> p3 = recombine1(p1,p2) = <s1,s1,s3 ,s2,s4> p4 = recombine2(p1,p2) = <s4,s2,s3 >
mutate (<s1,s2,s2>) = <s1,s4,s2>
17
Coevolution
Simultaneous evolution of two or more species with coupled fitness.
Coevolving species either competecompete or cooperatecooperate.
Competitive coevolution: Fitness of individual based on direct competition with individuals of other species, which in turn evolve separately in their own populations (“prey-predator”).
18
Cooperative Coevolution
19
Cooperative Coevolution (cont’d)
Cooperative Coevolution involves a number of independently evolving species.
Interaction between species occurs via fitness function only.
The fitness of an individual depends on its ability to collaborate with individuals from other species.
20
Cooperative Coevolution (cont’d)
Source: Potter & DeJong (1997)Source: Potter & DeJong (1997)
21
Cooperative Coevolutionary Algorithm for the SCS Problem
Two species evolve simultaneously. First species contains prefixesprefixes of candidate
solutions to the SCS problem at hand. Second species contains candidate suffixessuffixes. Fitness of an individual in each species
depends on how good it interacts with representativesrepresentatives from other species to construct a global solutionconstruct a global solution.
22
Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)
Prefixes population
Suffixes population
Suffix
Suffix
Representative
RepresentativeIndiv
idual
Indiv
idual
Merge
23
Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)
Prefixes population
Suffixes population
Fitness
Fitness
Evaluate
24
ExperimentsCompare: GREEDY, Standard GA, Cooperative CoevolutionCompare: GREEDY, Standard GA, Cooperative Coevolution
25
Experimental Setup
Each type of GA was executed twice on each problem instance; the better run of the two was used for statistical purposes.
Population size500Number of generations5000Recombination rate0.8Mutation rate0.03Problem instances per experiment50
26
Results: Experiment I (~50 blocks)
27
Results: Experiment II (~80 blocks)
28
Results: Summary
381381Distance from Distance from optimum: optimum: 131131
280280Distance from Distance from optimum: optimum: 3030
275275Distance from Distance from optimum: optimum: 2525
596596Distance from Distance from optimum: optimum: 196196
685685Distance from Distance from optimum: optimum: 285285
547547Distance from Distance from optimum: optimum: 147147
Problem size
Problem size
Algorithm
Algorithm
50 blocks
80 blocks
GREEDY Genetic Cooperative
Average of the best superstring lengthsAverage of the best superstring lengths
29
Conclusion:
The collaboration between the two The collaboration between the two populations results in a populations results in a good good decomposition of the problem into decomposition of the problem into two smaller sub-problems, each is two smaller sub-problems, each is solved using a standard GA.solved using a standard GA.
30
Outline The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic
algorithm (GA).
The The PuzzlePuzzle approach. approach. Conclusions and future work.
Messy Puzzle.
31
The Puzzle Algorithm
32
The Schema Theorem
““Short, low-order, above-average Short, low-order, above-average schemata receive exponentially schemata receive exponentially increasing trials in subsequent increasing trials in subsequent generations of a genetic algorithm.”generations of a genetic algorithm.”
Holland (1975)Holland (1975)
33
Building Blocks Hypothesis
““A genetic algorithm seeks near-optimal A genetic algorithm seeks near-optimal performance through the juxtaposition performance through the juxtaposition of short, low-order, high-performance of short, low-order, high-performance schemata, called the building blocks.”schemata, called the building blocks.”
34
Our Interpretation
““The The success of Gsuccess of GAAss stems from stems from their ability to combine quality their ability to combine quality sub-solutions (building blocks)sub-solutions (building blocks) from separate individuals in order from separate individuals in order to form better global solutions.to form better global solutions.””
35
The Main Assumption
PProblems in nature have an roblems in nature have an inherentinherent structural design. Even structural design. Even when the structure is not known when the structure is not known explicitly Gexplicitly GAAss detect it detect it implicitly and gradually implicitly and gradually enhance good building blocks.enhance good building blocks.
36
A Problem
Recombination may Recombination may destroy quality building destroy quality building blocks found by the GA. blocks found by the GA.
37
ExampleBrain AppearanceBrain Appearance
00101010101010101010000111101000100000010101010101010101000011110100010000
38
Example (con’t)Brain AppearanceBrain Appearance
00101010101010101010000111101000100000010101010101010101000011110100010000
1. Smart (assumable)1. Smart (assumable)
2. Blond 2. Blond
But not very beautiful…But not very beautiful…
39
The Preservation of Favoured Building Blocks in the Struggle for Fitness: The Puzzle Algorithm
40
Puzzle Algorithm: The Idea
Improve Recombination Operator. Preserve good building blocks discovered by
GA using selection of recombination loci that do not destroy good building blocks.
Result: Assembly of good building blocks to construct better solutions (as in a puzzle).
41
Puzzle Algorithm (cont’d) Two populations:
1. Candidate solutions: As in simple GA.2. Building blocks: Each individual is a sequence of blocks contained in at least one candidate solution.
Building blocks population
Candidate solutions population
42
Puzzle Algorithm (cont’d) Interaction between candidate solutionscandidate solutions and
building blocks is through fitness function.
Fitness evaluationFitness evaluation
Crossover locationCrossover location
Building blocks
population
Candidate solutions
population
Interaction between building blocksbuilding blocks and candidate solutions is through constraints on recombination points.
43
Puzzle Algorithm: Zoom In
Building blocks population
Candidate solutions population
Fitness evaluationFitness evaluation
Crossover locationCrossover location
each individual is a sequence of blocks
44
Puzzle Algorithm: Zoom In
Building blocks population
Candidate solutions population
Fitness evaluationFitness evaluation
Crossover locationCrossover location
each building block is contained in at each building block is contained in at least one individual in the solutions least one individual in the solutions
populationpopulation
overlapping building blocks
45
The Candidate Solutions Population
Representation, fitness evaluation, selection, and mutation are identical to the simple GA.
Recombination-aid vector aids in selecting the recombination loci.
Recombination-aid vector is updated by building blocks individuals.
Building blocks population
Candidate solutions population
Fitness evaluation
Crossover location
46
The Building Blocks Population An individual is represented as a sequence of
blocks, contained in at least one candidate solution. Fitness of an individual is the average of the fitness
of candidate solutions containing it. Fitness-proportionate selection.
Building blocks population
Candidate solutions population
Fitness evaluation
Crossover location
47
The Building Blocks Population (con’t) “Unisex” individuals. Two modification operators:
Expansion: Increase it’s genome by one block. Occurs with high probability.
Exploration: “Die”, and start over as a new 2-block individual. Occurs with low probability.
Building blocks population
Candidate solutions population
Fitness evaluation
Crossover location
48
Building Blocks – Candidate Solutions
Fitness evaluationFitness evaluationBuilding blocks population
Candidate solutions population
ff22
ff33
ff44
ff11
49
Building Blocks – Candidate Solutions
Fitness evaluationFitness evaluationBuilding blocks population
Candidate solutions population
ff22
ff33
ff44
ff11
Update Update “recombination-aid” “recombination-aid”
vectorvector
ff11
ff11 ff22
ff22
ff33
ff33
ff44
50
Update Recombination-aid vector
Solution’s genome
building block #1 fitness = 0.3
00000000000000Recombination-aid vector
building block #2 fitness = 0.4
building block #3 fitness = 0.6
51
Update Recombination-aid vector
Solution’s genome
000.60.60.40.4000.30.30.30.300Recombination-aid vector
building block #1 fitness = 0.3
building block #2 fitness = 0.4
building block #3 fitness = 0.6
52
Update Recombination-aid vector
Solution’s genome
0.60.60.60.60.40.4000.30.30.30.30.30.3Recombination-aid vector
building block #1 fitness = 0.3
building block #2 fitness = 0.4
building block #3 fitness = 0.6
53
Recombination-loci selection
Solution’s genome
0.60.60.60.60.40.4000.30.30.30.30.30.3Recombination-aid vector
* Ties are broken arbitrarily
54
ExperimentsCompare: GREEDY, Standard GA, PuzzleCompare: GREEDY, Standard GA, Puzzle
55
Building Blocks - Experimental Setup
Population size1000Expansion rate0.8Exploration rate0.1
56
Results: Experiment III (~50 blocks)
CooperativeCooperative
57
Results: Experiment IV (~80 blocks)
CooperativeCooperative
Did we lose to cooperative?Did we lose to cooperative?
NO!NO!
58
Results: Summary
381381Distance from Distance from optimum: optimum: 131131
280280Distance from Distance from optimum: optimum: 3030
253253Distance from Distance from
optimum: optimum: 33
596596Distance from Distance from optimum: optimum: 196196
685685Distance from Distance from optimum: optimum: 285285
571571Distance from Distance from optimum: optimum: 171171
Problem size
Problem size
Algorithm
Algorithm
50 blocks
80 blocks
GREEDY Genetic Puzzle
Average of the best superstring lengthsAverage of the best superstring lengths
59
Relations Between The Algorithms
Co-PuzzleCo-Puzzle
GAGA
PuzzlePuzzle
puzzl
epu
zzle
puzzl
epu
zzle
CooperativeCooperativecooperation
cooperation
cooperation
cooperation
60
The Co-Puzzle Algorithm
Possible building blocks population
Candidate prefixes population
Fitness eval
Crossover location
Possible building blocks population
Candidate suffixes population
Fitness eval
Crossover location
Fitness evaluation
61
ExperimentsCompare: GREEDY, Cooperative Coevolution, Co-PuzzleCompare: GREEDY, Cooperative Coevolution, Co-Puzzle
62
Results: Experiment V (~80 blocks)
63
Results: Experiment VI (~50 blocks)
PuzzlePuzzle
????????
64
Results: Summary
381381Distance from Distance from optimum: optimum: 131131
275275Distance from Distance from optimum: optimum: 2525
268268Distance from Distance from optimum: optimum: 1818
596596Distance from Distance from optimum: optimum: 196196
547547Distance from Distance from optimum: optimum: 147147
482482Distance from Distance from optimum: optimum: 8282
Problem size
Problem size
Algorithm
Algorithm
50 blocks
80 blocks
GREEDY Cooperative Co-puzzle
size of shortest common superstringsize of shortest common superstring
42% 42% improvement over cooperative
65
Outline The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic
algorithm (GA). The Puzzle approach.
Conclusions and future work.Conclusions and future work.
Messy Puzzle.
66
Results: Summary
381381Distance from Distance from optimum:optimum: 131131
275275Distance from Distance from optimum:optimum: 2525
253253Distance from Distance from optimum:optimum: 33
268268Distance from Distance from optimum:optimum: 1818
596596Distance from Distance from optimum:optimum: 196196
547547Distance from Distance from optimum:optimum: 147147
571571Distance from Distance from optimum:optimum: 171171
482482Distance from Distance from optimum:optimum: 8282
Problem size
Problem size
Algorithm
Algorithm
50 blocks
80 blocks
GREEDY Cooperative Co-puzzle
size of shortest common superstringsize of shortest common superstring
Puzzle
677677Distance from Distance from optimum:optimum: 227227
673673Distance from Distance from optimum:optimum: 223223
683683Distance from Distance from optimum:optimum: 233233
617617Distance from Distance from optimum:optimum: 167167
768768Distance from Distance from optimum:optimum: 268268
768768Distance from Distance from optimum:optimum: 268268
813813Distance from Distance from optimum:optimum: 313313
732732Distance from Distance from optimum:optimum: 232232
90 blocks
100 blocks
20 problem instances per experiment
25% 25% betterbetter
13% 13% betterbetter
83% 83% betterbetter
42% 42% betterbetter
67
Larger Problems - Using More Species
836836Distance from Distance from optimum: optimum: 286286
867867Distance from Distance from optimum: optimum: 317317
??Distance from Distance from
optimumoptimum : :??
906906Distance from Distance from optimum: optimum: 306306
992992Distance from Distance from optimum: optimum: 392392
906906Distance from Distance from optimum: optimum: 306306
Problem size
Problem size
Algorithm
Algorithm
110 blocks
120 blocks
GREEDY Co-puzzle 3-Co-puzzle
size of shortest common superstringsize of shortest common superstring
68
Conclusions
Cooperative coevolution might prove Cooperative coevolution might prove deleterious when too many species are deleterious when too many species are used (when close to optimum?).used (when close to optimum?).
When a suitable number of species are When a suitable number of species are used, cooperative coevolution improves used, cooperative coevolution improves performance by decomposing the performance by decomposing the problem to several easier subproblems.problem to several easier subproblems.
69
Conclusions (con’t)
Evolving a population of building blocks Evolving a population of building blocks to aid in the selection of recombination to aid in the selection of recombination loci improves drastically the loci improves drastically the performance of a standard GA.performance of a standard GA.
Cooperation between cooperative Cooperation between cooperative coevolution and Puzzle ultimately coevolution and Puzzle ultimately improves global performance.improves global performance.
70
Future Work Test the (Co-) Puzzle approach on other Test the (Co-) Puzzle approach on other
problem domains.problem domains. A hybrid GA.A hybrid GA.
Tackle larger problems.Tackle larger problems. Comparison to greedy-stochastically based Comparison to greedy-stochastically based
local-search algorithms.local-search algorithms.
71
Outline The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic
algorithm (GA). The Puzzle approach. Conclusions and future work.
Messy Puzzle.Messy Puzzle.
72
The Messy Puzzle Algorithm
73
Static Detection of Building Blocks for addressing the
Linkage ProblemHillel MaozHillel Maoz
Ben-Gurion University, IsraelBen-Gurion University, Israel
74
The Linkage Problem A binary Genome of size n = 14.A binary Genome of size n = 14. Genes Genes aa and and bb togethertogether encode important information. encode important information. Random cross over is applied.Random cross over is applied.
Survival probability = The chance to appear in the offspringSurvival probability = The chance to appear in the offspring Left genome – 4/15Left genome – 4/15 Right genome – 14/15Right genome – 14/15
75
The Linkage Problem (con’t)
In many cases it is hard In many cases it is hard to know the optimal to know the optimal
representationrepresentation
76
The MaxCut Problem
Input: undirected weighted graph G=(V, E, W).
Output: a partition of V into two disjoint sets (S,V\S).
Goal: maximal sum of edge weights between the sets.
NP-complete.
77
Cut = 34
Cut = 47
MaxCut - Example
78
Simple GA for MaxCut
Population of candidate solutionsPopulation of candidate solutions• Give each node with a numberGive each node with a number• Assign ‘0’ or ‘1’ to indicate which set the node belongs toAssign ‘0’ or ‘1’ to indicate which set the node belongs to
Iteration step Iteration step • Select any two parentsSelect any two parents• Recombine and create an offspringRecombine and create an offspring• Repeat until a new population is generatedRepeat until a new population is generated
Fitness – The weight of the cutFitness – The weight of the cut
79
The Representation Problem
““How to define the order of the How to define the order of the vertices within the genome ?”vertices within the genome ?”
80
Messy Genes
The main difficulty: identifying the related vertexes. Messy gene is an ordered pair <allele-locus,allele-value>. Possible solution:
Use some sort of messy genes to detect related genes.
Use the Puzzle approach to keep them together.
81
The Messy Puzzle Algorithm
A building block’s genome A building block’s genome is represented as a is represented as a
sequence of messy genessequence of messy genes
82
Messy Puzzle Algorithm
Two population setup as in the puzzle algorithm.Two population setup as in the puzzle algorithm. Enhanced recombination operator.Enhanced recombination operator. Evolved building blocks structure (similar to Evolved building blocks structure (similar to
puzzle).puzzle).
<0,0>
<2,0>
<1,1>
<5,0>
<6,1>
83
Enhanced Recombination
I)I)
II)II)
IIIIII))IV)IV)
0.8 0.7 0.60.8 0.7 0.6
1 2 3 4 5 6 7 81 2 3 4 5 6 7 8 1 2 3 4 5 6 7 81 2 3 4 5 6 7 8
Add the Add the 1st1st BB - success BB - success
Add the Add the 2nd2nd BB - failure BB - failure
Add the Add the 33rdrd BB - success BB - success
Simple crossoverSimple crossover
84
Static Detection of Building Blocks
Building blocks do not truly evolve. No Expansion and Exploration operators. Building blocks’ fitness is based on a number of
generations. Purpose: to check and understand the core of the
messy puzzle algorithm.
85
Results
Max Cut Size - Puzzle VS. GA
0.010.020.030.040.050.060.070.080.0
1 2 3 4 5 6 7 8 9 10
graph number
cut s
ize
diffe
renc
e
1graph_200_0.01_1
2graph_200_0.05_1
3graph_200_0.1_1
4graph_200_0.3_1
5graph_200_0.5_1
6graph_300_0.01_1
7graph_300_0.05_1
8graph_300_0.1_1
9graph_300_0.3_1
10graph_300_0.5_1
Random Generated Graphs.Random Generated Graphs. 1000 generations.1000 generations. 10 separate experiments per problem instance.10 separate experiments per problem instance.
Avg Cut Size - different number of BB (graph_300_0.1_1)
0
10
20
30
40
50
10 20 30 40 50 60
number of BB
dist
ance
from
GA
Avg Cut Size - different number of BB (graph_300_0.5_1)
-20
0
20
40
60
80
10 20 30 40 50 60
number of BB
dist
ance
from
GA
Max Cut Size - Bi-partite graphs
-200
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6
graph number
cut s
ize
diffe
renc
e
•Distance to optimum
•Puzzle addition
86
Conclusions and Future Work Do messy work to solve the linkage problem. Even a small population of building blocks
improves the GA performance. Messy puzzle is better when inner structures
exists.
Applying evolution to the building blocks population.
Comparing to different representation-search techniques.