coevolving solutions to the shortest common superstring problem

Post on 09-Feb-2016

32 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Coevolving Solutions to the Shortest Common Superstring Problem. Assaf Zaritsky & Moshe Sipper Ben-Gurion University, Israel www.cs.bgu.ac.il/~assafza. Outline. The “Shortest Common Superstring” problem. DNA sequencing and the input domain. - PowerPoint PPT Presentation

TRANSCRIPT

1

Coevolving Solutions to the Shortest Common

Superstring Problem

Assaf Zaritsky & Moshe SipperAssaf Zaritsky & Moshe SipperBen-Gurion University, IsraelBen-Gurion University, Israel

www.cs.bgu.ac.il/~assafzawww.cs.bgu.ac.il/~assafza

2

Outline

The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic

algorithm (GA). The Puzzle approach. Conclusions and future work.

Messy Puzzle.

3

The Shortest Common Superstring Problem (SCS)

Let SS = { = {ss11,…,,…,ssnn}} be a set of strings (blocksblocks) over some alphabet ΣΣ. A superstringsuperstring of S is a string x such that each si in S is a substring of x.

Problem: Find shortest (common) superstring.Problem: Find shortest (common) superstring. NP-Complete. MAX-SNP hard. Motivation: DNA sequencing, data compression.

4

S = {ate, half, lethal, alpha, alfalfa} A trivial superstring is “atehalflethalalphaalfalfa” of

length 25 (a simple concatenation of all blocks). A shortest common superstring is “lethalphalfalfate”

of length 17. Note that a “compressed” permutation of the blocks

is actually a superstring.

SCS: Example

5

Approximation Algorithms Several linear approximations for SCS have been

proposed, most of which rely on greedy approaches. GREEDY

The most widely heuristic used in DNA sequencing. Conjecture [Blum 1994, Sweedyk 1999]: Superstring

produced by GREEDY is of length at most two times the optimal.

We are not aware of any previous evolutionary approach to the SCS problem.

6

Outline The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem.

DNA sequencing and the input domain.DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic

algorithm (GA). The Puzzle approach. Conclusions and future work.

Messy Puzzle.

7

DNA SequencingThe most common usage of the SCS problem.

8

DNA Sequencing (cont’d)

The problem: “read” a string of DNA. Short DNA strands can be read in laboratory. To sequence a long DNA strand:

(The DNA sequence appears in many copies)1. Cut the DNA to short fragments using restriction

enzymes.2. Sequence each of the resulting fragments.

3. Order those fragments using a SCS algorithm.

9

The Input DomainThe input strings used in the experiments were inspired by DNA sequencing:

10

Input Generation Setup: Parameters

NB: increasing number of blocks results in exponential growth of the problem’s complexity.

Size of random string250 bits (~50 blocks)400 bits (~80 blocks)

Minimal block size20 bits

Maximal block size30 bits

Number of duplicates created from a random string

5

11

Outline The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem. DNA sequencing and the input domain.

Standard and cooperative coevolutionary Standard and cooperative coevolutionary genetic algorithm (GA).genetic algorithm (GA).

The Puzzle approach. Conclusions and future work.

Messy Puzzle.

12

Simple Genetic Algorithmproduce an initialinitial population of individuals

evaluateevaluate fitness of all individuals

whilewhile termination condition not met dodo

selectselect fitter individuals for reproduction

recombinerecombine individuals

mutatemutate individuals

evaluateevaluate fitness of modified individuals

generategenerate a new population

end whileend while

13

Simple GA for the SCS Problem Given a set of strings as input, generate initial

population of random candidate solutions. The fitness of each individual depends on its

lengthlength and accuracyaccuracy. The GA uses selection, recombination, and

mutation to create the next generation, each individual of which is then evaluated.

Theses steps are repeated a predefined number of times or until the solution is deemed satisfactory.

14

Simple GA for the SCS Problem (cont’d)

Blocks of the input set are atomicatomic components. Representation: An individual’s genome is

represented as a sequence of blocks. An individual may have missing blocks or

contain duplicate copies of the same block. Permutation Representation: Good or Bad?

15

Simple GA for the SCS Problem (cont’d)

Evaluation: fitness of an individual is the length of it’s compressed genome + the total length of the blocks that are not covered by the individual.

Genetic operators: Fitness proportionate selection. Two-points recombination. Allows growth and

reduction in genome’s length. Block-change mutation.

16

Simple GA for the SCS Problem (example)

S = {s1,s2,s3,s4}; s1 = 0011, s2 = 1100, s3 = 1001, s4 = 111. Fitness (< s2,s1>) = |110011| + |111| = 6 + 3 = 9. Fitness (< s4,s2,s1,s4>) = |11100111| = 8. Recombination:

p1 = <s1,||s2,s3||,s4> p2 = <s4,||s1,s3,s2||> p3 = recombine1(p1,p2) = <s1,s1,s3 ,s2,s4> p4 = recombine2(p1,p2) = <s4,s2,s3 >

mutate (<s1,s2,s2>) = <s1,s4,s2>

17

Coevolution

Simultaneous evolution of two or more species with coupled fitness.

Coevolving species either competecompete or cooperatecooperate.

Competitive coevolution: Fitness of individual based on direct competition with individuals of other species, which in turn evolve separately in their own populations (“prey-predator”).

18

Cooperative Coevolution

19

Cooperative Coevolution (cont’d)

Cooperative Coevolution involves a number of independently evolving species.

Interaction between species occurs via fitness function only.

The fitness of an individual depends on its ability to collaborate with individuals from other species.

20

Cooperative Coevolution (cont’d)

Source: Potter & DeJong (1997)Source: Potter & DeJong (1997)

21

Cooperative Coevolutionary Algorithm for the SCS Problem

Two species evolve simultaneously. First species contains prefixesprefixes of candidate

solutions to the SCS problem at hand. Second species contains candidate suffixessuffixes. Fitness of an individual in each species

depends on how good it interacts with representativesrepresentatives from other species to construct a global solutionconstruct a global solution.

22

Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)

Prefixes population

Suffixes population

Suffix

Suffix

Representative

RepresentativeIndiv

idual

Indiv

idual

Merge

23

Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)

Prefixes population

Suffixes population

Fitness

Fitness

Evaluate

24

ExperimentsCompare: GREEDY, Standard GA, Cooperative CoevolutionCompare: GREEDY, Standard GA, Cooperative Coevolution

25

Experimental Setup

Each type of GA was executed twice on each problem instance; the better run of the two was used for statistical purposes.

Population size500Number of generations5000Recombination rate0.8Mutation rate0.03Problem instances per experiment50

26

Results: Experiment I (~50 blocks)

27

Results: Experiment II (~80 blocks)

28

Results: Summary

381381Distance from Distance from optimum: optimum: 131131

280280Distance from Distance from optimum: optimum: 3030

275275Distance from Distance from optimum: optimum: 2525

596596Distance from Distance from optimum: optimum: 196196

685685Distance from Distance from optimum: optimum: 285285

547547Distance from Distance from optimum: optimum: 147147

Problem size

Problem size

Algorithm

Algorithm

50 blocks

80 blocks

GREEDY Genetic Cooperative

Average of the best superstring lengthsAverage of the best superstring lengths

29

Conclusion:

The collaboration between the two The collaboration between the two populations results in a populations results in a good good decomposition of the problem into decomposition of the problem into two smaller sub-problems, each is two smaller sub-problems, each is solved using a standard GA.solved using a standard GA.

30

Outline The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic

algorithm (GA).

The The PuzzlePuzzle approach. approach. Conclusions and future work.

Messy Puzzle.

31

The Puzzle Algorithm

32

The Schema Theorem

““Short, low-order, above-average Short, low-order, above-average schemata receive exponentially schemata receive exponentially increasing trials in subsequent increasing trials in subsequent generations of a genetic algorithm.”generations of a genetic algorithm.”

Holland (1975)Holland (1975)

33

Building Blocks Hypothesis

““A genetic algorithm seeks near-optimal A genetic algorithm seeks near-optimal performance through the juxtaposition performance through the juxtaposition of short, low-order, high-performance of short, low-order, high-performance schemata, called the building blocks.”schemata, called the building blocks.”

34

Our Interpretation

““The The success of Gsuccess of GAAss stems from stems from their ability to combine quality their ability to combine quality sub-solutions (building blocks)sub-solutions (building blocks) from separate individuals in order from separate individuals in order to form better global solutions.to form better global solutions.””

35

The Main Assumption

PProblems in nature have an roblems in nature have an inherentinherent structural design. Even structural design. Even when the structure is not known when the structure is not known explicitly Gexplicitly GAAss detect it detect it implicitly and gradually implicitly and gradually enhance good building blocks.enhance good building blocks.

36

A Problem

Recombination may Recombination may destroy quality building destroy quality building blocks found by the GA. blocks found by the GA.

37

ExampleBrain AppearanceBrain Appearance

00101010101010101010000111101000100000010101010101010101000011110100010000

38

Example (con’t)Brain AppearanceBrain Appearance

00101010101010101010000111101000100000010101010101010101000011110100010000

1. Smart (assumable)1. Smart (assumable)

2. Blond 2. Blond

But not very beautiful…But not very beautiful…

39

The Preservation of Favoured Building Blocks in the Struggle for Fitness: The Puzzle Algorithm

40

Puzzle Algorithm: The Idea

Improve Recombination Operator. Preserve good building blocks discovered by

GA using selection of recombination loci that do not destroy good building blocks.

Result: Assembly of good building blocks to construct better solutions (as in a puzzle).

41

Puzzle Algorithm (cont’d) Two populations:

1. Candidate solutions: As in simple GA.2. Building blocks: Each individual is a sequence of blocks contained in at least one candidate solution.

Building blocks population

Candidate solutions population

42

Puzzle Algorithm (cont’d) Interaction between candidate solutionscandidate solutions and

building blocks is through fitness function.

Fitness evaluationFitness evaluation

Crossover locationCrossover location

Building blocks

population

Candidate solutions

population

Interaction between building blocksbuilding blocks and candidate solutions is through constraints on recombination points.

43

Puzzle Algorithm: Zoom In

Building blocks population

Candidate solutions population

Fitness evaluationFitness evaluation

Crossover locationCrossover location

each individual is a sequence of blocks

44

Puzzle Algorithm: Zoom In

Building blocks population

Candidate solutions population

Fitness evaluationFitness evaluation

Crossover locationCrossover location

each building block is contained in at each building block is contained in at least one individual in the solutions least one individual in the solutions

populationpopulation

overlapping building blocks

45

The Candidate Solutions Population

Representation, fitness evaluation, selection, and mutation are identical to the simple GA.

Recombination-aid vector aids in selecting the recombination loci.

Recombination-aid vector is updated by building blocks individuals.

Building blocks population

Candidate solutions population

Fitness evaluation

Crossover location

46

The Building Blocks Population An individual is represented as a sequence of

blocks, contained in at least one candidate solution. Fitness of an individual is the average of the fitness

of candidate solutions containing it. Fitness-proportionate selection.

Building blocks population

Candidate solutions population

Fitness evaluation

Crossover location

47

The Building Blocks Population (con’t) “Unisex” individuals. Two modification operators:

Expansion: Increase it’s genome by one block. Occurs with high probability.

Exploration: “Die”, and start over as a new 2-block individual. Occurs with low probability.

Building blocks population

Candidate solutions population

Fitness evaluation

Crossover location

48

Building Blocks – Candidate Solutions

Fitness evaluationFitness evaluationBuilding blocks population

Candidate solutions population

ff22

ff33

ff44

ff11

49

Building Blocks – Candidate Solutions

Fitness evaluationFitness evaluationBuilding blocks population

Candidate solutions population

ff22

ff33

ff44

ff11

Update Update “recombination-aid” “recombination-aid”

vectorvector

ff11

ff11 ff22

ff22

ff33

ff33

ff44

50

Update Recombination-aid vector

Solution’s genome

building block #1 fitness = 0.3

00000000000000Recombination-aid vector

building block #2 fitness = 0.4

building block #3 fitness = 0.6

51

Update Recombination-aid vector

Solution’s genome

000.60.60.40.4000.30.30.30.300Recombination-aid vector

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

52

Update Recombination-aid vector

Solution’s genome

0.60.60.60.60.40.4000.30.30.30.30.30.3Recombination-aid vector

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

53

Recombination-loci selection

Solution’s genome

0.60.60.60.60.40.4000.30.30.30.30.30.3Recombination-aid vector

* Ties are broken arbitrarily

54

ExperimentsCompare: GREEDY, Standard GA, PuzzleCompare: GREEDY, Standard GA, Puzzle

55

Building Blocks - Experimental Setup

Population size1000Expansion rate0.8Exploration rate0.1

56

Results: Experiment III (~50 blocks)

CooperativeCooperative

57

Results: Experiment IV (~80 blocks)

CooperativeCooperative

Did we lose to cooperative?Did we lose to cooperative?

NO!NO!

58

Results: Summary

381381Distance from Distance from optimum: optimum: 131131

280280Distance from Distance from optimum: optimum: 3030

253253Distance from Distance from

optimum: optimum: 33

596596Distance from Distance from optimum: optimum: 196196

685685Distance from Distance from optimum: optimum: 285285

571571Distance from Distance from optimum: optimum: 171171

Problem size

Problem size

Algorithm

Algorithm

50 blocks

80 blocks

GREEDY Genetic Puzzle

Average of the best superstring lengthsAverage of the best superstring lengths

59

Relations Between The Algorithms

Co-PuzzleCo-Puzzle

GAGA

PuzzlePuzzle

puzzl

epu

zzle

puzzl

epu

zzle

CooperativeCooperativecooperation

cooperation

cooperation

cooperation

60

The Co-Puzzle Algorithm

Possible building blocks population

Candidate prefixes population

Fitness eval

Crossover location

Possible building blocks population

Candidate suffixes population

Fitness eval

Crossover location

Fitness evaluation

61

ExperimentsCompare: GREEDY, Cooperative Coevolution, Co-PuzzleCompare: GREEDY, Cooperative Coevolution, Co-Puzzle

62

Results: Experiment V (~80 blocks)

63

Results: Experiment VI (~50 blocks)

PuzzlePuzzle

????????

64

Results: Summary

381381Distance from Distance from optimum: optimum: 131131

275275Distance from Distance from optimum: optimum: 2525

268268Distance from Distance from optimum: optimum: 1818

596596Distance from Distance from optimum: optimum: 196196

547547Distance from Distance from optimum: optimum: 147147

482482Distance from Distance from optimum: optimum: 8282

Problem size

Problem size

Algorithm

Algorithm

50 blocks

80 blocks

GREEDY Cooperative Co-puzzle

size of shortest common superstringsize of shortest common superstring

42% 42% improvement over cooperative

65

Outline The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic

algorithm (GA). The Puzzle approach.

Conclusions and future work.Conclusions and future work.

Messy Puzzle.

66

Results: Summary

381381Distance from Distance from optimum:optimum: 131131

275275Distance from Distance from optimum:optimum: 2525

253253Distance from Distance from optimum:optimum: 33

268268Distance from Distance from optimum:optimum: 1818

596596Distance from Distance from optimum:optimum: 196196

547547Distance from Distance from optimum:optimum: 147147

571571Distance from Distance from optimum:optimum: 171171

482482Distance from Distance from optimum:optimum: 8282

Problem size

Problem size

Algorithm

Algorithm

50 blocks

80 blocks

GREEDY Cooperative Co-puzzle

size of shortest common superstringsize of shortest common superstring

Puzzle

677677Distance from Distance from optimum:optimum: 227227

673673Distance from Distance from optimum:optimum: 223223

683683Distance from Distance from optimum:optimum: 233233

617617Distance from Distance from optimum:optimum: 167167

768768Distance from Distance from optimum:optimum: 268268

768768Distance from Distance from optimum:optimum: 268268

813813Distance from Distance from optimum:optimum: 313313

732732Distance from Distance from optimum:optimum: 232232

90 blocks

100 blocks

20 problem instances per experiment

25% 25% betterbetter

13% 13% betterbetter

83% 83% betterbetter

42% 42% betterbetter

67

Larger Problems - Using More Species

836836Distance from Distance from optimum: optimum: 286286

867867Distance from Distance from optimum: optimum: 317317

??Distance from Distance from

optimumoptimum : :??

906906Distance from Distance from optimum: optimum: 306306

992992Distance from Distance from optimum: optimum: 392392

906906Distance from Distance from optimum: optimum: 306306

Problem size

Problem size

Algorithm

Algorithm

110 blocks

120 blocks

GREEDY Co-puzzle 3-Co-puzzle

size of shortest common superstringsize of shortest common superstring

68

Conclusions

Cooperative coevolution might prove Cooperative coevolution might prove deleterious when too many species are deleterious when too many species are used (when close to optimum?).used (when close to optimum?).

When a suitable number of species are When a suitable number of species are used, cooperative coevolution improves used, cooperative coevolution improves performance by decomposing the performance by decomposing the problem to several easier subproblems.problem to several easier subproblems.

69

Conclusions (con’t)

Evolving a population of building blocks Evolving a population of building blocks to aid in the selection of recombination to aid in the selection of recombination loci improves drastically the loci improves drastically the performance of a standard GA.performance of a standard GA.

Cooperation between cooperative Cooperation between cooperative coevolution and Puzzle ultimately coevolution and Puzzle ultimately improves global performance.improves global performance.

70

Future Work Test the (Co-) Puzzle approach on other Test the (Co-) Puzzle approach on other

problem domains.problem domains. A hybrid GA.A hybrid GA.

Tackle larger problems.Tackle larger problems. Comparison to greedy-stochastically based Comparison to greedy-stochastically based

local-search algorithms.local-search algorithms.

71

Outline The “Shortest Common Superstring” problem.The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic

algorithm (GA). The Puzzle approach. Conclusions and future work.

Messy Puzzle.Messy Puzzle.

72

The Messy Puzzle Algorithm

73

Static Detection of Building Blocks for addressing the

Linkage ProblemHillel MaozHillel Maoz

Ben-Gurion University, IsraelBen-Gurion University, Israel

74

The Linkage Problem A binary Genome of size n = 14.A binary Genome of size n = 14. Genes Genes aa and and bb togethertogether encode important information. encode important information. Random cross over is applied.Random cross over is applied.

Survival probability = The chance to appear in the offspringSurvival probability = The chance to appear in the offspring Left genome – 4/15Left genome – 4/15 Right genome – 14/15Right genome – 14/15

75

The Linkage Problem (con’t)

In many cases it is hard In many cases it is hard to know the optimal to know the optimal

representationrepresentation

76

The MaxCut Problem

Input: undirected weighted graph G=(V, E, W).

Output: a partition of V into two disjoint sets (S,V\S).

Goal: maximal sum of edge weights between the sets.

NP-complete.

77

Cut = 34

Cut = 47

MaxCut - Example

78

Simple GA for MaxCut

Population of candidate solutionsPopulation of candidate solutions• Give each node with a numberGive each node with a number• Assign ‘0’ or ‘1’ to indicate which set the node belongs toAssign ‘0’ or ‘1’ to indicate which set the node belongs to

Iteration step Iteration step • Select any two parentsSelect any two parents• Recombine and create an offspringRecombine and create an offspring• Repeat until a new population is generatedRepeat until a new population is generated

Fitness – The weight of the cutFitness – The weight of the cut

79

The Representation Problem

““How to define the order of the How to define the order of the vertices within the genome ?”vertices within the genome ?”

80

Messy Genes

The main difficulty: identifying the related vertexes. Messy gene is an ordered pair <allele-locus,allele-value>. Possible solution:

Use some sort of messy genes to detect related genes.

Use the Puzzle approach to keep them together.

81

The Messy Puzzle Algorithm

A building block’s genome A building block’s genome is represented as a is represented as a

sequence of messy genessequence of messy genes

82

Messy Puzzle Algorithm

Two population setup as in the puzzle algorithm.Two population setup as in the puzzle algorithm. Enhanced recombination operator.Enhanced recombination operator. Evolved building blocks structure (similar to Evolved building blocks structure (similar to

puzzle).puzzle).

<0,0>

<2,0>

<1,1>

<5,0>

<6,1>

83

Enhanced Recombination

I)I)

II)II)

IIIIII))IV)IV)

0.8 0.7 0.60.8 0.7 0.6

1 2 3 4 5 6 7 81 2 3 4 5 6 7 8 1 2 3 4 5 6 7 81 2 3 4 5 6 7 8

Add the Add the 1st1st BB - success BB - success

Add the Add the 2nd2nd BB - failure BB - failure

Add the Add the 33rdrd BB - success BB - success

Simple crossoverSimple crossover

84

Static Detection of Building Blocks

Building blocks do not truly evolve. No Expansion and Exploration operators. Building blocks’ fitness is based on a number of

generations. Purpose: to check and understand the core of the

messy puzzle algorithm.

85

Results

Max Cut Size - Puzzle VS. GA

0.010.020.030.040.050.060.070.080.0

1 2 3 4 5 6 7 8 9 10

graph number

cut s

ize

diffe

renc

e

1graph_200_0.01_1

2graph_200_0.05_1

3graph_200_0.1_1

4graph_200_0.3_1

5graph_200_0.5_1

6graph_300_0.01_1

7graph_300_0.05_1

8graph_300_0.1_1

9graph_300_0.3_1

10graph_300_0.5_1

Random Generated Graphs.Random Generated Graphs. 1000 generations.1000 generations. 10 separate experiments per problem instance.10 separate experiments per problem instance.

Avg Cut Size - different number of BB (graph_300_0.1_1)

0

10

20

30

40

50

10 20 30 40 50 60

number of BB

dist

ance

from

GA

Avg Cut Size - different number of BB (graph_300_0.5_1)

-20

0

20

40

60

80

10 20 30 40 50 60

number of BB

dist

ance

from

GA

Max Cut Size - Bi-partite graphs

-200

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6

graph number

cut s

ize

diffe

renc

e

•Distance to optimum

•Puzzle addition

86

Conclusions and Future Work Do messy work to solve the linkage problem. Even a small population of building blocks

improves the GA performance. Messy puzzle is better when inner structures

exists.

Applying evolution to the building blocks population.

Comparing to different representation-search techniques.

top related