Download - DNA Codes Design - SUPSI
!IMPORTANT:!!!Applications!of!metaheuristics!to!the!Sequential!Ordering!Problem!will!be!covered!first.!!!!Applications!of!metaheuristics!to!DNA!codes!design!will!be!treated!only!if!time!permits.!!!!Roberto!!
1
Roberto Montemanni
Dalle Molle Institute for Artificial Intelligence
University of Applied Science of Southern Switzerland
Email: [email protected] Tel: +41 58 666 666 7
DNA Codes Design
2
Outline • Introduction • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics
– Intro to Stochastic Local Search – Applications to the DNA codes design problem – Intro to Variable Neighbourhood Search – Applications to the DNA codes design problem
• Bibliography
3
Contributions to slides
• Dan C. Tulpan NRC Institute for Information Technology, Canada (Introduction, Applications, Stochastic Local Searches)
• Marco Chiarandini University of Southern Denmark, Denmark
(Introduction to Stochastic Local Search) • Thomas Stuetzle
Darmstadt University of Technology, Germany (Introduction to Variable Neighbourhood Search)
4
Outline • Introduction • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics
– Intro to Stochastic Local Search – Applications to the DNA codes design problem – Intro to Variable Neighbourhood Search – Applications to the DNA codes design problem
• Bibliography
5
DNA – The Blueprint of Life
chimp
cow
dinosaur bird
fish
worm
bacteria human
DNA
9 pictures taken from ClipArt
Background: DNA
6
What is DNA?
• All organisms on this planet are made of the same type of genetic blueprint.
7
Real Applications
• DNA computing => using DNA for massively parallel computations.
• DNA Chemical libraries => for the development and test of new drugs
• DNA Microarrays => for profiling genes and tracing genes within long DNA strands
• DNA Nanotechnologies => for the development of new materials/devices
http://en.wikipedia.org/wiki/DNA_computing
8
Outline • Introduction • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics
– Intro to Stochastic Local Search – Applications to the DNA codes design problem – Intro to Variable Neighbourhood Search – Applications to the DNA codes design problem
• Bibliography
9
DNA, Wikimedia Commons
What is DNA? • genetic material • four letter alphabet (nucleotides, bases):
– A (adenine), – C (cytosine), – G (guanine), – T (thymine)
• complementary base pairs CG, AT • hybridization via base pairing
A
A
C
G
T
3�
5�
T
T
G
C
A
3�
5�
A
T
G
G
T
3�
5�
T
T
G
C
A
3�
5�
Perfect hybridization Imperfect hybridization Background: DNA
10
Modeling
Uniform Stability
A
A
C
G
T
3�
5�
T
T
G
C
A
3�
5�
A
A
C
G
T
5�
3�
C
A
C
C
C
3�
5�
Non-interaction
Design Goals
Desired properties • Desired properties coming from real applications
• Notice that properties are not the same for all applications
11
DNA Codes Design Problem description
Input data:
• The alphabet {A, C, G, T}
• A fixed length n for the codewords
• A required distance d among codewords (used by constraints in Z)
• A set Z of constraints (explained in the next slides)
Optimization objective:
• Find the largest possible set of codewords (= code) of length n on alphabet {A, C, G, T}, feasible with respect to constraints Z (based on d)
Why to maximize the size of the code? To have more flexibility in the applications seen before!
12
AATTCCGG ACCTGATT
ATTCCCAG
ACCTTTTT
Codeword
Word Length n = 8
TATATATA
CATTCACC
GCTTATTC
GATTCAAT
TCACCATG
CCGTTACA
GCGCGCGC
CTATTCAC
TTGGCCAA
GGCTTTTA
CTACTACG
The solution respects a given a constraints set Z (we do not know Z at this stage!)
Example Code (solution)
DNA Codes Design Problem description
13
Requirements of a DNA Code
• Success in specific hybridization between a DNA codeword and its complement.
• No hybridization between DNA codewords from the same DNA code or between a DNA codeword and others complement.
How do these requirements translate into our constraints set Z?
DNA Codes Design Problem description
14
Constraints considered (set Z):
• Requirement: the distance between two codewords must be large (no hybridization).
• Answer: HD (Hamming Distance)
- Given two codewords w1 and w2
- H(w1, w2) = number of positions i in which the ith letter of w1 differs from the ith letter of w2
- example: w1 = GCTA, w2 = ATTA, H(w1, w2) = 2
- Constraint: H(w1, w2) ≥ d
DNA Codes Design Problem description
15
Constraints considered (set Z):
• Requirement: the number of G or C of each codeword must be the same (uniform stability) [=> self-hybridization is likely]
• Answer: GC (GC-content constraint)
- A fixed number of the letters of each word has to be either G or C: floor(n/2) in our case
- example: ATA is not feasible, AGA is feasible
DNA Codes Design Problem description
16
• Requirement: the distance between a codeword and the complement of another codeword must be large.
Watson-Crick complement of a DNA codeword
wcc(w) = Watson-Crick complement of a DNA codeword w, obtained by reversing w and then by replacing each A in w by T (and vice-versa) and each C in G (and vice-versa)
- example: wcc(ATGC) = GCAT
DNA Codes Design Problem description
17
Constraints considered (set Z):
• Requirement: the distance between a codeword and the complement of another codeword must be large.
• Answer: RC (Reverse Complement Hamming distance)
- Given two codewords w1 and w2
- example: GCTA, ATGC
H(GCTA, wcc(ATGC)) = H(GCTA,GCAT) = 2
- Constraint: H(w1, wcc(w2)) ≥ d
DNA Codes Design Problem description
18
Example of a problem and its solution
• Input data: n = 4, d = 3.
• Constraints considered: HD, GC, RC
• Solution:
the largest possible code with the characteristics above contains 6 codewords.
Optimal code with respect to the constraints considered (not unique!):
CTTC GGTT GTCA
AGGA ACTG TTGG
19
Problem description
• Other kinds of constraints are possible.
• They depend on the real-world application considered
• In this mini-course we limit ourselves to the constraints on the previous slides
Important observation
20
Outline • Introduction • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics
– Intro to Stochastic Local Search – Applications to the DNA codes design problem – Intro to Variable Neighbourhood Search – Applications to the DNA codes design problem
• Bibliography
21
TEMPLATE-MAP DESIGN • Find the largest possible set of 8-mers with
– 50% GC content in each word – at least four mismatches between each word and the complement of each distinct word
(reverse-complement constraint) – at least four mismatches between each pair of words (direct Hamming constraint) – based on template-map design
Approaches from the literature
Kobayashi, S., Konto, T., Arita, M. On template methods for DNA sequence design. Lecture Notes in Computer Science, 2568, 205-214 (2003).
Arita, M., Kobayashi, S. DNA sequence design using templates. New Generation Computing, 20, 263-277 (2002).
Frutos A.G., Liu, Q., Thiel A.J., Sanner A.M.W., Condon A.E., Smith L.M., Corn R.M. Demonstration of a word design strategy for DNA computing on surfaces. Nucleic Acids Res. 25, 4748-4757 (1997)
Koul, N. Heuristic Algorithms for Construction of Constant GC content DNA codes. Master thesis, USI (2010).
22
TEM
PLAT
E-M
AP
DES
IGN
Approaches from the literature
• The selection of maps and templates is based on reasoning and theoretical results
• Difficult to apply results to different problems: not a general approach
23
MATHEMATICAL CONSTRUCTIONS • Approaches adapted from classic Coding Theory • Theoretical results, based on the characteristics of the
desired code, are used to produce mathematical constructions leading to (very regular) codes
• Example: Theorem If C0 is a code that is fixed by reverse permutation R, then the subcode C1 of C0 consisting of the codewords that are unchanged by R is obtained as the intersection of C0 and the code R(C0).
Approaches from the literature
Gaborit P., King O. D. Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113 (2005).
• Not a general method. Results typically hold for the problem under investigation only
• The codes obtained are very regular. For many applications this is not desirable
King, O. D. Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33 (2003).
Neelakandan, I. New Approaches for Constructing Constant Weight Binary Codes. Master thesis, USI (2010).
24
HEURISTIC ALGORITHMS
• Many of the classic heuristic algorithms have been adapted, implemented and tested
• We will see some of them in details…!
Approaches from the literature
25
Outline • Introduction • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics
– Intro to Stochastic Local Search – Applications to the DNA codes design problem – Intro to Variable Neighbourhood Search – Applications to the DNA codes design problem
• Bibliography
26
Construction Heuristics
Construction Heuristic (CH)
All possible codewords with the required GC-content are examined in a given order.
Codewords are incrementally accepted if feasible with respet to the already accepted ones.
Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656 (2009)
Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Math. Modelling and Algorithms 7, 311-326 (2008).
Smith, D.H., Hughes L.A., Perkins S. A new table of constant weight binary codes of length grater than 28. Electron. J. of Combinatorics, 13(1), #A2 (2006).
27
Construction Heuristics
Example: n = 4, d = 3.
Constraints: HD, GC, RC
Lexicographic order:
AACC AACG AAGC AAGG ACAC ACAG ACCA ACCT ACGA ACGT ACTC ACTG AGAC AGAG AGCA AGCT AGGA AGGT AGTC AGTG ATCC ATCG ATGC ATGG CAAC CAAG CACA CACT CAGA CAGT CATC CATG CCAA CCAT CCTA CCTT CGAA CGAT CGTA CGTT CTAC CTAG CTCA CTCT CTGA CTGT CTTC CTTG GAAC GAAG GACA GACT GAGA GAGT GATC GATG GCAA GCAT GCTA GCTT GGAA GGAT GGTA GGTT GTAC GTAG GTCA GTCT GTGA GTGT GTTC GTTG TACC TACG TAGC TAGG TCAC TCAG TCCA TCCT TCGA TCGT TCTC TCTG TGAC TGAG TGCA TGCT TGGA TGGT TGTC TGTG TTCC TTCG TTGC TTGG
Solution: AACC ACAG AGGA CCTA GTCA
28
Construction Heuristics
• The method works over any possible order of the nodes (lexicographic, reverse lexicographic, random) => different algorithms in fact…
• Computational experiments suggest that random orders guarantee better results on DNA code design problems
• Slow for large problems (all possible codewords have to be examined!)
Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. J. of Math. Modelling and Algorithms 7, 311-326 (2008).
29
Outline • Introduction • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics
– Intro to Stochastic Local Search – Applications to the DNA codes design problem – Intro to Variable Neighbourhood Search – Applications to the DNA codes design problem
• Bibliography
30
Seed Building local search Seed Building (SB)
Iterative approach
A set of seed codewords is considered The set of seed codewords is dynamically adapted through iterations During each iteration: • All possible codewords with the required GC-content are examined in a given order. • Codewords are incrementally accepted if feasible with those already accepted in the current iteration and with the seed codewords. Statistics are used to expand or contract the set of seed codewords every ItrSeed iterations, based on the quality of the solutions built.
Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. J. of Math. Modelling and Algorithms 7, 311-326 (2008).
Brouwer A.E., Shearer J.B., Sloane N.J.A., Smith W.D. A new table of constant weight codes. IEEE Trans. Inf. Theory 36, 1334-1380 (1990).
31
Seed Building local search
Seed codewords management
32
Seed Building local search
Example: n = 4, d = 3.
Constraints: HD, GC, RC
Seed codewords: AACC ACAG
Random order:
CTTC CTTG CTCA CTCT CTGA CTGT CTAC CTAG CATC CATG CACA CACT CAGA CAGT CAAC CAAG CCTA CCTT CCAA CCAT CGTA CGTT CGAA CGAT GTTC GTTG GTCA GTCT GTGA GTGT GTAC GTAG GATC GATG GACA GACT GAGA GAGT GAAC GAAG GCTA GCTT GCAA GCAT GGTA GGTT GGAA GGAT TTCC TTCG TTGC TTGG TACC TACG TAGC TAGG TCTC TCTG TCCA TCCT TCGA TCGT TCAC TCAG TGTC TGTG TGCA TGCT TGGA TGGT TGAC TGAG ATCC ATCG ATGC ATGG AACC AACG AAGC AAGG ACTC ACTG ACCA ACCT ACGA ACGT ACAC ACAG AGTC AGTG AGCA AGCT AGGA AGGT AGAC AGAG
Solution: AACC ACAG CCTA GTCA TCCT
33
Seed Building local search
• The method works over any possible order of the nodes (lexicographic, reverse lexicographic, random).
• Experiments clearly show that a random order has to be preferred for DNA codes design problems.
• The process of identify a good set of codewords is intrinsically difficult => codes produced are sometimes very good and sometimes very poor => not a very robust method
• Slow for large problems (all possible codewords are examined at each iteration!)
34
• Clique Given an undirected graph G, a clique is a set of the vertices in which every vertex is connected to every other vertex of the clique
• Maximal clique problem Given an undirected graph G, identify the largest (number of nodes) clique of G
• Complexity Classic NP-hard problem
Clique Search local search
• {0, 3, 4} is a clique
• {2, 3, 4, 5} is a maximal clique
35
Clique Search local search Clique Search (CS)
Iterative approach A partial code can be completed by solving a subproblem (which is a maximum clique problem) to optimality During each iteration: • All possible codewords with the required GC-content are examined in a random order. • Codewords are accepted for the second phase if feasible with those of the partial code. • A maximum clique problem is solved on the set of accepted codewords to complete the partial code Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Math. Modelling and Algorithms 7, 311-326 (2008).
Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656 (2009)
36
Clique Search local search
37
Clique Search local search
Example: n = 4, d = 3. Constraints: HD, GC, RC
Partial code: CTTC CGAA TGGT GTGA
Maximum clique problem on feasible extensions of the partial solution:
CACT AGTG
AAGC GCTT
38
Clique Search local search
Example: n = 4, d = 3. Constraints: HD, GC, RC
Partial code: CTTC CGAA TGGT GTGA
Maximum clique problem on feasible extensions of the partial solution:
CACT AGTG
AAGC GCTT
Solution: CTTC CGAA TGGT GTGA CACT GCTT
39
Clique Search local search
• Solving a maximum clique problem (sub-procedure) is an NP-hard problem itself!
• Heuristics have to be used for the maximum clique problem
=> no optimality is guarantee for the sub-problem solutions
• The choice of the number of codewords to eliminate is crucial
! too many codewords eliminated => very large maximum clique problem => high probability of having suboptimality
! not enough codewords eliminated => very likely to find a code with the same number of codewords of the original
! This aspect deserves a deeper study to tackle large problems!
40
Hybrid Search local search Hybrid Search (HS)
Iterative approach
Merges the concepts of the two methods analyzed before.
A set of seed codewords is managed exactly as in Seed Building.
Seed codewords represent the partial code in the context of the Clique Search.
A relaxed distance d' < d is introduced.
A candidate code has to be at least at distance d from the seeds, and d' from the other candidate codes (this to keep the maximum clique problem to a reasonable size!)
Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).
41
Hybrid Search local search
Seed Building
Clique Search
42
Hybrid Search local search
Example: n = 4, d = 3. Constraints: HD, GC, RC
Partial code (seed codewords): CAAC AGAG
Maximum clique problem on feasible extensions of the partial solution (heuristic distance d'=1 to reduce the codewords considered):
TGGT
TCTC TGTC
TTGC TAGG
TACG ATGC
ACTC
43
Hybrid Search local search
Example: n = 4, d = 3. Constraints: HD, GC, RC
Partial code (seed codewords): CAAC AGAG
Maximum clique problem on feasible extensions of the partial solution (heuristic distance d'=1 to reduce the codewords considered):
TGGT
TCTC TGTC
TTGC TAGG
TACG ATGC
ACTC
Solution: CAAC AGAG TCTC TGGT TACG ATGC
44
Hybrid Search local search
• Sums the advantages of Seed Building to those of Clique Search
but…
• There is the risk of summing up drawbacks instead!
• The method deserves a further detailed study for larger problems
45
Experimental comparison of some of the heuristic algorithms
Experimental settings Methods coded in ANSI C
Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines
Maximum computation times: 10'000 seconds (2.8 hours)
Statistics over 5 runs for each combination problem/method
A (5,3,2) identifies the problem with constraints Cstrs (HD is always present, and therefore not listed), and with n = 5, d = 3, and GC content = floor(n/2) = 2. [this funny notation comes from coding theory…]
4 Cstrs
Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).
46
Experimental comparison of some of the heuristic algorithms
• SB = Seed Building
• CS = Clique Search
• HS = Hybrid Search
47
Experimental comparison of some of the heuristic algorithms
• SB = Seed Building
• CS = Clique Search
• HS = Hybrid Search
48
Experimental comparison of some of the heuristic algorithms
Comments • No clear ranking is possible among the methods considered: Seed Building, Clique Search, and Hybrid Search
• Methods are therefore likely to represent different neighbourhoods
49
Idea
• All the methods seen until now work on the search space of feasible solutions (we never have constraints violated…)
• What if we move into the search space of infeasible solutions? => we will have to minimize (i.e. bring down to zero!) a measure of infeasibility!
• This makes it possible to develop a completely different kind of local search!
• It is likely that the search space is visited in a different way by such a family of algorithms…
50
Iterated Greedy Search local search Iterated Greedy Search (IGS) Iterative approach Working on an infeasible code W, trying to make it feasible.
Measure of the infeasibility of W:
where w = floor(n/2)
Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).
51
Iterated Greedy Search local search
Iterated Greedy Search (IGS)
An infeasible solution is obtained by adding a random codeword to a perturbed feasible solution
During each iteration:
• A codeword σ is selected at random and the optimal (according to Inf(W)) change of one bit of σ is carried out.
• If Inf(W)=0, we are done, and we can add a random codeword
52
Iterated Greedy Search local search Perturbation of
the solution
Optimization of the solution
53
Iterated Greedy Search local search
Example: n = 4, d = 3. Constraints: HD, GC, RC
W Inf(W) ... TGGT GACC CGAA TCAC CCTT 1 TGGT GACT CGAA TCAC CCTT 0
TGGT GGCA CGAA TCAC CCTT TTTG 8 TGGT GGCA CGTA TCAC CCTT TTTG 8 TGGT GGCA CGTA TCAC GCTT TTTG 7 TGGT GGCA CGTC TCAC GCTT TTTG 7 … TGGT AGTG CGTC TCAC GCTT TTTG 4 TGGT AGTG CGTC TCAC GCTT TTCG 3 TGGT AGTG CTTC TCAC GCTT TTCG 0
TGGT AGTG GTAG TCAC GGTT TTCG AACT 9 TGGT AGTG GTAG TCTC GGTT TTCG AACT 9 ...
54
Iterated Greedy Search local search
• We change exactly one bit of a random codeword at each iteration: more complex neighbourhoods could be considered…
• We never accept changes that make the solution worse: might be an idea to escape from local minima
• A further investigation is deserved…
55
Experimental comparison of some of the heuristic algorithms
Experimental settings Methods coded in ANSI C
Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines
Maximum computation times: 10'000 seconds (2.8 hours)
Statistics over 5 runs for each combination problem/method
A (5,3,2) identifies the problem with constraints Cstrs (HD is always present, and therefore not listed), and with n = 5, d = 3, and GC content = floor(n/2) = 2. [this funny notation comes from coding theory…]
4 Cstrs
Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).
56
Experimental comparison of some of the heuristic algorithms
• SB = Seed Building
• CS = Clique Search
• HS = Hybrid Search
• IGS = Iterated Greedy Search
57
Experimental comparison of some of the heuristic algorithms
• SB = Seed Building
• CS = Clique Search
• HS = Hybrid Search
• IGS = Iterated Greedy Search
58
Experimental comparison of some of the heuristic algorithms
Comments • No clear ranking is possible among the methods considered: Seed Building, Clique Search, Hybrid Search and Iterative Greedy Search • Methods are likely to represent different neighbourhoods
59
Outline • Introduction • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics
– Intro to Stochastic Local Search – Applications to the DNA codes design problem – Intro to Variable Neighbourhood Search – Applications to the DNA codes design problem
• Bibliography
60
Goal: Effectively escape from local minima of given evaluation function. General approach: For fixed neighbourhood, use step function that permits worsening search steps. Specific methods: • Randomized Iterative Improvement • Simulated Annealing • Attribute Based Hill Climber • Dynamic Local Search • Iterated Local Search • Tabu Search
Stochastic Local Search: Simple SLS methods
61
Key idea: In each search step, with a fixed probability perform an uninformed random walk step instead of an iterative improvement step. Randomized Iterative Improvement (RII): determine initial candidate solution s while termination condition is not satisfied do
With probability p: choose a neighbor s0 of s uniformly at random Otherwise: choose a neighbor s0 of s such that g(s0) < g(s) or, if no such s0 exists, choose s0 such that g(s0) is minimal s := s0
Where g(s) is the objective function value (fitness) of solution s
Stochastic Local Search: Randomized Iterative Improvement
62
Observations: • No need to terminate search when local minimum is encountered.
Instead: Impose limit on number of search steps or CPU time, from beginning of search or after last improvement.
• Probabilistic mechanism permits arbitrary long sequences of random walk steps
Therefore: When run sufficiently long, RII is guaranteed to find (optimal) solution to any problem instance with arbitrarily high probability.
• Generally, RII is often outperformed by more complex LS methods.
Stochastic Local Search: Randomized Iterative Improvement
63
Outline • Introduction • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics
– Intro to Stochastic Local Search – Applications to the DNA codes design problem – Intro to Variable Neighbourhood Search – Applications to the DNA codes design problem
• Bibliography
64
Target: a code with k codewords 1. Start with k random codewords 2. Mark unsatisfied constraints (conflicts) 3. If no unsatisfied constraints go to 8 4. Pick 2 codewords involved in a conflict 5. With probability p select a better word minimizing the number of conflicts 6. Otherwise select a random codeword 7. Go to step 3. 8. Display all k codewords
Stochastic Local Search for the DNA codes design problem
It is a Randomized Iterative Improvement! Tulpan, D.C., Hoos, H.H., Condon, A.E. Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, Berlin, 2568, 229-241 (2002).
65
Random Walk
Best Improvement
SI SF
SBI
SRW 1 - p
p
Stochastic Local Search for the DNA codes design problem
66
Initialization
Evaluate
Pick Conflict
Best Improvement Random Walk
Return Result
Probability p Probability 1-p
No Yes
Stochastic Local Search for the DNA codes design problem
67
Select Conflicts Neighbourhood
Random Walk
Iterative / Best Improvement
Stochastic Local Search for the DNA codes design problem
68
Neighbors: TTTCTCAG, AATCTCAG, …
Random Walk Best Improvement p 1 - p
Conflicts: (1,3) (1,5) (2,7) (2,9) (12,14)
Pick Conflict
Conflicts: (1,3) (1,5) (2,11) (12,14) Current Set
ACCTGATT
ATTCTCAG
ACCTTTTT
TATATATA
CATTCACC
ATTCTCAA
GATTCAAT
TCACCATG
CCGTTACA
GCGCGCGC
CTATTCAC
TTGGCCAA
GGCTTTTA
CTACTACG
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
TTTCTCAG
Thesis Contributions: C1 Development of novel optimization algorithms
Given: a fixed set of constraints C, strand length n=8, set size k=14.
Stochastic Local Search for the DNA codes design problem
69
Simple SLS without Random Replacement
k = {100, 120, 140}, n = 8, d = 4
HD constraint only
1000 successful runs
Stochastic Local Search for the DNA codes design problem - results
Comments:
• T h e n u m b e r o f i terat ions required increases with k
• The increase is more dramatic when k is h i g h = > r i s k o f stagnation
Distribution of the number of iterations required to have a feasible solution for different values of k (target number of codewords)
70
k = 70, n = 8, d = 4
HD, RC, GC constraints
1000 successful runs
SLS with Random Replacement vs Simple SLS
Stochastic Local Search for the DNA codes design problem - results
Distribution of the number of iterations required to have a feasible solution for k = 70 (target number of codewords)
Comments:
• Random Replacement helps!
• Stagnation reduced
• Better robustness
71
100000
10000
1000
100
10 20 40 60 80 100 120 140 160
DNA set size
HD HD+GC
HD+GC+RC
Number of search iterations n = 8
Scaling of SLS with Random Replacement
Stochastic Local Search for the DNA codes design problem
, d = 4
Comments:
• SLS scales up better when less constraints are considered
• Why? Because less constraints => easier problem, intuitively
72
New bounds on the size of DNA codes
Note: HD, GC constraints.
n d Previous best SLS
6
10
14
18
3
5
7
9
10 20
56
132
240
380
1520
85
256
500
1200
2193
Stochastic Local Search for the DNA codes design problem - results
73
Comments: • There are improvements over previous best.
• The method is still extremely simple and intuitive [good quality in general but...]
• Is it possible to improve it with some refinement?
• Where should we work to refine the method?
Stochastic Local Search for the DNA codes design problem - results
IDEA: trying different neighbourhoods!
74
• Combinatorial problem Π: DNA Word Design
• Problem instance π : DNA/quaternary code design [ particular (n,d) combinations ]
• Search space S(π): set of (code word) sets s
• Neighborhood relation N(π): k-exchange + random based neighborhoods
• Initialization function init(π): random choosing or predefined
• Step function step(π): chooses with probability p between best improvement and random walk
• Terminate predicate terminate(π): a function depending on the number of iterations
performed or solution found
Improved Stochastic Local Search for the DNA codes design problem
Tulpan, D.C. Hoos, H.H. Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, Berlin, 2671, 418-433 (2003).
Inst
ead
of si
mpl
e 1-
exch
ange
!
75
Neighbourhoods
Simple neighbourhoods
• k-exchange / k-point mutation neighbourhoods
• rotation-based neighbourhoods
• random neighbourhoods
Complex neighbourhoods
• 1-exchange / 1-point mutation + rotation neighbourhoods
• k-exchange / k-point mutation + random words neighbourhoods
• 1-exchange / 1-point mutation + rotations + random words negihbourhoods
Improved Stochastic Local Search for the DNA codes design problem
76
Simple neighbourhoods
v-exchange / v-point mutation neighbourhoods
Example:
some of the codewords in the 2-exchange neighbourhood of CTA are:
ACA GTT TTG TCA
Improved Stochastic Local Search for the DNA codes design problem
77
Simple neighbourhoods
rotation-based neighbourhoods
Applying the neighbourhood to a given codeword, we get the codewords obtained from the input codeword by �shifting right� the codeword from 1 to n-1 positions.
Example:
CTA => TAC, ACT
Improved Stochastic Local Search for the DNA codes design problem
78
Simple neighbourhoods
random neighbourhoods
Example: some of the codewords in the random neighbourhood of CTA
are: CAA CTT TTC TCA
Improved Stochastic Local Search for the DNA codes design problem
79
Complex neighbourhoods
1-exchange + rotation neighbourhoods
v-exchange + random words neighbourhoods
1-exchange + rotations + random words neighbourhoods
• These neighbourhoods are obtained by applying all the neighbourhoods involved sequentially (repeated codewords have to be avoided)
• When rotation is involved, it is applied to all the codewords obtained by the neighbourhoods previously applied
Improved Stochastic Local Search for the DNA codes design problem
80
Improved Stochastic Local Search for the DNA codes design problem
The difference is here!
81
k-exchange Neighbourhoods k = 70, n = 8, d = 4
HD, RC, GC constraints
1000 successful runs
{1, 2, 3}-exchange neighbourhoods
Improved Stochastic Local Search for the DNA codes design problem - results
Distribution of the number of iterations required to have a feasible solution for different v-exchange methods
Comments:
• Using larger neighbourhood seems to helps but…
• The difference between 2-exchange and 3-exchange is not dramatic
• Larger neighbourhood means more time at each iteration…
Why 16?
2 words
I have to respect GC content
82
Neighbourhood CPU Time
1-exchange
2-exchange
3-exchange
.0017
.0088
.0314
Improved Stochastic Local Search for the DNA codes design problem - results
k-exchange Neighbourhoods Time for 1 iteration
Distribution of the CPU time required to have a feasible solution for different v-exchange methods
Comments:
• 1-exchange is still the best in terms of run times => not what we hoped!
83
Hybrid Randomized Neighbourhoods
k = 70, n = 8, d = 4
HD, RC, GC constraints
1000 successful runs
random, hybrid neighbourhoods
Improved Stochastic Local Search for the DNA codes design problem - results
Distribution of the number of iterations required to have a feasible solution for different hybrid neighbourhoods
Comments:
• P u r e r a n d o m p e r f o r m s surprisingly well
• 1-exchange + random is however the best method
84
All combinations of neighbourhoods together (usual benchmark)
Improved Stochastic Local Search for the DNA codes design problem
Distribution of the number of iterations required to have a feasible solution for different neighbourhoods
Comments:
• 1-exchange + rotation + random is the most promising combination in terms of number of iterations
• Methods including the random neighbourhood are definitely better
85
Approximate CPU Cost per Iteration for all the combinations of neighbourhoods considered
.043333
.040833
.029167
.022889
.015100
.017294
.031493
.008830
.002184 CPU Time [sec]
1-exchange 2-exchange 3-exchange
1-exchange + rotations random
1-exchange + random 2-exchange + random 3-exchange + random
1-exchange + rotations + random
Neighbourhood Type
128 + 100
Neighbourhood size
184 + 112 72 + 112 16 + 112
128 128 184 72 16
Improved Stochastic Local Search for the DNA codes design problem
Comment:
• 1-exchange + random is a good compromise between speed and quality of the solutions
• Let’s see now what happen if we consider both the time spent on each iteration, and the number of iterations required to converge… [next slide]
86
All combinations of neighbourhood together (usual benchmark)
Improved Stochastic Local Search for the DNA codes design problem
Distribution of the CPU time required to have a feasible solution for different neighbourhoods
Comments:
• Rotation is time consuming => methods with rotation are not so convenient anymore
• 1-exchange + random neighbourhood is far the most promising combination in terms of CPU time
87
Improved Stochastic Local Search for the DNA codes design problem
Is this randomized
step still interesting?
88
Improved Stochastic Local Search for the DNA codes design problem
Number of iterations to have a feasible solution for different values of the randomizing parameter
Comments:
• The randomized step is useless when the hybrid randomized neighbourhood is used!
• This happens because the neighbourhood already does the “random work”
k = 70, n = 8, d = 4
HD, RC, GC constraints
1000 successful runs
random, hybrid neighbourhoods
89
Improved Stochastic Local Search for the DNA codes design problem
90
n = 8, d = 4
HD, RC, GC constraints
1000 successful runs
1-exchange, random, hybrid
neighbourhoods
Scaling of the Improved SLS
Improved Stochastic Local Search for the DNA codes design problem - results
Comments:
• Surprising how pure random neighbourhood scales up well
• However, 1-exchange + random neighbourhood is the best
91
SLS Results and Analysis
• New bounds for DNA set sizes • Improved SLS using various neighborhoods
Length (n)
Hamming dist. (d)
Existing Bounds (k)
Simple SLS (k)
4 3 - 5
8 4 108 112*
10 5 - 127
12 6 - 210
Thesis Contributions: C1 Development of novel optimization algorithms
Improved SLS (k)
6
128
158
240
Combinatorial constraints: HD, RC, GC
[Tulpan et al., 2002] [Tulpan et al., 2003] [Frutos et al., 1997]
Improved Stochastic Local Search for the DNA codes design problem
92
Conclusions
• Random neighbourhoods => increased SLS performance
• 1-exchange + random neighbourhood is the best combination
• Larger DNA codes have been obtained
Improved Stochastic Local Search for the DNA codes design problem
93
Another Stochastic Local Search for the DNA codes design problem
Chee, Y. M, Ling, S. Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394 (2008).
• A different SLS algorithm has been presented in the literature.
• It can be seen as a Simulated Annealing algorithm without a cooling schedule (constant temperature).
• The current code L is always feasible
• At each iteration a new (feasible) codeword s is added, and all the codewords of L that are not compatible with s are removed, leading to a new code L’
• Code L’ is accepted with a certain probability depending on |L’| - |L| (difference in the cardinalities of the two sets)
94
Another Stochastic Local Search for the DNA codes design problem
Target number of codewords (k before)
Max number of iterations
Code
Set of incompatible codes
Acceptance probability of the new code:
95
Another Stochastic Local Search for the DNA codes design problem
Improvements over previous bests in the literature (theoretical methods, other SLSs and a few more)
HD, GC and RC constraints
96
Stochastic Local Searches for the DNA codes design problem
• Different methods based on a similar idea lead to very different codes
• There is not a method dominating the others
• The methods seem to explore the search space in a different manner
• Is it possible to combine the good property of (some of) the different approaches into a unique method?
97
Outline • Introduction • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics
– Intro to Stochastic Local Search – Applications to the DNA codes design problem – Intro to Variable Neighbourhood Search – Applications to the DNA codes design problem
• Bibliography
98
VNS
99
VNS
100
VNS
101
VNS
102
VNS
103
Outline • Introduction • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics
– Intro to Stochastic Local Search – Applications to the DNA codes design problem – Intro to Variable Neighbourhood Search – Applications to the DNA codes design problem
• Future research • Acknowledgment • Bibliography
104
A VNS algorithm for DNA codes design A primitive Variable Neighbourhood Search (VNS) algorithm is introduced.
It iteratively runs in turns the local search algorithms (basic ingredients) seen before.
The reference solution for local searches is always the best solution retrieved so far.
Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).
This is a Variable Neighbourhood Descent!
Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656 (2009)
Montemanni, R., Smith, D.H., Koul, N. Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International Conference. S. Voss and M. Caserta eds., Springer (to appear)
105
A VNS algorithm for DNA codes design
Methods involved in our implementation
106
A VNS algorithm for DNA codes design
• We hope to take advantage of the different philosophies behind the local search methods listed before
• From previous experiments we know that the basic local searches visit the search space is a different way
• We hope basic local searches will help each other to exit from local minima within a VNS framework
107
Experimental comparison of some of the heuristic algorithms
Experimental settings Methods coded in ANSI C
Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines
Maximum computation times: 10'000 seconds (2.8 hours)
Statistics over 5 runs for each combination problem/method
A (5,3,2) identifies the problem with constraints Cstrs (HD is always present, and therefore not listed), and with n = 5, d = 3, and GC content = floor(n/2) = 2. [this funny notation comes from coding theory…]
4 Cstrs
Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).
108
Experimental comparison of some of the heuristic algorithms
• SB = Seed Building
• CS = Clique Search
• HS = Hybrid Search
• IGS = Iterated Greedy Search
• VNS = Variable Neighbourhood Search
109
Experimental comparison of some of the heuristic algorithms
• SB = Seed Building
• CS = Clique Search
• HS = Hybrid Search
• IGS = Iterated Greedy Search
• VNS = Variable Neighbourhood Search
110
Experimental comparison of some of the heuristic algorithms
Comments • No clear ranking is possible among the basic methods considered: Seed Building, Clique Search, Hybrid Search and Iterative Greedy Search (as seen before…)
⇒ Methods are likely to represent different neighbourhoods
• Variable Neighbourhood Search clearly dominates the other methods
⇒ VNS takes advantage of the different neighbourhoods
⇒ VNS is likely to be competitive against all the other methods!
111 Reference algorithm
Experimental results of VNS The VNS algorithm discussed in:
• Montemanni, R., Smith, D.H. (2008). Construction of constant GC-content DNA codes via a Variable Neighbourhood Search Algorithm. Journal of Mathematical Modelling and Algorithms, 7, 311-326.
is compared with the methods discussed in the following 6 papers [which provide all the best known codes]:
• Li, M., Lee, H. J., Condon, A. E., and Corn, R. M. (2002). DNA word design strategy for creating sets of non-interacting oligonucleotides for DNA microarrays. Langmuir, 18, 805-812. • Tulpan, D. C., Hoos, H. H., and Condon, A. E. (2002). Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, 2568, 229-241. • Tulpan, D. C. and Hoos, H. H. (2003). Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, 2671, 418-433. • King, O. D. (2003). Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33. • Gaborit, P. and King, O. D. (2005). Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113. • Chee, Y. M. and Ling, S. (2008). Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394.
Theor. Constructions Heuristic Algorithms
112
Experimental results of VNS
Experimental settings • Methods coded in ANSI C
• Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines
• Maximum computation times: 100'000 seconds (27.8 hours)
=> Comparable with that of other heuristic algorithms
• Best over 5 runs for each combination problem/method
Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).
113
• We will consider 254 problems with
- 4 ≤ n ≤ 20
- 3 ≤ d ≤ n ≤ 20
- Case 1: HD and GC constraints
- Case 2: HD, RC and GC constraints
• These settings matches those of the state-of-the-art tables maintained at http://llama.med.harvard.edu/~king/dnacodes.html by O.D. King (last checked November 2009)
• We left out problems corresponding to very large codes (the current VNS algorithm cannot tackle them)
Experimental results of VNS
114
• over 254 problems considered:
• in 128 cases the best known result is matched
• in 52 cases a new best result is found
Experimental results of VNS
Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).
115
Detailed results of VNS
116
Detailed results of VNS
117
Detailed results of VNS
118
Detailed results of VNS
119
• After the publication of the paper we have been improving the VNS algorithms in many ways (work still in progress!)
• over 254 problems considered:
• in 128 132 cases the best known result is matched
• in 52 87 cases a new best result is found
• We miss the best known solution in 13.8% of the cases only!
• We feel there is room for further improvements…
Experimental results of VNS
Montemanni, R., Smith D.H. Metaheuristics for the construction of constant GC-content DNA codes. Proceedings of the MIC 2009 Conference (2009)
Montemanni, R., Smith, D.H., Koul, N. Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International Conference. S. Voss and M. Caserta eds., Springer (to appear)
120
Detailed results of VNS
Comments • VNS works (slightly) better on problems with RC contraints
• Result confirmed also by our last improved implementations
• Is this because the other methods are more competitive without RC constraints?
YES => we might have not too much chances to improve on problems without RC constraints
NO => we probably have chances to improve on problems without RC constraints
=> Worth to be investigated!
121
Outline • Introduction • Real applications • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics
– Intro to Stochastic Local Search – Applications to the DNA codes design problem – Intro to Variable Neighbourhood Search – Applications to the DNA codes design problem
• Future research • Acknowledgment • Bibliography
122
Essential bibliography (1/4) [HEUR] => Heuristics related publication.
Brenner, S., Lerner, R.A. (1992). Encoded combinatorial chemistry. Proceedings of the National Academy of Science USA, 89, 5381-5383.
Adleman, L. (1994) Molecular computation of solutions to combinatorial problems. Science, 266, 1021-1024.
Frutos, A.G., Liu, Q., Thiel, A.J., Sanner, A.M.W., Condon, A.E., Smith, L.M., Corn, R.M. (1997). Demonstration of a word design strategy for DNA computing on surfaces. Nucleic Acids Research, 25, 4748-4757.
Hansen, P., Mladenovic, N. (2001). Variable neighbourhood search: principles and applications. European Journal of Operational Research, 130, 449-467. [HEUR]
Marathe, A., Condon, A.E., Corn, R.M.. (2001). On combinatorial DNA word design. Journal of Computational Biology, 8, 201-219.
Arita, M., Kobayashi, S. (2002). DNA sequence design using templates. New Generation Computing, 20, 263-277.
123
Essential bibliography (2/4)
Li, M., Lee, H.J., Condon, A.E., Corn, R.M. (2002). DNA word design strategy for creating sets of non-interacting oligonucleotides for DNA microarrays. Langmuir, 18, 805-812.
Tulpan, D.C., Hoos, H.H., Condon, A.E. (2002). Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, Berlin, 2568, 229-241. [HEUR]
Tulpan, D.C. Hoos, H.H. (2003). Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, Berlin, 2671, 418-433. [HEUR]
King, O.D. (2003). Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33. [HEUR]
Kobayashi, S., Konto, T., Arita, M. (2003). On template methods for DNA sequence design. Lecture Notes in Computer Science, 2568, 205-214.
Hoos, H.H., Stuetzle, T. (2004). Stochastic Local Search: foundations and applications. Morgan Kaufmann/Elsevier. [HEUR]
124
Essential bibliography (3/4) Gaborit, P., King, O.D. (2005). Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113. [HEUR]
Tulpan, D.C. (2006). Effective heuristic methods for DNA strand design. PhD thesis, University of British Columbia. [HEUR]
King, O.D. (2006). Tables of lower bounds for DNA codes with constant GC-content. http://llama.med.harvard.edu/~king/dnacodes.html, last checked: November 2009. [HEUR]
Chee, Y. M, Ling, S. (2008). Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394. [HEUR]
Montemanni, R., Smith, D.H. (2008). Construction of constant GC-content DNA codes via a Variable Neighbourhood Search Algorithm. Journal of Mathematical Modelling and Algorithms, 7, 311-326. [HEUR]
Montemanni, R., Smith, D.H. (2009). Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656. [HEUR]
Montemanni, R., Smith D.H. (2009). Metaheuristics for the construction of constant GC-content DNA codes. Proceedings of the MIC 2009 Conference. [HEUR]
125
Essential bibliography (4/4) Montemanni, R., Smith D.H., Koul, N. (2010). Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International Conference. S. Voss and M. Caserta eds., Springer. [HEUR]
Tulpan, D., Montemanni, R., Ghiggi, A. (2010). Computational Sequence Design Techniques for DNA Microarray Technologies. Submitted for publication. [HEUR]
Ghiggi, A. (2010). DNA strand design with thermodynamic constraints. Master thesis, USI. [HEUR]
Koul, N. (2010). Heuristic Algorithms for Construction of Constant GC content DNA codes. Master thesis, USI. [HEUR]
Neelakandan, I. (2010). New Approaches for Constructing Constant Weight Binary Codes. Master thesis, USI. [HEUR]
126
Exercises 1 1. We have the following code with n=4:
CGTA GGAA AATG TAGA
a. Does it respect the GC-content constraint?
b. Does it respect the Hamming distance constraint for a DNA codes design problem with d=2?
c. Does it respect the Reverse Complement Hamming distance constraint for a DNA codes design problem with with d=2?
2. Given the settings n=4, d=3 and constraints HD, GC, RC, consider the following code:
AACC CAGT GAAG TCCT TGAC
a. Is it feasible?
b. Can it be extended?
127
Exercises 2 1. Given the settings n=3, d=2 and constraints HD, GC, show an execution of
the Construction Heuristic working on top of the inverse lexicographic order.
2. Given the settings n=2, d=1 and constraints HD, GC, RC, show and execution of the Construction Heuristic working on top of the lexicographic order
3. Given the settings n=3, d=2, constraints HD, GC, RC, and the following partial code:
CTT CAA TGT GTA
show an iteration of the Clique Search algorithm.
4. Given the settings n=4, d=2, constraints HD, GC, RC, and the following code:
TGGT GACC CGAA TCTC CGTT
calculate its measure of infeasibility Inf(W) according to the definition given in slide 75 (Iterative Greedy Search)
128
Exercises 3 1. Write the rotation neighbourhood of codeword CATGA.
2. Write 5 of the codewords of the 3-exchange neighbourhood of codeword CATGA.
5. Write 5 of the codewords of the random neighbourhood of codeword CATGA.
6. Write 5 of the codewords of the 2-exchange + random neighbourhood of codeword CATGA.
7. Consider the SLS method described from slide 119 on, with input parameters n=4, d=3, and constraints HD, GC, RC.
At a given iteration we have the following code L
CTTC GGTT GTCA AGGA ACTG TTGG
and the selected random codeword is TTGC.
Write down code L’ (we do not care if it will be accepted or not)