Download - DNA Codes Design - SUPSI

!IMPORTANT:!!!Applications!of!metaheuristics!to!the!Sequential!Ordering!Problem!will!be!covered!first.!!!!Applications!of!metaheuristics!to!DNA!codes!design!will!be!treated!only!if!time!permits.!!!!Roberto!!

1

Roberto Montemanni

Dalle Molle Institute for Artificial Intelligence

University of Applied Science of Southern Switzerland

Email: [email protected] Tel: +41 58 666 666 7

DNA Codes Design

2

Outline •  Introduction •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Bibliography

3

Contributions to slides

•  Dan C. Tulpan NRC Institute for Information Technology, Canada (Introduction, Applications, Stochastic Local Searches)

•  Marco Chiarandini University of Southern Denmark, Denmark

(Introduction to Stochastic Local Search) •  Thomas Stuetzle

Darmstadt University of Technology, Germany (Introduction to Variable Neighbourhood Search)

4



•  Bibliography

5

DNA – The Blueprint of Life

chimp

cow

dinosaur bird

fish

worm

bacteria human

DNA

9 pictures taken from ClipArt

Background: DNA

6

What is DNA?

•  All organisms on this planet are made of the same type of genetic blueprint.

7

Real Applications

•  DNA computing => using DNA for massively parallel computations.

•  DNA Chemical libraries => for the development and test of new drugs

•  DNA Microarrays => for profiling genes and tracing genes within long DNA strands

•  DNA Nanotechnologies => for the development of new materials/devices

http://en.wikipedia.org/wiki/DNA_computing

8



•  Bibliography

9

DNA, Wikimedia Commons

What is DNA? •  genetic material •  four letter alphabet (nucleotides, bases):

–  A (adenine), –  C (cytosine), –  G (guanine), –  T (thymine)

•  complementary base pairs CG, AT •  hybridization via base pairing

A

A

C

G

T

3�

5�

T

T

G

C

A

3�

5�

A

T

G

G

T

3�

5�

T

T

G

C

A

3�

5�

Perfect hybridization Imperfect hybridization Background: DNA

10

Modeling

Uniform Stability

A

A

C

G

T

3�

5�

T

T

G

C

A

3�

5�

A

A

C

G

T

5�

3�

C

A

C

C

C

3�

5�

Non-interaction

Design Goals

Desired properties •  Desired properties coming from real applications

•  Notice that properties are not the same for all applications

11

DNA Codes Design Problem description

Input data:

•  The alphabet {A, C, G, T}

•  A fixed length n for the codewords

•  A required distance d among codewords (used by constraints in Z)

• A set Z of constraints (explained in the next slides)

Optimization objective:

•  Find the largest possible set of codewords (= code) of length n on alphabet {A, C, G, T}, feasible with respect to constraints Z (based on d)

Why to maximize the size of the code? To have more flexibility in the applications seen before!

12

AATTCCGG ACCTGATT

ATTCCCAG

ACCTTTTT

Codeword

Word Length n = 8

TATATATA

CATTCACC

GCTTATTC

GATTCAAT

TCACCATG

CCGTTACA

GCGCGCGC

CTATTCAC

TTGGCCAA

GGCTTTTA

CTACTACG

The solution respects a given a constraints set Z (we do not know Z at this stage!)

Example Code (solution)


13

Requirements of a DNA Code

•  Success in specific hybridization between a DNA codeword and its complement.

•  No hybridization between DNA codewords from the same DNA code or between a DNA codeword and others complement.

How do these requirements translate into our constraints set Z?


14

Constraints considered (set Z):

•  Requirement: the distance between two codewords must be large (no hybridization).

•  Answer: HD (Hamming Distance)

-  Given two codewords w1 and w2

-  H(w1, w2) = number of positions i in which the ith letter of w1 differs from the ith letter of w2

-  example: w1 = GCTA, w2 = ATTA, H(w1, w2) = 2

-  Constraint: H(w1, w2) ≥ d


15


•  Requirement: the number of G or C of each codeword must be the same (uniform stability) [=> self-hybridization is likely]

•  Answer: GC (GC-content constraint)

-  A fixed number of the letters of each word has to be either G or C: floor(n/2) in our case

-  example: ATA is not feasible, AGA is feasible


16

•  Requirement: the distance between a codeword and the complement of another codeword must be large.

Watson-Crick complement of a DNA codeword

wcc(w) = Watson-Crick complement of a DNA codeword w, obtained by reversing w and then by replacing each A in w by T (and vice-versa) and each C in G (and vice-versa)

-  example: wcc(ATGC) = GCAT


17


• Requirement: the distance between a codeword and the complement of another codeword must be large.

•  Answer: RC (Reverse Complement Hamming distance)

-  Given two codewords w1 and w2

-  example: GCTA, ATGC

H(GCTA, wcc(ATGC)) = H(GCTA,GCAT) = 2

-  Constraint: H(w1, wcc(w2)) ≥ d


18

Example of a problem and its solution

•  Input data: n = 4, d = 3.

•  Constraints considered: HD, GC, RC

•  Solution:

the largest possible code with the characteristics above contains 6 codewords.

Optimal code with respect to the constraints considered (not unique!):

CTTC GGTT GTCA

AGGA ACTG TTGG

19

Problem description

•  Other kinds of constraints are possible.

•  They depend on the real-world application considered

•  In this mini-course we limit ourselves to the constraints on the previous slides

Important observation

20



•  Bibliography

21

TEMPLATE-MAP DESIGN •  Find the largest possible set of 8-mers with

–  50% GC content in each word –  at least four mismatches between each word and the complement of each distinct word

(reverse-complement constraint) –  at least four mismatches between each pair of words (direct Hamming constraint) –  based on template-map design

Approaches from the literature

Kobayashi, S., Konto, T., Arita, M. On template methods for DNA sequence design. Lecture Notes in Computer Science, 2568, 205-214 (2003).

Arita, M., Kobayashi, S. DNA sequence design using templates. New Generation Computing, 20, 263-277 (2002).

Frutos A.G., Liu, Q., Thiel A.J., Sanner A.M.W., Condon A.E., Smith L.M., Corn R.M. Demonstration of a word design strategy for DNA computing on surfaces. Nucleic Acids Res. 25, 4748-4757 (1997)

Koul, N. Heuristic Algorithms for Construction of Constant GC content DNA codes. Master thesis, USI (2010).

22

TEM

PLAT

E-M

AP

DES

IGN


•  The selection of maps and templates is based on reasoning and theoretical results

•  Difficult to apply results to different problems: not a general approach

23

MATHEMATICAL CONSTRUCTIONS •  Approaches adapted from classic Coding Theory •  Theoretical results, based on the characteristics of the

desired code, are used to produce mathematical constructions leading to (very regular) codes

•  Example: Theorem If C0 is a code that is fixed by reverse permutation R, then the subcode C1 of C0 consisting of the codewords that are unchanged by R is obtained as the intersection of C0 and the code R(C0).


Gaborit P., King O. D. Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113 (2005).

•  Not a general method. Results typically hold for the problem under investigation only

•  The codes obtained are very regular. For many applications this is not desirable

King, O. D. Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33 (2003).

Neelakandan, I. New Approaches for Constructing Constant Weight Binary Codes. Master thesis, USI (2010).

24

HEURISTIC ALGORITHMS

•  Many of the classic heuristic algorithms have been adapted, implemented and tested

•  We will see some of them in details…!


25



•  Bibliography

26

Construction Heuristics

Construction Heuristic (CH)

All possible codewords with the required GC-content are examined in a given order.

Codewords are incrementally accepted if feasible with respet to the already accepted ones.

Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656 (2009)

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Math. Modelling and Algorithms 7, 311-326 (2008).

Smith, D.H., Hughes L.A., Perkins S. A new table of constant weight binary codes of length grater than 28. Electron. J. of Combinatorics, 13(1), #A2 (2006).

27


Example: n = 4, d = 3.

Constraints: HD, GC, RC

Lexicographic order:

AACC AACG AAGC AAGG ACAC ACAG ACCA ACCT ACGA ACGT ACTC ACTG AGAC AGAG AGCA AGCT AGGA AGGT AGTC AGTG ATCC ATCG ATGC ATGG CAAC CAAG CACA CACT CAGA CAGT CATC CATG CCAA CCAT CCTA CCTT CGAA CGAT CGTA CGTT CTAC CTAG CTCA CTCT CTGA CTGT CTTC CTTG GAAC GAAG GACA GACT GAGA GAGT GATC GATG GCAA GCAT GCTA GCTT GGAA GGAT GGTA GGTT GTAC GTAG GTCA GTCT GTGA GTGT GTTC GTTG TACC TACG TAGC TAGG TCAC TCAG TCCA TCCT TCGA TCGT TCTC TCTG TGAC TGAG TGCA TGCT TGGA TGGT TGTC TGTG TTCC TTCG TTGC TTGG

Solution: AACC ACAG AGGA CCTA GTCA

28


•  The method works over any possible order of the nodes (lexicographic, reverse lexicographic, random) => different algorithms in fact…

•  Computational experiments suggest that random orders guarantee better results on DNA code design problems

•  Slow for large problems (all possible codewords have to be examined!)

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. J. of Math. Modelling and Algorithms 7, 311-326 (2008).

29



•  Bibliography

30

Seed Building local search Seed Building (SB)

Iterative approach

A set of seed codewords is considered The set of seed codewords is dynamically adapted through iterations During each iteration: •  All possible codewords with the required GC-content are examined in a given order. •  Codewords are incrementally accepted if feasible with those already accepted in the current iteration and with the seed codewords. Statistics are used to expand or contract the set of seed codewords every ItrSeed iterations, based on the quality of the solutions built.

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. J. of Math. Modelling and Algorithms 7, 311-326 (2008).

Brouwer A.E., Shearer J.B., Sloane N.J.A., Smith W.D. A new table of constant weight codes. IEEE Trans. Inf. Theory 36, 1334-1380 (1990).

31

Seed Building local search

Seed codewords management

32


Example: n = 4, d = 3.

Constraints: HD, GC, RC

Seed codewords: AACC ACAG

Random order:

CTTC CTTG CTCA CTCT CTGA CTGT CTAC CTAG CATC CATG CACA CACT CAGA CAGT CAAC CAAG CCTA CCTT CCAA CCAT CGTA CGTT CGAA CGAT GTTC GTTG GTCA GTCT GTGA GTGT GTAC GTAG GATC GATG GACA GACT GAGA GAGT GAAC GAAG GCTA GCTT GCAA GCAT GGTA GGTT GGAA GGAT TTCC TTCG TTGC TTGG TACC TACG TAGC TAGG TCTC TCTG TCCA TCCT TCGA TCGT TCAC TCAG TGTC TGTG TGCA TGCT TGGA TGGT TGAC TGAG ATCC ATCG ATGC ATGG AACC AACG AAGC AAGG ACTC ACTG ACCA ACCT ACGA ACGT ACAC ACAG AGTC AGTG AGCA AGCT AGGA AGGT AGAC AGAG

Solution: AACC ACAG CCTA GTCA TCCT

33


•  The method works over any possible order of the nodes (lexicographic, reverse lexicographic, random).

•  Experiments clearly show that a random order has to be preferred for DNA codes design problems.

•  The process of identify a good set of codewords is intrinsically difficult => codes produced are sometimes very good and sometimes very poor => not a very robust method

•  Slow for large problems (all possible codewords are examined at each iteration!)

34

•  Clique Given an undirected graph G, a clique is a set of the vertices in which every vertex is connected to every other vertex of the clique

•  Maximal clique problem Given an undirected graph G, identify the largest (number of nodes) clique of G

•  Complexity Classic NP-hard problem

Clique Search local search

•  {0, 3, 4} is a clique

•  {2, 3, 4, 5} is a maximal clique

35

Clique Search local search Clique Search (CS)

Iterative approach A partial code can be completed by solving a subproblem (which is a maximum clique problem) to optimality During each iteration: •  All possible codewords with the required GC-content are examined in a random order. •  Codewords are accepted for the second phase if feasible with those of the partial code. •  A maximum clique problem is solved on the set of accepted codewords to complete the partial code Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Math. Modelling and Algorithms 7, 311-326 (2008).


36


37


Example: n = 4, d = 3. Constraints: HD, GC, RC

Partial code: CTTC CGAA TGGT GTGA

Maximum clique problem on feasible extensions of the partial solution:

CACT AGTG

AAGC GCTT

38



Partial code: CTTC CGAA TGGT GTGA

Maximum clique problem on feasible extensions of the partial solution:

CACT AGTG

AAGC GCTT

Solution: CTTC CGAA TGGT GTGA CACT GCTT

39


•  Solving a maximum clique problem (sub-procedure) is an NP-hard problem itself!

•  Heuristics have to be used for the maximum clique problem

=> no optimality is guarantee for the sub-problem solutions

•  The choice of the number of codewords to eliminate is crucial

!  too many codewords eliminated => very large maximum clique problem => high probability of having suboptimality

!  not enough codewords eliminated => very likely to find a code with the same number of codewords of the original

!  This aspect deserves a deeper study to tackle large problems!

40

Hybrid Search local search Hybrid Search (HS)

Iterative approach

Merges the concepts of the two methods analyzed before.

A set of seed codewords is managed exactly as in Seed Building.

Seed codewords represent the partial code in the context of the Clique Search.

A relaxed distance d' < d is introduced.

A candidate code has to be at least at distance d from the seeds, and d' from the other candidate codes (this to keep the maximum clique problem to a reasonable size!)

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).

41

Hybrid Search local search

Seed Building

Clique Search

42



Partial code (seed codewords): CAAC AGAG

Maximum clique problem on feasible extensions of the partial solution (heuristic distance d'=1 to reduce the codewords considered):

TGGT

TCTC TGTC

TTGC TAGG

TACG ATGC

ACTC

43



Partial code (seed codewords): CAAC AGAG

Maximum clique problem on feasible extensions of the partial solution (heuristic distance d'=1 to reduce the codewords considered):

TGGT

TCTC TGTC

TTGC TAGG

TACG ATGC

ACTC

Solution: CAAC AGAG TCTC TGGT TACG ATGC

44


•  Sums the advantages of Seed Building to those of Clique Search

but…

•  There is the risk of summing up drawbacks instead!

•  The method deserves a further detailed study for larger problems

45

Experimental comparison of some of the heuristic algorithms

Experimental settings Methods coded in ANSI C

Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines

Maximum computation times: 10'000 seconds (2.8 hours)

Statistics over 5 runs for each combination problem/method

A (5,3,2) identifies the problem with constraints Cstrs (HD is always present, and therefore not listed), and with n = 5, d = 3, and GC content = floor(n/2) = 2. [this funny notation comes from coding theory…]

4 Cstrs


46


•  SB = Seed Building

•  CS = Clique Search

•  HS = Hybrid Search

47





48


Comments •  No clear ranking is possible among the methods considered: Seed Building, Clique Search, and Hybrid Search

•  Methods are therefore likely to represent different neighbourhoods

49

Idea

•  All the methods seen until now work on the search space of feasible solutions (we never have constraints violated…)

•  What if we move into the search space of infeasible solutions? => we will have to minimize (i.e. bring down to zero!) a measure of infeasibility!

•  This makes it possible to develop a completely different kind of local search!

•  It is likely that the search space is visited in a different way by such a family of algorithms…

50

Iterated Greedy Search local search Iterated Greedy Search (IGS) Iterative approach Working on an infeasible code W, trying to make it feasible.

Measure of the infeasibility of W:

where w = floor(n/2)


51

Iterated Greedy Search local search

Iterated Greedy Search (IGS)

An infeasible solution is obtained by adding a random codeword to a perturbed feasible solution

During each iteration:

•  A codeword σ is selected at random and the optimal (according to Inf(W)) change of one bit of σ is carried out.

•  If Inf(W)=0, we are done, and we can add a random codeword

52

Iterated Greedy Search local search Perturbation of

the solution

Optimization of the solution

53



W Inf(W) ... TGGT GACC CGAA TCAC CCTT 1 TGGT GACT CGAA TCAC CCTT 0

TGGT GGCA CGAA TCAC CCTT TTTG 8 TGGT GGCA CGTA TCAC CCTT TTTG 8 TGGT GGCA CGTA TCAC GCTT TTTG 7 TGGT GGCA CGTC TCAC GCTT TTTG 7 … TGGT AGTG CGTC TCAC GCTT TTTG 4 TGGT AGTG CGTC TCAC GCTT TTCG 3 TGGT AGTG CTTC TCAC GCTT TTCG 0

TGGT AGTG GTAG TCAC GGTT TTCG AACT 9 TGGT AGTG GTAG TCTC GGTT TTCG AACT 9 ...

54


•  We change exactly one bit of a random codeword at each iteration: more complex neighbourhoods could be considered…

•  We never accept changes that make the solution worse: might be an idea to escape from local minima

•  A further investigation is deserved…

55







4 Cstrs


56





•  IGS = Iterated Greedy Search

57






58


Comments •  No clear ranking is possible among the methods considered: Seed Building, Clique Search, Hybrid Search and Iterative Greedy Search •  Methods are likely to represent different neighbourhoods

59



•  Bibliography

60

Goal: Effectively escape from local minima of given evaluation function. General approach: For fixed neighbourhood, use step function that permits worsening search steps. Specific methods: •  Randomized Iterative Improvement •  Simulated Annealing •  Attribute Based Hill Climber •  Dynamic Local Search •  Iterated Local Search •  Tabu Search

Stochastic Local Search: Simple SLS methods

61

Key idea: In each search step, with a fixed probability perform an uninformed random walk step instead of an iterative improvement step. Randomized Iterative Improvement (RII): determine initial candidate solution s while termination condition is not satisfied do

With probability p: choose a neighbor s0 of s uniformly at random Otherwise: choose a neighbor s0 of s such that g(s0) < g(s) or, if no such s0 exists, choose s0 such that g(s0) is minimal s := s0

Where g(s) is the objective function value (fitness) of solution s

Stochastic Local Search: Randomized Iterative Improvement

62

Observations: •  No need to terminate search when local minimum is encountered.

Instead: Impose limit on number of search steps or CPU time, from beginning of search or after last improvement.

•  Probabilistic mechanism permits arbitrary long sequences of random walk steps

Therefore: When run sufficiently long, RII is guaranteed to find (optimal) solution to any problem instance with arbitrarily high probability.

•  Generally, RII is often outperformed by more complex LS methods.

Stochastic Local Search: Randomized Iterative Improvement

63



•  Bibliography

64

Target: a code with k codewords 1.  Start with k random codewords 2.  Mark unsatisfied constraints (conflicts) 3.  If no unsatisfied constraints go to 8 4.  Pick 2 codewords involved in a conflict 5.  With probability p select a better word minimizing the number of conflicts 6.  Otherwise select a random codeword 7.  Go to step 3. 8.  Display all k codewords

Stochastic Local Search for the DNA codes design problem

It is a Randomized Iterative Improvement! Tulpan, D.C., Hoos, H.H., Condon, A.E. Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, Berlin, 2568, 229-241 (2002).

65

Random Walk

Best Improvement

SI SF

SBI

SRW 1 - p

p


66

Initialization

Evaluate

Pick Conflict

Best Improvement Random Walk

Return Result

Probability p Probability 1-p

No Yes


67

Select Conflicts Neighbourhood

Random Walk

Iterative / Best Improvement


68

Neighbors: TTTCTCAG, AATCTCAG, …

Random Walk Best Improvement p 1 - p

Conflicts: (1,3) (1,5) (2,7) (2,9) (12,14)

Pick Conflict

Conflicts: (1,3) (1,5) (2,11) (12,14) Current Set

ACCTGATT

ATTCTCAG

ACCTTTTT

TATATATA

CATTCACC

ATTCTCAA

GATTCAAT

TCACCATG

CCGTTACA

GCGCGCGC

CTATTCAC

TTGGCCAA

GGCTTTTA

CTACTACG

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

TTTCTCAG

Thesis Contributions: C1 Development of novel optimization algorithms

Given: a fixed set of constraints C, strand length n=8, set size k=14.


69

Simple SLS without Random Replacement

k = {100, 120, 140}, n = 8, d = 4

HD constraint only

1000 successful runs

Stochastic Local Search for the DNA codes design problem - results

Comments:

•  T h e n u m b e r o f i terat ions required increases with k

•  The increase is more dramatic when k is h i g h = > r i s k o f stagnation

Distribution of the number of iterations required to have a feasible solution for different values of k (target number of codewords)

70

k = 70, n = 8, d = 4

HD, RC, GC constraints


SLS with Random Replacement vs Simple SLS


Distribution of the number of iterations required to have a feasible solution for k = 70 (target number of codewords)

Comments:

•  Random Replacement helps!

•  Stagnation reduced

•  Better robustness

71

100000

10000

1000

100

10 20 40 60 80 100 120 140 160

DNA set size

HD HD+GC

HD+GC+RC

Number of search iterations n = 8

Scaling of SLS with Random Replacement


, d = 4

Comments:

•  SLS scales up better when less constraints are considered

•  Why? Because less constraints => easier problem, intuitively

72

New bounds on the size of DNA codes

Note: HD, GC constraints.

n d Previous best SLS

6

10

14

18

3

5

7

9

10 20

56

132

240

380

1520

85

256

500

1200

2193


73

Comments: •  There are improvements over previous best.

•  The method is still extremely simple and intuitive [good quality in general but...]

•  Is it possible to improve it with some refinement?

•  Where should we work to refine the method?


IDEA: trying different neighbourhoods!

74

• Combinatorial problem Π: DNA Word Design

•  Problem instance π : DNA/quaternary code design [ particular (n,d) combinations ]

•  Search space S(π): set of (code word) sets s

•  Neighborhood relation N(π): k-exchange + random based neighborhoods

•  Initialization function init(π): random choosing or predefined

•  Step function step(π): chooses with probability p between best improvement and random walk

•  Terminate predicate terminate(π): a function depending on the number of iterations

performed or solution found

Improved Stochastic Local Search for the DNA codes design problem

Tulpan, D.C. Hoos, H.H. Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, Berlin, 2671, 418-433 (2003).

Inst

ead

of si

mpl

e 1-

exch

ange

!

75

Neighbourhoods

Simple neighbourhoods

•  k-exchange / k-point mutation neighbourhoods

•  rotation-based neighbourhoods

•  random neighbourhoods

Complex neighbourhoods

•  1-exchange / 1-point mutation + rotation neighbourhoods

•  k-exchange / k-point mutation + random words neighbourhoods

•  1-exchange / 1-point mutation + rotations + random words negihbourhoods


76


v-exchange / v-point mutation neighbourhoods

Example:

some of the codewords in the 2-exchange neighbourhood of CTA are:

ACA GTT TTG TCA


77


rotation-based neighbourhoods

Applying the neighbourhood to a given codeword, we get the codewords obtained from the input codeword by �shifting right� the codeword from 1 to n-1 positions.

Example:

CTA => TAC, ACT


78


random neighbourhoods

Example: some of the codewords in the random neighbourhood of CTA

are: CAA CTT TTC TCA


79

Complex neighbourhoods

1-exchange + rotation neighbourhoods

v-exchange + random words neighbourhoods

1-exchange + rotations + random words neighbourhoods

•  These neighbourhoods are obtained by applying all the neighbourhoods involved sequentially (repeated codewords have to be avoided)

•  When rotation is involved, it is applied to all the codewords obtained by the neighbourhoods previously applied


80


The difference is here!

81

k-exchange Neighbourhoods k = 70, n = 8, d = 4



{1, 2, 3}-exchange neighbourhoods

Improved Stochastic Local Search for the DNA codes design problem - results

Distribution of the number of iterations required to have a feasible solution for different v-exchange methods

Comments:

•  Using larger neighbourhood seems to helps but…

•  The difference between 2-exchange and 3-exchange is not dramatic

•  Larger neighbourhood means more time at each iteration…

Why 16?

2 words

I have to respect GC content

82

Neighbourhood CPU Time

1-exchange

2-exchange

3-exchange

.0017

.0088

.0314


k-exchange Neighbourhoods Time for 1 iteration

Distribution of the CPU time required to have a feasible solution for different v-exchange methods

Comments:

•  1-exchange is still the best in terms of run times => not what we hoped!

83

Hybrid Randomized Neighbourhoods

k = 70, n = 8, d = 4



random, hybrid neighbourhoods


Distribution of the number of iterations required to have a feasible solution for different hybrid neighbourhoods

Comments:

•  P u r e r a n d o m p e r f o r m s surprisingly well

•  1-exchange + random is however the best method

84

All combinations of neighbourhoods together (usual benchmark)


Distribution of the number of iterations required to have a feasible solution for different neighbourhoods

Comments:

•  1-exchange + rotation + random is the most promising combination in terms of number of iterations

•  Methods including the random neighbourhood are definitely better

85

Approximate CPU Cost per Iteration for all the combinations of neighbourhoods considered

.043333

.040833

.029167

.022889

.015100

.017294

.031493

.008830

.002184 CPU Time [sec]

1-exchange 2-exchange 3-exchange

1-exchange + rotations random

1-exchange + random 2-exchange + random 3-exchange + random

1-exchange + rotations + random

Neighbourhood Type

128 + 100

Neighbourhood size

184 + 112 72 + 112 16 + 112

128 128 184 72 16


Comment:

•  1-exchange + random is a good compromise between speed and quality of the solutions

•  Let’s see now what happen if we consider both the time spent on each iteration, and the number of iterations required to converge… [next slide]

86

All combinations of neighbourhood together (usual benchmark)


Distribution of the CPU time required to have a feasible solution for different neighbourhoods

Comments:

•  Rotation is time consuming => methods with rotation are not so convenient anymore

•  1-exchange + random neighbourhood is far the most promising combination in terms of CPU time

87


Is this randomized

step still interesting?

88


Number of iterations to have a feasible solution for different values of the randomizing parameter

Comments:

•  The randomized step is useless when the hybrid randomized neighbourhood is used!

•  This happens because the neighbourhood already does the “random work”

k = 70, n = 8, d = 4



random, hybrid neighbourhoods

89


90

n = 8, d = 4



1-exchange, random, hybrid

neighbourhoods

Scaling of the Improved SLS


Comments:

•  Surprising how pure random neighbourhood scales up well

•  However, 1-exchange + random neighbourhood is the best

91

SLS Results and Analysis

•  New bounds for DNA set sizes •  Improved SLS using various neighborhoods

Length (n)

Hamming dist. (d)

Existing Bounds (k)

Simple SLS (k)

4 3 - 5

8 4 108 112*

10 5 - 127

12 6 - 210

Thesis Contributions: C1 Development of novel optimization algorithms

Improved SLS (k)

6

128

158

240

Combinatorial constraints: HD, RC, GC

[Tulpan et al., 2002] [Tulpan et al., 2003] [Frutos et al., 1997]


92

Conclusions

•  Random neighbourhoods => increased SLS performance

•  1-exchange + random neighbourhood is the best combination

•  Larger DNA codes have been obtained


93

Another Stochastic Local Search for the DNA codes design problem

Chee, Y. M, Ling, S. Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394 (2008).

•  A different SLS algorithm has been presented in the literature.

•  It can be seen as a Simulated Annealing algorithm without a cooling schedule (constant temperature).

•  The current code L is always feasible

•  At each iteration a new (feasible) codeword s is added, and all the codewords of L that are not compatible with s are removed, leading to a new code L’

•  Code L’ is accepted with a certain probability depending on |L’| - |L| (difference in the cardinalities of the two sets)

94


Target number of codewords (k before)

Max number of iterations

Code

Set of incompatible codes

Acceptance probability of the new code:

95


Improvements over previous bests in the literature (theoretical methods, other SLSs and a few more)

HD, GC and RC constraints

96

Stochastic Local Searches for the DNA codes design problem

•  Different methods based on a similar idea lead to very different codes

•  There is not a method dominating the others

•  The methods seem to explore the search space in a different manner

•  Is it possible to combine the good property of (some of) the different approaches into a unique method?

97



•  Bibliography

98

VNS

99

VNS

100

VNS

101

VNS

102

VNS

103



•  Future research •  Acknowledgment •  Bibliography

104

A VNS algorithm for DNA codes design A primitive Variable Neighbourhood Search (VNS) algorithm is introduced.

It iteratively runs in turns the local search algorithms (basic ingredients) seen before.

The reference solution for local searches is always the best solution retrieved so far.


This is a Variable Neighbourhood Descent!


Montemanni, R., Smith, D.H., Koul, N. Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International Conference. S. Voss and M. Caserta eds., Springer (to appear)

105

A VNS algorithm for DNA codes design

Methods involved in our implementation

106

A VNS algorithm for DNA codes design

•  We hope to take advantage of the different philosophies behind the local search methods listed before

•  From previous experiments we know that the basic local searches visit the search space is a different way

•  We hope basic local searches will help each other to exit from local minima within a VNS framework

107







4 Cstrs


108






•  VNS = Variable Neighbourhood Search

109






•  VNS = Variable Neighbourhood Search

110


Comments •  No clear ranking is possible among the basic methods considered: Seed Building, Clique Search, Hybrid Search and Iterative Greedy Search (as seen before…)

⇒  Methods are likely to represent different neighbourhoods

•  Variable Neighbourhood Search clearly dominates the other methods

⇒  VNS takes advantage of the different neighbourhoods

⇒  VNS is likely to be competitive against all the other methods!

111 Reference algorithm

Experimental results of VNS The VNS algorithm discussed in:

•  Montemanni, R., Smith, D.H. (2008). Construction of constant GC-content DNA codes via a Variable Neighbourhood Search Algorithm. Journal of Mathematical Modelling and Algorithms, 7, 311-326.

is compared with the methods discussed in the following 6 papers [which provide all the best known codes]:

•  Li, M., Lee, H. J., Condon, A. E., and Corn, R. M. (2002). DNA word design strategy for creating sets of non-interacting oligonucleotides for DNA microarrays. Langmuir, 18, 805-812. •  Tulpan, D. C., Hoos, H. H., and Condon, A. E. (2002). Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, 2568, 229-241. •  Tulpan, D. C. and Hoos, H. H. (2003). Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, 2671, 418-433. •  King, O. D. (2003). Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33. •  Gaborit, P. and King, O. D. (2005). Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113. •  Chee, Y. M. and Ling, S. (2008). Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394.

Theor. Constructions Heuristic Algorithms

112

Experimental results of VNS

Experimental settings •  Methods coded in ANSI C

•  Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines

•  Maximum computation times: 100'000 seconds (27.8 hours)

=> Comparable with that of other heuristic algorithms

•  Best over 5 runs for each combination problem/method


113

•  We will consider 254 problems with

-  4 ≤ n ≤ 20

-  3 ≤ d ≤ n ≤ 20

-  Case 1: HD and GC constraints

-  Case 2: HD, RC and GC constraints

•  These settings matches those of the state-of-the-art tables maintained at http://llama.med.harvard.edu/~king/dnacodes.html by O.D. King (last checked November 2009)

•  We left out problems corresponding to very large codes (the current VNS algorithm cannot tackle them)


114

•  over 254 problems considered:

•  in 128 cases the best known result is matched

•  in 52 cases a new best result is found



115

Detailed results of VNS

116


117


118


119

•  After the publication of the paper we have been improving the VNS algorithms in many ways (work still in progress!)

•  over 254 problems considered:

•  in 128 132 cases the best known result is matched

•  in 52 87 cases a new best result is found

•  We miss the best known solution in 13.8% of the cases only!

•  We feel there is room for further improvements…


Montemanni, R., Smith D.H. Metaheuristics for the construction of constant GC-content DNA codes. Proceedings of the MIC 2009 Conference (2009)

Montemanni, R., Smith, D.H., Koul, N. Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International Conference. S. Voss and M. Caserta eds., Springer (to appear)

120


Comments •  VNS works (slightly) better on problems with RC contraints

•  Result confirmed also by our last improved implementations

•  Is this because the other methods are more competitive without RC constraints?

YES => we might have not too much chances to improve on problems without RC constraints

NO => we probably have chances to improve on problems without RC constraints

=> Worth to be investigated!

121

Outline •  Introduction •  Real applications •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics


•  Future research •  Acknowledgment •  Bibliography

122

Essential bibliography (1/4) [HEUR] => Heuristics related publication.

Brenner, S., Lerner, R.A. (1992). Encoded combinatorial chemistry. Proceedings of the National Academy of Science USA, 89, 5381-5383.

Adleman, L. (1994) Molecular computation of solutions to combinatorial problems. Science, 266, 1021-1024.

Frutos, A.G., Liu, Q., Thiel, A.J., Sanner, A.M.W., Condon, A.E., Smith, L.M., Corn, R.M. (1997). Demonstration of a word design strategy for DNA computing on surfaces. Nucleic Acids Research, 25, 4748-4757.

Hansen, P., Mladenovic, N. (2001). Variable neighbourhood search: principles and applications. European Journal of Operational Research, 130, 449-467. [HEUR]

Marathe, A., Condon, A.E., Corn, R.M.. (2001). On combinatorial DNA word design. Journal of Computational Biology, 8, 201-219.

Arita, M., Kobayashi, S. (2002). DNA sequence design using templates. New Generation Computing, 20, 263-277.

123

Essential bibliography (2/4)

Li, M., Lee, H.J., Condon, A.E., Corn, R.M. (2002). DNA word design strategy for creating sets of non-interacting oligonucleotides for DNA microarrays. Langmuir, 18, 805-812.

Tulpan, D.C., Hoos, H.H., Condon, A.E. (2002). Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, Berlin, 2568, 229-241. [HEUR]

Tulpan, D.C. Hoos, H.H. (2003). Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, Berlin, 2671, 418-433. [HEUR]

King, O.D. (2003). Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33. [HEUR]

Kobayashi, S., Konto, T., Arita, M. (2003). On template methods for DNA sequence design. Lecture Notes in Computer Science, 2568, 205-214.

Hoos, H.H., Stuetzle, T. (2004). Stochastic Local Search: foundations and applications. Morgan Kaufmann/Elsevier. [HEUR]

124

Essential bibliography (3/4) Gaborit, P., King, O.D. (2005). Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113. [HEUR]

Tulpan, D.C. (2006). Effective heuristic methods for DNA strand design. PhD thesis, University of British Columbia. [HEUR]

King, O.D. (2006). Tables of lower bounds for DNA codes with constant GC-content. http://llama.med.harvard.edu/~king/dnacodes.html, last checked: November 2009. [HEUR]

Chee, Y. M, Ling, S. (2008). Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394. [HEUR]

Montemanni, R., Smith, D.H. (2008). Construction of constant GC-content DNA codes via a Variable Neighbourhood Search Algorithm. Journal of Mathematical Modelling and Algorithms, 7, 311-326. [HEUR]

Montemanni, R., Smith, D.H. (2009). Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656. [HEUR]

Montemanni, R., Smith D.H. (2009). Metaheuristics for the construction of constant GC-content DNA codes. Proceedings of the MIC 2009 Conference. [HEUR]

125

Essential bibliography (4/4) Montemanni, R., Smith D.H., Koul, N. (2010). Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International Conference. S. Voss and M. Caserta eds., Springer. [HEUR]

Tulpan, D., Montemanni, R., Ghiggi, A. (2010). Computational Sequence Design Techniques for DNA Microarray Technologies. Submitted for publication. [HEUR]

Ghiggi, A. (2010). DNA strand design with thermodynamic constraints. Master thesis, USI. [HEUR]

Koul, N. (2010). Heuristic Algorithms for Construction of Constant GC content DNA codes. Master thesis, USI. [HEUR]

Neelakandan, I. (2010). New Approaches for Constructing Constant Weight Binary Codes. Master thesis, USI. [HEUR]

126

Exercises 1 1.  We have the following code with n=4:

CGTA GGAA AATG TAGA

a.  Does it respect the GC-content constraint?

b.  Does it respect the Hamming distance constraint for a DNA codes design problem with d=2?

c.  Does it respect the Reverse Complement Hamming distance constraint for a DNA codes design problem with with d=2?

2.  Given the settings n=4, d=3 and constraints HD, GC, RC, consider the following code:

AACC CAGT GAAG TCCT TGAC

a.  Is it feasible?

b.  Can it be extended?

127

Exercises 2 1.  Given the settings n=3, d=2 and constraints HD, GC, show an execution of

the Construction Heuristic working on top of the inverse lexicographic order.

2.  Given the settings n=2, d=1 and constraints HD, GC, RC, show and execution of the Construction Heuristic working on top of the lexicographic order

3.  Given the settings n=3, d=2, constraints HD, GC, RC, and the following partial code:

CTT CAA TGT GTA

show an iteration of the Clique Search algorithm.

4.  Given the settings n=4, d=2, constraints HD, GC, RC, and the following code:

TGGT GACC CGAA TCTC CGTT

calculate its measure of infeasibility Inf(W) according to the definition given in slide 75 (Iterative Greedy Search)

128

Exercises 3 1.  Write the rotation neighbourhood of codeword CATGA.

2.  Write 5 of the codewords of the 3-exchange neighbourhood of codeword CATGA.

5.  Write 5 of the codewords of the random neighbourhood of codeword CATGA.

6.  Write 5 of the codewords of the 2-exchange + random neighbourhood of codeword CATGA.

7.  Consider the SLS method described from slide 119 on, with input parameters n=4, d=3, and constraints HD, GC, RC.

At a given iteration we have the following code L

CTTC GGTT GTCA AGGA ACTG TTGG

and the selected random codeword is TTGC.

Write down code L’ (we do not care if it will be accepted or not)

Download - DNA Codes Design - SUPSI

Top Related