1 fireμsat : an algorithm to detect tandem repeats in dna
TRANSCRIPT
1
FireμSat:An Algorithm to Detect
Tandem Repeats in DNA
FireμSat:An Algorithm to Detect
Tandem Repeats in DNA
Corne de Ridder , Derrick G. Kourie , Bruce W. Watson [email protected],[email protected],[email protected]
School of Computing, University of South Africa, South Africa, Pretoria 0003 Fastar Research Group, Department of Computer Science, University of
Pretoria, South Africa Pretoria 0002
a b b
a
b
2
IntroductionIntroduction• What are tandem repeats in DNA?
• How are we going to detect tandem repeats in DNA?
• Why would anybody want to detect tandem repeats in DNA?
3
Genetic sequencesGenetic sequences
• DNA consists of four different nucleotides, namely:
Adenine (A) Guanine (G)
Cytosine (C) Thiamine (T)
• Genetic databanks e.g. Genbank, Emboss and Entrez stores DNA sequences as concatenated single letter codes in FASTA format.
4
Tandem Repeats (TR’s) in genome sequences
Tandem Repeats (TR’s) in genome sequences
• DNA molecules are subject to numerous mutational events. One of the consequences of these events that can be detected by computationally analyzing genome sequences is tandem duplication.
• A TR or TR-zone is a string of DNA molecules that is characterized by a certain motif that introduces the string, contiguously followed by a number of ‘copies’ of the motif, e.g., ACGACGACGACGACG
5
Tandem RepeatsTandem Repeats
• Perfect tandem repeat (PTR) if the copies are exact e.g. ACGACGACGACGACG, hence five copies of the motif ACG.
• Approximate tandem repeat (ATR) if the copies of the motif include non-exact copies, thus mutational events have, most likely occurred e.g. ACGACACGAGGACGAG.
• In the absence of further qualification, reference to a tandem repeat should be construed as a reference to either a PTR or an ATR.
6
Tandem Repeat ElementsTandem Repeat Elements
• A PTR element (PTRE) is a TR element that matches the motif. If the motif is for example ACG then the PTRE will also be ACG.
• An ATR element (ATRE) is a TR element similar
to the motif but not an exact copy thereof. If the motif is ACG then an ATRE may for example be AC.
7
MicrosatellitesMicrosatellites• The length of PTRE’s may vary:
satellites, minisatellites and microsatellites
• Microsatellites is a subset of TR’s
(conforming to Benson, Delgrange, Rivals & Abajian)
52 motif
8
Formal problem statement
A PTR whose motif is ρ is repeated p times where p 1, is denoted by ρp. An ATR u that is derived from this PTR ρp must always have the motif (ρ) as its prefix. It therefore has the form ρu2…up where each ATRE, uk(k = 2…p), is the result of at most ε mutations on ρ. Here ε is the so called motif error.
Besides the restrictions applicable to the motif error threshold values are also introduced that manipulate the attributes of the detected TR.
9
Tolerated errortypes
Tolerated errortypes
Errors regarding the motif or PTRE (motif errors):• deletions• mismatches • insertions
• Errors related to the detected TR (TR errors):• in terms of the ratio between PTRE’s and ATRE’s • the minimum number TRE’s to be reported• the maximum number of ATRE’s consecutively
10
Motif errorsMotif errorsMaximum of 50% error toleration
• If |ρ| = 2 or |ρ| = 3 then є = 0 or є = 1 (default = 1)
• If |ρ| = 4 or |ρ| = 5 then є = 0; є = 1 or є = 2(default = 2)
Consider ACGTT then ACT will be an ATRE where two deletions have occurred.
11
Motif errors: Types of Mutations
Motif errors: Types of Mutations
• Deletion Refers to the absence of a base pair in the
motif.
• Insertion An ATRE with up to ε base pairs inserted into
any position of the PTRE.
• Mismatch Refers to the replacement of a base pair in the
motif by another.
12
Detected TR errors:the substring error
Detected TR errors:the substring error
• The substring error :
where is the maximum substring error allowed and = (n_d x p_d) + (n_i x p_i) + (n_m x p_m) –
n_ptrewhere
n_d: number of deletionsn_i: number of insertionsn_m: number of mismatchesp_d: penalty allocated to deletionsp_i: penalty allocated to insertionsp_m: penalty allocated to mismatches
13
Detected TR errors:the minimum number of TRE’s
Detected TR errors:the minimum number of TRE’s
• tn_tre = tn_ptre + tn_atre
• tn_tre
• the default value for = 2
• to prevent the output of unwanted data
14
Detected TR errors:the maximum number of
consecutive ATRE’s
Detected TR errors:the maximum number of
consecutive ATRE’s
tn_atreC
• tn_atreC is incremented for every ATRE read
• tn_atreC is set to zero whenever a PTRE is read
• the default of tn_atreC is 0
15
DeletionRefers to the absence of a base pair in the motif
FAD(ACG,1)
16
MismatchRefers to the replacement of a base pair in the motif by another.
FAm(ACG,1)
17
• generateWords(ρ,ε) generates a set of all words
of length ρLength from the alphabet
Σ = {A,C,G,T}.
• createFATR(ρ,ε) returns FATR(ρ,ε) as discussed.
• findIndices(gSeq, FATR, τ, α, β, p_m, p_d, p_i) returns a set of index pairs in gSeq of an identified TR.
• the TR is such that it complies with the constraints specified by τ, α, β. Various counters have to be updated to ensure correct output.
High-level Descriptionof FireμSat
High-level Descriptionof FireμSat
18
Why does anybody want to detect TR’s in DNA?
Why does anybody want to detect TR’s in DNA?
• The cause of several human diseases can be traced to having too many copies of a certain nucleotide triplet.
• TR’s play a role in the development of immune system cells.
• TR’s serves as genetic markers in plant and animal species.
• Tandem repeats play a role in gene regulation and contribute to the breeding of disease resistant cultivars.
19
ConclusionConclusion
A new theoretical approach to detect TR’s in DNA has been introduced. The time complexity of FireµSat is linear in |gSeq|.
The practical implementation of FireµSat is in progress. The following matters constitute a future research agenda:
• the performance of FireµSat
• the possibility of reducing FATR
• and, if successful, the latter results could suggest ways of adapting FireµSat to detect minisatellites and satellites as well.