137920
TRANSCRIPT
Detection & analysis of DNA sequences
By: Mohammad Adil
Roll Number 137920
Under the guidance of:Dr.Raju Bhukya
Assistant ProfessorNIT Warangal
Contents
● Introduction● Repeats● Problem definition● Detection of DNA sequences● Analysis of DNA sequences● Results● Conclusion● References
Introduction
● DNA resides in the nucleus of cell of an organism.● Analysis of DNA can describe the relationship between species.● Made up of long double stranded helix.● Adenine: A Cyotosine: C Guanine: G Thymine: T
Repeats
● A pattern of nucleotide that occurs more than once in a sequence.
● Repeats are important genetic markers for disease diagnosis.
● Tandem Repeats:● Repeat that occur at contiguous location.●
●
Repeats....
➢ Perfect Repeat:➔ perfect microsatellites, only contain pure repeats, 100% identical
copies.
➔ Ex: (AC)6 or ACACACACACAC,
➔ (ATG)4 or ATGATGATGATG,
➔ (CTTA)4 or CTTACTTACTTACTTA
➢ Imperfect Repeat:➔ every n-th nucleotide is a mismatch
➔ Ex: ACACACATACAC , ATGCATGCTATGCATGCATGC
Problen definition
● TRF for Identification of large tandem repeat.● Starting and ending positions of tandem repeat was present.Starting and ending positions of tandem repeat was present.● A%, C%, G%, T% percentage of bases in the tandem repeat.A%, C%, G%, T% percentage of bases in the tandem repeat.
Tandem repeat finder outline :Tandem repeat finder outline :
Tandem repeat finder program has 2 main components:Tandem repeat finder program has 2 main components:
detection and and analysisanalysis
DetectionDetection - Finds - Finds candidatecandidate tandem repeats tandem repeats
Analysis Analysis - Produces an - Produces an alignment alignment for each candidate and for each candidate and statistics about the alignment. statistics about the alignment.
Detection component
● The algorithm looks for matching nucleotides separated by a common distance d, which is not specified in advance. For reasons of efficiency it looks for runs of k matches, which we call k-tuple matches.
● Matching k-tuples are two windows with identical contents and if aligned in the Bernoulli model would produce a run of k heads.
● Because we limit ourselves to k-tuple matches, we will not detect all matching characters.
Ex: if k = 6 and two windows contain TCATGT and TCTTGT we will not know that there are 5 matching characters because the window contents are not identical.
Put in terms of the Bernoulli model, the aligned windows would be represented by the sequence HHTHHH, which is not a run of 6 heads.
Detection component....
Statistical criteria
● The statistical criteria are based on runs of heads in Bernoulli sequences.
● criteria are based on four distributions which depend upon:-
– The matching probability– The indels probability– The tuple size– The pattern length
Sum of head distribution
● This distribution indicates how many matches are required.
● Let the random variable R d,k,pm
= the total number of heads in head runs of length k or longer in an iid Bernoulli sequence of length d with success probability Pm .
● The distribution of Rd,k,Pm
is well approximated by the normal distribution
● For the sum of heads criterion, we use the normal distribution to determine the largest number x such that 95% of the time R
d,k,Pm ≥ x
.
Random walk distribution
● This distribution describes how distances between matches may vary due to indels.
● Because indels change the distance between matching k-tuples.
●
Random walk distribution...
● In order to test the sum of head criterion,we count the matches in
Dd±∆d
for ∆d=0,1,..........∆dmax
● Indels are single nucleotide events are occuring with probability pi .
● Let the random variable Wd,pi
=the max displacement from the prigin of a one dimensional walk with expected number of steps equal to
Pi*d .
Apparent size distribution
● This distribution is used to distinguish between tandem repeats and non-tandem direct repeats.
Apparent size distribution...
● Sd,k,pm
=distance between ist and last run of k heads,sequence length d,
success probability pm
● Sd,k,pm
is the apparent size of repeat when using k tuple to find the matches.
● From the distibution we determin max number y such that 95% of the time S
d,k,pm >y (use y as our apparent size criterion)
Waiting time distribution
● It's used to pick tuple sizes.
● Increasing tuple size causes an exponential decrease in the expected number of tuple matches.
● If nucleotides occur with equal frequency the tuple size by ∆k increase the average distance between randomly matching tuple by a factor of 4∆k.
● If k=5 the avg. distance between random matches is ~1kb.
● If k=5 the avg. distance is ~16kb.
● Increase the tuple size ---> decrease the chance of noticing approximate copy.
Analysis component
● If the information in the distance list passes the criteria tests
➢ Analysis components:➔ Multiple reporting of repeat at different pattern sizes
➔ Narrow band alignment
➔ Consensus pattern and period size
Analysis component.......
Multiple reporting of repeat at different pattern size:
● When a single tandem repeat contains many copies, several pattern sizes are possible.
● Ex: if the basic pattern size is 26, then the repeat may be reported at sizes 26, 52, 78.
Narrow band allignment:
● To decrease running time,we limit WDP(wraparound dynamic programming) to a narrow diagonal band in the allignment matrix for pattern larger than 20 chars.
● The band is periodically recentred around a run of matches in the current
best allignment.
Consensus pattern size
● An initial candidate pattern P is drawn from the sequence, but this is
usually not the best pattern to align with the tandem repeat.● The consensus is used to realign the sequence and this final
alignment is reported in the output
Result
● Input to the program consists of a sequence file and the following parameters:
➢ alignment weights for match, mismatch and indels;
➢ Pm and P
i(probability of match and mismatch)
➢ a minimum size for patterns to report;
➢ a minimum alignment score to report.
Result....
● we used it to analyze four sequences:➢ the human frataxin gene sequence
➢ The human β T cell receptor locus sequence
➢ Two yeast chromosomes (I and VIII)
➔ The frataxin gene sequence and the human β T cell receptor sequences were obtained from GenBank
➔ The yeast chromosomes sequences were obtained via ftp from ftp.ebi.ac.uk directory pub/databases/ yeast in files chri_230209.ascii and chrviii_562638.ascii.
Result....
Human frataxin gene(Freidrich ataxia),intron 1:
● Freidrich ataxia is one of the triplet disease.
● It is caused by copy number expansion of the triplet GAA in the first
intron of the frataxin gene.
● Table lists the repeats found in the sequence.
➔ Besides the triplet repeat, our program found two others which were
apparently unknown, a 44 bp pattern and a 14 bp pattern.
Result....
LOCUS HSFRDA1 DEFINITION Human frataxin (FRDA) gene,promoter region and exon 1 ACCESSION U43748
Period size: 44 copynumber: 2.0
1787 GGATCCCTTCCGAGTGGCT
25 GGATCCCTTCAGAGTGGCT
1806 GGTACGCCGCCTGTANTATGGGAGAGGATCCCTTCAGAGTGGCT
0 GGTACGCCGCATGTA TTAGGGGAGAGGATCCCTTCAGAGTGGCT
1850 GGTACGCCGCATGTATTAGGGGAGA
0 GGTACGCCGCATGTATTAGGGGAGA
Summary:
Matches: 40, Mismatches: 4, Indels: 0
91% 9% 0%
matches are distributed among these distances; 44 40 1.00
ACGT count A: 0.18 C: 0.23 G:0.35 T: 0.23
Result....
Human Frataxin Gene(Freidrich Ataxia Intron I)
Indices
Period size
Copy no
Consensus size
%matches %indel Score A C G T Entropy
822-854
1787-1874
2183-2211
14
44
3
2.4
2.0
9.7
14
44
3
89
90
100
0
0
0
57
140
158
6
18
68
48
22
0
42
35
31
0
22
0
1.28
1.95
0.89
Tandem repeats detected in the human frataxin gene intron
Result....
Conclusion
● We have presented a new algorithm for finding tandem repeats in DNA sequences without the need to specify either the pattern or pattern size
● The algorithm is based on the detection of k-tuple matches
● It uses a probabilisitic model of tandem repeats and a collection of statistical criteria based on that model.
● We have demonstrated the speed and utility of the algorithm by analyzing four sequences ranging in size up to 700 kb.
References
● Huntington’s Disease Collaborative Research Group. (1993) Cell, 72, 971–983.● 3 Fu,Y.-H., Pizzuti,A., Fenwick,J., King,R.G.Jr. and Rajnarayan,S., Dunne,P.W., Dubel,J., Nasser,G.A.,
Ashizawa,T., DeJong,P., Wieringa,B.● Korneluk,R., Perryman,M.B., Epstein,H.F. and Caskey,C.T. (1992) Science, 255, 1256–1258.● 4 La Spada,A., Wilson,E., Lubahn,D., Harding,A. and Fischbeck,K. (1991) Nature, 352, 77–79.● 5 Campuzano,V., Montermini,L., Molto,M.D., Pianese,L. and Cossee,M. (1996) Science, 271, 1423–1427.● 6 Wells,R. (1996) J. Biol. Chem., 271, 2875–2878.● 7 Weitzmann,M., Woodford,K. and Usdin,K. (1997) J. Biol. Chem., 272,9517–9523.● 8 Hamada,H., Seidman,M., Howard,B. and Gorman,C. (1984) Mol. Cell. Biol.,4, 2622–2630.
.............
thanxthanx