137920

Detection & analysis of DNA sequences

By: Mohammad Adil

Roll Number 137920

Under the guidance of:Dr.Raju Bhukya

Assistant ProfessorNIT Warangal

Contents

● Introduction● Repeats● Problem definition● Detection of DNA sequences● Analysis of DNA sequences● Results● Conclusion● References

Introduction

● DNA resides in the nucleus of cell of an organism.● Analysis of DNA can describe the relationship between species.● Made up of long double stranded helix.● Adenine: A Cyotosine: C Guanine: G Thymine: T

Repeats

● A pattern of nucleotide that occurs more than once in a sequence.

● Repeats are important genetic markers for disease diagnosis.

● Tandem Repeats:● Repeat that occur at contiguous location.●

●

Repeats....

➢ Perfect Repeat:➔ perfect microsatellites, only contain pure repeats, 100% identical

copies.

➔ Ex: (AC)6 or ACACACACACAC,

➔ (ATG)4 or ATGATGATGATG,

➔ (CTTA)4 or CTTACTTACTTACTTA

➢ Imperfect Repeat:➔ every n-th nucleotide is a mismatch

➔ Ex: ACACACATACAC , ATGCATGCTATGCATGCATGC

Problen definition

● TRF for Identification of large tandem repeat.● Starting and ending positions of tandem repeat was present.Starting and ending positions of tandem repeat was present.● A%, C%, G%, T% percentage of bases in the tandem repeat.A%, C%, G%, T% percentage of bases in the tandem repeat.

Tandem repeat finder outline :Tandem repeat finder outline :

Tandem repeat finder program has 2 main components:Tandem repeat finder program has 2 main components:

detection and and analysisanalysis

DetectionDetection - Finds - Finds candidatecandidate tandem repeats tandem repeats

Analysis Analysis - Produces an - Produces an alignment alignment for each candidate and for each candidate and statistics about the alignment. statistics about the alignment.

Detection component

● The algorithm looks for matching nucleotides separated by a common distance d, which is not specified in advance. For reasons of efficiency it looks for runs of k matches, which we call k-tuple matches.

● Matching k-tuples are two windows with identical contents and if aligned in the Bernoulli model would produce a run of k heads.

● Because we limit ourselves to k-tuple matches, we will not detect all matching characters.

Ex: if k = 6 and two windows contain TCATGT and TCTTGT we will not know that there are 5 matching characters because the window contents are not identical.

Put in terms of the Bernoulli model, the aligned windows would be represented by the sequence HHTHHH, which is not a run of 6 heads.

Detection component....

Statistical criteria

● The statistical criteria are based on runs of heads in Bernoulli sequences.

● criteria are based on four distributions which depend upon:-

– The matching probability– The indels probability– The tuple size– The pattern length

Sum of head distribution

● This distribution indicates how many matches are required.

● Let the random variable R d,k,pm

= the total number of heads in head runs of length k or longer in an iid Bernoulli sequence of length d with success probability Pm .

● The distribution of Rd,k,Pm

is well approximated by the normal distribution

● For the sum of heads criterion, we use the normal distribution to determine the largest number x such that 95% of the time R

d,k,Pm ≥ x

.

Random walk distribution

● This distribution describes how distances between matches may vary due to indels.

● Because indels change the distance between matching k-tuples.

●

Random walk distribution...

● In order to test the sum of head criterion,we count the matches in

Dd±∆d

for ∆d=0,1,..........∆dmax

● Indels are single nucleotide events are occuring with probability pi .

● Let the random variable Wd,pi

=the max displacement from the prigin of a one dimensional walk with expected number of steps equal to

Pi*d .

Apparent size distribution

● This distribution is used to distinguish between tandem repeats and non-tandem direct repeats.

Apparent size distribution...

● Sd,k,pm

=distance between ist and last run of k heads,sequence length d,

success probability pm

● Sd,k,pm

is the apparent size of repeat when using k tuple to find the matches.

● From the distibution we determin max number y such that 95% of the time S

d,k,pm >y (use y as our apparent size criterion)

Waiting time distribution

● It's used to pick tuple sizes.

● Increasing tuple size causes an exponential decrease in the expected number of tuple matches.

● If nucleotides occur with equal frequency the tuple size by ∆k increase the average distance between randomly matching tuple by a factor of 4∆k.

● If k=5 the avg. distance between random matches is ~1kb.

● If k=5 the avg. distance is ~16kb.

● Increase the tuple size ---> decrease the chance of noticing approximate copy.

Analysis component

● If the information in the distance list passes the criteria tests

➢ Analysis components:➔ Multiple reporting of repeat at different pattern sizes

➔ Narrow band alignment

➔ Consensus pattern and period size

Analysis component.......

Multiple reporting of repeat at different pattern size:

● When a single tandem repeat contains many copies, several pattern sizes are possible.

● Ex: if the basic pattern size is 26, then the repeat may be reported at sizes 26, 52, 78.

Narrow band allignment:

● To decrease running time,we limit WDP(wraparound dynamic programming) to a narrow diagonal band in the allignment matrix for pattern larger than 20 chars.

● The band is periodically recentred around a run of matches in the current

best allignment.

Consensus pattern size

● An initial candidate pattern P is drawn from the sequence, but this is

usually not the best pattern to align with the tandem repeat.● The consensus is used to realign the sequence and this final

alignment is reported in the output

Result

● Input to the program consists of a sequence file and the following parameters:

➢ alignment weights for match, mismatch and indels;

➢ Pm and P

i(probability of match and mismatch)

➢ a minimum size for patterns to report;

➢ a minimum alignment score to report.

Result....

● we used it to analyze four sequences:➢ the human frataxin gene sequence

➢ The human β T cell receptor locus sequence

➢ Two yeast chromosomes (I and VIII)

➔ The frataxin gene sequence and the human β T cell receptor sequences were obtained from GenBank

➔ The yeast chromosomes sequences were obtained via ftp from ftp.ebi.ac.uk directory pub/databases/ yeast in files chri_230209.ascii and chrviii_562638.ascii.

Result....

Human frataxin gene(Freidrich ataxia),intron 1:

● Freidrich ataxia is one of the triplet disease.

● It is caused by copy number expansion of the triplet GAA in the first

intron of the frataxin gene.

● Table lists the repeats found in the sequence.

➔ Besides the triplet repeat, our program found two others which were

apparently unknown, a 44 bp pattern and a 14 bp pattern.

Result....

LOCUS HSFRDA1 DEFINITION Human frataxin (FRDA) gene,promoter region and exon 1 ACCESSION U43748

Period size: 44 copynumber: 2.0

1787 GGATCCCTTCCGAGTGGCT

25 GGATCCCTTCAGAGTGGCT

1806 GGTACGCCGCCTGTANTATGGGAGAGGATCCCTTCAGAGTGGCT

0 GGTACGCCGCATGTA TTAGGGGAGAGGATCCCTTCAGAGTGGCT

1850 GGTACGCCGCATGTATTAGGGGAGA

0 GGTACGCCGCATGTATTAGGGGAGA

Summary:

Matches: 40, Mismatches: 4, Indels: 0

91% 9% 0%

matches are distributed among these distances; 44 40 1.00

ACGT count A: 0.18 C: 0.23 G:0.35 T: 0.23

Result....

Human Frataxin Gene(Freidrich Ataxia Intron I)

Indices

Period size

Copy no

Consensus size

%matches %indel Score A C G T Entropy

822-854

1787-1874

2183-2211

14

44

3

2.4

2.0

9.7

14

44

3

89

90

100

0

0

0

57

140

158

6

18

68

48

22

0

42

35

31

0

22

0

1.28

1.95

0.89

Tandem repeats detected in the human frataxin gene intron

Result....

Conclusion

● We have presented a new algorithm for finding tandem repeats in DNA sequences without the need to specify either the pattern or pattern size

● The algorithm is based on the detection of k-tuple matches

● It uses a probabilisitic model of tandem repeats and a collection of statistical criteria based on that model.

● We have demonstrated the speed and utility of the algorithm by analyzing four sequences ranging in size up to 700 kb.

References

● Huntington’s Disease Collaborative Research Group. (1993) Cell, 72, 971–983.● 3 Fu,Y.-H., Pizzuti,A., Fenwick,J., King,R.G.Jr. and Rajnarayan,S., Dunne,P.W., Dubel,J., Nasser,G.A.,

Ashizawa,T., DeJong,P., Wieringa,B.● Korneluk,R., Perryman,M.B., Epstein,H.F. and Caskey,C.T. (1992) Science, 255, 1256–1258.● 4 La Spada,A., Wilson,E., Lubahn,D., Harding,A. and Fischbeck,K. (1991) Nature, 352, 77–79.● 5 Campuzano,V., Montermini,L., Molto,M.D., Pianese,L. and Cossee,M. (1996) Science, 271, 1423–1427.● 6 Wells,R. (1996) J. Biol. Chem., 271, 2875–2878.● 7 Weitzmann,M., Woodford,K. and Usdin,K. (1997) J. Biol. Chem., 272,9517–9523.● 8 Hamada,H., Seidman,M., Howard,B. and Gorman,C. (1984) Mol. Cell. Biol.,4, 2622–2630.

.............

thanxthanx

137920

Software