dna scanning and statistical analysis of the human...

Probability and Statistic Dec 23,2013

Ivan Ivanov

Vanier College

George Kolampas

DNA Scanning and Statistical Analysis of the Human

Cytomegalovirus

With the amount of DNA sequences available today the r-scan statistic is extremely useful in

determining regions of importance in a strand of DNA. In this paper the r-scan is used to find

the replication site of Human Cytomegalovirus at the 91490 to 92643 position. The r-scans

relation to the scan statistic is touched and the Compound Poisson approximation is also

observed.

Introduction and Backround

DNA has been known to be the genomic system of storing and passing on data

since Watson and Crick studied the phenomena in 1953. DNA stores the information for

controlling everything from its own replication to the transcription of RNA which

transcribes proteins used in all organisms.

DNA also known as Deoxyribonucleic acid is a polymer of nucleic acid made up of

the pentose sugar Deoxyribose, phosphate groups and one of four possible bases:

guanine, adenine, thymine, and cytosine. DNA has a double helix structure that can split

up into two parts and each part can be seen as a long stranded sequence of the 4 bases

previously listed.

Figure:1 DNA double helix

http://en.wikipedia.org/wiki/Guanine

http://en.wikipedia.org/wiki/Adenine

http://en.wikipedia.org/wiki/Thymine

http://en.wikipedia.org/wiki/Cytosine

DNA is made up of codons which are 3 bases long and with 4 bases to choose

from there are 43(64) different possible codons too choose from. Each codon codes for

an amino acid and there are 20 amino acids that can form. Each specific codon codes for

an unambiguous amino acid but many codons can code for the same amino acid. The

sequences of bases are called genes and are what control all processes that go on in a

cell and subsequently in an organism. DNA strands can be from 20 bases to even a few

million bases long and because a huge percentage of DNA is made up of introns(non-

coding genes) that in fact code for nothing and are considered useless, distinguishing

the useful parts from the useless parts is a big job.

Figure2: introns and their removal during mRNA transcription

With the introduction of PCR, electrophoresis and many other new techniques

used to analyse DNA . These advances in biochemical techniques have led to an

exponential increase in the amount of sequence data. There are over a million

sequences containing more than a billion nucleotide bases that have been recorded. As

the genome databases expand, mathematical methods play an increasingly important

role in obtaining, organizing, archiving, analyzing, and interpreting the rapidly

accumulating DNA data, especially when DNA is composed mostly of non-coding parts.

While searching for insights into the organization of a genome, one of the problems that

seems to arises is how to identify anomalies in the spacings of patterns in a long

sequence of nucleotides. Here the patterns refer to any short sequence segments with a

obvious reoccurrence. Spacing anomalies include properties of clumping (too many

neighboring short spacings), overdispersion (too many long gaps between paterns), and

excessive regularity (too few short spacings and/or too few long gaps). This is where

statistical analysis is needed and used for the scanning of DNA.

Statistics is the study of the collection, organization, analysis, interpretation and

presentation of data. In this study we are using statistics in order to identify anomalies

and patterns in sequences of DNA that may be of importance with a certain degree of

certainty.

The next section describes the r-scan and its close relationship to the traditional

scan statistic. The application of the r-scans will be illustrated by an example that

identifies unusual palindrome clusters in a family of herpesvirus genomes. The genes of

viruses are normally structured in a circular hoop called a plasmid. In order to replicate

viruses need a host cell for them to insert their genetic information into, once

replication has occurred the virus cell bursts out of the host cell and the cycle begins

again.

Figure3: Virus life cycle

Viruses are not wasteful in the sense that they do not code for proteins or

processes that are not immediately useful, for this reason we know that the anomalies

we are searching for using the r-scan method is for the purpose of finding the operon

which triggers replication in viruses. The way operons work is for example consider the

lac operon it is an example of an inducible system whose proteins allow bacteria to

metabolize lactose. When lactose is absent, a repressor protein binds tightly to the

operator. The repressor prevents RNA polymerase from binding to the promoter,

turning transcription off. Therefore when lactose is present the repressor detaches from

the operator and transcription of the gene begins. This is precisely what is happening in

the virus, when the viruses begin to replicate it is because of an outside factor removing

the repressor.

We are going to have to rely on an approximation when observing the statistical

significance of the clusters because of the fact that the exact probability distribution of

r-scans are not available. In Section 12.3, we shall compare the accuracy of three

Poisson-type approximate distributions by contrasting the calculated approximate

probabilities with simulation results.

r-Scans and DNA Sequence Analysis:

The r-scan is defined as the cumulative lengths of r consecutive distances

between the ordered statistics X(1), ... , X(N). It can also be explained as the continuous

sum of the lengths of r consecutive distances between the sample points X(1), ... , X(N).

For a set of points : X1, ... , XN distributed independently and uniformly over

the unit interval (0,1),

We let Di denote the distance between the ordered ith and (i + 1)th points,

so then: Di = X(i+1) - X(i)' i = 1, ... , N - 1.

For any fixed integer r between 1 and N -1, the r-scan at the point X(i) is

The order statistics of these r-scans are denoted by Ar,(i).

The most frequently used r-scan used in DNA analysis is the minimal r-scan which is Ar

the sum from zero to 1 r:

Ar,(1) = min{Ar,i, i = 1, ... , N - r}. We will use this Ar(1) throughout this paper.

Let Ar(1) be denoted as Ar

Ar is closely related to the traditional scan statistic of which:

where 0 < w < 1 is a prescribed window length and Yt (w) is the number of points in the

interval [t, t + w].

When looking at the event where:

for i = 1, ... , N - r, this states that there are at least

r + 1 points contained in the window [X(i) , X(i) + w]. This is equivalent to the event

which states that there are at the very least r adjoining spacings, starting at X(i)' whose

cumulative length cannot be greater than w. Because this is true for all i, it must also

hold for the particular window holding the maximal number of points. Therefore we can

say that we have of the relation:

for fixed values of w ε (0,1) and r = 1, ... ,N - 1.

` Because of this relation, we will automatically know the distribution

of Ar if the distribution of Sw is available. So the minimal r-scan statistic and the

traditional scan statistic can be used interchangeably. In DNA sequence analysis, the r-

scan is usually preferred. The exact probability distribution of Ar is unknown, so as

shown in (Ming-Ying Leung and Traci E.Yamashita).

where:

and where:

In this case N=296 the number of palindromes, j is the palindrome count but because

we are using the Compound Poisson approximation we use r 2 and w is the window

length.

Palindrome Placement and the Data at Hand

As previously explained we will use the Ar r-scan statistic in order to find dense

palindrome clusters that are found on the DNA which can be considered statistically

significant with a significance level of α=.05 and which are not due to random scatter.

The following table is the data provided and is what will be tested. Each number in the

table is a point in the DNA which has a palindrome of length ten or above. There also as

documented two hundred and ninety-six palindromes that fit the criteria in this table.

Using this table and the Compound-Poisson approximation method mentioned above

we will be able to obtain the palindromes of length 10 or above that are likely to not be

due to random chance.

Statistically Significant Palindromes With a Window Length of 500

For this window length of 500 and the DNA strand of 229354 bases long our

w = 500/229354= 0.00218 . Using Excel we obtain a table for r values from 2 to 10

This table tells us that when looking at a significance level of at least α=.05 with a

window length of 500, we only consider clusters of 6 or above to be statistically

significant. The following graph shows where clusters are found on the DNA strand.

r P(Ar )

2 1.00000

3 0.99815

4 0.023846

5 0.118076

6 0.013357

7 0.001225

8 9.73E-5

9 6.82E-6

10 4.20E-7

The graph shows two points above 6 which means that those points can be considered

anomalies not due to random chance. The one which is around 90000 is congruent with

the information stated on the paper by Masse et al. (1992) which carried out detailed

experimental assays around this part of the genome and characterized the segment

between 92210 bp and 93715 bp as the lytic origin of replication for HCMV. We choose

to ignore the one that surpasses six because in the paper in speaks about only being

interested in palindromes of length 10 or higher.

0

2

4

6

8

10

12

14

0 50000 100000 150000 200000 250000

Pal

ind

rom

e C

lust

er

De

nsi

ty

Position with respect to bases

Graph for Sliding Window of Length 500

Statistically Significant Palindromes With a Window Length of 1000

The same calculations will be used as those done with a window length of 500.

w=1000/229354=0.00436

r P(Ar )

2 1.00000

3 1.00000

4 0.999722

5 0.862716

6 0.34455

7 0.074703

8 0.012356

9 0.001758

10 0.000223

For a significance of at least α=.05 then only palindrome clusters of 8 or above are

acceptable.

0

2

4

6

8

10

12

14

0 50000 100000 150000 200000 250000

Pal

ind

rom

e C

lust

er

De

nsi

ty

Position with respect to bases

Graph for Sliding Window of Length 1000

Again the same two regions are acceptable but once again only one of the two surpass

the 10 base pairs written in the paper.

Conclusion

Finally to conclude, r-scan statistical analysis is very effective tool for finding

coding genes in DNA, thanks to the Bernoulli nature of DNA sequences . In this case we

found that the replication site for the HCMV virus was located at the cluster 91490 to

92643 as did Masse et al. (1992) in their analysis. Any of the two graphs are useful but

those with a higher window lengths are a bit more effective when accepting only high

amounts of clusters. In either case when reading the r-scan table values and comparing

to any graphs of clusters one can determine where important biological processes are

being coded for with a significant certainty that it is not due to random chance and with

all the sequence banks this is a big breakthrough in DNA research.

Bibliography Leung, M.-Y., & Traci, Y. E. (n.d.). Applications of the Scan Statistic in DNA Sequence Analysis. 27-

271,280-281.

Sadava, Hillis, Heller, & Berenbaum. (2011). Life:The Science of Biology. Sunderlan: The Courier

Companies inc.

dna scanning and statistical analysis of the human...

Documents