tmrca estimates

9
Notes on Estimating the Time to Most Recent Common Ancestor using Y-DNA Haplotypes David E. Johnston * June 5, 2011 1 The TMRCA Problem The DNA sequences of the male Y chromosome can be used to determine the genetic closeness of two individuals. The Short Tandem Repeats (STRs) are sequences (alleles) of repeating genetic base-sequences and it is known that the number of repeats mutates on a fairly short time-scale which makes them useful for genetic genealogy. Various companies offer services where in- dividuals can be tested and receive a string of numbers (markers) collectively called a haplotype. The statistical problem we will concern ourselves with is determining the Time to Most Recent Common Ancestor (TMRCA) given the mutation rates. This problem has been studied before. Walsh (2001) for example uses an approximate method called the Infinite Alleles Model (IAM) which makes the approximation of ignoring multiple mutations in the same allele, including back mutations where a mutation occurs and is followed by its reverse mutation so that none is observed. In these notes, I show that this problem has an exact solution including the generalization to multiple branching (mutating by more than one in one generation at a given allele) and is computationally benign. * Contact Author at [email protected] 1

Upload: davejphys

Post on 26-Mar-2015

209 views

Category:

Documents


2 download

DESCRIPTION

Just some notes I put together on calculating time to most recent common ancestor. I suspect many of these results have been derived before by others but I am not aware of any.

TRANSCRIPT

Page 1: TMRCA Estimates

Notes on Estimating the Time to Most RecentCommon Ancestor using Y-DNA Haplotypes

David E. Johnston ∗

June 5, 2011

1 The TMRCA Problem

The DNA sequences of the male Y chromosome can be used to determinethe genetic closeness of two individuals. The Short Tandem Repeats (STRs)are sequences (alleles) of repeating genetic base-sequences and it is knownthat the number of repeats mutates on a fairly short time-scale which makesthem useful for genetic genealogy. Various companies offer services where in-dividuals can be tested and receive a string of numbers (markers) collectivelycalled a haplotype.

The statistical problem we will concern ourselves with is determiningthe Time to Most Recent Common Ancestor (TMRCA) given the mutationrates. This problem has been studied before. Walsh (2001) for exampleuses an approximate method called the Infinite Alleles Model (IAM) whichmakes the approximation of ignoring multiple mutations in the same allele,including back mutations where a mutation occurs and is followed by itsreverse mutation so that none is observed.

In these notes, I show that this problem has an exact solution includingthe generalization to multiple branching (mutating by more than one in onegeneration at a given allele) and is computationally benign.

∗Contact Author at [email protected]

1

Page 2: TMRCA Estimates

2 A Bayesian Method

We wish to write down the posterior probability of TMRCA, which we willcall T, given the mutation rates. We will start with one allele at a time.Since we will assume that mutations in different alleles are independent, theprobabilities for all alleles are just products of the individual probabilities.

2.1 Mutation Rates

Let the multiple branch, per-generation mutation rates for a single allele begiven by a vector ~µ for example, ~µ = [0.0030,0.0010,0.0005]. This means thatthe probability of mutating by ±1 is 0.0030; the probability of mutating by±2 is 0.0010 and the probability of mutating by ±3 is 0.0005. We will assumesymmetric mutation rates so that the probability of +1 is 0.0030/2=0.0015and the same for −1 etc. The probability of mutating to any branch is thesum of all mutation rates, which we will call µ and so the probability ofremaining the same is 1− µ.

Chandler (2006) has studied mutation rates using Y-DNA haplotype data.The mutation rates vary from allele to allele. It is also known that mutationsrates for multiple jumps (or branches) are non-zero but smaller. Furthermore,there is at least anecdotal evidence that mutation rates vary between differentpaternal lines. We will side-step this issue for the purpose of these notes andassume the symmetric, multiple-branch mutation rates are known.

2.2 The Likelihood

Assume we start with some marker, M0, for the ancestor. The child ofthis ancestor will have a marker, M1 = M0 + D1 given by the probabilityP (D1|1, ~µ). The D1 denotes the difference between them and has a Bernoullidistribution or rather a multiple-Bernoulli distribution with the mutationrates as parameters. The 1 just denotes that this is the first generation. Forthe T-th generation, we can write MT = M0 +

∑iDi. This sequence of ran-

dom outcomes has the structure of a Markov chain (a specific kind of randomwalk) since the probability of jumping to the next value is independent ofwhich value you are at. That is, the distribution of the Di are identical.

The probability distribution of a sum of random variables from the samedistribution P is just the convolution of the probability distribution of each.P (Z ≡ X + Y ) =

∑i P (Xi)P (Z −Xi).

2

Page 3: TMRCA Estimates

Likewise, the probability distribution of a difference is given by the cor-relation of the two distributions P (W ≡ X − Y ) =

∑i P (Xi)P (W + Xi).

For distributions, like ours, that are symmetric about zero, it is easy to showthat convolution and correlation are the same.

Using this information we can immediately write down the probabilitydistribution of markers, M = M0 +D after T generations. It is given by

P (D|T, ~µ) = P ∗ P ∗ P... (T − times) ≡ P ∗T (1)

where * denotes convolution (or correlation). This is easy enough to compute,one generation at a time. The probability distribution for each generation isjust given recursively from the one before. Generation N will have 2BN + 1values with non-zero probability if B is the number of branches, e.g. B = 1for only ±1 mutations but only BN + 1 of these need to be stored since alldistributions are symmetric about zero. It is easy to show that the numberof calculations as well as the number of numbers needing to be stored are oforderN2 which is easy for modern computers up toN of a few thousand. Thiscan be made even easier computationally skipping ahead and just computingevery 5 or 10 generations. This is easy to do due to the associativity ofconvolution.

The likelihood above is the distribution of differences between the orig-inal haplotype marker and the final one. If you are comparing two finaldescendants, you just take this distribution above and correlate/convolve itwith itself. Note that this is just the same as continuing the process to 2Tgenerations, again due to the associativity of convolution.

Figure 1 shows the original probability distribution of mutation rates(black) and how the distribution spreads out with each generation.

2.3 Computing the Likelihood with the Discrete FourierTransform

The Discrete Fourier Transform (DFT) and in particular the Fast FourierTransform (FFT) algorithm can be used to speed up these calculations. Sincethe distributions are symmetric (and real) the discrete cosine transform isthe one to use.

Fourier transforms have the useful property that the Fourier transform ofa convolution of two distributions is just the product of two Fourier trans-

3

Page 4: TMRCA Estimates

forms. And so if P is the probability distribution of mutations,

P (D|T, ~µ) = DFT−1(DFT (P )T ) (2)

and for calculating the difference between the haplotypes of two descendants,you would just replace T in the above equation with 2T .

Generally the mutations are symmetric and so the DFT becomes theDCF, discrete cosine transform which we define as

Fk = DCT [f ](k) =∑j

fj cos(2πjk/N) (3)

and the inverse transform

fj = DCT−1[F ](j) =1

N

∑k

Fk cos(2πjk/N) (4)

The DCT is defined for any value of N. But if we wish to use the circularconvolution theorem without worrying about wrap around effects we wantN larger than the effective width of the resultant distribution. In practicesomething of the order N=32 should be sufficient.

For the simplest mutation model where there is a single symmetric branch-ing, we have f0 = (1 − µ) and f1 = fN−1 = µ/2 and all else are zero andso

Fk = (1− µ) + µ cos(2πk/N) (5)

= (1− µ)

(1 +

µ

1− µcos(2πk/N)

)(6)

≈ (1− µ) (1 + µ cos(2πk/N)) (7)

Note we have dropped the 1− µ in the second term since µ is small and wenever know it that accurately anyway.

Using this result and the one above we can write down the posterior

P (x|T ) = DFT−1(DFT (P )T ) (8)

=1

N(1− µ)T

N−1∑k=0

cos(2πxk/N) (1 + µ cos(2πk/N))T (9)

This is true for any N but we can also take N arbitrarily large and thisbecomes an integral

P (x|T ) = (1− µ)T1

∫ 2π

0dθ cos(xθ) (1 + µ cos(θ))T (10)

= (1− µ)T1

π

∫ π

0dθ cos(xθ) (1 + µ cos(θ))T (11)

4

Page 5: TMRCA Estimates

This integral can be integrated in various ways. For small µ it can bewritten

P (x|T ) ≈ exp(−µT )1

π

∫ π

0dθ cos(xθ) exp(µT cos(θ)) (12)

= exp(−µT )Ix(µT ) (13)

where Ix(µT ) is the modified bessel function of order x evaluated at µT . Thisis accurate when T is much less than 1/µ2.

The integral can also be computed exactly by noticing that, for integerx, cos(xθ) = Tx(cos(θ)) where Tx are the Chebyshev polynomials or order x.

Tx(cos(θ)) =x∑i=0

ax,i cosi(θ) (14)

where ax,i are the Chebyshev polynomial coefficients. Using this and thebinomial theorem, we have

P (x|T ) = (1− µ)T1

π

∫ π

0dθTx(cos(θ)) (1 + µ cos(θ))T (15)

=1

π(1− µ)T

x∑i=0

ax,i

∫ π

0dθ cosi(θ)

T∑j=0

(T

j

)µj cosj(θ) (16)

=1

π(1− µ)T

x∑i=0

T∑j=0

ax,i

(T

j

)µj∫ π

0dθ cosi+j(θ) (17)

This integral has an exact expression.

∫ π

0dθ cosm(θ) = π2−m

(m

m/2

)E(m) (18)

where E(m) = 1 when m is even and 0 when m is odd.Finally, we arrive at an expression as a double sum which is useful when

µ is not very small or T is comparable to 1/µ2.

P (x|T ) = (1− µ)Tx∑i=0

T∑j=0

ax,i

(T

j

)µj2−(i+j)

((i+ j)

(i+ j)/2

)E(i+ j) (19)

Generally, the x values are small especially for small T where this is usefulso you only need tabulate the first few Chebyshev polynomial coefficients.

5

Page 6: TMRCA Estimates

When µ is very poorly known, one might wish to integrate over µ andwrite

P (x|T ) =∫ ∞0

dµP (µ) exp(−µT )Ix(µT ) (20)

This can easily be done numerically for any distribution P (µ). Genererally ithas very little effect unless the uncertianty in µ is quite large. The integral canbe computed analytically (albeit with some difficulty) when P (µ) is a Gammadistribution, a not unreasonable choice. P (µ) = µ(k−1)(Γ(k) θk)−1 exp(−µ/θ).This has mean, kθ and variance kθ2. So that k is given by the square of themean over the variance and θ is given by the variance over the mean. Usingthis we have

P (x|T ) = (Γ(k) θk)−1∫ ∞0

dµµ(k−1) exp(−µ/θ) exp(−µT )Ix(µT ) (21)

= (Γ(k) θk)−1T−k∫ ∞0

dττ (k−1) exp(−τs)Ix(τ) (22)

(23)

with s = 1 + (Tθ)−1. This integral is the Laplace transform or rather thank − 1 derivative of the Laplace transform of the modified Bessel functionwhich becomes

P (x|T ) = (Γ(k) θk)−1T−k (−1)(k−1)(d

ds

)(k−1)(s+

√s2 − 1)−|x|√s2 − 1

(24)

This becomes especially simply for k = 1 (the exponential distribution) whereit becomes

P (x|T ) =

[τ(1 + τ +

√1 + 2τ)

]−|x|√

1 + 2τ(25)

2.4 Variances

Since our distributions are all symmetric about zero, the mean of the distri-bution is always zero. The next moment of interest is the variance. Anotherrelevant fact is that the variance of a convolution of two functions is thesum of the variance of each of them. This fact allows us to write down thevariance of distribution P (D|T, ~µ) as

V ar[P (D|T, ~µ)] = T V ar[P ] (26)

6

Page 7: TMRCA Estimates

where V ar[P ] is the variance of the original mutation distribution. For singlebranching, V ar[P ] = µ and so V ar[P (D|T, ~µ)] = Tµ. As usual, we replaceT with 2T when talking about the variance between two descendants ratherthan the variance between the descendant and the ancestor.

2.5 Using variances for full clade TMRCA estimates

Any group of haplotypes will have a TMRCA as well. This is the time whenall of the lineages have coalesced. We will refer to any group as a “clade”though some might prefer to reserve that word for people sharing a givenSNP. We will use it more loosely to mean any chosen group.

Variances of marker values (combined across different alleles in some way)are commonly used to estimate the TMRCA of the clade as well as betweenpairs. However, we will show that this is not an unbiased estimator of theclade TMRCA but rather an unbiased estimator of the average of the pair-wise TMRCAs.

Lets us define Dij as the STR marker data where i indicates the alleleand j indicates the person. We can define the sample mean for each allele asmi = N−1

∑j Dij. We can also define the sample variance in the usual way.

s2i = N−1∑j(Dij −mi)

2. Its is shown in any standard statistics text bookthat the sample mean is unbiased for the true mean but that the samplevariance is not unbiased for the true variance. The expectation value for s2

is (N − 1)/Nσ2. But this bias is harmless because the corrected statisticN/(N − 1)S2 therefore is an unbiased estimator for σ2.

These well known results however make the assumption that the data areindependent. If the data are not independent but are, rather, correlated, thethis result for the sample variance is changed as follows. For clarity, we willdrop the i subscript for the moment and reintroduce it later when needed.So we are just discussion the data in one allele.

s2 ≡ 1

N

∑j

(Dj −m)2 (27)

=1

N

∑j

D2j − 2

1

N

∑j

Djm+m2 (28)

(29)

7

Page 8: TMRCA Estimates

Plugging in m = N−1∑kDk, this becomes

s2 =1

N

∑j

D2j − 2

1

N2

∑jk

DjDk +1

N2

∑jk

DjDk (30)

=1

N

∑j

D2j −

1

N2

∑jk

DjDk (31)

Let µ be the population mean (just for now, we will use µ for mutationrates later). The data values are Dj = µ + εj where εj are the randomdeviations from the mean due to random mutations. The expectation valuesare < εj >= 0 (required if µ is to be the mean) and the expectation value ofεjεk defines the covariance matrix, Cjk =< εjεk >. So now, we can write

1

N

∑j

D2j =

1

N

∑j

(µ+ εj)2 =

1

N

∑j

(µ2 + 2µεj + ε2j) (32)

(33)

The expectation value of this is µ+ 1N

∑j Cjj = µ+N−1Tr(C) where Tr(C)

is the trace (sum of diagonals) of the covariance matrix. Similarly, the expec-tation value of this second term is < N−2

∑jkDjDk >= µ+N−2

∑jk Cjk and

so we can finally write down the expectation value of the sample variance forcorrelated data,

< s2 >=1

NTr(C)− 1

N2

∑jk

Cjk (34)

For the special case (uncorrelated and equal variances) Cjk = σ2δjk, werecover the usual result < s2 >= σ2 −N−1σ2 = (N − 1)/N σ2.

Now, lets apply to this the STR data for a clade. We only need to knowthe covariance matrix of Dj. We are still just working with one allele so willsuppress the i subscript. We already know that the variances are µT . Fromnow on, µ will refer to mutation rates not the mean. But what about the off-diagonal values? Here, we need to remember that the mutations are assumedto be independent events. If two people have a pairwise TMRCA of Tjk, itmeans that those people shared the exact same mutation events before thattime and, after that time, experienced independent (uncorrelated) mutations.So it is clear that the off-diagonal covariances are given by µ(T − Tjk). Sonow, we can write down the expectation value of the sample variance forSTR marker data.

< s2 > =1

NTr(C)− 1

N2

∑jk

Cjk (35)

8

Page 9: TMRCA Estimates

= µT − µ 1

N2

∑jk

(T − Tjk) (36)

= µ1

N2

∑jk

Tjk (37)

Note that the diagonals of Tjk are zero and there are N(N − 1) off-diagonalterms so we can write this

< s2 >= µN − 1

NTP (38)

where TP is the mean pair-wise TMRCA,

TP =1

N(N − 1)

∑jk

Tjk (39)

So, at last, we have shown that the corrected sample variance N/(N−1)s2

is not in fact an unbiased estimator of T but is an unbiased estimator of thisTP , the mean pairwise TMRCA. This TP is of course always less than T .The ratio of T/TP will depend on the structure of the particular tree andmutations times but will usually be in the range 1 to 3.

3 References

Walsh,B. 2001, The Genetics Society of Americahttp://www.genetics.org/cgi/reprint/158/2/897

http://en.wikipedia.org/wiki/Gamma distribution

http://mathworld.wolfram.com/SampleVarianceDistribution.html

Chandler, J. 2006, Journal of Genetic Genealogy 2:27-33

9