allele frequencies as stochastic processes: mathematical & statistical approaches

32
Allele frequencies as Stochastic Processes Mathematical and Statistical Approaches Gota Morota Nov 30, 2010 1 / 32

Upload: gota-morota

Post on 04-Jul-2015

290 views

Category:

Technology


1 download

DESCRIPTION

Presented at Animal Breeding & Genomics Seminar. University of Wisconsin-Madison.

TRANSCRIPT

Page 1: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Allele frequencies as Stochastic ProcessesMathematical and Statistical Approaches

Gota Morota

Nov 30, 2010

1 / 32

Page 2: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Outline

Change of Allele Frequencies as Stochastic Processes

Steady State Distributions of Allele Frequencies

Time Series Analysis

2 / 32

Page 3: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Outline

Change of Allele Frequencies as Stochastic Processes

Steady State Distributions of Allele Frequencies

Time Series Analysis

3 / 32

Page 4: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Outline

Change of Allele Frequencies as Stochastic Processes

Steady State Distributions of Allele Frequencies

Time Series Analysis

4 / 32

Page 5: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Various factors affecting allele frequencies

• Selection, mutation and migration (cross breedings)⇒systematic pressures (Wright 1949)

• Random fluctuations1. Random sampling of gametes (genetic drift)2. Random fluctuation in systematic pressures

Allele frequencies are funcions of the systematic forces and therandom components

5 / 32

Page 6: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Random walk⇒ Brownian Motion

Time

−0.040

−0.035

−0.030

−0.025

−0.020

−0.015

−0.010

2 4 6 8 10

Figure 1: Time = [1,10]

Time

−0.10

−0.08

−0.06

−0.04

−0.02

20 40 60 80 100

Figure 2: Time = [1:100]

Time

−0.25

−0.20

−0.15

−0.10

−0.05

0.00

0.05

0.10

200 400 600 800 1000

Figure 3: Time = [1:1000]

Time

−0.2

0.0

0.2

0.4

0.6

0.8

2000 4000 6000 8000 10000

Figure 4: Time = [1:10000]

6 / 32

Page 7: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Brownian Motion⇒ Diffusion Model

Time

−0.2

0.0

0.2

0.4

0.6

0.8

2000 4000 6000 8000 10000

Figure 5: Time = [1:10000]

+ conditional on Systematicforces

• treat change of allele frequencies as stochastic porcess

Diffusion Model

7 / 32

Page 8: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Diffusion ModelIt frames infinite number of paths that allele fequencies would takeover time under certain systematic pressures.

0 2000 4000 6000 8000 10000

TimeA

llele

Fre

quen

cy

0 2000 4000 6000 8000 10000

Time

Alle

le F

requ

ency

0 2000 4000 6000 8000 10000

Time

Alle

le F

requ

ency

• pick up single timepoint t (say 5000 inabove)

• try to find PDF atpoint t

• need to solve partial differntialequation (PDE)

• Fokker-Planck Equation!

8 / 32

Page 9: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Fokker-Planck Equation

• Derived from a continuous time stochastic process (X)• Partial differential equation

∂φ(p, x; t)∂t

=12∂2

∂x2 {Vδxφ(p, x; t)} −∂

∂x{Mδxφ(p, x; t)} (1)

where• p: initial allele frequency (fixed)• x: allele frequency (random variable)• t : time (continuous variable)• φ(p, x; t): PDF• Vδx : variance of δx (amount of change in allele frequency per

time)• Mδx : mean of δx (amount of change in allele frequency per

time)• Vδx and Mδx : both may depend on x and t

9 / 32

Page 10: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Fokker-Planck Equation for Brownian MotionA standard Brownian motion can be constructed from random walkwith error having mean 0 and variance 1 under right scaling. It hasthe PDF of N(0, t).

• when t = 1.0, N(0, 1)• when t = 1.5, N(0, 1.5)

Fokker-Planck equation:

∂φ(p, x; t)∂t

=12∂2

∂x2φ(p, x; t) (2)

= Heat equation (3)

Mδx = 0 and Vδx = 1 in equation (1)Solution:

φ(p.x; t) =1√

2πtexp

(−x2

2t

)(4)

10 / 32

Page 11: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Solution of the Heat Equation (the Heat Kernel)

−2 −1 0 1 2

x

t = 0.00001t = 0.01t=0.1t=1t=10

11 / 32

Page 12: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Under Random Genetic Drift

Mδx = 0 Vδx =x(1 − x)

2Ne

Fokker-Planck equation for random genetic drift:

∂φ(p, x; t)∂t

=1

4Ne

∂2

∂x2 x(1 − x)φ(p, x; t) (5)

Solutions are obtained as infinite series of sum by...

• Kimura (1955) Hypergeometric function

• Korn and Korn (1968) Gegenbauer polynomial

φ = 6p(1 − p)exp(−12Ne

t)+ 30p(1 − p)(1 − 2p)(1 − 2x)

(−32Ne

t)+ · · · ,

12 / 32

Page 13: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Solution of FPE (Kimura 1955)GENETICS: MOTOO KIMURA

FIGS. 1-2.-The processes of the change in the probability distribution of heterallelic classes,due to random sampling of gametes in reproduction. It is assumed that the population startsfrom the gene frequency 0.5 in Fig. 1 (left) and 0.1 in Fig. 2 (right). t = time in genera-tion; N = effective size of the population; abscissa is gene frequency; ordinate is probabilitydensity.

The probability of heterozygosis is calculated by equation (15):

fo12x(1-x~~x~t~dx =(2i+ l)TH. = O 2X(1- X)+O(X, t)di= E (i+1) i i (1-2p) X

(1 -Z2)T,._.(z) e-'i(i + 1)/4N]t dZ.

By virtue of equation (14) (put m = 0), the last integral is 0 except for i = 1.Hence

Hg= pq 1 4 (2)t 2pqe-(l/2lt = Hoe-(l/2N)t, (18)2 3

showing that the heterozygosis decreases exactly at the rate of 1/(2N) per generation.This is readily confirmed by a simple calculation: Let p be the frequency of A inthe population, where the frequency of the heterozygotes is 2p(l - p). Theamount of heterozygosis to be expected after one generation of random sampling ofthe gametes is

E{2 ( + 5p) (1 - P- 6P)} 2p(1 -p) -2E(ap)2=

2p(l - p) - 2 =(-22p(1-p),as was to be shown.

149VOL. 41) 1955

13 / 32

Page 14: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Under Selection and Random Genetic Drift

Mδx = sx(1 − x) Vδx =x(1 − x)

2Ne

∂φ(p, x; t)∂t

=1

4Ne

∂2

∂x2 x(1 − x)φ(p, x; t) − s∂

∂xx(1 − x)φ(p, x; t) (6)

Solutions are obtained as infinite series using oblate spheroidalequation using transformaton of allele frequencies (z = 1-2x)• Kimura (1955)• Kimura and Crow (1956)

φ(p, x, t) =∞∑

k=0

Ck exp(−λk t + 2cx)V (1)1k (z) (7)

where

V (1)1k (z) =

∑n=0,1

fkn T1

n (z)

14 / 32

Page 15: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Kolmogorov Backward Equation

• Derived from a continuous time stochastic process (P)• Partial differential equation

∂φ(p, x; t)∂t

=12

Vδp∂2

∂p2φ(p, x; t) +Mδp∂

∂pφ(p, x; t) (8)

where• p: initial allele frequency (random variable)• x: allele frequency (random variable except x in the time t is

fixed)• t : time (continuous variable)• φ(p, x; t): PDF• Vδp : variance of δp (amount of change in allele frequency)• Mδp : mean of δp (amount of change in allele frequency)• Vδp and Mδp : both may depend on x but not on t (time

homogeneous)15 / 32

Page 16: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Steady State Distribution of Allele FrequenciesEquilibrium• single point (balance between various forces that keep allele

frequecies near equilibrium )• PDF

PDF of stable equilibrium instead of single point

Steady state allele frequency distribution• Fisher (1922), (1930)• Wright (1931), (1937), (1938)

φ(p, x; t) = solution of a fokker-planck equation (9)

limt→∞

φ(p, x; t) = φ(x) (10)

φ(x) =C

Vδxexp(2

∫Mδx

Vδxdx) (11)

16 / 32

Page 17: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Steady State Distribution – Random Genetic Drift

For a large value of t, only the first few terms have impact ondetermining the actual form of the PDF.

φ = 6p(1 − p)exp(−t

2Ne

)+ 30p(1 − p)(1 − 2p)(1 − 2x)

(−3t2Ne

)+ · · · ,

Asymptotic formula:

limt→∞

φ = C · exp(−12Ne

t)

17 / 32

Page 18: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Graphical Representation (Wright 1931)

114 SEWALL WRIGHT

Before finally accepting this solution, however, it will be well to exam-

ine the terminal conditions. The amount of fixation a t the extremes if N

is large can be found directly from the Poisson series according to which

the chance of drawing 0 where m is the mean number in a sample i s r m .

The contribution to the 0 class will thus be (e-1+e-2+e-3 . . .)f =

e-l

1 -e-l f , = 0.582f.

25% 50% 754, Factor Freq u e n c y

T

FIGURE 3.-Distribution of gene frequencies in an isolated population in which fixation and

loss of genes each is proceeding at the rate 1/4N in the absence of appreciable selection or muta-

tion. y=Loe-TI*N.

This is a little larger than the i f deduced above and indicates a

small amount of distortion near the ends due to the element of approxi-

mation involved in substituting integration for summation. The nature

and amount of this distortion are indicated by the exact distributions ob-

tained in the extreme cases of only 2 and 3 monoecious individuals.

Letting Lo be the initial number of unfixed loci (pairs of allelomorphs)

and T the number of generations we have approximately

18 / 32

Page 19: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Steady State Distribution – Selection and Mutation

Mδx = −ux + v(1 − x) +x(1 − x)

2dadx

Vδx =x(1 − x)

2Ne

φ(x) = C · exp(2Ne a)x4Nev−1(1 − x)4Neu−1 (12)

When A has selecive advantage s over a:

a = 2sx2 + s2x(1 − x) + 0 ∗ (1 − x2)

= 2sx

φ(x) = C · exp(4Nesx)x4Nev−1(1 − x)4Neu−1 (13)

19 / 32

Page 20: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Graphical Representation (Wright 1937)GENETICS: S. WRIGHT

Fig 4

Fig. 5

Fig. 6

Fig. 8

(Captions for figares on opposite page.)

Fig.l

Fi9.2

308 PROC. N. A. S.

20 / 32

Page 21: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Time Series Analysis

When variable is measured sequentially in time resulting data forma time series.

• Diffusion Model – Continuous time stochastic process

• Time Series – Discrete time stochastic process

21 / 32

Page 22: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Basic Models

Observations close together in time tend to be correlated

• Autoregressive Model: AR(p)

Xt = c +p∑

i=1

ψiXt−i + εt (14)

• Moving Average Model: MA(q)

Xt = c +q∑

i=1

θiεt−i + εt (15)

• Autoregressive Moving Average Model: ARMA (p, q)

Xt = AR(p) + MA(q) (16)

22 / 32

Page 23: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Time Series as a Polynomial Equation

Bk Xt = Xt−k (back shift operator)

• AR(p)

Xt = ψ1Xt−1 + · · · + ψpXt−p

Xt = (ψ1B + · · · + ψpBp)Xt

(1 − ψ1B − · · · − ψpBp)Xt = 0

• ARMA(p,q)

Xt − ψ1Xt−1 − · · · − ψpXt−p = εt + θ1εt−1 + · · · + θqεt−q

(1 − ψ1B − · · · − ψpBp)Xt = (1 + θ1B + · · · + θqBq)εt

23 / 32

Page 24: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Stationary Process

The mean and variance do not change over time. No trend.

Not stationary

Time

−0.2

0.0

0.2

0.4

0.6

0.8

2000 4000 6000 8000 10000

Figure 6: Random Walk

Looks like stationary

Time

−10

−5

0

5

10

2000 4000 6000 8000 10000

Figure 7: Detrended

Detrending:

• linear regression

• take a difference

• Autoregressive Integrated Moving Average: ARIMA(p,d,q)

24 / 32

Page 25: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Application on Allele Frequencies

• Influential SNPs – indicative of deterministic trends

• Uninfluential SNPs – random fluctuation?

• Diffusion Model – assumed Markovian process

• Time Series – which model describes the process of changeof allele frequencies

Application

• Objective: model process of change of allele freqeuncies

• Data: SNPs genotypes of 4,798 Holstein bulls with 38,416markers and milk yield

• Genotype inputation: FastPhase 1.4

• Estimation of marker effects: BayesCπ

25 / 32

Page 26: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

BayesCπ

Analysis of human mini-exome sequencing data using a Bayesian hierarchical mixturemodel: Genetic Analysis Workshop 17

Bueno Filho JS1,2!, Morota G1!, Tran QT3, Maenner MJ4, Vera-Cala LM4,5, Engelman CD4§, and Meyers KJ4§

1Department of Dairy Science, University of Wisconsin-Madison, USA2Departamento de Ciencias Exatas, Universidade Federal de Lavras, Brasil3Department of Statistics, University of Wisconsin-Madison, USA4Department of Population Health Sciences, University of Wisconsin-Madison, USA5Departamento de Salud Publica, Universidad Industrial de Santander, Colombia

! Contributed equally to this work§Corresponding author

Email addresses:JSB: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]

1

Figure 8: GAW17

26 / 32

Page 27: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Allele Frequency of the Top Marker

Original

Time

Alle

le F

requ

ency

0 5 10 15 20 25 30

0.4

0.6

0.8

Detrended

Time

Alle

le F

requ

ency

5 10 15 20 25 30

−0.

150.

000.

15

Figure 9: Time plots of allele frequencies. Top: Original series. Bottom:Smoothed by taking the first order difference.

27 / 32

Page 28: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Autocorrelation and Partial AutocorrelationARIMA(1,1,1)?

0 2 4 6 8 10 12 14

−0.

40.

00.

40.

8

Lag

AC

F

Original series

2 4 6 8 10 12 14

−0.

20.

00.

20.

4

Lag

Par

tial A

CF

Original series

0 2 4 6 8 10 12 14

−0.

40.

00.

40.

8

Lag

AC

F

First order difference series

2 4 6 8 10 12 14

−0.

4−

0.2

0.0

0.2

0.4

Lag

Par

tial A

CF

First order ifference series

Figure 10: ACF and PACF28 / 32

Page 29: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Model Selection

Table 1: Comparison of several competitive models

Model AIC Model AICARIMA (1,0,0) -51.56 ARIMA (1,1,0) -52.47ARIMA (0,1,0) -49.38 ARIMA (1,0,1) -51.13ARIMA (0,0,1) -46.41 ARIMA (1,1,1) -51.02

ARIMA(1,1,0)

Xt = 0.635Xt−1 + εt

29 / 32

Page 30: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Advanced Models

Time dependent variance

• ARCH (Autoregressive Conditional Heteroskedasticity)

• GARCH (Generalized Autoregressive ConditionalHeteroskedasticity)

Multivariate

• VARMA (Vector Autoregression Moving Average)

• BVARMA (Bayesian Vector Autoregression Moving Average)

30 / 32

Page 31: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Intersection of Mathematics and Statistics

Under certain condition

GARCH(1,1) ≈ Diffusion Model!

31 / 32

Page 32: Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches

Thank you!

32 / 32