part ii: recombination and selection

PART II: Recombination and selection

Summary of assumptions so far

We have covered the role of chance (parent choice), demography (population size) and mutation, in shaping genetic diversity • Neutral Wright-Fisher models

1. Parents are always chosen at random. This means no new mutations affect reproductive success of parents, i.e. no “selection” for positive or negative changes.

2. A gene – i.e. a segment of DNA - is always inherited as a complete unit from the chosen parent. This assumes no “recombination”. Recombination is a process allowing different positions along a segment of DNA to be inherited from distinct parents

Discrete generations

Population size N

Parents chosen at random

Mutations probability m

Sample history can be constructed

Summary of assumptions so far

Until now, the course has concentrated on models which are heavily simplified • Neutral Wright-Fisher models

1. Parents are always chosen at random. This means no new mutations affect reproductive success of parents, i.e. no “selection” for positive or negative changes. Also called neutrality.

2. A gene – i.e. a segment of DNA - is always inherited as a complete unit from the chosen parent. This assumes no “recombination”. Recombination is a process allowing different positions along a segment of DNA to be inherited from distinct parents

Why relax these assumptions?

In fact, we are (obviously!) evolved to adapt to our environment.

This process occurs through natural selection. Some new mutations are favoured, because those carrying them have more children on average. To quantitatively study evolution, we need models incorporating this idea.

Why relax these assumptions?Our genome has essential functions. Many new mutations would disrupt this function (far more than confer useful new advantages), so must be prevented from becoming common in the population

This process also occurs through natural selection. Some new mutations are “deleterious”, because those carrying them have fewer children on average. Selection can act in both directions.

Disease Population frequency

Sickle cell anemia 1 in 625 (African Americans)

Cystic fibrosis 1 in 2,000 (Europeans)

Tay-Sachs disease 1 in 3,000 (US Jewish population)

Haemophilia 1 in 10,000

Galactosemia 1 in 57,000

Example: Human data

Data for ENR131, Chromosome 2q, Chinese and Japanese population sample (The International HapMap

Consortium, Nature 2005)

D’ Associationmeasure

According to the assumptions so far, a region has a historygiven by a tree

We should not see any obvious decay of association between sites with distance

What’s going on?

1.1 Why relax these assumptions?Recombination

In humans, and many other species, a process of recombination occurs:

This can mean different positions on a chromosome are inherited from different chromosomes in the parental generation.

So they have different histories.Our models for genetic data need to allow for this. We will begin by thinking about recombination (without selection initially).

We have chromosome pairs, one inherited from each parent

Only one of the two maternal (or paternal) copies is passed down

Almost always, rather than choosing one or other, a mosaic is constructed

Father Mother

Child

PART II: Recombination and selection

• We will extend our theory to cover the other two main biological forces driving genetic variation, evolution, and e.g. disease risk:

• Recombination

– The effect of recombination on ancestry

– Detecting historical recombination– Incorporating recombination into the

coalescent framework– Properties of the “ARG”– Real inference of recombination rate

• Natural selection– The fate of individual mutations– Modelling selection– Properties of selected alleles

1.2 Recombination model

• Suppose we are thinking about a segment D of DNA in a single chromosome– Sites – If S is large, reasonable to think of this

as a continuous segment D=[0,1]– In a single generation, at most one

recombination can occur in D :

– When recombination occurs, we pick the (left) breakpoint B from a density function f on D.

– We will normally assume (wlog):

– In humans, the per site per generation recombination rate averages ~1x10-8 versus a mutation rate of 1.3x10-8.

},....,2,1{ S

Probability 1-rSingle parent chromosome

Probability rTwo parents chosen

)(~ DUB

D

We begin by considering a general population, including recombination• Generations shown as discrete only for

simplicity• How do we represent histories with

recombination?

Later we will add additional modelling assumptions (random mating, etc).

Each chromosome chooses parent in previous generation

Single parent probabilty 1-r

Two parents, probability r

Denote by double arrow, left parent single line, right parent double line

Probability density function f






We begin by considering a general population, including recombination• Generations shown as discrete only for

simplicity• How do we represent histories with

recombination?

Later we will add additional modelling assumptions (random mating, etc).




Denote by double arrow. Left parent single line, right parent double line


We can trace ancestral histories in the new setting

At a recombination event, choose the appropriate ancestor

Consider site 1 (position 0 in [0,1]).






Now consider site S (position 1 in [0,1])


Always choose the right hand ancestor at a recombination event






Site S/2 (position 0.5 in [0,1])


Ancestor choice depends on position of recombination event

1.3 Marginal trees• With recombination, we can still draw a genealogical tree at

each site. At a position x in [0,1], we define the marginal tree T(x) to be the genealogical tree at x.

• In general, T(x) depends on x.

• The TMRCA can also change along the sequence.

• Tree change points are a subset of the recombination positions

Time

T(0) T(0.5) T(1)

In humans, genealogical trees are typically hundreds of thousands of years deep (tens of thousands of generations)

For a recombination event at x, T(x-) and T(x+) can be, but are not always, different (see problem sheet)

Question: Is this the best way to summarise information about the history of the sample?

1.4 The ancestral recombination graph

• Individual trees for each site are cumbersome

• They are not sufficient in general to reconstruct all historical recombination events• problematic if recombination is the focus of interest

• The ancestral recombination graph (ARG) solves this problem (Griffiths 1991, Griffiths and Marjoram 1997, Hudson 1983)

• Provides an efficient way to record the history of a sample with recombination, without losing information

This is a directed, acyclic graph of degree three. Nodes correspond to ancestors of the sample

1.5 The ARG

Every ancestral recombination graph corresponds uniquely to an ancestral history of the sample

Join edges when ancestors coalesce

Each tip corresponds to an individual chromosomal segment

Time

1.5 The ARG




Split edges at recombination events. Left branch contributes material to left of break

Time

0.9

0.2

Split edges at recombination events. Left branch contributes material to left of break

1.5 The ARG




Time

Eventually a most recent common ancestor (MRCA) will be reached

0.6

0.7

0.2

0.9

1.6 Example ARGs• Recombination events can change the tree “topology” (a)• Can leave the tree “topology” unchanged but alter the times in

the tree (b) • Can leave the tree completely unchanged (c)• Sample size n=4, single recombination event

(a)

(b)

(c)

1.7 (Embedded) Marginal trees

Marginal trees are recovered from the graph by taking the appropriate branch at each recombination event

0.6

0.7

0.2

0.9

Time

T(0) T(0.5) T(1)

1.8 Embedded subgraphs• To obtain the ARG for a subregion say [a,b] we take the

ARG for [0,1] and remove recombination events in [0,a) or (b,1], and respectively the left and right edges ancestral to these recombination events.– These events occur outside [a,b]– They therefore cannot affect the history of this

subregion so must be outside the subregion ARG– Essentially we “drop” irrelevant edges

0.2

0.7

0.6

0.9

[0.4,0.8]

2.1 Mutations in the ARG

• Suppose a mutation occurs in some sample ancestor• Add a mark to the ARG, at the appropriate position in

[0,1], to the place corresponding to that ancestor• The entire mutational history can be placed on the graph.

0.6

0.7

0.2

0.9

0.75

0.050.3

0.35 0.9

0.72

0.65

0.4

0.5

0.8

Sequence 0.05 0.3 0.35 0.4 0.5 0.65 0.72 0.75 0.8 0.9

1 1 0 1 0 1 0 1 0 0 1

2 1 1 1 0 0 1 1 0 0 1

3 1 1 1 0 0 1 1 0 0 1

4 1 1 1 1 0 1 1 0 0 1

5 0 1 1 0 0 1 1 0 0 1

6 0 0 0 0 0 1 0 1 0 1

7 0 0 0 0 0 1 0 1 1 1

8 0 0 0 0 0 1 0 1 0 1

2.2 The effect of recombination on data

• Suppose we are interested in performing inference on how much recombination there has been

• We cannot directly observe the ARG• Instead, we need to indirectly infer recombination using

mutation patterns in data• Later we will investigate in depth stochastic models of

the effect of recombination• These can be used to obtain parametric estimates of

recombination rate parameters

• An alternative approach is to not impose a particular model, but simply try to count how many recombination events occurred in a sample history

Advantages:Simple, easy to interpret in terms of counts, robust, requires few assumptionsProvides insight into relationship between data and recombination history

Disadvantages:Hard to interpret results in terms of underlying recombination parametersMisses many recombination eventsDifficult to quantify uncertainty about how many events occurred

2.3 Reminder of infinite sites model

Proposition 2.3.2: Compatibility of mutations with the point mutations assumption

An n × s 0-1 matrix is compatible with a gene tree if and only if no pattern

0 00 11 01 1

occurs in any two columns and four rows. If the ancestral type is known and always denoted by 0, the first row of the pattern can be removed from the condition.

Definition 2.3.1: Infinitely-many-sites model

Mutations occur at positions on the DNA sequences never before mutant.Every mutation occurring in the coalescent tree on an edge occurs in all genes subtended below the edge.

• If we assume the infinite sites model, then you have seen the following: the “4-gamete test”

• Question: Is this result respected if recombination occurs?

Example: recombination causes violation of the 4-gamete test

0.2

0.150.75

0.15 0.751 1 0

2 1 1

3 0 1

4 0 0

Note: only mutations on these two branches can violate the 4-gamete test, and that this occurs if and only if the blue mutation occurs to the left of 0.2, and the black mutation to the right of 0.2

2.4 Detecting recombination events (Hudson and Kaplan, 1985)

Lemma 2.4: The 4-gamete test

Suppose we have variation data for n individuals at s sites, represented as an n × s 0-1 matrix. Under the infinite sites model, if the pattern

0 00 11 01 1

occurs in two columns corresponding to positions x and y, then at least one recombination event must have occurred in the sample history, in the interval (x,y). If the ancestral type is known and denoted by 0 at x,y then the first row of the pattern can be removed from the condition.

Proof

We prove the converse statement.

Suppose there are no recombination events between x and y. Then the ancestral recombination graph for the interval [x,y] is simply a coalescent tree. Hence, by proposition 2.3.2 the above pattern cannot occur in the data.

2.5 Hudson’s RM (Hudson and Kaplan 1985)

• Suppose we have sites 1,2,..,10 and the dataset:

Sequence 0.05 0.3 0.35 0.4 0.5 0.65 0.72 0.75 0.8 0.9

1 1 0 1 0 1 0 1 0 0 1

2 1 1 1 0 0 1 1 0 0 1

3 1 1 1 0 0 1 1 0 0 1

4 1 1 1 1 0 1 1 0 0 1

5 0 1 1 0 0 1 1 0 0 1

6 0 0 0 0 0 1 0 1 0 1

7 0 0 0 0 0 1 0 1 1 1

8 0 0 0 0 0 1 0 1 0 1

How many recombination events?

2.5 Hudson’s RM (Hudson and Kaplan 1985)

Proposition 2.5: Hudson’s RM

Under the infinite sites model with recombination, suppose we have data for n sequences at s (ordered) segregating sites 1,2,..,s. Then the following recursive procedure gives a minimum number of recombination events in the history of the sample, based on the results of the four gamete test.

Step 1: For all pairs (i,j), construct a matrix R where Rij=1 if sites i and j show all 4 gametes, and 0 otherwise.

Step 2: Set i=1,l=2 and RM=0

Step 3: If max{Rkl: k=i,..,l-1}=1 then increment RM by 1 and reset i’=l. Otherwise, set i’=i

Step 4: If l=s, terminate. Otherwise, set i=i’, l=l+1 and return to step 3.

Remark The idea here is to go from left to right, putting in a recombination whenever one is required by the 4 gamete test, and that recombination must have happened to the right of the furthest right recombination placed so far.

Application of the algorithm

R12R16R36

R67R49R5,10R6,10R7,11R8,11R6,15R12,16

i

RM

l1

02

2

13

2

14

2

15

2

16

6

27

7

38

7

39

7

310

7

311

11

412

11

413

11

414

11

415

11

416

-

5-

All other Rij=0.

Proof of proposition 2.5

The result is trivial if Rij=0 for all i and j. Otherwise, let the true minimum number of events based on the 4-gamete test be W.

Suppose wlog that RM is incremented by 1 at RM steps corresponding to values l1, l2 ,.. ,lR of l. Setting l0=1, at these steps i therefore takes values l0, l1 ,.. ,lR-1 respectively because i is reassigned the current value of l at each increase in RM. We prove first that , then that .

For each j, by construction Then there must be recombination in the interval (lj-1,lj) for each j and as there are RM such intervals,

To prove , suppose we place RM recombination events along the sequence by placing one event in each interval (lj-1,lj) j=1,2,..., RM. Supposing for a contradication that this did not provide a solution, there must exist p, q such that Rpq=1 but no event is placed inside the interval (p,q). In the qth round of the algorithm, l=q and so since Rpq did not produce an increase in RM, we must have not considered this bound: q>iq>p. This implies iq>1 and hence iq=lm for some m>0. Thus there is a recombination placed in the interval (lm-1,lm). However lm>p so (lm-1,lm) contains an event placed within (p,q), a contradiction.

MRW WRM

.1}:max{ 1 jjkl lklRj

.MRW

WRM

Example: Drosophila data

• Chromosome 4 in three Drosophila species• Is there evidence for recombination?• None seen in thousands of “crosses”• Arguello et al. MBE 2009• Sequenced 80 genes

• Definitively recombination, at a low rate• More recombination in D. simulans than in other species• Suggests deeper ARGs (larger population size) for this

species

2.6 Properties of RM

• RM provides a simple, constructible measure of the influence of recombination on a sample of sequences

• This has led to its use in large real datasets by researchers

• RM relies on mutations in suitable places to detect recombination events, so if the mutation rate is not very high, typically drastically underestimates the number of recombination events (Hudson and Kaplan, 1985).

• Under a coalescent model, expectation of RM grows extremely slowly with sample size n – no faster than log(log(n))

• In general, recombination events are much more challenging to detect directly, and study, than mutation events

• Better bounds are also available, which extend the ideas used to construct RM (Myers and Griffiths 2003, Hein 1990, Song and Hein 2004, 2005, Bafna and Bansal 2006, Lynsgo, Song et al. 2008, Liu and Fu 2008, and more)

2.7 Haplotype patterns and recombination events

• Consider the following toy dataset. How many recombination events are required?

• RM =1• It is clear that under the infinite sites model, the first

event back in time must be a recombination event.• No matter which of the sequences we decide to

recombine, after this event there will still be 5 unchanged sequences

• (Exercise) no matter what choice we made, these 5 sequences still indicate recombination (4-gamete test)

• So we need at least one more recombination event in the history of these sequences, and RM could be improved to 2.

How can we do better?

One approach is to use haplotype information

0 0 00 1 01 0 0 1 1 00 0 11 0 1

2.8 The haplotype boundProposition 2.8: The haplotype bound

Under the infinite sites model with recombination, suppose we have data for n sequences at s segregating sites 1,2,..,s. Suppose that the n x s data matrix for these sequences has H unique rows, or haplotypes. Then a lower bound on the number of recombination events in the history of the sample is H-s-1.

Proof Consider the ancestral recombination graph representing the history of the sample. Beginning with the ancestral sequence at the TMRCA, we can view our sample haplotypes as being created forward in time.

Since there are H haplotypes, only one of which can be the ancestral type, there must be at least H-1 further events in the history creating novel types. Each mutation or recombination event can create at most one novel type. Coalescence events simply duplicate existing types (forward in time). By the infinite sites assumption there are s mutation events, so if R is the number of recombination events we must have R+s>=H-1.

Remark From the proof, if the ancestral type is known, we can add it to our collection of haplotypes. Note also that the four gamete test is just the special case s=2.

Example (toy) dataset revisited

• Consider the following toy dataset. How many recombination events are required?

• H=6, S=3 giving R>=6-3-1=2• This is the right answer here: a history with 2 events is

possible (hint: recombine sequences 4 and 6 first)

• Note that given a dataset with s sites, we can

• Apply proposition 2.8 to any subset of t of the s sites • Obtain a bound on the number of events between the

first and last members of the subset• This will result in a lower bound matrix Rij with

positive integer entries

• We want to be able to combine bounds once again, to produce an overall bound for a region

0 0 00 1 01 0 0 1 1 00 0 11 0 1

Combining bounds

0 0 0 0 0 0 00 1 0 1 1 1 11 0 0 1 0 1 01 1 0 1 0 1 10 0 1 1 1 0 01 0 1 0 1 1 0

112221

For this set of bounds, HM =5

We could keep searching site subsets

Typically performance can be good if usee.g. only subsets up to size 5, up to some maximaldistance apart.

Clearly, software needed to calculate the bound!

2.9 HM

Proposition 2.9: HM

Under the infinite sites model with recombination, suppose the haplotype minimum gives a local bound matrix R where Rij is the best haplotype bound between sites i and j. Define HM

ij to be the minimum number of recombination events between sites i and j satisfying this set of bounds. Then the following is true:

This recursive system can be used to obtain HM1j

given HM12, HM

13 ,..., HM1(j-1) and hence provides an efficient

means of obtaining HM1s, the overall lower bound on

recombination events

}1,..,1:max{ 11 jkRHH kjk

Mj

M

Proof

Let the true minimum be W. Note that the above construction means that HM

1s is a sum of Rij terms corresponding to non-overlapping intervals. Thus, obviously

To prove the converse statement, we construct a minimal placement of recombination events as follows.

.1sMHW

2.9 HM

Proof continued:

Define a vector of recombination counts in the s-1 mutation intervals with rj, the number of events between mutant sites j-1 and j, given by rj= HM

1j-HM1(j-1). (Take HM

11=0). Supposing for a contradiction this does not satisfy the full bound set R={Rij}, we may pick j to be the minimal such where <Rij events are placed within (i,j). By the recursive formula in the construction:

But then

contradicting the fact that <Rij events are placed within (i,j).

Note that the proof provides an explicit possible solution for where recombination events are placed. This is usually non-unique: this solution corresponds to putting events as far “right” as possible.

,

11

)1(1

1

1

1

ij

iM

jM

kM

j

ik

kM

j

ikk

RHH

HHr

.11

11

iji

Mj

M

iji

Mj

M

RHH

RHH

The benefits of using more information

The following charts shows the expectation of the haplotype bound (solid lines) can greatly exceed that of RM (dotted lines) especially as sample size becomes large. These expectations were calculated using the coalescent with recombination – we will come to this soon

Myers and Griffiths (2003)

Example: the haplotype bound in humans

The following is based on real human mutation data for 10,000 bases around the LPL gene. We can plot the recombination density between pairs of sites as an x, y colour plot:

Question: Is there a “hotspot” for recombination here?Caveat: Apparent clustering of recombination might be due to stochastic variation in histories. Need to model this explicity



Rh

D’ Associationmeasure

Example: Humans versus chimps

These are similar plots, for aligned regions of the human and chimpanzee genomes (Winckler et al. 2005).

Further (model based) analyses confirm that recombination rates are very different between humans and chimpanzees genome-wide (Winckler et al. 2005, Ptak et al. 2005, Myers et al. 2009)

98.6% similar at aligned genomic bases

Example: Malaria

Malaria appear to have a similar uneven distribution of recombination sites along their genomes (Mu et al., Nature Genetics 2010)

Chromosome 1

Chromosome 7

Asia

Afr

ica

Asia

Afr

ica

2.10 Conclusions on recombination detection

• Direct detection of recombination events offers a very useful approach to:– Understanding the influence of recombination on data– Discovering the distribution of events along sequences

• More sophisticated approaches still have been developed in recent years (Song and Hein 2004, 2005, Bafna and Bansal 2006, Lynsgo, Song et al. 2008, Liu and Fu 2008, and more)– Improvements over HM, though these are modest.

• All strict minima miss the large majority of recombination events

• In organisms with repeat mutation, need to adapt approaches (Liu and Fu, 2008) and problem even tougher

• A model for populations with recombination is vital to– Recover more of the information from data– Perform inference on underlying recombination

parameters– Estimate uncertainty, make statements about rate

variation, make statements about particular sample histories, allow for demographic histories, selection,...

Chromosome randomly chooses parent in previous generation

Single parent probability 1-r




We incorporate recombination in the Wright-Fisher model:• Constant size population of size 2N• Generations are discrete with next generation formed

from previous:• Individuals choose a single parent uniformly, with

probability 1-r• Are recombinant, choose two parents at random and a

recombination breakpoint, with probability r• Can also mutate, with probability m, and choose a site to

mutate.

3.0 The Wright-Fisher model revisited

3.1 The history of a sample

Consider a sample of size n from the population

We will define

Consider the limit as while r, q remain constant.

At some time back, suppose there are j ancestors of the sample remaining and consider the events in the previous generation

.4,4 mqr NNr

N

Consider the probabilities of different possible events while j>1 ancestors remain:

)()events 2(

)(22

1

2221

211

21)1()1()coal. One(

)(2/21)rec. 1Only (

)(2/21

211

221

211)1()1()mut. 1Only (

)(2/2/22

11

)(2/2

1

)(4/1)(4/1

)(211)4/1()4/1(

211

221

211)1()1()event No(

2

2

2

2

1

2

2

22

21

1

NOP

NOj

N

jN

jNN

rP

NOjN

P

NOjN

Nj

NNjrP

NOjjj

N

NONj

NONjNONj

NOiN

NN

Nj

NNrP

jj

jj

j

i

jj

jj

m

r

q

mm

qr

qr

qr

m

Now as for the coalescent without recombination, we measure time in units of 2N generations, define t=T/2N, and consider event probabilities asand t remains fixed. Let be the waiting time back until some event occurs, while there are j ancestors

N

. as2/2/2

exp

)(2/2/22

11) (2

2

Ntjjj

NOjjj

NtTP

Nt

j

qr

qr

Thus, is exponentially distributed. When an event occurs:

NP

Nj

jP

Nj

P

Nj

NONjNjNj

NONjP

as0events) moreor Two(

as1

1)coalescepair One(

as1

)mut. One(

. as1

)(4/4/2/2

)(4/)rec. One(2

2

qr

qrq

qrr

qr

r

In the limit, this fully defines the ancestry process. By obvious symmetry, at coalescences a random pair coalesce, and a random sequence recombines or mutates at these respective events. This defines the coalescent with recombination:

jT

jT

3.2 The coalescent with recombination

Definition 3.2: The coalescent with recombination (Hudson 1983, Griffiths 1991, Griffiths and Marjoram 1997)

The coalescent with recombination is a Markov process describing the history, backward in time of a sample of n genes drawn from a population. While j ancestors remain, j>1, the time to the next event has an exponential distribution with rate parameter

After sampling the next event time, an event is chosen:

.2/2/2

jjj

qr

. mutate. torandomat chosen is

sequence one ,1

y probabilitWith

.1 recombine. torandomat chosen is

sequence one ,1

y probabilitWith

.1 coalesce. and randomat chosen are

sequences two,1

1y probabilitWith

jjθρj

θjj

θρj

jjθρj

j

r


Definition 3.2: The coalescent with recombination (Hudson 1983, Griffiths 1991, Griffiths and Marjoram 1997)

• At recombination events, the breakpoint is chosen using pdf f.

• In drawing the graph, coalescence events are represented as edge joins backward in time, recombination events as splits, and mutation events marked as points on the edges.

• Given a particular mutation model (specified forward in time) we first choose the ancestor type, and then choose a new mutant according to the model at each mutation point, based in general on the type of the edge immediately above the mutation event.

• If we are not interested in recording mutations, or investigating the genealogical relationships alone, we can simply set q=0.

• We usually terminate the process the first time j=1. The first ancestor of the sample where j=1 is the grand most recent common ancestor of the sample.

0.9

0.75


.4428

exp

qr

.44

28

1 Probqr

.3326

exp

qr

.2/32/3

23

2/ Probqr

r

2/2/22

12/2/22

exp2/2/22

2/72/727

2/2/72/727

exp2/72/727

4428

14428

exp4428

)(

2

1

qrqrqr

qr

qqrqr

qrqrqr

mW

W

WARGP

W2

0.6

0.7

0.2

0.05

0.3

0.35 0.9

0.72

0.65

0.4

0.5

0.8

W1

W3

Wm

• We have shown that the coalescent with recombination is the limit process (as N becomes large) describes the history of a sample drawn from a constant size Wright-Fisher model.

• It also arises as a limit process in other many models –with continuous or discrete generations

• r=0 corresponds to the standard coalescent

• The number of ancestors j can be thought of as a random walk.

• The coalescence rate grows quadratically with j while the recombination rate grows only linearly with j. Thus eventually the random walk will hit j=1 with probability 1 (exercise sheet)

• The expected number of recombination events before this happens satisfies the recursion (Exercise; Ethier and Griffiths 1990)

3.3 Properties of the coalescent with recombination

.)1(1solution with

1111

1

0

1

11

dxex

xE

jE

jE

jjE

xn

n

jjj

rr

rr

rr

r

• We can think of the coalescent with recombination in terms of independent Poisson processes on edges and pairs of edges

• This construction is helpful in theoretical calculations and obtaining subgraphs

• For this course, we only need to restate (these facts were also used in the earlier part of the course) two general properties of homogeneous Poisson processes on the real line. Here N(t) is the number of events before time t.

3.4 Description in terms of Poisson processes

. is processin occurrence of prob.

theprocess, summed in theevent each for Indep.

. rate of processPoisson homogen. a is )(

then ,,, rates of processesPoisson homogen. indep. are 0),(,),(),( If 3.4.2.

distn. thisfollowsevent first theuntil time waitingthe particularIn ).exp( is eventsbetween time waiting the,

rate of processPoisson homogen. a is 0),( If 3.4.1.

n

1j

n

1i

n

1i

21

21

j

i

ii

n

n

i

tN

ttNtNtN

ttN

• Exactly as without recombination, we can fully construct the ancestral recombination graph using independent Poisson processes in reverse time:– Each of the j(j-1)/2 pairs of edges independently

coalesces as a Poisson process with rate 1– Each of the j edges mutates at rate q/2.– Each the j edges recombines at rate r/2.– Events in the Poisson processes are “racing” each

other• To prove this gives the correct graph, we simply need

to show it yields the correct rates• By fact 3.4.2, while j ancestors remain, events occur as

a Poisson process with total rate

• The time to the first event has the correct exponential distribution, by fact 3.4.1. When the event occurs, fact 3.4.2 implies it is e.g. a coalescence (between a random pair of edges) with probability


.2/2/2

jjj

qr

.1

1

2/2/2

2qr

qr

jj

jjj

j

W2

0.6

0.7

0.2

0.05

0.3

0.35 0.9

0.72

0.65

0.4

0.5

0.8

W1

W3

Wm

.4428

rate Total

qr

1 Rate

.3326

rate Total

qr

2/ Rate r

2/ Rate q

0.9


0.75

3.5 Subgraphs• In 1.8, we saw that we can construct the ARG for a

subregion [a,b] by ignoring all recombination (and mutation) events outside [a,b]. If recombination and mutation are uniform, we construct a graph by starting with n sequences, and backward in time introducing – Recombination events at rate r(b-a)/2 per edge– Mutation events at rate q(b-a)/2 per edge – Coalescence at rate 1 per pair of edges

Thus the ARG for a subregion is (of course) distributed according to the coalescent with recombination for the smaller region.

• “Small ARG”: In certain settings, we can gain efficiency by only following the history of specific branches contributing to genetic variation, building a coalescent using the Poisson process rates. Edges – or recombinations producing edges carrying no genetic material passed on to a sample, and edges carrying only material that has reached a MRCA, need not be followed. Similarly, mutations outside ancestral material need not be simulated.– This graph can be produced directly (Hudson 1983)– Can be much smaller than the “big ARG”– Preferred for simulation for this reason

Remark: small ARG in the coalescent

2/0.7 ,2/ 0.7 1, Rates rq

2/ ,2/ 1, Rates rq

Simulation of the small graph is efficient (Hudson 1991)Avoid considering ancestors sharing no material with the sample

0.6

0.7

0.2

0.9

0.85

0.8

Simulate directly by having different rates on different lineages in the past. We can measure the coalescence, mutation, recombination rates:

The small ARG does not include this recombination

3.6 Marginal trees revisited

Marginal trees are recovered from the graph by taking the appropriate branch at each recombination event

Note the marginal tree at x is the limit as d tends to 0 of the subgraph on [x,x+d].

In this subgraph, line pairs coalesce at rate 1, so while j ancestors remain the total coalescence rate is j(j-1)/2.

Lines recombine at rate rd/2 per edge, so in the limit there is no recombination and the marginal tree at x is described by the usual coalescent.

(Actually this is obvious, because we could make the tree at x based on the large size limit of a finite Wright-Fisher population directly, in which case recombination would not occur.)

Time

T(0) T(0.5) T(1)

• The coalescent with recombination is much harder to derive exact results for than the coalescent

• These are mainly restricted to samples of size 2, or the “big ARG”, which contains some ancestors unrelated to the sample

• In other settings, we rely on– Numerical recursions to solve– Lower and upper bounding of solutions– Analytic approximation of solutions

• We will see examples of these settings and approaches• For additional analytical results, see Durrett, and

Wakeley, and references therein (important papers include Hudson (1983), Hudson and Kaplan (1985), Ethier and Griffiths (1990), Griffiths and Marjoram (1997), Wiuf and Hein (1999) and others)

3.7 Theoretical results for the coalescent with recombination (?)

Assume the infinite sites model and a uniform mutation rate along [0,1].

Let us define Sn to be the number of mutation events in a sample of size n that occur in ancestral material and prior to the MRCA at their position.

Suppose the region consists of m discrete sites where each mutates at rate q/(2m), and between each pair of which recombination occurs at rate r /(2 (m-1)). The continuous model is the limit as m→∞.

Define Ti to be the total tree length at site i. Then conditional on T1,T2,..,Tm, the total number of mutations is a sum of independent Poisson random variables, so is Poisson with mean

3.8 Mean and variance of the number of segregating sites

m

iim T

mW

12q

Thus if Tij is the time while j ancestors remain in tree i:

so the mean number of sites is unchanged relative to the no recombination case.


1

11

1

1

1 21

122

)(2

)(2

)()(n

j

m

i

n

j

m

i

n

j

ji

m

iimn

jjm

TjEm

TEm

WESE

qq

qq

For the variance, note

where fn(z) is the covariance in tree times between sites a distance z/2 recombination units apart.


. as )1(2

1

1)1(1

12

1

1)1(

)(2

1

),(covar2

)var(4

1

),(covar2

)var(4

1

),(covar2)var(4

1

21

)()()()()()(

1

0

21

1

1

11

21

1

1

12

21

1

1

1 12

2

12

21

1

1 12

2

12

21

1

1 112

21

1

1

1

1

2222

mdzzfzi

mOzfz

mi

mO

mkfkm

mi

TTm

Tmi

TTm

Tmi

TTTmi

Tm

Vari

WVarWEWEWWESESE

n

n

i

mz

n

n

i

m

kn

n

i

m

k

km

ikii

m

ii

n

i

m

i

im

kkii

m

ii

n

i

m

i

m

ijji

m

ii

n

i

m

ii

n

i

mmmmmnn

rqq

rqq

rqq

qqq

qqq

qq

qq

We have

It is clear that we expect fn to decrease with r, and further

so as

(typo corrected in first line!!)

The variance is reduced relative to the no recombination case. (Hudson 1983, Griffiths and Marjoram 1997)


.)1(2

1)var(1

0

21

1

dzzfzi

S n

n

in rqq

. as 0 while,4var01

121

zzfi

Tf n

n

in

1

1

1

12

21

1

1)var( ,

and 11)var( ,0

n

in

n

i

n

in

iS

iiS

qr

qqr

Let Rn be the number of recombination events in a sample of size n that occur in ancestral material, and prior to the MRCA at their position. It was similarly shown (Hudson and Kaplan, 1985) that

Note that this expectation is different from the expected number of events, En, in the big ARG:

This is because events in the big ARG can happen outside ancestral material. The difference is, though, bounded as n→∞ (problem sheet).

How can we calculate fn(z) ? This is actually only reasonable analytically for n=2.

3.9 Mean and variance of the number of recombination events

.)1(2

1)var(

1)(1

0

21

1

1

1

dzzfzi

R

iRE

n

n

in

n

in

rrr

r

.)1(1)(

)1(1

1

0

1

1

0

1

dxx

xRE

dxex

xEn

n

xn

n

r

r r

f2(z) is defined as the covariance in total marginal tree lengths for two sites a distance z in recombination units apart. We can focus on the small ARG subgraph for a region [0,1] with overall r=z. Let the coalescence times at 0 and 1 be T1, T2. The tree lengths are then 2T1, 2T2 so:

and we “simply” need E(T1T2) for sites a distance r apart.

We sketch in the supplement how this quantity is obtained, to illustrate the important approach of constructing equation systems.

Idea: ignoring mutations, condition on the first event back in time that occurs in the ARG for these two sequences. This is a recombination or coalescence. Repeat this.

3.10 Covariance in ancestry times

4)(4)2()2()4()2,2cov()(

21

2121212

TTETETETTETTf r

T1 T2

T(0) T(z)

T1 T2

T(0) T(z)

.1

201

1. |1

:and ,0. |,1

1.)( ,1

.)(

.)(. |.)(. |

Now1

2:so and ,)(2)( so lexponentia is as

)(2

1)( as

)(2)(2)(

)()(2)(

., oft independen is as

)()(2)())((

:eventfirst the lengths treethe

are , If .1exp isevent first the to timeThe

'2

'21

'2

'

'2

''2

''2

'

'2

'21

22

'2

'

1

22'2

'

12'

2'

'2

'

'2'2

''2

'21

'2

'

1

1

111

1

1

1

1

1

111

1

rrrr

rrr

r

r

recomTTETTE

coalTTEcoalPrecP

coalPcoalTTErecPrecomTTETTE

TTETTE

TETET

TETTE

TE

TETETETTE

TETTETETTE

TTT

TETETETTETTTTETTE

above

TTT

3.10 Supplement I: ancestry time covariance

Note that the conditional expectation term corresponds to the expectation for a new state, immediately following a recombination event. By the Markov property of ARGs, this is the expected product if we started in this state (looking back in time).

Label the original state 1, and the new state following a recombination event 2

Define E1 to be the expectation we seek, E2 the corresponding expectation for the new state:

We need to consider additional potential states to form a complete system of equations. For any such state s, we can write the following, using the argument on the previous slide. If s is the total event rate for state s:

3.10 Supplement I: ancestry time covariance

sss

ss

s

ssPTTE

TETTETTEE

TTT

2)'(

)(2

:eventfirst theabove lengths treethe

are , If .exp isevent next the to timeThe

'21'

'2

'21

'2

'

1

1

1

r1.

3.

2.

12

1 21

rrr EE

We can build a graph with vertices corresponding to particular states, and rates between states. Colour positions red if an MRCA is reached. Such states have E(T1T2)=0.

This allows us to construct a system of equations:

3.10 Supplement continued

1

r1.

3.

2.

1

r1.

3.

2.

1

r/2

4

1

1

1

1

5.

6.

4.

.62

64

64

62

6

12

1

24

142

21

EE

EEE

EE

rrrr

rrr

1813

18414)(

find weso18133614

:algebra little aafter and

.62

64

64

62

6

12

1

212

2

2

1

24

142

21

zzzEzf

E

EE

EEE

EE

rrrr

rrrr

rrr

Note that the covariance decreases as the recombination rate increases.

A similar system of recursions can be calculated for n>2. In practice, the solution is extremely messy. Simulation is another approach to directly estimate the covariance in tree times (Hudson, 1983).

Often, when n>2 we rely on bounding quantities of interest.

As we saw in the previous example, the time to the most recent common ancestor, of individual marginal trees can vary along a sequence with recombination. How many different MRCAs do we expect along a sequence? The answer is: surprisingly few.

Consider a small interval [x,x+d]. With high probability there is at most one recombination event on the graph for this region:

For a recombination while j ancestors, what is the probability it changes the MRCA?

3.11 Number of distinct MRCAs

T1 T2

T(0) T(z)

)()1(

)(1

11

111.),1(

).(111

1.)(

1

21

1

12

d

drdrdrd

rdrd

drdrd

orecthanmoreP

oj

ii

jiiancjwhilerecP

oii

irecnoP

j

i

n

ji

n

i

n

i

One or other of the (coloured) recombinant edges must not coalesce with the other edges while >2 edges remain. The probability of this is combinatoric:


)1(4

2

21

2)(1

3

jji

i

escapesPj

i

Thus the expected number of TMRCA changes in [x,x+d] is

and the expected number in [0,1], letting d1/m→0, is


)()1(

21

)()1(2

1414

)()1(2

11)1(2

14

)()1(

41

1)(

2

2

drd

drd

drd

drd

onn

onn

ojjj

ojjj

changesEn

j

n

j

. allfor 1)1(

211) (#

)1(21

1)1(

21lim)(1

nnn

TMRCAsdistinctE

nn

mo

nnmchangesE

m

im

rr

r

r

4.0 Supplement II: Inference about recombination rate

• Given variation data from a population, we seek to perform inference on processes producing data

• One of the most important parameters in human biology is the recombination rate

– Reflects the real biological process of recombination– Recombination is required for meiosis to take place– Recombination can cause disease when it goes wrong (by

deleting, duplicating or inverting segments of the genome)– Recombination keeps populations healthy, by allowing

elimination of deleterious mutations– Despite this, there is much we don’t know!

• The recombination rate– Can vary hugely along a sequence– Determines association between loci in the population– Is hard to measure directly, because recombination occurs on

average only ~1 in 100,000,000 meioses between any pair of successive nucleotides in the genome.

– Can be measured indirectly, by parametric analysis of variation data)

– Researchers in Oxford, and elsewhere, have developed such parametric approaches (Li and Stephens, 2003; Ptak et al. 2005; Hudson 2001, McVean 2002, McVean et al. 2004)

– One method uses the “composite likelihood” which approximates the likelihood of the data given a (variable) recombination rate, then estimates this rate using the likelihood

4.5 Findings using the “composite likelihood” (I)

Recombination estimates for all of chromosome 12. The inferred patterns of recombination are extremely uneven (>80% of recom. in 10-20% of sequence). Over 30,000 hotspots identified genome-wide, via the composite likelihood (Myers et al. 2005).

One of the challenges in human genetics is that there is a very high volume of data

For example, the following is based on data for over 4 million binary mutations, typed in 270 humans from four populations

There is tremendous power in the data, but analysis methods must be sufficiently fast, requiring approximation

4.5 Findings using the composite likelhood (II)

• Downstream, one can use the places where recombination clusters – termed “hotspots” - to ask if there are features of DNA sequence that specify hotspot locations

– None previously identified in any mammal, but this is powerful data

• ~30,000 hotspots used genome-wide, and DNA sequence compared to DNA sequence of “cold” regions where there is little or no recombination

• It turns out there is a difference. A particular “word” in the DNA codes for there being a hotspot at a location (Myers et al. 2005, 2008) (the code is fuzzy):

• Since then, researchers have been able to find a new part of the cellular machinery (a “protein”, PRDM9) that recognises this word, and turns on recombination in hotspots (Myers et al. 2009, Baudat et al. 2009, Parvanov et al. 2009)

• PRDM9 is different in chimps, explaining their different hotspots, and has remarkable properties

• So: there is a close relationship between underlying biology, and variation patterns in data

...CTTCCGCCATGATTGTGAGGCCTCCCTAGCCACGTGGAACTGTGAGT...

4.6 Recombination summary

• Recombination is a powerful, fundamental force that has shaped both our current patterns of genetic variation, and our genomes themselves

• The coalescent with recombination is the key model enabling us to understand the relationship between recombination and genealogical histories, and patterns in variation data

• Inference under this model is challenging, but creative approaches have yielded workable solutions to this problem

• Non-parametric and parametric approaches both have something to offer and often largely agree in findings

part ii: recombination and selection

Documents

process of recombination

new mutations

different positions

segment of dna

chosen parent

random mutations probability

different chromosomes

different histories