part ii: recombination and selection
DESCRIPTION
PART II: Recombination and selection. Summary of assumptions so far. Until now, the course has largely concentrated on models which are heavily simplified Neutral , e.g. Wright-Fisher models with no recombination - PowerPoint PPT PresentationTRANSCRIPT
PART II: Recombination and selection
Summary of assumptions so far
We have covered the role of chance (parent choice), demography (population size) and mutation, in shaping genetic diversity • Neutral Wright-Fisher models
1. Parents are always chosen at random. This means no new mutations affect reproductive success of parents, i.e. no “selection” for positive or negative changes.
2. A gene – i.e. a segment of DNA - is always inherited as a complete unit from the chosen parent. This assumes no “recombination”. Recombination is a process allowing different positions along a segment of DNA to be inherited from distinct parents
Discrete generations
Population size N
Parents chosen at random
Mutations probability m
Sample history can be constructed
Summary of assumptions so far
Until now, the course has concentrated on models which are heavily simplified • Neutral Wright-Fisher models
1. Parents are always chosen at random. This means no new mutations affect reproductive success of parents, i.e. no “selection” for positive or negative changes. Also called neutrality.
2. A gene – i.e. a segment of DNA - is always inherited as a complete unit from the chosen parent. This assumes no “recombination”. Recombination is a process allowing different positions along a segment of DNA to be inherited from distinct parents
Why relax these assumptions?
In fact, we are (obviously!) evolved to adapt to our environment.
This process occurs through natural selection. Some new mutations are favoured, because those carrying them have more children on average. To quantitatively study evolution, we need models incorporating this idea.
Why relax these assumptions?Our genome has essential functions. Many new mutations would disrupt this function (far more than confer useful new advantages), so must be prevented from becoming common in the population
This process also occurs through natural selection. Some new mutations are “deleterious”, because those carrying them have fewer children on average. Selection can act in both directions.
Disease Population frequency
Sickle cell anemia 1 in 625 (African Americans)
Cystic fibrosis 1 in 2,000 (Europeans)
Tay-Sachs disease 1 in 3,000 (US Jewish population)
Haemophilia 1 in 10,000
Galactosemia 1 in 57,000
Example: Human data
Data for ENR131, Chromosome 2q, Chinese and Japanese population sample (The International HapMap
Consortium, Nature 2005)
D’ Associationmeasure
According to the assumptions so far, a region has a historygiven by a tree
We should not see any obvious decay of association between sites with distance
What’s going on?
1.1 Why relax these assumptions?Recombination
In humans, and many other species, a process of recombination occurs:
This can mean different positions on a chromosome are inherited from different chromosomes in the parental generation.
So they have different histories.Our models for genetic data need to allow for this. We will begin by thinking about recombination (without selection initially).
We have chromosome pairs, one inherited from each parent
Only one of the two maternal (or paternal) copies is passed down
Almost always, rather than choosing one or other, a mosaic is constructed
Father Mother
Child
PART II: Recombination and selection
• We will extend our theory to cover the other two main biological forces driving genetic variation, evolution, and e.g. disease risk:
• Recombination
– The effect of recombination on ancestry
– Detecting historical recombination– Incorporating recombination into the
coalescent framework– Properties of the “ARG”– Real inference of recombination rate
• Natural selection– The fate of individual mutations– Modelling selection– Properties of selected alleles
1.2 Recombination model
• Suppose we are thinking about a segment D of DNA in a single chromosome– Sites – If S is large, reasonable to think of this
as a continuous segment D=[0,1]– In a single generation, at most one
recombination can occur in D :
– When recombination occurs, we pick the (left) breakpoint B from a density function f on D.
– We will normally assume (wlog):
– In humans, the per site per generation recombination rate averages ~1x10-8 versus a mutation rate of 1.3x10-8.
},....,2,1{ S
Probability 1-rSingle parent chromosome
Probability rTwo parents chosen
)(~ DUB
D
We begin by considering a general population, including recombination• Generations shown as discrete only for
simplicity• How do we represent histories with
recombination?
Later we will add additional modelling assumptions (random mating, etc).
Each chromosome chooses parent in previous generation
Single parent probabilty 1-r
Two parents, probability r
Denote by double arrow, left parent single line, right parent double line
Probability density function f
Each chromosome chooses parent in previous generation
Single parent probabilty 1-r
Two parents, probability r
Denote by double arrow, left parent single line, right parent double line
Probability density function f
We begin by considering a general population, including recombination• Generations shown as discrete only for
simplicity• How do we represent histories with
recombination?
Later we will add additional modelling assumptions (random mating, etc).
Each chromosome chooses parent in previous generation
Single parent probabilty 1-r
Two parents, probability r
Denote by double arrow, left parent single line, right parent double line
Probability density function f
We begin by considering a general population, including recombination• Generations shown as discrete only for
simplicity• How do we represent histories with
recombination?
Later we will add additional modelling assumptions (random mating, etc).
Each chromosome chooses parent in previous generation
Single parent probabilty 1-r
Two parents, probability r
Denote by double arrow, left parent single line, right parent double line
Probability density function f
We begin by considering a general population, including recombination• Generations shown as discrete only for
simplicity• How do we represent histories with
recombination?
Later we will add additional modelling assumptions (random mating, etc).
Each chromosome chooses parent in previous generation
Single parent probabilty 1-r
Two parents, probability r
Denote by double arrow. Left parent single line, right parent double line
Probability density function f
We can trace ancestral histories in the new setting
At a recombination event, choose the appropriate ancestor
Consider site 1 (position 0 in [0,1]).
Each chromosome chooses parent in previous generation
Single parent probabilty 1-r
Two parents, probability r
Denote by double arrow. Left parent single line, right parent double line
Probability density function f
Now consider site S (position 1 in [0,1])
At a recombination event, choose the appropriate ancestor
Always choose the right hand ancestor at a recombination event
Each chromosome chooses parent in previous generation
Single parent probabilty 1-r
Two parents, probability r
Denote by double arrow. Left parent single line, right parent double line
Probability density function f
Site S/2 (position 0.5 in [0,1])
At a recombination event, choose the appropriate ancestor
Ancestor choice depends on position of recombination event
1.3 Marginal trees• With recombination, we can still draw a genealogical tree at
each site. At a position x in [0,1], we define the marginal tree T(x) to be the genealogical tree at x.
• In general, T(x) depends on x.
• The TMRCA can also change along the sequence.
• Tree change points are a subset of the recombination positions
Time
T(0) T(0.5) T(1)
In humans, genealogical trees are typically hundreds of thousands of years deep (tens of thousands of generations)
For a recombination event at x, T(x-) and T(x+) can be, but are not always, different (see problem sheet)
Question: Is this the best way to summarise information about the history of the sample?
1.4 The ancestral recombination graph
• Individual trees for each site are cumbersome
• They are not sufficient in general to reconstruct all historical recombination events• problematic if recombination is the focus of interest
• The ancestral recombination graph (ARG) solves this problem (Griffiths 1991, Griffiths and Marjoram 1997, Hudson 1983)
• Provides an efficient way to record the history of a sample with recombination, without losing information
This is a directed, acyclic graph of degree three. Nodes correspond to ancestors of the sample
1.5 The ARG
Every ancestral recombination graph corresponds uniquely to an ancestral history of the sample
Join edges when ancestors coalesce
Each tip corresponds to an individual chromosomal segment
Time
1.5 The ARG
Every ancestral recombination graph corresponds uniquely to an ancestral history of the sample
Join edges when ancestors coalesce
Each tip corresponds to an individual chromosomal segment
Split edges at recombination events. Left branch contributes material to left of break
Time
0.9
0.2
Split edges at recombination events. Left branch contributes material to left of break
1.5 The ARG
Every ancestral recombination graph corresponds uniquely to an ancestral history of the sample
Join edges when ancestors coalesce
Each tip corresponds to an individual chromosomal segment
Time
Eventually a most recent common ancestor (MRCA) will be reached
0.6
0.7
0.2
0.9
1.6 Example ARGs• Recombination events can change the tree “topology” (a)• Can leave the tree “topology” unchanged but alter the times in
the tree (b) • Can leave the tree completely unchanged (c)• Sample size n=4, single recombination event
(a)
(b)
(c)
1.7 (Embedded) Marginal trees
Marginal trees are recovered from the graph by taking the appropriate branch at each recombination event
0.6
0.7
0.2
0.9
Time
T(0) T(0.5) T(1)
1.8 Embedded subgraphs• To obtain the ARG for a subregion say [a,b] we take the
ARG for [0,1] and remove recombination events in [0,a) or (b,1], and respectively the left and right edges ancestral to these recombination events.– These events occur outside [a,b]– They therefore cannot affect the history of this
subregion so must be outside the subregion ARG– Essentially we “drop” irrelevant edges
0.2
0.7
0.6
0.9
[0.4,0.8]
2.1 Mutations in the ARG
• Suppose a mutation occurs in some sample ancestor• Add a mark to the ARG, at the appropriate position in
[0,1], to the place corresponding to that ancestor• The entire mutational history can be placed on the graph.
0.6
0.7
0.2
0.9
0.75
0.050.3
0.35 0.9
0.72
0.65
0.4
0.5
0.8
Sequence 0.05 0.3 0.35 0.4 0.5 0.65 0.72 0.75 0.8 0.9
1 1 0 1 0 1 0 1 0 0 1
2 1 1 1 0 0 1 1 0 0 1
3 1 1 1 0 0 1 1 0 0 1
4 1 1 1 1 0 1 1 0 0 1
5 0 1 1 0 0 1 1 0 0 1
6 0 0 0 0 0 1 0 1 0 1
7 0 0 0 0 0 1 0 1 1 1
8 0 0 0 0 0 1 0 1 0 1
2.2 The effect of recombination on data
• Suppose we are interested in performing inference on how much recombination there has been
• We cannot directly observe the ARG• Instead, we need to indirectly infer recombination using
mutation patterns in data• Later we will investigate in depth stochastic models of
the effect of recombination• These can be used to obtain parametric estimates of
recombination rate parameters
• An alternative approach is to not impose a particular model, but simply try to count how many recombination events occurred in a sample history
Advantages:Simple, easy to interpret in terms of counts, robust, requires few assumptionsProvides insight into relationship between data and recombination history
Disadvantages:Hard to interpret results in terms of underlying recombination parametersMisses many recombination eventsDifficult to quantify uncertainty about how many events occurred
2.3 Reminder of infinite sites model
Proposition 2.3.2: Compatibility of mutations with the point mutations assumption
An n × s 0-1 matrix is compatible with a gene tree if and only if no pattern
0 00 11 01 1
occurs in any two columns and four rows. If the ancestral type is known and always denoted by 0, the first row of the pattern can be removed from the condition.
Definition 2.3.1: Infinitely-many-sites model
Mutations occur at positions on the DNA sequences never before mutant.Every mutation occurring in the coalescent tree on an edge occurs in all genes subtended below the edge.
• If we assume the infinite sites model, then you have seen the following: the “4-gamete test”
• Question: Is this result respected if recombination occurs?
Example: recombination causes violation of the 4-gamete test
0.2
0.150.75
0.15 0.751 1 0
2 1 1
3 0 1
4 0 0
Note: only mutations on these two branches can violate the 4-gamete test, and that this occurs if and only if the blue mutation occurs to the left of 0.2, and the black mutation to the right of 0.2
2.4 Detecting recombination events (Hudson and Kaplan, 1985)
Lemma 2.4: The 4-gamete test
Suppose we have variation data for n individuals at s sites, represented as an n × s 0-1 matrix. Under the infinite sites model, if the pattern
0 00 11 01 1
occurs in two columns corresponding to positions x and y, then at least one recombination event must have occurred in the sample history, in the interval (x,y). If the ancestral type is known and denoted by 0 at x,y then the first row of the pattern can be removed from the condition.
Proof
We prove the converse statement.
Suppose there are no recombination events between x and y. Then the ancestral recombination graph for the interval [x,y] is simply a coalescent tree. Hence, by proposition 2.3.2 the above pattern cannot occur in the data.
2.5 Hudson’s RM (Hudson and Kaplan 1985)
• Suppose we have sites 1,2,..,10 and the dataset:
Sequence 0.05 0.3 0.35 0.4 0.5 0.65 0.72 0.75 0.8 0.9
1 1 0 1 0 1 0 1 0 0 1
2 1 1 1 0 0 1 1 0 0 1
3 1 1 1 0 0 1 1 0 0 1
4 1 1 1 1 0 1 1 0 0 1
5 0 1 1 0 0 1 1 0 0 1
6 0 0 0 0 0 1 0 1 0 1
7 0 0 0 0 0 1 0 1 1 1
8 0 0 0 0 0 1 0 1 0 1
How many recombination events?
2.5 Hudson’s RM (Hudson and Kaplan 1985)
Proposition 2.5: Hudson’s RM
Under the infinite sites model with recombination, suppose we have data for n sequences at s (ordered) segregating sites 1,2,..,s. Then the following recursive procedure gives a minimum number of recombination events in the history of the sample, based on the results of the four gamete test.
Step 1: For all pairs (i,j), construct a matrix R where Rij=1 if sites i and j show all 4 gametes, and 0 otherwise.
Step 2: Set i=1,l=2 and RM=0
Step 3: If max{Rkl: k=i,..,l-1}=1 then increment RM by 1 and reset i’=l. Otherwise, set i’=i
Step 4: If l=s, terminate. Otherwise, set i=i’, l=l+1 and return to step 3.
Remark The idea here is to go from left to right, putting in a recombination whenever one is required by the 4 gamete test, and that recombination must have happened to the right of the furthest right recombination placed so far.
Application of the algorithm
R12R16R36
R67R49R5,10R6,10R7,11R8,11R6,15R12,16
i
RM
l1
02
2
13
2
14
2
15
2
16
6
27
7
38
7
39
7
310
7
311
11
412
11
413
11
414
11
415
11
416
-
5-
All other Rij=0.
Proof of proposition 2.5
The result is trivial if Rij=0 for all i and j. Otherwise, let the true minimum number of events based on the 4-gamete test be W.
Suppose wlog that RM is incremented by 1 at RM steps corresponding to values l1, l2 ,.. ,lR of l. Setting l0=1, at these steps i therefore takes values l0, l1 ,.. ,lR-1 respectively because i is reassigned the current value of l at each increase in RM. We prove first that , then that .
For each j, by construction Then there must be recombination in the interval (lj-1,lj) for each j and as there are RM such intervals,
To prove , suppose we place RM recombination events along the sequence by placing one event in each interval (lj-1,lj) j=1,2,..., RM. Supposing for a contradication that this did not provide a solution, there must exist p, q such that Rpq=1 but no event is placed inside the interval (p,q). In the qth round of the algorithm, l=q and so since Rpq did not produce an increase in RM, we must have not considered this bound: q>iq>p. This implies iq>1 and hence iq=lm for some m>0. Thus there is a recombination placed in the interval (lm-1,lm). However lm>p so (lm-1,lm) contains an event placed within (p,q), a contradiction.
MRW WRM
.1}:max{ 1 jjkl lklRj
.MRW
WRM
Example: Drosophila data
• Chromosome 4 in three Drosophila species• Is there evidence for recombination?• None seen in thousands of “crosses”• Arguello et al. MBE 2009• Sequenced 80 genes
• Definitively recombination, at a low rate• More recombination in D. simulans than in other species• Suggests deeper ARGs (larger population size) for this
species
2.6 Properties of RM
• RM provides a simple, constructible measure of the influence of recombination on a sample of sequences
• This has led to its use in large real datasets by researchers
• RM relies on mutations in suitable places to detect recombination events, so if the mutation rate is not very high, typically drastically underestimates the number of recombination events (Hudson and Kaplan, 1985).
• Under a coalescent model, expectation of RM grows extremely slowly with sample size n – no faster than log(log(n))
• In general, recombination events are much more challenging to detect directly, and study, than mutation events
• Better bounds are also available, which extend the ideas used to construct RM (Myers and Griffiths 2003, Hein 1990, Song and Hein 2004, 2005, Bafna and Bansal 2006, Lynsgo, Song et al. 2008, Liu and Fu 2008, and more)
2.7 Haplotype patterns and recombination events
• Consider the following toy dataset. How many recombination events are required?
• RM =1• It is clear that under the infinite sites model, the first
event back in time must be a recombination event.• No matter which of the sequences we decide to
recombine, after this event there will still be 5 unchanged sequences
• (Exercise) no matter what choice we made, these 5 sequences still indicate recombination (4-gamete test)
• So we need at least one more recombination event in the history of these sequences, and RM could be improved to 2.
How can we do better?
One approach is to use haplotype information
0 0 00 1 01 0 0 1 1 00 0 11 0 1
2.8 The haplotype boundProposition 2.8: The haplotype bound
Under the infinite sites model with recombination, suppose we have data for n sequences at s segregating sites 1,2,..,s. Suppose that the n x s data matrix for these sequences has H unique rows, or haplotypes. Then a lower bound on the number of recombination events in the history of the sample is H-s-1.
Proof Consider the ancestral recombination graph representing the history of the sample. Beginning with the ancestral sequence at the TMRCA, we can view our sample haplotypes as being created forward in time.
Since there are H haplotypes, only one of which can be the ancestral type, there must be at least H-1 further events in the history creating novel types. Each mutation or recombination event can create at most one novel type. Coalescence events simply duplicate existing types (forward in time). By the infinite sites assumption there are s mutation events, so if R is the number of recombination events we must have R+s>=H-1.
Remark From the proof, if the ancestral type is known, we can add it to our collection of haplotypes. Note also that the four gamete test is just the special case s=2.
Example (toy) dataset revisited
• Consider the following toy dataset. How many recombination events are required?
• H=6, S=3 giving R>=6-3-1=2• This is the right answer here: a history with 2 events is
possible (hint: recombine sequences 4 and 6 first)
• Note that given a dataset with s sites, we can
• Apply proposition 2.8 to any subset of t of the s sites • Obtain a bound on the number of events between the
first and last members of the subset• This will result in a lower bound matrix Rij with
positive integer entries
• We want to be able to combine bounds once again, to produce an overall bound for a region
0 0 00 1 01 0 0 1 1 00 0 11 0 1
Combining bounds
0 0 0 0 0 0 00 1 0 1 1 1 11 0 0 1 0 1 01 1 0 1 0 1 10 0 1 1 1 0 01 0 1 0 1 1 0
112221
For this set of bounds, HM =5
We could keep searching site subsets
Typically performance can be good if usee.g. only subsets up to size 5, up to some maximaldistance apart.
Clearly, software needed to calculate the bound!
2.9 HM
Proposition 2.9: HM
Under the infinite sites model with recombination, suppose the haplotype minimum gives a local bound matrix R where Rij is the best haplotype bound between sites i and j. Define HM
ij to be the minimum number of recombination events between sites i and j satisfying this set of bounds. Then the following is true:
This recursive system can be used to obtain HM1j
given HM12, HM
13 ,..., HM1(j-1) and hence provides an efficient
means of obtaining HM1s, the overall lower bound on
recombination events
}1,..,1:max{ 11 jkRHH kjk
Mj
M
Proof
Let the true minimum be W. Note that the above construction means that HM
1s is a sum of Rij terms corresponding to non-overlapping intervals. Thus, obviously
To prove the converse statement, we construct a minimal placement of recombination events as follows.
.1sMHW
2.9 HM
Proof continued:
Define a vector of recombination counts in the s-1 mutation intervals with rj, the number of events between mutant sites j-1 and j, given by rj= HM
1j-HM1(j-1). (Take HM
11=0). Supposing for a contradiction this does not satisfy the full bound set R={Rij}, we may pick j to be the minimal such where <Rij events are placed within (i,j). By the recursive formula in the construction:
But then
contradicting the fact that <Rij events are placed within (i,j).
Note that the proof provides an explicit possible solution for where recombination events are placed. This is usually non-unique: this solution corresponds to putting events as far “right” as possible.
,
11
)1(1
1
1
1
ij
iM
jM
kM
j
ik
kM
j
ikk
RHH
HHr
.11
11
iji
Mj
M
iji
Mj
M
RHH
RHH
The benefits of using more information
The following charts shows the expectation of the haplotype bound (solid lines) can greatly exceed that of RM (dotted lines) especially as sample size becomes large. These expectations were calculated using the coalescent with recombination – we will come to this soon
Myers and Griffiths (2003)
Example: the haplotype bound in humans
The following is based on real human mutation data for 10,000 bases around the LPL gene. We can plot the recombination density between pairs of sites as an x, y colour plot:
Question: Is there a “hotspot” for recombination here?Caveat: Apparent clustering of recombination might be due to stochastic variation in histories. Need to model this explicity
Data for ENR131, Chromosome 2q, Chinese and Japanese population sample (The International HapMap
Consortium, Nature 2005)
Rh
D’ Associationmeasure
Example: Humans versus chimps
These are similar plots, for aligned regions of the human and chimpanzee genomes (Winckler et al. 2005).
Further (model based) analyses confirm that recombination rates are very different between humans and chimpanzees genome-wide (Winckler et al. 2005, Ptak et al. 2005, Myers et al. 2009)
98.6% similar at aligned genomic bases
Example: Malaria
Malaria appear to have a similar uneven distribution of recombination sites along their genomes (Mu et al., Nature Genetics 2010)
Chromosome 1
Chromosome 7
Asia
Afr
ica
Asia
Afr
ica
2.10 Conclusions on recombination detection
• Direct detection of recombination events offers a very useful approach to:– Understanding the influence of recombination on data– Discovering the distribution of events along sequences
• More sophisticated approaches still have been developed in recent years (Song and Hein 2004, 2005, Bafna and Bansal 2006, Lynsgo, Song et al. 2008, Liu and Fu 2008, and more)– Improvements over HM, though these are modest.
• All strict minima miss the large majority of recombination events
• In organisms with repeat mutation, need to adapt approaches (Liu and Fu, 2008) and problem even tougher
• A model for populations with recombination is vital to– Recover more of the information from data– Perform inference on underlying recombination
parameters– Estimate uncertainty, make statements about rate
variation, make statements about particular sample histories, allow for demographic histories, selection,...
Chromosome randomly chooses parent in previous generation
Single parent probability 1-r
Two parents, probability r
Denote by double arrow, left parent single line, right parent double line
Probability density function f
We incorporate recombination in the Wright-Fisher model:• Constant size population of size 2N• Generations are discrete with next generation formed
from previous:• Individuals choose a single parent uniformly, with
probability 1-r• Are recombinant, choose two parents at random and a
recombination breakpoint, with probability r• Can also mutate, with probability m, and choose a site to
mutate.
3.0 The Wright-Fisher model revisited
3.1 The history of a sample
Consider a sample of size n from the population
We will define
Consider the limit as while r, q remain constant.
At some time back, suppose there are j ancestors of the sample remaining and consider the events in the previous generation
.4,4 mqr NNr
N
Consider the probabilities of different possible events while j>1 ancestors remain:
)()events 2(
)(22
1
2221
211
21)1()1()coal. One(
)(2/21)rec. 1Only (
)(2/21
211
221
211)1()1()mut. 1Only (
)(2/2/22
11
)(2/2
1
)(4/1)(4/1
)(211)4/1()4/1(
211
221
211)1()1()event No(
2
2
2
2
1
2
2
22
21
1
NOP
NOj
N
jN
jNN
rP
NOjN
P
NOjN
Nj
NNjrP
NOjjj
N
NONj
NONjNONj
NOiN
NN
Nj
NNrP
jj
jj
j
i
jj
jj
m
r
q
mm
qr
qr
qr
m
Now as for the coalescent without recombination, we measure time in units of 2N generations, define t=T/2N, and consider event probabilities asand t remains fixed. Let be the waiting time back until some event occurs, while there are j ancestors
N
. as2/2/2
exp
)(2/2/22
11) (2
2
Ntjjj
NOjjj
NtTP
Nt
j
qr
qr
Thus, is exponentially distributed. When an event occurs:
NP
Nj
jP
Nj
P
Nj
NONjNjNj
NONjP
as0events) moreor Two(
as1
1)coalescepair One(
as1
)mut. One(
. as1
)(4/4/2/2
)(4/)rec. One(2
2
qr
qrq
qrr
qr
r
In the limit, this fully defines the ancestry process. By obvious symmetry, at coalescences a random pair coalesce, and a random sequence recombines or mutates at these respective events. This defines the coalescent with recombination:
jT
jT
3.2 The coalescent with recombination
Definition 3.2: The coalescent with recombination (Hudson 1983, Griffiths 1991, Griffiths and Marjoram 1997)
The coalescent with recombination is a Markov process describing the history, backward in time of a sample of n genes drawn from a population. While j ancestors remain, j>1, the time to the next event has an exponential distribution with rate parameter
After sampling the next event time, an event is chosen:
.2/2/2
jjj
qr
. mutate. torandomat chosen is
sequence one ,1
y probabilitWith
.1 recombine. torandomat chosen is
sequence one ,1
y probabilitWith
.1 coalesce. and randomat chosen are
sequences two,1
1y probabilitWith
jjθρj
θjj
θρj
jjθρj
j
r
3.2 The coalescent with recombination
Definition 3.2: The coalescent with recombination (Hudson 1983, Griffiths 1991, Griffiths and Marjoram 1997)
• At recombination events, the breakpoint is chosen using pdf f.
• In drawing the graph, coalescence events are represented as edge joins backward in time, recombination events as splits, and mutation events marked as points on the edges.
• Given a particular mutation model (specified forward in time) we first choose the ancestor type, and then choose a new mutant according to the model at each mutation point, based in general on the type of the edge immediately above the mutation event.
• If we are not interested in recording mutations, or investigating the genealogical relationships alone, we can simply set q=0.
• We usually terminate the process the first time j=1. The first ancestor of the sample where j=1 is the grand most recent common ancestor of the sample.
0.9
0.75
3.2 The coalescent with recombination
.4428
exp
qr
.44
28
1 Probqr
.3326
exp
qr
.2/32/3
23
2/ Probqr
r
2/2/22
12/2/22
exp2/2/22
2/72/727
2/2/72/727
exp2/72/727
4428
14428
exp4428
)(
2
1
qrqrqr
qr
qqrqr
qrqrqr
mW
W
WARGP
W2
0.6
0.7
0.2
0.05
0.3
0.35 0.9
0.72
0.65
0.4
0.5
0.8
W1
W3
Wm
• We have shown that the coalescent with recombination is the limit process (as N becomes large) describes the history of a sample drawn from a constant size Wright-Fisher model.
• It also arises as a limit process in other many models –with continuous or discrete generations
• r=0 corresponds to the standard coalescent
• The number of ancestors j can be thought of as a random walk.
• The coalescence rate grows quadratically with j while the recombination rate grows only linearly with j. Thus eventually the random walk will hit j=1 with probability 1 (exercise sheet)
• The expected number of recombination events before this happens satisfies the recursion (Exercise; Ethier and Griffiths 1990)
3.3 Properties of the coalescent with recombination
.)1(1solution with
1111
1
0
1
11
dxex
xE
jE
jE
jjE
xn
n
jjj
rr
rr
rr
r
• We can think of the coalescent with recombination in terms of independent Poisson processes on edges and pairs of edges
• This construction is helpful in theoretical calculations and obtaining subgraphs
• For this course, we only need to restate (these facts were also used in the earlier part of the course) two general properties of homogeneous Poisson processes on the real line. Here N(t) is the number of events before time t.
3.4 Description in terms of Poisson processes
. is processin occurrence of prob.
theprocess, summed in theevent each for Indep.
. rate of processPoisson homogen. a is )(
then ,,, rates of processesPoisson homogen. indep. are 0),(,),(),( If 3.4.2.
distn. thisfollowsevent first theuntil time waitingthe particularIn ).exp( is eventsbetween time waiting the,
rate of processPoisson homogen. a is 0),( If 3.4.1.
n
1j
n
1i
n
1i
21
21
j
i
ii
n
n
i
tN
ttNtNtN
ttN
• Exactly as without recombination, we can fully construct the ancestral recombination graph using independent Poisson processes in reverse time:– Each of the j(j-1)/2 pairs of edges independently
coalesces as a Poisson process with rate 1– Each of the j edges mutates at rate q/2.– Each the j edges recombines at rate r/2.– Events in the Poisson processes are “racing” each
other• To prove this gives the correct graph, we simply need
to show it yields the correct rates• By fact 3.4.2, while j ancestors remain, events occur as
a Poisson process with total rate
• The time to the first event has the correct exponential distribution, by fact 3.4.1. When the event occurs, fact 3.4.2 implies it is e.g. a coalescence (between a random pair of edges) with probability
3.4 Description in terms of Poisson processes
.2/2/2
jjj
qr
.1
1
2/2/2
2qr
qr
jj
jjj
j
W2
0.6
0.7
0.2
0.05
0.3
0.35 0.9
0.72
0.65
0.4
0.5
0.8
W1
W3
Wm
.4428
rate Total
qr
1 Rate
.3326
rate Total
qr
2/ Rate r
2/ Rate q
0.9
3.4 Description in terms of Poisson processes
0.75
3.5 Subgraphs• In 1.8, we saw that we can construct the ARG for a
subregion [a,b] by ignoring all recombination (and mutation) events outside [a,b]. If recombination and mutation are uniform, we construct a graph by starting with n sequences, and backward in time introducing – Recombination events at rate r(b-a)/2 per edge– Mutation events at rate q(b-a)/2 per edge – Coalescence at rate 1 per pair of edges
Thus the ARG for a subregion is (of course) distributed according to the coalescent with recombination for the smaller region.
• “Small ARG”: In certain settings, we can gain efficiency by only following the history of specific branches contributing to genetic variation, building a coalescent using the Poisson process rates. Edges – or recombinations producing edges carrying no genetic material passed on to a sample, and edges carrying only material that has reached a MRCA, need not be followed. Similarly, mutations outside ancestral material need not be simulated.– This graph can be produced directly (Hudson 1983)– Can be much smaller than the “big ARG”– Preferred for simulation for this reason
Remark: small ARG in the coalescent
2/0.7 ,2/ 0.7 1, Rates rq
2/ ,2/ 1, Rates rq
Simulation of the small graph is efficient (Hudson 1991)Avoid considering ancestors sharing no material with the sample
0.6
0.7
0.2
0.9
0.85
0.8
Simulate directly by having different rates on different lineages in the past. We can measure the coalescence, mutation, recombination rates:
The small ARG does not include this recombination
3.6 Marginal trees revisited
Marginal trees are recovered from the graph by taking the appropriate branch at each recombination event
Note the marginal tree at x is the limit as d tends to 0 of the subgraph on [x,x+d].
In this subgraph, line pairs coalesce at rate 1, so while j ancestors remain the total coalescence rate is j(j-1)/2.
Lines recombine at rate rd/2 per edge, so in the limit there is no recombination and the marginal tree at x is described by the usual coalescent.
(Actually this is obvious, because we could make the tree at x based on the large size limit of a finite Wright-Fisher population directly, in which case recombination would not occur.)
Time
T(0) T(0.5) T(1)
• The coalescent with recombination is much harder to derive exact results for than the coalescent
• These are mainly restricted to samples of size 2, or the “big ARG”, which contains some ancestors unrelated to the sample
• In other settings, we rely on– Numerical recursions to solve– Lower and upper bounding of solutions– Analytic approximation of solutions
• We will see examples of these settings and approaches• For additional analytical results, see Durrett, and
Wakeley, and references therein (important papers include Hudson (1983), Hudson and Kaplan (1985), Ethier and Griffiths (1990), Griffiths and Marjoram (1997), Wiuf and Hein (1999) and others)
3.7 Theoretical results for the coalescent with recombination (?)
Assume the infinite sites model and a uniform mutation rate along [0,1].
Let us define Sn to be the number of mutation events in a sample of size n that occur in ancestral material and prior to the MRCA at their position.
Suppose the region consists of m discrete sites where each mutates at rate q/(2m), and between each pair of which recombination occurs at rate r /(2 (m-1)). The continuous model is the limit as m→∞.
Define Ti to be the total tree length at site i. Then conditional on T1,T2,..,Tm, the total number of mutations is a sum of independent Poisson random variables, so is Poisson with mean
3.8 Mean and variance of the number of segregating sites
m
iim T
mW
12q
Thus if Tij is the time while j ancestors remain in tree i:
so the mean number of sites is unchanged relative to the no recombination case.
3.8 Mean and variance of the number of segregating sites
1
11
1
1
1 21
122
)(2
)(2
)()(n
j
m
i
n
j
m
i
n
j
ji
m
iimn
jjm
TjEm
TEm
WESE
For the variance, note
where fn(z) is the covariance in tree times between sites a distance z/2 recombination units apart.
3.8 Mean and variance of the number of segregating sites
. as )1(2
1
1)1(1
12
1
1)1(
)(2
1
),(covar2
)var(4
1
),(covar2
)var(4
1
),(covar2)var(4
1
21
)()()()()()(
1
0
21
1
1
11
21
1
1
12
21
1
1
1 12
2
12
21
1
1 12
2
12
21
1
1 112
21
1
1
1
1
2222
mdzzfzi
mOzfz
mi
mO
mkfkm
mi
TTm
Tmi
TTm
Tmi
TTTmi
Tm
Vari
WVarWEWEWWESESE
n
n
i
mz
n
n
i
m
kn
n
i
m
k
km
ikii
m
ii
n
i
m
i
im
kkii
m
ii
n
i
m
i
m
ijji
m
ii
n
i
m
ii
n
i
mmmmmnn
rqq
rqq
rqq
qqq
qqq
We have
It is clear that we expect fn to decrease with r, and further
so as
(typo corrected in first line!!)
The variance is reduced relative to the no recombination case. (Hudson 1983, Griffiths and Marjoram 1997)
3.8 Mean and variance of the number of segregating sites
.)1(2
1)var(1
0
21
1
dzzfzi
S n
n
in rqq
. as 0 while,4var01
121
zzfi
Tf n
n
in
1
1
1
12
21
1
1)var( ,
and 11)var( ,0
n
in
n
i
n
in
iS
iiS
qr
qqr
Let Rn be the number of recombination events in a sample of size n that occur in ancestral material, and prior to the MRCA at their position. It was similarly shown (Hudson and Kaplan, 1985) that
Note that this expectation is different from the expected number of events, En, in the big ARG:
This is because events in the big ARG can happen outside ancestral material. The difference is, though, bounded as n→∞ (problem sheet).
How can we calculate fn(z) ? This is actually only reasonable analytically for n=2.
3.9 Mean and variance of the number of recombination events
.)1(2
1)var(
1)(1
0
21
1
1
1
dzzfzi
R
iRE
n
n
in
n
in
rrr
r
.)1(1)(
)1(1
1
0
1
1
0
1
dxx
xRE
dxex
xEn
n
xn
n
r
r r
f2(z) is defined as the covariance in total marginal tree lengths for two sites a distance z in recombination units apart. We can focus on the small ARG subgraph for a region [0,1] with overall r=z. Let the coalescence times at 0 and 1 be T1, T2. The tree lengths are then 2T1, 2T2 so:
and we “simply” need E(T1T2) for sites a distance r apart.
We sketch in the supplement how this quantity is obtained, to illustrate the important approach of constructing equation systems.
Idea: ignoring mutations, condition on the first event back in time that occurs in the ARG for these two sequences. This is a recombination or coalescence. Repeat this.
3.10 Covariance in ancestry times
4)(4)2()2()4()2,2cov()(
21
2121212
TTETETETTETTf r
T1 T2
T(0) T(z)
f2(r)
T1 T2
T(0) T(z)
.1
201
1. |1
:and ,0. |,1
1.)( ,1
.)(
.)(. |.)(. |
Now1
2:so and ,)(2)( so lexponentia is as
)(2
1)( as
)(2)(2)(
)()(2)(
., oft independen is as
)()(2)())((
:eventfirst the lengths treethe
are , If .1exp isevent first the to timeThe
'2
'21
'2
'
'2
''2
''2
'
'2
'21
22
'2
'
1
22'2
'
12'
2'
'2
'
'2'2
''2
'21
'2
'
1
1
111
1
1
1
1
1
111
1
rrrr
rrr
r
r
recomTTETTE
coalTTEcoalPrecP
coalPcoalTTErecPrecomTTETTE
TTETTE
TETET
TETTE
TE
TETETETTE
TETTETETTE
TTT
TETETETTETTTTETTE
above
TTT
3.10 Supplement I: ancestry time covariance
Note that the conditional expectation term corresponds to the expectation for a new state, immediately following a recombination event. By the Markov property of ARGs, this is the expected product if we started in this state (looking back in time).
Label the original state 1, and the new state following a recombination event 2
Define E1 to be the expectation we seek, E2 the corresponding expectation for the new state:
We need to consider additional potential states to form a complete system of equations. For any such state s, we can write the following, using the argument on the previous slide. If s is the total event rate for state s:
3.10 Supplement I: ancestry time covariance
sss
ss
s
ssPTTE
TETTETTEE
TTT
2)'(
)(2
:eventfirst theabove lengths treethe
are , If .exp isevent next the to timeThe
'21'
'2
'21
'2
'
1
1
1
r1.
3.
2.
12
1 21
rrr EE
We can build a graph with vertices corresponding to particular states, and rates between states. Colour positions red if an MRCA is reached. Such states have E(T1T2)=0.
This allows us to construct a system of equations:
3.10 Supplement continued
1
r1.
3.
2.
1
r1.
3.
2.
1
r/2
4
1
1
1
1
5.
6.
4.
.62
64
64
62
6
12
1
24
142
21
EE
EEE
EE
rrrr
rrr
1813
18414)(
find weso18133614
:algebra little aafter and
.62
64
64
62
6
12
1
212
2
2
1
24
142
21
zzzEzf
E
EE
EEE
EE
rrrr
rrrr
rrr
Note that the covariance decreases as the recombination rate increases.
A similar system of recursions can be calculated for n>2. In practice, the solution is extremely messy. Simulation is another approach to directly estimate the covariance in tree times (Hudson, 1983).
Often, when n>2 we rely on bounding quantities of interest.
As we saw in the previous example, the time to the most recent common ancestor, of individual marginal trees can vary along a sequence with recombination. How many different MRCAs do we expect along a sequence? The answer is: surprisingly few.
Consider a small interval [x,x+d]. With high probability there is at most one recombination event on the graph for this region:
For a recombination while j ancestors, what is the probability it changes the MRCA?
3.11 Number of distinct MRCAs
T1 T2
T(0) T(z)
)()1(
)(1
11
111.),1(
).(111
1.)(
1
21
1
12
d
drdrdrd
rdrd
drdrd
orecthanmoreP
oj
ii
jiiancjwhilerecP
oii
irecnoP
j
i
n
ji
n
i
n
i
One or other of the (coloured) recombinant edges must not coalesce with the other edges while >2 edges remain. The probability of this is combinatoric:
3.11 Number of distinct MRCAs
)1(4
2
21
2)(1
3
jji
i
escapesPj
i
Thus the expected number of TMRCA changes in [x,x+d] is
and the expected number in [0,1], letting d1/m→0, is
3.11 Number of distinct MRCAs
)()1(
21
)()1(2
1414
)()1(2
11)1(2
14
)()1(
41
1)(
2
2
drd
drd
drd
drd
onn
onn
ojjj
ojjj
changesEn
j
n
j
. allfor 1)1(
211) (#
)1(21
1)1(
21lim)(1
nnn
TMRCAsdistinctE
nn
mo
nnmchangesE
m
im
rr
r
r
4.0 Supplement II: Inference about recombination rate
• Given variation data from a population, we seek to perform inference on processes producing data
• One of the most important parameters in human biology is the recombination rate
– Reflects the real biological process of recombination– Recombination is required for meiosis to take place– Recombination can cause disease when it goes wrong (by
deleting, duplicating or inverting segments of the genome)– Recombination keeps populations healthy, by allowing
elimination of deleterious mutations– Despite this, there is much we don’t know!
• The recombination rate– Can vary hugely along a sequence– Determines association between loci in the population– Is hard to measure directly, because recombination occurs on
average only ~1 in 100,000,000 meioses between any pair of successive nucleotides in the genome.
– Can be measured indirectly, by parametric analysis of variation data)
– Researchers in Oxford, and elsewhere, have developed such parametric approaches (Li and Stephens, 2003; Ptak et al. 2005; Hudson 2001, McVean 2002, McVean et al. 2004)
– One method uses the “composite likelihood” which approximates the likelihood of the data given a (variable) recombination rate, then estimates this rate using the likelihood
Data for ENR131, Chromosome 2q, Chinese and Japanese population sample (The International HapMap
Consortium, Nature 2005)
4.5 Findings using the “composite likelihood” (I)
Recombination estimates for all of chromosome 12. The inferred patterns of recombination are extremely uneven (>80% of recom. in 10-20% of sequence). Over 30,000 hotspots identified genome-wide, via the composite likelihood (Myers et al. 2005).
One of the challenges in human genetics is that there is a very high volume of data
For example, the following is based on data for over 4 million binary mutations, typed in 270 humans from four populations
There is tremendous power in the data, but analysis methods must be sufficiently fast, requiring approximation
4.5 Findings using the composite likelhood (II)
• Downstream, one can use the places where recombination clusters – termed “hotspots” - to ask if there are features of DNA sequence that specify hotspot locations
– None previously identified in any mammal, but this is powerful data
• ~30,000 hotspots used genome-wide, and DNA sequence compared to DNA sequence of “cold” regions where there is little or no recombination
• It turns out there is a difference. A particular “word” in the DNA codes for there being a hotspot at a location (Myers et al. 2005, 2008) (the code is fuzzy):
• Since then, researchers have been able to find a new part of the cellular machinery (a “protein”, PRDM9) that recognises this word, and turns on recombination in hotspots (Myers et al. 2009, Baudat et al. 2009, Parvanov et al. 2009)
• PRDM9 is different in chimps, explaining their different hotspots, and has remarkable properties
• So: there is a close relationship between underlying biology, and variation patterns in data
...CTTCCGCCATGATTGTGAGGCCTCCCTAGCCACGTGGAACTGTGAGT...
4.6 Recombination summary
• Recombination is a powerful, fundamental force that has shaped both our current patterns of genetic variation, and our genomes themselves
• The coalescent with recombination is the key model enabling us to understand the relationship between recombination and genealogical histories, and patterns in variation data
• Inference under this model is challenging, but creative approaches have yielded workable solutions to this problem
• Non-parametric and parametric approaches both have something to offer and often largely agree in findings