the impact of sampling schemes on the site frequency ... · 1 the impact of sampling schemes on the...
TRANSCRIPT
1
The impact of sampling schemes on the site frequency spectrum
in non-equilibrium subdivided populations
Thomas Städler,* Bernhard Haubold,† Carlos Merino,‡
Wolfgang Stephan‡ and Peter Pfaffelhuber§
* ETH Zurich, Institute of Integrative Biology, Plant Ecological Genetics,
CH-8092 Zurich, Switzerland
† Max-Planck-Institute for Evolutionary Biology, Department of Evolutionary Genetics,
D-24306 Plön, Germany
‡ University of Munich (LMU), Department of Biology II, Section of Evolutionary Biology,
D-82152 Planegg-Martinsried, Germany
§ University of Freiburg, Faculty of Mathematics and Physics,
D-79104 Freiburg, Germany
Genetics: Published Articles Ahead of Print, published on February 23, 2009 as 10.1534/genetics.108.094904
2
Running head: Genealogies and Population Subdivision
Key words: coalescent, demography, population subdivision, range expansion, site frequency
spectrum
Corresponding author: Dr. Thomas Städler
ETH Zurich
Institute of Integrative Biology
Plant Ecological Genetics
Universitätstrasse 16
CH-8092 Zurich, Switzerland
Phone: +41-44-6327429
Fax: +41-44-6321215
E-mail: [email protected]
3
ABSTRACT
Using coalescent simulations, we study the impact of three different sampling schemes on
patterns of neutral diversity in structured populations. Specifically, we are interested in two
summary statistics based on the site frequency spectrum as a function of migration rate,
demographic history of the entire substructured population (including timing and magnitude of
species-wide expansions), and the sampling scheme. Using simulations implementing both
finite-island and two-dimensional stepping-stone spatial structure, we demonstrate strong effects
of the sampling scheme on Tajima’s D (DT) and Fu and Li’s D (DFL) statistics, particularly under
species-wide (range) expansions. Pooled samples yield average DT and DFL values that are
generally intermediate between those of local and scattered samples. Local samples (and to a
lesser extent, pooled samples) are influenced by local, rapid coalescence events in the underlying
coalescent process. These processes result in lower proportions of external branch lengths and
hence lower proportions of singletons, explaining our finding that the sampling scheme impacts
DFL more than it does DT. Under species-wide expansion scenarios, these effects of spatial
sampling may persist up to very high levels of gene flow (Nm > 25), implying that local samples
cannot be regarded as being drawn from a panmictic population. Importantly, many data sets on
humans, Drosophila, and plants contain signatures of species-wide expansions and effects of
sampling scheme that are predicted by our simulation results. This suggests that validating the
assumption of panmixia is crucial if robust demographic inferences are to be made from local or
pooled samples. However, future studies should consider to adopt a framework that explicitly
accounts for the genealogical effects of population subdivision and empirical sampling schemes.
4
Although most (if not all) species are characterized by various degrees of population subdivision,
population geneticists continue to employ models of panmictic populations of constant size (the
neutral equilibrium [NE] model) when testing for selection and/or demographic changes through
time. Motivated by empirical data sets that were found to be incompatible with expectations
under the NE model, recent large-scale studies have employed coalescent simulations of
demographic changes such as bottlenecks and/or population expansions in attempts to estimate
parameters of such (arguably) biologically more realistic scenarios (e.g., MARTH et al. 2004;
NORDBORG et al. 2005; OMETTO et al. 2005; SCHMID et al. 2005; LI and STEPHAN 2006;
HEUERTZ et al. 2006; PYHÄJÄRVI et al. 2007; ROSS-IBARRA et al. 2008). However, such
simulations still assume unstructured populations, while the empirical data may derive from
subdivided species and a variety of sampling schemes. The data typically collected in such
projects consist of multilocus DNA sequences from which genealogical information is extracted,
e.g., on the number of segregating sites (S), the average pairwise sequence diversity (π), and
either the full site frequency spectrum or summary statistics based on the site frequency
spectrum, such as Tajima’s D (TAJIMA 1989), Fu and Li’s D (FU and LI 1993), and Fay and
Wu’s H (FAY and WU 2000).
Here, we are concerned with estimates of species-wide demographic history from such
summary statistics, and particularly with the joint impact of population subdivision, population
expansion, and the sampling scheme on statistics summarizing the site frequency spectrum. Our
motivation stems largely from striking patterns in empirical data sets that, we believe, need to be
evaluated for their general implications. For example, in empirical studies that include several
genuine population samples, it becomes possible to contrast the site frequency spectra of local
population samples with those obtained for the pooled (combined) samples. Although this
5
possibility has not been exploited in all such studies, the common observation has been that site
frequency spectra are more skewed toward low-frequency polymorphisms in pooled samples,
compared to the means of genuine population samples; we refer to this phenomenon as the
‘pooling effect’.
Such genealogical signals have been obtained in a variety of organisms such as humans
(ALONSO and ARMOUR 2001; PTAK and PRZEWORSKI 2002; HAMMER et al. 2003), Drosophila
(POOL and AQUADRO 2006), and several plant species (INGVARSSON 2005; ARUNYAWAT et al.
2007; MOELLER et al. 2007; PYHÄJÄRVI et al. 2007). At least partly due to the putatively
confounding effects of population subdivision and population expansion first identified by PTAK
and PRZEWORSKI (2002), several studies have explicitly favored the analysis and interpretation of
data obtained from large samples of single populations or geographic regions (e.g., ADAMS and
HUDSON 2004; MARTH et al. 2004; GARRIGAN et al. 2007). Similarly, a recent multilocus study
of several populations of teosinte (the wild ancestor of maize) reiterated the notion of
genealogical signals for population expansion (e.g., negative values of Tajima’s D) being
confounded with effects of population structure (MOELLER et al. 2007).
In our recent study of patterns of polymorphism and population subdivision in two closely
related species of wild tomatoes, however, we arrived at quite different conclusions
(ARUNYAWAT et al. 2007). Inspired by the results of two complementary simulation studies of
subdivided populations (RAY et al. 2003; DE and DURRETT 2007; see below), we interpreted this
apparently widespread empirical pattern (see references cited above) as being a consequence of
the effect of the sampling scheme on sample genealogies. Specifically, ARUNYAWAT et al.
(2007) envisioned the impact of population subdivision as ‘masking’ the true genealogical signal
of the species-wide demography via loss of low-frequency polymorphisms in local samples,
6
rather than as ‘creating’ an excess of low-frequency variants in pooled samples. The difference
in interpretation compared to the teosinte study by MOELLER et al. (2007) is not due to the
underlying data, as the site frequency spectra of pooled population samples (four populations per
species in our study on wild tomatoes) also showed a marked excess of singletons compared to
the average spectra for individual populations, resulting in significantly negative values of
Tajima’s D and/or Fu and Li’s D (ARUNYAWAT et al. 2007; STÄDLER et al. 2008).
Using parameter values designed to reflect patterns of variation in human mitochondrial
DNA, RAY et al. (2003) simulated a range expansion under stepping-stone structure and sampled
30 sequences taken from a single deme. Under fairly low levels of gene flow between demes,
sample genealogies were shaped by a mixture of local coalescent events (leading to near-zero
branch lengths) and ancestral lineages coalescing in the founding deme at the onset (looking
forward in time) of the species-wide range expansion. The overall shape of such genealogies
yielded values of Tajima’s D that were not significant under NE expectations for low to
moderate migration rates. Only under very high levels of gene flow did the single-deme samples
exhibit genealogies expected for strong population expansions (RAY et al. 2003), reflecting the
low number of coalescence events within the sampled deme (WAKELEY 1999, 2001, 2004;
WAKELEY and ALIACAR 2001). Analogous phenomena were recently uncovered by DE and
DURRETT (2007) who simulated an equilibrium stepping-stone model of 100 demes, using scaled
mutation parameters appropriate for nuclear loci in humans. Compared to expectations for a
panmictic equilibrium population, these authors found lower average S, more extended linkage
disequilibrium, and average frequency spectra skewed toward intermediate-frequency
polymorphisms (i.e., positive Tajima’s D values) in simulated samples from single demes.
7
A notable limitation of these studies is their focus on samples obtained from single demes
(RAY et al. 2003), although DE and DURRETT (2007) also considered random samples and
showed that their genealogical properties approach those of samples from random-mating
populations. ARUNYAWAT et al. (2007) qualitatively interpreted the pooling effect identified in
their wild tomato data (and observed for several other empirical studies cited above) as resulting
from the changed genealogical structure of combined samples, analogous to increasing the
proportion of migrants in single-deme samples. In both circumstances, scattering-phase
coalescence events ought to decrease in favor of higher numbers of ancestral lines remaining at
the start of the collecting phase (looking backward in time) of the coalescent process, thus
yielding higher proportions of external branches of the genealogies.
In this study, we set out to test the interpretations of ARUNYAWAT et al. (2007), in
particular their qualitative prediction that the genealogies of pooled samples (comprised of
several local population samples treated as one entity), as reflected in the site frequency
spectrum, should be intermediate between those of local and scattered samples. The latter
comprise single sequences from each of many demes and are known to possess the genealogical
structure of samples from unstructured populations (assuming a large number of demes;
WAKELEY 1999, 2001, 2004; WAKELEY and ALIACAR 2001). More generally, we assess the
genealogical properties of these three types of samples under both the symmetric finite-island
model and the stepping-stone model of population structure. Given that ARUNYAWAT et al.
(2007) interpreted the frequent observation of negative Tajima’s D values in pooled samples of
diverse taxa (e.g., PTAK and PRZEWORSKI 2002; HAMMER et al. 2003; POOL and AQUADRO 2006;
MOELLER et al. 2007) as signatures of species-wide (range) expansions, our emphasis is on
models of structured populations undergoing an overall expansion in population size. To this
8
end, our coalescent simulations encompass a wide range of migration rates, magnitude, and
timing of expansions. We find a strong effect of the sampling scheme on two commonly used
summary statistics based on the site frequency spectrum, where the pooling effect gradually
decreases with increasing levels of gene flow. Our results imply that sequence data for most
species should not be evaluated under a ‘panmixia’ framework and that attention to sampling is
paramount even for weakly subdivided taxa. Moreover, disentangling the effects of demography
from those of positive selection appears to be an even more challenging task than currently
assumed.
MODELS AND METHODS
Coalescent simulations under two models of population structure: All patterns of
sequence diversity were generated using the coalescent simulation software ms (HUDSON 2002)
to model the following evolutionary scenario. At the time of sampling, the population consists of
I demes, each containing N0 diploid individuals. For the first set of simulations, the subdivided
population is at equilibrium with constant population size. Along every line mutations
accumulate at rate µ, and θ = 4N0µ. We chose to simulate mainly with a fixed value of θ = 1 (for
simulations implementing 100 demes) rather than with a fixed number of segregating sites S, or
choosing different θs for different simulation scenarios to obtain realistic values of S and/or π.
Simulating with a fixed S has been shown to yield inaccurate results under non-equilibrium
demography (RAMOS-ONSINS et al. 2007), and the critical values for Tajima’s D and Fu and Li’s
D depend not too strongly on θ. Moreover, the effects we describe in this article are orders of
magnitude larger than what could be generated by choosing a different θ or fixing S.
9
The demes exchange (haploid) migrants either under an island model or a two-dimensional
stepping-stone model at rate m, and we consider a broad range of gene flow: 0.1 ≤ 4N0m ≤ 100.
Under the island model an ancestral line in deme j switches its location to deme i at rate m/(I –
1). Under the stepping-stone model, we assume that I = a2, i.e., the population is arranged in a
square lattice of a × a demes and we assume periodic boundary conditions. This means that the
migration rate is m/4 if i = (i1,i2) and j = (j1,j2) are neighboring demes, and 0 otherwise. Here, i
and j are neighbors if |(i1 – j1)mod a| + |(i2 – j2)mod a| = 1. In other words, an individual at
location (a,i2) can migrate to (1,i2) and one at (i1,a) can migrate to (i1,1) and vice versa. We
modified ms in order to be able to efficiently sample sequences from randomly chosen demes
rather than fixed demes. Specifically, in each iteration the modified version of ms shuffles the
entries in the sample configuration array inconfig at the beginning of the function
segtre_mig (HUDSON 2002). The C code of this program is available from
http://guanine.evolbio.mpg.de/sampling.
Implementing range expansions under population subdivision: For an additional set of
coalescent simulations, we assume that the structure in the population was created some time τ in
the past (τ is measured in units of 4N0 generations), i.e., before time τ the population was
panmictic and of size NA. This scheme ought to be plausible under range expansions, e.g., as
exemplified by migration out of Africa by both humans and Drosophila melanogaster and
subsequent colonization of expansive areas, or temperate-zone populations expanding from
glacial refugia, or following a speciation event such as that inferred for the two wild tomato
species Solanum peruvianum and S. chilense (STÄDLER et al. 2005, 2008). Moreover, this is
essentially a generalized “isolation with migration” (IM) model of divergence with a large
number of extant demes (NIELSEN and WAKELEY 2001; HEY and NIELSEN 2004; WILKINSON-
10
HERBOTS 2008). Looking forward in time, at time τ the ancestral population splits into I demes
of equal size and in equal proportions. The ‘expansion factor’ for the total population at time τ is
thus given by
β = I × N0/NA. (1)
Note that a value of β = 1 implies constant population size in the sense that a panmictic
population at time τ in the past split into I demes each of size N0 (= NA/I), without changing the
total census size of the entire, now subdivided population. This particular scenario can be seen as
a form of ‘range fragmentation’, albeit without decline in total population size.
Sampling schemes and descriptors of diversity and differentiation: Simulated samples
of total size n (20 in our numerical examples) from the structured population were implemented
as either local, pooled, or scattered. Local samples contain n sequences from a single island, i.e.,
only one arbitrarily chosen deme is sampled. Pooled samples contain several lines each from
several demes (we take five lines from each of four demes in our simulations). Scattered samples
encompass single sequences from each of n different demes, i.e., only one sequence per sampled
deme.
A commonly used statistic to quantify population structure from patterns of diversity
within and among local populations is FST (e.g., HUDSON et al. 1992). In particular, if the number
of demes is large and the population is in equilibrium,
E[FST] = 1 / (1 + 4N0m), (2)
(e.g., WRIGHT 1951) and thus migration rates can, in principle, be estimated from observed
values of FST (but see WHITLOCK and MCCAULEY [1999] for numerous caveats, and Jost [2008]
for a more fundamental critique of FST-based estimates of differentiation and gene flow). In our
simulations, FST can only be computed under the ‘pooled’ sampling scheme. We used the
11
formula FST = 1 – πw/πb, where πb is the ‘average’ number of differences for pairs of sequences
taken from different demes, and πw is the ‘average’ number of differences for pairs sampled
within demes. The ‘average’ here means only the average over all pairs of sequences, and not
over all simulation runs. Thus, for every simulation run, we recorded exactly one FST value. For
the stepping-stone model, we used the same computations, i.e., FST does not take isolation by
distance into account.
To describe sequence diversity patterns, we focus on the site frequency spectrum, as
summarized by statistics such as the widely used Tajima’s D (TAJIMA 1989) and Fu and Li’s D
(FU and LI 1993). For clarity, we denote these distinct D statistics as DT and DFL, respectively.
Using these particular summary statistics enables us to perform power analyses to reject the
standard NE model. Moreover, we chose to include DFL because the singleton class appeared to
be the major reason for the lower/more negative DT values in pooled versus local samples of
wild tomatoes (ARUNYAWAT et al. 2007). The statistical package R was used to drive our
modified version of ms and to compute these statistics from its output; the corresponding R
scripts are available at http://guanine.evolbio.mpg.de/sampling.
Demographic signatures in multilocus data from wild tomatoes: Our recent studies on
population subdivision and speciation in two wild tomato species provided the motivation for our
current simulation study (ROSELIUS et al. 2005; STÄDLER et al. 2005, 2008; ARUNYAWAT et al.
2007; see INTRODUCTION). We extracted the biallelic, synonymous and noncoding Single
Nucleotide Polymorphisms (SNPs) contained in the data of ARUNYAWAT et al. (2007) to confine
analyses of the site frequency spectrum to putatively neutral polymorphisms and those fitting the
infinite-sites assumption implemented in HUDSON’s (2002) ms program. Tajima’s DT and Fu and
Li’s DFL values of these pruned data sets were computed in the SITES program
12
(http://lifesci.rutgers.edu/~heylab). Specifically, we computed the average DT and DFL values
across the four single-population samples per species, as well as the corresponding statistics
obtained by treating all sequences within species as a single sample; we refer to these latter
entities as the ‘pooled’ samples (see ARUNYAWAT et al. 2007).
While devising estimators of the demographic parameters β and τ is beyond the scope of
the present article (and may require the availability of multilocus data obtained from scattered
samples), we implemented coalescent simulations reflecting the actual sampling scheme used by
ARUNYAWAT et al. (2007) and fixed other parameters according to aspects of the empirical data
and perceived biological context. Both stepping-stone and island-model simulations were
performed with 400 local demes, θ = 0.25, and sampling 11 sequences from each of four local
demes (recall that demes were randomly chosen for each of the 1000 iterations). The migration
rate was set to 4N0m = 5, reflecting the average FST estimates found for both species
(ARUNYAWAT et al. 2007; and see Figure 5 below). Simulations explored the joint effects of
varying the time (τ) and magnitude (β) of demographic expansions concomitant with
establishment of population subdivision. From these simulations, we recorded estimates of DT
and DFL for two sampling schemes, local (means of the four local samples) and pooled (all 44
sequences treated as one sample). Average estimates of DT and DFL obtained in this manner for
many simulated combinations of β and τ are presented as contour plots.
RESULTS
Our coalescent simulations yield results for levels of nucleotide diversity, summary
statistics based on the site frequency spectrum, and population differentiation under both
equilibrium and non-equilibrium demographic history, all obtained for three different sampling
13
schemes: local, pooled, and scattered. All our findings are consistent between the island model
and the stepping-stone model, and in this article we focus on quantitative results obtained under
stepping-stone spatial structure; results for the island model are presented as Supplemental
Figures online.
The site frequency spectrum under population subdivision: The simplest demographic
scenario we analyzed is an equilibrium population subdivided into 100 demes, i.e., we first focus
on the effects of population structure per se without any past changes of population size. As
expected, characteristics of the site frequency spectra depend strongly on the sampling scheme.
For the stepping-stone model of population structure, Figure 1 shows the simulation results under
various levels of gene flow. For a migration rate of 4N0m = 10, local samples produce values of
Tajima’s DT (Fu and Li’s DFL) that are significantly different from values expected under the NE
model (two-tailed test, p < 0.05) in 16% (39%) of all cases, while scattered samples give
significant results in only 6% (7%) of all simulations. In particular, we see that local samples
generate values for both statistics that are higher than expected for samples from panmictic
populations, reflecting a site frequency distribution skewed toward intermediate-frequency
mutations; this result mirrors the recent work of DE and DURRETT (2007).
For migration rates (in units of 4N0m) between ~2 and ~50, pooled samples exhibit site
frequency spectra that are broadly intermediate between those of local and scattered samples.
The differences in sample genealogies (as reflected in estimates of DT and DFL) gradually
diminish with higher levels of gene flow, but some differences among sampling schemes are still
apparent at fairly high migration rates (e.g., 4N0m >50 for DT and 4N0m ~100 for DFL; Figure 1).
Importantly, pooling data from several subpopulations does not generate negative values of DT
or DFL without an expansion of the total population (see DISCUSSION). These observations also
14
hold for the island model, albeit with smaller discrepancies between the summary statistics for
scattered samples and those for the two other sampling schemes (i.e., a less pronounced skew
toward intermediate-frequency mutations for both pooled and local samples; Supplemental
Figure 1). For lower levels of gene flow (approx. 4 N0m < 2), the site frequency spectra of local
samples gradually shift toward NE expectations, while those of pooled samples yield
increasingly positive values of both DT and DFL under decreasing levels of migration. We
checked our simulations of this equilibrium model against analytical results showing that local
samples ought to be invariant for the level of nucleotide diversity, π, irrespective of the level of
symmetrical interdeme migration in an island model (SLATKIN 1987; STROBECK 1987). For both
models of population structure, we found approximately invariant mean π values for local
samples over the entire range of simulated migration rates (results not shown).
The impact of population/range expansions on the site frequency spectrum: Next, we
considered scenarios of (range) expansions under concomitant establishment of subdivision of
the total species range, as described in the MODELS AND METHODS section. In particular, we
simulated a single ancestral population that at time τ before the present experienced a
fragmentation into I demes, where the total number of individuals across the subdivided
population could vary from NA (the number of individuals in the single ancestral population, in
which case the expansion factor β = 1) to 100 × NA (equivalent to β = 100). Under a stepping-
stone model with a 10-fold population expansion 8N0 generations in the past, local samples still
exhibit values of DT and DFL that would be expected under NE conditions, as long as gene flow
is fairly low (Figure 2); these findings are consistent with simulation results by RAY et al.
(2003). The corresponding simulation results for the island model are shown in Supplemental
Figure 2.
15
Scattered samples, however, contain a clear signal of the species-wide expansion at any
level of gene flow. The reason is that the coalescent of scattered samples almost behaves like a
neutral one with a population size proportional to the number of demes. Hence, this coalescent
picks up the signal of expansion as in the panmictic case. The same should hold for any sampling
scheme under sufficiently high migration, but Figure 2 shows that even under high levels of gene
flow (4N0m = 100), the site frequency spectrum of local samples is still different from that of
pooled and scattered samples. Moreover, the simulation results plotted in Figure 2 were obtained
under a 10-fold expansion, and we may expect discrepancies between local and scattered
samples to extend to even higher migration rates with higher expansion factors (see Figure 3).
Again, DT and DFL values obtained for pooled samples are intermediate between those of local
and scattered samples except for very low migration rates (approx. 4N0m < 0.5), similar to the
case of equilibrium subdivided populations (Figures 1, 2). These simulation results imply that
both local and pooled samples may be expected to underestimate the extent of any species-wide
(range) expansion to various degrees, depending on levels of gene flow connecting the demes
and the age and magnitude of the expansion. The latter aspect, however, is influenced by our
choice of simulating an instantaneous expansion followed by a period of constant population size
until the time of sampling; an exponential expansion scheme until the present would yield higher
detectability of expansion from local and pooled samples.
We point out that quantitative details of our simulation results for the three sampling
schemes depend to some extent on our choice of sampling 20 sequences distributed over one
(‘local’), four (‘pooled’), and 20 demes (‘scattered’), respectively. Generally speaking,
decreasing the number of sequences sampled per deme and/or increasing the number of demes
sequences are sampled from shifts the site frequency spectrum more toward that of scattered
16
samples, reflecting the diminished impact of the scattering phase of the coalescent process on
such samples. The exact numerical composition of local and pooled samples also affects the
migration rate at which the expected DT and DFL values for pooled samples drop below those for
local samples (see Figures 1, 2).
Power of DT and DFL to detect expansion under different sampling schemes and
expansion times: Illustrated by an intermediate level of interdeme migration (4N0m = 10), we
assessed the power of the test statistics DT and DFL under a range of expansion factors and times
of expansion. For the three sampling schemes, Figure 3 summarizes the power of DT and DFL
assuming an expansion time of τ = 8N0 generations ago. The low power of local samples to
detect significant departures from the NE model regardless of the magnitude of the species-wide
expansion is striking; qualitatively, this is consistent with results by RAY et al. (2003). Even if
the species-wide expansion was 100-fold, local samples deviate from NE expectations in only
8% (for DT) and 12% (for DFL) of all cases under these conditions. In sharp contrast, scattered
samples deviate from NE expectations in 98% (99%) of all cases for β = 100. The corresponding
simulation results for the island model are presented in Supplemental Figure 3.
Next, we illustrate the effect of the timing of the expansion for a fixed 10-fold expansion
and a migration rate of 4N0m = 10. Under these conditions, expansion times in the approximate
range 1 < τ < 15 can be detected in principle, but again with striking differences in power
exhibited by local versus scattered samples (Figure 4). The shape of the curves in Figure 4 can be
explained intuitively: if τ is very small (i.e., establishment of population structure was very
recent), samples appear to be drawn from a panmictic population of constant size. On the other
hand, if τ is very large, samples appear to be drawn from an equilibrium subdivided population
and hence the expansion cannot be detected. Under the island model, all results are qualitatively
17
the same but with even smaller power to reject stable population size for local samples (at most
10% less power; Supplemental Figure 4). As these power assessments are based on simulated
sample genealogies of single loci, the actual power available with empirical multilocus data
would be correspondingly higher.
All simulation results appear to depend only weakly on the number of demes, as long as I
>> n. However, an increase in the number of demes carries important connotations, as the
genealogical signatures of past expansions remain detectable for longer time periods than with
fewer demes. For example, simulating with a constant ratio τ/I = 0.02 (i.e., equivalent to τ = 2 for
I = 100, τ = 10 for I = 500, etc.) under otherwise equal demographic conditions, we obtained
fairly constant but sampling-specific estimates of DT and DFL, as illustrated for the island model
in Supplemental Figure 5. One interpretation is that the effective population size is proportional
to I but depends on the sampling scheme. These observations imply approximately equal
detectability of expansions (for a given sampling scheme) over a large range of I, and thus the
dependency of ‘relevant’ τ values on I. This latter effect is a direct consequence of lengthening
the collecting phase of the coalescent process with increasing numbers of demes in the total
population.
Estimates of FST under non-equilibrium conditions: We studied the behavior of FST in
expanding populations (cf. MODELS AND METHODS) and consider both the stepping-stone and
island model of population structure. Given such scenarios, estimates of FST depend strongly on
the migration rate 4N0m but only weakly on the expansion time τ and the expansion factor β, as
long as 4N0m ≥ 2 (Figure 5 and Supplemental Figure 6). Under moderate and high levels of gene
flow, equation (2) gives a reasonable approximation of the simulated results, although the level
of population structure is mostly higher than expected under equilibrium conditions, and thus
18
actual levels of gene flow would be mostly underestimated using FST calculated from non-
equilibrium populations. Under low levels of gene flow (4N0m < 2), however, strong and/or
recent expansions yield less population differentiation than expected under the assumption of
equilibrium conditions, and thus levels of gene flow would be mostly overestimated (Figure 5;
see also EXCOFFIER 2004).
Reanalyzing the pooling effect in two species of wild tomatoes: Our reanalysis of the
multilocus wild tomato data, limited to silent sites and biallelic SNPs, confirms the patterns
originally identified by ARUNYAWAT et al. (2007) based on all SNPs; with the exception of one
locus in S. peruvianum (CT066, DFL) and one locus in S. chilense (CT268, DT), the site
frequency spectra at individual loci yield lower/more negative values of DT and DFL in the
pooled samples compared to the means of four samples representing local populations (Table 1).
Considering the pooled samples, multilocus average DT (DFL) values are –1.14 (–1.38) in S.
peruvianum and –0.30 (–0.76) in S. chilense; the magnitude of the ‘drop’ from single-population
means in DT (DFL) is 0.89 (1.04) in S. peruvianum and 0.78 (1.31) in S. chilense (Table 1). Given
our simulation results under both equilibrium and expansion scenarios, the observed summary
statistics for pooled samples of both species are incompatible with equilibrium assumptions, but
compatible with species-wide expansions (Figures 1–3). Based on the observed levels of
population subdivision (FST approx. 0.20–0.23; ARUNYAWAT et al. 2007), a rough estimate of
average migration rates is 4N0m ~5 (Figure 5).
Our simulation results presented above convey how summary statistics such as DT and DFL
obtained from local or pooled samples ought to be affected jointly by migration rates and both
the magnitude and time of expansion. Figure 6 presents results of coalescent simulations tailored
to fit the empirical wild tomato data (i.e., sampling 11 sequences from each of four local demes
19
and fixing the migration rate to 4N0m = 5). Even when restricting the analysis to two parameters
that were allowed to vary widely, it is clear that given observed values of DT and DFL are not
sufficient to permit robust estimation of the demographic parameters τ and β. The simulated
average DT values for pooled samples shown in the contour plot (Figure 6B) suggest that the
observed pooled DT values are compatible with ≥ 25-fold expansion in S. peruvianum and with ≥
4-fold expansion in S. chilense.
When viewed in isolation, the observed local DT values are consistent with a very high
expansion factor for S. peruvianum (β > 200) and with β ~1 for S. chilense. However,
considering the observed differences local–pooled, > 40-fold and < 4-fold expansions appear to
be inconsistent with the data for both species (Figure 6; Table 1). Similarly, the observed DFL
values would seem to be compatible with > 150-fold expansion in S. peruvianum and with > 3-
fold expansion in S. chilense, whereas based on the observed differences local–pooled, only >
120-fold expansions appear to be compatible with the data for both species (Table 1;
Supplemental Figure 7). Assuming that the expansion time of both species mirrors their
speciation time, we note that these β values are larger than those assuming the model of
WAKELEY and HEY (1997) as obtained in STÄDLER et al. (2008). This apparent discrepancy can
be explained at least partly by our findings, since the Wakeley–Hey model treats any empirical
sample as one from a panmictic population and hence picks up signals of expansion only partly,
leading to smaller estimates of the expansion factor.
We emphasize that changing some of the simulated parameter values, such as the number
of demes and the migration rate among demes, would suggest other β-values as being
‘consistent’ with the empirical data. Similarly, assuming an island-model rather than stepping-
stone structure would result in a site frequency spectrum more biased toward low-frequency
20
variants, hence implying signatures of more moderate expansions than suggested by some
features of the data mentioned above (Supplemental Figures 8, 9). Moreover, these simulations
were performed with a uniform migration rate across the entire subdivided population, whereas
the empirical data may reflect (perhaps dramatic) differences in immigration rates among local
samples. Irrespective of the inherent difficulties of inferring credible demographic parameters
under population subdivision and non-stationarity, the generally lower/more negative DT and DFL
values for both local and pooled S. peruvianum samples strongly suggest a more pronounced
expansion compared to S. chilense (Table 1; ARUNYAWAT et al. 2007; STÄDLER et al. 2008).
DISCUSSION
Although models of subdivided populations have been studied extensively, the lack of
analytical results for local samples of size > 2 may have restricted their utility for genealogical
inference. WILKINSON-HERBOTS (2008) recently obtained expressions for the expected mean and
variance of pairwise sequence divergence (samples of size two drawn either from the same
subpopulation or from different subpopulations in a finite island model) in a generalized IM
model (see also EXCOFFIER 2004). As our simulations were motivated mainly by patterns in
empirical data, we have focused on larger sample sizes and aspects other than pairwise sequence
diversity, but the underlying model of population subdivision with variable timing and
magnitude of population expansion is identical to that studied by WILKINSON-HERBOTS (2008).
For the most part, our simulations implemented 100 demes and a sample size of 20, and thus
should approach the many-demes limit utilized in Wakeley’s studies of the island model (e.g.,
WAKELEY 1999, 2001, 2004; WAKELEY and ALIACAR 2001), and more recently shown by DE
and DURRETT (2007) for the stepping-stone model. Simulating local samples under both island
21
and stepping-stone spatial structure, DE and DURRETT (2007) found lower numbers of
segregating sites, higher linkage disequilibrium (LD), and median site frequency spectra that
were skewed toward intermediate-frequency polymorphisms (i.e., yielding positive DT values) in
comparison with samples from randomly mating populations. However, they did not investigate
the behavior of samples pooled across several demes. WAKELEY and LESSARD (2003) have
shown that levels of LD can be significantly elevated in local samples, depending on migration
rates and the total number of demes. Their analyses highlighted the potential impact of sampling
scheme and demography on the validity of evolutionary inferences drawn under unrealistic
assumptions, a topic to which we shall return below.
Site frequency spectra under different sampling schemes: Using coalescent simulations,
we evaluated three sampling schemes of sequences drawn from subdivided populations under
different levels of migration among local demes; all three sampling schemes have corresponding
examples in empirical studies of DNA sequence diversity designed to infer aspects of
demographic history and natural selection from the site frequency spectrum and/or patterns of
LD. Our simulations of subdivided populations of constant total size (Figure 1) have identified a
wide range of migration rates where the pooling effect, as interpreted by ARUNYAWAT et al.
(2007), appears to hold: local samples are characterized by higher values of Tajima’s DT and Fu
and Li’s DFL than pooled samples for Nm (immigrants per generation) > 0.5, and the underlying
genealogies of such samples converge in shape only at very high levels of gene flow (Nm > 25).
With low migration levels (Nm < 0.5), the pattern is reversed in that pooled samples show an
excess of highly positive DT and DFL values. At these low gene flow levels, local samples are
expected to undergo a series of coalescent events within the sampled demes before entering the
collecting phase with a single ancestral line per deme. The resulting long internal branches of
22
genealogies encompassing lineages from several local demes (i.e., pooled samples) typically
exhibit multiple fixed mutations between different demes, leading to intermediate-frequency
SNPs in pooled samples; PANNELL (2003; his Table 3) has addressed these effects of local vs.
pooled samples in the context of metapopulation dynamics.
That these genealogical effects are mostly in the opposite direction, i.e., that pooling
several local samples is expected to increase the proportion of low-frequency polymorphisms
under intermediate and high levels of migration, has until recently been obscure, even though
simulation results by PANNELL (2003; his Table 3) indicated such an effect. Analyzing
previously published human data sets encompassing a variety of sampling schemes, PTAK and
PRZEWORSKI (2002) identified a pattern of increasing skew toward low-frequency mutations (i.e.,
negative DT) with more geographically heterogeneous sampling (equivalent to pooling across
demes), and concluded that signatures of population expansion may be confounded with effects
of population subdivision. Our coalescent simulations of equilibrium island/stepping-stone
models under a reasonable range of migration rates show that pooling does not lead to negative
DT values, and thus do not support this particular interpretation of a ‘confounding’ effect of
subdivision.
On the contrary, pooled samples (albeit to a lesser extent than local samples) still exhibit
the signatures of local (scattering-phase) coalescent events (i.e., lower proportion of external
branch length of the sample genealogies), compared to scattered samples whose genealogies
should be roughly equivalent in structure to those from panmictic populations (but see below).
This ‘intermediate’ frequency spectrum of pooled samples (between the ‘extremes’ of local and
scattered samples, respectively) is what ARUNYAWAT et al. (2007) predicted based on their
genealogical interpretation of the pooling effect. For the case of non-equilibrium populations,
23
one caveat that ought to be mentioned is that signatures of expansion will be seen in scattered
samples under low levels of gene flow even without any increase in the total number of
individuals (i.e., β ≤ 1). The reason is that at time τ, the establishment of subdivision increases
the effective population size by a factor of 1 + (1/4N0m) (WAKELEY 2000). Consequently, an
apparent ‘expansion’ may be seen under strong subdivision (e.g., 4N0m < 2); the increasingly
negative values for the scattered samples shown in Figure 2 with decreasing migration rate
appear to reflect this phenomenon.
Site frequency spectra and sampling in non-equilibrium subdivided populations: We
found effects of the sampling scheme on sample genealogies up to very high migration rates,
especially under strong species-wide expansions (Figures 2, 3). In other words, even under high
migration rates (e.g., Nm = 25), local samples may not adequately reflect the species-wide
demography. The relevance of our simulation findings is essentially an empirical issue that
should be judged by its predictive power and relevant data. Our simulations predict a pooling
effect over a wide range of migration rates, implying one way to test the null hypothesis of
panmixia, i.e., that the genealogies of local samples are indistinguishable from those of scattered
samples. Very few published studies have performed separate analyses of local population
samples and the pooled sample consisting of several local samples, but many studies can, in
principle, be used to test our predictions because several local samples have been included
(Drosophila: BAUDRY et al. 2004, 2006; NOLTE and SCHLÖTTERER 2008; humans: MARTH et al.
2004; VOIGHT et al. 2005; KEINAN et al. 2007; GARRIGAN et al. 2007; plants: HEUERTZ et al.
2006; PYHÄJÄRVI et al. 2007).
POOL and AQUADRO (2006) demonstrated the pooling effect in sub-Saharan Drosophila
melanogaster, as have several studies in plants (INGVARSSON 2005; ARUNYAWAT et al. 2007;
24
MOELLER et al. 2007) and humans (e.g., PTAK and PRZEWORSKI 2002; HAMMER et al. 2003). In
all these cases, the site frequency spectra of pooled samples were characterized by an excess of
low-frequency variants, resulting in (more) negative DT values. Some of these studies explicitly
considered the possible contribution of purifying selection to these patterns, but concluded that
demographic (or range) expansions were at least partly responsible for the skew in the site
frequency spectra (POOL and AQUADRO 2006; MOELLER et al. 2007; see also NORDBORG et al.
2005). As far as neutral polymorphisms are concerned, our simulation results fully concur with
these interpretations, as only global expansions and not the effects of population subdivision per
se can generate substantially negative DT values (Figures 1–3). As many other researchers before
us, we have investigated the properties of general models of population structure in the context
of sampling schemes and their effects on sample genealogies. In contrast, HAMMER et al. (2003)
presented coalescent simulations of a non-standard ‘continuous-splitting’ model of human
population structure (where demes do not exchange migrants after being isolated from other
demes), and concluded that site frequency spectra of human data sets may be explained without
invoking global expansion.
Levels of gene flow and within-species panmixia: It may seem surprising to observe the
pooling effect even in highly mobile species such as Drosophila melanogaster (POOL and
AQUADRO 2006), thus providing clear-cut evidence for deviations from panmixia at seemingly
low levels of population subdivision (e.g., as quantified by FST). The impression of low levels of
population substructure may, at least in part, be a consequence of the FST estimator most widely
used for DNA sequence data, FST = 1 – πw/πb, which treats each SNP as a separate locus
(HUDSON et al. 1992). In empirical as well as simulated data sets, many segregating sites that
appear as singletons in local population samples remain singletons in pooled samples. These sites
25
do not contribute much to nucleotide diversity in either local or total samples, compared to SNPs
with intermediate frequencies, but they fuel a more pronounced increase in the number of
segregating sites in pooled samples (i.e., WATTERSON’s [1975] estimator θW increases much
more than π; see also RAY et al. 2003). These underlying features explain the effect of pooling
under apparently low levels of population subdivision. Moreover, they also highlight the inherent
dangers of equating low FST estimates with the (near) absence of population subdivision.
In the framework of the infinite-alleles model, JOST (2008) recently argued that FST (and
GST) are not at all appropriate measures of differentiation and pinpointed mathematical
misconceptions underlying the standard approach. In particular, he identified the classical
additive partitioning of total heterozygosity (HT, the heterozygosity of the pooled
subpopulations) into mean within-subpopulation heterozygosity (HS) and a between-
subpopulation component (HT – HS = DST; NEI 1973) as erroneous, because HS and DST are not
truly independent but rather related through an incomplete partitioning (JOST 2008). Because
classical results concerning the absolute number of migrants per generation required to prevent
significant differentiation among local demes in the finite-island and stepping-stone models (e.g.,
Nm > 1, equation (2) above; e.g., WRIGHT 1951; MARUYAMA 1971) are based on the
interpretation of FST as a measure of differentiation, JOST (2008) concluded that such ‘rules of
thumb’ must be considered invalid. Although JOST (2008) did not formally evaluate the infinite-
sites model of sequence evolution, this conclusion goes hand in hand with our interpretations in
the preceding paragraph. Importantly, the marked effects of sampling scheme on the site
frequency spectrum in our simulations despite high levels of gene flow are not restricted to
expanding populations but were found also for subdivided populations at demographic
equilibrium (e.g., Nm > 10; Figures 1, 2). Similar results were previously obtained by DE and
26
DURRETT (2007) for equilibrium populations and, implicitly, by RAY et al. (2003) for expanding
populations.
Empirical considerations and implications for statistical inference: Large-scale
population-genetic data sets are commonly evaluated in the framework of panmictic populations
undergoing temporal changes in population size (humans: e.g., MARTH et al. 2004; VOIGHT et al.
2005; GARRIGAN et al. 2007; Drosophila: e.g., OMETTO et al. 2005; LI and STEPHAN 2006;
THORNTON and ANDOLFATTO 2006; plants: HEUERTZ et al. 2006; PYHÄJÄRVI et al. 2007). In
conjunction with the empirical data discussed above, our results suggest that a more appropriate
approach would acknowledge the subdivided nature of these species explicitly, which would
require paying particular attention to sampling schemes and their genealogical consequences.
Especially if empirical sampling is from a local population (as has often been the case), it is
inappropriate to compare patterns of diversity with expectations under the classical NE
conditions embodied in commonly used tests of ‘neutrality’. Generally, local population samples
should not be modeled as coming from a panmictic (e.g., continental) population but rather as
being drawn from one of many demes comprising the total metapopulation.
Put another way, the traditional almost exclusive focus on the temporal trajectory of
population-size changes that best explains properties of local/regional samples ought to be
questioned, especially in systems with moderate-to-high levels of gene flow where sample
genealogies are not regionally monophyletic, but rather are deeply embedded in species-wide
sample genealogies (WAKELEY 2001; WAKELEY and ALIACAR 2001). In such cases, the
properties of local samples (e.g., the number of segregating sites, nucleotide diversity, the site
frequency spectrum, and the extent and decay of LD) arguably reflect the level of connectivity
with (degree of isolation from) other demes during the scattering phase of the coalescent process
27
(WAKELEY 2001; RAY et al. 2003, WAKELEY and LESSARD 2003; DE and DURRETT 2007). Our
simulation scheme has been simplistic in assuming equal migration rates and deme sizes through
time and across the entire metapopulation, but it is straightforward to explain empirical patterns,
such as those found for various human population samples, in terms of differences in local
population size and immigration rates during the recent past (WAKELEY 2001; RAY et al. 2003;
EXCOFFIER 2004), arguably mediated by environmental heterogeneity (WEGMANN et al. 2006).
Guided by notions of the special importance of population structure and the potential for
local adaptation in plants, several recent studies have emphasized the need to include genuine
local population samples in population-genetic studies, rather than relying exclusively on
scattered samples (WRIGHT and GAUT 2005; MOELLER et al. 2007; ROSS-IBARRA et al. 2008).
While we agree that sampling locally has particular merit for studying signatures of local
adaptation (see ARUNYAWAT et al. 2007), our present results caution against analyzing such local
samples naively, i.e., in the conventional framework of panmictic populations. In principle,
sampling several local demes offers the opportunity to empirically contrast the properties of local
samples (e.g., site frequency spectrum, decay of LD) with those of the pooled sample, analogous
to two of our three simulation sampling schemes. At the very least, this exercise has the potential
to infer general features of the species-wide demographic history despite the biases inherent in
local samples. It would also (at least partially) mitigate against inferring spurious signatures of
natural selection from levels and patterns of nucleotide polymorphism, or from the extent of
haplotype structure (for examples, see WAKELEY and LESSARD 2003; DE and DURRETT 2007).
However, only the genealogical structure of scattered samples is comparable to that of an elusive
panmictic population that has experienced the same temporal demographic history (e.g.,
expansion), whereas such samples preclude finding molecular evidence of local adaptation.
28
We thank Tanja Pyhäjärvi for information on her multideme data set on Pinus sylvestris
and Aurelien Tellier for valuable discussions on metapopulation dynamics and the coalescent in
subdivided populations. Two anonymous referees provided constructive criticism that helped to
improve the final version of this paper. Furthermore, we thank Dick Hudson for allowing us to
redistribute a modified version of his coalescent simulation program ms (HUDSON 2002). This
work was funded by the Deutsche Forschungsgemeinschaft through its Priority Program
“Radiations—Origins of Biological Diversity” (SPP-1127), grant Ste 325/5-3 to W.S., and
Research Unit “Natural Selection in Structured Populations” (FOR-1078), grant Ste 325/12-1 to
W.S. and grant Pf 672/1-1 to P.P.; P.P. acknowledges additional support by the German Federal
Ministry of Education and Research (BMBF) through the Freiburg Initiative for Systems
Biology, grant 0313921.
29
LITERATURE CITED
ADAMS, A. M., and R. R. HUDSON, 2004 Maximum-likelihood estimation of demographic
parameters using the frequency spectrum of unlinked single-nucleotide polymorphisms.
Genetics 168: 1699–1712.
ALONSO, S., and J. A. L. ARMOUR, 2001 A highly variable segment of human subterminal 16p
reveals a history of population growth for modern humans outside Africa. Proc. Natl.
Acad. Sci. USA 98: 864–869.
ARUNYAWAT, U., W. STEPHAN and T. STÄDLER, 2007 Using multilocus sequence data to assess
population structure, natural selection, and linkage disequilibrium in wild tomatoes. Mol.
Biol. Evol. 24: 2310–2322.
BAUDRY, E., B. VIGINIER and M. VEUILLE, 2004 Non-African populations of Drosophila
melanogaster have a unique origin. Mol. Biol. Evol. 21: 1482–1491.
BAUDRY, E., N. DEROME, M. HUET and M. VEUILLE, 2006 Contrasted polymorphism patterns in
a large sample of populations from the evolutionary genetics model Drosophila simulans.
Genetics 173: 759–767.
DE, A., and R. DURRETT, 2007 Stepping-stone spatial structure causes slow decay of linkage
disequilibrium and shifts the site frequency spectrum. Genetics 176: 969–981.
EXCOFFIER, L., 2004 Patterns of DNA sequence diversity and genetic structure after a range
expansion: lessons from the infinite-island model. Mol. Ecol. 13: 853–864.
FAY, J. C., and C.-I WU, 2000 Hitchhiking under positive Darwinian selection. Genetics 155:
1405–1413.
FU, Y.-X., and W.-H. LI, 1993 Statistical tests of neutrality of mutations. Genetics 133: 693–
709.
30
GARRIGAN, D., S. B. KINGAN, M. M. PILKINGTON, J. A. WILDER, M. P. COX et al., 2007
Inferring human population sizes, divergence times and rates of gene flow from
mitochondrial, X and Y chromosome resequencing data. Genetics 177: 2195–2207.
HAMMER, M. F., F. BLACKMER, D. GARRIGAN, M. W. NACHMAN and J. A. WILDER, 2003
Human population structure and its effects on sampling Y chromosome sequence variation.
Genetics 164: 1495–1509.
HEUERTZ, M., E. DE PAOLI, T. KÄLLMAN, H. LARSSON, I. JURMAN et al., 2006 Multilocus
patterns of nucleotide diversity, linkage disequilibrium and demographic history of
Norway Spruce [Picea abies (L.) Karst]. Genetics 174: 2095–2105.
HEY, J., and R. NIELSEN, 2004 Multilocus methods for estimating population sizes, migration
rates and divergence time, with applications to the divergence of Drosophila
pseudoobscura and D. persimilis. Genetics 167: 747–760.
HUDSON, R. R., 2002 Generating samples under a Wright-Fisher neutral model of genetic
variation. Bioinformatics 18: 337–338.
HUDSON, R. R., M. SLATKIN and W. P. MADDISON, 1992 Estimation of levels of gene flow from
DNA sequence data. Genetics 132: 583–589.
INGVARSSON, P. K., 2005 Nucleotide polymorphism and linkage disequilibrium within and
among natural populations of European aspen (Populus tremula L., Salicaceae). Genetics
169: 945–953.
JOST, L., 2008 GST and its relatives do not measure differentiation. Mol. Ecol. 17: 4015–4026.
KEINAN, A., J. C. MULLIKIN, N. PATTERSON and D. REICH, 2007 Measurement of the human
allele frequency spectrum demonstrates greater genetic drift in East Asians than in
Europeans. Nat. Genet. 39: 1251–1255.
31
LI, H., and W. STEPHAN, 2006 Inferring the demographic history and rate of adaptive
substitution in Drosophila. PLoS Genet. 2: e166.
MARTH, G. T., E. CZABARKA, J. MURVAI and S. T. SHERRY, 2004 The allele frequency
spectrum in genome-wide human variation data reveals signals of differential demographic
history in three large world populations. Genetics 166: 351–372.
MARUYAMA, T., 1971 Analysis of population structure II. Two-dimensional stepping stone
models of finite length and other geographically structured populations. Ann. Hum. Genet.
Lond. 35: 179–196.
MOELLER, D. A., M. I. TENAILLON and P. TIFFIN, 2007 Population structure and its effects on
patterns of nucleotide polymorphism in teosinte (Zea mays ssp. parviglumis). Genetics
176: 1799–1809.
NEI, M., 1973 Analysis of gene diversity in subdivided populations. Proc. Natl. Acad. Sci. USA
70: 3321–3323.
NIELSEN, R., and J. WAKELEY, 2001 Distinguishing migration from isolation: a Markov chain
Monte Carlo approach. Genetics 158: 885–896.
NOLTE, V., and C. SCHLÖTTERER, 2008 African Drosophila melanogaster and D. simulans
populations have similar levels of sequence variability, suggesting comparable effective
population sizes. Genetics 178: 405–412.
NORDBORG, M., T. T. HU, Y. ISHINO, J. JHAVERI, C. TOOMAJIAN et al., 2005 The pattern of
polymorphism in Arabidopsis thaliana. PLoS Biol. 3: e196.
OMETTO, L., S. GLINKA, D. DE LORENZO and W. STEPHAN, 2005 Inferring the effects of
demography and selection on Drosophila melanogaster populations from a chromosome-
wide scan of DNA variation. Mol. Biol. Evol. 22: 2119–2130.
32
PANNELL, J. R., 2003 Coalescence in a metapopulation with recurrent local extinction and
recolonization. Evolution 57: 949–961.
POOL, J. E., and C. F. AQUADRO, 2006 History and structure of sub-Saharan populations of
Drosophila melanogaster. Genetics 174: 915–929.
PTAK, S. E., and M. PRZEWORSKI, 2002 Evidence for population growth in humans is
confounded by fine-scale population structure. Trends Genet. 18: 559–563.
PYHÄJÄRVI, T., M. R. GARCÍA-GIL, T. KNÜRR, M. MIKKONEN, W. WACHOWIAK and O.
SAVOLAINEN, 2007 Demographic history has influenced nucleotide diversity in European
Pinus sylvestris populations. Genetics 177: 1713–1724.
RAMOS-ONSINS, S. E., S. MOUSSET, T. MITCHELL-OLDS and W. STEPHAN, 2007 Population
genetic inference using a fixed number of segregating sites: a reassessment. Genet. Res.
89: 231–244.
RAY, N., M. CURRAT and L. EXCOFFIER, 2003 Intra-deme molecular diversity in spatially
expanding populations. Mol. Biol. Evol. 20: 76–86.
ROSELIUS, K., W. STEPHAN and T. STÄDLER, 2005 The relationship of nucleotide
polymorphism, recombination rate and selection in wild tomato species. Genetics 171:
753–763.
ROSS-IBARRA, J., S. I. WRIGHT, J. P. FOXE, A. KAWABE, L. DEROSE-WILSON et al., 2008
Patterns of polymorphism and demographic history in natural populations of Arabidopsis
lyrata. PLoS One 3: e2411.
SCHMID, K. J., S. E. RAMOS-ONSINS, H. RINGYS-BECKSTEIN, B. WEISSHAAR and T. MITCHELL-
OLDS, 2005 A multilocus sequence survey in Arabidopsis thaliana reveals a genome-wide
33
departure from a neutral model of DNA sequence polymorphism. Genetics 169: 1601–
1615.
SLATKIN, M., 1987 The average number of sites separating DNA sequences drawn from a
subdivided population. Theor. Popul. Biol. 32: 42–49.
STÄDLER, T., K. ROSELIUS and W. STEPHAN, 2005 Genealogical footprints of speciation
processes in wild tomatoes: demography and evidence for historical gene flow. Evolution
59: 1268–1279.
STÄDLER, T., U. ARUNYAWAT and W. STEPHAN, 2008 Population genetics of speciation in two
closely related wild tomatoes (Solanum section Lycopersicon). Genetics 178: 339–350.
STROBECK, C., 1987 Average number of nucleotide differences in a sample from a single
subpopulation: a test for population subdivision. Genetics 117: 149–153.
TAJIMA, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA
polymorphism. Genetics 123: 585–595.
THORNTON, K., and P. ANDOLFATTO, 2006 Approximate Bayesian inference reveals evidence
for a recent, severe bottleneck in a Netherlands population of Drosophila melanogaster.
Genetics 172: 1607–1619.
VOIGHT, B. F., A. M. ADAMS, L. A. FRISSE, Y. QIAN, R. R. HUDSON and A. DI RIENZO, 2005
Interrogating multiple aspects of variation in a full resequencing data set to infer human
population size changes. Proc. Natl. Acad. Sci. USA 102: 18508–18513.
WAKELEY, J., 1999 Nonequilibrium migration in human history. Genetics 153: 1863–1871.
WAKELEY, J., 2000 The effects of subdivision on the genetic divergence of populations and
species. Evolution 54: 1092–1101.
WAKELEY, J., 2001 The coalescent in an island model of population subdivision with variation
34
among demes. Theor. Popul. Biol. 59: 133–144.
WAKELEY, J., 2004 Metapopulations and coalescent theory, pp. 175–198 in Ecology, Genetics,
and Evolution of Metapopulations, edited by I. HANSKI and O. GAGGIOTTI. Elsevier,
Oxford, UK.
WAKELEY, J., and N. ALIACAR, 2001 Gene genealogies in a metapopulation. Genetics 159: 893–
905.
WAKELEY, J., and J. HEY, 1997 Estimating ancestral population parameters. Genetics 145: 847–
855.
WAKELEY, J., and S. LESSARD, 2003 Theory of the effects of population structure and sampling
on patterns of linkage disequilibrium applied to genomic data from humans. Genetics 164:
1043–1053.
WATTERSON, G. A., 1975 On the number of segregating sites in genetical models without
recombination. Theor. Popul. Biol. 7: 256–276.
WEGMANN, D., M. CURRAT and L. EXCOFFIER, 2006 Molecular diversity after a range
expansion in heterogeneous environments. Genetics 174: 2009–2020.
WHITLOCK, M. C., and D. E. MCCAULEY, 1999 Indirect measures of gene flow and migration:
FST doesn’t equal 1/(4Nm + 1). Heredity 82: 117–125.
WILKINSON-HERBOTS, H. M., 2008 The distribution of the coalescent time and the number of
pairwise nucleotide differences in the “isolation with migration” model. Theor. Popul.
Biol. 73: 277–288.
WRIGHT, S., 1951 The genetical structure of populations. Ann. Eugen. 15: 323–354.
WRIGHT, S. I., and B. S. GAUT, 2005 Molelcular population genetics and the search for adaptive
evolution in plants. Mol. Biol. Evol. 22: 506–519.
35
TABLE 1
Tajima’s D and Fu and Li’s D values in pooled versus population-specific tomato samples
S. peruvianum S. chilense
Locus Mean of 4 pops. Pooled Mean of 4 pops. Pooled
CT066 –0.61
–0.48
–0.68
0.05
1.07
0.64
–0.29
–1.55
CT093 –0.19
–0.34
–1.79
–3.25
0.03
0.00
–0.82
–1.85
CT166 –0.63
–0.85
–1.85
–2.71
0.12
0.36
0.04
–0.27
CT179 –0.01
–0.09
–1.07
–1.40
0.64
0.49
–0.69
–1.53
CT198 –0.14
–0.01
–0.95
–0.26
0.20
0.38
–1.10
–0.05
CT208 –0.60
–1.30
–1.19
–1.64
—
—
—
—
CT251 0.00
0.02
–0.89
–1.01
0.71
0.82
0.04
–0.51
CT268 0.21
0.37
–0.66
–0.81
0.61
1.14
0.70
0.47
36
Average –0.25
–0.34
–1.14
–1.38
0.48
0.55
–0.30
–0.76
For each locus, the first row shows values of Tajima’s DT, while the second row shows
values of Fu and Li’s DFL. Data were taken from ARUNYAWAT et al. (2007), but here all
nonsynonymous SNPs and those segregating more than two nucleotides (i.e., showing evidence
of recurrent mutations) were eliminated before estimating DT and DFL. The ‘pooled’ columns
show values based on the combined (total) sample within each species, treated as a single entity.
Note that almost all loci individually exhibit lower/more negative DT and DFL values in the
pooled samples. Locus CT208 was not included for S. chilense because patterns of
polymorphism were suggestive of natural selection (ARUNYAWAT et al. 2007).
37
FIGURE CAPTIONS
FIGURE 1.—Averages of Tajima’s DT (A) and Fu and Li’s DFL (B) under three sampling
schemes as a function of levels of gene flow. We simulated an equilibrium stepping-stone model
with I = 100 islands; the simulations were carried out without recombination. Every plotted point
is based on 1000 independently generated data sets, and standard errors of the means are
indicated by vertical lines.
FIGURE 2.—Averages of Tajima’s DT (A) and Fu and Li’s DFL (B) under species-wide
expansion (β = 10, τ = 2) as a function of migration rates between demes. We simulated a
stepping-stone model with 100 demes without recombination. Every plotted point is based on
1000 independently generated data sets; error bars represent standard errors.
FIGURE 3.—Averages of Tajima’s DT (A) and Fu and Li’s DFL (B) as functions of the
expansion factor β (with fixed migration rate and time of expansion), with the power of these
statistics evaluated in the top part of each plot (right y-axis; power was assessed at a level of P =
0.05). The same simulation scheme was used as for Figure 2.
FIGURE 4.—The power of Tajima’s DT (A) and Fu and Li’s DFL (B) as a function of the
expansion time τ (with fixed migration rate and expansion factor). The same simulation scheme
was used as for Figure 2; power was assessed at a level of P = 0.05.
38
FIGURE 5.—FST as a function of 4N0m. (A) τ = 8N0 generations is fixed and β is varying;
(B) β = 10 is fixed and τ is varying. The same simulations as in Figure 2 were used (i.e.,
stepping-stone model with 100 demes).
FIGURE 6.—Contour plots showing average values of DT obtained in simulations designed
to mimic the empirical sampling of ARUNYAWAT et al. (2007). (A) Single-deme samples (n = 11
sequences); (B) pooled samples (n = 44 sequences). For each of the 1000 simulations for a given
combination of expansion factor (β) and time since expansion (τ), 11 sequences were drawn
from each of four randomly chosen demes, and the 44 sequences were also evaluated as a pooled
sample (B). We simulated 400 demes under stepping-stone structure (θ [per deme] = 0.25, 4N0m
= 5). The empirical average DT values for S. peruvianum are –0.25 (local) and –1.14 (pooled),
while they are 0.48 (local) and –0.30 (pooled) for S. chilense; see Table 1.
39
Figure 1: A = left, B = right
40
Figure 2: A = left; B = right
41
Figure 3: A = left; B = right
42
Figure 4: A = left; B = right
43
Figure 5: A = left; B = right
44
Figure 6: A = left; B = right