the impact of sampling schemes on the site frequency ... impact of sampling schemes on the site...

44
1 The impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations Thomas Städler, * Bernhard Haubold, Carlos Merino, Wolfgang Stephan and Peter Pfaffelhuber § * ETH Zurich, Institute of Integrative Biology, Plant Ecological Genetics, CH-8092 Zurich, Switzerland Max-Planck-Institute for Evolutionary Biology, Department of Evolutionary Genetics, D-24306 Plön, Germany University of Munich (LMU), Department of Biology II, Section of Evolutionary Biology, D-82152 Planegg-Martinsried, Germany § University of Freiburg, Faculty of Mathematics and Physics, D-79104 Freiburg, Germany Genetics: Published Articles Ahead of Print, published on February 23, 2009 as 10.1534/genetics.108.094904

Upload: nguyendung

Post on 13-May-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

1

The impact of sampling schemes on the site frequency spectrum

in non-equilibrium subdivided populations

Thomas Städler,* Bernhard Haubold,† Carlos Merino,‡

Wolfgang Stephan‡ and Peter Pfaffelhuber§

* ETH Zurich, Institute of Integrative Biology, Plant Ecological Genetics,

CH-8092 Zurich, Switzerland

† Max-Planck-Institute for Evolutionary Biology, Department of Evolutionary Genetics,

D-24306 Plön, Germany

‡ University of Munich (LMU), Department of Biology II, Section of Evolutionary Biology,

D-82152 Planegg-Martinsried, Germany

§ University of Freiburg, Faculty of Mathematics and Physics,

D-79104 Freiburg, Germany

Genetics: Published Articles Ahead of Print, published on February 23, 2009 as 10.1534/genetics.108.094904

Page 2: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

2

Running head: Genealogies and Population Subdivision

Key words: coalescent, demography, population subdivision, range expansion, site frequency

spectrum

Corresponding author: Dr. Thomas Städler

ETH Zurich

Institute of Integrative Biology

Plant Ecological Genetics

Universitätstrasse 16

CH-8092 Zurich, Switzerland

Phone: +41-44-6327429

Fax: +41-44-6321215

E-mail: [email protected]

Page 3: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

3

ABSTRACT

Using coalescent simulations, we study the impact of three different sampling schemes on

patterns of neutral diversity in structured populations. Specifically, we are interested in two

summary statistics based on the site frequency spectrum as a function of migration rate,

demographic history of the entire substructured population (including timing and magnitude of

species-wide expansions), and the sampling scheme. Using simulations implementing both

finite-island and two-dimensional stepping-stone spatial structure, we demonstrate strong effects

of the sampling scheme on Tajima’s D (DT) and Fu and Li’s D (DFL) statistics, particularly under

species-wide (range) expansions. Pooled samples yield average DT and DFL values that are

generally intermediate between those of local and scattered samples. Local samples (and to a

lesser extent, pooled samples) are influenced by local, rapid coalescence events in the underlying

coalescent process. These processes result in lower proportions of external branch lengths and

hence lower proportions of singletons, explaining our finding that the sampling scheme impacts

DFL more than it does DT. Under species-wide expansion scenarios, these effects of spatial

sampling may persist up to very high levels of gene flow (Nm > 25), implying that local samples

cannot be regarded as being drawn from a panmictic population. Importantly, many data sets on

humans, Drosophila, and plants contain signatures of species-wide expansions and effects of

sampling scheme that are predicted by our simulation results. This suggests that validating the

assumption of panmixia is crucial if robust demographic inferences are to be made from local or

pooled samples. However, future studies should consider to adopt a framework that explicitly

accounts for the genealogical effects of population subdivision and empirical sampling schemes.

Page 4: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

4

Although most (if not all) species are characterized by various degrees of population subdivision,

population geneticists continue to employ models of panmictic populations of constant size (the

neutral equilibrium [NE] model) when testing for selection and/or demographic changes through

time. Motivated by empirical data sets that were found to be incompatible with expectations

under the NE model, recent large-scale studies have employed coalescent simulations of

demographic changes such as bottlenecks and/or population expansions in attempts to estimate

parameters of such (arguably) biologically more realistic scenarios (e.g., MARTH et al. 2004;

NORDBORG et al. 2005; OMETTO et al. 2005; SCHMID et al. 2005; LI and STEPHAN 2006;

HEUERTZ et al. 2006; PYHÄJÄRVI et al. 2007; ROSS-IBARRA et al. 2008). However, such

simulations still assume unstructured populations, while the empirical data may derive from

subdivided species and a variety of sampling schemes. The data typically collected in such

projects consist of multilocus DNA sequences from which genealogical information is extracted,

e.g., on the number of segregating sites (S), the average pairwise sequence diversity (π), and

either the full site frequency spectrum or summary statistics based on the site frequency

spectrum, such as Tajima’s D (TAJIMA 1989), Fu and Li’s D (FU and LI 1993), and Fay and

Wu’s H (FAY and WU 2000).

Here, we are concerned with estimates of species-wide demographic history from such

summary statistics, and particularly with the joint impact of population subdivision, population

expansion, and the sampling scheme on statistics summarizing the site frequency spectrum. Our

motivation stems largely from striking patterns in empirical data sets that, we believe, need to be

evaluated for their general implications. For example, in empirical studies that include several

genuine population samples, it becomes possible to contrast the site frequency spectra of local

population samples with those obtained for the pooled (combined) samples. Although this

Page 5: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

5

possibility has not been exploited in all such studies, the common observation has been that site

frequency spectra are more skewed toward low-frequency polymorphisms in pooled samples,

compared to the means of genuine population samples; we refer to this phenomenon as the

‘pooling effect’.

Such genealogical signals have been obtained in a variety of organisms such as humans

(ALONSO and ARMOUR 2001; PTAK and PRZEWORSKI 2002; HAMMER et al. 2003), Drosophila

(POOL and AQUADRO 2006), and several plant species (INGVARSSON 2005; ARUNYAWAT et al.

2007; MOELLER et al. 2007; PYHÄJÄRVI et al. 2007). At least partly due to the putatively

confounding effects of population subdivision and population expansion first identified by PTAK

and PRZEWORSKI (2002), several studies have explicitly favored the analysis and interpretation of

data obtained from large samples of single populations or geographic regions (e.g., ADAMS and

HUDSON 2004; MARTH et al. 2004; GARRIGAN et al. 2007). Similarly, a recent multilocus study

of several populations of teosinte (the wild ancestor of maize) reiterated the notion of

genealogical signals for population expansion (e.g., negative values of Tajima’s D) being

confounded with effects of population structure (MOELLER et al. 2007).

In our recent study of patterns of polymorphism and population subdivision in two closely

related species of wild tomatoes, however, we arrived at quite different conclusions

(ARUNYAWAT et al. 2007). Inspired by the results of two complementary simulation studies of

subdivided populations (RAY et al. 2003; DE and DURRETT 2007; see below), we interpreted this

apparently widespread empirical pattern (see references cited above) as being a consequence of

the effect of the sampling scheme on sample genealogies. Specifically, ARUNYAWAT et al.

(2007) envisioned the impact of population subdivision as ‘masking’ the true genealogical signal

of the species-wide demography via loss of low-frequency polymorphisms in local samples,

Page 6: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

6

rather than as ‘creating’ an excess of low-frequency variants in pooled samples. The difference

in interpretation compared to the teosinte study by MOELLER et al. (2007) is not due to the

underlying data, as the site frequency spectra of pooled population samples (four populations per

species in our study on wild tomatoes) also showed a marked excess of singletons compared to

the average spectra for individual populations, resulting in significantly negative values of

Tajima’s D and/or Fu and Li’s D (ARUNYAWAT et al. 2007; STÄDLER et al. 2008).

Using parameter values designed to reflect patterns of variation in human mitochondrial

DNA, RAY et al. (2003) simulated a range expansion under stepping-stone structure and sampled

30 sequences taken from a single deme. Under fairly low levels of gene flow between demes,

sample genealogies were shaped by a mixture of local coalescent events (leading to near-zero

branch lengths) and ancestral lineages coalescing in the founding deme at the onset (looking

forward in time) of the species-wide range expansion. The overall shape of such genealogies

yielded values of Tajima’s D that were not significant under NE expectations for low to

moderate migration rates. Only under very high levels of gene flow did the single-deme samples

exhibit genealogies expected for strong population expansions (RAY et al. 2003), reflecting the

low number of coalescence events within the sampled deme (WAKELEY 1999, 2001, 2004;

WAKELEY and ALIACAR 2001). Analogous phenomena were recently uncovered by DE and

DURRETT (2007) who simulated an equilibrium stepping-stone model of 100 demes, using scaled

mutation parameters appropriate for nuclear loci in humans. Compared to expectations for a

panmictic equilibrium population, these authors found lower average S, more extended linkage

disequilibrium, and average frequency spectra skewed toward intermediate-frequency

polymorphisms (i.e., positive Tajima’s D values) in simulated samples from single demes.

Page 7: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

7

A notable limitation of these studies is their focus on samples obtained from single demes

(RAY et al. 2003), although DE and DURRETT (2007) also considered random samples and

showed that their genealogical properties approach those of samples from random-mating

populations. ARUNYAWAT et al. (2007) qualitatively interpreted the pooling effect identified in

their wild tomato data (and observed for several other empirical studies cited above) as resulting

from the changed genealogical structure of combined samples, analogous to increasing the

proportion of migrants in single-deme samples. In both circumstances, scattering-phase

coalescence events ought to decrease in favor of higher numbers of ancestral lines remaining at

the start of the collecting phase (looking backward in time) of the coalescent process, thus

yielding higher proportions of external branches of the genealogies.

In this study, we set out to test the interpretations of ARUNYAWAT et al. (2007), in

particular their qualitative prediction that the genealogies of pooled samples (comprised of

several local population samples treated as one entity), as reflected in the site frequency

spectrum, should be intermediate between those of local and scattered samples. The latter

comprise single sequences from each of many demes and are known to possess the genealogical

structure of samples from unstructured populations (assuming a large number of demes;

WAKELEY 1999, 2001, 2004; WAKELEY and ALIACAR 2001). More generally, we assess the

genealogical properties of these three types of samples under both the symmetric finite-island

model and the stepping-stone model of population structure. Given that ARUNYAWAT et al.

(2007) interpreted the frequent observation of negative Tajima’s D values in pooled samples of

diverse taxa (e.g., PTAK and PRZEWORSKI 2002; HAMMER et al. 2003; POOL and AQUADRO 2006;

MOELLER et al. 2007) as signatures of species-wide (range) expansions, our emphasis is on

models of structured populations undergoing an overall expansion in population size. To this

Page 8: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

8

end, our coalescent simulations encompass a wide range of migration rates, magnitude, and

timing of expansions. We find a strong effect of the sampling scheme on two commonly used

summary statistics based on the site frequency spectrum, where the pooling effect gradually

decreases with increasing levels of gene flow. Our results imply that sequence data for most

species should not be evaluated under a ‘panmixia’ framework and that attention to sampling is

paramount even for weakly subdivided taxa. Moreover, disentangling the effects of demography

from those of positive selection appears to be an even more challenging task than currently

assumed.

MODELS AND METHODS

Coalescent simulations under two models of population structure: All patterns of

sequence diversity were generated using the coalescent simulation software ms (HUDSON 2002)

to model the following evolutionary scenario. At the time of sampling, the population consists of

I demes, each containing N0 diploid individuals. For the first set of simulations, the subdivided

population is at equilibrium with constant population size. Along every line mutations

accumulate at rate µ, and θ = 4N0µ. We chose to simulate mainly with a fixed value of θ = 1 (for

simulations implementing 100 demes) rather than with a fixed number of segregating sites S, or

choosing different θs for different simulation scenarios to obtain realistic values of S and/or π.

Simulating with a fixed S has been shown to yield inaccurate results under non-equilibrium

demography (RAMOS-ONSINS et al. 2007), and the critical values for Tajima’s D and Fu and Li’s

D depend not too strongly on θ. Moreover, the effects we describe in this article are orders of

magnitude larger than what could be generated by choosing a different θ or fixing S.

Page 9: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

9

The demes exchange (haploid) migrants either under an island model or a two-dimensional

stepping-stone model at rate m, and we consider a broad range of gene flow: 0.1 ≤ 4N0m ≤ 100.

Under the island model an ancestral line in deme j switches its location to deme i at rate m/(I –

1). Under the stepping-stone model, we assume that I = a2, i.e., the population is arranged in a

square lattice of a × a demes and we assume periodic boundary conditions. This means that the

migration rate is m/4 if i = (i1,i2) and j = (j1,j2) are neighboring demes, and 0 otherwise. Here, i

and j are neighbors if |(i1 – j1)mod a| + |(i2 – j2)mod a| = 1. In other words, an individual at

location (a,i2) can migrate to (1,i2) and one at (i1,a) can migrate to (i1,1) and vice versa. We

modified ms in order to be able to efficiently sample sequences from randomly chosen demes

rather than fixed demes. Specifically, in each iteration the modified version of ms shuffles the

entries in the sample configuration array inconfig at the beginning of the function

segtre_mig (HUDSON 2002). The C code of this program is available from

http://guanine.evolbio.mpg.de/sampling.

Implementing range expansions under population subdivision: For an additional set of

coalescent simulations, we assume that the structure in the population was created some time τ in

the past (τ is measured in units of 4N0 generations), i.e., before time τ the population was

panmictic and of size NA. This scheme ought to be plausible under range expansions, e.g., as

exemplified by migration out of Africa by both humans and Drosophila melanogaster and

subsequent colonization of expansive areas, or temperate-zone populations expanding from

glacial refugia, or following a speciation event such as that inferred for the two wild tomato

species Solanum peruvianum and S. chilense (STÄDLER et al. 2005, 2008). Moreover, this is

essentially a generalized “isolation with migration” (IM) model of divergence with a large

number of extant demes (NIELSEN and WAKELEY 2001; HEY and NIELSEN 2004; WILKINSON-

Page 10: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

10

HERBOTS 2008). Looking forward in time, at time τ the ancestral population splits into I demes

of equal size and in equal proportions. The ‘expansion factor’ for the total population at time τ is

thus given by

β = I × N0/NA. (1)

Note that a value of β = 1 implies constant population size in the sense that a panmictic

population at time τ in the past split into I demes each of size N0 (= NA/I), without changing the

total census size of the entire, now subdivided population. This particular scenario can be seen as

a form of ‘range fragmentation’, albeit without decline in total population size.

Sampling schemes and descriptors of diversity and differentiation: Simulated samples

of total size n (20 in our numerical examples) from the structured population were implemented

as either local, pooled, or scattered. Local samples contain n sequences from a single island, i.e.,

only one arbitrarily chosen deme is sampled. Pooled samples contain several lines each from

several demes (we take five lines from each of four demes in our simulations). Scattered samples

encompass single sequences from each of n different demes, i.e., only one sequence per sampled

deme.

A commonly used statistic to quantify population structure from patterns of diversity

within and among local populations is FST (e.g., HUDSON et al. 1992). In particular, if the number

of demes is large and the population is in equilibrium,

E[FST] = 1 / (1 + 4N0m), (2)

(e.g., WRIGHT 1951) and thus migration rates can, in principle, be estimated from observed

values of FST (but see WHITLOCK and MCCAULEY [1999] for numerous caveats, and Jost [2008]

for a more fundamental critique of FST-based estimates of differentiation and gene flow). In our

simulations, FST can only be computed under the ‘pooled’ sampling scheme. We used the

Page 11: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

11

formula FST = 1 – πw/πb, where πb is the ‘average’ number of differences for pairs of sequences

taken from different demes, and πw is the ‘average’ number of differences for pairs sampled

within demes. The ‘average’ here means only the average over all pairs of sequences, and not

over all simulation runs. Thus, for every simulation run, we recorded exactly one FST value. For

the stepping-stone model, we used the same computations, i.e., FST does not take isolation by

distance into account.

To describe sequence diversity patterns, we focus on the site frequency spectrum, as

summarized by statistics such as the widely used Tajima’s D (TAJIMA 1989) and Fu and Li’s D

(FU and LI 1993). For clarity, we denote these distinct D statistics as DT and DFL, respectively.

Using these particular summary statistics enables us to perform power analyses to reject the

standard NE model. Moreover, we chose to include DFL because the singleton class appeared to

be the major reason for the lower/more negative DT values in pooled versus local samples of

wild tomatoes (ARUNYAWAT et al. 2007). The statistical package R was used to drive our

modified version of ms and to compute these statistics from its output; the corresponding R

scripts are available at http://guanine.evolbio.mpg.de/sampling.

Demographic signatures in multilocus data from wild tomatoes: Our recent studies on

population subdivision and speciation in two wild tomato species provided the motivation for our

current simulation study (ROSELIUS et al. 2005; STÄDLER et al. 2005, 2008; ARUNYAWAT et al.

2007; see INTRODUCTION). We extracted the biallelic, synonymous and noncoding Single

Nucleotide Polymorphisms (SNPs) contained in the data of ARUNYAWAT et al. (2007) to confine

analyses of the site frequency spectrum to putatively neutral polymorphisms and those fitting the

infinite-sites assumption implemented in HUDSON’s (2002) ms program. Tajima’s DT and Fu and

Li’s DFL values of these pruned data sets were computed in the SITES program

Page 12: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

12

(http://lifesci.rutgers.edu/~heylab). Specifically, we computed the average DT and DFL values

across the four single-population samples per species, as well as the corresponding statistics

obtained by treating all sequences within species as a single sample; we refer to these latter

entities as the ‘pooled’ samples (see ARUNYAWAT et al. 2007).

While devising estimators of the demographic parameters β and τ is beyond the scope of

the present article (and may require the availability of multilocus data obtained from scattered

samples), we implemented coalescent simulations reflecting the actual sampling scheme used by

ARUNYAWAT et al. (2007) and fixed other parameters according to aspects of the empirical data

and perceived biological context. Both stepping-stone and island-model simulations were

performed with 400 local demes, θ = 0.25, and sampling 11 sequences from each of four local

demes (recall that demes were randomly chosen for each of the 1000 iterations). The migration

rate was set to 4N0m = 5, reflecting the average FST estimates found for both species

(ARUNYAWAT et al. 2007; and see Figure 5 below). Simulations explored the joint effects of

varying the time (τ) and magnitude (β) of demographic expansions concomitant with

establishment of population subdivision. From these simulations, we recorded estimates of DT

and DFL for two sampling schemes, local (means of the four local samples) and pooled (all 44

sequences treated as one sample). Average estimates of DT and DFL obtained in this manner for

many simulated combinations of β and τ are presented as contour plots.

RESULTS

Our coalescent simulations yield results for levels of nucleotide diversity, summary

statistics based on the site frequency spectrum, and population differentiation under both

equilibrium and non-equilibrium demographic history, all obtained for three different sampling

Page 13: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

13

schemes: local, pooled, and scattered. All our findings are consistent between the island model

and the stepping-stone model, and in this article we focus on quantitative results obtained under

stepping-stone spatial structure; results for the island model are presented as Supplemental

Figures online.

The site frequency spectrum under population subdivision: The simplest demographic

scenario we analyzed is an equilibrium population subdivided into 100 demes, i.e., we first focus

on the effects of population structure per se without any past changes of population size. As

expected, characteristics of the site frequency spectra depend strongly on the sampling scheme.

For the stepping-stone model of population structure, Figure 1 shows the simulation results under

various levels of gene flow. For a migration rate of 4N0m = 10, local samples produce values of

Tajima’s DT (Fu and Li’s DFL) that are significantly different from values expected under the NE

model (two-tailed test, p < 0.05) in 16% (39%) of all cases, while scattered samples give

significant results in only 6% (7%) of all simulations. In particular, we see that local samples

generate values for both statistics that are higher than expected for samples from panmictic

populations, reflecting a site frequency distribution skewed toward intermediate-frequency

mutations; this result mirrors the recent work of DE and DURRETT (2007).

For migration rates (in units of 4N0m) between ~2 and ~50, pooled samples exhibit site

frequency spectra that are broadly intermediate between those of local and scattered samples.

The differences in sample genealogies (as reflected in estimates of DT and DFL) gradually

diminish with higher levels of gene flow, but some differences among sampling schemes are still

apparent at fairly high migration rates (e.g., 4N0m >50 for DT and 4N0m ~100 for DFL; Figure 1).

Importantly, pooling data from several subpopulations does not generate negative values of DT

or DFL without an expansion of the total population (see DISCUSSION). These observations also

Page 14: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

14

hold for the island model, albeit with smaller discrepancies between the summary statistics for

scattered samples and those for the two other sampling schemes (i.e., a less pronounced skew

toward intermediate-frequency mutations for both pooled and local samples; Supplemental

Figure 1). For lower levels of gene flow (approx. 4 N0m < 2), the site frequency spectra of local

samples gradually shift toward NE expectations, while those of pooled samples yield

increasingly positive values of both DT and DFL under decreasing levels of migration. We

checked our simulations of this equilibrium model against analytical results showing that local

samples ought to be invariant for the level of nucleotide diversity, π, irrespective of the level of

symmetrical interdeme migration in an island model (SLATKIN 1987; STROBECK 1987). For both

models of population structure, we found approximately invariant mean π values for local

samples over the entire range of simulated migration rates (results not shown).

The impact of population/range expansions on the site frequency spectrum: Next, we

considered scenarios of (range) expansions under concomitant establishment of subdivision of

the total species range, as described in the MODELS AND METHODS section. In particular, we

simulated a single ancestral population that at time τ before the present experienced a

fragmentation into I demes, where the total number of individuals across the subdivided

population could vary from NA (the number of individuals in the single ancestral population, in

which case the expansion factor β = 1) to 100 × NA (equivalent to β = 100). Under a stepping-

stone model with a 10-fold population expansion 8N0 generations in the past, local samples still

exhibit values of DT and DFL that would be expected under NE conditions, as long as gene flow

is fairly low (Figure 2); these findings are consistent with simulation results by RAY et al.

(2003). The corresponding simulation results for the island model are shown in Supplemental

Figure 2.

Page 15: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

15

Scattered samples, however, contain a clear signal of the species-wide expansion at any

level of gene flow. The reason is that the coalescent of scattered samples almost behaves like a

neutral one with a population size proportional to the number of demes. Hence, this coalescent

picks up the signal of expansion as in the panmictic case. The same should hold for any sampling

scheme under sufficiently high migration, but Figure 2 shows that even under high levels of gene

flow (4N0m = 100), the site frequency spectrum of local samples is still different from that of

pooled and scattered samples. Moreover, the simulation results plotted in Figure 2 were obtained

under a 10-fold expansion, and we may expect discrepancies between local and scattered

samples to extend to even higher migration rates with higher expansion factors (see Figure 3).

Again, DT and DFL values obtained for pooled samples are intermediate between those of local

and scattered samples except for very low migration rates (approx. 4N0m < 0.5), similar to the

case of equilibrium subdivided populations (Figures 1, 2). These simulation results imply that

both local and pooled samples may be expected to underestimate the extent of any species-wide

(range) expansion to various degrees, depending on levels of gene flow connecting the demes

and the age and magnitude of the expansion. The latter aspect, however, is influenced by our

choice of simulating an instantaneous expansion followed by a period of constant population size

until the time of sampling; an exponential expansion scheme until the present would yield higher

detectability of expansion from local and pooled samples.

We point out that quantitative details of our simulation results for the three sampling

schemes depend to some extent on our choice of sampling 20 sequences distributed over one

(‘local’), four (‘pooled’), and 20 demes (‘scattered’), respectively. Generally speaking,

decreasing the number of sequences sampled per deme and/or increasing the number of demes

sequences are sampled from shifts the site frequency spectrum more toward that of scattered

Page 16: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

16

samples, reflecting the diminished impact of the scattering phase of the coalescent process on

such samples. The exact numerical composition of local and pooled samples also affects the

migration rate at which the expected DT and DFL values for pooled samples drop below those for

local samples (see Figures 1, 2).

Power of DT and DFL to detect expansion under different sampling schemes and

expansion times: Illustrated by an intermediate level of interdeme migration (4N0m = 10), we

assessed the power of the test statistics DT and DFL under a range of expansion factors and times

of expansion. For the three sampling schemes, Figure 3 summarizes the power of DT and DFL

assuming an expansion time of τ = 8N0 generations ago. The low power of local samples to

detect significant departures from the NE model regardless of the magnitude of the species-wide

expansion is striking; qualitatively, this is consistent with results by RAY et al. (2003). Even if

the species-wide expansion was 100-fold, local samples deviate from NE expectations in only

8% (for DT) and 12% (for DFL) of all cases under these conditions. In sharp contrast, scattered

samples deviate from NE expectations in 98% (99%) of all cases for β = 100. The corresponding

simulation results for the island model are presented in Supplemental Figure 3.

Next, we illustrate the effect of the timing of the expansion for a fixed 10-fold expansion

and a migration rate of 4N0m = 10. Under these conditions, expansion times in the approximate

range 1 < τ < 15 can be detected in principle, but again with striking differences in power

exhibited by local versus scattered samples (Figure 4). The shape of the curves in Figure 4 can be

explained intuitively: if τ is very small (i.e., establishment of population structure was very

recent), samples appear to be drawn from a panmictic population of constant size. On the other

hand, if τ is very large, samples appear to be drawn from an equilibrium subdivided population

and hence the expansion cannot be detected. Under the island model, all results are qualitatively

Page 17: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

17

the same but with even smaller power to reject stable population size for local samples (at most

10% less power; Supplemental Figure 4). As these power assessments are based on simulated

sample genealogies of single loci, the actual power available with empirical multilocus data

would be correspondingly higher.

All simulation results appear to depend only weakly on the number of demes, as long as I

>> n. However, an increase in the number of demes carries important connotations, as the

genealogical signatures of past expansions remain detectable for longer time periods than with

fewer demes. For example, simulating with a constant ratio τ/I = 0.02 (i.e., equivalent to τ = 2 for

I = 100, τ = 10 for I = 500, etc.) under otherwise equal demographic conditions, we obtained

fairly constant but sampling-specific estimates of DT and DFL, as illustrated for the island model

in Supplemental Figure 5. One interpretation is that the effective population size is proportional

to I but depends on the sampling scheme. These observations imply approximately equal

detectability of expansions (for a given sampling scheme) over a large range of I, and thus the

dependency of ‘relevant’ τ values on I. This latter effect is a direct consequence of lengthening

the collecting phase of the coalescent process with increasing numbers of demes in the total

population.

Estimates of FST under non-equilibrium conditions: We studied the behavior of FST in

expanding populations (cf. MODELS AND METHODS) and consider both the stepping-stone and

island model of population structure. Given such scenarios, estimates of FST depend strongly on

the migration rate 4N0m but only weakly on the expansion time τ and the expansion factor β, as

long as 4N0m ≥ 2 (Figure 5 and Supplemental Figure 6). Under moderate and high levels of gene

flow, equation (2) gives a reasonable approximation of the simulated results, although the level

of population structure is mostly higher than expected under equilibrium conditions, and thus

Page 18: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

18

actual levels of gene flow would be mostly underestimated using FST calculated from non-

equilibrium populations. Under low levels of gene flow (4N0m < 2), however, strong and/or

recent expansions yield less population differentiation than expected under the assumption of

equilibrium conditions, and thus levels of gene flow would be mostly overestimated (Figure 5;

see also EXCOFFIER 2004).

Reanalyzing the pooling effect in two species of wild tomatoes: Our reanalysis of the

multilocus wild tomato data, limited to silent sites and biallelic SNPs, confirms the patterns

originally identified by ARUNYAWAT et al. (2007) based on all SNPs; with the exception of one

locus in S. peruvianum (CT066, DFL) and one locus in S. chilense (CT268, DT), the site

frequency spectra at individual loci yield lower/more negative values of DT and DFL in the

pooled samples compared to the means of four samples representing local populations (Table 1).

Considering the pooled samples, multilocus average DT (DFL) values are –1.14 (–1.38) in S.

peruvianum and –0.30 (–0.76) in S. chilense; the magnitude of the ‘drop’ from single-population

means in DT (DFL) is 0.89 (1.04) in S. peruvianum and 0.78 (1.31) in S. chilense (Table 1). Given

our simulation results under both equilibrium and expansion scenarios, the observed summary

statistics for pooled samples of both species are incompatible with equilibrium assumptions, but

compatible with species-wide expansions (Figures 1–3). Based on the observed levels of

population subdivision (FST approx. 0.20–0.23; ARUNYAWAT et al. 2007), a rough estimate of

average migration rates is 4N0m ~5 (Figure 5).

Our simulation results presented above convey how summary statistics such as DT and DFL

obtained from local or pooled samples ought to be affected jointly by migration rates and both

the magnitude and time of expansion. Figure 6 presents results of coalescent simulations tailored

to fit the empirical wild tomato data (i.e., sampling 11 sequences from each of four local demes

Page 19: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

19

and fixing the migration rate to 4N0m = 5). Even when restricting the analysis to two parameters

that were allowed to vary widely, it is clear that given observed values of DT and DFL are not

sufficient to permit robust estimation of the demographic parameters τ and β. The simulated

average DT values for pooled samples shown in the contour plot (Figure 6B) suggest that the

observed pooled DT values are compatible with ≥ 25-fold expansion in S. peruvianum and with ≥

4-fold expansion in S. chilense.

When viewed in isolation, the observed local DT values are consistent with a very high

expansion factor for S. peruvianum (β > 200) and with β ~1 for S. chilense. However,

considering the observed differences local–pooled, > 40-fold and < 4-fold expansions appear to

be inconsistent with the data for both species (Figure 6; Table 1). Similarly, the observed DFL

values would seem to be compatible with > 150-fold expansion in S. peruvianum and with > 3-

fold expansion in S. chilense, whereas based on the observed differences local–pooled, only >

120-fold expansions appear to be compatible with the data for both species (Table 1;

Supplemental Figure 7). Assuming that the expansion time of both species mirrors their

speciation time, we note that these β values are larger than those assuming the model of

WAKELEY and HEY (1997) as obtained in STÄDLER et al. (2008). This apparent discrepancy can

be explained at least partly by our findings, since the Wakeley–Hey model treats any empirical

sample as one from a panmictic population and hence picks up signals of expansion only partly,

leading to smaller estimates of the expansion factor.

We emphasize that changing some of the simulated parameter values, such as the number

of demes and the migration rate among demes, would suggest other β-values as being

‘consistent’ with the empirical data. Similarly, assuming an island-model rather than stepping-

stone structure would result in a site frequency spectrum more biased toward low-frequency

Page 20: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

20

variants, hence implying signatures of more moderate expansions than suggested by some

features of the data mentioned above (Supplemental Figures 8, 9). Moreover, these simulations

were performed with a uniform migration rate across the entire subdivided population, whereas

the empirical data may reflect (perhaps dramatic) differences in immigration rates among local

samples. Irrespective of the inherent difficulties of inferring credible demographic parameters

under population subdivision and non-stationarity, the generally lower/more negative DT and DFL

values for both local and pooled S. peruvianum samples strongly suggest a more pronounced

expansion compared to S. chilense (Table 1; ARUNYAWAT et al. 2007; STÄDLER et al. 2008).

DISCUSSION

Although models of subdivided populations have been studied extensively, the lack of

analytical results for local samples of size > 2 may have restricted their utility for genealogical

inference. WILKINSON-HERBOTS (2008) recently obtained expressions for the expected mean and

variance of pairwise sequence divergence (samples of size two drawn either from the same

subpopulation or from different subpopulations in a finite island model) in a generalized IM

model (see also EXCOFFIER 2004). As our simulations were motivated mainly by patterns in

empirical data, we have focused on larger sample sizes and aspects other than pairwise sequence

diversity, but the underlying model of population subdivision with variable timing and

magnitude of population expansion is identical to that studied by WILKINSON-HERBOTS (2008).

For the most part, our simulations implemented 100 demes and a sample size of 20, and thus

should approach the many-demes limit utilized in Wakeley’s studies of the island model (e.g.,

WAKELEY 1999, 2001, 2004; WAKELEY and ALIACAR 2001), and more recently shown by DE

and DURRETT (2007) for the stepping-stone model. Simulating local samples under both island

Page 21: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

21

and stepping-stone spatial structure, DE and DURRETT (2007) found lower numbers of

segregating sites, higher linkage disequilibrium (LD), and median site frequency spectra that

were skewed toward intermediate-frequency polymorphisms (i.e., yielding positive DT values) in

comparison with samples from randomly mating populations. However, they did not investigate

the behavior of samples pooled across several demes. WAKELEY and LESSARD (2003) have

shown that levels of LD can be significantly elevated in local samples, depending on migration

rates and the total number of demes. Their analyses highlighted the potential impact of sampling

scheme and demography on the validity of evolutionary inferences drawn under unrealistic

assumptions, a topic to which we shall return below.

Site frequency spectra under different sampling schemes: Using coalescent simulations,

we evaluated three sampling schemes of sequences drawn from subdivided populations under

different levels of migration among local demes; all three sampling schemes have corresponding

examples in empirical studies of DNA sequence diversity designed to infer aspects of

demographic history and natural selection from the site frequency spectrum and/or patterns of

LD. Our simulations of subdivided populations of constant total size (Figure 1) have identified a

wide range of migration rates where the pooling effect, as interpreted by ARUNYAWAT et al.

(2007), appears to hold: local samples are characterized by higher values of Tajima’s DT and Fu

and Li’s DFL than pooled samples for Nm (immigrants per generation) > 0.5, and the underlying

genealogies of such samples converge in shape only at very high levels of gene flow (Nm > 25).

With low migration levels (Nm < 0.5), the pattern is reversed in that pooled samples show an

excess of highly positive DT and DFL values. At these low gene flow levels, local samples are

expected to undergo a series of coalescent events within the sampled demes before entering the

collecting phase with a single ancestral line per deme. The resulting long internal branches of

Page 22: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

22

genealogies encompassing lineages from several local demes (i.e., pooled samples) typically

exhibit multiple fixed mutations between different demes, leading to intermediate-frequency

SNPs in pooled samples; PANNELL (2003; his Table 3) has addressed these effects of local vs.

pooled samples in the context of metapopulation dynamics.

That these genealogical effects are mostly in the opposite direction, i.e., that pooling

several local samples is expected to increase the proportion of low-frequency polymorphisms

under intermediate and high levels of migration, has until recently been obscure, even though

simulation results by PANNELL (2003; his Table 3) indicated such an effect. Analyzing

previously published human data sets encompassing a variety of sampling schemes, PTAK and

PRZEWORSKI (2002) identified a pattern of increasing skew toward low-frequency mutations (i.e.,

negative DT) with more geographically heterogeneous sampling (equivalent to pooling across

demes), and concluded that signatures of population expansion may be confounded with effects

of population subdivision. Our coalescent simulations of equilibrium island/stepping-stone

models under a reasonable range of migration rates show that pooling does not lead to negative

DT values, and thus do not support this particular interpretation of a ‘confounding’ effect of

subdivision.

On the contrary, pooled samples (albeit to a lesser extent than local samples) still exhibit

the signatures of local (scattering-phase) coalescent events (i.e., lower proportion of external

branch length of the sample genealogies), compared to scattered samples whose genealogies

should be roughly equivalent in structure to those from panmictic populations (but see below).

This ‘intermediate’ frequency spectrum of pooled samples (between the ‘extremes’ of local and

scattered samples, respectively) is what ARUNYAWAT et al. (2007) predicted based on their

genealogical interpretation of the pooling effect. For the case of non-equilibrium populations,

Page 23: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

23

one caveat that ought to be mentioned is that signatures of expansion will be seen in scattered

samples under low levels of gene flow even without any increase in the total number of

individuals (i.e., β ≤ 1). The reason is that at time τ, the establishment of subdivision increases

the effective population size by a factor of 1 + (1/4N0m) (WAKELEY 2000). Consequently, an

apparent ‘expansion’ may be seen under strong subdivision (e.g., 4N0m < 2); the increasingly

negative values for the scattered samples shown in Figure 2 with decreasing migration rate

appear to reflect this phenomenon.

Site frequency spectra and sampling in non-equilibrium subdivided populations: We

found effects of the sampling scheme on sample genealogies up to very high migration rates,

especially under strong species-wide expansions (Figures 2, 3). In other words, even under high

migration rates (e.g., Nm = 25), local samples may not adequately reflect the species-wide

demography. The relevance of our simulation findings is essentially an empirical issue that

should be judged by its predictive power and relevant data. Our simulations predict a pooling

effect over a wide range of migration rates, implying one way to test the null hypothesis of

panmixia, i.e., that the genealogies of local samples are indistinguishable from those of scattered

samples. Very few published studies have performed separate analyses of local population

samples and the pooled sample consisting of several local samples, but many studies can, in

principle, be used to test our predictions because several local samples have been included

(Drosophila: BAUDRY et al. 2004, 2006; NOLTE and SCHLÖTTERER 2008; humans: MARTH et al.

2004; VOIGHT et al. 2005; KEINAN et al. 2007; GARRIGAN et al. 2007; plants: HEUERTZ et al.

2006; PYHÄJÄRVI et al. 2007).

POOL and AQUADRO (2006) demonstrated the pooling effect in sub-Saharan Drosophila

melanogaster, as have several studies in plants (INGVARSSON 2005; ARUNYAWAT et al. 2007;

Page 24: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

24

MOELLER et al. 2007) and humans (e.g., PTAK and PRZEWORSKI 2002; HAMMER et al. 2003). In

all these cases, the site frequency spectra of pooled samples were characterized by an excess of

low-frequency variants, resulting in (more) negative DT values. Some of these studies explicitly

considered the possible contribution of purifying selection to these patterns, but concluded that

demographic (or range) expansions were at least partly responsible for the skew in the site

frequency spectra (POOL and AQUADRO 2006; MOELLER et al. 2007; see also NORDBORG et al.

2005). As far as neutral polymorphisms are concerned, our simulation results fully concur with

these interpretations, as only global expansions and not the effects of population subdivision per

se can generate substantially negative DT values (Figures 1–3). As many other researchers before

us, we have investigated the properties of general models of population structure in the context

of sampling schemes and their effects on sample genealogies. In contrast, HAMMER et al. (2003)

presented coalescent simulations of a non-standard ‘continuous-splitting’ model of human

population structure (where demes do not exchange migrants after being isolated from other

demes), and concluded that site frequency spectra of human data sets may be explained without

invoking global expansion.

Levels of gene flow and within-species panmixia: It may seem surprising to observe the

pooling effect even in highly mobile species such as Drosophila melanogaster (POOL and

AQUADRO 2006), thus providing clear-cut evidence for deviations from panmixia at seemingly

low levels of population subdivision (e.g., as quantified by FST). The impression of low levels of

population substructure may, at least in part, be a consequence of the FST estimator most widely

used for DNA sequence data, FST = 1 – πw/πb, which treats each SNP as a separate locus

(HUDSON et al. 1992). In empirical as well as simulated data sets, many segregating sites that

appear as singletons in local population samples remain singletons in pooled samples. These sites

Page 25: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

25

do not contribute much to nucleotide diversity in either local or total samples, compared to SNPs

with intermediate frequencies, but they fuel a more pronounced increase in the number of

segregating sites in pooled samples (i.e., WATTERSON’s [1975] estimator θW increases much

more than π; see also RAY et al. 2003). These underlying features explain the effect of pooling

under apparently low levels of population subdivision. Moreover, they also highlight the inherent

dangers of equating low FST estimates with the (near) absence of population subdivision.

In the framework of the infinite-alleles model, JOST (2008) recently argued that FST (and

GST) are not at all appropriate measures of differentiation and pinpointed mathematical

misconceptions underlying the standard approach. In particular, he identified the classical

additive partitioning of total heterozygosity (HT, the heterozygosity of the pooled

subpopulations) into mean within-subpopulation heterozygosity (HS) and a between-

subpopulation component (HT – HS = DST; NEI 1973) as erroneous, because HS and DST are not

truly independent but rather related through an incomplete partitioning (JOST 2008). Because

classical results concerning the absolute number of migrants per generation required to prevent

significant differentiation among local demes in the finite-island and stepping-stone models (e.g.,

Nm > 1, equation (2) above; e.g., WRIGHT 1951; MARUYAMA 1971) are based on the

interpretation of FST as a measure of differentiation, JOST (2008) concluded that such ‘rules of

thumb’ must be considered invalid. Although JOST (2008) did not formally evaluate the infinite-

sites model of sequence evolution, this conclusion goes hand in hand with our interpretations in

the preceding paragraph. Importantly, the marked effects of sampling scheme on the site

frequency spectrum in our simulations despite high levels of gene flow are not restricted to

expanding populations but were found also for subdivided populations at demographic

equilibrium (e.g., Nm > 10; Figures 1, 2). Similar results were previously obtained by DE and

Page 26: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

26

DURRETT (2007) for equilibrium populations and, implicitly, by RAY et al. (2003) for expanding

populations.

Empirical considerations and implications for statistical inference: Large-scale

population-genetic data sets are commonly evaluated in the framework of panmictic populations

undergoing temporal changes in population size (humans: e.g., MARTH et al. 2004; VOIGHT et al.

2005; GARRIGAN et al. 2007; Drosophila: e.g., OMETTO et al. 2005; LI and STEPHAN 2006;

THORNTON and ANDOLFATTO 2006; plants: HEUERTZ et al. 2006; PYHÄJÄRVI et al. 2007). In

conjunction with the empirical data discussed above, our results suggest that a more appropriate

approach would acknowledge the subdivided nature of these species explicitly, which would

require paying particular attention to sampling schemes and their genealogical consequences.

Especially if empirical sampling is from a local population (as has often been the case), it is

inappropriate to compare patterns of diversity with expectations under the classical NE

conditions embodied in commonly used tests of ‘neutrality’. Generally, local population samples

should not be modeled as coming from a panmictic (e.g., continental) population but rather as

being drawn from one of many demes comprising the total metapopulation.

Put another way, the traditional almost exclusive focus on the temporal trajectory of

population-size changes that best explains properties of local/regional samples ought to be

questioned, especially in systems with moderate-to-high levels of gene flow where sample

genealogies are not regionally monophyletic, but rather are deeply embedded in species-wide

sample genealogies (WAKELEY 2001; WAKELEY and ALIACAR 2001). In such cases, the

properties of local samples (e.g., the number of segregating sites, nucleotide diversity, the site

frequency spectrum, and the extent and decay of LD) arguably reflect the level of connectivity

with (degree of isolation from) other demes during the scattering phase of the coalescent process

Page 27: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

27

(WAKELEY 2001; RAY et al. 2003, WAKELEY and LESSARD 2003; DE and DURRETT 2007). Our

simulation scheme has been simplistic in assuming equal migration rates and deme sizes through

time and across the entire metapopulation, but it is straightforward to explain empirical patterns,

such as those found for various human population samples, in terms of differences in local

population size and immigration rates during the recent past (WAKELEY 2001; RAY et al. 2003;

EXCOFFIER 2004), arguably mediated by environmental heterogeneity (WEGMANN et al. 2006).

Guided by notions of the special importance of population structure and the potential for

local adaptation in plants, several recent studies have emphasized the need to include genuine

local population samples in population-genetic studies, rather than relying exclusively on

scattered samples (WRIGHT and GAUT 2005; MOELLER et al. 2007; ROSS-IBARRA et al. 2008).

While we agree that sampling locally has particular merit for studying signatures of local

adaptation (see ARUNYAWAT et al. 2007), our present results caution against analyzing such local

samples naively, i.e., in the conventional framework of panmictic populations. In principle,

sampling several local demes offers the opportunity to empirically contrast the properties of local

samples (e.g., site frequency spectrum, decay of LD) with those of the pooled sample, analogous

to two of our three simulation sampling schemes. At the very least, this exercise has the potential

to infer general features of the species-wide demographic history despite the biases inherent in

local samples. It would also (at least partially) mitigate against inferring spurious signatures of

natural selection from levels and patterns of nucleotide polymorphism, or from the extent of

haplotype structure (for examples, see WAKELEY and LESSARD 2003; DE and DURRETT 2007).

However, only the genealogical structure of scattered samples is comparable to that of an elusive

panmictic population that has experienced the same temporal demographic history (e.g.,

expansion), whereas such samples preclude finding molecular evidence of local adaptation.

Page 28: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

28

We thank Tanja Pyhäjärvi for information on her multideme data set on Pinus sylvestris

and Aurelien Tellier for valuable discussions on metapopulation dynamics and the coalescent in

subdivided populations. Two anonymous referees provided constructive criticism that helped to

improve the final version of this paper. Furthermore, we thank Dick Hudson for allowing us to

redistribute a modified version of his coalescent simulation program ms (HUDSON 2002). This

work was funded by the Deutsche Forschungsgemeinschaft through its Priority Program

“Radiations—Origins of Biological Diversity” (SPP-1127), grant Ste 325/5-3 to W.S., and

Research Unit “Natural Selection in Structured Populations” (FOR-1078), grant Ste 325/12-1 to

W.S. and grant Pf 672/1-1 to P.P.; P.P. acknowledges additional support by the German Federal

Ministry of Education and Research (BMBF) through the Freiburg Initiative for Systems

Biology, grant 0313921.

Page 29: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

29

LITERATURE CITED

ADAMS, A. M., and R. R. HUDSON, 2004 Maximum-likelihood estimation of demographic

parameters using the frequency spectrum of unlinked single-nucleotide polymorphisms.

Genetics 168: 1699–1712.

ALONSO, S., and J. A. L. ARMOUR, 2001 A highly variable segment of human subterminal 16p

reveals a history of population growth for modern humans outside Africa. Proc. Natl.

Acad. Sci. USA 98: 864–869.

ARUNYAWAT, U., W. STEPHAN and T. STÄDLER, 2007 Using multilocus sequence data to assess

population structure, natural selection, and linkage disequilibrium in wild tomatoes. Mol.

Biol. Evol. 24: 2310–2322.

BAUDRY, E., B. VIGINIER and M. VEUILLE, 2004 Non-African populations of Drosophila

melanogaster have a unique origin. Mol. Biol. Evol. 21: 1482–1491.

BAUDRY, E., N. DEROME, M. HUET and M. VEUILLE, 2006 Contrasted polymorphism patterns in

a large sample of populations from the evolutionary genetics model Drosophila simulans.

Genetics 173: 759–767.

DE, A., and R. DURRETT, 2007 Stepping-stone spatial structure causes slow decay of linkage

disequilibrium and shifts the site frequency spectrum. Genetics 176: 969–981.

EXCOFFIER, L., 2004 Patterns of DNA sequence diversity and genetic structure after a range

expansion: lessons from the infinite-island model. Mol. Ecol. 13: 853–864.

FAY, J. C., and C.-I WU, 2000 Hitchhiking under positive Darwinian selection. Genetics 155:

1405–1413.

FU, Y.-X., and W.-H. LI, 1993 Statistical tests of neutrality of mutations. Genetics 133: 693–

709.

Page 30: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

30

GARRIGAN, D., S. B. KINGAN, M. M. PILKINGTON, J. A. WILDER, M. P. COX et al., 2007

Inferring human population sizes, divergence times and rates of gene flow from

mitochondrial, X and Y chromosome resequencing data. Genetics 177: 2195–2207.

HAMMER, M. F., F. BLACKMER, D. GARRIGAN, M. W. NACHMAN and J. A. WILDER, 2003

Human population structure and its effects on sampling Y chromosome sequence variation.

Genetics 164: 1495–1509.

HEUERTZ, M., E. DE PAOLI, T. KÄLLMAN, H. LARSSON, I. JURMAN et al., 2006 Multilocus

patterns of nucleotide diversity, linkage disequilibrium and demographic history of

Norway Spruce [Picea abies (L.) Karst]. Genetics 174: 2095–2105.

HEY, J., and R. NIELSEN, 2004 Multilocus methods for estimating population sizes, migration

rates and divergence time, with applications to the divergence of Drosophila

pseudoobscura and D. persimilis. Genetics 167: 747–760.

HUDSON, R. R., 2002 Generating samples under a Wright-Fisher neutral model of genetic

variation. Bioinformatics 18: 337–338.

HUDSON, R. R., M. SLATKIN and W. P. MADDISON, 1992 Estimation of levels of gene flow from

DNA sequence data. Genetics 132: 583–589.

INGVARSSON, P. K., 2005 Nucleotide polymorphism and linkage disequilibrium within and

among natural populations of European aspen (Populus tremula L., Salicaceae). Genetics

169: 945–953.

JOST, L., 2008 GST and its relatives do not measure differentiation. Mol. Ecol. 17: 4015–4026.

KEINAN, A., J. C. MULLIKIN, N. PATTERSON and D. REICH, 2007 Measurement of the human

allele frequency spectrum demonstrates greater genetic drift in East Asians than in

Europeans. Nat. Genet. 39: 1251–1255.

Page 31: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

31

LI, H., and W. STEPHAN, 2006 Inferring the demographic history and rate of adaptive

substitution in Drosophila. PLoS Genet. 2: e166.

MARTH, G. T., E. CZABARKA, J. MURVAI and S. T. SHERRY, 2004 The allele frequency

spectrum in genome-wide human variation data reveals signals of differential demographic

history in three large world populations. Genetics 166: 351–372.

MARUYAMA, T., 1971 Analysis of population structure II. Two-dimensional stepping stone

models of finite length and other geographically structured populations. Ann. Hum. Genet.

Lond. 35: 179–196.

MOELLER, D. A., M. I. TENAILLON and P. TIFFIN, 2007 Population structure and its effects on

patterns of nucleotide polymorphism in teosinte (Zea mays ssp. parviglumis). Genetics

176: 1799–1809.

NEI, M., 1973 Analysis of gene diversity in subdivided populations. Proc. Natl. Acad. Sci. USA

70: 3321–3323.

NIELSEN, R., and J. WAKELEY, 2001 Distinguishing migration from isolation: a Markov chain

Monte Carlo approach. Genetics 158: 885–896.

NOLTE, V., and C. SCHLÖTTERER, 2008 African Drosophila melanogaster and D. simulans

populations have similar levels of sequence variability, suggesting comparable effective

population sizes. Genetics 178: 405–412.

NORDBORG, M., T. T. HU, Y. ISHINO, J. JHAVERI, C. TOOMAJIAN et al., 2005 The pattern of

polymorphism in Arabidopsis thaliana. PLoS Biol. 3: e196.

OMETTO, L., S. GLINKA, D. DE LORENZO and W. STEPHAN, 2005 Inferring the effects of

demography and selection on Drosophila melanogaster populations from a chromosome-

wide scan of DNA variation. Mol. Biol. Evol. 22: 2119–2130.

Page 32: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

32

PANNELL, J. R., 2003 Coalescence in a metapopulation with recurrent local extinction and

recolonization. Evolution 57: 949–961.

POOL, J. E., and C. F. AQUADRO, 2006 History and structure of sub-Saharan populations of

Drosophila melanogaster. Genetics 174: 915–929.

PTAK, S. E., and M. PRZEWORSKI, 2002 Evidence for population growth in humans is

confounded by fine-scale population structure. Trends Genet. 18: 559–563.

PYHÄJÄRVI, T., M. R. GARCÍA-GIL, T. KNÜRR, M. MIKKONEN, W. WACHOWIAK and O.

SAVOLAINEN, 2007 Demographic history has influenced nucleotide diversity in European

Pinus sylvestris populations. Genetics 177: 1713–1724.

RAMOS-ONSINS, S. E., S. MOUSSET, T. MITCHELL-OLDS and W. STEPHAN, 2007 Population

genetic inference using a fixed number of segregating sites: a reassessment. Genet. Res.

89: 231–244.

RAY, N., M. CURRAT and L. EXCOFFIER, 2003 Intra-deme molecular diversity in spatially

expanding populations. Mol. Biol. Evol. 20: 76–86.

ROSELIUS, K., W. STEPHAN and T. STÄDLER, 2005 The relationship of nucleotide

polymorphism, recombination rate and selection in wild tomato species. Genetics 171:

753–763.

ROSS-IBARRA, J., S. I. WRIGHT, J. P. FOXE, A. KAWABE, L. DEROSE-WILSON et al., 2008

Patterns of polymorphism and demographic history in natural populations of Arabidopsis

lyrata. PLoS One 3: e2411.

SCHMID, K. J., S. E. RAMOS-ONSINS, H. RINGYS-BECKSTEIN, B. WEISSHAAR and T. MITCHELL-

OLDS, 2005 A multilocus sequence survey in Arabidopsis thaliana reveals a genome-wide

Page 33: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

33

departure from a neutral model of DNA sequence polymorphism. Genetics 169: 1601–

1615.

SLATKIN, M., 1987 The average number of sites separating DNA sequences drawn from a

subdivided population. Theor. Popul. Biol. 32: 42–49.

STÄDLER, T., K. ROSELIUS and W. STEPHAN, 2005 Genealogical footprints of speciation

processes in wild tomatoes: demography and evidence for historical gene flow. Evolution

59: 1268–1279.

STÄDLER, T., U. ARUNYAWAT and W. STEPHAN, 2008 Population genetics of speciation in two

closely related wild tomatoes (Solanum section Lycopersicon). Genetics 178: 339–350.

STROBECK, C., 1987 Average number of nucleotide differences in a sample from a single

subpopulation: a test for population subdivision. Genetics 117: 149–153.

TAJIMA, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA

polymorphism. Genetics 123: 585–595.

THORNTON, K., and P. ANDOLFATTO, 2006 Approximate Bayesian inference reveals evidence

for a recent, severe bottleneck in a Netherlands population of Drosophila melanogaster.

Genetics 172: 1607–1619.

VOIGHT, B. F., A. M. ADAMS, L. A. FRISSE, Y. QIAN, R. R. HUDSON and A. DI RIENZO, 2005

Interrogating multiple aspects of variation in a full resequencing data set to infer human

population size changes. Proc. Natl. Acad. Sci. USA 102: 18508–18513.

WAKELEY, J., 1999 Nonequilibrium migration in human history. Genetics 153: 1863–1871.

WAKELEY, J., 2000 The effects of subdivision on the genetic divergence of populations and

species. Evolution 54: 1092–1101.

WAKELEY, J., 2001 The coalescent in an island model of population subdivision with variation

Page 34: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

34

among demes. Theor. Popul. Biol. 59: 133–144.

WAKELEY, J., 2004 Metapopulations and coalescent theory, pp. 175–198 in Ecology, Genetics,

and Evolution of Metapopulations, edited by I. HANSKI and O. GAGGIOTTI. Elsevier,

Oxford, UK.

WAKELEY, J., and N. ALIACAR, 2001 Gene genealogies in a metapopulation. Genetics 159: 893–

905.

WAKELEY, J., and J. HEY, 1997 Estimating ancestral population parameters. Genetics 145: 847–

855.

WAKELEY, J., and S. LESSARD, 2003 Theory of the effects of population structure and sampling

on patterns of linkage disequilibrium applied to genomic data from humans. Genetics 164:

1043–1053.

WATTERSON, G. A., 1975 On the number of segregating sites in genetical models without

recombination. Theor. Popul. Biol. 7: 256–276.

WEGMANN, D., M. CURRAT and L. EXCOFFIER, 2006 Molecular diversity after a range

expansion in heterogeneous environments. Genetics 174: 2009–2020.

WHITLOCK, M. C., and D. E. MCCAULEY, 1999 Indirect measures of gene flow and migration:

FST doesn’t equal 1/(4Nm + 1). Heredity 82: 117–125.

WILKINSON-HERBOTS, H. M., 2008 The distribution of the coalescent time and the number of

pairwise nucleotide differences in the “isolation with migration” model. Theor. Popul.

Biol. 73: 277–288.

WRIGHT, S., 1951 The genetical structure of populations. Ann. Eugen. 15: 323–354.

WRIGHT, S. I., and B. S. GAUT, 2005 Molelcular population genetics and the search for adaptive

evolution in plants. Mol. Biol. Evol. 22: 506–519.

Page 35: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

35

TABLE 1

Tajima’s D and Fu and Li’s D values in pooled versus population-specific tomato samples

S. peruvianum S. chilense

Locus Mean of 4 pops. Pooled Mean of 4 pops. Pooled

CT066 –0.61

–0.48

–0.68

0.05

1.07

0.64

–0.29

–1.55

CT093 –0.19

–0.34

–1.79

–3.25

0.03

0.00

–0.82

–1.85

CT166 –0.63

–0.85

–1.85

–2.71

0.12

0.36

0.04

–0.27

CT179 –0.01

–0.09

–1.07

–1.40

0.64

0.49

–0.69

–1.53

CT198 –0.14

–0.01

–0.95

–0.26

0.20

0.38

–1.10

–0.05

CT208 –0.60

–1.30

–1.19

–1.64

CT251 0.00

0.02

–0.89

–1.01

0.71

0.82

0.04

–0.51

CT268 0.21

0.37

–0.66

–0.81

0.61

1.14

0.70

0.47

Page 36: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

36

Average –0.25

–0.34

–1.14

–1.38

0.48

0.55

–0.30

–0.76

For each locus, the first row shows values of Tajima’s DT, while the second row shows

values of Fu and Li’s DFL. Data were taken from ARUNYAWAT et al. (2007), but here all

nonsynonymous SNPs and those segregating more than two nucleotides (i.e., showing evidence

of recurrent mutations) were eliminated before estimating DT and DFL. The ‘pooled’ columns

show values based on the combined (total) sample within each species, treated as a single entity.

Note that almost all loci individually exhibit lower/more negative DT and DFL values in the

pooled samples. Locus CT208 was not included for S. chilense because patterns of

polymorphism were suggestive of natural selection (ARUNYAWAT et al. 2007).

Page 37: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

37

FIGURE CAPTIONS

FIGURE 1.—Averages of Tajima’s DT (A) and Fu and Li’s DFL (B) under three sampling

schemes as a function of levels of gene flow. We simulated an equilibrium stepping-stone model

with I = 100 islands; the simulations were carried out without recombination. Every plotted point

is based on 1000 independently generated data sets, and standard errors of the means are

indicated by vertical lines.

FIGURE 2.—Averages of Tajima’s DT (A) and Fu and Li’s DFL (B) under species-wide

expansion (β = 10, τ = 2) as a function of migration rates between demes. We simulated a

stepping-stone model with 100 demes without recombination. Every plotted point is based on

1000 independently generated data sets; error bars represent standard errors.

FIGURE 3.—Averages of Tajima’s DT (A) and Fu and Li’s DFL (B) as functions of the

expansion factor β (with fixed migration rate and time of expansion), with the power of these

statistics evaluated in the top part of each plot (right y-axis; power was assessed at a level of P =

0.05). The same simulation scheme was used as for Figure 2.

FIGURE 4.—The power of Tajima’s DT (A) and Fu and Li’s DFL (B) as a function of the

expansion time τ (with fixed migration rate and expansion factor). The same simulation scheme

was used as for Figure 2; power was assessed at a level of P = 0.05.

Page 38: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

38

FIGURE 5.—FST as a function of 4N0m. (A) τ = 8N0 generations is fixed and β is varying;

(B) β = 10 is fixed and τ is varying. The same simulations as in Figure 2 were used (i.e.,

stepping-stone model with 100 demes).

FIGURE 6.—Contour plots showing average values of DT obtained in simulations designed

to mimic the empirical sampling of ARUNYAWAT et al. (2007). (A) Single-deme samples (n = 11

sequences); (B) pooled samples (n = 44 sequences). For each of the 1000 simulations for a given

combination of expansion factor (β) and time since expansion (τ), 11 sequences were drawn

from each of four randomly chosen demes, and the 44 sequences were also evaluated as a pooled

sample (B). We simulated 400 demes under stepping-stone structure (θ [per deme] = 0.25, 4N0m

= 5). The empirical average DT values for S. peruvianum are –0.25 (local) and –1.14 (pooled),

while they are 0.48 (local) and –0.30 (pooled) for S. chilense; see Table 1.

Page 39: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

39

Figure 1: A = left, B = right

Page 40: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

40

Figure 2: A = left; B = right

Page 41: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

41

Figure 3: A = left; B = right

Page 42: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

42

Figure 4: A = left; B = right

Page 43: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

43

Figure 5: A = left; B = right

Page 44: The impact of sampling schemes on the site frequency ... impact of sampling schemes on the site frequency spectrum in non-equilibrium subdivided populations ... pooled samples. However,

44

Figure 6: A = left; B = right