evidence of selection on genomic gc content in bacteria falk hildebrand adam eyre-walker
TRANSCRIPT
Explanations
• Mutation bias• Suoeka (1961) & Freese (1962)• Intrinsic and/or extrinsic
• Selection• Many authors
• Biased gene conversion• Anonymous referees
Correlates
• Genome size• positive correlation
• Lifestyle• higher GC in free living
• Aerobiosis• higher GC in aerobic
• Nitrogen utilization• higher amongst N fixers
• Temperature • higher amongst thermophiles?
Evidence of selection I
• Escherichia coli• Mutation pattern
• 273 GCAT versus 131 ATGC
• Predicted GC content = 0.32• Observed GC content = 0.50• Observed GC at neutral sites = 0.58
Lynch (2007) Origins of genome architecture
Evidence of selection II
• Phylogenetic analyses• Mycobacterium leprae (Lynch 2007)• Escherichia coli (Balbi et al. 2009)• 5 pathogenic bacteria (Hershberg and
Petrov 2010)
Evidence of selection II
• Phylogenetic analyses• Mycobacterium leprae (Lynch 2007)• Escherichia coli (Balbi et al. 2009)• 5 pathogenic bacteria (Hershberg and
Petrov 2010) • Excess of GC AT
Test of mutation bias
• If GC content is• Due to mutation bias alone• Stationary• And the infinite sites assumption holds
• Then• # GCAT mutations = # ATGC mutations
Why?
• If GC stationary
• #GCAT subs = #ATGC subs
• All neutral mutations have same chance of fixation
• #GCAT muts = #ATGC muts
Identifying mutations
Strain 1 ACT GCT TTG GCT TTA TGGStrain 2 ACT GCT TTG GCT TTA TGAStrain 3 ACT GCT TTG GCT TTA TGGStrain 4 ACT GCT TTC GCT TTA TGAStrain 5 ACC GCT TTC GCT TTA TGGStrain 6 ACT GCT TTG GCT TTA TGG
TC CG GA
Orienting mutations
Outgroup ACT GCT TTC GCT TTA TGGStrain 1 ACT GCT TTG GCT TTA TGGStrain 2 ACT GCT TTG GCT TTA TGAStrain 3 ACT GCT TTG GCT TTA TGGStrain 4 ACT GCT TTC GCT TTA TGAStrain 5 ACC GCT TTC GCT TTA TGGStrain 6 ACT GCT TTG GCT TTA TGG
TC CG GA
GCAT = 1ATGC = 1
Orienting mutations
Strain 1 ACT GCT TTG GCT TTA TGGStrain 2 ACT GCT TTG GCT TTA TGAStrain 3 ACT GCT TTG GCT TTA TGGStrain 4 ACT GCT TTC GCT TTA TGAStrain 5 ACC GCT TTC GCT TTA TGGStrain 6 ACT GCT TTG GCT TTA TGG
TC GC GA
GCAT = 1ATGC = 1
Test of mutation bias
• If GC content is• Due to mutation bias alone• Stationary• And the infinite sites assumption holds
• Then• # GCAT = # ATGC
Data
• Popset• Keyword “bacteria”• 8 or more sequences from same species• 149 bacterial species
• 8 phyla, 15 classes and 77 genera• 1 or more genes• 10 or more synonymous polymorphisms• 4-fold diversity < 0.1
Phylogenetic distribution
Phylum Class No. of species GC4 range Mean Z
(GC4<0.34)
Mean Z
(GC4>0.34)
Actinobacteria Actinobacteria 3 0.64-0.93 no species 0.64
Bacteroidetes Bacteroidetes 3 0.12-0.46 0.43 0.36
Chlamydiae+ Chlamydiae 2 0.21-0.30 0.45 no species
Cyanobacteria Chroococcales 2 0.38-0.51 no species 0.53
Cyanobacteria Nostocales 3 0.26-0.31 0.45 no species
Cyanobacteria Oscillatoriales 2 0.41 no species 0.38
Cyanobacteria Stigonemales 1 0.40 no species 0.59
Firmicutes Bacilli 27 0.085-0.68 0.44 0.58
Firmicutes Clostridia 5 0.050-0.28 0.34 no species
Proteobacteria Alphaproteobacteria16 0.099-0.94 0.43 0.65
Proteobacteria Betaproteobacteria6 0.66-0.96 no species 0.67
Proteobacteria delta/epsilon6 0.15-0.99 0.49 0.78
ProteobacteriaGammaproteobacteria
62 0.095-0.95 0.50 0.66
Spirochaetes Spirochaetes 7 0.12-0.60 0.45 0.54
Tenericutes Mollicutes 4 0.023-0.24 0.33 no species
Infinite sites assumption
• If GC content stationary
• #GCAT subs = #ATGC subs
• All neutral mutations have same chance of fixation
• #GCAT muts = #ATGC muts
Finite sites assumption
• If GC content stationary
• #GCAT subs = #ATGC subs
• All neutral mutations have same chance of fixation
• #GCAT muts = #ATGC muts
• But some mutations not evident as poly
Finite sites
• GC rich sequence
• Implies• rate of ATGC > rate of GCAT
• Mutation rate low• #ATGC poly = # GCAT poly
• Mutation rate high• #ATGC poly < # GCAT poly
Finite sites theory
GC ATuμ
vμ
€
H(x) =J(x)
J(x)dx0
1
∫
€
J(x) = xV −1(1− x)U −1
€
U = 2Neμ(1 − f ) V = 2Neμf f = v /(u+ v)
Assume :
stationary popn stationary GC
Finite sites theory
€
H(x) =J(x)
J(x)dx0
1
∫
€
G(n,i) = H(x)Q(n,i,x)dx0
1
∫
€
Q(n,i,x) =n!
i!(n − i)!x i(1− x)n−i
Predicting Z
• Assume • finite sites• neutrality
• Use GC4 to get f
• Use observed diversity to estimate μ
• Predict Z
Mutation rate variation
€
H(x) =J(x)
J(x)dx0
1
∫
€
G(n,i) = H(x)Q(n,i,x)dx0
1
∫
€
Q(n,i,x) =n!
i!(n − i)!x i(1− x)n−i
Z-Zpred (exponential rates)
No. of species Z-Zpred > 0 P-value
GC-rich 82 56 0.0012
GC-poor 67 46 0.003
Explanations
• Non-stationary base composition
• Selection for translational efficiency
• Biased gene conversion
• Selection upon base composition
Explanations
• Non-stationary base composition
• Selection for translational efficiency
• Biased gene conversion
• Selection upon base composition
Non-stationary GC content
€
GCpred =
AT →GC
(1 −GC)
⎛
⎝ ⎜
⎞
⎠ ⎟
AT →GC
(1 −GC)+GC →AT
GC
⎛
⎝ ⎜
⎞
⎠ ⎟
Explanations
• Non-stationary base composition
• Selection for translational efficiency
• Biased gene conversion
• Selection upon base composition
Selection on codon usage
Amino Acid Codon High usage Low usage
Phenylalanine UUU 0.22 0.71
UUC 0.78 0.29
Valine GUU 0.46 0.36
GUC 0.09 0.19
GUA 0.24 0.23
GUG 0.21 0.23
Explanations
• Non-stationary base composition
• Selection for translational efficiency
• Biased gene conversion
• Selection upon base composition
Biased gene conversion
No. species Z > 0.5 P-value
GC-rich 28 19 0.087
GCAT ATGC P-value
No. of SNPs 1079 844 <0.0001
Biased gene conversion
r / m p-value
GC4 -0.076 0.67
Z 0.003 0.99
Z-Zpred 0.026 0.88
GC4pred -0.115 0.52
34 species with estimate of r / mVos & Didelot (2009) ISME J.
Biased gene conversion
θ r / m p-value
GC4 0.039 0.83
Z 0.11 0.55
Z-Zpred 0.18 0.30
GC4pred -0.031 0.86
€
πsr
m= 2Neu
r
m= 2Nerk
Explanations
• Non-stationary base composition
• Selection for translational efficiency
• Biased gene conversion
• Selection upon base composition
Selection on GC content
€
H(x) =J(x)
J(x)dx0
1
∫
GC ATuμ
vμ
€
J(x) = eSxxV −1(1 − x)U −1
€
S = 2Nes U = 2Neμ(1 − f ) V = 2Neμf f = v /(u+ v)
+s -s
Summary
• Large excess of GCAT mutations at 4-fold sites• Particularly in GC-rich species
• Not due to• Infinite sites• Sequencing error• Translational selection• Biased gene conversion
• Therefore• Selection on GC4
Correlates
• Genome size• positive correlation
• Lifestyle• higher GC in free living
• Aerobiosis• higher GC in aerobic
• Nitrogen utilization• higher GC amongst N fixers
• Temperature • higher amongst thermophiles?
Further reading
• Hildebrand et al. (2010) PLoS Genetics
• Hershberg and Petrov (2010) PLoS Genetics
• Rocha and Feil (2010) PLoS Genetics