phages manuscript hhmi (1)
Post on 17-Jul-2015
Embed Size (px)
Dramatic variation in phage genome structures revealed by whole genome comparisons
Welkin Pope1, Charles Bowman1, SEA-PHAGES2, PHIRE3, K-RITH MGC4, Deborah Jacobs-Sera1, Daniel A. Russell1, Steven Cresawn5, William R. Jacobs Jr.6, Jeffrey G. Lawrence1,
Roger W. Hendrix1, and Graham F. Hatfull1*.
1Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260 2Science Education Alliance Phage Hunters Advancing Genomics and Evolutionary Science
3Phage Hunters Integrating Research and Education 4KwaZulu-Natal Institute for TB and HIV research Mycobacterial Genetics Course
5Department of Biology, James Madison University, Harrisonburg, VA 6Department of Microbiology and Immunology, Albert Einstein College of Medicine, NY
Bacteriophages are the dark matter of the biological universe1, forming a vast, dynamic,
old, and genetically diverse population2. Horizontal exchange generates pervasive
genome mosaicism, with different genome segments having distinct evolutionary
histories3. Phages of phylogenetically distant hosts typically share low nucleic acid
sequence similarity, and few share genes with amino acid sequence similarity2. Phages
of a single common host can also span considerable sequence diversity even though
they are in direct genetic contact1. Comparative genomics of a large collection of phages
isolated on Mycobacterium smegmatis provides insights into the size and diversity of
groups of related phages and the extent to which the groups are discrete and genetically
isolated from other phages. We show that both the diversity and genetic isolation of
phage groups varies enormously. Some are discrete and share few genes with other
phages, whereas others are genetically connected to many other phages. The phage
population thus spans a continuum of relationships, but with phages of different types
varying enormously in prevalence. The reticulate relationships resulting from pervasively
mosaic architectures confound hierarchical taxonomic phage classification or
application of simple numerical values to distinguish among phage genomic types.
Bacteriophages are the most abundant organisms in the biosphere, and the ~1031 tailed phage
particles participate in ~1023 infections per second on a global scale, with the entire population
turning over every few days4. Virion structures suggest the population is also extremely old5 and
thus the great genetic diversity of phages is not surprising2. Phages likely evolved with common
ancestry and access to a large common gene pool3, although rates of horizontal exchange are
heterogeneous, being influenced by host range, varying phage migration rates across the
microbial landscape, and lifestyle (temperate or virulent)6. Multiple processes determine this
including local host diversity and mutation rates, as well as resistance mechanisms such as
receptor availability, restriction, CRISPRs, and abortive infection systems6,7. Constraints on
gene acquisition may also be imposed by synteny particularly among virion structural genes
and by size limits of DNA packaging2,8.
Genomic comparison of phages infecting a common host provides insights into evolutionary
mechanisms and the structure of their genetic diversity9. Relatively small numbers of phage
genomes have been sequenced for hosts such as Escherichia coli, Salmonella,
Staphylococcus, Pseudomonas, and Propionibacterium10-13 revealing varying degrees of genetic
diversity. Mycobacteriophages isolated from environmental samples using Mycobacterium
smegmatis mc2155 as a host are architecturally mosaic1 and span considerable diversity, but
can be grouped into clusters of related phages that share little or no nucleotide sequence
similarity with other phages1,14-18. Some clusters are heterogeneous and can be readily divided
into subclusters by their nucleotide similarities. Recent analysis of phages adsorbed to
Synechococcus revealed 26 discrete populations, although they were obtained from a single
sample and are predominantly morphologically myoviral (T4-like)9. However, these populations
likely represent only a small portion Synechococcus phages because the genomes of 17 fully
sequenced phages infecting Synechoccocus or closely-related hosts fail to associate with these
populations9. These populations may thus reflect sampling bias of the single environment
examined, and extensive genomic mosaicism found in phages of Synechococcus and other
hosts1,3,19 warrants caution in extrapolation of the concept of discrete phage populations in the
absence of complete genome sequences.
The Howards Hughes Medical Institute (HHMI) Science Education Alliance Phage Hunters
Advancing Genomics and Evolutionary Science (SEA-PHAGES) program has facilitated
expansion of the number of sequenced mycobacteriophage genomes to 627 (Table S1) by
engaging large numbers of undergraduates in phage discovery and genomics20. The size of this
collection now provides sufficient resolution to offer insights into the diversity and genetic
isolation of phage genome types. Here we address the question of whether the groups of
related phages represent primarily discrete populations or genetically intermixed groups.
Although the collection excludes viruses that dont form plaques under laboratory conditions, the
phages were isolated from widely dispersed geographical locations, including nine countries
and 36 of the continental United States (Fig. S1), over a dozen or more years. All are dsDNA
tailed phages (Caudovirales), and are morphologically siphoviral, except cluster C myoviruses.
Most have isometric heads except for singleton MooMoo and the Cluster I and O phages, which
have prolate heads21.
Using previously reported parameters15 the 627 genomes were assembled into 20 clusters (A
T) and 8 singletons (with no close relatives) with large variations in Cluster sizes (Table 1, Fig.
S2); 11 clusters can be subdivided into 2 to 11 subclusters (Table 1). Clustered phages typically
share genome architectures; for example, Cluster A phages are similar in size, transcriptional
organization, and share an unusual immunity system16,22. A different set of clustering
parameters would generate different profiles, but not alter the core observation that there are
large variations among the different phage types. Cluster designation is simple for some phage
types because of extensive nucleotide similarity (e.g. Cluster C; Fig. S2), and if all clusters
resembled Cluster C, our data would be congruent with the Synechococcus populations 9. But
many do not, revealing more complex relationships.
To compare mycobacteriophage gene contents we grouped related genes into phamilies using
Phamerator23, modified to use kclust24. The 69,633 genes assembled into 5,205 phams of which
1,613 (31%) are orphams14 (single-gene phamilies), and the gene content relationships are
represented as a network phylogeny in Fig. 1. In general, branch lengths provide strong support
for cluster and subcluster designations (Table 1, Fig. S2); the proportions of orphams per
genome provide additional support, which as expected is highest for singletons and single-
genome subclusters (Fig. S3). Determination of the proportions of shared genes by pairwise
comparisons reveals the complexity of the genetic relationships (Fig. 2), and three major
features are apparent.
First, the overall phage relationships closely mirror the cluster and subcluster designations
derived by DNA similarities (Fig. S2). Secondly, the intra-cluster and intra-subcluster diversity
varies enormously, and this is quantified as the Cluster Cohesion Index (CCI, average number
of genes/genome divided by the total number of phamilies in the cluster; Table 1, Fig. 3). Thus
in clusters such as Cluster A (CCI, 0.08), the total number of phamilies is vastly greater than the
average number of genes per genome, indicating high diversity. The diversity of the A
subclusters is also highly varied with CCI values ranging from 0.22 to 0.91 (Table S1). In
contrast, Clusters G and O have low diversity (high CCI values) and closely related genomes
(Table 1; Fig. 3).
Thirdly, the degree to which clusters are genetically connected to other phages varies greatly,
and is quantified as the Cluster Isolation Index (CII, the percentage of phamilies not present in
genomes outside of the cluster; Table 1, Fig. 3). Some clusters such as Clusters A, B, C, and Q
share relatively few genes (60% of their genes with other phages (Table 1),
reflecting the DNA relationships (Fig. S4). There are therefore no universally applicable values
of either diversity or isolation for different phage groups, and the most striking picture emerging
is one of great diversity with unequal representation of different types (Fig. 3). This is in marked
contrast to the discreet populations reported for Synechococcus phages9.
These comparisons reveal additional complexities arising from highly mosaic genomes (Figs.
S5-S8). For example, Dori is clearly related to Cluster B phages (Fig. 1) with which it shares 20-
26% of its genes and limited DNA similarity (Fig. S5), but also has nucleotide similarity and
shares genes with Cluste