phages manuscript hhmi (1)

Download phages manuscript HHMI (1)

Post on 17-Jul-2015




3 download

Embed Size (px)


  • Dramatic variation in phage genome structures revealed by whole genome comparisons

    Welkin Pope1, Charles Bowman1, SEA-PHAGES2, PHIRE3, K-RITH MGC4, Deborah Jacobs-Sera1, Daniel A. Russell1, Steven Cresawn5, William R. Jacobs Jr.6, Jeffrey G. Lawrence1,

    Roger W. Hendrix1, and Graham F. Hatfull1*.

    1Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260 2Science Education Alliance Phage Hunters Advancing Genomics and Evolutionary Science

    3Phage Hunters Integrating Research and Education 4KwaZulu-Natal Institute for TB and HIV research Mycobacterial Genetics Course

    5Department of Biology, James Madison University, Harrisonburg, VA 6Department of Microbiology and Immunology, Albert Einstein College of Medicine, NY

    *Corresponding Author

  • 2

    Bacteriophages are the dark matter of the biological universe1, forming a vast, dynamic,

    old, and genetically diverse population2. Horizontal exchange generates pervasive

    genome mosaicism, with different genome segments having distinct evolutionary

    histories3. Phages of phylogenetically distant hosts typically share low nucleic acid

    sequence similarity, and few share genes with amino acid sequence similarity2. Phages

    of a single common host can also span considerable sequence diversity even though

    they are in direct genetic contact1. Comparative genomics of a large collection of phages

    isolated on Mycobacterium smegmatis provides insights into the size and diversity of

    groups of related phages and the extent to which the groups are discrete and genetically

    isolated from other phages. We show that both the diversity and genetic isolation of

    phage groups varies enormously. Some are discrete and share few genes with other

    phages, whereas others are genetically connected to many other phages. The phage

    population thus spans a continuum of relationships, but with phages of different types

    varying enormously in prevalence. The reticulate relationships resulting from pervasively

    mosaic architectures confound hierarchical taxonomic phage classification or

    application of simple numerical values to distinguish among phage genomic types.

    Bacteriophages are the most abundant organisms in the biosphere, and the ~1031 tailed phage

    particles participate in ~1023 infections per second on a global scale, with the entire population

    turning over every few days4. Virion structures suggest the population is also extremely old5 and

    thus the great genetic diversity of phages is not surprising2. Phages likely evolved with common

    ancestry and access to a large common gene pool3, although rates of horizontal exchange are

    heterogeneous, being influenced by host range, varying phage migration rates across the

    microbial landscape, and lifestyle (temperate or virulent)6. Multiple processes determine this

    including local host diversity and mutation rates, as well as resistance mechanisms such as

    receptor availability, restriction, CRISPRs, and abortive infection systems6,7. Constraints on

  • 3

    gene acquisition may also be imposed by synteny particularly among virion structural genes

    and by size limits of DNA packaging2,8.

    Genomic comparison of phages infecting a common host provides insights into evolutionary

    mechanisms and the structure of their genetic diversity9. Relatively small numbers of phage

    genomes have been sequenced for hosts such as Escherichia coli, Salmonella,

    Staphylococcus, Pseudomonas, and Propionibacterium10-13 revealing varying degrees of genetic

    diversity. Mycobacteriophages isolated from environmental samples using Mycobacterium

    smegmatis mc2155 as a host are architecturally mosaic1 and span considerable diversity, but

    can be grouped into clusters of related phages that share little or no nucleotide sequence

    similarity with other phages1,14-18. Some clusters are heterogeneous and can be readily divided

    into subclusters by their nucleotide similarities. Recent analysis of phages adsorbed to

    Synechococcus revealed 26 discrete populations, although they were obtained from a single

    sample and are predominantly morphologically myoviral (T4-like)9. However, these populations

    likely represent only a small portion Synechococcus phages because the genomes of 17 fully

    sequenced phages infecting Synechoccocus or closely-related hosts fail to associate with these

    populations9. These populations may thus reflect sampling bias of the single environment

    examined, and extensive genomic mosaicism found in phages of Synechococcus and other

    hosts1,3,19 warrants caution in extrapolation of the concept of discrete phage populations in the

    absence of complete genome sequences.

    The Howards Hughes Medical Institute (HHMI) Science Education Alliance Phage Hunters

    Advancing Genomics and Evolutionary Science (SEA-PHAGES) program has facilitated

    expansion of the number of sequenced mycobacteriophage genomes to 627 (Table S1) by

    engaging large numbers of undergraduates in phage discovery and genomics20. The size of this

    collection now provides sufficient resolution to offer insights into the diversity and genetic

  • 4

    isolation of phage genome types. Here we address the question of whether the groups of

    related phages represent primarily discrete populations or genetically intermixed groups.

    Although the collection excludes viruses that dont form plaques under laboratory conditions, the

    phages were isolated from widely dispersed geographical locations, including nine countries

    and 36 of the continental United States (Fig. S1), over a dozen or more years. All are dsDNA

    tailed phages (Caudovirales), and are morphologically siphoviral, except cluster C myoviruses.

    Most have isometric heads except for singleton MooMoo and the Cluster I and O phages, which

    have prolate heads21.

    Using previously reported parameters15 the 627 genomes were assembled into 20 clusters (A

    T) and 8 singletons (with no close relatives) with large variations in Cluster sizes (Table 1, Fig.

    S2); 11 clusters can be subdivided into 2 to 11 subclusters (Table 1). Clustered phages typically

    share genome architectures; for example, Cluster A phages are similar in size, transcriptional

    organization, and share an unusual immunity system16,22. A different set of clustering

    parameters would generate different profiles, but not alter the core observation that there are

    large variations among the different phage types. Cluster designation is simple for some phage

    types because of extensive nucleotide similarity (e.g. Cluster C; Fig. S2), and if all clusters

    resembled Cluster C, our data would be congruent with the Synechococcus populations 9. But

    many do not, revealing more complex relationships.

    To compare mycobacteriophage gene contents we grouped related genes into phamilies using

    Phamerator23, modified to use kclust24. The 69,633 genes assembled into 5,205 phams of which

    1,613 (31%) are orphams14 (single-gene phamilies), and the gene content relationships are

    represented as a network phylogeny in Fig. 1. In general, branch lengths provide strong support

    for cluster and subcluster designations (Table 1, Fig. S2); the proportions of orphams per

    genome provide additional support, which as expected is highest for singletons and single-

  • 5

    genome subclusters (Fig. S3). Determination of the proportions of shared genes by pairwise

    comparisons reveals the complexity of the genetic relationships (Fig. 2), and three major

    features are apparent.

    First, the overall phage relationships closely mirror the cluster and subcluster designations

    derived by DNA similarities (Fig. S2). Secondly, the intra-cluster and intra-subcluster diversity

    varies enormously, and this is quantified as the Cluster Cohesion Index (CCI, average number

    of genes/genome divided by the total number of phamilies in the cluster; Table 1, Fig. 3). Thus

    in clusters such as Cluster A (CCI, 0.08), the total number of phamilies is vastly greater than the

    average number of genes per genome, indicating high diversity. The diversity of the A

    subclusters is also highly varied with CCI values ranging from 0.22 to 0.91 (Table S1). In

    contrast, Clusters G and O have low diversity (high CCI values) and closely related genomes

    (Table 1; Fig. 3).

    Thirdly, the degree to which clusters are genetically connected to other phages varies greatly,

    and is quantified as the Cluster Isolation Index (CII, the percentage of phamilies not present in

    genomes outside of the cluster; Table 1, Fig. 3). Some clusters such as Clusters A, B, C, and Q

    share relatively few genes (60% of their genes with other phages (Table 1),

    reflecting the DNA relationships (Fig. S4). There are therefore no universally applicable values

    of either diversity or isolation for different phage groups, and the most striking picture emerging

    is one of great diversity with unequal representation of different types (Fig. 3). This is in marked

    contrast to the discreet populations reported for Synechococcus phages9.

    These comparisons reveal additional complexities arising from highly mosaic genomes (Figs.

    S5-S8). For example, Dori is clearly related to Cluster B phages (Fig. 1) with which it shares 20-

  • 6

    26% of its genes and limited DNA similarity (Fig. S5), but also has nucleotide similarity and

    shares genes with Cluste