inter-species interactions in microbial communities
TRANSCRIPT
Inter-species interactions in microbial communities
CitationHsu, Tiffany Yeong-Ting. 2018. Inter-species interactions in microbial communities. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
Permanent linkhttp://nrs.harvard.edu/urn-3:HUL.InstRepos:42015251
Terms of UseThis article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA
Share Your StoryThe Harvard community has made this article openly available.Please share how this access benefits you. Submit a story .
Accessibility
Inter-species interactions in microbial communities
A dissertation presented
by
Tiffany Yeong-Ting Hsu
to
The Division of Medical Sciences
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the subject of
Biological and Biomedical Sciences
Harvard University
Cambridge, Massachusetts
October 2017
© 2017 Tiffany Yeong-Ting Hsu
All rights reserved.
iii
Dissertation Advisor: Professor Curtis Huttenhower Tiffany Yeong-Ting Hsu
Inter-species interactions in microbial communities
Abstract
Microorganisms are omnipresent and exist as communities within and around the
human body. These communities, regardless of location, may cause disease: dysbioses within
the gut microbiota are associated with obesity and inflammatory bowel disease, while
differences in immune development and environmental exposures are linked to atopy and
diabetes. It is thus crucial to characterize microbial communities and their interactions to better
understand how they are formed, maintained, and manipulated. To better understand the
ecology of communities on and around the human body, my work has explored lateral gene
transfer (LGT) within human-associated microbial communities and the transfer of microbes
between the human body and environmental surfaces.
I developed the first method for detection of de novo LGT events from metagenomes
termed WAAFLE, a Workflow to Annotate Assemblies and Find LGT Events. I applied
WAAFLE to the Human Microbiome Project: LGT frequencies were highest in the gut and oral
sites, and lowest in the vaginal and skin microbiomes. High frequency pairs corresponded with
increased taxon abundances and close phylogenetic distances. Taxa found in multiple LGT pairs
had strong partner preferences, and several had biases in transfer directionality. Enriched
functions in LGT contigs included transposases, phage, and TonB membrane receptors. Taxa in
high frequency LGT pairs may preferentially use LGT as a tool to maintain or change their
community status.
iv
I examined cross-talk between human-associated and built-environment microbial
communities in heavily trafficked environments, specifically the Boston subway. These areas
may facilitate microbial transmission and are ripe for public health interventions such as
sanitation or architecture. We used 16S rRNA gene and metagenomics shotgun sequencing to
profile microbes on multiple surface types in trains along the red, green, and orange lines, as
well as ticketing machines at four train stations. Community structure was dictated by surface
type, rather than train line. Common taxa included human skin and oral commensals such as
Propionibacterium, Corynebacterium, Staphylococcus, and Streptococcus. Enriched functions were
often from Propionibacterium acnes pathways, and few antibiotic resistance genes were observed.
Overall, microbial communities on the Boston subway are likely derived from the rider
population and influenced by rider interactions and environmental biochemistry.
v
Table of Contents
Abstract ...................................................................................................................................................iii
Table of Contents ................................................................................................................................... v
Acknowledgements ............................................................................................................................ vii
List of Figures ......................................................................................................................................... x
List of Tables ......................................................................................................................................... xii
List of Abbreviations ......................................................................................................................... xiii
Chapter 1: Introduction ............................................................................................................................. 1
Copyright Disclosure ............................................................................................................................. 2
Overview ................................................................................................................................................. 2
The significance of lateral gene transfer ............................................................................................. 3
Mechanisms and discovery of lateral gene transfer ......................................................... 3
Problems with the prokaryotic “species concept” ........................................................... 5
Methods for identifying species and LGT ......................................................................... 7
LGT in microbial communities ........................................................................................... 9
Transferred functions and their associated costs ........................................................... 11
Evolutionary legacy of LGT .............................................................................................. 12
Surveying microbial communities in the built-environment ........................................................ 13
Microbial composition of the built-environment ........................................................... 14
Applications for the built-environment ........................................................................... 16
Technical considerations for sampling the built-environment .................................... 17
The role of DNA sequencing for microbial profiling ...................................................................... 19
Amplicon Sequencing ........................................................................................................ 19
WMS Sequencing ................................................................................................................ 21
Contig Assembly ................................................................................................................. 23
Summary ............................................................................................................................................... 24
Chapter 2: Lateral Gene Transfer in the Human Microbiome .......................................................... 26
Attributions ........................................................................................................................................... 27
Introduction .......................................................................................................................................... 27
Results .................................................................................................................................................... 30
Identifying recent LGT events from metagenomic shotgun sequencing .................... 30
WAAFLE performance on synthetic data ....................................................................... 32
vi
Rates of novel LGT events across the human microbiome ........................................... 35
LGT frequency and pair formation are shaped by abundance and phylogeny ......... 41
Genera have preferred transfer partners that are shared across similar sites ............ 44
Mobile elements and TonB receptors are enriched in LGT contigs ............................. 49
Discussion ............................................................................................................................................. 54
Methods ................................................................................................................................................. 58
Chapter 3: Urban transit system microbial communities differ by surface type and interaction
with humans and environment .............................................................................................................. 68
Copyright Disclosure ........................................................................................................................... 69
Attributions ........................................................................................................................................... 69
Abstract .................................................................................................................................................. 69
Importance ............................................................................................................................................ 70
Introduction .......................................................................................................................................... 71
Results .................................................................................................................................................... 73
Sampling microbial communities on the Boston transit system .................................. 73
Microbial communities are specific to surface types and immediate environment .. 74
Subway microbial communities are largely derived from human skin and oral
commensal microbes ....................................................................................................................... 77
Propionibacterium phages and the yeast Malassezia globosa dominate the non-bacterial
microbial community ...................................................................................................................... 81
All surface types are dominated by skin microbes, with smaller proportions of oral,
gut, and environmental taxa across seats and touchscreens ..................................................... 83
Metagenomes reflect dominance of Propionibacterium acnes across subway surfaces
............................................................................................................................................................ 86
Minimal pathogenic and antibiotic resistance presence on the Boston transit system
............................................................................................................................................................ 88
Discussion ............................................................................................................................................. 90
Materials and Methods ........................................................................................................................ 95
Acknowledgements ........................................................................................................................... 101
Chapter 4: Conclusions ......................................................................................................................... 102
Appendix I : Supplemental Materials for Chapter 2 ........................................................................ 108
Appendix II : Supplemental Materials for Chapter 3....................................................................... 117
References ............................................................................................................................................... 129
vii
Acknowledgements
I came to Harvard determined to learn “computational biology”. Considering that my
laboratory experience far exceeded my programming experience (5 years versus 10 weeks), I
must first thank Dr. Curtis Huttenhower for taking a chance on me. In my first email to him, I
wrote:
“…I am interested in learning how to analyze large datasets and make some sense out of them. I
feel that it is no longer sufficient to look at just a few key genes - especially when there are now ways to
profile entire genomics, transcriptomes, and proteomes - though all the associations found will still have
to be validated molecularly. Still, I think it's exciting that there is a chance to look at the entire network
and see how it works.
I was wondering if you took rotation students - or knew of anyone who might train a student to
do dry work - since I have a wet lab background. I was also wondering what your opinion was on how
much of an "omics" understanding a scientist might need.”
What I have learned during my time in the lab has completely exceeded those expectations. The
Huttenhower Lab is a rare place that does not distinguish their bioinformatians from their
experimentalists. Every member is free to learn both, and they often do, through the process of
helping each other out. Curtis was also willing to help me take on projects I was initially
unqualified for, such as WAAFLE, which was born out of my qualifying exams.
Second, I must thank both past and present members of the Huttenhower Lab. Curtis
has assembled a wonderful team of people. To each of you, I would like to say: “You have
qualities that I strive to emulate, and skills and knowledge that I still hope to learn some day.” I
specifically want to thank two people, Dr. Eric Franzosa and Dr. Regina Joice.
Eric was my mentor throughout my PhD; without him I would not have graduated. The
beginning of my PhD was difficult, because the way computational biologists thought and the
terms they used were alien to me. It was not always clear what analyses were being suggested
viii
or why, and how to carry them out. Eric always took the time to explain these analyses, by
breaking down the underlying assumptions and hypotheses. When I had trouble turning those
analyses into code, he would show me his code and introduce me to new syntax. Eric was often
the first to review my grants and paper drafts: I learned a lot about writing from his revisions.
Towards the end of my PhD, when I had trouble mentoring and tutoring students, it was again
Eric that I turned to for advice. I hope I will become an equally skilled and kind scientist as I
move through my career.
Regina was my mentor throughout the MBTA project. Since she had a wet lab
background, she could anticipate my confusion and would help me if she knew the answer, or
help me rephrase the question so someone else could. When I got lost in the computational
aspects of my work, she would always steer me back to the biological question we were asking.
She also freely shared advice when I asked for it: I still remember sidling up and saying,
“Regina, I have a science/graduate school/life question, would you have time to talk later?”
Third, I want to thank my scientific colleagues outside the lab, including Dr. Morgan
Langille and Dr. Robert Beiko, the WAAFLE co-authors; Dr. Georgina Hold, for involving me in
her comparative genomics project; Dr. Wendy Garrett, who gave me access to her laboratory
when we didn’t have the right equipment; and Dr. Eric Rubin, Dr. Michael Springer, and Dr.
Colleen Cavanaugh, my dissertation advisory committee; and Dr. Ting-Ting Wu, my
undergraduate research mentor. To Morgan, Rob, and my advisory committee, I have always
enjoyed and appreciated your feedback on my projects. I have heard horror stories about
collaborators and committees: all five of you were truly a pleasure to work with, and even took
ix
time to meet with me one-on-one, whether it was for advice, beer, or while driving me to see
Bonnie Bassler. To Ting, despite all your cautionary advice, I still went to graduate school!
Without you, I would have never have experienced scientific research, and I hope we stay
friends and colleagues for the years to come.
Fourth, I must thank the administrative staff, including Nicole Levesque, the
Biostatistics Department program coordinator, who magically scheduled me into Curtis’s
schedule over the past five years; as well as Kate Hodgins, Anne O’Shea, Danny Gonzalez, and
Maria Bollinger, the present and former BBS program administrators, who have always swiftly
responded to questions about Harvard and graduate school.
Lastly, I want to thank my family and friends. Both my mother, Lichuan Hsu, and
brother, Eric Hsu, have always been there to support me. They have heard more than their fair
share of gripes and complaints along the way. My father, Che-Chang Hsu, is no longer here, but
I believe he would be proud of my work. As an electrical engineer, he was extremely excited
when I told him I was going to learn Python and described Curtis’s work. I am glad he was able
to see me start my bioinformatics journey. My partner, Wesley Hong, always gives me new
perspectives to consider, and is there to remind me that graduate school is not everything, but a
small step towards our aspirations. To my friends, I will remember the late night problem sets,
races and shopping trips, and surprise birthday parties: it is you who have made my time here
in Boston/Cambridge all the merrier.
x
List of Figures
Figure 2-1. WAAFLE pipeline overview. ............................................................................................. 31
Figure 2-2. WAAFLE parameter evaluation. ........................................................................................ 35
Figure 2-3. LGT rates are highest for oral and stool sites. .................................................................. 39
Figure 2-4. Both abundance and phylogeny affects LGT rates. ......................................................... 43
Figure 2-5. Taxa degree and differential edges. ................................................................................... 46
Figure 2-6 . Enriched functions show taxon and structural similarities across sites. ..................... 51
Figure 3-1. Collection of samples from MBTA trains and stations. .................................................. 74
Figure 3-2. Taxonomic composition of subway microbial communities. ........................................ 76
Figure 3-3. Putative MBTA microbial community sources. ............................................................... 78
Figure 3-4. Trans-domain taxonomic profiles from subway shotgun metagenomes. .................... 82
Figure 3-5. Enrichment of microbial taxa with respect to metadata using multivariate analyses.
....................................................................................................................................................... 84
Figure 3-6. Enrichment of KEGG Orthology (KOs) across MBTA surfaces before and after P.
acnes removal. ............................................................................................................................. 87
Figure 3-7. Quantification of antibiotic resistance marker and virulence factor abundances on
subway surfaces. ......................................................................................................................... 89
Figure I-1. Filtering potential misassemblies. .................................................................................... 109
Figure I-2. Determining which contig types contain misassemblies. ............................................. 110
Figure I-3. Gene call evaluation. .......................................................................................................... 112
Figure I-4. LGT evaluation with or without missing BLAST hits. .................................................. 112
Figure I-5. Selection of k1 and k2. .......................................................................................................... 113
xi
Figure I-6. Comparison of LGT measures. ......................................................................................... 114
Figure I-7. Jaccard and Bray-Curtis distances between inter-individual, intra-individual, and
technical samples. ..................................................................................................................... 114
Figure I-8. Phylogenetic distances computed from random taxa pairs within body sites. ......... 115
Figure II-1. Biomass and alpha diversity for train and station samples. ........................................ 118
Figure II-2. Ordination of surface data subsets. ................................................................................. 118
Figure II-3. Comparison of antibiotic resistance markers from the ARDB database. .................. 119
Figure II-4. Letter from the MBTA. ...................................................................................................... 120
xii
List of Tables
Table I-1. WAAFLE Parameters. .......................................................................................................... 116
Table II-1. Sample collection and metadata. ....................................................................................... 121
Table II-2. 16S and shotgun OTU tables along with taxa present across sequencing plate. ........ 121
Table II-3. LEfSe and MaAsLin analysis for 16S sequencing. .......................................................... 121
Table II-4. MaAsLin analysis for shotgun data. ................................................................................. 121
Table II-5. Antibiotic resistance gene and virulence factor markers. .............................................. 121
xiii
List of Abbreviations
antibiotic resistance (ABR).
antibiotic resistance genes (ARG).
base pair (bp).
biological species concept (BSC).
coding sequence (CDS).
coding sequences (CDS).
ecological species concept (ESC).
false positive rate (FPR).
gene transfer agent (GTA).
Human Microbiome Project (HMP)(The Human Microbiome Project Consortium).
Human Microbiome Project Phase 1-II (HMP 1-II).
interpolated variable order motifs (IVOM).
kilobase (kb).
last universal common ancestor (LUCA).
positive predictive value (PPV).
single nucleotide polymorphisms (SNP).
true positive rate (TPR).
WAAFLE (Workflow to Annotate Assemblies and Find LGT Events).
whole metagenome shotgun (WMS).
Chapter 1:
Introduction
2
Copyright Disclosure
Portions of this Introduction appear in or are adapted from the following publications:
Franzosa, E.A., T. Hsu, A. Sirota-Madi, A. Shafquat, G. Abu-Ali, X.C. Morgan, C. Huttenhower,
Sequencing and beyond: integrating molecular ‘omics’ for microbial community profiling. Nature
Reviews Microbiology, 2015. 13(6):p. 360-72.
Overview
There are approximately 3.8 × 1013 bacterial cells in the average 70 kg man, which is
roughly equal to the number of human cells in the body [1]. These bacterial cells are found as
microbial communities [2], and may interface with the immediate environment outside the host
[3]. Within a microbial community, individual taxa may have different phenotypes as compared
to the overall community: some have proposed that an individual microbe may be viewed as a
component cell of a multicellular organism, in which components communicate to coordinate
growth, movement, and biochemical activities in order to efficiently proliferate, access new
resources, and defend against antagonists [4]. As follows, it is necessary to study microbial
interactions at the individual and community scale. Furthermore, microbial communities may
influence or be influenced by the surrounding environment. Humans emit a detectable
microbial cloud into the surrounding air [5], and skin microorganisms are influenced by
temperature, moisture, and ultraviolet radiation [6]. Thus, it is important to characterize
microbial interactions within a community, as well as microbial interactions with the
surrounding environment in order to understand community formation, maintenance, and
function.
Microbial profiling began with Anton van Leewenhoek, who observed microorganisms
using a self-built microscope and classified them based on morphology [7]. Louis Pasteur and
3
Robert Koch later popularized the use of what is now considered traditional culture methods to
isolate microbes and observe their phenotypes [8]. However, the “The Great Plate Count
Anomaly” showed that the majority of bacteria were not being cultured: Razumov observed
that viable plate counts were much lower than microscopic counts [9-11]. The advent of 16S and
metagenomics shotgun sequencing partially solved this problem by allowing scientists to
identify and classify not-yet-culturable microbes. Coupled with other ‘omics’ data (including
transcriptomics, proteomics, and metabolomics) and appropriate study design, researchers can
begin to better understand microbial interactions at both the individual and community scale,
and across different environments.
In this Introduction, I will first explore lateral gene transfer (LGT), one type of
interaction within microbial communities. Specifically, I will discuss its mechanisms, history,
and roles in the human microbiota. Next, I will delve into the interactions between human-
associated microbial communities and the built-environment, where humans spend the
majority of their time. Finally, I will outline the potential and limitations of DNA sequencing
approaches for profiling microbial communities.
The significance of lateral gene transfer
Mechanisms and discovery of lateral gene transfer
One of the most important types of interactions within microbial communities has
proven to be LGT. LGT occurs when genetic information (or DNA) is passed from a single cell
to a neighboring cell (lateral transmission), rather than from parent to offspring (vertical
transmission). LGT is primarily known to occur through three mechanisms, transformation,
4
transduction, and conjugation, and via two recently discovered mechanisms, gene transfer
agents (GTA) and cell fusion [12]. Transformation ensues when a bacterium uptakes naked
DNA from the environment and incorporates it into its own genome. Transduction occurs when
a bacteriophage accidentally packages part of the host genome with its own genome, which is
then injected and integrated into the next infected bacterium. Conjugation requires physical
contact between two bacteria, and involves DNA transfer from one bacterium to the other via a
multiprotein apparatus. The different mechanisms of LGT limit both the potential participants
and amount of DNA transferred. For example, transduction restricts LGT partners to those with
the same phage host range, and phage can only package a small quantity of DNA. Lastly, GTA
are DNA elements evolved from prophages; they package small pieces of bacterial DNA in
capsids and transfer them to nearby hosts [13]. Cell fusion is similar to sexual reproduction in
eukaryotes in that microbial cells physically join and may bi-directionally transfer DNA [14].
LGT was initially considered a curiosity, but is now recognized as a potentially strong
evolutionary force in prokaryotes. Assuming one LGT event for every 1010 vertical replications,
no gene in any modern genome can be linked to the last universal common ancestor (LUCA)
through vertical descent [15]. LGT was first observed in 1928 as transformation: “R” (“rough”,
avirulent) Pneumococcus strains alone could not cause disease in mice, but would kill mice if
mixed with heat-killed “S” (“smooth”, virulent) Pneumococcus strains [16]. In 1943, Avery,
MacLeod, and McCarty determined the agent of this particular phenomenon (conversion of R
strains to S strains) to be DNA [17]. In the 1960s, Japanese researchers found that multi-drug
resistant Escherichia coli could transfer resistance to drug-sensitive Shigella through conjugation
[18-20], elevating LGT to a cause for concern. Finally, in 1999, researchers found that 20-25% of
5
Aquifex aeolicus and Thermotoga maritima genes were more similar to Archaea than Bacteria [21,
22], indicating that LGT can cross domains in the tree of life.
Problems with the prokaryotic “species concept”
Identifying LGT between different species is of particular interest, since these events
may increase the fitness of individual microbes, which in turn may alter microbial communities.
In both macro- and microbiology, species are defined as clusters of similar organisms, though
what drives the separation of these clusters is unclear for microorganisms. Historically, macro-
organisms were delineated based on morphology, while microorganisms were classified based
on metabolic characteristics [23]. The introduction of the “biological species concept” (BSC) by
Ernst Mayr in 1942 attempted to unify existing systematics and the theory of evolution, and
stated that “species are groups of actually or potentially interbreeding natural populations,
which are reproductively isolated from other such groups” [24, 25]. This definition formalized
“species” as a unit of ecology and evolution, and identified “reproductive isolation” as the
driver for species formation.
The BSC did not work well for microorganisms or plants, due to LGT and ability to form
hybrids, respectively. Still, several scientists attempted to apply the BSC to bacteria. Ravin
searched for similarity between “genospecies”, defined as groups of bacteria that could
exchange genes, and “phenospecies”, defined as groups of bacteria that shared metabolic
phenotypes. Unfortunately, the two groups did not correlate well, indicating that genetic
exchange ability does not necessarily correspond to phenotype [26]. Dykhuizen and Green
proposed defining bacterial species as strains that could undergo recombination with each other
6
but not with other strains [27], which proved to be impractical given the frequency of LGT and
large size of a species’ pan-genome [28].
In 2002, Frederick Cohan argued that ecology was the driver of species clusters in
bacteria (as opposed to reproductive isolation). He proposed defining bacterial species as
“ecotypes”, which are “…set(s) of strains using the same or similar ecological resources, such
that an adaptive mutant from within the ecotype out-competes to extinction all other strains of
the same ecotype; an adaptive mutant does not, however drive to extinction strains from other
ecotypes” [29]. This definition has also been referred to as the “ecological species concept”
(ESC). Cohan’s first model was termed the “stable ecotype model,” which assumed that 1)
microorganisms exist as large populations (1010 cells) and that 2) population genetic diversity is
largely controlled by periodic selection, in which a single species consistently sweeps the
population [30], rather than genetic drift. He pointed out that the latter was supported by long
term culture experiments, which often gave rise to strains with different phenotypes [31-33].
With this, ecotypes could be detected as sequence clusters due to genome-wide sweeps in
microbial populations.
Recent work has observed that gene-specific sweeps, rather than genome sweeps, occur
in microbial populations [34-36]. However, previous work has shown that the recombination
rate is usually lower than the mutation rate, and thus a gene should not undergo a different rate
of selection as compared to its genome [37]. To reconcile these observations, Cohan proposed
the ‘Adapt Globally, Act Locally” model, in which multiple ecotypes adopt the same gene
through lateral transfer, but maintain separate evolutionary trajectories [37, 38]. In 2012, Shapiro
7
et al expanded upon this theory by characterizing two populations of Vibrio cyclitrophicus, in
which they found that i) SNPs associated with a specific population were constrained to specific
genome regions, and ii) recent recombination was more common within a population than
between them [39]. From this, they proposed that microbes undergo gene transfer, leading to
gene sweeps. Since transferred genes are habitat-specific, gene sweeps prompt populations to
specialize, which in turn decreases gene flow between different populations and leads to the
formation of distinct genomic clusters. Their observations imply that gene-specific sweeps can
lead to the formation of new species. More recent work has focused on characterizing
conditions under which gene specific sweeps may occur [40], as well as how gene transfer and
genetic drift work together towards speciation [41].
Methods for identifying species and LGT
The BSC and ESC disagree on the force (i.e., reproductive isolation versus ecological
specialization) that drives speciation, but both agree that DNA sequence clusters will
correspond with species. Compositional biases between species have been observed as early as
1959, in which the buoyant density of nine different bacterial DNAs were highly correlated to
the molar fraction of guanine and cytosine [42]. Microbial species were originally distinguished
via DNA-DNA hybridization, in which a single-stranded reference DNA and a single-stranded
query DNA are mixed, and the degree of binding between the two molecules is measured [43].
If molecules from the query organism showed ≥70% re-association with the reference DNA
molecules, the query and reference organisms were classified as the same species [44]. With
DNA sequencing, scientists began sequencing cultured isolates. In 1995, the first bacterial
genome Haemophilus influenza was sequenced [45]. The reference genome database grew
8
exponentially: by 2000, 27 microbial genome sequences had been published [46], and by 2005,
220 microbial genomes were sequenced with another 650 in progress [47]. One study utilized
this growing set of reference genomes and showed that 50 kilobase (kb) segments of a
prokaryotic genome are more similar to each other than to other genomes, and reflect species-
specific properties for DNA modification, replication, and repair [48]. Biases in nucleotide
composition between species have since been used for genome and metagenome assembly, as
well as for LGT detection.
The earliest LGT studies observed the transfer of phenotypes (i.e., ”R” Pneumococcus
strains becoming virulent, or Shigella acquiring antibiotic resistance), but the majority of new
studies utilize computational methods to detect LGT in sequenced genomes. Computational
methods usually fall into fall into two bins, tree-based and non-tree based methods. Tree-based
methods involve comparing gene trees to a species tree, in which the species tree is often
constructed from a slow evolving, essential gene such as the 16S rRNA gene or a combination of
housekeeping genes [49, 50]. Each phylogenetic tree reflects the evolutionary history of the
gene(s) used to construct it. Thus, if the evolutionary history of a gene deviates significantly
from that of the species tree, it may be explained by LGT, duplication, gene loss, incomplete
lineage sorting, or homologous recombination [51]. Tree-based methods further enable
inference of directionality and time of transfer. Directionality may be based off the “out-of-
Africa” principle, which assumes that the taxonomic group with the largest representation of
the transferred gene is the donor [52, 53].
9
Tree-based methods are considered the gold standard, but are more computationally
intensive than non-tree based methods. Methods that do not require trees can be subdivided
into compositional and gene-based methods. Compositional methods search for changes in GC
content, oligonucleotide frequencies, or even structural features, such as interaction energies
between base pairs or chromatin structure, any of which may have arisen through LGT. In
contrast, gene-based methods look for discrepancies between gene distances and phylogenetic
distances. Approaches for this include, i) searching for similar genes between distantly related
species, ii) calculating evolutionary rates for homologous genes and identifying those (potential
xenologs) with different evolutionary rates, iii) identifying strain-specific genes shared with
other species but not within species [51]. Both compositional and gene-based methods are
limited to detection of relatively recent LGT events, since transferred sequences may ameliorate,
or become more similar to the host sequence over time [54].
LGT in microbial communities
Estimates of LGT frequency were first calculated per taxon, and then per gene family.
Compositional methods predicted that 11% [55] to 17% [54] of the Escherichia coli chromosome
was acquired through LGT. Later studies compared LGT percentages between taxa: one study
found that LGT ranged from 0% of protein-coding genes in Mycoplasma genitalium to 16.6% of
protein-coding genes in Synechocystis PCC6803. This study further identified E. coli, Helicobacter
pylori, and Archaeoglobus fulgidus to have large proportions of transferred genes associated with
plasmid-, phage-, or transposon-sequences [56]. Symbionts and parasites such as Wigglesworthia
brevipalpis, Chlamydia, Mycoplasma, Rickettsia and Borrelia burgdorferi, were found to have lower
proportions of laterally transferred coding sequences (CDS) [53, 57]. Estimates for LGT
10
percentages across gene families has also been highly variable. Explicit phylogenetic methods
have since estimated that anywhere from 2% [58] to 60% [59] of genes are affected by LGT [60].
Forces that drive LGT within communities may include phylogeny, geography, and
ecology. Phylogeny is expected to play a strong role: closely related partners in a group will
preferentially exchange genes, since they will have shared genomic structure, machinery, and
phage host range [61]. One study inferred Bayesian phylogenetic trees for 5282 sets of proteins,
and found that Escherichia coli and Shigella have higher rates of gene transfer within
phylogenetic groups as compared to between phylogenetic groups [62]. Another study found
that integrons in Vibrio cholerae is associated with geography [63]. Lastly, taxa with similar
ecological needs may be found in close proximity, which fosters conjugation, cell fusion, or
GTAs; in addition, increased LGT via plasmids has been observed in biofilms [64]. One study
inferred LGT events between pairs of genomes if they shared 500 bp blocks with 99% similarity:
they found that genome pairs from the same environment had the most LGT events, followed
by genome pairs with small phylogenetic distances [65].
The human microbiota is likely to have high frequencies of LGT. More LGT was found
between human-associated microbial genomes, as compared to between human- and non-
human-associated microbial genomes, with most transfers occurring in the oral and gut sites
[65, 66]. Still, these studies focused on available reference genomes, which represent microbial
snapshots in time. Future work utilizing metagenomics contigs and shotgun metagenomic reads
across time may better capture de novo LGT events. For example, one study identified mobile
gene pools in Fijian and North American microbiomes from single-cell genome sequencing,
11
mapped shotgun metagenomics reads to the genes, and found that mobile gene abundances
were associated with diet and Fijian villages [67]. With this, they determined that LGT
frequencies are not only determined by microbial characteristics (i.e. phylogeny, geography,
and ecology), but may also be driven by host lifestyle and geography.
Transferred functions and their associated costs
There are two leading hypotheses for the types of genes transferred through LGT. The
first hypothesis assumes that genes can be divided into two classes, i) “informational” genes,
which are utilized for replication, transcription, and translation, and ii) “operational” genes,
which are used in metabolism [68]. This hypothesis predicts that the latter gene type is more
likely to be transferred, since the former gene type is responsible for cell division, the most
fundamental process for life. The second hypothesis is termed “the complexity hypothesis”, and
states that genes integrated into large, complex systems (i.e., part of large signaling pathways)
are less likely to be transferred than genes part of smaller pathways [69]. These two hypotheses
are not mutually exclusive: indeed, some have found that “informational” genes are more likely
to be part of complex systems [70]. As follows, predicted transfer functions have included
“plasmid, phage, and transposon functions”, “cell surface structures”, “surface
polysaccharides”, “DNA transformation”, “pathogenesis”, and “toxin production and
resistance” [57]. Another studying utilizing phylogenetic trees found that “energy metabolism”
and “mobile and extrachromosomal element functions” were enriched in discordant
phylogenetic trees, whereas “DNA metabolism,” “protein synthesis,” “protein fate,” and
“regulatory functions” biosynthesis were depleted [71].
12
Transferred genes may not be retained even if they are beneficial, since they may incur
high costs. Costs include disruption of neighboring genomic features via insertion, utilizing
limited resources through transcription and translation, and disrupting interactions within the
cellular network [72]. Furthermore, if transferred genes contain different codon usage, they may
lead to improper expression and/or protein mis-folding [73], which may incur cytotoxicity.
Different microbial taxa may have a variety of mechanisms to handle such costs: for example,
some taxa harbor HN-S proteins, which bind regions of high AT content and silence expression
[74]. Also, most successfully transferred genes eventually ameliorate [54]. The former operates
immediately, while the latter takes time, indicating that different mechanisms may operate on
different timescales to facilitate and select for gene integration.
Evolutionary legacy of LGT
The significance of LGT on evolution is still being debated today. Scientists found that
phylogenetic trees constructed from other “universal” genes such as heat shock protein HSP70
and glutamate dehydrogenase do not agree with the rRNA-based universal phylogenetic tree
[54]. Furthermore, informational genes such as aminoacyl-tRNA synthetases (aaRSs), which
attach amino acids to the corresponding tRNAs, have evidence of transfer [75]. These
discrepancies have led to two hypotheses. The first is the “early massive horizontal transfer
hypothesis”, in which LGT occurred early in prokaryotic evolution and created modern cells,
after which vertical gene transfer became the dominant evolutionary force (as compared to
LGT). The second is the “continual horizontal transfer process”, in which LGT has been a
continuous force from early evolution that continues today [68, 70].
13
Woese has argued in favor of the “early massive horizontal transfer hypothesis”
hypothesis [76]. He argues that the rRNA gene represents cellular information processing
systems such as replication, transcription, and translation, which are fundamental to cells and
differ between bacteria, archaea, and eukaryotes. This implies that multiple progenitor cells,
each with their own information processing systems, must have existed before the division of
the three domains. These progenitor cells were not well-developed, which allowed for extensive
LGT that may have eventually given rise to the efficient, modular cells seen today. In contrast,
Lake has argued for the “continual horizontal transfer process” hypothesis [70]. To test both
hypotheses, he classified genes as “informational” or “operational” [68], and then constructed
phylogenetic trees for each gene type. He assumed that informational genes were not subject to
transfer or were transferred infrequently (which is debatable [69]). If the phylogenetic trees for
the two gene types were similar, it would indicate that most LGT had occurred before
formation of the three domains, thus supporting the “massive horizontal gene transfer
hypothesis”. Instead, he found that phylogenetic trees for the two gene types were significantly
different, which indicates that LGT is still an ongoing force today, thereby supporting the
“continual horizontal transfer process”. Others have argued that i) the observed variation in
nucleotide composition across whole genomes and ii) Occam’s Razor support the “continual
horizontal transfer process” [60].
Surveying microbial communities in the built-environment
Another set of microbial interactions is between microbial communities and their
environment. In 1934, the Dutch microbiologist Lourens G. M. Baas Becking articulated that
“everything (microorganisms) is everywhere: but the environment selects [77, 78]”. This
14
statement put forth a hypothesis that has shaped current microbial ecology: microbial
distributions were believed to be primarily shaped by dispersal and environment, as opposed to
earth history and geography [79]. This hypothesis is demonstrated in the human microbiome, in
which microbial communities and their associated functions are often site-specific [80, 81]. In
contrast, the built-environment seems to be primarily shaped by dispersal, especially from
human-associated communities [82]. To better understand how microbial communities outside
the human body affect human health, researchers must first understand these dispersal patterns
and then determine how these microorganisms interact with their new environment.
Microbial composition of the built-environment
The built-environment is the ecological habitat of humans, consisting of the physical
parts of where we live and work (such as homes, offices, streets) [83][75][77][77]. Humans
spend most of their time in the built-environment: one study showed that Americans (across
states) spend ~87% of their time indoors and ~6% of their time in an enclosed vehicle
(consistently over the past few decades) [84]. As of 2015, buildings were estimated to cover 1.3%
to 6% of global ice-free land [85, 86], and are expanding rapidly [87, 88]. Although building
temperatures and humidity vary across the world, each unit is enclosed and consistently
maintains these variables throughout the day and across seasons [88]. They may also contain a
variety of materials and chemicals not found in the natural environment [89]. As follows, it is
important to identify i) which microbes are in the built-environment, and ii) how they are adapt
to these environments. Furthermore, distinguishing how microbes, microbial compounds, and
man-made chemicals affect human health can result in actionable changes in hygiene and
building construction.
15
Currently, most studies have focused on building surfaces such as homes, restrooms,
hospitals, and classrooms. These studies have shown that the majority of microbes in the built-
environment are derived from human skin, with some influence from human interaction and
the surrounding environment [90]. This is unsurprising, given that humans shed between 2 x
108 and 10 x 108 skin cells/day [91]. Colonization and de-colonization happen rapidly: the Home
Microbiome Study monitored seven families in their homes for six weeks, in which three
families had samples taken pre- and post- move into their new homes. For these three families,
the differences in microbial community structure between their previous and new homes were
insignificant, indicating quick colonization of the new home. Researchers also quantified how
much each individual contributed to the microbial signal of the house, and found that an
absence of three days led to smaller contribution [92], indicating quick de-colonization. The
effect of human interaction can be observed via microbial community patterns on different
surfaces and room types. For example, a study of public restrooms showed that the microbial
community of bathroom floors were likely derived from soil taxa, while communities on toilet
seats, handles, and the inside of the stall were derived from gut bacteria and urine [93]. Lastly,
the surrounding environment may introduce new members to built-environment communities:
one study found that phylogenetic diversity was correlated with ventilation air, airflow rates,
and humidity and temperature [94].
These findings indicate that the human microbiome is rarely colonized or altered by
built-environment microbial communities. Instead, a person may be primarily exposed to
his/her own microbiome, which could then self-perpetuate or perpetuate to other occupants
within the building, either to their benefit or detriment [95]. One example is the effect of pets on
16
their owners: some studies found that infants in homes with dog or cat exposure have
decreased risk of atopy [96], though other studies identified pets as sources of endotoxins [97,
98]. More work is needed to determine what constitutes a healthy indoor microbiome [3],
especially since adverse health effects have been tied to microbial and non-microbial sources.
Microbial threats include single pathogens such as Legionella, which may be transferred through
water systems and inhalation (if aerosolized), as well as microbial components such as
endotoxin, which has been paradoxically linked to promotion of and protection against asthma
[99]. Non-microbial threats include damp indoor environments, which have been associated
with respiratory diseases, and may further be linked to growth of mold and fungal species
[100].
Applications for the built-environment
Since built-environment microbial communities are largely derived from human skin,
they may also resemble their occupants, giving rise to forensic applications. The Home
Microbiome Study could predict which family belonged to which home using microbial
community profiles [92]. Many built-environment studies have also found that occupants of the
same space have significantly more similar microbiomes. For example, families not only share
microbes with one another, but also with their dogs [101]. Co-habituating couples could be
matched based on their skin microbiome samples ~86% of the time [102]. Lastly, one study
collected shoe and phone samples from individuals at three different conferences: random
forest models could predict which conference each sample was taken from, and distinguish
between two individuals’ shoe samples at a single conference [103]. These studies indicate that
17
individuals may be linked to highly-trafficked buildings, as well as to colleagues within the
same space [104].
Another potential application is improved protocols for hygiene, especially with the
development of the hygiene hypothesis. The hygiene hypothesis was conceived as early as 1989:
David Strachan found that high prevalence of hay fever (at ages 23 and 11) and eczema (in the
first year of life) was linked to smaller family sizes. He hypothesized that fewer infections early
in life (due to lack of disease transmission in smaller families) lead to greater numbers of
infection later in life [105]. His hypothesis was replaced by the “Old Friends” hypothesis in
2004: Ross et al stated that increased disease types (such as allergy) in developed parts of the
world was due to lack of exposure to “old friends”, which are defined as microbes that co-
evolved with humans. These “old friends” facilitate regulatory T cell development, thereby
preventing inappropriate immune responses [106]. The “Old Friends” hypothesis has led to the
general consensus that increased microbial diversity is favorable, though others argue it is
simply a community property [107]. With this, hygiene should be redefined as an effort to select
for beneficial bacteria, rather than an attempt at complete sterilization [108, 109]. Suggested
interventions have been to build with materials that select for specific microbes, as well as
increasing building ventilation and outdoor green space to boost microbial diversity [3, 110].
Technical considerations for sampling the built-environment
The majority of built-environment samples have been sequenced using 16S rRNA
sequencing due to low biomass, which makes them particularly susceptible to batch effects and
contamination from sequencing kits and reagents. The former was demonstrated in a study that
18
monitored office buildings in Flagstaff, AZ; San Diego, CA; and Toronto, ON for one year.
Samples were grouped and sequenced by season, with eight technical replicates included in
each sequencing run. Unfortunately, sequencing run was conflated with seasonality: even
technical replicates varied widely across run. Researchers attempted to eliminate the batch
effect by removing highly variable low, abundance taxa, which worked poorly [111]. Other
studies have been affected by contaminants found in sequencing reagents, extraction kits, and
PCR reagents [112-114]. For example, one study found that age as the driver of observed trends
in the nasopharyngeal microbiome among children in a refugee camp, but another study
showed that the driver was kit contaminants [115]. The use of technical replicates, extraction
and negatives controls, and microbial spike-ins have all been proposed to address the problem
of batch effects and contaminants. These technical challenges may be further complicated by the
rise of citizen microbiology projects, in which aseptic technique, sample collection logistics, and
privacy concerns must be considered [116].
Unfortunately, built-environment studies that rely on DNA sequencing for community
profiling cannot distinguish between DNA in live cells and extracellular DNA, which can
survive on surfaces for weeks to years [117]. Currently, it is unclear whether most microbes on
these surfaces are active, dormant, or dead. Some have described the built-environment as a
microbial wasteland, where most microbes are likely dormant or dead [82, 111]. One study
found that 40% of prokaryotic and fungal DNA in soil was extracellular or from cells that were
not intact [118]. Several methods have been developed that may assist in assessing viability,
which primarily function by examining membrane integrity, measuring transcription or
19
translational activity, or measuring cellular respiration (through ATP). Still, the majority of
these methods are for bacteria, and may not work on viruses or spores [119].
The role of DNA sequencing for microbial profiling
High-throughput DNA sequencing has proven invaluable for investigating diverse
environmental and host-associated microbial communities. Sequence-based taxonomic profiling
of a microbiome can be carried out using either amplicon (typically the 16S rRNA gene) or
whole metagenome shotgun (WMS) sequencing (reviewed in [120-122]). The resulting DNA
sequence data are then used to assess the community in at least two ways: taxonomic profiling,
which answers, “who is present in the community?” and functional profiling, which answers,
“what could they be doing?” Still, there are several limitations to DNA-based approaches. First,
the most common taxonomic profilers provide at best species-level taxonomic resolution,
whereas many important phenomena occur at the strain level. Second, DNA sequencing cannot
directly measure the functional activity of a community under a given set of conditions. While
the former has been addressed through sequencing and bioinformatics techniques, the latter
may require multi’omic data sets, which include community RNA (transcriptomics), protein
(proteomics), and metabolite abundances (metabolomics), preferably in an integrated
framework.
Amplicon Sequencing
One common method for profiling a microbial community involves sequencing specific
microbial amplicons (predominantly the bacterial 16S rRNA gene). Although amplicon-based
sequencing considers only one or a few microbial genes, it may be used for taxonomic,
20
phylogenetic, and even functional profiling. It may also be used to profile low biomass samples,
as compared to WMS sequencing. To identify which taxa are present, amplicon sequences are
either directly binned to reference taxa [123, 124] by classification or phylogenetic placement, or
more commonly they are first clustered into operational taxonomic units (OTUs) sharing a fixed
level of sequence identity (often 97%) [125, 126], and then binned as a whole (often by
classification of a reference sequence). Functional profiles can be approximated for marker-
based samples by associating 16S rRNA or marker genes with annotated reference genomes,
aggregating coding sequences (from the reference genomes) into gene families, and then
inferring gene family abundances through taxonomic abundances [127].
Unfortunately, the singular use of the 16S rRNA marker gene has several problems.
First, some species have multiple copies of the 16S rRNA gene, which in turn have different
sequences [125]. Second, the 16S rRNA gene has difficulty resolving species due to its slow
evolution: strains with less than 97% 16S rRNA sequence identity are likely to be different
species, but strains with more than 97% 16S rRNA are not necessarily the same species [49, 128].
The use of the 97% cutoff is also somewhat arbitrary and based off concordance with DNA-
DNA hybridizations [129]. In order to improve taxonomic resolution from 16S rRNA
sequencing, two techniques have been developed. One recent technique, termed “oligotyping”,
uses a sequence entropy-based approach to identify maximally informative sites within the 16S
rRNA gene to improve OTU resolution [130]. Oligotyping is advantageous for distinguishing
closely related taxa (such as those that differ by a single 16S rRNA nucleotide) and has been
applied to study subspecies-level population structure in the vaginal microbiome [131] and to
link sewage samples to specific fecal pollution sources [132]. In addition, a new, low-error
21
approach to 16S rRNA gene sequencing, termed LEA-Seq has been proposed and used to
profile stable carriage of host-specific strains in the human gut microbiome [133].
WMS Sequencing
WMS sequencing involves sequencing “random” DNA fragments from microbial
communities. Taxonomic profiling of metagenomes instead uses some or all shotgun reads to
determine membership in a community. This can be done in a number of ways, including
metagenomic assembly followed by phylogenetic binning or placement of contigs [134]. More
commonly, short reads are profiled directly by comparison to a reference catalogue of microbial
genes or genomes. Alternatively, reads can be mapped to a (pre-computed) catalog of clade-
specific marker sequences (with [135] or without [136] pre-clustering). Finally, reads may be
assigned to species based on agreement with models of genome composition [137] or by exact k-
mer matching [138], thus enabling placement of reads or assembled contigs when
corresponding reference genomes are not available (which is common for poorly characterized
communities).
WMS sequencing is the preferred method for strain-level profiling due to its ability to
identify variation throughout microbial genomes. Strains may differ in sequence through loss or
gain of genomic regions or through single nucleotide polymorphisms (SNPs), both of which can
be identified by mapping shotgun reads to reference genomes. For example, mapping WMS
reads from tongue samples to genomes of Streptococcus mitis highlighted the presence and
absence of genomic islands in isolates of that species from individuals enrolled in the Human
Microbiome Project (HMP) [2]. Genomic islands were shown to contain multiple, functionally
22
coherent genes (e.g. subunits of the V-type H+ ATPase) that were gained and lost together,
suggesting a mechanism for individual- and body site-specific functional specialization.
Detection of SNP differences requires greater sequencing depth. Existing WMS data from
human stool samples have been used to identify reference genomes with high sequencing
coverage which were then scanned for SNPs [135]. This analysis revealed that subject-specific
SNP variation tended to remain stable for up to a year and was comparatively more conserved
than overall species abundance.
Functional profiling of metagenomic samples typically begins by associating new
sequence data with known gene families. This can be accomplished by directly mapping DNA
or RNA reads to databases of gene sequences that have been clustered at the family level; such
databases include KEGG Orthology [139], COG [140], NOG [141], Pfam [142], and UniRef [143].
Naturally, the number of reads that can be mapped in this manner depends on the
completeness of the underlying reference database. Alternatively, reads can be assembled into
contigs to determine putative protein-coding sequences, and then the CDSs are assigned to gene
families following the same or similar methods used for annotating isolate microbial genomes.
Both strategies yield profiles of the presence and absence of a gene family as well as the relative
abundance of each family within a sample. Functional profiles at the gene family-level may
contain many thousands of features. Downstream analyses can be made more tractable by
further performing per-organism or whole-community pathway reconstruction based on these
genes. Although not specifically designed for microbial community analysis, species-specific
pathway databases such as KEGG [139], MetaCyc [144], and SEED [145] can be useful for this
purpose. Integrated bioinformatics pipelines such as IMG/M [146], MG-RAST [145],
23
MetaPathways [147], and HUMAnN [148] have been developed to streamline the conversion of
raw meta’omic sequencing data into more easily-interpreted profiles of microbial community
function.
Contig Assembly
Deeper WMS sequencing can facilitate the de novo assembly of contigs and even
microbial genomes. Assemblies are generated by connecting overlapping sequencing reads to
form longer sequences, which may be represented as an assembly graph in which nodes
embody sequence information (such as k-mers) and edges connect adjacent or overlapping
sequences. Metagenomics samples come with special challenges that may lead to errors in the
assembly graph. These samples contain multiple taxa with differential abundances, leading to
uneven coverage and the presence of conserved sequences across taxa, and making it difficult to
determine where edges should be drawn. Several tools have been built to address these
problems: MetaVelvet-SL generates a single assembly graph, and then uses k-mer coverage to
identify sub-graphs that are assumed to be single species; in contrast, both MetaSPAdes and
IBDA-UD use multiple k-mer sizes to iteratively improve the assembly graph [149]. General
challenges in assembly arise from technical variables, such as sequencing errors, chimeric reads,
and read lengths that are shorter than genomic repeats [150], as well as the size of the dataset,
which increases computational intensity. Lastly, there is no gold standard to determine if a
given assembly is correct. As a result, earlier metagenomic studies that utilized assembly
limited analyses to cataloguing genes and functions [151, 152], though one study went further
and identified plasmids and scaffold synteny in samples collected from the Sargasso Sea [153].
24
Assembly is crucial to studying microbial communities, and may be used to identify
novel sequence elements, generating reference genomes from uncultivated or poorly
represented microorganisms in reference databases, and characterizing the synteny of microbial
genes. Improvements in metagenomics contig assembly has led to the recovery of whole
microbial genomes from communities [92, 154-156], which was previously only possible in low-
complexity communities [157]. One study was able to assemble 31 bacterial genomes after
binning assemblies by differential read coverage [158]. Increasing the number of reference
genomes across the tree of life may help with discovery of novel gene functions and pathways
[159]. Furthermore, assemblies can reveal novel genomic rearrangements and LGT events not in
previous reference genomes. For example, one study found that the genomic architecture of
mobile genes in human gut samples was specific to individuals, even though individual mobile
genes were found universally across U.S. and Fijian cohorts [67].
Summary
The advent of DNA sequencing has made it quicker and easier to profile microbial
communities, while further development of tools for analyzing and interpreting sequencing
data may potentially reveal how community trends and interactions between individual
microbes. In Chapter 2, we describe the tool WAAFLE, a workflow that annotates assemblies
and finds LGT events from assembled metagenomics contigs. We then apply WAAFLE to the
Human Microbiome Project, and find that properties such as phylogenetic relatedness and
abundance affect LGT frequency, and that transferred functions are enriched for mobile
elements and outer membrane receptors. In Chapter 3, we survey the microbial communities on
25
the Boston subway using 16S rRNA and WMS sequencing. We observe that that the microbial
community mostly comprises of skin microbes, and that overall pathogenic potential is low.
Chapter 2:
Lateral Gene Transfer in the Human Microbiome
27
Attributions
The contributors to this work include Tiffany Hsu, Eric A. Franzosa, Chengwei Luo,
Dennis Wong, Morgan Langille, Robert G. Beiko, and Curtis Huttenhower, in no particular
order. T.H. and E.A.F developed the software implementation, evaluated the method, and
applied the tool to the Human Microbiome Project. All authors helped design the method and
interpret the data. T.H. wrote the text with feedback from E.A.F., M.L, R.G.B, and C.H.
Introduction
Lateral gene transfer (LGT) is the movement of genetic material between organisms
without sexual or asexual reproduction [160]. Its role in microbial communities is not well
understood, due to the difficulty in identifying LGT events. First, evolutionarily significant
events are difficult to ascertain. These events include ancient LGTs, which have likely
ameliorated to the host genome, as well as LGT of homologs, which are conserved across
species and difficult to distinguish from orthologs (homologs that arose through speciation) and
paralogs (homologs that were duplicated and have a separate evolution trajectory) [53]. Second,
transient events, in which LGT occurs but the organism does not accept or maintain the
transferred sequence, are difficult to measure [161]. Still, LGT has proven to be an important
evolutionary force [15], especially with the rise of antibiotic resistance in human-associated
microbial communities. LGT events may change the fitness of individual microbes, which may
in turn affect microbial community composition and function. These events may eventually
give rise to new species, impacting both evolutionary history and phylogeny [70, 76].
28
Several studies have characterized the quantity of and forces shaping LGT in human-
associated microbial communities. For human-associated microbial genomes, most transfers
occur in the oral and gut sites [65, 66]. LGT may be shaped by host factors, such as lifestyle and
geography, as well as microbial traits, such as phylogeny and ecology. One study found that
cultural practices affected LGT rates: mobile gene pool abundances in Fijian and North
American microbiomes were associated with diet and Fijian villages [67]. Another study found
increased LGT between human-associated isolates as compared to between human-associated
and non-human-associated isolates. Isolates with shorter phylogenetic distances and from
similar sources (between human and/or non-human) had increased transfer, though the latter
had the stronger effect [65]. As follows, some have proposed that LGT is a mechanism used
between niche-sharing microbes to adapt to changing conditions [162], while others have
suggested it as a mechanism to enforce cooperation or competition [163, 164]. This is further
supported by the observation that transferred genes are enriched for functions in cell surface,
DNA-binding, and pathogenicity, which may be necessary for survival in different
environments [57].
Microbial community sequencing has generated 16S rRNA and metagenomic shotgun
datasets, yet most software tools available for LGT detection are designed for whole and/or
draft genomes [165, 166]. Methodologies for detecting LGT fall roughly into three categories,
composition-based, alignment-based, and phylogeny-based approaches. Compositional-based
methods assume that laterally transferred genes will have distinct nucleotide compositions as
compared to the host genome: software such as Alien_Hunter [167] uses interpolated variable
order motifs (IVOMs) to find genomic regions with significant shifts in composition.
29
Alignment-based methods look for discrepancies between gene distances and phylogenetic
distances: for example, Darkhorse [168] aligns protein sequences (from a single genome) to a
reference database and infers LGT using bitscore and phylogeny. In contrast, IslandPick uses
genome alignments and comparative genomics to identify LGT in closely related genomes [169].
Phylogeny-based implementations such as rSPR [170], PhylTr [171], and MaxTiC [172] search
for incongruence between gene trees with species trees. Only the software Daisy [173] utilizes
shotgun reads, but still requires prior knowledge of donor and recipient genomes.
Here, we present WAAFLE, a Workflow for Annotating Assemblies and Finding LGT
Events, which uses alignment-based methods to detect LGT events in contigs assembled from
metagenomic shotgun sequencing sets. A tool that can utilize shotgun sequencing data has
several advantages. First, we can potentially find new LGT events that are not yet reflected in
reference genomes. Second, since each metagenomic sample represents a snapshot in time,
users will have the ability to compare LGT rates between individuals, conditions, and across
time. Third, although WAAFLE is limited to fairly recent events, the use of reference databases
allows us to identify gene functions and perform taxonomic assignment with higher accuracy,
especially in human-associated datasets. In this study, we apply WAAFLE to the Human
Microbiome Project Phase 1-II (HMP 1-II) [174] assembled contigs. We quantify LGT
frequencies for taxon pairs at the genus level across six major body sites, which specifically
represents the number of unique, novel, and fixed LGT events per sample (which represents a
single body site in an individual). We then i) determine how abundance and phylogeny
influence LGT frequencies, ii) characterize taxon pair formation and partner preference, and iii)
identify functions enriched in LGT contigs.
30
Results
Identifying recent LGT events from metagenomic shotgun sequencing
In order to detect LGT events from metagenomic shotgun sequencing, we developed
WAAFLE, a Workflow to Annotate Assemblies and Find LGT Events (Fig. 2-1A). WAAFLE has
one required input, i) assembled metagenomic contigs in FASTA format, and two optional
inputs, ii) gene calls for each contig and iii) a nucleotide reference database of genes with
taxonomic and functional annotations (down to the species level, and for UniRef50 and
UniRef90 terms, respectively). A default reference database of pangenomes, MetaRef [175], is
provided. WAAFLE conducts a four step process to output a single file in which each contig is
classified as containing LGT or not, with each gene annotated with a taxon and function. First,
contigs are searched against the nucleotide reference database via BLASTN. Second, contigs are
annotated with genes, either by connecting overlapping BLAST alignments or using supplied
gene calls. Third, contig genes are assigned UniRef50/90 annotations and taxon scores; the latter
represents how well a given taxon characterizes a gene. To do this, we bin BLAST hits by gene,
and then group BLAST hits within bins by taxonomic annotation. From the BLAST hit bins, we
designate the most common UniRef50/90 term to each gene. From the BLAST hit groups, we
calculate a single score per taxon per gene using the percent identity and subject coverage.
Fourth, contigs are classified as having LGT or not. Using the taxon scores, we determine
whether genes across a contig are best explained by two taxa or one (Fig. 2-1A).
31
Figure 2-1. WAAFLE pipeline overview. A) Within microbial populations, genes can be
transferred vertically or laterally, which may confer adaptive traits to individual microbes and
affect the community composition and function. To understand the impact of LGT, we built the
tool B) WAAFLE, which identifies LGT events within metagenomic contigs using a four step
process. First, WAAFLE searches contigs against a reference species pangenome database,
which is generated by downloading NCBI isolate genomes, binning isolate genes by species,
and then clustering binned species genes at 97% nucleotide identity. Second, WAAFLE calls
genes (if not supplied) by connecting overlapping BLAST hits. Third, WAAFLE assigns each
gene a function and taxon scores. To do this, alignments are first binned by genes: the most
common UniRef50/90 annotation across hits per gene is assigned as the gene function. Binned
alignments are then further grouped by taxa, and taxon scores are calculated using percent
identity and subject coverage. Fourth, we classify the contig as having LGT or not. If a single
taxon has taxon scores above k1 (blue threshold) across all contig genes, the contig is predicted
to not have LGT. Otherwise, if two taxa have taxon scores above k2 (red threshold) across all
contig genes, the contig is predicted to have LGT. C) To evaluate WAAFLE and its parameters,
we generated synthetic contigs by selecting random donor and recipient genomes at varying at
32
Figure 2-1 (Continued)
different taxonomic levels. We chose a three gene region from the recipient genome, and
replaced the center gene with a gene from the donor genome. We then truncated the newly
formed contig at both ends.
How accurately WAAFLE detects LGT depends both on the contig assembly quality and
WAAFLE parameter settings. False positive LGT calls may arise from contig misassemblies,
which we hypothesized would have steep drops in read coverage. To identify misassemblies,
we mapped shotgun reads to metagenomic contigs and examined gene junctions, the region
between two contig genes. Contigs that had i) low coverage for read junctions relative to
flanking genes and ii) lacked paired or single read support for the junction were removed from
analysis, regardless of LGT status (Fig. I-1, Fig. I-2). WAAFLE may also call different amounts
of LGT depending on its five parameters, which include subject coverage (s), overlap
percentage (o), gene length (g), one-taxon threshold (k1), and two-taxon threshold (k2) (Table I-1).
The first three parameters are utilized to minimize false positive gene calls (step 2), which may
lead to increased LGT calls. The last two parameters are employed in LGT classification (step 4).
Specifically, WAAFLE identifies a contig as not having LGT if one taxon has taxon scores
greater than k1 across all genes. If no single taxon scores above k1, WAAFLE searches for two
taxa that collectively have taxon scores greater than k2 across all genes. If the contig contains
such a pair, it is classified as LGT; otherwise, it is classified as “ambiguous”. Lowering k1 and
raising k2 thresholds make it more difficult to call LGT.
WAAFLE performance on synthetic data
To set the default WAAFLE parameters, we generated a synthetic dataset from the NCBI
isolate genomes. This dataset consisted of 1000 contigs spanning 8 taxonomic levels, with 25
33
donor-recipient pairs at each level. Each contig was created by i) selecting a donor-recipient pair
with some taxonomic level difference, ii) choosing a recipient genome fragment containing
three genes, iii) replacing the center gene (of the fragment) with a random gene from the donor
taxon, and iv) truncating the contig ends (Fig. 2-1C). It should be noted that the NCBI isolate
genomes used to generate the synthetic dataset are the same genomes used to create WAAFLE’s
species pangenome database. As follows, the species pangenome database contains all the
species and genes present in the synthetic contigs. In reality, the reference database will be
missing species and genes potentially present in biological data. To simulate missing
information in the pangenome database, we removed 20% of the BLAST alignments (to the
synthetic contigs) generated from the first step in WAAFLE.
We first evaluated WAAFLE’s ability to call genes by varying three parameters, subject
coverage, overlap, and gene length, during step 2 of the WAAFLE pipeline. We compared each
set of WAAFLE gene calls to the NCBI gene annotations. True positives were defined as the
number of NCBI genes with corresponding WAAFLE genes, while false positives were defined
as the number of single NCBI genes with multiple corresponding WAAFLE genes (to one NBCI
gene) and the number of WAAFLE genes with no corresponding NCBI genes. The true positive
rate (TPR) ranged from 0.691 to 0.841 while the positive predictive value (PPV) ranged from
0.955 to 0.994 (Fig. I-3). Overall, we found that lower overlap, increased subject coverage, and
increased gene length increased the PPV, with subject coverage and gene length having the
greatest effect. Since increasing the number of genes increases the potential of calling LGT, we
conservatively set gene calling parameters at 0.75 for subject coverage, 0.1 for overlap, and base
pairs (bp) for minimum gene length.
34
To evaluate LGT classification, we supplied WAAFLE with the WAAFLE-called genes
generated from the default parameters (mentioned above) and filtered for contigs containing at
least two genes. We then varied the one-taxon and two-taxon thresholds for step 4 of the
WAAFLE pipeline. Synthetic contigs with inter-species or above LGT events were considered
true positives when WAAFLE called LGT, and false negatives otherwise. Synthetic contigs with
inter-strain LGT events were considered true negatives if WAAFLE classified the contigs as
having no LGT, and false positives otherwise. The TPR ranged from 0.513 to 1 and false positive
rate (FPR) ranged from 0 to 0.111, where most false positives arose as a consequence of BLAST
hit removal (Fig. I-4). Higher one-taxon thresholds increased both TPR and FPR, while higher
two-taxon thresholds decreased both TPR and FPR (Fig. 2-2A, Fig. I-5). As the one-taxon
threshold increases, it becomes difficult to classify contigs as not having LGT, which leads to
more LGT calls and increases the number of true and false positives. In contrast, increases in the
two-taxon threshold make it difficult to classify contigs as having LGT, resulting in fewer true
and false positives. As such, we decided to set the one-taxon threshold at 0.5 and the two-taxon
threshold at 0.8. To evaluate organism calls, we examined the subset of correctly called LGT
contigs (true positives), and identified the taxonomic levels (kingdom through species) at which
WAAFLE correctly matches the reference taxa. WAAFLE often correctly annotated taxa down
to the family level, but did not always identify the correct genus or species (Fig. 2-2B).
35
Figure 2-2. WAAFLE parameter evaluation. Using the WAAFLE-called genes (using the default
gene calling parameters), we examined how the one-taxon (k1) and two-taxon (k2) thresholds
would affect A) LGT classification and B) taxonomic assignment. For the left half of the figure,
we set k2 at 0.8 while varying k1 from 0.1 through 0.9. For the right half of the figure, we set k1 at
0.5 while varying k2 from 0.1 through 0.9. A) Colors indicate k2, and the x-axis indicates the
taxonomic level difference between the donor and recipient genomes. For example, we observe
lower TPR and FPR for inter-species LGT. B) Colors indicate k1, and the taxonomic level at
which WAAFLE correctly identified an organism. For example, lower percentages are observed
for correctly calling a taxon at the species level.
Rates of novel LGT events across the human microbiome
We used WAAFLE to interrogate the expanded Human Microbiome Project (HMP1-II)
[174]: a dataset that includes 2,341 shotgun metagenomes sampled from 265 individuals at
36
diverse body sites at up to three time points (http://hmpdacc.org). For quality control, we
removed samples with poor assembly (with less than 1,000 gene calls across contigs) and
inconsistent taxonomic profiles (appeared as outliers in ordination analyses), and then filtered
out contigs that resembled mis-assemblies (described earlier, see Methods). We first set out to
develop a measure for LGT frequency, which would allow us to quantify LGT for taxon pairs
across body sites. Within a metagenomic assembly, each LGT event detected by WAAFLE is
likely to be i) unique; ii) novel, since the use of a reference database should exclude previously
detected LGT events; and iii) fixed in the population, since erratic events are likely not
assembled. Thus, an increase in LGT frequency (per sample) as measured by WAAFLE
represents an increase in unique LGT events, which can further be stratified by taxon pairs.
We generated two measures to quantify LGT frequency, which included i) gene
percentages (the number of genes in LGT contigs normalized by the total number of sample
genes) and ii) events per gene (the number of LGT contigs normalized by the total number of
sample genes). The two measures may not correspond due to differences in assembly: samples
with multiple short contigs may have low gene percentages and high events per gene, while
samples with LGT in a few long contigs may have high gene percentages and low events per
gene. Still, we found that both measures were highly correlated across body sites (Fig. I-6), thus
increases in either measure generally indicate higher LGT frequencies. We then used gene
percentages to determine if WAAFLE is reproducible across technical replicates. As expected,
LGT pairs were most similar between technical replicates, followed by intra-individual and
then inter-individual samples based on Jaccard and Bray-Curtis distances (Fig. I-7). Distances
for LGT pairs were much higher than that of single taxon gene percentages, indicating that
37
similar taxonomic gene profiles still lead to highly variable LGT profiles. For the remainder of
the analyses, we used only assemblies unique to an individual, body site, and time point,
leaving 1,128 assemblies with 237 from stool, 208 from tongue dorsum, 191 from supragingival
plaque, 182 from buccal mucosa, 94 from anterior nares, and 89 from posterior fornix.
LGT is an adaptive mechanism that may facilitate microbial survival and maintenance at
the individual or community level. Cataloguing high frequency LGT pairs identifies the
partners and genes each taxon has access to, and furthers understanding of their interactions.
We thus characterized high frequency LGT pairs across six body sites, and found that they
generally fell into three categories, those with high phylogenetic relatedness [61], large joint
abundances, and similar functions or niches (Fig. 2-3A). Pairs with closely related taxa included
Bacteroides with Parabacteroides (0.746% genes, average phylogenetic distance PD=1.02),
Odoribacter (0.0947%, PD=1.77), or Alistipes (0.260%, PD=1.95), all of which were found in the
stool and considered inter-family transfers, despite relatively short phylogenetic distances. In
contrast, some taxa with high abundances transferred regardless of phylogenetic distance,
including Lactobacillus and Gardnerella (0.137%, PD=8.34) in the posterior fornix, and
Corynebacterium and Propionibacterium (0.0522%, PD=3.14) in the anterior nares. Lastly, some
taxa pairs have overlapping functions or niches. Eubacterium and Roseburia (0.0637%, PD range
0.81 to 6.44) in stool are both butyrate producers that decrease in abundance with lower intake
of carbohydrates [176, 177]. Oral taxa have close physical proximity through biofilms; one
example includes a corncob structure found in supragingival plaque consisting of
Corynebacterium and Streptococcus, with an outer ring of Haemophilus and Aggregatibacter [178],
38
which may explain high frequency transfers for each pair, but not across the two pairs in both
buccal mucosa and supragingival plaque.
39
Figure 2-3. LGT rates are highest for oral and stool sites. A) For each body site, we display
LGT between the ten genera with the highest gene percentages via heatmaps. Each row and
column represent a single genus and off-diagonal cells represent LGT gene percentages. Colors
indicate the number of genes for the row taxa in LGT contigs involving both row and column
taxa divided by the total genes per sample, averaged across body site, resulting in an
asymmetrical matrix. The histogram above each heatmap shows the average number of genes
per sample across body sites. B) Each point represents one sample in the body site. LGT
frequencies on the y-axis are measured as the number of LGT contigs divided by the total
number of sample genes, plotted on a log2 scale per 1000 genes.
40
41
Different environments may also facilitate or hinder LGT. This is evident in the patterns
we see for the six different body sites: taxa seem to transfer indiscriminately in the stool and
oral sites, but appear more selective in the anterior nares and posterior fornix (Fig. 2-3A). We
therefore investigated whether differences in overall LGT frequency are attributable to body
site. To do this, we calculated the overall extent of LGT in each body site using events per gene.
LGT frequencies were highest in the stool (m = median 2.898 events per 1000 genes), followed
by multiple oral sites, including the supragingival plaque (m=2.134), tongue dorsum (m=2.129),
and keratinized gingiva (m=1.799). Frequencies were lowest in the vaginal and skin sites (Fig. 2-
3B). To further understand how technical and biological effects might affect LGT rates, we
performed a linear regression using events per gene as the dependent variable, and technical
and biological effects as the explanatory variables. Technical effects included the number of
contigs per sample and contig size (genes/contigs), while biological effects included genus
richness, genera evenness, and body site. Significant predictors of LGT frequency included
body site (p<2e-16), average contig size (p=2e-16, positive coefficient), and species evenness
(p=2e-16, positive coefficient). These observations indicate that sites with high LGT rates are i)
mucosal and ii) have higher alpha diversity, in which evenness plays a larger role than richness.
LGT frequency and pair formation are shaped by abundance and phylogeny
We next set out to characterize the overall effect of phylogeny and taxon abundance on
LGT frequencies. To this end, we calculated phylogenetic distances and joint abundances for
each LGT taxon pair, and estimated how well each of these variables predicted LGT gene
percentages using a nonparametric generalized additive model smoother. Phylogenetic distance
was calculated by measuring branch length between two taxa in the PhyloPhlAn phylogenetic
42
tree [50], which represents the average number of nucleotide substitutions between two taxa.
Joint abundance was calculated by multiplying one taxon’s abundance by the other: taxon
abundance was quantified as the total number of genes for a single taxon (across all contigs
regardless of LGT status) divided by the total number of genes per sample, averaged across a
body site. We observed an increase in LGT gene percentages at low phylogenetic distances (Fig.
2-4A), and an increase in LGT gene percentages as joint abundances increase (Fig. 2-4B). The
former suggests that species level LGT events fix in the population more often than higher level
LGT events. Phylogeny is known to affect LGT: closely related partners have shared DNA
composition and transcriptional/translational machinery, allowing them to successfully
integrate and express transferred genes [61]. The latter suggests that taxonomic abundance
leads to increased transfer opportunities and thus higher rates irrespective of phylogenetic
distance.
43
Figure 2-4. Both abundance and phylogeny affects LGT rates. For both plots, each point
represents a taxa pair, and smoothing functions are fit by a generalized additive model using
cubic splines. Only taxa pairs annotated to at least the genus level are included, and taxa pairs
found in a single sample (across body sites) are colored in gray. All other pairs are colored by
inter-taxon LGT level (i.e, inter-species LGT pairs are red). In A), the x-axis displays the
phylogenetic distance between the two taxa, while the y-axis shows the LGT gene percentages,
or the average number of LGT genes in a taxa pair divided by total number of genes in a
sample. In B), the x-axis shows the joint abundance, and the y-axis is the same as A). Joint
abundances are calculated by multiplying one taxon’s gene percentage against another taxon’s
gene percentage. Colors are the same as A).
44
We further examined how phylogeny and taxon abundances influence LGT pair
formation, regardless of LGT rate. For phylogenetic distances, we observed that the HMP LGT
pairs form bi-modal distributions across body sites. This distribution may indicate selective pair
formation at specific distances, or reflect taxonomic bias in NCBI reference genomes. To
distinguish between these two hypotheses, we compared the phylogenetic distance distribution
from HMP LGT pairs to the distribution from randomly generated LGT pairs. We observed that
the phylogenetic distance distributions are significantly different via the Kolmogorov-Smirnov
test: randomly generated pairs have on average larger phylogenetic distances than that of HMP
LGT pairs (Fig. I-8A), indicating that LGT preferentially occurs between closely related species.
We repeated this analysis for LGT joint abundances, and found that randomly generated taxon
pairs had higher joint abundances than that of HMP LGT pairs (Fig. I-8B). This suggests that
LGT pair formation occurs more often than expected between rare taxa, which may be
supported by the physical structure and community organization of microbial communities.
Genera have preferred transfer partners that are shared across similar sites
Individual taxa may vary in partner choice: some may be promiscuous, while others are
more selective. We can identify these preferences by representing LGT pairs as a network, in
which nodes are genera and edges are unique LGT events. We generated networks for each of
the six body sites, and then calculated degree for every node (genus) along with the percentage
of genes found in LGT events involving that node (Fig. 2-5A). As expected, genera with higher
frequencies of LGT also have large numbers of partners. Interestingly, the majority of LGT
events for these genera was accounted for by a small number of partners: for example, 90% of
genes for LGT events involving Streptococcus are transferred with 11 (out of 57), 22 (out of 91),
45
and 19 (out of 88) genera in the buccal mucosa, supragingival plaque, and tongue dorsum,
respectively. Still, we attempted to identify taxa that i) had a larger number of partners and
relatively large number of preferred partners, and ii) had a larger number of partners and
relatively small number of preferred partners. The former represent more promiscuous taxa,
while the latter may be more selective. The former category included Streptococcus (Fig. 2-5B),
Actinomyces, Veillonella, and Haemophilus in the oral sites, as well as Clostridium and
Faecalibacterium in stool. The latter category included Aggregatibacter in the oral sites (Fig. 2-5B),
Bacteroides in the stool, and Corynebacterium and Propionibacterium in the anterior nares and
supragingival plaque. LGT may not be as advantageous for these latter taxa, which might lead
to limited transfer abilities.
46
Figure 2-5. Taxa degree and differential edges. A) Across the six body sites, we compared the
total number of LGT partners for a given genera against the number of partners needed to
explain 90% of genes in LGT transfers. Points are colored by the number of genes in LGT events
involving the given genera normalized by total sample genes, which are averaged across
samples and log2 normalized. Taxa in the upper right corner are more promiscuous: these
genera have many partners and need more partners to explain transfer; while taxa in the lower
right corner are more selective: they have the ability to transfer with multiple taxa but mostly
transfer with a few. Several genera are designated by letters as shown in B). B) We show an
example of a promiscuous taxon, Streptococcus, along with a selective taxon, Aggregatibacter. The
x-axis displays body site, while the y-axis is the gene percentage for LGT pairs, proportionally
scaled to the square root of the total sum gene percentage. C) Arc diagrams display directional
transfers in the buccal mucosa, supragingival plaque, and tongue dorsum. Solid black circles
represent genera, and size indicate average taxon gene percentages for the corresponding site.
Arcs indicate directional transfer between two circles in a counterclockwise fashion: arcs above
two circles indicate donation of genes from the right node to the left node, and vice-versa for
arcs under the two nodes. Arc width indicates the average number of LGT contigs with that
direction normalized by total number of genes per sample. Arcs colored in blue are found in all
three oral sites, while arcs in red are found in two oral sites.
47
48
We next investigated which LGT events were shared across multiple body sites. The
networks for each site consisted of anterior nares (nodes=61, edges=166), posterior fornix (n=85,
e=342), and buccal mucosa (n=130, e=890), which had fewer nodes and edges than stool (n=174,
e=2698), supragingival plaque (n=242, e=2812), and tongue dorsum (n=188, e=2898). Across all
six sites, only 3 edges were shared, including Bacteroides and Parabacteroides, Bacteroides and
Capnocytophaga, as well as Peptoniphilus and Streptococcus, while 2212 edges were unique to one
site. This is not surprising: the six sites have distinct taxonomic compositions, along with
different environments and selective pressures, which leads to different LGT pairs. As follows,
we focused on the intersection of the three oral networks, which shared 308 pairs, of which 232
pairs were not found in non-oral sites. Some oral pairs were found at differential frequencies
across sites: for example, Streptococcus (degree=49) had higher percentage of transfers with
Gemella, Capnocytophaga, and Prevotella in the buccal mucosa, supragingival plaque, and tongue
dorsum, respectively (Fig. 2-5B). Despite consistent partners in the oral sites, some of these
genera had completely different partners in non-oral sites. For example, Streptococcus paired
mostly with Lactobacillus in the posterior fornix and Dolosigranulum in the anterior nares.
Continuing our focus on the oral sites, we looked to see if oral taxon pairs might have
preferences in transfer directionality, or if one taxon consistently donates or receives genes from
its partner. We assigned directionality to LGT contigs with outer genes annotated as one taxon
(designated as the recipient), and inner genes annotated as a different taxon (designated as the
donor). We quantified events per gene for directional LGT pairs, and filtered for pairs found in
at least 10% of samples across each site. For each pair, we took the maximum directional LGT
frequency across oral sites, and then selected for pairs in the 75th percentile or above. We then
49
plotted all edges associated with the 21 genera in the selected pairs (Fig. 2-5C). Across all three
oral sites, Streptococcus, Veillonella and Pasteurella preferentially donated to Haemophilus, Rothia
and Aggregatibacter preferentially donated to Neisseria, and Simonsiella preferentially donated to
Eikenella. Other transfers have no donor or recipient preference: these include Gemella or
Granulicatella with Streptococcus, as well as Neisseria and Haemophilus. We hypothesized that
recipients may be the more abundant taxon (as compared to donors) within the community.
Although recipients often made up a larger portion of the contig in which they are found, they
were not consistently the more abundant taxon. Furthermore, some directional transfers were
site-specific, indicating that environment may also facilitate donor/recipient dynamics.
Mobile elements and TonB receptors are enriched in LGT contigs
Laterally transferred gene functions have been shown to be i) for adaptation, rather than
for information storage [68], and ii) the outer component of an interaction network (such as a
signaling or metabolic pathway), as opposed to a central component [69, 70]. We aimed to
determine if such trends persist for novel LGT events in the HMP1-II metagenomes. To do this,
we searched for gene functions enriched in LGT contigs. We quantified the number of UniRef90
terms from LGT and non-LGT contigs, aggregated them into Pfam clans [179], and performed
Fisher’s Exact Test to identify Pfam clans associated with LGT contigs (as compared to all
contigs). Enriched and depleted Pfam clans could be divided into 5 groups, i) DNA-binding
proteins such as transposases, ribonucleases, exo/endonucleases; ii) mobile elements including
phage, plasmids and toxin/antitoxin systems; iii) specific enzymes such as GMP synthase and
the FMN-binding split barrel superfamily, the latter mostly consisted of
pyridoxine/pyridoxamine 5'-phosphate oxidase; iv) transport systems including ABC
50
transporters and TonB dependent receptors, and v) antibiotic resistance genes (ARGs) (Fig. 2-
6A). As expected, groups i) and ii) were enriched across most body sites, with the exception of
plasmid toxin-antitoxin systems, which were enriched in the oral sites, as well as the NUMOD4
motif, which is part of an endonuclease found in Bacteroides [180], and was enriched only in
stool. Groups iv) and v) contained mixed results: inner membrane transport proteins, such as
ABC transporter permeases, were depleted, while outer membrane beta-barrel proteins and
TonB-dependent receptors were enriched.
51
Figure 2-6 . Enriched functions show taxon and structural similarities across sites. A) We
searched for Pfam clans enriched and depleted in LGT contigs by aggregating UniRef90 terms
for Fisher’s Exact Test. Each cell within the heatmap is colored by the log2 normalized odds
ratio, in which a positive value indicates enrichment of the Pfam clan in LGT contigs, whereas a
negative value indicates depletion of the Pfam clan in LGT contigs. B) We counted the number
of genes for UniRef90 annotations in enriched Pfam clans, specifically plasmid-related genes,
transcriptional regulators, TonB receptors, and ISNme transposases (from left to right). The x-
axis is labeled using two color bars: the first bar indicates the UniRef90 annotation, while the
second bar indicates the body site; colors for the latter correspond to A). The y-axis displays the
number of genes found in LGT genes stratified by genus, and is proportionally scaled to the
square root of the total number of LGT genes. C) We show a single contig containing a Neisseria
and Haemophilus LGT event in the buccal mucosa. From top to bottom, we first show a graph in
which the x-axis is the length of the contig and the y-axis is the taxon score. Arrows represent
aggregated BLAST hits, those in red are for genus Neisseria while those in blue are for
Haemophilus. Below, we display the called genes and their assigned UniRef90 functions. Lastly,
we examined other oral sites and searched for contigs with Neisseria-Haemophilus LGT transfers
with the UniRef90 term E3D293. These contigs are colored by UniRef90 function and labeled
with sample number, many share synteny with the example contig.
52
53
We next examined groups with potential adaptive functions, including group iii) with
GMP synthase and pyridoxine/pyridoxamine 5'-phosphate oxidase, and group v) ARGs. GMP
synthase and pyridoxine/pyridoxamine 5'-phosphate oxidase are likely LGT markers rather
than transferred functions: GMP synthase is hypothesized to be part of integration sites [181],
and has been found at the 3’ end of integrative and conjugative elements in Staphylococcus
aureus, Listeria monocytogenes, Clostridium perfringens, and Enterococcus faecalis, which are four
Gram-positive bacteria with low GC content [182]. Pyridoxine/pyridoxamine 5'-phosphate
oxidase may also be a LGT marker: manual examination of LGT contigs with this function
revealed that it is frequently found in conserved regions nearby transferred genes. Surprisingly,
antibiotic resistance (ABR) was depleted outside of the VOC superfamily, which may contain
glyoxlyase and bleomycin resistance genes. Depletion may be due to the lack of selection for
ABR in healthy human subjects, as well as WAAFLE’s inability to detect ABR LGT events
already present in the reference database. Still, many enriched genes within the helix-turn-helix
binding proteins (CL0123) are from TetR, AraC, and MerR family transcriptional regulators, of
which the former and latter may control for tetracycline and mercury resistance, respectively.
AraC was associated with LGT for iron acquisition regions in cheese (Fig. 2-6B) [183].
Lastly, we searched for functions that were specific to taxa. To do this, we extracted
UniRef90 terms from significant Pfam clans, and determined which taxa they were derived
from. Examples include the ISNme transposase (CL0219), which was found almost exclusively
in Neisseria across oral sites; the plasmid recombination enzyme (CL0169), which was mostly in
Streptococcus and Prevotella in oral sites, but spread across Bacteroides, Parabacteroides, Alistipes,
and Clostridium in stool; and the TonB receptor dependent receptor plug and TonB-linked outer
54
membrane protein SusC/RagA family (PF07715, CL0193, CL0287), which was found mostly in
Capnocytophaga and Prevotella across oral sites, Prevotella in anterior nares and posterior fornix,
and Bacteroides, Parabacteroides, and Alistipes in stool (Fig. 2-6B). We looked specifically at LGT
contigs containing ISNme transposons, which occurred almost exclusively between Neisseria
and Streptococcus. These contigs contained a conserved structure across oral sites in multiple
samples (Fig. 2-6C).
Discussion
LGT is a strong evolutionary force: assuming one LGT event for every 1010 vertical
replications, no gene in any modern genome can be linked to the last universal common
ancestor (LUCA) through vertical descent [15]. Most studies and computational tools have
focused on whole genomes, which makes characterization of LGT within microbial
communities particularly challenging. First, the use of reference genomes removes the microbial
community context (i.e. the genome is obtained from culture rather than the community).
Second, the assembly of complete genomes from microbial communities is experimentally and
computationally challenging, requiring either low diversity communities [157] or single cell
genomics [67]. We addressed both limitations by developing WAAFLE, which detects novel
LGT events directly from partially assembled metagenomes. With this, we can begin to ask i)
whether novel LGT events consistently occur in microbial communities, ii) which biological
factors affect LGT frequency, and iii) which taxa and functions are exchanged. In our validation
with synthetically generated LGT events, WAAFLE performed solidly with high true positive
rates for LGT detection and taxonomic assignment. We then applied WAAFLE to the Human
Microbiome Project 1 Phase II and quantified LGT frequencies across multiple body sites.
55
Increased LGT frequencies were associated with overall community trends such as greater
community evenness and body sites (stool and oral), as well as individual taxon pairs with
higher community abundances and small phylogenetic distances. We also observed that mobile
genetic elements and outer membrane proteins were enriched in LGT contigs. Overall, this
demonstrates that WAAFLE can generate biological insights using existing metagenomic
data.
It is important to consider the biological interpretation for LGT frequency, which
depends on i) the data from which LGT is detected and ii) the quantification method. In
WAAFLE specifically, the use of metagenomes means that each detected LGT event is unique to
a sample and fixed in the population, while the use of a reference database means that each
detected event should not have been previously characterized in reference genomes. Strikingly,
our study detected multiple LGT events in six major body sites, demonstrating that LGT is an
ongoing process in which events continuously fix in microbial populations. We next quantified
LGT frequencies as the i) number of LGT contigs per gene and ii) number of genes in LGT
contigs per gene. We hypothesized that higher LGT frequencies as detected by WAAFLE were
likely caused by an increased number of unique taxon pair combinations and/or increased
fixation rates. Indeed, across all six body sites, we found that higher community evenness,
along with larger taxonomic abundances and smaller phylogenetic distances between taxon
pairs, led to increased LGT frequency. With this, we propose that LGT occurs universally
between taxa, in which greater community evenness increases the number of unique taxon pair
combinations, and higher joint taxonomic abundance increases the probability of exchange.
Fixation of events is then limited by factors such as phylogenetic distance.
56
WAAFLE has several limitations that should be taken into account. First, WAAFLE is
ultimately affected by the quality of the metagenome assembly, which is in turn influenced by
biological factors such as community evenness and richness. As follows, LGT frequencies were
difficult to compare across sites: the posterior fornix had fewer contigs, and had close to the
highest or lowest frequencies across body sites depending on the measure used. Samples with
longer contig lengths (gene to contig ratio) also tended to have increased LGT frequency,
especially those in gut and oral sites, though vaginal sites were not affected due to low
community diversity. Second, WAAFLE’s parameters are tuned to be conservative with LGT
calls (minimizing false positives). As such, WAAFLE underestimates LGT events, especially for
inter-genus and inter-species LGT events, where most LGT is most likely to occur. WAAFLE is
also unable to detect strain-level LGT events, as the default reference database is annotated to
the species level. Third, WAAFLE lacks to ability to infer donor and recipient for most events.
This study briefly identified donor and recipient taxa across oral sites based on taxon gene
order within contigs, but did not find consistent relationships between genera. A more focused
characterization of donor and recipient taxa using phylogenetic trees may reveal whether
specific taxa are prone to donating or receiving genes, and distinguish between transferred and
non-transferred gene functions (as opposed to genes on contigs with or without LGT).
Despite these limitations, WAAFLE allowed us to identify patterns for specific taxa and
LGT-enriched functions. We found that most taxa across sites were relatively selective about
their partners, even if they had the ability to transfer with multiple other taxa. For example,
promiscuous taxa such as Streptococcus transfer with many genera, while taxa such as
Aggregatibacter primarily transfer with Haemophilus. We also found that metagenomically-
57
enriched LGT functions included mobile genetic elements such as transposons, phage, and
plasmids as well as outer membrane proteins, suggesting that 1) LGT events involving mobile
elements are ongoing and relatively frequent as compared to transfer of other genes, and 2)
mobile elements are pangenome-specific and do not ameliorate. These two points are illustrated
in an example showing a Neisseria and Haemophilus transfer, in which the majority of the contig
consists of Haemophilus genes with a single gene matching a Neisseria-specific ISNme
transposase. This event is consistently detected across samples from the buccal mucosa,
supragingival plaque, and tongue dorsum, showing that certain LGT events may be prevalent
across individuals and taxon-pair specific.
We anticipate that future work will include generation of new measures for LGT
frequency, improved detection of donor and recipient taxa, and further investigation of specific
functions and taxa. WAAFLE as is detects novel (not in reference genomes), recent (events
without amelioration), and fixed LGT events. Our ability to find these events enables us to i)
determine the timescale at which these events occur, through the use of time-series data; ii)
quantify the proportion of the microbial or human population contains specific events, in which
LGT sweeps might correspond with strain sweeps [184]; and iii) identify environmental factors
that might influence LGT frequencies and transferred functions, through the use of case-control
studies. Improved classification of donor and recipient taxa may facilitate discovery of
transferred metabolic functions, which may be taxon-pair specific and were not detectable
across body sites. Unlike findings based on reference genomes [185], which can infer donor and
recipient, we did not see enrichment for antibiotic-resistance genes for LGT transfers. This may
be due to the use of a healthy cohort, rather than one taking antibiotics. More work is needed to
58
quantify the frequency and characteristics of novel LGT events in microbial communities, as
well as the variation in transferred functions in different cohorts and in response to selective
pressures. WAAFLE represents a step forward in characterizing LGT directly from microbial
communities, which will ultimately enable us to understand the roles of LGT for adaptation or
speciation in microbial communities.
Methods
Datasets
Metagenomic datasets used in this study were produced through the Human
Microbiome Project Phase 1-II [174]. The HMP data are publicly available through the HMP’s
public data repository (http://www.hmpdacc.org/). Contigs were assembled via IBDA-UD [186].
The pangenome reference database was generated by downloading NCBI isolate genomes,
binning isolate genes by species, and then clustering binned species genes at 97% nucleotide
identity [187].
Detecting LGT events from metagenomic shotgun sequencing datasets
WAAFLE takes one required input, i) contigs assembled from metagenomic data in
FASTA format, and two optional inputs, ii) gene calls for each contig in genome format file 3
(GFF3), and iii) a nucleotide reference database of genes with taxonomic and functional
annotations. WAAFLE uses four steps to classify each contig as having LGT or not:
1. Contigs are searched against the ChocoPhlAn pangenome reference database
(https://bitbucket.org/biobakery/humann2/wiki/Home) using BLASTN default
parameters.
59
2. If gene calls were not supplied, contigs are annotated with genes using overlapping
BLASTN alignments.
3. Within a contig, each gene is assigned multiple taxon scores. BLASTN hits are grouped
by taxonomic annotation and gene overlap. We then calculate a score from each group
using BLASTN hit percent identity and subject coverage.
4. Each contig is classified as “No LGT”, “LGT”, or “ambiguous” by examining whether all
genes across a contig are best explained by one taxon, two, or multiple, respectively (Fig.
2-1).
These steps can be tuned using 5 parameters: subject coverage (s), overlap percentage
(o), gene length (l), one taxon score (k1), and two taxon score (k2) (Table I-1). We describe each
step in detail below.
Step 2: Calling genes.
If gene calls are not supplied, we combine overlapping BLAST hits to call genes. BLAST
hits are first filtered by subject coverage cutoff s (s, default 0.75), which is defined as the
percentage of the reference gene (subject sequence) that aligned to the contig (query sequence).
For hits that aligned to contig ends, it is not possible for the full gene to align to the contig. We
thus calculated subject coverage by dividing the alignment length by the subject gene length
that can potentially align to the contig. Specifically, we subtracted the length of the subject gene
that ran off the contig from the total subject gene length.
The filtered BLAST hits are then sorted by length and sequentially assigned to groups
based on overlap percentage. Hits and groups may be considered nucleotide fragments: overlap
60
percentage is calculated between a two nucleotide fragments by dividing the length of the
overlap between the two fragments by the length of the shorter fragment. Specifically, each
BLAST hit is added to a group if the hit has at least overlap percentage o (o, default 0.1) with
any existing groups, otherwise a new group is created. After all BLAST hits have been
considered, each group is considered a gene, and the start and end sites are calculated as the
minimum start and maximum end of all BLAST hits encompassed (in the group). The resulting
genes are further filtered by length (l, default 200 bp).
Step 3: Assign taxon scores to genes.
To assign taxon scores, WAAFLE combines the BLASTN results from step 1 and gene
annotations called from step 2 or supplied by the user. First, WAAFLE bins all BLAST hits (s,
default 0) to genes if they have overlap greater than o (o, default 0.1); it is possible to assign a
single hit to multiple genes. The top UniRef term across all BLAST hits assigned to a gene is
then annotated as the gene function. Second, for each gene in a contig, WAAFLE further groups
the binned BLASTN hits based on taxonomic annotation, which can be performed at different
taxonomic levels (such as kingdom, phylum, class, etc). Each BLAST hit within the group is
scored by multiplying its percent identity by its subject coverage. For each nucleotide position
within the gene, we allot the maximum score across grouped BLAST hits, or if there were no
BLAST hits at that position, allot a score of 0. This results in a vector of scores per taxon per
gene, which we average for a single taxon score. Once each taxon has been scored at each gene,
each contig can then be represented by a table, S, with N rows (representing taxa) and M
columns (representing genes).
61
Step 4: LGT classification and taxonomic annotation of contigs.
Only contigs with more than 1 gene and more than 1 taxon are considered for LGT. To
search for LGT, we loop through seven taxonomic levels, starting at the species level and
ending at the kingdom level. The loop is terminated if the contig is i) classified as containing
LGT or not containing LGT, and ii) assigned a single taxon pair or taxon, respectively. At each
taxonomic level, we perform step 3, in which each contig is represented as table S, where each
entry Sij contains taxon i‘s score for jth gene.
Using this table, we define O(i) = minj(Sij) as taxon i’s worst single-gene score, and C(i, i’)
= minj(max(Sij, Si’j)) as the worst single-gene score for the combination of taxa i and i’. If
maxi(O(i)) is larger than the one taxon score threshold (k1, default 0.5), then one taxon explains
the entire contig. If maxi(O(i)) < k1 and maxi,i'(C(i, i')) is larger than the two taxon score threshold
(k2, default 0.8), then i and i' jointly explain the contig, indicating LGT between taxa i and i'. If
neither k1 nor k2 are met, the contig is annotated as “ambiguous”. If the contig is annotated as
“ambiguous”, the loop continues a higher taxonomic level. If if the contig is determined to
contain no LGT or LGT, WAAFLE performs taxonomic assignment.
Taxonomic assignment is performed as follows: if the contig is determined to contain no
LGT, the contig is annotated with taxon i with O(i) = maxi(O(i)). If multiple taxa have scores
equal to maxi(O(i)), we annotate the contig with the term “multiple” (rather than any taxon). If
the contig is determined to contain LGT, the contig is annotated with taxa i and i’ resulting in
C(i, i’) = maxi,i'(C(i, i')). If multiple taxon pairs have scores equal to maxi,i'(C(i, i')), WAAFLE
determines whether the multiple pairs share one taxon, indicating that one taxon is known
62
while the other is uncertain. If so, WAAFLE determines the name of the uncertain taxa by
identifying the last common ancestor shared between all uncertain taxa, and assigns the contig
the taxon pair consisting of the universally shared taxon and the last common ancestor of the
uncertain taxon. If all pairs are different, we annotate the contig with term “multiple”. If contigs
are assigned the term “multiple”, the determined LGT status is rejected and the loop continues
to a higher taxonomic level. Otherwise, we complete the search and annotate the contig with its
LGT status and corresponding taxa.
Other Options: Dealing with Unknown Taxa
First, in the case where there are no BLAST alignments to a gene (due to a user
supplying their own gene calls), WAAFLE by default assigns the gene a taxon score of 1 for the
taxon “Unknown”. This will result in WAAFLE either i) identifying the contig as a inter-
kingdom LGT between one taxon and the “Unknown”, or ii) identifying the contig as
“ambiguous” if no two taxa can explain the full contig. Second, users may choose to “spike” in
an “Unknown” taxon into table S during Step 4, in which the “Unknown” is equal to 1 -
maxi(O(i)) across all genes. Simulation with this flag has shown that WAAFLE will then call
LGT between one taxon and an “Unknown” for contigs containing multiple genes with low
taxon scores, so caution is advised if using this function.
Tuning parameters through grid search
WAAFLE has 5 parameters, subject coverage (s), overlap percentage (o), gene length (g),
one taxon score threshold (k1), and two taxon score threshold (k2). We constructed a set of 1000
synthetic contigs to set these parameters. Contigs were generated through a three step process.
63
First, we randomly selected donor and recipient genomes that differed across 8 taxonomic
levels (kingdom, phylum, class, order, family, genus, species, and strain/no difference). Second,
we chose a three gene region within the recipient genome, and swapped out the center gene
with a random donor gene. At each taxonomic level, contigs contained 25 unique donor-
recipient pairs with 5 contigs each (for a total of 190 unique donors and 183 unique recipient
strains). Third, we truncated the contigs on both ends. After truncation, some contigs were left
with only one gene, which were removed and resulted in a different distribution across
taxonomic levels.
Gene Calling
We first assessed WAAFLE’s ability to call genes while varying three three parameters,
which included i) subject coverage from 0, 0.25, 0.5, 0.75, and 0.9, ii) overlap from 0.1 to 0.5 in
0.1 increments, iii) gene length from 0, 25, 50, 75, 100, and 200 bp. We then compared each NCBI
reference gene to WAAFLE-called genes, and vice-versa, to identify true positives, false
positives, and false negatives:
1. True positive: A WAAFLE-called gene overlap the NCBI annotated gene by at least 80%.
2. False positive: A WAAFLE-called gene does not match any NCBI annotated gene, or two
or more WAAFLE-called genes match one NCBI annotated gene.
3. False negative: The reference gene does not match any WAAFLE-called gene.
Note that true negatives cannot be assessed meaningfully: these would be regions where
NCBI had no annotation, and WAAFLE did not call a gene. With this, we compared TPR
against PPV for each set of conditions (Fig. I-3).
64
LGT Classification
In order to set parameters k1 and k2, we performed a second grid search to characterize
WAAFLE’s ability to call LGT. We only included contigs with at least 2 genes. We varied four
parameters, including i) subject coverage from 0, 0.25, 0.5, 0.75, 0.9, ii) overlap from 0.1 to 0.5 in
0.1 increments, iii), k1 from 0.1 to 0.9 in 0.1 increments, and iv) k2 from 0.1 to 0.9 in 0.1
increments. We then assessed positives and negatives as such:
1. True positive: WAAFLE calls “LGT” for a synthetic contig with an inter-species LGT or
above.
2. True negative: WAAFLE calls “No LGT” or “ambiguous” for a synthetic contig with an
inter-strain LGT.
3. False positive: WAAFLE calls “LGT” for a synthetic contig with an inter-strain LGT.
4. False negative: WAAFLE calls “No LGT” or “ambiguous” for a synthetic contig with an
inter-species LGT or above.
It should be noted that WAAFLE does not have to call LGT at the correct taxonomic
level; thus, this assessment looks specifically at whether WAAFLE can detect LGT, not whether
it called the correct taxa.
Taxonomic Annotation
For correctly classified contigs, we assessed whether WAAFLE annotated contigs with
the correct taxa at each taxonomic level. To compare one taxon call against another, we looked
to see whether they had identical names at each phylogenetic level (i.e, same name at kingdom,
65
phylum, class, etc.). At best, two taxa may match across all seven levels, in the worst case
scenario, two taxa may not match at all. For a contig without LGT, we compared the WAAFLE
taxon to the reference taxon. For a contig with LGT, we compared each WAAFLE taxon to each
reference taxon, and selected the combination of pairs with the highest number of matches. We
then calculated what percentage of the reference taxa had a correct match at each taxonomic
level.
Quality control for the Human Microbiome Project (HMP) assemblies
Samples were filtered out if they 1) were outliers in ordination analyses using
MetaPhlAn [186] community profiles or 2) had fewer than 1,000 genes across contigs
(definitively annotated as LGT or not). Contigs were then filtered from these samples if they
resembled misassemblies, defined here as the erroneous combination of genomic material from
two species into a single contig, which will match WAAFLE’s internal model for a biological
LGT event and result in false positive LGT calls. To identify and quantify misassembly in
contigs from the HMP1-II dataset, we examined recruitment of reads to gene junctions. Contigs
that met the two conditions below were removed:
1. The average coverage (reads per nucleotide) of the gene-gene junction is less than half of
the average coverage of the flanking genes.
2. There are no single reads or read pairs that support the junction. Single reads may
support the junction if they overlap both the junction and flanking genes (single), paired
reads may support the junction if i) each read is in a flanking gene (perfect-double) or ii)
66
one read is in one flanking gene, and the other overlaps the other flanking gene and the
junction (partial-double) (Fig. I-1).
Both conditions are necessary to remove a contig because contig coverage is highly variable,
and read support decreases as junction lengths increase.
Linear regression for LGT frequency
We performed linear regression with LGT events per gene as the outcome, and number
of contigs, gene to contig ratio, alpha diversity, richness, and body site as the regressors. Alpha
diversity was calculated using the Gini-Simpson Index [188], which is equal to 1 minus the sum
of the square of each genera’s gene percentages. Richness was counted as the total number of
genera per sample.
Determining phylogenetic distance between pairs
We calculated phylogenetic distances between pairs using the PhyloPhlAn tree [50]. If
both taxa were annotated to the species level (tree tips or terminal nodes), distances were
calculated between terminal nodes. If a taxon was not annotated to the species level, the internal
node for the last common ancestor (LCA) was determined after searching the tree for all species
that matched the last known level by regular expression. Distances were then calculated
between nodes, and adjusted by adding the average distance from the LCA to the terminal
nodes.
Functional Analyses
Identifying enriched and depleted Pfam clans
67
Fisher’s Exact Test was performed both per sample and per body site. For each sample,
we counted the total number of UniRef90 genes in contigs with at least 2 genes and WAAFLE
classification of “LGT” or “No LGT”. For the body site, we summed the total number of
UniRef90 genes in contigs with at least 2 genes and WAAFLE classification of “LGT” or “No
LGT”. We then aggregated UniRef90 terms to Pfam clans, and identified Pfam clans that were
positively or negatively associated with LGT contigs. A Pfam clan was considered significant if:
1. The site-wide q-value is < 0.01.
2. The difference between the percentage of sample odds ratios (OR) that agreed with the
side-wide odds ratio and the percentage sample odds ratios that disagreed with the site-
wide odds ratio is greater than 0.2
ORsupport + ORagainst + ORnan = total_samples
(ORsupport - ORagainst) / total_samples > 0.2
For the latter condition, 0.2 was chosen because it requires at least 20% of the samples to
have an odds ratio, and the worst case scenario involves the ORsupport / total samples = 0.6, and
ORagainst / total samples < 0.4.
Searching for genes within Pfam clans
WAAFLE annotates each gene with a UniRef90 term and taxon, which enables us to
examine in more detail which genes and taxa are within enriched Pfam clans. To do this, we
quantified the UniRef90 terms from specific Pfam clans and stratified them by taxonomic
annotation and LGT status (within an LGT contig or not). UniRef90 terms with similar
annotations were collapsed for plotting purposes.
Chapter 3:
Urban transit system microbial communities differ by surface type and interaction with
humans and environment
69
Copyright Disclosure
This Chapter is a reproduction of a published manuscript, in which the * indicates equal
contribution:
Hsu T.*, Joice R.J.*, J. Vallarino, G. Abu-Ali, E.M. Hartmann, A. Shafquat, C. DuLong, C.
Baranowski, D. Gevers, J.L. Green, X.C. Morgan, J.D. Spengler, C. Huttenhower. Urban Transit
System Microbial Communities Differ by Surface Type and Interaction with Humans and the
Environment., MSystems, 2016. 1(3): e00018-16.
Attributions
R.J., J.S., and C.H. designed the study. C.B. optimized the sampling protocol. R.J., T.H.,
and J.V. collected transit samples, and R.J. and T.H. extracted DNA for 16S and shotgun
sequencing at the Broad Institute. R.J and A.S. performed 16S computational analyses; T.H., A.S,
and G.A. performed shotgun computational analyses, E.M.H. and J.L.G. helped interpret
taxonomic composition and functional profiling results. R.J., T.H., A.S., C.D. and X.C.M. made
figures: X.C.M., D.G., J.D.S., and C.H. provided support throughout the sequencing and
analysis process. R.J., T.H., X.C.M. wrote the manuscript.
Abstract
Public transit systems are ideal for studying the urban microbiome and inter-individual
community transfer. In this study, we used 16S amplicon and shotgun metagenomic sequencing
to profile microbial communities on multiple transit surfaces across train lines and stations in
the Boston metropolitan transit system. The greatest determinant of microbial community
structure was the transit surface type. In contrast, little variation was observed between
geographically distinct train lines and stations serving different demographics. All surfaces
were dominated by human skin and oral commensals such as Propionibacterium,
70
Corynebacterium, Staphylococcus, and Streptococcus. Non-human associated taxa detected
included generalists from Alphaproteobacteria, which was especially abundant on outdoor
touchscreens. Shotgun metagenomics further identified viral and eukaryotic microbes including
Propionibacterium phage and Malassezia globosa. Functional profiling showed that P. acnes
pathways such as propionate production and porphyrin synthesis were enriched on train holds,
while electron transport chain components for aerobic respiration was enriched on touchscreens
and seats. Lastly, the transit environment was not found to be a reservoir of antimicrobial
resistance and virulence genes. Our results suggest that microbial communities on transit
surfaces are maintained from a metapopulation of human skin commensals and environmental
generalists, with enrichments corresponding to local interactions with the human body and
environmental exposures.
Importance
Mass transit, specifically urban subways, are distinct microbial environments with high
occupant densities, diversities, and turnovers, and they are thus especially relevant to public
health. Despite this, only three culture-independent subway studies have been performed, all
since 2013 and with widely varying designs and differing conclusions. In this study, we profiled
the Boston subway system, which provides 238 million trips per year by the Massachusetts Bay
Transit Authority (MBTA). This yielded the first high-precision microbial survey of a variety of
surfaces, ridership environments, and microbiological functions (including tests for potential
pathogenicity) in a mass transit environment. Characterizing microbial profiles for multiple
transit systems will be increasing important for biosurveillance of antibiotic resistance genes or
pathogens, which can be early indicators for outbreak or sanitation. Understanding how human
71
contact, materials, and the environment affect microbial profiles may eventually allow us to
rationally design public spaces to sustain our microbial health.
Introduction
Mass transit systems host large volumes of passengers and facilitate a constant stream of
human/human and human/built environment microbial transmission. The largest urban mass
transit system in the United States facilitates an average of 11 million trips per weekday (New
York). The next four largest systems transport just over 1 million trips per weekday
(Washington DC, Chicago, Boston, San Francisco) [189][180][182][181], yet little is known about
the mass transit system microbial reservoir. Understanding the associated microbial
transmission dynamics between humans and the built environment, and microbial occupation
and persistence on different surfaces, can inform decisions regarding public health and safety.
Microbial DNA sequencing-based studies have revealed that microbial communities of
the built environment are greatly influenced by their human occupants. Communities within
homes showed higher similarity to those of their inhabitants [92], and specific surfaces
frequently contacted by human skin, such as keyboards or mobile phones, had microbial
communities that reflect those of skin [190, 191]. In restrooms and classrooms, variation in
microbial community composition across surface types was associated with variation in human
contact with those surfaces: desks contained human skin and oral microbes, while chairs
contained intestinal and urogenital-derived microbes [93, 192]. However, a limitation of most
built environment microbiome research is that human contact, surface type, and material
composition are frequently confounded. For example, in the classroom study described above,
72
different forms of human contact were associated with distinct microbial community profiles;
however, the desks and chairs were also constructed from different materials.
Previously observed subway microbial communities comprise both human and
environmentally derived microbes. Air samples from within the New York and Hong Kong
subway systems included microbes originating from soil and environmental water in addition
to human skin [193, 194]. The recent metagenomic study of New York subway stations [195] has
been widely criticized [196] and leaves many detailed analysis questions regarding the transit
microbiome unanswered, but it has provided an initial reference dataset for further analysis of
subway microbiome diversity. In addition, while this study collected surface type information,
it did not standardize their characterization or, as a result, investigate surface-specific
enrichments for microbial taxa. Understanding the separate influences of human contact,
surface type, and surface material would help identify mechanisms through which microbial
communities form and persist on surfaces within built environments.
In the present study, we provide the first comprehensive metagenomic profile of
microbial communities across multiple surface types and materials in a high-volume public
transportation system. Samples were collected from seats, seat backs, walls, vertical and
horizontal poles, and hanging grips inside train cars from three subway lines, as well as
touchscreens and walls of ticketing machines inside five subway stations. Using a combination
of 16S amplicon and shotgun metagenomic sequencing, we characterized the microbial
community composition, functional capacity, and pathogenic potential of the Boston mass
transit system. In agreement with previous studies, we observed a combination of human-, soil-,
73
and air-derived microbial communities across the system. Taxonomic differences were most
strongly associated with surface type, as compared to geographic, train-line, and material
differences in a multivariate analysis. The distribution of metabolic functions was dominated by
P. acnes, which made up a majority of the community. Minimal antibiotic resistance genes and
virulence factors were detected across transit system surfaces. In addition to identifying the
most important factors determining microbial colonization, our results may serve as a baseline
description of microbes on public transportation surfaces, which will be relevant toward future
design of transit environments encouraging microbial health.
Results
Sampling microbial communities on the Boston transit system
We collected samples from train cars and stations (n=73) from the Boston transit system.
This system is maintained by the Massachusetts Bay Transportation Authority (MBTA), which
operates bus, subway, commuter rail, and ferry routes in the greater Boston area. Our study
focused on the subway system, which consists of four lines (red, orange, blue, green, and silver)
that extend from downtown Boston into the surrounding suburbs (Fig. 3-1A). Train car samples
were collected from the red, orange, and green lines, and comprised 6 surface types, including
grips, horizontal and vertical poles, seats, seat backs, and walls (Fig. 3-1B). Station samples were
collected from the touchscreens and the sides of fare ticketing machines (Fig. 3-1C). Biomass
yields were highest for hanging grips (141.83±92.68 ng/µL), followed by seats (128.1429±49.955
ng/µL) and touchscreens (120.47±73.68 ng/µL), though these differences were not statistically
significant (Fig. II-S1A).
74
Figure 3-1. Collection of samples from MBTA trains and stations. (A) Microbial community
samples were collected from the Massachusetts Bay Transit system in the Boston, Massachusetts
metropolitan area. Train samples were collected from 6 train car surfaces across 3 locations
along 3 train routes; station samples were collected from 5 stations. (B, C) Diagram of the
surfaces sampled within train cars (B) and stations (C). Sampled surfaces specifically included
seats and seat backs, horizontal and vertical poles, hanging grips, and walls within train cars, as
well as the screens and walls of touchscreen machines within stations.
For each sample, we collected metadata describing built environment type, surface type,
material composition, as well as collection date (Table II-1). For train car samples, we also
recorded the train line, within-train location, and location along the subway route at time of
sample collection (nearest subway stop). For station samples, we recorded the station, ticketing
machine location, and which side of the touchscreen was swabbed. 16S rDNA amplicon
sequence data was generated from most samples (n=72), and a subset (n=24) was subjected to
shotgun metagenomic sequencing.
Microbial communities are specific to surface types and immediate environment
The surface type from which microbes were collected proved to be the major
determinant of community diversity and structure. Alpha diversity of touchscreen samples was
significantly higher than that of all other surface types (p<0.0001, ANOVA comparison of 7
75
surfaces with Bonferroni correction, Fig. II-1B), and did not correlate with biomass (Spearman’s
rho=0.0057, Fig. II-1A). The largest axes of beta diversity separated train holds (horizontal and
vertical poles, hanging grips), chairs (seat and seat backs), touchscreens, and walls (Fig. 3-2A).
Train line remained only a minor driver of community structure (Fig. 3-2B), and did not dictate
overall community composition for either holds (Fig. II-S2A) or seats, once the material of the
latter was taken into account (Fig. II-S2B, II-S2C). In particular, the green line seats were
upholstered with vinyl, while seats on the orange and red lines were upholstered with
polyester.
76
Figure 3-2. Taxonomic composition of subway microbial communities. All ordinations are
principal coordinate analyses using Bray-Curtis distance among filtered OTUs (see Methods),
colored by metadata. (A) Subway data by surface, (B) train car data by train line, and (C)
touchscreen data by location of machine. (D) Relative abundances of bacterial families across
samples from train cars (see Table II-2 for complete data). (E) Relative abundance of bacterial
families within stations (complete data as above). Asterisks indicate that the sample was
collected on a separate day during the same month as the remaining samples. For station
samples, “W” indicates a sample from a ticketing machine wall; all other samples are from the
ticketing machine touchscreens.
The location of ticketing machines (e.g. outdoor, indoor, underground) was a primary
source of variation between microbial communities on touchscreens (Fig. 3-1C). Univariate
analyses using Linear Discriminant Analysis Effect Size (LEfSe) [197] revealed that indoor
77
touchscreens were characterized by genus Acinetobacter, while underground touchscreens had
increased levels of genus Corynebacterium, and family Tissierellaceae, specifically genus
Finegoldia and genus Anaerococcus. Those with outdoor exposures were enriched for class
Alphaproteobacteria, including family Acetobacteraceae and genus Methylobacterium,
Sphingomonas, and Blastococcus (Table II-3). These results imply that surface type is a major
driver of community composition on transit surfaces, and that indoor versus outdoor exposure
detectably influences the resident microbial composition of touchscreen surfaces.
Subway microbial communities are largely derived from human skin and oral commensal
microbes
Subway microbial clades were generally those found in typical human skin communities
[2, 81] (Fig. 3-3Ai) and were dominated by the phyla Firmicutes, Proteobacteria, and
Actinobacteria, each of which comprised over 20% of the microbial community, based on 16S
data. The Bacteroidetes were much less abundant with an average community abundance of 6%
(Table II-S2). The families with the highest mean relative abundances were Staphylococcoceae
and Corynebacteriaceae (Fig. 3-2D-E), also typical of skin commensals. Propionibacterium was
not observed due to known primer bias [198] but was confirmed later with shotgun
metagenomics. The next most abundant taxa were Micrococcaceae, which included genus
Micrococcus (found in hair and skin) and genus Rothia (found in the oral cavity [2, 199]), as well
as Streptococcaceae (found in the oral cavity) and Pseudomonadaceae. We also observed low
proportions of gut and oral commensals such as Lachnospiraceae, Veillonella, and Prevotella.
78
Figure 3-3. Putative MBTA microbial community sources. (A) i. Ordination of subway surface
data jointly with human skin (anterior nares), oral (mixed sites from within oral cavity) and gut
(stool) microbiome data from the Human Microbiome Project (HMP). Principal coordinate
analysis was performed with weighted UniFrac distance and calculated using OTU relative
abundances. ii-iv. Correlations between subway samples and human body sites [200]: ii. skin,
iii. oral, and iv. gut, as well as environmental sites: v. air [201] and vii. soil [202]. The x- and y-
axes represent mean relative abundance across each data set with standard error bars. For each
plot, subway samples (MBTA) are on the x-axis and potential source community on the y-axis.
(B, C) Microbial SourceTracker [203] was used to identify possible human and environmental
sources of subway station (B) train and (C) station communities. Relative estimated contribution
of each source is plotted per subway sample.
79
80
Highly abundant non human-associated taxa encompassed the order Burkholderiales
(3.25%); as well as class Alphaproteobacteria (9.15%), which contains genera Sphingomonas
(1.48%) and Methylobacterium (1.14%) and families Rhodobacteraceae (1.48%) and
Methylocystaceae (0.447%). These Alphaproteobacteria are widespread environmental bacteria
with flexible metabolic regimes; Sphingomonads in particular, including the genera
Sphingomonas and Sphingobium, are found in soils and sediments and are most well studied for
their ability to degrade polyaromatic hydrocarbons [204]. Methylobacterium, primarily M.
extorquens, is a genus of plant- and soil-associated facultative methylotrophs; these bacteria are
highly prevalent on the surfaces of plants, and their diverse metabolic capabilities make them
likely to survive in other environments [205]. Enhydrobacter aerosaccus, which is currently
classified as belonging to Moraxellaceae but may more aptly be classified as an
Alphaproteobacterium [206], was also prevalent in the subway samples.
To determine the microbial clades driving these patterns, we correlated the abundance
of subway microbial genera with their abundance in three human body sites [200] as well as air
and soil [201, 202] (Fig. 3Aii-vi). As expected, the human skin genera Staphylococcus and
Corynebacterium (Fig. 3Aii), human oral cavity taxon Streptococcus, and human gut-resident
genera Bacteroides and Prevotella are abundant on both the subway and their respective body
sites (Fig. 3Aii-iv). In addition to human-associated taxa, several genera previously observed in
indoor air [201] were also abundant on subway surfaces: Sphingomonas, Methylobacterium,
Acinetobacter, Streptococcus, Staphylococcus and Corynebacterium (Fig. 3Av). In contrast, typical
soil genera were rare on subway surfaces (Fig. 3Avi). Microbial SourceTracker [203] confirmed
these origins based on overall community composition as compared to a variety of reference
81
environments [207] (Fig. 3B-C). Only a subset of touchscreen samples included a substantial
proportion of environmental microbes (e.g. air and soil), most notably from the Riverside
above-ground outdoor ticketing station (Fig. 3C).
Propionibacterium phages and the yeast Malassezia globosa dominate the non-bacterial microbial
community
Shotgun metagenomic sequencing, which allowed us to profile viral and eukaryotic
microbes that cannot be identified by 16S sequencing as well as bacterial taxa that are poorly
amplified by the 16S V4 region primers [198], was performed for 24 mass transit samples
including 15 train car samples and 9 station samples. In agreement with previous studies of skin
ribotypes [81, 208], the most abundant species across all samples was the facultative anaerobe
Propionibacterium acnes (mean 47%, max 81%); its average abundance was 29.8% for chairs,
71.6% for grips and poles, and 43.4% for touchscreen surfaces (Fig. 3-4). Other metagenomically
assessed bacterial abundances agreed with 16S data, including high levels of family
Micrococcaceae (mean 5.3%), Staphylococcaceae (mean 5.28%), Corynebacteriaceae (mean
4.95%), and Streptococcaceae (mean 3.73%), along with non human-associated taxa included
soil taxa Geodermatophilaceae (mean 1.22%) and Acinetobacter (mean 0.70%) (Table II-2).
82
Figure 3-4. Trans-domain taxonomic profiles from subway shotgun metagenomes. Relative
abundances of the twenty microbial species with highest mean across 24 metagenomes from
train cars and stations. Among colored metadata annotations, train line (green, orange, or red)
is indicated for car surface samples and location (indoor or outdoor) for touchscreens. P. acnes
is not amplified by the 16S primers used in this study but readily detectable by shotgun
sequencing, as are non-bacteria such as Propionibacterium phage.
Eleven non-bacterial species were present at an abundance of ≥0.1% in at least two
samples. The most abundant and prevalent viruses included Propionibacterium bacteriophages
and oncovirus Merkel cell polyomavirus (a common respiratory infection [198]). The relative
abundance of Propionibacterium bacteriophages P100D and P101A show similar abundance
patterns to P. acnes, with lower average abundance on chairs (3.2%), and higher abundances on
holds (5.4%) and touchscreens (7.9%), suggesting that phage/host relationships are detectable
directly from metagenomics. Remaining viruses were found sporadically (in only 2 samples) or
83
had mean relative abundances less than 0.0006% (Table II-2). Many of these viruses were phage
that corresponded to abundant bacterial species, including Pseudomonas phage, Lactobacillus
phage, Lactococcus phage, Staphylococcus phage 3A, Staphylococcus phage 80 alpha, and
Staphylococcus phage phi2958PVL.
The yeast Malassezia globosa [209] also occurred with abundance patterns similar to those
of P. acnes, with lower abundance on chairs (0.03%) and higher abundances on holds (0.25%)
and touchscreens (0.1%). Both M. globosa and P. acnes show niche-specific adaptation to
metabolism of lipid-rich sebum [209, 210] and are commonly found on sebaceous skin sites,
which comprise of the chest, back, and face [208]. This may indicate that sebaceous skin taxa
more easily transfer or adhere to built environment surfaces.
All surface types are dominated by skin microbes, with smaller proportions of oral, gut, and
environmental taxa across seats and touchscreens
To identify differentially abundant taxa across metadata categories, we performed a
multivariate analysis using MaAsLin [211], which controls for multiple covariates using a
generalized linear model (Table II-4). For 16S data, we accounted for built environment type,
surface type, material composition, and sample location. For human-associated taxa, seats were
particularly enriched in skin taxon Corynebacterium and vaginal taxon Gardnerella, though all
contacted surface types had higher relative abundances of Corynebacterium as compared to train
walls (Fig. 3-5A). The skin taxon Staphylococcus was also enriched across all surface types except
for touchscreens and train walls, and Corynebacterium was negatively associated with vinyl seats
relative to polyester seats. Grips were enriched for oral taxa such as Rothia and Veillonella. For
84
non human-associated taxa, all grips and vertical poles were depleted in class
Alphaproteobacteria, as contrasted to their enrichment on outdoor surfaces at the Riverside
station (western suburb). These clades included Methylobacteriaceae (grips and vertical poles)
and Methylocystaceae (all holds), as well as family Sphingomonadaceae (grips and vertical
poles) and genus Amaricoccus (all holds). Because many of these organisms are likely associated
with soil particles, it is reasonable that they should be less abundant on surfaces where soil is
unlikely to settle.
Figure 3-5. Enrichment of microbial taxa with respect to metadata using multivariate
analyses. Each ring represents significant associations of one metadatum with microbial clades
using MaAsLin [211] (FDR q<0.25). (A) 16S data. For location, surface category, surface type,
and surface material (inner rings to outer rings), the direction of association between taxa and
metadata is indicated in red (positive) or green (negative) was relative to Alewife, touchscreens,
seat backs, and polyester, respectively. (B) Shotgun metagenomic data; only a simplified surface
type was represented by sufficiently many samples for analysis. Horizontal poles, vertical poles,
and grips were grouped into “holds”, and that seats and seat backs were grouped into “chairs”.
The direction of association is again indicated by color. Only taxa with at least one association
are shown in each cladogram.
85
For shotgun data, we again used MaAsLin [211] to identify associations between
microbial taxa and a single covariate, surface type (Fig. 3-5B, Table II-4). Due to the small
number of samples, surface type metadata were grouped into chairs (seat and seat backs), holds
(hanging grips, horizontal and vertical poles), and touchscreens. For human-associated taxa,
chairs and touchscreens were enriched in multiple species of Corynebacterium (including C.
aurimucosum, genitalium, jeikeium, massiliense, pseudogenitalium, tuberculostearicum, urealyticum)
and Staphylococcus (S. caprae capitis, epidermis, haemolyticus, hominis, pettenkoferi); vaginal taxa
Gardnerella vaginalis and Lactobacillus (L. crispatus and L. iners); and gut taxa Ruminococcus bromii,
Faecalibacterium prausnitzii, and Eubacterium rectale. Touchscreens were particularly enriched in
oral species such as Streptococcus (S. cristatus, gordonii, infantis, mitis/oralis/pneumoniae,
parasanguinis, sanguinis, thermophiles, tigirinus), Prevotella (P. copri, melaninogenica), and Rothia
aeria (also enriched in holds). For non-human associated taxa, we saw similar patterns as in the
16S data. Touchscreens were enriched in Methylobacteriaceae, Burkholderiales,
Sphingomonadales, and Rhodobacteraceae (also enriched in chairs). Many of these non-human
associated taxa that we identified on surfaces are hardy generalists that survive under harsh
conditions [212].
Most Corynebacterium species enriched in both chairs and touchscreens have higher (but
not statistically significant) abundances in chairs, with the exception of C. kroppenstedtii and C.
matruchotii. The lack of oral species on holds may be due to the newfound detection of P. acnes,
which is enriched in holds and may affect the relative abundances of rarer taxa. Generally, skin
taxa dominate all surfaces, with P. acnes enriched on holds and Corynebacterium and
Staphylococcus on chairs and touchscreens. Oral taxa are present on both holds and
86
touchscreens. Non-human associated taxa remain enriched on touchscreens, which present
more exposed surface areas not enclosed within trains.
Metagenomes reflect dominance of Propionibacterium acnes across subway surfaces
Functional genomic profiling using HUMAnN2 quantified 3,975,869 UniRef50 [143]
protein families, which were collapsed into 12,074 KEGG Orthology (KO) [213] families. For
hypothesis testing, we focused on 604 KOs with mean abundances greater than the overall
median abundance and variance across samples in the 90th percentile. MaAsLin identified 590
KOs significantly associated with surface type (q < 0.05): 360 enriched in holds, 204 depleted in
holds, 12 enriched in chairs, 4 depleted in chairs, 5 enriched in touchscreens, and 4 depleted in
touchscreens, relative to all other surface types (Table II-4).
Many of the KOs enriched in holds were genes found in the P. acnes genome [214]. These
included systems for anaerobic respiration, lipases and esterases for degrading lipids within
sebaceous sites, hyaluronate lyase for digesting the extracellular matrix of skin, fermentation of
pyruvate to propionate (Fig. 3-6A). Production of propionate is catalyzed by methylmalonyl-
CoA carboxyltransferase, which is enriched in the holds. Porphyrin synthesis is a major
function of several Propionibacterium [215], contributing to a range of physiological activities
(e.g. potential keratinocyte damage from free radical release [214, 216]) and industrial uses (e.g.
synthesis of vitamin B12 [217]). Here, the pathway was represented by several genes from the
hem and cbi/cob gene clusters [217, 218]. To verify that the KOs detected above were indeed
specific to P. acnes, we removed its contributions to the overall abundance of each UniRef50
family, renormalized, and again identified KOs enriched on different surface types (see
87
Methods). KOs specific to P. acnes metabolism were no longer enriched on holds, with a few
exceptions including iron transport (Fig. 3-6A, Table II-4).
Figure 3-6. Enrichment of KEGG Orthology (KOs) across MBTA surfaces before and after P.
acnes removal. For all heatmaps, rows represent significantly enriched KOs detected through
linear regression with MaAsLin, columns represent samples, and cells are colored by sum-
normalized reads per kilobase (RPKs) on a log scale. Further metadata is shown as colored bars
below the heatmaps. The first colored bar explains the collapsed surface types (second bar), in
which chairs include seats (light blue) and seat backs (dark blue), grips include horizontal poles
(red), vertical poles (orange), and grips (yellow), and touchscreens are from Riverside (green),
Alewife (red), Forest Hills (orange), and South Station (light blue). KOs annotated with yellow
circles are found before and after P. acnes removal. (A) Selected KOs enriched in holds only are
specific to and colored by P. acnes metabolic function. (B) Selected KOs specific to oxidative
phosphorylation and photosynthesis are shown before (above) and after (below) P. acnes
removal. Direction of association between KO abundances and surface types, relative to holds,
are shown as green ‘+’ (positive) or red ‘-’ (negative) to the left of the heatmap. Columns are
colored by metadata as in Fig. 3-2.Many KOs associated with oxidative phosphorylation and
photosynthesis were enriched in chairs and touchscreens relative to holds before removal of P.
acnes. These included NADH dehydrogenase I subunits (EC:1.6.5.3), ferredoxin-NADP+
reductase (involved in photosystem I, EC:1.18.1.2), ATPase subunits (EC:3.6.3.14), and
cytochrome c oxidases (EC:1.9.3.1). After depletion of P. acnes-derived processes, ferredoxin-
88
NADP+ reductase and F-type H+-transporting ATPase subunits were enriched only on chairs,
while cytochrome c oxidase subunits and NADH dehydrogenase subunit types and Fe-S
proteins were enriched only on touchscreens (Fig. 3-6B). Increased numbers of electron
transport chain components may indicate more aerobic respiration, or the presence of
eukaryotic DNA (as detected by chloroplasts or mitochondria). Notably, high levels are found
across all KOs for the horizontal pole from the Red Line and the outdoor touchscreen from
Riverside station, although it is unlikely that these trends were completely eukaryotic. Riverside
station touchscreen 16S profiles included only 4.04% chloroplast classified sequences, and
overall holds included for shotgun sequencing had the highest average proportions of
chloroplast, followed by chairs and touchscreens. Thus, presence of more electron transport
chain components may also reflect a metabolic strategy enriched among persisters in the built
environment, especially relevant to the touchscreens’ Alphaproteobacteria.
Minimal pathogenic and antibiotic resistance presence on the Boston transit system
To detect antibiotic resistance factors in MBTA metagenomes, we used ShortBRED [219]
to create high-precision sequence markers from the Comprehensive Antibiotic Resistance
Database (CARD) [220]. This resulted in 2,657 antibiotic resistance gene (ARG) markers for 792
ARGs in CARD, but only 46 ARG markers were detected with RPKMs greater than 0 in at least
two samples. This is notable because the average read depth of our samples was 9.8×106 reads
(0.989 Gnt), but the average RPKM per sample for these markers was only 1.172, ranging from 0
to 46.67. Similarly, a low abundance of ARGs (<0.3% of total reads mapped to the Antibiotic
Resistance Database) was found in the Home Microbiome Project [92]. Our hits included several
89
resistance mechanisms, including efflux pumps, antibiotic target modification or replacement,
antibiotic inactivation, and changes in nucleic acid machinery (rpoB or par genes) (Fig. 3-7A).
Figure 3-7. Quantification of antibiotic resistance marker and virulence factor abundances on
subway surfaces. (A) Antimicrobial resistance markers (rows) quantified in metagenomes by
ShortBRED [219] and annotated by antibiotic target through the Antibiotic Resistance Ontology
in the CARD database. (B) Virulence factors (rows) likewise quantified and manually annotated
by virulence function through keywords on the VFDB web site. For both heatmaps, columns
(samples) are arranged as in Fig. 3-6.
To contextualize ARG enrichment (or rather depletion) in this environment, we further
compared the Boston subway to ARGs in the air microbiome from several other built
environments [221] as well as from 552 stool samples from individuals in the United States,
China, Malawi, and Venezuela [2, 222, 223]. For consistency with previous surveys, we used
ShortBRED to generate 4,132 antibiotic ARG markers for 849 ARGs in the Antibiotic Resistance
Database (ARDB). Both the air microbiome and Boston subway samples had noticeably lower
90
levels of RPKMs that that of typical human stool (Fig. II-3). The gut microbiome has repeatedly
been observed [224] to be enriched for tetracycline resistance, beta-lactamases, and MFS/RNS
efflux pumps, whereas none of these were substantially present in the MBTA and only low
levels of tetracycline and beta-lactamase resistance in indoor air [221].
To similarly assess virulence factors in the MBTA, we created sequence markers from
the Virulence Factor Database (VFDB) [225], resulting in 7,869 markers for 2,089 factors. 54
markers were detected with RPKMs greater than 0 in at least two samples. The average RPKM
per sample was 0.240, ranging from 0 to 23.74. All of the putative virulence factors, with the
exception of the alpha and beta-hemolysin proteins found in S. aureus, are opportunistic factors
typical of normal microbial life. For example, many proteins were classified as part of
pathogenicity islands; however, most of these proteins are transposases, integrases, and
repetitive regions (Fig. 3-7B). Other hits were annotated with functions in adherence,
antiphagocytosis, and secretion systems, but consisted of cell surface proteins such as
lipopolysaccharides, capsule polysaccharide proteins, and flagellar proteins. This indicates that
the real pathogenic potential detected in the Boston subway is very low. Overall, the Boston
subway has minimal antibiotic resistance and virulence factor presence.
Discussion
Here, we report on the microbial profile of the Boston metropolitan transit system.
Previous studies have characterized the Hong Kong and New York subway aerosol
communities [193, 194], as well as surfaces in the New York subway [195], but we believe this to
be the first to determine how space utilization by passengers, surface type, and material
91
composition individually affect microbial ecology. We further describe the microbial
community metabolic potential across surface types and metagenomically assess the absence of
pathogenic potential. The former primarily reflected P. acnes pathways on holds and aerobic
respiration on seats and touchscreens; resistance and virulence factors among the latter were
depleted relative to environments such as the human microbiome.
Surface type was the major driver of variation in composition, lending support to three
potential hypotheses: differences may be driven by 1) human body interactions [192], 2)
material composition of these surfaces, which may enhance microbial adherence and growth, or
3) a combination of the two factors. Our data support the third hypothesis. First, we observed a
significant enrichment of oral microbes on horizontal poles and grips, which may be higher up
and closer to riders’ faces or reflect transfer through skin-mediated contact (Fig. 3-1C). Second,
both 16S and shotgun data showed enrichment of vaginal commensals in seat surfaces, which
may be transmitted through clothing. Third, we found that seats were enriched in vaginal and
oral taxa relative to seat backs, and outdoor touchscreens were enriched in Alphaproteobacteria
relative to indoor touchscreens. If surface material were the only driver of microbial
composition, seats vs. seat backs and indoor vs. outdoor touchscreens should have similar
taxonomic profiles. Surface material certainly plays at least a partial role, however, as we
observed decreased Corynebacterium in vinyl seats as compared to polyester seats. Overall, our
observations indicate that both human body interactions and surface material shape community
composition, with the former as the stronger driver.
92
Previous studies of the transit microbiome, particularly those of New York [195] and
Hong Kong [194], have also observed environmental exposure to be an additional driver of its
microbial community composition. Afshinnekoo et al, for example, found that samples’ human
DNA reflected census demographics for the surrounding region, although we saw no
differentiation at the microbial level among Boston train lines serving suburbs with different
ethno-demographics. We primarily observed the impact of environmental exposure on outdoor
touchscreens, in agreement with Leung et al’s higher alpha diversities for outdoor stations in
Hong Kong. The surfaces we investigated are near-uniformly exposed to high volume and
diversity of rider interaction. This frequent human contact could homogenize many potential
influences on microbial populations, such as demographics or weather. Since the body sites
used for contact, indoor/outdoor location, and material composition remain consistent, these
exposures would thus shape the taxonomic differences we observed across the Boston subway.
There are few non-opportunistic pathogens in the built environment outside of hospitals
[226]. None were reported for restrooms [93], classrooms [192], or Hong Kong subway aerosols
[194], possibly due to lack of phylogenetic resolution with 16S sequencing. During partial
genome assembly from home [92] and restroom [227] surface metagenomes, shotgun
sequencing facilitated identification of opportunists with pathogenic potential, but even with
this increased resolution, outright virulence factors were rare. Robertson et al detected no
human pathogens using Sanger and pyrosequencing in New York subway aerosols [193].
Furthermore, although Afshinnekoo et al report 12% of taxa represented known pathogens in
the National Select Agent Registry and PATRIC database, this database uses an extremely
broad definition of “pathogen,” and these results were later refuted [196]. Our study assessed
93
whether typical subway microbial communities were unusual in their carriage or transfer of
antibiotic resistance genes and virulence factors. We detected low numbers of these genes, and
they were present at drastically lower amounts than observed in the human gut.
One goal of studying the microbiology of the built environment is to establish a baseline
against which deviations can be used to detect potential public health threats. As with the
human microbiome, however, inter-subject variability appears to be quite high in built
environments (e.g. buildings) and in transit systems, and both greater cross-sectional breadth
and longitudinal depth are still necessary. All subway microbiome papers to date have detected
a high level of skin-associated genera. In addition to this work, Leung et al (Hong Kong subway
aerosols) included Micrococcus (4.9%), Enhydrobacter (3.1%), Propionibacterium (2.9%),
Staphylococcus, and Corynebacterium (1.5%), while Robertson et al detected high levels of families
Staphylococcaceae, Moraxellaceae, Micrococcaceae, Enterobacteriaceae, and
Corynebacteriaceae. Afshinnekoo et al in the New York subway is the only major exception,
with the most abundant organisms instead including Pseudomonas stuzeri, Acinetobacter, and
Stenotrophomonas. If microbes shed from skin (or still resident on shed skin cells) do dominate
mass transit environments, it must be determined whether these microbes are deposited,
dormant, or actively growing, or whether they can be stably transferred from one individual to
another.
Like other built environments, however, human-associated microbes are by no means
the only apparently functional community residents even when abundant. Notably, our wall
samples, which are not consistently touched but in the presence of high human density, have
94
lower biomass and different microbial compositions from other train surfaces. Establishing a
"typical" microbial baseline for mass transit environments will require thoughtful sample
design that controls for local space properties, short- and long-term temporal variation (e.g.
time of day and season), and cross-sectional differences within and between cities. It may also
prove useful to monitor for a combination of normal versus undesirable organisms and
metabolic or functional profiles, as the latter has been observed to be more stable than
taxonomy in the human microbiome [2]. In some cases specific pathogens may be easier to
detect; in others (e.g. when individual pathogens may be extremely low density), structural,
functional, or metabolic shifts may be better indicators of changing transit profiles and,
consequently, health hazards. In all such cases, future studies should incorporate expertise from
architecture, engineering, public health, microbiology, and ecology, thus allowing both
confident and interdisciplinary analyses as well as institutional changes in response to scientific
findings.
In conjunction with other published investigations, this work helps to characterize the
“urban microbiome” and, in doing so, adds to our understanding of how these microbial
communities are formed, maintained, and transferred. Such studies fall in a critical space
between environmental and human-associated microbial ecology, and as such must address the
challenges of both. These include study designs with rich metadata, including architectural
features, human contact, environmental exposure, surface type, and surface material;
accounting for a wide range of potential biochemical environments, contaminants, and biomass
levels; and the involvement of institutional review boards, city officials, and engineers as
appropriate. Future work will help to determine which urban microbes are viable and resident
95
(as opposed to transient), as well as identifying the mechanisms utilized to persist in the built
environment. It will also be important to identify microbes that can be transferred between
people via specific fomites, since this especially has the potential to inform public health and
policy (by monitoring organisms, gene content, or both). A greater understanding of these
processes may thus eventually lead to construction of built environments that enhance and
maintain human health.
Materials and Methods
Study permissions
The Massachusetts Bay Transportation Authority (MBTA) approved all aspects of transit
system sampling and gave permission to the Harvard T.H. Chan School of Public Health to
conduct this study (Fig. II-4). Additional support was provided by the MBTA Police, who
accompanied the study team during sample collection. A written description of the protocols
and study goals were distributed to interested MBTA passengers during sampling.
Sample collection
Samples were collected in 2013 on May 16, May 23, and October 22 from the public
transit system serving metropolitan Boston during normal workday hours. Train car sampling
began at the outmost termini of train routes (Alewife Station on the Red Line, Riverside Station
on the Green Line, and Forest Hills Station on the Orange Line). Trains were sampled as they
proceeded inbound towards the city center. Station samples were collected by swabbing the
touchscreens and sides of ticket machines at five stations (Fig. 3-1).
96
For all samples, we recorded the sampling date, outdoor air temperature and relative air
humidity, location, surface type (seat, seat back, horizontal pole, vertical pole, hanging grip,
wall, or touchscreen), and material composition (polyester and vinyl (seats and seat backs),
stainless steel (poles), PVC (grips), combination of wood, engineered wood, extruded
thermoplastic, fiber reinforced plastic, aluminum honeycomb panel, melamine-finished
aluminum panels reinforced with Kevlar (walls), or coated glass (touchscreens)). For train car
samples, we recorded the within-train location of sample collection (end or middle of car), as
well as the train line and location along the route when sample was collected. For station
samples, we recorded the location of each ticketing machine (indoor, outdoor, underground)
and the side of the touchscreen swabbed (right, left, both).
All metadata are described in Table II-1 and where possible, metadata terms from the
Minimum Information Standards for the Built Environment (MIxS-BE) were used [228].
Weather information was compiled from weather archives from the National Oceanic &
Atmospheric Administration [229] and Weather Underground (KBOS [230]).
Swab collection and processing
DNA-free cotton swabs (Puritan, Maine, USA) were used for collection in this study.
Each swab was dipped into a swabbing solution prepared from 0.15 M NaCl and 0.1% Tween
20, as used in previous studies [81, 192, 201, 231]. All surfaces were swabbed for approximately
15 seconds, and each surface was sampled 2 or 3 times with separate swabs over non-
overlapping regions. Swabs were stored together in 15 mL Falcon tubes on ice for no more than
97
one hour before being taken to a central location and stored on dry ice. All samples were
transported directly from dry ice to a -80°C freezer for storage.
DNA extraction, 16S amplicon sequencing, and operational taxonomic unit (OTU) calling
Samples were processed using the MoBio PowerLyzer PowerSoil DNA extraction kit
(MO BIO Laboratories, Inc.). For each sample, 2 or 3 swabs from the same sample were pooled
for optimal biomass recovery. Amplification and sequencing by Illumina MiSeq were
performed as described previously by Caporaso et al [232]. OTU tables were constructed with
Quantitatve Insights into Microbial Ecology (QIIME) software [233] version 1.8 using a closed
reference (pick_closed_reference_otus.py) with Greengenes reference version 13.5 at the 97%
identity level. We filtered low-abundance OTUs (minimum abundance threshold 0.001 in at
least one of 72 samples). Because the primers used in the study were designed to amplify
bacterial 16S genes, we filtered out OTUs that corresponded to chloroplasts, mitochondria, and
archaea. This reduced the dataset to 2,134 unique OTUs representing 501 unique genera. OTU
frequencies in samples were then sum-normalized to proportional data (Table II-2). Further
details can be found in the Supplemental Information.
Analysis methods
Alpha diversity was calculated using the Inverse Simpson diversity index in the R
package ‘vegan’ [234]. Ordinations were calculated by principal coordinate analysis (PCoA)
using Bray-Curtis dissimilarity, unless otherwise noted, using the relative abundance table
generated above. For univariate and multivariate tests, we further filtered OTUs (minimum
abundance threshold 0.001 in at least seven of 72 samples). Univariate test for taxa differentially
98
abundant with respect to touchscreen location was performed using LEfSe [197]. For this
analysis, each metadata category was tested using alpha values of 0.05 for both the Kruskal-
Wallis and Wilcoxon tests with one-against-all comparison and an LDA effect size cutoff of 2.0.
Significant taxa-metadata univariate associations are listed in Table II-3. Multivariate
association tests for taxa that were differentially abundant with respect to metadata were
performed using MaAsLin [211]. For this analysis, we used four metadata categories: these
included locale (train or station), surface type (e.g. seat, seat back, etc), surface material (e.g.
polyvinyl chloride, carpet, etc), and location (e.g. Forest Hills Station, Orange Line train, etc).
Microbial source prediction was performed using Microbial Sourcetracker [203] and using data
from human and environmental sites in Hewitt et al [207]. GraPhlAn [235] was used for
visualization of associations and phylogenetic relationships.
Shotgun library sequencing and quality control
DNA was extracted using the MoBio PowerLyzer PowerSoil DNA extraction kit (MO
BIO Laboratories, Inc.) as described for 16S sequencing libraries. Only samples with at least 80
ng/µL were selected and sent to the Broad Institute for shotgun library construction. Libraries
were constructed using the Illumina Nextera XT method and sequenced on the Illumina HiSeq
2000 platform with 100 bp paired-end (PE) reads. The sequencing depth was 16.7×106 PE reads
per sample. The KneadDATA v0.3 pipeline (http://huttenhower.sph.harvard.edu/kneaddata)
was used to remove low quality reads and human host sequences. Further details can be found
in the Supplemental Information.
99
Taxonomic and functional profiling of metagenomes
Pan-microbial (bacterial, archaeal, viral, and eukaryotic) taxonomy was determined
using MetaPhlAn2 [136] (http://huttenhower.sph.harvard.edu/metaphlan2). 1,340 microbial
clades comprising 499 species were identified (Table II-2), and filtered for relative abundance ≥
0.1% in at least two samples for downstream multivariate analysis with MaAsLin [211]. For all
MaAsLin analysis involving shotgun taxonomic and functional profiles, we used one metadata
category: collapsed surface types, which included chairs (seat and seat backs), holds (grips,
horizontal and vertical poles), and touchscreens.
Functional genomic profiles were generated with HUMAnN2 version 0.3.0 [148]
(http://huttenhower.sph.harvard.edu/humann2), which leverages the UniRef [143] orthologous
gene family catalog, along with the MetaCyc [144], UniPathway [236], and KEGG [139]
databases. HUMAnN2 gives three outputs: the 1) UniRef proteins and their abundances in
reads per kilobase (RPK), 2) MetaCyc pathways and their abundances in RPK 3) MetaCyc
pathways and their coverage ranging from 0 to 1. HUMAnN2 further calculates the RPK and
coverage for each microbial taxa observed in MetaPhlAn2 for each UniRef protein and MetaCyc
pathway.
To look at the functional profile, we collapsed 3,975,869 UniRef50 protein families into
12,074 KEGG Orthology (KO) numbers. UniRef50 proteins that did not belong to any KOs were
not analyzed further. We sum-normalized KO RPKs and focused on KOs with mean abundance
greater than the overall median abundance and variances in the 90th percentile. We identified
KOs that were significantly enriched in chairs, holds, and touchscreens using MaAsLin [211]
100
with a false discovery rate (FDR) < 0.05. KO differences between surface types were heavily
influenced by the presence of Propionibacterium acnes. To remove this influence, we removed P.
acnes’ RPK contribution to each UniRef50 protein and then re-summed the overall UniRef50
RPK from the remaining taxa. UniRef proteins were again collapsed into KOs and subjected to
the analysis described above. We then compared KOs that were significantly enriched in seats,
holds, and touchscreens before and after P. acnes removal. Tables with KO RPKs are at
http://huttenhower.sph.harvard.edu/MBTA2015.
Identification and quantification of antibiotic resistance and virulence factor gene markers.
Antibiotic resistance gene markers were generated with ShortBRED (Short Better Read
Extract Dataset) [219] from the Comprehensive Antibiotic Resistance Database (CARD) [220]
using UniRef90 [237] as a reference. ShortBRED virulence factor markers were generated from
the Virulence Factor DataBase (VFDB) [225] using UniRef50 [237] as a reference (due to the
availability of a previous version of these markers). ShortBRED maps the shotgun reads against
the markers, and returns normalized marker abundances as reads per kilobase per million reads
(RPKM). We aggregated and annotated antibiotic resistance gene markers using the antibiotic
resistance ontology (ARO) numbers in CARD.
To facilitate cross-dataset comparison, we also generated 121 bp markers with
ShortBRED from the Antibiotic Resistance Database (ARDB) [238] using UniRef50 [237] as a
reference and aggregated these markers at the ARDB family level. We compared the
distribution of antibiotic resistance gene markers in our dataset to four previously profiled
shotgun datasets describing the gut microbiomes of 552 individuals from the United States [2,
101
223], China [222], Malawi [223], and Venezuela [223], as well as one shotgun dataset profiling
air microbiomes in a home, hospital (indoors and outdoors), pier, and offices (indoors and
outdoors) [221]. Virulence factors were annotated using VFDB ontologies available on
http://www.mgc.ac.cn/VFs/main.htm. ShortBRED results can be found in Table II-5.
Accession numbers
Raw sequence files were deposited into Sequence Read Archive (SRA) under the
National Center for Biotechnology Information (NCBI) with accession number PRJNA301589.
Acknowledgements
We thank the MBTA Transit Police Department, specifically Chief Paul MacMillan and
Detective Matthew Haney, for their support of this project. We are also grateful to MBTA police
officers Tommy O’Connor and Lieutenant David F. Albanese for their assistance during sample
collection. We also thank Sydney Lavoie and Gerrod Voit for additional laboratory and
computational assistance, and Boyu Ren and Koji Yasuda for helpful feedback and discussion.
Jessica L. Green would like to disclose her affiliation as CTO of Phylagen, Inc. which
does not conflict with the study. The authors declare no conflict of interest.
Chapter 4:
Conclusions
103
To understand a given microbial community, there are two major questions to be
answered: “Who is there?”, followed by “What are they doing?” DNA sequencing has proven
to be a powerful tool for answering these questions. It has the capability of surveying thousands
of organisms and millions of genes relatively quickly, but is limited in its ability to track
microbial activity. In addition, the size of the resulting datasets restricts most analyses to
identification of associations between microbial abundances and metadata, or a search for
biomarkers or keystone species. Understanding the complexity underlying these trends must
begin with i) characterizing the stability of the observed trend and ii) determining its activity
and its effect within and outside the microbial community. The former may be established via
comprehensive time-series sampling, while the latter may be achieved through the combination
of DNA sequencing with other ‘omics’, such as transcriptomics, proteomics, and metabolomics,
or through wet laboratory experiments.
In Chapter 2, we introduced WAAFLE, which is the first method for detection if de novo
LGT events from metagenomes. A tool that can utilize WMS sequencing data is important, since
the majority of tools for LGT detection are optimized for full genomes. As follows, identifying
novel LGT events will require constant sequencing of whole genomes, which is achievable for
clinical isolates but difficult for single organisms within a complex community. The direct use
of metagenomes allows for LGT profiling of older datasets in the context of a community (as
opposed to cultured isolates), which may affect LGT activity. We next demonstrated proof of
concept by applying WAAFLE to the Human Microbiome Project Phase 1-II. Indeed, there are
limits to what we can detect: first, potential misassemblies based on read coverage
disproportionately affect contigs classified as LGT, and second, short contig lengths limit
104
detection of plasmids, unless there are novel rearrangements within them. Still, we were able to
identify high frequency LGT pairs across six major body sites, which increased in frequency
with shorter phylogenetic distances and higher taxonomic abundances. Most pairs were also
specific to environment (body site), though the buccal mucosa, supragingival plaque, and
tongue dorsum shared pairs with differential abundances. As expected, enriched functions in
LGT contigs included mobile elements such as transposons and phage, along with GMP
synthases and TonB outer membrane receptors.
Immediate next steps include characterizing LGT stability over time, as well as
determining how LGT frequency varies with disease and environment. Both approaches require
datasets with specific study designs: the former requires time-series data while the latter
requires case/control cohorts, or samples collected from the built-environment or environmental
sources. Applying WAAFLE to these datasets will help quantify LGT rates, which may occur at
the scale of minutes, days, or months; as well as determine how LGT rates change with disease,
or how they might be associated with cohort metadata (such as dietary intake or drug
administration). Analyses of whole genomes has shown that LGT rates are likely higher in
human-associated versus non-human associated environments: further work may identify the
taxa and functions responsible for increased LGT. Still, computational detection of events at the
DNA level does not indicate active use of transferred genes. To quantify LGT activity, WAAFLE
results should be combined with other ‘omics data in order to find actively transcribed or
translated LGT products. Results may also be combined with wet laboratory procedures such as
qPCR or transformation, to validate the presence or activity of transferred genomic segments,
respectively. Furthermore, attempts to induce LGT within culture may help identify conditions
105
such as abiotic/biological stress, specific spatial structure, or proximity of select taxon partners
that might favor LGT.
In Chapter 3, we described microbial communities on the Boston subway, which were
mostly derived from human skin and oral sites. Samples were collected from trains on the red,
orange, and green line, as well as ticketing machines from Alewife, Park Street, South Station,
Forest Hills, and Riverside. The original intent of the study design was to see if microbial
communities might vary based on the demographic served. Instead, microbial communities on
trains mostly varied by surface type, likely due to rider interactions such as sitting on seats or
touching the ticketing machines, while microbial communities on touchscreens varied mostly
by indoor or outdoor location. Functional profiles were dominated by systems for anaerobic
respiration and porphyrin synthesis, which reflected the high abundance of Propionibacterium
acnes. Overall, the number of antibiotic resistance genes were lower than that found in the
human gut.
Future directions include identifying the stability of these high-traffic spaces as well as
determining the proportion of live, dormant, and dead microbes. The former will require
sampling the subway at regular intervals over a longer period of time. This sampling strategy
will enable us to determine if there is a consistent built-environment microbiome: if so,
fluctuations may be useful indicators of disease outbreak, or simply indicators of changing
seasons, or both. For the latter, microbial viability may be measured using a variety of methods,
including sample treatment with propidium monoazide or cell sorting to distinguish between
DNA from intact versus dead cells, isolation of RNA rather than DNA for transcriptomic
106
activity, identification of protein synthesis through fluorescent click-chemistry (such as
BONCAT), or measurement of cellular activity through ATP assays. Multiple methods will
need to be tested, as contamination and low biomass are common problems for built-
environment samples. Furthermore, if the majority of built-environment samples are dead, then
profiling should shift from looking at microbial taxa to looking at metabolites, or microbial
components such as pathogen-associated microbial patterns (PAMPS), which may stimulate the
human immune system.
Long term goals include understanding how LGT affects microbial evolution and how
the built-environment influences human health, especially immune development. It is unclear
what role LGT plays in speciation and whether that role differs today versus the evolutionary
past. Still, it is clear that LGT has a clinical impact, especially in the rise in antibiotic resistance.
If we can identify the conditions under which LGT occurs, as well as the specific gene segments
and taxa participating in transfer, it may be possible to use LGT to alter microbial community
structure or processes, or predict short-term microbial evolution (especially for pathogens).
Some work has also suggested that LGT helps maintain bacterial species: thus, a better
understanding may help refine the “species concept” for bacteria, leading to better taxonomic
assignment and calculation of phylogenetic distance.
In contrast, to characterize the effects of the built environment on human immune
development, studies should move beyond single built-environment types, and begin i)
comparing same purpose built-environment structures with that lead to differential health
outcomes, or ii) comparing different built-environment structures to identify their similarities
107
and differences. An example of the former involves surveying nursing homes with varying
survival rates, while an example of the latter would include examining a rural home versus
urban home. Study of the former could identify aspects of building design that might facilitate
better health outcomes through microbial community modulation, such as increased ventilation
or changes in hygiene. Study of the latter can help establish a baseline built-environment, likely
to be skin and oral microbes, and determine which microbes or PAMPs could potentially be
introduced. This is especially important if constant exposure to skin and oral-derived microbes
lead to adverse health outcomes. Multiple diseases have been associated with the microbiome,
of which a subset are linked to Western lifestyle and diet. This has led to an extensive search for
therapeutics to modulate the human microbiome. A better understanding of LGT and the built-
environment microbiome may help spur therapeutics, and highlight adaptive mechanisms used
by the microbiota and host to adjust to the “new normal.”
108
Appendix I:
Supplemental Materials for Chapter 2
109
Supplemental Figures
Figure I-1. Filtering potential misassemblies. To search for miassemblies, shotgun reads were
mapped to contigs using Bowtie2. We then examined both read coverage (Step 1) and read
support (Step 2) for gene junctions, or the regions between two genes on a contig. Genes
containing any single junction that fail both steps 1 and 2 are removed from analysis.
110
Figure I-2. Determining which contig types contain misassemblies. In A), we show the
percentage of contigs filtered out via read mapping, stratified by whether WAAFLE classified
them as LGT or not. We find that more LGT contigs are filtered out, as expected. In B), we
examine the gene junction type and determine what percentage have read support. Here “AA”
junctions are defined as gene junctions between two genes annotated to the same taxa, while
“AB” junctions are defined as gene junctions between two genes annotated to different taxa. As
expected, junctions between genes annotated to different taxa have less read support.
111
112
Figure I-3. Gene call evaluation. To assess how well WAAFLE calls genes, we varied the
subject coverage threshold (for including a BLAST hit), gene length threshold (above which to
include the gene), and minimum overlap (above which to merge a BLAST hit into a gene
group).
Figure I-4. LGT evaluation with or without missing BLAST hits. We show the TPR against the
FPR for the LGT evaluation with 20% of BLAST hits removed (on left) versus the evaluation
with all BLAST hits (on right).
113
Figure I-5. Selection of k1 and k2. As in Fig. 2-2, we show the LGT evaluation for WAAFLE with
20% of BLAST hits removed. On the left, we hold k2 at 0.8 while we vary k1 from 0.1 to 0.9 (blue
line represents default k1 chosen). On the right, we hold k1 at 0.5 while we vary k2 from 0.1 to 0.9
(red line indicates default k2 chosen). In A), colors indicate the inter-taxon level for LGT, for
example, “species” in red shows the TPR and FPR for inter-species LGT across different k1 and
k2 values. In B), colors indicate the taxonomic level at which WAAFLE is evaluated for
taxonomic assignment. For example, “species” in red indicates the percentage of correct species
calls in LGT contigs.
114
Figure I-6. Comparison of LGT measures. We attempted to quantify LGT frequencies per
sample using 2 methods: 1) the number of LGT contigs divided by the total number of genes, 2)
the number of genes in LGT contigs divided by the total number of genes. Initially, we were
concerned that the former might overestimate LGT in samples with many short contigs, while
the latter might overestimate LGT in samples with many long contigs. In this plot, each point
represents a LGT taxon pair in a body site. The x-axis is the first measure, while the y-axis is the
second measure. We found that the two measures were highly correlated within body sites,
indicating that higher values in either measure usually point to higher frequencies of LGT.
However, when comparing body sites, we observe a different y-axis scale for the posterior
fornix: the longer contigs in the posterior fornix may lead to larger LGT frequencies if gene
percentages (measure 2) are utilized rather than events per gene.
Figure I-7. Jaccard and Bray-Curtis distances between inter-individual, intra-individual, and
technical samples. We looked to see if LGT detection via WAAFLE is reproducible across
technical replicates and stable in individuals. We focused on contigs with taxonomy resolved to
the genus level and inter-genus LGT events. For each body site, we subsampled half the
115
Figure I-7 (Continued)
samples while including all technical replicates. In each sample, gene percentages were
quantified for inter-genus taxon pairs (number of genes in LGT pair divided by number of
sample genes) and single taxa (number of genes for taxa in sample divided by number of
sample genes). We then calculated Jaccard and Bray-Curtis distances between samples from
different individuals, the same individual but different time points, and technical replicates.
Figure I-8. Phylogenetic distances computed from random taxa pairs within body sites. For
each body site, we (i) randomly chose WAAFLE-called pairs (waafle) or (ii) generated taxon
pairs by randomly choosing two taxa weighted by gene percentages (simulated). For the
former, up to 1,000 pairs or the total number of taxon pairs were chosen, whichever number is
smaller. For the latter, 1,000 unique pairs were generated. We then plotted the A) phylogenetic
distance distribution and B) LGT joint abundance distribution. Joint abundances were
calculated by multiplying the gene percentage of one taxon (number of genes for a single taxon
in a sample divided by total sample genes, averaged across sites) against the other.
116
Supplemental Tables
Table I-1. WAAFLE Parameters. This table describes the 5 parameters used to tune the
WAAFLE pipeline.
Parameters Definition WAAFLE
Steps
Involved
(and default
values)
Effect
Subject
coverage (s)
Percentage of a
reference gene (subject
sequence) that aligned
to the contig (query
sequence)
Step 2: s =
0.75
Step 3: s = 0
Increasing subject coverage filters out
low quality BLAST hits when calling
genes (Step 2) and scoring taxa (Step
3). Including higher quality BLAST
hits in Step 2 led to more accurate
gene calling. Including more BLAST
hits in Step 3 led to higher taxon
scores.
Overlap
percentage
(o)
Length of overlap
region between two
nucleotide fragments
overlap by, divided by
the length of the
shorter fragment
Steps 2 & 3: o
= 0.1
Lowering overlap percentage allows
more hits to be merged into groups,
leading to fewer gene calls (Step 2).
Inclusion of more BLAST hits per
gene for taxon scoring (Step 3) can
lead to higher scores.
Gene length
(g)
Length of gene called
or supplied to
WAAFLE
Step 2: g =
200 bp
A higher gene length cutoff prevents
LGTs from being called due to
spurious gene calls, and leads to
lower numbers of genes per contig.
One taxon
score (k1)
A single taxon’s
minimum score across
all genes in a contig
Step 4: k1 =
0.5
A lower threshold for the one taxon
score makes it easier for WAAFLE to
annotate a contig as “No LGT”.
Two taxon
score (k2)
The minimum score
for two taxa after
maximizing scores
between them across
all genes in a contig
Step 4: k2 =
0.8
A lower threshold for the two taxon
score makes it easier for a contig to
be called “LGT”.
117
Appendix II:
Supplemental Materials for Chapter 3
118
Supplemental Figures
Figure II-1. Biomass and alpha diversity for train and station samples. (A) Biomass from
samples collected across the subway system. Each data point represents a pooled sampling
strategy in which two or three swabs from the same site were pooled and jointly extracted.
DNA yield is plotted in ng/mL. (B) Alpha diversity by surface type, as measured by the inverse
Simpson diversity index. In both (A) and (B), colors represent the line of the train from which
sample was derived (red, orange or green line of the train or station, or black indicating from
within a downtown station).
Figure II-2. Ordination of surface data subsets. (A) Train hold surfaces by train line, (B) train
chair surfaces by train line, (C) train chairs by material, and (D) touchscreen surfaces by date.
All ordinations are principal coordinates analyses using Bray-Curtis distance, colored by
metadata category, calculated using filtered OTU relative abundance table subsets of the
relevant samples.
119
Fig
ure
II-
3. C
om
par
iso
n o
f an
tib
ioti
c re
sist
ance
mar
ker
s fr
om
th
e A
RD
B d
atab
ase.
RP
KM
s o
f an
tib
ioti
c re
sist
ance
gen
e
mar
ker
s fr
om
air
mic
rob
iom
es i
n N
ew Y
ork
Cit
y (
off
ice)
an
d S
an D
ieg
o (
ho
spit
al, h
om
e, p
ier)
, th
e B
ost
on
MB
TA
, an
d g
ut
mic
rob
iom
es f
rom
552
in
div
idu
als
fro
m t
he
Un
ited
Sta
tes,
Ch
ina,
Mal
awi,
an
d V
enez
uel
a.
120
Figure II-4. Letter from the MBTA. We received MBTA approval, by way of the MBTA Transit
Police, to carry out the study prior to grant submission and confirmed detailed sampling plans
with the MBTA prior to any public work. Their assistance and input was invaluable both for
study design and for safe execution of sample collection, and this letter includes the initial
approval information from Chief MacMillan approving the work.
121
Supplemental Tables
Supplemental Tables are too large to display and are available online at
http://huttenhower.sph.harvard.edu/MBTA2015. Captions are included below for reference.
Table II-1. Sample collection and metadata. Includes metadata for all collected samples that
were sequenced via 16S amplicon or shotgun sequencing. Abbreviations are defined at the
bottom.
Table II-2. 16S and shotgun OTU tables along with taxa present across sequencing plate. The
first tab contains the 16S OTU counts after quality control, stitching, length filtering, removal of
chloroplast, mitochondria, and archaea, and filtering for at least 0.1% in 1 sample. The second
tab contains the unfiltered MetaPhlAn OTU table with percentages (100 = 100%). Note that
additional filtering was performed before LEfSe and MaAsLin runs for both 16S (at least 0.1% in
7 samples) and shotgun (at least 0.1% in 2 samples). The third tab contains our analyses to
identify contaminant taxa. As a negative control, we examined all samples present on a
sequencing plate containing a subset of MBTA samples, which included touchscreens (n=21),
trains (n=6), 30 saliva cultures, 13 skin samples, and 2 macaque tissue samples. Listed taxa listed
were present in 80% of samples with at least 0.00001 abundance, and are shown with their
average abundance across all samples. This provides a quality control test for potential
contaminant taxa, none of which were nontrivially abundant or significant during our MBTA
analyses.
Table II-3. LEfSe and MaAsLin analysis for 16S sequencing. The first tab contains LEfSe
results when searching for differentially abundant taxa between touchscreen locations
(outdoors (out), underground (under), and indoors near an exit facing an outside environment
(inout). Significant results report both logarithmic LDA scores and p-values. The second tab
contains results for MaAsLin run with four covariates, including surface category, surface type,
surface material, and surface location. Only organisms with q>0.25 are reported.
Table II-4. MaAsLin analysis for shotgun data. MaAsLin analysis was performed to identify
differentially abundant taxa (first and second tab) and KOs (third and fourth tab) with respect
to surface type. For both, surface type was split into chairs (seat backs and seats), holds
(horizontal/vertical poles, grips), and touchscreens. For identifying differentially abundant taxa,
we performed MaAsLin with full taxonomies at all levels (first tab) as well as with species only
(second tab). All results are reported: we considered organisms with q<0.25 to be significant. For
identifying differentially abundant KOs, we performed MaAsLin on KO abundances calculated
using all shotgun reads (third tab) and after P. acnes-associated reads were removed (fourth
tab). Only significant results are reported; these are KOs with q<0.05.
Table II-5. Antibiotic resistance gene and virulence factor markers. RPKM values for CARD
(first tab), VFDB (second tab), and ARDB (third tab). The RPKM values for CARD and VFDB are
only for MBTA data; the ARDB data contains values from multiple shotgun datasets.
122
Supplemental Information
The BioProject number, protocols, raw data tables, and supplemental tables can be
downloaded at http://huttenhower.sph.harvard.edu/MBTA2015.
Methods and Materials
DNA extraction, 16S amplification and sequencing. Samples were processed using the
MoBio PowerLyzer PowerSoil DNA extraction kit (MO BIO Laboratories, Inc.) using bead-
beating homogenization. For each sample, 2 or 3 swabs from the same sample were pooled for
optimal biomass recovery. Each swab was individually homogenized in a bead-beating tube at
6.0 M/s for 40 seconds on the MP Biomedical FastPrep 24, but subsequent cleanup was pooled
over one column. Extracted DNA extracts were quantified using a Qubit fluorimeter and sent to
the Broad Institute for sequencing. Amplification and sequencing by Illumina MiSeq were
performed as described previously [232]. In brief, genomic DNA was subjected to 16S
amplification using primers designed incorporating the Illumina adapters and a sample barcode
sequence, allowing directional sequencing covering variable region V4 (Primers:
515F[GTGCCAGCMGCCGCGGTAA] and 806R [GGACTACHVGGGTWTCTAAT]). PCR was
performed in triplicate with 1 μl of template (1:50), 10 μl of HotMasterMix with the HotMaster
Taq DNA Polymerase (5 Prime), and 1 μl of primer mix (for final concentration of 10 μM). The
cycling conditions consisted of an initial denaturation of 94°C for 3 min, followed by 24 cycles of
denaturation at 94°C for 45 sec, annealing at 50 °C for 60 sec, extension at 72°C for 5 min, and a
final extension at 72°C for 10 min. Amplicons were quantified on the Caliper LabChipGX
(PerkinElmer, Waltham, MA), pooled in equimolar concentrations, size selected (375-425 bp) on
123
the Pippin Prep (Sage Sciences, Beverly, MA) to reduce non-specific amplification products
from host DNA, and a final library size and quantification was performed on an Agilent
Bioanalyzer 2100 DNA 1000 chip (Agilent Technologies, Santa Clara, CA). Sequencing was
performed on the Illumina MiSeq platform (version 2) according to the manufacturer’s
specifications with addition of 5% PhiX, and yielded paired-end reads of 150 bp in length in
each direction. Total read depth was at least 5,000 reads (up to over 100,000 reads) per sample.
OTU calling. Quantitative Insights into Microbial Ecology (QIIME) software [233]
version 1.8 was used for data processing. Paired-end reads (with approximately 97 bp overlap)
were stitched and size selected (225 – 275 bp) to reduce nonspecific amplification products.
Operational taxonomic units (OTUs) were called with a closed reference
(pick_closed_reference_otus.py) using the Greengenes reference version 13.5 at the 97% identity
level based on the PICRUSt [127] protocol. Using these parameters, we observed 17,954 unique
OTUs. We filtered low-abundance OTUs (minimum abundance threshold 0.001 in at least one of
72 samples); this reduced the dataset to 2,134 unique OTUs representing 501 unique genera.
Since the primers used in the study were designed to amplify bacterial 16S genes, we filtered
out OTUs that corresponded to chloroplasts, mitochondria, and archaea. OTU frequencies in
samples were then sum-normalized to proportional data. The filtered OTU tables can be found
in Table II-2.
KneadData. KneadData incorporates Trimmomatic [239] and bowtie2 [240] for filtering
and human sequence removal, respectively. Reads were scanned with a four-base wide sliding
window and trimmed when the average base Phred score drops below 20. Trimmed reads
124
shorter than 70 nt were discarded. UCSC Human genome assembly version hg38 was used as
reference for removal of human sequences. The average sequencing depth after quality control
was 9.8×106 reads per sample.
Negative control analyses. Unfortunately, our study did not include negative controls
beyond those internal to the sequencing platform. Instead, we took several measures during
analysis to test for contamination in the 16S datasets. First, we looked at relative abundances
across multiple sets of samples on the same sequencing plate, since taxa present across all
samples may indicate contamination (especially since the batch included many non-transit
samples). This was possible mainly for the touchscreen samples (n=21) and a few train samples
(n=6), which were pooled with 30 saliva cultures, 13 skin samples, and 2 macaque adipose
samples. At the species level, we found 42 taxa (of 1647 total) that were present in 80% of
samples, with average abundance ranging from 0.018% (Pseudomonas unknown) to 11.1%
(Actinomyces unknown). Many of these are skin-associated, including Pseudomonas,
Staphylococcus, Corynebacterium (in increasing abundance) or associated with the oral cavity,
including Fusobacterium, Veillonella, Peptostreptococcus, Streptococcus, Prevotella, and
Porphyromonas (in increasing order) (Table II-2). It is unclear whether the latter arises from the
large number of saliva samples in this dataset, or as a true contaminant. None of the taxa with
lower average abundance are key to our findings.
Chloroplast and mitochondrial sequences were actually considered to be a type of
contaminant in our study, inasmuch as they essentially represent plant- and human-material
derived reads. They were found across all touchscreen and surface samples, but at very low
125
levels in adipose fat (primate, not human-derived) and saliva. Others have claimed that
chloroplast DNA may be an artifact of cotton swabs rather than environmental exposures; our
skin samples were processed with Copan swabs and yielded 1-2 orders of magnitude fewer
chloroplast sequences (<1% maximum). Our standard primer pairs are known to amplify
chloroplast and mitochondrial sequences: this is a well-known problem for those that study
plant-associated microbial communities [241, 242]. Chloroplast DNA percentages varied from
1.32%-6.98%, and 0.054-1.03% in the touchscreens. They varied even more in the train data (not
pooled with the touchscreens): chloroplast DNA ranged from 0.9% to 62.39%, with especially
high levels on the Red line, while mitochondrial DNA varied from 0-8.27% on the trains (data
available via website). This led to our analysis strategy of treating both sequence types like
typical contaminants, discounting their sequence abundance, renormalizing, and analyzing
primarily the resulting quality-controlled datasets.
Physical negative controls should be part of future study design, as recommended by
Adams et al [114] and Salter et al [115]. Their use, we note, must still be context dependent, as
no one blanket analysis is likely to apply to different sample and contaminant types. Some
studies have utilized the approach outlined in Flores et al, where OTUs constituting greater
than 1% of the total negative control sequences were removed from all samples prior to
rarefaction and analyses [243]. Another approach developed by Meadow et al involves
searching for taxa with high abundance in negative controls relative to samples: this is done by
plotting the relative abundances of taxa in negative controls against the relative abundances of
taxa in samples and applying a cutoff [190]. Adams et al performed a meta-analysis of built
environment studies, and reported phylum Tenericutes as significantly enriched in kit
126
microbiomes, and Cyanobacteria (or chloroplast) as highly abundant in dust but not in kits.
They also mention that skin taxa are often found as contaminants, but removing them could
remove true signal. Typical kit contaminant taxa were also not significant in our study.
Comparison to the NY subway study
To expand our comparison with the previous NYC subway study, we downloaded their
MetaPhlAn2 tables (provided at the time of the NYC publication by Nicola Segata in
collaboration with our group) from their supplementary data. We applied a simple quality
control filter by retaining taxa with at least 0.1% abundance in at least 1% of samples (14
samples), and then focused specifically on the samples most similar to ours, i.e. from subway
stations or trains.
In the NYC study, the most abundant taxa in the resulting 1,416 samples included
Pseudomonas stuzeri (27.01%), Pseudomonas unclassified (8.66%), Enterobacter cloacae (7.66%),
Stenotrophomonas maltophilia (7.10%), and Acinetobacter pitti/calcoaceticus/nosocomialis (3.39%).
Neither Yersinia nor Bacillus anthracis were present in any samples. These results are strikingly
different from our top species from similarly analyzed metagenomic data, which included
Propionibacterium acnes (47.44%), Propionibacterium phage (total ~6%), Micrococcus luteus (2.40%),
and Staphylococcus epidermis (1.98%). This may be due to a combination of factors, most likely
the different types of surfaces sampled, but also including the swab protocol development and
biomass validation prior to sequencing carried out for our study (see Methods). Most of our
samples represent heavily utilized, nonporous, non-sanitized surfaces within train cars or, less
often, stations; in contrast, NYC study surfaces include benches (n=326), rails/poles (condensed
127
from other categories, n=468), garbage cans (n=142), kiosks (n=161), turnstiles (n=151), and doors
(n=77), with all other surfaces sampled <24 times.
In support of this hypothesis, the NYC microbiomes at least in part do resemble those of
other built environment surfaces and dust. Adams et al, for example [244], collected dust in
vacuums or passively (through settlement). The former, which was considered homogenized,
had significantly higher levels of Pseudomonales, Enterobacteriales, and Streptophyta as
compared to the latter. Overall, Gammaproteobacteria dominated most samples (ave. 76.8%),
still primarily from Pseudomonadales and Enterobacteriales, and overshadowed the Bacilli
(6.68%), Betaproteobacteria (5.02%), and Alphaproteobacteria (4.80%), and Actinobacteria
(2.48%). The NYC subway had high levels of Enterobacteriales (17.90%) and Pseudomonadales
(49.61%), but none for Streptophyta (0%, suggesting a possible sampling or extraction bias).
However, it is difficult to compare NYC swabbed samples (or our own) to vacuumed or settled
dust, given the extreme heterogeneity seen in the latter for distinct space types or time
integration periods. Adams et al, for example, was in turn quite distinct from dust in the
International Space Station [245], for example, a mixed use academic classroom building [201],
or house dust [92], none of which significantly resembling our skin-dominated MBTA surfaces.
Taking these unusual features of the NYC subway data as given, however, we sought to
determine whether surface material was at least a major determinant of their microbial
community composition, as it proved to be for ours. We grouped their sample metadata into
four categories: type of object (bench, rail/pole, garbage can, kiosk, turnstiles, etc.), surface
material (wood, metal, plastic, etc.), object category (station, train, etc.), and borough (Queens,
128
Brooklyn, Manhattan, etc.) Applying the MaAsLin multivariate linear model to these variables
jointly, we found 71 differentially abundant clades at FDR<0.25.
Surprisingly, none of these associations were with surface material type; most instead
segregated with object type, which may at least be concordant with the much greater diversity
of objects sampled in the NYC study. Rails and poles had lower levels of Pseudomonas and
Acinetobacter lwoffi as compared to benches, for example, while garbage cans had higher levels
of Enterococcus italicus and Leunostoc. Clostridia and Klebsiella (not marine taxa) were found in
the abandoned South Ferry and Penn Station timecourse samples, as well as in trains as
compared to all other stations. Lastly, and also surprising, some taxa were associated with
borough: this includes higher levels of Acinetobacter and Moraxellaceae in Manhattan as
compared to the Bronx. Without more detail on the study’s exact sampling protocol - which
parts of these diverse objects were swabbed, for example, and for how long over what surface
area - it is difficult to interpret statistically significant but low effect size differences. It may be
useful for future studies to sample fewer, more controlled environments with greater
specificity, and of course to assess the results with more careful and targeted metagenomic
analyses.
129
References
1. Sender, R., S. Fuchs, and R. Milo, Revised Estimates for the Number of Human and Bacteria
Cells in the Body. PLoS Biol, 2016. 14(8): p. e1002533.
2. Consortium, T.H.M.P., Structure, function and diversity of the healthy human microbiome.
Nature, 2012. 486(7402): p. 207-14.
3. Engineering, N.A.o., E. National Academies of Sciences, and Medicine, Microbiomes of the
Built Environment: A Research Agenda for Indoor Microbiology, Human Health, and Buildings.
2017, Washington, DC: The National Academies Press. 253.
4. Shapiro, J.A., Thinking about bacterial populations as multicellular organisms. Annu Rev
Microbiol, 1998. 52: p. 81-104.
5. Meadow, J.F., et al., Humans differ in their personal microbial cloud. PeerJ, 2015. 3: p. e1258.
6. Rosenthal, M., et al., Skin microbiota: microbial community structure and its potential
association with health and disease. Infect Genet Evol, 2011. 11(5): p. 839-48.
7. Leewenhoeck, A.v., Observations, Communicated to the Publisher by Mr. Antony van
Leewenhoeck, in a Dutch Letter of the 9th of Octob. 1676. Here English'd: concerning Little
Animals by Him Observed in Rain-Well-Sea. and Snow Water; as Also in Water Wherein Pepper
Had Lain Infused. Philosophical Transactions Royal Society, 1677. 12: p. 821-831.
8. Adler, A. and E. Ducker, When Pasteurian Science Went to Sea: The Birth of Marine
Microbiology. J Hist Biol, 2017.
9. Razumov, A., The direct method of calculation of bacteria in water: comparison with the Koch
method. Mikrobiologija, 1932. 1: p. 131-146.
10. Staley, J.T. and A. Konopka, Measurement of in situ activities of nonphotosynthetic
microorganisms in aquatic and terrestrial habitats. Annu Rev Microbiol, 1985. 39: p. 321-46.
11. Stewart, E.J., Growing unculturable bacteria. J Bacteriol, 2012. 194(16): p. 4151-60.
130
12. Soucy, S.M., J. Huang, and J.P. Gogarten, Horizontal gene transfer: building the web of life.
Nat Rev Genet, 2015. 16(8): p. 472-82.
13. Lang, A.S., O. Zhaxybayeva, and J.T. Beatty, Gene transfer agents: phage-like elements of
genetic exchange. Nat Rev Microbiol, 2012. 10(7): p. 472-82.
14. Naor, A., et al., Low species barriers in halophilic archaea and the formation of recombinant
hybrids. Curr Biol, 2012. 22(15): p. 1444-8.
15. Zhaxybayeva, O. and W.F. Doolittle, Lateral gene transfer. Curr Biol, 2011. 21(7): p. R242-6.
16. Griffith, F., The Significance of Pneumococcal Types. J Hyg (Lond), 1928. 27(2): p. 113-59.
17. Avery, O.T., C.M. Macleod, and M. McCarty, Studies on the Chemical Nature of the Substance
Inducing Transformation of Pneumococcal Types : Induction of Transformation by a
Desoxyribonucleic Acid Fraction Isolated from Pneumococcus Type Iii. J Exp Med, 1944. 79(2):
p. 137-58.
18. Ochiai, K., et al., Studies on inheritance of drug resistance between Shigella strains and
Escherichia coli strains. Nihon Iji Shimpo, 1959. 1861: p. 34-46.
19. Akiba, T.K.T.I.Y., S. Kimura, and T. Fukushima, Studies on the mechanism of development of
multiple drug-resistant Shigella strains. Nihon Iji Shimpo, 1960. 1866: p. 45-50.
20. Anderson, E.S., The ecology of transferable drug resistance in the enterobacteria. Annu Rev
Microbiol, 1968. 22: p. 131-80.
21. Aravind, L., et al., Evidence for massive gene exchange between archaeal and bacterial
hyperthermophiles. Trends Genet, 1998. 14(11): p. 442-4.
22. Nelson, K.E., et al., Evidence for lateral gene transfer between Archaea and bacteria from genome
sequence of Thermotoga maritima. Nature, 1999. 399(6734): p. 323-9.
23. Sokal, R.R. and T.J. Crovello, The Biological Species Concept: A Critical Evaluation. The
American Naturalist, 1970. 104(936): p. 127-153.
131
24. Mayr, E., Systematics and the origin of species, from the viewpoint of a zoologist. 1942: Harvard
University Press.
25. de Queiroz, K., Ernst Mayr and the modern concept of species. Proc Natl Acad Sci U S A, 2005.
102 Suppl 1: p. 6600-7.
26. Ravin, A.W., Experimental Approaches to the Study of Bacterial Phylogeny. The American
Naturalist, 1963. 97(896): p. 307-318.
27. Dykhuizen, D.E. and L. Green, Recombination in Escherichia coli and the definition of biological
species. J Bacteriol, 1991. 173(22): p. 7257-68.
28. Tettelin, H., et al., Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae:
implications for the microbial "pan-genome". Proc Natl Acad Sci U S A, 2005. 102(39): p. 13950-
5.
29. Cohan, F.M., What are bacterial species? Annu Rev Microbiol, 2002. 56: p. 457-87.
30. Atwood, K.C., L.K. Schneider, and F.J. Ryan, Periodic selection in Escherichia coli. Proc Natl
Acad Sci U S A, 1951. 37(3): p. 146-55.
31. Treves, D.S., S. Manning, and J. Adams, Repeated evolution of an acetate-crossfeeding
polymorphism in long-term populations of Escherichia coli. Mol Biol Evol, 1998. 15(7): p. 789-
97.
32. Imhof, M. and C. Schlotterer, Fitness effects of advantageous mutations in evolving Escherichia
coli populations. Proc Natl Acad Sci U S A, 2001. 98(3): p. 1113-7.
33. Rozen, D.E. and R.E. Lenski, Long-Term Experimental Evolution in Escherichia coli. VIII.
Dynamics of a Balanced Polymorphism. Am Nat, 2000. 155(1): p. 24-35.
34. Guttman, D.S. and D.E. Dykhuizen, Detecting selective sweeps in naturally occurring
Escherichia coli. Genetics, 1994. 138(4): p. 993-1003.
35. Coleman, M.L. and S.W. Chisholm, Ecosystem-specific selection pressures revealed through
comparative population genomics. Proc Natl Acad Sci U S A, 2010. 107(43): p. 18634-9.
132
36. Papke, R.T., et al., Searching for species in haloarchaea. Proc Natl Acad Sci U S A, 2007.
104(35): p. 14092-7.
37. Cohan, F.M. and E.B. Perry, A systematics for discovering the fundamental units of bacterial
diversity. Curr Biol, 2007. 17(10): p. R373-86.
38. Majewski, J. and F.M. Cohan, Adapt globally, act locally: the effect of selective sweeps on
bacterial sequence diversity. Genetics, 1999. 152(4): p. 1459-74.
39. Shapiro, B.J., et al., Population genomics of early events in the ecological differentiation of
bacteria. Science, 2012. 336(6077): p. 48-51.
40. Takeuchi, N., et al., Gene-specific selective sweeps in bacteria and archaea caused by negative
frequency-dependent selection. BMC Biol, 2015. 13: p. 20.
41. Dixit, P.D., T.Y. Pang, and S. Maslov, Recombination-Driven Genome Evolution and Stability
of Bacterial Species. Genetics, 2017. 207(1): p. 281-295.
42. Rolfe, R. and M. Meselson, The Relative Homogeneity of Microbial DNA. Proc Natl Acad Sci
U S A, 1959. 45(7): p. 1039-43.
43. De Ley, J., H. Cattoir, and A. Reynaerts, The quantitative measurement of DNA hybridization
from renaturation rates. Eur J Biochem, 1970. 12(1): p. 133-42.
44. Wayne, L.G., International Committee on Systematic Bacteriology: announcement of the report
of the ad hoc Committee on Reconciliation of Approaches to Bacterial Systematics. Zentralbl
Bakteriol Mikrobiol Hyg A, 1988. 268(4): p. 433-4.
45. Fleischmann, R.D., et al., Whole-genome random sequencing and assembly of Haemophilus
influenzae Rd. Science, 1995. 269(5223): p. 496-512.
46. Fraser, C.M., J.A. Eisen, and S.L. Salzberg, Microbial genome sequencing. Nature, 2000.
406(6797): p. 799-803.
47. Ravel, J. and C.M. Fraser, Genome sequencing of microbial species, in Encyclopedia of Genetics,
Genomics, Proteomics and Bioinformatics. 2004, John Wiley & Sons, Ltd.
133
48. Karlin, S., Global dinucleotide signatures and analysis of genomic heterogeneity. Curr Opin
Microbiol, 1998. 1(5): p. 598-610.
49. Hanage, W.P., C. Fraser, and B.G. Spratt, Sequences, sequence clusters and bacterial species.
Philos Trans R Soc Lond B Biol Sci, 2006. 361(1475): p. 1917-27.
50. Segata, N., et al., PhyloPhlAn is a new method for improved phylogenetic and taxonomic
placement of microbes. Nat Commun, 2013. 4: p. 2304.
51. Ravenhall, M., et al., Inferring horizontal gene transfer. PLoS Comput Biol, 2015. 11(5): p.
e1004095.
52. Cavalli-Sforza, L.L., The DNA revolution in population genetics. Trends Genet, 1998. 14(2): p.
60-5.
53. Koonin, E.V., K.S. Makarova, and L. Aravind, Horizontal gene transfer in prokaryotes:
quantification and classification. Annu Rev Microbiol, 2001. 55: p. 709-42.
54. Lawrence, J.G. and H. Ochman, Amelioration of bacterial genomes: rates of change and
exchange. J Mol Evol, 1997. 44(4): p. 383-97.
55. Medigue, C., et al., Evidence for horizontal gene transfer in Escherichia coli speciation. J Mol
Biol, 1991. 222(4): p. 851-6.
56. Ochman, H., J.G. Lawrence, and E.A. Groisman, Lateral gene transfer and the nature of
bacterial innovation. Nature, 2000. 405(6784): p. 299-304.
57. Nakamura, Y., et al., Biased biological functions of horizontally transferred genes in prokaryotic
genomes. Nat Genet, 2004. 36(7): p. 760-6.
58. Ge, F., L.S. Wang, and J. Kim, The cobweb of life revealed by genome-scale estimates of horizontal
gene transfer. PLoS Biol, 2005. 3(10): p. e316.
59. Lerat, E., et al., Evolutionary origins of genomic repertoires in bacteria. PLoS Biol, 2005. 3(5): p.
e130.
134
60. Dagan, T. and W. Martin, Ancestral genome sizes specify the minimum rate of lateral gene
transfer during prokaryote evolution. Proc Natl Acad Sci U S A, 2007. 104(3): p. 870-5.
61. Andam, C.P. and J.P. Gogarten, Biased gene transfer in microbial evolution. Nat Rev
Microbiol, 2011. 9(7): p. 543-55.
62. Skippington, E. and M.A. Ragan, Phylogeny rather than ecology or lifestyle biases the
construction of Escherichia coli-Shigella genetic exchange communities. Open Biol, 2012. 2(9): p.
120112.
63. Boucher, Y., et al., Local mobile gene pools rapidly cross species boundaries to create endemicity
within global Vibrio cholerae populations. MBio, 2011. 2(2).
64. Madsen, J.S., et al., The interconnection between biofilm formation and horizontal gene transfer.
FEMS Immunol Med Microbiol, 2012. 65(2): p. 183-95.
65. Smillie, C.S., et al., Ecology drives a global network of gene exchange connecting the human
microbiome. Nature, 2011. 480(7376): p. 241-4.
66. Liu, L., et al., The human microbiome: a hot spot of microbial horizontal gene transfer. Genomics,
2012. 100(5): p. 265-70.
67. Brito, I.L., et al., Mobile genes in the human microbiome are structured from global to individual
scales. Nature, 2016. 535(7612): p. 435-439.
68. Rivera, M.C., et al., Genomic evidence for two functionally distinct gene classes. Proc Natl Acad
Sci U S A, 1998. 95(11): p. 6239-44.
69. Cohen, O., U. Gophna, and T. Pupko, The complexity hypothesis revisited: connectivity rather
than function constitutes a barrier to horizontal gene transfer. Mol Biol Evol, 2011. 28(4): p.
1481-9.
70. Jain, R., M.C. Rivera, and J.A. Lake, Horizontal gene transfer among genomes: the complexity
hypothesis. Proc Natl Acad Sci U S A, 1999. 96(7): p. 3801-6.
135
71. Beiko, R.G., T.J. Harlow, and M.A. Ragan, Highways of gene sharing in prokaryotes. Proc Natl
Acad Sci U S A, 2005. 102(40): p. 14332-7.
72. Baltrus, D.A., Exploring the costs of horizontal gene transfer. Trends Ecol Evol, 2013. 28(8): p.
489-95.
73. Drummond, D.A. and C.O. Wilke, The evolutionary consequences of erroneous protein
synthesis. Nat Rev Genet, 2009. 10(10): p. 715-24.
74. Banos, R.C., et al., Differential regulation of horizontally acquired and core genome genes by the
bacterial modulator H-NS. PLoS Genet, 2009. 5(6): p. e1000513.
75. Wolf, Y.I., et al., Evolution of aminoacyl-tRNA synthetases--analysis of unique domain
architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events.
Genome Res, 1999. 9(8): p. 689-710.
76. Woese, C.R., Interpreting the universal phylogenetic tree. Proc Natl Acad Sci U S A, 2000.
97(15): p. 8392-6.
77. Baas Becking, L.G.M., Geobiologie of inleiding tot de milieukunde. 1934, The Hague, the
Netherlands: W.P. Van Stockum & Zoon.
78. de Wit, R. and T. Bouvier, 'Everything is everywhere, but, the environment selects'; what did
Baas Becking and Beijerinck really say? Environ Microbiol, 2006. 8(4): p. 755-8.
79. O'Malley, M.A., The nineteenth century roots of 'everything is everywhere'. Nat Rev Microbiol,
2007. 5(8): p. 647-51.
80. Yasuda, K., et al., Biogeography of the intestinal mucosal and lumenal microbiome in the rhesus
macaque. Cell Host Microbe, 2015. 17(3): p. 385-91.
81. Grice, E.A., et al., Topographical and temporal diversity of the human skin microbiome. Science,
2009. 324(5931): p. 1190-2.
82. Gibbons, S.M., The Built Environment Is a Microbial Wasteland. mSystems, 2016. 1(2).
136
83. Impact of the Built Environment on Health. 2011 09/15/2017]; Available from:
https://www.cdc.gov/nceh/publications/factsheets/impactofthebuiltenvironmentonhealt
h.pdf.
84. Klepeis, N.E., et al., The National Human Activity Pattern Survey (NHAPS): a resource for
assessing exposure to environmental pollutants. J Expo Anal Environ Epidemiol, 2001. 11(3):
p. 231-52.
85. Kitzes, J.P., Audrey, S. Goldfinger, and M. Wackernagel, Current Methods for Calculating
National Ecological Footprint Accounts. Science for Environment & Sustainable Society,
2007. 4(1): p. 1-9.
86. Hooke, R.L., J.F. Martín-Duque, and J. Pedraza, Land transformation by humans: A review
GSA Today, 2012. 22(12): p. 4-10.
87. Division, U.N.D.o.E.a.S.A.P., World urbanization prospects: the 2011 revision. Vol.
ST/ESA/SER.A/322. 2012: United Nations Publications.
88. Environment, N.E.W.G.o.t.E.B.o.t.B., et al., Evolution of the indoor biome. Trends Ecol Evol,
2015. 30(4): p. 223-32.
89. Dai, D., et al., Factors Shaping the Human Exposome in the Built Environment: Opportunities
for Engineering Control. Environ Sci Technol, 2017. 51(14): p. 7759-7774.
90. Kelley, S.T. and J.A. Gilbert, Studying the microbiology of the indoor environment. Genome
Biol, 2013. 14(2): p. 202.
91. Milstone, L.M., Epidermal desquamation. J Dermatol Sci, 2004. 36(3): p. 131-40.
92. Lax, S., et al., Longitudinal analysis of microbial interaction between humans and the indoor
environment. Science, 2014. 345(6200): p. 1048-52.
93. Flores, G.E., et al., Microbial biogeography of public restroom surfaces. PLoS One, 2011. 6(11):
p. e28132.
137
94. Kembel, S.W., et al., Architectural design influences the diversity and structure of the built
environment microbiome. ISME J, 2012. 6(8): p. 1469-79.
95. Lax, S., C.R. Nagler, and J.A. Gilbert, Our interface with the built environment: immunity and
the indoor microbiota. Trends Immunol, 2015. 36(3): p. 121-3.
96. Ownby, D.R., C.C. Johnson, and E.L. Peterson, Exposure to dogs and cats in the first year of
life and risk of allergic sensitization at 6 to 7 years of age. JAMA, 2002. 288(8): p. 963-72.
97. Park, J.H., et al., Predictors of airborne endotoxin in the home. Environ Health Perspect, 2001.
109(8): p. 859-64.
98. Thorne, P.S., et al., Endotoxin Exposure: Predictors and Prevalence of Associated Asthma
Outcomes in the United States. Am J Respir Crit Care Med, 2015. 192(11): p. 1287-97.
99. Liu, A.H., Endotoxin exposure in allergy and asthma: reconciling a paradox. J Allergy Clin
Immunol, 2002. 109(3): p. 379-92.
100. Sharpe, R.A., et al., Indoor fungal diversity and asthma: a meta-analysis and systematic review
of risk factors. J Allergy Clin Immunol, 2015. 135(1): p. 110-22.
101. Song, S.J., et al., Cohabiting family members share microbiota with one another and with their
dogs. Elife, 2013. 2: p. e00458.
102. Ross, A.A., A.C. Doxey, and J.D. Neufeld, The Skin Microbiome of Cohabiting Couples.
mSystems, 2017. 2(4).
103. Lax, S., et al., Forensic analysis of the microbiome of phones and shoes. Microbiome, 2015. 3: p.
21.
104. Lax, S.G., J., 13. Forensic microbiology in built environments, in Forensic Microbiology, D.O.T.
Carter, J.K. and M.E.M. Benbow, J.L., Editors. 2017, John Wiley & Sons, Ltd: Chichester,
UK.
105. Strachan, D.P., Hay fever, hygiene, and household size. BMJ, 1989. 299(6710): p. 1259-60.
138
106. Rook, G.A., et al., Mycobacteria and other environmental organisms as immunomodulators for
immunoregulatory disorders. Springer Semin Immunopathol, 2004. 25(3-4): p. 237-55.
107. Shade, A., Diversity is the question, not the answer. ISME J, 2017. 11(1): p. 1-6.
108. Vandegrift, R., et al., Cleanliness in context: reconciling hygiene with a modern microbial
perspective. Microbiome, 2017. 5(1): p. 76.
109. Bloomfield, S.F., et al., Time to abandon the hygiene hypothesis: new perspectives on allergic
disease, the human microbiome, infectious disease prevention and the role of targeted hygiene.
Perspect Public Health, 2016. 136(4): p. 213-24.
110. Rook, G.A., Regulation of the immune system by biodiversity from the natural environment: an
ecosystem service essential to health. Proc Natl Acad Sci U S A, 2013. 110(46): p. 18360-7.
111. Chase, J., et al., Geography and Location Are the Primary Drivers of Office Microbiome
Composition. mSystems, 2016. 1(2).
112. Mohammadi, T., et al., Removal of contaminating DNA from commercial nucleic acid extraction
kit reagents. J Microbiol Methods, 2005. 61(2): p. 285-8.
113. Tanner, M.A., et al., Specific ribosomal DNA sequences from diverse environmental settings
correlate with experimental contaminants. Appl Environ Microbiol, 1998. 64(8): p. 3110-3.
114. Adams, R.I., et al., Microbiota of the indoor environment: a meta-analysis. Microbiome, 2015.
3: p. 49.
115. Salter, S.J., et al., Reagent and laboratory contamination can critically impact sequence-based
microbiome analyses. BMC Biol, 2014. 12: p. 87.
116. Coil, D., “Citizen Microbiology: A Case Study in Space.”, in The Rightful Place of Science: Citizen
Science, D.K. Cavalier, E.B., Editor. 2016, Consortium for Science, Policy & Outcomes:
Tempe, AZ.
117. Nielsen, K.M., et al., Release and persistence of extracellular DNA in the environment. Environ
Biosafety Res, 2007. 6(1-2): p. 37-53.
139
118. Carini, P., et al., Relic DNA is abundant in soil and obscures estimates of soil microbial diversity.
Nat Microbiol, 2016. 2: p. 16242.
119. Emerson, J.B., et al., Schrodinger's microbes: Tools for distinguishing the living from the dead in
microbial ecosystems. Microbiome, 2017. 5(1): p. 86.
120. Riesenfeld, C.S., P.D. Schloss, and J. Handelsman, Metagenomics: genomic analysis of
microbial communities. Annu Rev Genet, 2004. 38: p. 525-52.
121. Hamady, M. and R. Knight, Microbial community profiling for human microbiome projects:
Tools, techniques, and challenges. Genome Res, 2009. 19(7): p. 1141-52.
122. Segata, N., et al., Computational meta'omics for microbial community studies. Mol Syst Biol,
2013. 9: p. 666.
123. McDonald, D., et al., An improved Greengenes taxonomy with explicit ranks for ecological and
evolutionary analyses of bacteria and archaea. ISME J, 2012. 6(3): p. 610-8.
124. Yilmaz, P., et al., The SILVA and "All-species Living Tree Project (LTP)" taxonomic frameworks.
Nucleic Acids Res, 2014. 42(Database issue): p. D643-8.
125. Huse, S.M., et al., Exploring microbial diversity and taxonomy using SSU rRNA hypervariable
tag sequencing. PLoS Genet, 2008. 4(11): p. e1000255.
126. Knights, D., et al., Human-associated microbial signatures: examining their predictive value. Cell
Host Microbe, 2011. 10(4): p. 292-6.
127. Langille, M.G., et al., Predictive functional profiling of microbial communities using 16S rRNA
marker gene sequences. Nat Biotechnol, 2013. 31(9): p. 814-21.
128. Vandamme, P., et al., Polyphasic taxonomy, a consensus approach to bacterial systematics.
Microbiol Rev, 1996. 60(2): p. 407-38.
129. Stackebrandt, E.G., B.M., Taxonomic Note: A Place for DNA-DNA Reassociation and 16S
rRNA Sequence Analysis in the Present Species Definition in Bacteriology. International Journal
of Systematic and Evolutionary Microbiology, 1994. 44(4): p. 846-849.
140
130. Eren, A.M., et al., Oligotyping: Differentiating between closely related microbial taxa using 16S
rRNA gene data. Methods Ecol Evol, 2013. 4(12).
131. Eren, A.M., et al., Exploring the diversity of Gardnerella vaginalis in the genitourinary tract
microbiota of monogamous couples through subtle nucleotide variation. PLoS One, 2011. 6(10):
p. e26732.
132. McLellan, S.L., et al., Sewage reflects the distribution of human faecal Lachnospiraceae. Environ
Microbiol, 2013. 15(8): p. 2213-27.
133. Faith, J.J., et al., The long-term stability of the human gut microbiota. Science, 2013. 341(6141):
p. 1237439.
134. McHardy, A.C., et al., Accurate phylogenetic classification of variable-length DNA fragments.
Nat Methods, 2007. 4(1): p. 63-72.
135. Schloissnig, S., et al., Genomic variation landscape of the human gut microbiome. Nature, 2013.
493(7430): p. 45-50.
136. Segata, N., et al., Metagenomic microbial community profiling using unique clade-specific marker
genes. Nature methods, 2012. 9(8): p. 811-4.
137. Brady, A. and S. Salzberg, PhymmBL expanded: confidence scores, custom databases,
parallelization and more. Nat Methods, 2011. 8(5): p. 367.
138. Wood, D.E. and S.L. Salzberg, Kraken: ultrafast metagenomic sequence classification using exact
alignments. Genome Biol, 2014. 15(3): p. R46.
139. Kanehisa, M., et al., Data, information, knowledge and principle: back to metabolism in KEGG.
Nucleic acids research, 2014. 42(Database issue): p. D199-205.
140. Tatusov, R.L., E.V. Koonin, and D.J. Lipman, A genomic perspective on protein families.
Science, 1997. 278(5338): p. 631-7.
141. Powell, S., et al., eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different
taxonomic ranges. Nucleic Acids Res, 2012. 40(Database issue): p. D284-9.
141
142. Punta, M., et al., The Pfam protein families database. Nucleic Acids Res, 2012. 40(Database
issue): p. D290-301.
143. Suzek, B.E., et al., UniRef: comprehensive and non-redundant UniProt reference clusters.
Bioinformatics, 2007. 23(10): p. 1282-8.
144. Caspi, R., et al., The MetaCyc database of metabolic pathways and enzymes and the BioCyc
collection of Pathway/Genome Databases. Nucleic acids research, 2014. 42(Database issue): p.
D459-71.
145. Overbeek, R., et al., The subsystems approach to genome annotation and its use in the project to
annotate 1000 genomes. Nucleic Acids Res, 2005. 33(17): p. 5691-702.
146. Markowitz, V.M., et al., IMG/M: the integrated metagenome data management and comparative
analysis system. Nucleic Acids Res, 2012. 40(Database issue): p. D123-9.
147. Konwar, K.M., et al., MetaPathways: a modular pipeline for constructing pathway/genome
databases from environmental sequence information. BMC Bioinformatics, 2013. 14: p. 202.
148. Abubucker, S., et al., Metabolic reconstruction for metagenomic data and its application to the
human microbiome. PLoS Comput Biol, 2012. 8(6): p. e1002358.
149. Vollmers, J., S. Wiegand, and A.K. Kaster, Comparing and Evaluating Metagenome Assembly
Tools from a Microbiologist's Perspective - Not Only Size Matters! PLoS One, 2017. 12(1): p.
e0169662.
150. Nagarajan, N. and M. Pop, Sequence assembly demystified. Nat Rev Genet, 2013. 14(3): p.
157-67.
151. Gill, S.R., et al., Metagenomic analysis of the human distal gut microbiome. Science, 2006.
312(5778): p. 1355-9.
152. Qin, J., et al., A human gut microbial gene catalogue established by metagenomic sequencing.
Nature, 2010. 464(7285): p. 59-65.
142
153. Venter, J.C., et al., Environmental genome shotgun sequencing of the Sargasso Sea. Science,
2004. 304(5667): p. 66-74.
154. Wrighton, K.C., et al., Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated
bacterial phyla. Science, 2012. 337(6102): p. 1661-5.
155. Castelle, C.J., et al., Extraordinary phylogenetic diversity and metabolic versatility in aquifer
sediment. Nat Commun, 2013. 4: p. 2120.
156. Di Rienzi, S.C., et al., The human gut and groundwater harbor non-photosynthetic bacteria
belonging to a new candidate phylum sibling to Cyanobacteria. Elife, 2013. 2: p. e01102.
157. Tyson, G.W., et al., Community structure and metabolism through reconstruction of microbial
genomes from the environment. Nature, 2004. 428(6978): p. 37-43.
158. Albertsen, M., et al., Genome sequences of rare, uncultured bacteria obtained by differential
coverage binning of multiple metagenomes. Nat Biotechnol, 2013. 31(6): p. 533-8.
159. Mukherjee, S., et al., 1,003 reference genomes of bacterial and archaeal isolates expand coverage
of the tree of life. Nat Biotechnol, 2017. 35(7): p. 676-683.
160. Eisen, J.A., Horizontal gene transfer among microbial genomes: new insights from complete
genome analysis. Curr Opin Genet Dev, 2000. 10(6): p. 606-11.
161. Hao, W. and G.B. Golding, The fate of laterally transferred genes: life in the fast lane to
adaptation or death. Genome Res, 2006. 16(5): p. 636-43.
162. Polz, M.F., E.J. Alm, and W.P. Hanage, Horizontal gene transfer and the evolution of bacterial
and archaeal population structure. Trends Genet, 2013. 29(3): p. 170-5.
163. Mitri, S. and K.R. Foster, The genotypic view of social interactions in microbial communities.
Annu Rev Genet, 2013. 47: p. 247-73.
164. Smith, J., The social evolution of bacterial pathogenesis. Proc Biol Sci, 2001. 268(1462): p. 61-9.
143
165. de Carvalho, M.O. and E.L. Loreto, Methods for detection of horizontal transfer of transposable
elements in complete genomes. Genet Mol Biol, 2012. 35(4 (suppl)): p. 1078-84.
166. Ragan, M.A., On surrogate methods for detecting lateral gene transfer. FEMS Microbiol Lett,
2001. 201(2): p. 187-91.
167. Vernikos, G.S. and J. Parkhill, Interpolated variable order motifs for identification of horizontally
acquired DNA: revisiting the Salmonella pathogenicity islands. Bioinformatics, 2006. 22(18): p.
2196-203.
168. Podell, S. and T. Gaasterland, DarkHorse: a method for genome-wide prediction of horizontal
gene transfer. Genome Biol, 2007. 8(2): p. R16.
169. Langille, M.G., W.W. Hsiao, and F.S. Brinkman, Evaluation of genomic island predictors using
a comparative genomics approach. BMC Bioinformatics, 2008. 9: p. 329.
170. Whidden, C., N. Zeh, and R.G. Beiko, Supertrees Based on the Subtree Prune-and-Regraft
Distance. Syst Biol, 2014. 63(4): p. 566-81.
171. Tofigh, A., M. Hallett, and J. Lagergren, Simultaneous identification of duplications and lateral
gene transfers. IEEE/ACM Trans Comput Biol Bioinform, 2011. 8(2): p. 517-35.
172. Chauve, C., et al., MaxTiC: Fast Ranking Of A Phylogenetic Tree By Maximum Time
Consistency With Lateral Gene Transfers. bioRxiv, 2017.
173. Trappe, K., T. Marschall, and B.Y. Renard, Detecting horizontal gene transfer by mapping
sequencing reads across species boundaries. Bioinformatics, 2016. 32(17): p. i595-i604.
174. Lloyd-Price, J.M., A*, et al., Strains, functions and dynamics in the expanded Human
Microbiome Project. Nature, in press.
175. Huang, K., et al., MetaRef: a pan-genomic database for comparative and community microbial
genomics. Nucleic Acids Res, 2014. 42(Database issue): p. D617-24.
176. Louis, P., G.L. Hold, and H.J. Flint, The gut microbiota, bacterial metabolites and colorectal
cancer. Nat Rev Microbiol, 2014. 12(10): p. 661-72.
144
177. Flint, H.J., et al., Interactions and competition within the microbial community of the human
colon: links between diet and health. Environ Microbiol, 2007. 9(5): p. 1101-11.
178. Mark Welch, J.L., et al., Biogeography of a human oral microbiome at the micron scale. Proc Natl
Acad Sci U S A, 2016. 113(6): p. E791-800.
179. Finn, R.D., et al., Pfam: clans, web tools and services. Nucleic Acids Res, 2006. 34(Database
issue): p. D247-51.
180. Sitbon, E. and S. Pietrokovski, New types of conserved sequence domains in DNA-binding
regions of homing endonucleases. Trends Biochem Sci, 2003. 28(9): p. 473-7.
181. Burrus, V., et al., The ICESt1 element of Streptococcus thermophilus belongs to a large family of
integrative and conjugative elements that exchange modules and change their specificity of
integration. Plasmid, 2002. 48(2): p. 77-97.
182. Burrus, V., et al., Conjugative transposons: the tip of the iceberg. Mol Microbiol, 2002. 46(3): p.
601-10.
183. Bonham, K.S., B.E. Wolfe, and R.J. Dutton, Extensive horizontal gene transfer in cheese-
associated bacteria. Elife, 2017. 6.
184. Truong, D.T., et al., Microbial strain-level population structure and genetic diversity from
metagenomes. Genome Res, 2017. 27(4): p. 626-638.
185. Stokes, H.W. and M.R. Gillings, Gene flow, mobile genetic elements and the recruitment of
antibiotic resistance genes into Gram-negative pathogens. FEMS Microbiol Rev, 2011. 35(5): p.
790-819.
186. Peng, Y., et al., IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data
with highly uneven depth. Bioinformatics, 2012. 28(11): p. 1420-8.
187. Konstantinidis, K.T. and J.M. Tiedje, Genomic insights that advance the species definition for
prokaryotes. Proc Natl Acad Sci U S A, 2005. 102(7): p. 2567-72.
188. Jost, L., Entropy and diversity. Oikos, 2006. 113(2): p. 363-375.
145
189. National Transit Database. Monthly Module Raw Data Release. 2015; Available from:
http://www.ntdprogram.gov/ntdprogram/data.htm.
190. Meadow, J.F., A.E. Altrichter, and J.L. Green, Mobile phones carry the personal microbiome of
their owners. PeerJ, 2014. 2: p. e447.
191. Fierer, N., et al., Forensic identification using skin bacterial communities. Proc Natl Acad Sci
U S A, 2010. 107(14): p. 6477-81.
192. Meadow, J.F., et al., Bacterial communities on classroom surfaces vary with human contact.
Microbiome, 2014. 2(1): p. 7.
193. Robertson, C.E., et al., Culture-independent analysis of aerosol microbiology in a metropolitan
subway system. Appl Environ Microbiol, 2013. 79(11): p. 3485-93.
194. Leung, M.H., et al., Indoor-air microbiome in an urban subway network: diversity and dynamics.
Appl Environ Microbiol, 2014. 80(21): p. 6760-70.
195. Afshinnekoo, E., et al., Geospatial Resolution of Human and Bacterial Diversity with City-Scale
Metagenomics. Cell Systems, 2015. 1(1): p. 72-87.
196. Ackelsberg, J., et al., Lack of Evidence for Plague or Anthrax on the New York City Subway. Cell
Systems. 1(1): p. 4-5.
197. Segata, N., et al., Metagenomic biomarker discovery and explanation. Genome Biol, 2011. 12(6):
p. R60.
198. Nelson, M.C., et al., Analysis, optimization and verification of Illumina-generated 16S rRNA
gene amplicon surveys. PLoS One, 2014. 9(4): p. e94249.
199. Segata, N., et al., Composition of the adult digestive tract bacterial microbiome based on seven
mouth surfaces, tonsils, throat and stool samples. Genome Biol, 2012. 13(6): p. R42.
200. Costello, E.K., et al., Bacterial community variation in human body habitats across space and
time. Science, 2009. 326(5960): p. 1694-7.
146
201. Kembel, S.W., et al., Architectural design drives the biogeography of indoor bacterial
communities. PLoS One, 2014. 9(1): p. e87093.
202. Lauber, C.L., et al., Pyrosequencing-based assessment of soil pH as a predictor of soil bacterial
community structure at the continental scale. Appl Environ Microbiol, 2009. 75(15): p. 5111-
20.
203. Knights, D., et al., Bayesian community-wide culture-independent microbial source tracking.
Nature methods, 2011. 8(9): p. 761-3.
204. Stolz, A., Molecular characteristics of xenobiotic-degrading sphingomonads. Appl Microbiol
Biotechnol, 2009. 81(5): p. 793-811.
205. Peyraud, R., et al., Genome-scale reconstruction and system level investigation of the metabolic
network of Methylobacterium extorquens AM1. BMC Syst Biol, 2011. 5: p. 189.
206. Kawamura, Y., et al., Genus Enhydrobacter Staley et al. 1987 should be recognized as a member
of the family Rhodospirillaceae within the class Alphaproteobacteria. Microbiol Immunol, 2012.
56(1): p. 21-6.
207. Hewitt, K.M., et al., Bacterial diversity in two Neonatal Intensive Care Units (NICUs). PLoS
One, 2013. 8(1): p. e54703.
208. Grice, E.A., et al., A diversity profile of the human skin microbiota. Genome Res, 2008. 18(7):
p. 1043-50.
209. Dawson, T.L., Jr., Malassezia globosa and restricta: breakthrough understanding of the etiology
and treatment of dandruff and seborrheic dermatitis through whole-genome analysis. J Investig
Dermatol Symp Proc, 2007. 12(2): p. 15-9.
210. Zouboulis, C.C., Propionibacterium acnes and sebaceous lipogenesis: a love-hate relationship? J
Invest Dermatol, 2009. 129(9): p. 2093-6.
211. Morgan, X.C., et al., Dysfunction of the intestinal microbiome in inflammatory bowel disease and
treatment. Genome biology, 2012. 13(9): p. R79.
147
212. Barberan, A., et al., Using network analysis to explore co-occurrence patterns in soil microbial
communities. ISME J, 2012. 6(2): p. 343-51.
213. Kanehisa, M. and S. Goto, KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids
Res, 2000. 28(1): p. 27-30.
214. Bruggemann, H., et al., The complete genome sequence of Propionibacterium acnes, a commensal
of human skin. Science, 2004. 305(5684): p. 671-3.
215. Lee, W.L., A.R. Shalita, and M.B. Poh-Fitzpatrick, Comparative studies of porphyrin
production in Propionibacterium acnes and Propionibacterium granulosum. J Bacteriol, 1978.
133(2): p. 811-5.
216. Holland, K.T., et al., Propionibacterium acnes and acne. Dermatology, 1998. 196(1): p. 67-8.
217. Roessner, C.A., et al., Isolation and characterization of 14 additional genes specifying the
anaerobic biosynthesis of cobalamin (vitamin B12) in Propionibacterium freudenreichii (P.
shermanii). Microbiology, 2002. 148(Pt 6): p. 1845-53.
218. Hashimoto, Y., M. Yamashita, and Y. Murooka, The Propionibacterium freudenreichii
hemYHBXRL gene cluster, which encodes enzymes and a regulator involved in the biosynthetic
pathway from glutamate to protoheme. Appl Microbiol Biotechnol, 1997. 47(4): p. 385-92.
219. Kaminski, J., et al., High-specificity targeted functional profiling in microbial communities with
ShortBRED. PLoS Comp Biol, in press.
220. McArthur, A.G., et al., The comprehensive antibiotic resistance database. Antimicrob Agents
Chemother, 2013. 57(7): p. 3348-57.
221. Yooseph, S., et al., A metagenomic framework for the study of airborne microbial communities.
PLoS One, 2013. 8(12): p. e81862.
222. Qin, J., et al., A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature,
2012. 490(7418): p. 55-60.
148
223. Yatsunenko, T., et al., Human gut microbiome viewed across age and geography. Nature, 2012.
486(7402): p. 222-7.
224. Hu, Y., et al., Metagenome-wide analysis of antibiotic resistance genes in a large cohort of human
gut microbiota. Nat Commun, 2013. 4: p. 2151.
225. Chen, L., et al., VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res,
2005. 33(Database issue): p. D325-8.
226. Li, Y., et al., Role of ventilation in airborne transmission of infectious agents in the built
environment - a multidisciplinary systematic review. Indoor Air, 2007. 17(1): p. 2-18.
227. Gibbons, S.M., et al., Ecological succession and viability of human-associated microbiota on
restroom surfaces. Appl Environ Microbiol, 2015. 81(2): p. 765-73.
228. Glass, E.M., et al., MIxS-BE: a MIxS extension defining a minimum information standard for
sequence data from the built environment. ISME J, 2014. 8(1): p. 1-3.
229. National Centers for Environmental Information & National Oceanic and Atmospheric
Administration. Record of Climatological Observations. 8/29/2015; Station: Boston Logan
International Airport, MA, US. ]. Available from: http://www.ncdc.noaa.gov/cdo-web/.
230. Weather Underground. Weather History for KBOS 8/29/2015]; Available from:
http://www.wunderground.com/history/.
231. Paulino, L.C., et al., Molecular analysis of fungal microbiota in samples from healthy human skin
and psoriatic lesions. J Clin Microbiol, 2006. 44(8): p. 2933-41.
232. Caporaso, J.G., et al., Global patterns of 16S rRNA diversity at a depth of millions of sequences
per sample. Proc Natl Acad Sci U S A, 2011. 108 Suppl 1: p. 4516-22.
233. Caporaso, J.G., et al., QIIME allows analysis of high-throughput community sequencing data.
Nature methods, 2010. 7(5): p. 335-6.
234. Oksanen J, B.F., Kindt R, Legendre P, Minchin P, O'Hara R, Simpson G, Solymos P,
Stevens H, Wagner H, vegan: Community Ecology Package. 2015.
149
235. Asnicar, F., et al., Compact graphical representation of phylogenetic data and metadata with
GraPhlAn. PeerJ, 2015. 3: p. e1029.
236. Morgat, A., et al., UniPathway: a resource for the exploration and annotation of metabolic
pathways. Nucleic acids research, 2012. 40(Database issue): p. D761-9.
237. Suzek, B.E., et al., UniRef clusters: a comprehensive and scalable alternative for improving
sequence similarity searches. Bioinformatics, 2015. 31(6): p. 926-32.
238. Liu, B. and M. Pop, ARDB--Antibiotic Resistance Genes Database. Nucleic Acids Res, 2009.
37(Database issue): p. D443-7.
239. Bolger, A.M., M. Lohse, and B. Usadel, Trimmomatic: a flexible trimmer for Illumina sequence
data. Bioinformatics, 2014. 30(15): p. 2114-20.
240. Langmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat Methods,
2012. 9(4): p. 357-9.
241. Rastogi, G., et al., A PCR-based toolbox for the culture-independent quantification of total
bacterial abundances in plant environments. J Microbiol Methods, 2010. 83(2): p. 127-32.
242. Lane, D., 16S/23S rRNA sequencing, in Nucleic acid techniques in bacterial systematics, G.M.
Stackebrandt E, Editor. 1991, John Wiley and Sons: Chichester, United Kingdom. p. 115-
175.
243. Flores, G.E., J.B. Henley, and N. Fierer, A direct PCR approach to accelerate analyses of human-
associated microbial communities. PLoS One, 2012. 7(9): p. e44563.
244. Adams, R.I., et al., Passive dust collectors for assessing airborne microbial material. Microbiome,
2015. 3: p. 46.
245. Checinska, A., et al., Microbiomes of the dust particles collected from the International Space
Station and Spacecraft Assembly Facilities. Microbiome, 2015. 3: p. 50.