inter-species interactions in microbial communities

Inter-species interactions in microbial communities

CitationHsu, Tiffany Yeong-Ting. 2018. Inter-species interactions in microbial communities. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Permanent linkhttp://nrs.harvard.edu/urn-3:HUL.InstRepos:42015251

Terms of UseThis article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Share Your StoryThe Harvard community has made this article openly available.Please share how this access benefits you. Submit a story .

Accessibility

http://nrs.harvard.edu/urn-3:HUL.InstRepos:42015251

http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

http://osc.hul.harvard.edu/dash/open-access-feedback?handle=&title=Inter-species%20interactions%20in%20microbial%20communities&community=1/1&collection=1/4927603&owningCollection1/4927603&harvardAuthors=45d882723b066063caf640a29e3731e4&departmentMedical%20Sciences

https://dash.harvard.edu/pages/accessibility


A dissertation presented

by

Tiffany Yeong-Ting Hsu

to

The Division of Medical Sciences

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in the subject of

Biological and Biomedical Sciences

Harvard University

Cambridge, Massachusetts

October 2017

iii

Dissertation Advisor: Professor Curtis Huttenhower Tiffany Yeong-Ting Hsu


Abstract

Microorganisms are omnipresent and exist as communities within and around the

human body. These communities, regardless of location, may cause disease: dysbioses within

the gut microbiota are associated with obesity and inflammatory bowel disease, while

differences in immune development and environmental exposures are linked to atopy and

diabetes. It is thus crucial to characterize microbial communities and their interactions to better

understand how they are formed, maintained, and manipulated. To better understand the

ecology of communities on and around the human body, my work has explored lateral gene

transfer (LGT) within human-associated microbial communities and the transfer of microbes

between the human body and environmental surfaces.

I developed the first method for detection of de novo LGT events from metagenomes

termed WAAFLE, a Workflow to Annotate Assemblies and Find LGT Events. I applied

WAAFLE to the Human Microbiome Project: LGT frequencies were highest in the gut and oral

sites, and lowest in the vaginal and skin microbiomes. High frequency pairs corresponded with

increased taxon abundances and close phylogenetic distances. Taxa found in multiple LGT pairs

had strong partner preferences, and several had biases in transfer directionality. Enriched

functions in LGT contigs included transposases, phage, and TonB membrane receptors. Taxa in

high frequency LGT pairs may preferentially use LGT as a tool to maintain or change their

community status.

iv

I examined cross-talk between human-associated and built-environment microbial

communities in heavily trafficked environments, specifically the Boston subway. These areas

may facilitate microbial transmission and are ripe for public health interventions such as

sanitation or architecture. We used 16S rRNA gene and metagenomics shotgun sequencing to

profile microbes on multiple surface types in trains along the red, green, and orange lines, as

well as ticketing machines at four train stations. Community structure was dictated by surface

type, rather than train line. Common taxa included human skin and oral commensals such as

Propionibacterium, Corynebacterium, Staphylococcus, and Streptococcus. Enriched functions were

often from Propionibacterium acnes pathways, and few antibiotic resistance genes were observed.

Overall, microbial communities on the Boston subway are likely derived from the rider

population and influenced by rider interactions and environmental biochemistry.

v

Table of Contents

Abstract ...................................................................................................................................................iii

Table of Contents ................................................................................................................................... v

Acknowledgements ............................................................................................................................ vii

List of Figures ......................................................................................................................................... x

List of Tables ......................................................................................................................................... xii

List of Abbreviations ......................................................................................................................... xiii

Chapter 1: Introduction ............................................................................................................................. 1

Copyright Disclosure ............................................................................................................................. 2

Overview ................................................................................................................................................. 2

The significance of lateral gene transfer ............................................................................................. 3

Mechanisms and discovery of lateral gene transfer ......................................................... 3

Problems with the prokaryotic “species concept” ........................................................... 5

Methods for identifying species and LGT ......................................................................... 7

LGT in microbial communities ........................................................................................... 9

Transferred functions and their associated costs ........................................................... 11

Evolutionary legacy of LGT .............................................................................................. 12

Surveying microbial communities in the built-environment ........................................................ 13

Microbial composition of the built-environment ........................................................... 14

Applications for the built-environment ........................................................................... 16

Technical considerations for sampling the built-environment .................................... 17

The role of DNA sequencing for microbial profiling ...................................................................... 19

Amplicon Sequencing ........................................................................................................ 19

WMS Sequencing ................................................................................................................ 21

Contig Assembly ................................................................................................................. 23

Summary ............................................................................................................................................... 24

Chapter 2: Lateral Gene Transfer in the Human Microbiome .......................................................... 26

Attributions ........................................................................................................................................... 27

Introduction .......................................................................................................................................... 27

Results .................................................................................................................................................... 30

Identifying recent LGT events from metagenomic shotgun sequencing .................... 30

WAAFLE performance on synthetic data ....................................................................... 32

vi

Rates of novel LGT events across the human microbiome ........................................... 35

LGT frequency and pair formation are shaped by abundance and phylogeny ......... 41

Genera have preferred transfer partners that are shared across similar sites ............ 44

Mobile elements and TonB receptors are enriched in LGT contigs ............................. 49

Discussion ............................................................................................................................................. 54

Methods ................................................................................................................................................. 58

Chapter 3: Urban transit system microbial communities differ by surface type and interaction

with humans and environment .............................................................................................................. 68

Copyright Disclosure ........................................................................................................................... 69

Attributions ........................................................................................................................................... 69

Abstract .................................................................................................................................................. 69

Importance ............................................................................................................................................ 70

Introduction .......................................................................................................................................... 71

Results .................................................................................................................................................... 73

Sampling microbial communities on the Boston transit system .................................. 73

Microbial communities are specific to surface types and immediate environment .. 74

Subway microbial communities are largely derived from human skin and oral

commensal microbes ....................................................................................................................... 77

Propionibacterium phages and the yeast Malassezia globosa dominate the non-bacterial

microbial community ...................................................................................................................... 81

All surface types are dominated by skin microbes, with smaller proportions of oral,

gut, and environmental taxa across seats and touchscreens ..................................................... 83

Metagenomes reflect dominance of Propionibacterium acnes across subway surfaces

............................................................................................................................................................ 86

Minimal pathogenic and antibiotic resistance presence on the Boston transit system

............................................................................................................................................................ 88

Discussion ............................................................................................................................................. 90

Materials and Methods ........................................................................................................................ 95

Acknowledgements ........................................................................................................................... 101

Chapter 4: Conclusions ......................................................................................................................... 102

Appendix I : Supplemental Materials for Chapter 2 ........................................................................ 108

Appendix II : Supplemental Materials for Chapter 3....................................................................... 117

References ............................................................................................................................................... 129

vii

Acknowledgements

I came to Harvard determined to learn “computational biology”. Considering that my

laboratory experience far exceeded my programming experience (5 years versus 10 weeks), I

must first thank Dr. Curtis Huttenhower for taking a chance on me. In my first email to him, I

wrote:

“…I am interested in learning how to analyze large datasets and make some sense out of them. I

feel that it is no longer sufficient to look at just a few key genes - especially when there are now ways to

profile entire genomics, transcriptomes, and proteomes - though all the associations found will still have

to be validated molecularly. Still, I think it's exciting that there is a chance to look at the entire network

and see how it works.

I was wondering if you took rotation students - or knew of anyone who might train a student to

do dry work - since I have a wet lab background. I was also wondering what your opinion was on how

much of an "omics" understanding a scientist might need.”

What I have learned during my time in the lab has completely exceeded those expectations. The

Huttenhower Lab is a rare place that does not distinguish their bioinformatians from their

experimentalists. Every member is free to learn both, and they often do, through the process of

helping each other out. Curtis was also willing to help me take on projects I was initially

unqualified for, such as WAAFLE, which was born out of my qualifying exams.

Second, I must thank both past and present members of the Huttenhower Lab. Curtis

has assembled a wonderful team of people. To each of you, I would like to say: “You have

qualities that I strive to emulate, and skills and knowledge that I still hope to learn some day.” I

specifically want to thank two people, Dr. Eric Franzosa and Dr. Regina Joice.

Eric was my mentor throughout my PhD; without him I would not have graduated. The

beginning of my PhD was difficult, because the way computational biologists thought and the

terms they used were alien to me. It was not always clear what analyses were being suggested

viii

or why, and how to carry them out. Eric always took the time to explain these analyses, by

breaking down the underlying assumptions and hypotheses. When I had trouble turning those

analyses into code, he would show me his code and introduce me to new syntax. Eric was often

the first to review my grants and paper drafts: I learned a lot about writing from his revisions.

Towards the end of my PhD, when I had trouble mentoring and tutoring students, it was again

Eric that I turned to for advice. I hope I will become an equally skilled and kind scientist as I

move through my career.

Regina was my mentor throughout the MBTA project. Since she had a wet lab

background, she could anticipate my confusion and would help me if she knew the answer, or

help me rephrase the question so someone else could. When I got lost in the computational

aspects of my work, she would always steer me back to the biological question we were asking.

She also freely shared advice when I asked for it: I still remember sidling up and saying,

“Regina, I have a science/graduate school/life question, would you have time to talk later?”

Third, I want to thank my scientific colleagues outside the lab, including Dr. Morgan

Langille and Dr. Robert Beiko, the WAAFLE co-authors; Dr. Georgina Hold, for involving me in

her comparative genomics project; Dr. Wendy Garrett, who gave me access to her laboratory

when we didn’t have the right equipment; and Dr. Eric Rubin, Dr. Michael Springer, and Dr.

Colleen Cavanaugh, my dissertation advisory committee; and Dr. Ting-Ting Wu, my

undergraduate research mentor. To Morgan, Rob, and my advisory committee, I have always

enjoyed and appreciated your feedback on my projects. I have heard horror stories about

collaborators and committees: all five of you were truly a pleasure to work with, and even took

ix

time to meet with me one-on-one, whether it was for advice, beer, or while driving me to see

Bonnie Bassler. To Ting, despite all your cautionary advice, I still went to graduate school!

Without you, I would have never have experienced scientific research, and I hope we stay

friends and colleagues for the years to come.

Fourth, I must thank the administrative staff, including Nicole Levesque, the

Biostatistics Department program coordinator, who magically scheduled me into Curtis’s

schedule over the past five years; as well as Kate Hodgins, Anne O’Shea, Danny Gonzalez, and

Maria Bollinger, the present and former BBS program administrators, who have always swiftly

responded to questions about Harvard and graduate school.

Lastly, I want to thank my family and friends. Both my mother, Lichuan Hsu, and

brother, Eric Hsu, have always been there to support me. They have heard more than their fair

share of gripes and complaints along the way. My father, Che-Chang Hsu, is no longer here, but

I believe he would be proud of my work. As an electrical engineer, he was extremely excited

when I told him I was going to learn Python and described Curtis’s work. I am glad he was able

to see me start my bioinformatics journey. My partner, Wesley Hong, always gives me new

perspectives to consider, and is there to remind me that graduate school is not everything, but a

small step towards our aspirations. To my friends, I will remember the late night problem sets,

races and shopping trips, and surprise birthday parties: it is you who have made my time here

in Boston/Cambridge all the merrier.

x

List of Figures

Figure 2-1. WAAFLE pipeline overview. ............................................................................................. 31

Figure 2-2. WAAFLE parameter evaluation. ........................................................................................ 35

Figure 2-3. LGT rates are highest for oral and stool sites. .................................................................. 39

Figure 2-4. Both abundance and phylogeny affects LGT rates. ......................................................... 43

Figure 2-5. Taxa degree and differential edges. ................................................................................... 46

Figure 2-6 . Enriched functions show taxon and structural similarities across sites. ..................... 51

Figure 3-1. Collection of samples from MBTA trains and stations. .................................................. 74

Figure 3-2. Taxonomic composition of subway microbial communities. ........................................ 76

Figure 3-3. Putative MBTA microbial community sources. ............................................................... 78

Figure 3-4. Trans-domain taxonomic profiles from subway shotgun metagenomes. .................... 82

Figure 3-5. Enrichment of microbial taxa with respect to metadata using multivariate analyses.

....................................................................................................................................................... 84

Figure 3-6. Enrichment of KEGG Orthology (KOs) across MBTA surfaces before and after P.

acnes removal. ............................................................................................................................. 87

Figure 3-7. Quantification of antibiotic resistance marker and virulence factor abundances on

subway surfaces. ......................................................................................................................... 89

Figure I-1. Filtering potential misassemblies. .................................................................................... 109

Figure I-2. Determining which contig types contain misassemblies. ............................................. 110

Figure I-3. Gene call evaluation. .......................................................................................................... 112

Figure I-4. LGT evaluation with or without missing BLAST hits. .................................................. 112

Figure I-5. Selection of k1 and k2. .......................................................................................................... 113

xi

Figure I-6. Comparison of LGT measures. ......................................................................................... 114

Figure I-7. Jaccard and Bray-Curtis distances between inter-individual, intra-individual, and

technical samples. ..................................................................................................................... 114

Figure I-8. Phylogenetic distances computed from random taxa pairs within body sites. ......... 115

Figure II-1. Biomass and alpha diversity for train and station samples. ........................................ 118

Figure II-2. Ordination of surface data subsets. ................................................................................. 118

Figure II-3. Comparison of antibiotic resistance markers from the ARDB database. .................. 119

Figure II-4. Letter from the MBTA. ...................................................................................................... 120

xii

List of Tables

Table I-1. WAAFLE Parameters. .......................................................................................................... 116

Table II-1. Sample collection and metadata. ....................................................................................... 121

Table II-2. 16S and shotgun OTU tables along with taxa present across sequencing plate. ........ 121

Table II-3. LEfSe and MaAsLin analysis for 16S sequencing. .......................................................... 121

Table II-4. MaAsLin analysis for shotgun data. ................................................................................. 121

Table II-5. Antibiotic resistance gene and virulence factor markers. .............................................. 121

xiii

List of Abbreviations

antibiotic resistance (ABR).

antibiotic resistance genes (ARG).

base pair (bp).

biological species concept (BSC).

coding sequence (CDS).

coding sequences (CDS).

ecological species concept (ESC).

false positive rate (FPR).

gene transfer agent (GTA).

Human Microbiome Project (HMP)(The Human Microbiome Project Consortium).

Human Microbiome Project Phase 1-II (HMP 1-II).

interpolated variable order motifs (IVOM).

kilobase (kb).

last universal common ancestor (LUCA).

positive predictive value (PPV).

single nucleotide polymorphisms (SNP).

true positive rate (TPR).

WAAFLE (Workflow to Annotate Assemblies and Find LGT Events).

whole metagenome shotgun (WMS).

Chapter 1:

Introduction

2

Copyright Disclosure

Portions of this Introduction appear in or are adapted from the following publications:

Franzosa, E.A., T. Hsu, A. Sirota-Madi, A. Shafquat, G. Abu-Ali, X.C. Morgan, C. Huttenhower,

Sequencing and beyond: integrating molecular ‘omics’ for microbial community profiling. Nature

Reviews Microbiology, 2015. 13(6):p. 360-72.

Overview

There are approximately 3.8 × 1013 bacterial cells in the average 70 kg man, which is

roughly equal to the number of human cells in the body [1]. These bacterial cells are found as

microbial communities [2], and may interface with the immediate environment outside the host

[3]. Within a microbial community, individual taxa may have different phenotypes as compared

to the overall community: some have proposed that an individual microbe may be viewed as a

component cell of a multicellular organism, in which components communicate to coordinate

growth, movement, and biochemical activities in order to efficiently proliferate, access new

resources, and defend against antagonists [4]. As follows, it is necessary to study microbial

interactions at the individual and community scale. Furthermore, microbial communities may

influence or be influenced by the surrounding environment. Humans emit a detectable

microbial cloud into the surrounding air [5], and skin microorganisms are influenced by

temperature, moisture, and ultraviolet radiation [6]. Thus, it is important to characterize

microbial interactions within a community, as well as microbial interactions with the

surrounding environment in order to understand community formation, maintenance, and

function.

Microbial profiling began with Anton van Leewenhoek, who observed microorganisms

using a self-built microscope and classified them based on morphology [7]. Louis Pasteur and

3

Robert Koch later popularized the use of what is now considered traditional culture methods to

isolate microbes and observe their phenotypes [8]. However, the “The Great Plate Count

Anomaly” showed that the majority of bacteria were not being cultured: Razumov observed

that viable plate counts were much lower than microscopic counts [9-11]. The advent of 16S and

metagenomics shotgun sequencing partially solved this problem by allowing scientists to

identify and classify not-yet-culturable microbes. Coupled with other ‘omics’ data (including

transcriptomics, proteomics, and metabolomics) and appropriate study design, researchers can

begin to better understand microbial interactions at both the individual and community scale,

and across different environments.

In this Introduction, I will first explore lateral gene transfer (LGT), one type of

interaction within microbial communities. Specifically, I will discuss its mechanisms, history,

and roles in the human microbiota. Next, I will delve into the interactions between human-

associated microbial communities and the built-environment, where humans spend the

majority of their time. Finally, I will outline the potential and limitations of DNA sequencing

approaches for profiling microbial communities.

The significance of lateral gene transfer

Mechanisms and discovery of lateral gene transfer

One of the most important types of interactions within microbial communities has

proven to be LGT. LGT occurs when genetic information (or DNA) is passed from a single cell

to a neighboring cell (lateral transmission), rather than from parent to offspring (vertical

transmission). LGT is primarily known to occur through three mechanisms, transformation,

4

transduction, and conjugation, and via two recently discovered mechanisms, gene transfer

agents (GTA) and cell fusion [12]. Transformation ensues when a bacterium uptakes naked

DNA from the environment and incorporates it into its own genome. Transduction occurs when

a bacteriophage accidentally packages part of the host genome with its own genome, which is

then injected and integrated into the next infected bacterium. Conjugation requires physical

contact between two bacteria, and involves DNA transfer from one bacterium to the other via a

multiprotein apparatus. The different mechanisms of LGT limit both the potential participants

and amount of DNA transferred. For example, transduction restricts LGT partners to those with

the same phage host range, and phage can only package a small quantity of DNA. Lastly, GTA

are DNA elements evolved from prophages; they package small pieces of bacterial DNA in

capsids and transfer them to nearby hosts [13]. Cell fusion is similar to sexual reproduction in

eukaryotes in that microbial cells physically join and may bi-directionally transfer DNA [14].

LGT was initially considered a curiosity, but is now recognized as a potentially strong

evolutionary force in prokaryotes. Assuming one LGT event for every 1010 vertical replications,

no gene in any modern genome can be linked to the last universal common ancestor (LUCA)

through vertical descent [15]. LGT was first observed in 1928 as transformation: “R” (“rough”,

avirulent) Pneumococcus strains alone could not cause disease in mice, but would kill mice if

mixed with heat-killed “S” (“smooth”, virulent) Pneumococcus strains [16]. In 1943, Avery,

MacLeod, and McCarty determined the agent of this particular phenomenon (conversion of R

strains to S strains) to be DNA [17]. In the 1960s, Japanese researchers found that multi-drug

resistant Escherichia coli could transfer resistance to drug-sensitive Shigella through conjugation

[18-20], elevating LGT to a cause for concern. Finally, in 1999, researchers found that 20-25% of

5

Aquifex aeolicus and Thermotoga maritima genes were more similar to Archaea than Bacteria [21,

22], indicating that LGT can cross domains in the tree of life.

Problems with the prokaryotic “species concept”

Identifying LGT between different species is of particular interest, since these events

may increase the fitness of individual microbes, which in turn may alter microbial communities.

In both macro- and microbiology, species are defined as clusters of similar organisms, though

what drives the separation of these clusters is unclear for microorganisms. Historically, macro-

organisms were delineated based on morphology, while microorganisms were classified based

on metabolic characteristics [23]. The introduction of the “biological species concept” (BSC) by

Ernst Mayr in 1942 attempted to unify existing systematics and the theory of evolution, and

stated that “species are groups of actually or potentially interbreeding natural populations,

which are reproductively isolated from other such groups” [24, 25]. This definition formalized

“species” as a unit of ecology and evolution, and identified “reproductive isolation” as the

driver for species formation.

The BSC did not work well for microorganisms or plants, due to LGT and ability to form

hybrids, respectively. Still, several scientists attempted to apply the BSC to bacteria. Ravin

searched for similarity between “genospecies”, defined as groups of bacteria that could

exchange genes, and “phenospecies”, defined as groups of bacteria that shared metabolic

phenotypes. Unfortunately, the two groups did not correlate well, indicating that genetic

exchange ability does not necessarily correspond to phenotype [26]. Dykhuizen and Green

proposed defining bacterial species as strains that could undergo recombination with each other

6

but not with other strains [27], which proved to be impractical given the frequency of LGT and

large size of a species’ pan-genome [28].

In 2002, Frederick Cohan argued that ecology was the driver of species clusters in

bacteria (as opposed to reproductive isolation). He proposed defining bacterial species as

“ecotypes”, which are “…set(s) of strains using the same or similar ecological resources, such

that an adaptive mutant from within the ecotype out-competes to extinction all other strains of

the same ecotype; an adaptive mutant does not, however drive to extinction strains from other

ecotypes” [29]. This definition has also been referred to as the “ecological species concept”

(ESC). Cohan’s first model was termed the “stable ecotype model,” which assumed that 1)

microorganisms exist as large populations (1010 cells) and that 2) population genetic diversity is

largely controlled by periodic selection, in which a single species consistently sweeps the

population [30], rather than genetic drift. He pointed out that the latter was supported by long

term culture experiments, which often gave rise to strains with different phenotypes [31-33].

With this, ecotypes could be detected as sequence clusters due to genome-wide sweeps in

microbial populations.

Recent work has observed that gene-specific sweeps, rather than genome sweeps, occur

in microbial populations [34-36]. However, previous work has shown that the recombination

rate is usually lower than the mutation rate, and thus a gene should not undergo a different rate

of selection as compared to its genome [37]. To reconcile these observations, Cohan proposed

the ‘Adapt Globally, Act Locally” model, in which multiple ecotypes adopt the same gene

through lateral transfer, but maintain separate evolutionary trajectories [37, 38]. In 2012, Shapiro

7

et al expanded upon this theory by characterizing two populations of Vibrio cyclitrophicus, in

which they found that i) SNPs associated with a specific population were constrained to specific

genome regions, and ii) recent recombination was more common within a population than

between them [39]. From this, they proposed that microbes undergo gene transfer, leading to

gene sweeps. Since transferred genes are habitat-specific, gene sweeps prompt populations to

specialize, which in turn decreases gene flow between different populations and leads to the

formation of distinct genomic clusters. Their observations imply that gene-specific sweeps can

lead to the formation of new species. More recent work has focused on characterizing

conditions under which gene specific sweeps may occur [40], as well as how gene transfer and

genetic drift work together towards speciation [41].

Methods for identifying species and LGT

The BSC and ESC disagree on the force (i.e., reproductive isolation versus ecological

specialization) that drives speciation, but both agree that DNA sequence clusters will

correspond with species. Compositional biases between species have been observed as early as

1959, in which the buoyant density of nine different bacterial DNAs were highly correlated to

the molar fraction of guanine and cytosine [42]. Microbial species were originally distinguished

via DNA-DNA hybridization, in which a single-stranded reference DNA and a single-stranded

query DNA are mixed, and the degree of binding between the two molecules is measured [43].

If molecules from the query organism showed ≥70% re-association with the reference DNA

molecules, the query and reference organisms were classified as the same species [44]. With

DNA sequencing, scientists began sequencing cultured isolates. In 1995, the first bacterial

genome Haemophilus influenza was sequenced [45]. The reference genome database grew

8

exponentially: by 2000, 27 microbial genome sequences had been published [46], and by 2005,

220 microbial genomes were sequenced with another 650 in progress [47]. One study utilized

this growing set of reference genomes and showed that 50 kilobase (kb) segments of a

prokaryotic genome are more similar to each other than to other genomes, and reflect species-

specific properties for DNA modification, replication, and repair [48]. Biases in nucleotide

composition between species have since been used for genome and metagenome assembly, as

well as for LGT detection.

The earliest LGT studies observed the transfer of phenotypes (i.e., ”R” Pneumococcus

strains becoming virulent, or Shigella acquiring antibiotic resistance), but the majority of new

studies utilize computational methods to detect LGT in sequenced genomes. Computational

methods usually fall into fall into two bins, tree-based and non-tree based methods. Tree-based

methods involve comparing gene trees to a species tree, in which the species tree is often

constructed from a slow evolving, essential gene such as the 16S rRNA gene or a combination of

housekeeping genes [49, 50]. Each phylogenetic tree reflects the evolutionary history of the

gene(s) used to construct it. Thus, if the evolutionary history of a gene deviates significantly

from that of the species tree, it may be explained by LGT, duplication, gene loss, incomplete

lineage sorting, or homologous recombination [51]. Tree-based methods further enable

inference of directionality and time of transfer. Directionality may be based off the “out-of-

Africa” principle, which assumes that the taxonomic group with the largest representation of

the transferred gene is the donor [52, 53].

9

Tree-based methods are considered the gold standard, but are more computationally

intensive than non-tree based methods. Methods that do not require trees can be subdivided

into compositional and gene-based methods. Compositional methods search for changes in GC

content, oligonucleotide frequencies, or even structural features, such as interaction energies

between base pairs or chromatin structure, any of which may have arisen through LGT. In

contrast, gene-based methods look for discrepancies between gene distances and phylogenetic

distances. Approaches for this include, i) searching for similar genes between distantly related

species, ii) calculating evolutionary rates for homologous genes and identifying those (potential

xenologs) with different evolutionary rates, iii) identifying strain-specific genes shared with

other species but not within species [51]. Both compositional and gene-based methods are

limited to detection of relatively recent LGT events, since transferred sequences may ameliorate,

or become more similar to the host sequence over time [54].

LGT in microbial communities

Estimates of LGT frequency were first calculated per taxon, and then per gene family.

Compositional methods predicted that 11% [55] to 17% [54] of the Escherichia coli chromosome

was acquired through LGT. Later studies compared LGT percentages between taxa: one study

found that LGT ranged from 0% of protein-coding genes in Mycoplasma genitalium to 16.6% of

protein-coding genes in Synechocystis PCC6803. This study further identified E. coli, Helicobacter

pylori, and Archaeoglobus fulgidus to have large proportions of transferred genes associated with

plasmid-, phage-, or transposon-sequences [56]. Symbionts and parasites such as Wigglesworthia

brevipalpis, Chlamydia, Mycoplasma, Rickettsia and Borrelia burgdorferi, were found to have lower

proportions of laterally transferred coding sequences (CDS) [53, 57]. Estimates for LGT

10

percentages across gene families has also been highly variable. Explicit phylogenetic methods

have since estimated that anywhere from 2% [58] to 60% [59] of genes are affected by LGT [60].

Forces that drive LGT within communities may include phylogeny, geography, and

ecology. Phylogeny is expected to play a strong role: closely related partners in a group will

preferentially exchange genes, since they will have shared genomic structure, machinery, and

phage host range [61]. One study inferred Bayesian phylogenetic trees for 5282 sets of proteins,

and found that Escherichia coli and Shigella have higher rates of gene transfer within

phylogenetic groups as compared to between phylogenetic groups [62]. Another study found

that integrons in Vibrio cholerae is associated with geography [63]. Lastly, taxa with similar

ecological needs may be found in close proximity, which fosters conjugation, cell fusion, or

GTAs; in addition, increased LGT via plasmids has been observed in biofilms [64]. One study

inferred LGT events between pairs of genomes if they shared 500 bp blocks with 99% similarity:

they found that genome pairs from the same environment had the most LGT events, followed

by genome pairs with small phylogenetic distances [65].

The human microbiota is likely to have high frequencies of LGT. More LGT was found

between human-associated microbial genomes, as compared to between human- and non-

human-associated microbial genomes, with most transfers occurring in the oral and gut sites

[65, 66]. Still, these studies focused on available reference genomes, which represent microbial

snapshots in time. Future work utilizing metagenomics contigs and shotgun metagenomic reads

across time may better capture de novo LGT events. For example, one study identified mobile

gene pools in Fijian and North American microbiomes from single-cell genome sequencing,

11

mapped shotgun metagenomics reads to the genes, and found that mobile gene abundances

were associated with diet and Fijian villages [67]. With this, they determined that LGT

frequencies are not only determined by microbial characteristics (i.e. phylogeny, geography,

and ecology), but may also be driven by host lifestyle and geography.

Transferred functions and their associated costs

There are two leading hypotheses for the types of genes transferred through LGT. The

first hypothesis assumes that genes can be divided into two classes, i) “informational” genes,

which are utilized for replication, transcription, and translation, and ii) “operational” genes,

which are used in metabolism [68]. This hypothesis predicts that the latter gene type is more

likely to be transferred, since the former gene type is responsible for cell division, the most

fundamental process for life. The second hypothesis is termed “the complexity hypothesis”, and

states that genes integrated into large, complex systems (i.e., part of large signaling pathways)

are less likely to be transferred than genes part of smaller pathways [69]. These two hypotheses

are not mutually exclusive: indeed, some have found that “informational” genes are more likely

to be part of complex systems [70]. As follows, predicted transfer functions have included

“plasmid, phage, and transposon functions”, “cell surface structures”, “surface

polysaccharides”, “DNA transformation”, “pathogenesis”, and “toxin production and

resistance” [57]. Another studying utilizing phylogenetic trees found that “energy metabolism”

and “mobile and extrachromosomal element functions” were enriched in discordant

phylogenetic trees, whereas “DNA metabolism,” “protein synthesis,” “protein fate,” and

“regulatory functions” biosynthesis were depleted [71].

12

Transferred genes may not be retained even if they are beneficial, since they may incur

high costs. Costs include disruption of neighboring genomic features via insertion, utilizing

limited resources through transcription and translation, and disrupting interactions within the

cellular network [72]. Furthermore, if transferred genes contain different codon usage, they may

lead to improper expression and/or protein mis-folding [73], which may incur cytotoxicity.

Different microbial taxa may have a variety of mechanisms to handle such costs: for example,

some taxa harbor HN-S proteins, which bind regions of high AT content and silence expression

[74]. Also, most successfully transferred genes eventually ameliorate [54]. The former operates

immediately, while the latter takes time, indicating that different mechanisms may operate on

different timescales to facilitate and select for gene integration.

Evolutionary legacy of LGT

The significance of LGT on evolution is still being debated today. Scientists found that

phylogenetic trees constructed from other “universal” genes such as heat shock protein HSP70

and glutamate dehydrogenase do not agree with the rRNA-based universal phylogenetic tree

[54]. Furthermore, informational genes such as aminoacyl-tRNA synthetases (aaRSs), which

attach amino acids to the corresponding tRNAs, have evidence of transfer [75]. These

discrepancies have led to two hypotheses. The first is the “early massive horizontal transfer

hypothesis”, in which LGT occurred early in prokaryotic evolution and created modern cells,

after which vertical gene transfer became the dominant evolutionary force (as compared to

LGT). The second is the “continual horizontal transfer process”, in which LGT has been a

continuous force from early evolution that continues today [68, 70].

13

Woese has argued in favor of the “early massive horizontal transfer hypothesis”

hypothesis [76]. He argues that the rRNA gene represents cellular information processing

systems such as replication, transcription, and translation, which are fundamental to cells and

differ between bacteria, archaea, and eukaryotes. This implies that multiple progenitor cells,

each with their own information processing systems, must have existed before the division of

the three domains. These progenitor cells were not well-developed, which allowed for extensive

LGT that may have eventually given rise to the efficient, modular cells seen today. In contrast,

Lake has argued for the “continual horizontal transfer process” hypothesis [70]. To test both

hypotheses, he classified genes as “informational” or “operational” [68], and then constructed

phylogenetic trees for each gene type. He assumed that informational genes were not subject to

transfer or were transferred infrequently (which is debatable [69]). If the phylogenetic trees for

the two gene types were similar, it would indicate that most LGT had occurred before

formation of the three domains, thus supporting the “massive horizontal gene transfer

hypothesis”. Instead, he found that phylogenetic trees for the two gene types were significantly

different, which indicates that LGT is still an ongoing force today, thereby supporting the

“continual horizontal transfer process”. Others have argued that i) the observed variation in

nucleotide composition across whole genomes and ii) Occam’s Razor support the “continual

horizontal transfer process” [60].

Surveying microbial communities in the built-environment

Another set of microbial interactions is between microbial communities and their

environment. In 1934, the Dutch microbiologist Lourens G. M. Baas Becking articulated that

“everything (microorganisms) is everywhere: but the environment selects [77, 78]”. This

14

statement put forth a hypothesis that has shaped current microbial ecology: microbial

distributions were believed to be primarily shaped by dispersal and environment, as opposed to

earth history and geography [79]. This hypothesis is demonstrated in the human microbiome, in

which microbial communities and their associated functions are often site-specific [80, 81]. In

contrast, the built-environment seems to be primarily shaped by dispersal, especially from

human-associated communities [82]. To better understand how microbial communities outside

the human body affect human health, researchers must first understand these dispersal patterns

and then determine how these microorganisms interact with their new environment.

Microbial composition of the built-environment

The built-environment is the ecological habitat of humans, consisting of the physical

parts of where we live and work (such as homes, offices, streets) [83][75][77][77]. Humans

spend most of their time in the built-environment: one study showed that Americans (across

states) spend ~87% of their time indoors and ~6% of their time in an enclosed vehicle

(consistently over the past few decades) [84]. As of 2015, buildings were estimated to cover 1.3%

to 6% of global ice-free land [85, 86], and are expanding rapidly [87, 88]. Although building

temperatures and humidity vary across the world, each unit is enclosed and consistently

maintains these variables throughout the day and across seasons [88]. They may also contain a

variety of materials and chemicals not found in the natural environment [89]. As follows, it is

important to identify i) which microbes are in the built-environment, and ii) how they are adapt

to these environments. Furthermore, distinguishing how microbes, microbial compounds, and

man-made chemicals affect human health can result in actionable changes in hygiene and

building construction.

15

Currently, most studies have focused on building surfaces such as homes, restrooms,

hospitals, and classrooms. These studies have shown that the majority of microbes in the built-

environment are derived from human skin, with some influence from human interaction and

the surrounding environment [90]. This is unsurprising, given that humans shed between 2 x

108 and 10 x 108 skin cells/day [91]. Colonization and de-colonization happen rapidly: the Home

Microbiome Study monitored seven families in their homes for six weeks, in which three

families had samples taken pre- and post- move into their new homes. For these three families,

the differences in microbial community structure between their previous and new homes were

insignificant, indicating quick colonization of the new home. Researchers also quantified how

much each individual contributed to the microbial signal of the house, and found that an

absence of three days led to smaller contribution [92], indicating quick de-colonization. The

effect of human interaction can be observed via microbial community patterns on different

surfaces and room types. For example, a study of public restrooms showed that the microbial

community of bathroom floors were likely derived from soil taxa, while communities on toilet

seats, handles, and the inside of the stall were derived from gut bacteria and urine [93]. Lastly,

the surrounding environment may introduce new members to built-environment communities:

one study found that phylogenetic diversity was correlated with ventilation air, airflow rates,

and humidity and temperature [94].

These findings indicate that the human microbiome is rarely colonized or altered by

built-environment microbial communities. Instead, a person may be primarily exposed to

his/her own microbiome, which could then self-perpetuate or perpetuate to other occupants

within the building, either to their benefit or detriment [95]. One example is the effect of pets on

16

their owners: some studies found that infants in homes with dog or cat exposure have

decreased risk of atopy [96], though other studies identified pets as sources of endotoxins [97,

98]. More work is needed to determine what constitutes a healthy indoor microbiome [3],

especially since adverse health effects have been tied to microbial and non-microbial sources.

Microbial threats include single pathogens such as Legionella, which may be transferred through

water systems and inhalation (if aerosolized), as well as microbial components such as

endotoxin, which has been paradoxically linked to promotion of and protection against asthma

[99]. Non-microbial threats include damp indoor environments, which have been associated

with respiratory diseases, and may further be linked to growth of mold and fungal species

[100].

Applications for the built-environment

Since built-environment microbial communities are largely derived from human skin,

they may also resemble their occupants, giving rise to forensic applications. The Home

Microbiome Study could predict which family belonged to which home using microbial

community profiles [92]. Many built-environment studies have also found that occupants of the

same space have significantly more similar microbiomes. For example, families not only share

microbes with one another, but also with their dogs [101]. Co-habituating couples could be

matched based on their skin microbiome samples ~86% of the time [102]. Lastly, one study

collected shoe and phone samples from individuals at three different conferences: random

forest models could predict which conference each sample was taken from, and distinguish

between two individuals’ shoe samples at a single conference [103]. These studies indicate that

17

individuals may be linked to highly-trafficked buildings, as well as to colleagues within the

same space [104].

Another potential application is improved protocols for hygiene, especially with the

development of the hygiene hypothesis. The hygiene hypothesis was conceived as early as 1989:

David Strachan found that high prevalence of hay fever (at ages 23 and 11) and eczema (in the

first year of life) was linked to smaller family sizes. He hypothesized that fewer infections early

in life (due to lack of disease transmission in smaller families) lead to greater numbers of

infection later in life [105]. His hypothesis was replaced by the “Old Friends” hypothesis in

2004: Ross et al stated that increased disease types (such as allergy) in developed parts of the

world was due to lack of exposure to “old friends”, which are defined as microbes that co-

evolved with humans. These “old friends” facilitate regulatory T cell development, thereby

preventing inappropriate immune responses [106]. The “Old Friends” hypothesis has led to the

general consensus that increased microbial diversity is favorable, though others argue it is

simply a community property [107]. With this, hygiene should be redefined as an effort to select

for beneficial bacteria, rather than an attempt at complete sterilization [108, 109]. Suggested

interventions have been to build with materials that select for specific microbes, as well as

increasing building ventilation and outdoor green space to boost microbial diversity [3, 110].

Technical considerations for sampling the built-environment

The majority of built-environment samples have been sequenced using 16S rRNA

sequencing due to low biomass, which makes them particularly susceptible to batch effects and

contamination from sequencing kits and reagents. The former was demonstrated in a study that

18

monitored office buildings in Flagstaff, AZ; San Diego, CA; and Toronto, ON for one year.

Samples were grouped and sequenced by season, with eight technical replicates included in

each sequencing run. Unfortunately, sequencing run was conflated with seasonality: even

technical replicates varied widely across run. Researchers attempted to eliminate the batch

effect by removing highly variable low, abundance taxa, which worked poorly [111]. Other

studies have been affected by contaminants found in sequencing reagents, extraction kits, and

PCR reagents [112-114]. For example, one study found that age as the driver of observed trends

in the nasopharyngeal microbiome among children in a refugee camp, but another study

showed that the driver was kit contaminants [115]. The use of technical replicates, extraction

and negatives controls, and microbial spike-ins have all been proposed to address the problem

of batch effects and contaminants. These technical challenges may be further complicated by the

rise of citizen microbiology projects, in which aseptic technique, sample collection logistics, and

privacy concerns must be considered [116].

Unfortunately, built-environment studies that rely on DNA sequencing for community

profiling cannot distinguish between DNA in live cells and extracellular DNA, which can

survive on surfaces for weeks to years [117]. Currently, it is unclear whether most microbes on

these surfaces are active, dormant, or dead. Some have described the built-environment as a

microbial wasteland, where most microbes are likely dormant or dead [82, 111]. One study

found that 40% of prokaryotic and fungal DNA in soil was extracellular or from cells that were

not intact [118]. Several methods have been developed that may assist in assessing viability,

which primarily function by examining membrane integrity, measuring transcription or

19

translational activity, or measuring cellular respiration (through ATP). Still, the majority of

these methods are for bacteria, and may not work on viruses or spores [119].

The role of DNA sequencing for microbial profiling

High-throughput DNA sequencing has proven invaluable for investigating diverse

environmental and host-associated microbial communities. Sequence-based taxonomic profiling

of a microbiome can be carried out using either amplicon (typically the 16S rRNA gene) or

whole metagenome shotgun (WMS) sequencing (reviewed in [120-122]). The resulting DNA

sequence data are then used to assess the community in at least two ways: taxonomic profiling,

which answers, “who is present in the community?” and functional profiling, which answers,

“what could they be doing?” Still, there are several limitations to DNA-based approaches. First,

the most common taxonomic profilers provide at best species-level taxonomic resolution,

whereas many important phenomena occur at the strain level. Second, DNA sequencing cannot

directly measure the functional activity of a community under a given set of conditions. While

the former has been addressed through sequencing and bioinformatics techniques, the latter

may require multi’omic data sets, which include community RNA (transcriptomics), protein

(proteomics), and metabolite abundances (metabolomics), preferably in an integrated

framework.

Amplicon Sequencing

One common method for profiling a microbial community involves sequencing specific

microbial amplicons (predominantly the bacterial 16S rRNA gene). Although amplicon-based

sequencing considers only one or a few microbial genes, it may be used for taxonomic,

20

phylogenetic, and even functional profiling. It may also be used to profile low biomass samples,

as compared to WMS sequencing. To identify which taxa are present, amplicon sequences are

either directly binned to reference taxa [123, 124] by classification or phylogenetic placement, or

more commonly they are first clustered into operational taxonomic units (OTUs) sharing a fixed

level of sequence identity (often 97%) [125, 126], and then binned as a whole (often by

classification of a reference sequence). Functional profiles can be approximated for marker-

based samples by associating 16S rRNA or marker genes with annotated reference genomes,

aggregating coding sequences (from the reference genomes) into gene families, and then

inferring gene family abundances through taxonomic abundances [127].

Unfortunately, the singular use of the 16S rRNA marker gene has several problems.

First, some species have multiple copies of the 16S rRNA gene, which in turn have different

sequences [125]. Second, the 16S rRNA gene has difficulty resolving species due to its slow

evolution: strains with less than 97% 16S rRNA sequence identity are likely to be different

species, but strains with more than 97% 16S rRNA are not necessarily the same species [49, 128].

The use of the 97% cutoff is also somewhat arbitrary and based off concordance with DNA-

DNA hybridizations [129]. In order to improve taxonomic resolution from 16S rRNA

sequencing, two techniques have been developed. One recent technique, termed “oligotyping”,

uses a sequence entropy-based approach to identify maximally informative sites within the 16S

rRNA gene to improve OTU resolution [130]. Oligotyping is advantageous for distinguishing

closely related taxa (such as those that differ by a single 16S rRNA nucleotide) and has been

applied to study subspecies-level population structure in the vaginal microbiome [131] and to

link sewage samples to specific fecal pollution sources [132]. In addition, a new, low-error

21

approach to 16S rRNA gene sequencing, termed LEA-Seq has been proposed and used to

profile stable carriage of host-specific strains in the human gut microbiome [133].

WMS Sequencing

WMS sequencing involves sequencing “random” DNA fragments from microbial

communities. Taxonomic profiling of metagenomes instead uses some or all shotgun reads to

determine membership in a community. This can be done in a number of ways, including

metagenomic assembly followed by phylogenetic binning or placement of contigs [134]. More

commonly, short reads are profiled directly by comparison to a reference catalogue of microbial

genes or genomes. Alternatively, reads can be mapped to a (pre-computed) catalog of clade-

specific marker sequences (with [135] or without [136] pre-clustering). Finally, reads may be

assigned to species based on agreement with models of genome composition [137] or by exact k-

mer matching [138], thus enabling placement of reads or assembled contigs when

corresponding reference genomes are not available (which is common for poorly characterized

communities).

WMS sequencing is the preferred method for strain-level profiling due to its ability to

identify variation throughout microbial genomes. Strains may differ in sequence through loss or

gain of genomic regions or through single nucleotide polymorphisms (SNPs), both of which can

be identified by mapping shotgun reads to reference genomes. For example, mapping WMS

reads from tongue samples to genomes of Streptococcus mitis highlighted the presence and

absence of genomic islands in isolates of that species from individuals enrolled in the Human

Microbiome Project (HMP) [2]. Genomic islands were shown to contain multiple, functionally

22

coherent genes (e.g. subunits of the V-type H+ ATPase) that were gained and lost together,

suggesting a mechanism for individual- and body site-specific functional specialization.

Detection of SNP differences requires greater sequencing depth. Existing WMS data from

human stool samples have been used to identify reference genomes with high sequencing

coverage which were then scanned for SNPs [135]. This analysis revealed that subject-specific

SNP variation tended to remain stable for up to a year and was comparatively more conserved

than overall species abundance.

Functional profiling of metagenomic samples typically begins by associating new

sequence data with known gene families. This can be accomplished by directly mapping DNA

or RNA reads to databases of gene sequences that have been clustered at the family level; such

databases include KEGG Orthology [139], COG [140], NOG [141], Pfam [142], and UniRef [143].

Naturally, the number of reads that can be mapped in this manner depends on the

completeness of the underlying reference database. Alternatively, reads can be assembled into

contigs to determine putative protein-coding sequences, and then the CDSs are assigned to gene

families following the same or similar methods used for annotating isolate microbial genomes.

Both strategies yield profiles of the presence and absence of a gene family as well as the relative

abundance of each family within a sample. Functional profiles at the gene family-level may

contain many thousands of features. Downstream analyses can be made more tractable by

further performing per-organism or whole-community pathway reconstruction based on these

genes. Although not specifically designed for microbial community analysis, species-specific

pathway databases such as KEGG [139], MetaCyc [144], and SEED [145] can be useful for this

purpose. Integrated bioinformatics pipelines such as IMG/M [146], MG-RAST [145],

23

MetaPathways [147], and HUMAnN [148] have been developed to streamline the conversion of

raw meta’omic sequencing data into more easily-interpreted profiles of microbial community

function.

Contig Assembly

Deeper WMS sequencing can facilitate the de novo assembly of contigs and even

microbial genomes. Assemblies are generated by connecting overlapping sequencing reads to

form longer sequences, which may be represented as an assembly graph in which nodes

embody sequence information (such as k-mers) and edges connect adjacent or overlapping

sequences. Metagenomics samples come with special challenges that may lead to errors in the

assembly graph. These samples contain multiple taxa with differential abundances, leading to

uneven coverage and the presence of conserved sequences across taxa, and making it difficult to

determine where edges should be drawn. Several tools have been built to address these

problems: MetaVelvet-SL generates a single assembly graph, and then uses k-mer coverage to

identify sub-graphs that are assumed to be single species; in contrast, both MetaSPAdes and

IBDA-UD use multiple k-mer sizes to iteratively improve the assembly graph [149]. General

challenges in assembly arise from technical variables, such as sequencing errors, chimeric reads,

and read lengths that are shorter than genomic repeats [150], as well as the size of the dataset,

which increases computational intensity. Lastly, there is no gold standard to determine if a

given assembly is correct. As a result, earlier metagenomic studies that utilized assembly

limited analyses to cataloguing genes and functions [151, 152], though one study went further

and identified plasmids and scaffold synteny in samples collected from the Sargasso Sea [153].

24

Assembly is crucial to studying microbial communities, and may be used to identify

novel sequence elements, generating reference genomes from uncultivated or poorly

represented microorganisms in reference databases, and characterizing the synteny of microbial

genes. Improvements in metagenomics contig assembly has led to the recovery of whole

microbial genomes from communities [92, 154-156], which was previously only possible in low-

complexity communities [157]. One study was able to assemble 31 bacterial genomes after

binning assemblies by differential read coverage [158]. Increasing the number of reference

genomes across the tree of life may help with discovery of novel gene functions and pathways

[159]. Furthermore, assemblies can reveal novel genomic rearrangements and LGT events not in

previous reference genomes. For example, one study found that the genomic architecture of

mobile genes in human gut samples was specific to individuals, even though individual mobile

genes were found universally across U.S. and Fijian cohorts [67].

Summary

The advent of DNA sequencing has made it quicker and easier to profile microbial

communities, while further development of tools for analyzing and interpreting sequencing

data may potentially reveal how community trends and interactions between individual

microbes. In Chapter 2, we describe the tool WAAFLE, a workflow that annotates assemblies

and finds LGT events from assembled metagenomics contigs. We then apply WAAFLE to the

Human Microbiome Project, and find that properties such as phylogenetic relatedness and

abundance affect LGT frequency, and that transferred functions are enriched for mobile

elements and outer membrane receptors. In Chapter 3, we survey the microbial communities on

25

the Boston subway using 16S rRNA and WMS sequencing. We observe that that the microbial

community mostly comprises of skin microbes, and that overall pathogenic potential is low.

Chapter 2:

Lateral Gene Transfer in the Human Microbiome

27

Attributions

The contributors to this work include Tiffany Hsu, Eric A. Franzosa, Chengwei Luo,

Dennis Wong, Morgan Langille, Robert G. Beiko, and Curtis Huttenhower, in no particular

order. T.H. and E.A.F developed the software implementation, evaluated the method, and

applied the tool to the Human Microbiome Project. All authors helped design the method and

interpret the data. T.H. wrote the text with feedback from E.A.F., M.L, R.G.B, and C.H.

Introduction

Lateral gene transfer (LGT) is the movement of genetic material between organisms

without sexual or asexual reproduction [160]. Its role in microbial communities is not well

understood, due to the difficulty in identifying LGT events. First, evolutionarily significant

events are difficult to ascertain. These events include ancient LGTs, which have likely

ameliorated to the host genome, as well as LGT of homologs, which are conserved across

species and difficult to distinguish from orthologs (homologs that arose through speciation) and

paralogs (homologs that were duplicated and have a separate evolution trajectory) [53]. Second,

transient events, in which LGT occurs but the organism does not accept or maintain the

transferred sequence, are difficult to measure [161]. Still, LGT has proven to be an important

evolutionary force [15], especially with the rise of antibiotic resistance in human-associated

microbial communities. LGT events may change the fitness of individual microbes, which may

in turn affect microbial community composition and function. These events may eventually

give rise to new species, impacting both evolutionary history and phylogeny [70, 76].

28

Several studies have characterized the quantity of and forces shaping LGT in human-

associated microbial communities. For human-associated microbial genomes, most transfers

occur in the oral and gut sites [65, 66]. LGT may be shaped by host factors, such as lifestyle and

geography, as well as microbial traits, such as phylogeny and ecology. One study found that

cultural practices affected LGT rates: mobile gene pool abundances in Fijian and North

American microbiomes were associated with diet and Fijian villages [67]. Another study found

increased LGT between human-associated isolates as compared to between human-associated

and non-human-associated isolates. Isolates with shorter phylogenetic distances and from

similar sources (between human and/or non-human) had increased transfer, though the latter

had the stronger effect [65]. As follows, some have proposed that LGT is a mechanism used

between niche-sharing microbes to adapt to changing conditions [162], while others have

suggested it as a mechanism to enforce cooperation or competition [163, 164]. This is further

supported by the observation that transferred genes are enriched for functions in cell surface,

DNA-binding, and pathogenicity, which may be necessary for survival in different

environments [57].

Microbial community sequencing has generated 16S rRNA and metagenomic shotgun

datasets, yet most software tools available for LGT detection are designed for whole and/or

draft genomes [165, 166]. Methodologies for detecting LGT fall roughly into three categories,

composition-based, alignment-based, and phylogeny-based approaches. Compositional-based

methods assume that laterally transferred genes will have distinct nucleotide compositions as

compared to the host genome: software such as Alien_Hunter [167] uses interpolated variable

order motifs (IVOMs) to find genomic regions with significant shifts in composition.

29

Alignment-based methods look for discrepancies between gene distances and phylogenetic

distances: for example, Darkhorse [168] aligns protein sequences (from a single genome) to a

reference database and infers LGT using bitscore and phylogeny. In contrast, IslandPick uses

genome alignments and comparative genomics to identify LGT in closely related genomes [169].

Phylogeny-based implementations such as rSPR [170], PhylTr [171], and MaxTiC [172] search

for incongruence between gene trees with species trees. Only the software Daisy [173] utilizes

shotgun reads, but still requires prior knowledge of donor and recipient genomes.

Here, we present WAAFLE, a Workflow for Annotating Assemblies and Finding LGT

Events, which uses alignment-based methods to detect LGT events in contigs assembled from

metagenomic shotgun sequencing sets. A tool that can utilize shotgun sequencing data has

several advantages. First, we can potentially find new LGT events that are not yet reflected in

reference genomes. Second, since each metagenomic sample represents a snapshot in time,

users will have the ability to compare LGT rates between individuals, conditions, and across

time. Third, although WAAFLE is limited to fairly recent events, the use of reference databases

allows us to identify gene functions and perform taxonomic assignment with higher accuracy,

especially in human-associated datasets. In this study, we apply WAAFLE to the Human

Microbiome Project Phase 1-II (HMP 1-II) [174] assembled contigs. We quantify LGT

frequencies for taxon pairs at the genus level across six major body sites, which specifically

represents the number of unique, novel, and fixed LGT events per sample (which represents a

single body site in an individual). We then i) determine how abundance and phylogeny

influence LGT frequencies, ii) characterize taxon pair formation and partner preference, and iii)

identify functions enriched in LGT contigs.

30

Results

Identifying recent LGT events from metagenomic shotgun sequencing

In order to detect LGT events from metagenomic shotgun sequencing, we developed

WAAFLE, a Workflow to Annotate Assemblies and Find LGT Events (Fig. 2-1A). WAAFLE has

one required input, i) assembled metagenomic contigs in FASTA format, and two optional

inputs, ii) gene calls for each contig and iii) a nucleotide reference database of genes with

taxonomic and functional annotations (down to the species level, and for UniRef50 and

UniRef90 terms, respectively). A default reference database of pangenomes, MetaRef [175], is

provided. WAAFLE conducts a four step process to output a single file in which each contig is

classified as containing LGT or not, with each gene annotated with a taxon and function. First,

contigs are searched against the nucleotide reference database via BLASTN. Second, contigs are

annotated with genes, either by connecting overlapping BLAST alignments or using supplied

gene calls. Third, contig genes are assigned UniRef50/90 annotations and taxon scores; the latter

represents how well a given taxon characterizes a gene. To do this, we bin BLAST hits by gene,

and then group BLAST hits within bins by taxonomic annotation. From the BLAST hit bins, we

designate the most common UniRef50/90 term to each gene. From the BLAST hit groups, we

calculate a single score per taxon per gene using the percent identity and subject coverage.

Fourth, contigs are classified as having LGT or not. Using the taxon scores, we determine

whether genes across a contig are best explained by two taxa or one (Fig. 2-1A).

31

Figure 2-1. WAAFLE pipeline overview. A) Within microbial populations, genes can be

transferred vertically or laterally, which may confer adaptive traits to individual microbes and

affect the community composition and function. To understand the impact of LGT, we built the

tool B) WAAFLE, which identifies LGT events within metagenomic contigs using a four step

process. First, WAAFLE searches contigs against a reference species pangenome database,

which is generated by downloading NCBI isolate genomes, binning isolate genes by species,

and then clustering binned species genes at 97% nucleotide identity. Second, WAAFLE calls

genes (if not supplied) by connecting overlapping BLAST hits. Third, WAAFLE assigns each

gene a function and taxon scores. To do this, alignments are first binned by genes: the most

common UniRef50/90 annotation across hits per gene is assigned as the gene function. Binned

alignments are then further grouped by taxa, and taxon scores are calculated using percent

identity and subject coverage. Fourth, we classify the contig as having LGT or not. If a single

taxon has taxon scores above k1 (blue threshold) across all contig genes, the contig is predicted

to not have LGT. Otherwise, if two taxa have taxon scores above k2 (red threshold) across all

contig genes, the contig is predicted to have LGT. C) To evaluate WAAFLE and its parameters,

we generated synthetic contigs by selecting random donor and recipient genomes at varying at

32

Figure 2-1 (Continued)

different taxonomic levels. We chose a three gene region from the recipient genome, and

replaced the center gene with a gene from the donor genome. We then truncated the newly

formed contig at both ends.

How accurately WAAFLE detects LGT depends both on the contig assembly quality and

WAAFLE parameter settings. False positive LGT calls may arise from contig misassemblies,

which we hypothesized would have steep drops in read coverage. To identify misassemblies,

we mapped shotgun reads to metagenomic contigs and examined gene junctions, the region

between two contig genes. Contigs that had i) low coverage for read junctions relative to

flanking genes and ii) lacked paired or single read support for the junction were removed from

analysis, regardless of LGT status (Fig. I-1, Fig. I-2). WAAFLE may also call different amounts

of LGT depending on its five parameters, which include subject coverage (s), overlap

percentage (o), gene length (g), one-taxon threshold (k1), and two-taxon threshold (k2) (Table I-1).

The first three parameters are utilized to minimize false positive gene calls (step 2), which may

lead to increased LGT calls. The last two parameters are employed in LGT classification (step 4).

Specifically, WAAFLE identifies a contig as not having LGT if one taxon has taxon scores

greater than k1 across all genes. If no single taxon scores above k1, WAAFLE searches for two

taxa that collectively have taxon scores greater than k2 across all genes. If the contig contains

such a pair, it is classified as LGT; otherwise, it is classified as “ambiguous”. Lowering k1 and

raising k2 thresholds make it more difficult to call LGT.

WAAFLE performance on synthetic data

To set the default WAAFLE parameters, we generated a synthetic dataset from the NCBI

isolate genomes. This dataset consisted of 1000 contigs spanning 8 taxonomic levels, with 25

33

donor-recipient pairs at each level. Each contig was created by i) selecting a donor-recipient pair

with some taxonomic level difference, ii) choosing a recipient genome fragment containing

three genes, iii) replacing the center gene (of the fragment) with a random gene from the donor

taxon, and iv) truncating the contig ends (Fig. 2-1C). It should be noted that the NCBI isolate

genomes used to generate the synthetic dataset are the same genomes used to create WAAFLE’s

species pangenome database. As follows, the species pangenome database contains all the

species and genes present in the synthetic contigs. In reality, the reference database will be

missing species and genes potentially present in biological data. To simulate missing

information in the pangenome database, we removed 20% of the BLAST alignments (to the

synthetic contigs) generated from the first step in WAAFLE.

We first evaluated WAAFLE’s ability to call genes by varying three parameters, subject

coverage, overlap, and gene length, during step 2 of the WAAFLE pipeline. We compared each

set of WAAFLE gene calls to the NCBI gene annotations. True positives were defined as the

number of NCBI genes with corresponding WAAFLE genes, while false positives were defined

as the number of single NCBI genes with multiple corresponding WAAFLE genes (to one NBCI

gene) and the number of WAAFLE genes with no corresponding NCBI genes. The true positive

rate (TPR) ranged from 0.691 to 0.841 while the positive predictive value (PPV) ranged from

0.955 to 0.994 (Fig. I-3). Overall, we found that lower overlap, increased subject coverage, and

increased gene length increased the PPV, with subject coverage and gene length having the

greatest effect. Since increasing the number of genes increases the potential of calling LGT, we

conservatively set gene calling parameters at 0.75 for subject coverage, 0.1 for overlap, and base

pairs (bp) for minimum gene length.

34

To evaluate LGT classification, we supplied WAAFLE with the WAAFLE-called genes

generated from the default parameters (mentioned above) and filtered for contigs containing at

least two genes. We then varied the one-taxon and two-taxon thresholds for step 4 of the

WAAFLE pipeline. Synthetic contigs with inter-species or above LGT events were considered

true positives when WAAFLE called LGT, and false negatives otherwise. Synthetic contigs with

inter-strain LGT events were considered true negatives if WAAFLE classified the contigs as

having no LGT, and false positives otherwise. The TPR ranged from 0.513 to 1 and false positive

rate (FPR) ranged from 0 to 0.111, where most false positives arose as a consequence of BLAST

hit removal (Fig. I-4). Higher one-taxon thresholds increased both TPR and FPR, while higher

two-taxon thresholds decreased both TPR and FPR (Fig. 2-2A, Fig. I-5). As the one-taxon

threshold increases, it becomes difficult to classify contigs as not having LGT, which leads to

more LGT calls and increases the number of true and false positives. In contrast, increases in the

two-taxon threshold make it difficult to classify contigs as having LGT, resulting in fewer true

and false positives. As such, we decided to set the one-taxon threshold at 0.5 and the two-taxon

threshold at 0.8. To evaluate organism calls, we examined the subset of correctly called LGT

contigs (true positives), and identified the taxonomic levels (kingdom through species) at which

WAAFLE correctly matches the reference taxa. WAAFLE often correctly annotated taxa down

to the family level, but did not always identify the correct genus or species (Fig. 2-2B).

35

Figure 2-2. WAAFLE parameter evaluation. Using the WAAFLE-called genes (using the default

gene calling parameters), we examined how the one-taxon (k1) and two-taxon (k2) thresholds

would affect A) LGT classification and B) taxonomic assignment. For the left half of the figure,

we set k2 at 0.8 while varying k1 from 0.1 through 0.9. For the right half of the figure, we set k1 at

0.5 while varying k2 from 0.1 through 0.9. A) Colors indicate k2, and the x-axis indicates the

taxonomic level difference between the donor and recipient genomes. For example, we observe

lower TPR and FPR for inter-species LGT. B) Colors indicate k1, and the taxonomic level at

which WAAFLE correctly identified an organism. For example, lower percentages are observed

for correctly calling a taxon at the species level.

Rates of novel LGT events across the human microbiome

We used WAAFLE to interrogate the expanded Human Microbiome Project (HMP1-II)

[174]: a dataset that includes 2,341 shotgun metagenomes sampled from 265 individuals at

36

diverse body sites at up to three time points (http://hmpdacc.org). For quality control, we

removed samples with poor assembly (with less than 1,000 gene calls across contigs) and

inconsistent taxonomic profiles (appeared as outliers in ordination analyses), and then filtered

out contigs that resembled mis-assemblies (described earlier, see Methods). We first set out to

develop a measure for LGT frequency, which would allow us to quantify LGT for taxon pairs

across body sites. Within a metagenomic assembly, each LGT event detected by WAAFLE is

likely to be i) unique; ii) novel, since the use of a reference database should exclude previously

detected LGT events; and iii) fixed in the population, since erratic events are likely not

assembled. Thus, an increase in LGT frequency (per sample) as measured by WAAFLE

represents an increase in unique LGT events, which can further be stratified by taxon pairs.

We generated two measures to quantify LGT frequency, which included i) gene

percentages (the number of genes in LGT contigs normalized by the total number of sample

genes) and ii) events per gene (the number of LGT contigs normalized by the total number of

sample genes). The two measures may not correspond due to differences in assembly: samples

with multiple short contigs may have low gene percentages and high events per gene, while

samples with LGT in a few long contigs may have high gene percentages and low events per

gene. Still, we found that both measures were highly correlated across body sites (Fig. I-6), thus

increases in either measure generally indicate higher LGT frequencies. We then used gene

percentages to determine if WAAFLE is reproducible across technical replicates. As expected,

LGT pairs were most similar between technical replicates, followed by intra-individual and

then inter-individual samples based on Jaccard and Bray-Curtis distances (Fig. I-7). Distances

for LGT pairs were much higher than that of single taxon gene percentages, indicating that

http://hmpdacc.org/

37

similar taxonomic gene profiles still lead to highly variable LGT profiles. For the remainder of

the analyses, we used only assemblies unique to an individual, body site, and time point,

leaving 1,128 assemblies with 237 from stool, 208 from tongue dorsum, 191 from supragingival

plaque, 182 from buccal mucosa, 94 from anterior nares, and 89 from posterior fornix.

LGT is an adaptive mechanism that may facilitate microbial survival and maintenance at

the individual or community level. Cataloguing high frequency LGT pairs identifies the

partners and genes each taxon has access to, and furthers understanding of their interactions.

We thus characterized high frequency LGT pairs across six body sites, and found that they

generally fell into three categories, those with high phylogenetic relatedness [61], large joint

abundances, and similar functions or niches (Fig. 2-3A). Pairs with closely related taxa included

Bacteroides with Parabacteroides (0.746% genes, average phylogenetic distance PD=1.02),

Odoribacter (0.0947%, PD=1.77), or Alistipes (0.260%, PD=1.95), all of which were found in the

stool and considered inter-family transfers, despite relatively short phylogenetic distances. In

contrast, some taxa with high abundances transferred regardless of phylogenetic distance,

including Lactobacillus and Gardnerella (0.137%, PD=8.34) in the posterior fornix, and

Corynebacterium and Propionibacterium (0.0522%, PD=3.14) in the anterior nares. Lastly, some

taxa pairs have overlapping functions or niches. Eubacterium and Roseburia (0.0637%, PD range

0.81 to 6.44) in stool are both butyrate producers that decrease in abundance with lower intake

of carbohydrates [176, 177]. Oral taxa have close physical proximity through biofilms; one

example includes a corncob structure found in supragingival plaque consisting of

Corynebacterium and Streptococcus, with an outer ring of Haemophilus and Aggregatibacter [178],

38

which may explain high frequency transfers for each pair, but not across the two pairs in both

buccal mucosa and supragingival plaque.

39

Figure 2-3. LGT rates are highest for oral and stool sites. A) For each body site, we display

LGT between the ten genera with the highest gene percentages via heatmaps. Each row and

column represent a single genus and off-diagonal cells represent LGT gene percentages. Colors

indicate the number of genes for the row taxa in LGT contigs involving both row and column

taxa divided by the total genes per sample, averaged across body site, resulting in an

asymmetrical matrix. The histogram above each heatmap shows the average number of genes

per sample across body sites. B) Each point represents one sample in the body site. LGT

frequencies on the y-axis are measured as the number of LGT contigs divided by the total

number of sample genes, plotted on a log2 scale per 1000 genes.

41

Different environments may also facilitate or hinder LGT. This is evident in the patterns

we see for the six different body sites: taxa seem to transfer indiscriminately in the stool and

oral sites, but appear more selective in the anterior nares and posterior fornix (Fig. 2-3A). We

therefore investigated whether differences in overall LGT frequency are attributable to body

site. To do this, we calculated the overall extent of LGT in each body site using events per gene.

LGT frequencies were highest in the stool (m = median 2.898 events per 1000 genes), followed

by multiple oral sites, including the supragingival plaque (m=2.134), tongue dorsum (m=2.129),

and keratinized gingiva (m=1.799). Frequencies were lowest in the vaginal and skin sites (Fig. 2-

3B). To further understand how technical and biological effects might affect LGT rates, we

performed a linear regression using events per gene as the dependent variable, and technical

and biological effects as the explanatory variables. Technical effects included the number of

contigs per sample and contig size (genes/contigs), while biological effects included genus

richness, genera evenness, and body site. Significant predictors of LGT frequency included

body site (p<2e-16), average contig size (p=2e-16, positive coefficient), and species evenness

(p=2e-16, positive coefficient). These observations indicate that sites with high LGT rates are i)

mucosal and ii) have higher alpha diversity, in which evenness plays a larger role than richness.

LGT frequency and pair formation are shaped by abundance and phylogeny

We next set out to characterize the overall effect of phylogeny and taxon abundance on

LGT frequencies. To this end, we calculated phylogenetic distances and joint abundances for

each LGT taxon pair, and estimated how well each of these variables predicted LGT gene

percentages using a nonparametric generalized additive model smoother. Phylogenetic distance

was calculated by measuring branch length between two taxa in the PhyloPhlAn phylogenetic

42

tree [50], which represents the average number of nucleotide substitutions between two taxa.

Joint abundance was calculated by multiplying one taxon’s abundance by the other: taxon

abundance was quantified as the total number of genes for a single taxon (across all contigs

regardless of LGT status) divided by the total number of genes per sample, averaged across a

body site. We observed an increase in LGT gene percentages at low phylogenetic distances (Fig.

2-4A), and an increase in LGT gene percentages as joint abundances increase (Fig. 2-4B). The

former suggests that species level LGT events fix in the population more often than higher level

LGT events. Phylogeny is known to affect LGT: closely related partners have shared DNA

composition and transcriptional/translational machinery, allowing them to successfully

integrate and express transferred genes [61]. The latter suggests that taxonomic abundance

leads to increased transfer opportunities and thus higher rates irrespective of phylogenetic

distance.

43

Figure 2-4. Both abundance and phylogeny affects LGT rates. For both plots, each point

represents a taxa pair, and smoothing functions are fit by a generalized additive model using

cubic splines. Only taxa pairs annotated to at least the genus level are included, and taxa pairs

found in a single sample (across body sites) are colored in gray. All other pairs are colored by

inter-taxon LGT level (i.e, inter-species LGT pairs are red). In A), the x-axis displays the

phylogenetic distance between the two taxa, while the y-axis shows the LGT gene percentages,

or the average number of LGT genes in a taxa pair divided by total number of genes in a

sample. In B), the x-axis shows the joint abundance, and the y-axis is the same as A). Joint

abundances are calculated by multiplying one taxon’s gene percentage against another taxon’s

gene percentage. Colors are the same as A).

44

We further examined how phylogeny and taxon abundances influence LGT pair

formation, regardless of LGT rate. For phylogenetic distances, we observed that the HMP LGT

pairs form bi-modal distributions across body sites. This distribution may indicate selective pair

formation at specific distances, or reflect taxonomic bias in NCBI reference genomes. To

distinguish between these two hypotheses, we compared the phylogenetic distance distribution

from HMP LGT pairs to the distribution from randomly generated LGT pairs. We observed that

the phylogenetic distance distributions are significantly different via the Kolmogorov-Smirnov

test: randomly generated pairs have on average larger phylogenetic distances than that of HMP

LGT pairs (Fig. I-8A), indicating that LGT preferentially occurs between closely related species.

We repeated this analysis for LGT joint abundances, and found that randomly generated taxon

pairs had higher joint abundances than that of HMP LGT pairs (Fig. I-8B). This suggests that

LGT pair formation occurs more often than expected between rare taxa, which may be

supported by the physical structure and community organization of microbial communities.

Genera have preferred transfer partners that are shared across similar sites

Individual taxa may vary in partner choice: some may be promiscuous, while others are

more selective. We can identify these preferences by representing LGT pairs as a network, in

which nodes are genera and edges are unique LGT events. We generated networks for each of

the six body sites, and then calculated degree for every node (genus) along with the percentage

of genes found in LGT events involving that node (Fig. 2-5A). As expected, genera with higher

frequencies of LGT also have large numbers of partners. Interestingly, the majority of LGT

events for these genera was accounted for by a small number of partners: for example, 90% of

genes for LGT events involving Streptococcus are transferred with 11 (out of 57), 22 (out of 91),

45

and 19 (out of 88) genera in the buccal mucosa, supragingival plaque, and tongue dorsum,

respectively. Still, we attempted to identify taxa that i) had a larger number of partners and

relatively large number of preferred partners, and ii) had a larger number of partners and

relatively small number of preferred partners. The former represent more promiscuous taxa,

while the latter may be more selective. The former category included Streptococcus (Fig. 2-5B),

Actinomyces, Veillonella, and Haemophilus in the oral sites, as well as Clostridium and

Faecalibacterium in stool. The latter category included Aggregatibacter in the oral sites (Fig. 2-5B),

Bacteroides in the stool, and Corynebacterium and Propionibacterium in the anterior nares and

supragingival plaque. LGT may not be as advantageous for these latter taxa, which might lead

to limited transfer abilities.

46

Figure 2-5. Taxa degree and differential edges. A) Across the six body sites, we compared the

total number of LGT partners for a given genera against the number of partners needed to

explain 90% of genes in LGT transfers. Points are colored by the number of genes in LGT events

involving the given genera normalized by total sample genes, which are averaged across

samples and log2 normalized. Taxa in the upper right corner are more promiscuous: these

genera have many partners and need more partners to explain transfer; while taxa in the lower

right corner are more selective: they have the ability to transfer with multiple taxa but mostly

transfer with a few. Several genera are designated by letters as shown in B). B) We show an

example of a promiscuous taxon, Streptococcus, along with a selective taxon, Aggregatibacter. The

x-axis displays body site, while the y-axis is the gene percentage for LGT pairs, proportionally

scaled to the square root of the total sum gene percentage. C) Arc diagrams display directional

transfers in the buccal mucosa, supragingival plaque, and tongue dorsum. Solid black circles

represent genera, and size indicate average taxon gene percentages for the corresponding site.

Arcs indicate directional transfer between two circles in a counterclockwise fashion: arcs above

two circles indicate donation of genes from the right node to the left node, and vice-versa for

arcs under the two nodes. Arc width indicates the average number of LGT contigs with that

direction normalized by total number of genes per sample. Arcs colored in blue are found in all

three oral sites, while arcs in red are found in two oral sites.

48

We next investigated which LGT events were shared across multiple body sites. The

networks for each site consisted of anterior nares (nodes=61, edges=166), posterior fornix (n=85,

e=342), and buccal mucosa (n=130, e=890), which had fewer nodes and edges than stool (n=174,

e=2698), supragingival plaque (n=242, e=2812), and tongue dorsum (n=188, e=2898). Across all

six sites, only 3 edges were shared, including Bacteroides and Parabacteroides, Bacteroides and

Capnocytophaga, as well as Peptoniphilus and Streptococcus, while 2212 edges were unique to one

site. This is not surprising: the six sites have distinct taxonomic compositions, along with

different environments and selective pressures, which leads to different LGT pairs. As follows,

we focused on the intersection of the three oral networks, which shared 308 pairs, of which 232

pairs were not found in non-oral sites. Some oral pairs were found at differential frequencies

across sites: for example, Streptococcus (degree=49) had higher percentage of transfers with

Gemella, Capnocytophaga, and Prevotella in the buccal mucosa, supragingival plaque, and tongue

dorsum, respectively (Fig. 2-5B). Despite consistent partners in the oral sites, some of these

genera had completely different partners in non-oral sites. For example, Streptococcus paired

mostly with Lactobacillus in the posterior fornix and Dolosigranulum in the anterior nares.

Continuing our focus on the oral sites, we looked to see if oral taxon pairs might have

preferences in transfer directionality, or if one taxon consistently donates or receives genes from

its partner. We assigned directionality to LGT contigs with outer genes annotated as one taxon

(designated as the recipient), and inner genes annotated as a different taxon (designated as the

donor). We quantified events per gene for directional LGT pairs, and filtered for pairs found in

at least 10% of samples across each site. For each pair, we took the maximum directional LGT

frequency across oral sites, and then selected for pairs in the 75th percentile or above. We then

49

plotted all edges associated with the 21 genera in the selected pairs (Fig. 2-5C). Across all three

oral sites, Streptococcus, Veillonella and Pasteurella preferentially donated to Haemophilus, Rothia

and Aggregatibacter preferentially donated to Neisseria, and Simonsiella preferentially donated to

Eikenella. Other transfers have no donor or recipient preference: these include Gemella or

Granulicatella with Streptococcus, as well as Neisseria and Haemophilus. We hypothesized that

recipients may be the more abundant taxon (as compared to donors) within the community.

Although recipients often made up a larger portion of the contig in which they are found, they

were not consistently the more abundant taxon. Furthermore, some directional transfers were

site-specific, indicating that environment may also facilitate donor/recipient dynamics.

Mobile elements and TonB receptors are enriched in LGT contigs

Laterally transferred gene functions have been shown to be i) for adaptation, rather than

for information storage [68], and ii) the outer component of an interaction network (such as a

signaling or metabolic pathway), as opposed to a central component [69, 70]. We aimed to

determine if such trends persist for novel LGT events in the HMP1-II metagenomes. To do this,

we searched for gene functions enriched in LGT contigs. We quantified the number of UniRef90

terms from LGT and non-LGT contigs, aggregated them into Pfam clans [179], and performed

Fisher’s Exact Test to identify Pfam clans associated with LGT contigs (as compared to all

contigs). Enriched and depleted Pfam clans could be divided into 5 groups, i) DNA-binding

proteins such as transposases, ribonucleases, exo/endonucleases; ii) mobile elements including

phage, plasmids and toxin/antitoxin systems; iii) specific enzymes such as GMP synthase and

the FMN-binding split barrel superfamily, the latter mostly consisted of

pyridoxine/pyridoxamine 5'-phosphate oxidase; iv) transport systems including ABC

50

transporters and TonB dependent receptors, and v) antibiotic resistance genes (ARGs) (Fig. 2-

6A). As expected, groups i) and ii) were enriched across most body sites, with the exception of

plasmid toxin-antitoxin systems, which were enriched in the oral sites, as well as the NUMOD4

motif, which is part of an endonuclease found in Bacteroides [180], and was enriched only in

stool. Groups iv) and v) contained mixed results: inner membrane transport proteins, such as

ABC transporter permeases, were depleted, while outer membrane beta-barrel proteins and

TonB-dependent receptors were enriched.

51

Figure 2-6 . Enriched functions show taxon and structural similarities across sites. A) We

searched for Pfam clans enriched and depleted in LGT contigs by aggregating UniRef90 terms

for Fisher’s Exact Test. Each cell within the heatmap is colored by the log2 normalized odds

ratio, in which a positive value indicates enrichment of the Pfam clan in LGT contigs, whereas a

negative value indicates depletion of the Pfam clan in LGT contigs. B) We counted the number

of genes for UniRef90 annotations in enriched Pfam clans, specifically plasmid-related genes,

transcriptional regulators, TonB receptors, and ISNme transposases (from left to right). The x-

axis is labeled using two color bars: the first bar indicates the UniRef90 annotation, while the

second bar indicates the body site; colors for the latter correspond to A). The y-axis displays the

number of genes found in LGT genes stratified by genus, and is proportionally scaled to the

square root of the total number of LGT genes. C) We show a single contig containing a Neisseria

and Haemophilus LGT event in the buccal mucosa. From top to bottom, we first show a graph in

which the x-axis is the length of the contig and the y-axis is the taxon score. Arrows represent

aggregated BLAST hits, those in red are for genus Neisseria while those in blue are for

Haemophilus. Below, we display the called genes and their assigned UniRef90 functions. Lastly,

we examined other oral sites and searched for contigs with Neisseria-Haemophilus LGT transfers

with the UniRef90 term E3D293. These contigs are colored by UniRef90 function and labeled

with sample number, many share synteny with the example contig.

53

We next examined groups with potential adaptive functions, including group iii) with

GMP synthase and pyridoxine/pyridoxamine 5'-phosphate oxidase, and group v) ARGs. GMP

synthase and pyridoxine/pyridoxamine 5'-phosphate oxidase are likely LGT markers rather

than transferred functions: GMP synthase is hypothesized to be part of integration sites [181],

and has been found at the 3’ end of integrative and conjugative elements in Staphylococcus

aureus, Listeria monocytogenes, Clostridium perfringens, and Enterococcus faecalis, which are four

Gram-positive bacteria with low GC content [182]. Pyridoxine/pyridoxamine 5'-phosphate

oxidase may also be a LGT marker: manual examination of LGT contigs with this function

revealed that it is frequently found in conserved regions nearby transferred genes. Surprisingly,

antibiotic resistance (ABR) was depleted outside of the VOC superfamily, which may contain

glyoxlyase and bleomycin resistance genes. Depletion may be due to the lack of selection for

ABR in healthy human subjects, as well as WAAFLE’s inability to detect ABR LGT events

already present in the reference database. Still, many enriched genes within the helix-turn-helix

binding proteins (CL0123) are from TetR, AraC, and MerR family transcriptional regulators, of

which the former and latter may control for tetracycline and mercury resistance, respectively.

AraC was associated with LGT for iron acquisition regions in cheese (Fig. 2-6B) [183].

Lastly, we searched for functions that were specific to taxa. To do this, we extracted

UniRef90 terms from significant Pfam clans, and determined which taxa they were derived

from. Examples include the ISNme transposase (CL0219), which was found almost exclusively

in Neisseria across oral sites; the plasmid recombination enzyme (CL0169), which was mostly in

Streptococcus and Prevotella in oral sites, but spread across Bacteroides, Parabacteroides, Alistipes,

and Clostridium in stool; and the TonB receptor dependent receptor plug and TonB-linked outer

54

membrane protein SusC/RagA family (PF07715, CL0193, CL0287), which was found mostly in

Capnocytophaga and Prevotella across oral sites, Prevotella in anterior nares and posterior fornix,

and Bacteroides, Parabacteroides, and Alistipes in stool (Fig. 2-6B). We looked specifically at LGT

contigs containing ISNme transposons, which occurred almost exclusively between Neisseria

and Streptococcus. These contigs contained a conserved structure across oral sites in multiple

samples (Fig. 2-6C).

Discussion

LGT is a strong evolutionary force: assuming one LGT event for every 1010 vertical

replications, no gene in any modern genome can be linked to the last universal common

ancestor (LUCA) through vertical descent [15]. Most studies and computational tools have

focused on whole genomes, which makes characterization of LGT within microbial

communities particularly challenging. First, the use of reference genomes removes the microbial

community context (i.e. the genome is obtained from culture rather than the community).

Second, the assembly of complete genomes from microbial communities is experimentally and

computationally challenging, requiring either low diversity communities [157] or single cell

genomics [67]. We addressed both limitations by developing WAAFLE, which detects novel

LGT events directly from partially assembled metagenomes. With this, we can begin to ask i)

whether novel LGT events consistently occur in microbial communities, ii) which biological

factors affect LGT frequency, and iii) which taxa and functions are exchanged. In our validation

with synthetically generated LGT events, WAAFLE performed solidly with high true positive

rates for LGT detection and taxonomic assignment. We then applied WAAFLE to the Human

Microbiome Project 1 Phase II and quantified LGT frequencies across multiple body sites.

55

Increased LGT frequencies were associated with overall community trends such as greater

community evenness and body sites (stool and oral), as well as individual taxon pairs with

higher community abundances and small phylogenetic distances. We also observed that mobile

genetic elements and outer membrane proteins were enriched in LGT contigs. Overall, this

demonstrates that WAAFLE can generate biological insights using existing metagenomic

data.

It is important to consider the biological interpretation for LGT frequency, which

depends on i) the data from which LGT is detected and ii) the quantification method. In

WAAFLE specifically, the use of metagenomes means that each detected LGT event is unique to

a sample and fixed in the population, while the use of a reference database means that each

detected event should not have been previously characterized in reference genomes. Strikingly,

our study detected multiple LGT events in six major body sites, demonstrating that LGT is an

ongoing process in which events continuously fix in microbial populations. We next quantified

LGT frequencies as the i) number of LGT contigs per gene and ii) number of genes in LGT

contigs per gene. We hypothesized that higher LGT frequencies as detected by WAAFLE were

likely caused by an increased number of unique taxon pair combinations and/or increased

fixation rates. Indeed, across all six body sites, we found that higher community evenness,

along with larger taxonomic abundances and smaller phylogenetic distances between taxon

pairs, led to increased LGT frequency. With this, we propose that LGT occurs universally

between taxa, in which greater community evenness increases the number of unique taxon pair

combinations, and higher joint taxonomic abundance increases the probability of exchange.

Fixation of events is then limited by factors such as phylogenetic distance.

56

WAAFLE has several limitations that should be taken into account. First, WAAFLE is

ultimately affected by the quality of the metagenome assembly, which is in turn influenced by

biological factors such as community evenness and richness. As follows, LGT frequencies were

difficult to compare across sites: the posterior fornix had fewer contigs, and had close to the

highest or lowest frequencies across body sites depending on the measure used. Samples with

longer contig lengths (gene to contig ratio) also tended to have increased LGT frequency,

especially those in gut and oral sites, though vaginal sites were not affected due to low

community diversity. Second, WAAFLE’s parameters are tuned to be conservative with LGT

calls (minimizing false positives). As such, WAAFLE underestimates LGT events, especially for

inter-genus and inter-species LGT events, where most LGT is most likely to occur. WAAFLE is

also unable to detect strain-level LGT events, as the default reference database is annotated to

the species level. Third, WAAFLE lacks to ability to infer donor and recipient for most events.

This study briefly identified donor and recipient taxa across oral sites based on taxon gene

order within contigs, but did not find consistent relationships between genera. A more focused

characterization of donor and recipient taxa using phylogenetic trees may reveal whether

specific taxa are prone to donating or receiving genes, and distinguish between transferred and

non-transferred gene functions (as opposed to genes on contigs with or without LGT).

Despite these limitations, WAAFLE allowed us to identify patterns for specific taxa and

LGT-enriched functions. We found that most taxa across sites were relatively selective about

their partners, even if they had the ability to transfer with multiple other taxa. For example,

promiscuous taxa such as Streptococcus transfer with many genera, while taxa such as

Aggregatibacter primarily transfer with Haemophilus. We also found that metagenomically-

57

enriched LGT functions included mobile genetic elements such as transposons, phage, and

plasmids as well as outer membrane proteins, suggesting that 1) LGT events involving mobile

elements are ongoing and relatively frequent as compared to transfer of other genes, and 2)

mobile elements are pangenome-specific and do not ameliorate. These two points are illustrated

in an example showing a Neisseria and Haemophilus transfer, in which the majority of the contig

consists of Haemophilus genes with a single gene matching a Neisseria-specific ISNme

transposase. This event is consistently detected across samples from the buccal mucosa,

supragingival plaque, and tongue dorsum, showing that certain LGT events may be prevalent

across individuals and taxon-pair specific.

We anticipate that future work will include generation of new measures for LGT

frequency, improved detection of donor and recipient taxa, and further investigation of specific

functions and taxa. WAAFLE as is detects novel (not in reference genomes), recent (events

without amelioration), and fixed LGT events. Our ability to find these events enables us to i)

determine the timescale at which these events occur, through the use of time-series data; ii)

quantify the proportion of the microbial or human population contains specific events, in which

LGT sweeps might correspond with strain sweeps [184]; and iii) identify environmental factors

that might influence LGT frequencies and transferred functions, through the use of case-control

studies. Improved classification of donor and recipient taxa may facilitate discovery of

transferred metabolic functions, which may be taxon-pair specific and were not detectable

across body sites. Unlike findings based on reference genomes [185], which can infer donor and

recipient, we did not see enrichment for antibiotic-resistance genes for LGT transfers. This may

be due to the use of a healthy cohort, rather than one taking antibiotics. More work is needed to

58

quantify the frequency and characteristics of novel LGT events in microbial communities, as

well as the variation in transferred functions in different cohorts and in response to selective

pressures. WAAFLE represents a step forward in characterizing LGT directly from microbial

communities, which will ultimately enable us to understand the roles of LGT for adaptation or

speciation in microbial communities.

Methods

Datasets

Metagenomic datasets used in this study were produced through the Human

Microbiome Project Phase 1-II [174]. The HMP data are publicly available through the HMP’s

public data repository (http://www.hmpdacc.org/). Contigs were assembled via IBDA-UD [186].

The pangenome reference database was generated by downloading NCBI isolate genomes,

binning isolate genes by species, and then clustering binned species genes at 97% nucleotide

identity [187].

Detecting LGT events from metagenomic shotgun sequencing datasets

WAAFLE takes one required input, i) contigs assembled from metagenomic data in

FASTA format, and two optional inputs, ii) gene calls for each contig in genome format file 3

(GFF3), and iii) a nucleotide reference database of genes with taxonomic and functional

annotations. WAAFLE uses four steps to classify each contig as having LGT or not:

1. Contigs are searched against the ChocoPhlAn pangenome reference database

(https://bitbucket.org/biobakery/humann2/wiki/Home) using BLASTN default

parameters.

http://www.hmpdacc.org/

https://bitbucket.org/biobakery/humann2/wiki/Home

59

2. If gene calls were not supplied, contigs are annotated with genes using overlapping

BLASTN alignments.

3. Within a contig, each gene is assigned multiple taxon scores. BLASTN hits are grouped

by taxonomic annotation and gene overlap. We then calculate a score from each group

using BLASTN hit percent identity and subject coverage.

4. Each contig is classified as “No LGT”, “LGT”, or “ambiguous” by examining whether all

genes across a contig are best explained by one taxon, two, or multiple, respectively (Fig.

2-1).

These steps can be tuned using 5 parameters: subject coverage (s), overlap percentage

(o), gene length (l), one taxon score (k1), and two taxon score (k2) (Table I-1). We describe each

step in detail below.

Step 2: Calling genes.

If gene calls are not supplied, we combine overlapping BLAST hits to call genes. BLAST

hits are first filtered by subject coverage cutoff s (s, default 0.75), which is defined as the

percentage of the reference gene (subject sequence) that aligned to the contig (query sequence).

For hits that aligned to contig ends, it is not possible for the full gene to align to the contig. We

thus calculated subject coverage by dividing the alignment length by the subject gene length

that can potentially align to the contig. Specifically, we subtracted the length of the subject gene

that ran off the contig from the total subject gene length.

The filtered BLAST hits are then sorted by length and sequentially assigned to groups

based on overlap percentage. Hits and groups may be considered nucleotide fragments: overlap

60

percentage is calculated between a two nucleotide fragments by dividing the length of the

overlap between the two fragments by the length of the shorter fragment. Specifically, each

BLAST hit is added to a group if the hit has at least overlap percentage o (o, default 0.1) with

any existing groups, otherwise a new group is created. After all BLAST hits have been

considered, each group is considered a gene, and the start and end sites are calculated as the

minimum start and maximum end of all BLAST hits encompassed (in the group). The resulting

genes are further filtered by length (l, default 200 bp).

Step 3: Assign taxon scores to genes.

To assign taxon scores, WAAFLE combines the BLASTN results from step 1 and gene

annotations called from step 2 or supplied by the user. First, WAAFLE bins all BLAST hits (s,

default 0) to genes if they have overlap greater than o (o, default 0.1); it is possible to assign a

single hit to multiple genes. The top UniRef term across all BLAST hits assigned to a gene is

then annotated as the gene function. Second, for each gene in a contig, WAAFLE further groups

the binned BLASTN hits based on taxonomic annotation, which can be performed at different

taxonomic levels (such as kingdom, phylum, class, etc). Each BLAST hit within the group is

scored by multiplying its percent identity by its subject coverage. For each nucleotide position

within the gene, we allot the maximum score across grouped BLAST hits, or if there were no

BLAST hits at that position, allot a score of 0. This results in a vector of scores per taxon per

gene, which we average for a single taxon score. Once each taxon has been scored at each gene,

each contig can then be represented by a table, S, with N rows (representing taxa) and M

columns (representing genes).

61

Step 4: LGT classification and taxonomic annotation of contigs.

Only contigs with more than 1 gene and more than 1 taxon are considered for LGT. To

search for LGT, we loop through seven taxonomic levels, starting at the species level and

ending at the kingdom level. The loop is terminated if the contig is i) classified as containing

LGT or not containing LGT, and ii) assigned a single taxon pair or taxon, respectively. At each

taxonomic level, we perform step 3, in which each contig is represented as table S, where each

entry Sij contains taxon i‘s score for jth gene.

Using this table, we define O(i) = minj(Sij) as taxon i’s worst single-gene score, and C(i, i’)

= minj(max(Sij, Si’j)) as the worst single-gene score for the combination of taxa i and i’. If

maxi(O(i)) is larger than the one taxon score threshold (k1, default 0.5), then one taxon explains

the entire contig. If maxi(O(i)) < k1 and maxi,i'(C(i, i')) is larger than the two taxon score threshold

(k2, default 0.8), then i and i' jointly explain the contig, indicating LGT between taxa i and i'. If

neither k1 nor k2 are met, the contig is annotated as “ambiguous”. If the contig is annotated as

“ambiguous”, the loop continues a higher taxonomic level. If if the contig is determined to

contain no LGT or LGT, WAAFLE performs taxonomic assignment.

Taxonomic assignment is performed as follows: if the contig is determined to contain no

LGT, the contig is annotated with taxon i with O(i) = maxi(O(i)). If multiple taxa have scores

equal to maxi(O(i)), we annotate the contig with the term “multiple” (rather than any taxon). If

the contig is determined to contain LGT, the contig is annotated with taxa i and i’ resulting in

C(i, i’) = maxi,i'(C(i, i')). If multiple taxon pairs have scores equal to maxi,i'(C(i, i')), WAAFLE

determines whether the multiple pairs share one taxon, indicating that one taxon is known

62

while the other is uncertain. If so, WAAFLE determines the name of the uncertain taxa by

identifying the last common ancestor shared between all uncertain taxa, and assigns the contig

the taxon pair consisting of the universally shared taxon and the last common ancestor of the

uncertain taxon. If all pairs are different, we annotate the contig with term “multiple”. If contigs

are assigned the term “multiple”, the determined LGT status is rejected and the loop continues

to a higher taxonomic level. Otherwise, we complete the search and annotate the contig with its

LGT status and corresponding taxa.

Other Options: Dealing with Unknown Taxa

First, in the case where there are no BLAST alignments to a gene (due to a user

supplying their own gene calls), WAAFLE by default assigns the gene a taxon score of 1 for the

taxon “Unknown”. This will result in WAAFLE either i) identifying the contig as a inter-

kingdom LGT between one taxon and the “Unknown”, or ii) identifying the contig as

“ambiguous” if no two taxa can explain the full contig. Second, users may choose to “spike” in

an “Unknown” taxon into table S during Step 4, in which the “Unknown” is equal to 1 -

maxi(O(i)) across all genes. Simulation with this flag has shown that WAAFLE will then call

LGT between one taxon and an “Unknown” for contigs containing multiple genes with low

taxon scores, so caution is advised if using this function.

Tuning parameters through grid search

WAAFLE has 5 parameters, subject coverage (s), overlap percentage (o), gene length (g),

one taxon score threshold (k1), and two taxon score threshold (k2). We constructed a set of 1000

synthetic contigs to set these parameters. Contigs were generated through a three step process.

63

First, we randomly selected donor and recipient genomes that differed across 8 taxonomic

levels (kingdom, phylum, class, order, family, genus, species, and strain/no difference). Second,

we chose a three gene region within the recipient genome, and swapped out the center gene

with a random donor gene. At each taxonomic level, contigs contained 25 unique donor-

recipient pairs with 5 contigs each (for a total of 190 unique donors and 183 unique recipient

strains). Third, we truncated the contigs on both ends. After truncation, some contigs were left

with only one gene, which were removed and resulted in a different distribution across

taxonomic levels.

Gene Calling

We first assessed WAAFLE’s ability to call genes while varying three three parameters,

which included i) subject coverage from 0, 0.25, 0.5, 0.75, and 0.9, ii) overlap from 0.1 to 0.5 in

0.1 increments, iii) gene length from 0, 25, 50, 75, 100, and 200 bp. We then compared each NCBI

reference gene to WAAFLE-called genes, and vice-versa, to identify true positives, false

positives, and false negatives:

1. True positive: A WAAFLE-called gene overlap the NCBI annotated gene by at least 80%.

2. False positive: A WAAFLE-called gene does not match any NCBI annotated gene, or two

or more WAAFLE-called genes match one NCBI annotated gene.

3. False negative: The reference gene does not match any WAAFLE-called gene.

Note that true negatives cannot be assessed meaningfully: these would be regions where

NCBI had no annotation, and WAAFLE did not call a gene. With this, we compared TPR

against PPV for each set of conditions (Fig. I-3).

64

LGT Classification

In order to set parameters k1 and k2, we performed a second grid search to characterize

WAAFLE’s ability to call LGT. We only included contigs with at least 2 genes. We varied four

parameters, including i) subject coverage from 0, 0.25, 0.5, 0.75, 0.9, ii) overlap from 0.1 to 0.5 in

0.1 increments, iii), k1 from 0.1 to 0.9 in 0.1 increments, and iv) k2 from 0.1 to 0.9 in 0.1

increments. We then assessed positives and negatives as such:

1. True positive: WAAFLE calls “LGT” for a synthetic contig with an inter-species LGT or

above.

2. True negative: WAAFLE calls “No LGT” or “ambiguous” for a synthetic contig with an

inter-strain LGT.

3. False positive: WAAFLE calls “LGT” for a synthetic contig with an inter-strain LGT.

4. False negative: WAAFLE calls “No LGT” or “ambiguous” for a synthetic contig with an

inter-species LGT or above.

It should be noted that WAAFLE does not have to call LGT at the correct taxonomic

level; thus, this assessment looks specifically at whether WAAFLE can detect LGT, not whether

it called the correct taxa.

Taxonomic Annotation

For correctly classified contigs, we assessed whether WAAFLE annotated contigs with

the correct taxa at each taxonomic level. To compare one taxon call against another, we looked

to see whether they had identical names at each phylogenetic level (i.e, same name at kingdom,

65

phylum, class, etc.). At best, two taxa may match across all seven levels, in the worst case

scenario, two taxa may not match at all. For a contig without LGT, we compared the WAAFLE

taxon to the reference taxon. For a contig with LGT, we compared each WAAFLE taxon to each

reference taxon, and selected the combination of pairs with the highest number of matches. We

then calculated what percentage of the reference taxa had a correct match at each taxonomic

level.

Quality control for the Human Microbiome Project (HMP) assemblies

Samples were filtered out if they 1) were outliers in ordination analyses using

MetaPhlAn [186] community profiles or 2) had fewer than 1,000 genes across contigs

(definitively annotated as LGT or not). Contigs were then filtered from these samples if they

resembled misassemblies, defined here as the erroneous combination of genomic material from

two species into a single contig, which will match WAAFLE’s internal model for a biological

LGT event and result in false positive LGT calls. To identify and quantify misassembly in

contigs from the HMP1-II dataset, we examined recruitment of reads to gene junctions. Contigs

that met the two conditions below were removed:

1. The average coverage (reads per nucleotide) of the gene-gene junction is less than half of

the average coverage of the flanking genes.

2. There are no single reads or read pairs that support the junction. Single reads may

support the junction if they overlap both the junction and flanking genes (single), paired

reads may support the junction if i) each read is in a flanking gene (perfect-double) or ii)

66

one read is in one flanking gene, and the other overlaps the other flanking gene and the

junction (partial-double) (Fig. I-1).

Both conditions are necessary to remove a contig because contig coverage is highly variable,

and read support decreases as junction lengths increase.

Linear regression for LGT frequency

We performed linear regression with LGT events per gene as the outcome, and number

of contigs, gene to contig ratio, alpha diversity, richness, and body site as the regressors. Alpha

diversity was calculated using the Gini-Simpson Index [188], which is equal to 1 minus the sum

of the square of each genera’s gene percentages. Richness was counted as the total number of

genera per sample.

Determining phylogenetic distance between pairs

We calculated phylogenetic distances between pairs using the PhyloPhlAn tree [50]. If

both taxa were annotated to the species level (tree tips or terminal nodes), distances were

calculated between terminal nodes. If a taxon was not annotated to the species level, the internal

node for the last common ancestor (LCA) was determined after searching the tree for all species

that matched the last known level by regular expression. Distances were then calculated

between nodes, and adjusted by adding the average distance from the LCA to the terminal

nodes.

Functional Analyses

Identifying enriched and depleted Pfam clans

67

Fisher’s Exact Test was performed both per sample and per body site. For each sample,

we counted the total number of UniRef90 genes in contigs with at least 2 genes and WAAFLE

classification of “LGT” or “No LGT”. For the body site, we summed the total number of

UniRef90 genes in contigs with at least 2 genes and WAAFLE classification of “LGT” or “No

LGT”. We then aggregated UniRef90 terms to Pfam clans, and identified Pfam clans that were

positively or negatively associated with LGT contigs. A Pfam clan was considered significant if:

1. The site-wide q-value is < 0.01.

2. The difference between the percentage of sample odds ratios (OR) that agreed with the

side-wide odds ratio and the percentage sample odds ratios that disagreed with the site-

wide odds ratio is greater than 0.2

ORsupport + ORagainst + ORnan = total_samples

(ORsupport - ORagainst) / total_samples > 0.2

For the latter condition, 0.2 was chosen because it requires at least 20% of the samples to

have an odds ratio, and the worst case scenario involves the ORsupport / total samples = 0.6, and

ORagainst / total samples < 0.4.

Searching for genes within Pfam clans

WAAFLE annotates each gene with a UniRef90 term and taxon, which enables us to

examine in more detail which genes and taxa are within enriched Pfam clans. To do this, we

quantified the UniRef90 terms from specific Pfam clans and stratified them by taxonomic

annotation and LGT status (within an LGT contig or not). UniRef90 terms with similar

annotations were collapsed for plotting purposes.

Chapter 3:

Urban transit system microbial communities differ by surface type and interaction with

humans and environment

69

Copyright Disclosure

This Chapter is a reproduction of a published manuscript, in which the * indicates equal

contribution:

Hsu T.*, Joice R.J.*, J. Vallarino, G. Abu-Ali, E.M. Hartmann, A. Shafquat, C. DuLong, C.

Baranowski, D. Gevers, J.L. Green, X.C. Morgan, J.D. Spengler, C. Huttenhower. Urban Transit

System Microbial Communities Differ by Surface Type and Interaction with Humans and the

Environment., MSystems, 2016. 1(3): e00018-16.

Attributions

R.J., J.S., and C.H. designed the study. C.B. optimized the sampling protocol. R.J., T.H.,

and J.V. collected transit samples, and R.J. and T.H. extracted DNA for 16S and shotgun

sequencing at the Broad Institute. R.J and A.S. performed 16S computational analyses; T.H., A.S,

and G.A. performed shotgun computational analyses, E.M.H. and J.L.G. helped interpret

taxonomic composition and functional profiling results. R.J., T.H., A.S., C.D. and X.C.M. made

figures: X.C.M., D.G., J.D.S., and C.H. provided support throughout the sequencing and

analysis process. R.J., T.H., X.C.M. wrote the manuscript.

Abstract

Public transit systems are ideal for studying the urban microbiome and inter-individual

community transfer. In this study, we used 16S amplicon and shotgun metagenomic sequencing

to profile microbial communities on multiple transit surfaces across train lines and stations in

the Boston metropolitan transit system. The greatest determinant of microbial community

structure was the transit surface type. In contrast, little variation was observed between

geographically distinct train lines and stations serving different demographics. All surfaces

were dominated by human skin and oral commensals such as Propionibacterium,

70

Corynebacterium, Staphylococcus, and Streptococcus. Non-human associated taxa detected

included generalists from Alphaproteobacteria, which was especially abundant on outdoor

touchscreens. Shotgun metagenomics further identified viral and eukaryotic microbes including

Propionibacterium phage and Malassezia globosa. Functional profiling showed that P. acnes

pathways such as propionate production and porphyrin synthesis were enriched on train holds,

while electron transport chain components for aerobic respiration was enriched on touchscreens

and seats. Lastly, the transit environment was not found to be a reservoir of antimicrobial

resistance and virulence genes. Our results suggest that microbial communities on transit

surfaces are maintained from a metapopulation of human skin commensals and environmental

generalists, with enrichments corresponding to local interactions with the human body and

environmental exposures.

Importance

Mass transit, specifically urban subways, are distinct microbial environments with high

occupant densities, diversities, and turnovers, and they are thus especially relevant to public

health. Despite this, only three culture-independent subway studies have been performed, all

since 2013 and with widely varying designs and differing conclusions. In this study, we profiled

the Boston subway system, which provides 238 million trips per year by the Massachusetts Bay

Transit Authority (MBTA). This yielded the first high-precision microbial survey of a variety of

surfaces, ridership environments, and microbiological functions (including tests for potential

pathogenicity) in a mass transit environment. Characterizing microbial profiles for multiple

transit systems will be increasing important for biosurveillance of antibiotic resistance genes or

pathogens, which can be early indicators for outbreak or sanitation. Understanding how human

71

contact, materials, and the environment affect microbial profiles may eventually allow us to

rationally design public spaces to sustain our microbial health.

Introduction

Mass transit systems host large volumes of passengers and facilitate a constant stream of

human/human and human/built environment microbial transmission. The largest urban mass

transit system in the United States facilitates an average of 11 million trips per weekday (New

York). The next four largest systems transport just over 1 million trips per weekday

(Washington DC, Chicago, Boston, San Francisco) [189][180][182][181], yet little is known about

the mass transit system microbial reservoir. Understanding the associated microbial

transmission dynamics between humans and the built environment, and microbial occupation

and persistence on different surfaces, can inform decisions regarding public health and safety.

Microbial DNA sequencing-based studies have revealed that microbial communities of

the built environment are greatly influenced by their human occupants. Communities within

homes showed higher similarity to those of their inhabitants [92], and specific surfaces

frequently contacted by human skin, such as keyboards or mobile phones, had microbial

communities that reflect those of skin [190, 191]. In restrooms and classrooms, variation in

microbial community composition across surface types was associated with variation in human

contact with those surfaces: desks contained human skin and oral microbes, while chairs

contained intestinal and urogenital-derived microbes [93, 192]. However, a limitation of most

built environment microbiome research is that human contact, surface type, and material

composition are frequently confounded. For example, in the classroom study described above,

72

different forms of human contact were associated with distinct microbial community profiles;

however, the desks and chairs were also constructed from different materials.

Previously observed subway microbial communities comprise both human and

environmentally derived microbes. Air samples from within the New York and Hong Kong

subway systems included microbes originating from soil and environmental water in addition

to human skin [193, 194]. The recent metagenomic study of New York subway stations [195] has

been widely criticized [196] and leaves many detailed analysis questions regarding the transit

microbiome unanswered, but it has provided an initial reference dataset for further analysis of

subway microbiome diversity. In addition, while this study collected surface type information,

it did not standardize their characterization or, as a result, investigate surface-specific

enrichments for microbial taxa. Understanding the separate influences of human contact,

surface type, and surface material would help identify mechanisms through which microbial

communities form and persist on surfaces within built environments.

In the present study, we provide the first comprehensive metagenomic profile of

microbial communities across multiple surface types and materials in a high-volume public

transportation system. Samples were collected from seats, seat backs, walls, vertical and

horizontal poles, and hanging grips inside train cars from three subway lines, as well as

touchscreens and walls of ticketing machines inside five subway stations. Using a combination

of 16S amplicon and shotgun metagenomic sequencing, we characterized the microbial

community composition, functional capacity, and pathogenic potential of the Boston mass

transit system. In agreement with previous studies, we observed a combination of human-, soil-,

73

and air-derived microbial communities across the system. Taxonomic differences were most

strongly associated with surface type, as compared to geographic, train-line, and material

differences in a multivariate analysis. The distribution of metabolic functions was dominated by

P. acnes, which made up a majority of the community. Minimal antibiotic resistance genes and

virulence factors were detected across transit system surfaces. In addition to identifying the

most important factors determining microbial colonization, our results may serve as a baseline

description of microbes on public transportation surfaces, which will be relevant toward future

design of transit environments encouraging microbial health.

Results

Sampling microbial communities on the Boston transit system

We collected samples from train cars and stations (n=73) from the Boston transit system.

This system is maintained by the Massachusetts Bay Transportation Authority (MBTA), which

operates bus, subway, commuter rail, and ferry routes in the greater Boston area. Our study

focused on the subway system, which consists of four lines (red, orange, blue, green, and silver)

that extend from downtown Boston into the surrounding suburbs (Fig. 3-1A). Train car samples

were collected from the red, orange, and green lines, and comprised 6 surface types, including

grips, horizontal and vertical poles, seats, seat backs, and walls (Fig. 3-1B). Station samples were

collected from the touchscreens and the sides of fare ticketing machines (Fig. 3-1C). Biomass

yields were highest for hanging grips (141.83±92.68 ng/µL), followed by seats (128.1429±49.955

ng/µL) and touchscreens (120.47±73.68 ng/µL), though these differences were not statistically

significant (Fig. II-S1A).

https://www.dropbox.com/s/lun6rjak4419x61/Fig1_horiz_v5.pdf?dl=0


https://www.dropbox.com/s/pf39ixt44pmfe8e/FigS1.pdf?dl=0

https://www.dropbox.com/s/pf39ixt44pmfe8e/FigS1.pdf?dl=0

74

Figure 3-1. Collection of samples from MBTA trains and stations. (A) Microbial community

samples were collected from the Massachusetts Bay Transit system in the Boston, Massachusetts

metropolitan area. Train samples were collected from 6 train car surfaces across 3 locations

along 3 train routes; station samples were collected from 5 stations. (B, C) Diagram of the

surfaces sampled within train cars (B) and stations (C). Sampled surfaces specifically included

seats and seat backs, horizontal and vertical poles, hanging grips, and walls within train cars, as

well as the screens and walls of touchscreen machines within stations.

For each sample, we collected metadata describing built environment type, surface type,

material composition, as well as collection date (Table II-1). For train car samples, we also

recorded the train line, within-train location, and location along the subway route at time of

sample collection (nearest subway stop). For station samples, we recorded the station, ticketing

machine location, and which side of the touchscreen was swabbed. 16S rDNA amplicon

sequence data was generated from most samples (n=72), and a subset (n=24) was subjected to

shotgun metagenomic sequencing.

Microbial communities are specific to surface types and immediate environment

The surface type from which microbes were collected proved to be the major

determinant of community diversity and structure. Alpha diversity of touchscreen samples was

significantly higher than that of all other surface types (p<0.0001, ANOVA comparison of 7

https://www.dropbox.com/s/k0pxsomslip7m0f/TableS1_Metadata.xlsx?dl=0

75

surfaces with Bonferroni correction, Fig. II-1B), and did not correlate with biomass (Spearman’s

rho=0.0057, Fig. II-1A). The largest axes of beta diversity separated train holds (horizontal and

vertical poles, hanging grips), chairs (seat and seat backs), touchscreens, and walls (Fig. 3-2A).

Train line remained only a minor driver of community structure (Fig. 3-2B), and did not dictate

overall community composition for either holds (Fig. II-S2A) or seats, once the material of the

latter was taken into account (Fig. II-S2B, II-S2C). In particular, the green line seats were

upholstered with vinyl, while seats on the orange and red lines were upholstered with

polyester.

https://www.dropbox.com/s/xmz85wg0dikrj2y/FigS2_v2.pdf?dl=0




76

Figure 3-2. Taxonomic composition of subway microbial communities. All ordinations are

principal coordinate analyses using Bray-Curtis distance among filtered OTUs (see Methods),

colored by metadata. (A) Subway data by surface, (B) train car data by train line, and (C)

touchscreen data by location of machine. (D) Relative abundances of bacterial families across

samples from train cars (see Table II-2 for complete data). (E) Relative abundance of bacterial

families within stations (complete data as above). Asterisks indicate that the sample was

collected on a separate day during the same month as the remaining samples. For station

samples, “W” indicates a sample from a ticketing machine wall; all other samples are from the

ticketing machine touchscreens.

The location of ticketing machines (e.g. outdoor, indoor, underground) was a primary

source of variation between microbial communities on touchscreens (Fig. 3-1C). Univariate

analyses using Linear Discriminant Analysis Effect Size (LEfSe) [197] revealed that indoor



77

touchscreens were characterized by genus Acinetobacter, while underground touchscreens had

increased levels of genus Corynebacterium, and family Tissierellaceae, specifically genus

Finegoldia and genus Anaerococcus. Those with outdoor exposures were enriched for class

Alphaproteobacteria, including family Acetobacteraceae and genus Methylobacterium,

Sphingomonas, and Blastococcus (Table II-3). These results imply that surface type is a major

driver of community composition on transit surfaces, and that indoor versus outdoor exposure

detectably influences the resident microbial composition of touchscreen surfaces.

Subway microbial communities are largely derived from human skin and oral commensal

microbes

Subway microbial clades were generally those found in typical human skin communities

[2, 81] (Fig. 3-3Ai) and were dominated by the phyla Firmicutes, Proteobacteria, and

Actinobacteria, each of which comprised over 20% of the microbial community, based on 16S

data. The Bacteroidetes were much less abundant with an average community abundance of 6%

(Table II-S2). The families with the highest mean relative abundances were Staphylococcoceae

and Corynebacteriaceae (Fig. 3-2D-E), also typical of skin commensals. Propionibacterium was

not observed due to known primer bias [198] but was confirmed later with shotgun

metagenomics. The next most abundant taxa were Micrococcaceae, which included genus

Micrococcus (found in hair and skin) and genus Rothia (found in the oral cavity [2, 199]), as well

as Streptococcaceae (found in the oral cavity) and Pseudomonadaceae. We also observed low

proportions of gut and oral commensals such as Lachnospiraceae, Veillonella, and Prevotella.

https://www.dropbox.com/s/og2nahhc117491x/Fig2_v5.pdf?dl=0

https://www.dropbox.com/s/og2nahhc117491x/Fig2_v5.pdf?dl=0

78

Figure 3-3. Putative MBTA microbial community sources. (A) i. Ordination of subway surface

data jointly with human skin (anterior nares), oral (mixed sites from within oral cavity) and gut

(stool) microbiome data from the Human Microbiome Project (HMP). Principal coordinate

analysis was performed with weighted UniFrac distance and calculated using OTU relative

abundances. ii-iv. Correlations between subway samples and human body sites [200]: ii. skin,

iii. oral, and iv. gut, as well as environmental sites: v. air [201] and vii. soil [202]. The x- and y-

axes represent mean relative abundance across each data set with standard error bars. For each

plot, subway samples (MBTA) are on the x-axis and potential source community on the y-axis.

(B, C) Microbial SourceTracker [203] was used to identify possible human and environmental

sources of subway station (B) train and (C) station communities. Relative estimated contribution

of each source is plotted per subway sample.

80

Highly abundant non human-associated taxa encompassed the order Burkholderiales

(3.25%); as well as class Alphaproteobacteria (9.15%), which contains genera Sphingomonas

(1.48%) and Methylobacterium (1.14%) and families Rhodobacteraceae (1.48%) and

Methylocystaceae (0.447%). These Alphaproteobacteria are widespread environmental bacteria

with flexible metabolic regimes; Sphingomonads in particular, including the genera

Sphingomonas and Sphingobium, are found in soils and sediments and are most well studied for

their ability to degrade polyaromatic hydrocarbons [204]. Methylobacterium, primarily M.

extorquens, is a genus of plant- and soil-associated facultative methylotrophs; these bacteria are

highly prevalent on the surfaces of plants, and their diverse metabolic capabilities make them

likely to survive in other environments [205]. Enhydrobacter aerosaccus, which is currently

classified as belonging to Moraxellaceae but may more aptly be classified as an

Alphaproteobacterium [206], was also prevalent in the subway samples.

To determine the microbial clades driving these patterns, we correlated the abundance

of subway microbial genera with their abundance in three human body sites [200] as well as air

and soil [201, 202] (Fig. 3Aii-vi). As expected, the human skin genera Staphylococcus and

Corynebacterium (Fig. 3Aii), human oral cavity taxon Streptococcus, and human gut-resident

genera Bacteroides and Prevotella are abundant on both the subway and their respective body

sites (Fig. 3Aii-iv). In addition to human-associated taxa, several genera previously observed in

indoor air [201] were also abundant on subway surfaces: Sphingomonas, Methylobacterium,

Acinetobacter, Streptococcus, Staphylococcus and Corynebacterium (Fig. 3Av). In contrast, typical

soil genera were rare on subway surfaces (Fig. 3Avi). Microbial SourceTracker [203] confirmed

these origins based on overall community composition as compared to a variety of reference

https://www.dropbox.com/s/jnz69kuwqew3rux/fig3_v4.pdf?dl=0








81

environments [207] (Fig. 3B-C). Only a subset of touchscreen samples included a substantial

proportion of environmental microbes (e.g. air and soil), most notably from the Riverside

above-ground outdoor ticketing station (Fig. 3C).

Propionibacterium phages and the yeast Malassezia globosa dominate the non-bacterial microbial

community

Shotgun metagenomic sequencing, which allowed us to profile viral and eukaryotic

microbes that cannot be identified by 16S sequencing as well as bacterial taxa that are poorly

amplified by the 16S V4 region primers [198], was performed for 24 mass transit samples

including 15 train car samples and 9 station samples. In agreement with previous studies of skin

ribotypes [81, 208], the most abundant species across all samples was the facultative anaerobe

Propionibacterium acnes (mean 47%, max 81%); its average abundance was 29.8% for chairs,

71.6% for grips and poles, and 43.4% for touchscreen surfaces (Fig. 3-4). Other metagenomically

assessed bacterial abundances agreed with 16S data, including high levels of family

Micrococcaceae (mean 5.3%), Staphylococcaceae (mean 5.28%), Corynebacteriaceae (mean

4.95%), and Streptococcaceae (mean 3.73%), along with non human-associated taxa included

soil taxa Geodermatophilaceae (mean 1.22%) and Acinetobacter (mean 0.70%) (Table II-2).





https://www.dropbox.com/s/rbk57s2oyi162ar/Figure5_Metaphlanv2.pdf?dl=0



https://www.dropbox.com/s/3dlnxr49wonde48/TableS5_ShotgunOTU.xlsx?dl=0

82

Figure 3-4. Trans-domain taxonomic profiles from subway shotgun metagenomes. Relative

abundances of the twenty microbial species with highest mean across 24 metagenomes from

train cars and stations. Among colored metadata annotations, train line (green, orange, or red)

is indicated for car surface samples and location (indoor or outdoor) for touchscreens. P. acnes

is not amplified by the 16S primers used in this study but readily detectable by shotgun

sequencing, as are non-bacteria such as Propionibacterium phage.

Eleven non-bacterial species were present at an abundance of ≥0.1% in at least two

samples. The most abundant and prevalent viruses included Propionibacterium bacteriophages

and oncovirus Merkel cell polyomavirus (a common respiratory infection [198]). The relative

abundance of Propionibacterium bacteriophages P100D and P101A show similar abundance

patterns to P. acnes, with lower average abundance on chairs (3.2%), and higher abundances on

holds (5.4%) and touchscreens (7.9%), suggesting that phage/host relationships are detectable

directly from metagenomics. Remaining viruses were found sporadically (in only 2 samples) or

83

had mean relative abundances less than 0.0006% (Table II-2). Many of these viruses were phage

that corresponded to abundant bacterial species, including Pseudomonas phage, Lactobacillus

phage, Lactococcus phage, Staphylococcus phage 3A, Staphylococcus phage 80 alpha, and

Staphylococcus phage phi2958PVL.

The yeast Malassezia globosa [209] also occurred with abundance patterns similar to those

of P. acnes, with lower abundance on chairs (0.03%) and higher abundances on holds (0.25%)

and touchscreens (0.1%). Both M. globosa and P. acnes show niche-specific adaptation to

metabolism of lipid-rich sebum [209, 210] and are commonly found on sebaceous skin sites,

which comprise of the chest, back, and face [208]. This may indicate that sebaceous skin taxa

more easily transfer or adhere to built environment surfaces.

All surface types are dominated by skin microbes, with smaller proportions of oral, gut, and

environmental taxa across seats and touchscreens

To identify differentially abundant taxa across metadata categories, we performed a

multivariate analysis using MaAsLin [211], which controls for multiple covariates using a

generalized linear model (Table II-4). For 16S data, we accounted for built environment type,

surface type, material composition, and sample location. For human-associated taxa, seats were

particularly enriched in skin taxon Corynebacterium and vaginal taxon Gardnerella, though all

contacted surface types had higher relative abundances of Corynebacterium as compared to train

walls (Fig. 3-5A). The skin taxon Staphylococcus was also enriched across all surface types except

for touchscreens and train walls, and Corynebacterium was negatively associated with vinyl seats

relative to polyester seats. Grips were enriched for oral taxa such as Rothia and Veillonella. For

84

non human-associated taxa, all grips and vertical poles were depleted in class

Alphaproteobacteria, as contrasted to their enrichment on outdoor surfaces at the Riverside

station (western suburb). These clades included Methylobacteriaceae (grips and vertical poles)

and Methylocystaceae (all holds), as well as family Sphingomonadaceae (grips and vertical

poles) and genus Amaricoccus (all holds). Because many of these organisms are likely associated

with soil particles, it is reasonable that they should be less abundant on surfaces where soil is

unlikely to settle.

Figure 3-5. Enrichment of microbial taxa with respect to metadata using multivariate

analyses. Each ring represents significant associations of one metadatum with microbial clades

using MaAsLin [211] (FDR q<0.25). (A) 16S data. For location, surface category, surface type,

and surface material (inner rings to outer rings), the direction of association between taxa and

metadata is indicated in red (positive) or green (negative) was relative to Alewife, touchscreens,

seat backs, and polyester, respectively. (B) Shotgun metagenomic data; only a simplified surface

type was represented by sufficiently many samples for analysis. Horizontal poles, vertical poles,

and grips were grouped into “holds”, and that seats and seat backs were grouped into “chairs”.

The direction of association is again indicated by color. Only taxa with at least one association

are shown in each cladogram.

85

For shotgun data, we again used MaAsLin [211] to identify associations between

microbial taxa and a single covariate, surface type (Fig. 3-5B, Table II-4). Due to the small

number of samples, surface type metadata were grouped into chairs (seat and seat backs), holds

(hanging grips, horizontal and vertical poles), and touchscreens. For human-associated taxa,

chairs and touchscreens were enriched in multiple species of Corynebacterium (including C.

aurimucosum, genitalium, jeikeium, massiliense, pseudogenitalium, tuberculostearicum, urealyticum)

and Staphylococcus (S. caprae capitis, epidermis, haemolyticus, hominis, pettenkoferi); vaginal taxa

Gardnerella vaginalis and Lactobacillus (L. crispatus and L. iners); and gut taxa Ruminococcus bromii,

Faecalibacterium prausnitzii, and Eubacterium rectale. Touchscreens were particularly enriched in

oral species such as Streptococcus (S. cristatus, gordonii, infantis, mitis/oralis/pneumoniae,

parasanguinis, sanguinis, thermophiles, tigirinus), Prevotella (P. copri, melaninogenica), and Rothia

aeria (also enriched in holds). For non-human associated taxa, we saw similar patterns as in the

16S data. Touchscreens were enriched in Methylobacteriaceae, Burkholderiales,

Sphingomonadales, and Rhodobacteraceae (also enriched in chairs). Many of these non-human

associated taxa that we identified on surfaces are hardy generalists that survive under harsh

conditions [212].

Most Corynebacterium species enriched in both chairs and touchscreens have higher (but

not statistically significant) abundances in chairs, with the exception of C. kroppenstedtii and C.

matruchotii. The lack of oral species on holds may be due to the newfound detection of P. acnes,

which is enriched in holds and may affect the relative abundances of rarer taxa. Generally, skin

taxa dominate all surfaces, with P. acnes enriched on holds and Corynebacterium and

Staphylococcus on chairs and touchscreens. Oral taxa are present on both holds and

86

touchscreens. Non-human associated taxa remain enriched on touchscreens, which present

more exposed surface areas not enclosed within trains.

Metagenomes reflect dominance of Propionibacterium acnes across subway surfaces

Functional genomic profiling using HUMAnN2 quantified 3,975,869 UniRef50 [143]

protein families, which were collapsed into 12,074 KEGG Orthology (KO) [213] families. For

hypothesis testing, we focused on 604 KOs with mean abundances greater than the overall

median abundance and variance across samples in the 90th percentile. MaAsLin identified 590

KOs significantly associated with surface type (q < 0.05): 360 enriched in holds, 204 depleted in

holds, 12 enriched in chairs, 4 depleted in chairs, 5 enriched in touchscreens, and 4 depleted in

touchscreens, relative to all other surface types (Table II-4).

Many of the KOs enriched in holds were genes found in the P. acnes genome [214]. These

included systems for anaerobic respiration, lipases and esterases for degrading lipids within

sebaceous sites, hyaluronate lyase for digesting the extracellular matrix of skin, fermentation of

pyruvate to propionate (Fig. 3-6A). Production of propionate is catalyzed by methylmalonyl-

CoA carboxyltransferase, which is enriched in the holds. Porphyrin synthesis is a major

function of several Propionibacterium [215], contributing to a range of physiological activities

(e.g. potential keratinocyte damage from free radical release [214, 216]) and industrial uses (e.g.

synthesis of vitamin B12 [217]). Here, the pathway was represented by several genes from the

hem and cbi/cob gene clusters [217, 218]. To verify that the KOs detected above were indeed

specific to P. acnes, we removed its contributions to the overall abundance of each UniRef50

family, renormalized, and again identified KOs enriched on different surface types (see

https://www.dropbox.com/s/a0r8pg2um2pw0dc/TableS8_KOMaAsLin.xlsx?dl=0

87

Methods). KOs specific to P. acnes metabolism were no longer enriched on holds, with a few

exceptions including iron transport (Fig. 3-6A, Table II-4).

Figure 3-6. Enrichment of KEGG Orthology (KOs) across MBTA surfaces before and after P.

acnes removal. For all heatmaps, rows represent significantly enriched KOs detected through

linear regression with MaAsLin, columns represent samples, and cells are colored by sum-

normalized reads per kilobase (RPKs) on a log scale. Further metadata is shown as colored bars

below the heatmaps. The first colored bar explains the collapsed surface types (second bar), in

which chairs include seats (light blue) and seat backs (dark blue), grips include horizontal poles

(red), vertical poles (orange), and grips (yellow), and touchscreens are from Riverside (green),

Alewife (red), Forest Hills (orange), and South Station (light blue). KOs annotated with yellow

circles are found before and after P. acnes removal. (A) Selected KOs enriched in holds only are

specific to and colored by P. acnes metabolic function. (B) Selected KOs specific to oxidative

phosphorylation and photosynthesis are shown before (above) and after (below) P. acnes

removal. Direction of association between KO abundances and surface types, relative to holds,

are shown as green ‘+’ (positive) or red ‘-’ (negative) to the left of the heatmap. Columns are

colored by metadata as in Fig. 3-2.Many KOs associated with oxidative phosphorylation and

photosynthesis were enriched in chairs and touchscreens relative to holds before removal of P.

acnes. These included NADH dehydrogenase I subunits (EC:1.6.5.3), ferredoxin-NADP+

reductase (involved in photosystem I, EC:1.18.1.2), ATPase subunits (EC:3.6.3.14), and

cytochrome c oxidases (EC:1.9.3.1). After depletion of P. acnes-derived processes, ferredoxin-

88

NADP+ reductase and F-type H+-transporting ATPase subunits were enriched only on chairs,

while cytochrome c oxidase subunits and NADH dehydrogenase subunit types and Fe-S

proteins were enriched only on touchscreens (Fig. 3-6B). Increased numbers of electron

transport chain components may indicate more aerobic respiration, or the presence of

eukaryotic DNA (as detected by chloroplasts or mitochondria). Notably, high levels are found

across all KOs for the horizontal pole from the Red Line and the outdoor touchscreen from

Riverside station, although it is unlikely that these trends were completely eukaryotic. Riverside

station touchscreen 16S profiles included only 4.04% chloroplast classified sequences, and

overall holds included for shotgun sequencing had the highest average proportions of

chloroplast, followed by chairs and touchscreens. Thus, presence of more electron transport

chain components may also reflect a metabolic strategy enriched among persisters in the built

environment, especially relevant to the touchscreens’ Alphaproteobacteria.

Minimal pathogenic and antibiotic resistance presence on the Boston transit system

To detect antibiotic resistance factors in MBTA metagenomes, we used ShortBRED [219]

to create high-precision sequence markers from the Comprehensive Antibiotic Resistance

Database (CARD) [220]. This resulted in 2,657 antibiotic resistance gene (ARG) markers for 792

ARGs in CARD, but only 46 ARG markers were detected with RPKMs greater than 0 in at least

two samples. This is notable because the average read depth of our samples was 9.8×106 reads

(0.989 Gnt), but the average RPKM per sample for these markers was only 1.172, ranging from 0

to 46.67. Similarly, a low abundance of ARGs (<0.3% of total reads mapped to the Antibiotic

Resistance Database) was found in the Home Microbiome Project [92]. Our hits included several

89

resistance mechanisms, including efflux pumps, antibiotic target modification or replacement,

antibiotic inactivation, and changes in nucleic acid machinery (rpoB or par genes) (Fig. 3-7A).

Figure 3-7. Quantification of antibiotic resistance marker and virulence factor abundances on

subway surfaces. (A) Antimicrobial resistance markers (rows) quantified in metagenomes by

ShortBRED [219] and annotated by antibiotic target through the Antibiotic Resistance Ontology

in the CARD database. (B) Virulence factors (rows) likewise quantified and manually annotated

by virulence function through keywords on the VFDB web site. For both heatmaps, columns

(samples) are arranged as in Fig. 3-6.

To contextualize ARG enrichment (or rather depletion) in this environment, we further

compared the Boston subway to ARGs in the air microbiome from several other built

environments [221] as well as from 552 stool samples from individuals in the United States,

China, Malawi, and Venezuela [2, 222, 223]. For consistency with previous surveys, we used

ShortBRED to generate 4,132 antibiotic ARG markers for 849 ARGs in the Antibiotic Resistance

Database (ARDB). Both the air microbiome and Boston subway samples had noticeably lower

https://www.dropbox.com/s/dskzx6utx233zw5/Fig7_v1.pdf?dl=0



90

levels of RPKMs that that of typical human stool (Fig. II-3). The gut microbiome has repeatedly

been observed [224] to be enriched for tetracycline resistance, beta-lactamases, and MFS/RNS

efflux pumps, whereas none of these were substantially present in the MBTA and only low

levels of tetracycline and beta-lactamase resistance in indoor air [221].

To similarly assess virulence factors in the MBTA, we created sequence markers from

the Virulence Factor Database (VFDB) [225], resulting in 7,869 markers for 2,089 factors. 54

markers were detected with RPKMs greater than 0 in at least two samples. The average RPKM

per sample was 0.240, ranging from 0 to 23.74. All of the putative virulence factors, with the

exception of the alpha and beta-hemolysin proteins found in S. aureus, are opportunistic factors

typical of normal microbial life. For example, many proteins were classified as part of

pathogenicity islands; however, most of these proteins are transposases, integrases, and

repetitive regions (Fig. 3-7B). Other hits were annotated with functions in adherence,

antiphagocytosis, and secretion systems, but consisted of cell surface proteins such as

lipopolysaccharides, capsule polysaccharide proteins, and flagellar proteins. This indicates that

the real pathogenic potential detected in the Boston subway is very low. Overall, the Boston

subway has minimal antibiotic resistance and virulence factor presence.

Discussion

Here, we report on the microbial profile of the Boston metropolitan transit system.

Previous studies have characterized the Hong Kong and New York subway aerosol

communities [193, 194], as well as surfaces in the New York subway [195], but we believe this to

be the first to determine how space utilization by passengers, surface type, and material


91

composition individually affect microbial ecology. We further describe the microbial

community metabolic potential across surface types and metagenomically assess the absence of

pathogenic potential. The former primarily reflected P. acnes pathways on holds and aerobic

respiration on seats and touchscreens; resistance and virulence factors among the latter were

depleted relative to environments such as the human microbiome.

Surface type was the major driver of variation in composition, lending support to three

potential hypotheses: differences may be driven by 1) human body interactions [192], 2)

material composition of these surfaces, which may enhance microbial adherence and growth, or

3) a combination of the two factors. Our data support the third hypothesis. First, we observed a

significant enrichment of oral microbes on horizontal poles and grips, which may be higher up

and closer to riders’ faces or reflect transfer through skin-mediated contact (Fig. 3-1C). Second,

both 16S and shotgun data showed enrichment of vaginal commensals in seat surfaces, which

may be transmitted through clothing. Third, we found that seats were enriched in vaginal and

oral taxa relative to seat backs, and outdoor touchscreens were enriched in Alphaproteobacteria

relative to indoor touchscreens. If surface material were the only driver of microbial

composition, seats vs. seat backs and indoor vs. outdoor touchscreens should have similar

taxonomic profiles. Surface material certainly plays at least a partial role, however, as we

observed decreased Corynebacterium in vinyl seats as compared to polyester seats. Overall, our

observations indicate that both human body interactions and surface material shape community

composition, with the former as the stronger driver.

92

Previous studies of the transit microbiome, particularly those of New York [195] and

Hong Kong [194], have also observed environmental exposure to be an additional driver of its

microbial community composition. Afshinnekoo et al, for example, found that samples’ human

DNA reflected census demographics for the surrounding region, although we saw no

differentiation at the microbial level among Boston train lines serving suburbs with different

ethno-demographics. We primarily observed the impact of environmental exposure on outdoor

touchscreens, in agreement with Leung et al’s higher alpha diversities for outdoor stations in

Hong Kong. The surfaces we investigated are near-uniformly exposed to high volume and

diversity of rider interaction. This frequent human contact could homogenize many potential

influences on microbial populations, such as demographics or weather. Since the body sites

used for contact, indoor/outdoor location, and material composition remain consistent, these

exposures would thus shape the taxonomic differences we observed across the Boston subway.

There are few non-opportunistic pathogens in the built environment outside of hospitals

[226]. None were reported for restrooms [93], classrooms [192], or Hong Kong subway aerosols

[194], possibly due to lack of phylogenetic resolution with 16S sequencing. During partial

genome assembly from home [92] and restroom [227] surface metagenomes, shotgun

sequencing facilitated identification of opportunists with pathogenic potential, but even with

this increased resolution, outright virulence factors were rare. Robertson et al detected no

human pathogens using Sanger and pyrosequencing in New York subway aerosols [193].

Furthermore, although Afshinnekoo et al report 12% of taxa represented known pathogens in

the National Select Agent Registry and PATRIC database, this database uses an extremely

broad definition of “pathogen,” and these results were later refuted [196]. Our study assessed

93

whether typical subway microbial communities were unusual in their carriage or transfer of

antibiotic resistance genes and virulence factors. We detected low numbers of these genes, and

they were present at drastically lower amounts than observed in the human gut.

One goal of studying the microbiology of the built environment is to establish a baseline

against which deviations can be used to detect potential public health threats. As with the

human microbiome, however, inter-subject variability appears to be quite high in built

environments (e.g. buildings) and in transit systems, and both greater cross-sectional breadth

and longitudinal depth are still necessary. All subway microbiome papers to date have detected

a high level of skin-associated genera. In addition to this work, Leung et al (Hong Kong subway

aerosols) included Micrococcus (4.9%), Enhydrobacter (3.1%), Propionibacterium (2.9%),

Staphylococcus, and Corynebacterium (1.5%), while Robertson et al detected high levels of families

Staphylococcaceae, Moraxellaceae, Micrococcaceae, Enterobacteriaceae, and

Corynebacteriaceae. Afshinnekoo et al in the New York subway is the only major exception,

with the most abundant organisms instead including Pseudomonas stuzeri, Acinetobacter, and

Stenotrophomonas. If microbes shed from skin (or still resident on shed skin cells) do dominate

mass transit environments, it must be determined whether these microbes are deposited,

dormant, or actively growing, or whether they can be stably transferred from one individual to

another.

Like other built environments, however, human-associated microbes are by no means

the only apparently functional community residents even when abundant. Notably, our wall

samples, which are not consistently touched but in the presence of high human density, have

94

lower biomass and different microbial compositions from other train surfaces. Establishing a

"typical" microbial baseline for mass transit environments will require thoughtful sample

design that controls for local space properties, short- and long-term temporal variation (e.g.

time of day and season), and cross-sectional differences within and between cities. It may also

prove useful to monitor for a combination of normal versus undesirable organisms and

metabolic or functional profiles, as the latter has been observed to be more stable than

taxonomy in the human microbiome [2]. In some cases specific pathogens may be easier to

detect; in others (e.g. when individual pathogens may be extremely low density), structural,

functional, or metabolic shifts may be better indicators of changing transit profiles and,

consequently, health hazards. In all such cases, future studies should incorporate expertise from

architecture, engineering, public health, microbiology, and ecology, thus allowing both

confident and interdisciplinary analyses as well as institutional changes in response to scientific

findings.

In conjunction with other published investigations, this work helps to characterize the

“urban microbiome” and, in doing so, adds to our understanding of how these microbial

communities are formed, maintained, and transferred. Such studies fall in a critical space

between environmental and human-associated microbial ecology, and as such must address the

challenges of both. These include study designs with rich metadata, including architectural

features, human contact, environmental exposure, surface type, and surface material;

accounting for a wide range of potential biochemical environments, contaminants, and biomass

levels; and the involvement of institutional review boards, city officials, and engineers as

appropriate. Future work will help to determine which urban microbes are viable and resident

95

(as opposed to transient), as well as identifying the mechanisms utilized to persist in the built

environment. It will also be important to identify microbes that can be transferred between

people via specific fomites, since this especially has the potential to inform public health and

policy (by monitoring organisms, gene content, or both). A greater understanding of these

processes may thus eventually lead to construction of built environments that enhance and

maintain human health.

Materials and Methods

Study permissions

The Massachusetts Bay Transportation Authority (MBTA) approved all aspects of transit

system sampling and gave permission to the Harvard T.H. Chan School of Public Health to

conduct this study (Fig. II-4). Additional support was provided by the MBTA Police, who

accompanied the study team during sample collection. A written description of the protocols

and study goals were distributed to interested MBTA passengers during sampling.

Sample collection

Samples were collected in 2013 on May 16, May 23, and October 22 from the public

transit system serving metropolitan Boston during normal workday hours. Train car sampling

began at the outmost termini of train routes (Alewife Station on the Red Line, Riverside Station

on the Green Line, and Forest Hills Station on the Orange Line). Trains were sampled as they

proceeded inbound towards the city center. Station samples were collected by swabbing the

touchscreens and sides of ticket machines at five stations (Fig. 3-1).


96

For all samples, we recorded the sampling date, outdoor air temperature and relative air

humidity, location, surface type (seat, seat back, horizontal pole, vertical pole, hanging grip,

wall, or touchscreen), and material composition (polyester and vinyl (seats and seat backs),

stainless steel (poles), PVC (grips), combination of wood, engineered wood, extruded

thermoplastic, fiber reinforced plastic, aluminum honeycomb panel, melamine-finished

aluminum panels reinforced with Kevlar (walls), or coated glass (touchscreens)). For train car

samples, we recorded the within-train location of sample collection (end or middle of car), as

well as the train line and location along the route when sample was collected. For station

samples, we recorded the location of each ticketing machine (indoor, outdoor, underground)

and the side of the touchscreen swabbed (right, left, both).

All metadata are described in Table II-1 and where possible, metadata terms from the

Minimum Information Standards for the Built Environment (MIxS-BE) were used [228].

Weather information was compiled from weather archives from the National Oceanic &

Atmospheric Administration [229] and Weather Underground (KBOS [230]).

Swab collection and processing

DNA-free cotton swabs (Puritan, Maine, USA) were used for collection in this study.

Each swab was dipped into a swabbing solution prepared from 0.15 M NaCl and 0.1% Tween

20, as used in previous studies [81, 192, 201, 231]. All surfaces were swabbed for approximately

15 seconds, and each surface was sampled 2 or 3 times with separate swabs over non-

overlapping regions. Swabs were stored together in 15 mL Falcon tubes on ice for no more than

97

one hour before being taken to a central location and stored on dry ice. All samples were

transported directly from dry ice to a -80°C freezer for storage.

DNA extraction, 16S amplicon sequencing, and operational taxonomic unit (OTU) calling

Samples were processed using the MoBio PowerLyzer PowerSoil DNA extraction kit

(MO BIO Laboratories, Inc.). For each sample, 2 or 3 swabs from the same sample were pooled

for optimal biomass recovery. Amplification and sequencing by Illumina MiSeq were

performed as described previously by Caporaso et al [232]. OTU tables were constructed with

Quantitatve Insights into Microbial Ecology (QIIME) software [233] version 1.8 using a closed

reference (pick_closed_reference_otus.py) with Greengenes reference version 13.5 at the 97%

identity level. We filtered low-abundance OTUs (minimum abundance threshold 0.001 in at

least one of 72 samples). Because the primers used in the study were designed to amplify

bacterial 16S genes, we filtered out OTUs that corresponded to chloroplasts, mitochondria, and

archaea. This reduced the dataset to 2,134 unique OTUs representing 501 unique genera. OTU

frequencies in samples were then sum-normalized to proportional data (Table II-2). Further

details can be found in the Supplemental Information.

Analysis methods

Alpha diversity was calculated using the Inverse Simpson diversity index in the R

package ‘vegan’ [234]. Ordinations were calculated by principal coordinate analysis (PCoA)

using Bray-Curtis dissimilarity, unless otherwise noted, using the relative abundance table

generated above. For univariate and multivariate tests, we further filtered OTUs (minimum

abundance threshold 0.001 in at least seven of 72 samples). Univariate test for taxa differentially

98

abundant with respect to touchscreen location was performed using LEfSe [197]. For this

analysis, each metadata category was tested using alpha values of 0.05 for both the Kruskal-

Wallis and Wilcoxon tests with one-against-all comparison and an LDA effect size cutoff of 2.0.

Significant taxa-metadata univariate associations are listed in Table II-3. Multivariate

association tests for taxa that were differentially abundant with respect to metadata were

performed using MaAsLin [211]. For this analysis, we used four metadata categories: these

included locale (train or station), surface type (e.g. seat, seat back, etc), surface material (e.g.

polyvinyl chloride, carpet, etc), and location (e.g. Forest Hills Station, Orange Line train, etc).

Microbial source prediction was performed using Microbial Sourcetracker [203] and using data

from human and environmental sites in Hewitt et al [207]. GraPhlAn [235] was used for

visualization of associations and phylogenetic relationships.

Shotgun library sequencing and quality control

DNA was extracted using the MoBio PowerLyzer PowerSoil DNA extraction kit (MO

BIO Laboratories, Inc.) as described for 16S sequencing libraries. Only samples with at least 80

ng/µL were selected and sent to the Broad Institute for shotgun library construction. Libraries

were constructed using the Illumina Nextera XT method and sequenced on the Illumina HiSeq

2000 platform with 100 bp paired-end (PE) reads. The sequencing depth was 16.7×106 PE reads

per sample. The KneadDATA v0.3 pipeline (http://huttenhower.sph.harvard.edu/kneaddata)

was used to remove low quality reads and human host sequences. Further details can be found

in the Supplemental Information.

https://www.dropbox.com/s/q1b8jjul7scajpx/TableS3_16SLefseLocation.xlsx?dl=0

https://www.dropbox.com/s/q1b8jjul7scajpx/TableS3_16SLefseLocation.xlsx?dl=0

http://huttenhower.sph.harvard.edu/kneaddata

99

Taxonomic and functional profiling of metagenomes

Pan-microbial (bacterial, archaeal, viral, and eukaryotic) taxonomy was determined

using MetaPhlAn2 [136] (http://huttenhower.sph.harvard.edu/metaphlan2). 1,340 microbial

clades comprising 499 species were identified (Table II-2), and filtered for relative abundance ≥

0.1% in at least two samples for downstream multivariate analysis with MaAsLin [211]. For all

MaAsLin analysis involving shotgun taxonomic and functional profiles, we used one metadata

category: collapsed surface types, which included chairs (seat and seat backs), holds (grips,

horizontal and vertical poles), and touchscreens.

Functional genomic profiles were generated with HUMAnN2 version 0.3.0 [148]

(http://huttenhower.sph.harvard.edu/humann2), which leverages the UniRef [143] orthologous

gene family catalog, along with the MetaCyc [144], UniPathway [236], and KEGG [139]

databases. HUMAnN2 gives three outputs: the 1) UniRef proteins and their abundances in

reads per kilobase (RPK), 2) MetaCyc pathways and their abundances in RPK 3) MetaCyc

pathways and their coverage ranging from 0 to 1. HUMAnN2 further calculates the RPK and

coverage for each microbial taxa observed in MetaPhlAn2 for each UniRef protein and MetaCyc

pathway.

To look at the functional profile, we collapsed 3,975,869 UniRef50 protein families into

12,074 KEGG Orthology (KO) numbers. UniRef50 proteins that did not belong to any KOs were

not analyzed further. We sum-normalized KO RPKs and focused on KOs with mean abundance

greater than the overall median abundance and variances in the 90th percentile. We identified

KOs that were significantly enriched in chairs, holds, and touchscreens using MaAsLin [211]

http://huttenhower.sph.harvard.edu/metaphlan2

https://www.dropbox.com/s/3dlnxr49wonde48/TableS5_ShotgunOTU.xlsx?dl=0

http://huttenhower.sph.harvard.edu/humann2

100

with a false discovery rate (FDR) < 0.05. KO differences between surface types were heavily

influenced by the presence of Propionibacterium acnes. To remove this influence, we removed P.

acnes’ RPK contribution to each UniRef50 protein and then re-summed the overall UniRef50

RPK from the remaining taxa. UniRef proteins were again collapsed into KOs and subjected to

the analysis described above. We then compared KOs that were significantly enriched in seats,

holds, and touchscreens before and after P. acnes removal. Tables with KO RPKs are at

http://huttenhower.sph.harvard.edu/MBTA2015.

Identification and quantification of antibiotic resistance and virulence factor gene markers.

Antibiotic resistance gene markers were generated with ShortBRED (Short Better Read

Extract Dataset) [219] from the Comprehensive Antibiotic Resistance Database (CARD) [220]

using UniRef90 [237] as a reference. ShortBRED virulence factor markers were generated from

the Virulence Factor DataBase (VFDB) [225] using UniRef50 [237] as a reference (due to the

availability of a previous version of these markers). ShortBRED maps the shotgun reads against

the markers, and returns normalized marker abundances as reads per kilobase per million reads

(RPKM). We aggregated and annotated antibiotic resistance gene markers using the antibiotic

resistance ontology (ARO) numbers in CARD.

To facilitate cross-dataset comparison, we also generated 121 bp markers with

ShortBRED from the Antibiotic Resistance Database (ARDB) [238] using UniRef50 [237] as a

reference and aggregated these markers at the ARDB family level. We compared the

distribution of antibiotic resistance gene markers in our dataset to four previously profiled

shotgun datasets describing the gut microbiomes of 552 individuals from the United States [2,

http://huttenhower.sph.harvard.edu/MBTA2015

101

223], China [222], Malawi [223], and Venezuela [223], as well as one shotgun dataset profiling

air microbiomes in a home, hospital (indoors and outdoors), pier, and offices (indoors and

outdoors) [221]. Virulence factors were annotated using VFDB ontologies available on

http://www.mgc.ac.cn/VFs/main.htm. ShortBRED results can be found in Table II-5.

Accession numbers

Raw sequence files were deposited into Sequence Read Archive (SRA) under the

National Center for Biotechnology Information (NCBI) with accession number PRJNA301589.

Acknowledgements

We thank the MBTA Transit Police Department, specifically Chief Paul MacMillan and

Detective Matthew Haney, for their support of this project. We are also grateful to MBTA police

officers Tommy O’Connor and Lieutenant David F. Albanese for their assistance during sample

collection. We also thank Sydney Lavoie and Gerrod Voit for additional laboratory and

computational assistance, and Boyu Ren and Koji Yasuda for helpful feedback and discussion.

Jessica L. Green would like to disclose her affiliation as CTO of Phylagen, Inc. which

does not conflict with the study. The authors declare no conflict of interest.

http://www.mgc.ac.cn/VFs/main.htm

Chapter 4:

Conclusions

103

To understand a given microbial community, there are two major questions to be

answered: “Who is there?”, followed by “What are they doing?” DNA sequencing has proven

to be a powerful tool for answering these questions. It has the capability of surveying thousands

of organisms and millions of genes relatively quickly, but is limited in its ability to track

microbial activity. In addition, the size of the resulting datasets restricts most analyses to

identification of associations between microbial abundances and metadata, or a search for

biomarkers or keystone species. Understanding the complexity underlying these trends must

begin with i) characterizing the stability of the observed trend and ii) determining its activity

and its effect within and outside the microbial community. The former may be established via

comprehensive time-series sampling, while the latter may be achieved through the combination

of DNA sequencing with other ‘omics’, such as transcriptomics, proteomics, and metabolomics,

or through wet laboratory experiments.

In Chapter 2, we introduced WAAFLE, which is the first method for detection if de novo

LGT events from metagenomes. A tool that can utilize WMS sequencing data is important, since

the majority of tools for LGT detection are optimized for full genomes. As follows, identifying

novel LGT events will require constant sequencing of whole genomes, which is achievable for

clinical isolates but difficult for single organisms within a complex community. The direct use

of metagenomes allows for LGT profiling of older datasets in the context of a community (as

opposed to cultured isolates), which may affect LGT activity. We next demonstrated proof of

concept by applying WAAFLE to the Human Microbiome Project Phase 1-II. Indeed, there are

limits to what we can detect: first, potential misassemblies based on read coverage

disproportionately affect contigs classified as LGT, and second, short contig lengths limit

104

detection of plasmids, unless there are novel rearrangements within them. Still, we were able to

identify high frequency LGT pairs across six major body sites, which increased in frequency

with shorter phylogenetic distances and higher taxonomic abundances. Most pairs were also

specific to environment (body site), though the buccal mucosa, supragingival plaque, and

tongue dorsum shared pairs with differential abundances. As expected, enriched functions in

LGT contigs included mobile elements such as transposons and phage, along with GMP

synthases and TonB outer membrane receptors.

Immediate next steps include characterizing LGT stability over time, as well as

determining how LGT frequency varies with disease and environment. Both approaches require

datasets with specific study designs: the former requires time-series data while the latter

requires case/control cohorts, or samples collected from the built-environment or environmental

sources. Applying WAAFLE to these datasets will help quantify LGT rates, which may occur at

the scale of minutes, days, or months; as well as determine how LGT rates change with disease,

or how they might be associated with cohort metadata (such as dietary intake or drug

administration). Analyses of whole genomes has shown that LGT rates are likely higher in

human-associated versus non-human associated environments: further work may identify the

taxa and functions responsible for increased LGT. Still, computational detection of events at the

DNA level does not indicate active use of transferred genes. To quantify LGT activity, WAAFLE

results should be combined with other ‘omics data in order to find actively transcribed or

translated LGT products. Results may also be combined with wet laboratory procedures such as

qPCR or transformation, to validate the presence or activity of transferred genomic segments,

respectively. Furthermore, attempts to induce LGT within culture may help identify conditions

105

such as abiotic/biological stress, specific spatial structure, or proximity of select taxon partners

that might favor LGT.

In Chapter 3, we described microbial communities on the Boston subway, which were

mostly derived from human skin and oral sites. Samples were collected from trains on the red,

orange, and green line, as well as ticketing machines from Alewife, Park Street, South Station,

Forest Hills, and Riverside. The original intent of the study design was to see if microbial

communities might vary based on the demographic served. Instead, microbial communities on

trains mostly varied by surface type, likely due to rider interactions such as sitting on seats or

touching the ticketing machines, while microbial communities on touchscreens varied mostly

by indoor or outdoor location. Functional profiles were dominated by systems for anaerobic

respiration and porphyrin synthesis, which reflected the high abundance of Propionibacterium

acnes. Overall, the number of antibiotic resistance genes were lower than that found in the

human gut.

Future directions include identifying the stability of these high-traffic spaces as well as

determining the proportion of live, dormant, and dead microbes. The former will require

sampling the subway at regular intervals over a longer period of time. This sampling strategy

will enable us to determine if there is a consistent built-environment microbiome: if so,

fluctuations may be useful indicators of disease outbreak, or simply indicators of changing

seasons, or both. For the latter, microbial viability may be measured using a variety of methods,

including sample treatment with propidium monoazide or cell sorting to distinguish between

DNA from intact versus dead cells, isolation of RNA rather than DNA for transcriptomic

106

activity, identification of protein synthesis through fluorescent click-chemistry (such as

BONCAT), or measurement of cellular activity through ATP assays. Multiple methods will

need to be tested, as contamination and low biomass are common problems for built-

environment samples. Furthermore, if the majority of built-environment samples are dead, then

profiling should shift from looking at microbial taxa to looking at metabolites, or microbial

components such as pathogen-associated microbial patterns (PAMPS), which may stimulate the

human immune system.

Long term goals include understanding how LGT affects microbial evolution and how

the built-environment influences human health, especially immune development. It is unclear

what role LGT plays in speciation and whether that role differs today versus the evolutionary

past. Still, it is clear that LGT has a clinical impact, especially in the rise in antibiotic resistance.

If we can identify the conditions under which LGT occurs, as well as the specific gene segments

and taxa participating in transfer, it may be possible to use LGT to alter microbial community

structure or processes, or predict short-term microbial evolution (especially for pathogens).

Some work has also suggested that LGT helps maintain bacterial species: thus, a better

understanding may help refine the “species concept” for bacteria, leading to better taxonomic

assignment and calculation of phylogenetic distance.

In contrast, to characterize the effects of the built environment on human immune

development, studies should move beyond single built-environment types, and begin i)

comparing same purpose built-environment structures with that lead to differential health

outcomes, or ii) comparing different built-environment structures to identify their similarities

107

and differences. An example of the former involves surveying nursing homes with varying

survival rates, while an example of the latter would include examining a rural home versus

urban home. Study of the former could identify aspects of building design that might facilitate

better health outcomes through microbial community modulation, such as increased ventilation

or changes in hygiene. Study of the latter can help establish a baseline built-environment, likely

to be skin and oral microbes, and determine which microbes or PAMPs could potentially be

introduced. This is especially important if constant exposure to skin and oral-derived microbes

lead to adverse health outcomes. Multiple diseases have been associated with the microbiome,

of which a subset are linked to Western lifestyle and diet. This has led to an extensive search for

therapeutics to modulate the human microbiome. A better understanding of LGT and the built-

environment microbiome may help spur therapeutics, and highlight adaptive mechanisms used

by the microbiota and host to adjust to the “new normal.”

108

Appendix I:

Supplemental Materials for Chapter 2

109

Supplemental Figures

Figure I-1. Filtering potential misassemblies. To search for miassemblies, shotgun reads were

mapped to contigs using Bowtie2. We then examined both read coverage (Step 1) and read

support (Step 2) for gene junctions, or the regions between two genes on a contig. Genes

containing any single junction that fail both steps 1 and 2 are removed from analysis.

110

Figure I-2. Determining which contig types contain misassemblies. In A), we show the

percentage of contigs filtered out via read mapping, stratified by whether WAAFLE classified

them as LGT or not. We find that more LGT contigs are filtered out, as expected. In B), we

examine the gene junction type and determine what percentage have read support. Here “AA”

junctions are defined as gene junctions between two genes annotated to the same taxa, while

“AB” junctions are defined as gene junctions between two genes annotated to different taxa. As

expected, junctions between genes annotated to different taxa have less read support.

112

Figure I-3. Gene call evaluation. To assess how well WAAFLE calls genes, we varied the

subject coverage threshold (for including a BLAST hit), gene length threshold (above which to

include the gene), and minimum overlap (above which to merge a BLAST hit into a gene

group).

Figure I-4. LGT evaluation with or without missing BLAST hits. We show the TPR against the

FPR for the LGT evaluation with 20% of BLAST hits removed (on left) versus the evaluation

with all BLAST hits (on right).

113

Figure I-5. Selection of k1 and k2. As in Fig. 2-2, we show the LGT evaluation for WAAFLE with

20% of BLAST hits removed. On the left, we hold k2 at 0.8 while we vary k1 from 0.1 to 0.9 (blue

line represents default k1 chosen). On the right, we hold k1 at 0.5 while we vary k2 from 0.1 to 0.9

(red line indicates default k2 chosen). In A), colors indicate the inter-taxon level for LGT, for

example, “species” in red shows the TPR and FPR for inter-species LGT across different k1 and

k2 values. In B), colors indicate the taxonomic level at which WAAFLE is evaluated for

taxonomic assignment. For example, “species” in red indicates the percentage of correct species

calls in LGT contigs.

114

Figure I-6. Comparison of LGT measures. We attempted to quantify LGT frequencies per

sample using 2 methods: 1) the number of LGT contigs divided by the total number of genes, 2)

the number of genes in LGT contigs divided by the total number of genes. Initially, we were

concerned that the former might overestimate LGT in samples with many short contigs, while

the latter might overestimate LGT in samples with many long contigs. In this plot, each point

represents a LGT taxon pair in a body site. The x-axis is the first measure, while the y-axis is the

second measure. We found that the two measures were highly correlated within body sites,

indicating that higher values in either measure usually point to higher frequencies of LGT.

However, when comparing body sites, we observe a different y-axis scale for the posterior

fornix: the longer contigs in the posterior fornix may lead to larger LGT frequencies if gene

percentages (measure 2) are utilized rather than events per gene.

Figure I-7. Jaccard and Bray-Curtis distances between inter-individual, intra-individual, and

technical samples. We looked to see if LGT detection via WAAFLE is reproducible across

technical replicates and stable in individuals. We focused on contigs with taxonomy resolved to

the genus level and inter-genus LGT events. For each body site, we subsampled half the

115

Figure I-7 (Continued)

samples while including all technical replicates. In each sample, gene percentages were

quantified for inter-genus taxon pairs (number of genes in LGT pair divided by number of

sample genes) and single taxa (number of genes for taxa in sample divided by number of

sample genes). We then calculated Jaccard and Bray-Curtis distances between samples from

different individuals, the same individual but different time points, and technical replicates.

Figure I-8. Phylogenetic distances computed from random taxa pairs within body sites. For

each body site, we (i) randomly chose WAAFLE-called pairs (waafle) or (ii) generated taxon

pairs by randomly choosing two taxa weighted by gene percentages (simulated). For the

former, up to 1,000 pairs or the total number of taxon pairs were chosen, whichever number is

smaller. For the latter, 1,000 unique pairs were generated. We then plotted the A) phylogenetic

distance distribution and B) LGT joint abundance distribution. Joint abundances were

calculated by multiplying the gene percentage of one taxon (number of genes for a single taxon

in a sample divided by total sample genes, averaged across sites) against the other.

116

Supplemental Tables

Table I-1. WAAFLE Parameters. This table describes the 5 parameters used to tune the

WAAFLE pipeline.

Parameters Definition WAAFLE

Steps

Involved

(and default

values)

Effect

Subject

coverage (s)

Percentage of a

reference gene (subject

sequence) that aligned

to the contig (query

sequence)

Step 2: s =

0.75

Step 3: s = 0

Increasing subject coverage filters out

low quality BLAST hits when calling

genes (Step 2) and scoring taxa (Step

3). Including higher quality BLAST

hits in Step 2 led to more accurate

gene calling. Including more BLAST

hits in Step 3 led to higher taxon

scores.

Overlap

percentage

(o)

Length of overlap

region between two

nucleotide fragments

overlap by, divided by

the length of the

shorter fragment

Steps 2 & 3: o

= 0.1

Lowering overlap percentage allows

more hits to be merged into groups,

leading to fewer gene calls (Step 2).

Inclusion of more BLAST hits per

gene for taxon scoring (Step 3) can

lead to higher scores.

Gene length

(g)

Length of gene called

or supplied to

WAAFLE

Step 2: g =

200 bp

A higher gene length cutoff prevents

LGTs from being called due to

spurious gene calls, and leads to

lower numbers of genes per contig.

One taxon

score (k1)

A single taxon’s

minimum score across

all genes in a contig

Step 4: k1 =

0.5

A lower threshold for the one taxon

score makes it easier for WAAFLE to

annotate a contig as “No LGT”.

Two taxon

score (k2)

The minimum score

for two taxa after

maximizing scores

between them across

all genes in a contig

Step 4: k2 =

0.8

A lower threshold for the two taxon

score makes it easier for a contig to

be called “LGT”.

117

Appendix II:

Supplemental Materials for Chapter 3

118

Supplemental Figures

Figure II-1. Biomass and alpha diversity for train and station samples. (A) Biomass from

samples collected across the subway system. Each data point represents a pooled sampling

strategy in which two or three swabs from the same site were pooled and jointly extracted.

DNA yield is plotted in ng/mL. (B) Alpha diversity by surface type, as measured by the inverse

Simpson diversity index. In both (A) and (B), colors represent the line of the train from which

sample was derived (red, orange or green line of the train or station, or black indicating from

within a downtown station).

Figure II-2. Ordination of surface data subsets. (A) Train hold surfaces by train line, (B) train

chair surfaces by train line, (C) train chairs by material, and (D) touchscreen surfaces by date.

All ordinations are principal coordinates analyses using Bray-Curtis distance, colored by

metadata category, calculated using filtered OTU relative abundance table subsets of the

relevant samples.

119

Fig

ure

II-

3. C

om

par

iso

n o

f an

tib

ioti

c re

sist

ance

mar

ker

s fr

om

th

e A

RD

B d

atab

ase.

RP

KM

s o

f an

tib

ioti

c re

sist

ance

gen

e

mar

ker

s fr

om

air

mic

rob

iom

es i

n N

ew Y

ork

Cit

y (

off

ice)

an

d S

an D

ieg

o (

ho

spit

al, h

om

e, p

ier)

, th

e B

ost

on

MB

TA

, an

d g

ut

mic

rob

iom

es f

rom

552

in

div

idu

als

fro

m t

he

Un

ited

Sta

tes,

Ch

ina,

Mal

awi,

an

d V

enez

uel

a.

120

Figure II-4. Letter from the MBTA. We received MBTA approval, by way of the MBTA Transit

Police, to carry out the study prior to grant submission and confirmed detailed sampling plans

with the MBTA prior to any public work. Their assistance and input was invaluable both for

study design and for safe execution of sample collection, and this letter includes the initial

approval information from Chief MacMillan approving the work.

121

Supplemental Tables

Supplemental Tables are too large to display and are available online at

http://huttenhower.sph.harvard.edu/MBTA2015. Captions are included below for reference.

Table II-1. Sample collection and metadata. Includes metadata for all collected samples that

were sequenced via 16S amplicon or shotgun sequencing. Abbreviations are defined at the

bottom.

Table II-2. 16S and shotgun OTU tables along with taxa present across sequencing plate. The

first tab contains the 16S OTU counts after quality control, stitching, length filtering, removal of

chloroplast, mitochondria, and archaea, and filtering for at least 0.1% in 1 sample. The second

tab contains the unfiltered MetaPhlAn OTU table with percentages (100 = 100%). Note that

additional filtering was performed before LEfSe and MaAsLin runs for both 16S (at least 0.1% in

7 samples) and shotgun (at least 0.1% in 2 samples). The third tab contains our analyses to

identify contaminant taxa. As a negative control, we examined all samples present on a

sequencing plate containing a subset of MBTA samples, which included touchscreens (n=21),

trains (n=6), 30 saliva cultures, 13 skin samples, and 2 macaque tissue samples. Listed taxa listed

were present in 80% of samples with at least 0.00001 abundance, and are shown with their

average abundance across all samples. This provides a quality control test for potential

contaminant taxa, none of which were nontrivially abundant or significant during our MBTA

analyses.

Table II-3. LEfSe and MaAsLin analysis for 16S sequencing. The first tab contains LEfSe

results when searching for differentially abundant taxa between touchscreen locations

(outdoors (out), underground (under), and indoors near an exit facing an outside environment

(inout). Significant results report both logarithmic LDA scores and p-values. The second tab

contains results for MaAsLin run with four covariates, including surface category, surface type,

surface material, and surface location. Only organisms with q>0.25 are reported.

Table II-4. MaAsLin analysis for shotgun data. MaAsLin analysis was performed to identify

differentially abundant taxa (first and second tab) and KOs (third and fourth tab) with respect

to surface type. For both, surface type was split into chairs (seat backs and seats), holds

(horizontal/vertical poles, grips), and touchscreens. For identifying differentially abundant taxa,

we performed MaAsLin with full taxonomies at all levels (first tab) as well as with species only

(second tab). All results are reported: we considered organisms with q<0.25 to be significant. For

identifying differentially abundant KOs, we performed MaAsLin on KO abundances calculated

using all shotgun reads (third tab) and after P. acnes-associated reads were removed (fourth

tab). Only significant results are reported; these are KOs with q<0.05.

Table II-5. Antibiotic resistance gene and virulence factor markers. RPKM values for CARD

(first tab), VFDB (second tab), and ARDB (third tab). The RPKM values for CARD and VFDB are

only for MBTA data; the ARDB data contains values from multiple shotgun datasets.


122

Supplemental Information

The BioProject number, protocols, raw data tables, and supplemental tables can be

downloaded at http://huttenhower.sph.harvard.edu/MBTA2015.

Methods and Materials

DNA extraction, 16S amplification and sequencing. Samples were processed using the

MoBio PowerLyzer PowerSoil DNA extraction kit (MO BIO Laboratories, Inc.) using bead-

beating homogenization. For each sample, 2 or 3 swabs from the same sample were pooled for

optimal biomass recovery. Each swab was individually homogenized in a bead-beating tube at

6.0 M/s for 40 seconds on the MP Biomedical FastPrep 24, but subsequent cleanup was pooled

over one column. Extracted DNA extracts were quantified using a Qubit fluorimeter and sent to

the Broad Institute for sequencing. Amplification and sequencing by Illumina MiSeq were

performed as described previously [232]. In brief, genomic DNA was subjected to 16S

amplification using primers designed incorporating the Illumina adapters and a sample barcode

sequence, allowing directional sequencing covering variable region V4 (Primers:

515F[GTGCCAGCMGCCGCGGTAA] and 806R [GGACTACHVGGGTWTCTAAT]). PCR was

performed in triplicate with 1 μl of template (1:50), 10 μl of HotMasterMix with the HotMaster

Taq DNA Polymerase (5 Prime), and 1 μl of primer mix (for final concentration of 10 μM). The

cycling conditions consisted of an initial denaturation of 94°C for 3 min, followed by 24 cycles of

denaturation at 94°C for 45 sec, annealing at 50 °C for 60 sec, extension at 72°C for 5 min, and a

final extension at 72°C for 10 min. Amplicons were quantified on the Caliper LabChipGX

(PerkinElmer, Waltham, MA), pooled in equimolar concentrations, size selected (375-425 bp) on


123

the Pippin Prep (Sage Sciences, Beverly, MA) to reduce non-specific amplification products

from host DNA, and a final library size and quantification was performed on an Agilent

Bioanalyzer 2100 DNA 1000 chip (Agilent Technologies, Santa Clara, CA). Sequencing was

performed on the Illumina MiSeq platform (version 2) according to the manufacturer’s

specifications with addition of 5% PhiX, and yielded paired-end reads of 150 bp in length in

each direction. Total read depth was at least 5,000 reads (up to over 100,000 reads) per sample.

OTU calling. Quantitative Insights into Microbial Ecology (QIIME) software [233]

version 1.8 was used for data processing. Paired-end reads (with approximately 97 bp overlap)

were stitched and size selected (225 – 275 bp) to reduce nonspecific amplification products.

Operational taxonomic units (OTUs) were called with a closed reference

(pick_closed_reference_otus.py) using the Greengenes reference version 13.5 at the 97% identity

level based on the PICRUSt [127] protocol. Using these parameters, we observed 17,954 unique

OTUs. We filtered low-abundance OTUs (minimum abundance threshold 0.001 in at least one of

72 samples); this reduced the dataset to 2,134 unique OTUs representing 501 unique genera.

Since the primers used in the study were designed to amplify bacterial 16S genes, we filtered

out OTUs that corresponded to chloroplasts, mitochondria, and archaea. OTU frequencies in

samples were then sum-normalized to proportional data. The filtered OTU tables can be found

in Table II-2.

KneadData. KneadData incorporates Trimmomatic [239] and bowtie2 [240] for filtering

and human sequence removal, respectively. Reads were scanned with a four-base wide sliding

window and trimmed when the average base Phred score drops below 20. Trimmed reads

124

shorter than 70 nt were discarded. UCSC Human genome assembly version hg38 was used as

reference for removal of human sequences. The average sequencing depth after quality control

was 9.8×106 reads per sample.

Negative control analyses. Unfortunately, our study did not include negative controls

beyond those internal to the sequencing platform. Instead, we took several measures during

analysis to test for contamination in the 16S datasets. First, we looked at relative abundances

across multiple sets of samples on the same sequencing plate, since taxa present across all

samples may indicate contamination (especially since the batch included many non-transit

samples). This was possible mainly for the touchscreen samples (n=21) and a few train samples

(n=6), which were pooled with 30 saliva cultures, 13 skin samples, and 2 macaque adipose

samples. At the species level, we found 42 taxa (of 1647 total) that were present in 80% of

samples, with average abundance ranging from 0.018% (Pseudomonas unknown) to 11.1%

(Actinomyces unknown). Many of these are skin-associated, including Pseudomonas,

Staphylococcus, Corynebacterium (in increasing abundance) or associated with the oral cavity,

including Fusobacterium, Veillonella, Peptostreptococcus, Streptococcus, Prevotella, and

Porphyromonas (in increasing order) (Table II-2). It is unclear whether the latter arises from the

large number of saliva samples in this dataset, or as a true contaminant. None of the taxa with

lower average abundance are key to our findings.

Chloroplast and mitochondrial sequences were actually considered to be a type of

contaminant in our study, inasmuch as they essentially represent plant- and human-material

derived reads. They were found across all touchscreen and surface samples, but at very low

125

levels in adipose fat (primate, not human-derived) and saliva. Others have claimed that

chloroplast DNA may be an artifact of cotton swabs rather than environmental exposures; our

skin samples were processed with Copan swabs and yielded 1-2 orders of magnitude fewer

chloroplast sequences (<1% maximum). Our standard primer pairs are known to amplify

chloroplast and mitochondrial sequences: this is a well-known problem for those that study

plant-associated microbial communities [241, 242]. Chloroplast DNA percentages varied from

1.32%-6.98%, and 0.054-1.03% in the touchscreens. They varied even more in the train data (not

pooled with the touchscreens): chloroplast DNA ranged from 0.9% to 62.39%, with especially

high levels on the Red line, while mitochondrial DNA varied from 0-8.27% on the trains (data

available via website). This led to our analysis strategy of treating both sequence types like

typical contaminants, discounting their sequence abundance, renormalizing, and analyzing

primarily the resulting quality-controlled datasets.

Physical negative controls should be part of future study design, as recommended by

Adams et al [114] and Salter et al [115]. Their use, we note, must still be context dependent, as

no one blanket analysis is likely to apply to different sample and contaminant types. Some

studies have utilized the approach outlined in Flores et al, where OTUs constituting greater

than 1% of the total negative control sequences were removed from all samples prior to

rarefaction and analyses [243]. Another approach developed by Meadow et al involves

searching for taxa with high abundance in negative controls relative to samples: this is done by

plotting the relative abundances of taxa in negative controls against the relative abundances of

taxa in samples and applying a cutoff [190]. Adams et al performed a meta-analysis of built

environment studies, and reported phylum Tenericutes as significantly enriched in kit

126

microbiomes, and Cyanobacteria (or chloroplast) as highly abundant in dust but not in kits.

They also mention that skin taxa are often found as contaminants, but removing them could

remove true signal. Typical kit contaminant taxa were also not significant in our study.

Comparison to the NY subway study

To expand our comparison with the previous NYC subway study, we downloaded their

MetaPhlAn2 tables (provided at the time of the NYC publication by Nicola Segata in

collaboration with our group) from their supplementary data. We applied a simple quality

control filter by retaining taxa with at least 0.1% abundance in at least 1% of samples (14

samples), and then focused specifically on the samples most similar to ours, i.e. from subway

stations or trains.

In the NYC study, the most abundant taxa in the resulting 1,416 samples included

Pseudomonas stuzeri (27.01%), Pseudomonas unclassified (8.66%), Enterobacter cloacae (7.66%),

Stenotrophomonas maltophilia (7.10%), and Acinetobacter pitti/calcoaceticus/nosocomialis (3.39%).

Neither Yersinia nor Bacillus anthracis were present in any samples. These results are strikingly

different from our top species from similarly analyzed metagenomic data, which included

Propionibacterium acnes (47.44%), Propionibacterium phage (total ~6%), Micrococcus luteus (2.40%),

and Staphylococcus epidermis (1.98%). This may be due to a combination of factors, most likely

the different types of surfaces sampled, but also including the swab protocol development and

biomass validation prior to sequencing carried out for our study (see Methods). Most of our

samples represent heavily utilized, nonporous, non-sanitized surfaces within train cars or, less

often, stations; in contrast, NYC study surfaces include benches (n=326), rails/poles (condensed

127

from other categories, n=468), garbage cans (n=142), kiosks (n=161), turnstiles (n=151), and doors

(n=77), with all other surfaces sampled <24 times.

In support of this hypothesis, the NYC microbiomes at least in part do resemble those of

other built environment surfaces and dust. Adams et al, for example [244], collected dust in

vacuums or passively (through settlement). The former, which was considered homogenized,

had significantly higher levels of Pseudomonales, Enterobacteriales, and Streptophyta as

compared to the latter. Overall, Gammaproteobacteria dominated most samples (ave. 76.8%),

still primarily from Pseudomonadales and Enterobacteriales, and overshadowed the Bacilli

(6.68%), Betaproteobacteria (5.02%), and Alphaproteobacteria (4.80%), and Actinobacteria

(2.48%). The NYC subway had high levels of Enterobacteriales (17.90%) and Pseudomonadales

(49.61%), but none for Streptophyta (0%, suggesting a possible sampling or extraction bias).

However, it is difficult to compare NYC swabbed samples (or our own) to vacuumed or settled

dust, given the extreme heterogeneity seen in the latter for distinct space types or time

integration periods. Adams et al, for example, was in turn quite distinct from dust in the

International Space Station [245], for example, a mixed use academic classroom building [201],

or house dust [92], none of which significantly resembling our skin-dominated MBTA surfaces.

Taking these unusual features of the NYC subway data as given, however, we sought to

determine whether surface material was at least a major determinant of their microbial

community composition, as it proved to be for ours. We grouped their sample metadata into

four categories: type of object (bench, rail/pole, garbage can, kiosk, turnstiles, etc.), surface

material (wood, metal, plastic, etc.), object category (station, train, etc.), and borough (Queens,

128

Brooklyn, Manhattan, etc.) Applying the MaAsLin multivariate linear model to these variables

jointly, we found 71 differentially abundant clades at FDR<0.25.

Surprisingly, none of these associations were with surface material type; most instead

segregated with object type, which may at least be concordant with the much greater diversity

of objects sampled in the NYC study. Rails and poles had lower levels of Pseudomonas and

Acinetobacter lwoffi as compared to benches, for example, while garbage cans had higher levels

of Enterococcus italicus and Leunostoc. Clostridia and Klebsiella (not marine taxa) were found in

the abandoned South Ferry and Penn Station timecourse samples, as well as in trains as

compared to all other stations. Lastly, and also surprising, some taxa were associated with

borough: this includes higher levels of Acinetobacter and Moraxellaceae in Manhattan as

compared to the Bronx. Without more detail on the study’s exact sampling protocol - which

parts of these diverse objects were swabbed, for example, and for how long over what surface

area - it is difficult to interpret statistically significant but low effect size differences. It may be

useful for future studies to sample fewer, more controlled environments with greater

specificity, and of course to assess the results with more careful and targeted metagenomic

analyses.

129

References

1. Sender, R., S. Fuchs, and R. Milo, Revised Estimates for the Number of Human and Bacteria

Cells in the Body. PLoS Biol, 2016. 14(8): p. e1002533.

2. Consortium, T.H.M.P., Structure, function and diversity of the healthy human microbiome.

Nature, 2012. 486(7402): p. 207-14.

3. Engineering, N.A.o., E. National Academies of Sciences, and Medicine, Microbiomes of the

Built Environment: A Research Agenda for Indoor Microbiology, Human Health, and Buildings.

2017, Washington, DC: The National Academies Press. 253.

4. Shapiro, J.A., Thinking about bacterial populations as multicellular organisms. Annu Rev

Microbiol, 1998. 52: p. 81-104.

5. Meadow, J.F., et al., Humans differ in their personal microbial cloud. PeerJ, 2015. 3: p. e1258.

6. Rosenthal, M., et al., Skin microbiota: microbial community structure and its potential

association with health and disease. Infect Genet Evol, 2011. 11(5): p. 839-48.

7. Leewenhoeck, A.v., Observations, Communicated to the Publisher by Mr. Antony van

Leewenhoeck, in a Dutch Letter of the 9th of Octob. 1676. Here English'd: concerning Little

Animals by Him Observed in Rain-Well-Sea. and Snow Water; as Also in Water Wherein Pepper

Had Lain Infused. Philosophical Transactions Royal Society, 1677. 12: p. 821-831.

8. Adler, A. and E. Ducker, When Pasteurian Science Went to Sea: The Birth of Marine

Microbiology. J Hist Biol, 2017.

9. Razumov, A., The direct method of calculation of bacteria in water: comparison with the Koch

method. Mikrobiologija, 1932. 1: p. 131-146.

10. Staley, J.T. and A. Konopka, Measurement of in situ activities of nonphotosynthetic

microorganisms in aquatic and terrestrial habitats. Annu Rev Microbiol, 1985. 39: p. 321-46.

11. Stewart, E.J., Growing unculturable bacteria. J Bacteriol, 2012. 194(16): p. 4151-60.

130

12. Soucy, S.M., J. Huang, and J.P. Gogarten, Horizontal gene transfer: building the web of life.

Nat Rev Genet, 2015. 16(8): p. 472-82.

13. Lang, A.S., O. Zhaxybayeva, and J.T. Beatty, Gene transfer agents: phage-like elements of

genetic exchange. Nat Rev Microbiol, 2012. 10(7): p. 472-82.

14. Naor, A., et al., Low species barriers in halophilic archaea and the formation of recombinant

hybrids. Curr Biol, 2012. 22(15): p. 1444-8.

15. Zhaxybayeva, O. and W.F. Doolittle, Lateral gene transfer. Curr Biol, 2011. 21(7): p. R242-6.

16. Griffith, F., The Significance of Pneumococcal Types. J Hyg (Lond), 1928. 27(2): p. 113-59.

17. Avery, O.T., C.M. Macleod, and M. McCarty, Studies on the Chemical Nature of the Substance

Inducing Transformation of Pneumococcal Types : Induction of Transformation by a

Desoxyribonucleic Acid Fraction Isolated from Pneumococcus Type Iii. J Exp Med, 1944. 79(2):

p. 137-58.

18. Ochiai, K., et al., Studies on inheritance of drug resistance between Shigella strains and

Escherichia coli strains. Nihon Iji Shimpo, 1959. 1861: p. 34-46.

19. Akiba, T.K.T.I.Y., S. Kimura, and T. Fukushima, Studies on the mechanism of development of

multiple drug-resistant Shigella strains. Nihon Iji Shimpo, 1960. 1866: p. 45-50.

20. Anderson, E.S., The ecology of transferable drug resistance in the enterobacteria. Annu Rev

Microbiol, 1968. 22: p. 131-80.

21. Aravind, L., et al., Evidence for massive gene exchange between archaeal and bacterial

hyperthermophiles. Trends Genet, 1998. 14(11): p. 442-4.

22. Nelson, K.E., et al., Evidence for lateral gene transfer between Archaea and bacteria from genome

sequence of Thermotoga maritima. Nature, 1999. 399(6734): p. 323-9.

23. Sokal, R.R. and T.J. Crovello, The Biological Species Concept: A Critical Evaluation. The

American Naturalist, 1970. 104(936): p. 127-153.

131

24. Mayr, E., Systematics and the origin of species, from the viewpoint of a zoologist. 1942: Harvard

University Press.

25. de Queiroz, K., Ernst Mayr and the modern concept of species. Proc Natl Acad Sci U S A, 2005.

102 Suppl 1: p. 6600-7.

26. Ravin, A.W., Experimental Approaches to the Study of Bacterial Phylogeny. The American

Naturalist, 1963. 97(896): p. 307-318.

27. Dykhuizen, D.E. and L. Green, Recombination in Escherichia coli and the definition of biological

species. J Bacteriol, 1991. 173(22): p. 7257-68.

28. Tettelin, H., et al., Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae:

implications for the microbial "pan-genome". Proc Natl Acad Sci U S A, 2005. 102(39): p. 13950-

5.

29. Cohan, F.M., What are bacterial species? Annu Rev Microbiol, 2002. 56: p. 457-87.

30. Atwood, K.C., L.K. Schneider, and F.J. Ryan, Periodic selection in Escherichia coli. Proc Natl

Acad Sci U S A, 1951. 37(3): p. 146-55.

31. Treves, D.S., S. Manning, and J. Adams, Repeated evolution of an acetate-crossfeeding

polymorphism in long-term populations of Escherichia coli. Mol Biol Evol, 1998. 15(7): p. 789-

97.

32. Imhof, M. and C. Schlotterer, Fitness effects of advantageous mutations in evolving Escherichia

coli populations. Proc Natl Acad Sci U S A, 2001. 98(3): p. 1113-7.

33. Rozen, D.E. and R.E. Lenski, Long-Term Experimental Evolution in Escherichia coli. VIII.

Dynamics of a Balanced Polymorphism. Am Nat, 2000. 155(1): p. 24-35.

34. Guttman, D.S. and D.E. Dykhuizen, Detecting selective sweeps in naturally occurring

Escherichia coli. Genetics, 1994. 138(4): p. 993-1003.

35. Coleman, M.L. and S.W. Chisholm, Ecosystem-specific selection pressures revealed through

comparative population genomics. Proc Natl Acad Sci U S A, 2010. 107(43): p. 18634-9.

132

36. Papke, R.T., et al., Searching for species in haloarchaea. Proc Natl Acad Sci U S A, 2007.

104(35): p. 14092-7.

37. Cohan, F.M. and E.B. Perry, A systematics for discovering the fundamental units of bacterial

diversity. Curr Biol, 2007. 17(10): p. R373-86.

38. Majewski, J. and F.M. Cohan, Adapt globally, act locally: the effect of selective sweeps on

bacterial sequence diversity. Genetics, 1999. 152(4): p. 1459-74.

39. Shapiro, B.J., et al., Population genomics of early events in the ecological differentiation of

bacteria. Science, 2012. 336(6077): p. 48-51.

40. Takeuchi, N., et al., Gene-specific selective sweeps in bacteria and archaea caused by negative

frequency-dependent selection. BMC Biol, 2015. 13: p. 20.

41. Dixit, P.D., T.Y. Pang, and S. Maslov, Recombination-Driven Genome Evolution and Stability

of Bacterial Species. Genetics, 2017. 207(1): p. 281-295.

42. Rolfe, R. and M. Meselson, The Relative Homogeneity of Microbial DNA. Proc Natl Acad Sci

U S A, 1959. 45(7): p. 1039-43.

43. De Ley, J., H. Cattoir, and A. Reynaerts, The quantitative measurement of DNA hybridization

from renaturation rates. Eur J Biochem, 1970. 12(1): p. 133-42.

44. Wayne, L.G., International Committee on Systematic Bacteriology: announcement of the report

of the ad hoc Committee on Reconciliation of Approaches to Bacterial Systematics. Zentralbl

Bakteriol Mikrobiol Hyg A, 1988. 268(4): p. 433-4.

45. Fleischmann, R.D., et al., Whole-genome random sequencing and assembly of Haemophilus

influenzae Rd. Science, 1995. 269(5223): p. 496-512.

46. Fraser, C.M., J.A. Eisen, and S.L. Salzberg, Microbial genome sequencing. Nature, 2000.

406(6797): p. 799-803.

47. Ravel, J. and C.M. Fraser, Genome sequencing of microbial species, in Encyclopedia of Genetics,

Genomics, Proteomics and Bioinformatics. 2004, John Wiley & Sons, Ltd.

133

48. Karlin, S., Global dinucleotide signatures and analysis of genomic heterogeneity. Curr Opin

Microbiol, 1998. 1(5): p. 598-610.

49. Hanage, W.P., C. Fraser, and B.G. Spratt, Sequences, sequence clusters and bacterial species.

Philos Trans R Soc Lond B Biol Sci, 2006. 361(1475): p. 1917-27.

50. Segata, N., et al., PhyloPhlAn is a new method for improved phylogenetic and taxonomic

placement of microbes. Nat Commun, 2013. 4: p. 2304.

51. Ravenhall, M., et al., Inferring horizontal gene transfer. PLoS Comput Biol, 2015. 11(5): p.

e1004095.

52. Cavalli-Sforza, L.L., The DNA revolution in population genetics. Trends Genet, 1998. 14(2): p.

60-5.

53. Koonin, E.V., K.S. Makarova, and L. Aravind, Horizontal gene transfer in prokaryotes:

quantification and classification. Annu Rev Microbiol, 2001. 55: p. 709-42.

54. Lawrence, J.G. and H. Ochman, Amelioration of bacterial genomes: rates of change and

exchange. J Mol Evol, 1997. 44(4): p. 383-97.

55. Medigue, C., et al., Evidence for horizontal gene transfer in Escherichia coli speciation. J Mol

Biol, 1991. 222(4): p. 851-6.

56. Ochman, H., J.G. Lawrence, and E.A. Groisman, Lateral gene transfer and the nature of

bacterial innovation. Nature, 2000. 405(6784): p. 299-304.

57. Nakamura, Y., et al., Biased biological functions of horizontally transferred genes in prokaryotic

genomes. Nat Genet, 2004. 36(7): p. 760-6.

58. Ge, F., L.S. Wang, and J. Kim, The cobweb of life revealed by genome-scale estimates of horizontal

gene transfer. PLoS Biol, 2005. 3(10): p. e316.

59. Lerat, E., et al., Evolutionary origins of genomic repertoires in bacteria. PLoS Biol, 2005. 3(5): p.

e130.

134

60. Dagan, T. and W. Martin, Ancestral genome sizes specify the minimum rate of lateral gene

transfer during prokaryote evolution. Proc Natl Acad Sci U S A, 2007. 104(3): p. 870-5.

61. Andam, C.P. and J.P. Gogarten, Biased gene transfer in microbial evolution. Nat Rev

Microbiol, 2011. 9(7): p. 543-55.

62. Skippington, E. and M.A. Ragan, Phylogeny rather than ecology or lifestyle biases the

construction of Escherichia coli-Shigella genetic exchange communities. Open Biol, 2012. 2(9): p.

120112.

63. Boucher, Y., et al., Local mobile gene pools rapidly cross species boundaries to create endemicity

within global Vibrio cholerae populations. MBio, 2011. 2(2).

64. Madsen, J.S., et al., The interconnection between biofilm formation and horizontal gene transfer.

FEMS Immunol Med Microbiol, 2012. 65(2): p. 183-95.

65. Smillie, C.S., et al., Ecology drives a global network of gene exchange connecting the human

microbiome. Nature, 2011. 480(7376): p. 241-4.

66. Liu, L., et al., The human microbiome: a hot spot of microbial horizontal gene transfer. Genomics,

2012. 100(5): p. 265-70.

67. Brito, I.L., et al., Mobile genes in the human microbiome are structured from global to individual

scales. Nature, 2016. 535(7612): p. 435-439.

68. Rivera, M.C., et al., Genomic evidence for two functionally distinct gene classes. Proc Natl Acad

Sci U S A, 1998. 95(11): p. 6239-44.

69. Cohen, O., U. Gophna, and T. Pupko, The complexity hypothesis revisited: connectivity rather

than function constitutes a barrier to horizontal gene transfer. Mol Biol Evol, 2011. 28(4): p.

1481-9.

70. Jain, R., M.C. Rivera, and J.A. Lake, Horizontal gene transfer among genomes: the complexity

hypothesis. Proc Natl Acad Sci U S A, 1999. 96(7): p. 3801-6.

135

71. Beiko, R.G., T.J. Harlow, and M.A. Ragan, Highways of gene sharing in prokaryotes. Proc Natl

Acad Sci U S A, 2005. 102(40): p. 14332-7.

72. Baltrus, D.A., Exploring the costs of horizontal gene transfer. Trends Ecol Evol, 2013. 28(8): p.

489-95.

73. Drummond, D.A. and C.O. Wilke, The evolutionary consequences of erroneous protein

synthesis. Nat Rev Genet, 2009. 10(10): p. 715-24.

74. Banos, R.C., et al., Differential regulation of horizontally acquired and core genome genes by the

bacterial modulator H-NS. PLoS Genet, 2009. 5(6): p. e1000513.

75. Wolf, Y.I., et al., Evolution of aminoacyl-tRNA synthetases--analysis of unique domain

architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events.

Genome Res, 1999. 9(8): p. 689-710.

76. Woese, C.R., Interpreting the universal phylogenetic tree. Proc Natl Acad Sci U S A, 2000.

97(15): p. 8392-6.

77. Baas Becking, L.G.M., Geobiologie of inleiding tot de milieukunde. 1934, The Hague, the

Netherlands: W.P. Van Stockum & Zoon.

78. de Wit, R. and T. Bouvier, 'Everything is everywhere, but, the environment selects'; what did

Baas Becking and Beijerinck really say? Environ Microbiol, 2006. 8(4): p. 755-8.

79. O'Malley, M.A., The nineteenth century roots of 'everything is everywhere'. Nat Rev Microbiol,

2007. 5(8): p. 647-51.

80. Yasuda, K., et al., Biogeography of the intestinal mucosal and lumenal microbiome in the rhesus

macaque. Cell Host Microbe, 2015. 17(3): p. 385-91.

81. Grice, E.A., et al., Topographical and temporal diversity of the human skin microbiome. Science,

2009. 324(5931): p. 1190-2.

82. Gibbons, S.M., The Built Environment Is a Microbial Wasteland. mSystems, 2016. 1(2).

136

83. Impact of the Built Environment on Health. 2011 09/15/2017]; Available from:

https://www.cdc.gov/nceh/publications/factsheets/impactofthebuiltenvironmentonhealt

h.pdf.

84. Klepeis, N.E., et al., The National Human Activity Pattern Survey (NHAPS): a resource for

assessing exposure to environmental pollutants. J Expo Anal Environ Epidemiol, 2001. 11(3):

p. 231-52.

85. Kitzes, J.P., Audrey, S. Goldfinger, and M. Wackernagel, Current Methods for Calculating

National Ecological Footprint Accounts. Science for Environment & Sustainable Society,

2007. 4(1): p. 1-9.

86. Hooke, R.L., J.F. Martín-Duque, and J. Pedraza, Land transformation by humans: A review

GSA Today, 2012. 22(12): p. 4-10.

87. Division, U.N.D.o.E.a.S.A.P., World urbanization prospects: the 2011 revision. Vol.

ST/ESA/SER.A/322. 2012: United Nations Publications.

88. Environment, N.E.W.G.o.t.E.B.o.t.B., et al., Evolution of the indoor biome. Trends Ecol Evol,

2015. 30(4): p. 223-32.

89. Dai, D., et al., Factors Shaping the Human Exposome in the Built Environment: Opportunities

for Engineering Control. Environ Sci Technol, 2017. 51(14): p. 7759-7774.

90. Kelley, S.T. and J.A. Gilbert, Studying the microbiology of the indoor environment. Genome

Biol, 2013. 14(2): p. 202.

91. Milstone, L.M., Epidermal desquamation. J Dermatol Sci, 2004. 36(3): p. 131-40.

92. Lax, S., et al., Longitudinal analysis of microbial interaction between humans and the indoor

environment. Science, 2014. 345(6200): p. 1048-52.

93. Flores, G.E., et al., Microbial biogeography of public restroom surfaces. PLoS One, 2011. 6(11):

p. e28132.

https://www.cdc.gov/nceh/publications/factsheets/impactofthebuiltenvironmentonhealth.pdf

https://www.cdc.gov/nceh/publications/factsheets/impactofthebuiltenvironmentonhealth.pdf

137

94. Kembel, S.W., et al., Architectural design influences the diversity and structure of the built

environment microbiome. ISME J, 2012. 6(8): p. 1469-79.

95. Lax, S., C.R. Nagler, and J.A. Gilbert, Our interface with the built environment: immunity and

the indoor microbiota. Trends Immunol, 2015. 36(3): p. 121-3.

96. Ownby, D.R., C.C. Johnson, and E.L. Peterson, Exposure to dogs and cats in the first year of

life and risk of allergic sensitization at 6 to 7 years of age. JAMA, 2002. 288(8): p. 963-72.

97. Park, J.H., et al., Predictors of airborne endotoxin in the home. Environ Health Perspect, 2001.

109(8): p. 859-64.

98. Thorne, P.S., et al., Endotoxin Exposure: Predictors and Prevalence of Associated Asthma

Outcomes in the United States. Am J Respir Crit Care Med, 2015. 192(11): p. 1287-97.

99. Liu, A.H., Endotoxin exposure in allergy and asthma: reconciling a paradox. J Allergy Clin

Immunol, 2002. 109(3): p. 379-92.

100. Sharpe, R.A., et al., Indoor fungal diversity and asthma: a meta-analysis and systematic review

of risk factors. J Allergy Clin Immunol, 2015. 135(1): p. 110-22.

101. Song, S.J., et al., Cohabiting family members share microbiota with one another and with their

dogs. Elife, 2013. 2: p. e00458.

102. Ross, A.A., A.C. Doxey, and J.D. Neufeld, The Skin Microbiome of Cohabiting Couples.

mSystems, 2017. 2(4).

103. Lax, S., et al., Forensic analysis of the microbiome of phones and shoes. Microbiome, 2015. 3: p.

21.

104. Lax, S.G., J., 13. Forensic microbiology in built environments, in Forensic Microbiology, D.O.T.

Carter, J.K. and M.E.M. Benbow, J.L., Editors. 2017, John Wiley & Sons, Ltd: Chichester,

UK.

105. Strachan, D.P., Hay fever, hygiene, and household size. BMJ, 1989. 299(6710): p. 1259-60.

138

106. Rook, G.A., et al., Mycobacteria and other environmental organisms as immunomodulators for

immunoregulatory disorders. Springer Semin Immunopathol, 2004. 25(3-4): p. 237-55.

107. Shade, A., Diversity is the question, not the answer. ISME J, 2017. 11(1): p. 1-6.

108. Vandegrift, R., et al., Cleanliness in context: reconciling hygiene with a modern microbial

perspective. Microbiome, 2017. 5(1): p. 76.

109. Bloomfield, S.F., et al., Time to abandon the hygiene hypothesis: new perspectives on allergic

disease, the human microbiome, infectious disease prevention and the role of targeted hygiene.

Perspect Public Health, 2016. 136(4): p. 213-24.

110. Rook, G.A., Regulation of the immune system by biodiversity from the natural environment: an

ecosystem service essential to health. Proc Natl Acad Sci U S A, 2013. 110(46): p. 18360-7.

111. Chase, J., et al., Geography and Location Are the Primary Drivers of Office Microbiome

Composition. mSystems, 2016. 1(2).

112. Mohammadi, T., et al., Removal of contaminating DNA from commercial nucleic acid extraction

kit reagents. J Microbiol Methods, 2005. 61(2): p. 285-8.

113. Tanner, M.A., et al., Specific ribosomal DNA sequences from diverse environmental settings

correlate with experimental contaminants. Appl Environ Microbiol, 1998. 64(8): p. 3110-3.

114. Adams, R.I., et al., Microbiota of the indoor environment: a meta-analysis. Microbiome, 2015.

3: p. 49.

115. Salter, S.J., et al., Reagent and laboratory contamination can critically impact sequence-based

microbiome analyses. BMC Biol, 2014. 12: p. 87.

116. Coil, D., “Citizen Microbiology: A Case Study in Space.”, in The Rightful Place of Science: Citizen

Science, D.K. Cavalier, E.B., Editor. 2016, Consortium for Science, Policy & Outcomes:

Tempe, AZ.

117. Nielsen, K.M., et al., Release and persistence of extracellular DNA in the environment. Environ

Biosafety Res, 2007. 6(1-2): p. 37-53.

139

118. Carini, P., et al., Relic DNA is abundant in soil and obscures estimates of soil microbial diversity.

Nat Microbiol, 2016. 2: p. 16242.

119. Emerson, J.B., et al., Schrodinger's microbes: Tools for distinguishing the living from the dead in

microbial ecosystems. Microbiome, 2017. 5(1): p. 86.

120. Riesenfeld, C.S., P.D. Schloss, and J. Handelsman, Metagenomics: genomic analysis of

microbial communities. Annu Rev Genet, 2004. 38: p. 525-52.

121. Hamady, M. and R. Knight, Microbial community profiling for human microbiome projects:

Tools, techniques, and challenges. Genome Res, 2009. 19(7): p. 1141-52.

122. Segata, N., et al., Computational meta'omics for microbial community studies. Mol Syst Biol,

2013. 9: p. 666.

123. McDonald, D., et al., An improved Greengenes taxonomy with explicit ranks for ecological and

evolutionary analyses of bacteria and archaea. ISME J, 2012. 6(3): p. 610-8.

124. Yilmaz, P., et al., The SILVA and "All-species Living Tree Project (LTP)" taxonomic frameworks.

Nucleic Acids Res, 2014. 42(Database issue): p. D643-8.

125. Huse, S.M., et al., Exploring microbial diversity and taxonomy using SSU rRNA hypervariable

tag sequencing. PLoS Genet, 2008. 4(11): p. e1000255.

126. Knights, D., et al., Human-associated microbial signatures: examining their predictive value. Cell

Host Microbe, 2011. 10(4): p. 292-6.

127. Langille, M.G., et al., Predictive functional profiling of microbial communities using 16S rRNA

marker gene sequences. Nat Biotechnol, 2013. 31(9): p. 814-21.

128. Vandamme, P., et al., Polyphasic taxonomy, a consensus approach to bacterial systematics.

Microbiol Rev, 1996. 60(2): p. 407-38.

129. Stackebrandt, E.G., B.M., Taxonomic Note: A Place for DNA-DNA Reassociation and 16S

rRNA Sequence Analysis in the Present Species Definition in Bacteriology. International Journal

of Systematic and Evolutionary Microbiology, 1994. 44(4): p. 846-849.

140

130. Eren, A.M., et al., Oligotyping: Differentiating between closely related microbial taxa using 16S

rRNA gene data. Methods Ecol Evol, 2013. 4(12).

131. Eren, A.M., et al., Exploring the diversity of Gardnerella vaginalis in the genitourinary tract

microbiota of monogamous couples through subtle nucleotide variation. PLoS One, 2011. 6(10):

p. e26732.

132. McLellan, S.L., et al., Sewage reflects the distribution of human faecal Lachnospiraceae. Environ

Microbiol, 2013. 15(8): p. 2213-27.

133. Faith, J.J., et al., The long-term stability of the human gut microbiota. Science, 2013. 341(6141):

p. 1237439.

134. McHardy, A.C., et al., Accurate phylogenetic classification of variable-length DNA fragments.

Nat Methods, 2007. 4(1): p. 63-72.

135. Schloissnig, S., et al., Genomic variation landscape of the human gut microbiome. Nature, 2013.

493(7430): p. 45-50.

136. Segata, N., et al., Metagenomic microbial community profiling using unique clade-specific marker

genes. Nature methods, 2012. 9(8): p. 811-4.

137. Brady, A. and S. Salzberg, PhymmBL expanded: confidence scores, custom databases,

parallelization and more. Nat Methods, 2011. 8(5): p. 367.

138. Wood, D.E. and S.L. Salzberg, Kraken: ultrafast metagenomic sequence classification using exact

alignments. Genome Biol, 2014. 15(3): p. R46.

139. Kanehisa, M., et al., Data, information, knowledge and principle: back to metabolism in KEGG.

Nucleic acids research, 2014. 42(Database issue): p. D199-205.

140. Tatusov, R.L., E.V. Koonin, and D.J. Lipman, A genomic perspective on protein families.

Science, 1997. 278(5338): p. 631-7.

141. Powell, S., et al., eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different

taxonomic ranges. Nucleic Acids Res, 2012. 40(Database issue): p. D284-9.

141

142. Punta, M., et al., The Pfam protein families database. Nucleic Acids Res, 2012. 40(Database

issue): p. D290-301.

143. Suzek, B.E., et al., UniRef: comprehensive and non-redundant UniProt reference clusters.

Bioinformatics, 2007. 23(10): p. 1282-8.

144. Caspi, R., et al., The MetaCyc database of metabolic pathways and enzymes and the BioCyc

collection of Pathway/Genome Databases. Nucleic acids research, 2014. 42(Database issue): p.

D459-71.

145. Overbeek, R., et al., The subsystems approach to genome annotation and its use in the project to

annotate 1000 genomes. Nucleic Acids Res, 2005. 33(17): p. 5691-702.

146. Markowitz, V.M., et al., IMG/M: the integrated metagenome data management and comparative

analysis system. Nucleic Acids Res, 2012. 40(Database issue): p. D123-9.

147. Konwar, K.M., et al., MetaPathways: a modular pipeline for constructing pathway/genome

databases from environmental sequence information. BMC Bioinformatics, 2013. 14: p. 202.

148. Abubucker, S., et al., Metabolic reconstruction for metagenomic data and its application to the

human microbiome. PLoS Comput Biol, 2012. 8(6): p. e1002358.

149. Vollmers, J., S. Wiegand, and A.K. Kaster, Comparing and Evaluating Metagenome Assembly

Tools from a Microbiologist's Perspective - Not Only Size Matters! PLoS One, 2017. 12(1): p.

e0169662.

150. Nagarajan, N. and M. Pop, Sequence assembly demystified. Nat Rev Genet, 2013. 14(3): p.

157-67.

151. Gill, S.R., et al., Metagenomic analysis of the human distal gut microbiome. Science, 2006.

312(5778): p. 1355-9.

152. Qin, J., et al., A human gut microbial gene catalogue established by metagenomic sequencing.

Nature, 2010. 464(7285): p. 59-65.

142

153. Venter, J.C., et al., Environmental genome shotgun sequencing of the Sargasso Sea. Science,

2004. 304(5667): p. 66-74.

154. Wrighton, K.C., et al., Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated

bacterial phyla. Science, 2012. 337(6102): p. 1661-5.

155. Castelle, C.J., et al., Extraordinary phylogenetic diversity and metabolic versatility in aquifer

sediment. Nat Commun, 2013. 4: p. 2120.

156. Di Rienzi, S.C., et al., The human gut and groundwater harbor non-photosynthetic bacteria

belonging to a new candidate phylum sibling to Cyanobacteria. Elife, 2013. 2: p. e01102.

157. Tyson, G.W., et al., Community structure and metabolism through reconstruction of microbial

genomes from the environment. Nature, 2004. 428(6978): p. 37-43.

158. Albertsen, M., et al., Genome sequences of rare, uncultured bacteria obtained by differential

coverage binning of multiple metagenomes. Nat Biotechnol, 2013. 31(6): p. 533-8.

159. Mukherjee, S., et al., 1,003 reference genomes of bacterial and archaeal isolates expand coverage

of the tree of life. Nat Biotechnol, 2017. 35(7): p. 676-683.

160. Eisen, J.A., Horizontal gene transfer among microbial genomes: new insights from complete

genome analysis. Curr Opin Genet Dev, 2000. 10(6): p. 606-11.

161. Hao, W. and G.B. Golding, The fate of laterally transferred genes: life in the fast lane to

adaptation or death. Genome Res, 2006. 16(5): p. 636-43.

162. Polz, M.F., E.J. Alm, and W.P. Hanage, Horizontal gene transfer and the evolution of bacterial

and archaeal population structure. Trends Genet, 2013. 29(3): p. 170-5.

163. Mitri, S. and K.R. Foster, The genotypic view of social interactions in microbial communities.

Annu Rev Genet, 2013. 47: p. 247-73.

164. Smith, J., The social evolution of bacterial pathogenesis. Proc Biol Sci, 2001. 268(1462): p. 61-9.

143

165. de Carvalho, M.O. and E.L. Loreto, Methods for detection of horizontal transfer of transposable

elements in complete genomes. Genet Mol Biol, 2012. 35(4 (suppl)): p. 1078-84.

166. Ragan, M.A., On surrogate methods for detecting lateral gene transfer. FEMS Microbiol Lett,

2001. 201(2): p. 187-91.

167. Vernikos, G.S. and J. Parkhill, Interpolated variable order motifs for identification of horizontally

acquired DNA: revisiting the Salmonella pathogenicity islands. Bioinformatics, 2006. 22(18): p.

2196-203.

168. Podell, S. and T. Gaasterland, DarkHorse: a method for genome-wide prediction of horizontal

gene transfer. Genome Biol, 2007. 8(2): p. R16.

169. Langille, M.G., W.W. Hsiao, and F.S. Brinkman, Evaluation of genomic island predictors using

a comparative genomics approach. BMC Bioinformatics, 2008. 9: p. 329.

170. Whidden, C., N. Zeh, and R.G. Beiko, Supertrees Based on the Subtree Prune-and-Regraft

Distance. Syst Biol, 2014. 63(4): p. 566-81.

171. Tofigh, A., M. Hallett, and J. Lagergren, Simultaneous identification of duplications and lateral

gene transfers. IEEE/ACM Trans Comput Biol Bioinform, 2011. 8(2): p. 517-35.

172. Chauve, C., et al., MaxTiC: Fast Ranking Of A Phylogenetic Tree By Maximum Time

Consistency With Lateral Gene Transfers. bioRxiv, 2017.

173. Trappe, K., T. Marschall, and B.Y. Renard, Detecting horizontal gene transfer by mapping

sequencing reads across species boundaries. Bioinformatics, 2016. 32(17): p. i595-i604.

174. Lloyd-Price, J.M., A*, et al., Strains, functions and dynamics in the expanded Human

Microbiome Project. Nature, in press.

175. Huang, K., et al., MetaRef: a pan-genomic database for comparative and community microbial

genomics. Nucleic Acids Res, 2014. 42(Database issue): p. D617-24.

176. Louis, P., G.L. Hold, and H.J. Flint, The gut microbiota, bacterial metabolites and colorectal

cancer. Nat Rev Microbiol, 2014. 12(10): p. 661-72.

144

177. Flint, H.J., et al., Interactions and competition within the microbial community of the human

colon: links between diet and health. Environ Microbiol, 2007. 9(5): p. 1101-11.

178. Mark Welch, J.L., et al., Biogeography of a human oral microbiome at the micron scale. Proc Natl

Acad Sci U S A, 2016. 113(6): p. E791-800.

179. Finn, R.D., et al., Pfam: clans, web tools and services. Nucleic Acids Res, 2006. 34(Database

issue): p. D247-51.

180. Sitbon, E. and S. Pietrokovski, New types of conserved sequence domains in DNA-binding

regions of homing endonucleases. Trends Biochem Sci, 2003. 28(9): p. 473-7.

181. Burrus, V., et al., The ICESt1 element of Streptococcus thermophilus belongs to a large family of

integrative and conjugative elements that exchange modules and change their specificity of

integration. Plasmid, 2002. 48(2): p. 77-97.

182. Burrus, V., et al., Conjugative transposons: the tip of the iceberg. Mol Microbiol, 2002. 46(3): p.

601-10.

183. Bonham, K.S., B.E. Wolfe, and R.J. Dutton, Extensive horizontal gene transfer in cheese-

associated bacteria. Elife, 2017. 6.

184. Truong, D.T., et al., Microbial strain-level population structure and genetic diversity from

metagenomes. Genome Res, 2017. 27(4): p. 626-638.

185. Stokes, H.W. and M.R. Gillings, Gene flow, mobile genetic elements and the recruitment of

antibiotic resistance genes into Gram-negative pathogens. FEMS Microbiol Rev, 2011. 35(5): p.

790-819.

186. Peng, Y., et al., IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data

with highly uneven depth. Bioinformatics, 2012. 28(11): p. 1420-8.

187. Konstantinidis, K.T. and J.M. Tiedje, Genomic insights that advance the species definition for

prokaryotes. Proc Natl Acad Sci U S A, 2005. 102(7): p. 2567-72.

188. Jost, L., Entropy and diversity. Oikos, 2006. 113(2): p. 363-375.

145

189. National Transit Database. Monthly Module Raw Data Release. 2015; Available from:

http://www.ntdprogram.gov/ntdprogram/data.htm.

190. Meadow, J.F., A.E. Altrichter, and J.L. Green, Mobile phones carry the personal microbiome of

their owners. PeerJ, 2014. 2: p. e447.

191. Fierer, N., et al., Forensic identification using skin bacterial communities. Proc Natl Acad Sci

U S A, 2010. 107(14): p. 6477-81.

192. Meadow, J.F., et al., Bacterial communities on classroom surfaces vary with human contact.

Microbiome, 2014. 2(1): p. 7.

193. Robertson, C.E., et al., Culture-independent analysis of aerosol microbiology in a metropolitan

subway system. Appl Environ Microbiol, 2013. 79(11): p. 3485-93.

194. Leung, M.H., et al., Indoor-air microbiome in an urban subway network: diversity and dynamics.

Appl Environ Microbiol, 2014. 80(21): p. 6760-70.

195. Afshinnekoo, E., et al., Geospatial Resolution of Human and Bacterial Diversity with City-Scale

Metagenomics. Cell Systems, 2015. 1(1): p. 72-87.

196. Ackelsberg, J., et al., Lack of Evidence for Plague or Anthrax on the New York City Subway. Cell

Systems. 1(1): p. 4-5.

197. Segata, N., et al., Metagenomic biomarker discovery and explanation. Genome Biol, 2011. 12(6):

p. R60.

198. Nelson, M.C., et al., Analysis, optimization and verification of Illumina-generated 16S rRNA

gene amplicon surveys. PLoS One, 2014. 9(4): p. e94249.

199. Segata, N., et al., Composition of the adult digestive tract bacterial microbiome based on seven

mouth surfaces, tonsils, throat and stool samples. Genome Biol, 2012. 13(6): p. R42.

200. Costello, E.K., et al., Bacterial community variation in human body habitats across space and

time. Science, 2009. 326(5960): p. 1694-7.

http://www.ntdprogram.gov/ntdprogram/data.htm

146

201. Kembel, S.W., et al., Architectural design drives the biogeography of indoor bacterial

communities. PLoS One, 2014. 9(1): p. e87093.

202. Lauber, C.L., et al., Pyrosequencing-based assessment of soil pH as a predictor of soil bacterial

community structure at the continental scale. Appl Environ Microbiol, 2009. 75(15): p. 5111-

20.

203. Knights, D., et al., Bayesian community-wide culture-independent microbial source tracking.

Nature methods, 2011. 8(9): p. 761-3.

204. Stolz, A., Molecular characteristics of xenobiotic-degrading sphingomonads. Appl Microbiol

Biotechnol, 2009. 81(5): p. 793-811.

205. Peyraud, R., et al., Genome-scale reconstruction and system level investigation of the metabolic

network of Methylobacterium extorquens AM1. BMC Syst Biol, 2011. 5: p. 189.

206. Kawamura, Y., et al., Genus Enhydrobacter Staley et al. 1987 should be recognized as a member

of the family Rhodospirillaceae within the class Alphaproteobacteria. Microbiol Immunol, 2012.

56(1): p. 21-6.

207. Hewitt, K.M., et al., Bacterial diversity in two Neonatal Intensive Care Units (NICUs). PLoS

One, 2013. 8(1): p. e54703.

208. Grice, E.A., et al., A diversity profile of the human skin microbiota. Genome Res, 2008. 18(7):

p. 1043-50.

209. Dawson, T.L., Jr., Malassezia globosa and restricta: breakthrough understanding of the etiology

and treatment of dandruff and seborrheic dermatitis through whole-genome analysis. J Investig

Dermatol Symp Proc, 2007. 12(2): p. 15-9.

210. Zouboulis, C.C., Propionibacterium acnes and sebaceous lipogenesis: a love-hate relationship? J

Invest Dermatol, 2009. 129(9): p. 2093-6.

211. Morgan, X.C., et al., Dysfunction of the intestinal microbiome in inflammatory bowel disease and

treatment. Genome biology, 2012. 13(9): p. R79.

147

212. Barberan, A., et al., Using network analysis to explore co-occurrence patterns in soil microbial

communities. ISME J, 2012. 6(2): p. 343-51.

213. Kanehisa, M. and S. Goto, KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids

Res, 2000. 28(1): p. 27-30.

214. Bruggemann, H., et al., The complete genome sequence of Propionibacterium acnes, a commensal

of human skin. Science, 2004. 305(5684): p. 671-3.

215. Lee, W.L., A.R. Shalita, and M.B. Poh-Fitzpatrick, Comparative studies of porphyrin

production in Propionibacterium acnes and Propionibacterium granulosum. J Bacteriol, 1978.

133(2): p. 811-5.

216. Holland, K.T., et al., Propionibacterium acnes and acne. Dermatology, 1998. 196(1): p. 67-8.

217. Roessner, C.A., et al., Isolation and characterization of 14 additional genes specifying the

anaerobic biosynthesis of cobalamin (vitamin B12) in Propionibacterium freudenreichii (P.

shermanii). Microbiology, 2002. 148(Pt 6): p. 1845-53.

218. Hashimoto, Y., M. Yamashita, and Y. Murooka, The Propionibacterium freudenreichii

hemYHBXRL gene cluster, which encodes enzymes and a regulator involved in the biosynthetic

pathway from glutamate to protoheme. Appl Microbiol Biotechnol, 1997. 47(4): p. 385-92.

219. Kaminski, J., et al., High-specificity targeted functional profiling in microbial communities with

ShortBRED. PLoS Comp Biol, in press.

220. McArthur, A.G., et al., The comprehensive antibiotic resistance database. Antimicrob Agents

Chemother, 2013. 57(7): p. 3348-57.

221. Yooseph, S., et al., A metagenomic framework for the study of airborne microbial communities.

PLoS One, 2013. 8(12): p. e81862.

222. Qin, J., et al., A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature,

2012. 490(7418): p. 55-60.

148

223. Yatsunenko, T., et al., Human gut microbiome viewed across age and geography. Nature, 2012.

486(7402): p. 222-7.

224. Hu, Y., et al., Metagenome-wide analysis of antibiotic resistance genes in a large cohort of human

gut microbiota. Nat Commun, 2013. 4: p. 2151.

225. Chen, L., et al., VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res,

2005. 33(Database issue): p. D325-8.

226. Li, Y., et al., Role of ventilation in airborne transmission of infectious agents in the built

environment - a multidisciplinary systematic review. Indoor Air, 2007. 17(1): p. 2-18.

227. Gibbons, S.M., et al., Ecological succession and viability of human-associated microbiota on

restroom surfaces. Appl Environ Microbiol, 2015. 81(2): p. 765-73.

228. Glass, E.M., et al., MIxS-BE: a MIxS extension defining a minimum information standard for

sequence data from the built environment. ISME J, 2014. 8(1): p. 1-3.

229. National Centers for Environmental Information & National Oceanic and Atmospheric

Administration. Record of Climatological Observations. 8/29/2015; Station: Boston Logan

International Airport, MA, US. ]. Available from: http://www.ncdc.noaa.gov/cdo-web/.

230. Weather Underground. Weather History for KBOS 8/29/2015]; Available from:

http://www.wunderground.com/history/.

231. Paulino, L.C., et al., Molecular analysis of fungal microbiota in samples from healthy human skin

and psoriatic lesions. J Clin Microbiol, 2006. 44(8): p. 2933-41.

232. Caporaso, J.G., et al., Global patterns of 16S rRNA diversity at a depth of millions of sequences

per sample. Proc Natl Acad Sci U S A, 2011. 108 Suppl 1: p. 4516-22.

233. Caporaso, J.G., et al., QIIME allows analysis of high-throughput community sequencing data.

Nature methods, 2010. 7(5): p. 335-6.

234. Oksanen J, B.F., Kindt R, Legendre P, Minchin P, O'Hara R, Simpson G, Solymos P,

Stevens H, Wagner H, vegan: Community Ecology Package. 2015.

http://www.ncdc.noaa.gov/cdo-web/

http://www.wunderground.com/history/

149

235. Asnicar, F., et al., Compact graphical representation of phylogenetic data and metadata with

GraPhlAn. PeerJ, 2015. 3: p. e1029.

236. Morgat, A., et al., UniPathway: a resource for the exploration and annotation of metabolic

pathways. Nucleic acids research, 2012. 40(Database issue): p. D761-9.

237. Suzek, B.E., et al., UniRef clusters: a comprehensive and scalable alternative for improving

sequence similarity searches. Bioinformatics, 2015. 31(6): p. 926-32.

238. Liu, B. and M. Pop, ARDB--Antibiotic Resistance Genes Database. Nucleic Acids Res, 2009.

37(Database issue): p. D443-7.

239. Bolger, A.M., M. Lohse, and B. Usadel, Trimmomatic: a flexible trimmer for Illumina sequence

data. Bioinformatics, 2014. 30(15): p. 2114-20.

240. Langmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat Methods,

2012. 9(4): p. 357-9.

241. Rastogi, G., et al., A PCR-based toolbox for the culture-independent quantification of total

bacterial abundances in plant environments. J Microbiol Methods, 2010. 83(2): p. 127-32.

242. Lane, D., 16S/23S rRNA sequencing, in Nucleic acid techniques in bacterial systematics, G.M.

Stackebrandt E, Editor. 1991, John Wiley and Sons: Chichester, United Kingdom. p. 115-

175.

243. Flores, G.E., J.B. Henley, and N. Fierer, A direct PCR approach to accelerate analyses of human-

associated microbial communities. PLoS One, 2012. 7(9): p. e44563.

244. Adams, R.I., et al., Passive dust collectors for assessing airborne microbial material. Microbiome,

2015. 3: p. 46.

245. Checinska, A., et al., Microbiomes of the dust particles collected from the International Space

Station and Spacecraft Assembly Facilities. Microbiome, 2015. 3: p. 50.

inter-species interactions in microbial communities

Documents