feasibility of collection and analysis of microbiome data

Feasibility of collection and analysis of microbiome data in a longitudinal

randomized trial of community gardening

Running head: RCT community gardening and microbiome

Mireia Gascona,b,c, Kylie K. Harralld, Alyssa W. Beaverse, Deborah H. Glueckf, Maggie

A. Stanislawskid,g, Katherine Alaimoh, Angel Villalobosi, James R. Hebertj, Kelsey

Dexterk, Kaigang Lil, Jill Litta,i*

Affiliations

aISGlobal, Barcelona, Spain

bUniversitat Pompeu Fabra (UPF), Barcelona, Spain

cCIBER Epidemiología y Salud Pública (CIBERESP), Barcelona, Spain

dLifecourse Epidemiology of Adiposity and Diabetes Center, Colorado School of Public

Health, University of Colorado Denver, Aurora, Colorado, United States of America

eDepartment of Food Science and Human Nutrition, Michigan State University, East

Lansing, Michigan, United States of America

fDepartment of Pediatrics, University of Colorado School of Medicine, University of

Colorado Denver, Anschutz Medical Campus, Aurora, Colorado, United States of

America

gDepartment of Epidemiology, University of Colorado School of Public Health,

University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado, United

States of America

hDepartment of Food Science and Human Nutrition, Michigan State University,

Michigan, United States of America

iEnvironmental Studies, University of Colorado Boulder, Boulder, Colorado, United

States of America

jDepartment of Epidemiology and Biostatistics and Cancer Prevention and Control

Program, Arnold School of Public Health, University of South Carolina, Columbia,

United States of America

kDepartment of Endocrinology, University of Colorado School of Public Health,

University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado, United

States of America

lDepartment of Health & Exercise Science, Colorado State University, Colorado, United

States of America

Correspondence to:

*Jill Litt, [email protected]

4001 Discovery Drive, Boulder, Colorado 80303

303-735-4519

Author contributions

M. Gascon contributions: formal analysis, writing, editing; KK. Harrall contributions:

data curation, methodology, formal analysis, writing, editing; AW. Beavers contributions:

data curation, methodology, formal analysis, writing, editing; DH. Glueck contributions:

conceptualization, methodology, formal analysis, writing, editing. MA. Stanislawski

contributions: formal analysis, writing, editing; K. Alaimo contributions:

conceptualization, funding acquisition, methodology, validation, writing, editing; A.

Villalobos contributions: methodology, validation, writing, editing; JR. Hebert

contributions: methodology, validation, writing, editing; K. Dexter contributions:

methodology, validation, writing, editing; K. Li contributions: methodology, validation,

writing, editing; J. Litt contributions: conceptualization, funding acquisition,

investigation, methodology, supervision, writing and editing.

Acknowledgements

We would like to thank Robert Knight and his lab, particularly Daniel McDonald, for the

analysis of the microbiome, and Allyson Masunaga Goto, Jessica Metcalf and Lara

Fahnestock for their advice and contribution in the project.

Funding

This study was funded by the University of Colorado Boulder Population Center (CUPC,

Litt, PI), through the National Institute of Child Health & Human Development of the

National Institutes of Health under Award Number P2CHD066613-06 and the Center for

Microbiome Innovation at the University of California San Diego. We also received

supplemental funding through the Clinical & Translational Research Center (CTRC) to

cover all laboratory costs (Litt, PI). Mireia Gascon received a fellowship from the Societat

Econòmica Barcelonesa d’Amics del País (SEBAP) in 2018, Barcelona (Catalonia), for

her research stay at the University of Colorado to conduct the statistical analysis for this

work. DHG was supported, in part, by R01GM121-81 and R25 GM111901.

Ethical conduct of research

The DGEM study had the ethical approval of the University of Colorado Boulder

Institutional Review Board Office (Protocol #: 16-0644).

Data sharing statement

The authors certify that this manuscript reports original clinical trial data. Individual, de-

identified participant data that underlie the results reported in this article (text, tables,

figures, and appendices) are available from the corresponding author (Jill Litt:

[email protected]) following publication, including the clinical study report and study

protocol.

Word count: 5817

Figure number: 5

Table number: 2

1

Appendices

The Denver Garden Environment and Microbiome (DGEM) feasibility study: a roadmap for the

analysis of microbiome results for a randomized controlled trial of community gardening.

Mireia Gascon, Kylie K. Harrall, Alyssa W. Beavers, Deborah H. Glueck, Maggie A. Stanislawski,

Katherine Alaimo, Angel Villalobos, James R. Hebert, Kelsey Dexter, Kaigang Li, Jill Litt

Appendix A. Add taxonomic information to the deblur reference-hit sequences.

Qiime2, Assign Taxonomy

Note: Qiime2 commands are case sensitive

Convert a BIOM file to a text file. biom convert –i /media/sf_data/rarefied_denovo_FeatureTable.BIOM

-o /media/sf_data/rarefied_denovo_FeatureTable.txt

--to-tsv

Imports reference-hit sequences, downloaded from Qiita as an FA file, into Qiime2. Import FeatureData[Sequence] from fa file

Qiime tools import –input-path /media/sf_data/reference-hit.seqs.fa

--ouput-path /media/sf_data/reference-hit.seqs.qza --type FeatureData[Sequence]

Assigns taxonomy to the reference-hit sequences.

Download the greengenes classifier at the link below: https://chmi-sops.github.io/mydoc_qiime2.html

Visualization files are viewable at https://view.qiime2.org.

qiime feature-classifier classify-sklearn \

--i-classifier /media/sf_data/gg-13-8-99-515-806-nb-classifier.qza --i-reads /media/sf_data/reference-hit.seqs.qza

--o-classification /media/sf_data/reference-hit.seqs.taxonomy.qza

--p-reads-per-batch 10000 (This allowed me to run on laptop) qiime metadata tabulate \

--m-input-file /media/sf_data/reference-hit.seqs.taxonomy.qza \

--o-visualization /media/sf_data/reference-hit.seqs.taxonomy.qzv

2

Appendix B. Phylum Index Figures

SAS, Relative proportion phylum index figures *Import the reference-hit frequency table and reference-hit sequences with taxonomic assignment;

*Frequency table;

PROC IMPORT OUT= WORK.SUMS5

DATAFILE= "C:\Users\harrallk\Dropbox (ColoradoTeam)\Microbio me\Data\ID15926_fro_trimmed_deblur\AddTaxonomy\Deblur_biom_forSAS.xlsx"

DBMS=EXCEL REPLACE;

RANGE="Sheet1$"; GETNAMES=YES;

MIXED=NO;

SCANTEXT=YES; USEDATE=YES;

SCANTIME=YES;

RUN;

*Sequences; *NOTE: taxonomic identifiers include Kingdom, Phylum, Class, Order, Family, Genus, Species in a

single variable name. Prior to SAS import, Phylum was extracted into its own variable.

PROC IMPORT OUT= WORK.SUMS5 DATAFILE= "~\reference-hit.seqs.wTax_Phylum.xlsx"

DBMS=EXCEL REPLACE;

RANGE="Sheet1$"; GETNAMES=YES;

MIXED=NO;

SCANTEXT=YES;

USEDATE=YES; SCANTIME=YES;

RUN;

*Merge the count data and the sequences with taxonomic information; data MB01.Biomtax; /*Reference-hit sequences and taxonomic identification*/

set Biomtax;

run;

data MB01.Biomfile; /*Reference-hit frequency table*/ set Biomfile;

run;

*Align variable names between the sequence and frequency datasets. data a;

set MB01.Biomtax;

keep Otu_ID Phylum; Otu_ID = Feature_ID;

run;

proc sort data = a;

by OTU_ID; run;

proc sort data = Biomfile;

by OTU_ID; run;

data deblur;

merge a BiomFile; by OTU_ID;

3

run; data MB01.deblur;

set deblur;

run;

*How many phylums does this dataset contain?; proc freq data = Mb01.deblur;

tables phylum;

run; /* Remove OTU_ID */

data deblur;

set Mb01.deblur; drop OTU_ID;

run;

proc sort data = deblur;

by Phylum; run;

ods output summary = Sums;

proc means data = deblur sum; by Phylum;

run;

/* Remove the variable and labels that proc means added in as variable names */ data sums2;

set sums;

drop VName:;

run; data sums3;

set sums2;

drop Label:; run;

/* I have a missing phylum name - label as other */

data sums4;

set sums3; if Phylum = " " then Phylum = " Other";

if Phylum = " [Thermi]" then Phylum = " Thermi";

run; /* import the transposed file into SAS */

PROC IMPORT OUT= WORK.SUMS5

DATAFILE= "C:\Users\harrallk\Dropbox (ColoradoTeam)\Microbio me\Data\ID15926_fro_trimmed_deblur\AddTaxonomy\sums4.xlsx"

DBMS=EXCEL REPLACE;

RANGE="Sheet1$";

GETNAMES=YES; MIXED=NO;

SCANTEXT=YES;

USEDATE=YES; SCANTIME=YES;

RUN;

/* Merge in the metadata info, like group and sample locations */ proc sort data = sums5;

by SampleID;

run;

4

proc sort data = Mb01.MbAnalysis; by SampleID;

run;

data Mb01.DeblurTax;

merge sums5 Mb01.Mbanalysis; by SampleID;

run;

* Create graphics, we are interested in the following phylums: Firmicutes, Bacteroidetes, Proteobacteria, Verrucomicrobia, Actinobacteria, Tenericutes,

Cyanobacteria, Fusobacteria, Spirochaetes;

* Create a dataset with relative abundance; data RPh;

set Mb01.DeblurTax;

TotCount = (Other +

Acidobacteria + Actinobacteria +

Aquificae +

Armatimonadetes + BHI80_139 +

BRC1 +

Bacteroidetes + Chlamydiae +

Chlorobi +

Chloroflexi +

Crenarchaeota + Cyanobacteria +

Deferribacteres +

Elusimicrobia + Euryarchaeota +

FBP +

Fibrobacteres +

Firmicutes + Fusobacteria +

GN02 +

Gemmatimonadetes + Lentisphaerae +

MVP_21 +

Nitrospirae + OD1 +

OP3 +

OP8 +

OP9 + Planctomycetes +

Proteobacteria +

SR1 + Spirochaetes +

Synergistetes +

TM6 + TM7 +

Tenericutes +

Verrucomicrobia +

5

WPS_2 + WS2 +

WS3 +

Thermi);

RA_Actinobacteria = (Actinobacteria/TotCount); RA_Bacteroidetes = (Bacteroidetes/TotCount);

RA_Cyanobacteria = (Cyanobacteria/TotCount);

RA_Firmicutes = (Firmicutes/TotCount); RA_Fusobacteria = (Fusobacteria/TotCount);

RA_Proteobacteria = (Proteobacteria/TotCount);

/*RA_Spirochaetes = (Spirochaetes/TotCount);*/ /*RA_Tenericutes = (Tenericutes/TotCount);*/

RA_Verrucomicrobia = (Verrucomicrobia/TotCount);

RA_Other = ((Other +

Acidobacteria + Aquificae +

Armatimonadetes +

BHI80_139 + BRC1 +

Chlamydiae +

Chlorobi + Chloroflexi +

Crenarchaeota +

Deferribacteres +

Elusimicrobia + Euryarchaeota +

FBP +

Fibrobacteres + GN02 +

Gemmatimonadetes +

Lentisphaerae +

MVP_21 + Nitrospirae +

OD1 +

OP3 + OP8 +

OP9 +

Planctomycetes + SR1 +

Spirochaetes +

Synergistetes +

TM6 + TM7 +

Tenericutes +

WPS_2 + WS2 +

WS3 +

Thermi)/TotCount); run;

*Stack the data;

data RPH2;

6

keep PID

sample_type

sample_type_num

Timepoint Group

RA_Actinobacteria

RA_Bacteroidetes RA_Cyanobacteria

RA_Firmicutes

RA_Fusobacteria RA_Proteobacteria

RA_Verrucomicrobia

RA_Other;

set RPH; run;

proc sort data = RPH2;

by PID Timepoint; run;

* Verrucomicrobia;

data a; keep

PID

sample_type

sample_type_num Timepoint

Group

Phylum RA_Phylum;

set RPH2;

Phylum = "Verrucomicrobia";

RA_Phylum = RA_Verrucomicrobia; run;

* Bacteroidetes;

data b; keep

PID

sample_type sample_type_num

Timepoint

Group

Phylum RA_Phylum;

set RPH2;

Phylum = "Bacteroidetes"; RA_Phylum = RA_Bacteroidetes;

run;

data bi; merge a b;

by Phylum PID Timepoint;

run;

7

* Cyanobacteria; data c;

keep

PID


Timepoint

Group Phylum

RA_Phylum;

set RPH2; Phylum = "Cyanobacteria";

RA_Phylum = RA_Cyanobacteria;

run;

data ci; merge bi c;


run; * Firmicutes;

data d;

keep PID

sample_type

sample_type_num

Timepoint Group

Phylum

RA_Phylum; set RPH2;

Phylum = "Firmicutes";

RA_Phylum = RA_Firmicutes;

run; data di;

merge ci d;

by Phylum PID Timepoint; run;

* Fusobacteria;

data e; keep

PID

sample_type


Group

Phylum RA_Phylum;

set RPH2;

Phylum = "Fusobacteria"; RA_Phylum = RA_Fusobacteria;

run;

data ei;

8

merge di e; by Phylum PID Timepoint;

run;

* Proteobacteria;

data f; keep

PID


Timepoint

Group Phylum

RA_Phylum;

set RPH2;

Phylum = "Proteobacteria"; RA_Phylum = RA_Proteobacteria;

run;

data fi; merge ei f;


run; * Actinobacteria;

data i;

keep

PID sample_type

sample_type_num

Timepoint Group

Phylum

RA_Phylum;

set RPH2; Phylum = "Actinobacteria";

RA_Phylum = RA_Actinobacteria;

run; data ii;

merge fi i;


* Other;

data j;

keep PID

sample_type


Group

Phylum RA_Phylum;

set RPH2;

Phylum = "Other";

9

RA_Phylum = RA_Other; run;

data ji;

merge ii j;


data Rph3;

set ji; run;

*Isolate data by gardening groups;

data Rph3_GroupZero; set ji;

where Group = 0;

run;

data Rph3_GroupOne; set ji;

where Group = 1;

run; /*********************************************

Create a stacked bar chart

**********************************************/ *Summed over all 6;

data Rph5;

set Rph3;

RA_Phylum_6 = RA_Phylum/6; if Phylum = "Firmicutes" then PhylumCat = 9;

else if Phylum = "Bacteroidetes" then PhylumCat = 8;

else if Phylum = "Proteobacteria" then PhylumCat = 7; else if Phylum = "Verrucomicrobia" then PhylumCat = 6;

else if Phylum = "Actinobacteria" then PhylumCat = 5;

else if Phylum = "Cyanobacteria" then PhylumCat = 4;

else if Phylum = "Fusobacteria" then PhylumCat = 3; else if Phylum = "Other" then PhylumCat = 2;

run;

proc format; value Phylum 2 = "Other"

3 = "Fusobacteria"

4 = "Cyanobacteria" 5 = "Actinobacteria"

6 = "Verrucomicrobia"

7 = "Proteobacteria"

8 = "Bacteroidetes" 9 = "Firmicutes";

run;

*Graphic for gardeners; *Subset of gardeners;

data Rph5_GroupOne;

set Rph5; where Group = 1;

run;

ODS Graphics / Height = 6in Width = 4in;

10

proc sgpanel data = Rph5_GroupOne; title "The Relative Proportion of Phylums Represented in Gardener's Microbiome Samples";

title2 "Summed Over 6 Timepoints";

panelby Sample_type / columns = 1 rows = 4 novarname;

vbar PID / response = RA_Phylum_6 group = PhylumCat

groupdisplay = stack

barwidth = 1; rowaxis label = "Relative Proportion of each Phylum";

colaxis label = "Participant";

keylegend / title = "Phylum"; format PhylumCat Phylum.;

run;

*Graphic for non-gardeners;

*Subset of non-gardeners; data Rph5_GroupZero;

set Rph5;

where Group = 0; run;

proc sgpanel data = Rph5_GroupZero;

title "The Relative Proportion of Phylums Represented in Non-Gardener's Microbiome Samples"; title2 "Summed Over 6 Timepoints";


vbar PID / response = RA_Phylum_6

group = PhylumCat groupdisplay = stack

barwidth = 1;

rowaxis label = "Relative Proportion of each Phylum"; colaxis label = "Participant";

keylegend / title = "Phylum";

format PhylumCat Phylum.;

run; /***************************************

Graphic that sums over participants and not time

****************************************/ proc format;

value Phylum 2 = "Other"

3 = "Fusobacteria" 4 = "Cyanobacteria"

5 = "Actinobacteria"

6 = "Verrucomicrobia"

7 = "Proteobacteria" 8 = "Bacteroidetes"

9 = "Firmicutes";

run; ODS PDF DPI = 1200

file = "C:\Users\harrallk\Dropbox

(ColoradoTeam)\Microbiome\Text\Deblur_Figures_SampleByTime_01.pdf" startpage = no

;


11

*Graphic for gardeners; *Subset gardeners;

data Rph6_GroupOne;

set Rph3;

where Group = 1; RA_Phylum_Gp = RA_Phylum/5;

if Phylum = "Firmicutes" then PhylumCat = 9;

else if Phylum = "Bacteroidetes" then PhylumCat = 8; else if Phylum = "Proteobacteria" then PhylumCat = 7;

else if Phylum = "Verrucomicrobia" then PhylumCat = 6;

else if Phylum = "Actinobacteria" then PhylumCat = 5; else if Phylum = "Cyanobacteria" then PhylumCat = 4;

else if Phylum = "Fusobacteria" then PhylumCat = 3;

else if Phylum = "Other" then PhylumCat = 2;

run; proc sgpanel data = Rph6_GroupOne;

title "The Relative Proportion of Phylums Represented in Gardener's Microbiome Samples";

title2 "Over Time"; panelby Sample_type / columns = 1 rows = 4 novarname;

vbar Timepoint / response = RA_Phylum_Gp


barwidth = 1

transparency = 0.5;

rowaxis label = "Relative Proportion of each Phylum"; colaxis label = "Time";

styleattrs datacolors = (Cyan Magenta Yellow Black Orange Blue Gray Green)

datacontrastcolors = (Black); keylegend / title = "Phylum";


run;

*Graphic for non-gardeners; *Subset non-gardeners;

data Rph6_GroupZero;


RA_Phylum_Gp = RA_Phylum/6;

if Phylum = "Firmicutes" then PhylumCat = 9; else if Phylum = "Bacteroidetes" then PhylumCat = 8;

else if Phylum = "Proteobacteria" then PhylumCat = 7;


else if Phylum = "Actinobacteria" then PhylumCat = 5; else if Phylum = "Cyanobacteria" then PhylumCat = 4;

else if Phylum = "Fusobacteria" then PhylumCat = 3;

else if Phylum = "Other" then PhylumCat = 2; run;

proc sgpanel data = Rph6_GroupZero;

title "The Relative Proportion of Phylums Represented in Non-Gardener's Microbiome Samples"; title2 "Over Time";


vbar Timepoint / response = RA_Phylum_Gp

12


barwidth = 1

transparency = 0.5;

rowaxis label = "Relative Proportion of each Phylum"; colaxis label = "Time";

styleattrs datacolors = (Cyan Magenta Yellow Black Orange Blue Gray Green)

datacontrastcolors = (Black); keylegend / title = "Phylum";


run; ods pdf close;

/***************************************

Graphic that shows all participant data over time

****************************************/ *Create a PID by time variable to indicate groups on x-axis;

data Rph4;

set Rph3; PIDTime = PID+(Timepoint*0.1) ;

if Phylum = "Firmicutes" then PhylumCat = 9;

else if Phylum = "Bacteroidetes" then PhylumCat = 8; else if Phylum = "Proteobacteria" then PhylumCat = 7;


else if Phylum = "Actinobacteria" then PhylumCat = 5;

else if Phylum = "Cyanobacteria" then PhylumCat = 4; else if Phylum = "Fusobacteria" then PhylumCat = 3;

else if Phylum = "Other" then PhylumCat = 2;

run; ODS PDF DPI = 1200

file = "C:\Users\harrallk\Dropbox

(ColoradoTeam)\Microbiome\Text\Deblur_Figures_SampleByPIDByTime_01.pdf"

startpage = no ;


*graphic for gardeners; *subset gardeners;

data Rph4_one;


run;

proc sgpanel data = Rph4_One;

title "Relative Proportion of Phylumns Present in the Microbiome of Gardeners"; panelby Sample_type / columns = 1 rows = 4 novarname;

vbar PIDTime / response = RA_Phylum


barwidth = 1

transparency = 0.5 ;

refline 5.1 7.1 12.1 13.1 / axis = x discreteoffset = -0.5 lineattrs = (color = Black thickness = 5);

rowaxis label = "Relative Proportion of each Phylum";

13

colaxis label = "Participant.Timepoint"; styleattrs datacolors = (Cyan Magenta Yellow Black Orange Blue Gray Green)

datacontrastcolors = (Black);


format PhylumCat Phylum.; run;

*graphic for non-gardeners;

*subset non-gardeners; data Rph4_zero;

set Rph4;

where Group = 0; run;

proc sgpanel data = Rph4_Zero;

title "Relative Proportion of Phylumns Present in the Microbiome of Non-Gardeners";

panelby Sample_type / columns = 1 rows = 4 novarname; vbar PIDTime / response = RA_Phylum

group = PhylumCat

groupdisplay = stack barwidth = 1

transparency = 0.5

; refline 3.1 6.1 8.1 9.1 14.1 / axis = x discreteoffset = -0.5 lineattrs = (color = Black thickness = 4);

rowaxis label = "Relative Proportion of each Phylum";

colaxis label = "Participant.Timepoint";

styleattrs datacolors = (Cyan Magenta Yellow Black Orange Blue Gray Green) datacontrastcolors = (Black);


format PhylumCat Phylum.; run;

ods pdf close;

14

Appendix C. Test for differences in the abundance of sOTUs between gardeners and non-

gardeners

R, ANCOM

## ANCOM code downloaded from https://sites.google.com/site/siddharthamandal1985/research

#Run downloaded code for ANCOM, titled ANCOM_updated_code.R library(exactRankTests)

library(nlme)

library(ggplot2) # Separate, parallel models, run for each sample type. We present code for the forehead below.

# Restrict samples to forehead

Var_data_forehead <-

var_data[which(var_data$sample_type_num>0.5&var_data$sample_type_num<1.5),]

# forhead analysis accounting for within subject variability longitudinal_comparison_foreheadAdjRand=ANCOM.main(OTUdat=otu_data,

Vardat=Var_data_forehead,

adjusted=T, repeated=T,

main.var="Group",

adj.formula="sexF+age_1",

repeat.var="NULL", longitudinal=FALSE,

random.formula="~1|PID",

multcorr=2, sig=0.05,

prev.cut=0.90)

longitudinal_comparison_foreheadAdjRand$W.taxa

15

Appendix D – Alpha Diversity. Test for differences in Shannon diversity or Faith’s PD

between gardeners and non-gardeners.

SAS, Linear Mixed Model

Models for Shannon diversity and Faith’s phylogenetic diversity are parallel.

The data were subset by sample location.

proc mixed data = mouth;

title "Mouth Shannon Diversity, Full Model";

class PID;

model Shannon_Deblur = group*timepoint*age_1*BMI_1

group*timepoint*age_1 group*timepoint*BMI_1

group*age_1*BMI_1

timepoint*age_1*BMI_1

group*timepoint

group*age_1

group*BMI_1 timepoint*age_1

timepoint*BMI_1

age_1*BMI_1

group

timepoint

sexF age_1

BMI_1

/ residual outp = mouthresout ddfm = kr; repeated PID;

run;

proc univariate data = mouthresout;

var studentResid;

title "Full Mouth Model";

histogram studentResid / normal;

run;

16

Appendix E – Beta Diversity. Visualize variability within and between participant samples.

SAS, Variability within and between participants

/********************************************************************* Convert distance matrix into a list of sample pairs and correlations

**********************************************************************/

proc iml; use UniFrac;

read all var "SampleID" into ColNames; /* get names of variables */

read all var (ColNames) into mCorr; /* matrix of correlations */

close UniFrac; numCols = ncol(mCorr); /* number of variables */

numPairs = numCols*(numCols-1) / 2;

length = 2*nleng(ColNames) + 5; /* max length of new ID variable */ Sample1 = j(NumPairs, 1, BlankStr(length));

i = 1;

do row= 2 to numCols; /* construct the pairwise names */

do col = 1 to row-1; Sample1[i] = ColNames[col];

i = i + 1;

end; end;

Sample2 = j(NumPairs, 1, BlankStr(length));

i = 1; do row= 2 to numCols; /* construct the pairwise names */

do col = 1 to row-1;

Sample2[i] = ColNames[row];

i = i + 1; end;

end;

lowerIdx = loc(row(mCorr) > col(mCorr)); /* indices of lower-triangular elements */ Corr = mCorr[ lowerIdx ];

create CorrPairs var {"Sample1" "Sample2" "Corr"};

append; close;

QUIT;

/****************************************************************************

Merge in metadata.

This step introduces sample identification for the members of each correlation pair. This requires two

stages of merging: the first merge identifies the first column of pair members and the second merge identifies the second column of pair members.

****************************************************************************/

/**************************** Round 1 of merging

******************************/

proc sort data = CorrPairs; by Sample1;

run;

*Add "SampleID" to correlation dataset. This is the sample identification variable was named in the

metadata;

17

data CorrPairs2; set CorrPairs;

SampleID = Sample1;

SampleID2 = Sample2;

run; *For SAS variable naming conventions, an x was adding to the start of each numeric sample id.;

proc sort data = Mb01.MetaSampleX;

by SampleID; run;

data DistMeta;

merge CorrPairs2 Mb01.MetaSampleX; by SampleID;

run;

/***************************************************************************

Round 2 of merging Reduce Mb01.MetaSampleX so it only includes sample, time point, and group.

****************************************************************************/

data MetaReduce; keep SampleID2 timepointS2 groupS2 PIDS2 sample_type_numS2;

set Mb01.MetaSampleX;

SampleID2 = SampleID; timepointS2 = timepoint;

groupS2 = group;

PIDS2 = PID;

sample_type_numS2 = sample_type_num; run;

proc sort data = DistMeta;

by SampleID2; run;

proc sort data = MetaReduce;

by SampleID2;

run; data DistMeta2;

merge DistMeta MetaReduce;

by SampleID2; run;

*Remove any missing values of UniFrac;

data DistMeta3; set DistMeta2;

where corr NE .;

run;

/**************************************************************************** For each correlation pair, define the following

1. Members of the pair came from the same participant

2. Members of the pair came from the same sample type 3. Members of pair are from sample intervention group

4. Members of pair are from same time point

5. Within variability - members of pair are from same participant and same sample type 6. Between variability - members of pair are from different participants but same sample type and time

point

7. Members of pair are from sample intervention group

18

****************************************************************************/ data interactions;

set DistMeta3;

if PID = PIDS2 then PIDMatch = 1;

else PIDMatch = 0; if sample_type_num = sample_type_numS2 then SampleMatch = 1;

else SampleMatch = 0;

Match = (PIDMatch*SampleMatch); run;

*Code this so the between variation only considers differences between participants from the same day;

data graphicDTD; set interactions;

if timepoint = timepointS2 then TimeMatch = 1;

else TimeMatch = 0;

if PIDMatch = 1 and SampleMatch = 1 then Within = 1; else Within = 0;

if PIDMatch = 0 and SampleMatch = 1 and TimeMatch = 1 then Between = 1;

else Between = 0; if group = groups2 then GroupMatch = 1;

else GroupMatch = 0;

run; proc sort data = graphicDTD;

by Within;

run;

/*************************************************************************** UniFrac graphics showing within and between variability over sample type

****************************************************************************/

proc format; value var 0 = "Variation between participants"

1 = "Variation within participants";

value varDTD 0 = "Variation between participants (day-to-day)"

1 = "Variation within participants"; run;

ods pdf file = "~\UniFrac_unweightedPlots.pdf"; proc sgplot data = graphicDTD;

title "Variability Within and Between Participants";

vbar sample_type / response = corr group = Within groupdisplay = cluster stat = Mean limitstat = stderr limitattrs = (color = black);

format Within varDTD.;

yaxis label = "Unweighted UniFrac Distance";

xaxis label = "Sampling Location"; styleattrs datacolors = (Black White) datacontrastcolors = (Black Black);

keylegend / title = " ";

run; ods pdf close;

19

Appendix F – Beta Diversity. Test for differences in weighted UniFrac between gardeners

and non-gardeners

R, Nested Permutation ANOVAs

# Import datasets

library(readr) UniFrac_unweighted_denovo_181203 <- read_delim("~/UniFrac_unweighted_denovo.txt",

+ "\t", escape_double = FALSE, trim_ws = TRUE)

# Import metadata so we can compare UniFrac by group, time, sex, age, and BMI. library(haven)

mbanalysis <- read_sas("~/mbanalysis.sas7bdat", NULL)

#### Note, make sure that column and row names match after R imports this matrix. R likes to add an x

when a variable name starts with a number. I fixed this problem by adding x's to all column and row names before importing into R.

# Format data for Permuation ANOVAs.

# Make the first column into row names UniFrac <- as.data.frame(UniFrac_unweighted_denovo_181203) # Shorten object name

a <- UniFrac[,1]

rownames(UniFrac) <- a UniFrac <- UniFrac[,-1]

UniFrac[1:5, 1:5]

#Remove missing samples from the metadata.

meta <- mbanalysis[complete.cases(mbanalysis[ , "Shannon_CR_OTC"]),] meta2 <- meta[,c("SampleID","Timepoint", "Group", "sample_type")]

# Distance matrix samples and metadata samples must be in the same order.

sampleOrder <- as.numeric(row.names(UniFrac)) sampleOrder2 <- as.data.frame(sampleOrder)

colnames(sampleOrder2) <- "SampleID"

orderMeta <- merge(sampleOrder2,meta2,by.x="SampleID", sort = FALSE) GroupTime <- orderMeta[,c("Group", "Timepoint")]

GroupTime$Timepoint <- as.factor(GroupTime$Timepoint)

GroupTime$Group <- as.factor(GroupTime$Group)

# Import metadata so we can compare UniFrac by group, time, sex, age, and BMI # sample ID must be listed in the first column of the data frame

pax <- as.data.frame(rownames(UniFrac))

colnames(pax) <- "SampleID" Pax2 <- cbind(pax, UniFrac)

###################################################################

# The following code is parallel for all sample type. Thus, we only report forehead.

##################################################################### ##Restrict meta samples to forehead and order same as unifrac

foreheadMeta <- meta2[meta2$sample_type == "Forehead", ]

foreheadIDs <- foreheadMeta$SampleID ForeheadUniFrac <- Pax2[rownames(Pax2) %in% foreheadIDs,colnames(Pax2) %in% foreheadIDs]

#Group by PID

ForeheadGroupPID <- meta[meta$SampleID %in% foreheadIDs,c("SampleID", "Group", "PID")] ##Order

ForeheadOrder <- as.numeric(row.names(ForeheadUniFrac))

ForeheadOrder2 <- as.data.frame(ForeheadOrder)

colnames(ForeheadOrder2) <- "SampleID"

20

#Nested Group by PID ForeheadGroupPID2 <- merge(ForeheadOrder2,ForeheadGroupPID,by.x="SampleID", sort = FALSE)

###Remove sampleID from the metadata, and define covariates as factors

#Group by PID

ForeheadGroupPID3 <- ForeheadGroupPID2[,2:3] ForeheadGroupPID3$Group <- as.factor(ForeheadGroupPID3$Group)

ForeheadGroupPID3$PID <- as.factor(ForeheadGroupPID3$PID)

# Forehead Models nested.npmanova(ForeheadUniFrac~Group+PID, data = ForeheadGroupPID3, permutations=999,

warnings=FALSE)

feasibility of collection and analysis of microbiome data

Documents