feasibility of collection and analysis of microbiome data
TRANSCRIPT
Feasibility of collection and analysis of microbiome data in a longitudinal
randomized trial of community gardening
Running head: RCT community gardening and microbiome
Mireia Gascona,b,c, Kylie K. Harralld, Alyssa W. Beaverse, Deborah H. Glueckf, Maggie
A. Stanislawskid,g, Katherine Alaimoh, Angel Villalobosi, James R. Hebertj, Kelsey
Dexterk, Kaigang Lil, Jill Litta,i*
Affiliations
aISGlobal, Barcelona, Spain
bUniversitat Pompeu Fabra (UPF), Barcelona, Spain
cCIBER Epidemiología y Salud Pública (CIBERESP), Barcelona, Spain
dLifecourse Epidemiology of Adiposity and Diabetes Center, Colorado School of Public
Health, University of Colorado Denver, Aurora, Colorado, United States of America
eDepartment of Food Science and Human Nutrition, Michigan State University, East
Lansing, Michigan, United States of America
fDepartment of Pediatrics, University of Colorado School of Medicine, University of
Colorado Denver, Anschutz Medical Campus, Aurora, Colorado, United States of
America
gDepartment of Epidemiology, University of Colorado School of Public Health,
University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado, United
States of America
hDepartment of Food Science and Human Nutrition, Michigan State University,
Michigan, United States of America
iEnvironmental Studies, University of Colorado Boulder, Boulder, Colorado, United
States of America
jDepartment of Epidemiology and Biostatistics and Cancer Prevention and Control
Program, Arnold School of Public Health, University of South Carolina, Columbia,
United States of America
kDepartment of Endocrinology, University of Colorado School of Public Health,
University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado, United
States of America
lDepartment of Health & Exercise Science, Colorado State University, Colorado, United
States of America
Correspondence to:
*Jill Litt, [email protected]
4001 Discovery Drive, Boulder, Colorado 80303
303-735-4519
Author contributions
M. Gascon contributions: formal analysis, writing, editing; KK. Harrall contributions:
data curation, methodology, formal analysis, writing, editing; AW. Beavers contributions:
data curation, methodology, formal analysis, writing, editing; DH. Glueck contributions:
conceptualization, methodology, formal analysis, writing, editing. MA. Stanislawski
contributions: formal analysis, writing, editing; K. Alaimo contributions:
conceptualization, funding acquisition, methodology, validation, writing, editing; A.
Villalobos contributions: methodology, validation, writing, editing; JR. Hebert
contributions: methodology, validation, writing, editing; K. Dexter contributions:
methodology, validation, writing, editing; K. Li contributions: methodology, validation,
writing, editing; J. Litt contributions: conceptualization, funding acquisition,
investigation, methodology, supervision, writing and editing.
Acknowledgements
We would like to thank Robert Knight and his lab, particularly Daniel McDonald, for the
analysis of the microbiome, and Allyson Masunaga Goto, Jessica Metcalf and Lara
Fahnestock for their advice and contribution in the project.
Funding
This study was funded by the University of Colorado Boulder Population Center (CUPC,
Litt, PI), through the National Institute of Child Health & Human Development of the
National Institutes of Health under Award Number P2CHD066613-06 and the Center for
Microbiome Innovation at the University of California San Diego. We also received
supplemental funding through the Clinical & Translational Research Center (CTRC) to
cover all laboratory costs (Litt, PI). Mireia Gascon received a fellowship from the Societat
Econòmica Barcelonesa d’Amics del País (SEBAP) in 2018, Barcelona (Catalonia), for
her research stay at the University of Colorado to conduct the statistical analysis for this
work. DHG was supported, in part, by R01GM121-81 and R25 GM111901.
Ethical conduct of research
The DGEM study had the ethical approval of the University of Colorado Boulder
Institutional Review Board Office (Protocol #: 16-0644).
Data sharing statement
The authors certify that this manuscript reports original clinical trial data. Individual, de-
identified participant data that underlie the results reported in this article (text, tables,
figures, and appendices) are available from the corresponding author (Jill Litt:
[email protected]) following publication, including the clinical study report and study
protocol.
Word count: 5817
Figure number: 5
Table number: 2
1
Appendices
The Denver Garden Environment and Microbiome (DGEM) feasibility study: a roadmap for the
analysis of microbiome results for a randomized controlled trial of community gardening.
Mireia Gascon, Kylie K. Harrall, Alyssa W. Beavers, Deborah H. Glueck, Maggie A. Stanislawski,
Katherine Alaimo, Angel Villalobos, James R. Hebert, Kelsey Dexter, Kaigang Li, Jill Litt
Appendix A. Add taxonomic information to the deblur reference-hit sequences.
Qiime2, Assign Taxonomy
Note: Qiime2 commands are case sensitive
Convert a BIOM file to a text file. biom convert –i /media/sf_data/rarefied_denovo_FeatureTable.BIOM
-o /media/sf_data/rarefied_denovo_FeatureTable.txt
--to-tsv
Imports reference-hit sequences, downloaded from Qiita as an FA file, into Qiime2. Import FeatureData[Sequence] from fa file
Qiime tools import –input-path /media/sf_data/reference-hit.seqs.fa
--ouput-path /media/sf_data/reference-hit.seqs.qza --type FeatureData[Sequence]
Assigns taxonomy to the reference-hit sequences.
Download the greengenes classifier at the link below: https://chmi-sops.github.io/mydoc_qiime2.html
Visualization files are viewable at https://view.qiime2.org.
qiime feature-classifier classify-sklearn \
--i-classifier /media/sf_data/gg-13-8-99-515-806-nb-classifier.qza --i-reads /media/sf_data/reference-hit.seqs.qza
--o-classification /media/sf_data/reference-hit.seqs.taxonomy.qza
--p-reads-per-batch 10000 (This allowed me to run on laptop) qiime metadata tabulate \
--m-input-file /media/sf_data/reference-hit.seqs.taxonomy.qza \
--o-visualization /media/sf_data/reference-hit.seqs.taxonomy.qzv
2
Appendix B. Phylum Index Figures
SAS, Relative proportion phylum index figures *Import the reference-hit frequency table and reference-hit sequences with taxonomic assignment;
*Frequency table;
PROC IMPORT OUT= WORK.SUMS5
DATAFILE= "C:\Users\harrallk\Dropbox (ColoradoTeam)\Microbio me\Data\ID15926_fro_trimmed_deblur\AddTaxonomy\Deblur_biom_forSAS.xlsx"
DBMS=EXCEL REPLACE;
RANGE="Sheet1$"; GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES; USEDATE=YES;
SCANTIME=YES;
RUN;
*Sequences; *NOTE: taxonomic identifiers include Kingdom, Phylum, Class, Order, Family, Genus, Species in a
single variable name. Prior to SAS import, Phylum was extracted into its own variable.
PROC IMPORT OUT= WORK.SUMS5 DATAFILE= "~\reference-hit.seqs.wTax_Phylum.xlsx"
DBMS=EXCEL REPLACE;
RANGE="Sheet1$"; GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES; SCANTIME=YES;
RUN;
*Merge the count data and the sequences with taxonomic information; data MB01.Biomtax; /*Reference-hit sequences and taxonomic identification*/
set Biomtax;
run;
data MB01.Biomfile; /*Reference-hit frequency table*/ set Biomfile;
run;
*Align variable names between the sequence and frequency datasets. data a;
set MB01.Biomtax;
keep Otu_ID Phylum; Otu_ID = Feature_ID;
run;
proc sort data = a;
by OTU_ID; run;
proc sort data = Biomfile;
by OTU_ID; run;
data deblur;
merge a BiomFile; by OTU_ID;
3
run; data MB01.deblur;
set deblur;
run;
*How many phylums does this dataset contain?; proc freq data = Mb01.deblur;
tables phylum;
run; /* Remove OTU_ID */
data deblur;
set Mb01.deblur; drop OTU_ID;
run;
proc sort data = deblur;
by Phylum; run;
ods output summary = Sums;
proc means data = deblur sum; by Phylum;
run;
/* Remove the variable and labels that proc means added in as variable names */ data sums2;
set sums;
drop VName:;
run; data sums3;
set sums2;
drop Label:; run;
/* I have a missing phylum name - label as other */
data sums4;
set sums3; if Phylum = " " then Phylum = " Other";
if Phylum = " [Thermi]" then Phylum = " Thermi";
run; /* import the transposed file into SAS */
PROC IMPORT OUT= WORK.SUMS5
DATAFILE= "C:\Users\harrallk\Dropbox (ColoradoTeam)\Microbio me\Data\ID15926_fro_trimmed_deblur\AddTaxonomy\sums4.xlsx"
DBMS=EXCEL REPLACE;
RANGE="Sheet1$";
GETNAMES=YES; MIXED=NO;
SCANTEXT=YES;
USEDATE=YES; SCANTIME=YES;
RUN;
/* Merge in the metadata info, like group and sample locations */ proc sort data = sums5;
by SampleID;
run;
4
proc sort data = Mb01.MbAnalysis; by SampleID;
run;
data Mb01.DeblurTax;
merge sums5 Mb01.Mbanalysis; by SampleID;
run;
* Create graphics, we are interested in the following phylums: Firmicutes, Bacteroidetes, Proteobacteria, Verrucomicrobia, Actinobacteria, Tenericutes,
Cyanobacteria, Fusobacteria, Spirochaetes;
* Create a dataset with relative abundance; data RPh;
set Mb01.DeblurTax;
TotCount = (Other +
Acidobacteria + Actinobacteria +
Aquificae +
Armatimonadetes + BHI80_139 +
BRC1 +
Bacteroidetes + Chlamydiae +
Chlorobi +
Chloroflexi +
Crenarchaeota + Cyanobacteria +
Deferribacteres +
Elusimicrobia + Euryarchaeota +
FBP +
Fibrobacteres +
Firmicutes + Fusobacteria +
GN02 +
Gemmatimonadetes + Lentisphaerae +
MVP_21 +
Nitrospirae + OD1 +
OP3 +
OP8 +
OP9 + Planctomycetes +
Proteobacteria +
SR1 + Spirochaetes +
Synergistetes +
TM6 + TM7 +
Tenericutes +
Verrucomicrobia +
5
WPS_2 + WS2 +
WS3 +
Thermi);
RA_Actinobacteria = (Actinobacteria/TotCount); RA_Bacteroidetes = (Bacteroidetes/TotCount);
RA_Cyanobacteria = (Cyanobacteria/TotCount);
RA_Firmicutes = (Firmicutes/TotCount); RA_Fusobacteria = (Fusobacteria/TotCount);
RA_Proteobacteria = (Proteobacteria/TotCount);
/*RA_Spirochaetes = (Spirochaetes/TotCount);*/ /*RA_Tenericutes = (Tenericutes/TotCount);*/
RA_Verrucomicrobia = (Verrucomicrobia/TotCount);
RA_Other = ((Other +
Acidobacteria + Aquificae +
Armatimonadetes +
BHI80_139 + BRC1 +
Chlamydiae +
Chlorobi + Chloroflexi +
Crenarchaeota +
Deferribacteres +
Elusimicrobia + Euryarchaeota +
FBP +
Fibrobacteres + GN02 +
Gemmatimonadetes +
Lentisphaerae +
MVP_21 + Nitrospirae +
OD1 +
OP3 + OP8 +
OP9 +
Planctomycetes + SR1 +
Spirochaetes +
Synergistetes +
TM6 + TM7 +
Tenericutes +
WPS_2 + WS2 +
WS3 +
Thermi)/TotCount); run;
*Stack the data;
data RPH2;
6
keep PID
sample_type
sample_type_num
Timepoint Group
RA_Actinobacteria
RA_Bacteroidetes RA_Cyanobacteria
RA_Firmicutes
RA_Fusobacteria RA_Proteobacteria
RA_Verrucomicrobia
RA_Other;
set RPH; run;
proc sort data = RPH2;
by PID Timepoint; run;
* Verrucomicrobia;
data a; keep
PID
sample_type
sample_type_num Timepoint
Group
Phylum RA_Phylum;
set RPH2;
Phylum = "Verrucomicrobia";
RA_Phylum = RA_Verrucomicrobia; run;
* Bacteroidetes;
data b; keep
PID
sample_type sample_type_num
Timepoint
Group
Phylum RA_Phylum;
set RPH2;
Phylum = "Bacteroidetes"; RA_Phylum = RA_Bacteroidetes;
run;
data bi; merge a b;
by Phylum PID Timepoint;
run;
7
* Cyanobacteria; data c;
keep
PID
sample_type sample_type_num
Timepoint
Group Phylum
RA_Phylum;
set RPH2; Phylum = "Cyanobacteria";
RA_Phylum = RA_Cyanobacteria;
run;
data ci; merge bi c;
by Phylum PID Timepoint;
run; * Firmicutes;
data d;
keep PID
sample_type
sample_type_num
Timepoint Group
Phylum
RA_Phylum; set RPH2;
Phylum = "Firmicutes";
RA_Phylum = RA_Firmicutes;
run; data di;
merge ci d;
by Phylum PID Timepoint; run;
* Fusobacteria;
data e; keep
PID
sample_type
sample_type_num Timepoint
Group
Phylum RA_Phylum;
set RPH2;
Phylum = "Fusobacteria"; RA_Phylum = RA_Fusobacteria;
run;
data ei;
8
merge di e; by Phylum PID Timepoint;
run;
* Proteobacteria;
data f; keep
PID
sample_type sample_type_num
Timepoint
Group Phylum
RA_Phylum;
set RPH2;
Phylum = "Proteobacteria"; RA_Phylum = RA_Proteobacteria;
run;
data fi; merge ei f;
by Phylum PID Timepoint;
run; * Actinobacteria;
data i;
keep
PID sample_type
sample_type_num
Timepoint Group
Phylum
RA_Phylum;
set RPH2; Phylum = "Actinobacteria";
RA_Phylum = RA_Actinobacteria;
run; data ii;
merge fi i;
by Phylum PID Timepoint; run;
* Other;
data j;
keep PID
sample_type
sample_type_num Timepoint
Group
Phylum RA_Phylum;
set RPH2;
Phylum = "Other";
9
RA_Phylum = RA_Other; run;
data ji;
merge ii j;
by Phylum PID Timepoint; run;
data Rph3;
set ji; run;
*Isolate data by gardening groups;
data Rph3_GroupZero; set ji;
where Group = 0;
run;
data Rph3_GroupOne; set ji;
where Group = 1;
run; /*********************************************
Create a stacked bar chart
**********************************************/ *Summed over all 6;
data Rph5;
set Rph3;
RA_Phylum_6 = RA_Phylum/6; if Phylum = "Firmicutes" then PhylumCat = 9;
else if Phylum = "Bacteroidetes" then PhylumCat = 8;
else if Phylum = "Proteobacteria" then PhylumCat = 7; else if Phylum = "Verrucomicrobia" then PhylumCat = 6;
else if Phylum = "Actinobacteria" then PhylumCat = 5;
else if Phylum = "Cyanobacteria" then PhylumCat = 4;
else if Phylum = "Fusobacteria" then PhylumCat = 3; else if Phylum = "Other" then PhylumCat = 2;
run;
proc format; value Phylum 2 = "Other"
3 = "Fusobacteria"
4 = "Cyanobacteria" 5 = "Actinobacteria"
6 = "Verrucomicrobia"
7 = "Proteobacteria"
8 = "Bacteroidetes" 9 = "Firmicutes";
run;
*Graphic for gardeners; *Subset of gardeners;
data Rph5_GroupOne;
set Rph5; where Group = 1;
run;
ODS Graphics / Height = 6in Width = 4in;
10
proc sgpanel data = Rph5_GroupOne; title "The Relative Proportion of Phylums Represented in Gardener's Microbiome Samples";
title2 "Summed Over 6 Timepoints";
panelby Sample_type / columns = 1 rows = 4 novarname;
vbar PID / response = RA_Phylum_6 group = PhylumCat
groupdisplay = stack
barwidth = 1; rowaxis label = "Relative Proportion of each Phylum";
colaxis label = "Participant";
keylegend / title = "Phylum"; format PhylumCat Phylum.;
run;
*Graphic for non-gardeners;
*Subset of non-gardeners; data Rph5_GroupZero;
set Rph5;
where Group = 0; run;
proc sgpanel data = Rph5_GroupZero;
title "The Relative Proportion of Phylums Represented in Non-Gardener's Microbiome Samples"; title2 "Summed Over 6 Timepoints";
panelby Sample_type / columns = 1 rows = 4 novarname;
vbar PID / response = RA_Phylum_6
group = PhylumCat groupdisplay = stack
barwidth = 1;
rowaxis label = "Relative Proportion of each Phylum"; colaxis label = "Participant";
keylegend / title = "Phylum";
format PhylumCat Phylum.;
run; /***************************************
Graphic that sums over participants and not time
****************************************/ proc format;
value Phylum 2 = "Other"
3 = "Fusobacteria" 4 = "Cyanobacteria"
5 = "Actinobacteria"
6 = "Verrucomicrobia"
7 = "Proteobacteria" 8 = "Bacteroidetes"
9 = "Firmicutes";
run; ODS PDF DPI = 1200
file = "C:\Users\harrallk\Dropbox
(ColoradoTeam)\Microbiome\Text\Deblur_Figures_SampleByTime_01.pdf" startpage = no
;
ODS Graphics / Height = 6in Width = 4in;
11
*Graphic for gardeners; *Subset gardeners;
data Rph6_GroupOne;
set Rph3;
where Group = 1; RA_Phylum_Gp = RA_Phylum/5;
if Phylum = "Firmicutes" then PhylumCat = 9;
else if Phylum = "Bacteroidetes" then PhylumCat = 8; else if Phylum = "Proteobacteria" then PhylumCat = 7;
else if Phylum = "Verrucomicrobia" then PhylumCat = 6;
else if Phylum = "Actinobacteria" then PhylumCat = 5; else if Phylum = "Cyanobacteria" then PhylumCat = 4;
else if Phylum = "Fusobacteria" then PhylumCat = 3;
else if Phylum = "Other" then PhylumCat = 2;
run; proc sgpanel data = Rph6_GroupOne;
title "The Relative Proportion of Phylums Represented in Gardener's Microbiome Samples";
title2 "Over Time"; panelby Sample_type / columns = 1 rows = 4 novarname;
vbar Timepoint / response = RA_Phylum_Gp
group = PhylumCat groupdisplay = stack
barwidth = 1
transparency = 0.5;
rowaxis label = "Relative Proportion of each Phylum"; colaxis label = "Time";
styleattrs datacolors = (Cyan Magenta Yellow Black Orange Blue Gray Green)
datacontrastcolors = (Black); keylegend / title = "Phylum";
format PhylumCat Phylum.;
run;
*Graphic for non-gardeners; *Subset non-gardeners;
data Rph6_GroupZero;
set Rph3; where Group = 0;
RA_Phylum_Gp = RA_Phylum/6;
if Phylum = "Firmicutes" then PhylumCat = 9; else if Phylum = "Bacteroidetes" then PhylumCat = 8;
else if Phylum = "Proteobacteria" then PhylumCat = 7;
else if Phylum = "Verrucomicrobia" then PhylumCat = 6;
else if Phylum = "Actinobacteria" then PhylumCat = 5; else if Phylum = "Cyanobacteria" then PhylumCat = 4;
else if Phylum = "Fusobacteria" then PhylumCat = 3;
else if Phylum = "Other" then PhylumCat = 2; run;
proc sgpanel data = Rph6_GroupZero;
title "The Relative Proportion of Phylums Represented in Non-Gardener's Microbiome Samples"; title2 "Over Time";
panelby Sample_type / columns = 1 rows = 4 novarname;
vbar Timepoint / response = RA_Phylum_Gp
12
group = PhylumCat groupdisplay = stack
barwidth = 1
transparency = 0.5;
rowaxis label = "Relative Proportion of each Phylum"; colaxis label = "Time";
styleattrs datacolors = (Cyan Magenta Yellow Black Orange Blue Gray Green)
datacontrastcolors = (Black); keylegend / title = "Phylum";
format PhylumCat Phylum.;
run; ods pdf close;
/***************************************
Graphic that shows all participant data over time
****************************************/ *Create a PID by time variable to indicate groups on x-axis;
data Rph4;
set Rph3; PIDTime = PID+(Timepoint*0.1) ;
if Phylum = "Firmicutes" then PhylumCat = 9;
else if Phylum = "Bacteroidetes" then PhylumCat = 8; else if Phylum = "Proteobacteria" then PhylumCat = 7;
else if Phylum = "Verrucomicrobia" then PhylumCat = 6;
else if Phylum = "Actinobacteria" then PhylumCat = 5;
else if Phylum = "Cyanobacteria" then PhylumCat = 4; else if Phylum = "Fusobacteria" then PhylumCat = 3;
else if Phylum = "Other" then PhylumCat = 2;
run; ODS PDF DPI = 1200
file = "C:\Users\harrallk\Dropbox
(ColoradoTeam)\Microbiome\Text\Deblur_Figures_SampleByPIDByTime_01.pdf"
startpage = no ;
ODS Graphics / Height = 6in Width = 5in;
*graphic for gardeners; *subset gardeners;
data Rph4_one;
set Rph4; where Group = 1;
run;
proc sgpanel data = Rph4_One;
title "Relative Proportion of Phylumns Present in the Microbiome of Gardeners"; panelby Sample_type / columns = 1 rows = 4 novarname;
vbar PIDTime / response = RA_Phylum
group = PhylumCat groupdisplay = stack
barwidth = 1
transparency = 0.5 ;
refline 5.1 7.1 12.1 13.1 / axis = x discreteoffset = -0.5 lineattrs = (color = Black thickness = 5);
rowaxis label = "Relative Proportion of each Phylum";
13
colaxis label = "Participant.Timepoint"; styleattrs datacolors = (Cyan Magenta Yellow Black Orange Blue Gray Green)
datacontrastcolors = (Black);
keylegend / title = "Phylum";
format PhylumCat Phylum.; run;
*graphic for non-gardeners;
*subset non-gardeners; data Rph4_zero;
set Rph4;
where Group = 0; run;
proc sgpanel data = Rph4_Zero;
title "Relative Proportion of Phylumns Present in the Microbiome of Non-Gardeners";
panelby Sample_type / columns = 1 rows = 4 novarname; vbar PIDTime / response = RA_Phylum
group = PhylumCat
groupdisplay = stack barwidth = 1
transparency = 0.5
; refline 3.1 6.1 8.1 9.1 14.1 / axis = x discreteoffset = -0.5 lineattrs = (color = Black thickness = 4);
rowaxis label = "Relative Proportion of each Phylum";
colaxis label = "Participant.Timepoint";
styleattrs datacolors = (Cyan Magenta Yellow Black Orange Blue Gray Green) datacontrastcolors = (Black);
keylegend / title = "Phylum";
format PhylumCat Phylum.; run;
ods pdf close;
14
Appendix C. Test for differences in the abundance of sOTUs between gardeners and non-
gardeners
R, ANCOM
## ANCOM code downloaded from https://sites.google.com/site/siddharthamandal1985/research
#Run downloaded code for ANCOM, titled ANCOM_updated_code.R library(exactRankTests)
library(nlme)
library(ggplot2) # Separate, parallel models, run for each sample type. We present code for the forehead below.
# Restrict samples to forehead
Var_data_forehead <-
var_data[which(var_data$sample_type_num>0.5&var_data$sample_type_num<1.5),]
# forhead analysis accounting for within subject variability longitudinal_comparison_foreheadAdjRand=ANCOM.main(OTUdat=otu_data,
Vardat=Var_data_forehead,
adjusted=T, repeated=T,
main.var="Group",
adj.formula="sexF+age_1",
repeat.var="NULL", longitudinal=FALSE,
random.formula="~1|PID",
multcorr=2, sig=0.05,
prev.cut=0.90)
longitudinal_comparison_foreheadAdjRand$W.taxa
15
Appendix D – Alpha Diversity. Test for differences in Shannon diversity or Faith’s PD
between gardeners and non-gardeners.
SAS, Linear Mixed Model
Models for Shannon diversity and Faith’s phylogenetic diversity are parallel.
The data were subset by sample location.
proc mixed data = mouth;
title "Mouth Shannon Diversity, Full Model";
class PID;
model Shannon_Deblur = group*timepoint*age_1*BMI_1
group*timepoint*age_1 group*timepoint*BMI_1
group*age_1*BMI_1
timepoint*age_1*BMI_1
group*timepoint
group*age_1
group*BMI_1 timepoint*age_1
timepoint*BMI_1
age_1*BMI_1
group
timepoint
sexF age_1
BMI_1
/ residual outp = mouthresout ddfm = kr; repeated PID;
run;
proc univariate data = mouthresout;
var studentResid;
title "Full Mouth Model";
histogram studentResid / normal;
run;
16
Appendix E – Beta Diversity. Visualize variability within and between participant samples.
SAS, Variability within and between participants
/********************************************************************* Convert distance matrix into a list of sample pairs and correlations
**********************************************************************/
proc iml; use UniFrac;
read all var "SampleID" into ColNames; /* get names of variables */
read all var (ColNames) into mCorr; /* matrix of correlations */
close UniFrac; numCols = ncol(mCorr); /* number of variables */
numPairs = numCols*(numCols-1) / 2;
length = 2*nleng(ColNames) + 5; /* max length of new ID variable */ Sample1 = j(NumPairs, 1, BlankStr(length));
i = 1;
do row= 2 to numCols; /* construct the pairwise names */
do col = 1 to row-1; Sample1[i] = ColNames[col];
i = i + 1;
end; end;
Sample2 = j(NumPairs, 1, BlankStr(length));
i = 1; do row= 2 to numCols; /* construct the pairwise names */
do col = 1 to row-1;
Sample2[i] = ColNames[row];
i = i + 1; end;
end;
lowerIdx = loc(row(mCorr) > col(mCorr)); /* indices of lower-triangular elements */ Corr = mCorr[ lowerIdx ];
create CorrPairs var {"Sample1" "Sample2" "Corr"};
append; close;
QUIT;
/****************************************************************************
Merge in metadata.
This step introduces sample identification for the members of each correlation pair. This requires two
stages of merging: the first merge identifies the first column of pair members and the second merge identifies the second column of pair members.
****************************************************************************/
/**************************** Round 1 of merging
******************************/
proc sort data = CorrPairs; by Sample1;
run;
*Add "SampleID" to correlation dataset. This is the sample identification variable was named in the
metadata;
17
data CorrPairs2; set CorrPairs;
SampleID = Sample1;
SampleID2 = Sample2;
run; *For SAS variable naming conventions, an x was adding to the start of each numeric sample id.;
proc sort data = Mb01.MetaSampleX;
by SampleID; run;
data DistMeta;
merge CorrPairs2 Mb01.MetaSampleX; by SampleID;
run;
/***************************************************************************
Round 2 of merging Reduce Mb01.MetaSampleX so it only includes sample, time point, and group.
****************************************************************************/
data MetaReduce; keep SampleID2 timepointS2 groupS2 PIDS2 sample_type_numS2;
set Mb01.MetaSampleX;
SampleID2 = SampleID; timepointS2 = timepoint;
groupS2 = group;
PIDS2 = PID;
sample_type_numS2 = sample_type_num; run;
proc sort data = DistMeta;
by SampleID2; run;
proc sort data = MetaReduce;
by SampleID2;
run; data DistMeta2;
merge DistMeta MetaReduce;
by SampleID2; run;
*Remove any missing values of UniFrac;
data DistMeta3; set DistMeta2;
where corr NE .;
run;
/**************************************************************************** For each correlation pair, define the following
1. Members of the pair came from the same participant
2. Members of the pair came from the same sample type 3. Members of pair are from sample intervention group
4. Members of pair are from same time point
5. Within variability - members of pair are from same participant and same sample type 6. Between variability - members of pair are from different participants but same sample type and time
point
7. Members of pair are from sample intervention group
18
****************************************************************************/ data interactions;
set DistMeta3;
if PID = PIDS2 then PIDMatch = 1;
else PIDMatch = 0; if sample_type_num = sample_type_numS2 then SampleMatch = 1;
else SampleMatch = 0;
Match = (PIDMatch*SampleMatch); run;
*Code this so the between variation only considers differences between participants from the same day;
data graphicDTD; set interactions;
if timepoint = timepointS2 then TimeMatch = 1;
else TimeMatch = 0;
if PIDMatch = 1 and SampleMatch = 1 then Within = 1; else Within = 0;
if PIDMatch = 0 and SampleMatch = 1 and TimeMatch = 1 then Between = 1;
else Between = 0; if group = groups2 then GroupMatch = 1;
else GroupMatch = 0;
run; proc sort data = graphicDTD;
by Within;
run;
/*************************************************************************** UniFrac graphics showing within and between variability over sample type
****************************************************************************/
proc format; value var 0 = "Variation between participants"
1 = "Variation within participants";
value varDTD 0 = "Variation between participants (day-to-day)"
1 = "Variation within participants"; run;
ods pdf file = "~\UniFrac_unweightedPlots.pdf"; proc sgplot data = graphicDTD;
title "Variability Within and Between Participants";
vbar sample_type / response = corr group = Within groupdisplay = cluster stat = Mean limitstat = stderr limitattrs = (color = black);
format Within varDTD.;
yaxis label = "Unweighted UniFrac Distance";
xaxis label = "Sampling Location"; styleattrs datacolors = (Black White) datacontrastcolors = (Black Black);
keylegend / title = " ";
run; ods pdf close;
19
Appendix F – Beta Diversity. Test for differences in weighted UniFrac between gardeners
and non-gardeners
R, Nested Permutation ANOVAs
# Import datasets
library(readr) UniFrac_unweighted_denovo_181203 <- read_delim("~/UniFrac_unweighted_denovo.txt",
+ "\t", escape_double = FALSE, trim_ws = TRUE)
# Import metadata so we can compare UniFrac by group, time, sex, age, and BMI. library(haven)
mbanalysis <- read_sas("~/mbanalysis.sas7bdat", NULL)
#### Note, make sure that column and row names match after R imports this matrix. R likes to add an x
when a variable name starts with a number. I fixed this problem by adding x's to all column and row names before importing into R.
# Format data for Permuation ANOVAs.
# Make the first column into row names UniFrac <- as.data.frame(UniFrac_unweighted_denovo_181203) # Shorten object name
a <- UniFrac[,1]
rownames(UniFrac) <- a UniFrac <- UniFrac[,-1]
UniFrac[1:5, 1:5]
#Remove missing samples from the metadata.
meta <- mbanalysis[complete.cases(mbanalysis[ , "Shannon_CR_OTC"]),] meta2 <- meta[,c("SampleID","Timepoint", "Group", "sample_type")]
# Distance matrix samples and metadata samples must be in the same order.
sampleOrder <- as.numeric(row.names(UniFrac)) sampleOrder2 <- as.data.frame(sampleOrder)
colnames(sampleOrder2) <- "SampleID"
orderMeta <- merge(sampleOrder2,meta2,by.x="SampleID", sort = FALSE) GroupTime <- orderMeta[,c("Group", "Timepoint")]
GroupTime$Timepoint <- as.factor(GroupTime$Timepoint)
GroupTime$Group <- as.factor(GroupTime$Group)
# Import metadata so we can compare UniFrac by group, time, sex, age, and BMI # sample ID must be listed in the first column of the data frame
pax <- as.data.frame(rownames(UniFrac))
colnames(pax) <- "SampleID" Pax2 <- cbind(pax, UniFrac)
###################################################################
# The following code is parallel for all sample type. Thus, we only report forehead.
##################################################################### ##Restrict meta samples to forehead and order same as unifrac
foreheadMeta <- meta2[meta2$sample_type == "Forehead", ]
foreheadIDs <- foreheadMeta$SampleID ForeheadUniFrac <- Pax2[rownames(Pax2) %in% foreheadIDs,colnames(Pax2) %in% foreheadIDs]
#Group by PID
ForeheadGroupPID <- meta[meta$SampleID %in% foreheadIDs,c("SampleID", "Group", "PID")] ##Order
ForeheadOrder <- as.numeric(row.names(ForeheadUniFrac))
ForeheadOrder2 <- as.data.frame(ForeheadOrder)
colnames(ForeheadOrder2) <- "SampleID"
20
#Nested Group by PID ForeheadGroupPID2 <- merge(ForeheadOrder2,ForeheadGroupPID,by.x="SampleID", sort = FALSE)
###Remove sampleID from the metadata, and define covariates as factors
#Group by PID
ForeheadGroupPID3 <- ForeheadGroupPID2[,2:3] ForeheadGroupPID3$Group <- as.factor(ForeheadGroupPID3$Group)
ForeheadGroupPID3$PID <- as.factor(ForeheadGroupPID3$PID)
# Forehead Models nested.npmanova(ForeheadUniFrac~Group+PID, data = ForeheadGroupPID3, permutations=999,
warnings=FALSE)