metagenomics: richness and sampling effort in metagenomics · metagenomics: α-diversity 29 the...
TRANSCRIPT
Antonio Monleón GetinoSection of Statistics. Departament of Genetics,
Microbiology and Statistics, (UB)
METAGENOMICS: Richness and
sampling effort in metagenomics
126-10-2017
Proyecto FIS 2018-2020
2
Hospital Sant Joan de Deu-Universitat de Barcelona
FIS: DIAGNOSIS METAGENOME BASED
3
DIAGNOSIS TEST
The Human Microbiome and Metagenomics: A SCIENTIFIC REVOLUTION
(*) Christine Rodriguez, Ph.D.Harvard Outreach 2012
Summer 2012 Workshop in Biology and Multimedia for High School Teachers (Harvard, MS, USA)
The possibility of directly sequencing the genomes of microbes, without the need to cultivate them opens new possibilities that imply a change of course in Microbiology
Metagenomics is a new field in which you want to obtain sequences of the genome of different microorganisms, bacteria in this case, that compose a community, extracting and analyzing their DNA globally.
This fact is a scientific revolution due to its high performance andlow cost. It allows access to the genome without seeingmicroorganisms or cultivate them, with all the implications that itimplies (DIAGNOSTIC, BIOMONITORING: EARLY WARNING,BIOREMEDIATION, …).
Metagenome: a collection of genes sequencedfrom the environment could be analyzed in away analogous to the study of a single genome(Hadelsman et al, 1998)
Microbiota vs Microbiome?
5Summer 2012 Workshop in Biology and Multimedia for High School Teachers
Microbiota• The microorganisms that live in an established
environment
Microbiome: The full complement of microbes, theirgenes, and genomes in a particular environment
Microbes in the Human Microbiome include species from each major domain
Summer 2012 Workshop in Biology and Multimedia for High School Teachers
Bacteria
“Extremophile”Archaebacteria
Fungi
Bacteria• Have no nucleus or membrane bound
organelles• Often sphere (cocci) or rod (bacillus)
shape, but others as well
Eukaryotes• Have a true nucleus and
membrane bound organelles• Wide variety of shapes. For this
presentation, we will focus on fungi
• Fungi are unique since they have a cell wall and form spores during reproduction
Microbial life
Summer 2012 Workshop in Biology and Multimedia for High School Teachershttp://nihroadmap.nih.gov/hmp/
Some microbes are native, normally found in the body
Some microbes are introduced, suddenly arriving at a new residence in the body
Microbes are normally found in and on the human bodyThe following sites are “hotspots” for microbial life
What’s Happening inthe Oral Cavity?
Summer 2012 Workshop in Biology and Multimedia for High School Teachershttp://en.wikipedia.org/wiki/File:Teeth_by_David_Shankbone.jpg
A wide variety of microbes regularly enter the oral cavity
saliva, pH, temperature, immune system prevent many species from surviving
Oral antibiotics inhibit growth
Brushing and flossing teeth clears some built up biofilm
Symbiosis of the oral microbes that are able to survive these conditions form an elaborate scaffold that lives on the tooth enamel and at the interface with the gums. It forms a barrier for incoming bacteria.
Fusobacterium, Streptococcus mitis, etc
http://commons.wikimedia.org/wiki/File:Intestine_and_stomach_-_transparent_-_cut.png
http://microbewiki.kenyon.edu/index.php/File:G_reaction1.jpg
Ruminococcus sp. bacteria can be found in significantly high numbers in the gut flora. They break down cellulose in the gut, helping with digestion.
Helicobacter pylori bacteria has a helical shape and colonizes the stomach and upper G.I. tract. It is known to be a major cause of stomach ulcers, although many with H. pylori do not get ulcers.
http://commons.wikimedia.org/wiki/File:Helicobacter_pylori_diagram.png
What’s Happeningin the Gut?
Summer 2012 Workshop in Biology and Multimedia for High School Teachers
The microbiota has important functions: • Prevents colonization by pathogens• Educates the immune system• Metabolic role• Participates in drug metabolism• …
- Complex community of microbes –estimated to contain 200 trillion cells
- > 1000 diverse microbial species
- 10 x the number of human cells in our body
- Gut microbiome is 150 x larger than the human genome
Gut Microbiota
Summer 2012 Workshop in Biology and Multimedia for High School Teachers
Factors affecting Gut Microbiome
Is My Gut Microbiome the Same as Yours?
Summer 2012 Workshop in Biology and Multimedia for High School Teachers
My Gut Species Your Gut Species
The number and amount of the many different microbes can vary
greatly from person to person.
Enterotypes (Wadsworth et al.,2017)
“Very important difficulyt for a correct statistical analysis”
• Bacteroidetes and Firmicutes make up around 90% of the gut microbiota
• Each individual harbors his/her own distinctive pattern of gut microbial communities
• For a given individual, the fecal microbiota remains remarkably stable over a person’s lifetime
So many new questions to answer about the Human Microbiome…
How does the gut flora modify drugs, and how can we minimize side effects?
Why does my gut flora look different than yours? How does that affect obesity, food allergies, and ability to fight disease?
Are we making germs more resistant to anitmicrobials? What happens when the germs are resistant to all of the drugs in our arsenal?
Is possible to do a clinical test(distinctive pattern microbiomabased) to discriminate diseasesusing microbioma?
Is posible to do an “early warningbiomonitoring system” based onmetagenomics
Why metagenomics?
14
Basically we try to answer two questions:
Statistics
met
agen
omic
s
Metagenomics is the study of microbiomasfrom the direct isolation of DNA from an environment.
NGS
Statistics methodsbased on
community ecology
16SMT
Sequencing + bioinformatics + statistics
Sequencing technology
16
Sequencing techniques have improved dramatically its capabilities and decrease it’ s price
16S/18S-based metagenomics
cost-effective
kingdom-limited
lack of functional identification
phylum/genus level Bacterial and archaeal genomes submitted to Genbank
Sequencing capability
Prices
Metagenomic procedures
17
Shotgun metagenomics16S/18S-based metagenomics WGS-based
metagenomicsMetatranscriptomics
cost-effective
kingdom-limited
lack of functional identification
phylum/genus level
functional identification of expressing genes
differential expression analysis
expensive
family/genus level
functional identification
strain level
unable to know the gene expression abundance
computationally expensive
NGS: NextGenerationSequencing
WGS: WholeGenomeSequencing
Why shotgun metagenomics?
18
Most NGS used in metagenomics
Next Generation Sequencing (NGS) Technologies
Ilumina Ion TorrentPGM PacBio Oxford
Nanopore
INSTRUMENTIlluminaNextSeq
500
Ion Torrent– PGM 316
chip
Pacific Biosciences
RS II
Oxford Nanopore MinION
AMPLIFICATION REQUIRED YES NO
TYPE OF AMPLIFICATION BridgePCR emPCR -
MAXIMUM READ LENGHT (BP) 300 400 3000 9000
ERROR RATE (%) 0,1 1 13 4
RUN TIME 26 h 4,9 h 2 h <= 6 h
Adapted from http://www.molecularecologist.com/next-gen-fieldguide-2016/
There are manytechnologies, wherethis table summarizessome of them
Bioinformatic anaysisin metagenomics: just one way to do it?
Figure from Siegwald, L., Touzet, H., Lemoine, Y., Hot, D., Audebert, C., & Caboche, S. (2017). Assessment of Common and Emerging Bioinformatics Pipelines for Targeted Metagenomics. PloS One, 12(1), e0169563.
OTU: PSEUDOESPECIE
PSEUDO-ABUNDANCE
OTU (Operational Taxonomic Unit): Taxonomic level of sampling selected by the user to be used in a study. Typically using a percent sequence similarity threshold forclassifying microbes with the same, or different OTUs
Binning: The process of groupping reads or contigs and assignin them to OTUs
Research project: “Characterizing the structure of the complex microbiological communities using biodiversity and distance-based multivariate methods to obtain biologically relevant patterns and development of non-invasive methods for differential diagnosis”
Statistical analysis:work togheter with microbiologists, doctors and ecologysts to understant the problem
21
At this step, we try to find the statistical methods to solve this
questions
GUIDELINES (2016): Schema for a human microbiome study
22
• For both 16S rRNA and shotgun metagenomic sequencing,study participants can be compared by alpha diversity (i.e.,within participant diversity) and beta diversity (i.e., betweenparticipant diversity) metrics.
• For alpha diversity, conventional statistical methods are oftenused. Typically, random sampling for a standardised number ofOTUs (i.e., rarefaction) is conducted to minimise bias due toamplification or sampling efficiency and then analyses includeadjustment for taxa abundances in order to minimise influenceof rare taxa.
• For associations with the entire microbial community, betadiversity analyses are based on a matrix of distances betweenall pairs of specimens, followed by principle coordinate analysisand higher level statistics
Schema for a human microbiome study. To conduct a humanmicrobiome study, it is important to develop a well-formulatedhypothesis, apply a valid study design, and collect requisite covariatedata. Relevant specimens (e.g., oral, faecal, tissue, or other applicablesamples) should be collected and promptly frozen or chemicallystabilised. Once the DNA is extracted, currently there are two typicalmethods for sequencing: 16S rRNA gene sequencing (in blue) or shotgunmetagenomic sequencing (in red). To date, most epidemiologic-scalestudies profile the microbiome by amplifying and sequencing only theprokaryote 16S rRNA gene. Once the 16S rRNA sequencing is completed,the data are often processed using various publically available tools thatare used to cluster the sequences into operational taxonomic units(OTUs) and to assign taxonomy using public sequence databases. Forshotgun metagenomic sequencing, the DNA is sheared and then all thefragments are sequenced. From this type of data a variety ofbioinformatic processing can be conducted, but often the short readsare used to cluster OTUs and assign taxonomy, similar to 16S, but also todetermine the functional capabilities of the genes present by mappingthe reads to public gene databases.
The metagenomic guidelines suggested the use of diversity
statistical analysis
How to measure metagenomediversity?
23
• Population diversity is a very important concept when dealing with OTUsor other taxonomic “bins” because this is critical for human health.
• Since a number of disease conditions have been shown to correlate withdecreased microbiome diversity: it can affect human health seriously.
• Human gut contents appear to be highly personalized when consideredin terms of microbial PRESENCE, ABSENCE and ABUNDANCE.
Using the same concepts as in community ecology but with
their particularities…
• α,β-diversity:– α-diversity measures acts like a summary statistic of a single population.– β-diversity measures acts like a similarity score between populations,
allowing analysis by clustering or other multivariate analysis.
Biodiversity in metagenomics: complexity in a multidimensional perspective !!
24
Ecology
FunctionTaxonomy
Dynamics
Many concepts in metagenomic diversity
25
RAREFACTION / NON PARAMETRIC ESTIMATORS OF RICHNESS
SAMPLING EFFORT
It is necessary to accommodate concepts to metagenomics
Measure diversityin metagenomics
26
See the compilation of methods described at Monleón-Getino,“A statistician knight in King Molecular’ s Court”
(Introduction to the Species Divergence and the Measurement of Microbial Diversity)-2017
Lozupone CA and Knight R. 2008. SpeciesDivergence and the Measurement of MicrobialDiversity. FEMS Microbiol Rev. Jul; 32(4): 557–578
Many methods to measure diversity
Measure diversity in metagenomics: α-diversity
• α-diversity or “local diversity” isdenoted as the diversity within acommunity at a particular site.
• This diversity is at small-local scale,generally of the size of local area (e.g.for microbiome gut area, tooth, etc.).
• A common definition of α-diversity isreferred to a number of species(richness) at one site, but someformulations also consider theproportion that each speciesrepresents (evenness), in which case ahigh diversity implies a large number ofspecies with similar abundances.
27
How to calculate effortcurves (sampling curves) inmetagenomics?
Measure diversityin metagenomics: α-diversity
28
N=100 (true)
UNDERSAMPLING
Rarefaction curve (Simulation: 500 OTU, 5 samples, 120 bins each
sample)
True, only if the sampling is correct
Kindt's exact method: Species accumulationcurve (Simulation: 500 OTU, 5 samples, 120
bins each sample)
Only 250 OTU?
Only 100 OTU?
Measure diversity in metagenomics: α-diversity
29
The Shannon index (H’) is the most popular diversity index for the microbiologist when used metagenomics. Is also known asShannon's diversity index, Shannon entropy, Shannon–Wiener index or Shannon–Weaver index. The measure was originallyproposed by Claude Shannon to quantify the entropy (uncertainty or information content) in strings of text (Shannon, 1948) .
H’ quantifies the uncertainty (entropy or degree of surprise) associated with this prediction. It is most frequently calculated asfollows:
21
' log ( )R
i ii
H p p=
= −∑where pi is the proportion of individuals belonging to the ith OTUs in the dataset of interest. Then the Shannon entropy quantifiesthe uncertainty in predicting the OTUs assigned to an individual that is taken at random from the dataset.
Another index is proposed to estimate α-diversity that takes into account the phylogenetic species divergence is the thetaalpha biodiversity (θ) is calculated as,
where k is the number of individuals sampled, pi is the frequency of the ith sequence, pj is the frequency of the jthsequence, and dij is the phylogenetic divergence between the two sequences. θ is widely applied in molecular evolutionand population genetics. As an interesting feature θ is also identical to the taxonomic diversity index (Δ) described byClark and Warwick (1995) .Also θ is equal to ½ the Rao Diversity Coefficient (RaoDIV) (Rao, 1982), which reduces tothe Gini-Simpson index (1 − λ) when all pairwise distances between species are equal.
1( )
k
i j iji j i
p p dθ= <
=∑∑
Often quantified byShannon index:
Measure diversity in metagenomics: β-diversity
30
Many approaches: Diversity betweencommunities, groups, samples…
Measure diversity in metagenomics: β-diversity
31
Legendre and Caceres (2013)
See the compilation of methods described at Monleón-Getino, “A statistician knight in King
Molecular’ s Court(Introduction to the Species Divergence and
the Measurement of Microbial Diversity)-2017
I have adapted a new concept of beta-diversity to
metagenomics
Measure diversity in metagenomics: β-diversity
32
Also is possible to deal with beta-diversity using
multivariate approaches like …
QUESTIONS TO SOLVE
• Is the sampling effort enough?
• How many samples to estimatethe richness (OTUs)? More? Less?
• Are the replications OK
33
Richness
34
Tutorial
Richness: how to measure it?
35
Tutorial
Bayesian point of view: a new familyof algorithms to calculate the diversity
• The most frequently used statistical methods are known as frequentist (or classical) methods.• These methods assume that unknown parameters are fixed constants, and they define probability by using limiting
relative frequencies.• It follows from these assumptions that probabilities are objective and that you cannot make probabilistic statements
about parameters because they are fixed.• Bayesian methods offer an alternative approach; they treat parameters as random variables and define probability as
"degrees of belief" (that is, the probability of an event is the degree to which you believe the event is true).• It follows from these postulates that probabilities are subjective and that you can make probability statements about
parameters.• The term "Bayesian" comes from the prevalent usage of Bayes’ theorem, which was named after the Reverend Thomas
Bayes, an eighteenth century Presbyterian minister.• Bayes was interested in solving the question of inverse probability: after observing a collection of events, what is the
probability of one event?
36
Link at bayesian theory
𝑃𝑃 𝜃𝜃 𝑥𝑥 =)𝑃𝑃 𝑥𝑥 𝜃𝜃 𝑃𝑃(𝜃𝜃
)𝑃𝑃(𝑥𝑥Based on: Pullen and Kumaran (2010): Application ofMultinomial-Dirichlet Conjugate in MCMC Estimation : A BreastCancer Study . Int. Journal of Math. Analysis 4
Algorithm
37
A new bayesian method to estimaterichness and sample size (sampling effort) in metagenomics analysis
38
Multinomial-Dirichlet Conjugate distributions (bayesian approach)
Priori distribution
Bayes formula:
Evidence (difficult to calculate):
MCM
C
Using a projection of the Species accumulation curve (using Weibull: SATURATIVE or LOESS + Michaelis-Menten growth model: NON SATURATIVE)
Winbugs/ jags𝑃𝑃(𝜃𝜃𝑘𝑘|𝑥𝑥)(𝐵𝐵𝐵𝐵𝐵𝐵)
Species acumulation curve (Sa)
RICHNESS:
Using the inverses of Weibull or Michaelis-Menten
growth model)
SAMPLING EFFORT: Criteria: especie i is presentif BCIlower > 𝑃𝑃∗(𝜃𝜃𝑘𝑘|𝑥𝑥)
Function to project richnessat S(x)
• Weibull growth function: W 𝑥𝑥 = 𝑎𝑎 − 𝑏𝑏𝑒𝑒−𝑐𝑐𝑐𝑐𝑚𝑚
• Michaelis and Menten equation: MM 𝑥𝑥 = 𝑉𝑉𝑥𝑥/(𝑘𝑘 + 𝑥𝑥)
• LOESS regression: LOESS and LOWESS (locally weighted scatterplot smoothing) are two strongly related non-parametric regression methods that combine multiple regression models in a k-nearest-neighbor-based meta-model.
39
where w(x) is the richness of sample (x) and now a=α, b=α-β, c=κγ and m= γ. a,b, c and m are parameters to be estimated and e is the base of the naturallogarithms. a is the asymptote of limiting value of the response variable (w),limx→∞wx=a, that represent the maximum number of species. b is the biologicalconstant (lower asymptote), c is the parameter governing the rate at which theresponse variable approaches its potential maximum a and m is the allometricconstant.
extrapolated richness value
A new bayesian method to estimaterichness and sample size (sampling effort) in metagenomics analysis
40
Rarefaction curve (Simulation: 500 OTU, 5 samples, 120 bins each
sample)
Is the sampling effort enough?How many samples to estimatethe richness (OTUs)? More? Less?
Kindt's exact method: Species accumulationcurve (Simulation: 500
OTU, 5 samples, 120 binseach sample)
Only 250 OTU?Only 100 OTU?
Vegan for R (Oksanen, 2017)
UN
DERS
AMPL
ING
HIG
H RI
CHN
ESS
LOW
ABU
NDA
NCE
Only 300-400 OTU?
Library for R: BDSbiost3 (bayesian function to richness calculation and sampling effort)
https://github.com/amonleong/BDSbiost3
Non parametricasymptotic richness
estimation
MM (90, 95, 99%)[5.2, 11, 57.3]
MCMC method (JAGS for R) onth
ere
boun
d
W (90, 95, 99%):[2.1, 2.8, 3.7]
Bayesian sampling effort curve
Sam
plin
gef
fort
MM : 570.5
W : 499
Rich
ness
Beta-diversity(between samples): quality control
Different species accumulation model
Simulations using the library
41
SCENARIO 3: ↑ Ab; ↑ Rs; Correct sampling /Over-Sampling
Met
agen
Sam
ple.
size.
H1-M
CMC
spec
pool
() -V
egan
Abundance: 10000; OTUs: 500; Samples: 15
Sampling effort
W(90,95,99%:{2.1, 4.1, 15.0}
MM (90, 95, 99%): [2.2, 4.6, 24.1]
Simulations using the library
42
SCENARIO 2: ↓ Ab; ↑ Rs; Over-Sampling
Abundance: 75; OTUs: 500;
Samples: 15
Met
agen
Sam
ple.
size.
H1-M
CMC
spec
pool
() -V
egan
Sampling effort
W (90, 95, 99%):{2.9, 5.1, 11.7}
MM (90, 95, 99%): [3.5, 7.4, 8.5]
Simulations using the library
43
SCENARIO 13: ↓Ab; ↓Rs; Under-Sampling
Ab:100, OTUs: 50; Samples: 4
Sampling effort
W(90,95,99%:{2.1, 2.6, 3.3}
MM (90, 95, 99%): [5.7, 12.1, 63.01]
A new bayesian method to estimaterichness and sample size (sampling effort) in metagenomics analysis
44
Library for R: BDSbiost3 Better to estimate richness and number of samples(effort of sampling) based on MCMC and a Diritchet-Multinomial distribution
• Species richness in an assemblage is difficult to estimatereliably from sample data because it is very sensitive to thenumber of individuals and the number of samples collectedthat tent to be low in genomic analysis (low abundance,high number of OTUs, variability, …).
• Is very important to know the number of samples to acorrect estimation of Richness (sampling effort)
• Monleon-Getino,• Rodriguez-Casado,• Méndez-Viera.
9/2017
https://github.com/amonleong/BDSbiost3
GAIA: Metagenomics analysis from any matrix at strain level
45
Trex: COSMOCAIXA 27-10-17
• Reviu els dinosauresCinna.upc.edu:3838/GRBIOSTAT3/paleo
46
Trix, el tiranosaurio más bien conservado del mundo
Antonio Monleón GetinoSection of Statistics. Departament of Genetics, Microbiology and Statistics, (UB)
THANKS to MariaPilar Giménez, Cesc Múrries,
Javier Méndez, M. Cambra
47
beta0 chains 1:3
iteration1001 2500 5000 7500 10000
-0.5
0.0
0.5
1.0
More results on metagenomics
• PhD-stays in the United States of America-Cambridge/Boston (2011, 2012, 2015) at the Forsyth Institute.
• Library for R developed, recently published 2016: MetagenOutLDA.
• 3 articles related, recently published (2016, 2017, 2017) + 2 books (2017: Winbugs + metagenomics diversity).
• 3 congress communications
• A research line opened
48
Previous results: the basis for new project
49
Starting from matrix: M
MetagenOutLDA Diagnosis metagenomic-based
Discriminant analysis (Classical + Machine learning methods)g1
g2
Previous results: the basis for new project (main results)
50
g1
g2
Disc
rimin
antm
etho
ds
�β ± 𝐵𝐵𝐵𝐵𝑎𝑎(95%𝐵𝐵𝐵𝐵)
𝑦𝑦𝑖𝑖 = 𝑓𝑓(𝛽𝛽, 𝑥𝑥𝑖𝑖′) + 𝜖𝜖𝑖𝑖
Particular model:
𝜖𝜖𝑖𝑖~𝑁𝑁(0, 𝜎𝜎2)
Hope
fulr
esul
ts
Previous results: the basis for new project (robust Bca percentile method)
51
/ 2 0 1 / 2 01 0 2 0
/ 2 0 1 / 2 0
1 ( ), 1 ( )1 ( ) 1 ( )
z z z zz za z z a z zα α
α α
β β −
−
− −= −Φ + = −Φ +
− − − −
100(1 )% confidence intervalα−1 2
* *(( 1) (1 )) (( 1) (1 ))ˆ ˆ,B BLCL UCLβ βθ θ+ × − + × −= =
0estimate , z a
*2x *
Bx
*(2)θ̂
1 2ˆ ( , , ..., ) ( )ndata x x x sθ= ⇒ =x x
* *ˆget resample statistics ( ) and then sort themb bsθ = x
*1x
resample B times
*(1)θ̂ *
( )ˆ
Bθ*(2)θ̂≤ ≤ ≤
{ }1 *0
1
1 ˆ ˆestimate by 1 and by JackknifeB
bb
z aB
θ θ−
=
Φ ≤
∑
1( ) zαα−Φ = −
Flowchart of The BCa Percentile Method used by the function dmcbiodiv() to estimate de interval of confidence of (β0, β1, β2,…)
�𝜷𝜷 ± 𝑩𝑩𝑩𝑩𝑩𝑩(𝟗𝟗𝟗𝟗%𝑩𝑩𝟗𝟗)
Previous results: the basis for new project (discriminant methods)
The function dmcmetagen() provides a leader board featuring the results of linear discriminant analysis (LDA)with and without cross-validation and other more complex, such as:• Quadratic discriminant analysis (QDA): Quadratic discriminant analysis (QDA) is closely related to linear
discriminant analysis (LDA), where it is assumed that the measurements from each class are normallydistributed. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classesis identical
• Robust regularized linear discriminant analysis (RRLDA): methods to perform robust regularized lineardiscriminant analysis based on ‘rrlda’ library for R. (See Friedman, 1989). RRLDA is a combination ofregularization and robust methods. RRLDA is a good choice if data contain either outliers or manynoisy variables or both. Penalty parameter λ is chosen according to an adapted BIC criterion (Moreinformation at http://www.stat.tugraz.at/Statistiktage2011/PGschwandtner.pdf )
• Multiple discriminant analysis (MDA): Multiple Discriminant Analysis (MDA) is a method for compressing amultivariate signal to yield a lower-dimensional signal amenable to classification. MDA is not directly usedto perform classification. It merely supports classification by yielding a compressed signal amenable toclassification. The method described in Duda et al., (2001). Methods to perform MDA discriminantanalysis based on ‘mda’ library for R.
• Support Vector Machine (SVM): see description below. Is one of the most used algorithms nowadays toclassify objects using hyperplanes (Lee and Wahba, 2004).
52
𝑓𝑓 𝑥𝑥 = 𝛽𝛽0 + 𝛽𝛽𝑇𝑇𝑥𝑥
𝛽𝛽 =weight vector,𝛽𝛽0 = bias
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑎𝑎𝑑𝑑𝑑𝑑𝑒𝑒 =𝛽𝛽0 + 𝛽𝛽𝑇𝑇𝑥𝑥
𝛽𝛽
𝛽𝛽0 + 𝛽𝛽𝑇𝑇𝑥𝑥 = 1
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑎𝑎𝑑𝑑𝑑𝑑𝑒𝑒𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑣𝑣𝑣𝑣𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 =𝛽𝛽0 + 𝛽𝛽𝑇𝑇𝑥𝑥
𝛽𝛽=
1𝛽𝛽
𝑀𝑀 = 2𝛽𝛽
𝑚𝑚𝑑𝑑𝑑𝑑𝛽𝛽, 𝛽𝛽0
𝐿𝐿 𝛽𝛽 = 12𝛽𝛽 2 𝑑𝑑𝑠𝑠𝑏𝑏𝑠𝑠𝑒𝑒𝑑𝑑𝑑𝑑 𝑑𝑑𝑡𝑡: 𝑦𝑦𝑖𝑖 𝛽𝛽𝑇𝑇𝑥𝑥𝑖𝑖 +𝛽𝛽0 ≥ 1∀𝑖𝑖
Maximize MLagrangianoptimization
End point 1:Investigate new statistical multivariate methods based on the probability models (MN, DM, mixtures) to extract ecologically meaningful information from these data sets :
53
1.1.Characterize the biodiversity of the
microbiome:
1.2.Find the number of communities
(subpopulations) that interact in the sample:
1.3.Observe the behaviour and complexity in
situations of change or evolution of the microbial community (e.g. disease
status, time, space)
1.4.Propose an algorithm that extract ecologically meaningful information
based on 1.1., 1.2. and 1.3
Study advanced statistical multivariatemethods based on the probability distributionsabove-mentioned to extract ecologicallymeaningful information from these data sets :
𝑋𝑋𝑗𝑗.~𝑀𝑀𝑁𝑁(𝑁𝑁𝑗𝑗.; 𝜃𝜃𝑗𝑗 = (𝜃𝜃1𝑗𝑗, 𝜃𝜃2𝑗𝑗,… , 𝜃𝜃𝑘𝑘𝑗𝑗))
�𝜃𝜃𝑘𝑘~𝐷𝐷𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝐷𝐷𝐷𝑒𝑒𝑑𝑑(𝛼𝛼1𝑗𝑗, 𝛼𝛼2𝑗𝑗,… , 𝛼𝛼𝑘𝑘𝑗𝑗𝜃𝜃𝑗𝑗|𝑥𝑥~𝐷𝐷𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝐷𝐷𝐷𝑒𝑒𝑑𝑑 𝑥𝑥1𝑗𝑗 + 𝛼𝛼1𝑗𝑗, 𝑥𝑥2𝑗𝑗 + 𝛼𝛼2𝑗𝑗,… , 𝑥𝑥𝑘𝑘𝑗𝑗 + 𝛼𝛼𝑘𝑘𝑗𝑗𝑋𝑋𝑗𝑗.~𝐷𝐷𝑀𝑀 𝑁𝑁𝑗𝑗., 𝛼𝛼′1𝑗𝑗,… , 𝛼𝛼′𝑘𝑘𝑗𝑗
𝑓𝑓 𝑥𝑥 = ∑𝑖𝑖=1𝑚𝑚 𝑤𝑤𝑖𝑖 𝑝𝑝𝑖𝑖 𝑥𝑥 ;𝑝𝑝𝑖𝑖 = 𝑀𝑀𝑁𝑁, DM
Bhattacharyya distance
Kullback-Leiberdivergence
4
1arccoshk hi ki
id x x
=
= ∑
( || ) [log ]dD Edµυµ υµ
= −
PAMUnsupervised Affinity propagation (AP) clustering algorithm
Gaussian graph
Complexity index
Mixtures
Bayesian approach
End point 2: Validade the use of the developed library MetagenOutLDA for R(cross-validation and Sensitivity / Specificity of the diagnostic test )
SVM:Accuracy : 0.8947 - 95% CI : (0.6686, 0.987)No Information Rate : 0.6316 P-Value [Acc > NIR] : 0.01135 Kappa : 0.7595 Mcnemar's Test P-Value : 0.47950 Sensitivity : 0.7143 Specificity : 1.0000 Pos Pred Value : 1.0000 Neg Pred Value : 0.8571 Prevalence : 0.3684 Detection Rate : 0.2632 Detection Prevalence : 0.2632 Balanced Accuracy : 0.8571
LDA:Accuracy : 0.7368 - 95% CI : (0.488, 0.9085)No Information Rate : 0.5789 P-Value [Acc > NIR] : 0.1214 Kappa : 0.4509 Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.6250 Specificity : 0.8182 Pos Pred Value : 0.7143 Neg Pred Value : 0.7500 Prevalence : 0.4211 Detection Rate : 0.2632 Detection Prevalence : 0.3684 Balanced Accuracy : 0.7216
54
Sensitivity / Specificity table
Sensitivity / Specificity table
Cross-validation Validation
The
seco
ndIN
GTN
CELI
AC,
asa
part
ofth
eva
lidat
ions
ofth
est
atist
ical
met
hods
deve
lope
din
the
first
,is
foun
ded
byth
esc
hola
rshi
pPR
ODU
Oof
6000
€(C
hies
i).It
isa
pilo
tm
edic
alst
udy
onth
eus
eof
prob
iotic
sin
the
rem
issio
nof
non-
celia
cgl
uten
into
lera
nce.
This
proj
ect,
like
the
prev
ious
one,
aim
sto
adva
nce
the
know
ledg
eof
the
hum
anm
icro
biom
ean
dits
rela
tion
with
dise
ases
.
Dutie
spe
ndin
g
Research group (consortium) constituted for this research project
55
Hospital CIMA-Adeslas
/ Hospital Sant Joan de
Deu
SEQUENTIA
Validation of the statistical
methods
Statitical methods for a metagenomic Pipeline
Clinical diagnostic
Biodiversityinformation Discriminant
analysis R library
Clinicaldiagnostic
Sample sizeMultivariatemethods R library
ScholarshipPRODUO
MetagenOutLDA
Forsyth / University of
Florida
Coordinator: Toni Monleón Getino
FIS-2018-2020Samples collection
Biost3
Preliminary results
diagnosis
biologically relevant patterns
Statistical methods
• Study the probability distributions (MN, DM, mixtures)
• Subsetting based on distances and non-hierarchical classification methods
• Index of Complexity of a metacommunity• New approach to alpha diversity
(bayesian approach)• New approach to beta diversity
(Legendre and Caceres, 2013)• Bayesian approach to compute sampling
effort (sample size) and richness
PROBLEMS RAISED:• Big sparse matrices• High variability between samples• Enterotypes (mixtures)• Complex distributions of probabilities
(mixtures)• Sample effort & richness uncertainty
𝑴𝑴𝑘𝑘𝑐𝑐𝑘𝑘𝑇𝑇 Sequences Functional results
SOME ALGORITHMS DEVELOPED/ING:
Preliminary results
57
• Study the probability distributions (MN, DM, mixtures)• Enterotypes (mixtures): how many groups?• Beta diversity
Algorithms: Bhattacharyya distance and PAM algorithm
58
BDSbiost3 for R, version 0.3GITHUB