metagenomics: richness and sampling effort in metagenomics · metagenomics: α-diversity 29 the...

Antonio Monleón GetinoSection of Statistics. Departament of Genetics,

Microbiology and Statistics, (UB)

METAGENOMICS: Richness and

sampling effort in metagenomics

126-10-2017

Proyecto FIS 2018-2020

2

Hospital Sant Joan de Deu-Universitat de Barcelona

FIS: DIAGNOSIS METAGENOME BASED

3

DIAGNOSIS TEST

The Human Microbiome and Metagenomics: A SCIENTIFIC REVOLUTION

(*) Christine Rodriguez, Ph.D.Harvard Outreach 2012

Summer 2012 Workshop in Biology and Multimedia for High School Teachers (Harvard, MS, USA)

The possibility of directly sequencing the genomes of microbes, without the need to cultivate them opens new possibilities that imply a change of course in Microbiology

Metagenomics is a new field in which you want to obtain sequences of the genome of different microorganisms, bacteria in this case, that compose a community, extracting and analyzing their DNA globally.

This fact is a scientific revolution due to its high performance andlow cost. It allows access to the genome without seeingmicroorganisms or cultivate them, with all the implications that itimplies (DIAGNOSTIC, BIOMONITORING: EARLY WARNING,BIOREMEDIATION, …).

Metagenome: a collection of genes sequencedfrom the environment could be analyzed in away analogous to the study of a single genome(Hadelsman et al, 1998)

Microbiota vs Microbiome?

5Summer 2012 Workshop in Biology and Multimedia for High School Teachers

Microbiota• The microorganisms that live in an established

environment

Microbiome: The full complement of microbes, theirgenes, and genomes in a particular environment

Microbes in the Human Microbiome include species from each major domain

Summer 2012 Workshop in Biology and Multimedia for High School Teachers

Bacteria

“Extremophile”Archaebacteria

Fungi

Bacteria• Have no nucleus or membrane bound

organelles• Often sphere (cocci) or rod (bacillus)

shape, but others as well

Eukaryotes• Have a true nucleus and

membrane bound organelles• Wide variety of shapes. For this

presentation, we will focus on fungi

• Fungi are unique since they have a cell wall and form spores during reproduction

Microbial life

Summer 2012 Workshop in Biology and Multimedia for High School Teachershttp://nihroadmap.nih.gov/hmp/

Some microbes are native, normally found in the body

Some microbes are introduced, suddenly arriving at a new residence in the body

Microbes are normally found in and on the human bodyThe following sites are “hotspots” for microbial life

What’s Happening inthe Oral Cavity?

Summer 2012 Workshop in Biology and Multimedia for High School Teachershttp://en.wikipedia.org/wiki/File:Teeth_by_David_Shankbone.jpg

A wide variety of microbes regularly enter the oral cavity

saliva, pH, temperature, immune system prevent many species from surviving

Oral antibiotics inhibit growth

Brushing and flossing teeth clears some built up biofilm

Symbiosis of the oral microbes that are able to survive these conditions form an elaborate scaffold that lives on the tooth enamel and at the interface with the gums. It forms a barrier for incoming bacteria.

Fusobacterium, Streptococcus mitis, etc

http://commons.wikimedia.org/wiki/File:Intestine_and_stomach_-_transparent_-_cut.png

http://microbewiki.kenyon.edu/index.php/File:G_reaction1.jpg

Ruminococcus sp. bacteria can be found in significantly high numbers in the gut flora. They break down cellulose in the gut, helping with digestion.

Helicobacter pylori bacteria has a helical shape and colonizes the stomach and upper G.I. tract. It is known to be a major cause of stomach ulcers, although many with H. pylori do not get ulcers.

http://commons.wikimedia.org/wiki/File:Helicobacter_pylori_diagram.png

What’s Happeningin the Gut?


The microbiota has important functions: • Prevents colonization by pathogens• Educates the immune system• Metabolic role• Participates in drug metabolism• …

- Complex community of microbes –estimated to contain 200 trillion cells

- > 1000 diverse microbial species

- 10 x the number of human cells in our body

- Gut microbiome is 150 x larger than the human genome

Gut Microbiota


Factors affecting Gut Microbiome

Is My Gut Microbiome the Same as Yours?


My Gut Species Your Gut Species

The number and amount of the many different microbes can vary

greatly from person to person.

Enterotypes (Wadsworth et al.,2017)

“Very important difficulyt for a correct statistical analysis”

• Bacteroidetes and Firmicutes make up around 90% of the gut microbiota

• Each individual harbors his/her own distinctive pattern of gut microbial communities

• For a given individual, the fecal microbiota remains remarkably stable over a person’s lifetime

So many new questions to answer about the Human Microbiome…

How does the gut flora modify drugs, and how can we minimize side effects?

Why does my gut flora look different than yours? How does that affect obesity, food allergies, and ability to fight disease?

Are we making germs more resistant to anitmicrobials? What happens when the germs are resistant to all of the drugs in our arsenal?

Is possible to do a clinical test(distinctive pattern microbiomabased) to discriminate diseasesusing microbioma?

Is posible to do an “early warningbiomonitoring system” based onmetagenomics

Why metagenomics?

14

Basically we try to answer two questions:

Statistics

met

agen

omic

s

Metagenomics is the study of microbiomasfrom the direct isolation of DNA from an environment.

NGS

Statistics methodsbased on

community ecology

16SMT

Sequencing + bioinformatics + statistics

Sequencing technology

16

Sequencing techniques have improved dramatically its capabilities and decrease it’ s price

16S/18S-based metagenomics

cost-effective

kingdom-limited

lack of functional identification

phylum/genus level Bacterial and archaeal genomes submitted to Genbank

Sequencing capability

Prices

Metagenomic procedures

17

Shotgun metagenomics16S/18S-based metagenomics WGS-based

metagenomicsMetatranscriptomics

cost-effective

kingdom-limited

lack of functional identification

phylum/genus level

functional identification of expressing genes

differential expression analysis

expensive

family/genus level

functional identification

strain level

unable to know the gene expression abundance

computationally expensive

NGS: NextGenerationSequencing

WGS: WholeGenomeSequencing

Why shotgun metagenomics?

18

Most NGS used in metagenomics

Next Generation Sequencing (NGS) Technologies

Ilumina Ion TorrentPGM PacBio Oxford

Nanopore

INSTRUMENTIlluminaNextSeq

500

Ion Torrent– PGM 316

chip

Pacific Biosciences

RS II

Oxford Nanopore MinION

AMPLIFICATION REQUIRED YES NO

TYPE OF AMPLIFICATION BridgePCR emPCR -

MAXIMUM READ LENGHT (BP) 300 400 3000 9000

ERROR RATE (%) 0,1 1 13 4

RUN TIME 26 h 4,9 h 2 h <= 6 h

Adapted from http://www.molecularecologist.com/next-gen-fieldguide-2016/

There are manytechnologies, wherethis table summarizessome of them

Bioinformatic anaysisin metagenomics: just one way to do it?

Figure from Siegwald, L., Touzet, H., Lemoine, Y., Hot, D., Audebert, C., & Caboche, S. (2017). Assessment of Common and Emerging Bioinformatics Pipelines for Targeted Metagenomics. PloS One, 12(1), e0169563.

OTU: PSEUDOESPECIE

PSEUDO-ABUNDANCE

OTU (Operational Taxonomic Unit): Taxonomic level of sampling selected by the user to be used in a study. Typically using a percent sequence similarity threshold forclassifying microbes with the same, or different OTUs

Binning: The process of groupping reads or contigs and assignin them to OTUs

Research project: “Characterizing the structure of the complex microbiological communities using biodiversity and distance-based multivariate methods to obtain biologically relevant patterns and development of non-invasive methods for differential diagnosis”

Statistical analysis:work togheter with microbiologists, doctors and ecologysts to understant the problem

21

At this step, we try to find the statistical methods to solve this

questions

GUIDELINES (2016): Schema for a human microbiome study

22

• For both 16S rRNA and shotgun metagenomic sequencing,study participants can be compared by alpha diversity (i.e.,within participant diversity) and beta diversity (i.e., betweenparticipant diversity) metrics.

• For alpha diversity, conventional statistical methods are oftenused. Typically, random sampling for a standardised number ofOTUs (i.e., rarefaction) is conducted to minimise bias due toamplification or sampling efficiency and then analyses includeadjustment for taxa abundances in order to minimise influenceof rare taxa.

• For associations with the entire microbial community, betadiversity analyses are based on a matrix of distances betweenall pairs of specimens, followed by principle coordinate analysisand higher level statistics

Schema for a human microbiome study. To conduct a humanmicrobiome study, it is important to develop a well-formulatedhypothesis, apply a valid study design, and collect requisite covariatedata. Relevant specimens (e.g., oral, faecal, tissue, or other applicablesamples) should be collected and promptly frozen or chemicallystabilised. Once the DNA is extracted, currently there are two typicalmethods for sequencing: 16S rRNA gene sequencing (in blue) or shotgunmetagenomic sequencing (in red). To date, most epidemiologic-scalestudies profile the microbiome by amplifying and sequencing only theprokaryote 16S rRNA gene. Once the 16S rRNA sequencing is completed,the data are often processed using various publically available tools thatare used to cluster the sequences into operational taxonomic units(OTUs) and to assign taxonomy using public sequence databases. Forshotgun metagenomic sequencing, the DNA is sheared and then all thefragments are sequenced. From this type of data a variety ofbioinformatic processing can be conducted, but often the short readsare used to cluster OTUs and assign taxonomy, similar to 16S, but also todetermine the functional capabilities of the genes present by mappingthe reads to public gene databases.

The metagenomic guidelines suggested the use of diversity

statistical analysis

How to measure metagenomediversity?

23

• Population diversity is a very important concept when dealing with OTUsor other taxonomic “bins” because this is critical for human health.

• Since a number of disease conditions have been shown to correlate withdecreased microbiome diversity: it can affect human health seriously.

• Human gut contents appear to be highly personalized when consideredin terms of microbial PRESENCE, ABSENCE and ABUNDANCE.

Using the same concepts as in community ecology but with

their particularities…

• α,β-diversity:– α-diversity measures acts like a summary statistic of a single population.– β-diversity measures acts like a similarity score between populations,

allowing analysis by clustering or other multivariate analysis.

Biodiversity in metagenomics: complexity in a multidimensional perspective !!

24

Ecology

FunctionTaxonomy

Dynamics

Many concepts in metagenomic diversity

25

RAREFACTION / NON PARAMETRIC ESTIMATORS OF RICHNESS

SAMPLING EFFORT

It is necessary to accommodate concepts to metagenomics

Measure diversityin metagenomics

26

See the compilation of methods described at Monleón-Getino,“A statistician knight in King Molecular’ s Court”

(Introduction to the Species Divergence and the Measurement of Microbial Diversity)-2017

Lozupone CA and Knight R. 2008. SpeciesDivergence and the Measurement of MicrobialDiversity. FEMS Microbiol Rev. Jul; 32(4): 557–578

Many methods to measure diversity

Measure diversity in metagenomics: α-diversity

• α-diversity or “local diversity” isdenoted as the diversity within acommunity at a particular site.

• This diversity is at small-local scale,generally of the size of local area (e.g.for microbiome gut area, tooth, etc.).

• A common definition of α-diversity isreferred to a number of species(richness) at one site, but someformulations also consider theproportion that each speciesrepresents (evenness), in which case ahigh diversity implies a large number ofspecies with similar abundances.

27

How to calculate effortcurves (sampling curves) inmetagenomics?

Measure diversityin metagenomics: α-diversity

28

N=100 (true)

UNDERSAMPLING

Rarefaction curve (Simulation: 500 OTU, 5 samples, 120 bins each

sample)

True, only if the sampling is correct

Kindt's exact method: Species accumulationcurve (Simulation: 500 OTU, 5 samples, 120

bins each sample)

Only 250 OTU?

Only 100 OTU?

Measure diversity in metagenomics: α-diversity

29

The Shannon index (H’) is the most popular diversity index for the microbiologist when used metagenomics. Is also known asShannon's diversity index, Shannon entropy, Shannon–Wiener index or Shannon–Weaver index. The measure was originallyproposed by Claude Shannon to quantify the entropy (uncertainty or information content) in strings of text (Shannon, 1948) .

H’ quantifies the uncertainty (entropy or degree of surprise) associated with this prediction. It is most frequently calculated asfollows:

21

' log ( )R

i ii

H p p=

= −∑where pi is the proportion of individuals belonging to the ith OTUs in the dataset of interest. Then the Shannon entropy quantifiesthe uncertainty in predicting the OTUs assigned to an individual that is taken at random from the dataset.

Another index is proposed to estimate α-diversity that takes into account the phylogenetic species divergence is the thetaalpha biodiversity (θ) is calculated as,

where k is the number of individuals sampled, pi is the frequency of the ith sequence, pj is the frequency of the jthsequence, and dij is the phylogenetic divergence between the two sequences. θ is widely applied in molecular evolutionand population genetics. As an interesting feature θ is also identical to the taxonomic diversity index (Δ) described byClark and Warwick (1995) .Also θ is equal to ½ the Rao Diversity Coefficient (RaoDIV) (Rao, 1982), which reduces tothe Gini-Simpson index (1 − λ) when all pairwise distances between species are equal.

1( )

k

i j iji j i

p p dθ= <

=∑∑

Often quantified byShannon index:

Measure diversity in metagenomics: β-diversity

30

Many approaches: Diversity betweencommunities, groups, samples…


31

Legendre and Caceres (2013)

See the compilation of methods described at Monleón-Getino, “A statistician knight in King

Molecular’ s Court(Introduction to the Species Divergence and

the Measurement of Microbial Diversity)-2017

I have adapted a new concept of beta-diversity to

metagenomics


32

Also is possible to deal with beta-diversity using

multivariate approaches like …

QUESTIONS TO SOLVE

• Is the sampling effort enough?

• How many samples to estimatethe richness (OTUs)? More? Less?

• Are the replications OK

33

Richness

34

Tutorial

Richness: how to measure it?

35

Tutorial

Bayesian point of view: a new familyof algorithms to calculate the diversity

• The most frequently used statistical methods are known as frequentist (or classical) methods.• These methods assume that unknown parameters are fixed constants, and they define probability by using limiting

relative frequencies.• It follows from these assumptions that probabilities are objective and that you cannot make probabilistic statements

about parameters because they are fixed.• Bayesian methods offer an alternative approach; they treat parameters as random variables and define probability as

"degrees of belief" (that is, the probability of an event is the degree to which you believe the event is true).• It follows from these postulates that probabilities are subjective and that you can make probability statements about

parameters.• The term "Bayesian" comes from the prevalent usage of Bayes’ theorem, which was named after the Reverend Thomas

Bayes, an eighteenth century Presbyterian minister.• Bayes was interested in solving the question of inverse probability: after observing a collection of events, what is the

probability of one event?

36

Link at bayesian theory

𝑃𝑃 𝜃𝜃 𝑥𝑥 =)𝑃𝑃 𝑥𝑥 𝜃𝜃 𝑃𝑃(𝜃𝜃

)𝑃𝑃(𝑥𝑥Based on: Pullen and Kumaran (2010): Application ofMultinomial-Dirichlet Conjugate in MCMC Estimation : A BreastCancer Study . Int. Journal of Math. Analysis 4

https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_introbayes_sect002.htm

Algorithm

37

A new bayesian method to estimaterichness and sample size (sampling effort) in metagenomics analysis

38

Multinomial-Dirichlet Conjugate distributions (bayesian approach)

Priori distribution

Bayes formula:

Evidence (difficult to calculate):

MCM

C

Using a projection of the Species accumulation curve (using Weibull: SATURATIVE or LOESS + Michaelis-Menten growth model: NON SATURATIVE)

Winbugs/ jags𝑃𝑃(𝜃𝜃𝑘𝑘|𝑥𝑥)(𝐵𝐵𝐵𝐵𝐵𝐵)

Species acumulation curve (Sa)

RICHNESS:

Using the inverses of Weibull or Michaelis-Menten

growth model)

SAMPLING EFFORT: Criteria: especie i is presentif BCIlower > 𝑃𝑃∗(𝜃𝜃𝑘𝑘|𝑥𝑥)

Function to project richnessat S(x)

• Weibull growth function: W 𝑥𝑥 = 𝑎𝑎 − 𝑏𝑏𝑒𝑒−𝑐𝑐𝑐𝑐𝑚𝑚

• Michaelis and Menten equation: MM 𝑥𝑥 = 𝑉𝑉𝑥𝑥/(𝑘𝑘 + 𝑥𝑥)

• LOESS regression: LOESS and LOWESS (locally weighted scatterplot smoothing) are two strongly related non-parametric regression methods that combine multiple regression models in a k-nearest-neighbor-based meta-model.

39

where w(x) is the richness of sample (x) and now a=α, b=α-β, c=κγ and m= γ. a,b, c and m are parameters to be estimated and e is the base of the naturallogarithms. a is the asymptote of limiting value of the response variable (w),limx→∞wx=a, that represent the maximum number of species. b is the biologicalconstant (lower asymptote), c is the parameter governing the rate at which theresponse variable approaches its potential maximum a and m is the allometricconstant.

extrapolated richness value

http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2656.2006.01048.x/full


40

Rarefaction curve (Simulation: 500 OTU, 5 samples, 120 bins each

sample)

Is the sampling effort enough?How many samples to estimatethe richness (OTUs)? More? Less?

Kindt's exact method: Species accumulationcurve (Simulation: 500

OTU, 5 samples, 120 binseach sample)

Only 250 OTU?Only 100 OTU?

Vegan for R (Oksanen, 2017)

UN

DERS

AMPL

ING

HIG

H RI

CHN

ESS

LOW

ABU

NDA

NCE

Only 300-400 OTU?

Library for R: BDSbiost3 (bayesian function to richness calculation and sampling effort)

https://github.com/amonleong/BDSbiost3

Non parametricasymptotic richness

estimation

MM (90, 95, 99%)[5.2, 11, 57.3]

MCMC method (JAGS for R) onth

ere

boun

d

W (90, 95, 99%):[2.1, 2.8, 3.7]

Bayesian sampling effort curve

Sam

plin

gef

fort

MM : 570.5

W : 499

Rich

ness

Beta-diversity(between samples): quality control

Different species accumulation model


Simulations using the library

41

SCENARIO 3: ↑ Ab; ↑ Rs; Correct sampling /Over-Sampling

Met

agen

Sam

ple.

size.

H1-M

CMC

spec

pool

() -V

egan

Abundance: 10000; OTUs: 500; Samples: 15

Sampling effort

W(90,95,99%:{2.1, 4.1, 15.0}

MM (90, 95, 99%): [2.2, 4.6, 24.1]


42

SCENARIO 2: ↓ Ab; ↑ Rs; Over-Sampling

Abundance: 75; OTUs: 500;

Samples: 15

Met

agen

Sam

ple.

size.

H1-M

CMC

spec

pool

() -V

egan

Sampling effort

W (90, 95, 99%):{2.9, 5.1, 11.7}

MM (90, 95, 99%): [3.5, 7.4, 8.5]


43

SCENARIO 13: ↓Ab; ↓Rs; Under-Sampling

Ab:100, OTUs: 50; Samples: 4

Sampling effort

W(90,95,99%:{2.1, 2.6, 3.3}

MM (90, 95, 99%): [5.7, 12.1, 63.01]


44

Library for R: BDSbiost3 Better to estimate richness and number of samples(effort of sampling) based on MCMC and a Diritchet-Multinomial distribution

• Species richness in an assemblage is difficult to estimatereliably from sample data because it is very sensitive to thenumber of individuals and the number of samples collectedthat tent to be low in genomic analysis (low abundance,high number of OTUs, variability, …).

• Is very important to know the number of samples to acorrect estimation of Richness (sampling effort)

• Monleon-Getino,• Rodriguez-Casado,• Méndez-Viera.

9/2017



GAIA: Metagenomics analysis from any matrix at strain level

45

http://www.metagenomics.cloud./

http://www.metagenomics.cloud./

Trex: COSMOCAIXA 27-10-17

• Reviu els dinosauresCinna.upc.edu:3838/GRBIOSTAT3/paleo

46

Trix, el tiranosaurio más bien conservado del mundo

cinna.upc.edu:3838/GRBIOSTAT3/paleo

Antonio Monleón GetinoSection of Statistics. Departament of Genetics, Microbiology and Statistics, (UB)

THANKS to MariaPilar Giménez, Cesc Múrries,

Javier Méndez, M. Cambra

47

beta0 chains 1:3

iteration1001 2500 5000 7500 10000

-0.5

0.0

0.5

1.0

More results on metagenomics

• PhD-stays in the United States of America-Cambridge/Boston (2011, 2012, 2015) at the Forsyth Institute.

• Library for R developed, recently published 2016: MetagenOutLDA.

• 3 articles related, recently published (2016, 2017, 2017) + 2 books (2017: Winbugs + metagenomics diversity).

• 3 congress communications

• A research line opened

48

Previous results: the basis for new project

49

Starting from matrix: M

MetagenOutLDA Diagnosis metagenomic-based

Discriminant analysis (Classical + Machine learning methods)g1

g2

Previous results: the basis for new project (main results)

50

g1

g2

Disc

rimin

antm

etho

ds

�β ± 𝐵𝐵𝐵𝐵𝑎𝑎(95%𝐵𝐵𝐵𝐵)

𝑦𝑦𝑖𝑖 = 𝑓𝑓(𝛽𝛽, 𝑥𝑥𝑖𝑖′) + 𝜖𝜖𝑖𝑖

Particular model:

𝜖𝜖𝑖𝑖~𝑁𝑁(0, 𝜎𝜎2)

Hope

fulr

esul

ts

Previous results: the basis for new project (robust Bca percentile method)

51

/ 2 0 1 / 2 01 0 2 0

/ 2 0 1 / 2 0

1 ( ), 1 ( )1 ( ) 1 ( )

z z z zz za z z a z zα α

α α

β β −

−

− −= −Φ + = −Φ +

− − − −

100(1 )% confidence intervalα−1 2

* *(( 1) (1 )) (( 1) (1 ))ˆ ˆ,B BLCL UCLβ βθ θ+ × − + × −= =

0estimate , z a

*2x *

Bx

*(2)θ̂

1 2ˆ ( , , ..., ) ( )ndata x x x sθ= ⇒ =x x

* *ˆget resample statistics ( ) and then sort themb bsθ = x

*1x

resample B times

*(1)θ̂ *

( )ˆ

Bθ*(2)θ̂≤ ≤ ≤

{ }1 *0

1

1 ˆ ˆestimate by 1 and by JackknifeB

bb

z aB

θ θ−

=

Φ ≤

∑

1( ) zαα−Φ = −

Flowchart of The BCa Percentile Method used by the function dmcbiodiv() to estimate de interval of confidence of (β0, β1, β2,…)

�𝜷𝜷 ± 𝑩𝑩𝑩𝑩𝑩𝑩(𝟗𝟗𝟗𝟗%𝑩𝑩𝟗𝟗)

Previous results: the basis for new project (discriminant methods)

The function dmcmetagen() provides a leader board featuring the results of linear discriminant analysis (LDA)with and without cross-validation and other more complex, such as:• Quadratic discriminant analysis (QDA): Quadratic discriminant analysis (QDA) is closely related to linear

discriminant analysis (LDA), where it is assumed that the measurements from each class are normallydistributed. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classesis identical

• Robust regularized linear discriminant analysis (RRLDA): methods to perform robust regularized lineardiscriminant analysis based on ‘rrlda’ library for R. (See Friedman, 1989). RRLDA is a combination ofregularization and robust methods. RRLDA is a good choice if data contain either outliers or manynoisy variables or both. Penalty parameter λ is chosen according to an adapted BIC criterion (Moreinformation at http://www.stat.tugraz.at/Statistiktage2011/PGschwandtner.pdf )

• Multiple discriminant analysis (MDA): Multiple Discriminant Analysis (MDA) is a method for compressing amultivariate signal to yield a lower-dimensional signal amenable to classification. MDA is not directly usedto perform classification. It merely supports classification by yielding a compressed signal amenable toclassification. The method described in Duda et al., (2001). Methods to perform MDA discriminantanalysis based on ‘mda’ library for R.

• Support Vector Machine (SVM): see description below. Is one of the most used algorithms nowadays toclassify objects using hyperplanes (Lee and Wahba, 2004).

52

𝑓𝑓 𝑥𝑥 = 𝛽𝛽0 + 𝛽𝛽𝑇𝑇𝑥𝑥

𝛽𝛽 =weight vector,𝛽𝛽0 = bias

𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑎𝑎𝑑𝑑𝑑𝑑𝑒𝑒 =𝛽𝛽0 + 𝛽𝛽𝑇𝑇𝑥𝑥

𝛽𝛽

𝛽𝛽0 + 𝛽𝛽𝑇𝑇𝑥𝑥 = 1

𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑎𝑎𝑑𝑑𝑑𝑑𝑒𝑒𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑣𝑣𝑣𝑣𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 =𝛽𝛽0 + 𝛽𝛽𝑇𝑇𝑥𝑥

𝛽𝛽=

1𝛽𝛽

𝑀𝑀 = 2𝛽𝛽

𝑚𝑚𝑑𝑑𝑑𝑑𝛽𝛽, 𝛽𝛽0

𝐿𝐿 𝛽𝛽 = 12𝛽𝛽 2 𝑑𝑑𝑠𝑠𝑏𝑏𝑠𝑠𝑒𝑒𝑑𝑑𝑑𝑑 𝑑𝑑𝑡𝑡: 𝑦𝑦𝑖𝑖 𝛽𝛽𝑇𝑇𝑥𝑥𝑖𝑖 +𝛽𝛽0 ≥ 1∀𝑖𝑖

Maximize MLagrangianoptimization

http://www.stat.tugraz.at/Statistiktage2011/PGschwandtner.pdf

End point 1:Investigate new statistical multivariate methods based on the probability models (MN, DM, mixtures) to extract ecologically meaningful information from these data sets :

53

1.1.Characterize the biodiversity of the

microbiome:

1.2.Find the number of communities

(subpopulations) that interact in the sample:

1.3.Observe the behaviour and complexity in

situations of change or evolution of the microbial community (e.g. disease

status, time, space)

1.4.Propose an algorithm that extract ecologically meaningful information

based on 1.1., 1.2. and 1.3

Study advanced statistical multivariatemethods based on the probability distributionsabove-mentioned to extract ecologicallymeaningful information from these data sets :

𝑋𝑋𝑗𝑗.~𝑀𝑀𝑁𝑁(𝑁𝑁𝑗𝑗.; 𝜃𝜃𝑗𝑗 = (𝜃𝜃1𝑗𝑗, 𝜃𝜃2𝑗𝑗,… , 𝜃𝜃𝑘𝑘𝑗𝑗))

�𝜃𝜃𝑘𝑘~𝐷𝐷𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝐷𝐷𝐷𝑒𝑒𝑑𝑑(𝛼𝛼1𝑗𝑗, 𝛼𝛼2𝑗𝑗,… , 𝛼𝛼𝑘𝑘𝑗𝑗𝜃𝜃𝑗𝑗|𝑥𝑥~𝐷𝐷𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝐷𝐷𝐷𝑒𝑒𝑑𝑑 𝑥𝑥1𝑗𝑗 + 𝛼𝛼1𝑗𝑗, 𝑥𝑥2𝑗𝑗 + 𝛼𝛼2𝑗𝑗,… , 𝑥𝑥𝑘𝑘𝑗𝑗 + 𝛼𝛼𝑘𝑘𝑗𝑗𝑋𝑋𝑗𝑗.~𝐷𝐷𝑀𝑀 𝑁𝑁𝑗𝑗., 𝛼𝛼′1𝑗𝑗,… , 𝛼𝛼′𝑘𝑘𝑗𝑗

𝑓𝑓 𝑥𝑥 = ∑𝑖𝑖=1𝑚𝑚 𝑤𝑤𝑖𝑖 𝑝𝑝𝑖𝑖 𝑥𝑥 ;𝑝𝑝𝑖𝑖 = 𝑀𝑀𝑁𝑁, DM

Bhattacharyya distance

Kullback-Leiberdivergence

4

1arccoshk hi ki

id x x

=

= ∑

( || ) [log ]dD Edµυµ υµ

= −

PAMUnsupervised Affinity propagation (AP) clustering algorithm

Gaussian graph

Complexity index

Mixtures

Bayesian approach

End point 2: Validade the use of the developed library MetagenOutLDA for R(cross-validation and Sensitivity / Specificity of the diagnostic test )

SVM:Accuracy : 0.8947 - 95% CI : (0.6686, 0.987)No Information Rate : 0.6316 P-Value [Acc > NIR] : 0.01135 Kappa : 0.7595 Mcnemar's Test P-Value : 0.47950 Sensitivity : 0.7143 Specificity : 1.0000 Pos Pred Value : 1.0000 Neg Pred Value : 0.8571 Prevalence : 0.3684 Detection Rate : 0.2632 Detection Prevalence : 0.2632 Balanced Accuracy : 0.8571

LDA:Accuracy : 0.7368 - 95% CI : (0.488, 0.9085)No Information Rate : 0.5789 P-Value [Acc > NIR] : 0.1214 Kappa : 0.4509 Mcnemar's Test P-Value : 1.0000

Sensitivity : 0.6250 Specificity : 0.8182 Pos Pred Value : 0.7143 Neg Pred Value : 0.7500 Prevalence : 0.4211 Detection Rate : 0.2632 Detection Prevalence : 0.3684 Balanced Accuracy : 0.7216

54

Sensitivity / Specificity table

Sensitivity / Specificity table

Cross-validation Validation

The

seco

ndIN

GTN

CELI

AC,

asa

part

ofth

eva

lidat

ions

ofth

est

atist

ical

met

hods

deve

lope

din

the

first

,is

foun

ded

byth

esc

hola

rshi

pPR

ODU

Oof

6000

€(C

hies

i).It

isa

pilo

tm

edic

alst

udy

onth

eus

eof

prob

iotic

sin

the

rem

issio

nof

non-

celia

cgl

uten

into

lera

nce.

This

proj

ect,

like

the

prev

ious

one,

aim

sto

adva

nce

the

know

ledg

eof

the

hum

anm

icro

biom

ean

dits

rela

tion

with

dise

ases

.

Dutie

spe

ndin

g

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

Research group (consortium) constituted for this research project

55

Hospital CIMA-Adeslas

/ Hospital Sant Joan de

Deu

SEQUENTIA

Validation of the statistical

methods

Statitical methods for a metagenomic Pipeline

Clinical diagnostic

Biodiversityinformation Discriminant

analysis R library

Clinicaldiagnostic

Sample sizeMultivariatemethods R library

ScholarshipPRODUO

MetagenOutLDA

Forsyth / University of

Florida

Coordinator: Toni Monleón Getino

FIS-2018-2020Samples collection

Biost3

Preliminary results

diagnosis

biologically relevant patterns

Statistical methods

• Study the probability distributions (MN, DM, mixtures)

• Subsetting based on distances and non-hierarchical classification methods

• Index of Complexity of a metacommunity• New approach to alpha diversity

(bayesian approach)• New approach to beta diversity

(Legendre and Caceres, 2013)• Bayesian approach to compute sampling

effort (sample size) and richness

PROBLEMS RAISED:• Big sparse matrices• High variability between samples• Enterotypes (mixtures)• Complex distributions of probabilities

(mixtures)• Sample effort & richness uncertainty

𝑴𝑴𝑘𝑘𝑐𝑐𝑘𝑘𝑇𝑇 Sequences Functional results

SOME ALGORITHMS DEVELOPED/ING:

Preliminary results

57

• Study the probability distributions (MN, DM, mixtures)• Enterotypes (mixtures): how many groups?• Beta diversity

Algorithms: Bhattacharyya distance and PAM algorithm

58

BDSbiost3 for R, version 0.3GITHUB

metagenomics: richness and sampling effort in metagenomics · metagenomics: α-diversity 29 the...

Documents