Identifying and Predicting Novelty in
Microbiome Studies based on Microbiome
Searching
Xiaoquan SU, Ph. DBioinformatics Group, QIBEBT
Chinese Academy of Sciences
mse.ac.cn
Probiota 2019,Denmark
Microbiome big-data
Thompson, et al., Nature, 2017
Gilbert, et al., Nature, 2016
Blaser et al., mBio., 2016
• Human health
• Environment
• Agriculture
• Industry
• Bioenergy
Microbiome
• The basic functionality unit of microorganisms
• Enormous potential in solving crucial issues
0
5,000
10,000
15,000
20,000
25,000
Th
e n
um
be
r o
f p
ap
ers
The number of microbiome studies is increasing in
exponential scale
Data from NSL-CAS, 2018, May
Microbiome Big-data:
More information, more challenges
PB level data
1 PB = 1024 TB
1 TB = 1024 GB
1 GB = 1024 MB
Microbiome big-data
http://metagenomics.anl.gov/
# of public metagenomes: 36,632
# of sequences: >10,000,000,000
# of studies: 1312
https://qiita.ucsd.edu/
# of public metagenomes: 242,593
# of sequences: >20,000,000,000
# of studies: 1163
https://img.jgi.doe.gov/
# of public metagenomes: 6,595
# of sequences: >2,000,000,000
# of studies: 252
Challenge 1: Single-use of microbiome big-
data
PB level data, however:
• Just data depository
• Single use, no further mining
Microbiome big-data
Query words The Internet Matched pages
Web search engine:keyword to keywords
Challenge 2: Different to search in microbiome big data
Microbiome big-data
Microbiome big-dataQuery microbiome Matched samples
No solution for “community to communities” search
BLAST:sequence to sequences
Query reads Matched sequencesReference genomes
New technique for Microbiome Big-data Science
Microbiome Search Engine (MSE)
?
Microbiome Database
Input:
Microbiome StructureOutput: Structural similar
microbiome(s) and meta-data
Identification based on microbiome structural
similarity
Microbiome Search Engine
New technique for Microbiome Big-data Science
Microbiome Search Engine (MSE)
Online system
http://mse.ac.cn
Standalone package & QIIME 2 plugin
Linux/Mac OS X/Embedded Linux of Win10
Microbiome Search Engine
Efficiency evaluation of MSE Search against 1,000,000 samples
MSE enables in-depth data mining among microbiome big-data
• 0.29 s search time against over 1 million samples (340 X speedup)
• Constant search speed, insensitive to database size
Quad Intel Xeon E7, 40 cores
Microbiome Search Engine
A “bird-eye view” of global microbiome pattern
• Earth microbiome project 1: 27,715 samples (Nature, 2017)
• Microbiome Search Engine: 101,983 samples, 301 studies (mBio,
2018)
1Thompson, et al., Nature, 2017
Data-driven Research
The “Microbiome Data Space”
HMP
EMP
MetaHITAnother
microbiome study
Data-driven Research
A “star” is a microbiome
MSE is a super telescope for discovering similar stars (microbiomes)
Data-driven Research
The total number of known bacteria is N (eg., 1,000,000)
Theoretically the number of microbiome structure with m (1 ≤ m≤ N) bacteria is
The global microbiome data: “infinite” space ?
Data-driven Research
𝑚=1
𝑁
𝐶𝑁𝑚 × 𝐴𝑏𝑑 𝑚 → ∞
The phylogeny tree
A microbiome
Microbiome Novelty Score (MNS) 1
MNS measures the uniqueness of a microbiome in a database:
higher MNS = higher novelty1Su, et al., mBio, 2018
Data-driven Research
𝑀𝑁𝑆 = 1 −σ𝑖=110 𝑆𝑖 × 10 − 𝑖
σ𝑖=110 10 − 𝑖
Historical trend of MNS of 101,983 samples from 2010-2017
Normal Distribution (Pearson r=0.92±0.07)
Therefore, we set the 2010 mean MNS (0.15) as the novelty baseline
Data-driven Research
Human1 vs Non-human2 : 1:6
Thus, many more novel microbial patterns exist in natural environments
Trends of novel samples in each sub-category
1Human: Gut, Skin, Oral, Urogenital, etc.2Non-human: Animal, Marine, Lake, River, Soil, House, etc.
Total non-human
Total human
Novel non-human
Novel human
Data-driven Research
Human vs Non-human
• Structure of human microbiome is bounded, approaching saturation
• Turning point of human samples is in 2012, due to HMP publication
Trends of novel samples in each sub-category
Data-driven Research
Our known microbiome data space
Human
Microbiome:
With boundary
Environment
Microbiome:
Boundary unclear
Data-driven Research
Microbiome Attention Score (MAS) 1
MAS measures the attention of a microbiome in a database
higher MAS = higher attention1Su, et al., mBio, 2018
Data-driven Research
𝑀𝐴𝑆 =
𝑖=1,𝑖≠𝑚
10
𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑚, 𝑖
𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑚, 𝑖
= ቊሻ𝑆𝑖 , 𝑖𝑓 𝑚 ∈ 𝑡𝑜𝑝 10 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑜𝑓 𝑖 𝑎𝑛𝑑 (𝑆𝑖 ≥ 0.85
0, 𝑖𝑓! 𝑚 ∈ 𝑡𝑜𝑝 10 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑜𝑓 𝑖
The“80/20 Rule Principle”:
Top 20% most frequently matched samples received “High Attention”
The MAS threshold of 14 is determined based on the top 20% MAS samples
Data-driven Research
Distribution of MAS of 101,983 samples from 2010-2017
Microbiome Focus Index (MFI)= f(MNS, MAS)
Habitat Rate
Lake 22.29%
Animal 22.25%
Marine 15.32%
Soil 14.52%
Human 9.47%
High-MFI (focus) microbiome:
a. Novel when born (MNS ≥ 0.15)
b. Attention afterwards (MAS ≥ 14)
2,298 focus samples from 101,983 samples
Data-driven Research
Prediction of high-MFI
samples
Random-Forest differentiates the
Focus samples by 4-year MAS: 98.8% accuracy
MAS development:
wake up in the first 4 years
Focus Microbiome (Beauties):
a. Novel when born: constant
b. Attention afterwards: variable (Sleeping Beauties)
Data-driven Research
Sleeping beauties: Marine & Indoor
Human microbiome: Mother-baby
𝑴𝑨𝑺𝒎𝒂𝒙 =σ𝒊=𝟏𝒀−𝟐𝟎𝟏𝟒 𝑴𝑨𝑺𝒊 × Τ𝑹𝑭𝒊 𝑹𝒆𝒈𝒊
σ𝒊=𝟏𝒀−𝟐𝟎𝟏𝟒𝑹𝑭𝒊
Prediction of high-MFI samples
A hybrid Regression-Random-Forest
Y is the samples’ birth year
MASi is the i-th year’s MAS
Regi is the i-th year’s max-MAS-ratio
RFi is the RF importance of the i-th year
Data-driven Research
Data-driven Research
The treasure map of “Microbiome Data Space” by MSE
http://mse.ac.cn
Acknowledgement
• Bioinfo. Group, Single-Cell Center, QIBEBT-CAS
• D. McDonald, A. Gonzalez, J. Navas, Knight Lab,
UCSDProf. Jian XU
SSC Director, QIBEBT-CAS
Prof. Rob KNIGHT
Knight Lab PI, UCSD
Gongchao JING
Assist. Prof.
Algorithm developer
Lu LIU
Post Doc.
Data manager
Zheng SUN
Post Doc.
Data analyst
Zengbin WANG
Technician
Web developer
Yufeng ZHANG
Graduate student
Algorithm developer
http://mse.ac.cn
MNS-based detection without marker: Is a microbiome healthy or not?
MNS of unhealthy samples are significantly higher than healthy ones
IBD (炎症性肠病)
HIV (艾滋病)
CRC (结直肠癌)
EDD (腹泻型痢疾)
Search-based diagnosis
• Baseline database: 15,704 healthy fecal samples from 56
studies
• Test dataset: 3,113 fecal samples from 9 studies, 5 status
MSE offers precise diagnosis for multiple disease,
AUC=0.81
IBD (炎症性肠病)
HIV (艾滋病)
CRC (结直肠癌)
EDD (腹泻型痢
疾)
Search-based diagnosis
MNS-based detection without marker: Is a microbiome healthy or not?