scalable data mining for functional genomics and metagenomics
DESCRIPTION
Scalable data mining for functional genomics and metagenomics. Curtis Huttenhower 01-06- 10 11. Harvard School of Public Health Department of Biostatistics. What tools enable biological discoveries?. Our job is to create computational microscopes: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/1.jpg)
Scalable data mining for functional genomics and metagenomics
Curtis Huttenhower
01-06-1011Harvard School of Public HealthDepartment of Biostatistics
![Page 2: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/2.jpg)
2
What tools enable biological discoveries?
Our job is to create computational microscopes:
To ask and answer specific biomedical questions using
millions of experimental results
![Page 3: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/3.jpg)
3
Outline
2. Metagenomics:Modeling microbial
communities for public health
1. Data mining:Integrating very large
genomic data compendia
![Page 4: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/4.jpg)
4
A computational definition offunctional genomics
Genomic data Prior knowledge
Data↓
Function
Function↓
Function
Gene↓
Gene
Gene↓
Function
![Page 5: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/5.jpg)
5
A framework for functional genomics
HighSimilarity
LowSimilarity
HighCorrelation
LowCorrelation
G1G2
+
G4G9
+…
G3G6
-
G7G8
-…
G2G5
?
0.9 0.7 … 0.1 0.2 … 0.8
+ - … - - … +
0.8 0.5 … 0.05 0.1 … 0.6
HighCorrelation
LowCorrelation
Freq
uenc
y
Let.Not let.
Freq
uenc
y
SimilarDissim.
Freq
uenc
y
P(G2-G5|Data) = 0.85
100Ms gene pairs →
← 1
Ks
data
sets
+ =
![Page 6: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/6.jpg)
6
Functional networkprediction and analysis
Global interaction network
Carbon metabolism network Extracellular signaling network Gut community network
Currently includes data from30,000 human experimental results,
15,000 expression conditions +15,000 diverse others, analyzed for
200 biological functions and150 diseases
HEFalMp
![Page 7: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/7.jpg)
7
Meta-analysis for unsupervisedfunctional data integration
Evangelou 2007
Huttenhower 2006Hibbs 2007
11log
21'
'
''
z
eiey ,
ieeeiey ,,
i
ieiee yw ,*,̂
22,
*, ˆ
1
eieie s
w
Simple regression:All datasets are equally accurate
Random effects:Variation within and
among datasets and interactions
![Page 8: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/8.jpg)
8
Meta-analysis for unsupervisedfunctional data integration
Evangelou 2007
Huttenhower 2006Hibbs 2007
11log
21'
'
''
z
+ =
![Page 9: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/9.jpg)
9
Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune
Graphle http://huttenhower.sph.harvard.edu/graphle/
![Page 10: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/10.jpg)
10
Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune
Graphle http://huttenhower.sph.harvard.edu/graphle/
X?
![Page 11: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/11.jpg)
11
Outline
2. Metagenomics:Modeling microbial
communities for public health
1. Data mining:Integrating very large
genomic data compendia
![Page 12: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/12.jpg)
12
What to do with your metagenome?
(x1010)
Diagnostic or prognostic
biomarker for host disease
Public health tool monitoring
population health and interactions
Comprehensive snapshot of
microbial ecology and evolution
Reservoir of gene and protein
functional informationWho’s there?
What are they doing?
What do functional genomic data tell us about microbiomes?
What can our microbiomes tell us about us?*
*Using terabases of sequence and thousands of experimental results
![Page 13: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/13.jpg)
13
The Human Microbiome Project
2007 - ongoing
• 300 “normal” adults, 18-40
• 16S rDNA + WGS• 5 sites/18 samples +
blood• Oral cavity: saliva, tongue,
palate, buccal mucosa, gingiva,
tonsils, throat, teeth• Skin: ears, inner elbows• Nasal cavity• Gut: stool• Vagina: introitus, mid, fornix
• Reference genomes (~200+800)
All healthy subjects; followup projects in psoriasis, Crohn’s,
colitis, obesity, acne, cancer, antibiotic
resistant infection…
Hamady, 2009
Kolenbrander, 2010
![Page 14: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/14.jpg)
14
HMP Organisms: Everyone andeverywhere is different
← Body sites + individuals →
← O
rgan
ism
s (ta
xa) →
ear gut nose mouth vaginaarmmucosa palate gingiva tonsils saliva sub. plaq. sup. plaq. throat tongue
Every microbiome is surprisingly different
Most organisms are rare in most places
Even common organisms vary tremendously in abundance
among individuals
Aerobicity, interaction with the immune system, and
extracellular medium appear to be major determinants
There are few, if any, organismal biotypes
in health
![Page 15: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/15.jpg)
15
HMP: Metabolic reconstruction
WGS reads
Pathways/modules
Genes(KOs)
Pathways(KEGGs)
Functional seq.KEGG + MetaCYC
CAZy, TCDB,VFDB, MEROPS…
BLAST → Genes
rra
r
raa
p
gap
ggc
)(
)(
1
)()1(
||1)(
Genes → PathwaysMinPath (Ye 2009)
SmoothingWitten-Bell
otherwiseTNNgcgcTNTVTN
gc)/()(
0)()/()/()(Gap filling
c(g) = max( c(g), median )
300 subjects1-3 visits/subject~6 body sites/visit
10-200M reads/sample100bp reads
BLAST
?Taxonomic limitation
Rem. paths in taxa < ave.
XipeDistinguish zero/low
(Rodriguez-Mueller in review)
![Page 16: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/16.jpg)
16
HMP: Metabolic reconstruction
Pathway coverage Pathway abundance
![Page 17: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/17.jpg)
17
HMP: Metabolic reconstruction
Pathway abundance← Samples →
← P
athw
ays→
![Page 18: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/18.jpg)
18
HMP: Metabolic reconstruction
Pathway coverage← Samples →
← P
athw
ays→
Aerobic body sites
Gastrointestinal body sites
All body sites (“core”)
![Page 19: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/19.jpg)
19
GeneexpressionSNPgenotypes
Metagenomic biomarker discovery
Healthy/IBDBMIDiet
Taxa &pathways
Batch effects?Populationstructure?
Niches &Phylogeny
Test for correlates
Multiplehypothesiscorrection
Featureselection
p >> n
Confounds/stratification/environment
Cross-validate
Biological story?
Independent sample
Intervention/perturbation
![Page 20: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/20.jpg)
20
LEfSe: Metagenomic classcomparison and explanation
LEfSe
http://huttenhower.sph.harvard.edu/lefse
Nicola Segata
LDA +Effect Size
![Page 21: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/21.jpg)
21
LEfSe: The TRUC murine colitis microbiotaWith Wendy Garrett
![Page 22: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/22.jpg)
22
MetaHIT: The gut microbiome and IBD
WGS reads
Pathways/modules
124 subjects: 99 healthy21 UC + 4 CD
ReBLASTed against KEGG since published data
obfuscates read counts
Taxa
PhymmBrady 2009
Genes(KOs)
Pathways(KEGGs)
Qin 2010
With Ramnik Xavier, Joshua Korzenik
![Page 23: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/23.jpg)
23
MetaHIT: Taxonomic CD biomarkers
Firmicutes
Enterobacteriaceae
Up in CDDown in CD
UC
![Page 24: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/24.jpg)
24
MetaHIT: Functional CD biomarkers
Motility Transporters Sugar metabolism
Down in CD
Up in CD
Subset of enriched modules in CD patientsSubset of enriched pathways in CD patients
Growth/replication
![Page 25: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/25.jpg)
25
MetaHIT: Enzymes and metabolites over/under-enriched in the CD microbiome
Transporters
Growth/replication
Motility
Sugarmetabolism
Down in CD
Up in CD
Inferredmetabolites
Enzymefamilies
![Page 26: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/26.jpg)
26
Outline
2. Metagenomics:Modeling microbial
communities for public health
1. Data mining:Integrating very large
genomic data compendia
• HMP: microbiome in health,
18 body sites in 300 subjects• HUMAnN: metagenomic
metabolic and functional
pathway reconstruction• LEfSe: biologically relevant
community differences
• Network framework for
scalable data integration
• HEFalMp: human data
integration• Meta-analysis for
unsupervised functional
network integration
![Page 27: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/27.jpg)
27
Thanks!
Jacques IzardWendy Garrett
Pinaki SarderNicola Segata
Levi Waldron LarisaMiropolsky
http://huttenhower.sph.harvard.edu
Interested? We’re recruiting students and postdocs!
Human Microbiome Project
HMP Metabolic Reconstruction
George WeinstockJennifer WortmanOwen WhiteMakedonka MitrevaErica SodergrenVivien Bonazzi Jane PetersonLita Proctor
Sahar AbubuckerYuzhen Ye
Beltran Rodriguez-MuellerJeremy ZuckerQiandong Zeng
Mathangi ThiagarajanBrandi Cantarel
Maria RiveraBarbara Methe
Bill KlimkeDaniel Haft
Ramnik Xavier Dirk Gevers
Bruce Birren Mark DalyDoyle Ward Eric AlmAshlee Earl Lisa Cosimi
Sarah Fortune
http://huttenhower.sph.harvard.edu/sleipnir
![Page 28: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/28.jpg)
![Page 29: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/29.jpg)
29
Functional network prediction from diverse microbial data
486 bacterial expression
experiments
876 raw datasets
310 postprocessed
datasets
304 normalized coexpression networks
in 27 species
Integrated functional interaction networks
in 15 species
307 bacterial interaction
experiments
154796 raw interactions
114786 postprocessed
interactions
E. Coli Integration
← Precision ↑, Recall ↓
![Page 30: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/30.jpg)
30
Predicting gene function
Cell cycle genes
Predicted relationships between genes
HighConfidence
LowConfidence
![Page 31: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/31.jpg)
31
Predicting gene functionPredicted relationships
between genes
HighConfidence
LowConfidence
Cell cycle genes
![Page 32: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/32.jpg)
32
Cell cycle genes
Predicting gene functionPredicted relationships
between genes
HighConfidence
LowConfidence
These edges provide a measure of how likely a gene is to
specifically participate in the process of
interest.
![Page 33: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/33.jpg)
33
Comprehensive validation of computational predictions
Genomic data
Computational Predictions of Gene FunctionMEFITSPELL
Hibbs et al 2007bioPIXIEMyers et al 2005
Genes predicted to function in mitochondrion organization
and biogenesis
Laboratory ExperimentsPetite
frequencyGrowthcurves
Confocal microscopy
New known functions for correctly predicted genes
Retraining
With David Hess, Amy Caudy
Prior knowledge
![Page 34: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/34.jpg)
34
Evaluating the performance of computational predictions
106Original GO Annotations
Genes involved in mitochondrion organization and biogenesis
135Under-annotations
82Novel Confirmations,
First Iteration
17Novel Confirmations,
Second Iteration
340 total: >3x previously known genes in ~5 person-months
![Page 35: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/35.jpg)
35
Evaluating the performance of computational predictions
106Original GO Annotations
Genes involved in mitochondrion organization and biogenesis
95Under-annotations
40Confirmed
Under-annotations
80Novel Confirmations
First Iteration
17Novel Confirmations
Second Iteration
340 total: >3x previously known genes in ~5 person-months
Computational predictions from large collections of genomic data can be
accurate despite incomplete or misleading gold standards, and they
continue to improve as additional data are incorporated.
![Page 36: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/36.jpg)
36
Functional mapping: mining integrated networks
Predicted relationships between genes
HighConfidence
LowConfidence
The strength of these relationships indicates how
cohesive a process is.
Chemotaxis
![Page 37: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/37.jpg)
37
Functional mapping: mining integrated networks
Predicted relationships between genes
HighConfidence
LowConfidence
Chemotaxis
![Page 38: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/38.jpg)
38
Functional mapping: mining integrated networks
Flagellar assembly
The strength of these relationships indicates how
associated two processes are.
Predicted relationships between genes
HighConfidence
LowConfidence
Chemotaxis
![Page 39: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/39.jpg)
39
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox HomeostasisAldehyde
Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
CatabolismNegative Regulation
of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
![Page 40: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/40.jpg)
40
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox HomeostasisAldehyde
Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
CatabolismNegative Regulation
of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
![Page 41: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/41.jpg)
41
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
NodesCohesiveness of processes
BelowBaseline
Baseline(genomic
background)
VeryCohesive
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox HomeostasisAldehyde
Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
CatabolismNegative Regulation
of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
![Page 42: Scalable data mining for functional genomics and metagenomics](https://reader035.vdocuments.site/reader035/viewer/2022062323/56816262550346895dd2c0fc/html5/thumbnails/42.jpg)
42
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
NodesCohesiveness of processes
BelowBaseline
Baseline(genomic
background)
VeryCohesive
BordersData coverage of processes
WellCovered
SparselyCovered