Scalable data mining for functional genomics and metagenomics
Curtis Huttenhower
12-02-10Harvard School of Public HealthDepartment of Biostatistics
2
What tools enable biological discoveries?
Our job is to create computational microscopes:
To ask and answer specific biomedical questions using
millions of experimental results
3
Outline
3. Data mining:Integrating very large
genomic data compendia
1. Metagenomics:Network models of
microbial communities
2. Microbial biomarkers:Metagenomics in public health
4
What’s metagenomics?Total collection of microorganisms
within a community
Also microbial community or microbiota
Total genomic potential of a microbial community
Total biomolecular repertoire of a microbial community
Study of uncultured microorganisms from the environment, which can include
humans or other living hosts
5
What to do with your metagenome?
(x1010)
Diagnostic or prognostic
biomarker for host disease
Public health tool monitoring
population health and interactions
Comprehensive snapshot of
microbial ecology and evolution
Reservoir of gene and protein
functional informationWho’s there?
What are they doing?
What do functional genomic data tell us about microbiomes?
What can our microbiomes tell us about us?*
*Using terabases of sequence and thousands of experimental results
6
The Human Microbiome Project
2007 - ongoing
• 300 “normal” adults, 18-40
• 16S rDNA + WGS• 5 sites/18 samples +
blood• Oral cavity: saliva, tongue,
palate, buccal mucosa, gingiva,
tonsils, throat, teeth• Skin: ears, inner elbows• Nasal cavity• Gut: stool• Vagina: introitus, mid, fornix
• Reference genomes (~200+800)
All healthy subjects; followup projects in psoriasis, Crohn’s,
colitis, obesity, acne, cancer, antibiotic
resistant infection…
Hamady, 2009
Kolenbrander, 2010
7
Information provided by metagenomic assays
16S reads
WGS reads
Taxa
Orthologous clusters
Pathways/modules
Functional roles
Pathway activity
Genomic data(Reference genomes)
Functional data(Experimental models)
Binning
Clustering
Microbiome data
8
HMP: Data features
16S reads
Orthologous clusters
Pathways/modules
Taxa
Genes(KOs)
Pathways(KEGGs)
9
HMP Organisms: Everyone andeverywhere is different
← Body sites + individuals →
← O
rgan
ism
s (ta
xa) →
ear gut nose mouth vaginaarmmucosa palate gingiva tonsils saliva sub. plaq. sup. plaq. throat tongue
Every microbiome is surprisingly different
Most organisms are rare in most places
Even common organisms vary tremendously in abundance
among individuals
Aerobicity, interaction with the immune system, and
extracellular medium appear to be major determinants
There are few, if any, organismal biotypes
in health
10
HMP: Metabolic reconstruction
WGS reads
Pathways/modules
Genes(KOs)
Pathways(KEGGs)
Functional seq.KEGG + MetaCYC
CAZy, TCDB,VFDB, MEROPS…
BLAST → Genes
rra
r
raa
p
gap
ggc
)(
)(
1
)()1(
||1)(
Genes → PathwaysMinPath (Ye 2009)
SmoothingWitten-Bell
otherwiseTNNgcgcTNTVTN
gc)/()(
0)()/()/()(Gap filling
c(g) = max( c(g), median )
300 subjects1-3 visits/subject~6 body sites/visit
10-200M reads/sample100bp reads
BLAST
?Taxonomic limitation
Rem. paths in taxa < ave.
XipeDistinguish zero/low
(Rodriguez-Mueller in review)
11
HMP: Metabolic reconstruction
Pathway coverage Pathway abundance
12
HUMAnN: Evaluation on synthetic metagenomes
High complexity, staggered, ≤90% identity
LC, stg.
13
HMP: Metabolic reconstruction
Pathway abundance← Samples →
← P
athw
ays→
14
HMP: Metabolic reconstruction
Pathway coverage← Samples →
← P
athw
ays→
Aerobic body sites
Gastrointestinal body sites
All body sites (“core”)
15
HMP: MetaCyc Coverage + Abundance
16
HMP: Metabolism, host-microbiome interactions, and microbial taxa
>3200 gene families differential in the
mucosa
>1500 upregulated outsidethe mucosa and not in any
Actinobacterial genome
16S
WGS
17
Outline
3. Data mining:Integrating very large
genomic data compendia
1. Metagenomics:Network models of
microbial communities
2. Microbial biomarkers:Metagenomics in public health
18
~2000
AML/ALLSurvivalMutation
Geneexpression
Batcheffects
Functionalmodules
19
~2005
Healthy/DiabetesBMIM/F
SNPgenotypes
Populationstructure
LD
20
2010
Healthy/IBDTemperatureLocation
Taxa &Orthologs
???
Niches &Phylogeny Test for
correlatesMultiple
hypothesiscorrection
Featureselection
p >> n
Confounds/stratification/environment
Cross-validate
Biological story?
Independent sample
Intervention/perturbation
21
LEfSe: Metagenomic classcomparison and explanation
LEfSe
Coming soon to a URL near you!
Nicola Segata
LDA +Effect Size
22
LEfSe: Evaluation on synthetic data
23
LEfSe: The TRUC murine colitis microbiotaWith Wendy Garrett
24
MetaHIT: The gut microbiome and IBD
WGS reads
Pathways/modules
124 subjects: 99 healthy21 UC + 4 CD
ReBLASTed against KEGG since published data
obfuscates read counts
Taxa
PhymmBrady 2009
Genes(KOs)
Pathways(KEGGs)
Qin 2010
With Ramnik Xavier, Joshua Korzenik
25
MetaHIT: Taxonomic CD biomarkers
Firmicutes
Enterobacteriaceae
Up in CDDown in CD
UC
26
MetaHIT: Functional CD biomarkers
Motility Transporters Sugar metabolism
Down in CD
Up in CD
Subset of enriched modules in CD patientsSubset of enriched pathways in CD patients
Growth/replication
27
MetaHIT: Enzymes and metabolites over/under-enriched in the CD microbiome
Transporters
Growth/replication
Motility
Sugarmetabolism
Down in CD
Up in CD
Inferredmetabolites
Enzymefamilies
28
Outline
3. Data mining:Integrating very large
genomic data compendia
1. Metagenomics:Network models of
microbial communities
2. Microbial biomarkers:Metagenomics in public health
29
A computational definition offunctional genomics
Genomic data Prior knowledge
Data↓
Function
Function↓
Function
Gene↓
Gene
Gene↓
Function
30
A framework for functional genomics
HighSimilarity
LowSimilarity
HighCorrelation
LowCorrelation
G1G2
+
G4G9
+…
G3G6
-
G7G8
-…
G2G5
?
0.9 0.7 … 0.1 0.2 … 0.8
+ - … - - … +
0.8 0.5 … 0.05 0.1 … 0.6
HighCorrelation
LowCorrelation
Freq
uenc
y
Let.Not let.
Freq
uenc
y
SimilarDissim.
Freq
uenc
y
P(G2-G5|Data) = 0.85
100Ms gene pairs →
← 1
Ks
data
sets
+ =
31
Functional networkprediction and analysis
Global interaction network
Carbon metabolism network Extracellular signaling network Gut community network
Currently includes data from30,000 human experimental results,
15,000 expression conditions +15,000 diverse others, analyzed for
200 biological functions and150 diseases
HEFalMp
32
Meta-analysis for unsupervisedfunctional data integration
Evangelou 2007
Huttenhower 2006Hibbs 2007
11log
21'
'
''
z
eiey ,
ieeeiey ,,
i
ieiee yw ,*,̂
22,
*, ˆ
1
eieie s
w
Simple regression:All datasets are equally accurate
Random effects:Variation within and
among datasets and interactions
33
Meta-analysis for unsupervisedfunctional data integration
Evangelou 2007
Huttenhower 2006Hibbs 2007
11log
21'
'
''
z
+ =
34
Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune
Graphle http://huttenhower.sph.harvard.edu/graphle/
35
Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune
Graphle http://huttenhower.sph.harvard.edu/graphle/
X?
36
• Sleipnir C++ library for computational functional genomics
• Data types for biological entities• Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.• Network communication, parallelization
• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)
• And it’s fully documented!
Sleipnir: Software forscalable functional genomics
Massive datasets require efficientalgorithms and implementations.
It’s also speedy: microbial data integration
computationtakes <3hrs.
37
Outline
3. Data mining:Integrating very large
genomic data compendia
1. Metagenomics:Network models of
microbial communities
2. Microbial biomarkers:Metagenomics in public health
• Metagenomics: structure and
function of microbialcommunities
• HMP: microbiome in health,
18 body sites in 300 subjects• HUMAnN: metagenomic
metabolic and functional
pathway reconstruction
• Network framework for
scalable data integration
• HEFalMp: human data
integration• Meta-analysis for
unsupervised functional
network integration
• LEfSe: biologically relevant
community differences• Iron and sugar transport as
key players in the IBDmicrobiota
• Sleipnir: software for scalable
genomic data mining
38
Thanks!
Jacques IzardWendy Garrett
Pinaki SarderNicola Segata
Levi Waldron LarisaMiropolsky
http://huttenhower.sph.harvard.edu
Interested? We’re recruiting students and postdocs!
Human Microbiome Project
HMP Metabolic Reconstruction
George WeinstockJennifer WortmanOwen WhiteMakedonka MitrevaErica SodergrenVivien Bonazzi Jane PetersonLita Proctor
Sahar AbubuckerYuzhen Ye
Beltran Rodriguez-MuellerJeremy ZuckerQiandong Zeng
Mathangi ThiagarajanBrandi Cantarel
Maria RiveraBarbara Methe
Bill KlimkeDaniel Haft
Ramnik Xavier Dirk Gevers
Bruce Birren Mark DalyDoyle Ward Eric AlmAshlee Earl Lisa Cosimi
Sarah Fortune
http://huttenhower.sph.harvard.edu/sleipnir