1. 1. soon, most data might not be stored in central databases (s. physics) 1. 2. soon, pi’s...
TRANSCRIPT
1. 1. Soon, most data might NOT be stored in central databases (s. physics)1. 1. Soon, most data might NOT be stored in central databases (s. physics)
1. 2. Soon, PI’s computing resources might become insufficient1. 2. Soon, PI’s computing resources might become insufficient
1. 3. Soon, analysis tools might be too complex for local use1. 3. Soon, analysis tools might be too complex for local use
Computational metagenome analysis
1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve
2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape
3. Potential and challenges3. Potential and challenges
Metagenome analysis team, since Mar 04, aiming at discovery, mostly working on method development though
Peer Bork, EMBL-Heidelberg
Data analysis: the signs before the floodData analysis: the signs before the floodMicrobial genomesMicrobial genomespublished per year published per year
Animal genomes (>100Mb, published, >95% cov)Animal genomes (>100Mb, published, >95% cov)
0000 0101 0202 0303 0404 05059898 0606 0707
20042004
Metagenomics (>50Mb, not focussed, non-16S, published, deposited)
Acid mine drainageAcid mine drainage
Sargasso seaSargasso sea
Deep seaDeep seawhale boneswhale bones
Farm SoilFarm Soil
20052005 20062006
N-pacific N-pacific s-trop gyres-trop gyre
Mammoth bones (454)Mammoth bones (454)
Human gutHuman gut
Soudan mine (454)Soudan mine (454)
SludgeSludge
20072007
Global Global Ocean SurveyOcean Survey
Mouse gutMouse gut
Log
scal
e !!!
Log
scal
e !!!
Accelerated exponential increase of ORF numbers Accelerated exponential increase of ORF numbers (100 fold decrease in sequencing costs not visible yet)(100 fold decrease in sequencing costs not visible yet)
Computational metagenome analysis
1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve
2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape
3. Potential, challenges, examples3. Potential, challenges, examples
1. 1. Soon, most data might NOT be stored in central databases (s. physics)1. 1. Soon, most data might NOT be stored in central databases (s. physics)
1. 2. Soon, PI’s computing resources might become insufficient1. 2. Soon, PI’s computing resources might become insufficient
1. 3. Soon, analysis tools might be too complex for local use1. 3. Soon, analysis tools might be too complex for local use
MareNostrum supercomputer BarcelonaMareNostrum supercomputer Barcelona
Standard analysis: Ca 14 Mio ORFs, all-against-all blast
12.000 days on one CPU (ca 5 days real time at BSC)
9 TB output, data transfer to EMBL 2 weeks
Ongoing post-analysis…
Computational metagenome analysis
1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve
2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape
3. Potential and challenges3. Potential and challenges
1. 1. Soon, most data might NOT be stored in central databases (s. physics)1. 1. Soon, most data might NOT be stored in central databases (s. physics)
1. 2. Soon, PI’s computing resources might become insufficient1. 2. Soon, PI’s computing resources might become insufficient
1. 3. Soon, analysis tools might be too complex for local use1. 3. Soon, analysis tools might be too complex for local use
EykaryaEykaryaArcheaArchea
BacteriaBacteria
Taxonomic Taxonomic census of diverse census of diverse
environmentsenvironments
>more quantitative than>more quantitative than 16S RNA profiling16S RNA profiling
>reveals novelty, e.g.>reveals novelty, e.g. -stable habitat prefer.-stable habitat prefer. -water faster than soil-water faster than soil
>analysis of 1 marker >analysis of 1 marker can take hours can take hours on a single CPUon a single CPU
Von Mering et al.,Science 31(07) 1126
Great potential…Great potential…
but…but…
Computational metagenome analysis
1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve
2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape
3. Potential and challenges3. Potential and challenges
2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI)
2.2. Few basic analyses options (e.g. WEB, downloads)2.2. Few basic analyses options (e.g. WEB, downloads)
2.4. Missing standards even for basic steps of annotation 2.4. Missing standards even for basic steps of annotation
2.3. Analyses driven by individual PI’s research focus and expertise2.3. Analyses driven by individual PI’s research focus and expertise
2.5. Difficulties for comparative meta-analyses2.5. Difficulties for comparative meta-analyses
Pub. Yr.
Environment (Location)
# ORFs (Mbp)# Novel
ORFs
not assigned(%)
ORF-calling Procedure (DB searched)
Functional Annotation Procedure (DB searched)
2004Acid Mine (California)
46,862 (76) 34,301 (73.2%)FGENESB pipeline (nr)
blastp (COG, nr)
2004Surface Sea Water (Sargasso Sea, samples 1-4)
1,001,987 (779) 649,608 (64.8%)
Evidence-based, using translation start & stop sites (Bacterial portion of nraa)
blast (TIGR Role Category)
2005Deep-Sea Whalefall (Pacific, Antarctic)
122,147 (75) 63,021 (51.6%)FGENESB pipeline (nraa)
blastp (extCOG v 6, KEGG)
2005Farm Soil (Minnesota)
183,536 (100) 114,301 (62.3%)FGENESB pipeline (nraa)
blastp (extCOG v 6, KEGG)
2006Subtropical Ocean Gyre (North Pacific)
N/A* (64) - -ORF-calling not performed
blastx, blastn (KEGG, SEED, COG, Sargasso dataset)
2006 Human Gut 50,164 (78) 34,504 (68.8%)
Evidence-based, start & stop sites (in-house non-reduntant protein repository**)
blast (COG, KEGG, STRING)
2006Wastewater Sludge (US, Australia)
65,328 (176) 47,032 (72.0%)FGENESB pipeline (nraa)
IMG/M pipeline (KEGG)
2006 Mouse Gut 134,189 (160) 100,599 (75.0%)Glimmer software tool v. 3.01 (InterPro)
blastx (nr, extCOG v. 6.3 KEGG)
2007Global Ocean Sampling
6,123,395 (6250) N/a N/aUsing special pipeline
blastp (various DBs, HMMs (TIGRFAM)
8 studies8 studies
4 assembly pipelines4 assembly pipelines
6 ORF calling procedures6 ORF calling procedures
8 Function prediction protocols8 Function prediction protocols
8 Parameter settings8 Parameter settings
Computational metagenome analysis
1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve
2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape
3. Potential and challenges3. Potential and challenges
2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI)
2.2. Few basic analyses options (e.g. WEB, downloads)2.2. Few basic analyses options (e.g. WEB, downloads)
2.4. Missing standards even for basic steps of annotation 2.4. Missing standards even for basic steps of annotation
2.3. Analyses driven by individual PI’s research focus and expertise2.3. Analyses driven by individual PI’s research focus and expertise
2.5. Difficulties for comparative meta-analyses2.5. Difficulties for comparative meta-analyses
Increase of functional assignments (via orthologous groups) with coverageIncrease of functional assignments (via orthologous groups) with coverage
(out of a total of 20334 OGs) … [OGs in STRING now at 40 000] (out of a total of 20334 OGs) … [OGs in STRING now at 40 000]
PhylogenyPhylogeny
FunctionalityFunctionality
Evenness/RichnessEvenness/Richness
Genome sizesGenome sizes
Evolutionary speedEvolutionary speed
Tringe*, von Mering* … Bork, Hugenholtz, Rubin Science 308(05)554
Ort
holo
gous
gro
ups
(CO
Gs
+ N
OG
s)O
rtho
logo
us g
roup
s (C
OG
s +
NO
Gs)
Sampling +preparationSampling +preparation
CoverageCoverage
Technical issuesTechnical issues
Biological issuesBiological issues
Sequencing method Sequencing method
Assembly+annotationAssembly+annotation
GC contentGC content
Reason for differencesReason for differences
……....
Computational metagenome analysis
1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve
2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape
3. Potential and challenges3. Potential and challenges
3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)
3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)
3.3. High demands for integration (e.g. temporal, spatial, and context information)3.3. High demands for integration (e.g. temporal, spatial, and context information)
As we know,There are known knowns.There are things we know we know.We also knowThere are known unknowns.That is to sayWe know there are some thingsWe do not know.But there are also unknown unknowns,The ones we don't knowWe don't know.
The Unknown – by Donald Rumsfeld
CHARACTERIZED UNCHARACTERIZED UNCHARACTERIZABLE
Feb. 12, 2002, Department of Defense news briefing
Protein function prediction in metagenomics samplesProtein function prediction in metagenomics samples
Overall function predictions for >70% of environmental data!Overall function predictions for >70% of environmental data!
BlastBlastNeighborhoodNeighborhood
(taken from the STRING resource)(taken from the STRING resource)
CHARACTERIZED UNCHARACTERIZED UNCHARACTERIZABLE Next to CHARACT. Next to UNCHARACT. UNCHARACTERIZABLE
Mining for novelty in environmental dataMining for novelty in environmental data
Homology-basedHomology-based Novel antibiotics biosynthesis Novel antibiotics biosynthesis enzyme families (PKS1)enzyme families (PKS1)
Neighbourhood-basedNeighbourhood-based Coupling of fatty acid Coupling of fatty acid biosynthesis and degradation biosynthesis and degradation via new transcription regulatorvia new transcription regulator
More on the unknown …More on the unknown …
Function prediction in gene families of 1.5mio proteins from 4 environmentsFunction prediction in gene families of 1.5mio proteins from 4 environments
All against all, MCL clustering, (60bits, inflation factor 1.1)
Our functional knowledge: glass half full or half empty?Our functional knowledge: glass half full or half empty?
Our knowledge concentrates inOur knowledge concentrates inlarge, well established families large, well established families contributing 65% of the ORFs; contributing 65% of the ORFs; However, many specialized However, many specialized functions in small gene families functions in small gene families are to be discoveredare to be discovered
Computational metagenome analysis
1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve
2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape
3. Potential and challenges3. Potential and challenges
3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)
3.2. Meta-analysis essential to bridge to other fields (e.g. ecology, geochemistry)3.2. Meta-analysis essential to bridge to other fields (e.g. ecology, geochemistry)
3.3. High demands for integration (e.g. temporal, spatial, and context information)3.3. High demands for integration (e.g. temporal, spatial, and context information)
Functional indicators (richness, evenness ► functional diversity)
Functional diversity indicators should reveal properties of community networks
Tu
rno
ve
r
Substrates
Substrate usage rank ab. curve
Substrateassays
Limited to assayed substrates Wide range of functions considered
Ab
un
da
nc
e
Functions
Genomic repertoire rank ab. curve
Metagenome sequencing and analysis
From Phenotypic to Genomic EcologyApplying ecology concepts to metagenomic data: Functional diversity
Computational metagenome analysis
1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve
2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape
3. Potential and challenges3. Potential and challenges
3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)
3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)
3.3. High demands for integration (e.g. temporal, spatial, and context information)3.3. High demands for integration (e.g. temporal, spatial, and context information)
oxic zoneoxic zone
high Hhigh H22SS
low Hlow H22SS
Data integration: Reconciling species diversity with functional diversityData integration: Reconciling species diversity with functional diversity
Computational metagenome analysis Conclusions/Some thoughts…Conclusions/Some thoughts…
- At least, large projects would benefit from international collaborations and combination of different skill sets
- Encourage development of widely usable analysis tools to increase impact of smaller scale studies
- Install not only centres for resources (data, cpu...) but also for analysis; requiring interactions with individual projects
- Analysis infrastructure should go beyond human microbiome; there might not be clear-cut borders with environment
- Stimulate interactions with other research communities (Chemical Biology, Ecology etc.) to incorporate novel concepts into analyses
- Encourage meta-analysis to make multiple use of the data
- Establish working group on how to adapt important tools to data flood
WPS.2 – Full genome sequencing of the
cultured microorganisms
WPS.1 – Shotgun cloning & sequencing
of genes of all microorganisms
Sequencing Pillar
WPB.1 – Resource development and data processing
WPB.2 – Tool development and
data analysis
Bioinformatics Pillar
WPT.1 – High density array-based profiling tools microorganisms
WPT.2 – High throughput sequencing-
based profiling tools
Tool Pillar
WPF.1 – Phenotyping-based identification of
host-microbe interaction functions
WPF.2 – Sequence and variability-based
identification of host-microbe interaction
functions
Functional Pillar
WPV.1 – Correlations between microbiota and inflammatory
bowel disease
WPV.2 – Correlations between microbiota
and obesity
Variability Pillar
WPO.1 – International human metagenomics
coordination
WPO.2 – Technology transfer
Outreach Pillar
WPM – Project management
R1
R2
R3
R4 R5
R9 R8
R7
R6
METAHIT consortiumMETAHIT consortium
Computational metagenomics analysis: Need for wide-spread tools and collaborations
1. Context: Personal views on how the computational infrastructure might evolve1. Context: Personal views on how the computational infrastructure might evolve
2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape
2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI)
2.2. Few basic analyses options (e.g. WEB, downloads)2.2. Few basic analyses options (e.g. WEB, downloads)
3. Potential, needs, and challenges3. Potential, needs, and challenges
2.4. Missing standards even for basic steps of annotation 2.4. Missing standards even for basic steps of annotation
3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)
1. 1. Soon, most data might NOT be stored in central databases (s. physics)1. 1. Soon, most data might NOT be stored in central databases (s. physics)
1. 2. Soon, PI’s computing resources might become insufficient1. 2. Soon, PI’s computing resources might become insufficient
1. 3. Soon, analysis tools might be too complex for local use1. 3. Soon, analysis tools might be too complex for local use
2.3. Analyses driven by individual PI’s research focus and expertise2.3. Analyses driven by individual PI’s research focus and expertise
3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)
2.5. Difficulties for comparative meta-analyses2.5. Difficulties for comparative meta-analyses
3.3. High demands for integration (e.g. temporal, spatial, and context information)3.3. High demands for integration (e.g. temporal, spatial, and context information)
3.4. Spread of tools (e.g. via standards, compatibility etc.) and/or merge skill sets 3.4. Spread of tools (e.g. via standards, compatibility etc.) and/or merge skill sets
Summary: Summary: Bioinformatics almost ready to goBioinformatics almost ready to goFor comparative analysis, impact and interdependence For comparative analysis, impact and interdependence of several factors still need to be determined, e.g. of several factors still need to be determined, e.g. genome size, GC content, phylogenetic spread, genome size, GC content, phylogenetic spread, functional richness, evenness and diversity etc.functional richness, evenness and diversity etc.
But this seems doable and thus it should be possible toBut this seems doable and thus it should be possible toadapt ecological concepts to (molecular) metagenomics data adapt ecological concepts to (molecular) metagenomics data using computational tools to describe functional diversity using computational tools to describe functional diversity with unprecedented resolutionwith unprecedented resolution
It will require many more parameters to be recorded thoughIt will require many more parameters to be recorded thoughe.g. to cover temporal aspects (unlikely to be steady states e.g. to cover temporal aspects (unlikely to be steady states everywhere)everywhere)
Computational analysis will be THE key to integrate with Computational analysis will be THE key to integrate with chemical, medical, ecological, geological etc. data chemical, medical, ecological, geological etc. data