1. 1. soon, most data might not be stored in central databases (s. physics) 1. 2. soon, pi’s...

25
1. 1. Soon, most data might NOT be stored in central 1. 1. Soon, most data might NOT be stored in central databases (s. physics) databases (s. physics) 1. 2. Soon, PI’s computing resources might become 1. 2. Soon, PI’s computing resources might become insufficient insufficient 1. 3. Soon, analysis tools might be too complex 1. 3. Soon, analysis tools might be too complex for local use for local use Computational metagenome analysis 1. Context: Personal views on how the 1. Context: Personal views on how the computational computational infrastructure might evolve infrastructure might evolve 2. Current situation of basic metagenome 2. Current situation of basic metagenome analyses landscape analyses landscape 3. Potential and challenges 3. Potential and challenges Metagenome analysis team, since Mar 04, aiming at discovery, mostly working on method development though Peer Bork, EMBL-Heidelberg

Upload: bernice-paul

Post on 13-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

1. 1. Soon, most data might NOT be stored in central databases (s. physics)1. 1. Soon, most data might NOT be stored in central databases (s. physics)

1. 2. Soon, PI’s computing resources might become insufficient1. 2. Soon, PI’s computing resources might become insufficient

1. 3. Soon, analysis tools might be too complex for local use1. 3. Soon, analysis tools might be too complex for local use

Computational metagenome analysis

1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve

2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape

3. Potential and challenges3. Potential and challenges

Metagenome analysis team, since Mar 04, aiming at discovery, mostly working on method development though

Peer Bork, EMBL-Heidelberg

Data analysis: the signs before the floodData analysis: the signs before the floodMicrobial genomesMicrobial genomespublished per year published per year

Animal genomes (>100Mb, published, >95% cov)Animal genomes (>100Mb, published, >95% cov)

0000 0101 0202 0303 0404 05059898 0606 0707

20042004

Metagenomics (>50Mb, not focussed, non-16S, published, deposited)

Acid mine drainageAcid mine drainage

Sargasso seaSargasso sea

Deep seaDeep seawhale boneswhale bones

Farm SoilFarm Soil

20052005 20062006

N-pacific N-pacific s-trop gyres-trop gyre

Mammoth bones (454)Mammoth bones (454)

Human gutHuman gut

Soudan mine (454)Soudan mine (454)

SludgeSludge

20072007

Global Global Ocean SurveyOcean Survey

Mouse gutMouse gut

Log

scal

e !!!

Log

scal

e !!!

Accelerated exponential increase of ORF numbers Accelerated exponential increase of ORF numbers (100 fold decrease in sequencing costs not visible yet)(100 fold decrease in sequencing costs not visible yet)

Computational metagenome analysis

1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve

2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape

3. Potential, challenges, examples3. Potential, challenges, examples

1. 1. Soon, most data might NOT be stored in central databases (s. physics)1. 1. Soon, most data might NOT be stored in central databases (s. physics)

1. 2. Soon, PI’s computing resources might become insufficient1. 2. Soon, PI’s computing resources might become insufficient

1. 3. Soon, analysis tools might be too complex for local use1. 3. Soon, analysis tools might be too complex for local use

MareNostrum supercomputer BarcelonaMareNostrum supercomputer Barcelona

Standard analysis: Ca 14 Mio ORFs, all-against-all blast

12.000 days on one CPU (ca 5 days real time at BSC)

9 TB output, data transfer to EMBL 2 weeks

Ongoing post-analysis…

Computational metagenome analysis

1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve

2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape

3. Potential and challenges3. Potential and challenges

1. 1. Soon, most data might NOT be stored in central databases (s. physics)1. 1. Soon, most data might NOT be stored in central databases (s. physics)

1. 2. Soon, PI’s computing resources might become insufficient1. 2. Soon, PI’s computing resources might become insufficient

1. 3. Soon, analysis tools might be too complex for local use1. 3. Soon, analysis tools might be too complex for local use

EykaryaEykaryaArcheaArchea

BacteriaBacteria

Taxonomic Taxonomic census of diverse census of diverse

environmentsenvironments

>more quantitative than>more quantitative than 16S RNA profiling16S RNA profiling

>reveals novelty, e.g.>reveals novelty, e.g. -stable habitat prefer.-stable habitat prefer. -water faster than soil-water faster than soil

>analysis of 1 marker >analysis of 1 marker can take hours can take hours on a single CPUon a single CPU

Von Mering et al.,Science 31(07) 1126

Great potential…Great potential…

but…but…

Computational metagenome analysis

1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve

2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape

3. Potential and challenges3. Potential and challenges

2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI)

2.2. Few basic analyses options (e.g. WEB, downloads)2.2. Few basic analyses options (e.g. WEB, downloads)

2.4. Missing standards even for basic steps of annotation 2.4. Missing standards even for basic steps of annotation

2.3. Analyses driven by individual PI’s research focus and expertise2.3. Analyses driven by individual PI’s research focus and expertise

2.5. Difficulties for comparative meta-analyses2.5. Difficulties for comparative meta-analyses

Pub. Yr.

Environment (Location)

# ORFs (Mbp)# Novel

ORFs

not assigned(%)

ORF-calling Procedure (DB searched)

Functional Annotation Procedure (DB searched)

2004Acid Mine (California)

46,862 (76) 34,301 (73.2%)FGENESB pipeline (nr)

blastp (COG, nr)

2004Surface Sea Water (Sargasso Sea, samples 1-4)

1,001,987 (779) 649,608 (64.8%)

Evidence-based, using translation start & stop sites (Bacterial portion of nraa)

blast (TIGR Role Category)

2005Deep-Sea Whalefall (Pacific, Antarctic)

122,147 (75) 63,021 (51.6%)FGENESB pipeline (nraa)

blastp (extCOG v 6, KEGG)

2005Farm Soil (Minnesota)

183,536 (100) 114,301 (62.3%)FGENESB pipeline (nraa)

blastp (extCOG v 6, KEGG)

2006Subtropical Ocean Gyre (North Pacific)

N/A* (64) - -ORF-calling not performed

blastx, blastn (KEGG, SEED, COG, Sargasso dataset)

2006 Human Gut 50,164 (78) 34,504 (68.8%)

Evidence-based, start & stop sites (in-house non-reduntant protein repository**)

blast (COG, KEGG, STRING)

2006Wastewater Sludge (US, Australia)

65,328 (176) 47,032 (72.0%)FGENESB pipeline (nraa)

IMG/M pipeline (KEGG)

2006 Mouse Gut 134,189 (160) 100,599 (75.0%)Glimmer software tool v. 3.01 (InterPro)

blastx (nr, extCOG v. 6.3 KEGG)

2007Global Ocean Sampling

6,123,395 (6250) N/a N/aUsing special pipeline

blastp (various DBs, HMMs (TIGRFAM)

8 studies8 studies

4 assembly pipelines4 assembly pipelines

6 ORF calling procedures6 ORF calling procedures

8 Function prediction protocols8 Function prediction protocols

8 Parameter settings8 Parameter settings

Computational metagenome analysis

1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve

2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape

3. Potential and challenges3. Potential and challenges

2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI)

2.2. Few basic analyses options (e.g. WEB, downloads)2.2. Few basic analyses options (e.g. WEB, downloads)

2.4. Missing standards even for basic steps of annotation 2.4. Missing standards even for basic steps of annotation

2.3. Analyses driven by individual PI’s research focus and expertise2.3. Analyses driven by individual PI’s research focus and expertise

2.5. Difficulties for comparative meta-analyses2.5. Difficulties for comparative meta-analyses

Increase of functional assignments (via orthologous groups) with coverageIncrease of functional assignments (via orthologous groups) with coverage

(out of a total of 20334 OGs) … [OGs in STRING now at 40 000] (out of a total of 20334 OGs) … [OGs in STRING now at 40 000]

PhylogenyPhylogeny

FunctionalityFunctionality

Evenness/RichnessEvenness/Richness

Genome sizesGenome sizes

Evolutionary speedEvolutionary speed

Tringe*, von Mering* … Bork, Hugenholtz, Rubin Science 308(05)554

Ort

holo

gous

gro

ups

(CO

Gs

+ N

OG

s)O

rtho

logo

us g

roup

s (C

OG

s +

NO

Gs)

Sampling +preparationSampling +preparation

CoverageCoverage

Technical issuesTechnical issues

Biological issuesBiological issues

Sequencing method Sequencing method

Assembly+annotationAssembly+annotation

GC contentGC content

Reason for differencesReason for differences

……....

Computational metagenome analysis

1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve

2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape

3. Potential and challenges3. Potential and challenges

3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)

3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)

3.3. High demands for integration (e.g. temporal, spatial, and context information)3.3. High demands for integration (e.g. temporal, spatial, and context information)

As we know,There are known knowns.There are things we know we know.We also knowThere are known unknowns.That is to sayWe know there are some thingsWe do not know.But there are also unknown unknowns,The ones we don't knowWe don't know.

The Unknown – by Donald Rumsfeld

CHARACTERIZED UNCHARACTERIZED UNCHARACTERIZABLE

Feb. 12, 2002, Department of Defense news briefing

Protein function prediction in metagenomics samplesProtein function prediction in metagenomics samples

Overall function predictions for >70% of environmental data!Overall function predictions for >70% of environmental data!

BlastBlastNeighborhoodNeighborhood

(taken from the STRING resource)(taken from the STRING resource)

CHARACTERIZED UNCHARACTERIZED UNCHARACTERIZABLE Next to CHARACT. Next to UNCHARACT. UNCHARACTERIZABLE

Mining for novelty in environmental dataMining for novelty in environmental data

Homology-basedHomology-based Novel antibiotics biosynthesis Novel antibiotics biosynthesis enzyme families (PKS1)enzyme families (PKS1)

Neighbourhood-basedNeighbourhood-based Coupling of fatty acid Coupling of fatty acid biosynthesis and degradation biosynthesis and degradation via new transcription regulatorvia new transcription regulator

More on the unknown …More on the unknown …

Function prediction in gene families of 1.5mio proteins from 4 environmentsFunction prediction in gene families of 1.5mio proteins from 4 environments

All against all, MCL clustering, (60bits, inflation factor 1.1)

Our functional knowledge: glass half full or half empty?Our functional knowledge: glass half full or half empty?

Our knowledge concentrates inOur knowledge concentrates inlarge, well established families large, well established families contributing 65% of the ORFs; contributing 65% of the ORFs; However, many specialized However, many specialized functions in small gene families functions in small gene families are to be discoveredare to be discovered

Computational metagenome analysis

1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve

2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape

3. Potential and challenges3. Potential and challenges

3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)

3.2. Meta-analysis essential to bridge to other fields (e.g. ecology, geochemistry)3.2. Meta-analysis essential to bridge to other fields (e.g. ecology, geochemistry)

3.3. High demands for integration (e.g. temporal, spatial, and context information)3.3. High demands for integration (e.g. temporal, spatial, and context information)

Functional indicators (richness, evenness ► functional diversity)

Functional diversity indicators should reveal properties of community networks

Tu

rno

ve

r

Substrates

Substrate usage rank ab. curve

Substrateassays

Limited to assayed substrates Wide range of functions considered

Ab

un

da

nc

e

Functions

Genomic repertoire rank ab. curve

Metagenome sequencing and analysis

From Phenotypic to Genomic EcologyApplying ecology concepts to metagenomic data: Functional diversity

Computational metagenome analysis

1. Context: Personal views on how the computational 1. Context: Personal views on how the computational infrastructure might evolveinfrastructure might evolve

2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape

3. Potential and challenges3. Potential and challenges

3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)

3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)

3.3. High demands for integration (e.g. temporal, spatial, and context information)3.3. High demands for integration (e.g. temporal, spatial, and context information)

oxic zoneoxic zone

high Hhigh H22SS

low Hlow H22SS

Data integration: Reconciling species diversity with functional diversityData integration: Reconciling species diversity with functional diversity

Computational metagenome analysis Conclusions/Some thoughts…Conclusions/Some thoughts…

- At least, large projects would benefit from international collaborations and combination of different skill sets

- Encourage development of widely usable analysis tools to increase impact of smaller scale studies

- Install not only centres for resources (data, cpu...) but also for analysis; requiring interactions with individual projects

- Analysis infrastructure should go beyond human microbiome; there might not be clear-cut borders with environment

- Stimulate interactions with other research communities (Chemical Biology, Ecology etc.) to incorporate novel concepts into analyses

- Encourage meta-analysis to make multiple use of the data

- Establish working group on how to adapt important tools to data flood

WPS.2 – Full genome sequencing of the

cultured microorganisms

WPS.1 – Shotgun cloning & sequencing

of genes of all microorganisms

Sequencing Pillar

WPB.1 – Resource development and data processing

WPB.2 – Tool development and

data analysis

Bioinformatics Pillar

WPT.1 – High density array-based profiling tools microorganisms

WPT.2 – High throughput sequencing-

based profiling tools

Tool Pillar

WPF.1 – Phenotyping-based identification of

host-microbe interaction functions

WPF.2 – Sequence and variability-based

identification of host-microbe interaction

functions

Functional Pillar

WPV.1 – Correlations between microbiota and inflammatory

bowel disease

WPV.2 – Correlations between microbiota

and obesity

Variability Pillar

WPO.1 – International human metagenomics

coordination

WPO.2 – Technology transfer

Outreach Pillar

WPM – Project management

R1

R2

R3

R4 R5

R9 R8

R7

R6

METAHIT consortiumMETAHIT consortium

Computational metagenomics analysis: Need for wide-spread tools and collaborations

1. Context: Personal views on how the computational infrastructure might evolve1. Context: Personal views on how the computational infrastructure might evolve

2. Current situation of basic metagenome analyses landscape2. Current situation of basic metagenome analyses landscape

2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI)

2.2. Few basic analyses options (e.g. WEB, downloads)2.2. Few basic analyses options (e.g. WEB, downloads)

3. Potential, needs, and challenges3. Potential, needs, and challenges

2.4. Missing standards even for basic steps of annotation 2.4. Missing standards even for basic steps of annotation

3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)

1. 1. Soon, most data might NOT be stored in central databases (s. physics)1. 1. Soon, most data might NOT be stored in central databases (s. physics)

1. 2. Soon, PI’s computing resources might become insufficient1. 2. Soon, PI’s computing resources might become insufficient

1. 3. Soon, analysis tools might be too complex for local use1. 3. Soon, analysis tools might be too complex for local use

2.3. Analyses driven by individual PI’s research focus and expertise2.3. Analyses driven by individual PI’s research focus and expertise

3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)

2.5. Difficulties for comparative meta-analyses2.5. Difficulties for comparative meta-analyses

3.3. High demands for integration (e.g. temporal, spatial, and context information)3.3. High demands for integration (e.g. temporal, spatial, and context information)

3.4. Spread of tools (e.g. via standards, compatibility etc.) and/or merge skill sets 3.4. Spread of tools (e.g. via standards, compatibility etc.) and/or merge skill sets

Summary: Summary: Bioinformatics almost ready to goBioinformatics almost ready to goFor comparative analysis, impact and interdependence For comparative analysis, impact and interdependence of several factors still need to be determined, e.g. of several factors still need to be determined, e.g. genome size, GC content, phylogenetic spread, genome size, GC content, phylogenetic spread, functional richness, evenness and diversity etc.functional richness, evenness and diversity etc.

But this seems doable and thus it should be possible toBut this seems doable and thus it should be possible toadapt ecological concepts to (molecular) metagenomics data adapt ecological concepts to (molecular) metagenomics data using computational tools to describe functional diversity using computational tools to describe functional diversity with unprecedented resolutionwith unprecedented resolution

It will require many more parameters to be recorded thoughIt will require many more parameters to be recorded thoughe.g. to cover temporal aspects (unlikely to be steady states e.g. to cover temporal aspects (unlikely to be steady states everywhere)everywhere)

Computational analysis will be THE key to integrate with Computational analysis will be THE key to integrate with chemical, medical, ecological, geological etc. data chemical, medical, ecological, geological etc. data