bioinformatics for whole-genome shotgun sequencing of microbial communities

Download Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities

Post on 23-Feb-2016

45 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities. By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005. David Kelley. State of metagenomics. In July 2005, 9 projects had been completed. General challenges were becoming apparent - PowerPoint PPT Presentation

TRANSCRIPT

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial CommunitiesBy Kevin Chen, Lior PachterPLoS Computational Biology, 2005David Kelley1State of metagenomicsIn July 2005, 9 projects had been completed.General challenges were becoming apparentPaper focuses on computational problems

2Assembling communitiesGoalRetrieval of nearly complete genomes from the environment

ChallengesNeed sufficient read depth- species must be prominentAvoid mis-assembling across species while maximizing contig size

3Comparative assemblyAlign all reads to a closely-related reference genomeInfer contigs from read alignments

Rearrangements limit effectiveness

Pop M. et al. Comparative genome assembly. Briefings in Bioinformatics 2004.4Assisted AssemblyDe novo assemblyComplement by aligning reads to reference genome(s)

Short overlaps can be trustedSingle mate links can be trustedMis-assemblies can be detectedGnerre S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.Assisted Assembly

Gnerre S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.Assisted Assembly

Gnerre S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.Assisted Assembly

Gnerre S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.Metagenomics applicationPros:Low coverage speciesIf conservative, unlikely to hurt

ConsExotic microbes may have no good referencesPotential to propagate mis-assembliesOverlap-layout-consensusSpecies-levelIncreased polymorphismReads come from different individualsMissed overlaps

System-levelHomologous sequenceFalse overlaps10Polymorphic diploid eukaryotesReads sequenced from 2 chromosomesSingle reference sequence expected

Keep duplications separateKeep polymorphic haplotypes together11Strategy 1Form contigs aggressivelyDetect alignments between contigs and resolve

Avoid merging duplications by respecting mate pair distancesJones, T. et al. The diploid genome sequence of Candida albicans. PNAS 2004.

12Strategy 2Assemble chromosomes separatelyErase overlaps with splitting rule

Vinson et al. Assembly of polymorphic genomes: Algorithms and application to Ciona savignyi. Genome Research 2005.

13Back to metagenomicsStrategy 1Assemble aggressivelyDetect mis-assemblies and fix

Strategy 2Separate reads or filter overlapsBinningPresence of informative genesE.g. 16S rRNAMachine learningK-mersCodon biasWorked well only with big scaffolds

Lots of progress in this area since 200515AbundancesDepth of read coverage suggests relative abundance of species in sample

Difficult if polymorphism is significantSeparate individuals too lowMerge species too highDepends on good classification16How much sequencingG = genome size (or sum of genomes)c = global coveragek = local coveragenk= bp w/ coverage k17Poisson modelInterval =[x lr , x] Events = read starts = coveragexx-lrGene FindingFocus on genes, rather than genomes

Bacterial gene finders are very accurate

Assemble and run on scaffoldsBLAST leftover reads against protein db19Partial genesTested GLIMMER on simulated 10 Kb contigsMany genes crossed bordersGLIMMER often predicted a truncated version

Gene finding models could be adjusted to account for this caseGene-centric analysisCluster genes by orthologyOrthology refers to genes in different species that derive from a common ancestor

Express sample as vector of abundances

UPGMA on KEGG vectors

PCA on KEGG vectorsPrincipal components may correspond to interesting pathways or functions

How much sequencingN = # genes in communityf = fraction foundCoupon collectors problemPhylogenyApply multiple sequence alignment and phylogeny reconstruction to gene sequences

25Partial sequencesBad for common msa programs

Semi-global alignment is required

Supertree methodsConstruct tree from multiple subtrees

Split gene into segments?Construct subtree on sequences that align fully to segment?

Thanks!

E[nk] = Gckec

k!

Nlog1

1 f

Recommended

View more >