machine learning and metagenome analysis€¦ · overview of metagenome analysis • what is...
TRANSCRIPT
1
FASTQFILES
FASTQCQUALITYCONTROL
OFREADS
TRIMMINGFILTERINGBADQUALITYREADS
2 MAPPINGOFREADSTOA
REFERENCEGENOME
ASSEMBLY(DENOVO)RECONSTRUCTIONOF
AGENOME
3SAMFILES
BAMFILES4READDEPTH
VARIANTCALLING
STRUTURALVARIATIONS
GENE/CHRCNV
5VCFFILES
SNPSINDELS
ANNOTATIONVISUALIZATION
FASTAFILE GFFFILE
2
Overview of analysis workflow
Overviewofmetagenomeanalysis
• Whatismetagenomics?– Thestudyofthecollectivegenomicmaterialfromenvironmentalsamples,forexample• Environment:soil,water• Medical:fecal,skin,kidneystone• Industrial:bioreactors,fermenters,enrichments• Prettymuchanything
Overviewofmetagenomeanalysis
• Why?– Characterizeasamplethatmaybeof“biologicalinterest”,but…
– Thevastmajorityofmicroorganismscannotbecultured
– Methodsusedtoculturefromenvironmentalsamplesmissthese
• Solution:isolateDNAfromsamples,sequenceit,thenbreakdownwhatisthere.– Yes,it’sasdifficultasitsounds
Overviewofmetagenomeanalysis
• Solution:isolateDNAfromsamples,sequenceit,thenbreakdownwhatisthere.– Taxonomic–whatispresent?– Functional–whatcanbedonemetabolically(e.g.metabolicpotential)?• Note,thiscannotbedonewith16sdirectly
Overviewofmetagenomeanalysis
• Note:dependingonthequestion,maybecomplementary(andsimilarlydifficult)data– Metatranscriptome–whatisbeingexpressedinenvironmentalsamples(RNA)
– Metabolome–metabolitesproduced– Proteome–proteinspresentinsample
Overviewofmetagenomeanalysis
• Twogeneralapproaches– Targetedsequencing(e.g.16svariableregions)– Shotgun(whole)metagenomesequencing
TargetedanalysisMorganXC,HuttenhowerC(2012)Chapter12:HumanMicrobiomeAnalysis.PLOSComputationalBiology8(12):e1002808.
OTU:OperationalTaxonomicUnit(clusterofsimilarsequencevariants)usedtocategorizebacteria
TargetedanalysisMorganXC,HuttenhowerC(2012)Chapter12:HumanMicrobiomeAnalysis.PLOSComputationalBiology8(12):e1002808.
k-NNHierarchicalclusteringBayesianclusteringGreedyheuristicclustering
ToolsMothurUSEARCH/UCLUST/UPARSECD-HIT
TargetedanalysisMorganXC,HuttenhowerC(2012)Chapter12:HumanMicrobiomeAnalysis.PLOSComputationalBiology8(12):e1002808.
LinearmodelRandomforest
ToolsRDPClassifier16sClassifierPhyloSiftPhyloPithia
Shotgunmetagenomeanalysis• Fullsequencingofthegenomiccontentofanenvironmentalsample.
• Twogeneralmethodsinanalysis:– Assembly-based:assemblethesequences,thenclassifythecontigsfromtheassemblyinto‘bins’,followedbygeneprediction,annotation,andsomeformofquantifyingandnormalizingdataforcomparisonacrosssamples
– Read-based:analysetheunassembledreadsdirectlyagainstadatabaseofinterest,thenassigntaxonomyandfunctionwhenpossible
Shotgunmetagenomeanalysis
Quince,Cetal.Shotgunmetagenomics,fromsamplingtoanalysis,(2017)NatureBiotechnology(35):833–844
Metagenomeanalysis-Binning
Sedlar,Ketal,Bioinformaticsstrategiesfortaxonomyindependentbinningandvisualizationofsequencesinshotgunmetagenomics.ComputationalandStructuralBiotechnologyJournal15:48-55.2017
MLModelLinearregressionInt.MarkovModelPCASVDLotsofClustering!k-meansk-medioidsGaussianmixturemodelGreedyheuristicBayesianclusteringSpectralclustering
ToolsCONCOCTMetaBATMaxBin
Shotgunmetagenomeanalysis
• Let’ssayyouhaveametagenomeassembly• Nowyouhavetoannotateittogetfunctionalinformation
ToolsMetaProdigalMetaGeneMarkFragGeneScan
MLModelHMMNeuralnetworkInt.Markovmodels
Sharpton,T.Anintroductiontotheanalysisofshotgunmetagenomicdata.Front.PlantSci.,16June2014
Whatnext?
• Attheend,younormallyendupwithquantitativeinformationrelatedto:– Taxonomiccounts– Featurecounts(genes,proteinfamilies)
• Thesecangointostandarddownstreampackagesforanalysis(phyloseq,MEGAN,etc)– Normallyinvolvesperformingsomeformofordination(PCoA,MDS,etc)