canadian bioinformacs workshops 4.pdf · 2018. 11. 21. · 6/16/16 2 module 4: downstream analyses...
TRANSCRIPT
5/12/16
1
CanadianBioinforma1csWorkshops
www.bioinforma1cs.ca
2 Module #: Title of Module
6/16/16
1
Animagetorepresentyourworkshopormodule
Module4Downstreamanalyses&integra9ve
toolsDavidBujold
EpigenomicDataAnalysisJune20–June21,2016
Your logo here
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
LearningObjec6ves• Exploresomedownstreamanalysesthatcanbedonewith
epigenomicassaysdata• Discoversourcesofpubliclyavailabledatasetsthatcanbe
usedinanyone’sprojects• Learnaboutonlineportalsandtoolsthatcanease
epigenomicsdataanalysis
6/16/16
2
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
• Over98%ofthehumangenomedoesnotencodeproteinsequences
• 76%ofthegenomegetstranscribed• Nearlyhalfofthegenomeisaccessibleinsomewaytogene9cregulatoryproteinssuchastranscrip9onfactors
• PuSngincontexttheinforma9onwecanobtainonvariants,DNAmethyla9on,histonemodifica9ons,transcrip9ontoRNA,chroma9naccessibility,etc.willeaseourunderstandingoftheunderlyingbiology
Mo6va6onforepigenomicintegra6veanalysis
(1) Elgar, G., & Vavouri, T. (2008). Tuning in to the signals: noncoding sequence conservation in vertebrate genomes. Trends in genetics, 24(7), 344-352. (2) Pennisi, E. (6 September 2012). "ENCODE Project Writes Eulogy for Junk DNA". Science 337 (6099): 1159–1161.
(2)
(1)
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
ModuleOutline1. Downstreamfunc9onalanalysistools2. Workingwithpublicdatasets3. Qualitycontrolforonlineresources4. Onlinevisualiza9onandanalysistools
6/16/16
3
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
1-Downstreamfunc6onalanalysistools
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Downstreamfunc6onalanalysis• Onceprimaryanalysisisdoneforourepigenomicassay,wehave:
– AsetofpeakscallsforChIP-Seqassays– Methyla9onlevelsatCpGsitesforWGB-Seqassays
• Next,wecanusethisdatatorunsomefunc9onalanalysesbycomparing:
– Differentregionsfromthesamedataset– Mul9plesamplesofthesamegroup– Differentgroups
6/16/16
4
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Differen6allymethylatedsites
Bock C. Analysing and interpreting DNA methylation data. Nat Rev Genet. 2012 Oct;13(10):705-19.
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Methyla6ondownstreamanalysis• Iden9fyingdifferen9allymethylatedregions(DMR)acrosssamplegroups(celltypes,diseasestatus,etc.)
• Iden9fyingregionsofthegenomewithdifferentmethyla9onpaderns
Roadmap Epigenomics Consortium et al. Nature 518, 317-330 (2015) doi:10.1038/nature14248
6/16/16
5
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
(1) D'haeseleer, Patrik. "What are DNA sequence motifs?." Nature biotechnology 24.4 (2006): 423-425.
Whataremo6fs?• Short,recurringpadernsinDNAthatarepresumedtohaveabiologicalfunc9on
• Oeenindicatesequence-specificbindingsitesforproteinssuchasnucleasesandtranscrip9onfactors(TF)
• Inthisexample,ifallowing1basemismatch,therearetwomo9fs:TTGACAandGCATC:
Example from http://slideplayer.com/slide/8679835/
(1)
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Exploringmo6fsinChIP-seqpeaks• Usingregionspreviouslylabeledaspeaks,wecantrytoiden9fymo9fs
• Iden9fyingtranscrip9onfactorbindingsites(TFBS)ishelpfultounderstandregulatorynetworkstranscrip9onmechanisms
6/16/16
6
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
HOMER• Triestoiden9fyregulatoryelementsenrichedinonesetofsequencescomparedtoanother
• Mo9fdiscoveryalgorithmdesignedforregulatoryelementanalysisingenomicsapplica9ons(DNAsequencesonly)
– Knownmo9fscoun9ng– Denovomo9fsiden9fica9on– Ademptstomatchdenovomo9fstoknownones
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
HOMER• findMo9fsGenome.plademptstoiden9fymo9fsinaprovidedlistofgenomicregions
• Input:– BEDfilecontainingtheregions(peaksfile)
• Column1:chromosome• Column2:star9ngposi9on• Column3:endingposi9on• Column4:UniquePeakID• Column5:notused• Column6:Strand(+/-or0/1,where0="+",1="-")
– Referencegenomeassembly
– Size:fragmentsizetouseformo9ffinding
6/16/16
7
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
HOMER-Execu6onsteps1. Verifypeak/BEDfile2. Extractsequencesfromthegenomecorrespondingtotheregionsintheinputfile
3. CalculateGC/CpGcontentofpeaksequences4. Preparsethegenomicsequencesoftheselectedsizetoserveasbackgroundsequences
5. Randomlyselectbackgroundregionsformo9fdiscovery6. Autonormaliza9onofsequencebias7. Checkenrichmentofknownmo9fs8. denovomo9ffinding
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
HOMER• Amonggeneratedresultfiles,twoHTML-formadedreportswillbeavailable:
– homerResults.html:formadedoutputofdenovomo9ffinding– knownResults.html:formadedoutputofknownmo9ffinding
http://homer.salk.edu/homer/ngs/peakMotifs.html
6/16/16
8
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
LookingforsignificantGOenrichment• WecanlookatbiologicalsignificanceofourpeaksusingGeneOntologies(GO)termsgenomeannota9ons
– GO:Setofstructured,controlledvocabulariesforcommunityuseinannota9nggenes,geneproductsandsequences
• Populartool:theGenomicRegionsEnrichmentofAnnota9onsTool(GREAT)
http://bejerano.stanford.edu/great/public/html/index.php
(1) Gene Ontology Consortium. "The gene ontology project in 2008." Nucleic acids research 36.suppl 1 (2008): D440-D444.
(1)
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
GREAT:Cis-regulatoryregionsfunc6onspredic6on
• Bindingsitesareoeennotlocatedintheproximalregionofthegeneofinterest
• GREATlooksbeyondthisproximalregion
McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010 May;28(5):495-501.
6/16/16
9
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
GREAT:Cis-regulatoryregionsfunc6onspredic6on
McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010 May;28(5):495-501.
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
GREAT• Input:BEDfilewithregionsofinterest• Output:MatchingGOtermsforMolecularFunc9ons,BiologicalProcesses,Phenotypes,Diseases,etc.
• ExamplewithH3K27acpeaksfrombonemarrowsample:
http://bejerano.stanford.edu/great/public/html/index.php
6/16/16
10
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
LinkingGWASvariantstoChIP-Seqdata
Roadmap Epigenomics Consortium et al. Nature 518, 317-330 (2015) doi:10.1038/nature14248
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Integra6veanalysiswithRoadmapdata
Roadmap Epigenomics Consortium et al. Nature 518, 317-330 (2015) doi:10.1038/nature14248
6/16/16
11
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
2-Workingwithpublicdatasets
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Workingwithpublicdatasets
• Manylargeconsor9aofferdatasetsformul9ple9ssues/diseases/condi9ons
• Thesearefreeresourcestodomorewithcompara9vestudies
• Publicdatasetsoffernocontroloverhowassaysweredone,andwhatinforma9onisavailable
6/16/16
12
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
RoadmapEpigenomicsProject
http://www.roadmapepigenomics.org/
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
ENCODEConsor6um
https://www.encodeproject.org/
6/16/16
13
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
IHEC• Nowadays,themostrecentandcompleteresourceisIHEC,theInterna9onalHumanEpigenomeConsor9um
• Interna9onaleffortwithseveralfundingagencies
http://ihec-epigenomes.org/
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
WhatisIHEC• Goal:Providingstandardizedreferenceepigenomesforavarietyofnormalanddisease9ssues
– Membergroupstakepartincommideesworkingonstandards(assays,data/metadatadistribu9on,ethics…)
6/16/16
14
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
IHECDataPortal• Goal:IntegrateepigenomicpublicdatasetsproducedwithintheInterna9onalHumanEpigenomeConsor9um
– Rawdataisincontrolledaccessrepositories
• AsofApril2016:– over7,000humandatasets– datasetsfrom7consor9a,otherscoming
• Offerstoolsfordatasetsdiscovery,visualiza9onandpre-analysis
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Publiclyaccessibledatasets• DatasetsmadeavailableintheIHECDataPortalispubliclyaccessibleforeveryone’sownresearch
• Humandataofferedbysuchconsor9ausuallyfallsinoneoftwocategories:
– Controlledaccessdata• Rawdatafromsequencers• Clinical/sensi9veinforma9onsuchasphenotypes• ArchivedatrepositoriessuchasEGAanddbGaP
– Publicdata• Annota9ontracks,touseintoolssuchasUCSCGenomeBrowser,EnsemblandIGV.
• Somedonor,sampleandlibrarymetadata• Freelydownloadable
6/16/16
15
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
3-Qualitycontrolforonlineresources
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Qualitycontrolonepigenomicsdatasets• Datasetsobtainedonlineareofvariablelevelsofquality• Qualityofdownloadeddatasetsmustbeassessed• Examplesofqualitycontroltests:
– Rawdata:FastQC– Signal:
• Signal-to-noisera9o• ChromImpute• Whole-genomesignalcorrela9onacrosstracks
• SomeQCtoolsareavailableasonlineresources– IHECDataPortalincludessomepreliminaryqualitycontroltests,suchasPearsonCorrela9ontestoverwholetracksignal
6/16/16
16
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
ChromImpute• Allowsimpu9ngmissingsignaltracks• Toimputeasampleforamark,usestrainingdata:
– fromothersampleswiththesamemark– fromtheothermarksforthegivensample
Ernst J, Kellis M. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nature Biotechnology, 33:364-376, 2015.
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Ernst J, Kellis M. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nature Biotechnology, 33:364-376, 2015.
6/16/16
17
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
4-Onlinevisualiza6onandanalysistools
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Onlinevisualiza6onandanalysistools• Manyaddi9onalresourcesareusefulforvisualizingandmanipula9ngdatasets
• Inthissec9on,wewillcoverafew:– IHECDataPortal– UCSCGenomeBrowser– WashUEpigenomeBrowser– Galaxy
6/16/16
18
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
IHECDataPortal-Overview
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
IHECDataPortal-DataGrid
6/16/16
19
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
IHECDataPortal-DatasetsCorrela6on
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
IHECDataPortal-DatasetsCorrela6on
6/16/16
20
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
IHECDataPortal-Download
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
IHECDataPortal-ComingSoon• Comprehensivefilteringbasedonavailablemetadata• Metadataextrac9onfeatureinhuman-readableandmachine-readableformats
• Centralizeddataserving• Linkstopermanentsessions,foreasierci9ngandshareability
6/16/16
21
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
VisualizingtrackswiththeUCSCGenomeBrowser
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
UCSCGenomeBrowserTrackHubs• TrackscanbeaggregatedusingatextdocumentintheUCSCGenomeBrowsertrackhubformat
• Advantage:Canbeeasilydistributedtocollaborators/usersofyourresources
• Inconvenient:Needtogeneratethistextdocument
• Documenta9on:– hdps://genome.ucsc.edu/goldenpath/help/hgTrackHubHelp.html
6/16/16
22
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Smalltrackhubexample
track McGill_MS000101_monocyte_RNASeq_signal_forward type bigWig bigDataUrl http://epigenomesportal.ca/public_data/MS000101.monocyte.RNASeq.signal_forward.bigWig shortLabel 000101mono.rna longLabel MS000101 | human | monocyte | RNA-Seq | signal_forward track McGill_MS000101_monocyte_RNASeq_signal_reverse type bigWig bigDataUrl http://epigenomesportal.ca/public_data/MS000101.monocyte.RNASeq.signal_reverse.bigWig shortLabel 000101mono.rna longLabel MS000101 | human | monocyte | RNA-Seq | signal_reverse
• Minimumproper9esforatrack:– track:Symbolicnameofthetrack– type:Oneofthesupportedformats
• bigWig,bigBed,bigGenePred,bam,halSnake,vcfTabix– bigDataUrl:Webloca9on(URL)ofthedatafile– shortLabel:Shorttrackdescrip9on(Max17characters)– longLabel:Longertrackdescrip9on(displayedovertracksinthebrowser)
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
WashUEpigenomeBrowser• SupportsmanytracktypesincludedintheUCSCBrowser
– BigBedsareontheway– CanalsoloadUCSCtrackhubdocuments
http://epigenomegateway.wustl.edu/browser/
6/16/16
23
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Galaxy• Web-basedframeworkofferingauser-friendlyinterfacemappingtomostpopularbioinforma9cstools
– "Dataintensivebiologyforeveryone."
• Allowsforreproducibleresults
– Steps/parameterskeptinhistory
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
GalaxyInterface• ManytoolscoveredinthisworkshopareavailableinGalaxy
6/16/16
24
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Galaxy-Pipelinedesign• Abilitytodesigncustompipelinesandimportothers’
– Allthroughauser-friendlyGUI
• Tailoredforsmall/mediumscaleprojectswithnottoomanysamples
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
GenAP
• GenAPisaCanadiancompu9ngplarormforlifescienceresearchers
• LeveragesCANARIEhigh-speednetworkandComputeCanada(CC)HighPerformanceCompu9ng
• Userscancreatetheirownprivate,fullyconfiguredGalaxyandruntheiranalysesonComputeCanadaHPCs
• FreeforCanadianacademia– AllyouneedistogetaComputeCanadaaccount
6/16/16
25
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
GenAPPipelines• Free,open-sourcesoewarewithPython• Manypipelinesavailable,suchasforepigenomics:
– RNA-Seq– RNA-SeqDenovo– ChIP-Seq– Methyla9onpipelinecomingsoon
• Allsoewarerequirementsarepre-installedatmanyComputeCanadaHPCs
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
GenAPPipelineshdps://bitbucket.org/mugqic/mugqic_pipelines
6/16/16
26
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
• PrivateGalaxyinstance,sharablewithcollaborators• Computejobsmakinguseofgroup’sCCalloca9on
– Fasterthanusegalaxy.org
GenAP-Galaxy
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
GenAPPortal• LoginwithyourComputeCanadaaccount
6/16/16
27
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
GenAPPortal• You’re then readyto connect to thePortal
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
GenAPPortal-PreparingGalaxy• Instan9a9ngaGalaxyapplica9onwithinGenAP
6/16/16
28
Module4:Downstreamanalyses&integra6vetools bioinformatics.ca
Conclusion• Inthisunit,wehavecovered:
– Sometypesofdownstreamanalyseswithepigenomicdata– Howtoobtainpubliclyaccessibledatasetsforyourownanalyses– Methodstoassessthequalityofpublicdata– Howtovisualizeepigenomicdatasetsusingonlinetools– Someonlineresourcestorunaddi9onalanalyseswithawebinterface
• Thefollowingworkshopwillprovideanintroduc9ontosomeofthetoolspresentedintheseslides
• Aeertheworkshop,ifyou’reinCanadianAcademia,getthatComputeCanada/GenAPaccount!☺