rna-seqin galaxy: tuxedo protocol · public galaxy servers advantage of the registration: •...
TRANSCRIPT
RNA-Seq inGalaxy:Tuxedoprotocol
IgorMakunin,UQRCC,QCIF
Acknowledgments
GenomicsVirtualLab:gvl.org.auGalaxyfortutorials:galaxy-tut.genome.edu.auGalaxyAustralia:galaxy-aust.genome.edu.au
Contributorsandparticipants:
Planfortoday
Galaxy
DatatypesusedinRNA-Seq analysis
RNA-Seq practical
Galaxyworkflow
High-throughputsequencing
Bigscalesequencing• 100,000,000ssequences,orreads,perexperiment• sequencingofa(random)library• lowcostpernucleotide
Populartechnologies:• illumina• ion/proton• PacBio
Emergingtechnologies• OxfordNanopore MinION
AnalysisofNGSdataBigdatasetsComputationallyintensiveDedicatedtoolsanddatatypesExtensiveuseofpublicdata
Storage
Computationalresources
Publicdata
Knowledgeandskills
Tools
Galaxy:howdoesitlooklike
WorkingwindowUpload
Historymenu
GalaxyhistorysystemHistorymenuRefresh
Source:http://galaxyproject.github.io/training-material/topics/introduction/tutorials/galaxy-intro-history/tutorial.html
PublicGalaxyservers
Advantageoftheregistration:• accesstohistoriesoverlongtime• multiplehistories• abilitytouseGalaxyfromdifferentdevices• biggerquotas(onsomeservers)• ftp
• IndependentregistrationoneveryGalaxyserver
• Differenttools,differentuserpolicy
• DatacanbemovedbetweenGalaxyservers
Galaxyservers:usegalaxy.orgusegalaxy.eu
galaxy-tut.genome.edu.au
galaxy-aust.genome.edu.au
GalaxyAustralia
Lessjobsonweekends
Jobsperday
galaxy-aust.genome.edu.au
Workernodes:16CPUs,64GBRAM
49TbVolumestorage(userdata)
Designedforagenomescaleresearch>1,600registeredusers
Upto16CPUs60GBRAMperjobUpto12concurrentjobsperuserUpto1Tbperuser
TuxedoprotocolGVLBasicRNA-Seq GalaxytutorialTrapnell etal.(2012)NatureProtocols
VisualisealignmentwithIGV
FASTQFASTA
GFF BAM
Genomebrowser
FASTQformat
@SRR3145.19ILLUMINA-C32_FC:3:1:80:12/1TAGCAGCACATCATGGTTTACATCGTATGC+IIHIDIIIIIIIIIIIIIHIHIIIIIDGIB
Namealwaysstartswith@SequenceAlwaysstartswith+;mayhavenameEncodedPhred qualityscore
single-endreads paired-endreads
Terminology: read isasequencewithqualityscorevaluesproducedbyasequencingmachine
Commonoutputformat:FASTQ compressedwithgzip,e.g.SRR3145_1.fq.gz
MultiplereadsinasingleFASTQfileEachreadisdescribedbyfourlines
FASTQPhred qualityscore
Quality+Offset
39+33=72
ASCII(72):H
Range:~0to~40
Phred 10:accuracy90%Phred 20:accuracy99%Phred 30:accuracy99.9%Phred 40:accuracy99.99%
Valuesareencodedbycharacters
Advantage:asinglecharacterisusedinsteadofatwo-digitnumber
APhredqualityscoreisameasureofthequalityoftheidentificationforeverynucleotide.
@S391ILLUMINA_FC:3:80:12/1TAGCAGCACATCATGGTTTAC+IIHIDIIIIIIIIIIIIIHIH
ASCIItable
Phred qualityscoreencodingOffset33- SangerOffset64- oldillumina
Source:https://en.wikipedia.org/wiki/FASTQ_format
Qual.=40Offset=3340+33=73ASCII(73):I
FASTQqualityscoreinGalaxyManyoldillumina datasetshaveaproprietarydataencoding(offset64)CurrentlymostNGSdatasetsusetheSangerencoding(offset33)
GalaxyBydefaultGalaxyassign‘fastq’datatypetouploadedFASTQfiles.Inthiscasetheoffsetisnotspecified,andmanytoolsdonotrecognizethedata
fastqillumina – oldillumina qualityscoreencoding(offset64,illumina 1.3+)fastqsanger – newillumina 1.8+/SangerqualityscoreencodingSometoolsinGalaxynowworkonlywithfastqsanger datatype
Solution:- specifyfastqsanger orfastqillumina datatype duringupload- changetheformatviaAttributes>Datatype- useNGS:QCandmanipulation>FASTQGroomertool
TuxedoprotocolGVLBasicRNA-Seq GalaxytutorialTrapnell etal.(2012)NatureProtocols
VisualisealignmentwithIGV
FASTQFASTA
GFF BAM
Genomebrowser
ReferencegenomesGenomeReferenceConsortium:…aconsensusrepresentationofthegenome.
FASTAformat
ThehumanreferencesequenceGRCh37(hg19)containsthemitochondrialgenome,22autosomes,chrX,chrY,9haplotypechromosomes,39unplacedcontigs,and20unlocalized contigs.
Genomesarebig.GRCh38.p10totalnon-Nbases:3,080,585,178
Genomesmayhavemanyassemblyversions(releases,build):mm9,mm10
Usethesameassemblyversionforthereferencesequenceandgeneannotations.
Orderofsequences/contigs mightbeimportantforsometools.
“chr1”and“1”arenotidenticalforsometools.http://hgdownload.cse.ucsc.edu/gbdb/hg19/html/description.html
GeneannotationsCoordinate-based:linkedtoaparticulargenomeassembly,e.g.,hg19
GFF(GeneralFeatureFormat)formatconsistsofonelineperfeature,eachcontaining9columnsofdata,plusoptionaltrackdefinitionlines.Popularversions:GTF(=GFF2),GFF3tab-separatedfields
##gff-version3ctg123. mRNA 13009000.+.ID=mrna0001;Name=sonichedgehogctg123 .exon 1300 1500.+.ID=exon00001;Parent=mrna0001ctg123. exon 1050 1500.+.ID=exon00002;Parent=mrna0001ctg123.exon 3000 3902.+.ID=exon00003;Parent=mrna0001ctg123. exon 5000 5500.+.ID=exon00004;Parent=mrna0001ctg123.exon 7000 9000.+.ID=exon00005;Parent=mrna0001
http://asia.ensembl.org/info/website/upload/gff3.html
seqid
source
type start end
score
strand
phase'0','1'or'2'
attributes
The first line must be a comment that identifies the version
bothare1-based
IntervalsCoordinate-based:linkedtoaparticulargenomeassembly,e.g.,hg19
BEDformat,upto12columnsofdata(UCSCTableBrowser),plusoptionaltrackheaderlines.tab-separatedfields
GFF3##gff-version3ctg123. mRNA 13009000.+.ID=mrna0001;Name=sonichedgehog
BEDctg12312999000sonichedgehog .+
chromchromStart
chromEndscore
strandname
1-based
0-based
Aligners
Alignersmapreadstoareferencesequence.
Alignersuseproprietaryindexfilesformapping.bwa index hg19.fa
Galaxy-qld providesindicesforseveralgenomeassemblies(hg19,hg38,mm9,mm10etc.)
Galaxyusersalsocanuseacustomreferencesequenceforalignment.Inthissituationthealignercreatesindicesinatemporaryworkingdirectoryforeveryjob.
ContactGalaxy-qld adminsifyouplantorunmanyalignmentjobswithacustomgenome.Wecanaddgenomeindicestotheserver.
OnlyforBWA Onlyforhg19
Gappedalignment
Alignments:SAMandBAM50xcoverageofthehumangenomewithreadlength100bp:~1,500,000,000readsCompressedsizeofsuchalignmentcanbe>100Gb.
SAM:SequenceAlignment/Map.Plaintextformat.BAM:binary(compressed)versionofthealignmentformat.
SAMcoordinatesare1-basedBAMcoordinatesare0-based
BAMsareindexedforrapidaccess.Usefulforalignmentvisualization.
Itisalwaysgoodtohaveaheader!@HD VN:1.0 SO:queryname@RG ID:igGroup SM:igSmpl LB:igL1 PL:ILLUMINA@SQ SN:chr2L LN:23011544@PG ID:TopHat VN:2.0.14
CL:/mnt/galaxy/tools/tophat/2.0.14/iuc/package_tophat_2_0_14/536f7bb5616d/bin/tophat --num-threads5….
Readgroups Canhandlemultiplesamplesinalignment
SAMformat
11mandatorycolumnsandoptionalfieldswiththeTAG:TYPE:VALUEformat
VisualizationofBAMs
AlignmentonIGV
Galaxy servers can act as a track hub
Itispossibletoaddmultipletracks:BAMs,geneannotations,knownvariants…
Genomebrowsers
IntegrativeGenomicsViewer,IGVEfficientgenomeviewerdevelopedbytheBroadInstitute.Installableonpersonalcomputers.Canaddacustomgenome.
UCSCGenomeBrowserAbigserverintheUS.TableBrowserfordataanalysis(intersection)SupportdataexporttoGalaxyCustomsessions(cansaveyourtracks)liftOver toolPublictrackhubs
RNA-Seq withtheCufflinkspackageGVLBasicRNA-Seq GalaxytutorialTrapnell etal.(2012)NatureProtocols
Visualisealignments
Datamanipulation
D.melanogasterTwoconditionsThreereplicatesDataforchr4
SetupfortheworkshopGVLwebsite:gvl.org.au
3
RegisteronGalaxy-tut:galaxy-tut.genome.edu.au
BasicGalaxytutorial
Galaxyisaworkflowengine
Selecttoolorinputdataset
Addname,comments
ToolboxNoodle
Input
AGalaxyworkflowisaseriesoftoolsanddatasetactionsthatruninsequenceasabatchoperation
Emailnotification
Galaxyworkflow
CreateaGalaxyworkflow
Fromhistory
Fromscratch
Exercise
WewillcreateaGalaxyworkflowforRNA-Seq analysiswithoutreplicates:tophat2>Cuffdiff >Filter
Acknowledgments
GenomicsVirtualLab:gvl.org.auGalaxyfortutorials:galaxy-tut.genome.edu.auGalaxyAustralia:galaxy-aust.genome.edu.au
Contributorsandparticipants: