rna seq data therapy 12jan2018 - university of connecticut · analysis rna fastq fastq sam/bam...
TRANSCRIPT
RNA-Seq Analysis
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
QualityControlchecks• Reproducibility• Reliability
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
RNA-seq vsMicroarray
• Highersensitivityanddynamicrange• Lowertechnicalvariation• Availableforallspecies• Noveltranscriptidentification• Alternatesplicing• Allelespecificexpression• Fusiongenes• HigherInformaticsCost
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Reproducibility,LinearityandSensitivity
<2%ofgenome
Mortazavi etal.,NatureMethods5,621- 628(2008)
RNAisolation
TotalRNA
Poly(A)selection
rRNA depletion SizeselectionClinicalSamples(tissuebiopsies)BacterialRNAsamples
SmallncRNA
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
GeneA GeneB GeneC
Sample1 6 1 4
GeneA GeneB GeneC Total
Sample1 3 1 1 5
Sample2 6 3 6 15
#ofreads
GeneA GeneB GeneC
Sample1 3 1 1Readsperkbofexon
GeneA GeneB GeneC Total
Sample1 0.6 0.2 0.2 5
Sample2 0.4 0.2 0.4 15
Readsperkbofexon
Readsperkbofexonpermillionmappedreads- RPKM
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
ExperimentalDesign
Whataremygoals?• Transcriptassembly• DifferentialExpressionanalysis• Identifynew/raretranscripts
WhatareCharacteristicsofmysystem?• Largeandcomplexgenome• Intronsandhighdegreeofalternativesplicing• Noreferencegenomeortranscriptome.
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
ExperimentalDesign
• BiologicalComparison(s)• PairedEndvsSingleend• Readdepth• Readlength• Replicates
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
ExperimentalDesign
SimpleDesign- Pairwisecomparison
TwoGroups
ComplexdesignControl Experimental
treatment
CancerSubtypeA
- +drug
CancerSubtypeB
- +drug
Consultastatistician
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
ExperimentalDesign
ReaddepthandreadlengthSmallgenomewithnoalternatesplicing(yeast)Annotatedtranscriptome
10millionreadspersample,50bpsingle-endreads
Mammaliangenomes(Largetranscriptome,alternativesplicing,geneduplication)
30millionreadspersample,
Transcriptomeassembly(100Xcoverageoftranscriptome)50-200millionreadspersample,100bppairedendreads
Natureofsamples.• Whatistheexpectedpurityofyoursample?• Istherecontaminationorheterogeneityexpected?
Ifyes,thenHighcoveragetodetectvariantsatlowerfrequencyduetoimpurityorbecausetheycomefromminor(butpossiblystillinteresting)subpopulationsofyoursample.
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
ExperimentalDesign
Replicates:Factorsdeterminingnumberofreplicates:• Variabilityinmeasurements(TechnicalnoiseandBiologicalvariation)• Statisticalpoweranalysis
TechnicalReplicatesNotNeeded:HighreproducibilityatsequencingstepErrorpronestepsRNAfragmentation,cDNAsynthesis,adapterligation,PCRamplification,bar-coding,laneloadingSpikeIns:Qualitycontrolandlibrary-sizenormalisationMinimizebatcheffectsRandomizesamplesatlibrarypreparationandsequencingruns
BiologicalReplicatesNotrequiredfortranscriptionassemblyEssentialfordifferentialexpressionanalysisComplexdesigns:
• 3+forcelllines• 5+forinbredlines• 20+forhumansamples
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
ExperimentalDesign
Scotty:http://scotty.genetics.utah.edu/
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Sequencing
Illuminasequencingbysynthesis
SOLID“Color-Space”readsLowerrorrate
454pyrosequencingLongerreads,lowthroughput
Pacific-Bioscience(pacBio)/OxfordNanoporeLongerread(Recoveryoffulllengthtranscripts)
Introduction
ExperimentalDesign
Sequencing
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
Transcriptomeassembly
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
DataQualityControl
Introduction
ExperimentalDesign
Sequencing
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
Transcriptomeassembly
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
DataQualityControl
SequenceDataFormat.FASTQ
QualityScorePhred +33
ReadIDSequence
MachineID
QCFilterflagY=badN=goodReadpair#1
Readpair#2
Sample1_R1.fastqSample1_R2.fastq
R1
R2
SampleID/Barcode
Introduction
ExperimentalDesign
Sequencing
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
DataQualityControl
DataQualityAssessment
Evaluaterawreadlibraryquality• Sequencequality• GCcontentforbiases• AdapterContamination• K-mer overrepresentation• Duplicatereads• PCRartifacts
Software/Tools• FASTQC(Commandline)
Illuminareadfiles• NGSQC
SupportreadsfromanyplatformSupportqualitybasedreadtrimmingandfiltering
• SAMSat (Commandline)AlsoworkwithBamalignmentfiles
Introduction
ExperimentalDesign
Sequencing
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
DataQualityControl
DataQualityAssessment
Introduction
ExperimentalDesign
Sequencing
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
DataQualityControl
DataQualityAssessment
Sequencequality:Qualityscoresoverbases
Good
BadTrimmingrequired
Software/ToolsFastX-ToolkitTrimmomaticSickleCutadapt
Phred 30=1error/1000basesPhred 20=1error/100bases
Introduction
ExperimentalDesign
Sequencing
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
DataQualityControl
DataQualityAssessment
GCDistribution:AcceptablelevelsdependonSourceofsample
Good
Bad
Introduction
ExperimentalDesign
Sequencing
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
DataQualityControl
DataQualityAssessment
Introduction
ExperimentalDesign
Sequencing
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
DataQualityControl
DataQualityAssessment
Introduction
ExperimentalDesign
Sequencing
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
DataQualityControl
DataQualityAssessment
Recommendations:
• Generatequalityplotsforallreadlibraries• Trimand/orfilterdataifneeded
Alwaystrimandfilterfordenovotranscriptomeassembly
• Regeneratequalityplotsaftertrimmingandfilteringtodetermineeffectiveness
• AcceptableduplicationK-mer orGCcontentlevelsareexperimentandorganismspecificbutthevaluesshouldbehomogeneousforsamplesinthesameexperiment.
• Outlierswith>30%disagreementshouldbediscarded.
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
ReadMapping
(1) Withreferencegenome(with/withouttranscriptome).(2) Withreferencetranscriptome.(3) Referencefreeassembly.
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
RNA-seq:AssemblyvsMapping
RNA-seqreads
ReferencebasedRNA-seq
Ref:GenomeorTranscriptome
DenovoRNA-seqcontig 1 contig 2
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Mappingwithreferencegenome
Shortreadsfastq
Mappingtogenome
Gappedmapper TopHatSTAR,HISAT2
Cufflinks,htseq-count,subread
Transcriptidentification&counting
WithGFF WithoutGFF
Transcriptdiscovery&counting
Functionalannotation
Blast2GOHomologybased
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
MappingwithreferenceTranscriptome
Shortreadsfastq
MappingtoTranscriptome
Ungapped mapper Bowtie2
RSEMKallistoexpress
Transcriptidentification&counting
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Mappingwithoutreference
Shortreadsfastq
Assemblyintotranscripts
DeBrujin graphs TrinityOasesVelvet
Mapreadsback
Ungapped mapper Bowtie
Htseq-countRSEM,express
Functionalannotation
GTF-based
Counting
Blast2GOHomologybased
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Alignmenttools
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Alignmenttools
Conesaetal,GenomeBiology201617:13
`
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
ReadMapping
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
StatisticalpropertiesofedgeR (exact)asafunctionoflog2(FC)threshold,T,andthenumberofreplicates,n
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
DifferentialexpressionCommontoolsfordifferentialexpressionanalysis
Decisiontreeforsoftwareselection
Introduction
ExperimentalDesign
Sequencing
DataQualityControl
Readmapping
ReferenceGenome
ReferenceTranscriptome
DifferentialExpressionanalysis
RNA
fastq
fastq
SAM/BAM
fasta
GFF/GTF
Sources:
Conesaetal,GenomeBiology201617:13
Schurch etal,https://arxiv.org/abs/1505.02017
ZengAndMortazavi NatImmunol 13(9),802-7
Mortazavi etalNatureMethods volume5, pages621–628 (2008)