introduction to rna-seq -...
TRANSCRIPT
Introduction to RNA-Seq
David WoodWinter School in Mathematics and Computational Biology
July 1, 2013
RNA is...
Central
DNA
RNA
Protein
Epigenetics
Diverse
tRNA
mRNA
rRNA
Dynamic
Time
Abundance
RNA is...
QuantitativeQualitative
Understand the molecular basis of gene function. Classify and transform cellular states
Integrative
Central
DNA
RNA
Protein
Epigenetics
Diverse
tRNA
mRNA
rRNA
Dynamic
Time
Abundance
RNA studies involve...
Biological System
TechnologyAvailable Resources
Questions
~/bin
Project
DB
RNA studies involve...
Biological System
TechnologyAvailable Resources
Questions
~/bin
Project
DB
This talk: Focusing on reference based mammalian RNA-seq analysis
pA
pA pApAATG ATG
TSS transcription start site pA polyadenylation signalprotein coding regions
ATG translation start site AAA polyadenylationnon-coding regions
genomic DNA microRNAs spliced intron
TSS TSS TSS
TSS
ATG AAA
ATG AAA
ATG AAA
ATG
ATG
ATG
AAA
AAAATG
Transcriptional Complexity
pA
pA pApAATG ATG
TSS transcription start site pA polyadenylation signalprotein coding regions
ATG translation start site AAA polyadenylationnon-coding regions
genomic DNA microRNAs spliced intron
TSS TSS TSS
TSS
ATG AAA
ATG AAA
ATG AAA
ATG
ATG
ATG
AAA
AAAATG
Transcriptional Complexity
PASR miRNAtiRNA
AAAAAA
Alu
pA
pA pApAATG ATG
TSS transcription start site pA polyadenylation signalprotein coding regions
ATG translation start site AAA polyadenylationnon-coding regions
genomic DNA microRNAs spliced intron
TSS TSS TSS
TSS
ATG AAA
ATG AAA
ATG AAA
ATG
ATG
ATG
AAA
AAAATG
Transcriptional Complexity
PASR miRNAtiRNA
AAAAAA
Alu
pA
pA pApAATG ATG
TSS transcription start site pA polyadenylation signalprotein coding regions
ATG translation start site AAA polyadenylationnon-coding regions
genomic DNA microRNAs spliced intron
TSS TSS TSS
TSS
ATG AAA
ATG AAA
ATG AAA
ATG
ATG
ATG
AAA
AAAATG
Transcriptional Complexity
PASR miRNAtiRNA
Mutations Allelic Expression
RNA Editing
pA
pA pApAATG ATGTSS TSS TSS
TSS
AAA
PASR miRNA
ATG AAA
ATG AAA
ATG AAA
ATG
ATG
ATG
AAA
AAAATG
tiRNA
RNA-seq
non-spliced reads
junction readsstrand specific Cloonan et al. Nat Methods 2008; 5:613-619
AAA
Alu
mutations
Advantages of RNA-seq
!"
#!!!!"
$!!!!!"
$#!!!!"
%!!!!!"
%#!!!!"
#&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$"
,-./
01-23
40"
5/06789":6-02;/"
<-;462/"=;>2/?" @6?-.>.A;/"
/1BCD" <06E>;?6/6"
Discoverygenes, exons, junctions,
UTRs, fusions(Present and Future)
Advantages of RNA-seq
!"
#!!!!"
$!!!!!"
$#!!!!"
%!!!!!"
%#!!!!"
#&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$"
,-./
01-23
40"
5/06789":6-02;/"
<-;462/"=;>2/?" @6?-.>.A;/"
/1BCD" <06E>;?6/6"
Discoverygenes, exons, junctions,
UTRs, fusions(Present and Future)
Dynamic Range
Mortazavi et al. Nat. Methods 2008; 5:621–628
Advantages of RNA-seq
!"
#!!!!"
$!!!!!"
$#!!!!"
%!!!!!"
%#!!!!"
#&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$"
,-./
01-23
40"
5/06789":6-02;/"
<-;462/"=;>2/?" @6?-.>.A;/"
/1BCD" <06E>;?6/6"
Discoverygenes, exons, junctions,
UTRs, fusions(Present and Future)
Dynamic Range
Mortazavi et al. Nat. Methods 2008; 5:621–628
Nucleotide Specific
Typical experiment workflow
Design Experiment
Sample AcquisitionField / Clinic / Lab
Validation
VerificationSample Acquisition
Run Experiment
Obtain RNA
Make Library
Sequencing
Base Calling Mapping
Library QC
Publish
Analysis
Interpretation
1° 2°
3°
3°
2°
Field / Clinic Wet Lab Dry Lab
Typical experiment workflow
Design Experiment
Sample AcquisitionField / Clinic / Lab
Validation
VerificationSample Acquisition
Run Experiment
Obtain RNA
Make Library
Sequencing
Base Calling Mapping
Library QC
Publish
Analysis
Interpretation
1° 2°
3°
3°
2°
Field / Clinic Wet Lab Dry Lab
Typical experiment workflow
Design Experiment
Sample AcquisitionField / Clinic / Lab
Validation
VerificationSample Acquisition
Run Experiment
Obtain RNA
Make Library
Sequencing
Base Calling Mapping
Library QC
Publish
Analysis
Interpretation
1° 2°
3°
3°
2°
Field / Clinic Wet Lab Dry Lab
Typical experiment workflow
Design Experiment
Sample AcquisitionField / Clinic / Lab
Validation
VerificationSample Acquisition
Run Experiment
Obtain RNA
Make Library
Sequencing
Base Calling Mapping
Library QC
Publish
Analysis
Interpretation
1° 2°
3°
3°
2°
Field / Clinic Wet Lab Dry Lab
Library Construction
AAAAA
AAAAA
AAAAA
AAAAAA
AAAFragment
ds-cDNAsynthesis
Ligate adaptors +
Amplify
TargetRNA
rRNA (80%)
tRNA (15%)
5%
cellular RNA
Deplete rRNA
Enrich polyA RNA
Profile (ribosomes)
Capture(tiling arrays)
Sequencing
Typical experiment workflow
Design Experiment
Sample AcquisitionField / Clinic / Lab
Validation
VerificationSample Acquisition
Run Experiment
Obtain RNA
Make Library
Sequencing
Base Calling Mapping
Library QC
Publish
Analysis
Interpretation
1° 2°
3°
3°
2°
Field / Clinic Wet Lab Dry Lab
RNA-seq Mapping
ATG AAA
Challenge #1: Introns
RNA-seq Mapping
ATG AAA
Challenge #1: Introns
Align to database of junctions or transcriptome
Wood et al. Bioinformatics 2011; 27:580–581
Split Read Alignments
Trapnell et al. Bioinformatics 2009; 25:1105-11
RNA-seq Mapping
ATG AAA
Challenge #1: Introns
Challenge #2: Correctness
Sufficient OverlapSufficient Evidence
Align to database of junctions or transcriptome
Wood et al. Bioinformatics 2011; 27:580–581
Split Read Alignments
Trapnell et al. Bioinformatics 2009; 25:1105-11
RNA-seq Mapping
ATG AAA
Challenge #1: Introns
Challenge #2: Correctness
Sufficient OverlapSufficient Evidence
Align to the transcriptome
Challenge #3: Multi-mappers
Sequence Similarity
Align to database of junctions or transcriptome
Wood et al. Bioinformatics 2011; 27:580–581
Split Read Alignments
Trapnell et al. Bioinformatics 2009; 25:1105-11
RNA-seq Mapping
Data QC (clipping)
Align to Filter Set
Align to ‘genome’
Align to ‘junctions’
Split read Alignment
Choose Alignments, DisambiguateExclude Flag and Exclude
Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11
RNA-seq Mapping
Data QC (clipping)
Align to Filter Set
Align to ‘genome’
Align to ‘junctions’
Split read Alignment
Choose Alignments, DisambiguateExclude Flag and Exclude
BAMBAM BAM Alignment Filtering
AnalysisLibrary QC
Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11
RNA-seq Mapping
reference?diploid?
gene model?ESTs?
Algorithm?rRNA, tRNA?
Data QC (clipping)
Align to Filter Set
Align to ‘genome’
Align to ‘junctions’
Split read Alignment
Choose Alignments, DisambiguateExclude Flag and Exclude
BAMBAM BAM Alignment Filtering
AnalysisLibrary QC
Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11
Typical experiment workflow
Design Experiment
Sample AcquisitionField / Clinic / Lab
Validation
VerificationSample Acquisition
Run Experiment
Obtain RNA
Make Library
Sequencing
Base Calling Mapping
Library QC
Publish
Analysis
Interpretation
1° 2°
3°
3°
2°
Field / Clinic Wet Lab Dry Lab
Library Quality Control (QC)
AAAAA
AAAAA
AAAAA
AAAAAA
AAAFragment
ds-cDNAsynthesis
Ligate adaptors +
Amplify
TargetRNA
rRNA (80%)
tRNA (15%)
5%
cellular RNA
Deplete rRNA
Enrich polyA RNA
Profile (ribosomes)
Capture(tiling arrays)
Sequencing
Library Quality Control (QC)
AAAAA
AAAAA
AAAAA
AAAAAA
AAAFragment
ds-cDNAsynthesis
Ligate adaptors +
Amplify
TargetRNA
rRNA (80%)
tRNA (15%)
5%
cellular RNA
Deplete rRNA
Enrich polyA RNA
Profile (ribosomes)
Capture(tiling arrays)
Sequencing
Affects RNA content (Expression
quantification)
Library Quality Control (QC)
AAAAA
AAAAA
AAAAA
AAAAAA
AAAFragment
ds-cDNAsynthesis
Ligate adaptors +
Amplify
TargetRNA
rRNA (80%)
tRNA (15%)
5%
cellular RNA
Deplete rRNA
Enrich polyA RNA
Profile (ribosomes)
Capture(tiling arrays)
Sequencing
Affects RNA content (Expression
quantification)
Affects Insert Size (transcript
identification)
Library Quality Control (QC)
AAAAA
AAAAA
AAAAA
AAAAAA
AAAFragment
ds-cDNAsynthesis
Ligate adaptors +
Amplify
TargetRNA
rRNA (80%)
tRNA (15%)
5%
cellular RNA
Deplete rRNA
Enrich polyA RNA
Profile (ribosomes)
Capture(tiling arrays)
Sequencing
Affects RNA content (Expression
quantification)
Affects Insert Size (transcript
identification)
Affects Strand Specificity
Library Quality Control (QC)
AAAAA
AAAAA
AAAAA
AAAAAA
AAAFragment
ds-cDNAsynthesis
Ligate adaptors +
Amplify
TargetRNA
rRNA (80%)
tRNA (15%)
5%
cellular RNA
Deplete rRNA
Enrich polyA RNA
Profile (ribosomes)
Capture(tiling arrays)
Sequencing
Affects RNA content (Expression
quantification)
Affects Insert Size (transcript
identification)
Affects Strand Specificity
Affects Library Complexity
(Tag uniqueness)
Library Quality Control (QC)
AAAAA
AAAAA
AAAAA
AAAAAA
AAAFragment
ds-cDNAsynthesis
Ligate adaptors +
Amplify
TargetRNA
rRNA (80%)
tRNA (15%)
5%
cellular RNA
Deplete rRNA
Enrich polyA RNA
Profile (ribosomes)
Capture(tiling arrays)
Sequencing
Affects RNA content (Expression
quantification)
Affects Insert Size (transcript
identification)
Affects Strand Specificity
Affects Library Complexity
(Tag uniqueness)
Affects Mapping Rate
Paired-end?
Typical experiment workflow
Design Experiment
Sample AcquisitionField / Clinic / Lab
Validation
VerificationSample Acquisition
Run Experiment
Obtain RNA
Make Library
Sequencing
Base Calling Mapping
Library QC
Publish
Analysis
Interpretation
1° 2°
3°
3°
2°
Field / Clinic Wet Lab Dry Lab
Calculate Gene Expression
ATG AAA ATG
Gene A3500nt
(700 reads)
Gene B400nt
(160 reads)
AAA
Mortazavi et al. Nat. Methods 2008; 5:621–628
Calculate Gene Expression
ATG AAA ATG
Gene A3500nt
(700 reads)
Gene B400nt
(160 reads)
AAA
RPKM = 2.0 RPKM = 4.0
RPKM = R 103 106L N
× ×
Reads Per Kilobase per Million
L = Length of geneN = Library Size
R = Gene Read Count
Further Normalisation
ATG AAA
Repeat
Normalise to “mappable” gene length
Koehler et al. Bioinformatics 2010
Further Normalisation
ATG AAA
Repeat
Normalise to “mappable” gene length
Koehler et al. Bioinformatics 2010
Robinson et al. Genome Biology 2010; 11:R25
Scale Expression Values by TMM
Cellular RNA
Cond. 1 Cond. 2
Further Normalisation
ATG AAA
Repeat
Normalise to “mappable” gene length
Koehler et al. Bioinformatics 2010
Robinson et al. Genome Biology 2010; 11:R25
Scale Expression Values by TMM
Cellular RNA
Cond. 1 Cond. 2
RPKM
Cond. 1 Cond. 2
Further Normalisation
ATG AAA
Repeat
Normalise to “mappable” gene length
Koehler et al. Bioinformatics 2010
Robinson et al. Genome Biology 2010; 11:R25
Scale Expression Values by TMM
Benjamini et al. NAR; 2012
Normalise to GC content of
region
Calculate ‘Feature’ Expression
ATG AAA
ATG AAA
Calculate ‘Feature’ Expression
ATG AAA
ATG AAA
Exonic Region
Calculate ‘Feature’ Expression
ATG AAA
ATG AAA
Exonic Region Exon Junction
Calculate ‘Feature’ Expression
ATG AAA
ATG AAA
Exonic Region Exon Junction Intronic Region
Calculate ‘Feature’ Expression
ATG AAA
ATG AAA
Exonic Region Exon Junction Intronic Region Exon Boundary
Calculate ‘Feature’ Expression
ATG AAA
ATG AAA
Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region
Calculate ‘Feature’ Expression
ATG AAA
ATG AAA
Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region
Calculate RPKM for any feature
Calculate ‘Feature’ Expression
ATG AAA
ATG AAA
Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region
Calculate RPKM for any feature
Extended 3’ UTR
ATG AAA
Calculate ‘Feature’ Expression
ATG AAA
ATG AAA
Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region
Calculate RPKM for any feature
Extended 3’ UTR
ATG AAA
ATG AAA
Retained Intron
Calculate Transcript Expression
ATG AAA
ATG AAA
ATG AAA
ATG
Calculate Transcript Expression
ATG AAA
ATG AAA
ATG AAA
ATG
diagnostic feature
Calculate Transcript Expression
ATG AAA
ATG AAA
ATG AAA
ATG
diagnostic feature
Approach #1: Expression calculated using diagnostic features
Strong Evidence
Excludes Transcripts
Sampling Variability
Lacks statistical robustness
Easy to calculate
Dependent on gene model
ALEXA-seq: Griffith et al. Nat. Methods 2010; 11:R25
Calculate Transcript Expression
ATG AAA
ATG AAA
ATG AAA
ATG
Calculate Transcript Expression
ATG AAA
ATG AAA
ATG AAA
ATG
Approach #2: Expression estimatedConstruct bipartite graph, then finds minimum path
Cufflinks: Trapnell et al. Nat. Biotech. 2010, 28:511-515
Calculate Transcript Expression
ATG AAA
ATG AAA
ATG AAA
ATG
Estimates expression for all transcripts
Model can fail in complex / highly
expressed regions
More statistically robust Error rate largely unknown
Incorporates ambiguous reads
Approach #2: Expression estimatedConstruct bipartite graph, then finds minimum path
Cufflinks: Trapnell et al. Nat. Biotech. 2010, 28:511-515
Expressed or not?
ATG AAA
ATG AAA
ATG AAA
Cond. 1
Cond. 2
Cond. 3
Freq
uenc
y
log2 (expression)
not “expressed” “expressed”
Need to determine ‘expression’ cut-off value
Expressed or not?
Expressed if > 1 RPKM
1
Lacks sensitivity ArbitraryHas literature
support
Expressed or not?
Expressed if > 1 RPKM
1
Expressed if above intergenic
background
2
log2 Expression
Freq
uenc
y
95th percentile
Lacks sensitivity ArbitraryHas literature
support
Expressed or not?
Expressed if > 1 RPKM
1
Expressed if above intergenic
background
2
log2 Expression
Freq
uenc
y
95th percentile
Cut-off based on empirical
evidence
Still somewhat arbitrary
Lacks sensitivity ArbitraryHas literature
support
Expressed or not?
Expressed if > 1 RPKM
1
Expressed if above intergenic
background
2
log2 Expression
Freq
uenc
y
95th percentile
Cut-off based on empirical
evidence
Still somewhat arbitrary
Incorporate replicate
information
3Based on observed
reproducibility
Requires replicates
Lacks sensitivity ArbitraryHas literature
support
−log2 (expression) bins
np−I
DR
Rep 1 vs Rep 2Rep 2 vs Rep 1MeanCut−off
00.
10.
30.
50.
70.
91
−11 −7 −3 1 5 9 13 17 21 25
Expressed or not?
Expressed if > 1 RPKM
1
Expressed if above intergenic
background
2
log2 Expression
Freq
uenc
y
95th percentile
Cut-off based on empirical
evidence
Still somewhat arbitrary
Incorporate replicate
information
3Based on observed
reproducibility
Requires replicates
Lacks sensitivity ArbitraryHas literature
support
−log2 (expression) bins
np−I
DR
Rep 1 vs Rep 2Rep 2 vs Rep 1MeanCut−off
00.
10.
30.
50.
70.
91
−11 −7 −3 1 5 9 13 17 21 25
Expressed or not?
Expressed if > 1 RPKM
1
Expressed if above intergenic
background
2
log2 Expression
Freq
uenc
y
95th percentile
Cut-off based on empirical
evidence
Still somewhat arbitrary
Incorporate replicate
information
3Based on observed
reproducibility
Requires replicates
Choose what is reasonable for your experiment, be consistent!
Lacks sensitivity ArbitraryHas literature
support
−log2 (expression) bins
np−I
DR
Rep 1 vs Rep 2Rep 2 vs Rep 1MeanCut−off
00.
10.
30.
50.
70.
91
−11 −7 −3 1 5 9 13 17 21 25
Nucleotide-Resolution Analysis
ATG AAA
ATG AAA
ICR
Imprinting
Nucleotide-Resolution Analysis
ATG AAA
ATG AAA
Imprinting
sQTLeQTL
Nucleotide-Resolution Analysis
ATG AAA
ATG AAA
Imprinting
sQTLeQTLComplex Traits
Nucleotide-Resolution Analysis
ATG AAA
ATG AAA
Imprinting
eQTLComplex Traits
A B CSNPs
Allelic Fraction
sQTL
Nucleotide-Resolution Analysis
ATG AAA
ATG AAA
Imprinting
eQTLComplex Traits
A B CSNPs
Allelic Fraction
sQTL
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
Fraction of RNA−seq Reads Matching Reference Allele
Dens
ity
Expected MeanObserved Mean
Degner et al. Bioinformatics 2009
Reference bias
Nucleotide-Resolution Analysis
ATG AAA
ATG AAA
Imprinting
eQTLComplex Traits
A B CSNPs
Allelic Fraction
sQTL
Map to a diploid genome
AlleleSeq: Rozowsky et al. Mol. Sys. Bio 2011
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
Fraction of RNA−seq Reads Matching Reference Allele
Dens
ity
Expected MeanObserved Mean
Degner et al. Bioinformatics 2009
Reference bias
Typical experiment workflow
Design Experiment
Sample AcquisitionField / Clinic / Lab
Validation
VerificationSample Acquisition
Run Experiment
Obtain RNA
Make Library
Sequencing
Base Calling Mapping
Library QC
Publish
Analysis
Interpretation
1° 2°
3°
3°
2°
Field / Clinic Wet Lab Dry Lab
The future of RNA-seq (now)Single Cell
Shalek, et al. Nature 2013
The future of RNA-seq (now)Single Cell
Shalek, et al. Nature 2013
Huge Cohort
900 donors 30,000 RNA-seq
data sets!
Genotype-Tissue Expression project (GTEx)
Lonsdale, et al. Nature Genetics 2013
Summary
Choose an alignment approach suitable for your experiment, available resources and tools
Assess library quality, specifically rRNA contamination, insert size, strand specificity and library complexity
Gene and ‘Feature’ Expression can be calculated using count data, and normalised by length, library size and GC content
Transcript expression calculation requires alternative approaches and algorithms, which although common, are largely unproven
RNA-seq can interrogate nucleotide specific questions, but be careful of alignment biases (diploid mapping can help here)
1
2
3
4
5
Questions and References
Cloonan et al. Nat Methods 2008; Stem cell transcriptome profiling via massive-scale mRNA sequencing
Mortazavi et al. Nat. Methods 2008; Mapping and quantifying mammalian transcriptomes by RNA-Seq
Wood et al. Bioinformatics 2011; X-MATE: A flexible system for mapping short read data
Trapnell et al. Bioinformatics 2009; TopHat: discovering splice junctions with RNA-Seq
Koehler et al. Bioinformatics 2010. The Uniqueome: A mappability resource for short-tag sequencing
Robinson et al. Genome Biology 2010; A scaling normalization method for differential expression analysis of RNA-seq data.
Benjamini et al. NAR; 2012. Summarizing and correcting the GC content bias in high-throughput sequencing
Griffith et al. Nat. Methods 2010; Alternative expression analysis by RNA sequencing.
Trapnell et al. Nat. Biotech. 2010; Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform
Degner et al. Bioinformatics 2009; Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing
Rozowsky et al. Mol. Sys. Bio 2011; AlleleSeq: analysis of allele-specific expression and binding in a
Shalek, et al. Nature 2013; Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells
Lonsdale, et al. Nature Genetics 2013; The Genotype-Tissue Expression (GTEx) project.