rnaseq: isoform expression quantification and transcript...
TRANSCRIPT
![Page 1: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/1.jpg)
RNAseq: isoform expression quantification and transcript assembly
Slides courtesy from S. Salzberg, C. Trapnell, L. Pachter and K. Okrah
![Page 2: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/2.jpg)
!
Corrada Bravo 10/30/09
Sec-gen Sequencing
2
mRNA fragments to be sequenced
![Page 3: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/3.jpg)
!
Corrada Bravo 10/30/09
Sec-gen SequencingPaired-Ends
3
mRNA fragments to be sequenced
In paired-end sequencing reads are generated from both ends of a fragment
![Page 4: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/4.jpg)
Goal: Develop and analyze a sta1s1cal model for measuring differen1al expression ofIsoforms of the same gene using Rna-‐Seq.
Source: Computa.onal Genome Analysis
Recall:
![Page 5: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/5.jpg)
Isoform 1 : True abundance measure
Isoform 2 : True abundance measure
Isoform 3 : True abundance measure
Isoform 4 : True abundance measure
The goal is to estimate the true abundance measure of the 4 isoforms.
Suppose we have a gene with 4 isoforms and 3 alterna1vely spliced (AS) exonsas shown above.
AS exons
GENE
Transcrip
t pop
ula.
on
5’ UTR 3’ UTR
![Page 6: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/6.jpg)
Fragmented mRNas: 54 t0tal reads with 18 unique types.
s1 s2 s3 s4 s7s6
s1
s8
s11
s2
s11
s11
s11
s12 s13
s13
s3 s10s9
s7
s5
s1 s2 s3 s4 s5 s6 s7 s8
s1 s2 s3 s4 s5 s6 s7 s8
s10
s12
s8
s15s14s13
s17
s8s7
s13s12 s14 s15
s18s12
s16
s16
spliced reads
![Page 7: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/7.jpg)
TopHat for second generation RNA-Seq: spliced read alignment
• Suitable for
• short reads (25-50bp)
• long reads (100+ bp)
• paired end reads
• New features since 0.8x (Trapnell et al., Bioinformatics 2009)
• Much faster, almost fully threaded
• Semi-canonical introns (GC-AG and AT-AC) and some support for microexons
7
...
...
read set
genome
1. TopHat
![Page 8: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/8.jpg)
FPKM
• Expected number of Fragments Per Kilobase (of transcript) per Million fragments sequenced in an RNA-Seq experiment.
•These units are proportional to the .
8
θi
![Page 9: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/9.jpg)
FPKMg =1R
�ra
lengtha
�+
1R
�rb
lengthb
�
FPKMproj(g) =1R
�ra + rb
lengthproj(g)
�
ra
lengtha≥ ra
lengthproj(g),
rb
lengthb≥ rb
lengthproj(g)
FPKMg ≥ FPKMproj(g)
Projective normalization underestimates expression
9
isoform aisoform b
project all isoforms into genome coordinates
R reads total, r reads for the gene:- ra for isoform a- rb for isoform b
but so
![Page 10: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/10.jpg)
How should expression levels be estimated?
10
• A-B are distinguished by the presence of splice junction (a) or (b).
• A-C are distinguished by the presence of splice junction (a) and change in UTR
• B-C are distinguished by the presence of splice junction (b) and change in UTR
(a)(b)
![Page 11: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/11.jpg)
How should expression levels be estimated?
11
• Longer transcripts contain more reads.
• Reads that could have originated from multiple transcripts are informative.
• Relative abundance estimation requires “discriminatory reads”.
(a)(b)
![Page 12: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/12.jpg)
Isoform-level expression quantification
Jiang and Wong. Bioinformatics, 2009.Salzman, Jiang and Wong. Statistical Science, 2011.
![Page 13: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/13.jpg)
Isoform 1 : True abundance measure
Isoform 2 : True abundance measure
Isoform 3 : True abundance measure
Isoform 4 : True abundance measure
The goal is to estimate the true abundance measure of the 4 isoforms.
Suppose we have a gene with 4 isoforms and 3 alterna1vely spliced (AS) exonsas shown above.
AS exons
GENE
Transcrip
t pop
ula.
on
5’ UTR 3’ UTR
![Page 14: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/14.jpg)
14Example: mouse RNAseq data
STATISTICAL MODELING OF RNA-SEQ DATA 75
FIG. 6. Visualization of RNA-Seq read pairs mapped to the mouse gene Rnpep in the CisGenome Browser (see Jiang et al., 2010). Fromtop to bottom: genomic coordinates, gene structure where exons are magnified for better visualization, read pairs mapped to the gene. Readsare 32 bp at each end. A read that spans a junction between two exons is represented by a wider box.
amino peptidase, meaning that it is used to degradeproteins in the cell. After mapping, 116 read pairs wereassigned to this gene, out of which 113 read pairs wereused in the computation after outlier removal. Figure 6presents the positions where the reads are mapped.The gene was picked because it has two alternativelyspliced isoforms with a structure that makes distin-guishing reads from each isoform challenging, and be-cause the number of reads was small enough to visual-ize all of them in a simple figure.
5.1 Uniform Sampling Model
Any paired end read experiment can be treated as asingle end read experiment by taking each paired endread and treating it as two distinct single end reads, one
from each side of the pair. In this, the 113 paired endreads become 226 single end reads (without pairing in-formation).
In the uniform sampling model, for either isoform,the sampling rate vector for each read sj can take atmost two values: 2n when the isoform can generateread j and 0 when it cannot. Because there are onlytwo isoforms, one of which (isoform 2) excludes oneof the exons of the other (isoform 1), it is evident thatin the uniform sampling model, there are only threecategories for the two isoforms.
The total length of isoform 1 is 2,300. The totallength of isoform 2 is 2,183. Hence, computing ai,j bysumming over the sampling rate vectors of the readsin the same category, the three categories can be rep-
![Page 15: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/15.jpg)
Fragmented mRNas: 54 t0tal reads with 18 unique types.
s1 s2 s3 s4 s7s6
s1
s8
s11
s2
s11
s11
s11
s12 s13
s13
s3 s10s9
s7
s5
s1 s2 s3 s4 s5 s6 s7 s8
s1 s2 s3 s4 s5 s6 s7 s8
s10
s12
s8
s15s14s13
s17
s8s7
s13s12 s14 s15
s18s12
s16
s16
Sampling rate:
The ability for each of the 54 reads to be sequenced depends on:
1.Transcript fragmenta3on.2. Size selec3on.3. Sequence specificamplifica3on of selec3on.
![Page 16: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/16.jpg)
3.3 Likelihood Function
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18
θ1 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 24
θ2 1 1 1 0 0 0 2 2 1 2 1 1 1 0 0 0 0 0 13
θ3 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 0 0 12
θ4 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 5
4 4 4 3 3 3 5 5 1 2 4 4 4 2 2 2 1 1 54
For each read type, we only observe nj. We want to es1mate last column (transcript abundance).
Last lecture concentrated on using the sum over the en1re table (54) for posi1ons that overlap every transcript
![Page 17: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/17.jpg)
3.3 Likelihood Function
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18
θ1 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 24
θ2 1 1 1 0 0 0 2 2 1 2 1 1 1 0 0 0 0 0 13
θ3 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 0 0 12
θ4 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 5
4 4 4 3 3 3 5 5 1 2 4 4 4 2 2 2 1 1 54
• In reality we only observe nj = niji=1
I
! .
• nj ~ Poisson( !iaiji=1
I
! =! Taj ), where !=!1
...! I
"
#
$$$$
%
&
''''
, aj=
a1 j
...aIj
"
#
$$$$
%
&
''''
.
• Likelihood: f! (n1,n2,...,nJ ) =(! Taj )
nj e(!Taj
nj !j=1
J
) .
![Page 18: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/18.jpg)
Uniform sampling model
• Appropriate for single read data. (transcript length is not considered)
Model for A:
Interpreta3on of abundance:
This choice of this A means that !i =cilicii!
,
where li is the length of transcript i and ci isthe numberof copies in the ith transcript in the sample.
Remember FPKM!
![Page 19: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/19.jpg)
How do you fit it?
• The use of Poisson model makes things very easy
• The idea is to use Maximum Likelihood Estimation: find estimates that maximize the probability of observed data under Poisson model!
• Equivalent to a convex optimization problem:
maximize nT log(AT θ)− sum(AT θ)s.t. θ ≥ 0
![Page 20: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/20.jpg)
RNAseq: transcript assembly and quantification
All Slides courtesy from S. Salzberg, C. Trapnell and L. PachterTrapnell, et al. Nature Biotechnology, 2010.
![Page 21: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/21.jpg)
Overview of cufflinks
21
...
g1
g2
g3
g4
gm
genome
Partition into loci
Minimal set of explaining transcriptsvia Dilworth’s theorem
Statistical estimation of transcript abundances
Transcriptome
g1 g2 g3 g4 gm
ρt
ρt
![Page 22: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/22.jpg)
Comparative transcript assembly
• Desirable properties of an assembly: consistency, parsimony and identifiability.
• Dilworth’s theorem and its application to transcript assembly.
• The Cufflinks assembler.
• Promoter discovery and novel isoforms.
• Lessons learned.
22
![Page 23: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/23.jpg)
Transcriptome assembly with a reference genome
23
How many transcripts?
Don’t know that two reads came from the same transcripts, but sometimes know that they came from different transcripts
Genome
reads
![Page 24: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/24.jpg)
A partial order on paired end read alignments
• Alignment x y when
• x starts to the left of y in the reference
• x and y overlap consistently
• y is not contained in x
• That is, x y when they could have come from the same transcript
24
≺
≺
![Page 25: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/25.jpg)
Dilworth’s theorem applied to the read partial order
• Definition: an antichain in the read partial order is a set of alignments with the property that no two are compatible (i.e. could arise from the same transcript).
• Theorem [R.P. Dilworth, “A decomposition theorem for partially ordered sets”, Annals of Mathematics, 1950]: The size of the largest antichain is equal to the minimum size of a chain partition.
25
![Page 26: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/26.jpg)
Dilworth’s theorem applied to the read partial order
• Definition: an antichain in the read partial order is a set of alignments with the property that no two are compatible (i.e. could arise from the same transcript).
• Theorem [R.P. Dilworth, “A decomposition theorem for partially ordered sets”, Annals of Mathematics, 1950]: The size of the largest antichain is the minimum number of transcripts needed to explain the alignments.
• There is a constructive proof of the theorem, which reduces the problem to finding a maximum matching in a bipartite graph. The Hopcroft-Karp algorithm solves this problem in time where we have V=M, the number of fragments sequenced.
• We rely instead on a maximum weighted matching algorithm; the best running time for weighted maximum matching is .
• This approach builds on ideas from N. Eriksson et al. (PLoS Computational Biology 2008) where a similar parsimony approach is used for viral population estimation.
26
O(√
V E)
O(V 2logV + V E)
![Page 27: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/27.jpg)
Phasing splicing events using weighted matching
27
A B
CD E
![Page 28: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/28.jpg)
Properties of Cufflinks assemblies
• The assemblies are parsimonious- guarantee that the number of assembled transcripts is minimal.
• In the case of multiple minimal assemblies, likelihoods are compared in order to pick the best phasing.
• Identifiability of the resulting models is a corollary of Dilworth’s theorem (the maximum antichain is a permutation submatrix of the read-transcript matrix, hence the latter is full rank).
28
![Page 29: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/29.jpg)
Discovery is necessary for accurate abundance estimates
29
!
![Page 30: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/30.jpg)
RNA-Seq time course analysis
• Measuring changes in relative abundances over time.
• Iosoform switching and generalizations.
• Inference of transcriptional versus post-transcriptional regulation.
30
![Page 31: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/31.jpg)
The skeletal myogenesis transcriptomeRNA-Seq (2x75bp GAIIx) along time course of mouse C2C12 differentiation
31
-24 hours
60 hours
168 hours
differentiation(starting at 0 hours)
fusion
myotubemyoctyte
120 hours
Illustration based on: Ohtake et al, J. Cell Sci., 2006; 119:3822-3832
•84,369,078 reads
•140,384,062reads
• 82,138,212reads
•123,575,666reads
•66,541,668alignments
•103,681,081alignments
•47,431,271alignments
•89,162,512alignments
•10,754,363to junctions
•19,194,697to junctions
•9,015,806to junctions
•17,449,848to junctions
•58,008transfrags
•69,716transfrags
•55,241transfrags
•63,664transfrags
![Page 32: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from](https://reader033.vdocuments.site/reader033/viewer/2022051905/5ff783bd2894d81b457f5183/html5/thumbnails/32.jpg)
Dynamics of Myc expression
32
!
!
d( , )