rna sequencing and data analysisprada focuses on the analysis of paired-end rna-sequencing data....
TRANSCRIPT
RNA SEQUENCING AND DATA ANALYSIS
Download slides and package
http://odin.mdacc.tmc.edu/~rverhaak/package.zip
http://odin.mdacc.tmc.edu/~rverhaak/RNA-seq-lecture.zip
Overview
Introduction into the topic RNA species Experimental design considerations Analytical approaches
Discussion of our analysis pipeline Technical details Application on TCGA data sets Results
Hands on
All RNA is not the same
Types of RNA:
All RNA is not the same
Types of RNA: Messenger RNA Micro RNA Long non-coding RNA Ribosomal RNA Other…
Methods for RNA enrichment prior to library construction Poly(A)-RNA selection
By hybridization to oligo-dT beads mature mRNA highly enriched efficient for quantification of gene expression level and so on limitation: 3’ bias correlating with RNA degradation
rRNA depletion: by hybridization to bead-bound rRNA probes rRNA sequence-dependent and species-specific all non-rRNA retained: premature mRNA, long non-coding RNA
Small RNA extraction: Specific kits required to retain small RNA Optional fine size-selection by gel or column
This lecture focuses on mRNA sequencing
Length of mRNA transcripts in the human genome
0 2,000 4,000 6,000 8,000 10,000 0
1,000
2,000
3,000
4,000
5,000
0 200 400 600 800
5,000
4,000
3,000
0
2,000
1,000
Length of mRNA transcripts in the human genome
0 2,000 4,000 6,000 8,000 10,000 0
1,000
2,000
3,000
4,000
5,000
0 200 400 600 800
5,000
4,000
3,000
0
2,000
1,000 What is the optimal insert and read size for mRNA
sequencing?
Alignment versus assembly
Assembly Trinity, Cufflinks, ABySS Particularly useful when no reference genome is
available, like in bacterial transcriptomes
Alignment Bowtie, BWA, Mosaic Maximum sensitivity, fewer false positives
Sequencing parameters
Read Type, typically 36/51/76/101 bp:
Single end read:
Paired end read:
Sequencing parameters
Read Type:
Single end read: for efficient counting of transcript copy number and splicing sites
Paired end read: longer cDNA fragment and read length help to determine transcript structure especially within gene families
Applications of RNA-sequencing
RNA sequencing applications
Quantification of transcript expression levels Detection of splice variation/different isoforms of
the same gene Allele specific expression levels Detection of fusion transcripts (such as BCR-ABL in
CML) Detection of sequence variation (limited
application) Validation of DNA sequence variants
RNA-seq expression levels are linear where microarrays get saturated or are insensitive
Expression is measured as ‘reads per kilobase per million’ (RPKM) to normalize for gene length and library size
Identification of fusion transcripts
Popular methods search for Read pairs that map to two different genes Need to correct for gene homology
Reads that span fusion junctions Split reads in half and align separate halfs Make a database of all possible fusion junctions and align
full reads
PRADA, MapSplice, TopHat
Variant detection
Approximately 35% of mutations are covered sufficiently to be detected at a validation rate of ~ 80-90%.
All DNA mutations from TCGA renal cell clear cell carcinoma project
Reverse transcriptase step to convert RNA to cDNA complicates detection of RNA edits and mutations
Sequencing parameters
Read Depth
Minimum mapped reads: 10 million for quantitative analysis of mammalian transcriptome
More reads needed for splicing variant discovery and differential comparison among samples
Current output: 120-180 million raw reads / lane
Multiplex level: 4-12 libraries / lane recommended
RNA sequencing in The Cancer Genome Atlas
mRNA: poly-A mRNA purified from total RNA using poly-T oligo-attached magnetic beads
miRNA: Total RNA is mixed with oligo(dT) MicroBeads and loaded into MACS column, which is then placed on a MultiMACS separator. From the flow-through, small RNAs, including miRNAs, are recovered by ethanol precipitation.
Implementation Results Samples processed
>400 KIRC >170 GBM
TFG-GPR128 fusion
Samples detected 5 KIRC
>5 GBM
Samples processed 321 normal, 85 tumor
(BLCA, BRCA, HNSC, KIRC, KIRP, LIHC, LUAD,
LUSC, PRAD, THCA)
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
OUTPUTS
RNA-SeQC
RNA sequencing read alignment in PRADA
Transcripts from same gene
Reads are aligned to all possible transcripts
Reads are also aligned to genome
RNA sequencing read alignment in PRADA
Reads are aligned to all possible transcripts
Reads are also aligned to genome
Final and single placement for each read it determined by re-mapping
PRADA alignments – advantages versus disadvantages
Advantage: Alignment to unannotated transcripts Alignment across exon-exon junctions
Disadvantage Alignment approaches such as used by MapSplice,
Bowtie/Tophat typically split reads More conservative alignment than split-read
PRADA focuses on the analysis of paired-end RNA-sequencing data. Four modules:
1. Processing 2. Expression and Quality Control 3. Gene fusion 4. GUESS-ft: General User dEfined Supervised Search for fusion transcripts
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
OUTPUTS
http://sourceforge.net/projects/prada/
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression and QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
Samples reads are mapped to: Transcriptome Genome
Processing Module Widely use tools by the research
community Samtools, BWA, Picard, GATK Enabled References versions hg18|Ensembl52 hg19|Ensembl64
OUTPUTS
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
Expression & QC Module RNA-SeQC provides three types of
quality control metrics: Read Counts Coverage Correlation
RPKM Values at transcript level For longest transcript
RNAseQC Process (java)
OUTPUTS
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
Fusion Module Discordant read pair: Each end of the
read pair maps uniquely to distinct protein-coding genes.
Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B.
Gene A Gene B
Fusion Module Cont’d
Filters Gene homology using blastn (bitscore 50) Ratio of fusion spanning and discordant
reads
Number of gene partners within a sample Remove “promiscuous” fusion pairs, i.e.
with large number of partners (e.g. >25) Number of distinct junctions
Filtered Candidates: Up to 1 mismatch Unique sequences Unique start positions
180 bp
50 bp 50 bp 80 bp
49 bp 49 bp
5.180
4949≈
+=tr
Fusion Module Cont’d
SampleID TCGA-BP-4756-01A-01R-1289-07
GeneA SFPQ
GeneB TFE3
Discordant_Pairs 350
Fusion_Reads 220
Fusion_Junctions 1
HomologyScore 26.5
FusionDiscordant_Ratio 0.628571429
Positions_Consistent PARTIALLY
GeneA_Chr chr1
GeneB_Chr chrX
Fusion_Type Interchromosomal
Breakpoint_Distance 1.00E+46
Breakpoint(s) chr1.i.e7.e6.35427190_chr23.e.2.48785038
Unique reads: gAdiffpos 110
Unique reads: gBdiffpos 119
Unique reads: fusdiffseq 35
gA_withinsamplecount 1
gB_withinsamplecount 1
ExonJunction TAAGACGCATGGAAGAACTTCACAATCAAGAAATGCAGAAACGTAAAGAAATGCAATTGAG|*|CCTGAACTCTTTGCTTCCGGAATCCGGGATTGTTGCTGACATAGAATTAGAAAACGTCCTT
in-frame classification* in-frame
Outputs List all annotated
fusions SampleID.annotated.candidates.txt
List filtered annotated fusion SampleID.filtered.candidates.txt
The identification of in-frame fusion transcripts and their predicted protein sequences.
Asmann Y W et al. Nucl. Acids Res. 2011;nar.gkr362 © The Author(s) 2011. Published by Oxford University Press.
Out of all the combinations, we consider only those fusion classification which found in primary transcripts.
Fusion Module Cont’d
Image Source: http://upload.wikimedia.org/wikipedia/en/d/d3/Mature_mRNA.png
CDR-CDR
In-frame
Out-of-frame
Non CDR-CDR
5′ UTR to CDR
5′ UTR to 3′ UTR
3′ UTR to 3′ UTR
5′ UTR to 5′ UTR
3′ UTR to 5′ UTR
CDR to 5′ UTR
CDR to 3′ UTR
Implementation Results Samples processed
>400 KIRC >170 GBM
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
OUTPUTS
RNA-SeQC
Works well in MDACC HPC* system
PRADA-fusion module validation rate ~85 % (11 out of 13)
KIRC fusion results
We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.
We identified 80 bona-fide fusion transcripts, 57 intrachromosomal 33 interchromosomal in 62 individual samples
“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX) DHX33-NLRP1 (n=2, chr2) TRIP12-SLC16A14 (n=2, chr17) TFG-GRP128 (n=4, chr3)
KIRC fusion results Cont’d
SFPQ-TFE3 TFE3 translocations have been linked to a rare subtype of
renal cancer.
The five samples harboring a TFE3 fusion did not contain mutations in the ten most frequently mutated genes in ccRCC (PBRM1, PTEN, VHL, SETD2, BAP1, KDM5C, MTOR, ZNF800, PIK3CA, and TP53), except one (in VHL).
This suggests that SFPQ-TFE3 fusion plays a unique role in the cancer genomics of these patients.
KIRC fusion validation
Sample ID 5’ Gene 3’ Gene Discordant Read Pairs
Fusion Span Reads
Fusion Junction (s)
5’ Gene Chr
3’ Gene Chr
Validated?
TCGA-AK-3456-01A-02R-1325-07 TFE3 SFPQ 175 129 1 chrX chr1 Yes
TCGA-AK-3456-01A-02R-1325-07 SFPQ TFE3 116 81 1 chr1 chrX Yes
TCGA-A3-3313-01A-02R-1325-07 C6orf106 LRRC1 90 40 2 chr6 chr6 Yes
TCGA-A3-3313-01A-02R-1325-07 CYP39A1 LEMD2 37 9 1 chr6 chr6 Yes
TCGA-B2-4101-01A-02R-1277-07 FAM172A FHIT 17 4 1 chr5 chr3 Yes
TCGA-AK-3445-01A-02R-1277-07 KIAA0802 LRRC41 14 6 1 chr18 chr1 Yes
TCGA-B0-5095-01A-01R-1420-07 GORASP2 WIPF1 14 2 1 chr2 chr2 Yes
TCGA-A3-3313-01A-02R-1325-07 ZNF193 MRPS18A 11 3 1 chr6 chr6 Yes
TCGA-A3-3313-01A-02R-1325-07 FTSJD2 GPX6 9 8 1 chr6 chr6 Yes
TCGA-B0-4945-01A-01R-1420-07 KIAA0427 GRM4 8 5 1 chr18 chr6 No
TCGA-B8-4143-01A-01R-1188-07 SLC36A1 TTC37 5 5 1 chr5 chr5 No
PRADA-fusion module validation rate (11 out of 13) ~85% RT-PCR and FISH assays
TFE3-SFPQ was validated in three individual samples
KIRC fusion validation: RT-PCR
FAM172A-FHIT
SFPQ-TFE3
TFE3-SFPQ
KIRC fusion results
We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.
We identified 80 bona-fide fusion transcripts, 57 intrachromosomal 33 interchromosomal in 62 individual samples
“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX) DHX33-NLRP1 (n=2, chr2) TRIP12-SLC16A14 (n=2, chr17)
TFG-GRP128 (n=4, chr3)
TFG-GRP128 has been reported in other cancers
TFG-GRP128 has been reported in other cancers
TFG-GRP128 has been reported in other cancers
TCGA has 1,000s of RNA seq samples - how can we quickly scan many samples for the
presence of this fusion?
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
OUTPUTS
RNA-SeQC
Supervised Search Module GUESS-ft: General User dEfined Supervised
Search for fusion transcripts BAM
GUESS-ft
Mapped to A or B
A-B
Discordant reads
Unmapped reads
Junction DB
Junction spanning reads
Summary report
Use high quality mapping reads
only, Checks read
orientation fulfills fusion
schema, allow up to one mismatch.
Two read ends map to A and B
respectively
Parse Unmapped
reads with the other end
mapping to A or B
Map parsed reads to DB of
all possible exon junctions
List reads with one end map to junction, the other map to A
or B
Time consuming step
Tumors with the fusion have higher GPR128 expression levels
RPKM expression pattern seen in KIRC tumors
Fusion sample(s)
Higher expression of GPR128 (activation)
TCGA-B0-5703 w/ 1 discordant read pair in tumor sample w/ 33 discordant read pair in matched normal
Identification of TFG-GRP128 fusion
All available normal samples in cghub
Subset of tumor samples selected based on RPKM expression pattern
Table. Samples across cancer types
Cancer Type # of normal samples
# of tumor samples
Bladder Urothelial Carcinoma [BLCA] 11 4
Breast invasive carcinoma [BRCA] 106 30
Head and Neck squamous cell carcinoma [HNSC] 27 12
Kidney renal clear cell carcinoma [KIRC] 66 416*
Kidney renal papillary cell carcinoma [KIRP] 15 4
Liver hepatocellular carcinoma [LIHC] 9 2
Lung adenocarcinoma [LUAD] 51 4
Lung squamous cell carcinoma [LUSC] 17 18
Prostate adenocarcinoma [PRAD] 7 7
Thyroid carcinoma [THCA] 12 4
* All performed by PRADA fusion module.
Identification of TFG-GRP128 fusion
All available normal samples in cghub
Subset of tumor samples selected based on RPKM expression pattern
Table. Samples across cancer types
Cancer Type # of normal samples
# of tumor samples
Bladder Urothelial Carcinoma [BLCA] 0 (0%) 2 (3.6%)
Breast invasive carcinoma [BRCA] 1 (0.94%) 13 (1.6%)
Head and Neck squamous cell carcinoma [HNSC] 0 (0%) 6 (2.3%)
Kidney renal clear cell carcinoma [KIRC] 1 (1.5%) 5 (1.2%)
Kidney renal papillary cell carcinoma [KIRP] 0 (0%) 1 (5.9%)
Liver hepatocellular carcinoma [LIHC] 0 (0%) 1 (5.9%)
Lung adenocarcinoma [LUAD] 0 (0%) 1 (0.79%)
Lung squamous cell carcinoma [LUSC] 0 (0%) 9 (4%)
Prostate adenocarcinoma [PRAD] 1 (14.3) 2 (1.9%)
Thyroid carcinoma [THCA] 0 (0%) 2 (0.89%)
* All performed by PRADA fusion module.
GUESS-ft module: TFG-GPR128 fusion Cont’d
“Raw” Copy Number for KIRC
Focal amplification in chr3 (TFG-GPR128)
GUESS-ft module: TFG-GPR128 fusion Cont’d
GWAS
In GBM, the gene EGFR is frequently targeted by intragenic deletions
Figure. GBM Alterations in EGFR
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
OUTPUTS
RNA-SeQC
Supervised Search Module GUESS-ig: GUESS for intragenic rearrangements
BAM
GUESS-IG
Mapped to A
A-A
Discordant reads
Unmapped reads
Junction DB
Junction spanning reads
Summary report
Parse Unmapped reads with the
other end map to A
Map parsed reads to DB of undefined
junctions*
List reads with one end map to
undefined junction, the other maps to A
Applying GUESS-ig in GBM identifies intragenic deletion variants
Figure. GBM Alterations in EGFR