rna sequencing and data analysisprada focuses on the analysis of paired-end rna-sequencing data....

RNA SEQUENCING AND DATA ANALYSIS

Download slides and package

http://odin.mdacc.tmc.edu/~rverhaak/package.zip

http://odin.mdacc.tmc.edu/~rverhaak/RNA-seq-lecture.zip

http://odin.mdacc.tmc.edu/~rverhaak/package.zip�

Overview

Introduction into the topic RNA species Experimental design considerations Analytical approaches

Discussion of our analysis pipeline Technical details Application on TCGA data sets Results

Hands on

All RNA is not the same

Types of RNA:

All RNA is not the same

Types of RNA: Messenger RNA Micro RNA Long non-coding RNA Ribosomal RNA Other…

Methods for RNA enrichment prior to library construction Poly(A)-RNA selection

By hybridization to oligo-dT beads mature mRNA highly enriched efficient for quantification of gene expression level and so on limitation: 3’ bias correlating with RNA degradation

rRNA depletion: by hybridization to bead-bound rRNA probes rRNA sequence-dependent and species-specific all non-rRNA retained: premature mRNA, long non-coding RNA

Small RNA extraction: Specific kits required to retain small RNA Optional fine size-selection by gel or column

This lecture focuses on mRNA sequencing

Length of mRNA transcripts in the human genome

0 2,000 4,000 6,000 8,000 10,000 0

1,000

2,000

3,000

4,000

5,000

0 200 400 600 800

5,000

4,000

3,000

0

2,000

1,000

Length of mRNA transcripts in the human genome

0 2,000 4,000 6,000 8,000 10,000 0

1,000

2,000

3,000

4,000

5,000

0 200 400 600 800

5,000

4,000

3,000

0

2,000

1,000 What is the optimal insert and read size for mRNA

sequencing?

Alignment versus assembly

Assembly Trinity, Cufflinks, ABySS Particularly useful when no reference genome is

available, like in bacterial transcriptomes

Alignment Bowtie, BWA, Mosaic Maximum sensitivity, fewer false positives

Sequencing parameters

Read Type, typically 36/51/76/101 bp:

Single end read:

Paired end read:


Read Type:

Single end read: for efficient counting of transcript copy number and splicing sites

Paired end read: longer cDNA fragment and read length help to determine transcript structure especially within gene families

Applications of RNA-sequencing

RNA sequencing applications

Quantification of transcript expression levels Detection of splice variation/different isoforms of

the same gene Allele specific expression levels Detection of fusion transcripts (such as BCR-ABL in

CML) Detection of sequence variation (limited

application) Validation of DNA sequence variants

RNA-seq expression levels are linear where microarrays get saturated or are insensitive

Expression is measured as ‘reads per kilobase per million’ (RPKM) to normalize for gene length and library size

Identification of fusion transcripts

Popular methods search for Read pairs that map to two different genes Need to correct for gene homology

Reads that span fusion junctions Split reads in half and align separate halfs Make a database of all possible fusion junctions and align

full reads

PRADA, MapSplice, TopHat

Variant detection

Approximately 35% of mutations are covered sufficiently to be detected at a validation rate of ~ 80-90%.

All DNA mutations from TCGA renal cell clear cell carcinoma project

Reverse transcriptase step to convert RNA to cDNA complicates detection of RNA edits and mutations


Read Depth

Minimum mapped reads: 10 million for quantitative analysis of mammalian transcriptome

More reads needed for splicing variant discovery and differential comparison among samples

Current output: 120-180 million raw reads / lane

Multiplex level: 4-12 libraries / lane recommended

RNA sequencing in The Cancer Genome Atlas

mRNA: poly-A mRNA purified from total RNA using poly-T oligo-attached magnetic beads

miRNA: Total RNA is mixed with oligo(dT) MicroBeads and loaded into MACS column, which is then placed on a MultiMACS separator. From the flow-through, small RNAs, including miRNAs, are recovered by ethanol precipitation.

Implementation Results Samples processed

>400 KIRC >170 GBM

TFG-GPR128 fusion

Samples detected 5 KIRC

>5 GBM

Samples processed 321 normal, 85 tumor

(BLCA, BRCA, HNSC, KIRC, KIRP, LIHC, LUAD,

LUSC, PRAD, THCA)

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

OUTPUTS

RNA-SeQC

RNA sequencing read alignment in PRADA

Transcripts from same gene

Reads are aligned to all possible transcripts

Reads are also aligned to genome

RNA sequencing read alignment in PRADA

Reads are aligned to all possible transcripts

Reads are also aligned to genome

Final and single placement for each read it determined by re-mapping

PRADA alignments – advantages versus disadvantages

Advantage: Alignment to unannotated transcripts Alignment across exon-exon junctions

Disadvantage Alignment approaches such as used by MapSplice,

Bowtie/Tophat typically split reads More conservative alignment than split-read

PRADA focuses on the analysis of paired-end RNA-sequencing data. Four modules:

1. Processing 2. Expression and Quality Control 3. Gene fusion 4. GUESS-ft: General User dEfined Supervised Search for fusion transcripts

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS





RPKM & QC metrics

-geneA -geneB


[YES|| NO|| ONLY]

Fusion Candidates



RNA-SeQC

OUTPUTS

http://sourceforge.net/projects/prada/

http://sourceforge.net/projects/prada/�

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS





RPKM & QC metrics

-geneA -geneB

Expression and QC Module

[YES|| NO|| ONLY]

Fusion Candidates



RNA-SeQC

Samples reads are mapped to: Transcriptome Genome

Processing Module Widely use tools by the research

community Samtools, BWA, Picard, GATK Enabled References versions hg18|Ensembl52 hg19|Ensembl64

OUTPUTS

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS





RPKM & QC metrics

-geneA -geneB


[YES|| NO|| ONLY]

Fusion Candidates



RNA-SeQC

Expression & QC Module RNA-SeQC provides three types of

quality control metrics: Read Counts Coverage Correlation

RPKM Values at transcript level For longest transcript

RNAseQC Process (java)

OUTPUTS

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS





RPKM & QC metrics

-geneA -geneB


[YES|| NO|| ONLY]

Fusion Candidates



RNA-SeQC

Fusion Module Discordant read pair: Each end of the

read pair maps uniquely to distinct protein-coding genes.

Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B.

Gene A Gene B

Fusion Module Cont’d

Filters Gene homology using blastn (bitscore 50) Ratio of fusion spanning and discordant

reads

Number of gene partners within a sample Remove “promiscuous” fusion pairs, i.e.

with large number of partners (e.g. >25) Number of distinct junctions

Filtered Candidates: Up to 1 mismatch Unique sequences Unique start positions

180 bp

50 bp 50 bp 80 bp

49 bp 49 bp

5.180

4949≈

+=tr


SampleID TCGA-BP-4756-01A-01R-1289-07

GeneA SFPQ

GeneB TFE3

Discordant_Pairs 350

Fusion_Reads 220

Fusion_Junctions 1

HomologyScore 26.5

FusionDiscordant_Ratio 0.628571429

Positions_Consistent PARTIALLY

GeneA_Chr chr1

GeneB_Chr chrX

Fusion_Type Interchromosomal

Breakpoint_Distance 1.00E+46

Breakpoint(s) chr1.i.e7.e6.35427190_chr23.e.2.48785038

Unique reads: gAdiffpos 110

Unique reads: gBdiffpos 119

Unique reads: fusdiffseq 35

gA_withinsamplecount 1

gB_withinsamplecount 1

ExonJunction TAAGACGCATGGAAGAACTTCACAATCAAGAAATGCAGAAACGTAAAGAAATGCAATTGAG|*|CCTGAACTCTTTGCTTCCGGAATCCGGGATTGTTGCTGACATAGAATTAGAAAACGTCCTT

in-frame classification* in-frame

Outputs List all annotated

fusions SampleID.annotated.candidates.txt

List filtered annotated fusion SampleID.filtered.candidates.txt

The identification of in-frame fusion transcripts and their predicted protein sequences.

Asmann Y W et al. Nucl. Acids Res. 2011;nar.gkr362 © The Author(s) 2011. Published by Oxford University Press.

Out of all the combinations, we consider only those fusion classification which found in primary transcripts.


Image Source: http://upload.wikimedia.org/wikipedia/en/d/d3/Mature_mRNA.png

CDR-CDR

In-frame

Out-of-frame

Non CDR-CDR

5′ UTR to CDR

5′ UTR to 3′ UTR




CDR to 5′ UTR

CDR to 3′ UTR

Implementation Results Samples processed

>400 KIRC >170 GBM

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS





RPKM & QC metrics

-geneA -geneB


[YES|| NO|| ONLY]

Fusion Candidates



OUTPUTS

RNA-SeQC

Works well in MDACC HPC* system

PRADA-fusion module validation rate ~85 % (11 out of 13)

KIRC fusion results

We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.

We identified 80 bona-fide fusion transcripts, 57 intrachromosomal 33 interchromosomal in 62 individual samples

“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX) DHX33-NLRP1 (n=2, chr2) TRIP12-SLC16A14 (n=2, chr17) TFG-GRP128 (n=4, chr3)

KIRC fusion results Cont’d

SFPQ-TFE3 TFE3 translocations have been linked to a rare subtype of

renal cancer.

The five samples harboring a TFE3 fusion did not contain mutations in the ten most frequently mutated genes in ccRCC (PBRM1, PTEN, VHL, SETD2, BAP1, KDM5C, MTOR, ZNF800, PIK3CA, and TP53), except one (in VHL).

This suggests that SFPQ-TFE3 fusion plays a unique role in the cancer genomics of these patients.

KIRC fusion validation

Sample ID 5’ Gene 3’ Gene Discordant Read Pairs

Fusion Span Reads

Fusion Junction (s)

5’ Gene Chr

3’ Gene Chr

Validated?

TCGA-AK-3456-01A-02R-1325-07 TFE3 SFPQ 175 129 1 chrX chr1 Yes

TCGA-AK-3456-01A-02R-1325-07 SFPQ TFE3 116 81 1 chr1 chrX Yes

TCGA-A3-3313-01A-02R-1325-07 C6orf106 LRRC1 90 40 2 chr6 chr6 Yes

TCGA-A3-3313-01A-02R-1325-07 CYP39A1 LEMD2 37 9 1 chr6 chr6 Yes

TCGA-B2-4101-01A-02R-1277-07 FAM172A FHIT 17 4 1 chr5 chr3 Yes

TCGA-AK-3445-01A-02R-1277-07 KIAA0802 LRRC41 14 6 1 chr18 chr1 Yes

TCGA-B0-5095-01A-01R-1420-07 GORASP2 WIPF1 14 2 1 chr2 chr2 Yes

TCGA-A3-3313-01A-02R-1325-07 ZNF193 MRPS18A 11 3 1 chr6 chr6 Yes

TCGA-A3-3313-01A-02R-1325-07 FTSJD2 GPX6 9 8 1 chr6 chr6 Yes

TCGA-B0-4945-01A-01R-1420-07 KIAA0427 GRM4 8 5 1 chr18 chr6 No

TCGA-B8-4143-01A-01R-1188-07 SLC36A1 TTC37 5 5 1 chr5 chr5 No

PRADA-fusion module validation rate (11 out of 13) ~85% RT-PCR and FISH assays

TFE3-SFPQ was validated in three individual samples

KIRC fusion validation: RT-PCR

FAM172A-FHIT

SFPQ-TFE3

TFE3-SFPQ

KIRC fusion results

We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.

We identified 80 bona-fide fusion transcripts, 57 intrachromosomal 33 interchromosomal in 62 individual samples

“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX) DHX33-NLRP1 (n=2, chr2) TRIP12-SLC16A14 (n=2, chr17)

TFG-GRP128 (n=4, chr3)

TFG-GRP128 has been reported in other cancers

TFG-GRP128 has been reported in other cancers

TCGA has 1,000s of RNA seq samples - how can we quickly scan many samples for the

presence of this fusion?

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS





RPKM & QC metrics

-geneA -geneB


[YES|| NO|| ONLY]

Fusion Candidates



OUTPUTS

RNA-SeQC

Supervised Search Module GUESS-ft: General User dEfined Supervised

Search for fusion transcripts BAM

GUESS-ft

Mapped to A or B

A-B

Discordant reads

Unmapped reads

Junction DB

Junction spanning reads

Summary report

Use high quality mapping reads

only, Checks read

orientation fulfills fusion

schema, allow up to one mismatch.

Two read ends map to A and B

respectively

Parse Unmapped

reads with the other end

mapping to A or B

Map parsed reads to DB of

all possible exon junctions

List reads with one end map to junction, the other map to A

or B

Time consuming step

Tumors with the fusion have higher GPR128 expression levels

RPKM expression pattern seen in KIRC tumors

Fusion sample(s)

Higher expression of GPR128 (activation)

TCGA-B0-5703 w/ 1 discordant read pair in tumor sample w/ 33 discordant read pair in matched normal

Identification of TFG-GRP128 fusion

All available normal samples in cghub

Subset of tumor samples selected based on RPKM expression pattern

Table. Samples across cancer types

Cancer Type # of normal samples

# of tumor samples

Bladder Urothelial Carcinoma [BLCA] 11 4

Breast invasive carcinoma [BRCA] 106 30

Head and Neck squamous cell carcinoma [HNSC] 27 12

Kidney renal clear cell carcinoma [KIRC] 66 416*

Kidney renal papillary cell carcinoma [KIRP] 15 4

Liver hepatocellular carcinoma [LIHC] 9 2

Lung adenocarcinoma [LUAD] 51 4

Lung squamous cell carcinoma [LUSC] 17 18

Prostate adenocarcinoma [PRAD] 7 7

Thyroid carcinoma [THCA] 12 4

* All performed by PRADA fusion module.

Identification of TFG-GRP128 fusion

All available normal samples in cghub

Subset of tumor samples selected based on RPKM expression pattern

Table. Samples across cancer types

Cancer Type # of normal samples

# of tumor samples

Bladder Urothelial Carcinoma [BLCA] 0 (0%) 2 (3.6%)

Breast invasive carcinoma [BRCA] 1 (0.94%) 13 (1.6%)

Head and Neck squamous cell carcinoma [HNSC] 0 (0%) 6 (2.3%)

Kidney renal clear cell carcinoma [KIRC] 1 (1.5%) 5 (1.2%)

Kidney renal papillary cell carcinoma [KIRP] 0 (0%) 1 (5.9%)

Liver hepatocellular carcinoma [LIHC] 0 (0%) 1 (5.9%)

Lung adenocarcinoma [LUAD] 0 (0%) 1 (0.79%)

Lung squamous cell carcinoma [LUSC] 0 (0%) 9 (4%)

Prostate adenocarcinoma [PRAD] 1 (14.3) 2 (1.9%)

Thyroid carcinoma [THCA] 0 (0%) 2 (0.89%)

* All performed by PRADA fusion module.

GUESS-ft module: TFG-GPR128 fusion Cont’d

“Raw” Copy Number for KIRC

Focal amplification in chr3 (TFG-GPR128)

GUESS-ft module: TFG-GPR128 fusion Cont’d

GWAS

In GBM, the gene EGFR is frequently targeted by intragenic deletions

Figure. GBM Alterations in EGFR

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS





RPKM & QC metrics

-geneA -geneB


[YES|| NO|| ONLY]

Fusion Candidates



OUTPUTS

RNA-SeQC

Supervised Search Module GUESS-ig: GUESS for intragenic rearrangements

BAM

GUESS-IG

Mapped to A

A-A

Discordant reads

Unmapped reads

Junction DB

Junction spanning reads

Summary report

Parse Unmapped reads with the

other end map to A

Map parsed reads to DB of undefined

junctions*

List reads with one end map to

undefined junction, the other maps to A

Applying GUESS-ig in GBM identifies intragenic deletion variants

Figure. GBM Alterations in EGFR

Thanks.

http://sourceforge.net/projects/prada/

http://sourceforge.net/projects/prada/�

rna sequencing and data analysisprada focuses on the analysis of paired-end rna-sequencing data....

Documents