introduction to rna-seq -...

Post on 15-Oct-2020

15 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to RNA-Seq

David WoodWinter School in Mathematics and Computational Biology

July 1, 2013

RNA is...

Central

DNA

RNA

Protein

Epigenetics

Diverse

tRNA

mRNA

rRNA

Dynamic

Time

Abundance

RNA is...

QuantitativeQualitative

Understand the molecular basis of gene function. Classify and transform cellular states

Integrative

Central

DNA

RNA

Protein

Epigenetics

Diverse

tRNA

mRNA

rRNA

Dynamic

Time

Abundance

RNA studies involve...

Biological System

TechnologyAvailable Resources

Questions

~/bin

Project

DB

RNA studies involve...

Biological System

TechnologyAvailable Resources

Questions

~/bin

Project

DB

This talk: Focusing on reference based mammalian RNA-seq analysis

pA

pA pApAATG ATG

TSS transcription start site pA polyadenylation signalprotein coding regions

ATG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS TSS TSS

TSS

ATG AAA

ATG AAA

ATG AAA

ATG

ATG

ATG

AAA

AAAATG

Transcriptional Complexity

pA

pA pApAATG ATG

TSS transcription start site pA polyadenylation signalprotein coding regions

ATG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS TSS TSS

TSS

ATG AAA

ATG AAA

ATG AAA

ATG

ATG

ATG

AAA

AAAATG

Transcriptional Complexity

PASR miRNAtiRNA

AAAAAA

Alu

pA

pA pApAATG ATG

TSS transcription start site pA polyadenylation signalprotein coding regions

ATG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS TSS TSS

TSS

ATG AAA

ATG AAA

ATG AAA

ATG

ATG

ATG

AAA

AAAATG

Transcriptional Complexity

PASR miRNAtiRNA

AAAAAA

Alu

pA

pA pApAATG ATG

TSS transcription start site pA polyadenylation signalprotein coding regions

ATG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS TSS TSS

TSS

ATG AAA

ATG AAA

ATG AAA

ATG

ATG

ATG

AAA

AAAATG

Transcriptional Complexity

PASR miRNAtiRNA

Mutations Allelic Expression

RNA Editing

pA

pA pApAATG ATGTSS TSS TSS

TSS

AAA

PASR miRNA

ATG AAA

ATG AAA

ATG AAA

ATG

ATG

ATG

AAA

AAAATG

tiRNA

RNA-seq

non-spliced reads

junction readsstrand specific Cloonan et al. Nat Methods 2008; 5:613-619

AAA

Alu

mutations

Advantages of RNA-seq

!"

#!!!!"

$!!!!!"

$#!!!!"

%!!!!!"

%#!!!!"

#&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$"

,-./

01-23

40"

5/06789":6-02;/"

<-;462/"=;>2/?" @6?-.>.A;/"

/1BCD" <06E>;?6/6"

Discoverygenes, exons, junctions,

UTRs, fusions(Present and Future)

Advantages of RNA-seq

!"

#!!!!"

$!!!!!"

$#!!!!"

%!!!!!"

%#!!!!"

#&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$"

,-./

01-23

40"

5/06789":6-02;/"

<-;462/"=;>2/?" @6?-.>.A;/"

/1BCD" <06E>;?6/6"

Discoverygenes, exons, junctions,

UTRs, fusions(Present and Future)

Dynamic Range

Mortazavi et al. Nat. Methods 2008; 5:621–628

Advantages of RNA-seq

!"

#!!!!"

$!!!!!"

$#!!!!"

%!!!!!"

%#!!!!"

#&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$"

,-./

01-23

40"

5/06789":6-02;/"

<-;462/"=;>2/?" @6?-.>.A;/"

/1BCD" <06E>;?6/6"

Discoverygenes, exons, junctions,

UTRs, fusions(Present and Future)

Dynamic Range

Mortazavi et al. Nat. Methods 2008; 5:621–628

Nucleotide Specific

Typical experiment workflow

Design Experiment

Sample AcquisitionField / Clinic / Lab

Validation

VerificationSample Acquisition

Run Experiment

Obtain RNA

Make Library

Sequencing

Base Calling Mapping

Library QC

Publish

Analysis

Interpretation

1° 2°

Field / Clinic Wet Lab Dry Lab

Typical experiment workflow

Design Experiment

Sample AcquisitionField / Clinic / Lab

Validation

VerificationSample Acquisition

Run Experiment

Obtain RNA

Make Library

Sequencing

Base Calling Mapping

Library QC

Publish

Analysis

Interpretation

1° 2°

Field / Clinic Wet Lab Dry Lab

Typical experiment workflow

Design Experiment

Sample AcquisitionField / Clinic / Lab

Validation

VerificationSample Acquisition

Run Experiment

Obtain RNA

Make Library

Sequencing

Base Calling Mapping

Library QC

Publish

Analysis

Interpretation

1° 2°

Field / Clinic Wet Lab Dry Lab

Typical experiment workflow

Design Experiment

Sample AcquisitionField / Clinic / Lab

Validation

VerificationSample Acquisition

Run Experiment

Obtain RNA

Make Library

Sequencing

Base Calling Mapping

Library QC

Publish

Analysis

Interpretation

1° 2°

Field / Clinic Wet Lab Dry Lab

Library Construction

AAAAA

AAAAA

AAAAA

AAAAAA

AAAFragment

ds-cDNAsynthesis

Ligate adaptors +

Amplify

TargetRNA

rRNA (80%)

tRNA (15%)

5%

cellular RNA

Deplete rRNA

Enrich polyA RNA

Profile (ribosomes)

Capture(tiling arrays)

Sequencing

Typical experiment workflow

Design Experiment

Sample AcquisitionField / Clinic / Lab

Validation

VerificationSample Acquisition

Run Experiment

Obtain RNA

Make Library

Sequencing

Base Calling Mapping

Library QC

Publish

Analysis

Interpretation

1° 2°

Field / Clinic Wet Lab Dry Lab

RNA-seq Mapping

ATG AAA

Challenge #1: Introns

RNA-seq Mapping

ATG AAA

Challenge #1: Introns

Align to database of junctions or transcriptome

Wood et al. Bioinformatics 2011; 27:580–581

Split Read Alignments

Trapnell et al. Bioinformatics 2009; 25:1105-11

RNA-seq Mapping

ATG AAA

Challenge #1: Introns

Challenge #2: Correctness

Sufficient OverlapSufficient Evidence

Align to database of junctions or transcriptome

Wood et al. Bioinformatics 2011; 27:580–581

Split Read Alignments

Trapnell et al. Bioinformatics 2009; 25:1105-11

RNA-seq Mapping

ATG AAA

Challenge #1: Introns

Challenge #2: Correctness

Sufficient OverlapSufficient Evidence

Align to the transcriptome

Challenge #3: Multi-mappers

Sequence Similarity

Align to database of junctions or transcriptome

Wood et al. Bioinformatics 2011; 27:580–581

Split Read Alignments

Trapnell et al. Bioinformatics 2009; 25:1105-11

RNA-seq Mapping

Data QC (clipping)

Align to Filter Set

Align to ‘genome’

Align to ‘junctions’

Split read Alignment

Choose Alignments, DisambiguateExclude Flag and Exclude

Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11

RNA-seq Mapping

Data QC (clipping)

Align to Filter Set

Align to ‘genome’

Align to ‘junctions’

Split read Alignment

Choose Alignments, DisambiguateExclude Flag and Exclude

BAMBAM BAM Alignment Filtering

AnalysisLibrary QC

Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11

RNA-seq Mapping

reference?diploid?

gene model?ESTs?

Algorithm?rRNA, tRNA?

Data QC (clipping)

Align to Filter Set

Align to ‘genome’

Align to ‘junctions’

Split read Alignment

Choose Alignments, DisambiguateExclude Flag and Exclude

BAMBAM BAM Alignment Filtering

AnalysisLibrary QC

Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11

Typical experiment workflow

Design Experiment

Sample AcquisitionField / Clinic / Lab

Validation

VerificationSample Acquisition

Run Experiment

Obtain RNA

Make Library

Sequencing

Base Calling Mapping

Library QC

Publish

Analysis

Interpretation

1° 2°

Field / Clinic Wet Lab Dry Lab

Library Quality Control (QC)

AAAAA

AAAAA

AAAAA

AAAAAA

AAAFragment

ds-cDNAsynthesis

Ligate adaptors +

Amplify

TargetRNA

rRNA (80%)

tRNA (15%)

5%

cellular RNA

Deplete rRNA

Enrich polyA RNA

Profile (ribosomes)

Capture(tiling arrays)

Sequencing

Library Quality Control (QC)

AAAAA

AAAAA

AAAAA

AAAAAA

AAAFragment

ds-cDNAsynthesis

Ligate adaptors +

Amplify

TargetRNA

rRNA (80%)

tRNA (15%)

5%

cellular RNA

Deplete rRNA

Enrich polyA RNA

Profile (ribosomes)

Capture(tiling arrays)

Sequencing

Affects RNA content (Expression

quantification)

Library Quality Control (QC)

AAAAA

AAAAA

AAAAA

AAAAAA

AAAFragment

ds-cDNAsynthesis

Ligate adaptors +

Amplify

TargetRNA

rRNA (80%)

tRNA (15%)

5%

cellular RNA

Deplete rRNA

Enrich polyA RNA

Profile (ribosomes)

Capture(tiling arrays)

Sequencing

Affects RNA content (Expression

quantification)

Affects Insert Size (transcript

identification)

Library Quality Control (QC)

AAAAA

AAAAA

AAAAA

AAAAAA

AAAFragment

ds-cDNAsynthesis

Ligate adaptors +

Amplify

TargetRNA

rRNA (80%)

tRNA (15%)

5%

cellular RNA

Deplete rRNA

Enrich polyA RNA

Profile (ribosomes)

Capture(tiling arrays)

Sequencing

Affects RNA content (Expression

quantification)

Affects Insert Size (transcript

identification)

Affects Strand Specificity

Library Quality Control (QC)

AAAAA

AAAAA

AAAAA

AAAAAA

AAAFragment

ds-cDNAsynthesis

Ligate adaptors +

Amplify

TargetRNA

rRNA (80%)

tRNA (15%)

5%

cellular RNA

Deplete rRNA

Enrich polyA RNA

Profile (ribosomes)

Capture(tiling arrays)

Sequencing

Affects RNA content (Expression

quantification)

Affects Insert Size (transcript

identification)

Affects Strand Specificity

Affects Library Complexity

(Tag uniqueness)

Library Quality Control (QC)

AAAAA

AAAAA

AAAAA

AAAAAA

AAAFragment

ds-cDNAsynthesis

Ligate adaptors +

Amplify

TargetRNA

rRNA (80%)

tRNA (15%)

5%

cellular RNA

Deplete rRNA

Enrich polyA RNA

Profile (ribosomes)

Capture(tiling arrays)

Sequencing

Affects RNA content (Expression

quantification)

Affects Insert Size (transcript

identification)

Affects Strand Specificity

Affects Library Complexity

(Tag uniqueness)

Affects Mapping Rate

Paired-end?

Typical experiment workflow

Design Experiment

Sample AcquisitionField / Clinic / Lab

Validation

VerificationSample Acquisition

Run Experiment

Obtain RNA

Make Library

Sequencing

Base Calling Mapping

Library QC

Publish

Analysis

Interpretation

1° 2°

Field / Clinic Wet Lab Dry Lab

Calculate Gene Expression

ATG AAA ATG

Gene A3500nt

(700 reads)

Gene B400nt

(160 reads)

AAA

Mortazavi et al. Nat. Methods 2008; 5:621–628

Calculate Gene Expression

ATG AAA ATG

Gene A3500nt

(700 reads)

Gene B400nt

(160 reads)

AAA

RPKM = 2.0 RPKM = 4.0

RPKM  =  R   103 106L N

× ×

Reads  Per  Kilobase    per  Million

L  =  Length  of  geneN  =  Library  Size

R  =  Gene  Read  Count

Further Normalisation

ATG AAA

Repeat

Normalise to “mappable” gene length

Koehler et al. Bioinformatics 2010

Further Normalisation

ATG AAA

Repeat

Normalise to “mappable” gene length

Koehler et al. Bioinformatics 2010

Robinson et al. Genome Biology 2010; 11:R25

Scale Expression Values by TMM

Cellular RNA

Cond. 1 Cond. 2

Further Normalisation

ATG AAA

Repeat

Normalise to “mappable” gene length

Koehler et al. Bioinformatics 2010

Robinson et al. Genome Biology 2010; 11:R25

Scale Expression Values by TMM

Cellular RNA

Cond. 1 Cond. 2

RPKM

Cond. 1 Cond. 2

Further Normalisation

ATG AAA

Repeat

Normalise to “mappable” gene length

Koehler et al. Bioinformatics 2010

Robinson et al. Genome Biology 2010; 11:R25

Scale Expression Values by TMM

Benjamini et al. NAR; 2012

Normalise to GC content of

region

Calculate ‘Feature’ Expression

ATG AAA

ATG AAA

Calculate ‘Feature’ Expression

ATG AAA

ATG AAA

Exonic Region

Calculate ‘Feature’ Expression

ATG AAA

ATG AAA

Exonic Region Exon Junction

Calculate ‘Feature’ Expression

ATG AAA

ATG AAA

Exonic Region Exon Junction Intronic Region

Calculate ‘Feature’ Expression

ATG AAA

ATG AAA

Exonic Region Exon Junction Intronic Region Exon Boundary

Calculate ‘Feature’ Expression

ATG AAA

ATG AAA

Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region

Calculate ‘Feature’ Expression

ATG AAA

ATG AAA

Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region

Calculate RPKM for any feature

Calculate ‘Feature’ Expression

ATG AAA

ATG AAA

Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region

Calculate RPKM for any feature

Extended 3’ UTR

ATG AAA

Calculate ‘Feature’ Expression

ATG AAA

ATG AAA

Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region

Calculate RPKM for any feature

Extended 3’ UTR

ATG AAA

ATG AAA

Retained Intron

Calculate Transcript Expression

ATG AAA

ATG AAA

ATG AAA

ATG

Calculate Transcript Expression

ATG AAA

ATG AAA

ATG AAA

ATG

diagnostic feature

Calculate Transcript Expression

ATG AAA

ATG AAA

ATG AAA

ATG

diagnostic feature

Approach #1: Expression calculated using diagnostic features

Strong Evidence

Excludes Transcripts

Sampling Variability

Lacks statistical robustness

Easy to calculate

Dependent on gene model

ALEXA-seq: Griffith et al. Nat. Methods 2010; 11:R25

Calculate Transcript Expression

ATG AAA

ATG AAA

ATG AAA

ATG

Calculate Transcript Expression

ATG AAA

ATG AAA

ATG AAA

ATG

Approach #2: Expression estimatedConstruct bipartite graph, then finds minimum path

Cufflinks: Trapnell et al. Nat. Biotech. 2010, 28:511-515

Calculate Transcript Expression

ATG AAA

ATG AAA

ATG AAA

ATG

Estimates expression for all transcripts

Model can fail in complex / highly

expressed regions

More statistically robust Error rate largely unknown

Incorporates ambiguous reads

Approach #2: Expression estimatedConstruct bipartite graph, then finds minimum path

Cufflinks: Trapnell et al. Nat. Biotech. 2010, 28:511-515

Expressed or not?

ATG AAA

ATG AAA

ATG AAA

Cond. 1

Cond. 2

Cond. 3

Freq

uenc

y

log2 (expression)

not “expressed” “expressed”

Need to determine ‘expression’ cut-off value

Expressed or not?

Expressed if > 1 RPKM

1

Lacks sensitivity ArbitraryHas literature

support

Expressed or not?

Expressed if > 1 RPKM

1

Expressed if above intergenic

background

2

log2 Expression

Freq

uenc

y

95th percentile

Lacks sensitivity ArbitraryHas literature

support

Expressed or not?

Expressed if > 1 RPKM

1

Expressed if above intergenic

background

2

log2 Expression

Freq

uenc

y

95th percentile

Cut-off based on empirical

evidence

Still somewhat arbitrary

Lacks sensitivity ArbitraryHas literature

support

Expressed or not?

Expressed if > 1 RPKM

1

Expressed if above intergenic

background

2

log2 Expression

Freq

uenc

y

95th percentile

Cut-off based on empirical

evidence

Still somewhat arbitrary

Incorporate replicate

information

3Based on observed

reproducibility

Requires replicates

Lacks sensitivity ArbitraryHas literature

support

−log2 (expression) bins

np−I

DR

Rep 1 vs Rep 2Rep 2 vs Rep 1MeanCut−off

00.

10.

30.

50.

70.

91

−11 −7 −3 1 5 9 13 17 21 25

Expressed or not?

Expressed if > 1 RPKM

1

Expressed if above intergenic

background

2

log2 Expression

Freq

uenc

y

95th percentile

Cut-off based on empirical

evidence

Still somewhat arbitrary

Incorporate replicate

information

3Based on observed

reproducibility

Requires replicates

Lacks sensitivity ArbitraryHas literature

support

−log2 (expression) bins

np−I

DR

Rep 1 vs Rep 2Rep 2 vs Rep 1MeanCut−off

00.

10.

30.

50.

70.

91

−11 −7 −3 1 5 9 13 17 21 25

Expressed or not?

Expressed if > 1 RPKM

1

Expressed if above intergenic

background

2

log2 Expression

Freq

uenc

y

95th percentile

Cut-off based on empirical

evidence

Still somewhat arbitrary

Incorporate replicate

information

3Based on observed

reproducibility

Requires replicates

Choose what is reasonable for your experiment, be consistent!

Lacks sensitivity ArbitraryHas literature

support

−log2 (expression) bins

np−I

DR

Rep 1 vs Rep 2Rep 2 vs Rep 1MeanCut−off

00.

10.

30.

50.

70.

91

−11 −7 −3 1 5 9 13 17 21 25

Nucleotide-Resolution Analysis

ATG AAA

ATG AAA

ICR

Imprinting

Nucleotide-Resolution Analysis

ATG AAA

ATG AAA

Imprinting

sQTLeQTL

Nucleotide-Resolution Analysis

ATG AAA

ATG AAA

Imprinting

sQTLeQTLComplex Traits

Nucleotide-Resolution Analysis

ATG AAA

ATG AAA

Imprinting

eQTLComplex Traits

A B CSNPs

Allelic Fraction

sQTL

Nucleotide-Resolution Analysis

ATG AAA

ATG AAA

Imprinting

eQTLComplex Traits

A B CSNPs

Allelic Fraction

sQTL

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

Fraction of RNA−seq Reads Matching Reference Allele

Dens

ity

Expected MeanObserved Mean

Degner et al. Bioinformatics 2009

Reference bias

Nucleotide-Resolution Analysis

ATG AAA

ATG AAA

Imprinting

eQTLComplex Traits

A B CSNPs

Allelic Fraction

sQTL

Map to a diploid genome

AlleleSeq: Rozowsky et al. Mol. Sys. Bio 2011

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

Fraction of RNA−seq Reads Matching Reference Allele

Dens

ity

Expected MeanObserved Mean

Degner et al. Bioinformatics 2009

Reference bias

Typical experiment workflow

Design Experiment

Sample AcquisitionField / Clinic / Lab

Validation

VerificationSample Acquisition

Run Experiment

Obtain RNA

Make Library

Sequencing

Base Calling Mapping

Library QC

Publish

Analysis

Interpretation

1° 2°

Field / Clinic Wet Lab Dry Lab

The future of RNA-seq (now)Single Cell

Shalek, et al. Nature 2013

The future of RNA-seq (now)Single Cell

Shalek, et al. Nature 2013

Huge Cohort

900 donors 30,000 RNA-seq

data sets!

Genotype-Tissue Expression project (GTEx)

Lonsdale, et al. Nature Genetics 2013

Summary

Choose an alignment approach suitable for your experiment, available resources and tools

Assess library quality, specifically rRNA contamination, insert size, strand specificity and library complexity

Gene and ‘Feature’ Expression can be calculated using count data, and normalised by length, library size and GC content

Transcript expression calculation requires alternative approaches and algorithms, which although common, are largely unproven

RNA-seq can interrogate nucleotide specific questions, but be careful of alignment biases (diploid mapping can help here)

1

2

3

4

5

Questions and References

Cloonan et al. Nat Methods 2008; Stem cell transcriptome profiling via massive-scale mRNA sequencing

Mortazavi et al. Nat. Methods 2008; Mapping and quantifying mammalian transcriptomes by RNA-Seq

Wood et al. Bioinformatics 2011; X-MATE: A flexible system for mapping short read data

Trapnell et al. Bioinformatics 2009; TopHat: discovering splice junctions with RNA-Seq

Koehler et al. Bioinformatics 2010. The Uniqueome: A mappability resource for short-tag sequencing

Robinson et al. Genome Biology 2010; A scaling normalization method for differential expression analysis of RNA-seq data.

Benjamini et al. NAR; 2012. Summarizing and correcting the GC content bias in high-throughput sequencing

Griffith et al. Nat. Methods 2010; Alternative expression analysis by RNA sequencing.

Trapnell et al. Nat. Biotech. 2010; Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform

Degner et al. Bioinformatics 2009; Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing

Rozowsky et al. Mol. Sys. Bio 2011; AlleleSeq: analysis of allele-specific expression and binding in a

Shalek, et al. Nature 2013; Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells

Lonsdale, et al. Nature Genetics 2013; The Genotype-Tissue Expression (GTEx) project.

top related