exseq presentation with background

40
Optimization of mRNA sequencing relative to current microarray platforms May 31, 2011

Upload: judbox

Post on 06-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 1/40

Optimization of mRNA sequencing

relative to current microarray platforms

May 31, 2011

Page 2: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 2/40

Why mRNAseq?

• There are at least three compelling reasons for performingRNAseq in general and mRNAseq in particular vs. themicroarray – Specificity (of what is being measured)

 – Reduced bias (in batches, in log ratio (FC) estimates, in general)

 – Sensitivity (on a gene or transcript basis, both detection and differential

expression)

• Other reasons – Detection of SNV or other variations

 – No predetermined transcriptome needs to be available, no probes need tobe designed or manufactured

 – Cost (will soon be equivalent on a per assay basis with microarray)

Page 3: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 3/40

Why mRNAseq?

Specificity

DDR1

AffymetrixProbe setannotation

3’ bias of probes

creates greaterambiguity inmeasurement

Page 4: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 4/40

Why mRNAseq?

Reduced Biasover batches,time, machines,related toprocessing

CD19CD8 

CD14 

CD4

cells

Page 5: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 5/40

Page 6: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 6/40

Historical Issues in Genome-wide Expression Studies

• MAQC (Microarray Quality Control) effort began in 2005

to address differences seen in differential expressionbetween microarray platforms

 – The effort was very successful, as many questions were being

raised about the reliability/reproducibility of microarray results – The issues with microarray reproducibility in diff. exp. came

down to three main findings• Poor use/interpretation of statistical methods create illusion of discordance

• Relying on annotation rather than probe locations is perilous

• Each platform has its sweet spot for various performance characteristics

 – All primary microarray platforms can work well and be generallyconcordant

Page 7: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 7/40

Similar Issues with mRNA sequencing

• In one sense, next-gen sequencers can be viewed as

another platform for RNA (relative) quantitation, butwithout the bias of predetermined probes

• RNA assay performance measures that are important

 – Detection and signal (including dynamic range)

 – Fold Change (Log Ratio or Log FC) Estimates (biased?)

 – Differential Gene Lists (size, uniqueness)

 – Concordance with reference methods (TaqMan)

 – Repeatability/Reproducibility of these same factors• Technical variability vs. Biological variability

Page 8: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 8/40

Exemplar Sequencing (ExSeq) Experiment

• Goal: Compare performance of various mRNA-seq

strategies to microarrays in a real-world experimentalscenario

• Design – 15 breast cancer cell lines – 5 unique lines representing each of 3

breast cancer subtypes

 – Three independent library preparations of each line using theIllumina TruSeq protocol

 – 45 samples randomized into 7 pools/lanes with 7 barcodes per pool – The 7 pools were run on 2 flow cells (100x100 cycles – ie, 100PE)

 – For runs with acceptable quality, reads were randomly collectedinto sets of 2m, 5m, 10m, 17m, 25m, 33m, and 50m reads

Page 9: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 9/40

ExSeq

• Why is our ExSeq experiment designed differently

than MAQC? – MAQC had only two biological conditions that were themselves

pools from multiple sources / tissues (UHRR and HBRR)

• They had technical replicates, not biological replicates – We wanted to test a more realistic scenario using different

biological conditions and biological replicates within a condition• Thus, one specimen one source

 – Since each specimen is attributable to one source, we can alsopotentially assess variation (SNV/fusion/etc.) of that source

whereas with MAQC we could not even if we sequenced the RNA – We can still assess assay repeatability by running independent

preps in multiple lanes, but with sequencing we can easily combine

output from multiple preps (if there is not bias)

Page 10: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 10/40

Main Presentation

• Sequencing performance relative to sequencing

strategy and relative to microarrays using 15 breastcancer cell lines

 – Illumina HiSeq sequencer (HiSeq) – HG-U133_Plus_2 microarray (Affymetrix)

 – Human HT-12v4 microarray (Illumina)

• Interpretations/Insights from a sequencingexperiment over and above microarrays

Page 11: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 11/40

Raw Sequencing Output - ExSeq

*Illumina HiSeq specification is 60M PF clusters / lane (v2)

Typically achieving 100-110M PF clusters /lane*

Page 12: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 12/40

EA Pre-processing & QA Methods Implemented

• Automated determination of – Species

 – Molecule - DNA, RNA or miRNA

 – Insert size– length and variation

• Detection of non-uniformity of barcode representation

• Alarms for – percent and number of PF clusters

 – unexpected base distribution

 – unusual quality scores by cycle

• Automated detection and cleavage of adaptors

• Automated detection and trimming of cycles with skewed qualityscores or high frequency of ‘N’s

• Correction of quality scores based on Phi-X spike-ins (in testing)

• …

Page 13: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 13/40

Computational Processing overview

• Alignment

 – Maximizing unambiguous alignments

 – Alignment of reads that cross exon junctions

 – Ex: Bowtie, BWA, Tophat

• Abundance estimation – Gene or transcript

 – Handling alignments that are ambiguous in the transcriptome

 – Ex: Cufflinks, MISO, IsoEM, IsoInfer, Rseq, . . .

• Normalization of read counts

 – Minimizing bias due to variation in number of clusters available – Ex: Total count, Upper quartile, quantile, density

• Testing for differential expression

 – Data are not log normal as often observed with microarrays

 –  Ex: Cuffdiff, edgeR, DESeq

Page 14: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 14/40

Alignment

• In mRNA-seq and similar sequence counting

experiments, the number of unambiguously alignedreads is the driving factor behind most aspects ofmeasurement performance

• There are numerous strategies to improve

unambiguous alignment

 – Longer reads

 – Paired end vs single end – Alignment strategy - including error tolerance

 – Reduced complexity of the reference sequence

Page 15: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 15/40

Alignment Approaches

• Default parameters were used for available tools

• Reference database and error tolerance was identical for all methods

• EA = EA developed SE alignment strategy

• Alignment estimates generated from the Illumina body map data

unambiguous

ambiguous

unaligned

Page 16: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 16/40

Unambiguous Alignment in ExSeq

• The default TopHatalgorithm is based onthe SE Bowtie algorithm

• EA-TopHat is a hybridapproach which uses thegeneral Tophat algorithmfor junction mapping, but

is powered by the EAalignment engine

*Alignment estimates generated from the ExSeq data

97%

Page 17: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 17/40

Detection

• Microarray detection defined by MAS5 call for Affymetrix anddetection p-value < 0.05 for Illumina

• Sequencing detection defined as greater than 3 counts assigned

at the end of abundance estimation

• Shared content consists of the set of transcripts (or genes) thatare common to all platforms

• Unrestriced content allows for any possible detection eventunder the platform specific definition

Page 18: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 18/40

Detection – Shared Transcripts

• Gray – detected in any sample

• Red – detected in >=66% of samples

• Detected is defined as >= 3 reads assigned to a transcript

Page 19: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 19/40

Detection – All Transcripts

• Gray – detected in any sample

• Red – detected in >=66% of samples

• Detected is defined as >= 3 reads assigned to a transcript

Page 20: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 20/40

Transcript Abundance Estimation

• There may be few unambiguous alignments to the genome but a largefraction of ambiguity remain with respect to the transcriptome

• Ignoring the ambiguous fraction leads greatly reduces the read count,

and results in greatly reduced repeatability, fold change estimation, andidentification of differential expression

• Definition of the transcriptome plays an important role, and here we usethe UCSC KnownGene table – a combination of RefSeq, GenBank,and Uniprot

• Many methods are available to intelligently assign ambiguous reads.Results in the remaining slides are from Cufflinks estimation of theKnownGene transcripts.

Page 21: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 21/40

Magnitude of Fold Change (Log Ratio)

Slope estimates > 1

indicate compression offold change estimates inarray platforms (x-axes)

relative to HiSeq (y-axes).

FC (Log Ratio)estimatesare increased for 25M

reads (right) relative to10M reads (left)

r2 values are increasedfor 25M reads (right)

relative to 10M reads(left)

25m PE HiSeq10m PE HiSeq

Affy

array

Illumina

array

Page 22: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 22/40

Page 23: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 23/40

Concordance of Fold Change

• r2 (r=correlation) values are observed to be modestly

increased in 25M reads versus 10M reads

Page 24: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 24/40

Comparison of Differential Expression

• Venn diagrams are used by many to assess

concordance of differential expression; however, thechoice of threshold for significance varies widely

• Concordant/discordant counts were tabulated from40 combinations of q-value and fold change

thresholds. (q in 0.01-0.25 and FC in 1.5-8)

• For each threshold, the number discordant byplatform was normalized to the number concordant

Page 25: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 25/40

Comparison of Differential Expression

• With FC > 1.5 and q< 0.01 

for a given number of reads

• Summarized values  Scale the common set to 1

• Value of 0 for platform A means that all differential

expression detected by A was also made by B

• Increasing values indicate an increasing number ofsignificant calls unique to the platform

0.46 1.271

460 12701000

Illustration Platform A Platform BCommon

Page 26: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 26/40

Comparison of Differential Expression

q

.01

.05

.1

.15

.25

0.%! n.nn1

Platform A Platform BCommon

FC 1.5 2 3 4 5 6 7 8

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

0.%! n.nn1

Platform A Platform BCommon

For each collection of read depth (10M, 25M, etc.)

and strategy (PE, SE)

Page 27: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 27/40

Summarization of Differential Expression(for common content - microarray vs. sequencing)

The number of detected DE transcripts

unique to each platform is similar inmagnitude to the intersection of both at10M reads.

For 25M reads, sequencing produces a

noticeably larger total number ofdifferentially detected transcripts even forcommon content and the amount unique to

each array type (Affy and Illumina) is muchsmaller than at 10M.

Error bars indicate variation across the 40combinations of FC and q.

Page 28: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 28/40

HiSeq vs Array Comparative Summary

• EA alignment of10M reads provides similar performance ascurrently available gene expression microarrays in terms ofdetection, estimation of fold change, and detection of differentialexpression for the content assayed by the microarray

• EA alignment of 25M reads fold change estimates are 75-100%larger, and 2-3x more transcripts are identified as differentiallyexpressed for the content assayed by the microarray

• PE provides modest benefits in all aspects of quantificationrelative to SE

• However, there are other RNAs to measure using mRNAseq asdetection is increased 4x relative to microarrays for 25M PFclusters

Page 29: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 29/40

Effect of Sequencing Parameters

Agreement improveswith increasing

number of reads (topvs bottom).

25 cycle x 2 is much

better than 50 cycle x1 to recapitulate the

results of 50 cycle x2

25b PE vs 50b PE50b SE vs 50b PE

10m

25m

Page 30: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 30/40

Effect of Differential Expression Test

DESeq vs. cuffdiff

Magnitudes of changeare compressed incuffdiff relative toDESeq.

DESeq identifies 2-3x

more unique transcriptsthan Cuffdiff across thethree comparisons.

Page 31: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 31/40

Detection of Single Nucleotide Variants

Approximately 1/3 of detected SNVs are not known to dbSNP

At least 1 variant is detected in ~10% of detected transcripts

Page 32: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 32/40

Denovo Transcript Assembly

• Cufflinks used for denovo assembly of transcripts, notranscriptome definition is used

• PE provides superior performance to SE in this scenario

Page 33: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 33/40

Novel Transcript Assembly

• Identification of completely novel transcripts – those that areassembled from mRNA, but exist in regions currently annotatedas intergenic

Page 34: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 34/40

Sequencing Parameters and Processing

• Good correlation of differential expression is observed between50 SE and 50 PE. However, 25 PE is superior to 50 SE when 50PE is the standard.

• Differential expression statistics are still evolving, and the majordisagreement between current methods is in estimation of error

• Single nucleotide variants can be detected, but only for the

~10% most abundant transcripts

• Denovo assembly is greatly improved with PE information

Page 35: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 35/40

Experimental Results

25M clusters of 50 x 2

Principal component

analysis easilysegregates the celllines from the threeknown subtypes.

BasalClaudinLuminal

Page 36: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 36/40

Expression of Isoforms

• ESR1 is a well studied gene dueto its association withdevelopmental stage in epithelialcells and it use as a biomarkerfor treatment in breast cancer

• 11 isoforms of ESR1 aredetected and some may beindicative of isoform specific

differential expression

ClaudinBasal Luminal

Page 37: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 37/40

Differential Expression of Isoforms

• Differentially expressedtranscripts between Claudinand Luminal cell lines wereidentified as before

• These were filtered forisoforms of the same gene thatexhibit opposing direction of

change between the groupsand 115 transcripts wereidentified

Claudin Luminal

Page 38: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 38/40

Summary

• EA has consistently achieved >100M PF clusters per lane with high

quality base calls for 50 x 2 cycle sequencing from TruSeq preparedlibraries

• With EA alignment of 10M clusters of 50 x 2 cycles, HiSeq providessimilar levels of information to microarrays, when limited totranscripts detected by the microarray

• With EA alignment of 25M clusters of 50 x 2 cycles, HiSeq provides

substantially more information than microarrays, even when limitedto transcripts detected by the microarray

• At >=10M clusters and above, HiSeq consistently detects 3-4x more

transcripts than microarrays, and at >=25M clusters, HiSeq detects2x more differentially expressed transcripts

Page 39: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 39/40

Summary

• PE strategies are similar or marginally better than SE related to – Magnitude of detection of known transcripts

 – Magnitude of detection of differential expression

 – Correlation of FC with microarrays

 – Estimating the magnitude of FC

• PE strategies are noticeably better than SE in improving

 – Percentage of unambiguously aligned reads

• PE strategies are greatly better than SE in improving

 – De novo assembly of transcripts or in detecting novel transcripts

• Detection of novel isoforms and SNVs are improved with increased

coverage and read depth

• Alignment, estimation, testing for differential expression are asimportant as the sequencing strategy

Page 40: ExSeq Presentation With Background

8/2/2019 ExSeq Presentation With Background

http://slidepdf.com/reader/full/exseq-presentation-with-background 40/40

www.GenomicKnow-How.com