translational genome informatics laboratory, severance biomedical science institute, yonsei...

25
Translational Genome Informatics Laboratory, Severance Biomedical Science Institute, Yonsei University College of Medicine Journal club presentation August 26 2015 In Seok Yang | [email protected]

Upload: moris-sanders

Post on 13-Dec-2015

222 views

Category:

Documents


7 download

TRANSCRIPT

Translational Genome Informatics Laboratory,Severance Biomedical Science Institute,Yonsei University College of Medicine

Journal club presentation

August 26 2015

In Seok Yang | [email protected]

Journal

| 2

| 3

Application of RNA sequencing

1. Expressions of genes or isoforms 2. Alternative splicing

3. Non-coding RNAs 4. Gene fusions

Expression level - Single-end: RPKM - Paired-end: FPKM

| 4

Challenging issues

1. Transcript identification problem- Assembling these short reads into full transcripts is a complex task. programs: Trinity, Oases

2. Expression quantification problem- Exon sharing among transcripts- Ambiguous read mappings due to close paralogs- Low levels of gene expression programs: RSEM eXpress

3. Both problem programs: IsoInfer, Scrpture, Cufflinks, SLIDE, IsoLasso, iReckon, Traph

| 5

Recent study

Steijger, T. et al. Nat. Methods 10, 1177–1184 (2013).

The current transcript reconstruction methods showed that even in cases where these methods identified all constituent exons of a transcript, they often failed to assemble the exons into complete isoforms.

| 6

In this study

The authors developed a transcriptome assembly method named StringTie.

They showed that StringTie correctly identified 36–60% more transcripts than the next best assembler (Cuf-flinks) on multiple real and simulated data sets.

Using simulated data, they showed that the expression levels produced by StringTie also showed higher agreement with the true values.

| 7

Transcript assembly pipelines

StringTie uses a genome-guided transcriptome assembly (using TopHat2; spliced alignment tool) and de novo genome assembly (using MaSuRCA, optional)

TopHat2

MaSuRCASuper-reads

| 8

Algorithm (assembly)

1. Create super-reads using de novo assembly algorithm (MaSuRCA program)

2. Extend every read in both directions as long as the extension is unique

3. Identify pairs of reads that belong to the same super-read (assembled read)

4. Extract the sequence containing the pair plus the sequence between them; that is, the entire sequence of the original DNA fragment

Assembled sequence (=super-read)

A single read (=super-read)

5. Map the super-reads to the reference genome

Unassembled read pairUnassembled read pair

| 9

Algorithm (overall)

Node 1Node 2

Node 3

Node 4Node 5

Edge 1

Edge 2

Edge 3

9

4

3 3

14

911 9

4

5 3 2

3 3

Cufflinks(parsimony principal): 4 transcripts

StringTie(maximum flow algorithm): 5 transcripts

Weight on edge

ASG

| 10

Algorithm (estimate abundance)

Isoform 1

* *

*

Multiplieror

Bias factor

Node u Node v

fuv

w: weight

w(fuv)

Bias factor reflects biased distribution of fragments aligning to given node in ASG

*The number of fragments FPKM

| 11

Program running

StringTie v1.0.4 usage:

stringtie <input.bam> [-G <guide_gff>] [-l <label>] [-o <out_gtf>] [-p <cpus>] [-v] [-a <min_anchor_len>][-m <min_tlen>] [-j <min_anchor_cov>] [-n sens] [-C <coverage_file_name>] [-s <maxcov>][-c <min_bundle_cov>] [-g <bdist>] [-e] [-x <seqid,..>] {-B | -b <dir_path>}

Assemble RNA-Seq alignments into potential transcripts.

Options:-G reference annotation to use for guiding the assembly process (GTF/GFF3)-l name prefix for output transcripts (default: STRG)-f minimum isoform fraction (default: 0.1)-m minimum assembled transcript length (default: 200)-o output path/file name for the assembled transcripts GTF (default: stdout)-a minimum anchor length for junctions (default: 10)-j minimum junction coverage (default: 1)-t disable trimming of predicted transcripts based on coverage (default: coverage

trimming is enabled)-c minimum reads per bp coverage to consider for transcript assembly (default: 2.5)-s coverage saturation threshold; further read alignments will be ignored in a region

where a local coverage depth of <maxcov> is reached (default: 1,000,000);

-v verbose (log bundle processing details)-g gap between read mappings triggering a new bundle (default: 50)-C output file with reference transcripts that are covered by reads-M fraction of bundle allowed to be covered by multi-hit reads (default:0.95)-p number of threads (CPUs) to use (default: 1)-B enable output of Ballgown table files which will be created in the same

directory as the output GTF (requires -G, -o recommended)-b enable output of Ballgown table files but these files will be created under

the directory path given as <dir path>-e only estimates the abundance of given reference transcripts (requires -G)-x do not assemble any transcripts on these reference sequence(s)

pre-built Linux x86_64 binary packagepre-built Apple OS X binary for OS X version 10.7 or above

| 12

Criteria for transcript identification

A transcript is correctly identified

E E E E E

E E E E E

Predicted transcript (P)

True transcript (T)5’ UTR 3’ UTRIntron Intron Intron Intron

Abs(start position of the first exon of P – start position of the first exon of T) < 100 bp

Abs(end position of the first exon of P – end position of the first exon of T) < 100 bp

1.

2.

3.

4.

5.

Multi-exon transcripts are correctly assembled only if their strand was also correctly identified.

When strand-specific RNA-seq data were used, the authors also required that single-exon transcripts were as-signed to the correct strand.

If the read data only partially covered a transcript, the corresponding prediction to be correct if all partially cov-ered parts of the transcript were identified.

| 13

Simulated data sets

Simulated data sets were generated from Flux Simulator (150 million 75-bp paired-end reads, extracted from all RefSeq human transcripts as provided by the UCSC Genome Browser).

These data sets were used to evaluate how well each assembler captures both the structure and the quantity of each generated transcript.

First data set (Sim-I) was generated using the exact parameters specified for a directional human RNA-seq protocol.

Second data set (Sim-II) was generated, in which the empirical fragment size distribution from Sim-I was replaced with a parameterized normal N(250,20) distribution, because some transcriptome assemblers as-sume a normal distribution (including Cufflinks). N(average read length, standard deviation)

| 14

Transcriptome assemblers’ accuracies in detecting transcripts

StringTie

Cufflinks

All transcripts (full and partial) Fully covered transcripts only

| 15

Filtering out of low-abundance transcripts

It is designed to eliminate isoforms expressed at very low levels(i.e., incompletely spliced precursors or other artifacts).

StringTie and Cufflinks = 10% of the most abundant isoform

Figures in previous slide show StringTie’s accuracy when this filtering threshold was set to 10%, the same level as used by Cufflinks.

Interestingly, lowering the threshold to 5% for StringTie still yielded better sensitivity and precision than Cufflinks yielded at the 10% threshold.

| 16

Real data sets

Three human strand-specific RNA-seq data sets from the ENCODE project28- whole B cells in blood (GEO accession GSM981256) : 90 million 76-bp paired-end reads - the cytosol of fetal lung fibroblasts (GSM981244) : 45 million 101-bp paired-end reads - CD14-positive monocytes (GSM984609) : 120 million 76-bp paired-end reads

One unstranded RNA-seq data set (nuclear RNA from a human kidney cell line) : ~180 million 100-bp paired-end reads

TopHat2 and MaSuRCA were used for read mapping and assembly, respectively.Aligned paired reads outputted by TopHat2 were provided to StringTie, Cufflinks, Scrip-ture, IsoLasso and Traph.

| 17

Quantification (expression level) of transcript

Quantification is measured by the number of pairs of reads (“fragments”). fragments per kilobase of transcript per million fragments (FPKM)

Predicted Tran-script (P)

True transcript (T)

Predicted Tran-script (P)

True transcript (T)

Predicted Tran-script (P)

True transcript (T)

P is matched with a transcript that had an expression level of zero.

T is matched with a prediction that had an expression level of zero.

Predicted Tran-script (P1)

True transcript (T)

Predicted Tran-script (P2)

Predicted Tran-script (P3)

multiple predicted transcripts (transfrags)

The expression level of the true transcript was calculated by summing all the reads assigned to the predicted transfrags.

1.

2.

3.

4.

| 18

Transcriptome assemblers’ performancesR

eal d

ata

Results on simulated data show the Spearman correlation coefficient between the real and predicted number of reads.For the four real data sets, shown are the number of genes and transcripts exactly matching known annotation.

Real data provide a better test of each program’s performance because they have properties that are not accurately captured by simulations. Repetitive regions in the human genome, wide variance in GC content, isoform length and alternative splicing complexity are all factors that influence performance

all include all true and

predicted transcripts.

predicted include all pre-

dicted transcripts but ex-clude true transcripts that were not predicted.

| 19

Accuracy of transcript assemblers at assembling known genes

StringTie

Cufflinks

Because many overlapping splice variants are present at most loci, the number of transcripts covered by reads is less than the number actually present in the sample.

| 20

Comparison between StringTie and Cufflinks

StringTie was superior to Cufflinks because it did a better job at identifying the dominant transcript for a gene locus. Alternatively, it might have had an advantage on genes with larger numbers of exons or genes with more isoforms.

1. The number of transcripts

| 21

Comparison between StringTie and Cufflinks

StringTie was superior to Cufflinks because it did a better job at identifying the dominant transcript for a gene locus. Alternatively, it might have had an advantage on genes with larger numbers of exons or genes with more isoforms.

Kedney Blood Lung Mnocytes

the

num

ber

of e

xons

in tr

ansc

ript

s

the

num

ber

of e

xons

in tr

ansc

ript

s

the

num

ber

of e

xons

in tr

ansc

ript

s

the

num

ber

of e

xons

in tr

ansc

ript

s

StringTie Cufflinks StringTie Cufflinks StringTie Cufflinks StringTie Cufflinks

2. The distribution of the number of exons in transcripts identified by StringTie and not Cuff-links (left side of each plot), or by Cufflinks and not StringTie (right side of each plot).

| 22

Comparison between StringTie and Cufflinks

StringTie was superior to Cufflinks because it did a better job at identifying the dominant transcript for a gene locus. Alternatively, it might have had an advantage on genes with larger numbers of exons or genes with more isoforms.

StringTie StringTie StringTie StringTie

Cufflinks Cufflinks Cufflinks Cufflinks

FPKM FPKM FPKM FPKM

Kedney Blood Lung Mnocytes

Den

sity

3. The FPKM values of transcripts identified only by StringTie or only by Cuff-links

| 23

Comparison between StringTie and Cufflinks

StringTie was superior to Cufflinks because it did a better job at identifying the dominant transcript for a gene locus. Alternatively, it might have had an advantage on genes with larger numbers of exons or genes with more isoforms.

3. Distribution of genes containing transcripts correctly identified only by StringTie or only by Cufflinks

StringTie StringTie StringTie StringTie

Cufflinks Cufflinks Cufflinks Cufflinks

Kedney Blood Lung Mnocytes

Per

cent

age

of g

enes

con

tain

ing

uniq

uely

iden

tifi

ed tr

ansc

ript

s

No. of trascript No. of trascript No. of trascript

*** *** *** ***

1 2 3 4 5+ 1 2 3 4 5+ 1 2 3 4 5+ 1 2 3 4 5+

No. of trascript

| 24

Conclusion

1. Compared with other leading methods on both simulated and real data, StringTie was substantially more accurate at both assembly and quantitation of gene transcripts, recovering more expressed transcripts.

2. The main reason for StringTie’s accuracy is the network flow algorithm.3. StringTie can also use aligned de novo assembled fragments that combine pairs of reads into a longer se-

quence.4. StringTie assembles transcripts and estimates their expression levels simultaneously.

By comparison of two programs (StringTie and Cufflinks), StringTie was better at reconstructing at least three types of genes: (i) those at low abundance; (ii) those with more exons and (iii) those with multiple isoforms.

The authors envisage that users could replace Cufflinks with StringTie in most RNA-seq analysis pipelines and expect substantial improvements in transcript assembly coupled with faster performance.

Thank You