the iso-seq method...2015/08/24 · the term “iso-seq method” can refer to any transcriptome...

FIND MEANING IN COMPLEXITY

© Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures.

Elizabeth Tseng, Ph.D.

Senior Staff Scientist

The Iso-Seq™ Method: Transcriptome Sequencing Using Long Reads

Transcription Variation Proteomic/Gene Complexity

2 slide from G. Shenykman, ASMS talk 2014

A Single Gene Locus Many Transcripts

3

slide from G. Shenykman, ASMS talk 2014

Short reads cannot accurately assemble complex

transcripts

Steijger et al. (2013) Assessment of transcript reconstruction methods for RNA-Seq. Nature Methods

doi:10.1038/nmeth.2714.

…the complexity of higher eukaryotic genomes imposes severe

limitations on transcript recall and splice product discrimination…

…assembly of complete isoform structures poses a major

challenge even when all constituent elements are identified…

…Ultimately, the evolution of RNA-seq will move toward single-

pass determination of intact transcripts….

http://www.ncbi.nlm.nih.gov/pubmed/24185837









Iso-Seq™ Method: PacBio Transcriptome Sequencing

• Single-molecule observation

– one read = one transcript

• Sequence transcript in full length

– 0 – 15 kb full-length transcripts

– no assembly required

The term “Iso-Seq method” can refer to any transcriptome (cDNA) sequencing

using the PacBio System, including those that do not follow recommended library

preparation or the Iso-Seq bioinformatics pipeline (ICE + Quiver, later slides)

Iso-Seq Library Workflow

6

polyA+ RNA

Total RNA

Optional Poly-A Selection

Reverse Transcription

(SMARTScribe RT)

Full Length 1st Strand cDNA

PCR

Optimization

Large Scale Amplification

(Phusion DNA Polymerase)

Amplified cDNA

1-2 kb

2-3 kb

3-6 kb

Size Selection

(gel / BluePippin / SageELF)

1-2 kb

2-3 kb

3-6 kb

Re-Amplification

(Phusion DNA Polymerase)

1-2 kb

2-3 kb

3-6 kb

SMRTbell Template

Preparation

1-2 kb

2-3 kb

3-6 kb

SMRT Sequencing

3-6 kb

Optional Size Selection

(BluePippin)

Size cuts can be arbitrary

Current max FL transcript seen: 15 kb

5-10 kb

5-10 kb

5-10 kb

5-10 kb

5-10 kb

Full-Length (FL) read identification

Full-Length = 5’ primer seen, polyA tail seen, 3’ primer seen

• Identify and remove primers and polyA/T tail

• Identify transcript stranded-ness

Bioinformatics Challenge

8

ATTTAAGGCC ATTTAAGGCC ATTTAAGGCC

GCCATG GCCATG

TATAGGCAAGTAACGTT TATAGGCAAGTAACGTT

ATTCAAGGCC AATTAGGGC TTTAGGCC AAT GGCCATTG

GCCATG

TATAGGCAAGTACGTT TATAGGGGCAAGTAACGTT

SAMPLE INPUT SEQUENCING OUTPUT

Need to recover the original sequence Error Correction


9


GCCATG GCCATG



GCCATG




POST-

ERROR CORRECTION

ATTTAAGGCC

GCCATG

TATAGGCAAGTAACGTT


10


GCCATG GCCATG



GCCATG




POST-

ERROR CORRECTION

ATTTAAGGCC: 3

GCCATG: 2

TATAGGCAAGTAACGTT: 2

Error Correction: Three Approaches

11

Tool Author Genome-

Guided

Hybrid (long +

short reads)

Abundance

Inferrence

ToFU

(RS_IsoSeq) Liz T. N N (not really)

CONVEX Meisam R.

(David T.) N N Y

LSC + IDP Kin Fai A. Y Y Y

For Research Use Only. Not for use in diagnostic procedures.

ToFU: The ICE + Quiver error correction pipeline

12

Transcript isOforms: Full-length and Unassembled

ToFU is available through

SMRT Analysis (RS_IsoSeq)

and GitHub (ToFU)

Methods is available in paper supp

• de novo (no ref genome required)

• no assembly

• can handle any read length

• works for mixed accuracy

• post-Quiver: 99-100% accuracy

ToFU pipeline: classify cluster (ICE) Quiver polishing

Per-molecule reads (ReadsOfInsert aka CCS reads)

Clusters of transcript alignments using FL + nFL reads

Transcript 1 Transcript 2 Transcript 3

Final transcript consensus


Full-length (FL) reads

Non-FL reads


Isoform-level clusters ICE

Quiver

ToFU reveals transcriptional complexity in P. crispa

Gray are single gene transcripts

Green are polycistronic transcripts

that span 2+ genes

Top: Short read mapping

Bottom: PacBio transcripts

Gordon & Tseng, 2015

From Novel Transcripts to Novel Proteins

Shenykman, ASMS talk 2014

PacBio public MCF-7 dataset

• ~90% predicted ORFs

matched mass spec peptide

• 251 novel ORFs found unique

to MCF-7

For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq

are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.

the iso-seq method...2015/08/24 · the term “iso-seq method” can refer to any transcriptome...

Documents