the iso-seq method...2015/08/24 · the term “iso-seq method” can refer to any transcriptome...
TRANSCRIPT
FIND MEANING IN COMPLEXITY
© Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures.
Elizabeth Tseng, Ph.D.
Senior Staff Scientist
The Iso-Seq™ Method: Transcriptome Sequencing Using Long Reads
Transcription Variation Proteomic/Gene Complexity
2 slide from G. Shenykman, ASMS talk 2014
A Single Gene Locus Many Transcripts
3
slide from G. Shenykman, ASMS talk 2014
Short reads cannot accurately assemble complex
transcripts
Steijger et al. (2013) Assessment of transcript reconstruction methods for RNA-Seq. Nature Methods
doi:10.1038/nmeth.2714.
…the complexity of higher eukaryotic genomes imposes severe
limitations on transcript recall and splice product discrimination…
…assembly of complete isoform structures poses a major
challenge even when all constituent elements are identified…
…Ultimately, the evolution of RNA-seq will move toward single-
pass determination of intact transcripts….
Iso-Seq™ Method: PacBio Transcriptome Sequencing
• Single-molecule observation
– one read = one transcript
• Sequence transcript in full length
– 0 – 15 kb full-length transcripts
– no assembly required
The term “Iso-Seq method” can refer to any transcriptome (cDNA) sequencing
using the PacBio System, including those that do not follow recommended library
preparation or the Iso-Seq bioinformatics pipeline (ICE + Quiver, later slides)
Iso-Seq Library Workflow
6
polyA+ RNA
Total RNA
Optional Poly-A Selection
Reverse Transcription
(SMARTScribe RT)
Full Length 1st Strand cDNA
PCR
Optimization
Large Scale Amplification
(Phusion DNA Polymerase)
Amplified cDNA
1-2 kb
2-3 kb
3-6 kb
Size Selection
(gel / BluePippin / SageELF)
1-2 kb
2-3 kb
3-6 kb
Re-Amplification
(Phusion DNA Polymerase)
1-2 kb
2-3 kb
3-6 kb
SMRTbell Template
Preparation
1-2 kb
2-3 kb
3-6 kb
SMRT Sequencing
3-6 kb
Optional Size Selection
(BluePippin)
Size cuts can be arbitrary
Current max FL transcript seen: 15 kb
5-10 kb
5-10 kb
5-10 kb
5-10 kb
5-10 kb
Full-Length (FL) read identification
Full-Length = 5’ primer seen, polyA tail seen, 3’ primer seen
• Identify and remove primers and polyA/T tail
• Identify transcript stranded-ness
Bioinformatics Challenge
8
ATTTAAGGCC ATTTAAGGCC ATTTAAGGCC
GCCATG GCCATG
TATAGGCAAGTAACGTT TATAGGCAAGTAACGTT
ATTCAAGGCC AATTAGGGC TTTAGGCC AAT GGCCATTG
GCCATG
TATAGGCAAGTACGTT TATAGGGGCAAGTAACGTT
SAMPLE INPUT SEQUENCING OUTPUT
Need to recover the original sequence Error Correction
Bioinformatics Challenge
9
ATTTAAGGCC ATTTAAGGCC ATTTAAGGCC
GCCATG GCCATG
TATAGGCAAGTAACGTT TATAGGCAAGTAACGTT
ATTCAAGGCC AATTAGGGC TTTAGGCC AAT GGCCATTG
GCCATG
TATAGGCAAGTACGTT TATAGGGGCAAGTAACGTT
SAMPLE INPUT SEQUENCING OUTPUT
Need to recover the original sequence Error Correction
POST-
ERROR CORRECTION
ATTTAAGGCC
GCCATG
TATAGGCAAGTAACGTT
Bioinformatics Challenge
10
ATTTAAGGCC ATTTAAGGCC ATTTAAGGCC
GCCATG GCCATG
TATAGGCAAGTAACGTT TATAGGCAAGTAACGTT
ATTCAAGGCC AATTAGGGC TTTAGGCC AAT GGCCATTG
GCCATG
TATAGGCAAGTACGTT TATAGGGGCAAGTAACGTT
SAMPLE INPUT SEQUENCING OUTPUT
Need to recover the original sequence Error Correction
POST-
ERROR CORRECTION
ATTTAAGGCC: 3
GCCATG: 2
TATAGGCAAGTAACGTT: 2
Error Correction: Three Approaches
11
Tool Author Genome-
Guided
Hybrid (long +
short reads)
Abundance
Inferrence
ToFU
(RS_IsoSeq) Liz T. N N (not really)
CONVEX Meisam R.
(David T.) N N Y
LSC + IDP Kin Fai A. Y Y Y
For Research Use Only. Not for use in diagnostic procedures.
ToFU: The ICE + Quiver error correction pipeline
12
Transcript isOforms: Full-length and Unassembled
ToFU is available through
SMRT Analysis (RS_IsoSeq)
and GitHub (ToFU)
Methods is available in paper supp
• de novo (no ref genome required)
• no assembly
• can handle any read length
• works for mixed accuracy
• post-Quiver: 99-100% accuracy
ToFU pipeline: classify cluster (ICE) Quiver polishing
Per-molecule reads (ReadsOfInsert aka CCS reads)
Clusters of transcript alignments using FL + nFL reads
Transcript 1 Transcript 2 Transcript 3
Final transcript consensus
Transcript 1 Transcript 2 Transcript 3
Full-length (FL) reads
Non-FL reads
Transcript 1 Transcript 2 Transcript 3
Isoform-level clusters ICE
Quiver
ToFU reveals transcriptional complexity in P. crispa
Gray are single gene transcripts
Green are polycistronic transcripts
that span 2+ genes
Top: Short read mapping
Bottom: PacBio transcripts
Gordon & Tseng, 2015
From Novel Transcripts to Novel Proteins
Shenykman, ASMS talk 2014
PacBio public MCF-7 dataset
• ~90% predicted ORFs
matched mass spec peptide
• 251 novel ORFs found unique
to MCF-7
For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq
are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.