overview and applications of next-generation sequencing technologies

54
Overview and Applications of Next- Generation Sequencing Technologies Stéphane Deschamps Analytical & Genomic Technologies DuPont Agriculture & Nutrition Pioneer Hi-Bred International

Upload: paytah

Post on 15-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Overview and Applications of Next-Generation Sequencing Technologies. St éphane Deschamps. Analytical & Genomic Technologies DuPont Agriculture & Nutrition Pioneer Hi-Bred International. Outline. Next-Generation Sequencing Platforms 454 FLX technology Solexa/Illumina technology - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Overview and Applications of Next-Generation Sequencing Technologies

Overview and Applications of Next-Generation Sequencing

Technologies

Stéphane Deschamps

Analytical & Genomic TechnologiesDuPont Agriculture & NutritionPioneer Hi-Bred International

Page 2: Overview and Applications of Next-Generation Sequencing Technologies

Outline

1. Next-Generation Sequencing Platforms1. 454 FLX technology2. Solexa/Illumina technology

2. Applications of Next-Generation Sequencing Technologies1. Overview2. Variant detection with Illumina platform3. Open-source tools for bioinformatics

3. Third-Generation Sequencing technologies: what’s next?

Page 3: Overview and Applications of Next-Generation Sequencing Technologies

Sanger sequencing

Successive improvements now allows 96 800-900 base reads to be sequenced in less than 2h

Page 4: Overview and Applications of Next-Generation Sequencing Technologies

Sanger sequencing

Sanger sequencing has been, and still is, very useful...

...but it remains slow and expensive

Page 5: Overview and Applications of Next-Generation Sequencing Technologies

Sequencing Platform Comparisons

ABI3730xl

454 FLXTitanium

IlluminaGA II

Read Length ~750bps ~450bps 18-75bps

Number ofreads/run

96 500K 100MM

Max Yield/run ~70Kbps ~1Gbp ~10Gbps

Cost/1Gbp $3.5MM $7,000 $1,000

Run time/machineto 1Gbp

8 years 1 day <1 day

Page 6: Overview and Applications of Next-Generation Sequencing Technologies

Next-Generation Sequencing

Third-generation platforms:

•Complete Genomics

•BioNanomatrix

•VisiGen

•Pacific Biosciences

•Intelligent Bio-Systems

•ZS Genetics

•Reveo

•LightSpeed Genomics

•NABsys

•Oxford Nanopore Technologies

Second-generation platforms:

•454/Roche

•Solexa/Illumina

•SOLiD/ABI

•Helicos BioSciences

•Dover Systems

Page 7: Overview and Applications of Next-Generation Sequencing Technologies

454 FLX Titanium

• First next-generation sequencing platform launched (October 2005)

• Titanium chemistry for the 454 FLX launched in September 2008

• Sequencing By Synthesis

– Pyrosequencing

– Chemiluminescent signal

• Long read technology (~450 nucleotides)

• Possibility of sequencing both ends of

DNA fragments (FLX platform)

• Generates up to 0.5Gbps per run

• Max cost is ~$10,000/run

Page 8: Overview and Applications of Next-Generation Sequencing Technologies

454 FLX Titanium

• DNA Library Construction

• Emulsion PCR

• Sequencing

Page 9: Overview and Applications of Next-Generation Sequencing Technologies

DNA Library Construction

• DNA fragmentation via nebulization

• Size-selection

• Ligation of adapters A & B

• Selection of A/B fragments via biotin selection

• Denaturation to select single-stranded A/B fragments

• No cloning!

Streptavidin Streptavidin

+(A/A)

(B/B)

(A/B)

End repair

Denaturation+

Emulsion PCR

A/B ss DNA

Page 10: Overview and Applications of Next-Generation Sequencing Technologies

Emulsion PCR

• Add DNA to capture beads (needs titration)

• Add PCR reagents to DNA and capture beads

• Transfer sample to oil tube or cup

• Emulsify DNA capture beads in PCR reagents to form

water-in-oil “microreactors”

– Emulsion with Qiagen TissueLyser (high-speed

shaker)

• Clonal amplification in microreactors

– Careful not to break the emulsion!

– ~10MM copies per capture bead

• Break emulsion and enrich for DNA positive beads

– Use biotinylated oligo to capture enriched beads then

denature

www.roche-applied-science.com

Page 11: Overview and Applications of Next-Generation Sequencing Technologies

Bead deposition into plates

• Deposition of enriched beads into

PicoTiter plate

• Well diameter = 29uM allowing for a

single bead (20uM diameter) per well

• Chambers are filled with enzyme

beads, DNA beads and packing beads.

www.roche-applied-science.com

Page 12: Overview and Applications of Next-Generation Sequencing Technologies

Pyrosequencing

1. Polymerase add nucleotide

(sequential flow of dNTPs)

2. PPi is released

3. Sulfurylase creates ATP from

PPi

4. Luciferase hydrolyzes ATP

and use luciferin to make lightwww.roche-applied-science.com

Page 13: Overview and Applications of Next-Generation Sequencing Technologies

Image and signal processing

1. Raw data is series of images (one image per base per cycle).

2. Data are extracted, quantified and normalized.

3. Read data are converted into “flowgrams”.

Page 14: Overview and Applications of Next-Generation Sequencing Technologies

Post-processing

1. Output = flowgrams, basecalls, Phred-equivalent scores

2. Basecall & Flowgrams can be used in the following applications:

1. De novo assembler – consensus sequences assembled into contigs

with quality scores and ACE file (works best with genomic DNA).

2. Reference mapper – contigs mapped to reference sequence + list of

high-confidence mutations

3. Amplicon variant analyzer – identification of sequence variants in

amplicon libraries

Page 15: Overview and Applications of Next-Generation Sequencing Technologies

Illumina Genome Analyzer

• Successor to MPSS (Massively Parallel Signature Sequencing)

• Single molecule array (“flow cell”) with millions of amplified

clusters

• Sequencing By Synthesis

– Removable fluorescence

– Reversible terminators

• Short read technology (16 - 75 nucleotides)

• Possibility of sequencing both ends of DNA fragments

• Generates up to 20Gbps per run

• Max cost is ~$10,000/run

= $500/Gbp!

Page 16: Overview and Applications of Next-Generation Sequencing Technologies

Prepare DNA fragments

+Ligate

adapters

Sample Prep

Cluster Synthesis

Cluster Station Genome Analyzer

Analysis Pipeline

Illumina Genome Analyzer

Sequencing

Page 17: Overview and Applications of Next-Generation Sequencing Technologies

Cluster Station

Page 18: Overview and Applications of Next-Generation Sequencing Technologies

Fluidics and Electronics

Flow Cell &Detection

LaserOptics

Genome Analyzer

Page 19: Overview and Applications of Next-Generation Sequencing Technologies

Cluster Generation

or RNA

- anneal

Page 20: Overview and Applications of Next-Generation Sequencing Technologies

Cluster Generation

DNA Clusters• ~1,000 copies of DNA in each cluster• 1-2 microns in diameter

- extension

Page 21: Overview and Applications of Next-Generation Sequencing Technologies

Reversible Terminator Chemistry

Page 22: Overview and Applications of Next-Generation Sequencing Technologies

5’

G

T

C

A

G

T

C

A

G

T

C

A

G

T

C

A

G

T

C

A

T

C

A

C

C

T

A

G

C

G

T

A

First base incorporated

Cycle 1: Add sequencing reagents

Remove unincorporated bases

Detect signal

Cycle 2 - n: Add sequencing reagents and repeat

Deblock (removal of fluorescent dye and protecting group)

Sequencing by Synthesis (SBS)

C

A

T

G

5’

3’

T

Page 23: Overview and Applications of Next-Generation Sequencing Technologies

Sequencing by Synthesis (SBS)

Page 24: Overview and Applications of Next-Generation Sequencing Technologies

Data Analysis Workflow - Illumina

Sequence Analysis

alignment (ELAND), filtering (chastity)

Image Analysis

Base calling

Illumina Analysis PipelineImages (.tif)

1 image per dye4 dyes/cycle 75 cycles 50 tiles/column2 columns/lane8 lanes/flowcell240,000 images

per flowcellx8 MB per image1.92 TB of image datax2 for PE run3.8 TB of image data

Alignments, Assemblies, Normalization,

Annotations &Post-processing Evaluations

Datatransfer

and Storage

•Cluster Intensities•Cluster Noise

•Cluster Sequence•Cluster Probabilities (Scores)•Corrected Cluster Intensities

• cross-talk correction• phasing correction

•Image analysis module is Firecrest

•Base calling module is Bustard

•Sequence analysis module is Gerald

Page 25: Overview and Applications of Next-Generation Sequencing Technologies

Other platforms

Sequencing Sequencing Run Read Reads per Throughput per

Platform Chemistry Time Length (bp) Run (million) Run (Gbp)

Roche 454 FLX Pyrosequencing 10h 400-500 ~1 0.4-0.5

Sequencing bySynthesis

Sequencing byLigation

Sequencing bySynthesis

Sequencing byLigation

15-45

Polonator 80h 28 300-400 10

Helicos HeliScope 8 days 25-55 600-800

25

ABI SOLiD 8 days 50 400 20

Illumina GAIIx 9.5 days 100 250

Page 26: Overview and Applications of Next-Generation Sequencing Technologies

Data Storage & Quality

Images?

Phred score 20 = 1% error rate

Quality vs. Read Length? Trimming?Lower sequence quality than Sanger sequencing but offset by deeper coverage

~Phred 20

Page 27: Overview and Applications of Next-Generation Sequencing Technologies

Single short read uniqueness

~4MM reads

Illumina 35 base reads aligned to A. thaliana genome

Page 28: Overview and Applications of Next-Generation Sequencing Technologies

Applications of Next-Generation Sequencing

Page 29: Overview and Applications of Next-Generation Sequencing Technologies

– Tag count & Alignments

– Digital Gene Expression Tag Profiling

• Short cDNA fragments mapping to 3’ ends of transcripts

• SAGE-like approach (1 short tag/transcript)

• 20 base tag output (RE site + 16 bases) aligned to a reference genome

• Identify, quantify and annotate expressed genes

– Transcriptome Profiling (RNA-Seq)

• cDNA fragments generated via random priming

• 36-75 base output aligned to a reference genome

• Assemble entire transcript sequence

• Identify, quantify and annotate expressed genes

• Identify SNPs, alleles and alternative splice variants

Gene Expression Profiling

Page 30: Overview and Applications of Next-Generation Sequencing Technologies

GEX Adaptor 1 Ligation

GTACNN

MmeI

GEX Adaptor 2 Ligation

NNNN

CATGGTAC

Restriction Enzyme Digestion (DpnII or NlaIII)

AAAAATTTTT-bio

CATG

MmeIAAAAATTTTT_bio

CATGGTAC

AAAAA

AAAAATTTTT-bio

1st and 2nd Strand cDNA Synthesis

MmeI digestion

CATG

TAGPCR Primer 1 PCR Primer 2

PCR Amplification

Tag Profiling – Sample Prep (Illumina)

CATGGTAC

Cluster Generation

sequencing primer

mRNA isolation

Total RNA (5ug)

Page 31: Overview and Applications of Next-Generation Sequencing Technologies

Adaptor Ligations

AAAAA

AAAAA

Fragmentation (random)

Total RNA isolation (10ug)

PCR Primer 1PCR Primer 2

PCR Amplification

Transcriptome Profiling – Sample Prep (Illumina)

Cluster Generation

AAAAA

1st and 2nd Strand cDNA Synthesis (N6 primer)

TTTTT

sequencing primer 1 sequencing primer 2

mRNA isolation

Tissue

Page 32: Overview and Applications of Next-Generation Sequencing Technologies

– Small RNA Identification and Profiling

• Small RNA size is suitable to discovery with next-generation sequencing

– Deep assessment of alternative splicing isoforms

• Deep coverage allows discovery of rare isoforms

Novel Transcript Discovery

Mortazavi et al. (2008), Nat. Methods

Page 33: Overview and Applications of Next-Generation Sequencing Technologies

– Whole Genome Sequencing

• Small genomes that are not too complex (microbial)

• The longer the reads, the better – 454 chemistry most suitable

• Paired-End sequencing

– Whole Transcriptome Sequencing

– Targeted Sequencing

• Pooled PCR products

– Raindance Technologies (~4,000 amplicons in one tube)

– Padlock probes

• Pooled BAC clones

• Sequence Capture (Solid phase, Liquid phase)

– Agilent, Febit & Nimblegen

– Metagenomics & Microbial diversity

De novo Sequencing

Page 34: Overview and Applications of Next-Generation Sequencing Technologies

– ChIP-Seq (immunoprecipitate sequencing)

• Capture regions of the genome bound by proteins (transcription factors,

histones)

• Sequences need to be aligned to a reference sequence

• Requires complex algorithm to determine differential levels of coverage

throughout the genome

– Methyl-Seq (methylation status) – Bisulfite Sequencing

• Sequences aligned and compared to reference sequence

– DNAseI Hypersensitivity Site Sequencing

Gene Regulation

Mikkelsen et al. (2007), Nature

Page 35: Overview and Applications of Next-Generation Sequencing Technologies

– Coverage & Alignment

– Paired-End Sequencing

– Whole Genome Resequencing

• Small genomes that are not too complex (repeats, duplications...)

• The longer the reads, the better

– Targeted Resequencing

• Complex genomes (crops)

– Reduced representation libraries (methyl-sensitive enzymes)

– Transcriptome

• Sequence Capture (Microarrays)

» Agilent, Febit & Nimblegen

» CGH arrays

Variant & Structural Variation

Page 36: Overview and Applications of Next-Generation Sequencing Technologies

Challenges in variant discovery

1. Base quality & filtering (scoring threshold)

2. Sequencing errors vs. SNPs

1. To differentiate true polymorphisms from sequencing errors

2. Coverage of a given SNP region and redundancy of reads (coverage vs. number of samples)

3. Availability of a reference sequence (genome)

1. To separate unique vs. duplicated sequences

2. Duplication in one line but not another

3. Polymorphism rate in one line vs. another = need to set conditions for alignment

4. Paired-end sequencing can help unique read placement

5. Complex genomes = need to reduce complexity prior to sequencing

1. High repeat content (ex: ~80% in maize, ~70% in soy, 90% in sunflower…)

2. Gene duplications and genome plasticity (polyploidy, partial or whole genome duplications...)

Page 37: Overview and Applications of Next-Generation Sequencing Technologies

Reduced-representation libraries

transposon transposon transposon

PstI sites

PstI digestion

Recover digested fraction (gel, column)

1. DNA methylation in plants occurs at 5-methyl cytosine within CpG dinucleotides and CpNpG trinucleotides

2. Transposons and other repeats comprise the largest fraction of methylated DNA. Studies in Arabidopsis have shown that CG sites in the 3’ end of the transcribed regions of more than one third of all genes also are methylated (Zhang, Science, 320, 489, 2008).

3. Methylation is critically important in silencing transposons and regulating plant development (methylation in promoters appears to reduce transcription)

P P P P P P P P

Page 38: Overview and Applications of Next-Generation Sequencing Technologies

Library Construction

Digestion with one methyl-sensitive restriction enzyme (RE) and fractionation

Genomic DNA

Ligation of biotinylated RE-specific adapters 1

Digestion with 4-bp cutter (DpnII)GATC

Ligation of DpnII-specific adapter

Binding to streptavidin column and digestion with REGATCCTAG

Ligation of RE-specific adapters 2

PCR enrichment, gel purification, size selection (150-500bp fragments), cluster synthesis and sequencing (36 cycles)

B

B

B

B

GATCCTAG

GATCCTAG

Deschamps et al. The Plant Genome (in press)

Page 39: Overview and Applications of Next-Generation Sequencing Technologies

SNP detection flowchart

Basecalling, cropping last 4 bases & initial base-quality filter (for individual tags)

Condensing & optional consensus base-quality filter (for unitags sequences)

Creating HQ unitag datasets (removing singlets)

Comparing HQ unitag datasets from genotype “A” and genotype “B” using Vmatch

Filtering, to accept clusters with only two members (A, B) with exactly one mismatch

Recovering matched HQ unitag sequences and SNP sites from Vmatch alignments

Mapping SNP-containing HQ unitags to reference sequence (genome), using a k-mer table (k=length of trimmed tags), and find copy numbers and locations.

Capturing single-copy HQ unitags with up to a single-base mismatch to the reference sequence at the exact location of the putative SNP site for one or both genotypes.

Filtering and Condensing

Comparing two genotypes

Mapping to genome

Page 40: Overview and Applications of Next-Generation Sequencing Technologies

Example: one flow cell in soybean (Williams82 vs. Pintado)

† Filtered total reads defined as having a quality value for individual base greater than or equal to 15

‡ HQ unitag reads defined as having a quality value for each base greater than or equal to 15, and with an individual read count greater than or equal to 2.

§ Best match to reference sequence of HQ unitag reads aligning uniquely or multiple times to the reference sequence

1

10

100

1,000

10,000

10 100 1,000 10,000

Fre

qu

en

cy

100,000

100,000

Depth

Run Metrics Williams82

Pintado

Number of total reads generated (after initial basecalling)

37,666,279 38,000,474

Number of filtered total reads † 24,519,484 23,101,973

Number of unitags (generated from filtered total reads)

965,610 885,429

Number of high quality (HQ) unitags ‡

255,918 246,102

Alignment of HQ unitags against the reference sequence:

Zero mismatch § 208,923 197,015

One mismatch § 27,770 27,699

Two or more mismatches § 19,225 21,388

HQ unitags aligning uniquely to the reference sequence with zero

mismatch

152,185 144,559

Page 41: Overview and Applications of Next-Generation Sequencing Technologies

Results & Validation

*SNPs confirmed/not confirmed via Sanger sequencing of PCR products for both genotypes

**Experiments Putative SNPs Confirmed Not Confirmed Validation rateQ Score threshold: 15Soy: Williams82 vs. Pintado 1,682 163 5 97.0%Rice: Kasalath vs. Taichung65 2,618 162 6 96.4%

Q Score threshold: 25Soy: Williams82 vs. Pintado 702 168 2 98.8%Rice: Kasalath vs. Taichung65 2,148 174 1 99.4%

Page 42: Overview and Applications of Next-Generation Sequencing Technologies

Distribution of HQ unitags & SNPs related to annotated gene density (soybean)

Gene Density (excluding TEs) in 500Kb window

Coverage by HQ unitags in 70Kb window

SNP Density in 70Kb window

Page 43: Overview and Applications of Next-Generation Sequencing Technologies

Distribution of HQ unitags & SNPs related to distance to annotated genes (excluding TEs) in soybean

Intron, CDS and UTR coordinates determined from GFF annotation files

Page 44: Overview and Applications of Next-Generation Sequencing Technologies

Bioinformatic tools

Alignment and Polymorphism Detection

1. SOAP – Short Oligonucleotide Alignment Program

• Ruiqiang Li, Beijing Genomics Institute

• http://soap.genomics.org.cn

2. MAQ – Mapping and Assembly with Quality

• Heng Li, Sanger Centre

• http://maq.sourceforge.net/maq-man.shtml

3. Bowtie - An ultrafast memory-efficient short read aligner

• Ben Langmead and Cole Trapnell, University of Maryland

• http://bowtie-bio.sourceforge.net/

4. ssahaSNP – Tool to detect homozygous SNPs and indels

• Adam Spargo and Zemin Ning, Sanger Centre

• http://www.sanger.ac.uk/Software/analysis/ssahaSNP

Page 45: Overview and Applications of Next-Generation Sequencing Technologies

Bioinformatic tools

Genomic Assembly

1. Velvet – De novo assembly of short reads

• Daniel Zerbino and Ewan Birney, EMBL-EBI

• http://www.ebi.ac.uk/~zerbino/velvet/

2. SSAKE – Assembly of short reads

• Rene Warren, et al, British Columbia Cancer Agency

• http://bioinformatics.oxfordjournals.org/cgi/content/full/23/4/500

3. Euler – Genomic Assembly

• Pavel Pevzner and Mark Chaisson, University of California, San Diego

• http://nbcr.sdsc.edu/euler/

www.illumina.com

Page 46: Overview and Applications of Next-Generation Sequencing Technologies

ChIP Sequencing

1. ChIP-Seq Peak Finder

• Barbara Wold, Cal Tech and Rick Meyers, Stanford University

• http://woldlab.caltech.edu/html/software/

Digital Gene Expression

1. Comparative Count Display

• Alex Lash, NIH

• ftp://ftp.ncbi.nlm.nih.gov/pub/sage/obsolete/bin/ccd/

2. SAGE DGED Tool

• Cancer Genome Anatomy Project

• http://cgap.nci.nih.gov/SAGE/SDGED_Wizard?METHOD=SS10,LS10&ORG=Hs

Bioinformatic tools

www.illumina.com

Page 47: Overview and Applications of Next-Generation Sequencing Technologies

Overview

1. Obtain Bustard reads and align against Genome with Eland

2. Aggregate and SNP call data with CASAVA

3. GenomeStudio™ wizard import of data

4. Examine coverage and quality in stacked alignment graphs for a selected region/chromosome

5. Export table of SNPs and consensus sequence

Bioinformatic tools - Illumina

Page 48: Overview and Applications of Next-Generation Sequencing Technologies

Bioinformatic tools - Illumina

Page 49: Overview and Applications of Next-Generation Sequencing Technologies

Third-Generation Sequencing technologies: what’s next?

Page 50: Overview and Applications of Next-Generation Sequencing Technologies

Next-Generation Sequencing

Third-generation platforms:

•Complete Genomics

•BioNanomatrix

•VisiGen

•Pacific Biosciences

•Intelligent Bio-Systems

•ZS Genetics

•Reveo

•LightSpeed Genomics

•NABsys

•Oxford Nanopore Technologies

Second-generation platforms:

•454/Roche

•Solexa/Illumina

•SOLiD/ABI

•Helicos BioSciences

•Dover Systems

Page 51: Overview and Applications of Next-Generation Sequencing Technologies

Pacific Biosciences

• SMRT™ Technology (to be commercially

launched Fall 2010)

• Single DNA polymerase attached at bottom

surface of nanometer-scale hole, incorporates

in real-time fashion fluorescently labeled

nucleotides to elongated strand of DNA

• Elongated strand can be several thousands of

nucleotides in length

www.pacificbiosciences.com

Page 52: Overview and Applications of Next-Generation Sequencing Technologies

Pacific Biosciences

1. Small size of the hole favors rapid in-and-out diffusion of nucleotides and dye following

their cleavage. Meanwhile, incorporated nucleotide is held within the detection volume

for 10’s of milliseconds, order of magnitude longer than the time it takes for nucleotides

to diffuse in and out of the hole, therefore decreasing background noise

2. Fluorescent dye is attached to the phosphate chain, rather than the base, and is

cleaved when the nucleotide is incorporated to the DNA strand.

=> Decreased background noise and use of phospholinked nucleotides circumvents the need

for successive cycles of incorporation, washing, scanning and removal of the label,

therefore optimizing processivity of the enzyme and allowing longer read lengths

=> No need for washing decreases the consumption of reagents

Page 53: Overview and Applications of Next-Generation Sequencing Technologies

Nanopore Sequencing = the real $100 genome?

1. Sequencing-by-Synthesis requires lots of preparation, lots of reagents (polymerase,

nucleotides, fluorescent labels...) and expensive detection systems.

2. Nanopore sequencing does not rely on amplification or labeling, and provides a direct

electrical signal for base calling. It is based on a simple idea of “passing” DNA

fragments through a nanometer-scale pore and detecting in a real-time fashion signal

due to the DNA blocking the electrical current that runs through the pore

3. Oxford Nanopore: Protein nanopore

1. Long read lengths (1000’s)

2. High read accuracy

3. Current technical issues:

• Attachment of the exonuclease

to the pore

• Parallelization

(1,000’s of pores per chips)

Exonuclease

Alpha-hemolysin

Cyclodextrin(encapsulate nucleotide)

www.nanoporetech.com

Page 54: Overview and Applications of Next-Generation Sequencing Technologies

Questions?