introduction to next generation sequencing

58
Introduction to NGS http://ueb.ir.vhebron.net/NGS Introduction to Next Generation Sequencing Statistics and Bioinformatics Research Group Statistics department, Universitat de Barelona Statistics and Bioinformatics Unit Vall d’Hebron Institut de Recerca Alex Sánchez

Upload: ueb

Post on 10-May-2015

20.365 views

Category:

Technology


3 download

DESCRIPTION

More information at: http://ueb.ir.vhebron.net/NGS

TRANSCRIPT

Page 1: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Introduction toNext Generation Sequencing

Statistics and Bioinformatics Research GroupStatistics department, Universitat de Barelona

Statistics and Bioinformatics UnitVall d’Hebron Institut de Recerca

Alex Sánchez

Page 2: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Outline

Introduction, Presentation, Goals.Next generation sequencing technologies.

Evolution, Description, Comparison.Applications of NGS.Bioinformatics challenges.Some aspects of NGS data analysis.

NGS data, and data preprocessing (QC)Types of analyses, workflows, tools

Conclusions and perspectives

Page 3: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Who, where, what?

Page 4: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Introduction

Page 5: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Why is NGS revolutionary?

• NGS has brought high speed not only to genomesequencing and personal medicine

• it has also changed the way we do genome research

Got a question on genome organization?

SEQUENCE IT !!!

Ana Conesa, bioinformatics researcher at Principe Felipe Research Center

Page 6: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Sequencing: from DNA to GenomesSanger chain termination (1977) Hierarchical and

Shotgun sequencing (1996)

Page 7: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

The human genome project

Page 8: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next generation sequencing

The future is here, now

Page 9: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next generation Sequencing• By the middle decade new technologies consolidated

allowing the massive production of tens of millions ofshort sequencing fragments.

• These techniques could be used to– Deal with similar problems than microarrays,– But also with many other.

• “Again” they raised the promise of personalizedmedicine..

Page 10: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

NGS technologies

Page 11: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next-generation DNA sequencingSanger sequencing Cyclic-array sequencing

Page 12: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next-generation DNA sequencingSanger sequencing Next-generation sequencing

Advantages of NGS

- Construction of a sequencinglibrary clonal amplification togenerate sequencing features

Page 13: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next-generation DNA sequencingSanger sequencing Next-generation sequencing

Advantages:

- Construction of a sequencinglibrary clonal amplification togenerate sequencing features

No in vivo cloning, transformation, colony picking...

Page 14: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next-generation DNA sequencingSanger sequencing Next-generation sequencing

Advantages:

- Construction of a sequencinglibrary clonal amplification togenerate sequencing features

No in vivo cloning, transformation, colony picking...

- Array-based sequencing

Page 15: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next-generation DNA sequencingSanger sequencing Next-generation sequencing

Advantages:

- Construction of a sequencinglibrary clonal amplification togenerate sequencing features

No in vivo cloning, transformation, colony picking...

- Array-based sequencing

Higher degree of parallelismthan capillary-based sequencing

Page 16: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

NGS means high sequencing capacity

GS FLX 454(ROCHE)

HiSeq 2000(ILLUMINA)

5500xl SOLiD(ABI)

Ion TORRENT

GS Junior

Page 17: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

454 GS Junior35MB

NGS Platforms Performance

Page 18: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next-generation DNA sequencing

454

SOLiD

SOLEXA

Workflow?

Page 19: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

454 Sequencing

Page 20: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

ABI SOLID Sequencing

Page 21: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Solexa sequencing

Page 22: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Comparison of 2nd NGS

Page 23: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Some numbers

Page 24: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

The sequencing process, in detail

DNA fragmentationand in vitroadaptor ligation

111 Library preparation

Page 25: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next-generation DNA sequencing

DNA fragmentationand in vitroadaptor ligation

emulsion PCR

1

2

11

22

Library preparation

Clonal amplification

Page 26: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next-generation DNA sequencing

DNA fragmentationand in vitroadaptor ligation

emulsion PCR bridge PCR

1

2

11

22

Library preparation

Clonal amplification

Page 27: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next-generation DNA sequencing

DNA fragmentationand in vitroadaptor ligation

emulsion PCR bridge PCR

Pyrosequencing

1

2

3

11

22

33 Cyclic array sequencing

Library preparation

Clonal amplification

454 sequencing

Page 28: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next-generation DNA sequencing

DNA fragmentationand in vitroadaptor ligation

emulsion PCR bridge PCR

454 sequencing SOLiD platform

Pyrosequencing Sequencing-by-ligation

1

2

3

11

22

33 Cyclic array sequencing

Library preparation

Clonal amplification

Page 29: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next-generation DNA sequencing

DNA fragmentationand in vitroadaptor ligation

emulsion PCR bridge PCR

Solexa technologySOLiD platform

Pyrosequencing Sequencing-by-ligation Sequencing-by-synthesis

1

2

3

11

22

33 Cyclic array sequencing

454 sequencing

Library preparation

Clonal amplification

Page 30: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Next next generation sequencing

• Pacific Biosystems– Real time DNA

synthesis– Up to 12000nt (?)– 50 bases/second (?)

• Promises delivery ofhuman genome in minutes?– Company on track for

2013

Page 31: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

NGS Applications

Page 32: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Bioinformatics challenges of NGS

Page 33: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Page 34: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

NGS pushes bioinformatics needs up

• Need for large amount of CPU power– Informatics groups must manage compute clusters– Challenges in parallelizing existing software or redesign of

algorithms to work in a parallel environment– Another level of software complexity and challenges to

interoperability• VERY large text files (~10 million lines long)

– Can’t do ‘business as usual’ with familiar tools such as Perl/Python.

– Impossible memory usage and execution time – Impossible to browse for problems

• Need sequence Quality filtering

Page 35: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Data management issues

• Raw data are large. How long should be kept?• Processed data are manageable for most people

– 20 million reads (50bp) ~1Gb• More of an issue for a facility: HiSeq recommends

32 CPU cores, each with 4GB RAM

• Certain studies much more data intensive than other– Whole genome sequencing

• A 30X coverage genome pair (tumor/normal) ~500 GB• 50 genome pairs ~ 25 TB

Page 36: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

So what?

• In NGS we have to process really big amounts of data, which is not trivial in computing terms.

• Big NGS projects require supercomputing infrastructures

• Or put another way: it's not the case that anyone can study everything.– Small facilities must carefully choose their projects to be scaled

with their computing capabilities.

Page 37: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Computational infrastructure for NGS

• There is great variety but a good point to start with:– Computing cluster

• Multiple nodes (servers) with of course multiple cores• High performance storage (TB, PB level)• Fast networks (10Gb ethernet, infiniband)

– Enough space and conditions for the equipment ("servers room")– Skilled people (sysadmin, developers)

• CNAG, in Barcelona: 30 people, more than 50% of theminformaticians

Page 38: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Big computing infrastructure

• Distributed memory cluster– Starting at 20 computing nodes– 160 to 240 cores– amd64 (x86_64) is the most used cpu architecture– At least 48GB ram per node

• Fast networks– 10Gbit– Infiniband

• Batch queue system (sge, condor, pbs, slurm)• Optional MPI and GPUs environment depending on

project requirements

Page 39: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Big infrastructure is expensive

• Starting at 200.000€– 200.000€ is just the hardware– Plus data center (computers room)– Plus informaticians salary

• Not every partner knows about supercomputing.– SGI– Bull– IBMHP

Page 40: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Middle size infrastructure

• "Small” distributed filesystem ( around 50TB).

• "Small” cluster (around 10 nodes, 80 to 120 cores).

• At least gigabit ethernet network.

• Price range: 50.000 – 100.000 € (just hardware)– plus data center and informaticians salary

Page 41: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Small infrastructure

• Recommended at least 2 machines – 8 or 12 cores each machine.– 48Gb ram minimum each machine.– BIG local disk. At least 4TB each machine

• As much local disks as we can afford

• Price range: starting at 8.000€ - 10.000€ (2 machines)

Page 42: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Alternatives (1): Cloud Computing• Pros

– Flexibility.– You pay what you use.– Don´t need to maintain a data center.

• Cons– Transfer big datasets over internet is

slow.– You pay for consumed bandwidth.

That is a problem with big datasets.– Lower performance, specially in disk

read/write.– Privacy/security concerns.– More expensive for big and long

term projects.

Page 43: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Alternatives (2): Grid Computing

• Pros– Cheaper.– More resources available.

• Cons– Heterogeneous

environment.– Slow connectivity

(specially in Spain).– Much time required to find

good resources in the grid.

Page 44: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

NGS data analysis

Page 45: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

NGS data analysis stages

Page 46: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

A typical workflow (Seq-to-variant wf)

Page 47: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Whole Genome Sequencing

Resequencing

Transcriptome Analysis

Gene Regulation

Epigenetic Changes

Metagenomics

Paleogenomics

NGS Applications are sequencing applications

Page 48: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Metagenomics and other community-based “omics”

Zoetendal E G et al. Gut 2008;57:1605-1615

Page 49: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

De novo sequencing

Page 50: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Transcriptomics by NGS: RNASeq

• Digital Signal

• Harder to achieve & interpret• Reads counts: discrete values• Weak background or no noise

• Analog Signal

• Easy to convey the signal’sinformation

• Continuous strength• Signal loss and distortion

Page 51: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Quality control and preprocessing ofNGS data

Page 52: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Preprocessing sequences improves results

Page 53: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Why QC and preprocessing

• Sequencer output:– Reads + quality

• Natural questions– Is the quality of my sequenced

data OK?– If something is wrong can I fix it?

• Problem: HUGE files... How do they look?

• Files are flat files and are big... tens of Gbs (even hard tobrowse them)

Page 54: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

How is quality measured?

• Assign quality score to each peak• The frequently used Phred scores provide log(10)-transformed error• probability values:

– score = 20 corresponds to a 1% error rate– score = 30 corresponds to a 0.1% error rate– score = 40 corresponds to a 0.01% error rate

• The base calling (A, T, G or C) is performed based on Phred scores.• Ambiguous positions with Phred scores <= 20 are labeled with N.

Page 55: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Sequence formats

• FastA format (everybody knows about it)– Header line starts with “>” followed by a sequence ID– Sequence (string of nt).

• FastQ format (http://maq.sourceforge.net/fastq.shtml)– First is the sequence (like Fasta but starting with “@”)– Then “+” and sequence ID (optional) and in the following line are

QVs encoded as single byte ASCII codes• Different quality encode variants

• Nearly all downstream analysis take FastQ as inputsequence

Page 56: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Some tools to deal with QC

• Use FastQC to see your starting state.

• Use Fastx-toolkit to optimize different datasets and thenvisualize the result with FastQC to prove your success!

• Hints: – Trimming, clipping and filtering may improve quality– But beware of removing too many sequences…

Go to the tutorial and try the exercises...

Page 57: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

AcknowledgementsGrupo de investigación en Estadística y Bioinformática del departamento de Estadística de la Universidad de Barcelona.

Xavier de Pedro and Ferran Briansó (but also Jose Luis Mosquera and Israel Ortega) de la Unitat d’Estadística i Bioinformàtica del VHIR (Vall d’Hebron Institut de Recerca)

Unitat de Serveis Científico Tècnics (UCTS) del VHIR (Vall d’Hebron Institut de Recerca)

People whose materials have been borrowedManel Comabella, Rosa Prieto, Paqui Gallego, Javier Santoyo, Ana Conesa, Pablo Escobar, Thomas Girke…

Page 58: Introduction to next generation sequencing

Introduction to NGS http://ueb.ir.vhebron.net/NGS

Gracias por la atención y la paciencia