the genome access course november 2014 next generation dna sequencing illumina hiseq x 1.8 tbp (3...

30
The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

Upload: matilda-davis

Post on 13-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Next Generation DNA Sequencing

Illumina HiSeq X1.8 Tbp

(3 billion reads) in ~3 days

(as of 11/6/2014)

Page 2: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Whole Genome Shotgun Sequencing

Randomly Fragment

Genomic DNA

Genome Assembly...ATCCGTAAATGGGCTGATACTACTAATGC TGGGCTGATACTACTAATGCCAAACTGTACTAGTCCTG...

...ATCCGTAAATGGGCTGATACTACTAATGCCAAACTGTACTAGTCCTG...

Contiguous Sequence (Contig)

SequenceFragments

Page 3: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

RNA Sequencing (RNA-Seq)

1. Characterize all RNA in sample

2. Gene expression level proportional to number of reads

3. Detect alternatively spliced transcripts

Garber et al, Nat Methods (2011)

SequenceFragments

cDNA made from RNA

cDNA

Page 4: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Typical Next Gen Experiments

• Genome sequencing– Novel genomes– Resequencing

• Transcriptome sequencing (RNA-seq)– Characterize transcripts with or without reference genome

• Typical length• Short (microRNAs, …)

– Find differentially expressed transcripts

• Other– Methyl-seq– ChIP-seq

Page 5: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Page 6: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Illumina SequencingDNA Sample

ConstructLibrary

Cluster Generation in Flow Cell

200+ million reads per lane(>100 bp reads)

Sequencing by Synthesis

Page 7: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Types of Sequencing Libraries

Single-End Reads - 5’ or 3’ (random)

Paired-End Reads - 5’ and 3’

Mate-Pair Reads - 5’ and 3’

2-5 kbp

200-500 bp

Page 8: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Taken from GIGA Newsletter 13 – Universite de Liège

Page 9: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

What Does the Data Look Like?FASTQ File Format

Sequence

Quality (ASCII character for each base)

> 200 million reads in one lane

Files so big that they break them up in 40 million reads per file

Page 10: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Example Analysis WorkflowPaired-End FASTQ Files

FASTQ(_R1.txt)

FASTQ(_R2.txt)

FastQC(Diagnostics)

FastQC(Diagnostics)

Trim Reads(if needed)

Trim Reads(if needed)

Align Reads to GenomeAlign Reads to Genome

SAM FileSAM File

BAM FileBAM File

Page 11: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Sequence Composition Diagnostics

Unbiased Reads

Biased Reads

First Position Nearly Always “T”

Page 12: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

GC Bias in First ~15 bp Due to Random Hexamer Priming

Page 13: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Trim Sequences Prior To Analysis

• Make sure sequencing adapters are removed• Trim ends of sequence based on quality scores

Page 14: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Trimmomatic

FastX Toolkit – Hannon Lab at CSHL

Page 15: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Example Analysis WorkflowPaired-End FASTQ Files

FASTQ(_R1.txt)

FASTQ(_R2.txt)

FastQC(Diagnostics)

FastQC(Diagnostics)

Trim Reads(if needed)

Trim Reads(if needed)

Align Reads to GenomeAlign Reads to Genome

SAM FileSAM File

BAM FileBAM File

Page 16: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Sequence Alignment/Map (SAM) Format

Common file format to store: - Reads - Quality of each base - How reads align to a reference sequenceGenerated by most next gen analysis software

samtools software package

Page 17: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

samtools Used to Manipulate SAM Files

SAM FileSAM File

BAM FileBAM FilePileUp

FilePileUp

File

samtoolssamtools

Pileup output file

chr1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&chr1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+chr1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6chr1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<chr1 276 G 22 TTTTTTTTTTTTTTTTTTTTTTT 33;+<<7=7<<7<&<<1;<<6<chr1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<chr1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<chr1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

Call Variants

Call Variants …

Page 18: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Binary Alignment (BAM) Files

• Common file format to store reads and their alignment to a reference sequence– Generated by most next gen analysis software

• samtools software package

• UCSC Genome Browser and Ensembl can display them as a custom track– IGV from Broad very useful

Page 19: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

UCSC Genome Browser with 1,000 Genomes Project Data

Page 20: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Integrated Genomics Viewer (IGV)

Page 21: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

LookSeq at Sanger Mouse Genomes Project

Page 22: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Glo1 CNV Present in Mouse Genomes Data for A/J

Proximal FlankChr17: 30.5Mb

Max ~50x coverage

Glo1 LocusChr17: 30.7Mb

Max >100x coverage

Distal FlankChr17: 31.2Mb

Max ~50x coverage

50kb 50kb 50kb

Page 23: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Glo1 CNV Not Present in Mouse Genomes Data for NZO

Proximal FlankChr17: 30.5Mb

Max ~25x coverage

Glo1 LocusChr17: 30.7Mb

Max ~25x coverage

Distal FlankChr17: 31.2Mb

Max ~25x coverage

50kb 50kb 50kb

Page 24: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Galaxy (http://main.g2.bx.psu.edu)

Page 25: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Public Data Repositories

SRA Formatted Files

SRA Formatted Files

FASTQ FilesFASTQ Files

SRA ToolKitSRA ToolKit

FASTQ FilesFASTQ Files

Automatically Forward FASTQ Files to Galaxy

EBINCBI

Page 26: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

NCBI BioProject

Page 27: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

NCBI Gene Expression Omnibus

Page 28: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Overall Analysis Workflow

FASTQ FilesFASTQ Files

Tertiary Analysis1.Analysis of Read Counts

e.g., Differentially expressed genes2.Analysis of Gene Lists

1. Enrichment2. Pathway and networks

3.Analysis of Expression Patterns

Tertiary Analysis1.Analysis of Read Counts

e.g., Differentially expressed genes2.Analysis of Gene Lists

1. Enrichment2. Pathway and networks

3.Analysis of Expression Patterns

Secondary Analysis1.Read Preprocessing & Diagnostics2.Align Reads to Reference

3.Analysis of Aligned Readse.g., Read counts per gene from RNA-Seq

Secondary Analysis1.Read Preprocessing & Diagnostics2.Align Reads to Reference

3.Analysis of Aligned Readse.g., Read counts per gene from RNA-Seq

Page 29: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Push-Button Bioinformatics … Be Careful

Page 30: The Genome Access Course November 2014 Next Generation DNA Sequencing Illumina HiSeq X 1.8 Tbp (3 billion reads) in ~3 days (as of 11/6/2014)

The Genome Access Course

November 2014

Third Generation Sequencing