analysis of next generation sequence data

32
The Genome Access Course November 2011 Analysis of Next Generation Sequence Data Illumina HiSeq2000 600 Gbp (6 billion reads) in ~11 days

Upload: asha

Post on 30-Jan-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Analysis of Next Generation Sequence Data. Illumina HiSeq2000 600 Gbp (6 billion reads) in ~11 days. Typical Next Gen Experiments. Genome sequencing Novel genomes Resequencing Transcriptome sequencing (RNA-seq) Characterize transcripts with or without reference genome Typical length - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Analysis of Next Generation Sequence Data

Illumina HiSeq2000600 Gbp

(6 billion reads) in ~11 days

Page 2: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Typical Next Gen Experiments

• Genome sequencing– Novel genomes– Resequencing

• Transcriptome sequencing (RNA-seq)– Characterize transcripts with or without reference genome

• Typical length• Short (microRNAs, …)

– Find differentially expressed transcripts

• Other– Methyl-seq– ChIP-seq– RIP-seq– …

Page 3: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Types of Sequencing Libraries

Single-End Reads - 5’ or 3’ (random)

Paired-End Reads - 5’ and 3’

Mate-Pair Reads - 5’ and 3’

2-5 kbp

200-500 bp

Page 4: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

What Does the Data Look Like?FASTQ File Format

Sequence

Quality (ASCII character for each base)

> 80 million reads in one lane

Page 5: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Quality Control Analysis of Reads

Page 6: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Trim Sequences Prior To Analysis

• Make sure sequencing adapters are removed• Trim ends of sequence based on quality scores

Page 7: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Sequence Composition Diagnostics

Unbiased Reads

Biased Reads

First Position Nearly Always “T”

Page 8: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Genome Sequencing

Page 9: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Workflows for Genome Sequencing

Novel Genome Sequencing

• de novo assembly– Generate contigs and

scaffolds using overlapping reads

• If applicable, align reads from a sample back to consensus to examine variation

Resequencing

• Align reads from a sample to a reference genome assembly to examine variation– BWA mapping software

Page 10: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Sequence Alignment/Map (SAM) Format

Common file format to store reads and their alignment to a reference sequence

Generated by most next gen analysis softwaresamtools software package

Page 11: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Binary Alignment/Map (BAM) Files

• SAM (text file) BAM (binary file)– Not human-readable– Smaller file sizes

• BAM is widely used:– Often deposited to Gene Expression Omnibus (GEO) at NCBI– UCSC Genome Browser can display alignments as a track

Page 12: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

UCSC Genome Browser with 1,000 Genomes Project Data

Page 13: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

LookSeq at Sanger Mouse Genomes Project

Page 14: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Glo1 CNV Present in Mouse Genomes Data for A/J

Proximal FlankChr17: 30.5Mb

Max ~50x coverage

Glo1 LocusChr17: 30.7Mb

Max >100x coverage

Distal FlankChr17: 31.2Mb

Max ~50x coverage

50kb 50kb 50kb

Page 15: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Glo1 CNV Not Present in Mouse Genomes Data for NZO

Proximal FlankChr17: 30.5Mb

Max ~25x coverage

Glo1 LocusChr17: 30.7Mb

Max ~25x coverage

Distal FlankChr17: 31.2Mb

Max ~25x coverage

50kb 50kb 50kb

Page 16: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

RNA-seq Data Analysis

Page 17: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

RNA-Seq

Reads are randomly sampled fragments from RNA sample

Proportion of reads for a transcript Expression level of transcript

Lots of reads needed to construct models for every alternatively spliced transcript

Garber et al, Nat Methods (2011)

Page 18: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Experimental Design

Auer & Doerge Genetics (2010) 185: 405-416

Page 19: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Marioni et al, Genome Res (2008) 18(9):1509-17

Page 20: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Comparison of Affy and RNA-seq

Marioni et al, Genome Res (2008) 18(9):1509-17

Page 21: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Comparison of Affy and RNA-seq

Marioni et al, Genome Res (2008) 18(9):1509-17

Page 22: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Marioni et al, Genome Res (2008) 18(9):1509-17

Page 23: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011Shendure Nat Methods (2008) 5(7): 585-7

Page 24: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Workflows for RNA-seq

Novel Transcriptome Sequencing• de novo assembly

• Align reads from each sample/group to assembly

– Statistics for each transcript contig

Transcriptome Sequencing with Reference Genome

• Align reads from each sample/group to genome

– Statistics for each transcript model

– Examine isoforms

QC ReadsQC Reads

Analyze CountsAnalyze Counts

Page 25: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

de novo Transcriptome Assembly

Rarefaction Plot

How much sequencing is enough?

Page 26: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Mapping Reads

Align reads to a referenceGenome assemblyTranscriptome assembly

Commonly used aligners:bwabowtie

Page 27: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

RNAseq Workflow With Reference Genome

Langmead et al. Genome Biology (2010), 11:R83

Page 28: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Map Reads & ObtainCount Reads Per Gene

Both utilize a reference genome

Page 29: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Bowtie/TopHat

Trapnell, Pachter, Salzberg. Bioinformatics (2009) 25(9):1105-1111

Bowtie uses Burrows-Wheeler indexing for rapid mapping

TopHat uses Initially Un-Mapped (IUM) reads to find novel splice sites

Page 30: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Cufflinks

FPKM = Fragments Per Kilobase of transcript per Million fragments mapped

Trapnell et al. Nature Biotech (2010) 28(5):511-515

Page 31: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Galaxy

Can be used to upload FASTQ files and then run a number of QC tools and many other tools:

bwabowtietophatcufflinks…

Page 32: Analysis of Next Generation Sequence Data

The Genome Access Course

November 2011

Third Generation Sequencing