analysis of next generation sequence data
DESCRIPTION
Analysis of Next Generation Sequence Data. Illumina HiSeq2000 600 Gbp (6 billion reads) in ~11 days. Typical Next Gen Experiments. Genome sequencing Novel genomes Resequencing Transcriptome sequencing (RNA-seq) Characterize transcripts with or without reference genome Typical length - PowerPoint PPT PresentationTRANSCRIPT
The Genome Access Course
November 2011
Analysis of Next Generation Sequence Data
Illumina HiSeq2000600 Gbp
(6 billion reads) in ~11 days
The Genome Access Course
November 2011
Typical Next Gen Experiments
• Genome sequencing– Novel genomes– Resequencing
• Transcriptome sequencing (RNA-seq)– Characterize transcripts with or without reference genome
• Typical length• Short (microRNAs, …)
– Find differentially expressed transcripts
• Other– Methyl-seq– ChIP-seq– RIP-seq– …
The Genome Access Course
November 2011
Types of Sequencing Libraries
Single-End Reads - 5’ or 3’ (random)
Paired-End Reads - 5’ and 3’
Mate-Pair Reads - 5’ and 3’
2-5 kbp
200-500 bp
The Genome Access Course
November 2011
What Does the Data Look Like?FASTQ File Format
Sequence
Quality (ASCII character for each base)
> 80 million reads in one lane
The Genome Access Course
November 2011
Quality Control Analysis of Reads
The Genome Access Course
November 2011
Trim Sequences Prior To Analysis
• Make sure sequencing adapters are removed• Trim ends of sequence based on quality scores
The Genome Access Course
November 2011
Sequence Composition Diagnostics
Unbiased Reads
Biased Reads
First Position Nearly Always “T”
The Genome Access Course
November 2011
Genome Sequencing
The Genome Access Course
November 2011
Workflows for Genome Sequencing
Novel Genome Sequencing
• de novo assembly– Generate contigs and
scaffolds using overlapping reads
• If applicable, align reads from a sample back to consensus to examine variation
Resequencing
• Align reads from a sample to a reference genome assembly to examine variation– BWA mapping software
The Genome Access Course
November 2011
Sequence Alignment/Map (SAM) Format
Common file format to store reads and their alignment to a reference sequence
Generated by most next gen analysis softwaresamtools software package
The Genome Access Course
November 2011
Binary Alignment/Map (BAM) Files
• SAM (text file) BAM (binary file)– Not human-readable– Smaller file sizes
• BAM is widely used:– Often deposited to Gene Expression Omnibus (GEO) at NCBI– UCSC Genome Browser can display alignments as a track
The Genome Access Course
November 2011
UCSC Genome Browser with 1,000 Genomes Project Data
The Genome Access Course
November 2011
LookSeq at Sanger Mouse Genomes Project
The Genome Access Course
November 2011
Glo1 CNV Present in Mouse Genomes Data for A/J
Proximal FlankChr17: 30.5Mb
Max ~50x coverage
Glo1 LocusChr17: 30.7Mb
Max >100x coverage
Distal FlankChr17: 31.2Mb
Max ~50x coverage
50kb 50kb 50kb
The Genome Access Course
November 2011
Glo1 CNV Not Present in Mouse Genomes Data for NZO
Proximal FlankChr17: 30.5Mb
Max ~25x coverage
Glo1 LocusChr17: 30.7Mb
Max ~25x coverage
Distal FlankChr17: 31.2Mb
Max ~25x coverage
50kb 50kb 50kb
The Genome Access Course
November 2011
RNA-seq Data Analysis
The Genome Access Course
November 2011
RNA-Seq
Reads are randomly sampled fragments from RNA sample
Proportion of reads for a transcript Expression level of transcript
Lots of reads needed to construct models for every alternatively spliced transcript
Garber et al, Nat Methods (2011)
The Genome Access Course
November 2011
Experimental Design
Auer & Doerge Genetics (2010) 185: 405-416
The Genome Access Course
November 2011
Marioni et al, Genome Res (2008) 18(9):1509-17
The Genome Access Course
November 2011
Comparison of Affy and RNA-seq
Marioni et al, Genome Res (2008) 18(9):1509-17
The Genome Access Course
November 2011
Comparison of Affy and RNA-seq
Marioni et al, Genome Res (2008) 18(9):1509-17
The Genome Access Course
November 2011
Marioni et al, Genome Res (2008) 18(9):1509-17
The Genome Access Course
November 2011Shendure Nat Methods (2008) 5(7): 585-7
The Genome Access Course
November 2011
Workflows for RNA-seq
Novel Transcriptome Sequencing• de novo assembly
• Align reads from each sample/group to assembly
– Statistics for each transcript contig
Transcriptome Sequencing with Reference Genome
• Align reads from each sample/group to genome
– Statistics for each transcript model
– Examine isoforms
QC ReadsQC Reads
Analyze CountsAnalyze Counts
The Genome Access Course
November 2011
de novo Transcriptome Assembly
Rarefaction Plot
How much sequencing is enough?
The Genome Access Course
November 2011
Mapping Reads
Align reads to a referenceGenome assemblyTranscriptome assembly
Commonly used aligners:bwabowtie
The Genome Access Course
November 2011
RNAseq Workflow With Reference Genome
Langmead et al. Genome Biology (2010), 11:R83
The Genome Access Course
November 2011
Map Reads & ObtainCount Reads Per Gene
Both utilize a reference genome
The Genome Access Course
November 2011
Bowtie/TopHat
Trapnell, Pachter, Salzberg. Bioinformatics (2009) 25(9):1105-1111
Bowtie uses Burrows-Wheeler indexing for rapid mapping
TopHat uses Initially Un-Mapped (IUM) reads to find novel splice sites
The Genome Access Course
November 2011
Cufflinks
FPKM = Fragments Per Kilobase of transcript per Million fragments mapped
Trapnell et al. Nature Biotech (2010) 28(5):511-515
The Genome Access Course
November 2011
Galaxy
Can be used to upload FASTQ files and then run a number of QC tools and many other tools:
bwabowtietophatcufflinks…
The Genome Access Course
November 2011
Third Generation Sequencing