introduction to next generation sequencing (ngs) data analysis

Introduction To Next Generation Sequencing (NGS) Data Analysis

Jenny WuUCI Genomics High Throughput

Facility

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis– Basics: terminology, data file formats, general

workflow – Data Analysis Pipeline• Sequence QC and preprocessing• Obtaining and preparing reference • Sequence mapping• Downstream analysis workflow and software

• Example: RNA-Seq analysis with Tuxedo protocol• Summary and future plan

Why Next Generation SequencingOne can sequence hundreds of millions of short sequences (35bp-120bp) in a single run in a short period of time with low per base cost.

• Illumina/Solexa GA II / HiSeq 2000, 2500 • Life Technologies/Applied Biosystems SOLiD• Roche/454 FLX, Titanium

Reviews: Michael Metzker (2010) Nature Reviews Genetics 11:31Quail et al (2012) BMC Genomics Jul 24;13:341.

Why Bioinformatics

(wall.hms.harvard.edu)

Informatics

Bioinformatics Challenges in NGS Data Analysis

• VERY large text files (tens of millions of lines long)– Can’t do ‘business as usual’ with familiar tools– Impossible memory usage and execution time – Manage, analyze, store, transfer and archive huge files

• Need for powerful computers and expertise– Informatics groups must manage compute clusters– New algorithms and software are required and often time

they are open source Unix/Linux based.– Collaboration of IT, bioinformaticians and biologists

Basic NGS Workflow

NGS Data Analysis Overview

Olson et al.

Outline• Goals• Bioinformatics Challenges in NGS data analysis– Basics: terminology, data file formats, general workflow – Analysis Pipeline• Sequence QC and preprocessing• Obtaining and preparing reference • Sequence mapping• Downstream analysis workflow and software

• RNA-Seq analysis with Tuxedo protocol• Summary and future plan

Terminology

• Coverage (depth): The number of nucleotides from reads

that are mapped to a given position.• Quality Score: Each called base comes with a quality score

which measures the probability of base call error.

• Mapping: Align reads to reference to identify its origin.

• Assembly: Merging of fragments of DNA in order to reconstruct the original sequence.

• Duplicate reads: Reads that are identical.

• Multi-reads: Reads that can be mapped to multiple locations equally well.

What does the data look like?Common NGS Data Formats

FASTA Format (Reference Seq)

FASTQ Format (reads)

FASTQ Format (Illumina Example)

@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAACAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT+BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AGAAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG+@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AGCCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC+CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AGGAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG+CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ

Read RecordHeader

Read BasesSeparator

(with optional repeated header)

Read Quality Scores

Flow Cell ID

Lane TileTile

Coordinates

Barcode

NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads.

(Passarelli, 2012)

Outline• Goals• Bioinformatics Challenges in NGS data analysis– Basics: terminology, data file formats, general workflow, – Analysis Pipeline• Sequence QC and preprocessing• Obtaining and preparing reference • Sequence mapping• Downstream analysis workflow and software


Data Analysis PipelineRaw reads

Read QC and preprocessing

Read Mapping

Analysis-readyreads

FASTQ

FASTQC, FASTX-toolkit, PRINSEQ

Local realignment, base quality recalibration

FASTQ

SAM/BAMMapped readsVisualization (IGV,

USCS GB)

Bowtie, BWA, MAQ

Whole Genome Sequencing:

Variant calling, annotation

RNA-Seq: Transcript assembly,

quantification

ChIP-Seq :Peak Calling

Methyl-Seq:Methylation

calling……

Collecting reference

sequences and annotation

DataTaskFile FormatSoftware

FASTA GTF/GFF

Why QC?Sequencing runs cost money • Consequences of not assessing the Data • Sequencing a poor library on multiple runs

– throwing money away!

Data analysis costs money and time• Cost of analyzing data, CPU time $$• Cost of storing raw sequence data $$$• Hours of analysis could be wasted $$$$• Downstream analysis can be incorrect.

How to QC?

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, available on HPCTutorial : http://www.youtube.com/watch?v=bz93ReOv87Y

$: fastqc s_1_1.fastq;

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/



http://www.youtube.com/watch?v=bz93ReOv87Y

http://www.youtube.com/watch?v=bz93ReOv87Y

The UCSC Genome Browser Homepage

Get genome annotation here!

General information

Specific information—new features, current status, etc.

Get reference sequences here!

Getting reference sequences

Getting Reference Annotation

Sequence Mapping Challenges

• Alignment (Mapping) is the first steps once read sequences are obtained.

• The task: to align sequencing reads against a known reference

• Difficulties: high volume of data, size of reference genome, computation time, read length constraints, ambiguity caused by repeats and sequencing errors.

Short Read Alignment

Olson et al.

Short Read Alignment Software

Short Reads Mapping Software

How to choose an aligner?

• There are many aligners and they vary a lot in performance (accuracy, memory usage, speed, etc).

• Factors to consider : application, platform, read length, downstream analysis, etc.

• Constant trade off between speed and sensitivity (e.g. MAQ vs. Bowtie)

• Guaranteed high accuracy will take longer.

NGS Applications and Analysis StrategyName Nucleic acid population Brief analysis strategy

RNA-Seq RNA (may be poly-A mRNA or total RNA) Alignment of reads to “genes”; variations for detecting splice junctions and quantifying abundance

Small RNA sequencing

Small RNA (often miRNA) Alignment of reads to small RNA references (e.g. miRbase), then to the genome; quantify abundance

ChIP-Seq DNA bound to protein, captured via antibody (ChIP = Chromatin ImmunoPrecipitation)

Align reads to reference genome, identify peaks & motifs

RIP-Seq RNA bound to protein, captured via antibody (RIP = RNA ImmunoPrecipitation)

Align reads to reference genome and/or “genes”, identify peaks and motifs

Methylation Analysis

Select methylated genomic DNA regions, or convert methylated nucleotides to alternate forms

Align reads to reference and either identify peaks or regions of methylation

SNP calling/ discovery

All or some genomic DNA or RNA Either align reads to reference and identify statistically significant SNPs, or compare multiple samples to each other to identify SNPs

Structural Variation Analysis

Genomic DNA, with two reads (mate-pair reads) per DNA template

Align mate-pairs to reference sequence and interpret structural variants

de novo Sequencing

Genomic DNA (possibly with external data e.g. cDNA, genomes of closely related species, etc.)

Piece-together reads to assemble contigs, scaffolds, and (ideally) whole-genome sequence

Metagenomics Entire RNA or DNA from a (usually microbial) community

Phylogenetic analysis of sequences

(Hunicke-Smith et al, 2010)

Application Specific Software

Tophat, STAR, Cufflinks, edgeR,

MACS, AREM, PeakSeq

Mapped reads

Whole Genome Sequencing, Exome Sequencing

RNA-Seq: Transcriptome

analysis

ChIP-Seq :Protein DNA binding site,

Methyl-Seq:Methylation

pattern analysis

Variant Calling: SNPs, InDels

Bismark, BS Seeker

1: Transcriptome assembly2. Abundance quantification3. Differential expression and regulation

Peak Identification

Methylation calling

ssahaSNP, Samtools, PyroBayes

……

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

RNA-seq (Tuxedo Protocol)

2. Transcript assembly and quantification

1. Read mapping

3. Merge assembled transcripts from multiple

samples

4. Differential Expression analysis

SAM/BAM

GTF/GFF

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

1. Spliced Alignment: TophatTophat : a spliced short read aligner for RNA-seq.

$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq




2.Transcript assembly and abundance quantification: Cufflinks

CuffLinks: a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide.

$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/ accepted_hits.bam




3. Final Transcriptome assembly: Cuffmerge

$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt

$ more assembies.txt

./C1_R1_clout/transcripts.gtf




4.Differential Expression: Cuffdiff

CuffDiff: a program that compares transcript abundance between samples.

$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,./C2_R1_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam

Integrative Genomics Viewer (IGV)http://www.broadinstitute.org/igv

http://www.broadinstitute.org/igv/UserGuideNeilsen, C.B., et al. Visualizing Genomes: techniques and challenges Nature Methods 7:S5 S15 (2010)‐

Visualizing RNA-seq mapping with IGVSpecify range or tem in search box

Click on ruler

Click and drag

Use scroll bar

Use keyboard:Arrow keys, Page up

Page down, Home, End

http://www.broadinstitute.org/igv/UserGuide

http://www.broadinstitute.org/igv/UserGuide

SummarySummary

Thank you!

• NGS technologies are transforming molecular biology.

• Bioinformatics analysis is a crucial part in NGS applications – Data formats, terminology, general workflow– Analysis pipeline– Software for various NGS applications

• RNA-seq with Tuxedo suite

introduction to next generation sequencing (ngs) data analysis

Documents