galaxy rna-seq analysis: tuxedo protocol

58
Galaxy RNA-Seq Analysis: Tuxedo Protocol ChangBum Hong, KT Bioinformatics, GenomeCloud SCIC genome-cloud.com This work is licensed under the Creative Commons Attribution-NonCommercial- ShareAlike 3.0 New Zealand License. To view a copy of this license, visit http:// creativecommons.org/licenses/by-nc-sa/3.0/nz/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Upload: hong-changbum

Post on 10-May-2015

11.276 views

Category:

Health & Medicine


3 download

DESCRIPTION

Galaxy RNA-Seq Analysis: Tuxedo Protocol

TRANSCRIPT

Page 1: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Galaxy RNA-Seq Analysis: Tuxedo Protocol

ChangBum Hong, KT Bioinformatics, GenomeCloud SCIC genome-cloud.com

This work is licensed under the Creative Commons Attribution-NonCommercial- ShareAlike 3.0 New Zealand License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/nz/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Page 2: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Introduction• RNA-Seq

• Transcriptome assembly

• Qualitative identification of expressed sequence

• Differential expression analysis

• Quantitative measurement of transcript expression

• RNA-Seq Applications

• Annotation: Identify novel genes, transcripts, exons, splicing events, ncRNAs

• Detecting RNA editing and SNPs

• Measurements: RNA quantification and differential gene expression

Page 3: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Experimental design• What are my goals?

• Transcriptome assembly?

• Differential expression analysis?

• Identify rare transcripts?

• What are the characteristics of my system?

• Large, complex genome?

• Introns and high degree of alternative splicing?

• No reference genome or transcriptome?

Page 4: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Splicing

AssemblyExpression

Differentially expressed

Experimental Outputs

Page 5: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Sequencing• Platforms

• Library preparation

• Multiplexing

• Sequence reads

• File names

• Fastq format(Formats vary)

• 4 lines per readIllumina Read ID

Page 6: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Data Quality Control

• Data Quality Assessment

• Identify poor/bad sample

• Identify contaminates

• Trimming: remove bad bases from read

• Filtering: remove bad reads from library

Page 7: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Read Mapping• Alignment algorithm must be

• fast

• able to handle SNPs, indels, and sequencing errors

• allow for introns for reference genome alignment

• Input

• fastq read library

• reference genome index

• insert size mean and stddev(for paired-end libraries)

• Output

• SAM (text) / BAM (binary) alignment files

Page 8: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Differential Expression

• Cuffdiff (Cufflinks package)

• Pairwise comparisons

• Differnetial gene, transcript, and primary transcript expression

• Easy to use, well documented

• Input: transcriptome, SAM/BAM read alignments

Page 9: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Transcriptome Assembly• RNA-Seq

• Reference genome

• Reference transcriptome

• RNA-Seq

• Reference genome

• No reference transcriptome

• RNA-Seq

• No reference genome

• No reference transcriptome

Page 10: Galaxy RNA-Seq Analysis: Tuxedo Protocol

RNA

FASTQ

Reads

Experimental Design

Sequencing

Reference Genome

FASTA

Referecne Transcriptome

GFF/GTF

Data Quality Control

FASTQ

Tuxedo protocol

Page 11: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Combining tools in a pipeline• Linux Command-line Tools

• Shell script, Makefile

• GUI Based pipeline

• DNANexus

• SevenBridegs Genomics

• Galaxy

• Open Source

• Wrapper for command line utilites

• Workflows

• Save all steps you did in your analysis

• Return the entire analysis on a new dataset

• Share your workflow with other people

Page 12: Galaxy RNA-Seq Analysis: Tuxedo Protocol

How to use Galaxy?

NO WAIT TIMES

NO STORAGE QUOTAS

NO JOB

SUBMISSION LIMITS

NO DATA

TRANSFER BOTTLENECKS

NO IT

EXPERIENCE REQUIRED

NO REQUIRED

INFRASTRUCTURECOST

GALAXY MAIN Free

LOCAL GALAXY Free ?

CLOUD GALAXY

(AMAZON)

동일사양 대비 약 2배 (KT의)

SLIPSTREAM GALAXY

$19,995 (2천2백만원)

KT GenomeCloud

GALAXY

시간당 740원부터

GALAXY MAIN: User disk quotas 250GB for registered users, maximum concurrent jobs: 8

Page 13: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Outline of tutorial

• Starting Galaxy

• Mapping with Tophat

• Workflows

• Visualizing alignment with IGV

• Computing differential expression with cuffdiff

• Cuffdiff visuaalization with CummeRbund

Page 14: Galaxy RNA-Seq Analysis: Tuxedo Protocol
Page 15: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Starting Galaxy

• Tutorial Dataset

• Accessing Galaxy

• Import files for one sample into current history

• Set file attributes

• Run FastQC

Page 16: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Tutorial Dataset• FASTQ files (fastq): Sequence Reads

• Reference (fasta): Genome Sequence (galaxy default)

• Geneset (GTF / GFF3): Reference Geneset

• Bowtie2 index: Reference index files for Bowtie2 (galaxy default)

Page 17: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Tutorial Dataset Reference & Gene sets

• Ensembl

• http://www.ensembl.org/info/data/ftp/index.html

Page 18: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Tutorial Dataset Reference & Gene sets

•illumina iGenomes • The iGenomes are a collection of reference sequences and annotation files for commonly analyzed

organisms. The files have been downloaded from Ensembl, NCBI, or UCSC, and chromosome names have been changed to be simple and consistent with their download source. Each iGenome is available as a compressed file that contains sequences and annotation files for a single genomic build of an organism.

• http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn

Page 19: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Tutorial Dataset Sequencing data

•Sequencing data (Drosophila melanogaster) • Gene Expression Omnibus at accession GSE32038

• http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32038

Page 20: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Biological replicates vs. technical replicates

Biological Replicates

Technical Replicates

Page 21: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Accessing Galaxy• Open a web browser and navigate to Galaxy website usegalaxy.org or www.genome-cloud.com

• Log in with username and password

GenomeCloud (genome-cloud.com)

select galaxy service

Page 22: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Tools pane

Center pane

History pane

when your galaxy is ready you will recive the e-mail

access the galaxy via public ip address

you can register via user menu > register

Page 23: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Import files• Open a web browser and navigate to Galaxy website usegalaxy.org or www.genome-cloud.com

• Log in with username and passwordexample fastq and gtf files are located in shared data > RNA-Seq with Drosophila melanogaster

import data into your history panel (read to analysis)

Page 24: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Set file attributes• In the history pane click on the pencil icon

• Enter “fastqsanger” (It will takes time)

Sanger Phread+33 fastqsanger (cassava 1.8 ▲ ) Ilumina 1.3 Phread+64 fastqillunina (cassava 1.8 ▼) Solexa Solexa+64 fastqsolexa

Tophat options --solexa-quals: Use the Solexa scale for quality values in FASTQ files --solexa1.3-quals: Phred64/Illumina 1.3~1.5 !BWA options -l : The input is in the Illumina 1.3+ read format (quality equals ASCII-64) !GenomeCloud (g-Analysis)

Page 25: Galaxy RNA-Seq Analysis: Tuxedo Protocol

CASAVA 1.8.2 Quality Score (or Q-score)

Error probability

Quality Score Encoding

Page 26: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Run FastQC• Load the FastQC tool from the tool pane

• Set the input file (repeat this step on the C1, C2 all piar files)

Page 27: Galaxy RNA-Seq Analysis: Tuxedo Protocol

When fastqc has finished running, click on the eye on the FastQC output file to display

wait running done error

Galaxy status

Page 28: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Per base sequence quality

Per sequence quality score

Per base sequence content

illumina (in-house data)

IonTorrent (in-house data)

illumina (good dataset in FastQC homepage)

illumina (bad dataset in FastQC homepage)

illumina (in-house data)

IonTorrent (in-house data)

illumina (good dataset in FastQC homepage)

illumina (bad dataset in FastQC homepage)

illumina (in-house data)

IonTorrent (in-house data)

illumina (good dataset in FastQC homepage)

illumina (bad dataset in FastQC homepage)

Page 29: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Per base GC content

Per sequence GC content

Per base N content

illumina (in-house data)

IonTorrent (in-house data)

illumina (good dataset in FastQC homepage)

illumina (bad dataset in FastQC homepage)

illumina (in-house data)

IonTorrent (in-house data)

illumina (good dataset in FastQC homepage)

illumina (bad dataset in FastQC homepage)

illumina (in-house data)

IonTorrent (in-house data)

illumina (good dataset in FastQC homepage)

illumina (bad dataset in FastQC homepage)

Page 30: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Sequence Length Distribution

Sequence Duplication Levels

illumina (in-house data)

IonTorrent (in-house data)

illumina (good dataset in FastQC homepage)

illumina (bad dataset in FastQC homepage)

illumina (in-house data)

IonTorrent (in-house data)

illumina (good dataset in FastQC homepage)

illumina (bad dataset in FastQC homepage)

Page 31: Galaxy RNA-Seq Analysis: Tuxedo Protocol
Page 32: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Mapping with Tophat

• Initial Tophat run

• Determine insert size

• Rerun Tophat with correct insert size

• Review mapping statistics

Page 33: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Initial Tophat run• Use Full Tophat paramters

• Paired-end FASTQ files, Select reference genome, Use Own Juctions(Yes), Use Gene Annotation Model(Yes)

• Gene Model Anntations (use GFF file)

Page 34: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Determine insert size• Load the insert size tool “NGS: Picard (beta) -> Insertion size meterics”

Page 35: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Determine insert size• Click “eye” icon

• Identify the MIN_INSERT_SIZE (198)

Page 36: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Rerun Tophat• Click any one of the Tophat2 output files in the history panne

• Click on the circular blue arrow icon

• Change the “Mean Inner Distance between Mate Pairs” (198)

Page 37: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Tophat Output• unmapped.bam (BAM)

• accepted_hits.bam (BAM): a list of read alignments in BAM/SAM format

• junctions.bed (BED): list BED track of junctions reported by Tophat where each junction consists of two connected BED blocks where each block is as long as the max overhang of nay read spanning juction

• deletions.bed (BED): mentions the last genomic base before the deletion

• insertions.bed (BED): mentions the first genomic base of deletion

Page 38: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Load files into IGV• Click on the “accepted hits” file in the history pane

• Click on the “display with IGV web current”

• A file named “igv.jnlp” will be downloaded by your browser

• Open with text editor copy BAM file location

Page 39: Galaxy RNA-Seq Analysis: Tuxedo Protocol

http://www.sabiosciences.com/rt_pcr_product/HTML/PADM-000Z.html

IGV with Housekeeping gene

Page 40: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Load files into IGV• Enter “Act42A” in the search box to view the reads aligning

• Right-click on the coverage track and select “Set Data Range” (max value to 4372)

Housekeeping gene: Act42A

Set max value

Page 41: Galaxy RNA-Seq Analysis: Tuxedo Protocol

IGV with Differential Expression

Page 42: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Keyword: regucalcin (calcium-binding protein)

this gene has four isoforms

Page 43: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Load files into Trackster• Click on the “accepted hits” file in the history pane

• Click on the graph icon and select “Trackster”

• Select bam files

Page 44: Galaxy RNA-Seq Analysis: Tuxedo Protocol

create new group ‘Add group’drag into new group

move to regucalcin gene

Page 45: Galaxy RNA-Seq Analysis: Tuxedo Protocol

set max value

Page 46: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Run cuffdiff• Load the Cuffdiff tool: “NGS:RNA Analysis->Cuffdiff ”

• Perform replicate analysis(Yes)

• Add new Group / Add new Replicate

Page 47: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Cuffdiff output• Genes: gene differential FPKM

• Isoforms: Transcript differential FPKM

• CDS: Coding sequence differential FPKM

Page 48: Galaxy RNA-Seq Analysis: Tuxedo Protocol

View and filter cuffdiff output

• Differential Gene Expression (DGE)

• Filter out genes with significant change in expression with a log fold-change of at least 1 “C14 == ‘yes’ and abs(c10)>1” in the “With following condition” text box

Page 49: Galaxy RNA-Seq Analysis: Tuxedo Protocol
Page 50: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Cuffdiff visualization with CummeRbund

• Load the CummeRbund tool: NGS:RNA Analysis->cummerbund

• Plot type: Density, check the “Replicates” box

Page 51: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Samples have similar density distribution(density plot)

Samples cluster by expression condition (MDS / PCA plot)

Samples cluster by experimental condition (Dendogram)

Page 52: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Volcano Differential analysis results for regucalcin Expression plot shows clear differences in the expression of regucalcin across conditions C1

and C2 (four alternative isoforms)

Scatter plots highlight general similarities and specific outliers between conditions

C1 and C2

Page 53: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Extract workflow from current history

• Click on the small gear icon and select “Extract Workflow”

Page 54: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Edit workflow• Click on “Workflow” at the top of the Galaxy window

• Move the elements of the workflow

Page 55: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Run workflow• Load a workflow by clicking on “Workflow” ath the top of the screen

• Click on “Run”

• Select the input datas

Page 56: Galaxy RNA-Seq Analysis: Tuxedo Protocol
Page 57: Galaxy RNA-Seq Analysis: Tuxedo Protocol

• Public main galaxy site (user disk quotas 250GB for registered users, maximum concurrent jobs: 8)

• https://usegalaxy.org/

• Test galaxy site (beta site for galaxy main instance)

• https://test.galaxyproject.org/

• Galaxy screen cast and tutorials

• https://wiki.galaxyproject.org/Learn

• Galaxy를 이용한 NGS 분석 (Korean)

• http://hongiiv.tistory.com/701

• Galaxy를 이용한 SNP 분석 (Korean)

• http://hongiiv.tistory.com/652

• Galaxy를 이용한 부시맨 genome 분석 (Korean)

• http://hongiiv.tistory.com/655

Useful galaxy sites

Page 58: Galaxy RNA-Seq Analysis: Tuxedo Protocol

Thank you

Acknowledgements:YoungGi Kim HanKyu Choi

WanPyo Hong KangJung Kim