next generation sequencing analysis web toolkit2016/03/16  · next generation sequencing analysis...

25
Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16

Upload: others

Post on 07-Aug-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Next Generation Sequencing Analysis Web Toolkit

1Updated for 2016-03-16

Page 2: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Genomic, transcriptomic sequencing now

commonplace in projects. Now very cheap!

Large NGS projects in departments throughout the

University.

Now a routine tool for labs with little previous

genomic / transcriptomic experience.

Most common experiment across the University:

Use RNA-Seq to identify gene expression

changes in response to a stimulus / caused by

a disease.

Next Generation Sequencing

2

Let’s focus on this today – but you can do

other things on our systems!

Page 3: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Typical RNA Seq Workflow

3

A BPrepare/ Obtain Samples for different conditions

Extract RNA and prepare library for sequencing

Run library on Illumina sequencer

Obtain short-read sequences

Page 4: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Typical RNA Seq Workflow – Data Analysis

4

Check quality and/or filter reads

Align to the genome or transcriptome

Quantify transcript abundance across conditions

Identify significant differences in expression between conditions

Page 5: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Big Data, Big Problems

5

* Data from the NHGRI Genome Sequencing Program (GSP)http://www.genome.gov/sequencingcosts/

With sequencing throughput rapidly outpacing Moore’s Law for compute power, we find ourselves facing a major CPU and storage problem

e.g.: Running Cuffquant on BioHPC computer node(32 cores, 128 GB)

Sample A: 3 MB, 94,428 Reads~ 5 mins

Sample B: 4 GB, 157,530,392 Reads~ 2 hours 32 mins

Page 6: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

The BioHPC Solution: Easy Accessible Cloud Storage

6

50 GBBioHPC File Exchange

BioHPC Cluster

Compute nodesLamella Cloud Storage/Gateway

UTSW private cloud

1 way access2 way access

https://portal.biohpc.swmed.edu/content/guides/storage-cheat-cheet/https://portal.biohpc.swmed.edu/content/guides/biohpc-cloud-storage/

Large amounts of high-performance storage, easily accessible

Total Storage space: 3.7PB

Page 7: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

The BioHPC Solution : Powerful Computational Resource

7

112 nodes – 128GB, 256GB, 384GB, GPU

CPU cores: 4700GPU cores: 19968Memory: 25TB

Powerful compute cluster – run multiple tasks each faster than on a workstation

Nucleus Computer Cluster

Page 8: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

The BioHPC Solution: Various BioHPC Tools to Help You

8

Batch Scripts & Command Line Tools

Various NGS tools available as modules on the cluster, for expert

users.

Galaxy

Flexible environment with many tools, workflow designer, for

advanced users.

NGS Web Toolkit

Simple workflows built from modules. Step-by-step with

customizable parameters.

Workflow Platform

Run standard workflow/pipeline from web, for beginners.

4 approaches for NGS analysis on BioHPC

Page 9: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

The BioHPC Solution: Various BioHPC Tools to Help You

9

Batch Scripts & Command Line Tools

Various NGS tools available as modules on the cluster, for expert

users.

Galaxy

Flexible environment with many tools, workflow designer, for

advanced users.

NGS Web Toolkit

Simple workflows built from modules. Step-by-step with

customizable parameters.

Workflow Platform

Run standard workflow/pipeline from web, for beginners.

4 approaches for NGS analysis on BioHPC

More Flexible

Easier to use

Page 10: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Transfer and manage your sequence data

Retrieving and storing sequencing files (scp, ftp, removable hard disk )

Understanding the file system (file permissions etc.)

Understand and use command line tools

Option 1 - : NGS Analysis with Traditional Linux Command Line Tools

10

bowtie2-build genome.fa hg19

fastqc -o OutputDirectory/ inputFile.fastq

tophat -o TophatOutput/ -p 8 /programs/indexes/hg19 Experiment1.fastq

* Summarized from: http://crazyhottommy.blogspot.com/2013/06/a-very-good-introduction-of-ngs.html

Page 11: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Option 1 - : NGS Analysis with Traditional Linux Command Line Tools

11

Software/tools : module availGenome Database: /project/apps_database/iGenomes

Common NGS tools and Illumina iGenome databases are available on the clusterExperts can write their own pipelines using cluster sbatch jobs

Page 12: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Option 2 - BioHPC Galaxy Service

12

BioHPC Portal -> Cloud Services -> Galaxy (galaxy.biohpc.swmed.edu)

Reproducible workflows, with many available tools, via the web. Widely used by many institutions.

Separate Training Session: Galaxy at BioHPC (08/17/2016)

Page 13: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Option 3 - BioHPC NGS Pipeline

13

BioHPC Portal -> Cloud Services -> NGS Web Toolkit (ngs.biohpc.swmed.edu)

Provides an easy-to-use web interface to these command line-driven tools and allows users to run multiple sequencing samples simultaneously.

Page 14: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Option 3 - BioHPC NGS Pipeline

14

BioHPC Portal -> Cloud Services -> NGS Web Toolkit (ngs.biohpc.swmed.edu)

Provides an easy-to-use web interface to these command line-driven tools and allows users to run multiple sequencing samples simultaneously.

Page 15: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Option 4 - BioHPC Workflow Platform

15

Under Construction

Separate Training Session: Introduction to the BioHPC Workflow Platform (05/18/2016)

Web-based workflow platform, which allows easy access to run standard workflows on the BioHPC compute cluster, via the web.

Page 16: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

NGS Web Toolkit

16

• Backgroundalpha release (April, 2015)beta release (March, 2016)

• Prepare and Upload Data

• Demo: Follow a simple RNA-SEQ differential expression analysisTraining Notes

• Access results

• Future DevelopmentApplication Improvements (new features & security enhancement)User requests (new parameters, software, modules and etc.)

Page 17: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Background: Modules

17

Raw Reads

Check Quality

Trim Reads Map Reads Processing Mapped Reads

Assemble Reads

Cuffmerge Transcripts

HtSeq-Count

DESeq2-Norm

Cuffquant

Cuffdiff DESeq2-Diff

Cuffnorm

Check Quality Check Quality

Existing Module

Optional Module

New Module

* Users are encouraged to propose new modules and software.

DESeq2 can be used when each sample in a group has matched sample(s) in other group(s). For example, in a control vs. treatment experiment, a subject before and after treatment can be viewed as a pair.

Page 18: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Background: Genome Databases

18

If a reference genome is not collected in our system, you may choose to Build Your Own.

Available Genome Databases

Human reference genomes : GRCh37, hg19

Mouse reference genomes : GRCm38, mm9

Page 19: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

A ‘toy’ example we can show you in real time (hopefully!)

75,000 reads from chr19, extracted from a larger study

2 Conditions – brain tissue vs adrenal tissue

What’s the difference in expression for the limited number

of transcripts we can see in this data?

What’s unique to the brain tissue?

Courtesy Galaxy Project, Illumina Body Map:

https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise

Example 1 – Brain vs Adrenal

19

Page 20: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

20

NGS Web Toolkit Demo

See Handouts

Page 21: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Access results

21

• Download small files (~20MB) directly from web

• Create symbolic link of project folder

• Access data from BioHPC cluster/Lamella Gateway (Output Path)

* All results are read only except you choose to delete the whole module from web

Page 22: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Results

22

gene_exp.diff – Summary of differentially expressed genes

CELF5 & TUBB4A transcript are present in Brain tissue, not in Adrenal tissue

Called as significant – but remember this is a toy example (no replicates etc.)

Page 23: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Results

23

http://www.proteinatlas.org/ENSG00000161082-CELF5/tissue

Yes – Antibody staining data for CELF5 agrees with our findings here.

CELF5 is an RNA-binding protein expressed in the brain, implicated in the regulation of pre-mRNA alternative splicing.

Page 24: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Acknowledgements

24

We would like to thank Dr. Zhiyu(Sylvia) Zhao and Dr. Liang Shi from the Children's

Research Institute for development and assistance of this RNA-Seq pipeline.

Page 25: Next Generation Sequencing Analysis Web Toolkit2016/03/16  · Next Generation Sequencing Analysis Web Toolkit 1 Updated for 2016-03-16 Genomic, transcriptomic sequencing now commonplace

Future Development and Acknowledgements

25

Application Improvements

o New features (e.g. linking data for customized data)

o Security enhancement (HIPAA)

* Note: Contact us if you want to upload any identifiable/confidential data

Upon user requests

o Add new software and parameters

o Design new modules

o Develop other web-based application

* Send Email to: [email protected]