next generation sequencing analysis web toolkit2016/03/16 · next generation sequencing analysis...
TRANSCRIPT
Next Generation Sequencing Analysis Web Toolkit
1Updated for 2016-03-16
Genomic, transcriptomic sequencing now
commonplace in projects. Now very cheap!
Large NGS projects in departments throughout the
University.
Now a routine tool for labs with little previous
genomic / transcriptomic experience.
Most common experiment across the University:
Use RNA-Seq to identify gene expression
changes in response to a stimulus / caused by
a disease.
Next Generation Sequencing
2
Let’s focus on this today – but you can do
other things on our systems!
Typical RNA Seq Workflow
3
A BPrepare/ Obtain Samples for different conditions
Extract RNA and prepare library for sequencing
Run library on Illumina sequencer
Obtain short-read sequences
Typical RNA Seq Workflow – Data Analysis
4
Check quality and/or filter reads
Align to the genome or transcriptome
Quantify transcript abundance across conditions
Identify significant differences in expression between conditions
Big Data, Big Problems
5
* Data from the NHGRI Genome Sequencing Program (GSP)http://www.genome.gov/sequencingcosts/
With sequencing throughput rapidly outpacing Moore’s Law for compute power, we find ourselves facing a major CPU and storage problem
e.g.: Running Cuffquant on BioHPC computer node(32 cores, 128 GB)
Sample A: 3 MB, 94,428 Reads~ 5 mins
Sample B: 4 GB, 157,530,392 Reads~ 2 hours 32 mins
The BioHPC Solution: Easy Accessible Cloud Storage
6
50 GBBioHPC File Exchange
BioHPC Cluster
Compute nodesLamella Cloud Storage/Gateway
UTSW private cloud
1 way access2 way access
https://portal.biohpc.swmed.edu/content/guides/storage-cheat-cheet/https://portal.biohpc.swmed.edu/content/guides/biohpc-cloud-storage/
Large amounts of high-performance storage, easily accessible
Total Storage space: 3.7PB
The BioHPC Solution : Powerful Computational Resource
7
112 nodes – 128GB, 256GB, 384GB, GPU
CPU cores: 4700GPU cores: 19968Memory: 25TB
Powerful compute cluster – run multiple tasks each faster than on a workstation
Nucleus Computer Cluster
The BioHPC Solution: Various BioHPC Tools to Help You
8
Batch Scripts & Command Line Tools
Various NGS tools available as modules on the cluster, for expert
users.
Galaxy
Flexible environment with many tools, workflow designer, for
advanced users.
NGS Web Toolkit
Simple workflows built from modules. Step-by-step with
customizable parameters.
Workflow Platform
Run standard workflow/pipeline from web, for beginners.
4 approaches for NGS analysis on BioHPC
The BioHPC Solution: Various BioHPC Tools to Help You
9
Batch Scripts & Command Line Tools
Various NGS tools available as modules on the cluster, for expert
users.
Galaxy
Flexible environment with many tools, workflow designer, for
advanced users.
NGS Web Toolkit
Simple workflows built from modules. Step-by-step with
customizable parameters.
Workflow Platform
Run standard workflow/pipeline from web, for beginners.
4 approaches for NGS analysis on BioHPC
More Flexible
Easier to use
Transfer and manage your sequence data
Retrieving and storing sequencing files (scp, ftp, removable hard disk )
Understanding the file system (file permissions etc.)
Understand and use command line tools
Option 1 - : NGS Analysis with Traditional Linux Command Line Tools
10
bowtie2-build genome.fa hg19
fastqc -o OutputDirectory/ inputFile.fastq
tophat -o TophatOutput/ -p 8 /programs/indexes/hg19 Experiment1.fastq
…
* Summarized from: http://crazyhottommy.blogspot.com/2013/06/a-very-good-introduction-of-ngs.html
Option 1 - : NGS Analysis with Traditional Linux Command Line Tools
11
Software/tools : module availGenome Database: /project/apps_database/iGenomes
Common NGS tools and Illumina iGenome databases are available on the clusterExperts can write their own pipelines using cluster sbatch jobs
Option 2 - BioHPC Galaxy Service
12
BioHPC Portal -> Cloud Services -> Galaxy (galaxy.biohpc.swmed.edu)
Reproducible workflows, with many available tools, via the web. Widely used by many institutions.
Separate Training Session: Galaxy at BioHPC (08/17/2016)
Option 3 - BioHPC NGS Pipeline
13
BioHPC Portal -> Cloud Services -> NGS Web Toolkit (ngs.biohpc.swmed.edu)
Provides an easy-to-use web interface to these command line-driven tools and allows users to run multiple sequencing samples simultaneously.
Option 3 - BioHPC NGS Pipeline
14
BioHPC Portal -> Cloud Services -> NGS Web Toolkit (ngs.biohpc.swmed.edu)
Provides an easy-to-use web interface to these command line-driven tools and allows users to run multiple sequencing samples simultaneously.
Option 4 - BioHPC Workflow Platform
15
Under Construction
Separate Training Session: Introduction to the BioHPC Workflow Platform (05/18/2016)
Web-based workflow platform, which allows easy access to run standard workflows on the BioHPC compute cluster, via the web.
NGS Web Toolkit
16
• Backgroundalpha release (April, 2015)beta release (March, 2016)
• Prepare and Upload Data
• Demo: Follow a simple RNA-SEQ differential expression analysisTraining Notes
• Access results
• Future DevelopmentApplication Improvements (new features & security enhancement)User requests (new parameters, software, modules and etc.)
Background: Modules
17
Raw Reads
Check Quality
Trim Reads Map Reads Processing Mapped Reads
Assemble Reads
Cuffmerge Transcripts
HtSeq-Count
DESeq2-Norm
Cuffquant
Cuffdiff DESeq2-Diff
Cuffnorm
Check Quality Check Quality
Existing Module
Optional Module
New Module
* Users are encouraged to propose new modules and software.
DESeq2 can be used when each sample in a group has matched sample(s) in other group(s). For example, in a control vs. treatment experiment, a subject before and after treatment can be viewed as a pair.
Background: Genome Databases
18
If a reference genome is not collected in our system, you may choose to Build Your Own.
Available Genome Databases
Human reference genomes : GRCh37, hg19
Mouse reference genomes : GRCm38, mm9
A ‘toy’ example we can show you in real time (hopefully!)
75,000 reads from chr19, extracted from a larger study
2 Conditions – brain tissue vs adrenal tissue
What’s the difference in expression for the limited number
of transcripts we can see in this data?
What’s unique to the brain tissue?
Courtesy Galaxy Project, Illumina Body Map:
https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise
Example 1 – Brain vs Adrenal
19
20
NGS Web Toolkit Demo
See Handouts
Access results
21
• Download small files (~20MB) directly from web
• Create symbolic link of project folder
• Access data from BioHPC cluster/Lamella Gateway (Output Path)
* All results are read only except you choose to delete the whole module from web
Results
22
gene_exp.diff – Summary of differentially expressed genes
CELF5 & TUBB4A transcript are present in Brain tissue, not in Adrenal tissue
Called as significant – but remember this is a toy example (no replicates etc.)
Results
23
http://www.proteinatlas.org/ENSG00000161082-CELF5/tissue
Yes – Antibody staining data for CELF5 agrees with our findings here.
CELF5 is an RNA-binding protein expressed in the brain, implicated in the regulation of pre-mRNA alternative splicing.
Acknowledgements
24
We would like to thank Dr. Zhiyu(Sylvia) Zhao and Dr. Liang Shi from the Children's
Research Institute for development and assistance of this RNA-Seq pipeline.
Future Development and Acknowledgements
25
Application Improvements
o New features (e.g. linking data for customized data)
o Security enhancement (HIPAA)
* Note: Contact us if you want to upload any identifiable/confidential data
Upon user requests
o Add new software and parameters
o Design new modules
o Develop other web-based application
* Send Email to: [email protected]