tackling analytical challenges in cancer proteogenomics using...
TRANSCRIPT
Tackling Analytical challenges in Cancer proteogenomics using
Galaxy frameworkDecember 11, 2018
Pratik JagtapGalaxy-P Team
University of Minnesota
Slides for the Talk: z.umn.edu/mumbaislides
• Introduction to proteogenomics and multi-omic studies
• RNASeq Data Processing: Data Analysis using Galaxy platform
• Proteomics data analysis using Galaxy
• Identification of novel proteoforms and visualization
RNASeq data processing. Generation of protein sequence database.
Sequence database searching and peptide /
protein identification
Results visualization and interpretation
Raw RNA-seq data
Raw MS/MS proteomics data
WORKSHOP STRUCTURE
MULTI-OMICS
MULTI-OMICS TECHNOLOGIES
Ruggles et al. Mol Cell Proteomics 2017;16:959-981 © 2017 by The American Society for Biochemistry and Molecular Biology, Inc.
• Next-Gen Sequencing
• RNASeq
• Mass Spectrometry
• Proteogenomics
• Proteo-transcriptomics
• Metaproteomics
• Meta-transcriptomics
• Metabolomics
LOOKING BEYOND THE KNOWN PROTEOME
Mass spectrumReference Protein Database
from genomic annotation
Cancer / Disease related
Databases such as COSMIC,
IARC p53, OMIM…
Deep genome sequencing data
from ICGC, TCGA and CPTAC
RNASeq data
(Customized OR
Combined)
6-frame DNA
sequences.
3-frame cDNA
sequences.Identification of
peptides
corresponding
to novel proteoforms.
https://doi.org/10.1007/978-1-4939-7717-8_7
Multiomics / trans-omics
GALAXYGalaxy Instance for proteogenomics workshop: z.umn.edu/galaxypinmumbai
User will need to register and login in using password onto the site. Step by step instructions for the
workshop are provided in the document below (registration instructions start on page 5).
Documentation for Galaxy instance usage:z.umn.edu/mumbaidocs
GALAXYGalaxy Instance for proteogenomics workshop: z.umn.edu/proteogenomicsgateway
User will need to register and login in using password onto the site. Step by step instructions for the
workshop are provided in the document below (registration instructions start on page 5).
Documentation for Galaxy instance usage:z.umn.edu/pginnov18
REGISTER
IMPORT HISTORY
IMPORT HISTORY
INPUT DATA
DATASET FOR MULTI-OMICS ANALYSIS
Heydarian et al J Proteomics Bioinform. (2014) 17:7. pii: 1000302.
• Mouse cell culture.
• RNA-seq analysis
RNA-seq libraries were sequenced on a HiSeq 2000
(Illumina SY-401–1001) to a read depth of
~90,000,000 single end 97 bp reads per sample.
• iTRAQ-labeling and Mass SpectrometryReversed phase liquid chromatography using Easy-nLCsystem (Thermo Scientific) and analyzed on a LTQ-Orbitrap Elite mass spectrometer (Thermo Scientific).
Select History 1
Import history
Start using this history
Select Workflow 1
Import workflow
Using the workflow
Run Workflow 1
INPUT
WORKFLOW
GALAXY
OUTPUT
GALAXY INTERFACE
Left (Tool) Pane
Main Viewing Pane
History Pane
WORKSHOP WORKFLOWS
Workflow #1
RNA-Seq to Variant
FASTA database
Workflow #2
Database Searching
Using MS/MS data
Workflow #3
Identifying Novel Variants
And Visualization
Genomic coordinate information
OBJECTIVE OF WORKFLOW 1
Create custom variant database
Workflow #1
RNA-Seq to Variant
FASTA database
Workflow #2
Database Searching
Using MS/MS data
Workflow #3
Identifying Novel Variants
And Visualization
FASTA SequencesGenome Mapping Information
WORKSHOP WORKFLOWS
INPUT DATA
• RNA-Seq FASTQ file : Reads in FASTQ format
• GTF file: Gene Transfer Format • Tabular file to describe genes and related features
• Known protein and contaminant protein sequence FASTA file
• Mass-spectrometry (MGF) file
INPUT DATA
Select
‘MousePG_Input_History’
Import history
Start using this history
Select
‘MousePG_Workflow1
_RNAseq_Dbcreation’
Import workflow
Using the workflow
Run Workflow 1
INPUT
WORKFLOW
GALAXY
OUTPUT
IMPORT WORKFLOW
IMPORT WORKFLOW
RUNNING A WORKFLOW
SELECTING INPUT FILES TO RUN A WORKFLOW
JOB STATUS (HISTORY PANE)
Job in queue Job running Job successful Job failed
Workflow #1
RNA-Seq to Variant
FASTA database
Workflow #2
Database Searching
Using MS/MS data
Workflow #3
Identifying Novel Variants
And Visualization
WORKSHOP WORKFLOWS
WORKFLOW #1: RNA-SEQ TO VARIANT PROTEIN
SAV / In-Del Variants
Assembly Workflow
POTENTIAL NOVEL PEPTIDE IDENTIFICATIONS
5’3’
Exon 1 Exon 2 Exon 8Exon 3 Exon 4 Exon 5 Exon 6 Exon 7
Expressed 5’ UTR
Alternate start
Alternate frame
+2
+1
Novel Exon
Novel Spliceform
Exon extension
Expressed 3’ UTR
/Alternate stop
Intergenic
/Novel gene
+3
+3
*
*
Single amino acid
variant
UTR UTR UTR
CD
S
CDS CDS CDS
CD
S
CDS
CD
S
Sta
rt Sto
p
+2Known
Peptides +2
InDels A
RNA-SEQ TO FASTA DATABASE CREATION
RNA-Seq
FASTQ
HISAT
Alignment tool
STRINGTIE
RNA-Seq to transcripts
GFF COMPARE
Translate transcripts
FREEBAYES
CustomPro DB
Sequence
FASTA
GTF
Variant Calling
● Variant annotation
● Genome mapping
Evaluates the assembly with
annotated transcripts
Mapping
Files
Genome
SAV / In-Del Variants
Assembly Workflow
RNA-SEQ TO FASTA DATABASE CREATION
RNA-Seq
FASTQ
HISAT
Alignment tool
STRINGTIE
RNA-Seq to transcripts
GFF COMPARE
Translate transcripts
FREEBAYES
CustomPro DB
Sequence
FASTA
GTF
Variant Calling
● Variant annotation
● Genome mapping
Evaluates the assembly with
annotated transcripts
Mapping
Files
Genome
SAV / In-Del Variants
ALIGNMENT
Mapping to gene/genome
Reference gene/genome
HISAT2: Outputs BAM file (Dataset #9)Kim D., Langmead B. and Salzberg S.L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods (2015)
RNASeq reads
VARIANT CALLING
Mapping to gene/genome
Reference gene/genome
FreeBayes : Outputs VCF file (Dataset #14)Garrison E., Marth G. Haplotype-based variant detection from short-read sequencing. (arXiv:1207.3907)
RNASeq reads
VIEWING SNP VARIANT IN IGV
RNA-SEQ TO FASTA DATABASE CREATION
RNA-Seq
FASTQ
HISAT
Alignment tool
STRINGTIE
RNA-Seq to transcripts
GFF COMPARE
Translate transcripts
FREEBAYES
CustomPro DB
Sequence
FASTA
GTF
Variant Calling
● Variant annotation
● Genome mapping
Evaluates the assembly with
annotated transcripts
Mapping
Files
Genome
SAV / In-Del Variants
CustomProDB
Wang X., Zhang B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics (2013)
Reference gene/genome
Original Protein Variant ProteinTranslate Translate
FASTA Sequence Variant FASTA Sequence
RNA-SEQ TO FASTA DATABASE CREATION
RNA-Seq
FASTQ
HISAT
Alignment tool
STRINGTIE
RNA-Seq to transcripts
GFF COMPARE
Translate transcripts
FREEBAYES
CustomPro DB
Sequence
FASTA
GTF
Variant Calling
● Variant annotation
● Genome mapping
Evaluates the assembly with
annotated transcripts
Mapping
Files
Genome
Assembly Workflow
ALIGNMENT
Mapping to gene/genome
Reference gene/genome
TRANSCRIPT ASSEMBLY
Mapping to gene/genome
Reference gene/genome
Assembled Transcript
Splicing
3-Frames Translation FASTA Sequence
RNA-SEQ TO FASTA DATABASE CREATION
RNA-Seq
FASTQ
HISAT
Alignment tool
STRINGTIE
RNA-Seq to transcripts
GFF COMPARE
Translates novel
transcripts
FREEBAYES
CustomPro DB
Sequence
FASTA
GTF
Variant Calling
● Variant annotation
● Genome mapping
Evaluates the assembly with
annotated transcripts
Mapping
Files
Genome
SAV / In-Del Variants
Assembly Workflow
OUTPUTS
>generic|ENSMUSP00000107433|Erp29|ER protein 29
MAAAAGVSGAASLSPLLSVLLGLLLLFAPHGGSGLHTKGALPLDTVTFYKSRLLLGP
>generic|ENSMUSP00000120715|Rps2|ribosomal protein S2
MADDAGAAGGPGGPGGPGLGGRGGFRGGFGSGLRGRGRGRGRGRGRGRGARGGKAEDKEWIPVTKLGRLVKDMKIKSLEEIY
LFSLPIKESEIIDFFLGASLKDEVLKIMPVQKQTRAGQR
ENSMUSP00000107433 chr5 121452190 121452340 – 0 150
ENSMUSP00000107433 chr5 121449139 121449163 – 150 174
ENSMUSP00000120715 chr17 24720275 24720452 + 0 177
ENSMUSP00000120715 chr17 24720533 24720731 + 177 375
ENSMUSP00000120715 chr17 24720968 24721302 + 375 709
ENSMUSP00000120715 chr17 24721622 24721727 + 709 814
ENSMUSP00000120715 chr17 24721802 24721897 + 814 909
FASTA Sequence File
Genomic Mapping File
Workflow #1
RNA-Seq to Variant
FASTA database
Workflow #2
Database Searching
Using MS/MS data
Workflow #3
Identifying Novel Variants
And Visualization
FASTA SequencesGenome Mapping Information
WORKSHOP WORKFLOWS
SNAPSHOT OF WHAT HISTORY LOOKS LIKE AT THIS STAGE
PROTEOMICS DATA ANALYSIS USING GALAXY
Protein FASTA: reference proteins + potential variants
Peaklist of MS/MS data
Multiple algorithms for matching MS/MS to peptides
Organization and scoring of peptide spectral matches (PSMs)
Generation of an sqLite database for downstream data visualization and filtering
Putative variant peptide sequences for further verification and analysis
Proteomics. 11:996-9Nat Biotechnol. 33:22-4
Mass Spectrometry and Proteomics
Vaudel et al. Nature Biotechnol. 2015, 33:22–24.Vaudel et al. J Proteome Res. 2018, doi: 10.1021/acs.jproteome.8b00175.
• Bundles a multiple freely-available algorithms for matching MS/MS to peptide sequences
• Infers proteins from peptide sequence matches
• Assigns confidence scores to peptide sequence matches and inferred proteins
• Provides outputs in standard formats (e.g. mzidentML) for further processing
WORKSHOP WORKFLOWS
Workflow #1
RNA-Seq to Variant
FASTA database
Workflow #2
Database Searching
Using MS/MS data
Workflow #3
Identifying Novel Variants
And Visualization
YOUR CURRENT HISTORY
In order to access the input for this part of the workshop, Click on “Shared Data”→ “Histories”→“ MousePG_History2”. And click on Import History.
IF NOT…
Select ‘MousePG_Workflow3_Novel_peptide_analysis’
Import workflow
Start using this workflow
Run Workflow
ACTIVE HISTORY
FROM EARLIER
WORKFLOW
WORKFLOW
WORKFLOW FOR THIS SECTION
Workflow 2
Workflow 2
Workflow 1
WORKFLOW FOR THIS SECTION
Workshop Documentation: z.umn.edu/galaxypinmumbai5.2 BlastP analysis 325.3 Novel proteoform analysis 335.4 Using Multi-omics Visualization Platform for visualizing novel proteoforms 35
SELECT DISTINCT PSM.*FROM PSM JOIN BLAST ON PSM.SEQUENCE =BLAST.QSEQID
WHERE BLAST.PIDENT < 100 OR BLAST.GAPOPEN
>= 1 OR BLAST.LENGTH < BLAST.QLEN
ORDER BY PSM.SEQUENCE, PSM.ID
BLASTP ANALYSIS
MULTI-OMICS VISUALIZATION PLATFORM FOR VISUALIZING NOVEL PROTEOFORMS
MULTI-OMICS VISUALIZATION PLATFORM FOR VISUALIZING NOVEL PROTEOFORMS
SPECTRAL QUALITY VISUALIZATION (Lorikeet Viewer)
GENOMIC LOCALIZATION (Integrated Genomics Viewer)
ESSREALVEPTSESPRPALAR
GENOMIC LOCALIZATION (INTEGRATED GENOMICS VIEWER)
NOVEL PROTEOFORM ANALYSIS
UCSC GENOME BROWSER
CDART BLAST SEARCH
PROJECT OVERVIEW
GO AND TRY IT OUT!Galaxy Instance for proteogenomics workshop: z.umn.edu/galaxypinmumbai
User will need to register and login in using password onto the site. Step by step instructions for the
workshop are provided in the document below (registration instructions start on page 5).
Documentation for Galaxy instance usage:z.umn.edu/mumbaidocs
GO AND TRY IT OUT!
GALAXY INSTANCE ONE
· Galaxy Instance for proteogenomics workshop: z.umn.edu/galaxypinmumbai
User will need to register and login in using password onto the site. Step by step instructions for the workshop
are provided in the document below (registration instructions start on page 5).
· Documentation for Galaxy instance usage: z.umn.edu/mumbaidocs
GALAXY INSTANCE TWO (Back up if GALAXY INSTANCE ONE gets busy)
· Proteogenomics Gateway: z.umn.edu/proteogenomicsgateway
User will need to register and login in using password onto the site. Step by step instructions for the workshop
are provided in the document below (registration instructions start on page 5).
· Documentation for Galaxy instance usage: z.umn.edu/pginnov18
• Instructors• Pratik Jagtap
• Support• Praveen Kumar• Prof. Timothy Griffin Galaxy-P team (University of Minnesota)• Subina Mehta• James Johnson and Thomas McGowan (University of Minnesota)• Matthew Chambers• Jetstream Cloud at Indiana University
• Funding
WORKSHOP INSTRUCTORS AND ACKNOWLEDGEMENTS
Minnesota Supercomputing InstituteJames JohnsonThomas McGowanLee ParsonsMichael Milligan
Ira CookeMelbourne , Australia
University of MinnesotaTimothy GriffinPratik JagtapPraveen KumarCandace GuerreroSubina MehtaAdrian Hegeman (Co-I)Art EschenlauerShane HublerRay SajulgaCaleb EasterlyAndrew Rajczewski
Biologists / collaboratorsLaurie ParkerJoel RudneyManeesh BhargavaAmy SkubitzChris WendtBrian CrookerSteven FriedenbergKevin VikenKristin BoylanMarnie PetersonSomiah AfiuniBrian SandriAlexa PragmanWanda WeberAmy Treeful
Harald BarsnesMarc VaudelUniversity of Bergen, Norway
University of Freiburg,Freiburg, Germany
VIB, UGhent, Belgium
Judson HerveyNaval Research InstituteWashington, D.C.
Matt ChambersNashville, TN
Alessandro TancaPorto Conte Ricerche, Italy
CarolinKolmederUniversity of Helsinki, Finland
Thilo MuthBernhard RenardRobert Koch Institut
Thomas DoakJeremy Fisher Indiana University
Josh EliasStanford University
Brook NunnU of Washington
Lennart Martens (Co-I)Bart MesuereRobbert G Singh
Bjoern GrueningBérénice Batut
Lloyd Smith (Co-I)Michael ShortreedUW-Madison
Karen ReddyMo HeydarianJohns Hopkins UniversityFunding
Anamika KrishanpalPriyabrata PanigrahiPersistent Systems Limited
Stephan KangIntero Life Sciences
galaxyp.org
FundingACKNOWLEDGMENTS
QUESTIONS?
Follow us on twitter.com/usegalaxyp
Workshop Documentation: z.umn.edu/galaxypinmumbai
Slides for the Talk: z.umn.edu/mumbaislides
Visit: http://galaxyp.org
Feedback: https://z.umn.edu/fbindia