tackling analytical challenges in cancer proteogenomics using...

Tackling Analytical challenges in Cancer proteogenomics using

Galaxy frameworkDecember 11, 2018

Pratik JagtapGalaxy-P Team

University of Minnesota

Slides for the Talk: z.umn.edu/mumbaislides

z.umn.edu/mumbaislides

• Introduction to proteogenomics and multi-omic studies

• RNASeq Data Processing: Data Analysis using Galaxy platform

• Proteomics data analysis using Galaxy

• Identification of novel proteoforms and visualization

RNASeq data processing. Generation of protein sequence database.

Sequence database searching and peptide /

protein identification

Results visualization and interpretation

Raw RNA-seq data

Raw MS/MS proteomics data

WORKSHOP STRUCTURE

MULTI-OMICS

MULTI-OMICS TECHNOLOGIES

Ruggles et al. Mol Cell Proteomics 2017;16:959-981 © 2017 by The American Society for Biochemistry and Molecular Biology, Inc.

• Next-Gen Sequencing

• RNASeq

• Mass Spectrometry

• Proteogenomics

• Proteo-transcriptomics

• Metaproteomics

• Meta-transcriptomics

• Metabolomics

LOOKING BEYOND THE KNOWN PROTEOME

Mass spectrumReference Protein Database

from genomic annotation

Cancer / Disease related

Databases such as COSMIC,

IARC p53, OMIM…

Deep genome sequencing data

from ICGC, TCGA and CPTAC

RNASeq data

(Customized OR

Combined)

6-frame DNA

sequences.

3-frame cDNA

sequences.Identification of

peptides

corresponding

to novel proteoforms.

https://doi.org/10.1007/978-1-4939-7717-8_7

Multiomics / trans-omics

https://doi.org/10.1007/978-1-4939-7717-8_7

https://doi.org/10.1007/978-1-4939-7717-8_7

GALAXYGalaxy Instance for proteogenomics workshop: z.umn.edu/galaxypinmumbai

User will need to register and login in using password onto the site. Step by step instructions for the

workshop are provided in the document below (registration instructions start on page 5).

Documentation for Galaxy instance usage:z.umn.edu/mumbaidocs

http://z.umn.edu/galaxypinmumbai

https://z.umn.edu/mumbaidocs

GALAXYGalaxy Instance for proteogenomics workshop: z.umn.edu/proteogenomicsgateway



Documentation for Galaxy instance usage:z.umn.edu/pginnov18

REGISTER

IMPORT HISTORY

INPUT DATA

DATASET FOR MULTI-OMICS ANALYSIS

Heydarian et al J Proteomics Bioinform. (2014) 17:7. pii: 1000302.

• Mouse cell culture.

• RNA-seq analysis

RNA-seq libraries were sequenced on a HiSeq 2000

(Illumina SY-401–1001) to a read depth of

~90,000,000 single end 97 bp reads per sample.

• iTRAQ-labeling and Mass SpectrometryReversed phase liquid chromatography using Easy-nLCsystem (Thermo Scientific) and analyzed on a LTQ-Orbitrap Elite mass spectrometer (Thermo Scientific).

https://www.ncbi.nlm.nih.gov/pubmed/25544807

Select History 1

Import history

Start using this history

Select Workflow 1

Import workflow

Using the workflow

Run Workflow 1

INPUT

WORKFLOW

GALAXY

OUTPUT

GALAXY INTERFACE

Left (Tool) Pane

Main Viewing Pane

History Pane

WORKSHOP WORKFLOWS

Workflow #1

RNA-Seq to Variant

FASTA database

Workflow #2

Database Searching

Using MS/MS data

Workflow #3

Identifying Novel Variants

And Visualization

Genomic coordinate information

OBJECTIVE OF WORKFLOW 1

Create custom variant database

Workflow #1

RNA-Seq to Variant

FASTA database

Workflow #2

Database Searching

Using MS/MS data

Workflow #3


And Visualization

FASTA SequencesGenome Mapping Information

WORKSHOP WORKFLOWS

INPUT DATA

• RNA-Seq FASTQ file : Reads in FASTQ format

• GTF file: Gene Transfer Format • Tabular file to describe genes and related features

• Known protein and contaminant protein sequence FASTA file

• Mass-spectrometry (MGF) file

INPUT DATA

Select

‘MousePG_Input_History’

Import history

Start using this history

Select

‘MousePG_Workflow1

_RNAseq_Dbcreation’

Import workflow

Using the workflow

Run Workflow 1

INPUT

WORKFLOW

GALAXY

OUTPUT

IMPORT WORKFLOW

RUNNING A WORKFLOW

SELECTING INPUT FILES TO RUN A WORKFLOW

JOB STATUS (HISTORY PANE)

Job in queue Job running Job successful Job failed

Workflow #1

RNA-Seq to Variant

FASTA database

Workflow #2

Database Searching

Using MS/MS data

Workflow #3


And Visualization

WORKSHOP WORKFLOWS

WORKFLOW #1: RNA-SEQ TO VARIANT PROTEIN

SAV / In-Del Variants

Assembly Workflow

POTENTIAL NOVEL PEPTIDE IDENTIFICATIONS

5’3’

Exon 1 Exon 2 Exon 8Exon 3 Exon 4 Exon 5 Exon 6 Exon 7

Expressed 5’ UTR

Alternate start

Alternate frame

+2

+1

Novel Exon

Novel Spliceform

Exon extension

Expressed 3’ UTR

/Alternate stop

Intergenic

/Novel gene

+3

+3

*

*

Single amino acid

variant

UTR UTR UTR

CD

S

CDS CDS CDS

CD

S

CDS

CD

S

Sta

rt Sto

p

+2Known

Peptides +2

InDels A

RNA-SEQ TO FASTA DATABASE CREATION

RNA-Seq

FASTQ

HISAT

Alignment tool

STRINGTIE

RNA-Seq to transcripts

GFF COMPARE

Translate transcripts

FREEBAYES

CustomPro DB

Sequence

FASTA

GTF

Variant Calling

● Variant annotation

● Genome mapping

Evaluates the assembly with

annotated transcripts

Mapping

Files

Genome


Assembly Workflow


RNA-Seq

FASTQ

HISAT

Alignment tool

STRINGTIE


GFF COMPARE


FREEBAYES

CustomPro DB

Sequence

FASTA

GTF

Variant Calling


● Genome mapping



Mapping

Files

Genome


ALIGNMENT

Mapping to gene/genome

Reference gene/genome

HISAT2: Outputs BAM file (Dataset #9)Kim D., Langmead B. and Salzberg S.L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods (2015)

RNASeq reads

VARIANT CALLING



FreeBayes : Outputs VCF file (Dataset #14)Garrison E., Marth G. Haplotype-based variant detection from short-read sequencing. (arXiv:1207.3907)

RNASeq reads

VIEWING SNP VARIANT IN IGV


RNA-Seq

FASTQ

HISAT

Alignment tool

STRINGTIE


GFF COMPARE


FREEBAYES

CustomPro DB

Sequence

FASTA

GTF

Variant Calling


● Genome mapping



Mapping

Files

Genome


CustomProDB

Wang X., Zhang B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics (2013)


Original Protein Variant ProteinTranslate Translate

FASTA Sequence Variant FASTA Sequence


RNA-Seq

FASTQ

HISAT

Alignment tool

STRINGTIE


GFF COMPARE


FREEBAYES

CustomPro DB

Sequence

FASTA

GTF

Variant Calling


● Genome mapping



Mapping

Files

Genome

Assembly Workflow

ALIGNMENT



TRANSCRIPT ASSEMBLY



Assembled Transcript

Splicing

3-Frames Translation FASTA Sequence


RNA-Seq

FASTQ

HISAT

Alignment tool

STRINGTIE


GFF COMPARE

Translates novel

transcripts

FREEBAYES

CustomPro DB

Sequence

FASTA

GTF

Variant Calling


● Genome mapping



Mapping

Files

Genome


Assembly Workflow

OUTPUTS

>generic|ENSMUSP00000107433|Erp29|ER protein 29

MAAAAGVSGAASLSPLLSVLLGLLLLFAPHGGSGLHTKGALPLDTVTFYKSRLLLGP

>generic|ENSMUSP00000120715|Rps2|ribosomal protein S2

MADDAGAAGGPGGPGGPGLGGRGGFRGGFGSGLRGRGRGRGRGRGRGRGARGGKAEDKEWIPVTKLGRLVKDMKIKSLEEIY

LFSLPIKESEIIDFFLGASLKDEVLKIMPVQKQTRAGQR

ENSMUSP00000107433 chr5 121452190 121452340 – 0 150

ENSMUSP00000107433 chr5 121449139 121449163 – 150 174

ENSMUSP00000120715 chr17 24720275 24720452 + 0 177

ENSMUSP00000120715 chr17 24720533 24720731 + 177 375

ENSMUSP00000120715 chr17 24720968 24721302 + 375 709

ENSMUSP00000120715 chr17 24721622 24721727 + 709 814

ENSMUSP00000120715 chr17 24721802 24721897 + 814 909

FASTA Sequence File

Genomic Mapping File

Workflow #1

RNA-Seq to Variant

FASTA database

Workflow #2

Database Searching

Using MS/MS data

Workflow #3


And Visualization

FASTA SequencesGenome Mapping Information

WORKSHOP WORKFLOWS

SNAPSHOT OF WHAT HISTORY LOOKS LIKE AT THIS STAGE

PROTEOMICS DATA ANALYSIS USING GALAXY

Protein FASTA: reference proteins + potential variants

Peaklist of MS/MS data

Multiple algorithms for matching MS/MS to peptides

Organization and scoring of peptide spectral matches (PSMs)

Generation of an sqLite database for downstream data visualization and filtering

Putative variant peptide sequences for further verification and analysis

Proteomics. 11:996-9Nat Biotechnol. 33:22-4

Mass Spectrometry and Proteomics

Vaudel et al. Nature Biotechnol. 2015, 33:22–24.Vaudel et al. J Proteome Res. 2018, doi: 10.1021/acs.jproteome.8b00175.

• Bundles a multiple freely-available algorithms for matching MS/MS to peptide sequences

• Infers proteins from peptide sequence matches

• Assigns confidence scores to peptide sequence matches and inferred proteins

• Provides outputs in standard formats (e.g. mzidentML) for further processing

WORKSHOP WORKFLOWS

Workflow #1

RNA-Seq to Variant

FASTA database

Workflow #2

Database Searching

Using MS/MS data

Workflow #3


And Visualization

YOUR CURRENT HISTORY

In order to access the input for this part of the workshop, Click on “Shared Data”→ “Histories”→“ MousePG_History2”. And click on Import History.

IF NOT…

Select ‘MousePG_Workflow3_Novel_peptide_analysis’

Import workflow

Start using this workflow

Run Workflow

ACTIVE HISTORY

FROM EARLIER

WORKFLOW

WORKFLOW

WORKFLOW FOR THIS SECTION

Workflow 2

Workflow 2

Workflow 1

WORKFLOW FOR THIS SECTION

Workshop Documentation: z.umn.edu/galaxypinmumbai5.2 BlastP analysis 325.3 Novel proteoform analysis 335.4 Using Multi-omics Visualization Platform for visualizing novel proteoforms 35


SELECT DISTINCT PSM.*FROM PSM JOIN BLAST ON PSM.SEQUENCE =BLAST.QSEQID

WHERE BLAST.PIDENT < 100 OR BLAST.GAPOPEN

>= 1 OR BLAST.LENGTH < BLAST.QLEN

ORDER BY PSM.SEQUENCE, PSM.ID

BLASTP ANALYSIS

MULTI-OMICS VISUALIZATION PLATFORM FOR VISUALIZING NOVEL PROTEOFORMS

MULTI-OMICS VISUALIZATION PLATFORM FOR VISUALIZING NOVEL PROTEOFORMS

SPECTRAL QUALITY VISUALIZATION (Lorikeet Viewer)

GENOMIC LOCALIZATION (Integrated Genomics Viewer)

ESSREALVEPTSESPRPALAR

GENOMIC LOCALIZATION (INTEGRATED GENOMICS VIEWER)

NOVEL PROTEOFORM ANALYSIS

UCSC GENOME BROWSER

CDART BLAST SEARCH

PROJECT OVERVIEW

GO AND TRY IT OUT!Galaxy Instance for proteogenomics workshop: z.umn.edu/galaxypinmumbai



Documentation for Galaxy instance usage:z.umn.edu/mumbaidocs



GO AND TRY IT OUT!

GALAXY INSTANCE ONE

· Galaxy Instance for proteogenomics workshop: z.umn.edu/galaxypinmumbai

User will need to register and login in using password onto the site. Step by step instructions for the workshop

are provided in the document below (registration instructions start on page 5).

· Documentation for Galaxy instance usage: z.umn.edu/mumbaidocs

GALAXY INSTANCE TWO (Back up if GALAXY INSTANCE ONE gets busy)

· Proteogenomics Gateway: z.umn.edu/proteogenomicsgateway

User will need to register and login in using password onto the site. Step by step instructions for the workshop

are provided in the document below (registration instructions start on page 5).

· Documentation for Galaxy instance usage: z.umn.edu/pginnov18



http://z.umn.edu/proteogenomicsgateway

https://z.umn.edu/pginnov18

• Instructors• Pratik Jagtap

• Support• Praveen Kumar• Prof. Timothy Griffin Galaxy-P team (University of Minnesota)• Subina Mehta• James Johnson and Thomas McGowan (University of Minnesota)• Matthew Chambers• Jetstream Cloud at Indiana University

• Funding

WORKSHOP INSTRUCTORS AND ACKNOWLEDGEMENTS

Minnesota Supercomputing InstituteJames JohnsonThomas McGowanLee ParsonsMichael Milligan

Ira CookeMelbourne , Australia

University of MinnesotaTimothy GriffinPratik JagtapPraveen KumarCandace GuerreroSubina MehtaAdrian Hegeman (Co-I)Art EschenlauerShane HublerRay SajulgaCaleb EasterlyAndrew Rajczewski

Biologists / collaboratorsLaurie ParkerJoel RudneyManeesh BhargavaAmy SkubitzChris WendtBrian CrookerSteven FriedenbergKevin VikenKristin BoylanMarnie PetersonSomiah AfiuniBrian SandriAlexa PragmanWanda WeberAmy Treeful

Harald BarsnesMarc VaudelUniversity of Bergen, Norway

University of Freiburg,Freiburg, Germany

VIB, UGhent, Belgium

Judson HerveyNaval Research InstituteWashington, D.C.

Matt ChambersNashville, TN

Alessandro TancaPorto Conte Ricerche, Italy

CarolinKolmederUniversity of Helsinki, Finland

Thilo MuthBernhard RenardRobert Koch Institut

Thomas DoakJeremy Fisher Indiana University

Josh EliasStanford University

Brook NunnU of Washington

Lennart Martens (Co-I)Bart MesuereRobbert G Singh

Bjoern GrueningBérénice Batut

Lloyd Smith (Co-I)Michael ShortreedUW-Madison

Karen ReddyMo HeydarianJohns Hopkins UniversityFunding

Anamika KrishanpalPriyabrata PanigrahiPersistent Systems Limited

Stephan KangIntero Life Sciences

galaxyp.org

FundingACKNOWLEDGMENTS

QUESTIONS?

Follow us on twitter.com/usegalaxyp

Workshop Documentation: z.umn.edu/galaxypinmumbai

Slides for the Talk: z.umn.edu/mumbaislides

Visit: http://galaxyp.org

Feedback: https://z.umn.edu/fbindia

https://twitter.com/usegalaxyp?lang=en


z.umn.edu/mumbaislides

http://usegalaxyp.org/

https://z.umn.edu/fbindia

tackling analytical challenges in cancer proteogenomics using...

Documents