novel peptide identification using ests and genomic sequence nathan edwards center for...

Novel Peptide Identification using

ESTs and Genomic Sequence

Novel Peptide Identification using

ESTs and Genomic Sequence

Nathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

Novel Peptides

• Absent from traditional protein sequence databases• IPI, SwissProt, TrEMBL, NCBI’s nr, MSDB

• Due to• Deliberate “redundancy” elimination• “Dark-side” genes• Bias towards high-quality, high-confidence

full-length protein sequence

What is missing?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Alternative translation start-sites

• Microexons

• Alternative translation frames

Why should we care?

• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins

• Proteins have clinical implications• Biomarker discovery

• Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• No hard evidence for translation start site

Novel Protein

HEQASNVLSDISEFREvidence:• log10(E-value) = -9.6• 100’s of ESTs• Full length mRNA sequence Details:• Peptide Atlas A8_IP (Resing et al.);

Novel Protein

Novel Splice Isoform

LQGSATAAEAQVGHQTAR Evidence:• log10(E-value) = -6.8• 10’s of ESTs• Full length mRNA sequenceDetails:• Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene

Novel Splice Isoform

Novel Frame

TAGSPLCLPTPGAAPGSAGSCSHREvidence:• log10(E-value) = -3.9• 10’s of ESTs• Full length mRNA sequenceDetails:• Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene, downstream from LQGSA...

Novel Frame

“Novel” Microexon

LQTASDESYKDPTNIQLSKEvidence:• log10(E-value) = -6.4• 10’s of ESTs / mRNA sequences• SwissProt variant, absent from IPIDetails:• Peptide Atlas raftflow (von Haller, et al.); • SPTAN1 gene

“Novel” Microexon

Novel Mutation

KADDTWEPFASGK Evidence:• log10(E-value) = -7.6• 2 ESTs from same clone library• Ala2 DeletionDetails:• HUPO PPP 29_b1-EDTA_1 (Qian/He; Omenn et al.); • TTR gene• Known Mutation: Ala2-to-Pro associated with

familial amyloidotic polyneuropathy.

Novel Mutation

Known Coding SNP

DTEEEDFHVDQ[V|A]TTVKEvidence:• log10(E-value) = -9.5 / -9.4• Known dbSNP (coding): Val12-to-Ala• Wildtype also observedDetails:• HUPO PPP 40 (Wang; Omenn et al.); • SERPINA1 gene

Wildtype

Known Coding SNP

LQHL[E|V]NELTHDIITK Evidence:• log10(E-value) = -6.7/-10.9• 4 ESTs, same clone library• Known dbSNP (coding): Glu5-to-Val• Wildtype also observedDetails:• HUPO PPP 28_b2-CIT

(Pounds/Adkins/Rodland/Anderson; Omenn et al.); • SERPINA1 gene

IPI Common Variant Elimination

YYGGGYGSTQATFMVFQALAQYQK Evidence:• log10(E-value) = -5.9• 100’s ESTs, mRNA sequence• IPI has (rare) variant (Insertion of AS@10)• Differ in 5’ splice site.Details:• HUPO PPP 29 (Qian/He; Omenn et al.); • C3 gene

Why don’t we see more novel peptides?

• Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

Why don’t we see more novel peptides?

• Traditional protein sequence databases• High-quality, full-length proteins only• Many interesting peptides are omitted• Exclusive – peptide identifications are lost.

• ESTs, genomic & mRNA sequence• Used as evidence for full-length protein

sequences• Inclusive – may need to filter results

Significant False Positives

• E-values are not enough!• Random guessers are easy to beat.

• Post-translational modifications vs. amino-acid substitution• methylation (on I/L, Q, R, C, H, K, S, T, N): +14• D → E, G → A, V → I/L, N → Q, S → T: +14

• Peptide extension z=+2 → z=+3• Nonsense AA masses sum to precursor

• Need to ensure:• fragment ions define novel sequence• sequence evidence is strong• other plausible explanations can be eliminated

• DFLAGGLAAAISK 2.2x10-8

• 2 ESTs• DFLAGGIAAAISK 2.2x10-8

• IPI (2), RefSeq, mRNA, ~ 1400 ESTs• DFLAGGVAAAISK 3.7x10-8

• IPI, RefSeq, mRNA, ~700 ESTs• DFLAGGVAAAISKMAVVPI 3.5x10-5

• Genscan exon• AISFAKDFLAGGIAAAISK 3.3x10-4

• Genscan exon

How do we know they are novel?

• How do we know they are real?• Good spectra• Good E-value• Good ion ladders• Good sequence evidence

• Lack of other explanations...

Peptide Sequence Evidence

• C3 Compression:• Amino-acid 30-mers • Complete, Correct(, Compact) • Present at least twice (ESTs only)

Gb of Sequence Naïve Enumeration C3Self Corrected ESTs (1) 7.60 2.90 0.18

Genome Corrected ESTs (2) 5.40 1.30 0.12Corrected ESTs (1+2) 13.00 4.20 0.20

Genscan Exons (3) 0.10 0.06 0.05Genscan Exon Pairs (4) 2.30 1.60 0.55

Combo (1+2+3+4) 28.40 10.06 0.78Genomic ORFs (5) 6.20 1.90 1.50

SBH-graph

ACDEFGI, ACDEFACG, DEFGEFGI

Compressed-SBH-graph

ACDEFGI

Peptide Sequence Databases

• MS/MS search engine input only• Protein context is lost• Inclusive, rather than exclusive• Download from http://www.umiacs.umd.edu/~nedwards

• Exact string search for gene/protein context• Recover peptide sequence evidence

• Relational database to reassemble......with respect to genes & genome

• Grid Computing + Web Services + Viewer• Work in progress

Peptide Identification Navigator

Conclusions

• Peptides identify more than proteins

• Search EST sequences (at least)

• Compressed peptide sequence databases make this feasible

novel peptide identification using ests and genomic sequence nathan edwards center for...

s ests

novel peptide identification

peptide atlas raftflow

hupo ppp

mrna sequenceipi

peptide atlas a8

hard evidence

genomic mrna sequenceused

Documents

antigen-presenting genes and genomic copy number ... ·...

alternative splicing from ests

15-20 september wabi031 a method to detect gene structure...

genomic unity testing - variantyx...genomic unity®...

est clustering - bioinformaticsest clustering embnet 2003...

genomic and comparative genomic analysis

some jolly fun with barley ests

peptide-focused streams: oligonucleotide-focused … ·...

generating peptide probes against cancer-related peptide...

the march of the strand be ests

revised ests guidelines for preoperative mediastinal lymph...

novel peptide identification using ests and sequence...

property dg ests 1

ests 2015 soft-switching power converters for electric

series 70 ests - layerzeropower.com

bioinformatic screening of human ests for differentially

anotación de genomas con ests

partha pratim ests

economia ests. graduados

peptide asparaginyl ligases—renegade peptide bond makers