novel peptide identification using ests and genomic sequence nathan edwards center for...

45
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

Upload: william-butler

Post on 18-Jan-2016

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Novel Peptide Identification using

ESTs and Genomic Sequence

Novel Peptide Identification using

ESTs and Genomic Sequence

Nathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

Page 2: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

2

Novel Peptides

• Absent from traditional protein sequence databases• IPI, SwissProt, TrEMBL, NCBI’s nr, MSDB

• Due to• Deliberate “redundancy” elimination• “Dark-side” genes• Bias towards high-quality, high-confidence

full-length protein sequence

Page 3: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

3

What is missing?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Alternative translation start-sites

• Microexons

• Alternative translation frames

Page 4: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

4

Why should we care?

• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins

• Proteins have clinical implications• Biomarker discovery

• Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• No hard evidence for translation start site

Page 5: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

5

Novel Protein

HEQASNVLSDISEFREvidence:• log10(E-value) = -9.6• 100’s of ESTs• Full length mRNA sequence Details:• Peptide Atlas A8_IP (Resing et al.);

Page 9: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

9

Novel Splice Isoform

LQGSATAAEAQVGHQTAR Evidence:• log10(E-value) = -6.8• 10’s of ESTs• Full length mRNA sequenceDetails:• Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene

Page 11: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

11

Novel Splice Isoform

Page 12: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

12

Novel Splice Isoform

Page 13: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

13

Novel Frame

TAGSPLCLPTPGAAPGSAGSCSHREvidence:• log10(E-value) = -3.9• 10’s of ESTs• Full length mRNA sequenceDetails:• Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene, downstream from LQGSA...

Page 14: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

14

Novel Frame

Page 15: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

15

Novel Frame

Page 16: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

16

Novel Frame

Page 17: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

17

“Novel” Microexon

LQTASDESYKDPTNIQLSKEvidence:• log10(E-value) = -6.4• 10’s of ESTs / mRNA sequences• SwissProt variant, absent from IPIDetails:• Peptide Atlas raftflow (von Haller, et al.); • SPTAN1 gene

Page 18: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

18

“Novel” Microexon

Page 19: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

19

“Novel” Microexon

Page 20: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

20

“Novel” Microexon

Page 21: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

21

“Novel” Microexon

Page 22: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

22

Novel Mutation

KADDTWEPFASGK Evidence:• log10(E-value) = -7.6• 2 ESTs from same clone library• Ala2 DeletionDetails:• HUPO PPP 29_b1-EDTA_1 (Qian/He; Omenn et al.); • TTR gene• Known Mutation: Ala2-to-Pro associated with

familial amyloidotic polyneuropathy.

Page 26: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

26

Novel Mutation

Page 27: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

27

Known Coding SNP

DTEEEDFHVDQ[V|A]TTVKEvidence:• log10(E-value) = -9.5 / -9.4• Known dbSNP (coding): Val12-to-Ala• Wildtype also observedDetails:• HUPO PPP 40 (Wang; Omenn et al.); • SERPINA1 gene

Page 28: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

28

Wildtype

Page 29: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

29

Known Coding SNP

Page 30: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

30

Known Coding SNP

Page 31: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

31

Known Coding SNP

LQHL[E|V]NELTHDIITK Evidence:• log10(E-value) = -6.7/-10.9• 4 ESTs, same clone library• Known dbSNP (coding): Glu5-to-Val• Wildtype also observedDetails:• HUPO PPP 28_b2-CIT

(Pounds/Adkins/Rodland/Anderson; Omenn et al.); • SERPINA1 gene

Page 32: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

32

IPI Common Variant Elimination

YYGGGYGSTQATFMVFQALAQYQK Evidence:• log10(E-value) = -5.9• 100’s ESTs, mRNA sequence• IPI has (rare) variant (Insertion of AS@10)• Differ in 5’ splice site.Details:• HUPO PPP 29 (Qian/He; Omenn et al.); • C3 gene

Page 33: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

33

Why don’t we see more novel peptides?

• Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

Page 34: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

34

Why don’t we see more novel peptides?

• Traditional protein sequence databases• High-quality, full-length proteins only• Many interesting peptides are omitted• Exclusive – peptide identifications are lost.

• ESTs, genomic & mRNA sequence• Used as evidence for full-length protein

sequences• Inclusive – may need to filter results

Page 35: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

35

Significant False Positives

• E-values are not enough!• Random guessers are easy to beat.

• Post-translational modifications vs. amino-acid substitution• methylation (on I/L, Q, R, C, H, K, S, T, N): +14• D → E, G → A, V → I/L, N → Q, S → T: +14

• Peptide extension z=+2 → z=+3• Nonsense AA masses sum to precursor

• Need to ensure:• fragment ions define novel sequence• sequence evidence is strong• other plausible explanations can be eliminated

Page 36: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

36

Significant False Positives

• DFLAGGLAAAISK 2.2x10-8

• 2 ESTs• DFLAGGIAAAISK 2.2x10-8

• IPI (2), RefSeq, mRNA, ~ 1400 ESTs• DFLAGGVAAAISK 3.7x10-8

• IPI, RefSeq, mRNA, ~700 ESTs• DFLAGGVAAAISKMAVVPI 3.5x10-5

• Genscan exon• AISFAKDFLAGGIAAAISK 3.3x10-4

• Genscan exon

Page 37: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

37

Significant False Positives

Page 38: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

38

How do we know they are novel?

• How do we know they are real?• Good spectra• Good E-value• Good ion ladders• Good sequence evidence

• Lack of other explanations...

Page 39: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

39

Peptide Sequence Evidence

• C3 Compression:• Amino-acid 30-mers • Complete, Correct(, Compact) • Present at least twice (ESTs only)

Gb of Sequence Naïve Enumeration C3Self Corrected ESTs (1) 7.60 2.90 0.18

Genome Corrected ESTs (2) 5.40 1.30 0.12Corrected ESTs (1+2) 13.00 4.20 0.20

Genscan Exons (3) 0.10 0.06 0.05Genscan Exon Pairs (4) 2.30 1.60 0.55

Combo (1+2+3+4) 28.40 10.06 0.78Genomic ORFs (5) 6.20 1.90 1.50

Page 40: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

40

SBH-graph

ACDEFGI, ACDEFACG, DEFGEFGI

Page 41: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

41

Compressed-SBH-graph

ACDEFGI

2 2

1

2

1

Page 42: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

42

Peptide Sequence Databases

• MS/MS search engine input only• Protein context is lost• Inclusive, rather than exclusive• Download from http://www.umiacs.umd.edu/~nedwards

• Exact string search for gene/protein context• Recover peptide sequence evidence

• Relational database to reassemble......with respect to genes & genome

• Grid Computing + Web Services + Viewer• Work in progress

Page 43: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

43

Peptide Identification Navigator

Page 44: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

44

Peptide Identification Navigator

Page 45: Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

45

Conclusions

• Peptides identify more than proteins

• Search EST sequences (at least)

• Compressed peptide sequence databases make this feasible