improving the sensitivity of peptide identification for genome annotation

47
Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Upload: etana

Post on 19-Jan-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Improving the Sensitivity of Peptide Identification for Genome Annotation. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Why Tandem Mass Spectrometry?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improving the Sensitivity of Peptide Identification for Genome Annotation

Improving the Sensitivityof Peptide Identification for Genome Annotation

Nathan EdwardsDepartment of Biochemistry and

Molecular & Cellular Biology

Georgetown University Medical Center

Page 2: Improving the Sensitivity of Peptide Identification for Genome Annotation

2

Why Tandem Mass Spectrometry?

MS/MS spectra provide evidence for the amino-acid sequence of functional proteins.

Key concepts: Spectrum acquisition is unbiased Direct observation of amino-acid sequence Sensitive to small sequence variations

Page 3: Improving the Sensitivity of Peptide Identification for Genome Annotation

3

Mass Spectrometry for Proteomics

Measure mass of many (bio)molecules simultaneously High bandwidth

Mass is an intrinsic property of all (bio)molecules No prior knowledge required

Page 4: Improving the Sensitivity of Peptide Identification for Genome Annotation

4

Mass Spectrometer

Ionizer

Sample

+_

Mass Analyzer Detector

• MALDI• Electro-Spray

Ionization (ESI)

• Time-Of-Flight (TOF)• Quadrapole• Ion-Trap

• ElectronMultiplier(EM)

Page 5: Improving the Sensitivity of Peptide Identification for Genome Annotation

5

Mass Spectrum

Page 6: Improving the Sensitivity of Peptide Identification for Genome Annotation

6

Mass is fundamental

Page 7: Improving the Sensitivity of Peptide Identification for Genome Annotation

7

Mass Spectrometry for Proteomics

Measure mass of many molecules simultaneously ...but not too many, abundance bias

Mass is an intrinsic property of all (bio)molecules ...but need a reference to compare to

Page 8: Improving the Sensitivity of Peptide Identification for Genome Annotation

8

Mass Spectrometry for Proteomics

Mass spectrometry has been around since the turn of the century... ...why is MS based Proteomics so new?

Ionization methods MALDI, Electrospray

Protein chemistry & automation Chromatography, Gels, Computers

Protein sequence databases A reference for comparison

Page 9: Improving the Sensitivity of Peptide Identification for Genome Annotation

9

Sample Preparation for MS/MS

Enzymatic Digestand

Fractionation

Page 10: Improving the Sensitivity of Peptide Identification for Genome Annotation

10

Single Stage MS

MS

Page 11: Improving the Sensitivity of Peptide Identification for Genome Annotation

11

Tandem Mass Spectrometry(MS/MS)

Precursor selection

Page 12: Improving the Sensitivity of Peptide Identification for Genome Annotation

12

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

Page 13: Improving the Sensitivity of Peptide Identification for Genome Annotation

13

Peptide Fragmentation

Peptide: S-G-F-L-E-E-D-E-L-K

y1

y2

y3

y4

y5

y6

y7

y8

y9

ion

1020

907

778

663

534

405

292

145

88

MW

762SGFL EEDELKb4

389SGFLEED ELKb7

MWion

633SGFLE EDELKb5

1080S GFLEEDELKb1

1022SG FLEEDELKb2

875SGF LEEDELKb3

504SGFLEE DELKb6

260SGFLEEDE LKb8

147SGFLEEDEL Kb9

Page 14: Improving the Sensitivity of Peptide Identification for Genome Annotation

14

Unannotated Splice Isoform

Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003.

LIME1 gene: LCK interacting transmembrane adaptor 1

LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias.

Multiple significant peptide identifications

Page 15: Improving the Sensitivity of Peptide Identification for Genome Annotation

15

Unannotated Splice Isoform

Page 16: Improving the Sensitivity of Peptide Identification for Genome Annotation

16

Unannotated Splice Isoform

Page 17: Improving the Sensitivity of Peptide Identification for Genome Annotation

17

Translation start-site correction

Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble membrane

and soluble cytoplasmic proteins Goo, et al. MCP 2003.

GdhA1 gene: Glutamate dehydrogenase A1

Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0

prediction(s)

Page 18: Improving the Sensitivity of Peptide Identification for Genome Annotation

18

Halobacterium sp. NRC-1ORF: GdhA1

K-score E-value vs PepArML @ 10% FDR Many peptides inconsistent with annotated

translation start site of NP_279651

0 40 80 120 160 200 240 280 320 360 400 440

Page 19: Improving the Sensitivity of Peptide Identification for Genome Annotation

19

Translation start-site correction

Page 20: Improving the Sensitivity of Peptide Identification for Genome Annotation

20

Phyloproteomics

Tandem mass-spectra of proteins (top-down)

High-accuracy instrument (Orbitrap, UMD Core)

Proteins from unsequenced bacteria matching identical proteins in related organisms

Demonstration using Y.rohdei.

Page 21: Improving the Sensitivity of Peptide Identification for Genome Annotation

21

E:\Yersinia Work\yr_inclusion 3/11/2009 3:43:13 PM yrohdei

RT: 19.04 - 25.39

19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0 23.5 24.0 24.5 25.0

Time (min)

0

20

40

60

80

100

0

20

40

60

80

100

Re

lative

Ab

un

da

nce

25.3619.9919.93

25.2720.04 25.2319.89 23.0322.97 23.08

20.1019.83 23.64 25.1923.7022.88 24.6324.5720.1422.82

20.2019.7822.7220.2519.48

22.5220.41 22.0821.8420.60 21.04

20.00

21.03 21.46

NL: 1.66E8

TIC MS yr_inclusion

NL: 1.01E7

TIC F: FTMS + p ESI d Full ms2 [email protected] [195.00-2000.00] MS yr_inclusion

yr_inclusion #1937-2437 RT: 19.45-24.36 AV: 21 NL: 4.80E4F: FTMS + p ESI d Full ms2 [email protected] [195.00-2000.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

10

20

30

40

50

60

70

80

90

100

Re

lative

Ab

un

da

nce

576.83z=2

840.16z=7

720.39z=2 903.81

z=3785.41

z=4694.62

z=4

584.57z=4

928.49z=4559.55

z=41804.48

z=?992.53

z=3200.78z=?

329.71z=?

1253.14z=?

555.29z=4

1610.27z=?

1883.75z=?

1491.23z=?

1118.93z=?

1666.89z=?

1345.30z=?

461.16z=?

756.70 +8 MW 6044.11

Protein Fragmentation Spectrum

A­V­Q­Q­N­K­P­T­R­S­K­R­G­M­R­R­S­H­D­A­

L­T­T­A­T­L­S­V­D­K­T­S­G­E­T­H­L­R­H­H­

I­T­A­D­G­F­Y­R­G­R­K­V­I­G

Match to Y. pestis 50S RP L32

Page 22: Improving the Sensitivity of Peptide Identification for Genome Annotation

22

Phyloproteomics

Page 23: Improving the Sensitivity of Peptide Identification for Genome Annotation

23

Phyloproteomics

phylogeny.fr – "One-Click"

Protein Sequence 16S-rRNA Sequence

Page 24: Improving the Sensitivity of Peptide Identification for Genome Annotation

24

Shared "Biomarker" Proteins

Page 25: Improving the Sensitivity of Peptide Identification for Genome Annotation

25

Phyloproteomics

Recent extension to highly homologous proteins in related organisms Merely require N- and/or C-terminus in common Broadens applicability considerably

Phyloproteomic trees for E.herbicola and Enterocloacae, neither sequenced.

New paradigm for phylogenetic analysis?

Page 26: Improving the Sensitivity of Peptide Identification for Genome Annotation

26

Lost peptide identifications

Missing from the sequence database

Search engine strengths, weaknesses, quirks

Poor score or statistical significance

Thorough search takes too long

Page 27: Improving the Sensitivity of Peptide Identification for Genome Annotation

27

Searching under the street-light…

Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

Searching traditional protein sequence databases biases the results in favor of well-understood and/or computationally predicted proteins and protein isoforms!

Page 28: Improving the Sensitivity of Peptide Identification for Genome Annotation

28

All amino-acid 30-mers, no redundancy From ESTs, Proteins, mRNAs

30-40 fold size, search time reduction Formatted as a FASTA sequence database One entry per gene/cluster.

Peptide Sequence Databases

Organism Size (AA) Size (Entries)Human 248Mb 74,976Mouse 171Mb 55,887

Rat 76Mb 42,372Zebra-fish 94Mb 40,490

Page 29: Improving the Sensitivity of Peptide Identification for Genome Annotation

29

We can observe evidence for…

Known coding SNPs Unannotated coding mutations Alternate splicing isoforms Alternate/Incorrect translation start-sites Microexons Alternate/Incorrect translation frames

…though it must be treated thoughtfully.

Page 30: Improving the Sensitivity of Peptide Identification for Genome Annotation

30

PeptideMapper Web Service

I’m Feeling Lucky

Page 31: Improving the Sensitivity of Peptide Identification for Genome Annotation

31

PeptideMapper Web Service

I’m Feeling Lucky

Page 32: Improving the Sensitivity of Peptide Identification for Genome Annotation

32

PeptideMapper Web Service

I’m Feeling Lucky

Page 33: Improving the Sensitivity of Peptide Identification for Genome Annotation

33

PeptideMapper Web Service

Suffix-tree index on peptide sequence database Fast peptide to gene/cluster mapping “Compression” makes this feasible

Peptide alignment with cluster evidence Amino-acid or nucleotide; exact & near-exact

Genomic-loci mapping via UCSC “known-gene” transcripts, and Predetermined, embedded genomic coordinates

Page 34: Improving the Sensitivity of Peptide Identification for Genome Annotation

34

Comparison of search engine results

No single score is comprehensive

Search engines disagree

Many spectra lack confident peptide assignment

Searle et al. JPR 7(1), 2008

38%

14%28%

14%

3%

2%

1%

X! Tandem

SEQUESTMascot

Page 35: Improving the Sensitivity of Peptide Identification for Genome Annotation

35

Combining search engine results – harder than it looks!

Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too!

How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance?

We apply unsupervised machine-learning.... Lots of related work unified in a single framework.

Page 36: Improving the Sensitivity of Peptide Identification for Genome Annotation

36

Supervised Learning

Page 37: Improving the Sensitivity of Peptide Identification for Genome Annotation

37

Unsupervised Learning

Page 38: Improving the Sensitivity of Peptide Identification for Genome Annotation

38

Peptide Atlas A8_IP LTQ Dataset

Page 39: Improving the Sensitivity of Peptide Identification for Genome Annotation

39

Running many search engines

Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially

modifications and protein identifiers

Page 40: Improving the Sensitivity of Peptide Identification for Genome Annotation

40

Peptide Identification Meta-Search Simple unified search

interface for: Mascot, X!Tandem,

K-Score, OMSSA, MyriMatch, S-Score, InsPecT, KM-Score

Automatic decoy searches Automatic spectrum

file "chunking" Automatic scheduling

Serial, Multi-Processor, Cluster, Grid

Page 41: Improving the Sensitivity of Peptide Identification for Genome Annotation

41

NSF TeraGrid1000+ CPUs

X!Tandem,KScore,OMSSA,

MyriMatch,Mascot(1 core).

X!Tandem,KScore,OMSSA,

MyriMatch.

PepArML Meta-Search Engine

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Securecommunication

Heterogeneouscompute resources

Single, simplesearch request

Scales easily to 250+ simultaneous

searches

Page 42: Improving the Sensitivity of Peptide Identification for Genome Annotation

42

PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs

Edwards LabScheduler &80+ CPUs

Securecommunication

Heterogeneouscompute resources

Single, simplesearch request

Scales easily to 250+ simultaneous

searches

X!Tandem,KScore,OMSSA,

MyriMatch,Mascot(1 core).

X!Tandem,KScore,OMSSA,

MyriMatch.

Page 43: Improving the Sensitivity of Peptide Identification for Genome Annotation

43

PepArML Meta-Search Engine

NSF TeraGrid1000+ CPUs

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Securecommunication

Heterogeneouscompute resources

Simple searchrequest

Page 44: Improving the Sensitivity of Peptide Identification for Genome Annotation

44

PepArML Meta-Search Engine

NSF TeraGrid1000+ CPUs

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Securecommunication

Heterogeneouscompute resources

Simple searchrequest

Page 45: Improving the Sensitivity of Peptide Identification for Genome Annotation

45

Peptide Identification Grid-Enabled Meta-Search

Access to high-performance computing resources for the proteomics community NSF TeraGrid Community Portal University/Institute HPC clusters Individual lab compute resources Contribute cycles to the community

and get access to others’ cycles in return.

Centralized scheduler Compute capacity can still be exclusive, or prioritized. Compute client plays well with HPC grid schedulers.

Page 46: Improving the Sensitivity of Peptide Identification for Genome Annotation

46

Conclusions

Improve the scope and sensitivity of peptide identification for genome annotation, using

Exhaustive peptide sequence databases Machine-learning for combining Meta-search tools to maximize consensus Grid-computing for thorough search

http://edwardslab.bmcb.georgetown.edu

Page 47: Improving the Sensitivity of Peptide Identification for Genome Annotation

47

Acknowledgements

Dr. Catherine Fenselau & students University of Maryland Biochemistry

Dr. Yan Wang University of Maryland Proteomics Core

Dr. Art Delcher University of Maryland CBCB

Dr. Chau-Wen Tseng & Dr. Xue Wu University of Maryland Computer Science

Funding: NIH/NCI