proteomics informatics –

53
Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)

Upload: audra

Post on 05-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing  (Week 4). Peptide Mapping - Mass Accuracy. Peptide Mapping Database Size. Human. C. elegans. S. cerevisiae. Peptide Mapping Cys -Containing Peptides. Human. C. elegans. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Proteomics Informatics –

Proteomics Informatics – Protein identification I: searching protein

sequence collections and significance testing (Week 4)

Page 2: Proteomics Informatics –

2

Peptide Mapping - Mass Accuracy

Page 3: Proteomics Informatics –

3

Peptide MappingDatabase Size

C. elegans

S. cerevisiae

Human

Page 4: Proteomics Informatics –

4

Peptide MappingCys-ContainingPeptides

C. elegans

S. cerevisiae

Human

Page 5: Proteomics Informatics –

MS

Identification – Peptide Mass Fingerprinting

MS

Digestion

All Peptide Masses

Pick Protein

Compare, Score, Test Significance

Rep

eat for each

pro

teinSequence

DB

Identified Proteins

Page 6: Proteomics Informatics –

ProFound Results

Page 7: Proteomics Informatics –

Database size

Page 8: Proteomics Informatics –

Mixtures

Page 9: Proteomics Informatics –

Peptide FragmentationMass

Analyzer 1Frag-

mentationDetector

Ion Source

Mass Analyzer 2

b

y

Page 10: Proteomics Informatics –

Identification – Tandem MS

Page 11: Proteomics Informatics –

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

Tandem MS – Sequence Confirmation

KLEDEELFGS

Page 12: Proteomics Informatics –

K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

KLEDEELFGS

Tandem MS – Sequence Confirmation

Page 13: Proteomics Informatics –

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

KLEDEELFGS

Tandem MS – Sequence Confirmation

Page 14: Proteomics Informatics –

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

KLEDEELFGS

Tandem MS – Sequence Confirmation

Page 15: Proteomics Informatics –

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

KLEDEELFGS

Tandem MS – Sequence Confirmation

Page 16: Proteomics Informatics –

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

113

KLEDEELFGS

113

Tandem MS – Sequence Confirmation

Page 17: Proteomics Informatics –

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

129

129

KLEDEELFGS

Tandem MS – Sequence Confirmation

Page 18: Proteomics Informatics –

KLEDEELFGS

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

Tandem MS – Sequence Confirmation

Page 19: Proteomics Informatics –

KLEDEELFGS

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

Tandem MS – Sequence Confirmation

Page 20: Proteomics Informatics –

KLEDEELFGS

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

Tandem MS – Sequence Confirmation

Page 21: Proteomics Informatics –

Tandem MS – de novo Sequencing

m/z

% R

ela

tive

Ab

un

da

nce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

9071020663 778 1080

1022

Mass Differences

1-letter code

3-letter code

Chemical formula

Monoisotopic

Average

A Ala C3H5ON 71.0371 71.0788

R Arg C6H12ON4 156.101 156.188

N Asn C4H6O2N2 114.043 114.104

D Asp C4H5O3N 115.027 115.089

C Cys C3H5ONS 103.009 103.139

E Glu C5H7O3N 129.043 129.116

Q Gln C5H8O2N2 128.059 128.131

G Gly C2H3ON 57.0215 57.0519

H His C6H7ON3 137.059 137.141

I Ile C6H11ON 113.084 113.159

L Leu C6H11ON 113.084 113.159

K Lys C6H12ON2 128.095 128.174

M Met C5H9ONS 131.04 131.193

F Phe C9H9ON 147.068 147.177

P Pro C5H7ON 97.0528 97.1167

S Ser C3H5O2N 87.032 87.0782

T Thr C4H7O2N 101.048 101.105

W Trp C11H10ON2 186.079 186.213

Y Tyr C9H9O2N 163.063 163.176

V Val C5H9ON 99.0684 99.1326

Amino acid masses

Sequences consistent

with spectrum

Page 22: Proteomics Informatics –

Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079

260 32 129 145 244 274 373 403 502 518 615 647 760 762 819

292 97 113 212 242 341 371 470 486 583 615 728 730 787

389 16 115 145 244 274 373 389 486 518 631 633 690

405 99 129 228 258 357 373 470 502 615 617 674

504 30 129 159 258 274 371 403 516 518 575

534 99 129 228 244 341 373 486 488 545

633 30 129 145 242 274 387 389 446

663 99 115 212 244 357 359 416

762 16 113 145 258 260 317

778 97 129 242 244 301

875 32 145 147 204

907 113 115 172

1020 2 59

1022 57

Page 23: Proteomics Informatics –

Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079

260 32 129 145 244 274 373 403 502 518 615 647 760 762 819

292 97 113 212 242 341 371 470 486 583 615 728 730 787

389 16 115 145 244 274 373 389 486 518 631 633 690

405 99 129 228 258 357 373 470 502 615 617 674

504 30 129 159 258 274 371 403 516 518 575

534 99 129 228 244 341 373 486 488 545

633 30 129 145 242 274 387 389 446

663 99 115 212 244 357 359 416

762 16 113 145 258 260 317

778 97 129 242 244 301

875 32 145 147 204

907 113 115 172

1020 2 59

1022 57

Page 24: Proteomics Informatics –

260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079

260 32 E 145 244 274 373 403 502 518 615 647 760 762 819

292 P I/L 212 242 341 371 470 486 583 615 728 730 787

389 16 D 145 244 274 373 389 486 518 631 633 690

405 V E 228 258 357 373 470 502 615 617 674

504 30 E 159 258 274 371 403 516 518 575

534 V E 228 244 341 373 486 488 545

633 30 E 145 242 274 387 389 446

663 V D 212 244 357 359 416

762 16 I/L 145 258 260 317

778 P E 242 244 301

875 32 145 F 204

907 I/L D 172

1020 2 59

1022 G

Tandem MS – de novo Sequencing

X

X

X

X

X

X

…GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG……GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG…

Peptide M+H = 11661166 -1079 = 87 => S

SGF(I/L)EEDE(I/L)…

SGF(I/L)EEDE(I/L)…

1166 – 1020 – 18 = 128Þ K or Q

SGF(I/L)EEDE(I/L)(K/Q)

Page 25: Proteomics Informatics –

Tandem MS – de novo Sequencing

Challenges in de novo sequencing

Neutral loss (-H2O, -NH3)

Modifications

Background peaks

Incomplete information

Challenges in de novo sequencing

Neutral loss (-H2O, -NH3)

Modifications

Background peaks

Incomplete information

Page 26: Proteomics Informatics –

MS/MS

LysisFractionation

Tandem MS – Database Search

MS/MS

Digestion

SequenceDB

All FragmentMasses

Pick Protein

Compare, Score, Test Significance

Rep

eat for all p

rotein

s

Pick PeptideLC-MS

Rep

eat for

all pep

tides

Page 27: Proteomics Informatics –

Search Results

Page 28: Proteomics Informatics –

Significance Testing

False protein identification is caused by random matching

An objective criterion for testing the significance of protein identification results is necessary.

The significance of protein identifications can be tested once the distribution of scores for false results is known.

Page 29: Proteomics Informatics –

Significance Testing - Expectation Values

The majority of sequences in a collection will give a score due to random matching.

Page 30: Proteomics Informatics –

Database Search

M/Z

List of Candidates

ExtrapolateAnd Calculate Expectation Values

List of Candidates With Expectation Values

Distribution of Scoresfor Random and False Identifications

Significance Testing - Expectation Values

Page 31: Proteomics Informatics –

Rho-diagrams: Overall Quality of a Data Set

)exp()( sse

iN

iNi

EE i

))}1exp(1{

)}1exp(1){exp(log()log()(

0

)}1exp(){exp()exp(

)1exp(

iiNNdeie

ieiE

Definition: Ei (i=0,-1,-2,…) is the number of spectra that has been assigned an expectation value between exp(i) and exp(i-1). For random matching:

Expectation values as a function of score for random matching:

Page 32: Proteomics Informatics –

-6

-5

-4

-3

-2

-1

0

-6 -5 -4 -3 -2 -1 0

log(e)

Rho-diagramRandom Matching

Page 33: Proteomics Informatics –

Rho-diagramData Quality

-10

-8

-6

-4

-2

0

-10 -8 -6 -4 -2 0

log(e)

Page 34: Proteomics Informatics –

Rho-diagramParameters

Page 35: Proteomics Informatics –

How many fragments are sufficient?

To identify an unmodified peptide?To identify an unmodified peptide?

To identify a modified peptide?

To localize a modification on a peptide?

To identify an unmodified peptide?

To identify a modified peptide?

Page 36: Proteomics Informatics –

How many fragments are sufficient?

How does it depend on different parameters?

• Precursor mass• Precursor mass error• Fragment mass error• Background peaks

Page 37: Proteomics Informatics –

LSDPGVSPAVLSLEMLTDR

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

Seq.DB

Page 38: Proteomics Informatics –

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

LSDPGVSPAVLSLEMLTDR

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

Seq.DB

Page 39: Proteomics Informatics –

6

8 97

5

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

LSDPGVSPAVLSLEMLTDR

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

8

Page 40: Proteomics Informatics –

6

8 97

5

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

8

201.12504.28964.481123.591247.671496.761530.821710.89

Page 41: Proteomics Informatics –

201.12504.28964.481123.591247.671496.761530.821710.89

Seq.DB

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptidesSearchengine

Identification

LSDPGVSPAVLSLEMLTDR Seq.DB

Is it significant?

Is the identified sequence identical to the one used to generate the synthetic data?

Page 42: Proteomics Informatics –

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

Simulations using synthetic spectra

201.12504.28964.481123.591247.671496.761530.821710.89

Seq.DB

SearchengineIdentification

6

8 97

5

8

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

Page 43: Proteomics Informatics –

Simulations using synthetic spectra

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

201.12504.28964.481123.591247.671496.761530.821710.89

Seq.DB

SearchengineIdentification

6

8 97

5

9

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

Page 44: Proteomics Informatics –

6

8 97

5

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

LSDPGVSPAVLSLEMLTDR

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

LSDPGVSPAVLSLEMLTDRProt.seq.

201.12504.28964.481123.591247.671496.761530.821710.89

201.12504.28964.481123.591247.671496.761530.821710.89

Seq.DB

SearchengineIdentification

Is it significant?

Is the identified sequence identical to the one used to generate the synthetic data?

LSDPGVSPAVLSLEMLTDR

8

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

Page 45: Proteomics Informatics –

Simulations using synthetic spectra

Each point is an average of searches with 20 randomly generated synthetic fragment mass spectra.

Threshold

Each point is an average of 50 peptides.

Average over peptides

Page 46: Proteomics Informatics –

Critical number of fragment masses

Page 47: Proteomics Informatics –

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

1000 Da1500 Da2000 Da2500 Da

Small peptides are slightly more difficult to identify

Dmprecursor = 1 DaDmfragment = 0.5 DaNo modification

mprecursor

Page 48: Proteomics Informatics –

A lower precursor mass error requires fewer fragment masses for identification of unmodified peptides

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

0.01 Da

1 Da

10 Da

mprecursor = 2000 DaDmfragment = 0.5 DaNo modification

Page 49: Proteomics Informatics –

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

0.01 Da0.5 Da1 Da2 Da

The dependence on the fragment mass error is weak below a threshold for identification

of unmodified peptides

Dmfragment

mprecursor = 2000 DaDmprecursor = 1 DaNo modification

Page 50: Proteomics Informatics –

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

0%

50%

80%

A moderate number of background peaks can be tolerated when identifying

unmodified peptides

mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.5 DaNo modification

Background

Page 51: Proteomics Informatics –

A large number of background peaks can be tolerated if the fragment mass is accurate

mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.01 DaNo modification

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

0%

50%

80%

Background

Page 52: Proteomics Informatics –

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

Phosphorylated

Unmodified

Identification of phosphopeptides is only slightly more difficult

mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.5 Da

Page 53: Proteomics Informatics –

Proteomics Informatics – Protein identification I: searching protein

sequence collections and significance testing (Week 4)