proteomics informatics workshop part ii: protein characterization david fenyö february 18, 2011...
TRANSCRIPT
Proteomics Informatics WorkshopPart II: Protein Characterization
David Fenyö
February 18, 2011
• Top-down/bottom-up proteomics• Post-translational modifications• Protein complexes• Cross-linking• The Global Proteome Machine Database
MSMS/MS
Biological System
Samples
Information about each sample
Information about the biological system
Measurements
What does the sample contain?
How much?
Proteomics Informatics
ExperimentalDesign
Data Analysis
InformationIntegration
SamplePreparation
What does the sample contain?
How much?
Biological System
Information about each sample
Information about the biological system
What does the sample contain?
How much?
Sample Preparation
ExperimentalDesign
Data Analysis
InformationIntegration
MSMS/MS
Samples
Measurements
SamplePreparation
What does the sample contain?
How much?
EnrichmentSeparation etc
DigestionTopdown Bottom
upPeptidesProteins
Fragmentation
Fragments
Top down / bottom up
Top down
Bottom up
mass/charge
inte
nsi
ty
Top down Bottom up
Charge distribution
mass/chargein
ten
sity
mass/charge
inte
nsi
ty
1+
2+
3+
4+
27+
31+
Top down Bottom up
m = 1035 Da m = 1878 Da m = 2234 Da
Isotope distribution
mass/chargein
ten
sity
mass/charge
inte
nsi
ty
Fragmentation
Top down Bottom up
Fragmentation
Correlations between modifications
Top down
Bottom up
Alternative Splicing
Top down
Bottom up
Exon 1 2 3
Top down
Kellie et al., Molecular BioSystems 2010
Proteinmass
spectraFragment
mass spectra
Non-Covalent Protein Complexes
Schreiber et al., Nature 2011
Dynamic Range in Proteomics
Large discrepancy between the experimental dynamic range and the range of amounts of different proteins in a proteome
ExperimentalDynamic Range
Distribution of Protein Amounts
Log (Protein Amount)
Nu
mb
er o
f P
rote
ins
The goal is to identify and characterize all components of a proteome
Desired Dynamic Range
Experimental Designs
SimulatedProtein Separation
PeptideSeparation
"Retention time" (bin)
y
1 k
y
1 k
# o
f p
ep
tid
es
p
er
bin
Mass SpectrometryMS
dynamicrange
10
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
m1
m2
m3
m4
m5m6
10
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
m1
m2
m3
m4
m5m6
Protein AbundanceProtein Abundance
Digestion
Sample
Parameters in Simulation
● Distribution of protein amounts in sample
● Loss of peptides before binding to the column
● Loss of peptides after elution off the column
● Distribution of mass spectrometric response for different peptides present at the same amount
● Total amount of peptides that are loaded on column (limited by column loading capacity)
● # of peptide fractions
● # of Proteins in each fraction
● Total amount of peptides that are loaded on column (limited by column loading capacity)
● # of peptide fractions
● Dynamic range of mass spectrometer
● Detection limit of mass spectrometer
Protein Separation
PeptideSeparation
"Retention time" (bin)
y
1 k
y
1 k
# o
f p
ep
tid
es
p
er
bin
Mass SpectrometryMS
dynamicrange
10
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
m1
m2
m3
m4
m5m6
10
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
MS dynamicrange
m1
m2
m3
m4
m5m
6
m1
m2
m3
m4
m5m6
Protein AbundanceProtein Abundance
Digestion
Sample
Simulation Results for 1D-LC-MS
Complex Mixtures of Proteins
RPC
Digestion
MS Analysis
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0 2 4 6 8 10log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
0.00E+00
2.00E-03
4.00E-03
6.00E-03
8.00E-03
1.00E-02
1.20E-02
1.40E-02
0 2 4 6 8 10log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
No ProteinSeparation
Protein Separation:10 fractions
Protein Separation:10 fractions
No ProteinSeparation
Tissue
Tissue
Body Fluid
Body Fluid
Success Rate of a Proteomics Experiment
DEFINITION: The success rate of a proteomics experiment is defined as the number of proteins detected divided by the total number of proteins in the proteome.
Log (Protein Amount)
Nu
mb
er o
f P
rote
ins
ProteinsDetected
Distribution of Protein Amounts
Relative Dynamic Range of a Proteomics Experiment
DEFINITION: RELATIVE DYNAMIC RANGE, RDRx,
where x is e.g. 10%, 50%, or 90%
Log (Protein Amount)
RDR90
RDR50
RDR10Fra
cti
on
of
Pro
tein
s D
etec
ted
Nu
mb
er o
f P
rote
ins
ProteinsDetected
Distribution of Protein Amounts
Repeat Analysis
1 Analysis2 Analyses3 Analyses4 Analyses5 Analyses6 Analyses7 Analyses8 Analyses
Repeat Analysis: Comparison of Simulations and Experiments
0
0.1
0.2
0.3
0 2 4 6 8 10
Number of Repeats
Su
ce
ss
Ra
te
Experiment
Simulation
0
0.1
0.2
0.3
0.4
0.5
0 2 4 6 8 10
Number of Repeats
RD
R1
0
Experiment
Simulation
0.00E+00
2.00E-03
4.00E-03
6.00E-03
8.00E-03
1.00E-02
1.20E-02
1.40E-02
0 2 4 6 8 10log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000 100000Number of Proteins in Mixture
Su
cc
es
s R
ate
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000 100000Number of Proteins in Mixture
Re
lati
ve
Dy
na
mic
Ra
ng
e (
RD
R5
0)
Number of Proteins in Mixture
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
Tissue
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0 2 4 6 8 10log(Protein Amount)
Nu
mb
er o
f P
rote
ins
Body Fluid Body Fluid1 1 2
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000 100000Number of Proteins in Mixture
Su
cc
es
s R
ate
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000 100000Number of Proteins in Mixture
Re
lati
ve
Dy
na
mic
Ra
ng
e (
RD
R5
0)
RDR50 Success Rate
TissueBody Fluid
1
1
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
Tissue 2
2
2
Amount loaded and peptide separation
1. Protein separation2. Amount loaded 3. Peptide separation
Order:
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Re
lati
ve D
yna
mic
Ran
ge
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Re
lati
ve D
yna
mic
Ran
ge
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
11
11
Tissue
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
11
11
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Re
lati
ve D
yna
mic
Ran
ge
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Re
lati
ve D
yna
mic
Ran
ge
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
22
Proteinseparation
22
Tissue
11
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
11
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
22
Proteinseparation
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Re
lati
ve D
yna
mic
Ran
ge
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Re
lati
ve D
yna
mic
Ran
ge
11
22
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
33
Amountloaded
33
Tissue1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Rel
ati
ve D
yna
mic
Ran
ge
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Rel
ati
ve D
yna
mic
Ran
ge
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
11
11
Tissue
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
22
Proteinseparation
22
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
44
Peptideseparation
44
33
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
33
Amountloaded
1. Protein separation2. Peptide separation3. Amount loaded
11
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Re
lati
ve D
yna
mic
Ran
ge
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Re
lati
ve D
yna
mic
Ran
ge
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
22
Proteinseparation
22
1111
Tissue1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Re
lati
ve D
yna
mic
Ran
ge
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Re
lati
ve D
yna
mic
Ran
ge Tissue
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
1111
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
22
Proteinseparation
22
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
33
Peptideseparation
33
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Rel
ati
ve D
yna
mic
Ran
ge
1.0
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1.0
Success Rate
Rel
ati
ve D
yna
mic
Ran
ge Tissue
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
1111
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
22
Proteinseparation
22
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er o
f P
rote
ins
44
Amountloaded44
0
0.005
0.01
0.015
0.02
0.025
0 1 2 3 4 5 6log(Protein Amount)
Nu
mb
er
of
Pro
tein
s
33
Peptideseparation
33
Protein separationAmount loadedPeptide separation
Ranges:Protein separation: 30000 – 3000 proteins in each fractionAmount loaded: 0.1 ug – 10 ugPeptide separation: 100 – 1000 fractions
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25
Number of fragment ions
Pro
bab
ilit
y o
f L
oca
liza
tio
n
Phosphopeptide identification
mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.5 DaPhosphorylation
Localization of modifications
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25
Pro
bab
ilit
y o
f Lo
cali
zati
on
Number of fragment ions
ID
3
Localization (dmin=3)
mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.5 DaPhosphorylation
dmin>=3 for 47% of human tryptic peptides
Localization of modifications
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25
Pro
bab
ilit
y o
f Lo
cali
zati
on
Number of fragment ions
ID32
Localization (dmin=2)
mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.5 DaPhosphorylation
dmin=2 for 33% of human tryptic peptides
Localization of modifications
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25
Pro
bab
ilit
y o
f Lo
cali
zati
on
Number of fragment ions
ID321
Localization (dmin=1)
mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.5 DaPhosphorylation
dmin=1 for 20% of human tryptic peptides
Localization of modifications
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25
Pro
bab
ilit
y o
f Lo
cali
zati
on
Number of fragment ions
ID3211*
Localization(d=1*)
mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.5 DaPhosphorylation
Localization of modifications
Peptide with two possible modification sites
Localization of modifications
Peptide with two possible modification sites
MS/MS spectrum
m/z
Inte
nsi
ty
Localization of modifications
Peptide with two possible modification sites
MS/MS spectrum
m/z
Inte
nsi
ty
Matching
Localization of modifications
Peptide with two possible modification sites
MS/MS spectrum
m/z
Inte
nsi
ty
Matching
Which assignment doesthe data support?
1, 1 or 2, or 1 and 2?
Localization of modifications
AAYYQK
Visualization of evidence for localization
AAYYQK
Visualization of evidence for localization
AAYYQK
AAYYQK
Visualization of evidence for localization
3
2
1
3
2
1
Estimation of global false localization rate using decoy sites
By counting how many times the phosphorylation is localized to amino acids that can not be phosphorylated we can estimate the false localization rate as a function of amino acid frequency.
0
0.005
0.01
0.015
0.02
0 0.05 0.1 0.15
0
0.005
0.01
0.015
0.02
0 0.05 0.1 0.15
Amino acid frequency
Fal
se l
oca
liza
tio
n f
req
uen
cy
Y
S21
Sm1
How much can we trust a single localization assignment?
If we can generate the distribution of scores for assignment 1 when 2 is the correct assignment, it is possible to estimate the probability of obtaining a certain score by chance for a given peptide sequence and MS/MS spectrum assignment.
SSmm21
0
2
1
2
1
2
0
2
1
2
1
2
2
1
1
dSSF
dSSFp
S m
)(
)(
1.
2.
Is it a mixture or not?
If we can generate the distribution of scores for assignment 2 when 1 is the correct assignment, it is possible to estimate the probability of obtaining a certain score by chance for a given peptide sequence and MS/MS spectrum assignment.
S12
Sm2
SSmm21
0
12
12
1
0
12
12
1
1
2)(
)(2
dSSF
dSSFp
Sm
1.
2.
ppppthth
and1
2
2
11 and 2
ppppthth
and1
2
2
11
ppppthth
and1
2
2
1
ppppthth
and1
2
2
11 or 2
Ø )( ppSS mm 1
2
2
121
Peptide with two possible modification sites
MS/MS spectrum
m/zIn
ten
sity
Matching
Which assignment doesthe data support?
1, 1 or 2, or 1 and 2?
Localization of modifications
Protein Complexes
AB
A
CD
Digestion
Mass spectrometry
Tackett et al. JPR 2005
Protein Complexes – specific/non-specific binding
Sowa et al., Cell 2009
Protein Complexes – specific/non-specific binding
Protein Complexes – specific/non-specific binding
Choi et al., Nature Methods 2010
Analysis of Non-Covalent Protein Complexes
Taverner et al., Acc Chem Res 2008
Determining the architectures ofmacromolecular assemblies
Alber et al., Nature 2007
M/Z
PeptidesFragments
Fragmentation
ProteolyticPeptides
Enzymatic Digestion
ProteinComplex
Chemical Cross-Linking
MS
MS/MS
Isolation
Cross-LinkedProtein Complex
Interaction Partners by Chemical Cross-Linking
M/Z
PeptidesFragments
Fragmentation
ProteolyticPeptides
Enzymatic Digestion
ProteinComplex
Chemical Cross-Linking
MS
MS/MS
Isolation
Cross-LinkedProtein Complex
Interaction Sites by Chemical Cross-Linking
Cross-linking
protein
n peptides with reactive groups
(n-1)n/2 potential ways to cross-link peptides pairwise
+ many additional uninformative formsProtein A + IgG heavy chain 990 possible peptide pairs
Yeast NPC ˜106 possible peptide pairs
Cross-linking
Mass spectrometers have a limited dynamic range and it therefore important to limit the number of possible reactions not to dilute the cross-linked peptides.
For identification of a cross-linked peptide pair, both peptides have to be sufficiently long and required to give informative fragmentation.
High mass accuracy MS/MS is recommended because the spectrum will be a mixture of fragment ions from two peptides.
Because the cross-linked peptides are often large, CAD is not ideal, but instead ETD is recommended.
Search Results
Search Results
Search Results
GPMDB
2005 2006 2007 2008 2009 2010 20110
50,000,000
100,000,000
150,000,000
200,000,000
250,000,000
Year (as of Jan 1st)
Ass
ign
ed s
pect
raSequence-spectrum assignments in
GPMDB
0 20 40 60 80 100
chromatin
cytoskeleton
E.R.
Golgi
lysosome
mitochondrion
nuclear membrane
plasma membrane
ribosome
% genes
Human Genes Observed in GPMDB
-40
-30
-20
-10
0
10
20
30
40
N G P D E A V I S T L Y M F H Q K C R Wc
om
po
sit
ion
dif
fere
nc
e (
pe
rce
nt) b
Proteotypic peptide relative composition
Comparison with GPMDB
Most proteins show very reproducible peptide patterns
Comparison with GPMDB
Global frequency of observing a peptide
Peptide Sequence ObservationsFSTVAGESGSADTVR 2633FNTANDDNVTQVR 2432AFYVNVLNEEQR 1722LVNANGEAVYCK 1701
GPLLVQDVVFTDEMAHFDR 1637LSQEDPDYGIR 1560
LFAYPDTHR 1499NLSVEDAAR 1400
FYTEDGNWDLVGNNTPIFFIR 1386
ADVLTTGAGNPVGDK 1338
If the number of times a peptide sequence (i) has been observed is ni, then for a particular protein:
i
itotal nN
Global frequency of observing a peptide
Define a normalized global frequency of observation for a particular peptide sequence from a particular protein as:
total
ii N
n
Global frequency of observing a peptide (ω)
Peptide Sequence ωFSTVAGESGSADTVR 0.08FNTANDDNVTQVR 0.07AFYVNVLNEEQR 0.05LVNANGEAVYCK 0.05
GPLLVQDVVFTDEMAHFDR 0.05
LSQEDPDYGIR 0.04LFAYPDTHR 0.04NLSVEDAAR 0.04
FYTEDGNWDLVGNNTPIFFIR 0.04
ADVLTTGAGNPVGDK 0.04
Global frequency of observation (ω), catalase
1 2 3 4 5 6 7 8 9 10111213141516171819200.00
0.02
0.04
0.06
0.08
ω
Peptide sequences
Global frequency of observation (ω), catalase
For any set peptides observed in an experiment assigned to a particular protein (1 to j ):
j
jprotein )(
1)( protein
Omega (Ω) value for a protein identification
Protein ID Ω (z=2) Ω (z=3)SERPINB1 0.88 0.82SNRPD1 0.88 0.59
CFL1 0.81 0.87SNRPE 0.8 0.81
PPIA 0.79 0.64CSTA 0.79 0.36PFN1 0.76 0.61CAT 0.71 0.78
GLRX 0.66 0.8CALM1 0.62 0.76FABP5 0.57 0.17
Protein Ω’s for a set of identifications
Part of Best Practices Integrative Informatics Consultation Service (BPIC) at the NYU Center for Health Informatics and
Bioinformatics (CHIBI)
Walk-in Clinic:Wednesday, February 23, 3-5 pm
227 E 30th Street, 7th Floor, Room #739
Proteomics Consultation
Proteomics Informatics WorkshopPart III: Protein Quantitation
February 25, 2011
• Metabolic labeling – SILAC• Chemical labeling• Label-free quantitation• Spectrum counting• Stoichiometry• Protein processing and degradation• Biomarker discovery and verification
Proteomics Informatics Workshop
Part I: Protein Identification, February 4, 2011
Part II: Protein Characterization, February 18, 2011
Part III: Protein Quantitation, February 25, 2011