part ii. tandem ms. mass filter; complete spectrum is obtained by scanning whole range ions are lost...
TRANSCRIPT
Part II.
Tandem MS
Mass filter; complete spectrum is obtained by scanning whole range
Ions are lost
Mass range 10- 4,000 Da
Mass Analyzer (2) – Quadrupole
Q2Collision
Q1 Selection Pusher
TOF with reflectron
Detector
Hybrid Quadrupole/Time-of-Flight (Q-TOF) MS
Electrospray MS and MS/MS of Proteins
Sample Preparation
tissue fraction gel
peptidesAdd trypsin
MPSER……
GTDIMRPAK
……
HPLCTo MS/MS
Tandem Mass Spectrometer
Quadrupolemass
analyzercollision
parent ions fragment ions
MPSER
SG…
+
PAK +
+
P + AKPAK +
PAK + PA + K
AK +P
K +PA
P +K+
PA+
AK+
PAK +
PAK +
peptide sequencing
…
TOFmass
analyzer
ionsdetector
ESI
QTOF
How Does a Peptide Fragment?
m(y1)=19+m(A4)m(y2)=19+m(A4)+m(A3)m(y3)=19+m(A4)+m(A3)+m(A2)
m(b1)=1+m(A1)m(b2)=1+m(A1)+m(A2)m(b3)=1+m(A1)+m(A2)+m(A3)
How MS/MS corresponds to peptide
m/z
L G E R
b1b2
b3
m/z
Ry1 y2 y3
E G L
N-term
C-term
Put both together
m/z
L G E R
m/z
R E G L
In practice, there are many more peaks other than b and y peaksMany b and y peaks may disappear.
Matching Sequence with Spectrum
LGSSEVEQVQLVVDGVKpeptide sequence:
tandem mass spectrometry:
MS/MS spectrum
de novo sequencing:
LGSSEVEQVQLVVDGVK
database
Database Search Methods• Mascot
– matrix sciences– General software
• Sequest– John Yates et. al.– Distributed by Thermo Finnigan.– Works for Thermo’s LTQ.
• PEAKS– Bin Ma et. Al.– Distributed by Bioinformatics Solutions Inc.– General software
Mascot
PEAKS
• De Novo Sequencing (Dancik et al., JCB 6:327-342.)– Given a spectrum, a mass value M,
compute a sequence P, s.t. m(P)=M, and the matching score is maximized.
• We consider the matching score of P is the sum of the scores of the matched peaks.
De Novo Sequencing
Spectrum Graph Approach
• Convert the peak list to a graph. A peptide sequence corresponds to a path in the graph.
• Bartels (1990), Biomed. Environ. Mass Spectrom 19:363-368.
• Taylor and Johnson (1997). Rapid Comm. Mass Spec. 11:1067-1075. (Lutefisk)
• Dancik et al. (1999), JCB 6:327-342.• Chen et al. (2001), JCB 8:325-337.• ……
Difficulties• Spectrum graph approach has difficulties to handle
errors:– Missing of ions – break a path.– Too many peaks in a small error tolerance – too many edges
connecting to the same peak. (reduce efficiency)– Error accumulation.– A peak is used as both a y-ion and a b-ion.
• It is still possible to solve these problems under the spectrum graph schema– E.g. The y-b overlap problem had been addressed by Dancik
et al (1999) and Chen et al. (2001).– But things are getting complicated.– A reliable signal preprocessing is required.
PEAKS’ approach
• It is more natural and easier to handle the errors and noises.– Less dependent to the signal preprocessing.– Solved the missing ions and y-b overlap problems
naturally.– Showed great success on real-life lab data.– Has been licensed by tens of research labs in public and
private sectors.
A simplified case – Counting Only Y-ions
The Score of a Suffix
y1 y2 y3
score(Q) are the sum of scores of those y-ions of Q.
)(max)()(
QscoremDPmQm
Let Q be a suffix of the peptide. It can determine some y-ions.
19
Recursive Computation of DP(m)
)'()19()( QscoremhQscore
))(()19()( ammDPmhmDP
))((max)19()(a
ammDPmhmDP
Q’
Do not know a?
a Suppose Q is such that DP(m)=score(Q).
19
score(Q’)=DP(m(Q’))
Dynamic Programming
1. for m from 0 to M
2. backtracking to decide the optimal peptide.
)(max)19()( amDPmhmDP a
PEAKS – The Software
Comparison
• LCQ data (Iontrap instrument):– Generously provided by Dr. Richard Johnson. 144
spectra.• Micromass Q-Tof data:
– Measured in UWO’s Protein ID lab. 61 spectra• Sciex Q-Star data:
– Provided by U. Victoria’s Genome BC Proteomics Centre. 13 good/okay spectra.
PEAKS v.s. Lutefisk
• completely correct sequences: – 38/144 v.s. 15/144
• correct amino acids: – 1067/1702 v.s. 767/1702 v.s.
• partially correct sequences with 5 or more contiguous correct amino acids: – 94/144 v.s. 64/144
PEAKS v.s. Micromass PLGS
• completely correct sequences: – 23/61 v.s. 7/61
• correct amino acids: – 559/764 v.s. 232/764
• partially correct sequences with 5 or more contiguous correct amino acids: – 50/61 v.s. 24/61
PEAKS v.s. Sciex BioAnalyst
• completely correct sequences: – 7/13 v.s. 1/13
• correct amino acids:– 115/150 v.s. 86/150
• partially correct sequences with 5 or more contiguous correct amino acids: – 12/61 v.s. 7/61
Post Translational Modification (PTM)
PTM• PTMs are important to the functions of proteins.• There are more than 500 types of PTMs included in
the unimod PTM database. • For example: Reversible phosphorylation of proteins
is an important regulatory mechanism. Many enzymes are switched "on" or "off" by phosphorylation and dephosphorylation. This is done by the structural change caused by the PTM.
Phosphorylation
pS pT pY
H H
H
S T Y
Monoisotopic mass change: PO3H = 79.966
PTM increases complexity
• Most protein databases do not have the PTM information. Therefore, when PTM is present, one has to try different PTM possibilities to match a peptide with a spectrum.
• For peptide LGSSEVTMVYLK, if only phosphorylation is considered, there are 16 possibilities.– What if there are 10 possible PTM sites?
• This type of PTMs are called variable PTMs.
Fixed PTM
• Some PTMs are know to present all the time.– These are called fixed PTM.
• Oxidation of M. Mass +16. – It happens automatically in the air. So people
often make sure that all of the M are oxidized.
• carboxyamidomethyl cysteine (CamC). Mass +57.02– These are added intentionally to break the
disulphide bonds.
• Fixed PTMs are easier.
Variable PTM in DB Search and DeNovo
• For DB search, have to try different combinations.
• For De Novo, each variable PTM is like adding a new amino acid.– For example, if pS, pT, pY are variable, then
instead of having 20 characters in alphabet, we have 23.
– But too many variable PTMs will reduce the accuracy of the de novo sequencing.
Peptide Identification v.s. Protein Identification
Peptid
e
sequen
cing
Peptid
e
sequen
cing
digestiondigestion
Proteins
Peptides
……
>protein APAKGTIRHIHGCDKRGAPWPAS…>protein BMSERNHLREIIGNEVR……>protein CLSIMQDKDYSASFIS……
>protein APAKGTIRHIHGCDKRGAPWPAS…>protein BMSERNHLREIIGNEVR……>protein CLSIMQDKDYSASFIS……
Proteins
MS/M
S
MS/M
S
Protein IDProtein ID
PeptidesPAKMSERLSIMQDKHIHGCDKEIIGNEVRSIMQMDYSASFIS......
PAKMSERLSIMQDKHIHGCDKEIIGNEVRSIMQMDYSASFIS......
Peptides
Common procedure for protein ID
Problems• A peptide appears in several proteins.• A protein family may share many peptides.
– Usually only one of them is true.
• A protein may have only one peptide or two weak peptides, is it true or false positive?– The “one hit wonder”.
Estimate False Positives• Suppose you have a score for each identified protein. You
want to choose a score threshold T. – Score >T positive (keep)– Score <=T negative (discard)
• It is important to estimate the false positive rate for each given result.
• False Positive Rate – In statistics, FPR= #false positives/#negative results.– We care more about FPR = #false positives/#results reported as positives.
Positive (prediction)
Negative (prediction)
Positive (reality)
TP FN
Negative (reality)
FP TN
The two definitions are different!
Decoy Database Method
• Choose a decoy database: – for example, reverse the database.
• Anything from this database is false.• Search in a real database and a decoy database separately
– For same T, if there are x proteins in the decoy database >T, then perhaps there are x false proteins in the real database with score >T.
• Threshold T, – real db has 497 proteins >T,– decoy db has 7 proteins >T,– False positive rate is 7/497 = 1.4%
Problems
• Only works for large dataset. – Not statistically significant when dataset is small.
• Does not care how many proteins are actually kept.– Keeping only the true results is not our only goal,
we also want to keep as many as true results as possible.
• Decoy database is only good for validation and cannot substitute a good scoring method.
SPIDER – listen to both parties!
The solution when there is no protein database and no perfect
MS/MS.
兼听则明,偏听则暗
de novosequencing
EISGNEVR
protein DB
ESIGNEVRdatabase
search
protein DBhomology
search
ESIGSEVR
PEAKS: Ma et. al, Rapid Comm. Mass Spec. 2003
SI
PatternHunter: Ma, Tromp and Li, Bioinformatics. 2002
SPIDER: Han, Ma and Zhang, JBCB. 2005
Two purposes of our research
1. Given de novo sequence with errors, find homolog of the real sequence. (searching)
2. Using the de novo sequence and the homolog as input, compute the real sequence. (sequencing)
LSCFAV
“Listen to both sides and you will be enlightened; Heed only one side you will be benighted.”
EACFAV
de novoDACFKAV
homolog
Homology mutations
• Sequence alignment
• • Also called edit distance
EACF-AVQR DACFKAV-R
indelEDdist 2),(cost
Common de novo sequencing errors
same mass replacementAN?NA?GAG?
Two exercises
(denovo) X: LSCFV(real) Y: EACFV (homolog) Z: DACFV
m(LS)=m(EA)=200.1mu
(denovo) X: LSCFAV(real) Y: SLCFAV (homolog) Z: SLCF-V
blosum62
More formally• Let
• Sequencing: Given de novo sequence X, homolog Z, find Y such that is minimized.
• Let
• Searching: search a database for Z such that d(X,Z) is minimized.
XYZ
seqError
editDist
),(),( ZYdYXd es
ZYZYd
YXYXd
e
s
and between distanceedit ),(
and between errors sequencing ofcost ),(
),(),(min),( ZYdYXdZXd esY
How to compute ds(X,Y)• Easily align X and Y together (according to mass).
• For each erroneous mass block with mass mi, define the cost to be
• Define
XYZ
seqError
editDist
(denovo)X: LSCFAV(real) Y: EACFAV
i
is mfYXd )(),(
)( imf
How to compute d(X,Z)
• A multiple alignment can be built from alignments (X,Y) and (Y,Z).
• Lemma:
• Dynamic Programming!• Let X
YZ
seqError
editDist
(denovo) X: LSCF-AV(real) Y: EACF-AV (homolog)Z: DACFKAV
])..1[],..1[(),( jZiXdjiD
i
ii ZXdZXd ),(),(
Four cases of the last Block
indeljiDjiD )1,(),(
(A)(B)(C) no sequencing error
D(i,j) is the minimum of the four cases.
indeljiDjiD ),1(),(
])[],[()1,1(),( jZiXdistjiDjiD ])'..[],'..[()1',1'(),( jjZiiXjiDjiD
]1..1[ jZ
]..1[ iX
]..1[ jZ
How to compute),( CA
),()(
),(min)(
),(
)(:
Cmmf
CBdmf
CA
emBmB
A)(AmmB
C
Three cases of the alignment
])[,(])1..1[),((min
)),((min
])1..1[,(
min),(
nCbdistnCbmm
indelCbmm
indelnCm
Cm
b
b
m
]1..1[ nC ][nC
(1)
)(bmm
C
(2)
]1..1[ nC ][nC
(3)
bB
C
)(bmm b
The algorithm for computing
1. for m from 0 to m(X) step Δfor i from 0 to |Z|
for j from i to |Z|
])[,()])1..([),((min
)])1..([,(
])..[),((min
min])..[,(
jZydistjiZymm
indeljiZm
indeljiZymm
jiZm
y
y
Time complexity: )( 2MnO
])..[,( jiZm
The algorithm for computing d(X,Z) and Y
1. for i from 1 to |X|for j from 1 to |Z|
2. output D(|X|,|Z|) as d(X,Z).3. backtracking to get the best middle sequence Y.
]))'..[((])'..[]),'..[(()1',1'(min
])[],[()1,1(
)1,(
),1(
min),(
',' iiXmfjjZiiXmjiD
jZiXdistjiD
indeljiD
indeljiD
jiD
ji
Time complexity: )( 4nO
Total time complexity: )( 24 MnnO
Experiment
• 28 spectra from ALBU_BOVIN.• PEAKS de novo sequencing gives 13 correct and
15 partially correct sequences• SPIDER found good peptide homologues in
human protein DB for all.• 24 constructed correct peptide sequences.
PEAKS EAEGNEVR
human DB SPIDER
ESIGSEVRESIGSEVR
ESIGNEVRESIGNEVR
ALBU_BOVIN
2813+15
28
24+4
Two exemplary results
(denovo) X: CCQ[W ]DAEAC[AF]<NN><PG>K
(real) Y: CCK AD DAEAC FA VE GP K
(homolog)Z: CCK[AD]DKETC[FA]<EE><GK>K
(denovo) X: FVE<RDG>LVTD[TL]K(real) Y: FVE VTK LVTD LT K(homolog)Z: FAE<VSK>LVTD[LT]K
homology mutations
sequencing errors
Four modes in SPIDER
• Homology mode• Non-gapped homology mode
– Assume sequencing error and homology mutations do not overlap.
• Segment match mode– Assume no homology mutations.
• Exact match mode – Assume no sequencing errors or homology
mutations.
Experiment• 144 ion trap MS/MS spectra, lower quality spectra. • The proteins are all in Swissprot but not in human database.• PEAKS 2.0 was used to de novo sequence. • SPIDER searches Swissprot and human databases, respectively.
People like SPIDER• Best Paper Award at CSB2004• Some random emails we received
– “I'm a big SPIDER fan!” Shinichi Iwamoto, Shimadzu Corporation
– “The results I've been getting have been consistently very good. Thank you for this great piece of software!” Jason W. H. Wong, University of Oxford
– “Your software is by far the fastest and more user-friendly I have found.” Juan Luis, University of Georgia
– ……– I plan to teach SPIDER in my Advanced Bioinformatics class. I wonder if your
powerpoint slides are available?”Pavel Pevzner, Ronald R. Taylor Professor of Computer Science, UCSD
• Included in PEAKS as both a separate tool and an intermediate step in protein candidates generation.
• The best is yet to come– People just started using the de novo + homology approach.