ribra–an error-tolerant algorithm for the nmr backbone assignment problem
TRANSCRIPT
Jia-Ming Chang 0509Graph Algorithms and Their Applications to Bioinformatics
1/40
Determine Protein Structure X-ray
波長約 1 Å 長度接近原子間的距離 研究結晶的狀態的分子行為 定出其晶體結構,也包含蛋白質體結構
X-ray與結構生物學 利用 X-ray繞射法分析高度純化結晶的蛋白質的每個基團和原子的空間定位。
Nuclear magnetic resonance (NMR)NMR是涉及原子核吸收的過程。因為對某些原子核而言,具有自旋和磁矩的性質。因此,若暴露於強磁場中原子核會吸收電磁輻射,這是由磁場誘導而發生能階分裂的結果。科學家並發現,分子環境會影響在磁場中原子核的無線電波的吸收,利用這種特性來分析分子的結構
AVANCE 800 AV IBMS, Sinica 2/40
NMR – Nuclear Spin (1/5)
3/40
NMR – Nuclear Spin (2/5)
4/40
NMR - Magnetic Field (3/5)
5/40
NMR – Resonance (4/5)
6/40
NMR – Chemical Shift (5/5)
7/40
Find out Chemical Shift for Each Atom• Backbone: Ca, Cb, C’, N, NH
HSQC, CBCANH, CBCACONH
C CN
H H
C
C
C
H2
H2
H3
Chemical Shift Assignment (1/2)
One amino acid
8/40
Chemical Shift Assignment (2/2)
H-C-H
H-CC-H
H
-N-C-C-N-C-C-N-C-C-N-C-C-
O
O
O
O
H H
H
H
H O
H
H-C-H
CH3
Backbone
ppm18-23
19-24
16-20
17-23
31-34
55-60
CH3 30-35
9/40
HSQC Spectra HSQC peaks (1 chemical shifts for an amino acid)
HH NN IntensityIntensity
8.1098.109 118.60118.60 6592003265920032
HSQC
10/40
CBCA(CO)NH Spectra CBCA(CO)NH peaks (2 chemical shifts for one amino
acid) HH NN CC IntensityIntensity
8.1168.116 118.25118.25 16.3716.37 7923881179238811
8.1098.109 118.60118.60 36.5236.52 6592003265920032
11/40
CBCANH Spectra CBCANH peaks (4 chemical shifts for one amino acid)
Ca (+), Cb (-)
HH NN CC Intensity Intensity
8.1168.116 118.25118.25 16.3716.37 7923881179238811
8.1098.109 118.60118.60 36.5236.52 -65920032-65920032
8.1178.117 118.90118.90 61.5861.58 -51223894-51223894
8.1198.119 117.25117.25 57.4257.42 109928374109928374
++
--
12/40
A Dataset Example
N
HHSQC
HNCACB
CBCA(CO)NH
13/40
A Perfect Spin System Group
NN HH CC IntensityIntensity
113.293113.293 7.8977.897 56.29456.294 1.64325e+0081.64325e+008
113.293113.293 7.8977.897 27.85327.853 1.08099e+0081.08099e+008
CCaai-1i-1 CCbb
i-1i-1 CCaaii CCbb
ii
56.294
28.165
62.544 68.483NN HH CC IntensityIntensity
113.293113.293 7.927.92 62.54462.544 8.52851e+0078.52851e+007
113.293113.293 7.927.92 56.29456.294 4.71331e+0074.71331e+007
113.293113.293 7.927.92 68.48368.483 -8.54121e+007-8.54121e+007
113.293113.293 7.927.92 28.16528.165 -3.49346e+007-3.49346e+007
CBCA(CO)NH
CBCANH
i -1
i -1
Ca
Ca
Cb
Cb
14/40
Coding
Translate the target protein sequence and spin systems into coding sequences based on the following table.
Atreya, H.S., K.V.R. Chary, and G. Govil, Automated NMR assignments of proteins for high throughput structure determination: TATAPRO II. Current Science, 2002. 83(11): p. 1372-1376.
15/40
Backbone Assignment
GoalAssign chemical shifts to N, NH, Ca (and
Cb) along the protein backbone.
General approachesGenerate spin systems
○ A spin system: an amino acid with known chemical shifts on its N, NH, Ca (and Cb).
Link spin systems
16/40
17 /40
Backbone Assignment
DGRIGEIKGRKTLATPAVRRLAMENNIKLS
18 /40
Blind Men’s Elephant We cannot directly “see” the positions of
these atoms (the 3D structure) But we can measure a set of parameters
(with constraints) on these atoms,which can help us infer their coordinates
Each experiment can only determine a subset of parameters (with noises)
To combine the parameters of different experiments we need to stitch them together
A Peculiar Parking Lot (valet parking) Information you have: The make of your car, the car parked in front of you (approximately). Together with others, try to identify as many cars as possible (maximizing the overall satisfaction).
19 /43
Ambiguities
All 4 point experiments are mixed together
All 2 point experiments are mixed together
Each spin system can be mapped to several amino acids in the protein sequence
False positives, false negatives
20/40
Multiple Candidates One spin system maybe assign to many places
of a protein sequence. Spin system(SS)
Protein Sequence: AKFERQHMDSSTSRNLTKDR
NN HH CCaai-1i-1 CCbb
i-1i-1 CCaaii CCbb
ii
119.7119.7 8.848.84 58.458.4 32.732.7 56.356.3 40.840.8
SS SS SS SSPossible place
21/40
False Positives and False Negatives False positives
Noise with high intensityProduce fake spin systems
False negativesPeaks with low intensityMissing peaks
In real wet-lab data, nearly 50% are noises (false positive).
22/40
False Positive & False NegativePerfect
False Negative
False Positive
N
HHSQC
HNCACB
CBCA(CO)NH
23/40
Ambiguous Spin System
NN HH CC IntensityIntensity
106.9106.9 8.878.87 54.9254.92 423879423879
106.9106.9 8.878.87 40.3540.35 524522524522
NN HH CC IntensityIntensity
106.91106.91 8.858.85 59.759.7 235673235673
106.92106.92 8.868.86 54.9354.93 346234346234
106.91106.91 8.868.86 61.561.5 432432432432
106.91106.91 8.858.85 40.3140.31 -335759-335759
106.92106.92 8.868.86 30.530.5 -483759-483759
NN HH CCaai-1i-1 CCbb
i-1i-1 CCaaii CCbb
ii
106.1106.1 8.858.85 54.9354.93 40.3140.31 59.759.7 30.530.5
106.1106.1 8.858.85 61.561.5 40.3140.31 59.759.7 30.530.5
Two possible spin systems
24/40
Spin System Group Nearest Neighboring (TATAPRO, RIBRA, GASA)
N
HHSQC
HNCACB
CBCA(CO)NH
25/40
Spin System Linking
GoalLink spin system as long as possible.
Constraints Each spin system is uniquely assigned to a
position of the target protein sequence.Two spin systems are linked only if the
chemical shift differences of their intra- and inter- residues are less than the predefined thresholds.
26/40
Previous Approaches Constrained bipartite matching problem*
Can’t deal with ambiguous link Legal matching Illegal matching under constraints
*Xu Y, Xu D, Kim D, Olman V, Razumovskaya J, Jiang T. Automated assignment of backbone NMR peaks using constrained bipartite matching. Computing in Science & Engineering 2002;4(1):50-62.
27/40
Naatural Language Processing ─ Noises or Ambiguity ?
Speech recognition : Homopone selection
台 北 市 一 位 小 孩 走 失 了
台 北 市 小 孩台 北 適 宜 走 失 事 宜 一 位 一 味 移 位
28/40
An Error-Tolerant Algorithm
29/40
Phrase, Sentence Combination
30/40
Spin System Positioning
55.266 38.675 44.555 0
44.417 0 55.043 30.04
44.417 0 30.665 28.72
55356 29.782 60.044 37.541
D 50 G 10 R 40 I 50|51
55.266 38.675 44.555 0 => 50 10
44.417 0 55.043 30.04 =>10 40
44.417 0 30.665 28.72 =>10 40
55356 29.782 60.044 37.541 => 40 50
We assign spin system groups to a protein We assign spin system groups to a protein sequence according to their codes. sequence according to their codes.
Spin System
31/40
Link Spin System groups
Segment 3
Segment 2
Segment 155.266 38.675 44.555 0
44.417 0 55.043 30.04
44.417 0 30.665 28.72
55356 29.782 60.044 37.541
D G R I
32/40
Iterative Concatenation DGRI….FKJJREKL
….
Step n Segment 99
1
2
….
56
Spin Systems
1
2
2
47
1Step156…
Step2 Segment 1
Segment 2
Segment 31…
Step n-1 Segment 78 Segment 79…
33/40
Conflict Segments
DGRIDGRIGEIKGRKTLATPAVRRLAMENNIKLSGEIKGRKTLATPAVRRLAMENNIKLSSegment 78
Segment 71
Segment 79
Segment 99 Segment 98
Segment 97
Two kinds of conflict segments
Overlap (e.g. segment 71, segment 99)
Use the same spin system (e.g. both segment 78 and segment 79 contain spin system 1)
34/40
Independent Set
Subset S of vertices such that no two vertices in S are connected
www.cs.rochester.edu/~stefanko/Teaching/06CS282/06-CSC282-17.ppt 35/40
Independent Set
Subset S of vertices such that no two vertices in S are connected
www.cs.rochester.edu/~stefanko/Teaching/06CS282/06-CSC282-17.ppt 36/40
A Graph Model for Spin System Linking
G(V,E) V: a set of nodes (segments). E: (u, v), u, v V, u and v are conflict.
Goal Assign as many non-conflict segments
as possible => find the maximum independent set of G.
37/40
An Example of G
Seq. : Seq. : GEIKGRKTLATPAVRRLAMENNIKLSEGEIKGRKTLATPAVRRLAMENNIKLSE
Segment1: SP12->SP13->SP14
Segment2: SP9->SP13->SP20->SP4
Segment3: SP8->SP15->SP21
Segment4: SP7->SP1->SP15->SP3
Seg1 Seg3
Seg4 Seg2
Seg1
Seg3
Seg2
Seg4
SP13
SP15
Overlap
Overlap
38/40
Segment weight
The larger length of segment is, the higher weight of segment is.
The less frequency of segment is, the higher of segment is.
39/40
Find Maximum Weight Independent Set of G (1/2)
Boppana, R. and M.M. Halldόrsson, Approximating Maximum Independent Sets by Excluding Subgraphs. BIR, 1992. 32(2).
VN(v)
Head_N(v)
40/40
Find Maximum Weight Independent Set of G (2/2)
Boppana, R. and M.M. Halldόrsson, Approximating Maximum Independent Sets by Excluding Subgraphs. BIR, 1992. 32(2).
V
41/40
An Iterative Approach
We perform spin system generation and linking iteratively.
Three stages. Perfect spin systems Weak false negative spin systems Severe false negative spin systems
42/40
Segment Extension DGRDGRGEKGRKTLATPAVRRLAMENNIKLSGEKGRKTLATPAVRRLAMENNIKLS
MaxIndSetMaxIndSet
77 99‘ 97‘
99 97
45
23
263129
3233
24
2728
28
77
71
78
99‘
97‘
99 97
43/40