Carnegie MellonSchool of Computer Science
1
Conditional Graphical Models for Protein Structure Prediction
Yan Liu
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
Oct 24, 2006
Carnegie MellonSchool of Computer Science
2
Snapshot of Cell BiologyNobelprize.org
+
Protein function
DSCTFTTAAAAKAGKAKAG
Protein sequence
Protein structure
Carnegie MellonSchool of Computer Science
3
Protein Structures and Functions
Courtesy of
Nobelprize.org
Adenovirus Fibre Shaft Virus Capsid
Example: triple beta-spiral fold
Carnegie MellonSchool of Computer Science
4
Protein Structure Determination
• Lab experiments: time and labor- consuming
X-ray crystallography Nobel Prize, Kendrew & Perutz, 1962
NMR spectroscopyNobel Prize, Kurt Wuthrich, 2002
• The gap between sequence and structure necessitates computational methods of protein structure determination 3,023,461 sequences v.s. 36,247 resolved structures
(1.2%)
1MBN
1BUS
Carnegie MellonSchool of Computer Science
5
Protein Structure HierarchyWe focus on predicting the topology of the structures from sequences
APAFSVSPASGACGPECA
Carnegie MellonSchool of Computer Science
6
Major Challenges
• Protein structures are non-linear Long-range dependencies
• Structural similarity often does not indicate sequence similarity Sequence alignment reaches
twilight zone (under 25% similarity)Ubiquitin (blue) Ubx-Faf1 (gold)
β-α-β motif
Carnegie MellonSchool of Computer Science
7
Previous Work• Sequence similarity perspective
Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]
Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]
Window-based methods, e.g. PSI_pred [Jones, 2001]
• Physical forces perspective Homology modeling or threading, e.g. Threader [Jones, 1998]
• Structural biology perspective Methods of careful design for specific structures, e.g. αα- and ββ- hairpins, β-
turn and β-helix [Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001]Generative models based on physical free-energy
Hard to generalize due to the various informative features
Fail to capture the structure properties
Carnegie MellonSchool of Computer Science
8
Structured Prediction
• Many prediction tasks involve outputs with correlations or constraints
John ate the cat .
SEQUENCEXS…WGIKQLQAR
TreeSequence GridStructure
Input
Output
• Fundamental importance in many areas• Potential for significant theoretical and practical advances
HHHCCCEEE…EECCCCEEE
Carnegie MellonSchool of Computer Science
9
Graphical Models
• A graphical model is a graph representation of probability dependencies [Pearl 1993; Jordan 1999]
Node: random variables Edges: dependency relations
• Directed graphical model (Bayesian networks)
• Undirected graphical model (Markov random fields)
11..
( ,.., ) ( | parents( ))n i ii n
P x x P x x
1
1 1( ,.., ) ( ) exp( ( ))n c c c
c C c C
P x x x H xZ Z
Carnegie MellonSchool of Computer Science
10
Conditional Random Fields• Hidden Markov model (HMM) [Rabiner, 1989]
• Conditional random fields (CRFs) [Lafferty et al, 2001]
Model conditional probability directly Allow arbitrary dependencies in observation Adaptive to different loss functions and
regularizers Promising results in multiple applications
11
( ) ( | ) ( | )N
i i i ii
P P x y P y y
x, y
11 10
1( ) exp( ( , , , ))
N K
k k i ii k
P f i y yZ
y | x x
1 1( , , , ) ' ( , ) ( , ')k i i k i if i y y f i I y s y s x x
Carnegie MellonSchool of Computer Science
11
Protein Structure Prediction
• Dependency between residues (single observation)
• Dependency between components (subsequences of observations)
Carnegie MellonSchool of Computer Science
12
Outline• Brief introduction to protein structures• Graphical models for structured-prediction
• Conditional graphical models for protein structure prediction General framework Specific models
• Experiment results• Conclusion and discussion
Carnegie MellonSchool of Computer Science
13
• Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si}
• Feature definition Node feature
Local interaction feature
Long-range interaction feature
Our Solution: Conditional Graphical Models
1 1 1( , , ) ( , ', 1)k i i i i i if w w x I s s s s p q
( , ) '( , , ) ( ', 1 ')k i k i i i i if w x f x p q I s s q p d
Long-range dependencyLocal dependency
1( , , ) '( , , , , ) ( , ')k i j k i i j j i if w w x g x p q p q I s s s s
Carnegie MellonSchool of Computer Science
14
• Conditional probability given observed sequences x is defined as
• Prediction:
• Training phase : learn the model parameters λ Minimizing regularized negative log loss
Iterative search algorithms by seeking the direction whose empirical values agree with the expectation
Conditional Graphical Models (II)
1
1( | ) exp( ( , ))
G
K
k k cc C k
fP Y YZ
x x
( | )( ( , ) [ ( , )]) ( ) 0G
k c p k cc Ck
Lf E f
y xx y x y
21
( , ) log ( )G
K
k k cc C k
L f Z
x y
1
* arg max ( , )G
K
k k cc C k
y f Y
x
Carnegie MellonSchool of Computer Science
15
Major Components
• Graph topology Secondary structure prediction: CRF, kernel CRF Tertiary fold recognition: Segmentation CRF, Chain graph
model Quaternary fold recognition: Linked segmentation CRF
• Efficient inference Prefer exact inference with O(nd) complexity Resort to approximate inference
• Features Allows flexible and rich feature definition
Carnegie MellonSchool of Computer Science
16
Protein Secondary Structure Prediction
• Given a protein sequence, predict its secondary structure assignments Three classes: helix (H), sheets (E) and coil (C)
Input: APAFSVSPASGACGPECA
Output: CCEEEEECCCCCHHHCCC
Carnegie MellonSchool of Computer Science
17
CRF on Secondary Structure Prediction [Liu et al, Bioinformatics 2004]
• Node semantics – secondary structure assignment• Graphical model - conditional random fields (CRFs) or kernel CRF
• Inference algorithm - efficient inferences exists, such as forward-backward or Viterbi algorithm
C C E E …. ... C
11 1
1( | ) exp( ( , , , ))
N K
k k i ii k
P f i y yZ
y x x
Carnegie MellonSchool of Computer Science
18
Protein Fold Recognition and Alignment
• Protein fold: identifiable regular arrangement of secondary structural elements
• Different from previous simple fold classification
• Provide important information and novel biological insights
Input: ..APAFSVSPASGACGPECA..
Output 1: Does the target fold exist?
Output 2: ..NNEEEEECCCCCHHHCCC..
Yes
Testing PhaseTraining Phase
Carnegie MellonSchool of Computer Science
19
Conditional Graphical Model for Fixed Template Fold
[Liu et al, RECOMB 2005]
• Node semantics - secondary structure elements of variable lengths
• Graphical model - segmentation conditional random fields (SCRFs)
• Inference - forward-backward and Viterbi-like algorithm can be derived given some assumptions
β-α-β motif
1
1( | ) exp( ( , ))
G
K
k k cc C k
P y f yZ
x x
Carnegie MellonSchool of Computer Science
20
Conditional Graphical Model for Repetitive Fold Recognition
[Liu et al, ICML 2005]
• Node semantics - two layer segmentation Y = {M, {Ξi}, T} Level 1: envelop, or one repeat, level 2: components of one repeat
• Graphical model - Chain graph model A graph consisting of directed and undirected graphs
• Inference - forward-backward algorithm and Viterbi-like algorithm
1 11
( | ) ( ,{ }, | ) ( ) ( | , ) ( | , , )M
i i i i i ii
P Y P M T P M P T x P T
x x x( )
,1 ,
1 1
1( | , , ) exp( ( , , )
j
i j
M K
i i i k k i jj ki
P T f W WZ
x x
Carnegie MellonSchool of Computer Science
21
Conditional Graphical Model for for Quaternary Fold Recognition
[Liu et al, IJCAI 2007]
• Node semantics – secondary structure elements and/or simple fold
• Graphical model - linked segmentation CRF (L-SCRF) Fix template and/or repetitive subunits Inter-chain and intra-chain interactions
, , ,
1 1 , , ,,
1( ,..., | ,..., ) exp( ( , )) exp( ( , , , ))
i j G i j a b G
R R k k i i j l k i a i j a bV k lE
P f g yZ
y y y
y y x x x y x x y
Carnegie MellonSchool of Computer Science
22
Approximate Inference
• Varying dimensionality requires reversible jump MCMC sampling [Greens, 1995, Schmidler et al, 2001]
• Four types of Metropolis proposals State switching Position switching Segment split Segment merge
• Simulated annealing reversible jump MCMC [Andireu et al, 2000]
Replace the sample with RJ MCMC Theoretically converge on the global optimum
Carnegie MellonSchool of Computer Science
23
Conditional Graphical Models for Protein Structure Prediction
Carnegie MellonSchool of Computer Science
24
Model Roadmap
Conditional random fields
Segmentation CRFs
Chain graph model
Linked segmentation CRFs
Segment Correlations
Inter-chain Segment Correlations
Local and Global Tradeoff
Generalized as conditional graphical models
Kernel CRFsKernelization
Carnegie MellonSchool of Computer Science
25
Outline• Brief introduction to protein structures• Graphical models for structured prediction
• Conditional graphical models for protein structure prediction
• Experiment results Fold recognition Fold alignment prediction Discovery of potential membership proteins
• Conclusion and discussion
Carnegie MellonSchool of Computer Science
26
Experiments: Target Fold• Right-handed β-helix fold [Yoder et al, 1993]
Bacterial infection of plants, binding the O-antigen and so on
• Leucine-rich repeats (LLR) [Kobe & Deisenhofer, 1994]
Structural framework for protein-protein interaction
Carnegie MellonSchool of Computer Science
27
Experiments: Target Quaternary Fold
• Triple beta-spirals [van Raaij et al. Nature 1999]
Virus fibers in adenovirus, reovirus and PRD1
• Double barrel trimer [Benson et al, 2004]
Coat protein of adenovirus, PRD1, STIV, PBCV
Carnegie MellonSchool of Computer Science
28
Tertiary Fold Recognition: β-Helix fold
• Histogram and ranks for known β-helices against PDB-minus dataset
5
Chain graph model reduces the real running time of SCRFs model by around 50 times
Carnegie MellonSchool of Computer Science
29
Quaternary Fold Recognition: Triple β-Spirals
• Histogram and ranks for known triple β-spirals against PDB-minus dataset
Carnegie MellonSchool of Computer Science
30
Quaternary Fold Recognition: Double Barrel-Trimer
• Histogram and ranks for known double barrel-trimer against PDB-minus dataset
Carnegie MellonSchool of Computer Science
31
Fold Alignment Prediction: β-Helix• Predicted alignment for known β -helices on cross-family
validation
Carnegie MellonSchool of Computer Science
32
Fold Alignment Prediction:LLR and Triple β-Spirals
• Predicted alignment for known LLRs using chain graph model (left) and triple β-spirals using L-SCRFs
Carnegie MellonSchool of Computer Science
33
Discovery of Potential β-helices
• Hypothesize potential β-helices from Uniprot reference databases Full list can be accessed at
www.cs.cmu.edu/~yanliu/SCRF.html
• Verification on proteins with later resolved structures from different organisms 1YP2: potato tuber ADP-glucose pyrophosphorylase
1PXZ: major allergen from Cedar Pollen
GP14 of Shigella bacteriophage as a β-helix protein
Carnegie MellonSchool of Computer Science
34
Conclusion• Thesis Statement
Conditional graphical models are effective for protein structure prediction
• Strong claims Effective representation for protein structural properties Flexibility to incorporate different kinds of informative features Efficient inference algorithms for large-scale applications
• Weak claims Ability to handle long-range interactions Best performance bounded by prior knowledge
Carnegie MellonSchool of Computer Science
35
Contribution and Limitation
• Contribution to machine learning Enrichment of graphical models Formulation to incorporate domain knowledge
• Contribution to computational biology Effective for protein structure prediction and fold
recognition Solutions for the long-range interactions (inter-chain and
intra-chain)
• Limitation Manual feature extraction Difficulty in verification High complexity
Carnegie MellonSchool of Computer Science
36
Future Work• Computational biology
• Machine Learning
Protein structure prediction
+
Protein function and protein-protein interaction prediction
Drug target design
Graph topology learningGraph-based
semi-supervised learning
Active learning for structured data
Carnegie MellonSchool of Computer Science
37
Acknowledgement
• Jaime Carbonell, Eric Xing, John Lafferty, Vanathi Gopalakrishnan
• Chris Langmead, Yiming Yang, Roni Rosenfeld, Peter Weigele , Jonathan King, Judith Klein-Seetharaman, , Ivet Bahar, James Conway and many more
• And fellow graduate students …
Carnegie MellonSchool of Computer Science
38
Carnegie MellonSchool of Computer Science
39
Features for Tertiary Fold Recognition
• Node features Regular expression template, HMM profiles Secondary structure prediction scores Segment length
• Inter-node features β-strand Side-chain alignment scores Preferences for parallel alignment scores Distance between adjacent B23 segments
• Features are general and easy to extend
Carnegie MellonSchool of Computer Science
40
Features for Protein Fold Recognition
Carnegie MellonSchool of Computer Science
41
Discovery of Potential Double Barrel-Trimer
• Potential proteins suggested in [Benson, 2005]
Carnegie MellonSchool of Computer Science
42
Inference Algorithm for SCRF
• Backward-forward algorithm*
• Viterbi algorithm*
)),,(exp(),1(),(),(
1,
,,,,
ssxfypyqyrK
kkkryq
qppylryl ll
_
)),,(exp(),1(),(max),(1
,,,,
,
ssxfypyqyrK
kkkryqyl
qppryl ll
_
p(state yr ends at r |xl+1 xl+2… xr-1xr and state yl ends at l) =
Carnegie MellonSchool of Computer Science
43
Contrastive Divergence
Carnegie MellonSchool of Computer Science
44
Reversible jump MCMC Algorithm
• Three types of proposals Position switching: randomly select a segment j and
a new position assignment dj(i+1) ~U(dj-1
(i),dj+1(i))
Segment split: randomly select a segment j and split it into two segments where (dj
(i+1) , dj+1(i+1) ) =
G(dj-1(i) ,u(i) ) where u(i) ~ U
Segment merge: randomly select a segment j and merge segment j and j+1
• Simulated annealing reversible jump MCMC for computing y = argmax P(y|x) [Andireu et al, 2000]
Carnegie MellonSchool of Computer Science
45
Simulated annealing reversible jump MCMC
Carnegie MellonSchool of Computer Science
46
Protein Structural Graph for Beta-helix
Carnegie MellonSchool of Computer Science
47
Protein Structure Determination
• Lab experiments: time and labor- consuming X-ray crystallography NMR spectroscopy Electron microscopy and many more
• Computational methods: Homology modeling: ≥ 30% sequence similarity Fold recognition: < 30% sequence similarity Ab inito methods: no template structure needed
• Active research area in multiple scientific fields
Carnegie MellonSchool of Computer Science
48
Evaluation Measure
• Q3 (accuracy)
• Precision, Recall
• Segment Overlap quantity (SOV)
• Matthew’s Correlation coefficients
)(
)1()2;1(
)2;1()2;1(1)2,1(
iS
SLENSSMAXOV
SSDELTASSMINOV
NSSSOV
))(()()( iiiiiiii
iiiii
onunopup
ounpC
P + P-
T + P u
T - o nuonp
npQ
3
op
pQ pre
up
pQ
Carnegie MellonSchool of Computer Science
49
Outline
• Brief introduction to protein structures
• Discriminative graphical models
• Generalized discriminative graphical models for protein fold recognition
• Experiment results
• Conclusion and discussion
Carnegie MellonSchool of Computer Science
50
Graphical Models for Structured Prediction
• Conditional Random Fields Model conditional probability directly, not joint probability Allow arbitrary dependencies in observation (e.g. long
range, overlapping) Adaptive to different loss functions and regularizers Promising results in multiple applications
• Recent developments Alternative estimation algorithms (Collins, 2002, Dietterich et al,
2004) Alternative loss functions, use of kernels (Taskar et al., 2003,
Altun et al, 2003, Tsochantaridis et al, 2004) Baysian formulation (Qi and Minka, 2005) and semi-markov
version (Sarawagi and cohen, 2004)
Carnegie MellonSchool of Computer Science
51
Local Information
• PSI-blast profile Position-specific scoring matrices (PSSM) Linear transformation[Kim & Park, 2003]
• SVM classifier with RBF kernel • Feature#1 (Si): Prediction score for each
residue Ri
)5(0.1
)55(1.05.0
)5(0
)(
xif
xifx
xif
xf
Carnegie MellonSchool of Computer Science
52
Structured Data Prediction
• Data in many applications have inherent structures
John ate the cat .
SEQUENCEXS…WGIKQLQAR
Text sequence Parsing tree
Protein sequenceProtein structures
3-D imageSegmented objectsExampl
e
Input
Structures
• Fundamental importance in many areas• Potential for significant theoretical and practical advances
Carnegie MellonSchool of Computer Science
53
Tertiary Fold Recognition
• Structural motif Identifiable arrangement of secondary structural elements Super-secondary structure, or protein fold
• Structural motif recognition Given a structural motif, predict its presence in a testing
sequence and produce the sequence to structure alignment, based on sequences only
β-α-β motif
Carnegie MellonSchool of Computer Science
54
Training and Testing
• Training phase : learn the model parameters λ
Minimizing regularized log loss
Iterative search algorithms by seeking the direction whose empirical values agree with the expectation
• Testing phase: search the segmentation that maximizes P(y|x)
21
( , ) log ( )G
K
k k cc C k
L f Z
x y
( | )( ( , ) [ ( , )]) ( ) 0G
k c p k cc Ck
Lf E f
y xx y x y
Carnegie MellonSchool of Computer Science
55
Future Work
+
Protein function
DSCTFTTAAAAKAGKAKAG
Protein sequence
Protein structure
Drug Design
Carnegie MellonSchool of Computer Science
56
Experiment Setup• Cross-family validation
Focus on identification of non-homologous proteins
• Negative sets: PDB-minus set Non-homologous proteins in Protein Data Bank (less than
25%) – Positive sets
• Features Node features: regular expression template, predicted
secondary structure prediction scores, segment length, hydrophobicity and so on
Inter-node features: β-strand side-chain alignment scores, distance between nodes and so on
Features are general and easy to extend
Carnegie MellonSchool of Computer Science
57
Major Challenges
• Protein structures are non-linear Long-range dependencies
• Structural similarity often does not indicate sequence similarity Sequence alignment reaches
twilight zone (under 25% similarity)Ubiquitin (blue) Ubx-Faf1 (gold)
β-α-β motif
Carnegie MellonSchool of Computer Science
58
Conditional Random Fields [Lafferty et al, 2001]
• Model p (label y|observation x), not joint distribution
• Global normalization over undirected graphical models
• Allow arbitrary dependencies in observation (e.g. long range, overlapping)
• Adaptive to different loss functions and regularizers
Carnegie MellonSchool of Computer Science
59
Quaternary Fold
• Quaternary structures Multiple sequences with tertiary
structures associated together to form a stable structure
Similar structure stabilization mechanism as tertiary structures
• Very limited research work to date Complex structures Few positive training data
Triple beta-spirals
Tumor necrosis factor (TNF)
Carnegie MellonSchool of Computer Science
60
Protein Quaternary Fold Recognition
Seq 1: APA FSVSPA … SGACGP ECAESG
Seq 2 : DSCTFT…TAAAAKAGKAKCSTITL
Inter-chain and intra-chain dependencies
Training
Input: ..APAFSVSPASGACGPECA..
Output 1: Does the target fold exist?
Output 2:..
Yes
Testing
Carnegie MellonSchool of Computer Science
61
Conditional Random Fields
• Promising results in multiple applications Tagging (Collins, 2002) and parsing (Sha and Pereira, 2003)
Information extraction (Pinto et al., 2003)
Image processing (Kumar and Hebert, 2004)
DNA sequence analysis (Bockhorst & Craven, 2005)