the university of chicago new approaches to protein...
TRANSCRIPT
![Page 1: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/1.jpg)
THE UNIVERSITY OF CHICAGO
NEW APPROACHES TO PROTEIN STRUCTURE
PREDICTION AND DESIGN
A DISSERTATION SUBMITTED TO
THE FACULTY OF THE DIVISION OF BIOLOGICAL SCIENCES
AND THE PRITZKER SCHOOL OF MEDICINE
IN CANDIDACY FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
DEPARTMENT OF BIOCHEMISTRY AND MOLECULAR BIOLOGY
BY
JOSEPH DEBARTOLO
CHICAGO, ILLINOIS
JUNE 2010
![Page 2: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/2.jpg)
UMI Number: 3408517
All rights reserved
INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.
UMI 3408517
Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against
unauthorized copying under Title 17, United States Code.
ProQuest LLC 789 East Eisenhower Parkway
P.O. Box 1346 Ann Arbor, MI 48106-1346
![Page 3: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/3.jpg)
ii
Table of Contents List of figures ................................................................................................................................. iv
List of tables ................................................................................................................................... vi
List of abbreviations ..................................................................................................................... vii
Acknowledgements ...................................................................................................................... viii
Chapter 1 Protein structure prediction methods .......................................................................1
1.1 Introduction ..............................................................................................................1
1.2 Protein secondary and tertiary structure ..................................................................3
1.3 The protein folding problem and structure prediction .............................................6
1.4 Secondary structure prediction ................................................................................7
1.5 Tertiary structure prediction ....................................................................................8
1.6 Incorporating folding pathways into structure prediction ......................................13
1.7 Design of globular proteins ....................................................................................14
Chapter 2 Homology-free structure prediction and folding pathways ...................................17
2.1 Introduction ............................................................................................................18
2.2 Integration of 2° and 3° structure ...........................................................................19
2.3 Iterative fixing and trimer selection .......................................................................20
2.4 Retaining lost side chain information ....................................................................24
2.5 Structure prediction results ....................................................................................25
2.6 Folding pathways ...................................................................................................38
2.7 Conclusions ............................................................................................................39
2.8 Methods..................................................................................................................41
Chapter 3 Using evolutionary diversity to enhance structure prediction ...............................46
![Page 4: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/4.jpg)
iii
3.1 Introduction ............................................................................................................47
3.2 Overview of the SPEED methods ..........................................................................49
3.3 SPEED-enhanced Ramachandran distributions .....................................................51
3.4 ItFix 2° structure ....................................................................................................55
3.5 Energy functions ....................................................................................................56
3.6 Improvement in 3° structure ..................................................................................61
3.7 Averaging the energy function across the MSA ....................................................67
3.8 Clustering ...............................................................................................................69
3.9 Confidence assessed from reproducibility .............................................................70
3.10 Performance in CASP8 ..........................................................................................75
3.11 Conclusions ............................................................................................................76
3.12 Methods..................................................................................................................78
Chapter 4 New methods for protein design ............................................................................83
4.1 Introduction ............................................................................................................83
4.2 Choice of ubiquitin as a design target ....................................................................86
4.3 Design protocol ......................................................................................................87
4.4 Negative design ......................................................................................................88
4.5 Conclusions ............................................................................................................90 Chapter 5 Structure prediction: Future directions and conclusions ........................................91
5.1 Introduction ............................................................................................................91
5.2 Enhancement of the conformational search ...........................................................91
5.3 Algorithm-free smooth ItFix ..................................................................................92
5.4 New energy functions for folding and refinement .................................................94
5.5 Conclusions ............................................................................................................99
References ....................................................................................................................................100
![Page 5: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/5.jpg)
iv
List of Figures Figure Title Page 1.2.1 Hierarchy of protein structure ..............................................................................................4
1.2.2 Backbone torsional preference and 2° structure ..................................................................5
1.5.1 Nearest neighbor effects on backbone geometry ...............................................................11
2.2.1 Inter-related themes of protein folding ..............................................................................21
2.2.2 The ItFix 2 and 3structure prediction protocol ..............................................................22
2.4.1 Orientation-dependence of statistical potential .................................................................26
2.4.2 Statistical potential energy profiles illustrating orientation dependence ..........................27
2.5.1 2° and 3° structure prediction for low-homology targets ..................................................29
2.5.2 2° and 3° structure prediction for high-homology targets .................................................30
2.5.3 ItFix algorithm mimics the experimentally-determined ubiquitin folding pathway ..........34
2.6.1 Progression of fixing structure for 1af7, 1b72A, 1di2, and 1r69 .......................................40
3.1.1 Structure prediction protocol .............................................................................................50
3.3.1 SPEED-enhanced Rama sampling distribution .................................................................53
3.3.2 Position-based comparison of homology-free and SPEED distributions ..........................54
3.5.1 Radial protein structure terms ............................................................................................60
3.5.2 Effect of energy filtering on model accuracy ....................................................................63
3.5.3 Reproducibility of the final model ensembles ..................................................................64
3.6.1 Improvement in 3° structure prediction using SPEED ......................................................66
3.8.1 Comparison of contacts for the top clusters of several targets ..........................................71
![Page 6: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/6.jpg)
v
3.9.1 Assessing global accuracy from reproducibility of the top cluster ....................................72
3.9.2 Assessing local accuracy from reproducibility of top cluster ............................................74
3.10.1 ItFix-SPEED blind predictions in CASP ...........................................................................77
4.1.1 Structures of distant ubiquitin family members .................................................................85
4.4.1 Designed Rama propensity ................................................................................................89
5.2.1 ItFix -structure prediction ................................................................................................93
5.3.1 The Smooth ItFix protocol .................................................................................................95
5.4.1 DOPE-PW-SPEED encodes higher specificity .................................................................97
![Page 7: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/7.jpg)
vi
List of Tables
Table Title Page
1 Comparison of stable and structured designed sequences .................................................16
2 Component of DOPE-PW statistical potential ...................................................................28
3 Homology-free ItFix performance on low-homology target set ........................................31
4 Homology-free ItFix performance on high-homology target set ......................................32
5 SPEED 2° structure prediction comparison ......................................................................57
6 SPEED 3° structure prediction comparison ......................................................................65
7 Radial terms for protein structure .....................................................................................68
![Page 8: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/8.jpg)
vii
List of Abbreviations 1° primary
2° secondary
3° tertiary
MD Molecular dynamics
ItFix IterativeFixing
SPEED Structure Prediction Enhanced with Evolutionary Diversity
MCSA Monte Carlo Simulated Annealing
Rama Ramachandran
PDB Protein Data Bank
MSA Multiple sequence alignment
![Page 9: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/9.jpg)
viii
Acknowledgements
Tobin Sosnick has been an incredible advisor and I cannot thank him enough for his
support, motivation and outstanding scientific insights that cross multiple disciplines. Karl Freed
has been an essential contributor to the development of my computational skills and I thank him
enormously for his patience in the early days. Andres Colubri, Abhishek Jha, James Fitzgerald
and Glen Hocky are all crazy geniuses who pointed me in the right direction at the right times
and I thank them immensely for never taking themselves too seriously. I thank Marc Parisien and
Esmael Haddadian for great conversations about computer science and protein folding. Chloe
Antoniou was instrumental towards getting my work started in the wetlab and I thank her for her
patience and for her truly genuine willingness to help other people succeed. Nothing in my life
would be possible without the devotion of my family, so I thank my parents, grandparents and
sisters with all of my heart. Finally, I thank my wife Jessica who carried me through the hardest
parts of this journey and to whom I will be forever grateful.
![Page 10: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/10.jpg)
1
Chapter 1
Protein structure prediction methods
1.1 Introduction
The structure and function of proteins are interrelated concepts that are critical to
understanding disease pathways and the development of therapeutics. This connection has
motivated numerous theoretical and experimental attempts to determine how amino acid
sequences encode functionally active three-dimensional structures. Although many of the general
determinants of protein stability and specificity have been elucidated, our understanding of the
subject is not developed to the point that the high-resolution 3° structure of a protein can be
reliably predicted from amino acid sequence alone [1]. Nonetheless, accurate protein structure
prediction is a necessary objective due to the large and growing number of new sequences [2]
and the slow rate at which structures are being determined experimentally [3, 4].
The challenge of structure prediction, given the significant progress that had been
achieved in elucidating the details of the protein folding process [5, 6], has been to incorporate
this knowledge into a set of computational tools that can accurately and automatically predict
structure from sequence [7-9]. To achieve this goal, prediction methods are challenged to
incorporate sufficiently accurate force fields, efficient conformational search strategies, and the
appropriate level of representation of the chain. All of these factors are influenced by the desired
level of prediction resolution, which can range from the prediction of 2° structure content [10] to
high-accuracy atomistic models [11]. Indeed, some methods may trade increased accuracy for
physical fidelity in order to gain new insights into the physical mechanism of protein folding [8,
12-15].
![Page 11: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/11.jpg)
2
In parallel with the structure prediction effort, a large body of bioinformatics research has
been focused on clustering the various genomes into families of sequences which, when properly
aligned, can be represented by one structure [16-19]. This effort not only organizes the problem
of structure prediction into a reduced number of genomic targets [20], but also provides
evolutionary information that, when properly incorporated, can provide significant enhancements
to prediction accuracy [10] [21] [22, 23]. The objective of harnessing as much statistical
information as possible to achieve a high level of prediction accuracy and confidence
demonstrates an orthogonal approach to physics-based structure prediction.
The next challenge is to invert a successful structure prediction method into a protein
design algorithm and determine which protein sequences are most likely to fold into a given
structure [24-26]. The principal motivation of protein design is to circumvent the functional
repertoire of natural proteins and create new classes of enzymes for therapeutic and industrial
purposes. Protein design in practice generally consists of either the de novo redesign of the
amino acid sequence of an experimentally determined structure [27-35] or the de novo design of
novel structures and folds [36-39]. Both approaches incorporate the principal features of
structure prediction, such as hydrophobic burial [40-43], electrostatics [44-47], and local
conformational propensity [48-53], and methods may also use negative design to destabilize non-
native or unsought conformations [37, 38, 54] or to prevent the formation of amyloids [34, 55].
This chapter will start with a discussion of protein structure and the protein folding
problem and will continue with a summary of structure prediction methodologies. It will
conclude with a description of past achievements and the current status of protein design.
![Page 12: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/12.jpg)
3
1.2 Protein secondary and tertiary structure
In a globular protein domain, structure is divided into three hierarchical levels (Fig.
1.2.1). The base level is primary (1°) structure, which consists of the one-dimensional sequence
of amino acids that traverses from the N-terminal to C-terminal end of the chain. The next level,
secondary (2°) structure, encompasses a description of the regular hydrogen bond networks
formed by the backbone of the polypeptide [56-58], and consists of three main types: -helix, -
strand and turn. Within each 2° structure type there is also an observed preference for specific
backbone torsion angles, exemplified in the Ramachandran (Rama) map [59] of each type (Fig.
1.2.2). For this reason, 2° can be defined using a combination of local and long-distance
hydrogen bonding and local backbone conformation [58]. The integration of these two
components is also discussed in relation to tertiary (3°) structure, the next level of protein
structure.
Protein 3° structure is characterized in multiple ways (Fig. 1.2.1). One of the most
common is a diagram the topological organization of the units of 2° structure [60]. A higher
resolution depiction of 3° structure is the two-dimensional contact matrix between atoms in the
chain, where contacts between residues that are very separated in sequence are the furthest from
the diagonal (Fig. 1.2.1). Topology diagrams and contact matrices are useful ways of visualizing
3° structure, but he most detailed representation of 3° structure is an atomic resolution model
(Fig. 1.2.1), which provided the packing arrangements between all atoms.
Protein 2° and 3° structure are overlapping concepts from many perspectives. For
example, the hydrogen bonding network that defines a -strand can occur over a large separation
in sequence, making the -strand 2° structure largely defined by 3° contacts [58]. On the other
![Page 13: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/13.jpg)
4
Figure 1.2.1 Hierarchy of protein structure. The three levels of structure in a protein domain consist of 1°, 2° and 3° structure. The 1° structure of a protein is its amino acid sequence. The three main types of 2° structure are -helix, -strand and coil. There are multiple ways to describe 3° structure, including topology (left), contact matrices (center), and atomistic models (right).
![Page 14: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/14.jpg)
5
Figure 1.2.2 Backbone torsional preference and 2° structure. The (N to C rotation axis) and (C to C rotation axis) backbone torsion angles of residues in a non-redundant version of the PDB are plotted in Ramachandran maps. The three main 2° structure types describe unique local conformational geometries, and therefore exhibit unique Ramachandran maps.
![Page 15: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/15.jpg)
6
hand, the structure of -helices is defined by a sequence-local network of hydrogen bonds [58].
Even with a structural definition devoid of 3° contacts, however, the -helical 2° structure
displays an amphipathic polarization of hydrophobic and hydrophilic residues such that the
former are packed in the globular core and the latter are solvent accessible. The analogous
amphipathic characteristic of -strands is an alternating placement of residues on opposing sides
of the chain, which allows both hydrophobic and hydrophilic faces. The amphipathic nature of
both -helices and -strands is accommodated by the backbone torsion angle propensity of each
2° structure type (Fig. 1.2.2). This coupling is particularly evident in -strands, where extended
Rama distribution reflects the alternating polarization of side chains.
The preceding examples confirm that 2° and 3° structure are integrated and codependent
concepts and must be treated as such when included in structure prediction.
1.3 The protein folding problem and structure prediction
The known attributes that characterize native protein structures are summarized by the
forces that contribute to protein folding [61]. The most predominate of these forces include the
exclusion of the side chains of hydrophobic residues from solvent [40-43], electrostatic
interaction between the side chains of charged amino acids [44-47] and hydrogen bonding [62,
63]. There is also a distinct local backbone conformational preference for the protein chain
given the local amino acid identity [48-53]. This can be observed in the Rama distribution of the
amino acid leucine which strongly prefers the -helical conformation when neighbored by
alanines, but prefers an extended -strand conformation when neighbored by the -branched side
chain of valine (Figure 1.3). Additionally, it is been suggested that some part of a natural amino
acid sequence may be negatively designed to destabilize non-native or unsought conformations
![Page 16: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/16.jpg)
7
[37, 38, 54] or to prevent the formation of amyloids [34, 55].
The long list of phenomena involved in protein folding suggests that a complex interplay
between multiple forces causes a protein to adopt a unique conformation. Understanding this, the
protein folding problem seeks to quantify the relative importance of each force and physical
connection between them throughout the pathway from unfolded to folded protein. The
challenge of structure prediction is to encode all of these attributes into a single algorithm that
predicts protein structure from amino acid sequence. Given the marginal stability of protein
structures and the cooperativity of protein folding, it would seem that every component involved
in folding is interrelated and important [12, 64], so the challenge of structure prediction is to
incorporate each component with an appropriate weighting .The manner in structure prediction
methods approach this challenge, as described below, is strikingly varied.
1.4 Secondary structure prediction
The discovery that the different amino acids have varying probabilities for the three 2°
structure types allowed Chou and Fasman to use a very early PDB to predict 2° structure within
the categories of helix, strand and coil at around 60% accuracy [48, 65, 66]. The essence of their
algorithm is a search for a contiguous region of the sequence that contains a high probability of a
2° structure type followed by the expansion of the region until the ends are detected by a drop
below a probability threshold.
Subsequent methods greatly enhanced 2° structure prediction accuracy by incorporating
sequence profiles and machine learning techniques [10, 67, 68]. Sequence profiles use a multiple
sequence alignment to generate a distribution of amino acid types at each position as opposed to
a singular amino acid identity [69]. This enhancement in information consequently increases the
number of available parameters, hence requiring machine learning to optimize the parameters of
![Page 17: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/17.jpg)
8
algorithms that use sequence profiling [70]. The combination of these methods has proven
hugely successful when incorporated into Chou-Fasman style algorithms, allowing methods to
achieve accuracies close to 80-90% [10, 67, 68].
The remaining deficit in the accuracy of 2° structure prediction [71] can be attributed to a
low propensity for the native 2° structure at some positions due to functional and, most
importantly, 3° structure [72-74]. As a result, any complete 2° structure prediction method must
recognize that the formation of 2° and 3°structure is a coupled process [75] and must
appropriately incorporate 3° structure context [8].
1.5 Tertiary structure prediction
Protein 3° structure prediction methods can be divided into template-based homology-
modeling methods and template-free de novo methods. Template-based methods [22, 76-80]
search for very closely related sequences in the protein database (PDB) and use those
experimental models as templates to align to the target sequence. In cases where a target
sequence contains no sequence homologues in the PDB, template-based methods are
unsuccessful and template-free methods which assemble the protein chain de novo are necessary
[8, 9, 15, 80-83]. Template-free methods, which are the subject of this thesis, are limited by the
enormous conformational space accessible to each protein chain [84], and such methods seek to
narrow the conformational search by amounts ranging from sampling large PDB fragments [21,
80] to simulating folding process using all-atom molecular dynamics (MD) simulations [9, 15].
Structure prediction methods must be capable of recognizing good conformations, which
necessitates accurate protein energy functions. Many notable protein energy functions are force
fields applied in MD simulations, which seek to capture the most physically realistic spatial and
temporal relationships between each atom in molecule [85-87]. Such energy functions calculate
![Page 18: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/18.jpg)
9
some or all known protein force field components, which include those associated with bond
lengths and angles, backbone and side chain torsions, electrostatic interactions, van der Waals
interactions and dipolar interactions related to hydrogen bonding. The performance of these force
fields is difficult to determine and depends on the implementation within the application of
interest.
The most useful energy functions for direct de novo prediction of protein structure
include terms that are derived from the statistics of the PDB (i.e. the observed distances between
atoms in experimentally determined structures) [53, 88-91]. One motivation for the use of
statistical potentials is that the forces behind protein folding may be difficult to explicitly
calculate, but may be encoded implicitly in the empirical statistics of the PDB. While it is
difficult to directly compare the performance of statistical potentials and traditional molecular
mechanics force fields, the latter is rarely used for structure prediction, and at least one study
suggests that statistical potentials are superior at the refinement of near-native protein models
[91].
While an accurate protein energy function is important, an efficient search strategy is
crucial due to the enormity of the conformational space [84]. As such, any 3° structure prediction
should incorporate features that reduce the search space in a way that appropriately reflects the
real behavior of proteins in solution, which is governed by the excluded volume of side chains
and different regions of the chain [92, 93]. In essence, a protein in solution will exhibit local
conformational preferences that depend on the local amino acid sequence [52, 94] and long
distance chain interactions [95]. The twenty amino acid types by themselves have distinct
conformational preferences [48], which become even more specific when the amino acid
identities of nearest-neighbors are taken into account [52, 95] (Fig. 1.5.1). Thus, the 1° sequence
![Page 19: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/19.jpg)
10
provides the first level of reduction of the conformational search and it is the implementation of
this reduction within a structure prediction algorithm varies among methods.
Fragment-based assembly methods impart the native crystal structure local conformation
to any sampled region [21, 96, 97] and are successful when matching or homologous PDB
fragments are available for every position in the chain. But they are at a disadvantage when no
such fragments are available. Ab initio methods use an all atom representation of the chain to
explicitly encode the local conformational bias, which enjoys the advantage of physical realism
but suffers due to computational cost [9, 15]. Other methods choose to directly constrain the
conformational search based on predicted 2° structure, either through iterative homology-free
folding [8, 98] or through homology-based biases [9].
The large number of energy functions and conformational search strategies are combined in a
large and growing number of structure prediction methods that are too numerous to list in detail.
One of the most prominent examples of a de novo fragment assembly method is Rosetta [80],
which has been successfully applied in the Critical Assessment of Structure Prediction
experiment CASP [1]. The Rosetta method finds large (up to nine residues) homologous
fragments of structure in the PDB and assembles them using a Monte Carlo algorithm and an
energy function that consists of a Lennard-Jones potential, a Lazardis-Karplus implicit solvent
model [99], and an orientation-dependent statistical hydrogen bond potential [100]. Large-scale
sampling of sequence homologues and all-atom refinement has allowed the Rosetta method to
achieve high-accuracy predictions on some small globular proteins [21]. Rosetta can be limited,
however, by a dearth of large PDB fragments available for sampling [21, 101], which highlights
drawbacks of fragment-based assembly methods.
![Page 20: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/20.jpg)
11
Figure 1.5.1 Nearest neighbor effects on backbone geometry. The nearest N-terminal and C-terminal neighbors have dramatic effects on the Ramachandran maps of residues. Leucine has a strong preference for an -helical geometry when neighbored by alanines, yet prefers a -strand geometry when neighbored by the -branched sidechains of valine.
![Page 21: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/21.jpg)
12
Other de novo prediction methods utilize alternative combinations of conformational
search strategies and energy functions. CRFolder samples conformations constructed with
machine learning techniques, and uses Monte Carlo to minimize a simple statistical potential
[83]. Another example samples large PDB fragments with a genetic algorithm and a statistical
potential trained extensively on large decoy sets [96]. Another method uses Replica Exchange
Monte Carlo methods to minimize a course-grained potential and includes sequence profile
information to constrain an empirical hydrogen bonding term [9].
Prediction methods also vary by the level of representation of the polypeptide chain and
water. MD simulations with explicit water, which are the most detailed representations, can take
prohibitively long for all but the smallest proteins and fragments, and often start with a partially
folded chain [102]. Simulation time decreases with an implicit representation of water, even if all
atoms are retained [15, 103], and further “coarse-graining” is obtained by MD methods that
feature implicit water and a simplified lattice model of the chain [104, 105]. Methods that
sample the chain with Monte Carlo methods can use representations that range from all-atom
with implicit solvent [21], to a complete backbone and side chain C model with no solvent [8,
53], or a reduced-atom model of the backbone and include no side chain or water information
[83].
1.6 Incorporating folding pathways into structure prediction
The enormous conformational search space available to the polypeptide chain [84]
suggests that proteins fold along a pathway [106-110]. Additionally, native-state hydrogen
exchange experiments show that subunits of 3° structure, called foldons, form cooperatively
[111] and may sequentially assemble with each other along a folding pathway [6, 112-114].
![Page 22: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/22.jpg)
13
Considering the evidence proteins fold along pathways, structure prediction methods may take
advantage by incorporating pathways. Conversely, it is also possible that pathway information
can be extracted from folding simulations in order to better understand the physical mechanism
of protein folding.
Several de novo methods take exactly that kind of physics-based mechanistic approach by
not sampling PDB fragments and excluding any homology-based information [15, 98]. Pande
and coworkers run an all-atom molecular dynamics simulation with explicit solvent simulations
on a beta-hairpin of protein-G [102]. Given the time constraints of this approach, the
investigators start from the unfolding transition state of the structure, which is significantly
structured and diminishes the contributions from earlier parts of the folding pathway. Others
groups attempt to assemble pathways using Markov models on short MD simulations that do not
reach the folded state [115, 116]. MD folding of short proteins beyond their sub-microsecond
folding timescales has produced non-native conformations [117], which has been attributed to
force field bias [118].
Dill and coworkers present a model where folding proceeds through the “zipping” of
local regions of the chain followed by the assembly of these local structures into the global fold
[15]. To support this model the authors apply all-atom molecular dynamics simulations, which
are the most physical tools available, but simulate only small segments of the chain which are
subsequently assembled into a full length structure. It is unclear how exclusion of the context of
the entire chain during the formation of local structure reduces the effects of interactions that are
separated by long distances in sequence. For proteins with complex topologies and extensive
number of sequence non-local contact (e.g. with contacting N- and C-termini), this approach is
likely to run into problems. As a result, the elements of 2° structure that are most influenced by
![Page 23: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/23.jpg)
14
3° structure context, such as the unpredicted C-terminal strand of 1ubq, cannot factored into the
prediction using this method.
The pathway-based nature of other prediction methods feature full 2o and 3o structure
integration and represent the entire chain starting from an unstructured conformation [9, 98, 119,
120], but sample large fragments or include homology-based information, which compromises
their ability to extract pathway information. Thus, what is required for pathway-based prediction
is a homology-free method that does not sample fragments and starts from a completely random
chain. My approach to this challenge is discussed in Chapter 2 [8].
1.7 Design of globular proteins
The design of globular proteins has made substantial progress over the last few decades
[24]. In one of the earliest design efforts, DeGrado and coworkers synthesized and crystallized
alpha-helical peptides that assembled into a tetrameric four-helix bundle as isolated fragments
[121]. This work was followed by the de novo design of a stable and structured globular four-
helix bundle [122].
A decade later Mayo and coworkers experimentally confirmed the first de novo redesign
of an previously determined protein structure [35]. In that study, the investigators used a flexible
backbone design technique to generate a new sequence for a simple zinc finger motif, with the
design having low sequence identity (21%) to the wild-type sequence. This significance of this
result is underscored by the fact that many designed sequences only involve a small number of
deviations from the wild-type sequence [27-31].
The Baker group subsequently used the tools of their Rosetta structure prediction package
[80] to redesign a diverse set of crystal structures, which produced several stable and structured
proteins [32]. The redesigned procarboxypeptidase from that study was highly stabilized relative
![Page 24: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/24.jpg)
15
to the wild-type sequence, and the crystal structure of the design was found to be nearly identical
to the structure of the wild-type sequence [123]. This group also generated the first protein fold
yet to be seen in nature with the design of Top7 [39].
A comparison of designed sequences (globular, stable and structured) and the wild-type
sequences of their respective experimentally determined structures shows that the sequence
identity (i.e. exact sequence matches) is consistently greater than 25% (Table 1). The sequence
similarity, defined as matches according to a BLOSSUM62 matrix [124], is on average greater
than 50% (Table 1). This high similarity may be attributed to the possibility that natural
sequences are highly optimized for their native structures and effective design algorithms are
forced to retain much of the natural sequence [125]. Another possibility is that design methods
can focus too narrowly on native structural constraints, even though sequence substitutions may
require backbone relaxation [33, 126]. Chapter 4 of this thesis details my effort to generate a
designed sequence with much lower similarity to any naturally occurring sequence.
![Page 25: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/25.jpg)
16
Table 1 Comparison of stable and structured designed sequences
design length fold target PDB ID
wt % id1
(wt % sim2) top % id3
(top % sim)4 top-wt % id5
(top-wt % sim)6
protein L1 [32] 62 1hz5 35 (61) 50 (62) 73 (86)
protein L2 [32] 62 1hz5 45 (60) 45 (60) 73 (86)
ACP [32] 98 2acy 41 (54) 39 (57) 67 (69)
PCP [32] 70 1aye 31 (56) 33 (56) 73 (84)
S6 [32] 94 1ris 26 (43) 32 (46) 33 (52)
U1A [32] 96 1urn 32 (57) 33 (57) 97 (100)
FKB [32] 107 1fkb 42 (59) 44 (62) 96 (96)
zinc-finger [35] 28 1zaa 21 (38) N/A N/A
tenascin [127] 89 1ten 42 (64) 42 (64) 100 (100)
ubiquitin7 72 1ubq 8 (38) 25 (47) 14 (32)
1Exact amino acid matches of design to wild type sequence 2Exact amino acid matches and BLOSSUM62 substitution matrix matches of design to wild type sequence 3Exact amino acid matches of design to highest ranking natural sequence of PSIBLAST of design 4Exact amino acid matches and BLOSSUM62 substitution matrix matches of design to highest ranking natural sequence of PSIBLAST of design 5Exact amino acid matches of wild-type sequence to highest ranking natural sequence of PSIBLAST of design 6Exact amino acid matches and BLOSSUM62 substitution matrix matches of wild-type to highest ranking natural sequence of PSIBLAST of design 7Computational redesign described in Chapter 4
![Page 26: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/26.jpg)
17
Chapter 2
Homology-free structure prediction and folding pathways
Parts of this chapter are published in DeBartolo et al. Mimicking the folding pathway to
improve homology-free protein structure prediction. Proc. Natl. Acad. Sci. USA (2009) vol. 106
(10) pp. 3734-9 and in the accompanying supplementary materials. Additional sections have
been added and the text has been updated accordingly. I acknowledge and thank Andres Colubri,
Abhishek Jha, and James Fitzgerald for advice on the coding of simulations and energy
functions. I also thank G. Rose, D. Shortle, J. Xu, G. Hockey, H. Gong, and members of the
Sosnick and Freed labs for helpful discussions.
Since demonstrating that a protein’s sequence encodes its structure, the prediction of
structure from sequence remains an outstanding problem that impacts numerous scientific
disciplines including many genome projects. By iteratively fixing secondary structure
assignments of residues during Monte Carlo simulations of folding, a coarse grained model
without information concerning homology or explicit side chains can outperform current
homology-based secondary structure prediction methods. The computationally rapid algorithm
using only single ( dihedral angle moves also generates tertiary structures of comparable
accuracy to existing all-atom methods for many small proteins, particularly ones with low
homology. Hence, given appropriate search strategies and scoring functions, reduced
representations can be used for accurately predicting secondary structure as well as providing
three-dimensional structures, thereby increasing the size of proteins approachable by homology-
free methods and the accuracy of template methods whose accuracy depends on the quality of
the input secondary structure.
![Page 27: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/27.jpg)
18
2.1 Introduction
The protein folding process is integral to multiple cellular processes, and errors can result
in amyloidgenic diseases. A protein’s structure affords a window on its function, and the huge
growth in the number of sequenced genomes provides codes for an enormous number of new
proteins with unknown functions, [2] a number far exceeding experimental capabilities and
requiring fast throughput theoretical methods for deducing protein structure from sequence. To
this end, great progress in predicting structure has emerged using homology-based methods [76].
However, the goal of predicting structure and pathways beginning only from the
sequence remains an elusive goal. Furthermore, methods for 2o and 3o structure prediction, while
often quite accurate, can fail for lack of sufficient sequences that are homologous to the target
sequence. Even if a multiple sequence alignment (MSA) exists, e.g., as generated using PSI-
BLAST [128], the alignment may diminish any structural propensity that is specific to the target
sequence (in its 3 context) in favor of the consensus of the alignment. This disadvantage can
adversely affect 3 structure prediction because the homology-based 2o structure prediction and
MSA generally serve as crucial inputs.
The reliance on homology also precludes identifying the underlying physiochemical
principles that govern protein folding, including determining the minimal information and model
of protein structure that are required for accurate structure prediction. This inadequacy arises
from the failure of many 2o structure prediction methods [10, 68] to explicitly incorporate 3o
context. Context dependence can overrule local biases, [72-74] and its neglect has limited 2o
structure accuracy to about 80% for decades [71]. Previous attempts to improve 2o structure
predictions by including 3o structure predictions achieve limited success [75], perhaps because of
![Page 28: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/28.jpg)
19
a reliance on sequence homology.
We present a homology-free strategy using a C-level representation in which 2o and 3o
structure predictions emerge as an integral component of the folding process.
Consequently, our strategy may share some benefits that authentic proteins gain by
folding along a robust and efficient pathway. While others have integrated 2o and 3o structure
determination [9, 119] with an iterative fixing (ItFix) of 2o structure [15, 98, 120], our approach
differs by i) not using any exogenous 2o structure prediction or homology, ii) by removing side
chain degrees of freedom from the model, which greatly reduces computation time, and iii) by
allowing the whole chain to interact throughout the entire folding process. Furthermore, our
moves involve changes only in a single pair of dihedral angles (that is obtained from the
PBD and that includes the influence of the neighboring residues’ identity and 2 structure. Our
results demonstrate that models lacking explicit side chains or information from homology can
be as accurate while requiring orders of magnitude less computing time. In addition, information
about folding pathways can be extracted from the simulations.
2.2 Integration of 2o and 3o structure.
Our ItFix algorithm focuses on three fundamental protein properties, the sequence
dependent backbone torsional angle preferences, the backbone hydrogen bonding requirements,
and the different chemical properties and packing preferences of the twenty amino acid side
chains (Fig. 2.2.1). Since each factor strongly influences the other two, a major challenge lies in
simultaneously including all three factors with appropriate weights into a folding algorithm.
![Page 29: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/29.jpg)
20
Because the model retains the backbone heavy atoms and the side chain C atoms [53, 89,
129, 130] the 2N backbone dihedral angles are the major degrees of freedom for a chain of
N residues in our treatment (cis conformers are occasionally allowed, see Methods).
The neglect of side groups raises questions of how the model describes packing and
individual residue preferences. As demonstrated below, our Monte Carlo Simulated Annealing
algorithm (MCSA), using a statistical potential (StatPot) [53, 89, 129, 130] and an increasingly
restrictive PDB-based move set (Fig. 2.2.2), recaptures the requisite side chain information and
performs remarkably well in the prediction of 2o and 3o structure without invoking homology
information.
2.3 Iterative fixing and trimer selection.
A critical aspect of the algorithm is the selection of a single dihedral angles pair from an
increasingly refined library of amino acid trimers, similar in spirit to earlier studies [15, 98, 120].
During the initial round of the simulations, trimer selection is conditional only on the amino acid
identity of the three residues (Fig. 2.2.2). Trimer selection in subsequent rounds depends on the
2o structure type at each position that is identified from the previous round by the prescriptions
described in Methods. The specification of 2o structure is enabled because each trimer in the
trimer library is labeled by the 2o structure assignments for each of the three residues in the
original PDB structure in which they originate using the Dictionary of Protein Secondary
Structure (DSSP) definition [58]. The frequencies of occurrence for each originating 2o structure
type, H(elix), E(xtended), or C(oil), are calculated from the last inserted trimer at each position
in the 200-300 final structures emerging from each round of folding.
![Page 30: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/30.jpg)
21
Figure 2.2.1 Inter-related themes of protein folding. Protein backbone motions, 2 structure and hydrogen bonding, and side chain packing are the necessary components of any folding model. 2° and 3° structure formation are coupled processes whereby the formation of each type of structure influences the formation of the other type.
![Page 31: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/31.jpg)
22
Figure 2.2.2 The ItFix 2 and 3structure prediction protocol. At the end of each round, the 2 structure frequencies are used to eliminate H, E or C when they fall below specified thresholds.
![Page 32: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/32.jpg)
23
Following Sherlock Holmes’ deductive strategy to “Eliminate all other factors, and the one
which remains must be the truth.” [131], if the frequency of occurrence for a particular 2o
structure type falls below a ~5-10% threshold at a given position or across a contiguous stretch
of sequence (see Methods), any trimer inconsistent with that 2o structure is removed from the
trimer library used in subsequent folding rounds. The process continues until no additional
positions can be further restricted. After the last round, the lowest energy and best 3o structures
are identified, while the 2o structure prediction is obtained from the frequencies of appearance of
H, E and C in all final structures (see Methods).
Our MCSA algorithm is designed to resemble a true folding pathway. Each round in the
ItFix process begins from a configuration devoid of any 3o structure, rather than a collapsed
structure generated from a previous round. Consequently, the chains execute a new global search
each round. The backbone geometry is simulated by replacing only one of the three pairs of
dihedral angles at a randomly chosen position with those from the equivalent position in a
trimer selected from the trimer library. In principle, all-atom simulations for tripeptides could be
used, but the accuracy of current methods makes this approach less reliable [94]. The starting
chain is built using angles from trimers specified solely by the amino acid sequence. The trimer
library becomes increasingly conditional on 2o structure type as the rounds proceed. Each round
of ItFix consists of 200-300 individual folding trajectories. Each trajectory involves a global
search guided by insertion moves, a Metropolis acceptance criterion, and a StatPot for a
scoring function. The trajectory ends when the collapsed structure cannot undergo additional
moves. The end result of the iterative rounds is a folding-enhanced 2o structure prediction that
emerges simultaneously with an ensemble of 3o structures.
![Page 33: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/33.jpg)
24
2.4 Retaining lost side chain information.
The retention of the side chain information lost by the use of the C-level representation
poses a serious challenge. Central to this goal is our dihedral angle sampling procedure that
is conditional on both the chemical identity and the increasingly refined 2o structure specificity
for each position and its neighboring residues. The backbone dihedral angles are strongly
correlated with the side chain rotamer angles and both the neighboring residues’ side chain
identities and conformations [52, 53, 95]. Hence, even without explicitly depicting the side chain
atoms, much of their influence is retained by choosing values using our conditional trimer
selection strategy.
In addition to retaining the interplay of side chain and backbone interactions, our
algorithm focus on optimizing 3o interactions. The 3o interaction energies are obtained from the
StatPot “DOPE-C” [53, 89] derived from an all-atom pairwise additive StatPot [88, 129] that
uses a novel reference state and distinguishes the backbone atoms according to amino acid type.
Our version removes all contributions involving hydrogen and side chain atoms beyond the C
atom. To eliminate bias towards specific 2o structure types, the attractive potential is removed
between atoms in contiguous stretches of 2o structure, while the repulsive portion is retained to
prevent steric overlap. In addition, interactions are conditional on backbone geometry and the
relative orientation of the C-C bonds of the two interacting side chains (Figs. 2.4.1, 2.4.2), a
feature particularly helpful in setting up the overall chain topology so that collapse generates
native-like structures. Beyond the prescription used to eliminate a 2o structure option in the
trimer library, the only adjustable parameters are the four linear weight factors in the StatPot
(Table 2).
![Page 34: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/34.jpg)
25
2.5 Structure prediction results
Improvement in 2o structure prediction arising from folding.
The first set of targets (Table 3, Fig. 2.5.1) originates from a previous study that
integrates 2o and 3o structure prediction [75]. The set contains proteins with eleven diverse folds
and relatively low sequence homology. The second set of targets (Table 4, Fig. 2.5.2) originates
from a study focusing on improving 3 structure prediction using high sequence homology and
extensive side chain refinement.
Our accuracy of predicting the three major 2o structure types, H(elix), E(xtended) and C(oil)
(termed “Q3 level”) significantly improves as a result of the ItFix folding algorithm compared to
the intrinsic, locally-determined biases. This improvement can be seen by the initial trimer
library that is contingent only on the sequence. This Round 0, or “R0”, accuracy of 58 ± 10%
improves to 82 ± 11% over the 6-9 rounds of the ItFix process for the various proteins (Tables 3
and 4). The process of fixing 2 structure by eliminating options is well illustrated by the
evolution of 2 structure frequencies at each position in 1Ubq (Fig. 2.5.3). The R0 frequencies
display some bias to the native 2 structure but provide only 60% accuracy. Only as 2 structure
options are eliminated does the native 2 structure pattern emerge with 92% accuracy based on
the average 2 structure obtained from accuracy over the course of the nine rounds. A notable
example is the carboxy-terminal region where the high intrinsic helicity is over-ridden by 3
context and the region becomes a native-like strand.
![Page 35: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/35.jpg)
26
Figure 2.4.1 Orientation-dependence of statistical potential. Each interacting residue pair has two angles. One angle, 1-2, is the angle between the C-C vector of Residue 1 and the C-C vector from Residue 1 to Residue 2, and the other angle, 2-1, is the angle between the C-C vector of Residue 2 and the C-C vector going from Residue 2 to Residue 1. A) The relative orientation of the side chains is quantified as . B) Two residues have angles 1-2 and 2-1 close to 90, yielding a small value. C) A residue pair with a large value has angles 1-2 and 2-1 that are far from 90. D) Hypothetical protein illustrating with possible residue pair orientations having small (1-2, 2-3,1-3, 4-5) and large (1-4, 2-4,3-4, 1-5, 2-5,3-5).
![Page 36: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/36.jpg)
27
Figure 2.4.2 Statistical potential energy profiles illustrating orientation dependence. This dependence reflects the basic protein structural principles of hydrophobic burial, hydrophilic exposure and 2o structure conformation. A) The inter-atomic potential for two C atoms in three different orientations with a high value. In such cases, hydrophobic amino acids are favored to be at shorter distances, corresponding to buried residues pointing at each other in the core of the protein. The opposite applies for hydrophilic amino acids, which prefer larger distances corresponding to surface exposed residues on opposite sides of the protein. B) The potential for two C atoms in two different orientations for two residues on strands of -sheets with a small value. Shorter C-C distances are preferred for residues on the same side of the sheet, and larger for those on opposite sides of the sheet.
![Page 37: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/37.jpg)
28
Table 2 Components of DOPE-PW statistical potential
energy term
contin. helix
contin. strand
contin. coil
anti-parallel -sheet (small )
anti-parallel -sheet (small )
non -sheet (small )
non -sheet (med. )
non -sheet (large )
min/max dist. (Å)
0.0/15.0 0.0/15.0 0.0/15.0 0.0/15.0 0.0/15.0 0.0/30.0 0.0/30.0 0.0/30.0
bin size (Å) 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0
attractive weight 0.0 0.0 0.0 5.0 10.0 1.0 1.0 1.0
![Page 38: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/38.jpg)
29
Figure 2.5.1 2° and 3° structure prediction for low-homology targets. ItFix predicts 2° structure at the Q8 level (H, E, CG, CN, CI, CS, CB, or CT) and 3° structure for the high-homology targets in Table 3. Alignments of the ItFix lowest observed RMSD 3° structure (red) with the native structure (blue) using PyMol visualization software. C RMSD between ItFix model and native structure are listed next to each target name. These are referred to as low-homology targets due to the paucity of sequence homologues in the sequence database. This implies that methods such as PSIPRED and SSPro that take advantage of homology in the form of sequence profiles are conferred minimal information other than the local 2° structure propensity of each sequence.
![Page 39: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/39.jpg)
30
Figure 2.5.2 2° and 3° structure prediction for high-homology targets. ItFix predicts 2° structure at the Q8 level (H, E, CG, CN, CI, CS, CB, or CT) and 3° structure for the high-homology targets in Table 3. Alignments of the ItFix lowest observed RMSD 3° structure (red) with the native structure (blue) using PyMol visualization software. C RMSD between ItFix model and native structure are listed next to each target name. These are referred to as high-homology targets due to the abundance of sequence homologues in the sequence database. This implies that methods such as PSIPRED and SSPro that take advantage of homology in the form of sequence profiles are conferred maximal information in addition to the local 2° structure propensity of each sequence.
![Page 40: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/40.jpg)
31
Table 3 Homology-free ItFix performance on low-homology target set
1Target sequences were taken from previous study by Meiler and Baker [75]. 2Round 0 accuracy of the initial, sequence-dependent trimer library before any 2° structure restrictions are made. This reflects local 2° propensity. 3ItFix predicts 2° structure at the Q8 level (H, E, and the 6 types of coil, including turn (CT), bend (CS), 3-10 helix (CG), pi helix (CI), beta-bridge (CB), and other (CN). 4SSPro predictions taken from the SSPro online server [132]. 5PSIPRED predictions taken from the PSIPRED online server [133]. 6Values taken from column 6, Table 2 of Meiler et al. [75] 7Lowest RMSD obtained
protein
2 structure % accuracy
3 structure (Å)
PDB ID description Length Fold R02
Q3
ItFix Q3
(Q83)
SSPRO4
Q3 (Q8) PSIPRED
5 Q3
Meiler &
Baker6
Q3
ItFix (best7)
Meiler &
Baker (best8)
1ail Protein fragment 70 46 76
(73) 70 (74) 73 64 5.4 6.0
1aoy Single domain repressor 78 54 82
(72) 81 (65) 87 89 5.7 5.7
1c8cA DNA-binding 64 56 86 (70) 72 (59) 59 67 3.7 5.0
1cc5 Heme-binding 76 70 92 (68) 74 (75) 88 86 6.5 6.2
1dtdB Disulfide bonds 61 57 71 (57) 64 (57) 75 69 6.5 5.7
1hz6A Protein L 67 57 80 (72) 80 (75) 83 87 3.8 3.4
1fwp CheY-binding domain 69 45 70
(55) 48 (30) 61 68 8.1 7.3
1isuA Iron-binding 62 65 82 (44) 66 (39) 81 89 6.5 6.9
1sap Hyper-thermophile 66 65 85
(67) 76 (67) 65 65 4.6 6.6
1wapA oligomer
in crystal structure
68 43 80 (68) 73 (64) 81 68 8.0 7.7
2ezk DNA-binding 93 58 80 (75) 71 (64) 91 85 5.5 6.6
![Page 41: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/41.jpg)
32
Table 4 Homology-free ItFix performance on high-homology target set
protein
2 structure % accuracy
3 structure (Å)
PDB ID Length Fold R02 Q3
ItFix Q3 (Q8)
SSPRO3
Q3 (Q8) PSIPRED4
Q3
ItFix5
Lowest energy
(best)
Bradley et al.6
1af7 69 70 97 (86) 86 (81) 90 2.9 (2.5) 10.4
1b72A 50 62 88 (84) 68 (72) 84 3.5 (1.6) 1.1
1csp 67 49 79 (67) 75 (67) 88 10.5
(6.0) 4.7
1di2 68 68 88 (79) 74 (75) 97 6.1
(4.6) 2.6
1dcj 72 38 45 (29) 65 (56) 89 13.3
(7.6) 2.5
1mky 77 66 86 (70) 87 (71) 90 6.9
(6.1) 6.3
1o2Fb 77 65 78 (69) 79 (66) 75 11.2
(5.8) 10.1
1r69 61 79 93 (89) 84 (72) 92 4.2
(2.4) 1.2
1shfA 59 53 76 (56) 85 (69) 80 12.2
(6.7) 10.8
1tif 57 47 89 (79) 86 (70) 93 11.3
(4.2) 4.1
1tig 86 53 83 (70) 69 (67) 83 6.4
(5.3) 3.5
1ubq 73 60 92 (69) 88 (67) 90 5.3 (3.1) 1.0
1Target sequences are the same as a previous study [21]. 2Round 0 accuracy of the initial, sequence-dependent trimer library. This reflects local 2° propensity. 3SSPro predictions taken from the SSPro online server [134]. 4PSIPRED predictions taken from the PSIPRED online server [133]. 5ItFix lowest energy and lowest observed C RMSD (in parentheses) structures. 6Values shown are taken from ref. [21], Table 1, Column 6 “Lowest all-atom energy”.
![Page 42: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/42.jpg)
33
Within the framework of the ItFix algorithm and our energy function, the importance of
3 context in determining 2 structure is demonstrated for the five best performing targets (1af7,
1b72A, 1r69, 1di2, 1ubq). The ItFix process is repeated but without attractive terms between
amino acids farther than six residues (|i-j| > 6). When all the repulsive terms are retained, the
chain adopts extended geometries to avoid steric overlap. The resulting 2 structure prediction
accuracy decreases sharply even compared to the initial R0 2 structure accuracy. When long-
range chain over-lap is allowed, the 2 structure prediction accuracy also decreases relative to R0
because the only favorable interaction term remaining is between two residues on strands when
|i-j| > 4. By itself, this term is insufficient to drive stable, native-like sheet formation. A slight
improvement over R0 occurs simply from the 2 fixing protocol without any simulated
annealing. But, the improvement is marginal, 0-2%, compared to 13-30% obtained when the
long-range interactions are included.
Hence, accurate 2 structure prediction requires 3 context. Context serves to stabilize or
buttress weak local biases and 2 structural elements. For example, the amino hairpin in 1Ubq
emerges when the formation of weak turn brings two potential strands together. Similarly, an
unstable amphipathic helix can be mutually stabilized by a hairpin with a hydrophobic face.
Such 3 contacts may not always be completely native-like as significant increases in 2
structure accuracy can arise even when the global 3 fold is inaccurate (e.g., RMSD > 6 Å).
Comparison with existing 2 and 3 structure prediction methods.
The ItFix accuracy generally surpasses that from the 2o structure prediction servers
SSPRO [68] and PSIPRED [10] and the previous study [75], with some exceptions. The high
homology of these sequences is responsible for the prediction accuracy to meet or exceed 80%
![Page 43: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/43.jpg)
34
Figure 2.5.3 ItFix algorithm mimics the experimentally-determined ubiquitin folding pathway. The position dependence of the 2° structure frequencies at the end of each round, E (blue), H (red) and C (green). A single color bar represents a residue assigned to a single 2° structure type (native 2° structure shown at top, along with long-range contacts). The major steps in the proposed folding pathway [113, 114] are similar to the order of structure fixing over the multiple rounds: The hairpin forms, followed by the helix and 3 strand, and then 4. The final two events are the folding of the 3-10 helix and 5. Their formation appears in some trajectories but not at a high enough frequency to be fixed.
![Page 44: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/44.jpg)
35
for both the SSPRO [68] and PSIPRED [10] servers. Nevertheless, our ItFix protocol achieves
comparable accuracy without invoking any homology information (Table 3 and 4). The average
ItFix accuracy is only slightly smaller for the low homology targets, 80% versus 83%. But the
lack of homology significantly degrades PSIPRED and SSPRO’s performances, 77% versus 88%
and 70% versus 79%, respectively. Furthermore, the ItFix method is able to predict all eight
types of 2o structure where coil is subdivided into the six of the DSSP-defined subtypes (CG, CN,
CI, CS, CB, CT), termed “Q8 level”. This ability also is available using SSPRO, but it is slightly
less accurate for most targets. As illustrated below, ItFix provides much better predictions for the
location of turns and the ends of helices and strands, features that are crucial in 3o structure
prediction.
The ItFix algorithm describes , , and proteins within each set with comparable
accuracy, although 3° predictions for the more challenging low homology set are generally
poorer (Tables 1, 2, Fig. 3) because we have difficulty predicting metal- and heme-binding
proteins and disulfide-bonded proteins. The high homology set lacks these challenging protein
types. The accuracy of ItFix’s 3 structure predictions are comparable to those of the highly
successful Rosetta fragment-based insertion algorithm, as implemented in the papers from which
the test sets are obtained [21, 75]. Our structures are more similar in quality for the low
homology set than the high homology set. The high homology targets were chosen by Baker and
coworkers because improved predictions are obtained using data from the folding of an extensive
number of homologs. In addition, the Rosetta algorithm requires extensive side chain refinement
and thus orders of magnitude more computation time [21] than our algorithm which omits side
chain degrees of freedom. Hence, it is not surprising that this implementation of Rosetta
performs better for 9/12’s of this target set.
![Page 45: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/45.jpg)
36
Table 3 demonstrates that the ItFix 2o structure prediction method can meet or exceed the
percentage prediction accuracy of the programs PSIPRED and SSPro. The information conveyed
by the % accuracy, however, is compromised because of disagreements between methods for the
assignment of 2o structure. For example, the DSSP method, which we use to assign 2o structure,
differs from DeepView in specifying 2o structure for 1tif (Table 3). Deepview tends to be more
liberal when assigning strands and designates residues 3-5 and 9 as strand, whereas DSSP
assigns this region as mostly coil. Our method similarly favors assigning strand over coil,
implying that ItFix should achieve higher 2o structure accuracy for 1tif when compared to the
native DeepView assignment rather than the DSSP assignment. Nevertheless, we compare our
prediction to DSSP assignments because DSSP is used to calculate the 2o structure of the
simulation models.
Another issue relating to 2o structure prediction accuracy is the varying assignment of 2o
structures by different prediction methods. For example, some approaches consider and helix
and a 3-10 helix to belong to the same category, whereas we treat the 3-10 helix as a subtype of
coil because the helical hydrogen bonding pattern requires at least 4 residues whereas 3-10
helix only requires three. Notably, when Q3 level methods such as PSIPRED and SSPro predict
a 3-10 helix as an ‘H’, we consider them correct and incorrect when predicting a 3-10 helix as
coil. Because our 2o structure sampling depends on DSSP, we adhere to its convention and
consider 3-10 helix to be a class of coil (CG).
Feedback between 2o and 3 structure prediction.
While the average accuracy of a 2o structure prediction is a useful metric, it underreports
the importance of the feedback between 2o and 3o structure as illustrated for two of the many
examples. The ItFix 2o structure accuracy for 1c8c is only modestly superior to those of
![Page 46: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/46.jpg)
37
PSIPRED and SSPro, but crucially, ItFix correctly predicts as strand a region that both the
other methods incorrectly assign as helix (Fig. 2.5.1). Similarly, the SSPro Q8 level prediction
incorrectly assigns positions 9 and 10 as turn in 1ubq, whereas ItFix correctly assigns the turn
residues to positions 8 and 9 (Fig. 2.5.2). Only through successive rounds of folding does the
proper 3o context override the local propensities to correctly determine the location of the turn.
Although seemingly insignificant, this difference is crucial because the alignment of the hairpin,
and therefore the quality of the overall structure, depends on properly identifying the turn
location. Thus, extensive sequence homology information and intrinsic propensities can be
insufficient for 2o structures that depend strongly on 3o context.
Our main limitation in predicting 2o structure is the occasional deficiency of our starting
trimer library. For example, when we predict 2o structure for target 1dcj from the initial trimer
library contingent only on the sequence (R0), the accuracy is below 40%, implying very poor
local 2o structure context exists for this target (Table 4). Our 46% accuracy for this target
suggests that the 3o context of folding is insufficient to compensate for poor local propensity. In
fact, we assign the second helix of 1dcj as coil because a proline-glycine pair in the center of that
helix has a very high preference for coil. PSIPRED performs well on this target, presumably
because of the influence of sequence homology. SSPRO underperforms for this protein and some
others, perhaps because local preferences are weighted more heavily than the contribution of the
sequence alignment as compared to PSIPRED.
3o structure predictions.
Even though our StatPot can routinely distinguish a native structure from a set of folding
decoys, the folding simulations cannot always generate native-like models. This limitation is
often due to the vast size of the conformational search space for some sequences. We reduce the
![Page 47: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/47.jpg)
38
search space by specifying the sequence, and then iteratively identifying the 2o structure. Often
local propensities, however, are so strong that even enormous amounts of sampling and 3
context cannot overcome the bias. For example, the turn of the second -hairpin of 1di2 contains
residues whose turn propensities are very low. Even through many rounds of ItFix folding, the
turn probability of that region never becomes high enough to fix, which severely limits the
quality of the 3D models generated. Other prediction methods circumvent this problem and
accurately predict this structure by using the degeneracy of sequence homology to properly
predict the turns and by sampling larger structure fragments which may contain long range
information that specifies the turn [21], suggesting that employing sequence homology can
smooth over any incorrect local biases. Although ItFix uses no homology information and
samples one position at a time, it still correctly predicts the structure of 1di2 and its difficult turn
by including the crucial 3o structure context of folding.
The lack of homology-based information is actually beneficial to predictions for some
sequences, specifically when the MSA incorrectly biases the 2o structure. ItFix fares
exceptionally well for the 2o and 3o structure of 1sap (Table 3) because the 3o context drives the
central region of the protein to be -sheet rather than the helix preferred by the sequence
homology-based methods (Fig. 2.5.2). The very high confidence of PSIPRED in this region
suggests that the MSA strongly biases the 2o structure towards helix, resulting in less accurate 3o
structure.
2.6 Folding pathways.
Many aspects of the ItFix algorithm replicate the folding behavior of authentic proteins.
During the ItFix process, subunits of structure, or “foldons”, are fixed cooperatively, just as
observed by hydrogen exchange experiments [111]. The foldons add to existing structure in a
![Page 48: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/48.jpg)
39
process of sequential stabilization [112] that may resemble the pathway taken by authentic
proteins. In contrast to methods using pre-formed fragments or exogenous 2 structure
predictions where the connection to the authentic pathway is murky at best, the Itfix protocol
begins with an initial unstructured chain, and the buildup of structure evolves out of the folding
process. Hence, the order of fixing of structural elements may recapitulate major features of the
authentic pathway followed as the real chain progresses along the free energy surface (Fig.
2.6.1).
For the protein ubiquitin, the order of fixing structure (Fig. 2.5.3) and their
interactions are in remarkable accord with the experimental pathway [113]. A notable feature is
the formation of the parallel -strand interaction between the amino and the carboxy termini.
This long-range contact occurs prior to the 2o structure assignment of thirty intervening residues
and is possible with our method because the simulation includes the entire chain at all times.
Further, this parallel interaction overrides the initial R0 trimer propensities that favored
helix for the carboxy-terminal strand, as previously noted. Irrespective of whether the ItFix
algorithm replicates experiment, the pathway nature of the algorithm and the interplay of 2o and
3o structure formation contribute to the success, just as a pathway helps real proteins fold
reproducible and expediently.
2.7 Conclusions
The ItFix algorithm predicts 2o structure without resorting to homology and yet delivers an
accuracy and specificity that matches or exceeds current methods which rely heavily on
homology. The success is due to the integration of 3o structure context during the folding
![Page 49: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/49.jpg)
40
Figure 2.6.1 Progression of fixing structure for 1af7, 1b72A, 1di2, and 1r69. The position dependence of the 2° structure frequencies at the end of each round, E (blue), H (red) and C (green). A single color bar represents a residue assigned to a single 2° structure type. The native 2° structure is shown with red boxes (helices) and blue arrow (strands) at the top and bottom. The order of fixing of structural elements may recapitulate major features of the authentic pathway. Round 0 frequencies are the average 2 structure obtained from the initial trimer library that is contingent only on the sequence. As the rounds progress, the probabilities of non-native 2° structures diminish.
![Page 50: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/50.jpg)
41
simulations and the recursive refinement of the 2o structure assignments. Concurrently, accurate
3 structures are often generated. Although the model lacks explicit side chains, our PDB-based
backbone sampling protocol and scoring functions largely recapture the lost information. Hence,
we avoid the computationally expensive search along the rugged side chain rotamer energy
surface that is frequently involved in other successful prediction methods. In addition to
highlighting the basic principles required for ab initio structure prediction, our work extends the
size of proteins that can be predicted using homology-free methods. Furthermore, the ItFix 2o
structure predictions provide improved prediction of turns and ends of helices and strands,
features that are important in describing 3o structure. Thus, the Itfix predictions can be used as
inputs to increase the accuracy of template-based predictions that previously have inherent
restrictions imposed by requiring sequence homology. Moreover, now that the basic principles
have been established, the performance of ItFix can be improved further using homology.
2.8 Methods
Fixing Protocol
The protocol for eliminating a 2o structure option at a position is determined using the 2o
structure frequencies in the trimer library at the beginning of the round, PXInit (X=E, H or C), the
frequencies calculated using DSSP for the 200 - 300 final structures, PXFin_1, and the frequencies
of the trimers’ original 2o structure, PXFin_0, according to the following main criteria (see Suppl.
material). For i consecutive positions (in order of precedence):
(i>6): [HEC] [EC] if PHFin_1 < 0.03
(i>10): [HEC] [EC] if PHFin_1 < 0.05
![Page 51: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/51.jpg)
42
(i>2): [HEC] [EC] if PEFin_0 > 0.50 and PE
Fin_0 > PE
Init and PEFin_1 > 0.00
(all positions in protein) [HEC] [HC] if PEFin_1 < 0.01
(i>3): [HEC] [HC] if PHFin_0 > 0.50 and PH
Fin_0 > PH
Init
(i>4): [HEC] [HC] if PHFin_1 > 0.40
(i>0): [H or C] [H only] if PCFin_0 < 0.10, or (PH
Fin_0 > 0.50 for i-1,i-2,i+1,i+2)
(i>0): [H or C] [C only] if PHFin_1 < 0.10
(i>0): [E or C] [C only] if PEFin_0 < 0.05
(i>0): [E or C] [E only] if PCFin_0 < 0.10
(i>0): [E or C] [E only] if PEFin_0 > 0.50 and (PE
Fin_0 > 0.50 for i-1,i+1)
(i>0): [E or C] [E only] if PEFin_0 > and 0.50 PE
Fin_1 > 0.00 and total positions fixed > 80% of
sequence length
The selection of the thresholds has been made as an empirical compromise between
prediction accuracy and the speed of specifying 2o structure. Some accuracy may be
compromised to allow the largest number of positions to be fixed within a reasonable number of
rounds.
If the turn (CT) probability is greater than 50% in a region that has been fixed to have at
least one 2 structure type removed, we fix that region to coil. Also, no matter what are the
library restrictions, if a large stretch of positions contains no strand, then strand is removed
from the library at those positions if the overall 2structure fixing is at an advanced state (>90%
positions fixed). If a position is fixed as strand, at least 3 adjacent positions will be fixed as
strand when those positions have a strand probability > 50%. If the direction in which the fixing
of strands is ambiguous, it proceeds away from the nearest segment of coil. This correction is
added to make sure the maximum amount of secondary structure is fixed for a given target.
![Page 52: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/52.jpg)
43
To decrease the number of rounds of folding required for convergence, we use additional
operations. If strand has been removed from the library at two positions that are separated by
three or less residues where there are no library restrictions, we remove strand from the
intervening positions. There are operations on all types of library restrictions to refine any small
spaces between fixed regions, e.g. C-C and H-H are replaced by CCC and HHH, respectively.
The set of proteins studied typically requires 5-12 rounds. Convergence is slow for two
proteins (1bm8, 1vqh), and those simulations were stopped after 12 rounds. After the final round
for all proteins, the remaining unfixed positions have the 2o structure type determined by
plurality to obtain the final predictions. The DSSP 2o structure of each final structure in every
round is calculated directly or from the origins in the trimer library. sheet and turn probabilities
are taken from the origins in the library, whereas all other 2o structures are determined directly
using DSSP. In a small minority of cases, 2o structure assignments disagree with those
determined by DSSP. Incorrect assignments usually occur around the border between a helix and
coil or beta-sheet and coil, and in most cases tend to be at positions where 2o structure
determination methods disagree. The most notable examples are 1sap where ItFix assigns the
fifth strand as coil, 1fwp where the second helix is assigned as strand, and 1dcj, where the second
helix and third strand are incorrectly fixed. However, 1sap can fold accurately, implying that
some errors do not affect the quality of the structure prediction.
Energy function
The reduced C model includes only the backbone heavy atoms and the side chain C
The energy function is a pairwise additive statistical potential based on the Discrete Optimized
Protein Energy function (DOPE) [88]. We further divide interaction types as contingent on 2o
![Page 53: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/53.jpg)
44
structure type and continuity, sequence separation, and orientation (Figs. 2.4.1 and 2.4.2). The 2o
structure types are defined as per DSSP. An atom pair is defined as in a continuous segment of 2o
structure if each residue in the pair and all intervening residues in sequence have the same 2o
structure classification. The orientation-dependence is determined by the angle between the side
chain C-C vector and the C-C vector connecting the interacting pair.
MCSA simulations
Our MCSA energy minimization and sampling methods have been described previously
in detail [53]. The , are sampled from a PDB-derived library (resolution < 2.5 Å, homology
below 90%). To test whether the 90% homology level provided a native-like bias, five of the best
performing targets (1af7, 1b72A, 1r69, 1di2, 1ubq) were refolded but using a library with only a
25% homology threshold. The average accuracy of the 2 structure prediction changed from 91.6
to 90.2% while the average of best 3D structures changed from 2.84 to 3.30 Å, respectively.
These slight differences are most likely due to a 1.5-2-fold decrease in trimer diversity rather
than the use of the 90% homology level, which is at most a minimal factor in the success of the
algorithm.
The annealing simulations only consider the heavy atoms of the main chain and the
carbons (C) of the side chains. The backbone planar angles and bond lengths are fixed at their
ideal values, except which is chosen as cis at a frequency of 5% and 0.1% for prolines and all
other residues, respectively. For the cis prolines prediction for Table II, we obtained 2 true
positives, 4 false positives, 41 true negatives, and 2 false negatives based on an increase above
the 5% baseline. All non-proline residues are correctly predicted to be trans.
Sampling library
![Page 54: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/54.jpg)
45
We obtain our trimer library from a PDB culled using PICSES [135, 136] with a
resolution cutoff of 2.5. As 2o structure is restricted, the total number of trimers available for a
given sequence becomes smaller and less diverse. We increase diversity by allowing trimers with
amino acid substitutions within these four groups of structural correlated amino acids, (FVI),
(LM), (KRQH), and (WYF) (e.g. the three trimers XFY, XVY, XIY, are considered equivalent).
We add 5° noise to each angle pulled from the library. Bond lengths and angles are all set to
ideal values.
![Page 55: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/55.jpg)
46
Chapter 3
Using evolutionary diversity to enhance structure prediction
Parts of this chapter are published in DeBartolo et al. Protein structure prediction
enhanced with evolutionary diversity: SPEED. Protein Sci. (2010) vol. 19 (3) pp. 520-34 and in
the accompanying supplementary materials. Additional sections have been added and the text
has been updated accordingly. I acknowledge and thank Glen Hocky, Mike Wilde and Jinbo Xu
and Glen Hocky for helpful discussions.
For naturally occurring proteins, similar sequence implies similar structure.
Consequently, multiple sequence alignments often are used in template-based modeling of
protein structure and have been incorporated into fragment-based assembly methods. Our
previous homology-free structure prediction study introduced an algorithm that mimics the
folding pathway by coupling the formation of secondary and tertiary structure. Moves in the
Monte Carlo procedure involve only a change in a single pair of backbone dihedral angles
that are obtained from a PDB-based distribution appropriate for each amino acid, conditional on
the type and conformation of the flanking residues. We improve this method by utilizing
multiple sequence alignments to enrich the sampling distribution, but in a manner that does not
require structural knowledge of any protein sequence (i.e., not fragment insertion). In
combination with other tools, including clustering and refinement, the accuracies of the predicted
secondary and tertiary structures are substantially improved and a global and position-resolved
measure of confidence is introduced for the accuracy of the predictions. Performance of the
method in the Critical Assessment of Structure Prediction (CASP8) is discussed.
![Page 56: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/56.jpg)
47
3.1 Introduction
Given the expansion of the sequence database, an imperative of the field of structural
biology is to cluster related sequences into families and determine a representative structure for
each family [16-20]. The already large number of families is rapidly expanding and the cost of
determining representative protein structures is high. Computational structure prediction may
provide the most effective means of mapping the protein universe. Structure prediction, however,
is inherently challenging because of the enormous conformational space accessible to each
amino acid sequence. For this reason, the most successful prediction methods seek to narrow the
conformational search, for example by using large PDB fragments [80] rather than simulating the
protein ab initio [9, 15].
We have recently developed a C-level, homology-free structure prediction algorithm,
termed ItFix,[8] in which the conformational search space is restricted by iteratively fixing
secondary (2°) structure assignments of certain portions of the sequence after incorporating the
influence of tertiary (3°) context. Moreover, the iterative feature enables regions of lower
confidence to be predicted after the fixing of more confident regions. The coupling and mutual
stabilization of 2° and 3° structure formation mimics the pathway character exhibited by real
proteins [113, 114].
The computationally rapid, homology-free algorithm uses moves involving only the
change in a single pair of dihedral angles (pivot moves). Hence, its performance is
independent of the existence of appropriate fragments from the PDB. Nevertheless, our
algorithm can outperform current homology-based 2° structure prediction methods for many
proteins. ItFix also generates 3° structures of comparable accuracy to existing methods for many
small proteins, including ones with few sequence homologues.
![Page 57: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/57.jpg)
48
Our earlier study revealed that a large impediment to more accurate structure prediction
arises from the intrinsically low propensity of some residues to adopt the backbone dihedral
angles found in their native structures. In the protein 1dcj, for example, the middle of a helix
contains a proline followed by a glycine, two residues that are very unlikely to be found together
in helices. Even though ItFix uses more confidently assigned regions to identify native structure
in otherwise weakly determined regions, the additional contextual information occasionally is
insufficient to override very strong local biases. Unfortunately, issues of this severity occur often
in many proteins, and the associated errors can detrimentally affect the accuracy of the 2° and 3°
structure prediction.
Here, we employ multiple sequence alignments (MSAs) to mitigate the influence of the
non-native local biases. MSAs are incorporated into many popular 2° structure [10, 68] and both
template-based [137-139] and template-free [21, 83] 3° structure prediction methods. In our
distribution of sampled angles, the non-native biases are manifested as a low probability of
native-like angles. This PDB-based distribution is now enriched using the sequence diversity
found in an MSA, but does so without requiring structural information from any constituent
sequence. We denote this procedure SPEED: Structure Prediction Enhanced by Evolutionary
Diversity (Fig. 3.1.1). The combination of ItFix and SPEED significantly increases the accuracy
of 2° and 3° structure predictions, and more so in combination with novel energy functions and
clustering methods. We also provide global and local measures of the confidence of our
predictions, thereby providing an essential tool for assessing the accuracy of the predicted
structures of unsolved sequence families.
![Page 58: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/58.jpg)
49
3.2 Overview of the SPEED method
Figure 3.1.1a provides an overview of both the homology-free and SPEED structure
prediction methods utilizing the ItFix 2° structure fixing procedure. The fundamental difference
between our original homology-free protocol and the new SPEED protocol relates to the
Ramachandran (Rama) sampling distribution. In the homology-free protocol, the distribution
is generated only from the target sequence, whereas in the new protocol, the distribution is
constructed from an MSA of the target sequence. At the beginning of the ItFix procedure, no 2°
structure is fixed, and the distribution at each position reflects all 2° structure types, although
the distribution is contingent on the amino acid identities of the neighboring positions (Fig.
3.1.1b). Through rounds of folding (Monte Carlo simulated annealing, MCSA) using an energy
function that promotes hydrophobic burial and that penalizes polar burial (Methods), the 2°
structure options helix, strand or coil are progressively eliminated when their occurrence in the
final collapsed structures falls below a ~0-10% threshold.[8] Angles originating from the
eliminated 2° structure option are excluded in the calculation of the Rama distribution for the
subsequent round. The folding and elimination process proceeds until no further 2° structure
options can be eliminated (Fig. 3.1.1b middle and bottom). The final result is a more restricted
Rama distribution across the entire sequence which greatly reduces the search space.
The final Rama distribution is used to generate a large (10,000) ensemble of 3° structure
models. These models are clustered into groups of similar structure, and the models from the
largest cluster are selected for refinement and prediction, using our DOPE-PW statistical
potential.
![Page 59: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/59.jpg)
50
Figure 3.1.1 Structure prediction protocol. a) The 2° and 3° structure prediction protocol for homology-free modeling uses the target sequence to generate a Rama sampling distribution, whereas SPEED uses a distribution that is averaged over a Multiple Sequence Alignment (MSA). The ItFix algorithm iteratively defines the 2° structure, and clustering and refinement are used to predict 3° structure. b) The Rama distribution for position 4 of the sequence of 1tif is shown for representative rounds of ItFix for homology-free and SPEED sampling. The native , angles are denoted as a red circle.
![Page 60: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/60.jpg)
51
3.3 SPEED enhanced Ramachandran distributions
At the beginning of the ItFix rounds, the Rama distribution at each position is conditional
only on the amino acid identities of the position and its two neighbors. Our homology-free
implementation obtains this distribution solely using the target sequence. For example, N4 of 1tif
is flanked by I3 and E5 (denoted INE), with the resulting INE having a homology-free Rama
distribution displayed in the left panel of Figure 3.1.1b. The SPEED-enhanced Rama distribution
is the sum (with equal weights) of the distributions of all possible three-residue combinations
generated from the amino acid substitutions identified by the MSA. For example, the SPEED
distribution for INE is the sum of multiple Rama distributions derived from the MSA, such as
IND, IGD, and VGN. At the beginning of the algorithm when no 2° structure option is eliminated,
the native Rama region (Fig. 3.1.1b, red circle) has a small sampling probability in the
homology-free distribution (P=0.01), and the predominant Rama region is right-handed helix
(P=0.6). By contrast, the native Rama region has a ~20-fold larger probability in the equivalent
SPEED Rama distribution. Also, at the end of the ItFix rounds, the SPEED probability of the
native Rama region has nearly doubled compared to the homology-free probability (P=0.37
versus 0.21). The native Rama probability enhancement due to ItFix is thus significantly
improved by MSA-based procedure.
To illustrate the benefit of using SPEED, we quantify the enhancement across all
positions in the folding targets by comparing the native Rama probability of the homology-free
distribution to that of the SPEED-derived distribution (Fig. 3.3.1). This analysis proceeds by
partitioning the Rama map into four broad regions (Fig. 3.3.1a). More refined divisions of the
Rama map exist, but this division into four regions may be the most refined definition with clear
borders. The quality of SPEED-derived distribution is quantified as the percentage of positions
![Page 61: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/61.jpg)
52
with low probability of the native Rama region (P<0.25). This percentage is a useful metric
because any position with such a low native Rama probability is an obvious candidate for
improvement. Compared to the homology-free Rama distributions, the new procedure decreases
the percentage of residues having a non-native Rama propensity for 10 out of the 12 targets
studied (Fig. 3.3.1b). The two exceptions remain unchanged because their homology-free
distributions already are very good. The two targets with the largest improvement in Rama
distribution are 1csp (78% 86%) and 1dcj (84% 94%). In particular, the homology-free
Rama distribution for 1dcj contains serious flaws due to the aforementioned proline-glycine pair
in the second -helix and for residues in the turn separating the second helix and third strand
(Fig. 3.3.2). SPEED overrides the non-native propensity of G46 in the second helix (P=0.21
P=0.62) and also enhances the E52 turn position’s native propensity (P=0.01 P=0.32).
In addition to the moderation of outliers, SPEED enhances the native Rama propensity when it is
already high, as is the case for 1b72. Here, the native Rama probability at only one of the ten coil
positions (E31) falls below the 0.25 threshold (Fig. 3.3.1). Its native-like probability is only
P=0.03 in the homology-free distribution but is enhanced to P=0.23 in the enhanced distribution.
Additionally, the native Rama probability in the SPEED-derived distribution is two- fold higher
than the homology-free distribution in 7 out of 10 coil positions. Similar improvements can be
seen for other targets (Fig. 3.3.2). The exceptions to this trend generally emerge for positions
which already have a very strong native-like propensity in the target sequence. An illustration of
this effect is the left-handed turn position G10 in 1ubq. Because glycine favors the native left-
handed turn basin more than all other residues, any substitution lowers the native Rama
probability (Fig. 3.3.2). Nevertheless, the decrease in native probability due to the use of SPEED
is on average is much smaller than the benefit at other positions.
![Page 62: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/62.jpg)
53
Figure 3.3.1 SPEED-enhanced Rama sampling distribution. a) Rama space is divided into four coarse regions for analysis. b) The percentage of residues with probability exceeding 0.25 for the native Rama region is increased for SPEED for all targets, particularly 1csp and 1dcj. c) For 1b72, the probability of the native Rama region is greatly enhanced using SPEED.
![Page 63: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/63.jpg)
54
Figure 3.3.2 Position-based comparison of homology-free and SPEED distributions. The analysis of Figure 3.3.1c is shown for additional targets.
![Page 64: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/64.jpg)
55
3.4 ItFix 2° structure
Improvements to ItFix
The 2° structures of the final models are identified using the DSSP program for 2°
structure determination [58]. Since DSSP-identified -strands must be involved in -sheet
networks with optimized hydrogen bonds, the strand-fixing threshold is lower than our previous
study, with no noticeable decrease in fidelity. In many cases, the fidelity for specifying 2°
structure is higher. This increase is particularly evident for the all- targets, where the -strand
option is eliminated at every position within the first two rounds as a result of the -strand
probability vanishing (P< 0.005) at every position (in the first round for 1af7, 1b72; in the
second round for 1r69). The same accuracy is found for the helical regions of the targets.
Improvement in 2° prediction accuracy
The 2° structure prediction accuracy using SPEED compares very favorably with the
popular 2° structure prediction methods SSPro [68] and PSIPRED [10] (Table 5). When
predicting 2° structure at the level of helix, extended or coil (three options, termed Q3), ItFix-
SPEED is more accurate than its homology-free ItFix counterpart (average accuracy 84%
88%). Most of this improvement is due to 1csp (79% 87%) and 1dcj (45% 83%), the two
targets with the largest improvements in Rama distribution due to SPEED (Fig. 3.3.1b). The 2°
structures for the all- targets already are predicted to high accuracy using the homology-free
ItFix, so the average improvement due to SPEED is small (93% 96%), with the exception of
1b72 where the improvement is more substantial (88% 96%). The one exception is 1di2,
whose failure is discussed in the 3° structure prediction section below.
More impressive is the increase in accuracy for the prediction of 2° structure at the more
![Page 65: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/65.jpg)
56
refined Q8 level where coil is subdivided into six DSSP-identified subtypes (this level of
prediction is unavailable with PSIPRED). For 1b72, the overall Q8 accuracy increases (84%
96%) using SPEED with a >0.95 probability assigned to the native Q8 value at every position in
the second coil region. Two other targets that have substantial improvements in Q8 accuracy are
1dcj (29% 65%) and 1ubq (69% 82%). Most of the Q8 improvements for 1dcj arise from
the same helix and strand improvements found for the Q3 values, whereas the Q8 improvements
for 1ubq are due almost exclusively to better turn predictions within the coil subtype.
3.5 Energy Functions
We continue to use a reduced C model that includes the backbone heavy atoms,
backbone amide hydrogen, and the side chain C, and a slightly modified version of the DOPE-
PW energy function [8]. This energy function is a pairwise additive statistical potential based on
the observed distance distributions between each atom in the model. In addition to distinguishing
each type of atom, the energy function classifies each interaction according to residue type, 2o
structure assignment, and side-chain orientation.
In the prior ItFix treatment, the 2° structure assignment at a position is the same
assignment as in the original PDB structure from which the last , pair is selected (for this
position). Here, the 2° structure is specified using a geometric definition of 2°structure that is
applied in each energy calculation (i.e., in the application of the strand-strand terms, helix-helix
terms, etc.). A residue is considered to lie in a helix if it is situated in a block of more than four
residues in a row satisfying the following criteria:
![Page 66: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/66.jpg)
57
Table 5 SPEED 2° structure prediction comparison1
Protein Rama
Enrichment2
(angles / residue)
2° structure accuracy3
Q3 (Q8)
PDB ID size fold NEFF
4 Hfree
SPEED ItFix ItFix
SPEED SSPro PSI-PRED
1af7 69 7.3 1426 5599 97 (86) 96 (88) 86 (81) 90
1b72 50 5.7 1384 4229 88 (84) 96 (96) 68 (72) 84
1csp 67 6.0 1069 2365 79 (67) 87 (70) 75 (67) 88
1di2 68 6.8 1230 4964 88 (79) 66 (54) 74 (75) 97
1dcj 72 7.0 1059 4381 45 (29) 83 (65) 65 (56) 89
1mky 77 5.0 1572 3947 86 (70) 83 (65) 87 (71) 90
1o2f 77 5.5 1059 4506 78 (69) 84 (73) 79 (66) 75
1r69 61 7.5 1036 5058 93 (89) 97 (89) 74 (72) 92
1shf 59 7.1 774 3213 76 (56) 71 (51) 85 (69) 80
1tif 57 4.4 1349 3233 89 (79) 91 (81) 76 (70) 93
1tig5 86 5.4 1194 3323 83 (70) N/A 69 (67) 83
1ubq 73 7.7 1152 3405 92 (69) 94 (82) 88 (67) 90
1Target sequences are from our previous homology-free ItFix study,[8] which have been selected from a previous Rosetta prediction study.[21] 2Rama enrichment is the positional average of the number of PDB angles used to generate the Rama distribution for each method. The Q3 and Q8 (in parentheses) 2° structure prediction accuracies are reported for the previous homology-free study, an updated homology-free version, and SPEED sampling. 3SSpro and PSIPRED 2° structure predictions are obtained from their respective servers.[132, 133] 4NEFF[140] is a Shannon entropy measure on a scale of 1-20 of the amino acid diversity of the sequence alignment (1 = single amino acid, 20 = all amino acids are equally likely). 5Folding of 1tig could not converge in reasonable amount of time because radial terms could not be satisfied in a small number of MCSA steps.
![Page 67: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/67.jpg)
58
The minimum distance between the hydrogen bond donors and acceptors is described by the
distance criterion from the hydrogen bond potential of Kortemme et al.,[100]
[1.7 < dist(COi, NHj) < 2.6] or [1.7 < dist(NHi, COj) < 2.6]
In addition to this distance constraint, the hydrogen bond energy function also considers
the influence of hydrogen bond orientation. The following term is used to describe the
orientation between two covalent bonds, an example being the backbone carbonyl (C=O) bond
and amide bond (N-H) orientation:
,
In this equation, 12 represents the angle between the and vectors and 21 represents the
angle between the and vectors. We impose a 90° minimum on to maintain a planar
sheet network for both parallel and anti-parallel sheet networks.
Our previous study [8] finds that the statistical potential alone often is incapable of
generating a large proportion of well-collapsed models for the targets that contains -sheets.
These simulation models commonly contain attributes that are uncharacteristic of experimental
models, such as buried polar residues, unpaired buried strands, and a high radius of gyration of
C atoms (Rg). Buried polar residues and buried unpaired beta strands are symptomatic of an
energetic benefit allotted for the close pairing of non-polar C atoms and the lack of penalty for
the close pairing of polar and non-polar C atoms. Thus, the prior treatment allows a strand to be
buried in the hydrophobic core of a model so long as it contains a sufficient number of non-polar
residues. High-Rg models can be low in energy due to highly optimized sub-structures, such as
hairpins, which are formed at the expense of integrating the entire chain into a properly-
collapsed model.
![Page 68: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/68.jpg)
59
Adding a penalty for the burial of polar residues impedes the generation of low-Rg
models, and forcing a lower Rg on the chain can worsen the burial of polar groups and beta-
strands. For this reason, in addition to Rg, two radial terms are included to encourage the proper
global collapse of the entire chain (3.5.1). Radial uniformity (Ru) is the standard deviation of the
distances of C atoms from the C center of mass (cm),
, where and
The Ru term is necessary because small globular single-domain proteins rarely have a
completely buried segment of chain, but instead have an amphipathic alternation between
exposed and buried side chains. Enforcing a small value of Ru prevents any portion of the chain
from being too close to the center of mass and therefore diminishes the propensity for the burial
of entire 2° structure units in the core of the model.
Rg and Ru are minimized to create a collapsed chain with no completely buried chain
segments. A third radial term, the ratio of the Rg of the non-polar C atoms to the Rg of the polar
C atoms, is called burial ratio (Br):
Br = Rgnon-polar / Rgpolar
Most small proteins have the non-polar C atoms closer to the center of the protein,
whereas the polar C atoms are more likely to be on the exterior, so Br is less than unity to
capture the global hydrophobic burial of globular proteins. The global burial induced by the Br
term contrasts to the local optimization of statistical potentials, which can optimize local subsets
of hydrophobic atom pairs at the expense of global burial.
We add the three radial terms to obtain the overall scoring function, where EDOPE-repulsive is
sum of the positive (repulsive) DOPE terms,
![Page 69: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/69.jpg)
60
Figure 3.5.1 Radial protein structure terms. A single domain protein structure is treated as a sphere with a C radius of gyration, an inner hydrophobic C radius of gyration, and an outer hydrophilic C radius of gyration. The burial ratio (Br) is the ratio of the latter two terms. The Radial uniformity (Ru) is the standard deviation of the C center of mass (CM) to C distances.
![Page 70: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/70.jpg)
61
Each MCSA simulation is repeated using Eradial until the Br is less than 0.80. We cap the
minimum value of Ru at 2.5 Å, since it is very easy for the chain to fold into a ring structure with
Ru close to 0. The multiplied radial terms have a coefficient of 100, so that their combined
magnitude is significant relative to the repulsive part of DOPE.
The combined radial energy and filtering has a significant effect on model quality (Fig.
3.5.2); the lowest energy models from the final ensemble are more similar to the native.
The radial terms are used throughout the ItFix algorithm until the 2° structure is
determined. For the final round of folding, if the 2° structure is all- the DOPE-PW energy
function is used, otherwise the energy function is used. In either case, the size of the final
ensemble is 10,000 models, which is more than sufficient for reproducible average accuracy
(Fig. 3.5.3), but perhaps minimally sufficient for the purposes of clustering or reproducing the
absolute best model. The final model refinement process uses the DOPE-PW energy function
for all targets.
3.6 Improvement in 3° structure
SPEED significantly improves the quality of 3° models compared to the homology-free
treatment (Table 6). The model with the lowest C-RMSD (best model) is lower for SPEED in
every case except 1di2. Because the best model is not always a very reproducible metric of over-
all performance, we consider instead the fraction of final structures below 5 Å C-RMSD to the
native structure (Fig. 3.6.1). This fraction is on average several times greater for SPEED than
from the homology-free approach when all other folding parameters (2° structure assignment,
energy weighting coefficients, etc.) are identical (Table 6, last column). The SPEED folding
ensemble for 1ubq is the most enhanced, containing six times more native-like models than the
![Page 71: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/71.jpg)
62
homology-free ensemble. For four out of the twelve targets, the homology-free distribution
produces no models below 5 Å, and hence the SPEED enhancement factor is effectively infinite.
Even so, improvement also is evident across all ranges of C-RMSD. For 1b72, the addition of
SPEED improves the 3° structure ensemble such that 83% of the models are less than 5 Å C-
RMSD to the native structure (Fig. 3.6.1b), which compares favorably to 76% of the homology-
free models falling below that threshold. It should be noted that, for the purposes of direct
comparison of the homology free and SPEED Rama distributions, the SPEED 2° structure
definition was used in the homology free Rama distribution. Since the SPEED 2° structure is
typically more accurate, in reality the 3° structure accuracy enhancement due to SPEED is likely
much larger. Unfortunately, we did not have the computational resources to generate the
homology free ItFix 2° structure definitions.
Compared to the and targets, the three targets have the most native-like ensembles
for both homology-free and SPEED methods, and, hence, this class yields the smallest
enhancement factor. Conversely, the and targets produce a very small fraction of native-like
models for both SPEED and homology-free methods, but have the largest increase in native-like
models due to the use of SPEED (Table 6). Neither the SPEED nor homology-free methods
generate native-like models for 1di2, most likely because it is considerably more prolate in shape
than the rest of the proteins, and the radial energy terms (See 3.5) enforce a spherical bias (Table
7).
An obvious question is whether the increase in the accuracy of 3° structure prediction
found with SPEED emerges from the improvement of a few residues with low homology-free
native ( probability or from small improvements across the entire sequence.
![Page 72: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/72.jpg)
63
Figure 3.5.2 Effect of energy filtering on model accuracy. The accuracy of the folding ensemble increases as higher energy models are removed. Shown is the fraction of models below varying C-RMSD cutoffs. Traces represent the results after removal of models with energies higher than E greater than <Energy>+X where X=0,±s, ±2s, and s is the standard deviation in energy for all models.
![Page 73: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/73.jpg)
64
Figure 3.5.3 Reproducibility of the final model ensembles. The final folding ensembles (10,000 models before refinement) are divided into five random sets of 2000 models for the targets 1dcj, 1tif, and 1r69. The lack of diversity illustrates that the accuracy distribution is reproducible.
![Page 74: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/74.jpg)
65
Table 6 SPEED 3° structure prediction comparison1
Protein 3° structure accuracy
PDB ID size fold NEFF
Previous ItFix1
ItFix- hfree2
ItFix- SPEED3 C-5.0X4
1af7 69 7.3 2.9 (2.5) 2.5 (2.5) 2.6 (1.6) 1.2
1b72 50 5.7 3.5 (1.6) 3.6 (1.7) 3.5 (1.6) 1.1
1csp 67 6.0 10.5 (6.0) NC (4.6) 5.2 (4.1) 4.2
1di2 68 6.8 6.1 (4.6) NC (6.8) NC (6.6) N/A
1dcj 72 7.0 13.3 (7.6) NC (5.9) 5.3 (4.6)
1mky 77 5.0 6.9 (6.1) NC (4.4) 5.2 (4.2)
1o2f 77 5.5 11.2 (5.8) NC (6.7) NC (4.2)
1r69 61 7.5 4.2 (2.4) 3.7 (2.1) 3.5 (1.6) 1.8
1shf 59 7.1 12.2 (6.7) NC (6.2) NC (3.8)
1tif 57 4.4 11.3 (4.2) 5.7 (3.7) 5.4 (3.2) 4.3
1tig5 86 5.4 6.4 (5.3) N/A N/A N/A
1ubq 73 7.7 5.3 (3.1) 4.4 (3.6) 2.6 (1.9) 6.0
1The C-RMSD to the native of prediction based on energy and best model (in parentheses) from our previous homology-free ItFix study.[8] 2Folding with the homology-free Rama distribution and with the final SPEED 2° structure (2000 trajectories), cluster and refinement prediction and best model (in parentheses). 3Folding with the SPEED Rama distribution with final SPEED 2° structure (10,000 trajectories), cluster and refinement prediction and best model (in parentheses). 4Ratio of the percentage of models below 5.0 Å C-RMSD to native of SPEED (column 7) to homology-free (column 6) 5Folding of 1tig could not converge in reasonable amount of time because radial terms could not be satisfied in a small number of MCSA steps.
![Page 75: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/75.jpg)
66
Figure 3.6.1 Improvement in 3° structure prediction using SPEED. The percentage of models with a C-RMSD to the native below a cutoff level (x-axis) provides a comparison of the overall accuracy of the folding ensembles. The top cluster (solid line) from SPEED is much better than the entire SPEED ensemble (dashed line), which is better than the ensemble generated using the homology-free ItFix Rama distribution with the SPEED-generated 2° structure assignments (dotted line).
![Page 76: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/76.jpg)
67
Although it is impractical to test the effects of SPEED one residue at a time, the general
behavior is illustrated for 1dcj, the protein for which the use of SPEED introduces the largest
improvement in the accuracy of both 2° and 3° structure predictions. Without SPEED, we fail to
predict the second helix, which contains the PG combination and has low intrinsic helicity. Even
with the 2°
structure of this helix correctly fixed, the 3° accuracy still is inferior without SPEED (Table 6),
presumably due to the extremely low homology-free turn probability at position 52 compared to
the SPEED-based probability (Phfree=0.02; PSPEED=0.32). Hence, we believe that the larger
improvements due to SPEED probably can be localized to a few critical positions. However, the
improvement of near native structures (e.g., RMSD below 3-5 Å) likely arises from the
cumulative effect of enhancement at many positions.
3.7 Averaging the energy function across the MSA.
Analogous to the SPEED-improved Rama distribution, we have also tested an energy
function that is averaged over the MSA in order to incorporate additional sequence information,
specifically via sequence correlations in the long-range interactions. The analysis of correlated
mutations in sequence alignments has been used previously in other prediction and design
methods [141-144]. The new energy function uses the original statistical potential and the same
pairwise distances, Di,j, between the pairs of amino acids. However, the new energy for each (i,j)
residue pair now is the average energy calculated using the distance Di,j and statistical potential
appropriate for the amino acid pair found in each sequence in the MSA. This procedure includes
extra long-range information while maintaining the pairwise amino acid correlations inherent in
each aligned sequence.
![Page 77: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/77.jpg)
68
Table 7 Radial terms for protein structure
PDB id length native Br1 consensus native Br2
Rg (Å) Ru (Å)3
1af7 69 0.83 0.69 10.9 2.5
1b72 50 0.81 0.52 10.0 2.8
1csp 67 0.73 0.57 10.6 3.2
1di2 68 0.71 0.62 12.3 4.9
1dcj 72 0.77 0.69 10.4 2.8
1mky 77 0.76 0.66 11.7 3.2
1o2f 77 0.79 0.72 10.7 2.5
1shf 59 0.75 0.73 10.0 2.7
1tif 57 0.73 0.69 9.5 2.5
1tig 86 0.84 0.81 12.1 3.8
1ubq 72 0.76 0.71 10.7 2.4
1r69 61 0.79 0.69 9.9 2.0
mean 68 0.77 0.68 10.7 2.9
1Burial ratio of the target sequence 2Burial ratio of the consensus sequence of the multiple sequence alignment of the target sequence 3Standard deviation of the C distance from the C center of mass
![Page 78: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/78.jpg)
69
Although this method is intellectually appealing, the results are variable. We suspect that
for each interaction, the optimal (lowest energy) separation distance for each contact varies too
much for the different combination of residues found in the sequences in the MSA.
Consequently, the energy surface averaged across the sequences in the MSA has a shallower
minimum compared to the energy function calculated using only the target sequence.
Cursory tests using a single consensus sequence with the standard energy function also fail to
produce uniformly superior results. Nevertheless, we maintain that a careful and clever
implementation or extension of these ideas could yield strong improvements.
3.8 Clustering
The enhancement of the fraction of native-like models obtained using SPEED has
additional implications for 3° structure prediction. In our previous homology-free study, the
predicted structure is the lowest energy model from the final folding ensemble. But, that
structure is native-like (< 5 Å) only for about half of the targets, failing mostly when few or no
accurate models are generated. Although the use of SPEED increases the proportion of accurate
models, energy alone is insufficient for reliably choosing the best model. This situation is
common in structure prediction. As a result, clustering methods are frequently employed
because repeatedly occurring low energy conformations are typically more accurate than
structurally isolated low-energy models [145].
The lowest energy model from the top cluster for the homology-free and SPEED-based
Rama distributions are presented when a cluster exists (Table 6). A larger fraction (8/12) of the
SPEED-based ensembles contains identifiable clusters compared to the homology-free
ensembles (6/12), and their size often is larger as well (Fig. 3.6.1). The largest cluster may be the
![Page 79: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/79.jpg)
70
most accurate in terms of C-RMSD to the native, but it may share a similar average contact
profile to other less accurate clusters (Fig. 3.8.1). Most noticeable are the contact profiles of the
largest two clusters of 1b72, which display almost identical contacts, but decidedly different
values for the average C-RMSD to the native (Cluster 1, < 4 Å, Cluster 2, > 10 Å). This result is
due to the simplicity of the 1b72 fold (3-helix bundle), which permits a low energy fold that is a
pseudo-mirror image fold of the native and therefore has similar contacts and similar average
energy. Given this energetic similarity, the Rama distribution determines the favorability of the
native conformation, with the SPEED protocol succeeding to a greater extent than the homology-
free protocol.
3.9 Confidence assessed from reproducibility
While numerous methods exist for structure prediction [9, 15, 21, 80, 97, 120], the
quantification of the accuracy and confidence of a prediction is a crucial, but often
elusivecomponent. Template-based methods typically infer confidence from the quality of the
available information used to generate an alignment and a consensus of aligned models [146-
148]. When predicting remote templates, this technique can suffer from a dearth of PDB
templates that independently align to the target sequence with high confidence. This situation
precludes any meaningful clustering analysis and therefore imparts a large uncertainty to model
quality.
Template-free prediction methods have an advantage of generating a large number of
models that can be clustered. One noticeable feature of our method is the high correlation (R2 =
0.85) between the average C-RMSD between models in the predicted cluster and the average
accuracy (C-RMSD to the native) of the models within the cluster (Fig. 3.9.1).
![Page 80: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/80.jpg)
71
Figure 3.8.1 Comparison of contacts for the top clusters of several targets. Each map is a C-C contact matrix with a 10.0 Å distance cutoff for targets (1af7, 1b72) and a 8.0 Å distance cutoff for the , targets (1mky, and 1csp). Contacts of the native model are presented on the lower right of each map. The largest cluster for 1af7 has the most native contacts and has an average C-RMSD to the native below 4 Å. The next largest 1af7 cluster, which has an average greater than 10 Å C-RMSD to the native, has many native and non-native contacts. The largest 1b72 cluster is the most native in terms of C-RMSD (< 3Å average), but contains identical contacts to the next largest cluster (> 10 Å C-RMSD to native average) that is the mirror-image fold of the native. The contacts matrices of the top clusters of 1mky and 1csp are both very native-like.
![Page 81: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/81.jpg)
72
Figure 3.9.1 Assessing global accuracy from reproducibility of the top cluster. The mean C-RMSD to native of the top cluster is strongly correlated with the mean C-RMSD between the models in that cluster, indicating that the latter metric can be used as a measure of predicted model’s accuracy.
![Page 82: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/82.jpg)
73
This trend suggests that template-free models that are reproduced with a high degree of structural
similarity tend to be proportionately more accurate than models that are structurally further
removed from their closest neighbors. Noticeably, the average C-RMSD between models in a
cluster is typically one to two angstroms lower than the average C-RMSD to the native of the
cluster, suggesting that the top cluster has converged upon a stable but slightly non-native energy
minimum. Nonetheless, this difference can be factored in when quantifying the predicted
accuracy and may be diminished by improvements in the energy function and sampling
distributions.
In addition to global accuracy, the residue level RMSD at each position is calculated to
quantify the confidence of the prediction for each amino acid in the protein (Fig. 3.9.2).
Specifically, the average and standard deviation of the distance at each position between
the aligned models in the cluster are highly correlated to the respective average distance and
standard deviation at each position between the aligned cluster models and the native model,
suggesting that the accuracy and uncertainty at each position can be predicted.
This finding has implications for other template-free methods, which may suffer method-
specific difficulties when trying to quantify the confidence of model predictions. Most template-
free methods rely on large fragments from PDB models [21, 80, 137], but the number of these
fragments are limited and may introduce some bias due to the highly-restricted nature of their
conformational search. In other words, independently converging on very similar models may
not be as meaningful when the likelihood of sampling the same conformation is very high. Since
the conformational changes in ItFix feature the rotation of only a single pair of , angles, a
resulting ensemble consisting of a cluster of very similar models can be treated with higher
confidence given that the accessible conformation space is much larger than in fragment based
![Page 83: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/83.jpg)
74
Figure 3.9.2 Assessing local accuracy from reproducibility of top cluster. Position-resolved model accuracy and confidence. The average aligned distance between all models in the predicted cluster and the standard deviation of that distance is determined for each position. These values are highly correlated to the respective average aligned distance and standard deviation at each position between each model in the cluster and the native structure. The standard deviation for each of these values also is highly correlated, suggesting the ability to use clustering to determine confidence for each position in a predicted model.
![Page 84: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/84.jpg)
75
methods. Similarly, the bias likely is even weaker for all-atom physics-based simulations [15]
and ab initio folding simulations [9], which have the least restricted conformational search.
ItFix-SPEED may combine the best of both a restricted and unbiased conformational search in
regards to assessing accuracy from the structural diversity of the largest cluster.
3.10 Performance in CASP8
We have applied an early version of the ItFix-SPEED protocol in the 2008 Critical
Assessment of Structure Prediction (CASP8) for the human/server targets when a suitable
template from the PDB could not be identified by the threading program RAPTOR [83, 149],
one of the top performing entries in the server category. Of these targets, the 120 residue T0482
is the only small, globular, single-domain free-modeling target with no confident templates,
making it a prime candidate for the ItFix-SPEED methodology. This target has been subjected to
multiple rounds of ItFix-SPEED, and our final three submitted models are very similar with
highly accurate 2° and 3° structures (Fig. 3.10.1a). Our predicted 2° structure is slightly
improved over the PSIPRED[10] prediction. Due to time constraints, we initially assigned
PSIPRED’s high confidence (>90%) predictions at ~ 10% of the positions (total wall clock time
for prediction was under 12 hours from start of prediction to submission). When the central 100
residues (ignoring the solvent exposed ends of the NMR structure) of these models are aligned to
the now published structure, the C-RMSD to native is 4.8 Å. Hence, our algorithm is able to
confidently predict the correct structure without any false positive submissions. In addition, our
top model has the lowest C-RMSD among all submitted #1 models. We have performed
commendably for other challenging template-free modeling targets, such as the D1 subdomain of
protein T0405 (Fig. 3.10.1b). These results constitute strong evidence of the predictive
![Page 85: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/85.jpg)
76
capabilities of the ItFix-SPEED algorithm. Our participation in CASP8 also includes predictions
for sequences that have only poor templates and are considered template-free modeling targets.
For target T0429, RAPTOR chooses multiple homology-based templates, but it is uncertain as to
which template is correct for the C-terminal domain. ItFix-SPEED folding simulations for this
domain have been used to compare the average contact matrix of our folding simulations to the
contacts of each possible template (Fig. 3.10.1d). This process has enabled us to choose a better
template (T0429-2ckk) than RAPTOR’s top scoring template.
The SPEED-based sampling protocol also has been used to determine the structure of the
insertions of unknown structure that are present in RAPTOR-generated models. These situations
have been treated by breaking the chain at one end of the insertion and then folding this free end
in the context of the entire protein. The most successful outcome is for a 24 residue insertion for
target T0464, where our prediction ranks as one of the top submissions (Fig. 3.10.1c).
3.11 Discussion and Conclusions
Our computationally rapid algorithm using only single ( dihedral angle moves can
generate very accurate predictions of both 2° and 3° structures without relying on any known
structures, templates, or fragments. For the test set, we typically predict 2° structure with ~90%
accuracy, while the best 3° structure for 4/12 of the targets have C-RMSD below 2 Å. Hence,
given intelligent search strategies and scoring functions, C representations can be used to
accurately predict 2° and 3° structures.
Structure prediction is beyond current capabilities for the vast majority of the families
identified by large-scale sequencing efforts [2, 20]. The number of sequences with minimal
sequence similarity to known structures is increasing at a rate that outpaces our ability to identify
![Page 86: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/86.jpg)
77
Figure 3.10.1 ItFix-SPEED blind predictions in CASP8. a) 2° and 3° structure prediction of target T0482. The Global Distance Test (GDT) value is the % of the residues within a cut-off distance of the native structure. This cut-off distance is the y-value on the plot (e.g., for the ItFix prediction, 83% and 100% of the residues are predicted to within 4.7 and 7.8 Å of the native structure, respectively). The GDT trace for the ItFix prediction (blue line) is the rightmost of all the Model 1 predictionsIn addition, the C-RMSD to native is the lowest of all the Model 1 predictions. The Itfix-SPEED prediction for b) the entire Domain 1 of target T0405, and c) the 24 residue insertion in RAPTOR’s predicted template for T0464. d) Itfix-SPEED selection of the best template identified by RAPTOR based on average predicted tertiary contacts. Contact map, upper left: ItFix average contacts for the final structures; lower right: contacts of one of RAPTOR’s lower ranked templates, which is the closer to the native structure than its top ranked template. Values in parenthesis are the C-RMSD between predictions and the native structure. GDT plots are taken from the CASP8 website (www.predictioncenter.org/casp8/index.cgi).
![Page 87: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/87.jpg)
78
new families [20]. Currently, only about one third of the single domain architectures have known
folds [20].
The ItFix-SPEED procedure is well suited to contribute to mapping the protein universe,
particularly for low homology sequences. Because our procedure utilizes only multiple sequence
alignments, it can take advantage of the 107 known sequences, and not be limited by the ~104
unique structures in the PDB. For CASP8 target T0482, no member of its family had a known
structure, although its fold is not new. The ItFix-SPEED procedure accurately predicted its
structure using only 50 non-redundant sequence homologues and no structural information.
Furthermore, the ItFix-SPEED procedure is able to quantify the global and local accuracy of its
prediction from the reproducibility of the trajectories, a highly desirable feature from the
perspective of users of any sequence database annotation.
3.12 Methods
Generation of Sequence Alignments
Sequence alignments are generated by PSI-BLAST [69] using the executables from
NCBI on the non-redundant database. An inter-sequence similarity cutoff of 65% is imposed
with CD-HIT [150]. PSI-BLAST searches are performed in three passes with an E-value cutoff
of 1.0. We choose only sequences that cover over 90% of the target sequence length and have
gaps that span at most one position. These constraints are chosen such that sequences are very
likely to approximate the same structure as the target. As a result of these constraints, the average
E-value of each sequence in an alignment is orders of magnitude lower than 1.0.
SPEED sampling
![Page 88: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/88.jpg)
79
The MSA is used to generate an amino acid substitution matrix at each position in the
target sequence. Any amino acid that occurs in more than 10% of the alignments is included at
that position. If a position only has only one amino acid in its substitution matrix, the amino acid
occurrence threshold is decremented by 1% until there is more than 1 substitution, with the
exception of proline, which is kept as the sole amino acid at a position down to 5% probability as
long as there are no neighboring positions with prolines that occur at a greater probability. If
proline is the sole amino acid in the MSA-generated substitution matrix, we mutate the target
sequence at that position to proline. In all other cases the sequence used during folding remains
the same as the target sequence.
We initially tried calculating the SPEED distribution of a position by adding the Rama
distributions at that position for each sequence in the alignment. The SPEED distributions
created from this method, however, are more similar to the homology-free distribution because
the target sequence amino acid often has the highest-probability in the alignment and would be
weighted proportionately in the SPEED distribution. Using a substitution matrix, on the other
hand, weights all amino acids above a threshold equally, thereby rendering the resulting Rama
distribution less similar to the homology-free distribution.
Since the statistics for the distributions constructed from an MSA permit many different
combinations of amino acids, the area of the Rama map with vanishing probability tends to be
much lower for the SPEED distribution than previously used because of the added MSA-
identified combinations. In fact, the average number of angles per position used to generate a
SPEED distribution is three to five-fold larger than the number of angles used to generate a
homology-free distribution (Table 5). As seen in Figure 3.1.1b and the subsequent predictions,
this added diversity does not dilute the specificity of the conformational search; indeed the
![Page 89: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/89.jpg)
80
distributions are more native-like.
Ramachandran sampling
Our prior treatment employs a sampling of specific , angle pairs from a library
generated from high resolution crystal structures, conditional on the 2° structure and nearest
neighbor amino acid identities. The present study likewise employs a distribution of , angles
with the same dependencies, but instead of sampling from a large list of angles extracted from
PDB models, the , angles are chosen from a Rama distribution that is generated for each
position based on the amino acid identity and the 2° structure specification of that position and of
its nearest neighbors. Thus, Rama distributions are calculated for the central residue in each of
the distinct 8000 combinations of three contiguous amino acids, conditional upon the amino acid
identity and on the 2° structure of all three residues. Because the ItFix simulations consider six
possible categories of 2° structure for the construction of the sampling distributions (H: helix, E:
strand, C: coil, A: everything, O: not helix, Q: Not strand), 1,728,000 possible Rama
distributions are constructed to describe the possible 8000 amino acid triplets. Each Rama
distribution has 722 5°x5° bins, and each bin is assigned a probability that is determined by
frequency of occurrence of these backbone dihedral angles in the PDB for the specific conditions
of amino acid identities and 2o structure. A Rama distribution accommodates the increase in
PDB-derived angles introduced by SPEED without increasing the system memory, as occurs
when each angle is explicitly stored in memory.
The sampling of , angles begins by selecting a bin in Rama space according to the
probability assigned to that bin (e.g., a bin that contains 1.5% of the angle counts for the
distribution at that position has a 0.015 probability of being selected). This bin selection is
followed by the selection of a random angle uniformly from within the 5°x5° window of that bin.
![Page 90: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/90.jpg)
81
The Rama distribution of the central residue of the triplet INE (position 4 in 1tif) with all
allowed 2° structures is an example of one such sampling distribution (Fig 3.1.1b, top). If the
subsequent round of ItFix eliminates a 2° structure option at a position, the Rama distribution at
that position is changed accordingly (Fig. 3.1.1b, middle, bottom).
Clustering algorithm
After the ItFix protocol generates a predicted 2o structure, a further 10,000 folding
simulations are run to maximize the exploration of conformational space. The pairwise C-
RMSD matrix of the resulting 10,000 models is used to cluster the ensemble into groups of
models that all align to each other below a C-RMSD cutoff, an approach that is similar to the
SPICKER algorithm [145]. Other methods [151] cluster according to the C-C distance instead
of the pairwise C-RMSD, but we find that the C-C distances in some cases are highly
correlated even though the C-RMSD between the models are quite different (Fig. 3.8.1b).
When identifying clusters, the upper limit of the cutoff distance of the inter-model C-RMSD is
increased in increments of 1 Å starting at 1 Å until at least five clusters are found, or a 7 Å limit
is reached. Every model in the cluster must have a C-RMSD to every other model in the cluster
that is less than the cutoff distance. Targets with predicted all- 2° structures have a minimum
cluster size of 5%, whereas the minimum size for targets with other predicted 2° structure types
can be as low as 0.04%. A cluster is eliminated if it contains a model present in a larger cluster.
The largest cluster is selected as the predicted model, unless it has an above average energy and
there is another cluster with an energy that is greater than one standard deviation below average.
For and targets, the predicted cluster cannot consist of a fold that contains a predicted -
strand that is not part of a -sheet.
![Page 91: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/91.jpg)
82
Model refinement
One of the most important challenges of structure prediction is an effective exploration of
conformational space. Ideally an exhaustive refinement is performed for every model generated
by folding, but we take a computationally thrifty approach and refine only the models in the
largest cluster of each target. Refinement consists of the same move set and energy function as
folding, with the addition of the fact that we reject moves that increase the Rg, Br or Ru of the
starting model. Each model in the cluster is refined 100 times and the model with the lowest
average energy among all the refined models is chosen at the prediction listed in Table I.
Parallel scripting with Swift
The ItFix-SPEED algorithm has been implemented, tested and evaluated [152] using an
innovative parallel scripting language called Swift [153]. The Swift runtime system automates
parallelization, data management, and error recovery, and supports execution on a wide variety
of parallel computer systems. This allows the composition of flexible structure prediction scripts
to address new energy functions and explore algorithm enhancements, and to compare the
behavior of the algorithm under a wide range of conditions and parameter settings.
![Page 92: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/92.jpg)
83
Chapter 4
New methods for protein design
This Chapter contains unpublished and ongoing research. I would like to thank members
of the Sosnick and Freed labs for advice and conversations, and in particular Chloe Antoniou for
consultations concerning experiments.
The need for protein-based medicines and industrial tools has motivated numerous
attempts to design novel protein sequences. This effort has been labeled the “inverse protein
folding problem,” and has led to the creation of novel sequences that encode novel protein
topologies as well as enzymes with new functions. Crucial to all design efforts is an
understanding of which amino acid sequences best encode a specific three-dimensional structure.
Current redesigns of known protein structures have produced sequences that are highly similar to
the naturally occurring sequence, suggesting that current design methods are limited in the ability
to fully explore the sequence space of a given fold. We present a novel design method that uses
the physical principles that govern protein folding to produce the most non-natural sequence
designs yet accomplished.
4.1 Introduction
Chapter 1.7 summarizes notable past and current protein design methods. The significant
protein sequence redesigns that have been determined through experiments to be structured and
stable (Table 1) are all notably similar to the natural sequences of their design template
structures. The purpose of this chapter is to design a protein that lacks the high similarity
observed in previous designed sequences.
![Page 93: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/93.jpg)
84
One possibility is that the design template structures are highly optimized for the natural
sequence and the exact backbone geometry constrains the sequence to a near native identity. If
so, any sequence change should involve relaxation of the backbone to accommodate the new
residue. In some cases relaxed-backbone design methods are superior to methods that use a rigid
backbone [126] [156], but these studies only involve changes in a small number of residues and
may only require small backbone adjustments. The design methods of Table 1 all incorporate
backbone relaxation, but it is unclear how effectively any design algorithm can simulate the
relaxation required for a residue change that may involve concerted motions throughout the
chain in addition to sidechain rotamer sampling. Given the massive sequence space available,
thorough relaxation after sequence change would be computationally prohibitive.
Another explanation for the observed similarity is that natural sequences are highly
optimized and are not substantially varied within a fold family. For this reason, any properly
constructed design algorithm will rediscover many of the natural residues. Rosetta design
trajectories that include backbone relaxation advocate this suggestion by consistently generating
designs that are similar to the natural sequence, with core positions being the most similar (>
50%) [125]. The authors of this study also observe that the more similar to natural are the
designed sequences, the more conserved is the natural sequence family, further suggesting that
there is limited sequence variability for a given structure. Anecdotal evidence suggests that this
may not be true; ubiquitin family members that share in common only 11% amino acid identity
have PDB structures that deviate by less than 1.5Å RMSD when aligned (Fig. 4.1.1), proving
that distantly related sequences can share nearly identical structures. Thus any design strategy
that can implement a thorough and unbiased sequence search with a realistically flexible
backbone should be able to escape near-native sequence space.
![Page 94: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/94.jpg)
Figure 4.1.1 Structures of distant ubiquitin family members(1l2n, green) ubiquitin structures align to 1.5 Å RMSD. The amino acid sequences of these structures differ at more than 89% of aligned positions.
85
Structures of distant ubiquitin family members. Human (1ubq, cyan) and yeast ubiquitin structures align to 1.5 Å RMSD. The amino acid sequences of these
structures differ at more than 89% of aligned positions.
Human (1ubq, cyan) and yeast ubiquitin structures align to 1.5 Å RMSD. The amino acid sequences of these
![Page 95: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/95.jpg)
86
In almost all of the verified designs, however, the sequence most similar to the design in
the sequence database is very highly related to the natural sequence of the template structure (>
50%) and sometimes is the natural sequence (Table 1). This occurs even though there are
numerous sequence family members that are close enough to the natural sequence to be
structurally very similar, yet far enough away to be distinct from wild-type.
Thus it appears that current backbone relaxation methods are not able overcome the
initial template structure constraints to reach a truly unique design. Here, we introduce a method
that is designed to force itself away from the natural sequence of the template in order to
generate truly novel protein sequences.
4.2 Choice of ubiquitin as a design target
The choice of ubiquitin as a design target offers unique advantages. Primarily, ubiquitin
has a very large number of non-redundant sequence homologues (tens of thousands), compared
to the targets in Table 1 (typically hundreds). This implies that any ubiquitin design can be tested
against a very large number of sequences in order to test its uniqueness as a non-natural
sequence.
A potential challenge with ubiquitin is that previous attempts to redesign its hydrophobic core
have resulted in destabilized proteins [157]. This suggests that the sidechain packing
requirements may have very stringent requirements and any loss of stability due to core
substitution would require compensation from stabilizing surface substitutions.
![Page 96: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/96.jpg)
87
4.3 Design protocol
The design method seeks to reduce the size of the sequence search space, which is
prohibitively large at 2072 possible sequences for ubiquitin if every amino acid is allowed at
every position. The nineteen positions that are buried in the template model can be limited to
seven hydrophobic amino acids (ILE, VAL, LEU, PHE, TRP, TYR, ALA) and exposed residues
can be limited to all other residues and alanine, reducing the search space by a factor of 1017 to
1453 * 719. In addition, prolines are not sampled at positions that contain backbone amide
hydrogens involved in hydrogen bonds according to the dope-PW energy function (Chapters 2
and 3).
The next step in the procedure is to optimize the local / dihedral propensity by
selecting the most probable sets of sequences in contiguous six-residue segments. The
conditional probability of a segment is the cumulative product of the five dihedral pairs that
constitute a six residue segment. The remaining step is a Monte Carlo minimization dope-PW
energy function with sampling of the sequence library generated from the previous step using the
same protocol used in Chapters 2 and 3, with the convergence constituting the final designed
sequence.
The methods in Table 1 incorporate amino acid compositional terms into their energy
functions in order maintain a natural-like composition, which has been suggested to be important
for maintaining protein solubility [31, 32, 35, 158]. There is no explicit composition term
employed in this method, other than an arbitrarily high penalty for having more than ten percent
of the residues identical to the wild-type sequence.
Backbone relaxation is not achieved explicitly here, but instead relies on the flexibility of
the scoring functions. For example, in the Rama optimization stage the / dihedral propensity is
![Page 97: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/97.jpg)
88
calculated using 20° by 20° bins, which allows a large window of backbone torsional flexibility
around the native angles. Similarly, the dope-PW statistical potential has a bin size of 0.5Å for
hydrogen-bonded -strand interactions and 1.0 Å for long distance pair interactions. Each of
these distances are within the aligned distances of the respective interaction types within the
aligned structures of Figure 4.1.1, which suggests sufficient flexibility to compensate for
substitutions.
4.4 Negative design
Negative design is a challenging objective since it is difficult to anticipate which non-
native conformations might exist and therefore targeted for destabilization. As such, there is no
explicit negative design incorporated in this method, but since the objective of negative design is
to provide specificity to the desired conformation, the scoring function of this method contributes
implicitly by providing specificity towards the design template. At the local level, this is
accomplished by finding the sequence that is most likely to adopt into the native backbone at the
expense of all other backbone conformations. For example, rather than look for the sequence that
has the highest probability for the native Rama angles relative to other sequences, this method
seeks the sequence that has the highest probability for the native Rama angles relative to other
Rama angles for that same sequence. This method thus selects the amino acid sequence that is
most specific to the backbone of the native fold. In practice the Rama propensity of the ubiquitin
redesign is higher than that of the natural sequence at most positions (Fig. 4.4.1).
For long distance interactions, similar specificity may be incorporated with the dope-PW
statistical potential. In essence, it has been suggested that polar and electrostatic residues provide
the specificity of folding [154] [155].
![Page 98: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/98.jpg)
89
Figure 4.4.1 Designed Rama propensity. Propensity of native Rama region (Fig. 3.3.1a) at each position in ubiquitin natural (1ubq) and design sequences.
.
![Page 99: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/99.jpg)
90
For example, a salt bridge might be only marginally stabilizing in a native conformation, but it
might be destabilizing in a non-native conformation, serving to enforce specificity. Polar and
electrostatic interactions are indeed specified by the dope-PW energy function (Fig. 2.4.1), thus
contributing to implicit negative design. The dope-PW energy of the designed ubiquitin sequence
in the template 1ubq structure is indeed lower in energy than the natural sequence (G = ~-
120). The significance of this is underscored by the sometimes competing interests of local Rama
propensity and long-distance interactions, since what may stabilize one could destabilize the
other. Having both simultaneously optimized is thus a crucial test of the design method.
4.5 Conclusions
The final objective is to experimentally validate the design using first low-resolution
spectroscopic techniques such as far ultraviolet circular dichroism to detect native secondary
structure content. In addition, the peak dispersal in two-dimensional nuclear magnetic resonance
spectroscopy can be used to determine the extent to which the designed protein is structured.
Providing success using these techniques, the ultimate verification is to obtain a high-resolution
crystallographic model of the design. If these experimental validations are successful, it will
confirm the most unique non-natural redesign yet achieved.
![Page 100: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/100.jpg)
91
Chapter 5
Structure prediction: Future directions and conclusions
5.1 Introduction The focus of structure prediction, and hence of Chapters 2 and 3 of this thesis, has been
the conformational search and protein scoring functions. Any future improvements in the
methods described here will focus on these two subjects. The codependence of searching and
scoring has been emphasized in previous chapters, and will again feature strongly in this chapter,
but now in the context of new directions in structure prediction.
5.2 Enhancement of the conformational search
Compared to our homology-free structure prediction method (Chapter 2), the percentage
of native-like models was enhanced using evolutionarily-diversified sampling (Chapter 3). This
suggests that the conformational search is the factor that limits the sampling of native-like
conformations, which has been previously proposed [21, 120]. Therefore one of the principle
tasks towards achieving higher accuracy in structure prediction is to restrict sampling as much as
possible while ensuring that the native conformation is never eliminated.
We have developed a method that restricts sampling at the local level [8, 23, 152], but
what remains elusive is a reliable way of fixing more non-local contacts. For -helices, this is
already accomplished by fixing local 2° structure, which imposes long distance chain interaction
constraints due to the rigidity of the helix and perhaps explains why all- targets tend to be
predicted with more accuracy by ItFix [8]. Fixing -strands at the local level, however, has less
impact on the conformational search due to the larger Ramachandran space available to that 2°
structure type (Fig. 1.2.2). Therefore, the next logical step towards restricting the conformational
![Page 101: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/101.jpg)
92
search entails fixing -sheet structures in three-dimensional space.
In contrast to the straight-forward fixing of -helices, locking down -sheet structures
would be practically more complex if there is a large separation in sequence space between the
constituent strands (Fig. 5.2.1a). This results from the limited number of backbone torsion
changes that can be inserted into the intervening region without breaking the sheet, which
prevents the effective sampling of structured regions between the strands. Two residue -turn
motifs that separate the strands in -hairpins (Fig. 5.2.1b), however, are fully structured when the
hydrogen bond network is complete, which suggests that these units of structure may be used as
a rigid sampling unit. This supposition is reinforced by the folding of the N-terminal ubiquitin -
hairpin, which achieves a highly native structural model when folded and predicted with DOPE-
PW (Fig. 5.2.1c).
The strategy of fixing of substructures may be necessary due to size limitations of current
de novo structure prediction methods where the sampling time required for larger sequences can
be prohibitive [23]. Any manner of reducing the complexity of the search by restricting which
parts of the chain may be formed at a given time could simplify structure prediction in a manner
that resembles the folding of real proteins [106-110].
5.3 Algorithm-free Smooth ItFix
One drawback of the ItFix protocols that were described in Chapters 1 and 2 and the
associated publications [8, 23, 152] is the use of probability threshold parameters in the 2°
structure fixing algorithm. It is possible to eliminate the use of these parameters simply by using
the model generated from ItFix rather than applying an external 2° structure calculation. This
process, called Smooth ItFix, is demonstrated in Figure 5.3.1. As before, the Rama distribution
![Page 102: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/102.jpg)
93
Figure 5.2.1 ItFix -structure prediction. (a) Long structured loops prevent the fixing of flanking -structures. (b) Short -turns allow prediction of strands in tight hairpin structures, which are predicted by ItFix with high accuracy.
![Page 103: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/103.jpg)
94
for a position in the first round of folding is generated from the amino acid identities of that
position and its nearest neighbors. The structures generated from the first round of folding are
then used to generate the Rama distribution for the second round, with this process iterating as
long as is practically feasible. Figure 5.3.1 demonstrates that a natively extended position in
ubiquitin initially prefers a helical Rama distribution, but through four rounds of Smooth ItFix
prefers the native backbone geometry.
What remains to be determined for Smooth ItFix is whether to use the entire ensemble
of structures or a subset of structures filtered based on energy. The latter might introduce more
accurate Rama distributions due to the more native-like distribution of low energy models (Fig.
3.5.2), but it would also reduce the number of models available to calculate the Rama
distribution. This is crucial because having a large search space that encompasses the native
Rama angles is better than having a small space that excludes them. Therefore any successful
implementation of Smooth ItFix might require a larger number of simulation models from each
round.
5.4 New energy functions for folding and refinement
Improvements in the conformational search should coincide with improvements energy
functions for the identification of native-like conformations. The DOPE-PW statistical potential
introduced in Chapter 2 is the only orientation-dependent C pairwise statistical potential to be
successfully incorporated into de novo structure prediction. Chapter 3 described more globally
oriented multi-body radial terms were found to be ideal for generating properly collapsed models
for structures containing -sheets. On the other hand, very detailed energy functions like DOPE-
PW are ideally suited for refinement of complex structures that are near the native state and are
![Page 104: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/104.jpg)
95
Figure 5.3.1 The Smooth ItFix protocol. With no adjustable parameters, Smooth ItFix is an algorithm-free method for structure prediction. An initial sequence-based Rama distribution is the input to iterative rounds of folding where the Rama distribution of a subsequent round is taken directly from the structures of the previous round.
![Page 105: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/105.jpg)
96
less efficient earlier in the search. These results may not be divorced from the physical reality of
protein folding. In summary, it has been suggested that hydrophobic residues may provide the
lion’s share of the energy required for folding, whereas polar and electrostatic residues provide
the specificity of folding [154] [155]. Thus it is not surprising that out prediction methods
achieve the most success when the collapse of the chain is guided cooperatively by hydrophobic
desolvation, and the refinement of those collapsed structures is guided by highly conditional
specific interactions.
Given these principles, two obvious outstanding goals are the following: 1.) Maximize
the proportion of properly collapsed models; 2) Increase specificity during the refinement of the
collapsed models. Chapter 3 included the outline of a sequence profiling approach shows much
promise for achieving each of these goals. To briefly summarize, some target sequence amino
acid identities do not reflect the optimal solvent accessibility or pairwise compatibility expected
from the location of those residues in the native structure. For example, hydrophobic residues
can be solvent exposed (Fig. 5.4.1a) and polar and apolar residues can be on the same side of a -
sheet (Fig. 5.4.1b). As described in Chapter 3, these scenarios are analogous to residues being in
2° structures for which they have minimal propensity. In those cases, averaging across a
sequence profile gave those positions more native-like 2° structure propensity. Thus, the next
logical task is to integrate sequence profiling SPEED method into our energy functions.
Integration of SPEED into the radial terms is not as simple as averaging the amino acid
identities across the sequence profile because the radial terms identify a residue as either
hydrophobic or hydrophilic. Therefore, one possible approach is to take the consensus amino
acid identities from the multiple sequence alignment and use the polarity designation of the
consensus identities in the burial ratio term. Results of this calculation are shown in Table 7 for
![Page 106: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/106.jpg)
97
Figure 5.4.1 DOPE-PW-SPEED encodes higher specificity. Non-native interactions favored DOPE-PW in 1ubq can be corrected by averaging the pair interactions across a multiple sequence alignment in DOPE-PW-SPEED. (a) Exposed hydrophobic residues no longer prefer to be buried together and (b) polar and apolar residues switch to the improbable native interaction on the same side of a -sheet.
![Page 107: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/107.jpg)
98
the targets found in Tables 3-6. In every example, the consensus burial ratio is smaller than that
of the target amino acid sequence. This is almost exclusively a result of surface exposed
hydrophobic residues from the target sequence being mutated to hydrophilic residues. This could
enforce more specific restrictions on hydrophobic collapse and reduce non-native conformations.
As for integration of SPEED into DOPE-PW for the purposes of refinement, the potential
for increased specificity is clear. Instead of using singular amino acid identities from the target
sequence to calculate the interaction energy of a two residues, the multiple sequence alignment
offers an averaged pairwise interaction across two positions. As an example, two hydrophobic
residues (ALA and ILE) in ubiquitin are solvent exposed (Fig. 5.4.1a), but this information is not
known during folding. If a sampled conformation puts the C atoms within the non-native
optimal interaction distance of those two amino acids, there will be an energetic benefit. With the
pair interactions for those two positions averaged across the alignment (DOPE-PW-SPEED),
however, there is no energetic benefit to that interaction. DOPE-PW-SPEED also normalizes the
alignment of -sheets in a similar manner (Fig. 5.4.1b).
These results suggest that increasing the specificity of an energy function aides the
identification of native interactions. There is, however, a disadvantage to increased specificity.
When the native conformation is more narrowly defined it is more difficult to locate in the
conformational search, which has been referred to as the “needle in a haystack” problem [21].
For this reason we have developed a structure prediction paradigm where the initial chain
collapse is a global multi-body coarse search and the refinement of those collapsed models is
defined by highly specific pair interactions.
![Page 108: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/108.jpg)
99
5.5 Conclusions
The principal application of these developments is clear. The explosion in new protein
sequences over the last few decades [2] and the relatively slow pace of experimental structure
determination [3, 4] necessitate rapid and accurate structure prediction. We have developed
methods for the de novo prediction of protein structure that can directly measure the global and
local confidence in our predictions and we have integrated these methods with template-based
modeling methods in order to select only target sequences that have no significant structures in
the PDB [23]. We have also tested our methods on the computing resources necessary to expand
our prediction endeavors on a genomic scale [152]. As such, we are prepared to take the rapidly
expanding pool of new sequence families [20] and predict representative models.
![Page 109: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/109.jpg)
100
References
1. Kryshtafovych, A., K. Fidelis, and J. Moult, CASP8 results in context of previous experiments. Proteins, 2009. 77 Suppl 9: p. 217-28.
2. Yooseph, S., et al., The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol, 2007. 5(3): p. e16.
3. Service, R.F., Structural biology. Protein structure initiative: phase 3 or phase out. Science, 2008. 319(5870): p. 1610-3.
4. Lattman, E., The state of the Protein Structure Initiative. Proteins, 2004. 54(4): p. 611-5.
5. Service, R.F., Problem solved* (*sort of). Science, 2008. 321(5890): p. 784-6.
6. Sosnick, T.R., Kinetic barriers and the role of topology in protein and RNA folding. Prot. Sci., 2008. 17: p. 1308–1318.
7. Kryshtafovych, A., et al., Progress over the first decade of CASP experiments. Proteins, 2005. 61 Suppl 7: p. 225-36.
8. DeBartolo, J., et al., Mimicking the folding pathway to improve homology-free protein structure prediction. Proc Natl Acad Sci U S A, 2009. 106(10): p. 3734-9.
9. Yang, J.S., et al., All-atom ab initio folding of a diverse set of proteins. Structure, 2007. 15(1): p. 53-63.
10. Jones, D.T., Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 1999. 292(2): p. 195-202.
11. Qian, B., et al., High-resolution structure prediction and the crystallographic phase problem. Nature, 2007. 450(7167): p. 259-64.
12. Chan, H.S., S. Bromberg, and K.A. Dill, Models of cooperativity in protein folding. Philosophical Transactions of the Royal Society of London - Series B: Biological Sciences, 1995. 348(1323): p. 61-70.
13. Yue, K., et al., A test of lattice protein folding algorithms. Proc. Natl. Acad. Sci. USA, 1995. 92(1): p. 325-9.
14. Hegler, J.A., et al., Restriction versus guidance in protein structure prediction. Proc Natl Acad Sci U S A, 2009. 106(36): p. 15302-7.
15. Ozkan, S.B., et al., Protein folding by zipping and assembly. Proc. Natl. Acad. Sci. U S A, 2007. 104(29): p. 11987–11992.
16. Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-40.
![Page 110: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/110.jpg)
101
17. Li, W., L. Jaroszewski, and A. Godzik, Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 2001. 17(3): p. 282-3.
18. Bateman, A., et al., The Pfam protein families database. Nucleic Acids Res, 2002. 30(1): p. 276-80.
19. Fitch, W.M., Distinguishing homologous from analogous proteins. Syst Zool, 1970. 19(2): p. 99-113.
20. Levitt, M., Nature of the protein universe. Proc Natl Acad Sci U S A, 2009. 106(27): p. 11079-84.
21. Bradley, P., K.M. Misura, and D. Baker, Toward high-resolution de novo structure prediction for small proteins. Science, 2005. 309(5742): p. 1868-71.
22. Xu, J., J. Peng, and F. Zhao, Template-based and free modeling by RAPTOR++ in CASP8. Proteins, 2009. 77 Suppl 9: p. 133-7.
23. DeBartolo, J., et al., Protein structure prediction enhanced with evolutionary diversity: SPEED. Protein Sci. 19(3): p. 520-34.
24. DeGrado, W.F., et al., De novo design and structural characterization of proteins and metalloproteins. Annu Rev Biochem, 1999. 68: p. 779-819.
25. Pabo, C., Molecular technology. Designing proteins and peptides. Nature, 1983. 301(5897): p. 200.
26. Butterfoss, G.L. and B. Kuhlman, Computer-based design of novel protein structures. Annu Rev Biophys Biomol Struct, 2006. 35: p. 49-65.
27. Malakauskas, S.M. and S.L. Mayo, Design, structure and stability of a hyperthermophilic protein variant. Nat Struct Biol, 1998. 5(6): p. 470-5.
28. Pan, Y., et al., Computational redesign of human butyrylcholinesterase for anticocaine medication. Proc Natl Acad Sci U S A, 2005. 102(46): p. 16656-61.
29. Korkegian, A., et al., Computational thermostabilization of an enzyme. Science, 2005. 308(5723): p. 857-60.
30. Gribenko, A.V., et al., Rational stabilization of enzymes by computational redesign of surface charge-charge interactions. Proc Natl Acad Sci U S A, 2009. 106(8): p. 2601-6.
31. Filikov, A.V., et al., Computational stabilization of human growth hormone. Protein Sci, 2002. 11(6): p. 1452-61.
32. Dantas, G., et al., A large scale test of computational protein design: folding and stability of nine completely redesigned globular proteins. J Mol Biol, 2003. 332(2): p. 449-60.
33. Harbury, P.B., et al., High-resolution protein design with backbone freedom. Science,
![Page 111: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/111.jpg)
102
1998. 282(5393): p. 1462-7.
34. Villegas, V., et al., Protein engineering as a strategy to avoid formation of amyloid fibrils. Protein Sci, 2000. 9(9): p. 1700-8.
35. Dahiyat, B.I. and S.L. Mayo, De novo protein design: fully automated sequence selection. Science, 1997. 278(5335): p. 82-7.
36. Handel, T.M., S.A. Williams, and W.F. DeGrado, Metal ion-dependent modulation of the dynamics of a designed protein. Science, 1993. 261(5123): p. 879-85.
37. Betz, S.F., P.A. Liebman, and W.F. DeGrado, De novo design of native proteins: characterization of proteins intended to fold into antiparallel, rop-like, four-helix bundles. Biochemistry, 1997. 36(9): p. 2450-8.
38. Betz, S.F. and W.F. DeGrado, Controlling topology and native-like behavior of de novo-designed peptides: design and characterization of antiparallel four-stranded coiled coils. Biochemistry, 1996. 35(21): p. 6955-62.
39. Kuhlman, B., et al., Design of a novel globular protein fold with atomic-level accuracy. Science, 2003. 302(5649): p. 1364-8.
40. Kellis, J.T., Jr., et al., Contribution of hydrophobic interactions to protein stability. Nature, 1988. 333(6175): p. 784-6.
41. Matthews, B.W., Studies on protein stability with T4 lysozyme. Adv Protein Chem, 1995. 46: p. 249-78.
42. Gassner, N.C., W.A. Baase, and B.W. Matthews, A test of the "jigsaw puzzle" model for protein folding by multiple methionine substitutions within the core of T4 lysozyme. Proc Natl Acad Sci U S A, 1996. 93(22): p. 12155-8.
43. Axe, D.D., N.W. Foster, and A.R. Fersht, Active barnase variants with completely random hydrophobic cores. Proc Natl Acad Sci U S A, 1996. 93(11): p. 5590-4.
44. Krylov, D., I. Mikhailenko, and C. Vinson, A thermodynamic scale for leucine zipper stability and dimerization specificity: e and g interhelical interactions. Embo J, 1994. 13(12): p. 2849-61.
45. Lumb, K.J. and P.S. Kim, Measurement of interhelical electrostatic interactions in the GCN4 leucine zipper. Science, 1995. 268(5209): p. 436-9.
46. Kenar, K.T., B. Garcia-Moreno, and E. Freire, A calorimetric characterization of the salt dependence of the stability of the GCN4 leucine zipper. Protein Sci, 1995. 4(9): p. 1934-8.
47. Barlow, D.J. and J.M. Thornton, Ion-pairs in proteins. J Mol Biol, 1983. 168(4): p. 867-85.
![Page 112: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/112.jpg)
103
48. Chou, P.Y. and G.D. Fasman, Prediction of protein conformation. Biochemistry, 1974. 13(2): p. 222-45.
49. McGregor, M.J., S.A. Islam, and M.J. Sternberg, Analysis of the relationship between side-chain conformation and secondary structure in globular proteins. J Mol Biol, 1987. 198(2): p. 295-310.
50. Munoz, V. and L. Serrano, Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: comparison with experimental scales. Proteins, 1994. 20(4): p. 301-11.
51. Dunbrack, R.L., Jr. and M. Karplus, Conformational analysis of the backbone-dependent rotamer preferences of protein sidechains. Nat Struct Biol, 1994. 1(5): p. 334-40.
52. Jha, A.K., et al., Helix, Sheet, and Polyproline II Frequencies and Strong Nearest Neighbor Effects in a Restricted Coil Library. Biochemistry, 2005. 44(28): p. 9691-702.
53. Colubri, A., et al., Minimalist Representations and the Importance of Nearest Neighbor Effects in Protein Folding Simulations. J. Mol. Biol., 2006. 363: p. 835-857.
54. Richardson, J.S. and D.C. Richardson, The de novo design of protein structures. Trends Biochem Sci, 1989. 14(7): p. 304-9.
55. Booth, D.R., et al., Instability, unfolding and aggregation of human lysozyme variants underlying amyloid fibrillogenesis. Nature, 1997. 385(6619): p. 787-93.
56. Pauling, L., R.B. Corey, and H.R. Branson, The structure of proteins: Two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. USA, 1951. 37: p. 235-240.
57. Pauling, L. and R.B. Corey, Configurations of polypeptide chains with favored conformations around single bonds: Two new pleated sheets. Proc. Natl. Acad. Sci. USA, 1951. 37: p. 729-740.
58. Kabsch, W. and C. Sander, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 1983. 22(12): p. 2577-637.
59. Ramachandran, G.N., C. Ramakrishnan, and V. Sasisekharan, Stereochemistry of Polypeptide Chain Configurations. J. Mol. Biol., 1963. 7(1): p. 95-&.
60. Rose, G.D., Hierarchic organization of domains in globular proteins A noncovalent peptide complex as a model for an early folding intermediate of cytochrome c.
Journal of Molecular Biology, 1979. 134(3): p. 447-70.
61. Dill, K.A., Dominant forces in protein folding. Biochemistry, 1990. 29(31): p. 7133-7155.
![Page 113: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/113.jpg)
104
62. Stickle, D.F., et al., Hydrogen bonding in globular proteins. Journal of Molecular Biology, 1992. 226(4): p. 1143-59.
63. Baker, E.N. and R.E. Hubbard, Hydrogen bonding in globular proteins. Prog Biophys Mol Biol, 1984. 44(2): p. 97-179.
64. Dill, K.A., K. Fiebig, M., and H.S. Chan, Cooperativity in protein-folding kinetics. Proc. Natl. Acad. Sci. USA, 1993. 90: p. 1942-1946.
65. Chou, P.Y. and G.D. Fasman, Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins. Biochemistry, 1974. 13(2): p. 211-22.
66. Chou, P.Y. and G.D. Fasman, Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol Relat Areas Mol Biol, 1978. 47: p. 45-148.
67. Rost, B. and C. Sander, Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol, 1993. 232(2): p. 584-99.
68. Pollastri, G., et al., Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 2002. 47(2): p. 228-35.
69. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 1997. 25(17): p. 3389-402.
70. Cuff, J.A. and G.J. Barton, Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins, 2000. 40(3): p. 502-11.
71. Dor, O. and Y. Zhou, Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins, 2007. 66(4): p. 838-45.
72. Kihara, D., The effect of long-range interactions on the secondary structure formation of proteins. Protein Sci., 2005. 14(8): p. 1955-63.
73. Minor, D.L., Jr. and P.S. Kim, Context-dependent secondary structure formation of a designed protein sequence. Nature, 1996. 380(6576): p. 730-4.
74. Alexander, P.A., et al., The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc. Natl. Acad. Sci. U S A, 2007.
75. Meiler, J. and D. Baker, Coupled prediction of protein secondary and tertiary structure. Proc. Natl. Acad. Sci. U S A, 2003. 100(21): p. 12105-10.
76. Service, R.F., Structural biology. Researchers hone their homology tools. Science, 2008. 319(5870): p. 1612.
77. Zhang, Y., I-TASSER: fully automated protein structure prediction in CASP8. Proteins,
![Page 114: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/114.jpg)
105
2009. 77 Suppl 9: p. 100-13.
78. Krieger, E., et al., Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: Four approaches that performed well in CASP8. Proteins, 2009. 77 Suppl 9: p. 114-22.
79. Hildebrand, A., et al., Fast and accurate automatic structure prediction with HHpred. Proteins, 2009. 77 Suppl 9: p. 128-32.
80. Raman, S., et al., Structure prediction for CASP8 with all-atom refinement using Rosetta. Proteins, 2009. 20: p. 20.
81. Ben-David, M., et al., Assessment of CASP8 structure predictions for template free targets. Proteins, 2009. 77 Suppl 9: p. 50-65.
82. Fang, Q. and D. Shortle, Protein refolding in silico with atom-based statistical potentials and conformational search using a simple genetic algorithm. J Mol Biol, 2006. 359(5): p. 1456-67.
83. Zhao, F., et al., Discriminative learning for protein conformation sampling. Proteins, 2008. 73(1): p. 228-40.
84. Levinthal, C., Are there pathways for protein folding. J. Chim. Phys., 1968. 65: p. 44-45.
85. Jorgensen, W.L. and J. Tirado-Rives, The OPLS [optimized potentials for liquid simulations] potential functions for proteins, energy minimizations for crystals of cyclic peptides and crambin, in J. Am. Chem. Soc. 1988. p. 1657-66.
86. Brooks, B.R., et al., CHARMM: a program for macromolecular energy, minimization, and dynamics calculations, in J. Comput. Chem. 1983. p. 187-217.
87. Pearlman, D., Case, D. A. , Caldwell, J. W. , Ross, W. S. , Cheatam, I. T. E. , Ferguson, D. M. , Singh, U. C. , Weiner, P. & Kollman, P., Amber 4.1. 1995.
88. Shen, M.Y. and A. Sali, Statistical potential for assessment and prediction of protein structures. Protein Sci., 2006. 15(11): p. 2507-24.
89. Fitzgerald, J.E., et al., Reduced Cbeta statistical potentials can outperform all-atom potentials in decoy identification. Protein Sci., 2007. 16(10): p. 2123-39.
90. Fang, Q. and D. Shortle, A consistent set of statistical potentials for quantifying local side-chain and backbone interactions. Proteins, 2005. 60(1): p. 90-6.
91. Summa, C.M. and M. Levitt, Near-native structure refinement using in vacuo energy minimization. Proc Natl Acad Sci U S A, 2007. 104(9): p. 3177-82.
92. Flory, P.J., Statistical Mechanics of Chain Molecules. 1953, Ithaca, NY: Cornell University Press. 464.
![Page 115: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/115.jpg)
106
93. Flory, P.J., Statistical Mechanics of Chain Molecules. 1969, New York: Wiley.
94. Zaman, M.H., et al., Investigations into sequence and conformational dependence of backbone entropy, inter-basin dynamics and the Flory isolated-pair hypothesis for peptides. J. Mol. Biol., 2003. 331(3): p. 693-711.
95. Jha, A.K., et al., Statistical coil model of the unfolded state: Resolving the reconciliation problem. Proc. Natl. Acad. Sci. U S A, 2005. 102(37): p. 13099-104.
96. Fang, Q. and D. Shortle, Protein refolding in silico with atom-based statistical potentials and conformational search using a simple genetic algorithm. J. Mol. Biol., 2006. 359(5): p. 1456-67.
97. Skolnick, J., et al., Ab initio protein structure prediction via a combination of threading, lattice folding, clustering, and structure refinement. Proteins, 2001. Suppl 5: p. 149-56.
98. Srinivasan, R. and G.D. Rose, LINUS: a hierarchic procedure to predict the fold of a protein. Proteins, 1995. 22(2): p. 81-99.
99. Lazaridis, T. and M. Karplus, Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J Mol Biol, 1999. 288(3): p. 477-87.
100. Kortemme, T., A.V. Morozov, and D. Baker, An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J. Mol. Biol., 2003. 326(4): p. 1239-59.
101. Rohl, C.A., et al., Protein structure prediction using Rosetta. Methods Enzymol, 2004. 383: p. 66-93.
102. Pande, V.S. and D.S. Rokhsar, Molecular dynamics simulations of unfolding and refolding of a beta-hairpin fragment of protein G. Proc Natl Acad Sci U S A, 1999. 96(16): p. 9062-7.
103. Shen, M.Y. and K.F. Freed, All-atom fast protein folding simulations: the villin headpiece. Proteins, 2002. 49(4): p. 439-45.
104. Mirny, L.A., V. Abkevich, and E.I. Shakhnovich, Universality and diversity of folding scenarios: a comprehensive analysis with the aid of a lattice model. Folding & Design, 1996. 1103-116.
105. Maisuradze, G.G., et al., Investigation of Protein Folding by Coarse-Grained Molecular Dynamics with the UNRES Force Field. J Phys Chem A. 2010: p. 18.
106. Abkevich, V.I., A.M. Gutin, and E.I. Shakhnovich, Free energy landscape for protein folding kinetics: Intermediates, traps, and multiple pathways in theory and lattice model simulations. J. Chem. Phys., 1994. 101: p. 6052-6062.
107. Baldwin, R.L., The pathway of protein folding. TIBS, 1978: p. 66-67.
![Page 116: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/116.jpg)
107
108. Baldwin, R.L., Kinetic intermediates and the pathway of folding of ribonucleases A and S. Biomolecular structure, conformation, function and evolution., 1980: p. 87-95.
109. Baldwin, R.L. and T.E. Creighton, Recent experimental work on the pathway and mechanism of protein folding . Protein Folding, 1980: p. 217-259.
110. Baldwin, R.L., The nature of protein folding pathways: The classical versus the new view. J. Biomol. NMR, 1995. 5: p. 103-109.
111. Bai, Y. and S.W. Englander, Future directions in folding: the multi-state nature of protein structure. Proteins, 1996. 24(2): p. 145-51.
112. Maity, H., et al., Protein folding: The stepwise assembly of foldon units. Proc. Natl. Acad. Sci. U S A, 2005. 102(13): p. 4741-6.
113. Krantz, B.A., R.S. Dothager, and T.R. Sosnick, Discerning the structure and energy of multiple transition states in protein folding using psi-analysis. J. Mol. Biol., 2004. 337(2): p. 463-75.
114. Sosnick, T.R., et al., Characterizing the Protein Folding Transition State Using psi Analysis. Chem. Rev., 2006. 106(5): p. 1862-76.
115. Noe, F., et al., Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations. Proc Natl Acad Sci U S A, 2009. 106(45): p. 19011-6.
116. Voelz, V.A., et al., Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39). J Am Chem Soc, 1526. 132(5): p. 1526-8.
117. Freddolino, P.L., et al., Ten-microsecond molecular dynamics simulation of a fast-folding WW domain. Biophys J, 2008. 94(10): p. L75-7.
118. Freddolino, P.L., et al., Force field bias in protein folding simulations. Biophys J, 2009. 96(9): p. 3772-80.
119. Liwo, A., M. Khalili, and H.A. Scheraga, Ab initio simulations of protein-folding pathways by molecular dynamics with the united-residue model of polypeptide chains. Proc. Natl. Acad. Sci. U S A, 2005. 102(7): p. 2362-7.
120. Srinivasan, R., P.J. Fleming, and G.D. Rose, Ab initio protein folding using LINUS. Methods Enzymol, 2004. 383: p. 48-66.
121. Eisenberg, D., et al., The design, synthesis, and crystallization of an alpha-helical peptide. Proteins, 1986. 1(1): p. 16-22.
122. Regan, L. and W.F. DeGrado, Characterization of a helical protein designed from first principles. Science, 1988. 241(4868): p. 976-8.
123. Dantas, G., et al., High-resolution structural and thermodynamic analysis of extreme
![Page 117: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/117.jpg)
108
stabilization of human procarboxypeptidase by computational protein design. J Mol Biol, 2007. 366(4): p. 1209-21.
124. Henikoff, S. and J.G. Henikoff, Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A, 1992. 89(22): p. 10915-9.
125. Kuhlman, B. and D. Baker, Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci U S A, 2000. 97(19): p. 10383-8.
126. Harbury, P.B., B. Tidor, and P.S. Kim, Repacking protein cores with backbone freedom: structure prediction for coiled coils. Proceedings of the National Academy of Sciences of the United States of America, 1995. 92(18): p. 8408-12.
127. Hu, X., et al., Computer-based redesign of a beta sandwich protein suggests that extensive negative design is not required for de novo beta sheet design. Structure, 2008. 16(12): p. 1799-805.
128. Altschul, S.F., et al, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nuc. Acids Res., 1997. 25: p. 3389-3402.
129. Eramian, D., et al., A composite score for predicting errors in protein structure models. Protein Sci., 2006. 15(7): p. 1653-66.
130. Shen, M.Y. and A. Sali, Statistical potential for assessment and prediction of protein structures. Protein Sci, 2006. 15(11): p. 2507-24.
131. Doyle, A.C., The Sign of the Four 1890.
132. Cheng, J., et al., SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res, 2005. 33(Web Server issue): p. W72-6.
133. Bryson, K., et al., Protein structure prediction servers at University College London. Nucleic Acids Res, 2005. 33(Web Server issue): p. W36-8.
134. Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H., Helt, G. et al., Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science, 2005. 308: p. 1149-1154.
135. Wang, G. and R.L. Dunbrack, Jr., PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res, 2005. 33(Web Server issue): p. W94-8.
136. Wang, G. and R.L. Dunbrack, Jr., PISCES: a protein sequence culling server. Bioinformatics, 2003. 19(12): p. 1589-91.
137. Zhou, H. and J. Skolnick, Protein structure prediction by pro-Sp3-TASSER. Biophys J, 2009. 96(6): p. 2119-27.
138. Baker, D. and A. Sali, Protein structure prediction and structural genomics. Science,
![Page 118: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/118.jpg)
109
2001. 294(5540): p. 93-6.
139. Skolnick, J., J.S. Fetrow, and A. Kolinski, Structural genomics and its importance for gene function analysis. Nat Biotechnol, 2000. 18(3): p. 283-7.
140. Soding, J., Protein homology detection by HMM-HMM comparison. Bioinformatics, 2005. 21(7): p. 951-60.
141. Lockless, S.W. and R. Ranganathan, Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 1999. 286(5438): p. 295-9.
142. Suel, G.M., et al., Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol, 2003. 10(1): p. 59-69.
143. Russ, W.P. and R. Ranganathan, Knowledge-based potential functions in protein design. Curr Opin Struct Biol, 2002. 12(4): p. 447-52.
144. Lise, S., A. Walker-Taylor, and D.T. Jones, Docking protein domains in contact space. BMC Bioinformatics, 2006. 7(310): p. 310.
145. Zhang, Y. and J. Skolnick, SPICKER: a clustering approach to identify near-native protein folds. J Comput Chem, 2004. 25(6): p. 865-71.
146. McGuffin, L.J., Benchmarking consensus model quality assessment for protein fold recognition. BMC Bioinformatics, 2007. 8(345): p. 345.
147. Randall, A. and P. Baldi, SELECTpro: effective protein model selection using a structure-based energy function resistant to BLUNDERs. BMC Struct Biol, 2008. 8(52): p. 52.
148. Zhou, H. and J. Skolnick, Protein model quality assessment prediction by combining fragment comparisons and a consensus C(alpha) contact potential. Proteins, 2008. 71(3): p. 1211-8.
149. Xu, J., et al., RAPTOR: optimal protein threading by linear programming. J. Bioinform. Comp. Biol., 2003. 1(1): p. 95-117.
150. Li, W. and A. Godzik, Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006. 22(13): p. 1658-9.
151. Gong, H., P.J. Fleming, and G.D. Rose, Building native protein conformation from highly approximate backbone torsion angles. Proc Natl Acad Sci U S A, 2005. 102(45): p. 16227-32.
152. Hocky, G., Wilde, M., Debartolo, J., Hategan, M., Foster, I., Sosnick, T.R., and Freed, K.F., Homology-free protein structure prediction through parallel scripting. Argonne Technical Report Preprint ANL/MCS-P1645-0609. , 2009.
![Page 119: THE UNIVERSITY OF CHICAGO NEW APPROACHES TO PROTEIN ...sosnick.uchicago.edu/theses/Joe_DeBartolo_Doctoral_Thesis.pdf · Protein structure prediction methods 1.1 Introduction The structure](https://reader033.vdocuments.site/reader033/viewer/2022042219/5ec4fa36f2677f5b245a4c67/html5/thumbnails/119.jpg)
110
153. M. Wilde, I.F., K. Iskra, P. Beckman, Z. Zhang, A. Espinosa, M. Hategan, B. Clifford, I. Raicu., Parallel scripting for applications at the petascale and beyond. IEEE COMPUTER, 2009.
154. Cordes, M.H., A.R. Davidson, and R.T. Sauer, Sequence space, folding and protein design. Curr Opin Struct Biol, 1996. 6(1): p. 3-10.
155. Yue, K. and K.A. Dill, Inverse protein folding problem: designing polymer sequences. Proc Natl Acad Sci U S A, 1992. 89(9): p. 4163-7.
156. Desjarlais, J.R. and T.M. Handel, De novo design of the hydrophobic cores of proteins. Protein Sci, 1995. 4(10): p. 2006-18.
157. Lazar, G.A., J.R. Desjarlais, and T.M. Handel, De novo design of the hydrophobic core of ubiquitin. Protein Sci, 1997. 6(6): p. 1167-78.
158. Dahiyat, B.I., C.A. Sarisky, and S.L. Mayo, De novo protein design: towards fully automated sequence selection. J Mol Biol, 1997. 273(4): p. 789-96.