the university of chicago new approaches to protein...

THE UNIVERSITY OF CHICAGO

NEW APPROACHES TO PROTEIN STRUCTURE

PREDICTION AND DESIGN

A DISSERTATION SUBMITTED TO

THE FACULTY OF THE DIVISION OF BIOLOGICAL SCIENCES

AND THE PRITZKER SCHOOL OF MEDICINE

IN CANDIDACY FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

DEPARTMENT OF BIOCHEMISTRY AND MOLECULAR BIOLOGY

BY

JOSEPH DEBARTOLO

CHICAGO, ILLINOIS

JUNE 2010

UMI Number: 3408517

All rights reserved

INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript

and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.

UMI 3408517

Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against

unauthorized copying under Title 17, United States Code.

ProQuest LLC 789 East Eisenhower Parkway

P.O. Box 1346 Ann Arbor, MI 48106-1346

ii

Table of Contents List of figures ................................................................................................................................. iv

List of tables ................................................................................................................................... vi

List of abbreviations ..................................................................................................................... vii

Acknowledgements ...................................................................................................................... viii

Chapter 1 Protein structure prediction methods .......................................................................1

1.1 Introduction ..............................................................................................................1

1.2 Protein secondary and tertiary structure ..................................................................3

1.3 The protein folding problem and structure prediction .............................................6

1.4 Secondary structure prediction ................................................................................7

1.5 Tertiary structure prediction ....................................................................................8

1.6 Incorporating folding pathways into structure prediction ......................................13

1.7 Design of globular proteins ....................................................................................14

Chapter 2 Homology-free structure prediction and folding pathways ...................................17

2.1 Introduction ............................................................................................................18

2.2 Integration of 2° and 3° structure ...........................................................................19

2.3 Iterative fixing and trimer selection .......................................................................20

2.4 Retaining lost side chain information ....................................................................24

2.5 Structure prediction results ....................................................................................25

2.6 Folding pathways ...................................................................................................38

2.7 Conclusions ............................................................................................................39

2.8 Methods..................................................................................................................41

Chapter 3 Using evolutionary diversity to enhance structure prediction ...............................46

iii

3.1 Introduction ............................................................................................................47

3.2 Overview of the SPEED methods ..........................................................................49

3.3 SPEED-enhanced Ramachandran distributions .....................................................51

3.4 ItFix 2° structure ....................................................................................................55

3.5 Energy functions ....................................................................................................56

3.6 Improvement in 3° structure ..................................................................................61

3.7 Averaging the energy function across the MSA ....................................................67

3.8 Clustering ...............................................................................................................69

3.9 Confidence assessed from reproducibility .............................................................70

3.10 Performance in CASP8 ..........................................................................................75

3.11 Conclusions ............................................................................................................76

3.12 Methods..................................................................................................................78

Chapter 4 New methods for protein design ............................................................................83

4.1 Introduction ............................................................................................................83

4.2 Choice of ubiquitin as a design target ....................................................................86

4.3 Design protocol ......................................................................................................87

4.4 Negative design ......................................................................................................88

4.5 Conclusions ............................................................................................................90 Chapter 5 Structure prediction: Future directions and conclusions ........................................91

5.1 Introduction ............................................................................................................91

5.2 Enhancement of the conformational search ...........................................................91

5.3 Algorithm-free smooth ItFix ..................................................................................92

5.4 New energy functions for folding and refinement .................................................94

5.5 Conclusions ............................................................................................................99

References ....................................................................................................................................100

iv

List of Figures Figure Title Page 1.2.1 Hierarchy of protein structure ..............................................................................................4

1.2.2 Backbone torsional preference and 2° structure ..................................................................5

1.5.1 Nearest neighbor effects on backbone geometry ...............................................................11

2.2.1 Inter-related themes of protein folding ..............................................................................21

2.2.2 The ItFix 2 and 3structure prediction protocol ..............................................................22

2.4.1 Orientation-dependence of statistical potential .................................................................26

2.4.2 Statistical potential energy profiles illustrating orientation dependence ..........................27

2.5.1 2° and 3° structure prediction for low-homology targets ..................................................29

2.5.2 2° and 3° structure prediction for high-homology targets .................................................30

2.5.3 ItFix algorithm mimics the experimentally-determined ubiquitin folding pathway ..........34

2.6.1 Progression of fixing structure for 1af7, 1b72A, 1di2, and 1r69 .......................................40

3.1.1 Structure prediction protocol .............................................................................................50

3.3.1 SPEED-enhanced Rama sampling distribution .................................................................53

3.3.2 Position-based comparison of homology-free and SPEED distributions ..........................54

3.5.1 Radial protein structure terms ............................................................................................60

3.5.2 Effect of energy filtering on model accuracy ....................................................................63

3.5.3 Reproducibility of the final model ensembles ..................................................................64

3.6.1 Improvement in 3° structure prediction using SPEED ......................................................66

3.8.1 Comparison of contacts for the top clusters of several targets ..........................................71

v

3.9.1 Assessing global accuracy from reproducibility of the top cluster ....................................72

3.9.2 Assessing local accuracy from reproducibility of top cluster ............................................74

3.10.1 ItFix-SPEED blind predictions in CASP ...........................................................................77

4.1.1 Structures of distant ubiquitin family members .................................................................85

4.4.1 Designed Rama propensity ................................................................................................89

5.2.1 ItFix -structure prediction ................................................................................................93

5.3.1 The Smooth ItFix protocol .................................................................................................95

5.4.1 DOPE-PW-SPEED encodes higher specificity .................................................................97

vi

List of Tables

Table Title Page

1 Comparison of stable and structured designed sequences .................................................16

2 Component of DOPE-PW statistical potential ...................................................................28

3 Homology-free ItFix performance on low-homology target set ........................................31

4 Homology-free ItFix performance on high-homology target set ......................................32

5 SPEED 2° structure prediction comparison ......................................................................57

6 SPEED 3° structure prediction comparison ......................................................................65

7 Radial terms for protein structure .....................................................................................68

vii

List of Abbreviations 1° primary

2° secondary

3° tertiary

MD Molecular dynamics

ItFix IterativeFixing

SPEED Structure Prediction Enhanced with Evolutionary Diversity

MCSA Monte Carlo Simulated Annealing

Rama Ramachandran

PDB Protein Data Bank

MSA Multiple sequence alignment

viii

Acknowledgements

Tobin Sosnick has been an incredible advisor and I cannot thank him enough for his

support, motivation and outstanding scientific insights that cross multiple disciplines. Karl Freed

has been an essential contributor to the development of my computational skills and I thank him

enormously for his patience in the early days. Andres Colubri, Abhishek Jha, James Fitzgerald

and Glen Hocky are all crazy geniuses who pointed me in the right direction at the right times

and I thank them immensely for never taking themselves too seriously. I thank Marc Parisien and

Esmael Haddadian for great conversations about computer science and protein folding. Chloe

Antoniou was instrumental towards getting my work started in the wetlab and I thank her for her

patience and for her truly genuine willingness to help other people succeed. Nothing in my life

would be possible without the devotion of my family, so I thank my parents, grandparents and

sisters with all of my heart. Finally, I thank my wife Jessica who carried me through the hardest

parts of this journey and to whom I will be forever grateful.

1

Chapter 1

Protein structure prediction methods

1.1 Introduction

The structure and function of proteins are interrelated concepts that are critical to

understanding disease pathways and the development of therapeutics. This connection has

motivated numerous theoretical and experimental attempts to determine how amino acid

sequences encode functionally active three-dimensional structures. Although many of the general

determinants of protein stability and specificity have been elucidated, our understanding of the

subject is not developed to the point that the high-resolution 3° structure of a protein can be

reliably predicted from amino acid sequence alone [1]. Nonetheless, accurate protein structure

prediction is a necessary objective due to the large and growing number of new sequences [2]

and the slow rate at which structures are being determined experimentally [3, 4].

The challenge of structure prediction, given the significant progress that had been

achieved in elucidating the details of the protein folding process [5, 6], has been to incorporate

this knowledge into a set of computational tools that can accurately and automatically predict

structure from sequence [7-9]. To achieve this goal, prediction methods are challenged to

incorporate sufficiently accurate force fields, efficient conformational search strategies, and the

appropriate level of representation of the chain. All of these factors are influenced by the desired

level of prediction resolution, which can range from the prediction of 2° structure content [10] to

high-accuracy atomistic models [11]. Indeed, some methods may trade increased accuracy for

physical fidelity in order to gain new insights into the physical mechanism of protein folding [8,

12-15].

2

In parallel with the structure prediction effort, a large body of bioinformatics research has

been focused on clustering the various genomes into families of sequences which, when properly

aligned, can be represented by one structure [16-19]. This effort not only organizes the problem

of structure prediction into a reduced number of genomic targets [20], but also provides

evolutionary information that, when properly incorporated, can provide significant enhancements

to prediction accuracy [10] [21] [22, 23]. The objective of harnessing as much statistical

information as possible to achieve a high level of prediction accuracy and confidence

demonstrates an orthogonal approach to physics-based structure prediction.

The next challenge is to invert a successful structure prediction method into a protein

design algorithm and determine which protein sequences are most likely to fold into a given

structure [24-26]. The principal motivation of protein design is to circumvent the functional

repertoire of natural proteins and create new classes of enzymes for therapeutic and industrial

purposes. Protein design in practice generally consists of either the de novo redesign of the

amino acid sequence of an experimentally determined structure [27-35] or the de novo design of

novel structures and folds [36-39]. Both approaches incorporate the principal features of

structure prediction, such as hydrophobic burial [40-43], electrostatics [44-47], and local

conformational propensity [48-53], and methods may also use negative design to destabilize non-

native or unsought conformations [37, 38, 54] or to prevent the formation of amyloids [34, 55].

This chapter will start with a discussion of protein structure and the protein folding

problem and will continue with a summary of structure prediction methodologies. It will

conclude with a description of past achievements and the current status of protein design.

3

1.2 Protein secondary and tertiary structure

In a globular protein domain, structure is divided into three hierarchical levels (Fig.

1.2.1). The base level is primary (1°) structure, which consists of the one-dimensional sequence

of amino acids that traverses from the N-terminal to C-terminal end of the chain. The next level,

secondary (2°) structure, encompasses a description of the regular hydrogen bond networks

formed by the backbone of the polypeptide [56-58], and consists of three main types: -helix, -

strand and turn. Within each 2° structure type there is also an observed preference for specific

backbone torsion angles, exemplified in the Ramachandran (Rama) map [59] of each type (Fig.

1.2.2). For this reason, 2° can be defined using a combination of local and long-distance

hydrogen bonding and local backbone conformation [58]. The integration of these two

components is also discussed in relation to tertiary (3°) structure, the next level of protein

structure.

Protein 3° structure is characterized in multiple ways (Fig. 1.2.1). One of the most

common is a diagram the topological organization of the units of 2° structure [60]. A higher

resolution depiction of 3° structure is the two-dimensional contact matrix between atoms in the

chain, where contacts between residues that are very separated in sequence are the furthest from

the diagonal (Fig. 1.2.1). Topology diagrams and contact matrices are useful ways of visualizing

3° structure, but he most detailed representation of 3° structure is an atomic resolution model

(Fig. 1.2.1), which provided the packing arrangements between all atoms.

Protein 2° and 3° structure are overlapping concepts from many perspectives. For

example, the hydrogen bonding network that defines a -strand can occur over a large separation

in sequence, making the -strand 2° structure largely defined by 3° contacts [58]. On the other

4

Figure 1.2.1 Hierarchy of protein structure. The three levels of structure in a protein domain consist of 1°, 2° and 3° structure. The 1° structure of a protein is its amino acid sequence. The three main types of 2° structure are -helix, -strand and coil. There are multiple ways to describe 3° structure, including topology (left), contact matrices (center), and atomistic models (right).

5

Figure 1.2.2 Backbone torsional preference and 2° structure. The (N to C rotation axis) and (C to C rotation axis) backbone torsion angles of residues in a non-redundant version of the PDB are plotted in Ramachandran maps. The three main 2° structure types describe unique local conformational geometries, and therefore exhibit unique Ramachandran maps.

6

hand, the structure of -helices is defined by a sequence-local network of hydrogen bonds [58].

Even with a structural definition devoid of 3° contacts, however, the -helical 2° structure

displays an amphipathic polarization of hydrophobic and hydrophilic residues such that the

former are packed in the globular core and the latter are solvent accessible. The analogous

amphipathic characteristic of -strands is an alternating placement of residues on opposing sides

of the chain, which allows both hydrophobic and hydrophilic faces. The amphipathic nature of

both -helices and -strands is accommodated by the backbone torsion angle propensity of each

2° structure type (Fig. 1.2.2). This coupling is particularly evident in -strands, where extended

Rama distribution reflects the alternating polarization of side chains.

The preceding examples confirm that 2° and 3° structure are integrated and codependent

concepts and must be treated as such when included in structure prediction.

1.3 The protein folding problem and structure prediction

The known attributes that characterize native protein structures are summarized by the

forces that contribute to protein folding [61]. The most predominate of these forces include the

exclusion of the side chains of hydrophobic residues from solvent [40-43], electrostatic

interaction between the side chains of charged amino acids [44-47] and hydrogen bonding [62,

63]. There is also a distinct local backbone conformational preference for the protein chain

given the local amino acid identity [48-53]. This can be observed in the Rama distribution of the

amino acid leucine which strongly prefers the -helical conformation when neighbored by

alanines, but prefers an extended -strand conformation when neighbored by the -branched side

chain of valine (Figure 1.3). Additionally, it is been suggested that some part of a natural amino

acid sequence may be negatively designed to destabilize non-native or unsought conformations

7

[37, 38, 54] or to prevent the formation of amyloids [34, 55].

The long list of phenomena involved in protein folding suggests that a complex interplay

between multiple forces causes a protein to adopt a unique conformation. Understanding this, the

protein folding problem seeks to quantify the relative importance of each force and physical

connection between them throughout the pathway from unfolded to folded protein. The

challenge of structure prediction is to encode all of these attributes into a single algorithm that

predicts protein structure from amino acid sequence. Given the marginal stability of protein

structures and the cooperativity of protein folding, it would seem that every component involved

in folding is interrelated and important [12, 64], so the challenge of structure prediction is to

incorporate each component with an appropriate weighting .The manner in structure prediction

methods approach this challenge, as described below, is strikingly varied.

1.4 Secondary structure prediction

The discovery that the different amino acids have varying probabilities for the three 2°

structure types allowed Chou and Fasman to use a very early PDB to predict 2° structure within

the categories of helix, strand and coil at around 60% accuracy [48, 65, 66]. The essence of their

algorithm is a search for a contiguous region of the sequence that contains a high probability of a

2° structure type followed by the expansion of the region until the ends are detected by a drop

below a probability threshold.

Subsequent methods greatly enhanced 2° structure prediction accuracy by incorporating

sequence profiles and machine learning techniques [10, 67, 68]. Sequence profiles use a multiple

sequence alignment to generate a distribution of amino acid types at each position as opposed to

a singular amino acid identity [69]. This enhancement in information consequently increases the

number of available parameters, hence requiring machine learning to optimize the parameters of

8

algorithms that use sequence profiling [70]. The combination of these methods has proven

hugely successful when incorporated into Chou-Fasman style algorithms, allowing methods to

achieve accuracies close to 80-90% [10, 67, 68].

The remaining deficit in the accuracy of 2° structure prediction [71] can be attributed to a

low propensity for the native 2° structure at some positions due to functional and, most

importantly, 3° structure [72-74]. As a result, any complete 2° structure prediction method must

recognize that the formation of 2° and 3°structure is a coupled process [75] and must

appropriately incorporate 3° structure context [8].

1.5 Tertiary structure prediction

Protein 3° structure prediction methods can be divided into template-based homology-

modeling methods and template-free de novo methods. Template-based methods [22, 76-80]

search for very closely related sequences in the protein database (PDB) and use those

experimental models as templates to align to the target sequence. In cases where a target

sequence contains no sequence homologues in the PDB, template-based methods are

unsuccessful and template-free methods which assemble the protein chain de novo are necessary

[8, 9, 15, 80-83]. Template-free methods, which are the subject of this thesis, are limited by the

enormous conformational space accessible to each protein chain [84], and such methods seek to

narrow the conformational search by amounts ranging from sampling large PDB fragments [21,

80] to simulating folding process using all-atom molecular dynamics (MD) simulations [9, 15].

Structure prediction methods must be capable of recognizing good conformations, which

necessitates accurate protein energy functions. Many notable protein energy functions are force

fields applied in MD simulations, which seek to capture the most physically realistic spatial and

temporal relationships between each atom in molecule [85-87]. Such energy functions calculate

9

some or all known protein force field components, which include those associated with bond

lengths and angles, backbone and side chain torsions, electrostatic interactions, van der Waals

interactions and dipolar interactions related to hydrogen bonding. The performance of these force

fields is difficult to determine and depends on the implementation within the application of

interest.

The most useful energy functions for direct de novo prediction of protein structure

include terms that are derived from the statistics of the PDB (i.e. the observed distances between

atoms in experimentally determined structures) [53, 88-91]. One motivation for the use of

statistical potentials is that the forces behind protein folding may be difficult to explicitly

calculate, but may be encoded implicitly in the empirical statistics of the PDB. While it is

difficult to directly compare the performance of statistical potentials and traditional molecular

mechanics force fields, the latter is rarely used for structure prediction, and at least one study

suggests that statistical potentials are superior at the refinement of near-native protein models

[91].

While an accurate protein energy function is important, an efficient search strategy is

crucial due to the enormity of the conformational space [84]. As such, any 3° structure prediction

should incorporate features that reduce the search space in a way that appropriately reflects the

real behavior of proteins in solution, which is governed by the excluded volume of side chains

and different regions of the chain [92, 93]. In essence, a protein in solution will exhibit local

conformational preferences that depend on the local amino acid sequence [52, 94] and long

distance chain interactions [95]. The twenty amino acid types by themselves have distinct

conformational preferences [48], which become even more specific when the amino acid

identities of nearest-neighbors are taken into account [52, 95] (Fig. 1.5.1). Thus, the 1° sequence

10

provides the first level of reduction of the conformational search and it is the implementation of

this reduction within a structure prediction algorithm varies among methods.

Fragment-based assembly methods impart the native crystal structure local conformation

to any sampled region [21, 96, 97] and are successful when matching or homologous PDB

fragments are available for every position in the chain. But they are at a disadvantage when no

such fragments are available. Ab initio methods use an all atom representation of the chain to

explicitly encode the local conformational bias, which enjoys the advantage of physical realism

but suffers due to computational cost [9, 15]. Other methods choose to directly constrain the

conformational search based on predicted 2° structure, either through iterative homology-free

folding [8, 98] or through homology-based biases [9].

The large number of energy functions and conformational search strategies are combined in a

large and growing number of structure prediction methods that are too numerous to list in detail.

One of the most prominent examples of a de novo fragment assembly method is Rosetta [80],

which has been successfully applied in the Critical Assessment of Structure Prediction

experiment CASP [1]. The Rosetta method finds large (up to nine residues) homologous

fragments of structure in the PDB and assembles them using a Monte Carlo algorithm and an

energy function that consists of a Lennard-Jones potential, a Lazardis-Karplus implicit solvent

model [99], and an orientation-dependent statistical hydrogen bond potential [100]. Large-scale

sampling of sequence homologues and all-atom refinement has allowed the Rosetta method to

achieve high-accuracy predictions on some small globular proteins [21]. Rosetta can be limited,

however, by a dearth of large PDB fragments available for sampling [21, 101], which highlights

drawbacks of fragment-based assembly methods.

11

Figure 1.5.1 Nearest neighbor effects on backbone geometry. The nearest N-terminal and C-terminal neighbors have dramatic effects on the Ramachandran maps of residues. Leucine has a strong preference for an -helical geometry when neighbored by alanines, yet prefers a -strand geometry when neighbored by the -branched sidechains of valine.

12

Other de novo prediction methods utilize alternative combinations of conformational

search strategies and energy functions. CRFolder samples conformations constructed with

machine learning techniques, and uses Monte Carlo to minimize a simple statistical potential

[83]. Another example samples large PDB fragments with a genetic algorithm and a statistical

potential trained extensively on large decoy sets [96]. Another method uses Replica Exchange

Monte Carlo methods to minimize a course-grained potential and includes sequence profile

information to constrain an empirical hydrogen bonding term [9].

Prediction methods also vary by the level of representation of the polypeptide chain and

water. MD simulations with explicit water, which are the most detailed representations, can take

prohibitively long for all but the smallest proteins and fragments, and often start with a partially

folded chain [102]. Simulation time decreases with an implicit representation of water, even if all

atoms are retained [15, 103], and further “coarse-graining” is obtained by MD methods that

feature implicit water and a simplified lattice model of the chain [104, 105]. Methods that

sample the chain with Monte Carlo methods can use representations that range from all-atom

with implicit solvent [21], to a complete backbone and side chain C model with no solvent [8,

53], or a reduced-atom model of the backbone and include no side chain or water information

[83].

1.6 Incorporating folding pathways into structure prediction

The enormous conformational search space available to the polypeptide chain [84]

suggests that proteins fold along a pathway [106-110]. Additionally, native-state hydrogen

exchange experiments show that subunits of 3° structure, called foldons, form cooperatively

[111] and may sequentially assemble with each other along a folding pathway [6, 112-114].

13

Considering the evidence proteins fold along pathways, structure prediction methods may take

advantage by incorporating pathways. Conversely, it is also possible that pathway information

can be extracted from folding simulations in order to better understand the physical mechanism

of protein folding.

Several de novo methods take exactly that kind of physics-based mechanistic approach by

not sampling PDB fragments and excluding any homology-based information [15, 98]. Pande

and coworkers run an all-atom molecular dynamics simulation with explicit solvent simulations

on a beta-hairpin of protein-G [102]. Given the time constraints of this approach, the

investigators start from the unfolding transition state of the structure, which is significantly

structured and diminishes the contributions from earlier parts of the folding pathway. Others

groups attempt to assemble pathways using Markov models on short MD simulations that do not

reach the folded state [115, 116]. MD folding of short proteins beyond their sub-microsecond

folding timescales has produced non-native conformations [117], which has been attributed to

force field bias [118].

Dill and coworkers present a model where folding proceeds through the “zipping” of

local regions of the chain followed by the assembly of these local structures into the global fold

[15]. To support this model the authors apply all-atom molecular dynamics simulations, which

are the most physical tools available, but simulate only small segments of the chain which are

subsequently assembled into a full length structure. It is unclear how exclusion of the context of

the entire chain during the formation of local structure reduces the effects of interactions that are

separated by long distances in sequence. For proteins with complex topologies and extensive

number of sequence non-local contact (e.g. with contacting N- and C-termini), this approach is

likely to run into problems. As a result, the elements of 2° structure that are most influenced by

14

3° structure context, such as the unpredicted C-terminal strand of 1ubq, cannot factored into the

prediction using this method.

The pathway-based nature of other prediction methods feature full 2o and 3o structure

integration and represent the entire chain starting from an unstructured conformation [9, 98, 119,

120], but sample large fragments or include homology-based information, which compromises

their ability to extract pathway information. Thus, what is required for pathway-based prediction

is a homology-free method that does not sample fragments and starts from a completely random

chain. My approach to this challenge is discussed in Chapter 2 [8].

1.7 Design of globular proteins

The design of globular proteins has made substantial progress over the last few decades

[24]. In one of the earliest design efforts, DeGrado and coworkers synthesized and crystallized

alpha-helical peptides that assembled into a tetrameric four-helix bundle as isolated fragments

[121]. This work was followed by the de novo design of a stable and structured globular four-

helix bundle [122].

A decade later Mayo and coworkers experimentally confirmed the first de novo redesign

of an previously determined protein structure [35]. In that study, the investigators used a flexible

backbone design technique to generate a new sequence for a simple zinc finger motif, with the

design having low sequence identity (21%) to the wild-type sequence. This significance of this

result is underscored by the fact that many designed sequences only involve a small number of

deviations from the wild-type sequence [27-31].

The Baker group subsequently used the tools of their Rosetta structure prediction package

[80] to redesign a diverse set of crystal structures, which produced several stable and structured

proteins [32]. The redesigned procarboxypeptidase from that study was highly stabilized relative

15

to the wild-type sequence, and the crystal structure of the design was found to be nearly identical

to the structure of the wild-type sequence [123]. This group also generated the first protein fold

yet to be seen in nature with the design of Top7 [39].

A comparison of designed sequences (globular, stable and structured) and the wild-type

sequences of their respective experimentally determined structures shows that the sequence

identity (i.e. exact sequence matches) is consistently greater than 25% (Table 1). The sequence

similarity, defined as matches according to a BLOSSUM62 matrix [124], is on average greater

than 50% (Table 1). This high similarity may be attributed to the possibility that natural

sequences are highly optimized for their native structures and effective design algorithms are

forced to retain much of the natural sequence [125]. Another possibility is that design methods

can focus too narrowly on native structural constraints, even though sequence substitutions may

require backbone relaxation [33, 126]. Chapter 4 of this thesis details my effort to generate a

designed sequence with much lower similarity to any naturally occurring sequence.

16

Table 1 Comparison of stable and structured designed sequences

design length fold target PDB ID

wt % id1

(wt % sim2) top % id3

(top % sim)4 top-wt % id5

(top-wt % sim)6

protein L1 [32] 62 1hz5 35 (61) 50 (62) 73 (86)

protein L2 [32] 62 1hz5 45 (60) 45 (60) 73 (86)

ACP [32] 98 2acy 41 (54) 39 (57) 67 (69)

PCP [32] 70 1aye 31 (56) 33 (56) 73 (84)

S6 [32] 94 1ris 26 (43) 32 (46) 33 (52)

U1A [32] 96 1urn 32 (57) 33 (57) 97 (100)

FKB [32] 107 1fkb 42 (59) 44 (62) 96 (96)

zinc-finger [35] 28 1zaa 21 (38) N/A N/A

tenascin [127] 89 1ten 42 (64) 42 (64) 100 (100)

ubiquitin7 72 1ubq 8 (38) 25 (47) 14 (32)

1Exact amino acid matches of design to wild type sequence 2Exact amino acid matches and BLOSSUM62 substitution matrix matches of design to wild type sequence 3Exact amino acid matches of design to highest ranking natural sequence of PSIBLAST of design 4Exact amino acid matches and BLOSSUM62 substitution matrix matches of design to highest ranking natural sequence of PSIBLAST of design 5Exact amino acid matches of wild-type sequence to highest ranking natural sequence of PSIBLAST of design 6Exact amino acid matches and BLOSSUM62 substitution matrix matches of wild-type to highest ranking natural sequence of PSIBLAST of design 7Computational redesign described in Chapter 4

17

Chapter 2

Homology-free structure prediction and folding pathways

Parts of this chapter are published in DeBartolo et al. Mimicking the folding pathway to

improve homology-free protein structure prediction. Proc. Natl. Acad. Sci. USA (2009) vol. 106

(10) pp. 3734-9 and in the accompanying supplementary materials. Additional sections have

been added and the text has been updated accordingly. I acknowledge and thank Andres Colubri,

Abhishek Jha, and James Fitzgerald for advice on the coding of simulations and energy

functions. I also thank G. Rose, D. Shortle, J. Xu, G. Hockey, H. Gong, and members of the

Sosnick and Freed labs for helpful discussions.

Since demonstrating that a protein’s sequence encodes its structure, the prediction of

structure from sequence remains an outstanding problem that impacts numerous scientific

disciplines including many genome projects. By iteratively fixing secondary structure

assignments of residues during Monte Carlo simulations of folding, a coarse grained model

without information concerning homology or explicit side chains can outperform current

homology-based secondary structure prediction methods. The computationally rapid algorithm

using only single ( dihedral angle moves also generates tertiary structures of comparable

accuracy to existing all-atom methods for many small proteins, particularly ones with low

homology. Hence, given appropriate search strategies and scoring functions, reduced

representations can be used for accurately predicting secondary structure as well as providing

three-dimensional structures, thereby increasing the size of proteins approachable by homology-

free methods and the accuracy of template methods whose accuracy depends on the quality of

the input secondary structure.

18

2.1 Introduction

The protein folding process is integral to multiple cellular processes, and errors can result

in amyloidgenic diseases. A protein’s structure affords a window on its function, and the huge

growth in the number of sequenced genomes provides codes for an enormous number of new

proteins with unknown functions, [2] a number far exceeding experimental capabilities and

requiring fast throughput theoretical methods for deducing protein structure from sequence. To

this end, great progress in predicting structure has emerged using homology-based methods [76].

However, the goal of predicting structure and pathways beginning only from the

sequence remains an elusive goal. Furthermore, methods for 2o and 3o structure prediction, while

often quite accurate, can fail for lack of sufficient sequences that are homologous to the target

sequence. Even if a multiple sequence alignment (MSA) exists, e.g., as generated using PSI-

BLAST [128], the alignment may diminish any structural propensity that is specific to the target

sequence (in its 3 context) in favor of the consensus of the alignment. This disadvantage can

adversely affect 3 structure prediction because the homology-based 2o structure prediction and

MSA generally serve as crucial inputs.

The reliance on homology also precludes identifying the underlying physiochemical

principles that govern protein folding, including determining the minimal information and model

of protein structure that are required for accurate structure prediction. This inadequacy arises

from the failure of many 2o structure prediction methods [10, 68] to explicitly incorporate 3o

context. Context dependence can overrule local biases, [72-74] and its neglect has limited 2o

structure accuracy to about 80% for decades [71]. Previous attempts to improve 2o structure

predictions by including 3o structure predictions achieve limited success [75], perhaps because of

19

a reliance on sequence homology.

We present a homology-free strategy using a C-level representation in which 2o and 3o

structure predictions emerge as an integral component of the folding process.

Consequently, our strategy may share some benefits that authentic proteins gain by

folding along a robust and efficient pathway. While others have integrated 2o and 3o structure

determination [9, 119] with an iterative fixing (ItFix) of 2o structure [15, 98, 120], our approach

differs by i) not using any exogenous 2o structure prediction or homology, ii) by removing side

chain degrees of freedom from the model, which greatly reduces computation time, and iii) by

allowing the whole chain to interact throughout the entire folding process. Furthermore, our

moves involve changes only in a single pair of dihedral angles (that is obtained from the

PBD and that includes the influence of the neighboring residues’ identity and 2 structure. Our

results demonstrate that models lacking explicit side chains or information from homology can

be as accurate while requiring orders of magnitude less computing time. In addition, information

about folding pathways can be extracted from the simulations.

2.2 Integration of 2o and 3o structure.

Our ItFix algorithm focuses on three fundamental protein properties, the sequence

dependent backbone torsional angle preferences, the backbone hydrogen bonding requirements,

and the different chemical properties and packing preferences of the twenty amino acid side

chains (Fig. 2.2.1). Since each factor strongly influences the other two, a major challenge lies in

simultaneously including all three factors with appropriate weights into a folding algorithm.

20

Because the model retains the backbone heavy atoms and the side chain C atoms [53, 89,

129, 130] the 2N backbone dihedral angles are the major degrees of freedom for a chain of

N residues in our treatment (cis conformers are occasionally allowed, see Methods).

The neglect of side groups raises questions of how the model describes packing and

individual residue preferences. As demonstrated below, our Monte Carlo Simulated Annealing

algorithm (MCSA), using a statistical potential (StatPot) [53, 89, 129, 130] and an increasingly

restrictive PDB-based move set (Fig. 2.2.2), recaptures the requisite side chain information and

performs remarkably well in the prediction of 2o and 3o structure without invoking homology

information.

2.3 Iterative fixing and trimer selection.

A critical aspect of the algorithm is the selection of a single dihedral angles pair from an

increasingly refined library of amino acid trimers, similar in spirit to earlier studies [15, 98, 120].

During the initial round of the simulations, trimer selection is conditional only on the amino acid

identity of the three residues (Fig. 2.2.2). Trimer selection in subsequent rounds depends on the

2o structure type at each position that is identified from the previous round by the prescriptions

described in Methods. The specification of 2o structure is enabled because each trimer in the

trimer library is labeled by the 2o structure assignments for each of the three residues in the

original PDB structure in which they originate using the Dictionary of Protein Secondary

Structure (DSSP) definition [58]. The frequencies of occurrence for each originating 2o structure

type, H(elix), E(xtended), or C(oil), are calculated from the last inserted trimer at each position

in the 200-300 final structures emerging from each round of folding.

21

Figure 2.2.1 Inter-related themes of protein folding. Protein backbone motions, 2 structure and hydrogen bonding, and side chain packing are the necessary components of any folding model. 2° and 3° structure formation are coupled processes whereby the formation of each type of structure influences the formation of the other type.

22

Figure 2.2.2 The ItFix 2 and 3structure prediction protocol. At the end of each round, the 2 structure frequencies are used to eliminate H, E or C when they fall below specified thresholds.

23

Following Sherlock Holmes’ deductive strategy to “Eliminate all other factors, and the one

which remains must be the truth.” [131], if the frequency of occurrence for a particular 2o

structure type falls below a ~5-10% threshold at a given position or across a contiguous stretch

of sequence (see Methods), any trimer inconsistent with that 2o structure is removed from the

trimer library used in subsequent folding rounds. The process continues until no additional

positions can be further restricted. After the last round, the lowest energy and best 3o structures

are identified, while the 2o structure prediction is obtained from the frequencies of appearance of

H, E and C in all final structures (see Methods).

Our MCSA algorithm is designed to resemble a true folding pathway. Each round in the

ItFix process begins from a configuration devoid of any 3o structure, rather than a collapsed

structure generated from a previous round. Consequently, the chains execute a new global search

each round. The backbone geometry is simulated by replacing only one of the three pairs of

dihedral angles at a randomly chosen position with those from the equivalent position in a

trimer selected from the trimer library. In principle, all-atom simulations for tripeptides could be

used, but the accuracy of current methods makes this approach less reliable [94]. The starting

chain is built using angles from trimers specified solely by the amino acid sequence. The trimer

library becomes increasingly conditional on 2o structure type as the rounds proceed. Each round

of ItFix consists of 200-300 individual folding trajectories. Each trajectory involves a global

search guided by insertion moves, a Metropolis acceptance criterion, and a StatPot for a

scoring function. The trajectory ends when the collapsed structure cannot undergo additional

moves. The end result of the iterative rounds is a folding-enhanced 2o structure prediction that

emerges simultaneously with an ensemble of 3o structures.

24

2.4 Retaining lost side chain information.

The retention of the side chain information lost by the use of the C-level representation

poses a serious challenge. Central to this goal is our dihedral angle sampling procedure that

is conditional on both the chemical identity and the increasingly refined 2o structure specificity

for each position and its neighboring residues. The backbone dihedral angles are strongly

correlated with the side chain rotamer angles and both the neighboring residues’ side chain

identities and conformations [52, 53, 95]. Hence, even without explicitly depicting the side chain

atoms, much of their influence is retained by choosing values using our conditional trimer

selection strategy.

In addition to retaining the interplay of side chain and backbone interactions, our

algorithm focus on optimizing 3o interactions. The 3o interaction energies are obtained from the

StatPot “DOPE-C” [53, 89] derived from an all-atom pairwise additive StatPot [88, 129] that

uses a novel reference state and distinguishes the backbone atoms according to amino acid type.

Our version removes all contributions involving hydrogen and side chain atoms beyond the C

atom. To eliminate bias towards specific 2o structure types, the attractive potential is removed

between atoms in contiguous stretches of 2o structure, while the repulsive portion is retained to

prevent steric overlap. In addition, interactions are conditional on backbone geometry and the

relative orientation of the C-C bonds of the two interacting side chains (Figs. 2.4.1, 2.4.2), a

feature particularly helpful in setting up the overall chain topology so that collapse generates

native-like structures. Beyond the prescription used to eliminate a 2o structure option in the

trimer library, the only adjustable parameters are the four linear weight factors in the StatPot

(Table 2).

25

2.5 Structure prediction results

Improvement in 2o structure prediction arising from folding.

The first set of targets (Table 3, Fig. 2.5.1) originates from a previous study that

integrates 2o and 3o structure prediction [75]. The set contains proteins with eleven diverse folds

and relatively low sequence homology. The second set of targets (Table 4, Fig. 2.5.2) originates

from a study focusing on improving 3 structure prediction using high sequence homology and

extensive side chain refinement.

Our accuracy of predicting the three major 2o structure types, H(elix), E(xtended) and C(oil)

(termed “Q3 level”) significantly improves as a result of the ItFix folding algorithm compared to

the intrinsic, locally-determined biases. This improvement can be seen by the initial trimer

library that is contingent only on the sequence. This Round 0, or “R0”, accuracy of 58 ± 10%

improves to 82 ± 11% over the 6-9 rounds of the ItFix process for the various proteins (Tables 3

and 4). The process of fixing 2 structure by eliminating options is well illustrated by the

evolution of 2 structure frequencies at each position in 1Ubq (Fig. 2.5.3). The R0 frequencies

display some bias to the native 2 structure but provide only 60% accuracy. Only as 2 structure

options are eliminated does the native 2 structure pattern emerge with 92% accuracy based on

the average 2 structure obtained from accuracy over the course of the nine rounds. A notable

example is the carboxy-terminal region where the high intrinsic helicity is over-ridden by 3

context and the region becomes a native-like strand.

26

Figure 2.4.1 Orientation-dependence of statistical potential. Each interacting residue pair has two angles. One angle, 1-2, is the angle between the C-C vector of Residue 1 and the C-C vector from Residue 1 to Residue 2, and the other angle, 2-1, is the angle between the C-C vector of Residue 2 and the C-C vector going from Residue 2 to Residue 1. A) The relative orientation of the side chains is quantified as . B) Two residues have angles 1-2 and 2-1 close to 90, yielding a small value. C) A residue pair with a large value has angles 1-2 and 2-1 that are far from 90. D) Hypothetical protein illustrating with possible residue pair orientations having small (1-2, 2-3,1-3, 4-5) and large (1-4, 2-4,3-4, 1-5, 2-5,3-5).

27

Figure 2.4.2 Statistical potential energy profiles illustrating orientation dependence. This dependence reflects the basic protein structural principles of hydrophobic burial, hydrophilic exposure and 2o structure conformation. A) The inter-atomic potential for two C atoms in three different orientations with a high value. In such cases, hydrophobic amino acids are favored to be at shorter distances, corresponding to buried residues pointing at each other in the core of the protein. The opposite applies for hydrophilic amino acids, which prefer larger distances corresponding to surface exposed residues on opposite sides of the protein. B) The potential for two C atoms in two different orientations for two residues on strands of -sheets with a small value. Shorter C-C distances are preferred for residues on the same side of the sheet, and larger for those on opposite sides of the sheet.

28

Table 2 Components of DOPE-PW statistical potential

energy term

contin. helix

contin. strand

contin. coil

anti-parallel -sheet (small )

anti-parallel -sheet (small )

non -sheet (small )

non -sheet (med. )

non -sheet (large )

min/max dist. (Å)

0.0/15.0 0.0/15.0 0.0/15.0 0.0/15.0 0.0/15.0 0.0/30.0 0.0/30.0 0.0/30.0

bin size (Å) 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0

attractive weight 0.0 0.0 0.0 5.0 10.0 1.0 1.0 1.0

29

Figure 2.5.1 2° and 3° structure prediction for low-homology targets. ItFix predicts 2° structure at the Q8 level (H, E, CG, CN, CI, CS, CB, or CT) and 3° structure for the high-homology targets in Table 3. Alignments of the ItFix lowest observed RMSD 3° structure (red) with the native structure (blue) using PyMol visualization software. C RMSD between ItFix model and native structure are listed next to each target name. These are referred to as low-homology targets due to the paucity of sequence homologues in the sequence database. This implies that methods such as PSIPRED and SSPro that take advantage of homology in the form of sequence profiles are conferred minimal information other than the local 2° structure propensity of each sequence.

30

Figure 2.5.2 2° and 3° structure prediction for high-homology targets. ItFix predicts 2° structure at the Q8 level (H, E, CG, CN, CI, CS, CB, or CT) and 3° structure for the high-homology targets in Table 3. Alignments of the ItFix lowest observed RMSD 3° structure (red) with the native structure (blue) using PyMol visualization software. C RMSD between ItFix model and native structure are listed next to each target name. These are referred to as high-homology targets due to the abundance of sequence homologues in the sequence database. This implies that methods such as PSIPRED and SSPro that take advantage of homology in the form of sequence profiles are conferred maximal information in addition to the local 2° structure propensity of each sequence.

31

Table 3 Homology-free ItFix performance on low-homology target set

1Target sequences were taken from previous study by Meiler and Baker [75]. 2Round 0 accuracy of the initial, sequence-dependent trimer library before any 2° structure restrictions are made. This reflects local 2° propensity. 3ItFix predicts 2° structure at the Q8 level (H, E, and the 6 types of coil, including turn (CT), bend (CS), 3-10 helix (CG), pi helix (CI), beta-bridge (CB), and other (CN). 4SSPro predictions taken from the SSPro online server [132]. 5PSIPRED predictions taken from the PSIPRED online server [133]. 6Values taken from column 6, Table 2 of Meiler et al. [75] 7Lowest RMSD obtained

protein

2 structure % accuracy

3 structure (Å)

PDB ID description Length Fold R02

Q3

ItFix Q3

(Q83)

SSPRO4

Q3 (Q8) PSIPRED

5 Q3

Meiler &

Baker6

Q3

ItFix (best7)

Meiler &

Baker (best8)

1ail Protein fragment 70 46 76

(73) 70 (74) 73 64 5.4 6.0

1aoy Single domain repressor 78 54 82

(72) 81 (65) 87 89 5.7 5.7

1c8cA DNA-binding 64 56 86 (70) 72 (59) 59 67 3.7 5.0

1cc5 Heme-binding 76 70 92 (68) 74 (75) 88 86 6.5 6.2

1dtdB Disulfide bonds 61 57 71 (57) 64 (57) 75 69 6.5 5.7

1hz6A Protein L 67 57 80 (72) 80 (75) 83 87 3.8 3.4

1fwp CheY-binding domain 69 45 70

(55) 48 (30) 61 68 8.1 7.3

1isuA Iron-binding 62 65 82 (44) 66 (39) 81 89 6.5 6.9

1sap Hyper-thermophile 66 65 85

(67) 76 (67) 65 65 4.6 6.6

1wapA oligomer

in crystal structure

68 43 80 (68) 73 (64) 81 68 8.0 7.7

2ezk DNA-binding 93 58 80 (75) 71 (64) 91 85 5.5 6.6

32

Table 4 Homology-free ItFix performance on high-homology target set

protein

2 structure % accuracy

3 structure (Å)

PDB ID Length Fold R02 Q3

ItFix Q3 (Q8)

SSPRO3

Q3 (Q8) PSIPRED4

Q3

ItFix5

Lowest energy

(best)

Bradley et al.6

1af7 69 70 97 (86) 86 (81) 90 2.9 (2.5) 10.4

1b72A 50 62 88 (84) 68 (72) 84 3.5 (1.6) 1.1

1csp 67 49 79 (67) 75 (67) 88 10.5

(6.0) 4.7

1di2 68 68 88 (79) 74 (75) 97 6.1

(4.6) 2.6

1dcj 72 38 45 (29) 65 (56) 89 13.3

(7.6) 2.5

1mky 77 66 86 (70) 87 (71) 90 6.9

(6.1) 6.3

1o2Fb 77 65 78 (69) 79 (66) 75 11.2

(5.8) 10.1

1r69 61 79 93 (89) 84 (72) 92 4.2

(2.4) 1.2

1shfA 59 53 76 (56) 85 (69) 80 12.2

(6.7) 10.8

1tif 57 47 89 (79) 86 (70) 93 11.3

(4.2) 4.1

1tig 86 53 83 (70) 69 (67) 83 6.4

(5.3) 3.5

1ubq 73 60 92 (69) 88 (67) 90 5.3 (3.1) 1.0

1Target sequences are the same as a previous study [21]. 2Round 0 accuracy of the initial, sequence-dependent trimer library. This reflects local 2° propensity. 3SSPro predictions taken from the SSPro online server [134]. 4PSIPRED predictions taken from the PSIPRED online server [133]. 5ItFix lowest energy and lowest observed C RMSD (in parentheses) structures. 6Values shown are taken from ref. [21], Table 1, Column 6 “Lowest all-atom energy”.

33

Within the framework of the ItFix algorithm and our energy function, the importance of

3 context in determining 2 structure is demonstrated for the five best performing targets (1af7,

1b72A, 1r69, 1di2, 1ubq). The ItFix process is repeated but without attractive terms between

amino acids farther than six residues (|i-j| > 6). When all the repulsive terms are retained, the

chain adopts extended geometries to avoid steric overlap. The resulting 2 structure prediction

accuracy decreases sharply even compared to the initial R0 2 structure accuracy. When long-

range chain over-lap is allowed, the 2 structure prediction accuracy also decreases relative to R0

because the only favorable interaction term remaining is between two residues on strands when

|i-j| > 4. By itself, this term is insufficient to drive stable, native-like sheet formation. A slight

improvement over R0 occurs simply from the 2 fixing protocol without any simulated

annealing. But, the improvement is marginal, 0-2%, compared to 13-30% obtained when the

long-range interactions are included.

Hence, accurate 2 structure prediction requires 3 context. Context serves to stabilize or

buttress weak local biases and 2 structural elements. For example, the amino hairpin in 1Ubq

emerges when the formation of weak turn brings two potential strands together. Similarly, an

unstable amphipathic helix can be mutually stabilized by a hairpin with a hydrophobic face.

Such 3 contacts may not always be completely native-like as significant increases in 2

structure accuracy can arise even when the global 3 fold is inaccurate (e.g., RMSD > 6 Å).

Comparison with existing 2 and 3 structure prediction methods.

The ItFix accuracy generally surpasses that from the 2o structure prediction servers

SSPRO [68] and PSIPRED [10] and the previous study [75], with some exceptions. The high

homology of these sequences is responsible for the prediction accuracy to meet or exceed 80%

34

Figure 2.5.3 ItFix algorithm mimics the experimentally-determined ubiquitin folding pathway. The position dependence of the 2° structure frequencies at the end of each round, E (blue), H (red) and C (green). A single color bar represents a residue assigned to a single 2° structure type (native 2° structure shown at top, along with long-range contacts). The major steps in the proposed folding pathway [113, 114] are similar to the order of structure fixing over the multiple rounds: The hairpin forms, followed by the helix and 3 strand, and then 4. The final two events are the folding of the 3-10 helix and 5. Their formation appears in some trajectories but not at a high enough frequency to be fixed.

35

for both the SSPRO [68] and PSIPRED [10] servers. Nevertheless, our ItFix protocol achieves

comparable accuracy without invoking any homology information (Table 3 and 4). The average

ItFix accuracy is only slightly smaller for the low homology targets, 80% versus 83%. But the

lack of homology significantly degrades PSIPRED and SSPRO’s performances, 77% versus 88%

and 70% versus 79%, respectively. Furthermore, the ItFix method is able to predict all eight

types of 2o structure where coil is subdivided into the six of the DSSP-defined subtypes (CG, CN,

CI, CS, CB, CT), termed “Q8 level”. This ability also is available using SSPRO, but it is slightly

less accurate for most targets. As illustrated below, ItFix provides much better predictions for the

location of turns and the ends of helices and strands, features that are crucial in 3o structure

prediction.

The ItFix algorithm describes , , and proteins within each set with comparable

accuracy, although 3° predictions for the more challenging low homology set are generally

poorer (Tables 1, 2, Fig. 3) because we have difficulty predicting metal- and heme-binding

proteins and disulfide-bonded proteins. The high homology set lacks these challenging protein

types. The accuracy of ItFix’s 3 structure predictions are comparable to those of the highly

successful Rosetta fragment-based insertion algorithm, as implemented in the papers from which

the test sets are obtained [21, 75]. Our structures are more similar in quality for the low

homology set than the high homology set. The high homology targets were chosen by Baker and

coworkers because improved predictions are obtained using data from the folding of an extensive

number of homologs. In addition, the Rosetta algorithm requires extensive side chain refinement

and thus orders of magnitude more computation time [21] than our algorithm which omits side

chain degrees of freedom. Hence, it is not surprising that this implementation of Rosetta

performs better for 9/12’s of this target set.

36

Table 3 demonstrates that the ItFix 2o structure prediction method can meet or exceed the

percentage prediction accuracy of the programs PSIPRED and SSPro. The information conveyed

by the % accuracy, however, is compromised because of disagreements between methods for the

assignment of 2o structure. For example, the DSSP method, which we use to assign 2o structure,

differs from DeepView in specifying 2o structure for 1tif (Table 3). Deepview tends to be more

liberal when assigning strands and designates residues 3-5 and 9 as strand, whereas DSSP

assigns this region as mostly coil. Our method similarly favors assigning strand over coil,

implying that ItFix should achieve higher 2o structure accuracy for 1tif when compared to the

native DeepView assignment rather than the DSSP assignment. Nevertheless, we compare our

prediction to DSSP assignments because DSSP is used to calculate the 2o structure of the

simulation models.

Another issue relating to 2o structure prediction accuracy is the varying assignment of 2o

structures by different prediction methods. For example, some approaches consider and helix

and a 3-10 helix to belong to the same category, whereas we treat the 3-10 helix as a subtype of

coil because the helical hydrogen bonding pattern requires at least 4 residues whereas 3-10

helix only requires three. Notably, when Q3 level methods such as PSIPRED and SSPro predict

a 3-10 helix as an ‘H’, we consider them correct and incorrect when predicting a 3-10 helix as

coil. Because our 2o structure sampling depends on DSSP, we adhere to its convention and

consider 3-10 helix to be a class of coil (CG).

Feedback between 2o and 3 structure prediction.

While the average accuracy of a 2o structure prediction is a useful metric, it underreports

the importance of the feedback between 2o and 3o structure as illustrated for two of the many

examples. The ItFix 2o structure accuracy for 1c8c is only modestly superior to those of

37

PSIPRED and SSPro, but crucially, ItFix correctly predicts as strand a region that both the

other methods incorrectly assign as helix (Fig. 2.5.1). Similarly, the SSPro Q8 level prediction

incorrectly assigns positions 9 and 10 as turn in 1ubq, whereas ItFix correctly assigns the turn

residues to positions 8 and 9 (Fig. 2.5.2). Only through successive rounds of folding does the

proper 3o context override the local propensities to correctly determine the location of the turn.

Although seemingly insignificant, this difference is crucial because the alignment of the hairpin,

and therefore the quality of the overall structure, depends on properly identifying the turn

location. Thus, extensive sequence homology information and intrinsic propensities can be

insufficient for 2o structures that depend strongly on 3o context.

Our main limitation in predicting 2o structure is the occasional deficiency of our starting

trimer library. For example, when we predict 2o structure for target 1dcj from the initial trimer

library contingent only on the sequence (R0), the accuracy is below 40%, implying very poor

local 2o structure context exists for this target (Table 4). Our 46% accuracy for this target

suggests that the 3o context of folding is insufficient to compensate for poor local propensity. In

fact, we assign the second helix of 1dcj as coil because a proline-glycine pair in the center of that

helix has a very high preference for coil. PSIPRED performs well on this target, presumably

because of the influence of sequence homology. SSPRO underperforms for this protein and some

others, perhaps because local preferences are weighted more heavily than the contribution of the

sequence alignment as compared to PSIPRED.

3o structure predictions.

Even though our StatPot can routinely distinguish a native structure from a set of folding

decoys, the folding simulations cannot always generate native-like models. This limitation is

often due to the vast size of the conformational search space for some sequences. We reduce the

38

search space by specifying the sequence, and then iteratively identifying the 2o structure. Often

local propensities, however, are so strong that even enormous amounts of sampling and 3

context cannot overcome the bias. For example, the turn of the second -hairpin of 1di2 contains

residues whose turn propensities are very low. Even through many rounds of ItFix folding, the

turn probability of that region never becomes high enough to fix, which severely limits the

quality of the 3D models generated. Other prediction methods circumvent this problem and

accurately predict this structure by using the degeneracy of sequence homology to properly

predict the turns and by sampling larger structure fragments which may contain long range

information that specifies the turn [21], suggesting that employing sequence homology can

smooth over any incorrect local biases. Although ItFix uses no homology information and

samples one position at a time, it still correctly predicts the structure of 1di2 and its difficult turn

by including the crucial 3o structure context of folding.

The lack of homology-based information is actually beneficial to predictions for some

sequences, specifically when the MSA incorrectly biases the 2o structure. ItFix fares

exceptionally well for the 2o and 3o structure of 1sap (Table 3) because the 3o context drives the

central region of the protein to be -sheet rather than the helix preferred by the sequence

homology-based methods (Fig. 2.5.2). The very high confidence of PSIPRED in this region

suggests that the MSA strongly biases the 2o structure towards helix, resulting in less accurate 3o

structure.

2.6 Folding pathways.

Many aspects of the ItFix algorithm replicate the folding behavior of authentic proteins.

During the ItFix process, subunits of structure, or “foldons”, are fixed cooperatively, just as

observed by hydrogen exchange experiments [111]. The foldons add to existing structure in a

39

process of sequential stabilization [112] that may resemble the pathway taken by authentic

proteins. In contrast to methods using pre-formed fragments or exogenous 2 structure

predictions where the connection to the authentic pathway is murky at best, the Itfix protocol

begins with an initial unstructured chain, and the buildup of structure evolves out of the folding

process. Hence, the order of fixing of structural elements may recapitulate major features of the

authentic pathway followed as the real chain progresses along the free energy surface (Fig.

2.6.1).

For the protein ubiquitin, the order of fixing structure (Fig. 2.5.3) and their

interactions are in remarkable accord with the experimental pathway [113]. A notable feature is

the formation of the parallel -strand interaction between the amino and the carboxy termini.

This long-range contact occurs prior to the 2o structure assignment of thirty intervening residues

and is possible with our method because the simulation includes the entire chain at all times.

Further, this parallel interaction overrides the initial R0 trimer propensities that favored

helix for the carboxy-terminal strand, as previously noted. Irrespective of whether the ItFix

algorithm replicates experiment, the pathway nature of the algorithm and the interplay of 2o and

3o structure formation contribute to the success, just as a pathway helps real proteins fold

reproducible and expediently.

2.7 Conclusions

The ItFix algorithm predicts 2o structure without resorting to homology and yet delivers an

accuracy and specificity that matches or exceeds current methods which rely heavily on

homology. The success is due to the integration of 3o structure context during the folding

40

Figure 2.6.1 Progression of fixing structure for 1af7, 1b72A, 1di2, and 1r69. The position dependence of the 2° structure frequencies at the end of each round, E (blue), H (red) and C (green). A single color bar represents a residue assigned to a single 2° structure type. The native 2° structure is shown with red boxes (helices) and blue arrow (strands) at the top and bottom. The order of fixing of structural elements may recapitulate major features of the authentic pathway. Round 0 frequencies are the average 2 structure obtained from the initial trimer library that is contingent only on the sequence. As the rounds progress, the probabilities of non-native 2° structures diminish.

41

simulations and the recursive refinement of the 2o structure assignments. Concurrently, accurate

3 structures are often generated. Although the model lacks explicit side chains, our PDB-based

backbone sampling protocol and scoring functions largely recapture the lost information. Hence,

we avoid the computationally expensive search along the rugged side chain rotamer energy

surface that is frequently involved in other successful prediction methods. In addition to

highlighting the basic principles required for ab initio structure prediction, our work extends the

size of proteins that can be predicted using homology-free methods. Furthermore, the ItFix 2o

structure predictions provide improved prediction of turns and ends of helices and strands,

features that are important in describing 3o structure. Thus, the Itfix predictions can be used as

inputs to increase the accuracy of template-based predictions that previously have inherent

restrictions imposed by requiring sequence homology. Moreover, now that the basic principles

have been established, the performance of ItFix can be improved further using homology.

2.8 Methods

Fixing Protocol

The protocol for eliminating a 2o structure option at a position is determined using the 2o

structure frequencies in the trimer library at the beginning of the round, PXInit (X=E, H or C), the

frequencies calculated using DSSP for the 200 - 300 final structures, PXFin_1, and the frequencies

of the trimers’ original 2o structure, PXFin_0, according to the following main criteria (see Suppl.

material). For i consecutive positions (in order of precedence):

(i>6): [HEC] [EC] if PHFin_1 < 0.03

(i>10): [HEC] [EC] if PHFin_1 < 0.05

42

(i>2): [HEC] [EC] if PEFin_0 > 0.50 and PE

Fin_0 > PE

Init and PEFin_1 > 0.00

(all positions in protein) [HEC] [HC] if PEFin_1 < 0.01

(i>3): [HEC] [HC] if PHFin_0 > 0.50 and PH

Fin_0 > PH

Init

(i>4): [HEC] [HC] if PHFin_1 > 0.40

(i>0): [H or C] [H only] if PCFin_0 < 0.10, or (PH

Fin_0 > 0.50 for i-1,i-2,i+1,i+2)

(i>0): [H or C] [C only] if PHFin_1 < 0.10

(i>0): [E or C] [C only] if PEFin_0 < 0.05

(i>0): [E or C] [E only] if PCFin_0 < 0.10

(i>0): [E or C] [E only] if PEFin_0 > 0.50 and (PE

Fin_0 > 0.50 for i-1,i+1)

(i>0): [E or C] [E only] if PEFin_0 > and 0.50 PE

Fin_1 > 0.00 and total positions fixed > 80% of

sequence length

The selection of the thresholds has been made as an empirical compromise between

prediction accuracy and the speed of specifying 2o structure. Some accuracy may be

compromised to allow the largest number of positions to be fixed within a reasonable number of

rounds.

If the turn (CT) probability is greater than 50% in a region that has been fixed to have at

least one 2 structure type removed, we fix that region to coil. Also, no matter what are the

library restrictions, if a large stretch of positions contains no strand, then strand is removed

from the library at those positions if the overall 2structure fixing is at an advanced state (>90%

positions fixed). If a position is fixed as strand, at least 3 adjacent positions will be fixed as

strand when those positions have a strand probability > 50%. If the direction in which the fixing

of strands is ambiguous, it proceeds away from the nearest segment of coil. This correction is

added to make sure the maximum amount of secondary structure is fixed for a given target.

43

To decrease the number of rounds of folding required for convergence, we use additional

operations. If strand has been removed from the library at two positions that are separated by

three or less residues where there are no library restrictions, we remove strand from the

intervening positions. There are operations on all types of library restrictions to refine any small

spaces between fixed regions, e.g. C-C and H-H are replaced by CCC and HHH, respectively.

The set of proteins studied typically requires 5-12 rounds. Convergence is slow for two

proteins (1bm8, 1vqh), and those simulations were stopped after 12 rounds. After the final round

for all proteins, the remaining unfixed positions have the 2o structure type determined by

plurality to obtain the final predictions. The DSSP 2o structure of each final structure in every

round is calculated directly or from the origins in the trimer library. sheet and turn probabilities

are taken from the origins in the library, whereas all other 2o structures are determined directly

using DSSP. In a small minority of cases, 2o structure assignments disagree with those

determined by DSSP. Incorrect assignments usually occur around the border between a helix and

coil or beta-sheet and coil, and in most cases tend to be at positions where 2o structure

determination methods disagree. The most notable examples are 1sap where ItFix assigns the

fifth strand as coil, 1fwp where the second helix is assigned as strand, and 1dcj, where the second

helix and third strand are incorrectly fixed. However, 1sap can fold accurately, implying that

some errors do not affect the quality of the structure prediction.

Energy function

The reduced C model includes only the backbone heavy atoms and the side chain C

The energy function is a pairwise additive statistical potential based on the Discrete Optimized

Protein Energy function (DOPE) [88]. We further divide interaction types as contingent on 2o

44

structure type and continuity, sequence separation, and orientation (Figs. 2.4.1 and 2.4.2). The 2o

structure types are defined as per DSSP. An atom pair is defined as in a continuous segment of 2o

structure if each residue in the pair and all intervening residues in sequence have the same 2o

structure classification. The orientation-dependence is determined by the angle between the side

chain C-C vector and the C-C vector connecting the interacting pair.

MCSA simulations

Our MCSA energy minimization and sampling methods have been described previously

in detail [53]. The , are sampled from a PDB-derived library (resolution < 2.5 Å, homology

below 90%). To test whether the 90% homology level provided a native-like bias, five of the best

performing targets (1af7, 1b72A, 1r69, 1di2, 1ubq) were refolded but using a library with only a

25% homology threshold. The average accuracy of the 2 structure prediction changed from 91.6

to 90.2% while the average of best 3D structures changed from 2.84 to 3.30 Å, respectively.

These slight differences are most likely due to a 1.5-2-fold decrease in trimer diversity rather

than the use of the 90% homology level, which is at most a minimal factor in the success of the

algorithm.

The annealing simulations only consider the heavy atoms of the main chain and the

carbons (C) of the side chains. The backbone planar angles and bond lengths are fixed at their

ideal values, except which is chosen as cis at a frequency of 5% and 0.1% for prolines and all

other residues, respectively. For the cis prolines prediction for Table II, we obtained 2 true

positives, 4 false positives, 41 true negatives, and 2 false negatives based on an increase above

the 5% baseline. All non-proline residues are correctly predicted to be trans.

Sampling library

45

We obtain our trimer library from a PDB culled using PICSES [135, 136] with a

resolution cutoff of 2.5. As 2o structure is restricted, the total number of trimers available for a

given sequence becomes smaller and less diverse. We increase diversity by allowing trimers with

amino acid substitutions within these four groups of structural correlated amino acids, (FVI),

(LM), (KRQH), and (WYF) (e.g. the three trimers XFY, XVY, XIY, are considered equivalent).

We add 5° noise to each angle pulled from the library. Bond lengths and angles are all set to

ideal values.

46

Chapter 3

Using evolutionary diversity to enhance structure prediction

Parts of this chapter are published in DeBartolo et al. Protein structure prediction

enhanced with evolutionary diversity: SPEED. Protein Sci. (2010) vol. 19 (3) pp. 520-34 and in

the accompanying supplementary materials. Additional sections have been added and the text

has been updated accordingly. I acknowledge and thank Glen Hocky, Mike Wilde and Jinbo Xu

and Glen Hocky for helpful discussions.

For naturally occurring proteins, similar sequence implies similar structure.

Consequently, multiple sequence alignments often are used in template-based modeling of

protein structure and have been incorporated into fragment-based assembly methods. Our

previous homology-free structure prediction study introduced an algorithm that mimics the

folding pathway by coupling the formation of secondary and tertiary structure. Moves in the

Monte Carlo procedure involve only a change in a single pair of backbone dihedral angles

that are obtained from a PDB-based distribution appropriate for each amino acid, conditional on

the type and conformation of the flanking residues. We improve this method by utilizing

multiple sequence alignments to enrich the sampling distribution, but in a manner that does not

require structural knowledge of any protein sequence (i.e., not fragment insertion). In

combination with other tools, including clustering and refinement, the accuracies of the predicted

secondary and tertiary structures are substantially improved and a global and position-resolved

measure of confidence is introduced for the accuracy of the predictions. Performance of the

method in the Critical Assessment of Structure Prediction (CASP8) is discussed.

47

3.1 Introduction

Given the expansion of the sequence database, an imperative of the field of structural

biology is to cluster related sequences into families and determine a representative structure for

each family [16-20]. The already large number of families is rapidly expanding and the cost of

determining representative protein structures is high. Computational structure prediction may

provide the most effective means of mapping the protein universe. Structure prediction, however,

is inherently challenging because of the enormous conformational space accessible to each

amino acid sequence. For this reason, the most successful prediction methods seek to narrow the

conformational search, for example by using large PDB fragments [80] rather than simulating the

protein ab initio [9, 15].

We have recently developed a C-level, homology-free structure prediction algorithm,

termed ItFix,[8] in which the conformational search space is restricted by iteratively fixing

secondary (2°) structure assignments of certain portions of the sequence after incorporating the

influence of tertiary (3°) context. Moreover, the iterative feature enables regions of lower

confidence to be predicted after the fixing of more confident regions. The coupling and mutual

stabilization of 2° and 3° structure formation mimics the pathway character exhibited by real

proteins [113, 114].

The computationally rapid, homology-free algorithm uses moves involving only the

change in a single pair of dihedral angles (pivot moves). Hence, its performance is

independent of the existence of appropriate fragments from the PDB. Nevertheless, our

algorithm can outperform current homology-based 2° structure prediction methods for many

proteins. ItFix also generates 3° structures of comparable accuracy to existing methods for many

small proteins, including ones with few sequence homologues.

48

Our earlier study revealed that a large impediment to more accurate structure prediction

arises from the intrinsically low propensity of some residues to adopt the backbone dihedral

angles found in their native structures. In the protein 1dcj, for example, the middle of a helix

contains a proline followed by a glycine, two residues that are very unlikely to be found together

in helices. Even though ItFix uses more confidently assigned regions to identify native structure

in otherwise weakly determined regions, the additional contextual information occasionally is

insufficient to override very strong local biases. Unfortunately, issues of this severity occur often

in many proteins, and the associated errors can detrimentally affect the accuracy of the 2° and 3°

structure prediction.

Here, we employ multiple sequence alignments (MSAs) to mitigate the influence of the

non-native local biases. MSAs are incorporated into many popular 2° structure [10, 68] and both

template-based [137-139] and template-free [21, 83] 3° structure prediction methods. In our

distribution of sampled angles, the non-native biases are manifested as a low probability of

native-like angles. This PDB-based distribution is now enriched using the sequence diversity

found in an MSA, but does so without requiring structural information from any constituent

sequence. We denote this procedure SPEED: Structure Prediction Enhanced by Evolutionary

Diversity (Fig. 3.1.1). The combination of ItFix and SPEED significantly increases the accuracy

of 2° and 3° structure predictions, and more so in combination with novel energy functions and

clustering methods. We also provide global and local measures of the confidence of our

predictions, thereby providing an essential tool for assessing the accuracy of the predicted

structures of unsolved sequence families.

49

3.2 Overview of the SPEED method

Figure 3.1.1a provides an overview of both the homology-free and SPEED structure

prediction methods utilizing the ItFix 2° structure fixing procedure. The fundamental difference

between our original homology-free protocol and the new SPEED protocol relates to the

Ramachandran (Rama) sampling distribution. In the homology-free protocol, the distribution

is generated only from the target sequence, whereas in the new protocol, the distribution is

constructed from an MSA of the target sequence. At the beginning of the ItFix procedure, no 2°

structure is fixed, and the distribution at each position reflects all 2° structure types, although

the distribution is contingent on the amino acid identities of the neighboring positions (Fig.

3.1.1b). Through rounds of folding (Monte Carlo simulated annealing, MCSA) using an energy

function that promotes hydrophobic burial and that penalizes polar burial (Methods), the 2°

structure options helix, strand or coil are progressively eliminated when their occurrence in the

final collapsed structures falls below a ~0-10% threshold.[8] Angles originating from the

eliminated 2° structure option are excluded in the calculation of the Rama distribution for the

subsequent round. The folding and elimination process proceeds until no further 2° structure

options can be eliminated (Fig. 3.1.1b middle and bottom). The final result is a more restricted

Rama distribution across the entire sequence which greatly reduces the search space.

The final Rama distribution is used to generate a large (10,000) ensemble of 3° structure

models. These models are clustered into groups of similar structure, and the models from the

largest cluster are selected for refinement and prediction, using our DOPE-PW statistical

potential.

50

Figure 3.1.1 Structure prediction protocol. a) The 2° and 3° structure prediction protocol for homology-free modeling uses the target sequence to generate a Rama sampling distribution, whereas SPEED uses a distribution that is averaged over a Multiple Sequence Alignment (MSA). The ItFix algorithm iteratively defines the 2° structure, and clustering and refinement are used to predict 3° structure. b) The Rama distribution for position 4 of the sequence of 1tif is shown for representative rounds of ItFix for homology-free and SPEED sampling. The native , angles are denoted as a red circle.

51

3.3 SPEED enhanced Ramachandran distributions

At the beginning of the ItFix rounds, the Rama distribution at each position is conditional

only on the amino acid identities of the position and its two neighbors. Our homology-free

implementation obtains this distribution solely using the target sequence. For example, N4 of 1tif

is flanked by I3 and E5 (denoted INE), with the resulting INE having a homology-free Rama

distribution displayed in the left panel of Figure 3.1.1b. The SPEED-enhanced Rama distribution

is the sum (with equal weights) of the distributions of all possible three-residue combinations

generated from the amino acid substitutions identified by the MSA. For example, the SPEED

distribution for INE is the sum of multiple Rama distributions derived from the MSA, such as

IND, IGD, and VGN. At the beginning of the algorithm when no 2° structure option is eliminated,

the native Rama region (Fig. 3.1.1b, red circle) has a small sampling probability in the

homology-free distribution (P=0.01), and the predominant Rama region is right-handed helix

(P=0.6). By contrast, the native Rama region has a ~20-fold larger probability in the equivalent

SPEED Rama distribution. Also, at the end of the ItFix rounds, the SPEED probability of the

native Rama region has nearly doubled compared to the homology-free probability (P=0.37

versus 0.21). The native Rama probability enhancement due to ItFix is thus significantly

improved by MSA-based procedure.

To illustrate the benefit of using SPEED, we quantify the enhancement across all

positions in the folding targets by comparing the native Rama probability of the homology-free

distribution to that of the SPEED-derived distribution (Fig. 3.3.1). This analysis proceeds by

partitioning the Rama map into four broad regions (Fig. 3.3.1a). More refined divisions of the

Rama map exist, but this division into four regions may be the most refined definition with clear

borders. The quality of SPEED-derived distribution is quantified as the percentage of positions

52

with low probability of the native Rama region (P<0.25). This percentage is a useful metric

because any position with such a low native Rama probability is an obvious candidate for

improvement. Compared to the homology-free Rama distributions, the new procedure decreases

the percentage of residues having a non-native Rama propensity for 10 out of the 12 targets

studied (Fig. 3.3.1b). The two exceptions remain unchanged because their homology-free

distributions already are very good. The two targets with the largest improvement in Rama

distribution are 1csp (78% 86%) and 1dcj (84% 94%). In particular, the homology-free

Rama distribution for 1dcj contains serious flaws due to the aforementioned proline-glycine pair

in the second -helix and for residues in the turn separating the second helix and third strand

(Fig. 3.3.2). SPEED overrides the non-native propensity of G46 in the second helix (P=0.21

P=0.62) and also enhances the E52 turn position’s native propensity (P=0.01 P=0.32).

In addition to the moderation of outliers, SPEED enhances the native Rama propensity when it is

already high, as is the case for 1b72. Here, the native Rama probability at only one of the ten coil

positions (E31) falls below the 0.25 threshold (Fig. 3.3.1). Its native-like probability is only

P=0.03 in the homology-free distribution but is enhanced to P=0.23 in the enhanced distribution.

Additionally, the native Rama probability in the SPEED-derived distribution is two- fold higher

than the homology-free distribution in 7 out of 10 coil positions. Similar improvements can be

seen for other targets (Fig. 3.3.2). The exceptions to this trend generally emerge for positions

which already have a very strong native-like propensity in the target sequence. An illustration of

this effect is the left-handed turn position G10 in 1ubq. Because glycine favors the native left-

handed turn basin more than all other residues, any substitution lowers the native Rama

probability (Fig. 3.3.2). Nevertheless, the decrease in native probability due to the use of SPEED

is on average is much smaller than the benefit at other positions.

53

Figure 3.3.1 SPEED-enhanced Rama sampling distribution. a) Rama space is divided into four coarse regions for analysis. b) The percentage of residues with probability exceeding 0.25 for the native Rama region is increased for SPEED for all targets, particularly 1csp and 1dcj. c) For 1b72, the probability of the native Rama region is greatly enhanced using SPEED.

54

Figure 3.3.2 Position-based comparison of homology-free and SPEED distributions. The analysis of Figure 3.3.1c is shown for additional targets.

55

3.4 ItFix 2° structure

Improvements to ItFix

The 2° structures of the final models are identified using the DSSP program for 2°

structure determination [58]. Since DSSP-identified -strands must be involved in -sheet

networks with optimized hydrogen bonds, the strand-fixing threshold is lower than our previous

study, with no noticeable decrease in fidelity. In many cases, the fidelity for specifying 2°

structure is higher. This increase is particularly evident for the all- targets, where the -strand

option is eliminated at every position within the first two rounds as a result of the -strand

probability vanishing (P< 0.005) at every position (in the first round for 1af7, 1b72; in the

second round for 1r69). The same accuracy is found for the helical regions of the targets.

Improvement in 2° prediction accuracy

The 2° structure prediction accuracy using SPEED compares very favorably with the

popular 2° structure prediction methods SSPro [68] and PSIPRED [10] (Table 5). When

predicting 2° structure at the level of helix, extended or coil (three options, termed Q3), ItFix-

SPEED is more accurate than its homology-free ItFix counterpart (average accuracy 84%

88%). Most of this improvement is due to 1csp (79% 87%) and 1dcj (45% 83%), the two

targets with the largest improvements in Rama distribution due to SPEED (Fig. 3.3.1b). The 2°

structures for the all- targets already are predicted to high accuracy using the homology-free

ItFix, so the average improvement due to SPEED is small (93% 96%), with the exception of

1b72 where the improvement is more substantial (88% 96%). The one exception is 1di2,

whose failure is discussed in the 3° structure prediction section below.

More impressive is the increase in accuracy for the prediction of 2° structure at the more

56

refined Q8 level where coil is subdivided into six DSSP-identified subtypes (this level of

prediction is unavailable with PSIPRED). For 1b72, the overall Q8 accuracy increases (84%

96%) using SPEED with a >0.95 probability assigned to the native Q8 value at every position in

the second coil region. Two other targets that have substantial improvements in Q8 accuracy are

1dcj (29% 65%) and 1ubq (69% 82%). Most of the Q8 improvements for 1dcj arise from

the same helix and strand improvements found for the Q3 values, whereas the Q8 improvements

for 1ubq are due almost exclusively to better turn predictions within the coil subtype.

3.5 Energy Functions

We continue to use a reduced C model that includes the backbone heavy atoms,

backbone amide hydrogen, and the side chain C, and a slightly modified version of the DOPE-

PW energy function [8]. This energy function is a pairwise additive statistical potential based on

the observed distance distributions between each atom in the model. In addition to distinguishing

each type of atom, the energy function classifies each interaction according to residue type, 2o

structure assignment, and side-chain orientation.

In the prior ItFix treatment, the 2° structure assignment at a position is the same

assignment as in the original PDB structure from which the last , pair is selected (for this

position). Here, the 2° structure is specified using a geometric definition of 2°structure that is

applied in each energy calculation (i.e., in the application of the strand-strand terms, helix-helix

terms, etc.). A residue is considered to lie in a helix if it is situated in a block of more than four

residues in a row satisfying the following criteria:

57

Table 5 SPEED 2° structure prediction comparison1

Protein Rama

Enrichment2

(angles / residue)

2° structure accuracy3

Q3 (Q8)

PDB ID size fold NEFF

4 Hfree

SPEED ItFix ItFix

SPEED SSPro PSI-PRED

1af7 69 7.3 1426 5599 97 (86) 96 (88) 86 (81) 90

1b72 50 5.7 1384 4229 88 (84) 96 (96) 68 (72) 84

1csp 67 6.0 1069 2365 79 (67) 87 (70) 75 (67) 88

1di2 68 6.8 1230 4964 88 (79) 66 (54) 74 (75) 97

1dcj 72 7.0 1059 4381 45 (29) 83 (65) 65 (56) 89

1mky 77 5.0 1572 3947 86 (70) 83 (65) 87 (71) 90

1o2f 77 5.5 1059 4506 78 (69) 84 (73) 79 (66) 75

1r69 61 7.5 1036 5058 93 (89) 97 (89) 74 (72) 92

1shf 59 7.1 774 3213 76 (56) 71 (51) 85 (69) 80

1tif 57 4.4 1349 3233 89 (79) 91 (81) 76 (70) 93

1tig5 86 5.4 1194 3323 83 (70) N/A 69 (67) 83

1ubq 73 7.7 1152 3405 92 (69) 94 (82) 88 (67) 90

1Target sequences are from our previous homology-free ItFix study,[8] which have been selected from a previous Rosetta prediction study.[21] 2Rama enrichment is the positional average of the number of PDB angles used to generate the Rama distribution for each method. The Q3 and Q8 (in parentheses) 2° structure prediction accuracies are reported for the previous homology-free study, an updated homology-free version, and SPEED sampling. 3SSpro and PSIPRED 2° structure predictions are obtained from their respective servers.[132, 133] 4NEFF[140] is a Shannon entropy measure on a scale of 1-20 of the amino acid diversity of the sequence alignment (1 = single amino acid, 20 = all amino acids are equally likely). 5Folding of 1tig could not converge in reasonable amount of time because radial terms could not be satisfied in a small number of MCSA steps.

58

The minimum distance between the hydrogen bond donors and acceptors is described by the

distance criterion from the hydrogen bond potential of Kortemme et al.,[100]

[1.7 < dist(COi, NHj) < 2.6] or [1.7 < dist(NHi, COj) < 2.6]

In addition to this distance constraint, the hydrogen bond energy function also considers

the influence of hydrogen bond orientation. The following term is used to describe the

orientation between two covalent bonds, an example being the backbone carbonyl (C=O) bond

and amide bond (N-H) orientation:

,

In this equation, 12 represents the angle between the and vectors and 21 represents the

angle between the and vectors. We impose a 90° minimum on to maintain a planar

sheet network for both parallel and anti-parallel sheet networks.

Our previous study [8] finds that the statistical potential alone often is incapable of

generating a large proportion of well-collapsed models for the targets that contains -sheets.

These simulation models commonly contain attributes that are uncharacteristic of experimental

models, such as buried polar residues, unpaired buried strands, and a high radius of gyration of

C atoms (Rg). Buried polar residues and buried unpaired beta strands are symptomatic of an

energetic benefit allotted for the close pairing of non-polar C atoms and the lack of penalty for

the close pairing of polar and non-polar C atoms. Thus, the prior treatment allows a strand to be

buried in the hydrophobic core of a model so long as it contains a sufficient number of non-polar

residues. High-Rg models can be low in energy due to highly optimized sub-structures, such as

hairpins, which are formed at the expense of integrating the entire chain into a properly-

collapsed model.

59

Adding a penalty for the burial of polar residues impedes the generation of low-Rg

models, and forcing a lower Rg on the chain can worsen the burial of polar groups and beta-

strands. For this reason, in addition to Rg, two radial terms are included to encourage the proper

global collapse of the entire chain (3.5.1). Radial uniformity (Ru) is the standard deviation of the

distances of C atoms from the C center of mass (cm),

, where and

The Ru term is necessary because small globular single-domain proteins rarely have a

completely buried segment of chain, but instead have an amphipathic alternation between

exposed and buried side chains. Enforcing a small value of Ru prevents any portion of the chain

from being too close to the center of mass and therefore diminishes the propensity for the burial

of entire 2° structure units in the core of the model.

Rg and Ru are minimized to create a collapsed chain with no completely buried chain

segments. A third radial term, the ratio of the Rg of the non-polar C atoms to the Rg of the polar

C atoms, is called burial ratio (Br):

Br = Rgnon-polar / Rgpolar

Most small proteins have the non-polar C atoms closer to the center of the protein,

whereas the polar C atoms are more likely to be on the exterior, so Br is less than unity to

capture the global hydrophobic burial of globular proteins. The global burial induced by the Br

term contrasts to the local optimization of statistical potentials, which can optimize local subsets

of hydrophobic atom pairs at the expense of global burial.

We add the three radial terms to obtain the overall scoring function, where EDOPE-repulsive is

sum of the positive (repulsive) DOPE terms,

60

Figure 3.5.1 Radial protein structure terms. A single domain protein structure is treated as a sphere with a C radius of gyration, an inner hydrophobic C radius of gyration, and an outer hydrophilic C radius of gyration. The burial ratio (Br) is the ratio of the latter two terms. The Radial uniformity (Ru) is the standard deviation of the C center of mass (CM) to C distances.

61

Each MCSA simulation is repeated using Eradial until the Br is less than 0.80. We cap the

minimum value of Ru at 2.5 Å, since it is very easy for the chain to fold into a ring structure with

Ru close to 0. The multiplied radial terms have a coefficient of 100, so that their combined

magnitude is significant relative to the repulsive part of DOPE.

The combined radial energy and filtering has a significant effect on model quality (Fig.

3.5.2); the lowest energy models from the final ensemble are more similar to the native.

The radial terms are used throughout the ItFix algorithm until the 2° structure is

determined. For the final round of folding, if the 2° structure is all- the DOPE-PW energy

function is used, otherwise the energy function is used. In either case, the size of the final

ensemble is 10,000 models, which is more than sufficient for reproducible average accuracy

(Fig. 3.5.3), but perhaps minimally sufficient for the purposes of clustering or reproducing the

absolute best model. The final model refinement process uses the DOPE-PW energy function

for all targets.

3.6 Improvement in 3° structure

SPEED significantly improves the quality of 3° models compared to the homology-free

treatment (Table 6). The model with the lowest C-RMSD (best model) is lower for SPEED in

every case except 1di2. Because the best model is not always a very reproducible metric of over-

all performance, we consider instead the fraction of final structures below 5 Å C-RMSD to the

native structure (Fig. 3.6.1). This fraction is on average several times greater for SPEED than

from the homology-free approach when all other folding parameters (2° structure assignment,

energy weighting coefficients, etc.) are identical (Table 6, last column). The SPEED folding

ensemble for 1ubq is the most enhanced, containing six times more native-like models than the

62

homology-free ensemble. For four out of the twelve targets, the homology-free distribution

produces no models below 5 Å, and hence the SPEED enhancement factor is effectively infinite.

Even so, improvement also is evident across all ranges of C-RMSD. For 1b72, the addition of

SPEED improves the 3° structure ensemble such that 83% of the models are less than 5 Å C-

RMSD to the native structure (Fig. 3.6.1b), which compares favorably to 76% of the homology-

free models falling below that threshold. It should be noted that, for the purposes of direct

comparison of the homology free and SPEED Rama distributions, the SPEED 2° structure

definition was used in the homology free Rama distribution. Since the SPEED 2° structure is

typically more accurate, in reality the 3° structure accuracy enhancement due to SPEED is likely

much larger. Unfortunately, we did not have the computational resources to generate the

homology free ItFix 2° structure definitions.

Compared to the and targets, the three targets have the most native-like ensembles

for both homology-free and SPEED methods, and, hence, this class yields the smallest

enhancement factor. Conversely, the and targets produce a very small fraction of native-like

models for both SPEED and homology-free methods, but have the largest increase in native-like

models due to the use of SPEED (Table 6). Neither the SPEED nor homology-free methods

generate native-like models for 1di2, most likely because it is considerably more prolate in shape

than the rest of the proteins, and the radial energy terms (See 3.5) enforce a spherical bias (Table

7).

An obvious question is whether the increase in the accuracy of 3° structure prediction

found with SPEED emerges from the improvement of a few residues with low homology-free

native ( probability or from small improvements across the entire sequence.

63

Figure 3.5.2 Effect of energy filtering on model accuracy. The accuracy of the folding ensemble increases as higher energy models are removed. Shown is the fraction of models below varying C-RMSD cutoffs. Traces represent the results after removal of models with energies higher than E greater than <Energy>+X where X=0,±s, ±2s, and s is the standard deviation in energy for all models.

64

Figure 3.5.3 Reproducibility of the final model ensembles. The final folding ensembles (10,000 models before refinement) are divided into five random sets of 2000 models for the targets 1dcj, 1tif, and 1r69. The lack of diversity illustrates that the accuracy distribution is reproducible.

65

Table 6 SPEED 3° structure prediction comparison1

Protein 3° structure accuracy

PDB ID size fold NEFF

Previous ItFix1

ItFix- hfree2

ItFix- SPEED3 C-5.0X4

1af7 69 7.3 2.9 (2.5) 2.5 (2.5) 2.6 (1.6) 1.2

1b72 50 5.7 3.5 (1.6) 3.6 (1.7) 3.5 (1.6) 1.1

1csp 67 6.0 10.5 (6.0) NC (4.6) 5.2 (4.1) 4.2

1di2 68 6.8 6.1 (4.6) NC (6.8) NC (6.6) N/A

1dcj 72 7.0 13.3 (7.6) NC (5.9) 5.3 (4.6)

1mky 77 5.0 6.9 (6.1) NC (4.4) 5.2 (4.2)

1o2f 77 5.5 11.2 (5.8) NC (6.7) NC (4.2)

1r69 61 7.5 4.2 (2.4) 3.7 (2.1) 3.5 (1.6) 1.8

1shf 59 7.1 12.2 (6.7) NC (6.2) NC (3.8)

1tif 57 4.4 11.3 (4.2) 5.7 (3.7) 5.4 (3.2) 4.3

1tig5 86 5.4 6.4 (5.3) N/A N/A N/A

1ubq 73 7.7 5.3 (3.1) 4.4 (3.6) 2.6 (1.9) 6.0

1The C-RMSD to the native of prediction based on energy and best model (in parentheses) from our previous homology-free ItFix study.[8] 2Folding with the homology-free Rama distribution and with the final SPEED 2° structure (2000 trajectories), cluster and refinement prediction and best model (in parentheses). 3Folding with the SPEED Rama distribution with final SPEED 2° structure (10,000 trajectories), cluster and refinement prediction and best model (in parentheses). 4Ratio of the percentage of models below 5.0 Å C-RMSD to native of SPEED (column 7) to homology-free (column 6) 5Folding of 1tig could not converge in reasonable amount of time because radial terms could not be satisfied in a small number of MCSA steps.

66

Figure 3.6.1 Improvement in 3° structure prediction using SPEED. The percentage of models with a C-RMSD to the native below a cutoff level (x-axis) provides a comparison of the overall accuracy of the folding ensembles. The top cluster (solid line) from SPEED is much better than the entire SPEED ensemble (dashed line), which is better than the ensemble generated using the homology-free ItFix Rama distribution with the SPEED-generated 2° structure assignments (dotted line).

67

Although it is impractical to test the effects of SPEED one residue at a time, the general

behavior is illustrated for 1dcj, the protein for which the use of SPEED introduces the largest

improvement in the accuracy of both 2° and 3° structure predictions. Without SPEED, we fail to

predict the second helix, which contains the PG combination and has low intrinsic helicity. Even

with the 2°

structure of this helix correctly fixed, the 3° accuracy still is inferior without SPEED (Table 6),

presumably due to the extremely low homology-free turn probability at position 52 compared to

the SPEED-based probability (Phfree=0.02; PSPEED=0.32). Hence, we believe that the larger

improvements due to SPEED probably can be localized to a few critical positions. However, the

improvement of near native structures (e.g., RMSD below 3-5 Å) likely arises from the

cumulative effect of enhancement at many positions.

3.7 Averaging the energy function across the MSA.

Analogous to the SPEED-improved Rama distribution, we have also tested an energy

function that is averaged over the MSA in order to incorporate additional sequence information,

specifically via sequence correlations in the long-range interactions. The analysis of correlated

mutations in sequence alignments has been used previously in other prediction and design

methods [141-144]. The new energy function uses the original statistical potential and the same

pairwise distances, Di,j, between the pairs of amino acids. However, the new energy for each (i,j)

residue pair now is the average energy calculated using the distance Di,j and statistical potential

appropriate for the amino acid pair found in each sequence in the MSA. This procedure includes

extra long-range information while maintaining the pairwise amino acid correlations inherent in

each aligned sequence.

68

Table 7 Radial terms for protein structure

PDB id length native Br1 consensus native Br2

Rg (Å) Ru (Å)3

1af7 69 0.83 0.69 10.9 2.5

1b72 50 0.81 0.52 10.0 2.8

1csp 67 0.73 0.57 10.6 3.2

1di2 68 0.71 0.62 12.3 4.9

1dcj 72 0.77 0.69 10.4 2.8

1mky 77 0.76 0.66 11.7 3.2

1o2f 77 0.79 0.72 10.7 2.5

1shf 59 0.75 0.73 10.0 2.7

1tif 57 0.73 0.69 9.5 2.5

1tig 86 0.84 0.81 12.1 3.8

1ubq 72 0.76 0.71 10.7 2.4

1r69 61 0.79 0.69 9.9 2.0

mean 68 0.77 0.68 10.7 2.9

1Burial ratio of the target sequence 2Burial ratio of the consensus sequence of the multiple sequence alignment of the target sequence 3Standard deviation of the C distance from the C center of mass

69

Although this method is intellectually appealing, the results are variable. We suspect that

for each interaction, the optimal (lowest energy) separation distance for each contact varies too

much for the different combination of residues found in the sequences in the MSA.

Consequently, the energy surface averaged across the sequences in the MSA has a shallower

minimum compared to the energy function calculated using only the target sequence.

Cursory tests using a single consensus sequence with the standard energy function also fail to

produce uniformly superior results. Nevertheless, we maintain that a careful and clever

implementation or extension of these ideas could yield strong improvements.

3.8 Clustering

The enhancement of the fraction of native-like models obtained using SPEED has

additional implications for 3° structure prediction. In our previous homology-free study, the

predicted structure is the lowest energy model from the final folding ensemble. But, that

structure is native-like (< 5 Å) only for about half of the targets, failing mostly when few or no

accurate models are generated. Although the use of SPEED increases the proportion of accurate

models, energy alone is insufficient for reliably choosing the best model. This situation is

common in structure prediction. As a result, clustering methods are frequently employed

because repeatedly occurring low energy conformations are typically more accurate than

structurally isolated low-energy models [145].

The lowest energy model from the top cluster for the homology-free and SPEED-based

Rama distributions are presented when a cluster exists (Table 6). A larger fraction (8/12) of the

SPEED-based ensembles contains identifiable clusters compared to the homology-free

ensembles (6/12), and their size often is larger as well (Fig. 3.6.1). The largest cluster may be the

70

most accurate in terms of C-RMSD to the native, but it may share a similar average contact

profile to other less accurate clusters (Fig. 3.8.1). Most noticeable are the contact profiles of the

largest two clusters of 1b72, which display almost identical contacts, but decidedly different

values for the average C-RMSD to the native (Cluster 1, < 4 Å, Cluster 2, > 10 Å). This result is

due to the simplicity of the 1b72 fold (3-helix bundle), which permits a low energy fold that is a

pseudo-mirror image fold of the native and therefore has similar contacts and similar average

energy. Given this energetic similarity, the Rama distribution determines the favorability of the

native conformation, with the SPEED protocol succeeding to a greater extent than the homology-

free protocol.

3.9 Confidence assessed from reproducibility

While numerous methods exist for structure prediction [9, 15, 21, 80, 97, 120], the

quantification of the accuracy and confidence of a prediction is a crucial, but often

elusivecomponent. Template-based methods typically infer confidence from the quality of the

available information used to generate an alignment and a consensus of aligned models [146-

148]. When predicting remote templates, this technique can suffer from a dearth of PDB

templates that independently align to the target sequence with high confidence. This situation

precludes any meaningful clustering analysis and therefore imparts a large uncertainty to model

quality.

Template-free prediction methods have an advantage of generating a large number of

models that can be clustered. One noticeable feature of our method is the high correlation (R2 =

0.85) between the average C-RMSD between models in the predicted cluster and the average

accuracy (C-RMSD to the native) of the models within the cluster (Fig. 3.9.1).

71

Figure 3.8.1 Comparison of contacts for the top clusters of several targets. Each map is a C-C contact matrix with a 10.0 Å distance cutoff for targets (1af7, 1b72) and a 8.0 Å distance cutoff for the , targets (1mky, and 1csp). Contacts of the native model are presented on the lower right of each map. The largest cluster for 1af7 has the most native contacts and has an average C-RMSD to the native below 4 Å. The next largest 1af7 cluster, which has an average greater than 10 Å C-RMSD to the native, has many native and non-native contacts. The largest 1b72 cluster is the most native in terms of C-RMSD (< 3Å average), but contains identical contacts to the next largest cluster (> 10 Å C-RMSD to native average) that is the mirror-image fold of the native. The contacts matrices of the top clusters of 1mky and 1csp are both very native-like.

72

Figure 3.9.1 Assessing global accuracy from reproducibility of the top cluster. The mean C-RMSD to native of the top cluster is strongly correlated with the mean C-RMSD between the models in that cluster, indicating that the latter metric can be used as a measure of predicted model’s accuracy.

73

This trend suggests that template-free models that are reproduced with a high degree of structural

similarity tend to be proportionately more accurate than models that are structurally further

removed from their closest neighbors. Noticeably, the average C-RMSD between models in a

cluster is typically one to two angstroms lower than the average C-RMSD to the native of the

cluster, suggesting that the top cluster has converged upon a stable but slightly non-native energy

minimum. Nonetheless, this difference can be factored in when quantifying the predicted

accuracy and may be diminished by improvements in the energy function and sampling

distributions.

In addition to global accuracy, the residue level RMSD at each position is calculated to

quantify the confidence of the prediction for each amino acid in the protein (Fig. 3.9.2).

Specifically, the average and standard deviation of the distance at each position between

the aligned models in the cluster are highly correlated to the respective average distance and

standard deviation at each position between the aligned cluster models and the native model,

suggesting that the accuracy and uncertainty at each position can be predicted.

This finding has implications for other template-free methods, which may suffer method-

specific difficulties when trying to quantify the confidence of model predictions. Most template-

free methods rely on large fragments from PDB models [21, 80, 137], but the number of these

fragments are limited and may introduce some bias due to the highly-restricted nature of their

conformational search. In other words, independently converging on very similar models may

not be as meaningful when the likelihood of sampling the same conformation is very high. Since

the conformational changes in ItFix feature the rotation of only a single pair of , angles, a

resulting ensemble consisting of a cluster of very similar models can be treated with higher

confidence given that the accessible conformation space is much larger than in fragment based

74

Figure 3.9.2 Assessing local accuracy from reproducibility of top cluster. Position-resolved model accuracy and confidence. The average aligned distance between all models in the predicted cluster and the standard deviation of that distance is determined for each position. These values are highly correlated to the respective average aligned distance and standard deviation at each position between each model in the cluster and the native structure. The standard deviation for each of these values also is highly correlated, suggesting the ability to use clustering to determine confidence for each position in a predicted model.

75

methods. Similarly, the bias likely is even weaker for all-atom physics-based simulations [15]

and ab initio folding simulations [9], which have the least restricted conformational search.

ItFix-SPEED may combine the best of both a restricted and unbiased conformational search in

regards to assessing accuracy from the structural diversity of the largest cluster.

3.10 Performance in CASP8

We have applied an early version of the ItFix-SPEED protocol in the 2008 Critical

Assessment of Structure Prediction (CASP8) for the human/server targets when a suitable

template from the PDB could not be identified by the threading program RAPTOR [83, 149],

one of the top performing entries in the server category. Of these targets, the 120 residue T0482

is the only small, globular, single-domain free-modeling target with no confident templates,

making it a prime candidate for the ItFix-SPEED methodology. This target has been subjected to

multiple rounds of ItFix-SPEED, and our final three submitted models are very similar with

highly accurate 2° and 3° structures (Fig. 3.10.1a). Our predicted 2° structure is slightly

improved over the PSIPRED[10] prediction. Due to time constraints, we initially assigned

PSIPRED’s high confidence (>90%) predictions at ~ 10% of the positions (total wall clock time

for prediction was under 12 hours from start of prediction to submission). When the central 100

residues (ignoring the solvent exposed ends of the NMR structure) of these models are aligned to

the now published structure, the C-RMSD to native is 4.8 Å. Hence, our algorithm is able to

confidently predict the correct structure without any false positive submissions. In addition, our

top model has the lowest C-RMSD among all submitted #1 models. We have performed

commendably for other challenging template-free modeling targets, such as the D1 subdomain of

protein T0405 (Fig. 3.10.1b). These results constitute strong evidence of the predictive

76

capabilities of the ItFix-SPEED algorithm. Our participation in CASP8 also includes predictions

for sequences that have only poor templates and are considered template-free modeling targets.

For target T0429, RAPTOR chooses multiple homology-based templates, but it is uncertain as to

which template is correct for the C-terminal domain. ItFix-SPEED folding simulations for this

domain have been used to compare the average contact matrix of our folding simulations to the

contacts of each possible template (Fig. 3.10.1d). This process has enabled us to choose a better

template (T0429-2ckk) than RAPTOR’s top scoring template.

The SPEED-based sampling protocol also has been used to determine the structure of the

insertions of unknown structure that are present in RAPTOR-generated models. These situations

have been treated by breaking the chain at one end of the insertion and then folding this free end

in the context of the entire protein. The most successful outcome is for a 24 residue insertion for

target T0464, where our prediction ranks as one of the top submissions (Fig. 3.10.1c).

3.11 Discussion and Conclusions

Our computationally rapid algorithm using only single ( dihedral angle moves can

generate very accurate predictions of both 2° and 3° structures without relying on any known

structures, templates, or fragments. For the test set, we typically predict 2° structure with ~90%

accuracy, while the best 3° structure for 4/12 of the targets have C-RMSD below 2 Å. Hence,

given intelligent search strategies and scoring functions, C representations can be used to

accurately predict 2° and 3° structures.

Structure prediction is beyond current capabilities for the vast majority of the families

identified by large-scale sequencing efforts [2, 20]. The number of sequences with minimal

sequence similarity to known structures is increasing at a rate that outpaces our ability to identify

77

Figure 3.10.1 ItFix-SPEED blind predictions in CASP8. a) 2° and 3° structure prediction of target T0482. The Global Distance Test (GDT) value is the % of the residues within a cut-off distance of the native structure. This cut-off distance is the y-value on the plot (e.g., for the ItFix prediction, 83% and 100% of the residues are predicted to within 4.7 and 7.8 Å of the native structure, respectively). The GDT trace for the ItFix prediction (blue line) is the rightmost of all the Model 1 predictionsIn addition, the C-RMSD to native is the lowest of all the Model 1 predictions. The Itfix-SPEED prediction for b) the entire Domain 1 of target T0405, and c) the 24 residue insertion in RAPTOR’s predicted template for T0464. d) Itfix-SPEED selection of the best template identified by RAPTOR based on average predicted tertiary contacts. Contact map, upper left: ItFix average contacts for the final structures; lower right: contacts of one of RAPTOR’s lower ranked templates, which is the closer to the native structure than its top ranked template. Values in parenthesis are the C-RMSD between predictions and the native structure. GDT plots are taken from the CASP8 website (www.predictioncenter.org/casp8/index.cgi).

78

new families [20]. Currently, only about one third of the single domain architectures have known

folds [20].

The ItFix-SPEED procedure is well suited to contribute to mapping the protein universe,

particularly for low homology sequences. Because our procedure utilizes only multiple sequence

alignments, it can take advantage of the 107 known sequences, and not be limited by the ~104

unique structures in the PDB. For CASP8 target T0482, no member of its family had a known

structure, although its fold is not new. The ItFix-SPEED procedure accurately predicted its

structure using only 50 non-redundant sequence homologues and no structural information.

Furthermore, the ItFix-SPEED procedure is able to quantify the global and local accuracy of its

prediction from the reproducibility of the trajectories, a highly desirable feature from the

perspective of users of any sequence database annotation.

3.12 Methods

Generation of Sequence Alignments

Sequence alignments are generated by PSI-BLAST [69] using the executables from

NCBI on the non-redundant database. An inter-sequence similarity cutoff of 65% is imposed

with CD-HIT [150]. PSI-BLAST searches are performed in three passes with an E-value cutoff

of 1.0. We choose only sequences that cover over 90% of the target sequence length and have

gaps that span at most one position. These constraints are chosen such that sequences are very

likely to approximate the same structure as the target. As a result of these constraints, the average

E-value of each sequence in an alignment is orders of magnitude lower than 1.0.

SPEED sampling

79

The MSA is used to generate an amino acid substitution matrix at each position in the

target sequence. Any amino acid that occurs in more than 10% of the alignments is included at

that position. If a position only has only one amino acid in its substitution matrix, the amino acid

occurrence threshold is decremented by 1% until there is more than 1 substitution, with the

exception of proline, which is kept as the sole amino acid at a position down to 5% probability as

long as there are no neighboring positions with prolines that occur at a greater probability. If

proline is the sole amino acid in the MSA-generated substitution matrix, we mutate the target

sequence at that position to proline. In all other cases the sequence used during folding remains

the same as the target sequence.

We initially tried calculating the SPEED distribution of a position by adding the Rama

distributions at that position for each sequence in the alignment. The SPEED distributions

created from this method, however, are more similar to the homology-free distribution because

the target sequence amino acid often has the highest-probability in the alignment and would be

weighted proportionately in the SPEED distribution. Using a substitution matrix, on the other

hand, weights all amino acids above a threshold equally, thereby rendering the resulting Rama

distribution less similar to the homology-free distribution.

Since the statistics for the distributions constructed from an MSA permit many different

combinations of amino acids, the area of the Rama map with vanishing probability tends to be

much lower for the SPEED distribution than previously used because of the added MSA-

identified combinations. In fact, the average number of angles per position used to generate a

SPEED distribution is three to five-fold larger than the number of angles used to generate a

homology-free distribution (Table 5). As seen in Figure 3.1.1b and the subsequent predictions,

this added diversity does not dilute the specificity of the conformational search; indeed the

80

distributions are more native-like.

Ramachandran sampling

Our prior treatment employs a sampling of specific , angle pairs from a library

generated from high resolution crystal structures, conditional on the 2° structure and nearest

neighbor amino acid identities. The present study likewise employs a distribution of , angles

with the same dependencies, but instead of sampling from a large list of angles extracted from

PDB models, the , angles are chosen from a Rama distribution that is generated for each

position based on the amino acid identity and the 2° structure specification of that position and of

its nearest neighbors. Thus, Rama distributions are calculated for the central residue in each of

the distinct 8000 combinations of three contiguous amino acids, conditional upon the amino acid

identity and on the 2° structure of all three residues. Because the ItFix simulations consider six

possible categories of 2° structure for the construction of the sampling distributions (H: helix, E:

strand, C: coil, A: everything, O: not helix, Q: Not strand), 1,728,000 possible Rama

distributions are constructed to describe the possible 8000 amino acid triplets. Each Rama

distribution has 722 5°x5° bins, and each bin is assigned a probability that is determined by

frequency of occurrence of these backbone dihedral angles in the PDB for the specific conditions

of amino acid identities and 2o structure. A Rama distribution accommodates the increase in

PDB-derived angles introduced by SPEED without increasing the system memory, as occurs

when each angle is explicitly stored in memory.

The sampling of , angles begins by selecting a bin in Rama space according to the

probability assigned to that bin (e.g., a bin that contains 1.5% of the angle counts for the

distribution at that position has a 0.015 probability of being selected). This bin selection is

followed by the selection of a random angle uniformly from within the 5°x5° window of that bin.

81

The Rama distribution of the central residue of the triplet INE (position 4 in 1tif) with all

allowed 2° structures is an example of one such sampling distribution (Fig 3.1.1b, top). If the

subsequent round of ItFix eliminates a 2° structure option at a position, the Rama distribution at

that position is changed accordingly (Fig. 3.1.1b, middle, bottom).

Clustering algorithm

After the ItFix protocol generates a predicted 2o structure, a further 10,000 folding

simulations are run to maximize the exploration of conformational space. The pairwise C-

RMSD matrix of the resulting 10,000 models is used to cluster the ensemble into groups of

models that all align to each other below a C-RMSD cutoff, an approach that is similar to the

SPICKER algorithm [145]. Other methods [151] cluster according to the C-C distance instead

of the pairwise C-RMSD, but we find that the C-C distances in some cases are highly

correlated even though the C-RMSD between the models are quite different (Fig. 3.8.1b).

When identifying clusters, the upper limit of the cutoff distance of the inter-model C-RMSD is

increased in increments of 1 Å starting at 1 Å until at least five clusters are found, or a 7 Å limit

is reached. Every model in the cluster must have a C-RMSD to every other model in the cluster

that is less than the cutoff distance. Targets with predicted all- 2° structures have a minimum

cluster size of 5%, whereas the minimum size for targets with other predicted 2° structure types

can be as low as 0.04%. A cluster is eliminated if it contains a model present in a larger cluster.

The largest cluster is selected as the predicted model, unless it has an above average energy and

there is another cluster with an energy that is greater than one standard deviation below average.

For and targets, the predicted cluster cannot consist of a fold that contains a predicted -

strand that is not part of a -sheet.

82

Model refinement

One of the most important challenges of structure prediction is an effective exploration of

conformational space. Ideally an exhaustive refinement is performed for every model generated

by folding, but we take a computationally thrifty approach and refine only the models in the

largest cluster of each target. Refinement consists of the same move set and energy function as

folding, with the addition of the fact that we reject moves that increase the Rg, Br or Ru of the

starting model. Each model in the cluster is refined 100 times and the model with the lowest

average energy among all the refined models is chosen at the prediction listed in Table I.

Parallel scripting with Swift

The ItFix-SPEED algorithm has been implemented, tested and evaluated [152] using an

innovative parallel scripting language called Swift [153]. The Swift runtime system automates

parallelization, data management, and error recovery, and supports execution on a wide variety

of parallel computer systems. This allows the composition of flexible structure prediction scripts

to address new energy functions and explore algorithm enhancements, and to compare the

behavior of the algorithm under a wide range of conditions and parameter settings.

83

Chapter 4

New methods for protein design

This Chapter contains unpublished and ongoing research. I would like to thank members

of the Sosnick and Freed labs for advice and conversations, and in particular Chloe Antoniou for

consultations concerning experiments.

The need for protein-based medicines and industrial tools has motivated numerous

attempts to design novel protein sequences. This effort has been labeled the “inverse protein

folding problem,” and has led to the creation of novel sequences that encode novel protein

topologies as well as enzymes with new functions. Crucial to all design efforts is an

understanding of which amino acid sequences best encode a specific three-dimensional structure.

Current redesigns of known protein structures have produced sequences that are highly similar to

the naturally occurring sequence, suggesting that current design methods are limited in the ability

to fully explore the sequence space of a given fold. We present a novel design method that uses

the physical principles that govern protein folding to produce the most non-natural sequence

designs yet accomplished.

4.1 Introduction

Chapter 1.7 summarizes notable past and current protein design methods. The significant

protein sequence redesigns that have been determined through experiments to be structured and

stable (Table 1) are all notably similar to the natural sequences of their design template

structures. The purpose of this chapter is to design a protein that lacks the high similarity

observed in previous designed sequences.

84

One possibility is that the design template structures are highly optimized for the natural

sequence and the exact backbone geometry constrains the sequence to a near native identity. If

so, any sequence change should involve relaxation of the backbone to accommodate the new

residue. In some cases relaxed-backbone design methods are superior to methods that use a rigid

backbone [126] [156], but these studies only involve changes in a small number of residues and

may only require small backbone adjustments. The design methods of Table 1 all incorporate

backbone relaxation, but it is unclear how effectively any design algorithm can simulate the

relaxation required for a residue change that may involve concerted motions throughout the

chain in addition to sidechain rotamer sampling. Given the massive sequence space available,

thorough relaxation after sequence change would be computationally prohibitive.

Another explanation for the observed similarity is that natural sequences are highly

optimized and are not substantially varied within a fold family. For this reason, any properly

constructed design algorithm will rediscover many of the natural residues. Rosetta design

trajectories that include backbone relaxation advocate this suggestion by consistently generating

designs that are similar to the natural sequence, with core positions being the most similar (>

50%) [125]. The authors of this study also observe that the more similar to natural are the

designed sequences, the more conserved is the natural sequence family, further suggesting that

there is limited sequence variability for a given structure. Anecdotal evidence suggests that this

may not be true; ubiquitin family members that share in common only 11% amino acid identity

have PDB structures that deviate by less than 1.5Å RMSD when aligned (Fig. 4.1.1), proving

that distantly related sequences can share nearly identical structures. Thus any design strategy

that can implement a thorough and unbiased sequence search with a realistically flexible

backbone should be able to escape near-native sequence space.

Figure 4.1.1 Structures of distant ubiquitin family members(1l2n, green) ubiquitin structures align to 1.5 Å RMSD. The amino acid sequences of these structures differ at more than 89% of aligned positions.

85

Structures of distant ubiquitin family members. Human (1ubq, cyan) and yeast ubiquitin structures align to 1.5 Å RMSD. The amino acid sequences of these

structures differ at more than 89% of aligned positions.

Human (1ubq, cyan) and yeast ubiquitin structures align to 1.5 Å RMSD. The amino acid sequences of these

86

In almost all of the verified designs, however, the sequence most similar to the design in

the sequence database is very highly related to the natural sequence of the template structure (>

50%) and sometimes is the natural sequence (Table 1). This occurs even though there are

numerous sequence family members that are close enough to the natural sequence to be

structurally very similar, yet far enough away to be distinct from wild-type.

Thus it appears that current backbone relaxation methods are not able overcome the

initial template structure constraints to reach a truly unique design. Here, we introduce a method

that is designed to force itself away from the natural sequence of the template in order to

generate truly novel protein sequences.

4.2 Choice of ubiquitin as a design target

The choice of ubiquitin as a design target offers unique advantages. Primarily, ubiquitin

has a very large number of non-redundant sequence homologues (tens of thousands), compared

to the targets in Table 1 (typically hundreds). This implies that any ubiquitin design can be tested

against a very large number of sequences in order to test its uniqueness as a non-natural

sequence.

A potential challenge with ubiquitin is that previous attempts to redesign its hydrophobic core

have resulted in destabilized proteins [157]. This suggests that the sidechain packing

requirements may have very stringent requirements and any loss of stability due to core

substitution would require compensation from stabilizing surface substitutions.

87

4.3 Design protocol

The design method seeks to reduce the size of the sequence search space, which is

prohibitively large at 2072 possible sequences for ubiquitin if every amino acid is allowed at

every position. The nineteen positions that are buried in the template model can be limited to

seven hydrophobic amino acids (ILE, VAL, LEU, PHE, TRP, TYR, ALA) and exposed residues

can be limited to all other residues and alanine, reducing the search space by a factor of 1017 to

1453 * 719. In addition, prolines are not sampled at positions that contain backbone amide

hydrogens involved in hydrogen bonds according to the dope-PW energy function (Chapters 2

and 3).

The next step in the procedure is to optimize the local / dihedral propensity by

selecting the most probable sets of sequences in contiguous six-residue segments. The

conditional probability of a segment is the cumulative product of the five dihedral pairs that

constitute a six residue segment. The remaining step is a Monte Carlo minimization dope-PW

energy function with sampling of the sequence library generated from the previous step using the

same protocol used in Chapters 2 and 3, with the convergence constituting the final designed

sequence.

The methods in Table 1 incorporate amino acid compositional terms into their energy

functions in order maintain a natural-like composition, which has been suggested to be important

for maintaining protein solubility [31, 32, 35, 158]. There is no explicit composition term

employed in this method, other than an arbitrarily high penalty for having more than ten percent

of the residues identical to the wild-type sequence.

Backbone relaxation is not achieved explicitly here, but instead relies on the flexibility of

the scoring functions. For example, in the Rama optimization stage the / dihedral propensity is

88

calculated using 20° by 20° bins, which allows a large window of backbone torsional flexibility

around the native angles. Similarly, the dope-PW statistical potential has a bin size of 0.5Å for

hydrogen-bonded -strand interactions and 1.0 Å for long distance pair interactions. Each of

these distances are within the aligned distances of the respective interaction types within the

aligned structures of Figure 4.1.1, which suggests sufficient flexibility to compensate for

substitutions.

4.4 Negative design

Negative design is a challenging objective since it is difficult to anticipate which non-

native conformations might exist and therefore targeted for destabilization. As such, there is no

explicit negative design incorporated in this method, but since the objective of negative design is

to provide specificity to the desired conformation, the scoring function of this method contributes

implicitly by providing specificity towards the design template. At the local level, this is

accomplished by finding the sequence that is most likely to adopt into the native backbone at the

expense of all other backbone conformations. For example, rather than look for the sequence that

has the highest probability for the native Rama angles relative to other sequences, this method

seeks the sequence that has the highest probability for the native Rama angles relative to other

Rama angles for that same sequence. This method thus selects the amino acid sequence that is

most specific to the backbone of the native fold. In practice the Rama propensity of the ubiquitin

redesign is higher than that of the natural sequence at most positions (Fig. 4.4.1).

For long distance interactions, similar specificity may be incorporated with the dope-PW

statistical potential. In essence, it has been suggested that polar and electrostatic residues provide

the specificity of folding [154] [155].

89

Figure 4.4.1 Designed Rama propensity. Propensity of native Rama region (Fig. 3.3.1a) at each position in ubiquitin natural (1ubq) and design sequences.

.

90

For example, a salt bridge might be only marginally stabilizing in a native conformation, but it

might be destabilizing in a non-native conformation, serving to enforce specificity. Polar and

electrostatic interactions are indeed specified by the dope-PW energy function (Fig. 2.4.1), thus

contributing to implicit negative design. The dope-PW energy of the designed ubiquitin sequence

in the template 1ubq structure is indeed lower in energy than the natural sequence (G = ~-

120). The significance of this is underscored by the sometimes competing interests of local Rama

propensity and long-distance interactions, since what may stabilize one could destabilize the

other. Having both simultaneously optimized is thus a crucial test of the design method.

4.5 Conclusions

The final objective is to experimentally validate the design using first low-resolution

spectroscopic techniques such as far ultraviolet circular dichroism to detect native secondary

structure content. In addition, the peak dispersal in two-dimensional nuclear magnetic resonance

spectroscopy can be used to determine the extent to which the designed protein is structured.

Providing success using these techniques, the ultimate verification is to obtain a high-resolution

crystallographic model of the design. If these experimental validations are successful, it will

confirm the most unique non-natural redesign yet achieved.

91

Chapter 5

Structure prediction: Future directions and conclusions

5.1 Introduction The focus of structure prediction, and hence of Chapters 2 and 3 of this thesis, has been

the conformational search and protein scoring functions. Any future improvements in the

methods described here will focus on these two subjects. The codependence of searching and

scoring has been emphasized in previous chapters, and will again feature strongly in this chapter,

but now in the context of new directions in structure prediction.

5.2 Enhancement of the conformational search

Compared to our homology-free structure prediction method (Chapter 2), the percentage

of native-like models was enhanced using evolutionarily-diversified sampling (Chapter 3). This

suggests that the conformational search is the factor that limits the sampling of native-like

conformations, which has been previously proposed [21, 120]. Therefore one of the principle

tasks towards achieving higher accuracy in structure prediction is to restrict sampling as much as

possible while ensuring that the native conformation is never eliminated.

We have developed a method that restricts sampling at the local level [8, 23, 152], but

what remains elusive is a reliable way of fixing more non-local contacts. For -helices, this is

already accomplished by fixing local 2° structure, which imposes long distance chain interaction

constraints due to the rigidity of the helix and perhaps explains why all- targets tend to be

predicted with more accuracy by ItFix [8]. Fixing -strands at the local level, however, has less

impact on the conformational search due to the larger Ramachandran space available to that 2°

structure type (Fig. 1.2.2). Therefore, the next logical step towards restricting the conformational

92

search entails fixing -sheet structures in three-dimensional space.

In contrast to the straight-forward fixing of -helices, locking down -sheet structures

would be practically more complex if there is a large separation in sequence space between the

constituent strands (Fig. 5.2.1a). This results from the limited number of backbone torsion

changes that can be inserted into the intervening region without breaking the sheet, which

prevents the effective sampling of structured regions between the strands. Two residue -turn

motifs that separate the strands in -hairpins (Fig. 5.2.1b), however, are fully structured when the

hydrogen bond network is complete, which suggests that these units of structure may be used as

a rigid sampling unit. This supposition is reinforced by the folding of the N-terminal ubiquitin -

hairpin, which achieves a highly native structural model when folded and predicted with DOPE-

PW (Fig. 5.2.1c).

The strategy of fixing of substructures may be necessary due to size limitations of current

de novo structure prediction methods where the sampling time required for larger sequences can

be prohibitive [23]. Any manner of reducing the complexity of the search by restricting which

parts of the chain may be formed at a given time could simplify structure prediction in a manner

that resembles the folding of real proteins [106-110].

5.3 Algorithm-free Smooth ItFix

One drawback of the ItFix protocols that were described in Chapters 1 and 2 and the

associated publications [8, 23, 152] is the use of probability threshold parameters in the 2°

structure fixing algorithm. It is possible to eliminate the use of these parameters simply by using

the model generated from ItFix rather than applying an external 2° structure calculation. This

process, called Smooth ItFix, is demonstrated in Figure 5.3.1. As before, the Rama distribution

93

Figure 5.2.1 ItFix -structure prediction. (a) Long structured loops prevent the fixing of flanking -structures. (b) Short -turns allow prediction of strands in tight hairpin structures, which are predicted by ItFix with high accuracy.

94

for a position in the first round of folding is generated from the amino acid identities of that

position and its nearest neighbors. The structures generated from the first round of folding are

then used to generate the Rama distribution for the second round, with this process iterating as

long as is practically feasible. Figure 5.3.1 demonstrates that a natively extended position in

ubiquitin initially prefers a helical Rama distribution, but through four rounds of Smooth ItFix

prefers the native backbone geometry.

What remains to be determined for Smooth ItFix is whether to use the entire ensemble

of structures or a subset of structures filtered based on energy. The latter might introduce more

accurate Rama distributions due to the more native-like distribution of low energy models (Fig.

3.5.2), but it would also reduce the number of models available to calculate the Rama

distribution. This is crucial because having a large search space that encompasses the native

Rama angles is better than having a small space that excludes them. Therefore any successful

implementation of Smooth ItFix might require a larger number of simulation models from each

round.

5.4 New energy functions for folding and refinement

Improvements in the conformational search should coincide with improvements energy

functions for the identification of native-like conformations. The DOPE-PW statistical potential

introduced in Chapter 2 is the only orientation-dependent C pairwise statistical potential to be

successfully incorporated into de novo structure prediction. Chapter 3 described more globally

oriented multi-body radial terms were found to be ideal for generating properly collapsed models

for structures containing -sheets. On the other hand, very detailed energy functions like DOPE-

PW are ideally suited for refinement of complex structures that are near the native state and are

95

Figure 5.3.1 The Smooth ItFix protocol. With no adjustable parameters, Smooth ItFix is an algorithm-free method for structure prediction. An initial sequence-based Rama distribution is the input to iterative rounds of folding where the Rama distribution of a subsequent round is taken directly from the structures of the previous round.

96

less efficient earlier in the search. These results may not be divorced from the physical reality of

protein folding. In summary, it has been suggested that hydrophobic residues may provide the

lion’s share of the energy required for folding, whereas polar and electrostatic residues provide

the specificity of folding [154] [155]. Thus it is not surprising that out prediction methods

achieve the most success when the collapse of the chain is guided cooperatively by hydrophobic

desolvation, and the refinement of those collapsed structures is guided by highly conditional

specific interactions.

Given these principles, two obvious outstanding goals are the following: 1.) Maximize

the proportion of properly collapsed models; 2) Increase specificity during the refinement of the

collapsed models. Chapter 3 included the outline of a sequence profiling approach shows much

promise for achieving each of these goals. To briefly summarize, some target sequence amino

acid identities do not reflect the optimal solvent accessibility or pairwise compatibility expected

from the location of those residues in the native structure. For example, hydrophobic residues

can be solvent exposed (Fig. 5.4.1a) and polar and apolar residues can be on the same side of a -

sheet (Fig. 5.4.1b). As described in Chapter 3, these scenarios are analogous to residues being in

2° structures for which they have minimal propensity. In those cases, averaging across a

sequence profile gave those positions more native-like 2° structure propensity. Thus, the next

logical task is to integrate sequence profiling SPEED method into our energy functions.

Integration of SPEED into the radial terms is not as simple as averaging the amino acid

identities across the sequence profile because the radial terms identify a residue as either

hydrophobic or hydrophilic. Therefore, one possible approach is to take the consensus amino

acid identities from the multiple sequence alignment and use the polarity designation of the

consensus identities in the burial ratio term. Results of this calculation are shown in Table 7 for

97

Figure 5.4.1 DOPE-PW-SPEED encodes higher specificity. Non-native interactions favored DOPE-PW in 1ubq can be corrected by averaging the pair interactions across a multiple sequence alignment in DOPE-PW-SPEED. (a) Exposed hydrophobic residues no longer prefer to be buried together and (b) polar and apolar residues switch to the improbable native interaction on the same side of a -sheet.

98

the targets found in Tables 3-6. In every example, the consensus burial ratio is smaller than that

of the target amino acid sequence. This is almost exclusively a result of surface exposed

hydrophobic residues from the target sequence being mutated to hydrophilic residues. This could

enforce more specific restrictions on hydrophobic collapse and reduce non-native conformations.

As for integration of SPEED into DOPE-PW for the purposes of refinement, the potential

for increased specificity is clear. Instead of using singular amino acid identities from the target

sequence to calculate the interaction energy of a two residues, the multiple sequence alignment

offers an averaged pairwise interaction across two positions. As an example, two hydrophobic

residues (ALA and ILE) in ubiquitin are solvent exposed (Fig. 5.4.1a), but this information is not

known during folding. If a sampled conformation puts the C atoms within the non-native

optimal interaction distance of those two amino acids, there will be an energetic benefit. With the

pair interactions for those two positions averaged across the alignment (DOPE-PW-SPEED),

however, there is no energetic benefit to that interaction. DOPE-PW-SPEED also normalizes the

alignment of -sheets in a similar manner (Fig. 5.4.1b).

These results suggest that increasing the specificity of an energy function aides the

identification of native interactions. There is, however, a disadvantage to increased specificity.

When the native conformation is more narrowly defined it is more difficult to locate in the

conformational search, which has been referred to as the “needle in a haystack” problem [21].

For this reason we have developed a structure prediction paradigm where the initial chain

collapse is a global multi-body coarse search and the refinement of those collapsed models is

defined by highly specific pair interactions.

99

5.5 Conclusions

The principal application of these developments is clear. The explosion in new protein

sequences over the last few decades [2] and the relatively slow pace of experimental structure

determination [3, 4] necessitate rapid and accurate structure prediction. We have developed

methods for the de novo prediction of protein structure that can directly measure the global and

local confidence in our predictions and we have integrated these methods with template-based

modeling methods in order to select only target sequences that have no significant structures in

the PDB [23]. We have also tested our methods on the computing resources necessary to expand

our prediction endeavors on a genomic scale [152]. As such, we are prepared to take the rapidly

expanding pool of new sequence families [20] and predict representative models.

100

References

1. Kryshtafovych, A., K. Fidelis, and J. Moult, CASP8 results in context of previous experiments. Proteins, 2009. 77 Suppl 9: p. 217-28.

2. Yooseph, S., et al., The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol, 2007. 5(3): p. e16.

3. Service, R.F., Structural biology. Protein structure initiative: phase 3 or phase out. Science, 2008. 319(5870): p. 1610-3.

4. Lattman, E., The state of the Protein Structure Initiative. Proteins, 2004. 54(4): p. 611-5.

5. Service, R.F., Problem solved* (*sort of). Science, 2008. 321(5890): p. 784-6.

6. Sosnick, T.R., Kinetic barriers and the role of topology in protein and RNA folding. Prot. Sci., 2008. 17: p. 1308–1318.

7. Kryshtafovych, A., et al., Progress over the first decade of CASP experiments. Proteins, 2005. 61 Suppl 7: p. 225-36.

8. DeBartolo, J., et al., Mimicking the folding pathway to improve homology-free protein structure prediction. Proc Natl Acad Sci U S A, 2009. 106(10): p. 3734-9.

9. Yang, J.S., et al., All-atom ab initio folding of a diverse set of proteins. Structure, 2007. 15(1): p. 53-63.

10. Jones, D.T., Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 1999. 292(2): p. 195-202.

11. Qian, B., et al., High-resolution structure prediction and the crystallographic phase problem. Nature, 2007. 450(7167): p. 259-64.

12. Chan, H.S., S. Bromberg, and K.A. Dill, Models of cooperativity in protein folding. Philosophical Transactions of the Royal Society of London - Series B: Biological Sciences, 1995. 348(1323): p. 61-70.

13. Yue, K., et al., A test of lattice protein folding algorithms. Proc. Natl. Acad. Sci. USA, 1995. 92(1): p. 325-9.

14. Hegler, J.A., et al., Restriction versus guidance in protein structure prediction. Proc Natl Acad Sci U S A, 2009. 106(36): p. 15302-7.

15. Ozkan, S.B., et al., Protein folding by zipping and assembly. Proc. Natl. Acad. Sci. U S A, 2007. 104(29): p. 11987–11992.

16. Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-40.

101

17. Li, W., L. Jaroszewski, and A. Godzik, Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 2001. 17(3): p. 282-3.

18. Bateman, A., et al., The Pfam protein families database. Nucleic Acids Res, 2002. 30(1): p. 276-80.

19. Fitch, W.M., Distinguishing homologous from analogous proteins. Syst Zool, 1970. 19(2): p. 99-113.

20. Levitt, M., Nature of the protein universe. Proc Natl Acad Sci U S A, 2009. 106(27): p. 11079-84.

21. Bradley, P., K.M. Misura, and D. Baker, Toward high-resolution de novo structure prediction for small proteins. Science, 2005. 309(5742): p. 1868-71.

22. Xu, J., J. Peng, and F. Zhao, Template-based and free modeling by RAPTOR++ in CASP8. Proteins, 2009. 77 Suppl 9: p. 133-7.

23. DeBartolo, J., et al., Protein structure prediction enhanced with evolutionary diversity: SPEED. Protein Sci. 19(3): p. 520-34.

24. DeGrado, W.F., et al., De novo design and structural characterization of proteins and metalloproteins. Annu Rev Biochem, 1999. 68: p. 779-819.

25. Pabo, C., Molecular technology. Designing proteins and peptides. Nature, 1983. 301(5897): p. 200.

26. Butterfoss, G.L. and B. Kuhlman, Computer-based design of novel protein structures. Annu Rev Biophys Biomol Struct, 2006. 35: p. 49-65.

27. Malakauskas, S.M. and S.L. Mayo, Design, structure and stability of a hyperthermophilic protein variant. Nat Struct Biol, 1998. 5(6): p. 470-5.

28. Pan, Y., et al., Computational redesign of human butyrylcholinesterase for anticocaine medication. Proc Natl Acad Sci U S A, 2005. 102(46): p. 16656-61.

29. Korkegian, A., et al., Computational thermostabilization of an enzyme. Science, 2005. 308(5723): p. 857-60.

30. Gribenko, A.V., et al., Rational stabilization of enzymes by computational redesign of surface charge-charge interactions. Proc Natl Acad Sci U S A, 2009. 106(8): p. 2601-6.

31. Filikov, A.V., et al., Computational stabilization of human growth hormone. Protein Sci, 2002. 11(6): p. 1452-61.

32. Dantas, G., et al., A large scale test of computational protein design: folding and stability of nine completely redesigned globular proteins. J Mol Biol, 2003. 332(2): p. 449-60.

33. Harbury, P.B., et al., High-resolution protein design with backbone freedom. Science,

102

1998. 282(5393): p. 1462-7.

34. Villegas, V., et al., Protein engineering as a strategy to avoid formation of amyloid fibrils. Protein Sci, 2000. 9(9): p. 1700-8.

35. Dahiyat, B.I. and S.L. Mayo, De novo protein design: fully automated sequence selection. Science, 1997. 278(5335): p. 82-7.

36. Handel, T.M., S.A. Williams, and W.F. DeGrado, Metal ion-dependent modulation of the dynamics of a designed protein. Science, 1993. 261(5123): p. 879-85.

37. Betz, S.F., P.A. Liebman, and W.F. DeGrado, De novo design of native proteins: characterization of proteins intended to fold into antiparallel, rop-like, four-helix bundles. Biochemistry, 1997. 36(9): p. 2450-8.

38. Betz, S.F. and W.F. DeGrado, Controlling topology and native-like behavior of de novo-designed peptides: design and characterization of antiparallel four-stranded coiled coils. Biochemistry, 1996. 35(21): p. 6955-62.

39. Kuhlman, B., et al., Design of a novel globular protein fold with atomic-level accuracy. Science, 2003. 302(5649): p. 1364-8.

40. Kellis, J.T., Jr., et al., Contribution of hydrophobic interactions to protein stability. Nature, 1988. 333(6175): p. 784-6.

41. Matthews, B.W., Studies on protein stability with T4 lysozyme. Adv Protein Chem, 1995. 46: p. 249-78.

42. Gassner, N.C., W.A. Baase, and B.W. Matthews, A test of the "jigsaw puzzle" model for protein folding by multiple methionine substitutions within the core of T4 lysozyme. Proc Natl Acad Sci U S A, 1996. 93(22): p. 12155-8.

43. Axe, D.D., N.W. Foster, and A.R. Fersht, Active barnase variants with completely random hydrophobic cores. Proc Natl Acad Sci U S A, 1996. 93(11): p. 5590-4.

44. Krylov, D., I. Mikhailenko, and C. Vinson, A thermodynamic scale for leucine zipper stability and dimerization specificity: e and g interhelical interactions. Embo J, 1994. 13(12): p. 2849-61.

45. Lumb, K.J. and P.S. Kim, Measurement of interhelical electrostatic interactions in the GCN4 leucine zipper. Science, 1995. 268(5209): p. 436-9.

46. Kenar, K.T., B. Garcia-Moreno, and E. Freire, A calorimetric characterization of the salt dependence of the stability of the GCN4 leucine zipper. Protein Sci, 1995. 4(9): p. 1934-8.

47. Barlow, D.J. and J.M. Thornton, Ion-pairs in proteins. J Mol Biol, 1983. 168(4): p. 867-85.

103

48. Chou, P.Y. and G.D. Fasman, Prediction of protein conformation. Biochemistry, 1974. 13(2): p. 222-45.

49. McGregor, M.J., S.A. Islam, and M.J. Sternberg, Analysis of the relationship between side-chain conformation and secondary structure in globular proteins. J Mol Biol, 1987. 198(2): p. 295-310.

50. Munoz, V. and L. Serrano, Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: comparison with experimental scales. Proteins, 1994. 20(4): p. 301-11.

51. Dunbrack, R.L., Jr. and M. Karplus, Conformational analysis of the backbone-dependent rotamer preferences of protein sidechains. Nat Struct Biol, 1994. 1(5): p. 334-40.

52. Jha, A.K., et al., Helix, Sheet, and Polyproline II Frequencies and Strong Nearest Neighbor Effects in a Restricted Coil Library. Biochemistry, 2005. 44(28): p. 9691-702.

53. Colubri, A., et al., Minimalist Representations and the Importance of Nearest Neighbor Effects in Protein Folding Simulations. J. Mol. Biol., 2006. 363: p. 835-857.

54. Richardson, J.S. and D.C. Richardson, The de novo design of protein structures. Trends Biochem Sci, 1989. 14(7): p. 304-9.

55. Booth, D.R., et al., Instability, unfolding and aggregation of human lysozyme variants underlying amyloid fibrillogenesis. Nature, 1997. 385(6619): p. 787-93.

56. Pauling, L., R.B. Corey, and H.R. Branson, The structure of proteins: Two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. USA, 1951. 37: p. 235-240.

57. Pauling, L. and R.B. Corey, Configurations of polypeptide chains with favored conformations around single bonds: Two new pleated sheets. Proc. Natl. Acad. Sci. USA, 1951. 37: p. 729-740.

58. Kabsch, W. and C. Sander, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 1983. 22(12): p. 2577-637.

59. Ramachandran, G.N., C. Ramakrishnan, and V. Sasisekharan, Stereochemistry of Polypeptide Chain Configurations. J. Mol. Biol., 1963. 7(1): p. 95-&.

60. Rose, G.D., Hierarchic organization of domains in globular proteins A noncovalent peptide complex as a model for an early folding intermediate of cytochrome c.

Journal of Molecular Biology, 1979. 134(3): p. 447-70.

61. Dill, K.A., Dominant forces in protein folding. Biochemistry, 1990. 29(31): p. 7133-7155.

104

62. Stickle, D.F., et al., Hydrogen bonding in globular proteins. Journal of Molecular Biology, 1992. 226(4): p. 1143-59.

63. Baker, E.N. and R.E. Hubbard, Hydrogen bonding in globular proteins. Prog Biophys Mol Biol, 1984. 44(2): p. 97-179.

64. Dill, K.A., K. Fiebig, M., and H.S. Chan, Cooperativity in protein-folding kinetics. Proc. Natl. Acad. Sci. USA, 1993. 90: p. 1942-1946.

65. Chou, P.Y. and G.D. Fasman, Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins. Biochemistry, 1974. 13(2): p. 211-22.

66. Chou, P.Y. and G.D. Fasman, Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol Relat Areas Mol Biol, 1978. 47: p. 45-148.

67. Rost, B. and C. Sander, Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol, 1993. 232(2): p. 584-99.

68. Pollastri, G., et al., Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 2002. 47(2): p. 228-35.

69. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 1997. 25(17): p. 3389-402.

70. Cuff, J.A. and G.J. Barton, Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins, 2000. 40(3): p. 502-11.

71. Dor, O. and Y. Zhou, Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins, 2007. 66(4): p. 838-45.

72. Kihara, D., The effect of long-range interactions on the secondary structure formation of proteins. Protein Sci., 2005. 14(8): p. 1955-63.

73. Minor, D.L., Jr. and P.S. Kim, Context-dependent secondary structure formation of a designed protein sequence. Nature, 1996. 380(6576): p. 730-4.

74. Alexander, P.A., et al., The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc. Natl. Acad. Sci. U S A, 2007.

75. Meiler, J. and D. Baker, Coupled prediction of protein secondary and tertiary structure. Proc. Natl. Acad. Sci. U S A, 2003. 100(21): p. 12105-10.

76. Service, R.F., Structural biology. Researchers hone their homology tools. Science, 2008. 319(5870): p. 1612.

77. Zhang, Y., I-TASSER: fully automated protein structure prediction in CASP8. Proteins,

105

2009. 77 Suppl 9: p. 100-13.

78. Krieger, E., et al., Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: Four approaches that performed well in CASP8. Proteins, 2009. 77 Suppl 9: p. 114-22.

79. Hildebrand, A., et al., Fast and accurate automatic structure prediction with HHpred. Proteins, 2009. 77 Suppl 9: p. 128-32.

80. Raman, S., et al., Structure prediction for CASP8 with all-atom refinement using Rosetta. Proteins, 2009. 20: p. 20.

81. Ben-David, M., et al., Assessment of CASP8 structure predictions for template free targets. Proteins, 2009. 77 Suppl 9: p. 50-65.

82. Fang, Q. and D. Shortle, Protein refolding in silico with atom-based statistical potentials and conformational search using a simple genetic algorithm. J Mol Biol, 2006. 359(5): p. 1456-67.

83. Zhao, F., et al., Discriminative learning for protein conformation sampling. Proteins, 2008. 73(1): p. 228-40.

84. Levinthal, C., Are there pathways for protein folding. J. Chim. Phys., 1968. 65: p. 44-45.

85. Jorgensen, W.L. and J. Tirado-Rives, The OPLS [optimized potentials for liquid simulations] potential functions for proteins, energy minimizations for crystals of cyclic peptides and crambin, in J. Am. Chem. Soc. 1988. p. 1657-66.

86. Brooks, B.R., et al., CHARMM: a program for macromolecular energy, minimization, and dynamics calculations, in J. Comput. Chem. 1983. p. 187-217.

87. Pearlman, D., Case, D. A. , Caldwell, J. W. , Ross, W. S. , Cheatam, I. T. E. , Ferguson, D. M. , Singh, U. C. , Weiner, P. & Kollman, P., Amber 4.1. 1995.

88. Shen, M.Y. and A. Sali, Statistical potential for assessment and prediction of protein structures. Protein Sci., 2006. 15(11): p. 2507-24.

89. Fitzgerald, J.E., et al., Reduced Cbeta statistical potentials can outperform all-atom potentials in decoy identification. Protein Sci., 2007. 16(10): p. 2123-39.

90. Fang, Q. and D. Shortle, A consistent set of statistical potentials for quantifying local side-chain and backbone interactions. Proteins, 2005. 60(1): p. 90-6.

91. Summa, C.M. and M. Levitt, Near-native structure refinement using in vacuo energy minimization. Proc Natl Acad Sci U S A, 2007. 104(9): p. 3177-82.

92. Flory, P.J., Statistical Mechanics of Chain Molecules. 1953, Ithaca, NY: Cornell University Press. 464.

106

93. Flory, P.J., Statistical Mechanics of Chain Molecules. 1969, New York: Wiley.

94. Zaman, M.H., et al., Investigations into sequence and conformational dependence of backbone entropy, inter-basin dynamics and the Flory isolated-pair hypothesis for peptides. J. Mol. Biol., 2003. 331(3): p. 693-711.

95. Jha, A.K., et al., Statistical coil model of the unfolded state: Resolving the reconciliation problem. Proc. Natl. Acad. Sci. U S A, 2005. 102(37): p. 13099-104.

96. Fang, Q. and D. Shortle, Protein refolding in silico with atom-based statistical potentials and conformational search using a simple genetic algorithm. J. Mol. Biol., 2006. 359(5): p. 1456-67.

97. Skolnick, J., et al., Ab initio protein structure prediction via a combination of threading, lattice folding, clustering, and structure refinement. Proteins, 2001. Suppl 5: p. 149-56.

98. Srinivasan, R. and G.D. Rose, LINUS: a hierarchic procedure to predict the fold of a protein. Proteins, 1995. 22(2): p. 81-99.

99. Lazaridis, T. and M. Karplus, Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J Mol Biol, 1999. 288(3): p. 477-87.

100. Kortemme, T., A.V. Morozov, and D. Baker, An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J. Mol. Biol., 2003. 326(4): p. 1239-59.

101. Rohl, C.A., et al., Protein structure prediction using Rosetta. Methods Enzymol, 2004. 383: p. 66-93.

102. Pande, V.S. and D.S. Rokhsar, Molecular dynamics simulations of unfolding and refolding of a beta-hairpin fragment of protein G. Proc Natl Acad Sci U S A, 1999. 96(16): p. 9062-7.

103. Shen, M.Y. and K.F. Freed, All-atom fast protein folding simulations: the villin headpiece. Proteins, 2002. 49(4): p. 439-45.

104. Mirny, L.A., V. Abkevich, and E.I. Shakhnovich, Universality and diversity of folding scenarios: a comprehensive analysis with the aid of a lattice model. Folding & Design, 1996. 1103-116.

105. Maisuradze, G.G., et al., Investigation of Protein Folding by Coarse-Grained Molecular Dynamics with the UNRES Force Field. J Phys Chem A. 2010: p. 18.

106. Abkevich, V.I., A.M. Gutin, and E.I. Shakhnovich, Free energy landscape for protein folding kinetics: Intermediates, traps, and multiple pathways in theory and lattice model simulations. J. Chem. Phys., 1994. 101: p. 6052-6062.

107. Baldwin, R.L., The pathway of protein folding. TIBS, 1978: p. 66-67.

107

108. Baldwin, R.L., Kinetic intermediates and the pathway of folding of ribonucleases A and S. Biomolecular structure, conformation, function and evolution., 1980: p. 87-95.

109. Baldwin, R.L. and T.E. Creighton, Recent experimental work on the pathway and mechanism of protein folding . Protein Folding, 1980: p. 217-259.

110. Baldwin, R.L., The nature of protein folding pathways: The classical versus the new view. J. Biomol. NMR, 1995. 5: p. 103-109.

111. Bai, Y. and S.W. Englander, Future directions in folding: the multi-state nature of protein structure. Proteins, 1996. 24(2): p. 145-51.

112. Maity, H., et al., Protein folding: The stepwise assembly of foldon units. Proc. Natl. Acad. Sci. U S A, 2005. 102(13): p. 4741-6.

113. Krantz, B.A., R.S. Dothager, and T.R. Sosnick, Discerning the structure and energy of multiple transition states in protein folding using psi-analysis. J. Mol. Biol., 2004. 337(2): p. 463-75.

114. Sosnick, T.R., et al., Characterizing the Protein Folding Transition State Using psi Analysis. Chem. Rev., 2006. 106(5): p. 1862-76.

115. Noe, F., et al., Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations. Proc Natl Acad Sci U S A, 2009. 106(45): p. 19011-6.

116. Voelz, V.A., et al., Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39). J Am Chem Soc, 1526. 132(5): p. 1526-8.

117. Freddolino, P.L., et al., Ten-microsecond molecular dynamics simulation of a fast-folding WW domain. Biophys J, 2008. 94(10): p. L75-7.

118. Freddolino, P.L., et al., Force field bias in protein folding simulations. Biophys J, 2009. 96(9): p. 3772-80.

119. Liwo, A., M. Khalili, and H.A. Scheraga, Ab initio simulations of protein-folding pathways by molecular dynamics with the united-residue model of polypeptide chains. Proc. Natl. Acad. Sci. U S A, 2005. 102(7): p. 2362-7.

120. Srinivasan, R., P.J. Fleming, and G.D. Rose, Ab initio protein folding using LINUS. Methods Enzymol, 2004. 383: p. 48-66.

121. Eisenberg, D., et al., The design, synthesis, and crystallization of an alpha-helical peptide. Proteins, 1986. 1(1): p. 16-22.

122. Regan, L. and W.F. DeGrado, Characterization of a helical protein designed from first principles. Science, 1988. 241(4868): p. 976-8.

123. Dantas, G., et al., High-resolution structural and thermodynamic analysis of extreme

108

stabilization of human procarboxypeptidase by computational protein design. J Mol Biol, 2007. 366(4): p. 1209-21.

124. Henikoff, S. and J.G. Henikoff, Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A, 1992. 89(22): p. 10915-9.

125. Kuhlman, B. and D. Baker, Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci U S A, 2000. 97(19): p. 10383-8.

126. Harbury, P.B., B. Tidor, and P.S. Kim, Repacking protein cores with backbone freedom: structure prediction for coiled coils. Proceedings of the National Academy of Sciences of the United States of America, 1995. 92(18): p. 8408-12.

127. Hu, X., et al., Computer-based redesign of a beta sandwich protein suggests that extensive negative design is not required for de novo beta sheet design. Structure, 2008. 16(12): p. 1799-805.

128. Altschul, S.F., et al, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nuc. Acids Res., 1997. 25: p. 3389-3402.

129. Eramian, D., et al., A composite score for predicting errors in protein structure models. Protein Sci., 2006. 15(7): p. 1653-66.

130. Shen, M.Y. and A. Sali, Statistical potential for assessment and prediction of protein structures. Protein Sci, 2006. 15(11): p. 2507-24.

131. Doyle, A.C., The Sign of the Four 1890.

132. Cheng, J., et al., SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res, 2005. 33(Web Server issue): p. W72-6.

133. Bryson, K., et al., Protein structure prediction servers at University College London. Nucleic Acids Res, 2005. 33(Web Server issue): p. W36-8.

134. Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H., Helt, G. et al., Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science, 2005. 308: p. 1149-1154.

135. Wang, G. and R.L. Dunbrack, Jr., PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res, 2005. 33(Web Server issue): p. W94-8.

136. Wang, G. and R.L. Dunbrack, Jr., PISCES: a protein sequence culling server. Bioinformatics, 2003. 19(12): p. 1589-91.

137. Zhou, H. and J. Skolnick, Protein structure prediction by pro-Sp3-TASSER. Biophys J, 2009. 96(6): p. 2119-27.

138. Baker, D. and A. Sali, Protein structure prediction and structural genomics. Science,

109

2001. 294(5540): p. 93-6.

139. Skolnick, J., J.S. Fetrow, and A. Kolinski, Structural genomics and its importance for gene function analysis. Nat Biotechnol, 2000. 18(3): p. 283-7.

140. Soding, J., Protein homology detection by HMM-HMM comparison. Bioinformatics, 2005. 21(7): p. 951-60.

141. Lockless, S.W. and R. Ranganathan, Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 1999. 286(5438): p. 295-9.

142. Suel, G.M., et al., Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol, 2003. 10(1): p. 59-69.

143. Russ, W.P. and R. Ranganathan, Knowledge-based potential functions in protein design. Curr Opin Struct Biol, 2002. 12(4): p. 447-52.

144. Lise, S., A. Walker-Taylor, and D.T. Jones, Docking protein domains in contact space. BMC Bioinformatics, 2006. 7(310): p. 310.

145. Zhang, Y. and J. Skolnick, SPICKER: a clustering approach to identify near-native protein folds. J Comput Chem, 2004. 25(6): p. 865-71.

146. McGuffin, L.J., Benchmarking consensus model quality assessment for protein fold recognition. BMC Bioinformatics, 2007. 8(345): p. 345.

147. Randall, A. and P. Baldi, SELECTpro: effective protein model selection using a structure-based energy function resistant to BLUNDERs. BMC Struct Biol, 2008. 8(52): p. 52.

148. Zhou, H. and J. Skolnick, Protein model quality assessment prediction by combining fragment comparisons and a consensus C(alpha) contact potential. Proteins, 2008. 71(3): p. 1211-8.

149. Xu, J., et al., RAPTOR: optimal protein threading by linear programming. J. Bioinform. Comp. Biol., 2003. 1(1): p. 95-117.

150. Li, W. and A. Godzik, Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006. 22(13): p. 1658-9.

151. Gong, H., P.J. Fleming, and G.D. Rose, Building native protein conformation from highly approximate backbone torsion angles. Proc Natl Acad Sci U S A, 2005. 102(45): p. 16227-32.

152. Hocky, G., Wilde, M., Debartolo, J., Hategan, M., Foster, I., Sosnick, T.R., and Freed, K.F., Homology-free protein structure prediction through parallel scripting. Argonne Technical Report Preprint ANL/MCS-P1645-0609. , 2009.

110

153. M. Wilde, I.F., K. Iskra, P. Beckman, Z. Zhang, A. Espinosa, M. Hategan, B. Clifford, I. Raicu., Parallel scripting for applications at the petascale and beyond. IEEE COMPUTER, 2009.

154. Cordes, M.H., A.R. Davidson, and R.T. Sauer, Sequence space, folding and protein design. Curr Opin Struct Biol, 1996. 6(1): p. 3-10.

155. Yue, K. and K.A. Dill, Inverse protein folding problem: designing polymer sequences. Proc Natl Acad Sci U S A, 1992. 89(9): p. 4163-7.

156. Desjarlais, J.R. and T.M. Handel, De novo design of the hydrophobic cores of proteins. Protein Sci, 1995. 4(10): p. 2006-18.

157. Lazar, G.A., J.R. Desjarlais, and T.M. Handel, De novo design of the hydrophobic core of ubiquitin. Protein Sci, 1997. 6(6): p. 1167-78.

158. Dahiyat, B.I., C.A. Sarisky, and S.L. Mayo, De novo protein design: towards fully automated sequence selection. J Mol Biol, 1997. 273(4): p. 789-96.

the university of chicago new approaches to protein...

Documents