mmtsb tool set: enhanced sampling and multiscale …journal of molecular graphics and modelling 22...

19
Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods for applications in structural biology Michael Feig, John Karanicolas, Charles L. Brooks III Department of Molecular Biology, TPC6, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, CA 92037, USA Abstract We describe the Multiscale Modeling Tools for Structural Biology (MMTSB) Tool Set (http://mmtsb.scripps.edu/software/mmtsbToolSet. html), which is a novel set of utilities and programming libraries that provide new enhanced sampling and multiscale modeling techniques for the simulation of proteins and nucleic acids. The tool set interfaces with the existing molecular modeling packages CHARMM and Amber for classical all-atom simulations, and with MONSSTER for lattice-based low-resolution conformational sampling. In addition, it adds new functionality for the integration and translation between both levels of detail. The replica exchange method is implemented to allow enhanced sampling of both the all-atom and low-resolution models. The tool set aims at applications in structural biology that involve protein or nucleic acid structure prediction, refinement, and/or extended conformational sampling. With structure prediction applications in mind, the tool set also implements a facility that allows the control and application of modeling tasks on a large set of conformations in what we have termed ensemble computing. Ensemble computing encompasses loosely coupled, parallel computation on high-end parallel computers, clustered computational grids and desktop grid environments. This paper describes the design and implementation of the MMTSB Tool Set and illustrates its utility with three typical examples—scoring of a set of predicted protein conformations in order to identify the most native-like structures, ab initio folding of peptides in implicit solvent with the replica exchange method, and the prediction of a missing fragment in a larger protein structure. © 2004 Elsevier Inc. All rights reserved. Keywords: Protein structure prediction; Replica exchange; Ensemble computing 1. Introduction The success of computational methodologies in chemistry that have been developed over the last four decades is re- flected in a multitude of academic and commercial programs available today. CHARMM [1], Amber [2], and Gaussian [3] are typical examples of this development, and enjoy wide usage in both academia and industry. Most of these programs that have emerged from this period are highly functional, well optimized, and sufficiently integrated within their in- tended range of applications. However, because of a high level of complexity, proprietary command interfaces and in- put/output formats these programs often tend to be inflexible when extensions and/or interoperability with other existing programs are needed. While this is a common problem in the integration of heterogeneous legacy software components [4], such issues have become especially apparent in the im- plementation of new enhanced sampling techniques applied to the conformational sampling of biopolymers. These novel simulation protocols combine existing methods in order to Corresponding author. Tel.: +1-858-784-8035; fax: +1-858-784-8688. E-mail address: [email protected] (C.L. Brooks III). improve conformational sampling efficiency for molecular modeling and dynamics applications. Generalized ensemble sampling techniques, for example, involve parallel simula- tions of a system of interest with different weight factors coupled by a Monte Carlo simulation protocol [5–7]. Vari- ants of this sampling scheme are being used increasingly in the study of long time scale phenomena such as protein folding [8–13]. These methods could be implemented in the form of separate, new programs or by modifying existing simulation packages. In a more practical implementation, however, existing programs could be used to run each of the simulations while an external interface layer is utilized to couple and control the individual simulations and facilitates the enhanced sampling methodology. This approach would allow greater flexibility in using the same enhanced sam- pling method with different simulation programs, and avoid difficulties in modifying existing large software packages directly. Another way to improve conformational sampling is through multiscale modeling techniques. The computational modeling of biological macromolecules commonly revolves around structure representations in atomic or near-atomic detail, with a classical description of physical interactions. 1093-3263/$ – see front matter © 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jmgm.2003.12.005

Upload: others

Post on 24-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

Journal of Molecular Graphics and Modelling 22 (2004) 377–395

MMTSB Tool Set: enhanced sampling and multiscale modelingmethods for applications in structural biology

Michael Feig, John Karanicolas, Charles L. Brooks III∗Department of Molecular Biology, TPC6, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, CA 92037, USA

Abstract

We describe the Multiscale Modeling Tools for Structural Biology (MMTSB) Tool Set (http://mmtsb.scripps.edu/software/mmtsbToolSet.html), which is a novel set of utilities and programming libraries that provide new enhanced sampling and multiscale modeling techniquesfor the simulation of proteins and nucleic acids. The tool set interfaces with the existing molecular modeling packages CHARMM andAmber for classical all-atom simulations, and with MONSSTER for lattice-based low-resolution conformational sampling. In addition, itadds new functionality for the integration and translation between both levels of detail. The replica exchange method is implemented toallow enhanced sampling of both the all-atom and low-resolution models. The tool set aims at applications in structural biology that involveprotein or nucleic acid structure prediction, refinement, and/or extended conformational sampling. With structure prediction applicationsin mind, the tool set also implements a facility that allows the control and application of modeling tasks on a large set of conformations inwhat we have termed ensemble computing. Ensemble computing encompasses loosely coupled, parallel computation on high-end parallelcomputers, clustered computational grids and desktop grid environments.

This paper describes the design and implementation of the MMTSB Tool Set and illustrates its utility with three typical examples—scoringof a set of predicted protein conformations in order to identify the most native-like structures, ab initio folding of peptides in implicitsolvent with the replica exchange method, and the prediction of a missing fragment in a larger protein structure.© 2004 Elsevier Inc. All rights reserved.

Keywords:Protein structure prediction; Replica exchange; Ensemble computing

1. Introduction

The success of computational methodologies in chemistrythat have been developed over the last four decades is re-flected in a multitude of academic and commercial programsavailable today. CHARMM[1], Amber [2], and Gaussian[3] are typical examples of this development, and enjoy wideusage in both academia and industry. Most of these programsthat have emerged from this period are highly functional,well optimized, and sufficiently integrated within their in-tended range of applications. However, because of a highlevel of complexity, proprietary command interfaces and in-put/output formats these programs often tend to be inflexiblewhen extensions and/or interoperability with other existingprograms are needed. While this is a common problem in theintegration of heterogeneous legacy software components[4], such issues have become especially apparent in the im-plementation of new enhanced sampling techniques appliedto the conformational sampling of biopolymers. These novelsimulation protocols combine existing methods in order to

∗ Corresponding author. Tel.:+1-858-784-8035; fax:+1-858-784-8688.E-mail address:[email protected] (C.L. Brooks III).

improve conformational sampling efficiency for molecularmodeling and dynamics applications. Generalized ensemblesampling techniques, for example, involve parallel simula-tions of a system of interest with different weight factorscoupled by a Monte Carlo simulation protocol[5–7]. Vari-ants of this sampling scheme are being used increasinglyin the study of long time scale phenomena such as proteinfolding [8–13]. These methods could be implemented in theform of separate, new programs or by modifying existingsimulation packages. In a more practical implementation,however, existing programs could be used to run each of thesimulations while an external interface layer is utilized tocouple and control the individual simulations and facilitatesthe enhanced sampling methodology. This approach wouldallow greater flexibility in using the same enhanced sam-pling method with different simulation programs, and avoiddifficulties in modifying existing large software packagesdirectly.

Another way to improve conformational sampling isthrough multiscale modeling techniques. The computationalmodeling of biological macromolecules commonly revolvesaround structure representations in atomic or near-atomicdetail, with a classical description of physical interactions.

1093-3263/$ – see front matter © 2004 Elsevier Inc. All rights reserved.doi:10.1016/j.jmgm.2003.12.005

Page 2: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

378 M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395

Such models have been quite successful in complementingexperimental data with structural, dynamic, and energeticinformation, but involve substantial computational resourcesfor larger systems, or when long time scales have to be con-sidered. In particular, studies of protein folding, structureprediction applications, or the formation and interaction ofsupramolecular assemblies become prohibitively expensivewith models at atomic detail. Alternatively, coarser molec-ular representations with few virtual particles, often alsoprojected onto lattices, have yielded meaningful results insuch cases[14]. Unfortunately, the reduced level of detailoften cannot provide the same accuracy as all-atom models.For example, it is quite feasible to generate native topolo-gies from folding simulations using simple lattice models;however, it is much more difficult to actually discern na-tive or near-native conformations from other, non-nativeconformations that are also generated with the same model[15]. In such cases, one may instead reconstruct all-atomstructures from reduced representations[16], and use thesemore detailed models to regain a higher level of accuracywith an all-atom scoring function that can then distinguishnative from non-native conformations[17,18]. This idearepresents the core of more general multiscale modeling ap-proaches; lower resolution models are used to extend sam-pling to longer time scales or larger system sizes, whereashigher-resolution models provide the energetic accuracy.While the structure prediction example above describes asingle pass of low-resolution sampling followed by the useof all-atom models for improved accuracy, multiscale mod-eling can also be done in a continuous fashion, for examplethrough Monte Carlo type simulations that repeatedly movebetween low- and high-resolution models for extendedsampling on an energy landscape that is closely coupled tothe interactions of the high-resolution model. The imple-mentation of multiscale modeling methods faces problemssimilar to these seen in the implementation of enhancedsampling methods, but usually involves the combination ofmultiple programs rather than a single simulation program.All-atom modeling of biological macromolecules is possi-ble with a number of standard molecular modeling packagessuch as CHARMM or Amber, but these programs usuallydo not fully support low-resolution models and especiallylattice-based representations. On the other hand, simulationprograms for low-resolution models, such as the latticesimulation program MONSSTER[19], do not usually allowall-atom modeling. Both types of applications are fairlycomplex, so that the option of simply merging them is notvery attractive. As for the implementation of enhanced sam-pling methods, a better solution would be to wrap simulationprograms for all-atom and low-resolution models througha common interface layer and provide translation routinesbetween both models as the basis for building multiscaleapplications.

In this paper, we describe a new set of utilities andprogramming libraries for the implementation of compu-tationally distributed enhanced and multiscale sampling

methods based on existing simulation programs. Thispackage, called Multiscale Modeling Tools for StructuralBiology (MMTSB) Tool Set (available athttp://mmtsb.scripps.edu/software/mmtsbToolSet.html), is an effortwithin the NIH Research Resource for Multiscale ModelingTools for Structural Biology and follows the implementa-tion strategy outlined above by integrating the existing pro-grams through an interface layer while providing missingfunctionality as necessary. Interpreted scripting languagessuch as Perl or Python are particularly suitable for buildinginterface layers since they combine ease of use and porta-bility with a high level of functionality for addressing thecomplex system-oriented but computationally less intensivetasks[20]. Similar, scripting-language based designs havebeen used successfully in other related applications suchas the molecular modeling tool kit (MMTK)[21] or theBioperl toolkit [22].

The idea of the MMTSB Tool Set is not just to providea set of user programs for certain enhanced and multi-scale sampling modeling tasks, but also a programmingworkbench, which provides the framework for the devel-opment of new applications that require the interplay ofmultiple simulation packages. It focuses on applicationsin the area of protein structure prediction, protein folding,and large-scale model building and refinement of proteinsand nucleic acids for which enhanced and multiscale sam-pling techniques are particularly useful. As a subset ofits functionalities, the tool set also provides a commonuser interface to all-atom modeling via CHARMM1 [1] orAmber1 [2] and reduced-model lattice modeling via MON-SSTER[19]. Furthermore, the tool set incorporates a num-ber of support functions that are motivated by multiscalemodeling applications, but are certainly useful for otherpurposes as well. They include algorithms for translatingquickly and accurately between low- and high-resolutionmodels and methods for the organization, manipulation,and evaluation of large sets of conformations for a givenprotein, in what may be referred to as ensemble comput-ing. Ensemble computing applications greatly benefit fromparallel execution since they are inherently parallel in na-ture and typically require relatively little communication.The tool set provides basic parallel platform support im-plemented on the scripting language level, which makes itlargely platform-independent and does not require specificcommunication libraries.

In the following, we will first describe the architectureand components of the MMTSB Tool Set in more detail. Wewill then continue by providing examples of how the toolset may be used for typical enhanced and multiscale sam-pling applications in protein structure prediction, structureevaluation, and structure refinement examples. We concludeby discussing how this architecture may be extended to newtasks and applications.

1 We note that the tool set is supported with versions of CHARMMbeyond c29b and Amber version 7.

Page 3: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395 379

2. Software description

2.1. Architecture

Common modern scripting languages that would be ap-propriate for building complex applications are Perl andPython. We decided to use Perl as the (still) more widelyused scripting language in order to minimize portability is-sues and to facilitate user extensions as much as possible.As depicted inFig. 1, the architecture of the MMTSB ToolSet consists of a collection of object-oriented classes, calledpackages in Perl, that implement all of the core functionali-ties. These packages are used by a number of executable userprograms, which mainly parse command line arguments andcall the appropriate functions to build specific applications.In addition to the simulation programs CHARMM, Amber,and MONSSTER, a few other compiled-language programsare also included as part of the tool set, and wrapped throughPerl, for computationally more demanding tasks that can-not be done efficiently in Perl alone. This program designmaximizes flexibility and reusability. The packages alonecan be used as a programming library for a variety of tasksthat may go well beyond the intended applications of theMMTSB Tool Set. For example, one may use the interfaceto CHARMM to take advantage of Perl’s advanced scriptingcapabilities for building complex modeling applications thatrequire additional functionality and go beyond the capabil-ities of CHARMM’s own scripting language. On the otherhand, the command-line oriented user-level utilities in theMMTSB Tool Set are intended to cover a wide range of ap-plications with a special focus on enhanced and multiscalesampling protocols. Furthermore, since these user-level util-ities represent little more than a user interface to the pack-age routines, they can easily be customized to address new

Amber

minCHARMM.plminAmber.pl latticesim.pl

MONSSTER

SICHO

MONSSTERCHARMM

rebuild.pl

Molecule

Amber CHARMM

User

Sequence

Fig. 1. Representative view of the MMTSB Tool Set architecture. External programs CHARMM, Amber, and MONSSTER are shown in blue, Perlpackages in magenta, and Perl user utilities in green.

Table 1MMTSB Tool Set Perl packages implementing the core functionality andserving as a programming library

Package name Functionality

Molecule All-atom representation of molecule objectsCHARMM Interface to CHARMM molecular modeling

programAmber Interface to Amber molecular modeling

programAnalyze Structural analysis of molecular conformationsCluster Clustering of molecular conformationsSICHO Reduced, side chain based representation of

moleculesSequence Amino acid sequences and secondary

structure informationMONSSTER Interface to MONSSTER lattice simulation

programSimData, Ensemble Ensemble computingJobClient, JobServer Parallel execution for ensemble computing

applicationsReXClient, ReXServer Replica exchange simulationsGenUtil General utility functionsServer, Client General server/client implementation

types of problems either based on the existing package rou-tines or by adding new functionality. The user utilities couldalso be replaced by a different type of user interface or in-tegrated into other types of applications without significantadditional effort.

2.2. Components

In this section, we provide an overview over the variouscomponents of the MMTSB Tool Set. The packages anduser programs are listed inTables 1 and 2, respectively. Thispaper is meant to give an overview of the MMTSB Tool Set

Page 4: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

380 M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395

Table 2Main MMTSB Tool Set command-line oriented user utilities

Utility Functionality

enerCHARMM.pl, enerAmber.pl All-atom energy evaluation with CHARMM/AmberminCHARMM.pl, minAmber.pl All-atom minimization with CHARMM/AmbermdCHARMM.pl, mdAmber.pl All-atom molecular dynamics with CHARMM/Amberconvpdb.pl Convert and manipulate PDB filescomplete.pl Complete missing atoms in protein structuresmutate.pl Mutate residues in protein structures

rms.pl, lsqfit.pl, contact.pl, rgyr.pl, dihed.pl, qscore.pl Analysis of protein conformationscluster.pl Clustering of a set of conformations

latticesim.pl Lattice-based low-resolution simulations with MONSSTERgenchain.pl Generation of low-resolution representation from all-atom modelsrebuild.pl Reconstruction of all-atom models from low-resolution representations

ensmin.pl, enseval.pl, enslatsim.pl, ensrun.pl, enscluster.pl Ensemble computing applicationscheckin.pl Create ensemble from external sourcesgetprop.pl, setprop.pl Read/set ensemble property valuesshowcluster.pl, bestcluster.pl Cluster-based analysis of ensemble data

aarex.pl, aarexAmber.pl All-atom replica exchange simulations with CHARMM/Amberlatrex.pl Lattice model replica exchange simulations with MONSSTERrexinfo.pl Extract replica exchange data

hlamc.pl, hlamcrex.pl Hybrid multiscale lattice/all-atom Monte Carlo sampling protocol

predominantly from a user’s perspective. The following de-scription will focus on user programs rather than the under-lying packages and support programs; however, some exam-ples on how to use the packages as a library will be given aswell. We will describe the basic functions for all-atom andlow-resolution modeling and then continue with enhancedsampling and multiscale modeling applications.

3. All-atom modeling

The central part of the all-atom modeling componentsrevolves around interfaces to the molecular mechanicspackages CHARMM[1] and Amber[2]. In this respect theMMTSB Tool Set may be viewed as an alternative userinterface to CHARMM and Amber for certain standardmodeling tasks. The tool set utilities are meant to provideaccess to these powerful programs without requiring theuser to go through the learning curve of understanding thespecific command and data input and output protocols ofeach program. The functionalities that are provided throughthe MMTSB Tool Set focus on energy evaluations, min-imization, and molecular dynamics runs with the utilitiesenerCHARMM.pl, minCHARMM.pl, mdCHARMM.pl, en-erAmber.pl, minAmber.pl, and mdAmber.pl, respectively.Input structures are expected to be in standard PDB formatwith necessary name and format conversions done auto-matically and transparently for standard protein structures.A special utility, convpdb.pl, is also available for manualPDB format translations as well as a variety of manipula-tions that may involve changing residue numbering, editingchain identifiers, translating coordinates, and subselectingor merging structure fragments. All of the all-atom model-

ing utilities assume reasonable default values for a numberof parameters, which can be altered by the user throughadditional command line options if necessary. For example,minCHARMM.plwithout any further options will performa short minimization in vacuum on a structure given asinput and write the minimized structure to standard output.Parameters can then be altered to include implicit solvent[23,24], change the cutoff for non-bonded interactions, orthe number of minimization steps, among other options. Asanother example,mdCHARMM.plwith default parameterswill automatically recognize explicit water molecules inthe input file in PDB format and setup and run a standardmolecular dynamics protocol with periodic boundary con-ditions [25] and particle mesh Ewald electrostatics[26].The same default usage will use implicit solvent basedon a generalized Born formalism[27] instead, if explicitsolvent molecules are not found. While many options areavailable to support a number of commonly used featuresin CHARMM and Amber, the MMTSB Tool Set does notaim to provide a complete interface to the full level offunctionality of either one of these very complex molecularmodeling programs. However, for modeling tasks that gobeyond the capabilities of the provided utilities, the toolset may still be used to facilitate the preparation of inputstructures and setup procedures.

While the function of the MMTSB Tool Set as an in-terface to CHARMM and Amber may be very useful initself, it should be emphasized again at this point that thereal strength of the tool set lies in the combination of thesebasic all-atom modeling functions with other simulationtechniques that are not available in CHARMM or Amber.These are in particular enhanced sampling facilities basedon replica exchange methodology, multiscale modeling

Page 5: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395 381

applications in a combination with low-resolution sam-pling, and ensemble computing techniques that allow theefficient application of a given modeling task to a large setof structures via distributed parallelism. Further aspects ofthis functionality will be described in more detail below.

4. Low-resolution modeling

Low-resolution modeling within the MMTSB Tool Set isbased on the MONSSTER program[19]. MONSSTER im-plements the SICHO (side CHain only) model where eachamino acid in a polypeptide chain is represented by a singlevirtual particle located at the side chain center of mass andprojected onto a cubic lattice with 1.45 Å grid spacing[28].Such a model is particularly well suited for constant temper-ature or simulated annealing type Monte Carlo simulationsbased on an energy function that is governed by physicaland knowledge-based terms. As with CHARMM and Am-ber for all-atom modeling tasks, the tool set can also be usedas a user interface for running low-resolution simulationswith MONSSTER. The central utility for running either con-stant temperature or simulated annealing lattice simulationsis latticesim.pl. Other supporting utilities are available foraccess to MONSSTER output files as well as the genera-tion of sequence files, lattice chains, and other input filesthat are needed when running the MONSSTER program in amore manual fashion. Through the MMTSB Tool Set, latticesimulations can also benefit from enhanced sampling tech-niques and ensemble computing facilities. The latter is par-ticularly useful for structure prediction applications where avery large number of structures are often generated with thefast lattice sampling protocol.

5. Translation between all-atom and low-resolutionmodels

Both levels of detail, all-atom and low-resolution repre-sentations, are brought together by MMTSB functions thatallow the generation of lattice chains from all-atom struc-tures and the reconstruction of all-atom structures fromlattice chains. Such mapping functions are essential for amultiscale modeling strategy and should preserve initialstructures as much as possible through complete translationcycles. The utility for the generation of low-resolution mod-els from all-atom structures isgenchain.pl. It is primarilyintended for generating lattice models suitable for MON-SSTER, but it can also be used to generate related typesof reduced models with or without additional particles atC� positions, either in continuous space or projected ontocubic lattices with different grid spacings.

The reduction from all-atom models to low-resolution rep-resentations is fairly simple and straightforward. The recon-struction of all-atom models from low-resolution models,on the other hand, is more challenging since lost informa-

tion has to be recreated by other means. Several methods areavailable for the reconstruction of complete all-atom mod-els at moderate levels of accuracy based on C� backbones[29–32]. In this case the backbone needs to be completedand side chains are typically added from a rotamer libraryand then annealed in order to resolve steric clashes. If thelow-resolution model is side chain center based, one can usea slightly different reconstruction algorithm[16]. Becausethe side chain center is known, the reconstructed structuresare generally quite close (<1 Å) to the original structurefrom which a low-resolution model was generated. The re-construction program is part of the MMTSB Tool Set andavailable through therebuild.pl utility. The rebuilding pro-cedure can handle on- as well as off-lattice low-resolutionmodels and is able to take advantage of C� coordinates,if present, to build more accurate peptide backbones thanwould be possible with side chain centers alone.

As a first example how a combination of low-resolutionand all-atom representations can be useful for commonmodeling tasks, one may consider the computational mu-tation of residues in a given protein structure. Off-latticelow-resolution models based on side chain centers and C�

coordinates can be used to preserve the backbone and thecenter of the original side chain while allowing the recon-struction of the mutated amino acid onto the same backboneat the same location. This may be done through a combi-nation of thegenchain.pland rebuild.pl utilities, or moreconveniently withmutate.pl, which is intended specificallyfor such computational mutation tasks.

6. Ensemble computing

Certain applications such as structure prediction, dockingexperiments, or estimates of conformational or interactionenergies often involve relatively large ensembles of differ-ent conformations for a system of interest. Such ensemblesmay be assembled from simulation snapshots, the endpointsof simulated annealing runs as with the low-resolutionlattice model described above, or by other means of confor-mational sampling. In many cases the ensemble structuresare then evaluated and compared in one way or another,typically with the goal of extracting the most favorable en-semble members as the structures with the highest stabilityand, consequently, highest biological relevance. It may alsobe desirable to manipulate all of the ensemble structures inthe same fashion in order to improve the evaluation pro-cess, for example by regularizing all of the conformationsthrough force field based minimization.

The MMTSB Tool Set provides convenient facilities forhandling structural ensembles in this manner. It allows theorganization of ensemble members in the form of a sim-ple database, along with associated properties such as en-ergetic terms or structural quantities, and includes utilitiesfor the application of the same operation on a whole en-semble of structures, in what we call ensemble computing.

Page 6: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

382 M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395

Such computations are highly amenable to parallel comput-ing environments, and the MMTSB Tool Set can take ad-vantage of common architectures from distributed or sharedmemory parallel clusters to loosely coupled sets of hetero-geneous machines through the use of a standard TCP/IPsocket-based networking protocol[33]. With applicationssuch as protein structure prediction in mind, special empha-sis is put on tools that allow the efficient minimization andevaluation of energies based on CHARMM for an ensembleof structures. These functions are available with the utili-ties ensmin.plandenseval.pl, respectively. A more generalutility, ensrun.pl, allows any command or command scriptto be run on a set of structures in an ensemble either forcalculating a property of interest or generating new sets ofstructures. The ensemble facility in the MMTSB Tool Set isdesigned to maintain multiple conformations for each mem-ber of an ensemble. Such sets of conformations may be de-rived through minimization, short molecular dynamics runsor other means of structure manipulation and they are iden-tified through user-defined tags. This expands the ensembleidea borrowed from statistical mechanics to a collection ofstructures where each member is represented not just by onebut by any number of related conformations. In this organi-zational scheme, multiple ensembles are only needed for en-tirely different sets of conformations or conformations thatbelong to a different system altogether. In these cases, mul-tiple ensembles would be distinguished simply by keepingthe data files in different subdirectories.

The internal organization of ensembles within theMMTSB Tool Set is illustrated inFig. 2. A relatively sim-ple, text file-based database setup was chosen to maintaina level of transparency and openness that allows easy ac-cess to stored conformations and properties with externalprograms. However, such a design comes at the expense ofefficiency for very large structural ensembles. While signif-icant limitations have not become obvious for applications

ensdir

ens.cfg

tag1.prop.dat

tag2.prop.dat

0

1 1

2 2

3

4

tag1.pdb

tag2.pdb

Fig. 2. Directory and file organization used for ensemble computing.

involving up to 50,000 ensemble members, better perfor-mance for larger ensembles could be obtained through amore efficient database design. The use of database engineswould be an option if performance improvements turn outto be necessary in the future.

There are four ways for generating structure ensembleswithin the MMTSB Tool Set. The first option, aimed atstructure prediction applications, generates ensembles fromlow-resolution lattice simulations withenslatsim.pl. Thisutility is an ensemble version oflatticesim.pltaking advan-tage of parallel execution and allowing the automatic recon-struction of all-atom models from the final lattice models. Inthe second option one can generate ensembles from replicaexchange simulations, which will be explained in more de-tail in the following section. For all other purposes, thereis a general utilitycheckin.plfor creating new ensemblesor adding one or more structures to an existing ensemblefrom external sources. Finally, as a fourth option, becauseof the simple database structure one may simply create anensemble directory structure and copy files manually. Thisis not recommended, but it may be more practical in combi-nation with other computational tools if integration withinthe MMTSB Tool Set is not possible or desirable.

Each set of structures in an ensemble has an associatedproperty data file, which is queried most conveniently withthe utility getprop.pl, but could also be easily read withother external programs, if necessary. The properties storedin this file are identified with arbitrary property tags and maybe comprised of energy terms calculated withenseval.pl,structural properties calculated withcalcprop.pl, or otherproperties resulting from external programs that are run withensrun.plover the whole ensemble. It is also possible to entersingle values up to whole data series in a manual fashionwith setprop.pl.

While some of the tools for ensemble computing arespecifically aimed at multiscale modeling and structure pre-diction applications, the more general utilities make this kindof infrastructure accessible for other applications as well. Itwas our intention with this design that the MMTSB ToolSet will become useful for a variety of ensemble comput-ing tasks that involve the organization and manipulation oflarge sets of molecular conformations in new contexts.

7. Replica exchange simulations

An exploration of the potential energy landscape for asystem of interest, usually with the goal of finding low-lyingregions, is the central theme of most molecular modelingapplications. Sampling efficiency with standard simulationtechniques such as molecular dynamics or Monte Carlo at agiven temperature is governed by the distribution and heightof energetic barriers, or ruggedness, and the slope towardsthe energy minimum in the landscape, both of which deter-mine the kinetic behavior of the system. Barrier crossingsare facilitated at higher temperatures, but a single simulation

Page 7: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395 383

at an elevated temperature would sample an altered freeenergy surface due to temperature-dependent entropic con-tributions. As a dramatic example, a single simulation ofa protein at a temperature above its folding temperaturewould eventually result in protein unfolding, since unfoldedconformations have a lower free energy than conformationsin the folded, native basin at such temperatures.

Enhanced sampling schemes have been introduced to ad-dress this problem, so that it becomes possible to overcomeenergetic barriers more easily while maintaining samplingon the relevant free energy surface at room temperature. Inone such method, called replica exchange or parallel tem-pering, multiple simulations or replicas of the same systemare run in parallel at different temperatures[12,34]. The in-dividual simulations are then coupled through Monte Carlobased exchanges of simulation temperatures between repli-cas at periodic intervals. In this scheme each simulation vis-its a range from low to high temperatures so that samplingis provided at the temperature of interest, while traversingconformational space more easily at elevated temperatures.

More formally, temperatures are exchanged between tworeplicas,i and j, with temperaturesTi, Tj and energiesEi,Ej according to the canonical Metropolis criterion for theexchange probabilityp:

p ={

1 for∆ ≤ 0

exp(−∆) for∆ > 0

where

∆ =(

1

kTi− 1

kTj

)(Ej − Ei)

Applied to one or more pairs of simulations after shortruns of molecular dynamics or Monte Carlo simulations atconstant temperature, this protocol can improve the samplingefficiency by orders of magnitude depending on the type andsize of the system. Replica exchange simulations have beenused with great success for the ab initio folding of peptidesin explicit solvent from first principles[8–10,35]. In otherapplications, shorter replica exchange runs may be used forimproved local structure refinement or simply for rankinga set of structures according to relative free energy, sincethe most favorable conformations will populate the lowesttemperatures. While replica exchange simulations based onthe exchange of temperatures have been most popular, otherforms of biases can also be used to reweight sampling prob-abilities[6,7]. For example, umbrella type biasing potentialscould be used to restrain the radius of gyration or the frac-tion of native contacts to different values in each replica and,in a more general case, multiple biases can be combinedin two-dimensional or even higher-dimensional replica ex-change simulations[36].

In the MMTSB Tool Set, replica exchange sampling isavailable to achieve enhanced sampling of all-atom mod-els with CHARMM or Amber, as well as enhanced lat-tice based sampling of low-resolution representations with

MONSSTER. In each case, the simulation control and ex-change algorithm are implemented on the scripting languagelevel. In fact, most of the same code is reused in these casesand could be combined easily with other applications in or-der to add replica exchange sampling. Replica exchange sim-ulations are particularly suitable for parallel environmentsdue to their inherent parallelism and low cost of communi-cation, since communication occurs only infrequently at ex-change events. As for the ensemble computing functions, theMMTSB Tool Set supports most parallel architectures andenvironments through its own platform independent com-munication protocol.

The main tools for running replica exchange simulationsin the MMTSB Tool Set arelatrex.plfor lattice-based replicaexchange simulations using MONSSTER, andaarex.plandaarexAmber.plfor all-atom replica exchange simulations us-ing CHARMM and Amber, respectively. During and after areplica exchange run simulation data can be queried in manyways with therexinfo.pl utility. Replica exchange simula-tions run through the MMTSB Tool Set involve a specialdirectory structure for organizing and storing the conforma-tions from each of the individual replicas, but an option isavailable to automatically build an ensemble data structurefrom the lowest temperature conformations for further pro-cessing with the ensemble computing tools.

8. Advanced multiscale sampling methods

The utilities for lattice-based low-resolution sampling,for all-atom sampling, and for the translation betweenlow-resolution and all-atom models can be combined toimplement a basic multiscale modeling protocol. This isprovided with the utilitypredict.pl, which integrates thesesteps into a single pass from low-resolution sampling toall-atom based scoring for structure prediction applications.More complex multiscale modeling protocols, however,

Short lattice simulation (MC)

All atom reconstruction

All atom minimization

All atom energy evaluation

accept/rejectconformation

MonteCarloMove

MetropolisCriterion

Fig. 3. Hybrid lattice/all-atom simulation scheme I.

Page 8: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

384 M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395

Short lattice simulation (MC)

All atom molecular dynamics

All atom energy evaluation

restraintsMonteCarloMove

MetropolisCriterion

accept/rejectconformation

Fig. 4. Hybrid lattice/all-atom simulation scheme II.

may involve the continuous transition between low andhigh-resolution models to take advantage of efficient sam-pling with the low-resolution model and an accurate energyfunction with all-atom models. While any protocol could besetup as a custom application using the MMTSB Tool Setpackages or programming library, we have implementedtwo advanced modes of multiscale modeling simulationsthat appear to be particularly useful. Both of these samplingprotocols follow the idea of sampling conformations on theall-atom energy landscape using the lattice model, and areimplemented inhlamc.pl. In the first mode, illustrated inFig. 3, a Monte Carlo simulation is run with moves thatconsist of a very short constant temperature lattice simula-tion followed by all-atom reconstruction and short all-atomminimization before the final all-atom energy is used in theMetropolis criteria. The second mode (Fig. 4) couples latticeand all-atom models more tightly by running short all-atommolecular dynamics simulations that follow conformationalmoves from lattice simulation through side chain centerrestraints. Again, the final energy is used in a Monte Carlosimulation to either accept and continue from favorableconformations or reject unfavorable conformations and tryanother move. Instead of a single Monte Carlo run, eithermode can alternatively be coupled with replica exchangesampling at different temperatures withhlamcrex.pl. Othersimilar multiscale sampling algorithms are certainly pos-sible, and the application utilities provided may serve asstarting points for implementing new sampling schemes.While more testing and tuning of these novel methods isneeded, we believe their availability through the MMTSBTool Set will spark further interest.

9. Structure analysis functions

A number of utilities in the MMTSB Tool Set can be usedfor limited structure analysis tasks. They include functionssuch as clustering or the calculation of root mean square de-viations and optimal superposition between two conforma-

tions, calculation of the radius of gyration, the fraction ofnative contacts, or standard peptide chain dihedral anglesφ,ψ, ω, andχ1. In ensemble computing applications, most ofthese structural properties can be calculated in parallel for awhole set of ensemble structures with the utilitycalcprop.plin order to facilitate analysis of ensemble structures.

The clustering functions are particularly helpful for struc-ture prediction tasks. Clustering is done based on pairwisedistances, measured either as coordinate or dihedral angleroot mean square deviations independent of any single ref-erence structure. The results are sets of structures with sim-ilar conformations according to the given criteria. It is thenpossible to compare energy scores between entire clusters asthe average score from all of their respective members andobtain statistically more reliable quantities such as energyscores from a cluster of similar conformations rather thansingle conformations. This type of analysis is facilitated forensembles with theenscluster.plandbestcluster.plutilitiesfor the generation of clusters and cluster-based analysis, re-spectively.

10. Applications

Having provided an overview of the different componentsof the MMTSB Tool Set, we now present a few typical appli-cations that illustrate the use of the tool set—scoring of pre-viously generated protein conformations with the ensemblecomputing facility, folding of peptides via replica exchangesimulations, and the prediction of a missing fragment in thecontext of a known structure.

10.1. Scoring of protein conformations

The energy based scoring of protein conformations is acommon task in structure prediction and docking protocols.In these cases the scoring function is typically applied to alarge number of conformations generated with a given sam-pling method, with the goal of finding the most favorable,

Page 9: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395 385

and presumably most native-like, structures. The ensemblecomputing facilities within the MMTSB Tool Set are partic-ularly well suited for such a task. The following example il-lustrates the use of the MMTSB Tool Set for scoring predic-tions for the structure of a fusagenic sperm protein from H.fulgens[37]. Predictions of this protein were submitted dur-ing CASP4 (target id: T0125), the fourth community-wideassessment of structure prediction methods. For the ex-ample shown here, we downloaded all of the predictionsfor the entire length of the sequence submitted as the firstmodel by participating prediction groups from the CASPweb site. This yielded a total of 90 structure predictions.

11. Generation of an ensemble data structure frominput files

As the first step, an ensemble is generated from the set ofpredicted structures by using thecheckin.plutility:

The file names of the predictions submitted to CASP fol-low the format T0125∗.pdb for this target and the structuresare given an identifying tagcasp in the newly created en-semble. Now that the predicted structures are available inensemble format, ensemble computing tools can be used forfurther processing.

12. Preprocessing of input structures

Depending on how the input structures were generated, itis often a good idea to regularize and minimize the structures

before calculating energy scores. Many structure predictionsdo not contain a complete set of atoms. Often, hydrogenatoms are missing and some predictions may consist only ofC� coordinates. Therefore, as the first step we will run thecomplete.plutility in order to generate complete, all-atomstructures for all of the predictions. Since we want to applythis command to all of the structures in the ensemble we useensrun.plas follows:

This command runs thecomplete.pl utility for eachstructure in the ensemble under thecasp tag, the onlyset of structures we have so far, and generates a new set

from the output of complete.plwith the tag caspcom-plete. The complete.plutility automatically uses differentprotocols depending on how much of the structural in-formation is missing. If only C� or backbone coordinatesare present, the SCWRL utility[30,38] is used to addside chains from a rotamer library, while missing hydro-gens are added with the HBUILD facility in CHARMM[1].

The next step is a short minimization run with a distancedependent dielectric function. This can be done very conve-niently with theensmin.plutility which uses CHARMM todo the actual minimization:

This command minimizes all of the completed structuresstored under thecaspcompletetag and creates a new setof structures with themin tag. In this example 100 stepsof minimization are requested with a distance dependentdielectric function andε = 4. Depending on the size of the

system, minimization runs can take some time to completeand it is advantageous to use parallel computing facilitiesto speed up the calculation. In this example four CPUs areused in a shared memory environment.

13. Evaluation of scoring function

Finally, we can evaluate a scoring function for the mini-mized structures. The ensemble computing tool for energyevaluation,enseval.pl, is used as follows:

Here, we are using a scoring function that includes im-plicit solvation based on a generalized Born formalism[27],in this case the GBMV method[39,40], as implemented inCHARMM as the default when GB is requested. Again, fourCPUs are used in parallel to speed up the calculation. In thiscase the total energy of the entire molecular mechanics forcefield, including all bonded and non-bonded interactions aswell as the electrostatic solvation term, are used as the

scoring function and assigned to a new property calledscore.

Page 10: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

386 M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395

14. Analysis of results

Once theenseval.plrun is complete the scores are avail-able and can be queried withgetprop.pl. A sorted list of allvalues is obtained easily with the following command:

In this command the property namescoreand structuretagminare used to identify the data set. Such a result may besufficient for many applications, but often it is advantageousto form clusters of input structures based on mutual similar-ity and then compare average scores over cluster membersto identify the lowest scoring clusters rather than individualstructures. This requires a few additional steps and will beillustrated in more detail in the loop prediction example.

Since the native conformation of this structure is availablefrom the protein data bank[41] (PDB code: 1GAK) it ispossible to calculate root mean square deviations (RMSD)between all of the predictions and the native structure. Thiscan be done conveniently withcalcprop.pl, which calculatesa number of structural properties, including RMSD:

Both, the energy score and C� RMSD values, can now beextracted with

The results are visualized inFig. 5 as a plot of energyscores versus root mean square deviations from the nativestructure. It can be seen that the energy scores decreaseon average towards more native-like conformations, and thescoring function could be indeed used to identify the mostnative like conformations, with a C� RMSD of about 4 Åin this example. In a recent study, we have applied similarprotocols to all of the predictions submitted to CASP4 andgenerally found good correlation between all-atom energyscores that include a realistic treatment of solvation andproximity of predicted conformations to the experimentallyobtained native structure[17].

14.1. Folding of peptides with replica exchangesimulations

The MMTSB Tool Set adds enhanced sampling capabili-ties to existing simulation programs such as CHARMM or

Amber through the replica exchange simulation methodol-ogy. Replica exchange simulations can speed up samplingin conventional molecular dynamics simulations by ordersof magnitude[8,12]. Such a gain in sampling efficiencyis particularly attractive for the challenging problem offolding peptides and proteins through simulation at atomicdetail. Ab initio folding at atomic detail has been simulateddirectly with constant temperature molecular dynamicssimulations only for very small peptides, where foldingtimes are on the order of hundreds of nanoseconds andconsiderable computational resources were used[42,43].When replica exchange simulations are employed, ab initiofolding of peptides can be achieved for larger systems andon much shorter timescales[9,10,35]. Further reduction of

computational expense is possible if an implicit solventdescription is used instead of explicit solvent molecules[44–47]. It then becomes possible to fold�-helices and�-hairpins in a matter of days with moderate computa-tional resources. As examples we will consider the peptide(AAQAA) 3, which is known experimentally to be predom-inantly �-helical [48], and the designed tryptophan zipperhairpin SWTWENGKWTWK [49], for which an experi-mental structure is available from NMR. A replica exchangesimulation with the MMTSB Tool Set starting from a com-pletely extended conformation for either peptide is run fromthe command line as follows:

Page 11: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395 387

-5000

-5500

-6000

-6500

-7000

Cα RMSD in Å

All-

Ato

m E

nerg

y Sc

ore

in K

cal/m

ol

0 2 4 6 8 10 12 14 16 18 20

Fig. 5. All-atom energy scoring function versus root mean square deviation (RMSD) of C� coordinates with respect to native structure (PDB code: 1GAK).The all-atom energy function based on the CHARMM22 force field[56] includes implicit solvent contributions from the GBMV implementation of thegeneralized Born formalism[39,40]. Data points with energies more positive than−5000 kcal/mol and more than 20 Å RMSD are omitted for clarity.

In these simulations we are using the GBMV implemen-tation[39,40]of the generalized Born formalism with mod-ified van der Waals radii based on the set of radii developedby Nina et al.[50] and a hydrophobic term that depends lin-early on the solvent accessible surface area with a scalingfactor of 0.012 kcal/mol/Å2. We also use recently developedmap-basedφ/ψ backbone dihedral cross terms, in order toadjust the balance between�-helical and extended confor-mations from the original force field[51]. Both peptidesare built with blocked termini. The eight temperature win-dows are exponentially spaced from 270 to 550 K resultingin temperature exchange probabilities between replica pairsfrom 16 to 19% for the hairpin and from 16 to 25% for thehelical system. The simulations are each carried out over atotal of 10 ns simulation time for each replica (10,000 cyclesand 500 molecular dynamics steps (step size: 2 fs) betweenexchanges), which takes about 1 week on eight nodes of aPC-based cluster.

The data from replica exchange simulations can be ex-amined with therexinfo.pl utility and conformations maybe analyzed withrms.pl and genseq.pl. Particularly usefulis an option to rank replicas according to the average tem-peratures they have visited over a period of simulation time.This information can be used to identify the most favorableconformations from the replica with the lowest average tem-peratures.Tables 3 and 4summarize the results from simu-lations for (AAQAA)3 and the hairpin system, respectively.The data shows that the most favorable final conformationsat the lowest temperatures agree well with experimental ob-servations. Replica 7 in the simulation of (AAQAA)3, withthe lowest average temperature at the end of the simulation,

is indeed�-helical (see alsoFig. 6). On the other hand, themost favorable replica in the hairpin simulation, replica 1,clearly exhibits a hairpin-like secondary structure and de-viates by only 1.4 Å RMSD from the experimentally de-termined structure. However, in this example we also findother structures at higher temperatures that contain partial�-helices. The excellent agreement of the simulated hairpinfrom replica 1 is also manifest inFig. 6, where the final con-formation is compared with the structural ensemble fromNMR measurements[49].

It is instructive to examine the evolution of RMSD andtemperature for replica 7 in the (AAQAA)3 simulation(Fig. 7a) and replica 1 in the hairpin simulation (Fig. 7b).

Table 3Results from replica exchange simulations of (AAQAA)3

Temperature rank and percentage of time spent at the lowest tempera-ture, 270 K, are averaged over the last 1000 cycles (9000–10,000). Thesecondary structure is obtained from the final conformation after 10,000cycles using the DSSP program[55].

Page 12: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

388 M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395

Table 4Results from replica exchange simulations of the harpin sequence SWTWENGKWTWK

Temperature rank and percentage of time spent at the lowest temperature, 270 K, are averaged over the last 1000 cycles (9000–10,000). The secondarystructure is obtained from the final conformation after 10,000 cycles using the DSSP program[55]. C� coordinate root mean square deviations arecalculated with respect to the native structure obtained from NMR experiments (PDB code 1LE1, model 1)[49].

In both cases, native-like structures are reached relativelyquickly (after 5 and 2.5 ns, respectively). The variationsin temperature suggest extensive conformational samplingat higher temperatures during the beginning of the simu-lation until a favorable, native-like conformation is foundand then the temperature remains at the lower tempera-tures. For the hairpin, the step-wise reduction of RMSDwith respect to the native structure can be correlated withthe temperature fluctuations. Major transitions at 1.8 and2.5 ns appear to occur during or shortly after brief periodsof elevated temperatures around 500 K when the crossingof conformational barriers is greatly facilitated.

This example is meant to demonstrate how the MMTSBTool Set can be used to take advantage of the replica ex-

Fig. 6. Final conformations from lowest temperature replicas after 10,000 cycles (10 ns) in simulations of (AAQAA)3 (left) and the hairpin sequenceSWTWENGKWTWK (right). The hairpin structure is compared with the first 10 NMR models from PDB entry 1LE1[49].

change enhanced sampling methods in combination withmodeling packages such as CHARMM and Amber. We hopethat it will enable further studies in peptide folding, proteinfolding, and protein structure refinement.

14.2. Prediction of missing fragments in proteins

The large number of solved experimental protein struc-tures provides the basis for finding at least partial templatesin most structure prediction applications based on sequencehomology or fold recognition. This reduces typical structureprediction efforts from entirelyde novopredictions to thestill challenging task of modeling unknown structural frag-ments in the context of a template. In principle, a multiscale

Page 13: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395 389

Fig. 7. (a) Time series of simulation temperature and number of�-helical i, i + 4 hydrogen bonds in replica 7 over the course of the simulation of(AAQAA) 3. (b) Time series of simulation temperature and C� RMSD with respect to the native structure (PDB: 1LE1) in replica 1 over the course ofthe simulation of SWTWENGKWTWK.

modeling protocol as outlined above is followed, but withthe restraint of keeping the template structure fixed and thepossibility to limit calculations to the vicinity of a modelingregion and reduce computational expenses. How this can bedone efficiently with the MMTSB Tool Set will be illustratedin the following example, which demonstrates the modelingof residues 48 to 55 in the zinc endopeptidase astacin fromEuropean fresh water crayfish (PDB code: 1IAB). The na-tive structure is a mixed�/�-fold, and the missing 8-residuepiece constitutes a long, solvent-exposed loop between two�-sheet segments.

15. Generation of model conformations from latticesimulations

In the first step, conformations for the missing fragmentare generated using lattice-based low-resolution sampling.As input for this step only a sequence file and the tem-plate structure are needed. The sequence file containsthe entire sequence for the template as well as the miss-ing part. It also provides secondary structure informationthat is trivially obtained for the template and can be pre-dicted for the missing part with good reliability using a

Page 14: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

390 M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395

variety of different secondary structure prediction meth-ods.

In this example, we will use replica exchange simulationswith the lattice model for enhanced sampling. The commandwould then look like this:

This command will run 1000 cycles of replica exchangeon eight CPUs, which takes on the order of 1 h on modernclusters. With this command, all-atom structures are auto-matically rebuilt from the lowest-temperature replicas andstored as an ensemble for further processing. Since the−loption is given with the residues of the unknown structuralfragment, only these residues are sampled freely, while therest of the structure is restrained harmonically to the posi-tions from the incomplete input structure that is used as atemplate. The structures sampled at the lowest temperatureat each cycle are used to build an ensemble data structurein the directoryens.

16. Selection of protein environment near region ofinterest

The sampled structures could now be minimized, scored,and analyzed as in the example above. With 200 residues, thecomplete protein is fairly large, and both the minimizationand energy evaluation steps are relatively expensive. Sinceonly a small part of the structure has been varied in the sam-pling protocol, one does not necessarily need to consider theentire structure. The part of the structure in the vicinity of

the variable residues can be cut out according to a distancecutoff similar to the range of electrostatic interactions, forexample 12 Å. While this would very likely result in brokenpeptide chains at the edge of the cutout region the structurecan be kept intact if relatively strong restraints are appliedto fix the outer layer residues in place. This is in the spirit ofthe stochastic boundary approach to simulating localized re-gions of biopolymer structures[52]. Strains between highlyrestrained residues and entirely unrestrained parts may berelieved further if a second, intermediate layer of weakly re-strained residues is introduced. This setup then results in asystem where the variable residues that are being modeled

are entirely flexible, surrounded by a first layer of weaklyrestrained residues and a second layer of highly restrainedresidues. Depending on the size of the system and the chosencutoff this may result in significant computational advan-tages if only part of a large system needs to be considered.

The MMTSB Tool Set offers the utilityenscut.plto cut outsuch regions for an ensemble of structures and automaticallysetup the necessary residue restraint lists. It creates the listof residues that are included in the cutout region based onall of the different conformations for the variable residues asfound in the ensemble, so that the same residues are cutoutfor each ensemble structure and energy values calculated ata later point remain comparable.

In our example, we will use a cutoff of 12 Å for includingresidues at all and a cutoff of 9 Å for residues that are weaklyrestrained:

The cutout structures are then available under the taglat-cut in the same ensemble and can be used for further process-ing. In this example the original structure with 200 residuesis reduced to a region of interest of 123 residues, whichtranslates into significant time savings in subsequent steps.

17. Scoring of conformations

Following the example above, the sampled conformationsare first minimized before being scored with an energy func-tion that includes implicit solvation based on the generalizedBorn formalism[39,40]:

In this case, we create a minimized structure under thetagcutmin. Options specifying restraints to keep the cutoutregion intact during the minimization as described aboveare read from an options file generated automatically byenscut.plwhen the structures were reduced.

18. Clustering and analysis

At this point energy scores are available for the sampledconformations and we can proceed to cluster the sampledconformations based on mutual root mean square deviationswith the commandenscluster.pl:

Page 15: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395 391

Since we are primarily interested in the conformationof residues 48 to 55, we will cluster only based on theseresidues, disregarding the surrounding template. A quickview of the resulting clusters is available with the commandshowcluster.pl:

In this example, we find 7 clusters with sizes rangingfrom 38 to 435 members.Fig. 8 shows the sampling forthis example with the experimental native structure as thereference for the RMSD calculations.

The utility getprop.plcould now be used with each clusterto obtain average energy scores and rank clusters accord-ingly. However, this can be done more conveniently withthe bestcluster.plutility, which automatically calculates av-erage scores, standard deviations, and statistical errors forall clusters and ranks them accordingly. It can also calculateaverages for only the subset of structures where the scoresfall within two standard deviations of the mean. This is help-ful when outliers occur with very high energies due to stericclashes that could not be resolved in the minimization pro-cedure, and is used in the following example:

We find that cluster t.3 (colored red inFig. 8) has thelowest average score (column 4) and the statistical error of16.6 kcal/mol (column 6) indicates that the difference of ap-proximately 100 kcal/mol to the next best cluster t.5 (coloredblue inFig. 8) is significant. The second column shows thetotal number of conformations in a given cluster. Columnthree indicates how many were actually used to computethe average, while the remaining conformations with muchhigher energy scores were excluded. The results confirm thequalitative picture inFig. 9 of a downward slope of aver-age energy towards more native-like structures. It should bestressed, though, that in a real structure prediction applica-

tion the RMSD values with respect to the native conforma-tion would obviously not be available.

The lowest energy conformation of the best cluster t.3 canthen be found by callinggetprop.plwith the cluster name

as an additional argument. Finally, one may want to mergethe final conformation with the original template in orderto regain a complete protein structure. This can be donewith theconvpdb.plutility and may be followed by a quickminimization run with restrained C� atoms to anneal themerged structure.

In this case, the cluster with the conformations generatedfrom the lattice protocol that are closest to the native struc-ture was easily identified with this multiscale sampling pro-tocol. While the best conformations in this cluster are closeto 2 Å RMSD from the native, the conformation with thelowest energy score is found with an RMSD value of 3.6 Å.This structure, shown inFig. 9, has the correct loop confor-mation for the most part but is shifted somewhat with respect

to the experimental structure. At this point further samplingand refinement could focus only on structures from the bestcluster in order to better distinguish structures closest to thenative conformation.

18.1. Programming interface

The user-level utilities provide a comprehensive set offunctions for enhanced and multiscale sampling applica-tions; however the MMTSB Tool Set can also be used as aprogramming library for new tasks that involve or combineall-atom modeling, low-resolution modeling, and enhanced

Page 16: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

392 M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395

sampling methods. Such new applications would have tobe written in Perl in order to take full advantage of thePerl packages from the tool set, but it is always possibleto include components from other compiled or scriptinglanguages through wrappers. One may also wrap the Perlscripts if they are to be used in other scripting environmentsalthough this may not always be an efficient solution.

As an example demonstrating the use of the tool setpackages as a programming library for new Perl scripts,let us consider the analysis of secondary structure for

low-resolution lattice chains based on backbone dihedralangles in rebuilt all-atom structures. This is done with thefollowing steps: First, the SICHO chain file in MONSSTERformat containing the input lattice structure is read. Sincethe lattice chain does not contain any sequence information,we need to read a sequence file, also in MONSSTER format,for this example. An all-atom molecule object is then rebuiltfrom the SICHO lattice chain. Theφ andψ dihedral anglescan then be calculated, analyzed, and written out to standardoutput. The corresponding script could be as follows:

Page 17: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395 393

Fig. 8. Energy score including implicit solvent vs. RMSD from native conformation for loop residues 48–55 in astacin (PDB code: 1IAB) for structuresgenerated with lattice sampling protocol. The cluster with the lowest average energy score, t.3, is colored in red, the second best cluster, t.5, is coloredin blue.

This script uses four packages from the MMTSB ToolSet and would require significantly more effort if it had tobe written without using the functionality of the tool set.When this example is run, it expects a sequence and chainfile as input and writes out a list of residues with theircorrespondingφ/ψ angles and assigned secondary structuretypes.

Fig. 9. Predicted loop conformation (orange) with lowest score from bestcluster (t.3) compared with experimental structure (blue).

19. Summary

We have introduced the MMTSB Tool Set, a collectionof utilities and programming libraries aimed at enhancedsampling and multiscale modeling applications in struc-tural biology. The tool set interfaces with the standardmolecular modeling packages CHARMM and Amber forall-atom modeling and with MONSSTER for low-resolutionlattice-based simulations. It adds a number of functions, suchas the translation between all atom and low resolution repre-sentations, and implements replica exchange sampling bothfor all-atom and lattice-based simulations. Another featurethe MMTSB Tool Set enables ensemble computing for theapplication of programs and functions to large sets of struc-tures. The MMTSB Tool Set is intended primarily to addressproblems in protein structure prediction, but it also servesas a simplified interface to the complex modeling packagesCHARMM, Amber, and MONSSTER and we certainlyhope that it will become useful for other applications aswell.

We have presented three illustrative examples of how theMMTSB Tool Set may be used. While the examples presentreal cases, they are not intended to validate the methods thatwere being used. While a more careful evaluation of themethodology has been ongoing[16,17,51,53,54]and willbe continued in the future, the purpose of this paper is todemonstrate the capabilities of the tool set.

Future developments of the MMTSB Tool Set may ex-pand the availability of new enhanced sampling methods,implement more advanced multiscale sampling algorithms,and offer an alternative graphics-based user interface.

Page 18: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

394 M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395

Acknowledgements

We thank the first users of the MMTSB Tool Set for test-ing and useful suggestions. Financial support from the NIHsupported resource Multiscale Modeling Tools for StructuralBiology (http://mmtsb.scripps.edu) (grant RR12255 to CLBIII) is acknowledged. Support for the computational infra-structure and personnel (MF) that enabled these calculationscame from the DOD through the grant DAMD17-03-2-0012and is greatly appreciated.

References

[1] B.R. Brooks, et al., CHARMM: a program for macromolecularenergy, minimization, and dynamics calculations, J. Comput. Chem.4 (1983) 187–217.

[2] D.A. Pearlman, et al., AMBER, a package of computer programsfor applying molecular mechanics, normal mode analysis, moleculardynamics and free energy calculations to simulate the structuraland energetic properties of molecules, Comput. Phys. Commun. 91(1995) 1–41.

[3] M.J. Frisch, et al., Gaussian 98. Gaussian, Inc, Pittsburgh, PA, 1998.[4] A.C. Siepel, et al., An integration platform for heterogeneous

bioinformatics software components, IBM Syst. J. 40 (2001) 570–591.

[5] U.H.E. Hansmann, Generalized ensemble techniques and proteinfolding simulations, Comput. Phys. Commun. 147 (2002) 604–607.

[6] U.H.E. Hansmann, New algorithms and the physics of proteins, Phys.A 321 (2003) 152–163.

[7] T. Nagasima, et al., Generalized ensemble simulations of spin systemsand protein systems, Comput. Phys. Commun. 146 (2002) 69–76.

[8] K.Y. Sanbonmatsu, A.E. Garcia, Structure of met-enkephalin inexplicit aqueous solution using replica exchange molecular dynamics,Proteins 46 (2002) 225–234.

[9] R. Zhou, B.J. Berne, R. Germain, The free energy landscape for�

hairpin folding in explicit water, Proc. Nat. Acad. Sci. U.S.A. 98(2001) 14931–14936.

[10] A.E. Garcia, K.Y. Sanbonmatsu, Exploring the energy landscape ofa � hairpin in explicit solvent, Proteins 42 (2001) 345–354.

[11] Y.M. Rhee, V.S. Pande, Multiplexed-replica exchange moleculardynamics method for protein folding simulation, Biophys. J. 84(2003) 775–786.

[12] Y. Sugita, Y. Okamoto, Replica-exchange molecular dynamicsmethod for protein folding, Chem. Phys. Lett. 314 (1999) 141–151.

[13] D. Gront, A. Kolinski, J. Skolnick, A new combination of replicaexchange Monte Carlo and histogram analysis for protein foldingand thermodynamics, J. Chem. Phys. 115 (2001) 1569–1574.

[14] J. Skolnick, A. Koliniski, A unified approach to the prediction ofprotein structure and function, Adv. Chem. Phys. 120 (2002) 131–192.

[15] H. Lu, J. Skolnick, A distance-dependent atomic knowledge-basedpotential for improved protein structure selection, Proteins 44 (2001)223–232.

[16] M. Feig, et al., Accurate reconstruction of all-atom protein represen-tations from side-chain-based low-resolution models, Proteins 41(2000) 86–97.

[17] M. Feig, C.L. Brooks, III, Evaluating CASP4 predictions withphysical energy functions, Proteins 49 (2002) 232–245.

[18] C. Simmerling, et al., Combining MONSSTER and LES/PME topredict protein structure from amino acid sequence: application to thesmall protein CMTI-1, J. Am. Chem. Soc. 122 (2000) 8402–8932.

[19] J. Skolnick, A. Kolinski, A.R. Ortiz, MONSSTER: a method forfolding globular proteins with a small number of distance restraints,J. Mol. Biol. 265 (1997) 217–241.

[20] D.M. Beazley, An extensible compiler for creating scritable scientificsoftware, in: Proceedings of Lecture Notes in Computer Science,Computational Science–ICCS 2002, Part II, vol. 2330, 2002,pp. 824–833.

[21] K. Hinsen, The molecular modeling tool kit: a new approach tomolecular simulations, J. Comput. Chem. 21 (2000) 79–85.

[22] J.E. Stajich, et al., The bioperl toolkit: Perl modules for the lifesciences, Genome Res. 12 (2002) 1611–1618.

[23] B. Roux, T. Simonson, Implicit solvent models, Biophys. Chem. 78(1999) 1–20.

[24] C.J. Cramer, D.G. Truhlar, Implicit solvation models: equilibria,structure, spectra, and dynamics, Chem. Rev. 99 (1999) 2161–2200.

[25] M.P. Allen, D.J. Tildesley, Computer Simulation of Liquids, first ed.,Oxford University Press, New York, 1987.

[26] T.A. Darden, D. York, L.G. Pedersen, Particle mesh Ewald: anNlog(N) method for Ewald sums in large systems, J. Chem. Phys.98 (1993) 10089–10092.

[27] D. Bashford, D.A. Case, Generalized Born models of macromolecularsolvation effects, Ann. Rev. Phys. Chem. 51 (2000) 129–152.

[28] A. Kolinski, J. Skolnick, Assembly of protein structure from sparseexperimental data: an efficient Monte Carlo model, Proteins 32 (1998)475–494.

[29] E.S. Huang, et al., Accuracy of side-chain prediction upon near-nativeprotein backbones generated by ab initio folding methods, Proteins33 (1998) 204–217.

[30] R.L. Dunbrack Jr., M. Karplus, Backbone-dependent Rotamer libraryfor proteins: application to side-chain prediction, J. Mol. Biol. 230(1993) 543–574.

[31] M.P. Jacobson, et al., Force field validation using protein side chainprediction, J. Phys. Chem. B 106 (2002) 11673–11680.

[32] Z.X. Xiang, B. Honig, Extending the accuracy limits of predictionfor side-chain conformations, J. Mol. Biol. 311 (2001) 421–430.

[33] D. Comer, Internetworking with TCP/IP: Principles, Protocols, andArchitecture, vol. 1, fourth ed., Prentice Hall, 2000.

[34] U.H.E. Hansmann, Parallel tempering algorithm for conformationalstudies of biological molecules, Chem. Phys. Lett. 281 (1997) 140–150.

[35] A.E. Garcia, K.Y. Sanbonmatsu,�-Helical stabilization by side chainshielding of backbone hydrogen bonds, Proc. Nat. Acad. Sci. U.S.A.99 (2002) 2782–2787.

[36] Y. Sugita, A. Kitao, Y. Okamoto, Multidimensional replica-exchangemethod for free-energy calculations, J. Chem. Phys. 113 (2000)6042–6051.

[37] N. Kresge, V.D. Vacquier, C.D. Stout, The crystal structure ofa fusagenic sperm protein reveals extreme surface properties,Biochemistry 40 (2001) 5407–5413.

[38] M.J. Bower, F.E. Cohen, R.L. Dunbrack, Prediction of proteinside-chain Rotamers from a backbone-dependent Rotamer library: anew homology modeling tool, J. Mol. Biol. 267 (1997) 1268–1282.

[39] M.S. Lee, F.R. Salsbury, Jr., C.L. Brooks, III., Novel generalizedBorn methods, J. Chem. Phys. 116 (2002) 10606–10614.

[40] M.S. Lee, et al., A new analytical approximation to the standardmolecular volume definition and its application to generalized Borncalculations, J. Comput. Chem. 24 (2003) 1348–1356.

[41] H.M. Berman, et al., The protein data bank, Nucl. Acids Res. 28(2000) 235–242.

[42] B. Zagrovic, E.J. Sorin, V. Pande,�-hairpin folding simulations inatomistic detail using an implicit solvent model, J. Mol. Biol. 313(2001) 151–169.

[43] X. Daura, et al., Reversible peptide folding in solution by moleculardynamics simulation, J. Mol. Biol. 280 (1998) 925–932.

[44] A. Hiltpold, et al., Free energy surface of the helical peptideY(MEARA)6, J. Phys. Chem. B 104 (2000) 10080–10086.

[45] Y. Pak, S. Wang, Folding of a 16-residue helical peptide usingmolecular dynamics simulation with Tsallis effective potential, J.Chem. Phys. 111 (1999) 4359–4361.

Page 19: MMTSB Tool Set: enhanced sampling and multiscale …Journal of Molecular Graphics and Modelling 22 (2004) 377–395 MMTSB Tool Set: enhanced sampling and multiscale modeling methods

M. Feig et al. / Journal of Molecular Graphics and Modelling 22 (2004) 377–395 395

[46] S. Jang, S. Shin, Y. Pak, Molecular dynamics study of peptides inimplicit water: ab inito folding of�-hairpin,�-sheet, and���-motif,J. Am. Chem. Soc. 124 (2002) 4976–4977.

[47] Y. Pak, S. Jang, S. Shin, Prediction of helical peptide folding in animplicit water by a new molecular dynamics scheme with generalizedeffective potential, J. Chem. Phys. 116 (2002) 6831–6835.

[48] W. Shalongo, L. Dugad, E. Stellwagen, Distribution of Helicity withinthe Model Peptide Acetyl(AAQAA)3 amide, J. Am. Chem. Soc. 116(1994) 8288–8293.

[49] A.G. Cochran, N.J. Skelton, M.A. Staovasnik, Tryptophan zippers:stable, monomeric�-hairpins, Proc. Nat. Acad. Sci. U.S.A. 98 (2001)5578–5583.

[50] M. Nina, D. Beglov, B. Roux, Atomic radii for continuumelectrostatics calculations based on molecular dynamics free energysimulations, J. Phys. Chem. 101 (1997) 5239–5248.

[51] M. Feig, A.D. MacKerell, Jr., C.L. Brooks, III, Force fieldinfluence on the observation of�-helical protein structures in

molecular dynamics simulations, J. Phys. Chem. B 107 (2003) 2831–2836.

[52] A.T. Brünger, C.L. Brooks III, M. Karplus, Active-site dynamicsof ribonuclease, Proc. Nat. Aacd. Sci. U.S.A. 82 (1985) 8458–8462.

[53] A. Fiser, et al., Evolution and physics in comparative protein structuremodeling, Accounts Chem. Res. 35 (2002) 413–421.

[54] N. Rathore, J.J. de Pablo, Monte Carlo simulation of proteins througha random walk in energy space, J. Chem. Phys. 116 (2002) 7225–7230.

[55] W. Kabsch, C. Sander, Dictionary of protein secondary structure:pattern recognition of hydrogen-bonded and geometrical features,Biopolymers 22 (1983) 2577–2637.

[56] A.D. MacKerell Jr., All-atom empirical potential for molecularmodeling and dynamics studies of proteins, J. Phys. Chem. B 102(1998) 3586–3616.