gromacs tutorial umbrella sampling

GROMACS Tutorial

Umbrella Sampling

Justin A. Lemkul

Department of Pharmaceutical Sciences, University of Maryland, Baltimore

This tutorial will guide the user through the process of setting up and running pulling simulations necessary to calculatebinding energy between two species. The tutorial assumes the user has already successfully completed the Lysozymetutorial, some other tutorial, or is otherwise well-versed in basic GROMACS simulation methods and topologyorganization. Special attention will be paid to the methods for properly building the system and settings for the pull codeitself.

The binding energy (ΔGbind) is derived from the potential of mean force (PMF), extracted from a series of umbrellasampling simulations. A series of initial configurations is generated, each corresponding to a location wherein themolecule of interest (generally referred to as a "ligand") is harmonically restrained at increasing center-of-mass (COM)distance from a reference molecule using an umbrella biasing potential. This restraint allows the ligand to sample theconfigurational space in a defined region along a reaction coordinate between it and its reference molecule or bindingpartner. The windows must allow for slight overlap of the ligand positions for proper reconstruction of the PMF curve.

The steps for such a procedure (and the ones utilized in this tutorial) are as follows:

1. Generate a series of configurations along a single degree of freedom (reaction coordinate)2. Extract frames from the trajectory in step 1 that correspond to the desired COM spacing3. Run umbrella sampling simulations on each configuration to restrain it within a window corresponding to the chosen

COM distance4. Use the Weighted Histogram Analysis Method (WHAM) to extract the PMF and calculate ΔGbind

The tutorial assumes that the reader is using GROMACS version 4.5.3 or later. My original work (from which this workflowwas derived) was conducted with version 4.0.5, but in principle can be applied to any version in the 4.0.x or 4.5.x series.The pull code was completely re-written after version 3.3.3, such that none of the information contained herein (beyondthe basic theory of the technique) is applicable to any GROMACS version prior to 4.0. For the GROMACS 2013 workshopat the University of Virginia, it is assumed that you are using GROMACS version 4.6.3.

Step One: Prepare the Topology

Generating a molecular topology for an umbrella sampling simulation is just like any other simulation. Obtain thecoordinate file of the structure of interest, and generate the topology from pdb2gmx. Some systems will require specialconsideration (i.e., protein-ligand complexes, membrane proteins, etc). For protein-ligand systems, please consult thistutorial, and for membrane proteins, I recommend my own tutorial on the topic. The principles of umbrella sampling areeasily extendable to these systems, though we will consider only protein molecules in this tutorial.

The system we will consider here is the dissociation of a single peptide from the growing end of an Aβ42 protofibril, and is

based on simulations we recently published. The structure file of the wild-type Aβ42 protofibril used in those simulations,

acetylated at the N-terminus of each chain, can be found here. The original PDB accession code is 2BEG.

Run the structure through pdb2gmx:

pdb2gmx -f input.pdb -ignh -ter -o complex.gro

Choose the GROMOS96 53A6 parameter set, "None" for the N-termini, and "COO-" for the C-termini. Modify

topol_Protein_chain_B.itp to include the following lines (at the end of the file):

#ifdef POSRES_B

#include "posre_Protein_chain_B.itp"

#endif

We will be using chain B as an immobile reference later on in the pulling simulations, hence the need to specially

position-restrain this chain only, and none of the others.

Step Two: Define the Unit Cell

Defining the unit cell for a pulling simulation is not unlike defining the unit cell for any other simulation. There is, however,

one major consideration. One must allow enough space in the pulling direction to allow for a continuous pull without

interacting with the periodic images of the system. That is, the minimum image convention must be continually satisfied,

and as well, the pull distance must always be less than one-half the length of the box vector along which the pulling is

being conducted. Why, you may ask?

GROMACS calculates distances while simultaneously taking periodicity into account. This, if you have a 10-nm box, and

you pull over a distance greater than 5.0 nm, the periodic distance becomes the reference distance for the pulling, and

this distance is actually less than 5.0 nm! This fact will significantly affect results, since the distance you think you arepulling is not what is actually calculated.

We will be pulling a total distance of 5.0 nm in a 12.0-nm box, to avoid the complications described above. The center of

mass of the protofibril will be placed at (3.280, 2.181, 2.4775) in a box of dimensions 6.560 x 4.362 x 12. Use editconf to

place the protofibril at this location:

editconf -f complex.gro -o newbox.gro -center 3.280 2.181 2.4775 -box 6.560 4.362 12

You can visualize the location of the protofibril within the box using, for example, VMD. Load the structure in VMD and

open the Tk console. Type the following (note that > indicates the Tk prompt, not something you actually type):

> pbc box

You should see something like the following in the VMD window:

Step Three: Adding Solvent and Ions

This step is conducted much like any other simulation. Refer to the Lysozyme tutorial for a more detailed description ofwhat is going on here if you are unsure. First, we will add water with genbox:

genbox -cp newbox.gro -cs spc216.gro -o solv.gro -p topol.top

Next, we will add ions using genion, utilizing this .mdp file. We are going to be conducting these simulations in thepresence of 100 mM NaCl, on top of neutralizing counterions:

grompp -f ions.mdp -c solv.gro -p topol.top -o ions.tprgenion -s ions.tpr -o solv_ions.gro -p topol.top -pname NA -nname CL -neutral -conc 0.1

Step Four: Energy Minimization and Equilibration

The energy minimization and equilibration steps are going to be conducted just like any other protein-in-water system.

Here, we will perform steepest descents minimization followed by NPT equilibration. The .mdp file for minimization canbe found here, and the one for NPT equilibration can be found here.

Invoke grompp and mdrun, as usual:

grompp -f minim.mdp -c solv_ions.gro -p topol.top -o em.tprmdrun -v -deffnm em

grompp -f npt.mdp -c em.gro -p topol.top -o npt.tprmdrun -deffnm npt

Because these procedures are time-consuming, they are likely best run in parallel, i.e.:

mdrun -nt X -deffnm npt

In the above command, "X" represents the desired number of threads over which the parallel calculation is conducted.

Step Five: Generating Configurations

To conduct umbrella sampling, one must generate a series of configurations along a reaction coordinate, ζ. Some of

these configurations will serve as the starting configurations for the umbrella sampling windows, which are run in

independent simulations. The figure below illustrates these principles. The top image illustrates the pulling simulation we

will run now, conducted in order to generate a series of configurations along the reaction coordinate. These

configurations are extracted after the simulation is complete (dashed arrows in between the top and middle images). The

middle image corresponds to the independent simulations conducted within each sampling window, with the center of

mass of the free peptide restrained in that window by an umbrella biasing potential. The bottom images shows the ideal

result as a histogram of configurations, with neighboring windows overlapping such that a continuous energy function can

later be derived from these simulations.

For this example, the reaction coordinate is the z-axis. To generate these configurations, we must pull peptide A awayfrom the protofibril. We will pull over the course of 500 ps of MD, saving snapshots every 1 ps. This setup has been

established based on trial-and-error to obtain a reasonable distribution of configurations. In other systems, it may be

necessary to save configurations more often, or sufficient to save configurations less often. The idea is to capture enough

configurations along the reaction coordinate to obtain regular spacing of the umbrella sampling windows, in terms of

center-of-mass distance between peptides A and B, the latter of which is our reference group.

The .mdp file for this pulling can be found here. A brief explanation of the pulling options used is as follows:

; Pull code

pull = umbrella

pull_geometry = distance

pull_dim = N N Y

pull_start = yes ; define initial COM distance > 0

pull_ngroups = 1

pull_group0 = Chain_B

pull_group1 = Chain_A

pull_rate1 = 0.01 ; 0.01 nm per ps = 10 nm per ns

pull_k1 = 1000 ; kJ mol^-1 nm^-2

pull = umbrella: using a harmonic potential to pull. IMPORTANT: This procedure is NOT umbrella sampling. I

used a harmonic potential in order to make qualitative observations about the dissociation pathway in this study.

The harmonic potential allows the force to vary according to the nature of the interactions of peptide A with peptide

B. That is, the force will build up until certain critical interactions are broken. See our paper for details. For the

purposes of generating the initial configurations for umbrella sampling, you can actually use any combination of pull

settings (pull and pull_geometry), but when it comes time for the actual umbrella sampling (in the next step) you

MUST be using pull = umbrella. It is very important that you do not apply extremely fast pulling rates or extremely

strong force constants, which can seriously deform elements of your system. Please refer to paper (particularly the

Supporting Information) for how we chose to validate the pull rate used.

pull_geometry = distance: see the note the in .mdp file;; you can also use position or direction, but changes will

have to be made to other pulling parameters.

pull_dim = N N Y: we are pulling only in the z-dimension. Thus, x and y are set to "no" (N) and z is set to "yes" (Y).pull_start = yes: the initial COM distance is the reference distance for the first frame. This is useful because if we

are attempting to pull 5.0 nm, converting the initial COM distance to zero (i.e., pull_start = no) makes this

interpretation difficult.

pull_ngroups = 1: we are only applying a pulling force to one group.

pull_group0 = Chain_B: reference group for pulling.

pull_group1 = Chain_A: group to which pulling force is applied.

pull_rate1 = 0.01: the rate at which the "dummy particle" attached to our pull group is moved. This type of pulling

is also called "constant velocity" due to the fact that this rate is fixed.

pull_k1 = 1000: the force constant for pulling.

Remember that #ifdef POSRES_B statement we added to topol_B.itp a while ago? We're going to use it now. By

restraining peptide B of the protofibril, we are able to more easily pull peptide A away. Due to the extensive non-covalent

interactions between chains A and B, if we did not restrain chain B, we would end up simply towing the whole complex

along the simulation box, which wouldn't accomplish much.

We will need to define some custom index groups for this pulling simulation. Use make_ndx:

make_ndx -f npt.gro

(> indicates the make_ndx prompt)

> r 1-27

> name 19 Chain_A

> r 28-54

> name 20 Chain_B

> q

Now, run the continuous pulling simulation:

grompp -f md_pull.mdp -c npt.gro -p topol.top -n index.ndx -t npt.cpt -o pull.tpr

mdrun -s pull.tpr

Again, this procedure will take some time, so run it in parallel if you have the resources available to you. Once this

simulation is complete, we will need to extract useful frames for defining the umbrella sampling windows. The easiest

way I have found to do this is the following:

1. Define the spacing of the windows (generally 0.1 - 0.2 nm)

2. Extract all the frames from the pulling trajectory that was just produced

3. Measure the COM distance of each of these frames between the reference and pull group

4. Use the selected frames for umbrella sampling input

To extract the frames from your trajectory (traj.xtc), use trjconv:

trjconv -s pull.tpr -f traj.xtc -o conf.gro -sep

A series of coordinate files (conf0.gro, conf1.gro, etc) will be produced, corresponding to each of the frames saved in the

continuous pulling simulation. To iteratively call g_dist on all of these (501!) frames that were generated, I have written a

Perl script that takes care of this task. It will print a file called "summary_distances.dat" that contains this information. The

script can be found here. We will need to make use of the index file again, as well as a text file called "groups.txt," which

will be used to select our analysis groups non-interactively. The contents of groups.txt should be:

1920

The groups.txt file can be created with a plain text editor. Once you have this file, change the .txt file extension of

distances.txt (linked above) to .pl and execute the script:

perl distances.pl

Look at the contents of summary.dat to see the progression of COM distance between chain A and chain B over time.

Make note of the configurations to be used for umbrella sampling, based on the desired spacing. That is, if you want 0.2-

nm spacing, you might find the following lines in summary.dat:

50 0.600...100 0.800

You would then use conf50.gro and conf100.gro as the starting configurations of two adjacent umbrella sampling

windows. Make note of all the configurations you wish to use before continuing. For the purposes of this tutorial,

identifying configurations with 0.2-nm spacing will suffice, although in the original work a different (more detailed) spacing

was used.

Step Six: Umbrella Sampling Simulations

After having identified the initial configurations of the sampling windows, we can now conduct actual umbrella sampling

simulations. We will need to generate a number of input files in order to conduct each of the necessary simulations. For

example, if you have identified 25 configurations along the reaction coordinate, that means you will need 25 different

input files for 25 independent simulations. You will simply have to call grompp to process this .mdp file for each of the

conf.gro files you identified in the previous step. Many of the pulling parameters are the same as in the previous step, with

the notable exception of pull_rate1, which has now been set to zero. We don't want to move the configuration along the

reaction coordinate;; instead we want to restrain it within a defined window of configurational space. Setting pull_start =yes means that the initial COM distance is the reference distance, and we do not have to define a reference (pull_init1)separately for each configuration.

In this example, we will be sampling COM distances from 0.5 - 5.0 nm along the z-axis using roughly 0.2-nm spacing. The

following example commands may or may not be literally correct (the frame numbers may differ), but will serve as an

example as to how to run grompp on separate coordinate files to generate all 23 inputs (note as well that 23 is the amount

of windows required to obtain 0.2-nm spacing over roughly 4.5 nm;; in our original work, 31 asymmetric windows were

used).

You will also note that I have set gen_vel = no in the .mdp file. I have found that allowing the initial forces to govern thedynamics in each window is sufficient for a large, robust system such as this one. If this is not the case in systems with

which you work, you will likely want to set gen_vel = yes and allow some time for equilibration in each sampling window.

grompp -f md_umbrella.mdp -c conf0.gro -p topol.top -n index.ndx -o umbrella0.tpr...grompp -f md_umbrella.mdp -c conf450.gro -p topol.top -n index.ndx -o umbrella22.tpr

Now, each input file should be passed to mdrun for the actual data collection simulation. Once all of the simulations arecomplete, you can proceed to data analysis. One note on proper execution of the simulations: do not use the -deffnmoption of mdrun without also specifying -pf and -px filenames. Using -deffnm will cause both the pullf.xvg and pullx.xvgfiles to be written to the same file (whatever is specified by -deffnm) in this case. Using -pf and -px will override the settingof the -deffnm flag.

Step Seven: Data Analysis

The most common analysis conducted for umbrella sampling simulations is the extraction of the potential of mean force(PMF), which will yield the ΔG for the binding/unbinding process. The value of ΔG is simply the difference between thehighest and lowest values of the PMF curve, as long as the values of the PMF converge to a stable value at large COMdistance. A common method for extracting PMF is the Weighted Histogram Analysis Method (WHAM), included inGROMACS as the g_wham utility. The input to g_wham consists of two files, one that lists the names of the .tpr files ofeach window, and the other that lists the names of either the pullf.xvg or pullx.xvg files from each window. For example, asimple tpr-files.dat might consist of:

umbrella0.tprumbrella1.tpr...umbrella22.tpr

And analogously for the list of pullf.xvg or pullx.xvg files, in either pullf-files.dat or pullx-files.dat. Note that the files musthave unique names (i.e., pullf0.xvg, pullf1.xvg, etc) or else g_wham will fail. We then run g_wham:

g_wham -it tpr-files.dat -if pullf-files.dat -o -hist -unit kCal

The g_wham utility will then open each of the umbrella.tpr and pullf.xvg files sequentially and run the WHAM analysis onthem. The -unit kCal option indicates that the output will be in kcal mol-1, but you can also get results in kJ mol-1 or kBT.Note that you may have to discard the first several hundred ps of the trajectory as equilibration (using g_wham -b), sincewe generated our starting configurations from a non-equilibrium simulation. Once the PMF converges, you should knowhow much time was required to equilibrate your system. You should end up with a profile.xvg file that looks like thefollowing:

Please note that the result you obtain may be different, since the spacing recommended in this tutorial is different from the

spacing I actually used to generate this data in the original study. The overall shape of the curve should be similar, andthe value of ΔG (calculated as the difference between the plateau region of the PMF curve and the energy minimum of thecurve) should be close to -50.5 kcal mol-1.

The other output from the g_wham command will be a file called histo.xvg, which contains the histograms of theconfigurations within the umbrella sampling windows. These histograms will determine whether or not there is sufficientoverlap between adjacent windows. For the types of simulations conducted as part of this tutorial, you may obtainsomething like the following:

The above histogram shows reasonable overlap between windows from about 1.2 - 5 nm of COM spacing;; the overlaparound 1 nm (green and blue curves) indicates that more sampling windows are likely necessary to obtain good resultsfrom the WHAM algorithm. As it stands now, there is very little overlap between these two windows.

Summary

You have now hopefully been successful in conducting umbrella sampling simulations by generating a series ofconfigurations along a reaction coordinate, running biasing simulations, and extracting the PMF. The .mdp files providedhere serve as examples only, and should not be considered broadly applicable to all systems. Review the literature andthe GROMACS manual for adjustments to these files for efficiency and accuracy purposes.

If you have suggestions for improving this tutorial, if you notice a mistake, or if anything else is unclear, please feel free toemail me. Please note: this is not an invitation to email me for GROMACS problems. I do not advertise myself as a privatetutor or personal help service. That's what the gmx-users list is for. I may help you there, but only in the context ofproviding service to the community as a whole, not just the end user.

Happy simulating!

gromacs tutorial umbrella sampling

Documents