cs612 - algorithms in bioinformaticsnurith/cs612/sampling.pdfnurit haspel cs612 - algorithms in...

45
CS612 - Algorithms in Bioinformatics Sampling April 23, 2019

Upload: others

Post on 28-Jan-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

  • CS612 - Algorithms in Bioinformatics

    Sampling

    April 23, 2019

  • From a Rigid Ligand to a Flexible Ligand

    Torsional (Dihedral) Degrees of Freedom (DOF)

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Kinematics

    Kinematics is a branch of classical mechanics that describesthe motion of points, bodies (objects), and systems of bodies(groups of objects) without considering the forces that causedthe motion.

    A kinematics problem begins by describing the geometry of asystem and the initial conditions of any known values ofposition, velocity and/or acceleration of points in the system.

    Then, using geometric methods, the position, velocity andacceleration of any unknown parts of the system can bedetermined.

    Forward kinematics is the use of the kinematic equations of arobot to compute the position of the end-effector fromspecified values for the joint parameters.

    In protein motion, the problem becomes computing the newlocations of the atoms given a set of dihedral rotations.

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Robotics-inspired Approach to Protein Flexibility

    Similarity between proteins and robots: exploration ofcomplex high-dimensional space

    Similarity exploited to sample conformations with spatialconstraints

    Articulated manipulator Protein Extended Backbone

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Robotics-inspired Approach to Protein Flexibility

    Exploration of protein conformational space has parallels inrobotics

    0/1 collisions for robots versus energy field for proteins

    adapted from J.-C.Latombe, Stanford

    adapted from P. Smith,KSU

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Robotics-inspired Approach to Protein Flexibility

    Dimensionality of configuration space

    DOFs (rigid-body transformations and DOFs of the ligand)Too many DOFs mean that the configuration space of theligand is high-dimensional and difficult to searchSimilar issue when planning motions for an articulated roboticchain in a cluttered environment

    Geometric complexity of the free space

    Difficult to determine whether a ligand conformation andspecific position and orientation result in a good fitSimilar issue for an articulated robot

    Address: Plan motions in the configuration space but compute inworkspace (protein surface or cavity)!

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Probabilistic Roadmap Motion Planning (PRM)

    Conf. space Forbidden space Free space

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Probabilistic Roadmap Motion Planning (PRM)

    Configurations are sampled by picking coordinates at random

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Probabilistic Roadmap Motion Planning (PRM)

    Configurations are sampled by picking coordinates at random

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Probabilistic Roadmap Motion Planning (PRM)

    Sampled configurations are tested for collision (in workspace!)

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Probabilistic Roadmap Motion Planning (PRM)

    The collision-free configurations are retained as “milestones”

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Probabilistic Roadmap Motion Planning (PRM)

    Each milestone is linked by straight paths to its k-nearest neighbors

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Probabilistic Roadmap Motion Planning (PRM)

    Each milestone is linked by straight paths to its k-nearest neighbors

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Probabilistic Roadmap Motion Planning (PRM)

    The collision-free links are retained to form the PRM

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Probabilistic Roadmap Motion Planning (PRM)

    Finding paths in the map.

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Application of PRM to Protein-Ligand Docking

    Protein is assumed to berigid

    A fixed coordinate system Pis attached to the protein

    Ligand is a small flexiblemolecule

    A moving coordinate systemL is defined using threebonded atoms in the ligand

    A conformation of the ligandis defined by the positionand orientation of L relativeto P and the torsional anglesof the ligand

    x y

    z

    x y

    z

    A.P. Singh, J.C. Latombe, and D.L. Brutlag. A Motion Planning Approach to Flexible Ligand Binding. Proc. 7thISMB, pp. 252-261, 1999

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Roadmap Construction: Node Generation

    The nodes of the roadmap aregenerated by samplingconformations of the liganduniformly at random in theparameter space (around theprotein)

    The energy of each sampledconformation is E = Einteraction(electrostatic) + Einternal (vdw)A sampled conformation isretained with probability:

    p =

    0 if E > Emax

    Emax−EEmax−Emin

    if Emin ≤ E ≤ Emax1 if E < Emin

    x y

    z

    x y

    z

    Results in denser distribution ofnodes in low-energy regions ofconformational space

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Roadmap Construction: Edge Generation

    q q′qi qi+1

    Each node is connected toits closest neighbors bystraight edges

    Each edge is discretized sothat between qi and qi+1 noatom moves by more thansome ε = 1Å.

    x y

    z

    x y

    z

    Results in denser distribution ofnodes in low-energy regions ofconformational space

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Querying the Roadmap

    For a given goal node qg(e.g., binding conformation),the Dijkstras single-sourceshortest-path algorithmcomputes the lowest-weightpaths from qg to each node(in either direction) inO(N logN) time, where N= number of nodes

    Various quantities can thenbe easily computed in O(N)time, e.g., average weightsof all paths entering qg andof all paths leaving qg(binding and dissociationrates Kon and Koff )

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Computing Binding Conformations

    Sample many (several1000s) ligand’sconformations at randomaround protein

    Repeat several times:

    Select lowest-energyconformations that are closeto protein surface

    Re-sample around them

    Retain k (approx. 10)lowest-energy conformationswhose centers of mass are atleast 5Å apart

    Active site

    ?

    lactate dehydrogenase

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Testing on Three Complexes

    PDB ID: 1ldm Receptor: Lactate Dehydrogenase (2386atoms, 309 residues) Ligand: Oxamate (6 atoms, 7 dofs)

    PDB ID: 4ts1 Receptor: Mutant of tyrosyl-transfer-RNAsynthetase (2423 atoms, 319 residues) Ligand: L-leucyl-hydroxylamine (13 atoms, 9 dofs)

    PDB ID: 1stp Receptor: Streptavidin (901 atoms, 121residues) Ligand: Biotin (16 atoms, 11 dofs)

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Finding Folding Pathways Using RPM

    Degrees of freedom – number of rotatable backbone dihedralangles (approx. 2N, number of amino acids)

    Nodes generated in a similar manner as the docking schemeabove.

    Sampling cannot be done at random due to highdimensionality – sampling is done from a set of distributionsaround the native state.

    Edges connect neighboring nodes in a similar manner to theone described above.

    Can be used to discover folding pathways, intermediatestructures and other folding events.

    G. Song, N. Amato, RECOMB 2001

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • From Flexible Ligand to Flexible Receptor?

    Modeling full receptor flexibility is very difficult!

    In order for this process to become efficient, we must find arepresentation for protein flexibility that avoids the directsearch of a solution space comprised of thousands of degreesof freedom.

    There are several methods available, and the accuracy of theresults is usually directly proportional to the computationalcomplexity of the representation.

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • From Flexible Ligand to Flexible Receptor?

    The dimensionality of the proteinconformational space is much larger thanthat of a small ligand

    PRM-based methods that samplethousands of conformations to get a goodview of the ligand conformational spaceare not sufficient

    Challenge: from 7-10 DOFs to thousandsof DOFs

    Goal: Model protein flexibility to capturerelevant conformations of the flexible receptor

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Receptor Flexibility – Soft Receptor

    Soft receptors can be easily generated by relaxing the highVdW energy penalty

    The rationale is that the receptor structure has some inherentflexibility which allows it to adapt to slightly differentlyshaped ligands.

    If the change in the receptor conformation is small enough, itis assumed that the receptor is capable of such aconformational change.

    It is also assumed that the change in protein conformationdoes not incur a sufficiently high energetic penalty to offsetthe improved interaction energy between the ligand and thereceptor.

    It is also quite easy to implement (relax the collisioncomponent).

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Receptor Flexibility – Selecting Specific DOFs

    is it possible to select only a few degrees of freedom to modelexplicitly.

    They usually correspond to rotations around single bonds

    These degrees of freedom are usually considered the naturaldegrees of freedom in molecules.

    Rotations around bonds lead to deviations from idealgeometry that result in a small energy penalty when comparedto deviations from ideality in bond lengths and bond angles.

    Selection of which torsional degrees of freedom to model isusually the most difficult part of this method because itrequires a considerable amount of a priori knowledge.

    The torsions chosen are usually rotations of side chains in thebinding site of the receptor protein.

    It is also common to further reduce the search space by usingrotamer libraries.

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Receptor Flexibility – Ensemble Docking

    One possible way to represent a flexible receptor for drugdesign applications is the use of multiple static receptorstructures

    The best description for a protein structure is that of aconformational ensemble of slightly different protein structurescoexisting in a low energy region of the potential energysurface.

    The structures can be determined experimentally either fromX-ray crystallography or NMR, or generated via computationalmethods such as Monte Carlo or MD simulations.

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Modeling Limited Receptor Flexibility

    Selection of specific degreesof freedom such as ondesignated amino acids onbinding site

    Shown here:Acetylcholinesterase:Phe330 flexible – acts asswinging gate

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Modeling Limited Receptor Flexibility

    Moving larger number of amino acids (illustration onacetylcholinesterase)

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Receptor Flexibility – Collective DOF

    Collective DOF allows therepresentation of full proteinflexibility without a dramaticincrease in computationalcost.

    One method is thecalculation of normal modesfor the receptor.

    Alternatively, we can usedimensionality reductionmethods.

    The most commonly usedmethod for the study ofprotein motions is principalcomponent analysis (PCA).

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Inverse Kinematics (IK)

    Inverse kinematics is the problem of finding the right valuesfor the underlying degrees of freedom of a chain.

    In the case of a protein chain these degrees of freedom of thedihedral angles, so that the chain satisfies certain spatialconstraints.

    For example, in some applications, it is necessary to findrotations that can steer certain atoms to desired locations inspace.

    The applications of inverse kinematics to protein structureinclude mainly loop modeling and generating ensembles ofstructures.

    In this case - manipulate the rotational degrees of freedom ofa loop region to find possible loop conformations that attachto the rest of the protein.

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Modeling Loops Using Inverse Kinematics

    Goal: Model the ensemble of conformations of a protein.

    It is known that proteins are not rigid but fluctuate about anensemble of structures under equilibrium conditions.

    Focus mostly on loop regions, as they are the most flexibleones.

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Modeling Loops Using Inverse Kinematics

    Inverse kinematics: Manipulate the degrees of freedom of anarticulated chain to satisfy some end-constraints.

    In this case - manipulate the rotational degrees of freedom ofa loop region to find possible loop conformations that attachto the rest of the protein.

    Cyclic Coordinate Descent (CCD): solve for and rotate onedihedral at a time.

    Canutescu A. A., and Dunbrack R. L. Protein Science 12, 2003

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • CCD for Inverse Kinematics

    Goal: find optimal values tosimultaneously steer thethree backbone atoms of theend of the fragment to theirtarget positions.

    Current positions beforerotation - M0, after rotationM and target positions F .

    S is the sum of squareddistances between currentpositions and targetpositions

    Steering these three atomsto their target positionsrequires minimizing S .

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • CCD for Inverse Kinematics

    S is defined as:

    S = |~F1M1|2 + |~F2M2|2 + |~F3M3|2

    Where~F1M1 = ~O1M1 − ~O1F1

    Notice that it is a 2D rotation around the plane defined by ther̂ and ŝ local axes.

    The squared norm of the vector M − F (denoted FM) has thisvalue for each of the three atoms, so we can sum the threecontributions to S .

    We can express the rotation with respect to the r̂ and ŝ planeas:

    ~O1M1 = r1 cos θr̂1 + r1 sin θŝ1

    r1 is the vector between O and M01, which we want to rotateby θ.

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • CCD for Inverse Kinematics

    From the previous equations above it follows that:

    ~FiMi = ri cos θr̂i + ri sin θŝi − ~fi ≡ ~di , i = 1, 2, 3

    Calculating the squared distances between the moving atomsand the fixed target atoms, we obtain:

    |~di |2 = r2i + f 2i − 2ri cos θ(~fi 1 · r̂i )− 2ri sin θ(~fi · ŝi )Putting it all together, we can express S as the sum of thesquared distances above.

    Differentiating with respect to θ gives us:

    dS

    dθ=

    d |~d1|2dθ

    +d |~d2|2dθ

    +d |~d3|2dθ

    whered |~di |2dθ

    = 2ri sin θ(~fi · r̂i )− 2ri cos θ(~fi · ŝi )

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • CCD for Inverse Kinematics

    After a little bit of math, S can be written as:

    S = a−√

    b2 + c2 cos(θ − α)

    S is minimum when θ = α. Now we have explicit values forsine and cosine.

    Notice that the Time complexity is linear time on the numberof DOFs to solve for all dihedrals of a chain.

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Modeling Loops Using Inverse Kinematics

    Cyclic Coordinate Descent:solve for and rotate onedihedral at a time

    Given: atom at currentposition M, target position F

    Goal: Solve for dihedral θs.t.|F −M|2 = S(θ) < εthreshold

    Time complexity: Lineartime on the nr. DOFs tosolve for all dihedrals of achain

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Modeling Loops Using Inverse Kinematics

    Cyclic Coordinate Descent:solve for and rotate onedihedral at a time

    Given: atom at currentposition M, target position F

    Goal: Solve for dihedral θs.t.|F −M|2 = S(θ) < εthreshold

    Time complexity: Lineartime on the nr. DOFs tosolve for all dihedrals of achain

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Modeling Loops Using Inverse Kinematics

    Cyclic Coordinate Descent:solve for and rotate onedihedral at a time

    Given: atom at currentposition M, target position F

    Goal: Solve for dihedral θs.t.|F −M|2 = S(θ) < εthreshold

    Time complexity: Lineartime on the nr. DOFs tosolve for all dihedrals of achain

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Modeling Loops Using Inverse Kinematics

    Cyclic Coordinate Descent:solve for and rotate onedihedral at a time

    Given: atom at currentposition M, target position F

    Goal: Solve for dihedral θs.t.|F −M|2 = S(θ) < εthreshold

    Time complexity: Lineartime on the nr. DOFs tosolve for all dihedrals of achain

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Modeling Loops Using Inverse Kinematics

    Since there is redundancy, many solutions are feasible.

    Find rotations to satisfy spatial constraints on atoms Combinewith energy minimization to obtain physical structures

    Example: Chymotrypsin inhibitor 2

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Equilibrium Fluctuations

    More DOFs than spatial constraints can be exploited to generatefragment fluctuations

    Example: Chymotrypsin inhibitor 2

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Equilibrium Fluctuations

    Sample equilibrium fluctuations:

    Spatially constrained through Cyclic Coordinate Descent

    Energetically constrained to be feasible

    Local Fluctuations inα-Lactalbumin

    Boltzmann ensemble average

    RMSDx =∑

    Confs

    RMSD(C ,Cnative)e−β∆Ec

    Q

    ∆Ec = Ec − EnativeQ =

    ∑Confs

    e−β∆Ec

    Nurit Haspel CS612 - Algorithms in Bioinformatics

  • Equilibrium Fluctuations

    α-Lactalbumin (α-Lac)

    123 residues

    Hydrogen exchangeprotection factors available

    Ubiquitin

    76 residues NMRinformation on fluctuationsavailable

    Nurit Haspel CS612 - Algorithms in Bioinformatics