evolutionary algorithms for the protein folding problem

78
04/07/22 DMI - Università di Catania 1 Evolutionary Algorithms for the Protein Folding Problem Giuseppe Nicosia Department of Mathematics and Computer Science University of Catania

Upload: lethia

Post on 25-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Evolutionary Algorithms for the Protein Folding Problem. Giuseppe Nicosia Department of Mathematics and Computer Science University of Catania. Talk Outline. An overview of Evolutionary Algorithms The Protein Folding Problem Genetic Algorithms for the ab initio prediction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 1

Evolutionary Algorithms for the Protein Folding ProblemGiuseppe Nicosia

Department of Mathematics and Computer Science

University of Catania

Page 2: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 2

Talk Outline

1. An overview of Evolutionary Algorithms

2. The Protein Folding Problem

3. Genetic Algorithms for the ab initio prediction

Page 3: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 3

An overview of Evolutionary AlgorithmsEAs are optimization methods based on EAs are optimization methods based on an evolutionary metaphor that showed an evolutionary metaphor that showed effective in solving difficult problems.effective in solving difficult problems.

Page 4: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 4

Computational Intelligence and EAs

““Evolution is the natural way to program”Evolution is the natural way to program” Thomas Ray

Page 5: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 5

Evolutionary Algorithms1. Set of candidate solutions (individuals): Population.

2. Generating candidates by:– ReproductionReproduction: Copying an individual.

– CrossoverCrossover: 2 parents 2 children.

– MutationMutation: 1 parent 1 child.

3. Quality measure of individuals: Fitness functionFitness function.

4. Survival-of-the-fittestSurvival-of-the-fittest principle.

Page 6: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 6

Main components of EAsMain components of EAs

1. Representation of individuals: CodingCoding.

2. Evaluation method for individuals: FitnessFitness.

3. Initialization procedure for the 1st generation1st generation.

4. Definition of variation operators (mutationmutation and crossovercrossover).

5. Parent (matingmating) selection mechanism.

6. Survivor (environmentalenvironmental) selection mechanism.

7. Technical parametersTechnical parameters (e.g. mutation rates, population size).

Page 7: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 7

Mutation and CrossoverEAs manipulate partial EAs manipulate partial solutions in their search for solutions in their search for the overall optimal solutionthe overall optimal solution . These partial solutions or `building blocks' correspond to sub-strings of a trial solution - in our case local sub-structures within the overall conformation.

Page 8: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 8

`Optimal' Parameter Tuning:

• Experimental tests.

• Adaptation based on measured quality.

• Self-adaptation based on evolution !

Page 9: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 9

Constraint handling strategies [Michalewicz, Evolutionary Computation, 4(1), 1996]

1. Repair strategy: whenever an unfeasible solution is produced "repair" it, i.e. find a feasible solution "close“ to the unfeasible one;

2. Penalize strategy: admit unfeasible individuale in the population, but penalize them adding a suitable term to the energy.

Page 10: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 10

The evolution Loop

Page 11: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 11

Algorithm Outlineprocedure EA; {

t = 0;

initialize population P(t);

evaluate P(t);

until (done) {

t = t + 1;

parent_selection P(t);

recombine P(t);

mutate P(t);

evaluate P(t);

survive P(t);

}

}

Page 12: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 12

Example

Page 13: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 13

Evolutionary Programmingprocedure EP; {

t = 0;

initialize population P(t);

evaluate P(t);

until (done) {

t = t + 1;

parent_selection P(t);

mutate P(t);

evaluate P(t);

survive P(t);

}

}

• The individuals are real-valued vectors, ordered lists, graphs.• All N individuals are selected to be parents, and then are mutated, producing N children. These children are evaluated and N survivors are chosen from the 2N individuals, using a probabilistic function based on fitness (individuals with a greater fitness have a higher chance of survival). • Mutation is based on the representation used, and is often adaptive. For example, when using a real-valued vector, each variable within an individual may have an adaptive mutation rate that is normally distributed.• No Recombination.

Page 14: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 14

Evolution Strategiesprocedure ES; {

t = 0;

initialize population P(t);

evaluate P(t);

until (done) {

t = t + 1;

parent_selection P(t);

recombine P(t)

mutate P(t);

evaluate P(t);

survive P(t);

}

}

• ES typically use real-valued vector.• Individuals are selected uniformly

randomly to be parents. • Pairs of parents produces children via

recombination. The number of children created is greater than N.

• Survival is deterministic: 1. ES allows the N best children to

survive, and replaces the parents with these children.

2. ES allows the N best children and parents to survive.

• Like EP, adapting mutation.• Unlike EP, recombination does play an

important role in ES, especially in adapting mutation.

Page 15: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 15

Genetic Algorithmsprocedure GA; {

t = 0;

initialize population P(t);

evaluate P(t);

until (done) {

t = t + 1;

parent_selection P(t);

recombine P(t)

mutate P(t);

evaluate P(t);

survive P(t);

}

}

• GAs traditionally use a more domain independent representation, namely, bit-strings.

• Parents are selected according to a probabilistic function based on relative fitness.

• N children are created via recombination from the N parents.

• The N children are mutated and survive, replacing the N parents in the population.

• Emphasis on mutation and crossover is opposite to that in EP.

• Mutation flips bits with some small probability (background operator).

• Recombination, on the other hand, is emphasized as the primary search operator.

Page 16: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 16

Genetic Programming 1/2There are 5 major preparatory steps in using GP for a particular problem.

1) selection of the set of terminals (e.g., the actual variables of the problem, zero-argument functions, and random constants, if any)

2) selection of the set of functions

3) identication of the evaluation function

4) selection of parameters of the system for controlling the run

5) selection of the termination condition.

Each tree (program) is composed of functions and terminals appropriate to the particular problem domain; the set of all functions and terminals is selected a priori in such a way that some of the composed trees yield a solution.

The initial pop is composed of such trees. The evaluation function assigns a fitness value which evaluates the performance of a tree. The evaluation is based on a preselected set of test cases,a fitness cases; in general, the fitness function returns the sum of distances between the correct and obtained results on all test cases.

Page 17: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 17

Genetic Programming 2/2procedure GP; {t = 0;initialize population P(t); /* randomly create an initial pop of individuals

computer program */evaluate P(t); /* execute each program in the pop and assign it a fitness value */until (done) {

t = t + 1;parent_selection P(t); /* select one or two program(s) with a probability

based on fitness (with reselection allowed) */create P(t); /* create new programs for the pop by applying the

following ops with specified probability */ reproduction; /* Copy the selected program to the new pop */ crossover; /* create new offspring programs for the new pop by

recombining randomly chosen parts from 2 selected prgs*/ mutation; /* Create one new offspring program for the new pop by randomly

mutating a randomly chosen part of one selected program. */

Architecture-altering ops; /* Choose an architecture-altering operation from the available repertoire of such op. and create one new offspring program for the new pop by applying the chosen architecture-altering op. to the one selected prg */

} }

Page 18: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 18

Scaling

Suppose one has two search spacestwo search spaces. The first is described with a real-valued fitness function FF. The second search space is described by a fitness function GG that is equivalent to F F pp , where p is some constant. The relative relative positionspositions of peaks and valleys in the two search spaces correspondcorrespond exactly. Only the relative relative heights heights differdiffer (i.e., the vertical scale is different).

Should our EA search both spaces in the same Should our EA search both spaces in the same manner?manner?

Page 19: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 19

Ranking selection (ES, EP)If we believe that the EA should search the two spaces in the search the two spaces in the same mannersame manner, then selection should only be based on the relative ordering of fitnesses, only the rank of individuals is of importance.

ESES

• Parent selection is performed uniformly randomly, with no regard to fitness.

• Survival simply saves the N best individuals, which is only based on the relative ordering of fitnesses.

EPEP

• All individuals are selected to be parents. Each parent is mutated once, producing N children.

• A probabilistic ranking mechanism chooses the N best individuals for survival, from the union of the parents and children.

Page 20: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 20

Probabilistic selection mechanism GA

Many people, in the GA community, believe that believe that F F and and G G should be searched differentlyshould be searched differently. Fitness proportional selection is the probabilistic selection mechanism of the traditional GA.

• Parent selection is performed based on how fit an how fit an individual is with respect to the population averageindividual is with respect to the population average. For example, an individual with fitness twice the population average will tend to have twice as many children as average individuals.

• Survival, though, is not based on fitnessis not based on fitness, since the parents are automatically replaced by the children.

Page 21: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 21

Lacking the killer instinctOne problem with this latter approach is that, as the search continues, more and more individuals individuals receive fitnesses with small relative receive fitnesses with small relative differencesdifferences. This lessens the selection pressure, slowing the progress of the searchslowing the progress of the search. This effect, often referred to as "lacking the killer instinct", can be compensated somewhat by scaling mechanisms, that attempt to magnify relative differences as the search progresses..

Page 22: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 22

Mutation and Adaptation• GAsGAs typically use mutation as a simple background mutation as a simple background

operatoroperator, to ensure that a particular bit value is not lost forever. Mutation in GAs typically flips bits with a very low probabilitylow probability (e.g., 1 bit out of 1000).

• Mutation is far more important inimportant in ESsESs and EPEP. Instead of a global mutation rate, mutation mutation probability distributions can be maintained for probability distributions can be maintained for every variable of every individualevery variable of every individual.

More importantly, ESs and EP encode the probability distributions as extra information within each individual, and allow this information to evolve as well (self-self-adaptationadaptation of mutation parameters, while the space is being searched).

Page 23: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 23

Recombination and Adaptation• There are a number of recombination methods for ESsESs, all of

which assume that the individuals are composed of real-valued variables. Either the values are exchanged or they are Either the values are exchanged or they are averagedaveraged. The ES community has also considered multi-parent multi-parent versionsversions of these operators.

• The GAGA community places primary emphasis on crossover. – One-point recombinationOne-point recombination inserts a cut-point within the two

parents (e.g., between the 3rd and 4th variables, or bits). Then the information before the cut-point is swapped between the two parents.

– Multi-point recombinationMulti-point recombination is a generalization of this idea, introducing a higher number of cut-points. Information is then swapped between pairs of cut-points.

– Uniform crossoverUniform crossover, however, does not use cut-points, but simply uses a global parameter to indicate the likelihood that each variable should be exchanged between two parents.

Page 24: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 24

Representation• Traditionally, GAsGAs use bit stringsbit strings. In theory, this

representation makes the GA more problem independent. We can also see this as a more genotypic level of genotypic level of representationrepresentation, since the individual is in some sense encoded in the bit string. Recently, however, the GA community has investigated more phenotypic phenotypic representationsrepresentations, including vectors of real values, ordered lists, neural networks.

• The ESES and EPEP communities focus on real-valued vectorreal-valued vector representations, although the EP community has also used ordered listordered list and finite state automatafinite state automata representations.

• Very little has been done in the way of adaptive adaptive representations.representations.

Page 25: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 25

Strength of the selection and population’s carrying capacity• Strong selectionStrong selection refers to a selection mechanism that

concentrates quickly on the best individuals, while weaker weaker selection mechanismsselection mechanisms allow poor individuals to survive (and produce children) for a longer period of time.

• Similarly, the population can be thought of as having a certain carrying capacitcarrying capacityy, which refers to the amount of information that the population can usefully maintain. A small population has less carrying capacity, which is usually adequate for simple problems. Larger populations, with larger carrying capacities, are often better for more difficult problems.

• Perhaps the evolutionary algorithm can adapt both selection selection pressure and the population size dynamicallypressure and the population size dynamically, as it solves problems.

Page 26: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 26

Accumulated payoff• EPEP and ESES usually have optimization for a goaloptimization for a goal. In other

words, they are typically most interested in finding the best solution as quickly as possible.

• De Jong (1992) reminds us that GAsGAs are are not not function function optimizers per seoptimizers per se, although they can be used as such. There is very little theory indicating how well GAs will perform optimization tasks. Instead, theory concentrates on what is theory concentrates on what is referred to as referred to as accumulated payoffaccumulated payoff.

• The difference can be illustrated by considering financial financial investment planning over a period of timeinvestment planning over a period of time (e.g., you play the stock market). Instead of trying to find the best stock, you are trying to maximize your returns as the various stocks are sampled. Clearly the two goals are somewhat different, and maximizing the return may or may not also be a good heuristic for finding the best stock.

Page 27: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 27

Fitness correlation• Fitness correlation, which appears to be a measure of EA-

Hardness that places less emphasis on optimality (Manderick et al., 1991). Fitness correlation measures the measures the correlation between the fitness of children and their correlation between the fitness of children and their parentsparents. Manderick et al. found a strong relationship between GA performance and the strength of the correlations.

• Another possibility is problem modalityproblem modality. Those problems Those problems that have many suboptimal solutionsthat have many suboptimal solutions will, in general, be more difficult to search.

• Finally, this issue is also very related to a concern of de Garis, which he refers to as evolvabilitevolvabilityy. de Garis notes that often his systems do not evolve at allhis systems do not evolve at all, namely, that fitness does not increase over time. The reasons for this are not clear and remain an important research topic.

Page 28: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 28

Distributed EAsBecause of the inherent natural parallelisminherent natural parallelism within an EA, much recent work has concentrated on the implementation of EAs on parallel implementation of EAs on parallel machinesmachines. Typically either one processor holds one individual (in SIMD machines), or a subpopulation (in MIMD machines). Clearly, such implementations hold promise of execution time decreases.

More interestingly, are the evolutionary effectsevolutionary effects that can be naturally illustrated with parallel machines, namely, speciationspeciation, nicheingnicheing, and punctuated equilibriapunctuated equilibria (Belew and Booker,1991).

Page 29: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 29

Resume• Selection serves to focus search into areasfocus search into areas of

high fitness.

• Other genetic operators (recombination and mutation) perturb the individuals, providing exploration in nearby areas.

• RecombinationRecombination and mutationmutation provide different search biasesdifferent search biases, which may or may not be appropriate for the task.

• The key to more robust EA systems probably lies in the adaptive selection of such genetic adaptive selection of such genetic operatorsoperators.

Page 30: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 30

Hint: Gray Code

Gray code is often used in GAs for mapping between a decimal number and a bit string. Mapping each digit of a decimal number to a string of four bits corresponds to choosing 10 strings from 16 possibilities. In Gray code, two neighbouring decimal digits are always represented by adjacent bit strings that differ by only one bit position.

Decimal BinaryBinary GrayGray0 00000000 000000001 00010001 000100012 00100010 001100113 00110011 001000104 01000100 011001105 01010101 011101116 01100110 010101017 01110111 010001008 10001000 110011009 10011001 11011101

10 10101010 1111111111 10111011 1110111012 11001100 1010101013 11011101 1011101114 11101110 1001100115 11111111 10001000

Procedure binaryToGray

{ g1=b1;

forfor k=2 toto m dodo

gk=bk-1 XOR bk;

}

Page 31: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 31

Advantages of EAs• Widely applicableWidely applicable, also in cases where no (good) problem

specic techniques are available:– Multimodalities, discontinuities, constraints.– Noisy objective functions.– Multiple criteria decision making problems.

• No presumptionsNo presumptions with respect to the problem space.• Low development costsLow development costs; i.e. costs to adapt to new problem

spaces.• The solutionssolutions of EA's have straightforward straightforward

interpretationsinterpretations.• They can be run interactively (online parameter adjustment).

Page 32: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 32

Disadvantages of EAs• No guarantee for finding optimal solutionsNo guarantee for finding optimal solutions

within a finite amount of time: True for all global optimization methods.

• No complete theoretical basisNo complete theoretical basis (yet), but much progress is being made.

• Parameter tuningParameter tuning is largely based on trial and error (genetic algorithms); solution: Self-adaptation (evolution strategies).

• Often computationally expensive: ParallelismParallelism.

Page 33: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 33

Applications of EAs• Optimization and Problem Solving;

• NP-Complete Problem;

• Protein Folding;

• Financial Forecasting;

• Automated Synthesis of Analog Electrical Circuits;

• Evolutionary Robotics;

• Evolvable Hardware;

• Modelling.

Page 34: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 34

Turing ‘s Turing ‘s visionvision

July 20, 2001

Page 35: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 35

Turing’s three approachesTuring’s three approaches for creating intelligent computer programOne approach was logic-drivenlogic-driven

while a second was knowledge-basedknowledge-based.

The third approach that Turing specifically identified in 1948 for achieving machine intelligence is

““... the genetical or evolutionary search ... the genetical or evolutionary search by which a combination of genes is by which a combination of genes is looked for, the criterion being the looked for, the criterion being the survival value”. survival value”. A. M Turing A. M.,A. M Turing A. M.,Intelligent Intelligent machinesmachines, Machine Intelligence, 1948., Machine Intelligence, 1948.

Page 36: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 36

In 1950 Turing described how evolution and natural selection might be used to create an intelligent program:

“... we cannot expect to find a good child-good child-machinemachine at first attempt. One must experiment with teaching one such machine and see how well it learns. One can then try another and see if it is better or worse”.

Turing A. M.,Computing machinery and intelligence,Mind,1950.

Page 37: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 37

Turing’s third approach & EAs

There is an obvious connection between this process and evolution, by identifications:

1. Structure of the child machine = Hereditary material;

2. Changes of the child machine = Mutations;

3. Judgment of the experimenter = Natural selection.

The above features of Turing's third approach to machine intelligence are common to the various forms of evolutionary computation developed over the past four decades.

Page 38: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 38

Genes were easy: The Protein Folding Problem

Genomics Genomics Transcriptomics Transcriptomics ProteomicsProteomics

Page 39: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 39

Reductionistic and synthetic approaches

Page 40: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 40

Basic principles

Page 41: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 41

Why are computer scientists Why are computer scientists interested in biology ?interested in biology ?

Page 42: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 42

Scientific answer

• Biology is interesting as a domain for AI research (i.e. drug design).

• Biology provides a rich set of metaphors for thinking about intelligence: genetic algorithms, neural networks and Darwinian automata are but a few of the computational approaches to behavior based on biological ideas. There will, no doubt, be many more (Artificial Immune System).

Page 43: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 43

Pragmatic answer• “Gene sequencing’s Industrial Revolution” [IEEE [IEEE

Spectrum, November 2000]Spectrum, November 2000].

• IBM predicts the IT market for biology will grow from $3.5 billion to more $9 billion by 2003. The volume of life science data doubles every six months. [IEEE Spectrum, January 2001][IEEE Spectrum, January 2001]

• “Golden rice to Bioinformatics” [Scientific [Scientific American 2001]American 2001].

• Biotechnology, BioXML, BioPerl, BioJava, Bio-inspired models, Biological data analisys … Bio-all.

Page 44: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 44

The building blocks: the 20 natural Amino acids

1. Ala A Alanine HH 11. Leu L Leucine HH2. Arg R Arginine C+C+ 12. Lys K Lysine C+C+3. Asn N Asparagine PP 13. Met M Methionine HH4. Asp D Aspartic acid C-C- 14. Phe F Phenylalanine HH5. Cys C Cysteine PP 15. Pro P Proline HH6. Gln Q Glutamine PP 16. Ser S Serine PP7. Glu E Glutamic acid C-C- 17. Thr T Threonine PP8. Gly G Glycine PP 18. Trp W Tryptophan HH9. His H Histidine P, C+P, C+ 19. Tyr Y Tyrosine PP10. Ile I Isoleucine HH 20. Val V Valine HH

3-letter code, single code, name residued, ccharge,PPolar, HHydrophobic.

Page 45: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 45

Proteins are necklaces of amino acidsThe protein is a linear polymer of the 20 different kinds of amino acids, which are linked by peptide bonds. Protein sequence length: 20 – 4500 aa.

Page 46: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 46

Hydrophobic & hydrophilic residues• Hydrophobic residuesHydrophobic residues tend to come together to form

compact core that exclude water. Because the environment inside cells is aqueous (primarily water), these hydrophobic residues will tend to be on the inside of a protein, rather than on its surface.

• Hydrophobicity is one of the key factors that determines how the chain of amino acids will fold up into an active protein (Hydrophilic: attracted to water, Hydrophobic: repelled by water).

• The polaritypolarity of a molecule refers to the degree that its electrons are distributed asymmetrically. A non-polarnon-polar molecule has a relatively even distribution of charge.

Page 47: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 47

Primary structure

•The scaffold is always the same. •The side-chain R determines the amino acid type.

Page 48: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 48

Grand Challenge Problems in Bioinformatics [T.Lengauer, Informatics – 10 Years Back, 10 Years Ahed, LNCS 2000]

1. Finding Genes in Genomic Sequences

2. Protein Folding and Protein Structure Prediction

3. Estimating the Free Energy of Biomolecules and their complexes

4. Simulating a Cell

Page 49: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 49

The famed Protein Folding problem asks how the amino-acid

sequence of a protein adopts its native three-dimensional structure under natural conditions (e.g. in aqueous solution, with neutral pH at room temperature).

Page 50: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 50

Sequence

Structure

Function

While the nature of the fold is determined by the sequence, it is encoded in a very complicated manner. Thus, protein folding can be seen as a connection between the genome (sequence) and what the proteins actually do (their function).

Page 51: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 51

Protein Folding process

Folding the smallest protein: a single alpha helix.

•Denaturated = Unfolded (Disruption of equilibrium and of the H-bonds) No biological activityNo biological activity

• Native = Folded = Unique compact structure

Biologically activeBiologically active

Below folding transition temperature, the protein seems to exist under a unique conformation.

Folding time: 10-2 to 1 s

Page 52: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 52

Protein Folding Principles

• Finding a low-energy shape:Finding a low-energy shape: Proteins tend to twist into shapes that achieve a “low energy” state in which amino acids fit comfortably together.

• Attraction between neighbors:Attraction between neighbors: aa will most strongly attract or repel those closest to themselves. Aa can also interact with each other through “H-bonds”, weak interactions that when multiplied throughout the chain, can hold a protein in a regular shape.

Page 53: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 53

The Ab initio protein structure prediction problem

Protein folding has to be distinguished from protein structure prediction (PSP).

In the PSP we are not interested in the folding process but just in the final structure attained.

Page 54: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 54

Why do Proteins “Fold“ ? In order to carry out their function (e.g. as enzymes

or Ab), they must take on a particular shapeparticular shape, also known as a FoldFold. Thus, proteins are truly amazing machines: before they do their work, they assemble assemble themselves!themselves! This self-assembly is called FoldingFolding.

What happens if protein don’t fold correctly ?

Diseases such as Alzheimer's disease, cystic fibrosis, Mad Cow disease, an inherited form of emphysema, and even many cancers are believed to result from protein misfolding. When proteins misfold, then can clump together (aggregate).

Page 55: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 55

Computational Tools in Protein Folding• Molecular Dynamics (MD);

• Monte Carlo (MC);

• Simulated Annealing (SA);

• Evolutionary Algorithms (EA);

• Convex Global Underestimator (CGU) [K.A.Dill & H.S.Chan,Nature Struct Biol, 4:10-19,1997];

• Performance Guaranteed Approximation Algorithms [W.E. Hart & S. Istrail,RECOMB, ACM SIGACT 1997];

Search methods

Compute native conformation

Page 56: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 56

Genetic Algorithms for the ab initio prediction

Using biologically inspired ideas to Using biologically inspired ideas to compute about biological problems. compute about biological problems.

Page 57: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 57

Why Genetic Algorithms for Protein Structure Prediction?1. PSP is analytically difficult to solve.

2. The number of conformations of a protein with N aa grows exponentially as N where is the average number of conformations per residue (typically 10).

3. The PSP problem is NP-hard.

4. The energy landscape of proteins must be `rugged'.

Page 58: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 58

The HP protein folding model[K. A.Dill, Biochemistry, 24:1501, 1985].

• HP models abstract the hydrophobic interaction process in protein folding by reducing a protein to a heteropolymer that represents a predetermined pattern of hydrophobicity in the protein.

• Nonpolar amino acids are classifìed as hydrophobic (H) and polar amino acids are classified as hydrophilic (P). A sequence is s{H, P}+ 

• The HP modei restricts the space of conformations to, self-avoiding paths on a lattice (square, cubic or triangular) in which vertices axe labeled by the amino acids.

Page 59: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 59

The energy potential in the HP model

• The energy potential reflects the fact that hydrophobic amino acids have a propensity to form a hydrophobic core. To capture this feature of protein structures, the HP model adds a value for every pair of hydrophobics that form a topological contact.

• A topological contact is formed by a pair of amino acids that are adjacent on the lattice and not consecutive in the sequence. The value of is typically taken to be -1.

Page 60: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 60

HP sequences embedded

Figure shows sequence embedded in the square lattice, with hydrophobic-hydrophobic contacts (HH contacts) highlighted with dotted lines. The conformation has an energy of –4.

Page 61: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 61

Why HP model for Protein structure prediction ?

• S. Istrail: “Best Science for Protein FoldingBest Science for Protein Folding”, [Lipari School on Computational Biology 1999].

• HP model is powerful enough to capture a variety of properties of actual proteins [Dill et al., Protein Science, 4:561, 1995].

• The PSP problem for the HP model has been shown to be NP-complete on the square lattice [Crescenzi et al., J. Comp. Bio. 5(3), 1998] and cubic lattice [Berger et al., J.Comp. Bio., 5(1), 1998].

Page 62: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 62

Internal coordinates with Absolute direction

Individuals are coded with a sequence in {U,D,L,R,F,B}{U,D,L,R,F,B}n-1n-1 : which correspond to up, down, left, right, forward and backward moves in a cubic for a length n protein.

Page 63: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 63

Representation and Encoding

Pseq {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}+

PHP{H, P}+ 

Pconf-abs=RULLURURULU

PPconfconf {U,D,L,R,F,B}{U,D,L,R,F,B}n-1n-1

Page 64: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 64

Bond directions describing lattice conformations

A bond direction corresponds to a change, r , in one of the Cartesian coordinates of the successive monomer, keeping all other coordinates the same as the previous monomer.

Direction r

Up rz rz +1

Left rx rx -1

Front ry ry +1

Back ry ry -1

Right rx rx +1

Down rz rz –1

Page 65: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 65

The energy function

where,

• strength of HH attraction (usually taken as 1).

• 2 energetic penalty parameter for sites containing two monomers.

• 3 energetic penalty parameter for sites containing three or more monomers.

Page 66: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 66

Procedure GA (Protein Sequence) {{ t := 0;

initialize population P(t); /* random pop. of conformations */

evaluate P(t); /* compute energy function */

whilewhile notnot terminate dodo { { /* terminate when free energy reaches equilibrium point*/

parent_selection P(t);crossover P(t); /* 1-point-Crossover acts stochastically

with fixed probability Pcros*/

mutate P(t); /* randomly change the value of bond directions along the string with fixed probability Pmut */

evaluate P(t);survive P(t); } } /* /* All individuals are replaced except for 10% of

the current best conformation - elitism strategy */ */

OutputOutput (Protein structure); }}

Page 67: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 67

Selection

Selection is linearly proportional to Selection is linearly proportional to fitnessfitness so that the probability, Pi , of selecting the i-th conformation, with a fitness value Fi, to propagate to the next time step is given by:

Page 68: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 68

Fitness function

Probabilities must be positive so a linear mapping with a cut-off value is used to convert the energy (convert the energy (EE) minimization ) minimization problem to a fitness (problem to a fitness (FF) maximization) maximization:

Page 69: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 69

Population Free Energy• Since GAs deal with an ensemble of solutions, a

quantity analogous to the statistical mechanical free energy is used. The population free energy, F, is calculated from its partition function, Z:

• where the sum is over the total number of conformers in a population and Ei is the energy of the i-th conformer. Hence,

F = -ln(Z)

Page 70: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 70

Evolution of energy distributions

Fig. 1 Evolution of a population: plotting the energy distributions at various time steps. t is the percentage of the total run-time elapsed. The population converges within 50% of the run time.

The GA dynamics converge the population of conformations to an equilibrium distribution. This is characterised by F as shown in next slide.

Page 71: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 71

Mean, Minimum & Free Energy of an Evolving Population

The free energy, which characterises the energy distribution of a population, reaches a convergence, or equilibrium point.

The mean energy fluctuates around the equilibrium making it difficult to use as a stop criterion

Page 72: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 72

Compact HP conformation found by GA

Example of a compact HP conformation, with a hydrophobic core, found by the GA (number of nearest neighbours = 39). Dark=H=hydrophobic, Light=P=polar.

This method determines the global minima of HP-sequences by constructing conformations with a core of H (hydrophobic) residues that also minimize the surface area of the conformation.

Page 73: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 73

Studies were carried out using HP-sequences taken from Unger,Moult, J. Molecular Biology, 231,1993.

Sequence Lowest EnergyLowest Energy # Energy eval# Energy eval 643d1 -27 433.533

643d.2 -30 167.017

643d.3 -38 172.192

643d.4 -34 107.143

643d.5 -36 154.168

643d.6 -31 454.727

643d.7 -25 320.396

643d.8 -34 315.036

643d.9 -33 151.705

643d.10 -26 191.019

643d.1 wwbbbbbwwwbbwwwwwbbwwwbwwwwwwbwb

wwwbwwbwwbwwwwwbwwwwbbwbbwwbwwbw

Results

Page 74: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 74

Energy Landascapes are Funnels

Page 75: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 75

Conformational search strategies

Page 76: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 76

Conclusions• The evolutionary algorithms (GA) approach to the

protein structure prediction problem offers a promising potential method of solutionpromising potential method of solution.

• GAs are fastfast and efficientefficient at searching the rugged rugged conformational landscapesconformational landscapes presented by protein molecules.

Working in ProgressWorking in Progress

• Work is under way to design a EA to optimize real to design a EA to optimize real protein structuresprotein structures (torsion angle space, lattice models).

Page 77: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 77

References• D. E. Clark (ed.), Evolutionary Algorithms Evolutionary Algorithms

in Molecular Designin Molecular Design, Wiley-VCH, Weinheim, 2000.

• M. Kanehisa, Post-Genome InformaticsPost-Genome Informatics, Oxford University Press, 2000. (An overview of the types of data and databases used in bioinformatics).

Page 78: Evolutionary Algorithms for the Protein Folding Problem

22/04/23DMI - Università di Catania 78

Final Remark“The impact of Bioinformatics research critically depends on an accurate understanding of the biological process under investigation. It is essential to ask the right questions, and often modeling takes priority over optimization. Therefore, we need people that understandunderstand and lovelove both computercomputer science and biologyscience and biology to bring the field forward.”

Thomas Lengauer Institute of Computer Science - University of Bonn