crb journal club february 13, 2006 jenny gu. selected for a reason residues selected by evolution...

CRB Journal Club

February 13, 2006

Jenny Gu

Selected for a Reason

• Residues selected by evolution for a reason, but conservation is not distinguished between function or stability.

• To disentangle between functional and structural constraints, predicted sequence profiles generated for structural stability is compared to naturally occurring sequence profiles.

• Incorporates two additional measures, free energy and sequence profile difference, in addition to residue conservation to identify functional residues.

Datasets

• Enzyme Active Site Set

(Suspicious. What about cross fold validation?)

Conservation Score Distribution

Sequence conservation score calculated by SCORECONS with multiple sequence alignment from MUSCLE

A) All Residue Sites

B) Enzyme Active Site Set

Calculating Difference between Profiles

Designed Sequence Profiles

Rosetta design program

Generate 40 protein sequences stable for structure. Align with PSI-blast to generate position specific scoring matrix (PSSM)

Natural Sequence Profiles

Euclidean distance rescaled between :

0 (high similarity) 1 (low similiarity)

PSSM matrix from PSI-blast

Difference between Natural and Designed Sequence Profiles

A) All residues in active sites.

B) Functional residues in active sites

Differences between profiles are rescaled such that:

0 - High Similarity

1 - Low similarity

In other words:

Selection for function vs. stability

0 - Low selection

1 - High selection

Calculating Native/Optimal Residue Energy Difference

1. Use Rosetta G module to calculate free energy changes for each 20 amino acid substitutions at each position.

2. Compare to native G. If functional constraints are imposed, there should be a big gap between G.

Rosetta G

Originally developed to identify binding interface hot spots.

Model based on all-atom rotamer description of side chains with energy function dominated by Lennard Jones interactions, solvation interactions, and hydrogen bonding.

Distribution of Free Energy Difference

Difference between free energy of naturally occurring residue and energetically most favorable residue. (kcal/mol)

A) For all residues in active sites.

B) Functional residues in active sites.

In other words:

Positions with smaller differences have been selected for stability.

Residue Classification

Combine:1. Sequence Conservation

2. Profile Difference (Natural vs. Designed)

3. Residue Free Energy Changes (Natural vs Optimal)

To classify functional vs. nonfunctional residues.

Logistic regression with linear model module used to determine weights for input features.

Classification Performance

Largest improvement observed with free energy measures.

Inclusion of profile difference with free measures resulted in minor improvements.

Combined measures reduces false positives.

Chymosin B

Sequence Conservation Only

Combined Measures

Arginine Kinase

Sequence Conservation Only

Combined Measures

Testing Generality

Dataset 2 includes ligand binding sites

Comparison to another predictor

Sources of Error

• Sensitivity to multiple alignment quality.• Loop regions are difficult to align.

• Functionally important residues can contribute to stability.

• Suggested Improvements:• Better multiple sequence alignments.• Spatial clustering of high scoring residues.• Introducing backbone flexibility into energy calculations.

Other Approaches - Extracting from Sequence Design

1) Design procedure based on Monte Carlo simulation of amino acid substitution process.

2) Fixed substitutions based on scoring function from template structure and multiple alignment of homologs.

Other Approaches - Using Protein Homology Information

1) Identify high degree of conservation between homologous proteins.

2) Use information theory to identify positions where environment-specific substitution tables make poor prediction of overall amino acid substitution pattern.

3) Identify residues with highly conserved positions when homologous family are superposed.

Interest in this Paper

• Distinguishing between functional and structural constraints.

• Designing sequences and subsequent profiles allows us to explore an enlarge sequence space that is not captured by natural sequence.

Questions:

From an evolutionary perspective:1) How does structure limit the exploration of sequence space.

2) How is sequence space expanded with structure change.

3) How do selective pressures for molten globules, flexible regions, and disordered structures impact the sequence space?

Current Domain Coverage of Genome

Current perspective:

Ab initio structure evolution is now difficult now that system of balance and checks is implemented.

Evolution of current protein repertoire largely attributed to recombination of existing folds.

Reaching beyond structural genomics? ….

• With known structures:• Use of Hidden Markov Model (HMM) or profile for domains to identify in

genome.• Evolutionary plasticity greater for loop regions than for core.• Work has been done in this area.

• With unknown structures:• Can we design a structure not currently in PDB and identify it in nature?

• With structures that nature “hasn’t seen before”.• De novo structure designed in 2003.• Maybe it already exists in nature, we just don’t know about it yet.• And if it doesn’t exist, is it just a proof of principle or can we actually do

something with it?