combining sequence and structure information topic 17

40
Combining Sequence and Structure Information Topic 17

Upload: jonah-webster

Post on 20-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Combining Sequence and Structure Information Topic 17

Combining Sequence and Structure Information

Topic 17

Page 2: Combining Sequence and Structure Information Topic 17

Problem: Identify the most important region(s)

What is a functional site? This is actually a very difficult question to answer robustly.

Of course, “catalytic residues” are functional sites.

Generally, we assume other site directly interacting with the substrate or other proteins involved in a complex to be functional.

However, what about sites far removed from the “active site region?” If a mutation at one of these sites is deleterious, is it functional?

Page 3: Combining Sequence and Structure Information Topic 17

Problem: Identify the most important region(s)

Catalog of Important Sites (KC and Livesay)

Catalytic Sites: Sites that are identified as catalytic sites in the Catalytic Site Atlas (CSA).

Active Sites: Union of CSA catalytic residues and all residues contacting the catalytic residues using HBPLUS.

Ligand-Binding Sites: Sites identified by characterizing all enzyme-ligand interactions using HBPLUS.

What about Allosteric Sites? Structural Sites? Etc?

Page 4: Combining Sequence and Structure Information Topic 17

Note that the two things are not the same

Page 5: Combining Sequence and Structure Information Topic 17

The devil is in the details

Page 6: Combining Sequence and Structure Information Topic 17

Methods

Typical approach: Combine sequence and structural information

Alignment Content: Sequence conservation and phylogeny-based. Typically also use structural information.

Machine-Learning Methods: Computational “black boxes,” but give good results.

Structure Features: Graph theoretic methods, protein surface shape, protein surface physiochemical properties, etc.

Triosephosphate isomerase color-coded by conservation

Page 7: Combining Sequence and Structure Information Topic 17

Catalytic Site Atlas

Page 8: Combining Sequence and Structure Information Topic 17

Catalytic Propensity

Page 9: Combining Sequence and Structure Information Topic 17

Multiple Sequence Alignment

Page 10: Combining Sequence and Structure Information Topic 17

Multiple Sequence Alignment

The Sum of Pairs (SP) score of column mi is calculated as above where s(mk,ml) is the scoring matrix substitution value. The sum is enumerated over all possible pairs within a single alignment column.

The Shannon entropy (S) score is calculated where pi is the probability of each residue i in that column. Very similar, the Williamson Property Entropy (WPE), sums of groups of chemically similar residues (k=9), where the probability within the logarithm is normalized by the average column probability

Rate4site (R4S) constructs a mathematical description of the underlying phylogeny in order to improve determination the rate of evolution at each site. The rate of evolution at each site is then estimated using the maximum likelihood principle, which considers both phylogenetic tree branch lengths and the stochastic nature of evolution.

And many others.

Page 11: Combining Sequence and Structure Information Topic 17

Relative predictive power

Catalytic Active Ligand-binding

R4S 0.83 0.75 0.74

SP-score 0.77 0.66 0.70

JSD 0.78 0.72 0.67

WPE 0.75 0.70 0.66

Page 12: Combining Sequence and Structure Information Topic 17

ConSurf is a web-implementation of R4S

Page 13: Combining Sequence and Structure Information Topic 17

Throw everything at it, including the kitchen sink…

Page 14: Combining Sequence and Structure Information Topic 17

Gutteridge et al., JMB (2003) 330:719-734.

Relative importance of input variables

Page 15: Combining Sequence and Structure Information Topic 17

Gutteridge et al., JMB (2003) 330:719-734.

Three different NN’s

+ =

Using structural clustering to filter out FP’s

Unfortunately the method tends toover-predict catalytic residues

Structural clustering improves results

Page 16: Combining Sequence and Structure Information Topic 17

Going beyond conservation

HKAMMKLQWBBMVRERCUGDYADHRAFGSGFFBYTUJGGCADFYDD EFZHRDADFD-EGHDGCVRRSERADZDFDAADFDEHGRRCADDSDDDFZBBDMJJJ-EDAFDCRRVSHTADHADFDEBGJEVEEECADDSDDNTHLJDJDDGUEKJFJCLDLSEIOOHMCVDUEGTEDDEDC--DSEIJDILKJADFFIFEVEECLDKSVVJBIOUDFFVFCFLKEICKDKSEE

Of course, well conserved positions make very good functional site predictions. But what defines differences between sub-families in the overall phylogeny?

Page 17: Combining Sequence and Structure Information Topic 17

Evolutionary trace (aka tree-determinant) residues

..A........B....C...Y..

..A........B....C...Y..

..Z..D.....E....C...S..

..Z..D.....E....C...S..

..Z..D.....E....C...S..

..H......G.E....C...S..

..H......G.E....C...S..

..H......G.E....C...S..J.I......F.F....C...S..J.I......F.F....C...S..

Analyze to detect those residues with a tendency to be conserved within a subfamily of proteins, but which differ between subfamilies (tree-dependent positions), and regard them as a result of the evolutionary scenario in which conservation and specificity are present in a delicate balance.

Page 18: Combining Sequence and Structure Information Topic 17

Evolutionary trace (aka tree-determinant) residues

• Identifying and understanding the role of the essential sites that determine the structure and proper functioning of the molecule.

• A thorough evaluation of the importance of all sequence sites involves extremely time-consuming and laborious biochemical experimental methods.

• All methods presented here rely on some sort of co-evolutionary theme. Or put otherwise, Nature has allowed some plasticity within (some) functional positions assuming the appropriate conditions are met elsewhere.

• Starting from the groundbreaking Lichtarge et al. paper in 1996, there have been several approaches presented that use this intra-family co-evolution principle to predict functional sites. The methods, called evolutionary trace, tree determinate residues, phylogenetic motifs, ConSurf, and strong motifs are conceptually similar and provide somewhat consistent results.

Page 19: Combining Sequence and Structure Information Topic 17

Evolutionary trace (aka tree-determinant) residuesThe ET process

Page 20: Combining Sequence and Structure Information Topic 17

Structural clusters of ET overlap ligand binding sites

Active site

Trace residues

97% of the time (37 of 38 examples), the largest cluster of trace residues contacts the ligand (Madabushi et al, Journal of Molecular Biology, 2002).

Page 21: Combining Sequence and Structure Information Topic 17

Livesay et al. (2003). Biochemistry 42:3464-73.

What leads to conservation of CuZnSOD surface electrostatics?

Page 22: Combining Sequence and Structure Information Topic 17

Livesay et al. (2003). Biochemistry 42:3464-73.

Structural and structure variability

Page 23: Combining Sequence and Structure Information Topic 17

Stephen Jay Gould said, “The proof of evolution lies in those adaptations that arise from improbable foundations.”

Livesay et al. (2003). Biochemistry 42:3464-73.

An improbable result

Page 24: Combining Sequence and Structure Information Topic 17

Triosephosphate isomerasewindow width = 5

PSZ threshold = -1.5

TIM Prosite definition

La, Sutch, Livesay (2005). Proteins 58:309-320.

Phylogenetic motifs

Notice structural clustering despite

little overall sequence proximity

Page 25: Combining Sequence and Structure Information Topic 17

La, Sutch, Livesay (2005). Proteins 58:309-320.

Page 26: Combining Sequence and Structure Information Topic 17

Copper, zinc-superoxide dismutaseTATA-box binding protein

Inorganic pyrophosphatase Cytochrome P450Myoglobin

Page 27: Combining Sequence and Structure Information Topic 17

Glutamate dehydrogenase

Enolase Alcohol dehydrogenase

Glecerolaldehyde-3-phosphate dehydrogenase

Page 28: Combining Sequence and Structure Information Topic 17

Trace residues that correspond to PMs are colored red.Trace residues that do not correspond to PMs are colored blue.

PMs identify sequence clusters of ET residues

Page 29: Combining Sequence and Structure Information Topic 17

La, Sutch, Livesay (2005). Proteins 58:309-320.

PMs also correspond to traditional motif definitions

Page 30: Combining Sequence and Structure Information Topic 17

That is, PMs represent a subset of motif space

La, Sutch, Livesay (2005). Proteins 58:309-320.

Page 31: Combining Sequence and Structure Information Topic 17

APSRKFFVGGNWKMNGRKQSLGELIGTLNAAKV

PADTEVVCAPPTAYIDFARQKLDPKIAVAAQNC

YKVTGAFTGEISPGMIKDCGATWVVLGHSERRH

VFGESDELIGQKVAHALAEGLGVIACIGEKLDE

REAGITEVFEQTKVIADNVKDWSKVVLAYEPVW

AIGTGKTATPQQAQEVHEKLRGWLKSNVSDAVA

QSTRIIYGGVTGATCKELASQPDVDGFLVGGAS

LKPEFVDIINAKQ

Page 32: Combining Sequence and Structure Information Topic 17

Livesay, La (2005), Protein Science 14:1158-1170.

Page 33: Combining Sequence and Structure Information Topic 17

Figure caption: Ligand-binding Positions of Tyrosine Aminotransferase of Trypanosoma Cruzi. One chain of the crystal structure of tyrosine aminotransferase from Trypanosoma Cruzi (PDB code 1BW0). Results of a conservation-based measure (Williamson, in blue) are shown compared to the phylogeny-based SMERFS (in red). Positions predicted by both techniques are shown in green, the PLP cofactor in orange. Protein regions in stick representation and labelled are those important for cofactor binding, as described in the text. Manning et al. BMC Bioinformatics 2008 9:51

The SMERFS algorithm is intermediate in philosophy to those of TreeDet [21] and MINER [18] and compares local to global similarity matrices over windows on an alignment.

The work presented here has shown that SMERFS produces sets of putative functional positions in multiple sequence alignments fundamentally different from those of conservation measures. For this reason conservation measures and phylogeny-aware methods such as SMERFS should be considered as complementary tools. The data suggest that if alignment positions involved in the core function of a protein, for example catalysis, are the target of a study, relatively simple conservations measures remain the most useful tool. If less critical positions, perhaps responsible for defining sequence subfamiliy specificity, are the target, then methods such as SMERFS may be of use. Finally, SMERFS has been shown to predict many more surface positions than conservation, reducing the possibility of confusing signals from positions of core structural rather than functional significance.

Page 34: Combining Sequence and Structure Information Topic 17

Np: total number of verticesLij: shortest path between i and j

Vertex: CαEdge: if distance is within 8.5Å

Page 35: Combining Sequence and Structure Information Topic 17

Degree (aka, connectivity or valency):Simply the integer count of the number of edges a vertex shares.

Closeness:The closeness centrality, CC, for a vertex v is the reciprocal of the sum of geodesic distances to all other vertices in the graph.

Geodesic distance (aka, shortest path):The number of edges in the shortest path connecting two vertices. 1

23

54

6

Note: this assumes constant edge weights

Centrality metrics

Page 36: Combining Sequence and Structure Information Topic 17

• The networks are usually highly clustered with few links connecting any two random vertices.

• A key feature of many complex systems (including protein networks) is robustness, meaning that the system can continue to function despite perturbations.

• On the other hand, robustness is coupled with fragility toward non-trivial rearrangements of the connections between the system’s key internal parts.

• Proteins are no exception, they have evolved toward a robust design; however, they are vulnerable to mutation at certain residues, meaning that some special importance could be placed on central residues.

• Recently, various centrality scores have beenused to predict folding nuclei and catalytic sites.

Protein networks

Page 37: Combining Sequence and Structure Information Topic 17

del Sol et al., Mol Sys Biol (2006).

Protein networks conserve “hubs”

Page 38: Combining Sequence and Structure Information Topic 17

However, there is a clear distinction b/t buried noncatalytic and catalytic sites.

Catalytic residuesOne third most buried residuesMiddle thirdOne third least buried residues

Close residues are typically buried…

Chea and Livesay (2007), BMC Bioinformatics, 8:153.

Page 39: Combining Sequence and Structure Information Topic 17

Closeness centrality...

Non-catalytic residuesCatalytic residues

ROC curve for CC predictions

Chea and Livesay (2007), BMC Bioinformatics, 8:153.

Catalytic site prediction power

Page 40: Combining Sequence and Structure Information Topic 17

Computed p-values on the null hypothesis that CC does not predict catalytic sites better than random.

Simple steps to improve accuracy

Raw predictionsAccessibility filterResidue identity filter