[ieee 2013 ieee international conference on bioinformatics and biomedicine (bibm) - shanghai, china...
TRANSCRIPT
Confirming Biological Significance ofCo-occurrence Clusters of Aligned Pattern Clusters
En-Shiun Annie LeeSystems Design Engineering
University of Waterloo
Waterloo, Canada
Sanderz FungSystems Design Engineering
University of Waterloo
Waterloo, Canada
Ho-Yin Sze-ToComputer Science and Engineering
Chinese University of Hong Kong
Shatin, Hong Kong
Andrew K. C. WongSystems Design Engineering
University of Waterloo
Waterloo, Canada
Abstract—Advances in bioinformatics have provided re-searchers with a large influx of novel sequences, thus making theanalysis of the sequences for inherent biological knowledge cru-cial. By using pattern discovery and pattern synthesis on proteinfamily sequences, conserved protein segments can be representedby Aligned Pattern Clusters (APC), which is more knowledge-rich in statistical association comparing to probabilistic models.Such representation enabled us to exploit their co-occurrenceon the same protein sequence to identify functional regions. Inthis paper, we developed an efficient algorithm to identify thefrequently co-occurring patterns using only homologous proteinsequences as input. We applied our algorithm to triosephosphateisomerase and ubiquitin for a detailed study. We found that thediscovered co-occurring patterns are close in spatial distance inmost cases, by comparing to corresponding 3D structures. Wealso found that the co-occurrence of patterns are biologicallysignificant. Residues which play important and co-operativeroles in the glycolytic pathway of triosephosphate isomerase andresidues which are responsible for ubiquitination and ubiquitin-binding of ubiquitin are all covered in our co-occurring APCs.These results demonstrate the power of our algorithm to revealthe concurrent distant functional and structural relation ofproteins sequences based on co-occurrence clusters of APCs.
Index Terms—Sequence, Clustering, Ubiquitin, Triosephos-phate isomerase, Co-occurrence, Pattern, K-means clustering
I. INTRODUCTION
Identifying the functional regions on proteins is of funda-
mental importance. Such knowledge, not only enable us to
have a better understanding of the underlying biological mech-
anisms but can also help design new drugs. While traditional
experiments like alanine scanning mutagenesis and X-ray
crystallography are both laborious and time-consuming, there
are computational methods available to identify the functional
regions by looking for conserved segments among homologous
proteins with similar biological function. The underlying belief
is that amino acids in functional regions are under evolutionary
pressure to maintain their functional integrity and thus undergo
fewer mutations than less functionally important ones [1].
For de novo discovery, Multiple Sequence Alignment
(MSA) [2] is one approach to identify functional groups by
aligning a set of protein sequences to a globally optimum
consensus to come up with conserved regions. However, MSA
is only suitable for globally homologous sequences with a high
level of similarity [2]. Unlike MSA, Motif discovery (multiple
local alignment) [3], [4] attempts to locate and align locally
similar subsequences and builds up a probabilistic model,
which assumes independence between residues, to describe the
conserved region (as a motif). However, such assumption is
unrealistic in many cases, where correlation of residues along
the sequence is commonly observed [5], [6]. Moreover, no
specific methods are available to indicate which residues in
the consensus are not statistically or functionally significant in
such models. The Aligned Pattern Cluster (APC) was hence
introduced in our previous work [7] to provide a knowledge-
rich representation of functional regions, by capturing their
statistically significant associations of the residues along the
sequences and the distribution of their occurrence on each of
their aligned segment region.
With this novel representation, we are now able to study
and exploit the pattern co-occurrence to identify binding sites
within a protein, between two interacting proteins [8], [9],
and between protein and DNA [10], [11]. Here, we define co-
occurring patterns as patterns occurring on the same protein
sequence. Related works [12], [13], [14] suggests that co-
occurring (correlated) residues can provide insights on the
protein structures. Their hypothesis is that if two residues of
a protein form a contact, an amino acid substitution at one
position is expected to be compensated by a substitution of
that in another position. However, the major drawback of
these approaches is that a large number (e.g. the order of
1,000) of homologous and non-redundant protein sequences
are required to learn the underlying statistical model [12], [13].
Also, regarding studies on protein families using Evolutionary
Tracing (ET) [1], the presence or absence of certain clusters
of residue on a protein sequence is a main cause of divergence
between globally-specific functions and family-specific func-
tions [15]. Mutagenesis data is required for their studies, and
their results suggest that the presence or absence of the co-
occurring patterns is likely to be linked up with the functional
divergence [15].
In this study, we aim to answer the following two questions:
How can we efficiently find out the frequently co-occurring
patterns, given only multiple homologous proteins sequences
as input? And what are the biological reasons for their high
co-occurrence and how can we relate the pattern co-occurrence
findings to the biological causes? Our hypothesis is that the
co-occurring patterns might have formed chemical bonds, or
2013 IEEE International Conference on Bioinformatics and Biomedicine
978-1-4799-1310-7/13/$31.00 ©2013 IEEE
they need to co-operate on certain biological functions. We
started our study by collecting homologous protein sequences
from protein databases. We developed an efficient algorithm
based on our previous work [16], [7] to identify the frequently
co-occurring patterns using only sequence data as input. We
verified our result by computing the spatial distance between
the co-occurring patterns using the corresponding 3D struc-
tures. We also surveyed literature to find additional biological
evidence to support the notion of co-occurrence.In our experiments, we applied our algorithm to triosephos-
phate isomerase and ubiquitin for a detailed study. By ex-
amining protein structures, we found that the discovered co-
occurring patterns are close in three-dimensional distance in
most cases and that the co-occurring patterns are biologically
significant. Residues that play important and co-operative roles
in the glycolytic pathway of triosephosphate isomerase and
residues that are responsible for ubiquitination and ubiquitin-
binding of ubiquitin are all covered in our co-occurring APCs.The contribution of this study is three-folds. First, we estab-
lished a framework to study functional regions of proteins by
exploiting the co-occurrences of patterns to reveal concurrent
distant functions and structural relations. To our knowledge,
this is the first study to identify co-occurrence of patterns
rather than residues using only homologous protein sequences
as input. Second, we developed an algorithm which is statis-
tically robust, efficient, and visualizable (in domain location,
structural and functional relation, amino acid conservation and
variations) in an integrated process. Compared to existing
algorithms studying correlations (in residues), ours is novel
as it does not require a large amount of homologous protein
sequences to identify co-occurrences (of patterns) through
training. Third, those discovered co-occurrence of patterns
novel to the biological community will provide new insights
to their studies of biological functions.
II. METHOD
Our methodology combines three algorithms: the first two
from our existing published work and the third algorithm is
the main focus of this paper. First, we use a pattern discovery
algorithm [16] to discover and locate significant sequence
patterns from a protein family while pruning the redundant
patterns. Next, we apply an APC Algorithm [7] to obtain a
list of condensed APCs with variations. Finally, we cluster
the discovered APCs into APC clusters using a clustering
algorithm and co-occurrence scores (Fig. 1).
A. Input DataLet Σ be the protein alphabet containing the 20 standard
amino acids {σ1, σ2, . . . , σ|Σ|−1, σ|Σ|}. A protein sequence
s = s1s2 . . . s|s|−1s|s|, is built from amino acids from the
alphabet Σ. where each si ∈ Σ and s is of length |s|. The
protein dataset used for each of our case studies consists of a
set of protein sequences from the same protein family.
B. Pattern DiscoverySequence patterns which has statistically significant amino
acid association are first discovered [16]. They are defined as
Fig. 1. The overall process of our methodology is a combination of threealgorithms: 1) the pattern discovery algorithm, 2) the APC algorithm, and 3)the APC clustering algorithm.
an interdependent ordered sequence of symbols p = s1s2...sn
from the alphabet Σ. The pattern p has length n, and the ith
symbol that appears in the sequence is si. The list of patterns
resulting from the pattern discovery algorithm are P = {pi|i =1, ..., |P|} = {p1, p2, . . . , p|P|−1, p|P|}. This resulting list of
patterns is pruned of redundant patterns.
C. Aligned Pattern Clustering
An APC describes a set of aligned similar sequence patterns
(as defined in [7]). They are sets of patterns where gaps and
wildcards are added to maximize the similarity between the
patterns. Let a set of APC be defined as [7],
C = {Cl|l = 1, ..., |C|} = {C1, C2, . . . , C |C|−1, C |C|}and let an APC be defined as,
Cl = ALIGN(Pl), (1)
=
⎛⎜⎜⎜⎝
s11 s12 . . . s1ns21 s22 . . . s2n...
......
...
sm1 sm2 . . . smn
⎞⎟⎟⎟⎠
m×n
=
⎛⎜⎜⎜⎝
p1
p2
...
pm
⎞⎟⎟⎟⎠ , (2)
=(p1 p2 . . . pm
). (3)
where sij ∈ Σ∪ {−}∪ {∗} is a pattern pi with a new column
index j. Each of the |Pl| = m patterns in the rows of Cl is
of length |Cl| = n.
D. APC Clustering
Co-existence of patterns in different locations of the same
protein may indicate joint functionality that is important for the
protein family. In APC clustering, we first apply the k-mean
clustering algorithm to cluster APCs using a co-occurrence
score between APCs as a similarity measure. We also use
four different clustering indicators to arrive at an optimal
cluster configuration. Finally we confirm the results by three-
dimensional structure corresponding to the location of the APC
clusters.
1) Co-occurrence Score: First, we compare all possible
APC pairs, using a co-occurrence score as the similarity
measure between them. The co-occurrence scores quantifies
how often two APCs appear together on the same sequence.
The Jaccard index is adopted [17]:
J =|C1
seq ∩ C2seq|
|C1seq ∪ C2
seq|where
C1seq = sequences that contain patterns from APC C1
C2seq = sequences that contain patterns from APC C2
2) K-means Clustering: Next, a set of closely related APCs
called APC clusters is clustered using co-occurrence scores as
the similarity measure between APCs. The k-means clustering
algorithm is modified [17]. APCs are used to represent the
centroids, since calculating a centroid with only co-occurrence
between APCs is difficult. They are first initialized as the first
APC for each connected component, and then the APCs with
the lowest sum of co-occurrence score. During the clustering
process, the centroids are updated by finding the APC that
maximizes the co-occurrence score between all other APCs
in the same cluster. Secondly, the algorithm is modified to
prevent an APC from being clustered to a centroid that is not
connected to [18]. For example, it is possible for APCs to not
be connected if they do not co-occur together on any sequence.
Algorithm 1 Modified k-means clustering
Input: A set of APCs C, and the co-occurrence scores
between all pairs of APCs J , final number of clusters the
k-means clustering is kOutput: APC clusters K1...Kk
Initialize centroids M1...Mk, where each M1 represent the
center of APC cluster Ki
Find number of components
Select first APC from each component as the centroid
for i = |components|+ 1 to k doIdentify the APC that forms the lowest co-occurrence
score with known centroids
Assign this APC as a new centroid
end forrepeat
for all APC C ∈ C doAssign C to closest centroid Mj such that C and Mj
are from the same component
end forfor all clusterKi ∈ {K1...Kn} do
Update centroid Mi by selecting APC that maximizes
co-occurrence within all APCs in Ki
end foruntil convergence
return {K1...Kk}
3) Clustering Indicators: Finally, to ensure that clustering
provides the best possible results, four clustering indicators
were used to determine the optimal cluster count to be adopted
for the APC clustering process. All four cluster indicators
follow the principle of maximizing the average co-occurrence
score within a cluster while minimizing that between clusters.
Furthermore, additive smoothing was applied to several indica-
tors to prevent division by zero causing its values to be infinity.
The variables and the indicators are defined as follows:
k =number of clusters
s(Ki) =average co-occurrence score in cluster i
s(Ki,Kj) =average co-occurrence score between cluster i
and j
Average Score ∑ki=1 s(Ki)
k
Intra / Inter
k +∑k
i=1 s(Ki)
k +∑k
x=1
∑ky=x+1 s(Kx,Ky)
Dunn index [19]
2−max1≤x,y≤k:x �=y s(Kx,Ky)
2−min1≤i≤k s(Ki)
Max Intra / Related Inter
s(Kx)∑ky=1 s(Kx,Ky)
where x ismax ∀s(Ki)
In Dunn index, the difference of the co-occurrence score
(1 − s(Ki)) was taken as the distance between two clusters,
as required by the index definition [19]. In order to find the
optimal cluster count, the maximum of the four was computed
and selected. Finally, the statistical mode out of all the four
indicators was taken as the final optimal cluster count. If there
is a tie, the larger cluster count is chosen.
4) Verification by Three-Dimensional Structure: After ap-
plying co-occurrence clustering, we manually select the cluster
that contains the highest average co-occurrence score (s(Ki))as the highly connected APC cluster. We relate the result
to its corresponding three-dimensional protein structure from
Protein Database (PDB) using Chimera [20], highlighting the
regions where the APCs, or parts of the APCs, appear. We
calculate two distances for comparison: the distance between
the APCs, and the average pairwise distance. The former is
calculated by finding the centroid (defined as the arithmetic
mean of the amino acid locations from the APC) of each APC
before calculating the distance between both centroids. The
average pairwise distance directs the finding of the average
distance of all possible amino acid pairs in the structure.
By comparing two types of distances, we determine whether
high co-occurring APCs are also close in three-dimensional
distance or involved in protein function.
III. RESULTS AND ANALYSIS
To test our method’s ability to find a set of highly co-
occurring APCs, we used protein sequences obtained from
Pfam: triosephosphate isomerase and ubiquitin. After finding
the APCs, we verified the functional significance of the
APCs by finding structures in Protein Data Bank (PDB) [21]
Fig. 2. Normalized scores of the four indicators used for k-means clusteringon triosephosphate isomerase.
Fig. 3. C1, C2, C3 and C4 aligned by parts of the traditional pFamrepresentation of the triosephosphate isomerase family. C1, C2 represents oneAPC cluster and C3, C4 represents another APC cluster. The two APC clusterswould be joined if the APC was clustered into three APC clusters.
in the corresponding location of the APCs and observing
structural/functional characteristics shared between the found
APCs - especially the spatial distance between them.
A. Triosephosphate isomerase
First, by applying our method to triosephosphate isomerase
we show that closely related APC clusters could be found.
Six APCs are obtained from pattern discovery and pattern
alignment on triosephosphate isomerase protein sequences.
The k-means clustering algorithm was used to cluster the
APCs, and indicators (Figure 2) were used to obtain the
optimal number of clusters. According to results in Figure 2,
two indicators agree that two is the optimal cluster number
while the other two agree on four. With a tie, the larger
cluster count is chosen. Hence, k-means clustering of four
clusters were applied to obtain the closest co-occurring APCs
containing C1 and C2, with an average Jaccard index of 0.85,
and another one consisting of C3 and C4, with an average
Jaccard index of 0.72. Furthermore, by providing a cluster
count of three to k-means, the results show that the two APC
clusters are joined together, showing the relationship among
C1, C2 and C3, C4. Figure 3 shows the co-occurrence between
the two clusters and their location on the primary sequence.
A three-dimensional structure of triosephosphate isomerase
(PDB ID 4iot) was used to verify the APCs in Figure 3. As
seen in the figure, C1 and C4 overlaps, and hence are combined
as shown in Figure 4. The spatial distance between C1, C2 is
12.24 A and that between C3, C4 is 7.52 A, both less than the
Fig. 4. Three-dimensional structure of a triosephosphate isomerase proteinfound in E. coli (Escherichia coli, PDB ID 4iot). a) The four APCs shownin Figure 3 is highlighted, with C1 and C4 combined because they overlap b)Shows the four specific amino acid in the APCs, His95, Tyr166, Glu167 andTyr209.
average distance of 22.35A. Hence, the observed APC clusters
are interesting as they are close in spatial distance.
Next we would like to investigate the biological significance
of the co-occurrence of APCs because of this interesting
observation. We aim to establish our findings that the co-
occurring APCs, which may be far from each other in the
primary sequence, are close to one another in spatial distance
such that they are able to form chemical bonding or be
involved in joint functionally.
Triosephosphate isomerase (TIM) plays an important role in
glycolysis, which is essential for efficient energy production
[22]. TIM catalyzes the isomerization of a ketose (DHAP) to
an aldose (GAP) [22], which is an essential process in the
glycolytic pathway. (For the following residual positions, we
refer to PDB ID: 4iot.) In one important step, Glu167 alone
does not possess the basicity to abstract a proton and requires
His95 to donate a proton to stabilize the negative charge
[22], [23]. When Glu167 and His95 are mapped on the APCs
of TIM, they are located in C2 and C1 respectively. Hence,
this indicates that they need to co-operate in the glycolytic
pathway, which agrees with their high co-occurrence score of
0.85. Moreover, the co-occurrence between C1, which overlaps
with C4, and C3 was also confirmed by biological experiments
in [24]. Tyr166 and Tyr209 are amino acids corresponding
to C1 and C3 respectively. According to [24], these residues
are controlling opening or closing of loop 6, which facilitates
substrate binding and releasing in the glycolytic pathway. By
substituting Tyr166 and Tyr209 by Phe166 and Phe209, the
enzymatic activity of TIM decreases significantly [24], as
these substitutions would destabilize the internal structure by
compromising the H-bonds in between, which in turn breaks
the structural mechanism. This agrees with their high co-
occurrence score of 0.72.
B. Ubiquitin
Similarly, we applied our method to the ubiquitin family
to demonstrate our method’s ability to find a closely related
APC cluster corresponding to the protein structure. Eight
APCs were obtained on ubiquitin protein sequences from Pfam
version 25.0.
To find the optimal cluster count for k-means clustering,
Fig. 5. Normalized scores of the four indicators used for k-means clusteringon ubiquitin.
Fig. 7. Three-dimensional structure of an ubiquitin protein found in humans(Homo Sapiens, PDB ID 1ubq), labeling the Lys (K) amino acid. a) highlightsAPC cluster {C2, C3}, b) highlights the other APC cluster.
the plots in Figure 5 were obtained, showing that the optimal
cluster count is two. The APCs are listed on Figure 6, with C2,
C3 forming one cluster, with an average co-occurrence score
of 0.38, and the remaining six APCs forming another cluster,
with an average co-occurrence score of 0.36. Hence C2, C3 is
the better cluster, by our criteria.
The APCs were displayed in a three-dimensional structure
of ubiquitin (PDB ID 1ubq) in Figures 7. Both the average
distance between C2 and C3, the first cluster, (10.74 A) and
the second cluster (12.67 A) is less than the average distance of
14.83 A. According to PDB, both APC clusters are important
to ubiquitin, with C2, C3 appearing in 213/385 (55%) of
ubiquitin structures in PDB and C1, C4, C5, C6, C7, C8
appearing in 194/385 (50%) of PDB ubiquitin structures. This
shows that the cluster with the highest co-occurrence score do
provide us with useful information.
Similar to TIM, we would like to investigate the biological
significance of the co-occurrence of APCs in Ubiquitin. Our
hypothesis is that the co-occurring APCs form a chemical
bonding or co-operate on certain biological functions.
Ubiquitin (UBI) plays an important role in a post trans-
lational protein modification process called ubiquitination,
where ubuiquitin is attached to a substrate protein. The protein
modification can either be a single ubiquitin protein or multiple
chains of ubiquitin. To form a chain, an ubiquitin connects
to another ubiquitin by binding its C-terminal tail to one of
the seven lysine amino acid of its linking partner. There are
different forms of chains, named after one the seven lysine
amino acids used to link the chain together- that all have
different functionalities [25]. Ubiquitination is widely used
in regulating cellular signaling [25]. It does this by allowing
the attached ubuiquitin in substrate proteins to be bound
by proteins with ubiquitin-binding domains (UBD) to trigger
corresponding events [25].
When the seven lysine amino acids are mapped to our APCs,
they are all surprisingly covered (C1: Lys6, Lys11; C3: Lys27,
Lys29; C4: Lys33; C5: Lys48; C6: Lys63). According to the
results of our co-occurrence clustering algorithm (Figure 6),
the optimum number of cluster of the eight APCs is two. The
first cluster includes: C1, C4, C5, C6, C7, C8 and the second
cluster includes C2 and C3. Their biological significance is
discussed as follows.
There are two reasons for the APCs in the first cluster to
co-occur. First, excluding C7 and C8, all the APCs in the
first cluster covers a Lysine (K). Also, although the diglycine,
i.e Gly75 and Gly76, are not covered, C8 corresponds to the
C-terminal tail. As discussed before, Lysine and C-terminal
tail are both important for the formation of multiple ubiq-
uitin chains. Second, the APCs in the first cluster cover
the ubiquitin-binding residue, i.e. Ile44 in C5, and all the
facilitating residues, i.e Leu8 in C1, Lys48 in C5, His68 in C7,
and Val70 in C8 [25]. These residues, particularly Ile44 and
His68, are vitally important for the tight binding of ubiquitin
with almost all ubiquitin-binding proteins [25].
For the biological significance of the second cluster, it is
observed that both C2 and C3 cover the residues of the major
α-helix - α1, of the ubiquitin, which corresponds to residue 23-
34 [26]. Work in [27] discovered that α1 is an unconventional
recognition site of ubiquitin-binding proteins. Experiments in
[27] revealed that even if Ile44 and His68 were mutated, a
high affinity binding between CKS1 and ubiquitin was still
identified. This proved that CKS1 behaves unconventionally
in binding ubiquitin. Further experiments were done in [27]
to reveal the specificity of binding. Smt3p is a protein with
α-helices and has no interactions with CKS1 [27]. Once an α-
helix of Smt3p was replaced by the residues in α1 of ubiquitin,
protein-protein binding was detected [27]. Although individual
contributions have not been evaluated, the co-occurrence of
C2 and C3, which are close to each other in the primary
sequence, represents the unconventional recognition site (α1)
based on the observed specificity. It should be noted that this
unconventional way of binding is not mutually exclusive to
conventional binding [27]. Two models can co-occur, which
introduces more variety to the underlying mechanism.
IV. CONCLUSION
In this paper, we presented our newly developed algorithm
which discovers co-occurring aligned pattern clusters effi-
ciently, given only multiple homologous proteins sequences as
input. By applying the algorithm to two protein families, we
found that frequently co-occurring patterns are close in three-
dimensional distance and their co-occurrence is able to reveal
concurrent distant functions as well as structural relations.
Our experimental results demonstrated that our approach is
statistically reliable, algorithmically effective, and the co-
occurrence relations visualizable in an integrated process that
will benefit the biological community.
Fig. 6. The eight APC found and their relative positions to the ubiquitin sequence. C2, C3 forms one cluster and the other six APCs forms another cluster.The locations of Lys (K) amino acid are emphasized.
REFERENCES
[1] O. Lichtarge, H. R. Bourne, and F. E. Cohen, “An evolutionary tracemethod defines binding surfaces common to protein families,” Journalof molecular biology, vol. 257, no. 2, pp. 342–358, 1996.
[2] J. D. Thompson, B. Linard, O. Lecompte, and O. Poch, “A comprehen-sive benchmark study of multiple sequence alignment methods: currentchallenges and future perspectives,” PloS one, vol. 6, no. 3, p. e18093,2011.
[3] M. C. Frith, U. Hansen, J. L. Spouge, and Z. Weng, “Finding functionalsequence elements by multiple local alignment,” Nucleic acids research,vol. 32, no. 1, pp. 189–200, 2004.
[4] T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifsin biopolymers using expectation maximization,” Machine learning,vol. 21, no. 1-2, pp. 51–80, 1995.
[5] D. Altschuh, A. Lesk, A. Bloomer, and A. Klug, “Correlation of co-ordinated amino acid substitutions with function in viruses related totobacco mosaic virus,” Journal of molecular biology, vol. 193, no. 4,pp. 693–707, 1987.
[6] I. Kass and A. Horovitz, “Mapping pathways of allosteric communica-tion in groel by analysis of correlated mutations,” Proteins: Structure,Function, and Bioinformatics, vol. 48, no. 4, pp. 611–617, 2002.
[7] E.-S. A. Lee and A. K. C. Wong, “Revealing binding segments in proteinfamilies using aligned pattern clusters,” Proteome Science, 2013.
[8] X.-L. Li, S.-H. Tan, C.-S. Foo, S.-K. Ng et al., “Interaction graphmining for protein complexes using local clique merging,” GENOMEINFORMATICS SERIES, vol. 16, no. 2, p. 260, 2005.
[9] E. C. Kenley, L. Kirk, and Y.-R. Cho, “Differentiating party and datehubs in protein interaction networks using semantic similarity measures,”in Proceedings of the 2nd ACM Conference on Bioinformatics, Compu-tational Biology and Biomedicine. ACM, 2011, pp. 641–645.
[10] K.-S. Leung, K.-C. Wong, T.-M. Chan, M.-H. Wong, K.-H. Lee, C.-K.Lau, and S. K. Tsui, “Discovering protein–dna binding sequence patternsusing association rule mining,” Nucleic acids research, vol. 38, no. 19,pp. 6324–6337, 2010.
[11] T.-M. Chan, L.-Y. Lo, H.-Y. Sze-To, K.-S. Leung, X. Xiao, and M.-H. Wong, “Modeling associated protein-dna pattern discovery withunified scores.” IEEE/ACM transactions on computational biology andbioinformatics/IEEE, ACM, 2013.
[12] M. Weigt, R. A. White, H. Szurmant, J. A. Hoch, and T. Hwa,“Identification of direct residue contacts in protein–protein interactionby message passing,” Proceedings of the National Academy of Sciences,vol. 106, no. 1, pp. 67–72, 2009.
[13] L. Burger and E. van Nimwegen, “Disentangling direct from indirectco-evolution of residues in protein alignments,” PLoS computationalbiology, vol. 6, no. 1, p. e1000633, 2010.
[14] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. Marks, C. Sander,R. Zecchina, J. N. Onuchic, T. Hwa, and M. Weigt, “Direct-couplinganalysis of residue coevolution captures native contacts across many
protein families,” Proceedings of the National Academy of Sciences,vol. 108, no. 49, pp. E1293–E1301, 2011.
[15] S. Madabushi, A. K. Gross, A. Philippi, E. C. Meng, T. G. Wensel, andO. Lichtarge, “Evolutionary trace of g protein-coupled receptors revealsclusters of residues that determine global and class-specific functions,”Journal of Biological Chemistry, vol. 279, no. 9, pp. 8126–8132, 2004.
[16] A. K. Wong, D. Zhuang, G. C. Li, and E.-S. Lee, “Discovery of deltaclosed patterns and noninduced patterns from sequences,” Knowledgeand Data Engineering, IEEE Transactions on, vol. 24, no. 8, pp. 1408–1421, 2012.
[17] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining.Addison-Wesley, 2006.
[18] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrodl, “Constrained k-meansclustering with background knowledge,” in MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, 2001, pp.577–584.
[19] J. C. Dunn, “A fuzzy relative of the isodata process and its use indetecting compact well-separated clusters,” 1973.
[20] E. F. Pettersen, T. D. Goddard, C. C. Huang, G. S. Couch, D. M.Greenblatt, E. C. Meng, and T. E. Ferrin, “Ucsf chimera-a visualizationsystem for exploratory research and analysis,” Journal of computationalchemistry, vol. 25, no. 13, pp. 1605–1612, 2004.
[21] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig,I. N. Shindyalov, and P. E. Bourne, “The protein data bank,” Nucleicacids research, vol. 28, no. 1, pp. 235–242, 2000.
[22] R. Wierenga, E. Kapetaniou, and R. Venkatesan, “Triosephosphateisomerase: a highly evolved biocatalyst,” Cellular and molecular lifesciences, vol. 67, no. 23, pp. 3961–3982, 2010.
[23] M. M. Malabanan, L. Nitsch-Velasquez, T. L. Amyes, and J. P. Richard,“Magnitude and origin of the enhanced basicity of the catalytic gluta-mate of triosephosphate isomerase,” Journal of the American ChemicalSociety, vol. 135, no. 16, pp. 5978–5981, 2013.
[24] F. X. Guix, G. Ill-Raga, R. Bravo, T. Nakaya, G. de Fabritiis, M. Coma,G. P. Miscione, J. Villa-Freixa, T. Suzuki, X. Fernandez-Busquetset al., “Amyloid-dependent triosephosphate isomerase nitrotyrosinationinduces glycation and tau fibrillation,” Brain, vol. 132, no. 5, pp. 1335–1345, 2009.
[25] I. Dikic, S. Wakatsuki, and K. J. Walters, “Ubiquitin-binding domains-from structures to functions,” Nature reviews Molecular cell biology,vol. 10, no. 10, pp. 659–671, 2009.
[26] K.-Y. Huang, “Ubiquitin conformational dynamics and hydration shelldynamics by solid state nmr,” 2011.
[27] D. Tempe, M. Brengues, P. Mayonove, H. Bensaad, C. Lacrouts, andM. C. Morris, “The alpha helix of ubiquitin interacts with yeast cyclin-dependent kinase subunit cks1,” Biochemistry, vol. 46, no. 1, pp. 45–54,2007.