[ieee 2013 ieee international conference on bioinformatics and biomedicine (bibm) - shanghai, china...

6
Confirming Biological Significance of Co-occurrence Clusters of Aligned Pattern Clusters En-Shiun Annie Lee Systems Design Engineering University of Waterloo Waterloo, Canada [email protected] Sanderz Fung Systems Design Engineering University of Waterloo Waterloo, Canada [email protected] Ho-Yin Sze-To Computer Science and Engineering Chinese University of Hong Kong Shatin, Hong Kong [email protected] Andrew K. C. Wong Systems Design Engineering University of Waterloo Waterloo, Canada [email protected] Abstract—Advances in bioinformatics have provided re- searchers with a large influx of novel sequences, thus making the analysis of the sequences for inherent biological knowledge cru- cial. By using pattern discovery and pattern synthesis on protein family sequences, conserved protein segments can be represented by Aligned Pattern Clusters (APC), which is more knowledge- rich in statistical association comparing to probabilistic models. Such representation enabled us to exploit their co-occurrence on the same protein sequence to identify functional regions. In this paper, we developed an efficient algorithm to identify the frequently co-occurring patterns using only homologous protein sequences as input. We applied our algorithm to triosephosphate isomerase and ubiquitin for a detailed study. We found that the discovered co-occurring patterns are close in spatial distance in most cases, by comparing to corresponding 3D structures. We also found that the co-occurrence of patterns are biologically significant. Residues which play important and co-operative roles in the glycolytic pathway of triosephosphate isomerase and residues which are responsible for ubiquitination and ubiquitin- binding of ubiquitin are all covered in our co-occurring APCs. These results demonstrate the power of our algorithm to reveal the concurrent distant functional and structural relation of proteins sequences based on co-occurrence clusters of APCs. Index Terms—Sequence, Clustering, Ubiquitin, Triosephos- phate isomerase, Co-occurrence, Pattern, K-means clustering I. I NTRODUCTION Identifying the functional regions on proteins is of funda- mental importance. Such knowledge, not only enable us to have a better understanding of the underlying biological mech- anisms but can also help design new drugs. While traditional experiments like alanine scanning mutagenesis and X-ray crystallography are both laborious and time-consuming, there are computational methods available to identify the functional regions by looking for conserved segments among homologous proteins with similar biological function. The underlying belief is that amino acids in functional regions are under evolutionary pressure to maintain their functional integrity and thus undergo fewer mutations than less functionally important ones [1]. For de novo discovery, Multiple Sequence Alignment (MSA) [2] is one approach to identify functional groups by aligning a set of protein sequences to a globally optimum consensus to come up with conserved regions. However, MSA is only suitable for globally homologous sequences with a high level of similarity [2]. Unlike MSA, Motif discovery (multiple local alignment) [3], [4] attempts to locate and align locally similar subsequences and builds up a probabilistic model, which assumes independence between residues, to describe the conserved region (as a motif). However, such assumption is unrealistic in many cases, where correlation of residues along the sequence is commonly observed [5], [6]. Moreover, no specific methods are available to indicate which residues in the consensus are not statistically or functionally significant in such models. The Aligned Pattern Cluster (APC) was hence introduced in our previous work [7] to provide a knowledge- rich representation of functional regions, by capturing their statistically significant associations of the residues along the sequences and the distribution of their occurrence on each of their aligned segment region. With this novel representation, we are now able to study and exploit the pattern co-occurrence to identify binding sites within a protein, between two interacting proteins [8], [9], and between protein and DNA [10], [11]. Here, we define co- occurring patterns as patterns occurring on the same protein sequence. Related works [12], [13], [14] suggests that co- occurring (correlated) residues can provide insights on the protein structures. Their hypothesis is that if two residues of a protein form a contact, an amino acid substitution at one position is expected to be compensated by a substitution of that in another position. However, the major drawback of these approaches is that a large number (e.g. the order of 1,000) of homologous and non-redundant protein sequences are required to learn the underlying statistical model [12], [13]. Also, regarding studies on protein families using Evolutionary Tracing (ET) [1], the presence or absence of certain clusters of residue on a protein sequence is a main cause of divergence between globally-specific functions and family-specific func- tions [15]. Mutagenesis data is required for their studies, and their results suggest that the presence or absence of the co- occurring patterns is likely to be linked up with the functional divergence [15]. In this study, we aim to answer the following two questions: How can we efficiently find out the frequently co-occurring patterns, given only multiple homologous proteins sequences as input? And what are the biological reasons for their high co-occurrence and how can we relate the pattern co-occurrence findings to the biological causes? Our hypothesis is that the co-occurring patterns might have formed chemical bonds, or 2013 IEEE International Conference on Bioinformatics and Biomedicine 978-1-4799-1310-7/13/$31.00 ©2013 IEEE

Upload: andrew-k-c

Post on 27-Mar-2017

216 views

Category:

Documents


1 download

TRANSCRIPT

Confirming Biological Significance ofCo-occurrence Clusters of Aligned Pattern Clusters

En-Shiun Annie LeeSystems Design Engineering

University of Waterloo

Waterloo, Canada

[email protected]

Sanderz FungSystems Design Engineering

University of Waterloo

Waterloo, Canada

[email protected]

Ho-Yin Sze-ToComputer Science and Engineering

Chinese University of Hong Kong

Shatin, Hong Kong

[email protected]

Andrew K. C. WongSystems Design Engineering

University of Waterloo

Waterloo, Canada

[email protected]

Abstract—Advances in bioinformatics have provided re-searchers with a large influx of novel sequences, thus making theanalysis of the sequences for inherent biological knowledge cru-cial. By using pattern discovery and pattern synthesis on proteinfamily sequences, conserved protein segments can be representedby Aligned Pattern Clusters (APC), which is more knowledge-rich in statistical association comparing to probabilistic models.Such representation enabled us to exploit their co-occurrenceon the same protein sequence to identify functional regions. Inthis paper, we developed an efficient algorithm to identify thefrequently co-occurring patterns using only homologous proteinsequences as input. We applied our algorithm to triosephosphateisomerase and ubiquitin for a detailed study. We found that thediscovered co-occurring patterns are close in spatial distance inmost cases, by comparing to corresponding 3D structures. Wealso found that the co-occurrence of patterns are biologicallysignificant. Residues which play important and co-operativeroles in the glycolytic pathway of triosephosphate isomerase andresidues which are responsible for ubiquitination and ubiquitin-binding of ubiquitin are all covered in our co-occurring APCs.These results demonstrate the power of our algorithm to revealthe concurrent distant functional and structural relation ofproteins sequences based on co-occurrence clusters of APCs.

Index Terms—Sequence, Clustering, Ubiquitin, Triosephos-phate isomerase, Co-occurrence, Pattern, K-means clustering

I. INTRODUCTION

Identifying the functional regions on proteins is of funda-

mental importance. Such knowledge, not only enable us to

have a better understanding of the underlying biological mech-

anisms but can also help design new drugs. While traditional

experiments like alanine scanning mutagenesis and X-ray

crystallography are both laborious and time-consuming, there

are computational methods available to identify the functional

regions by looking for conserved segments among homologous

proteins with similar biological function. The underlying belief

is that amino acids in functional regions are under evolutionary

pressure to maintain their functional integrity and thus undergo

fewer mutations than less functionally important ones [1].

For de novo discovery, Multiple Sequence Alignment

(MSA) [2] is one approach to identify functional groups by

aligning a set of protein sequences to a globally optimum

consensus to come up with conserved regions. However, MSA

is only suitable for globally homologous sequences with a high

level of similarity [2]. Unlike MSA, Motif discovery (multiple

local alignment) [3], [4] attempts to locate and align locally

similar subsequences and builds up a probabilistic model,

which assumes independence between residues, to describe the

conserved region (as a motif). However, such assumption is

unrealistic in many cases, where correlation of residues along

the sequence is commonly observed [5], [6]. Moreover, no

specific methods are available to indicate which residues in

the consensus are not statistically or functionally significant in

such models. The Aligned Pattern Cluster (APC) was hence

introduced in our previous work [7] to provide a knowledge-

rich representation of functional regions, by capturing their

statistically significant associations of the residues along the

sequences and the distribution of their occurrence on each of

their aligned segment region.

With this novel representation, we are now able to study

and exploit the pattern co-occurrence to identify binding sites

within a protein, between two interacting proteins [8], [9],

and between protein and DNA [10], [11]. Here, we define co-

occurring patterns as patterns occurring on the same protein

sequence. Related works [12], [13], [14] suggests that co-

occurring (correlated) residues can provide insights on the

protein structures. Their hypothesis is that if two residues of

a protein form a contact, an amino acid substitution at one

position is expected to be compensated by a substitution of

that in another position. However, the major drawback of

these approaches is that a large number (e.g. the order of

1,000) of homologous and non-redundant protein sequences

are required to learn the underlying statistical model [12], [13].

Also, regarding studies on protein families using Evolutionary

Tracing (ET) [1], the presence or absence of certain clusters

of residue on a protein sequence is a main cause of divergence

between globally-specific functions and family-specific func-

tions [15]. Mutagenesis data is required for their studies, and

their results suggest that the presence or absence of the co-

occurring patterns is likely to be linked up with the functional

divergence [15].

In this study, we aim to answer the following two questions:

How can we efficiently find out the frequently co-occurring

patterns, given only multiple homologous proteins sequences

as input? And what are the biological reasons for their high

co-occurrence and how can we relate the pattern co-occurrence

findings to the biological causes? Our hypothesis is that the

co-occurring patterns might have formed chemical bonds, or

2013 IEEE International Conference on Bioinformatics and Biomedicine

978-1-4799-1310-7/13/$31.00 ©2013 IEEE

they need to co-operate on certain biological functions. We

started our study by collecting homologous protein sequences

from protein databases. We developed an efficient algorithm

based on our previous work [16], [7] to identify the frequently

co-occurring patterns using only sequence data as input. We

verified our result by computing the spatial distance between

the co-occurring patterns using the corresponding 3D struc-

tures. We also surveyed literature to find additional biological

evidence to support the notion of co-occurrence.In our experiments, we applied our algorithm to triosephos-

phate isomerase and ubiquitin for a detailed study. By ex-

amining protein structures, we found that the discovered co-

occurring patterns are close in three-dimensional distance in

most cases and that the co-occurring patterns are biologically

significant. Residues that play important and co-operative roles

in the glycolytic pathway of triosephosphate isomerase and

residues that are responsible for ubiquitination and ubiquitin-

binding of ubiquitin are all covered in our co-occurring APCs.The contribution of this study is three-folds. First, we estab-

lished a framework to study functional regions of proteins by

exploiting the co-occurrences of patterns to reveal concurrent

distant functions and structural relations. To our knowledge,

this is the first study to identify co-occurrence of patterns

rather than residues using only homologous protein sequences

as input. Second, we developed an algorithm which is statis-

tically robust, efficient, and visualizable (in domain location,

structural and functional relation, amino acid conservation and

variations) in an integrated process. Compared to existing

algorithms studying correlations (in residues), ours is novel

as it does not require a large amount of homologous protein

sequences to identify co-occurrences (of patterns) through

training. Third, those discovered co-occurrence of patterns

novel to the biological community will provide new insights

to their studies of biological functions.

II. METHOD

Our methodology combines three algorithms: the first two

from our existing published work and the third algorithm is

the main focus of this paper. First, we use a pattern discovery

algorithm [16] to discover and locate significant sequence

patterns from a protein family while pruning the redundant

patterns. Next, we apply an APC Algorithm [7] to obtain a

list of condensed APCs with variations. Finally, we cluster

the discovered APCs into APC clusters using a clustering

algorithm and co-occurrence scores (Fig. 1).

A. Input DataLet Σ be the protein alphabet containing the 20 standard

amino acids {σ1, σ2, . . . , σ|Σ|−1, σ|Σ|}. A protein sequence

s = s1s2 . . . s|s|−1s|s|, is built from amino acids from the

alphabet Σ. where each si ∈ Σ and s is of length |s|. The

protein dataset used for each of our case studies consists of a

set of protein sequences from the same protein family.

B. Pattern DiscoverySequence patterns which has statistically significant amino

acid association are first discovered [16]. They are defined as

Fig. 1. The overall process of our methodology is a combination of threealgorithms: 1) the pattern discovery algorithm, 2) the APC algorithm, and 3)the APC clustering algorithm.

an interdependent ordered sequence of symbols p = s1s2...sn

from the alphabet Σ. The pattern p has length n, and the ith

symbol that appears in the sequence is si. The list of patterns

resulting from the pattern discovery algorithm are P = {pi|i =1, ..., |P|} = {p1, p2, . . . , p|P|−1, p|P|}. This resulting list of

patterns is pruned of redundant patterns.

C. Aligned Pattern Clustering

An APC describes a set of aligned similar sequence patterns

(as defined in [7]). They are sets of patterns where gaps and

wildcards are added to maximize the similarity between the

patterns. Let a set of APC be defined as [7],

C = {Cl|l = 1, ..., |C|} = {C1, C2, . . . , C |C|−1, C |C|}and let an APC be defined as,

Cl = ALIGN(Pl), (1)

=

⎛⎜⎜⎜⎝

s11 s12 . . . s1ns21 s22 . . . s2n...

......

...

sm1 sm2 . . . smn

⎞⎟⎟⎟⎠

m×n

=

⎛⎜⎜⎜⎝

p1

p2

...

pm

⎞⎟⎟⎟⎠ , (2)

=(p1 p2 . . . pm

). (3)

where sij ∈ Σ∪ {−}∪ {∗} is a pattern pi with a new column

index j. Each of the |Pl| = m patterns in the rows of Cl is

of length |Cl| = n.

D. APC Clustering

Co-existence of patterns in different locations of the same

protein may indicate joint functionality that is important for the

protein family. In APC clustering, we first apply the k-mean

clustering algorithm to cluster APCs using a co-occurrence

score between APCs as a similarity measure. We also use

four different clustering indicators to arrive at an optimal

cluster configuration. Finally we confirm the results by three-

dimensional structure corresponding to the location of the APC

clusters.

1) Co-occurrence Score: First, we compare all possible

APC pairs, using a co-occurrence score as the similarity

measure between them. The co-occurrence scores quantifies

how often two APCs appear together on the same sequence.

The Jaccard index is adopted [17]:

J =|C1

seq ∩ C2seq|

|C1seq ∪ C2

seq|where

C1seq = sequences that contain patterns from APC C1

C2seq = sequences that contain patterns from APC C2

2) K-means Clustering: Next, a set of closely related APCs

called APC clusters is clustered using co-occurrence scores as

the similarity measure between APCs. The k-means clustering

algorithm is modified [17]. APCs are used to represent the

centroids, since calculating a centroid with only co-occurrence

between APCs is difficult. They are first initialized as the first

APC for each connected component, and then the APCs with

the lowest sum of co-occurrence score. During the clustering

process, the centroids are updated by finding the APC that

maximizes the co-occurrence score between all other APCs

in the same cluster. Secondly, the algorithm is modified to

prevent an APC from being clustered to a centroid that is not

connected to [18]. For example, it is possible for APCs to not

be connected if they do not co-occur together on any sequence.

Algorithm 1 Modified k-means clustering

Input: A set of APCs C, and the co-occurrence scores

between all pairs of APCs J , final number of clusters the

k-means clustering is kOutput: APC clusters K1...Kk

Initialize centroids M1...Mk, where each M1 represent the

center of APC cluster Ki

Find number of components

Select first APC from each component as the centroid

for i = |components|+ 1 to k doIdentify the APC that forms the lowest co-occurrence

score with known centroids

Assign this APC as a new centroid

end forrepeat

for all APC C ∈ C doAssign C to closest centroid Mj such that C and Mj

are from the same component

end forfor all clusterKi ∈ {K1...Kn} do

Update centroid Mi by selecting APC that maximizes

co-occurrence within all APCs in Ki

end foruntil convergence

return {K1...Kk}

3) Clustering Indicators: Finally, to ensure that clustering

provides the best possible results, four clustering indicators

were used to determine the optimal cluster count to be adopted

for the APC clustering process. All four cluster indicators

follow the principle of maximizing the average co-occurrence

score within a cluster while minimizing that between clusters.

Furthermore, additive smoothing was applied to several indica-

tors to prevent division by zero causing its values to be infinity.

The variables and the indicators are defined as follows:

k =number of clusters

s(Ki) =average co-occurrence score in cluster i

s(Ki,Kj) =average co-occurrence score between cluster i

and j

Average Score ∑ki=1 s(Ki)

k

Intra / Inter

k +∑k

i=1 s(Ki)

k +∑k

x=1

∑ky=x+1 s(Kx,Ky)

Dunn index [19]

2−max1≤x,y≤k:x �=y s(Kx,Ky)

2−min1≤i≤k s(Ki)

Max Intra / Related Inter

s(Kx)∑ky=1 s(Kx,Ky)

where x ismax ∀s(Ki)

In Dunn index, the difference of the co-occurrence score

(1 − s(Ki)) was taken as the distance between two clusters,

as required by the index definition [19]. In order to find the

optimal cluster count, the maximum of the four was computed

and selected. Finally, the statistical mode out of all the four

indicators was taken as the final optimal cluster count. If there

is a tie, the larger cluster count is chosen.

4) Verification by Three-Dimensional Structure: After ap-

plying co-occurrence clustering, we manually select the cluster

that contains the highest average co-occurrence score (s(Ki))as the highly connected APC cluster. We relate the result

to its corresponding three-dimensional protein structure from

Protein Database (PDB) using Chimera [20], highlighting the

regions where the APCs, or parts of the APCs, appear. We

calculate two distances for comparison: the distance between

the APCs, and the average pairwise distance. The former is

calculated by finding the centroid (defined as the arithmetic

mean of the amino acid locations from the APC) of each APC

before calculating the distance between both centroids. The

average pairwise distance directs the finding of the average

distance of all possible amino acid pairs in the structure.

By comparing two types of distances, we determine whether

high co-occurring APCs are also close in three-dimensional

distance or involved in protein function.

III. RESULTS AND ANALYSIS

To test our method’s ability to find a set of highly co-

occurring APCs, we used protein sequences obtained from

Pfam: triosephosphate isomerase and ubiquitin. After finding

the APCs, we verified the functional significance of the

APCs by finding structures in Protein Data Bank (PDB) [21]

Fig. 2. Normalized scores of the four indicators used for k-means clusteringon triosephosphate isomerase.

Fig. 3. C1, C2, C3 and C4 aligned by parts of the traditional pFamrepresentation of the triosephosphate isomerase family. C1, C2 represents oneAPC cluster and C3, C4 represents another APC cluster. The two APC clusterswould be joined if the APC was clustered into three APC clusters.

in the corresponding location of the APCs and observing

structural/functional characteristics shared between the found

APCs - especially the spatial distance between them.

A. Triosephosphate isomerase

First, by applying our method to triosephosphate isomerase

we show that closely related APC clusters could be found.

Six APCs are obtained from pattern discovery and pattern

alignment on triosephosphate isomerase protein sequences.

The k-means clustering algorithm was used to cluster the

APCs, and indicators (Figure 2) were used to obtain the

optimal number of clusters. According to results in Figure 2,

two indicators agree that two is the optimal cluster number

while the other two agree on four. With a tie, the larger

cluster count is chosen. Hence, k-means clustering of four

clusters were applied to obtain the closest co-occurring APCs

containing C1 and C2, with an average Jaccard index of 0.85,

and another one consisting of C3 and C4, with an average

Jaccard index of 0.72. Furthermore, by providing a cluster

count of three to k-means, the results show that the two APC

clusters are joined together, showing the relationship among

C1, C2 and C3, C4. Figure 3 shows the co-occurrence between

the two clusters and their location on the primary sequence.

A three-dimensional structure of triosephosphate isomerase

(PDB ID 4iot) was used to verify the APCs in Figure 3. As

seen in the figure, C1 and C4 overlaps, and hence are combined

as shown in Figure 4. The spatial distance between C1, C2 is

12.24 A and that between C3, C4 is 7.52 A, both less than the

Fig. 4. Three-dimensional structure of a triosephosphate isomerase proteinfound in E. coli (Escherichia coli, PDB ID 4iot). a) The four APCs shownin Figure 3 is highlighted, with C1 and C4 combined because they overlap b)Shows the four specific amino acid in the APCs, His95, Tyr166, Glu167 andTyr209.

average distance of 22.35A. Hence, the observed APC clusters

are interesting as they are close in spatial distance.

Next we would like to investigate the biological significance

of the co-occurrence of APCs because of this interesting

observation. We aim to establish our findings that the co-

occurring APCs, which may be far from each other in the

primary sequence, are close to one another in spatial distance

such that they are able to form chemical bonding or be

involved in joint functionally.

Triosephosphate isomerase (TIM) plays an important role in

glycolysis, which is essential for efficient energy production

[22]. TIM catalyzes the isomerization of a ketose (DHAP) to

an aldose (GAP) [22], which is an essential process in the

glycolytic pathway. (For the following residual positions, we

refer to PDB ID: 4iot.) In one important step, Glu167 alone

does not possess the basicity to abstract a proton and requires

His95 to donate a proton to stabilize the negative charge

[22], [23]. When Glu167 and His95 are mapped on the APCs

of TIM, they are located in C2 and C1 respectively. Hence,

this indicates that they need to co-operate in the glycolytic

pathway, which agrees with their high co-occurrence score of

0.85. Moreover, the co-occurrence between C1, which overlaps

with C4, and C3 was also confirmed by biological experiments

in [24]. Tyr166 and Tyr209 are amino acids corresponding

to C1 and C3 respectively. According to [24], these residues

are controlling opening or closing of loop 6, which facilitates

substrate binding and releasing in the glycolytic pathway. By

substituting Tyr166 and Tyr209 by Phe166 and Phe209, the

enzymatic activity of TIM decreases significantly [24], as

these substitutions would destabilize the internal structure by

compromising the H-bonds in between, which in turn breaks

the structural mechanism. This agrees with their high co-

occurrence score of 0.72.

B. Ubiquitin

Similarly, we applied our method to the ubiquitin family

to demonstrate our method’s ability to find a closely related

APC cluster corresponding to the protein structure. Eight

APCs were obtained on ubiquitin protein sequences from Pfam

version 25.0.

To find the optimal cluster count for k-means clustering,

Fig. 5. Normalized scores of the four indicators used for k-means clusteringon ubiquitin.

Fig. 7. Three-dimensional structure of an ubiquitin protein found in humans(Homo Sapiens, PDB ID 1ubq), labeling the Lys (K) amino acid. a) highlightsAPC cluster {C2, C3}, b) highlights the other APC cluster.

the plots in Figure 5 were obtained, showing that the optimal

cluster count is two. The APCs are listed on Figure 6, with C2,

C3 forming one cluster, with an average co-occurrence score

of 0.38, and the remaining six APCs forming another cluster,

with an average co-occurrence score of 0.36. Hence C2, C3 is

the better cluster, by our criteria.

The APCs were displayed in a three-dimensional structure

of ubiquitin (PDB ID 1ubq) in Figures 7. Both the average

distance between C2 and C3, the first cluster, (10.74 A) and

the second cluster (12.67 A) is less than the average distance of

14.83 A. According to PDB, both APC clusters are important

to ubiquitin, with C2, C3 appearing in 213/385 (55%) of

ubiquitin structures in PDB and C1, C4, C5, C6, C7, C8

appearing in 194/385 (50%) of PDB ubiquitin structures. This

shows that the cluster with the highest co-occurrence score do

provide us with useful information.

Similar to TIM, we would like to investigate the biological

significance of the co-occurrence of APCs in Ubiquitin. Our

hypothesis is that the co-occurring APCs form a chemical

bonding or co-operate on certain biological functions.

Ubiquitin (UBI) plays an important role in a post trans-

lational protein modification process called ubiquitination,

where ubuiquitin is attached to a substrate protein. The protein

modification can either be a single ubiquitin protein or multiple

chains of ubiquitin. To form a chain, an ubiquitin connects

to another ubiquitin by binding its C-terminal tail to one of

the seven lysine amino acid of its linking partner. There are

different forms of chains, named after one the seven lysine

amino acids used to link the chain together- that all have

different functionalities [25]. Ubiquitination is widely used

in regulating cellular signaling [25]. It does this by allowing

the attached ubuiquitin in substrate proteins to be bound

by proteins with ubiquitin-binding domains (UBD) to trigger

corresponding events [25].

When the seven lysine amino acids are mapped to our APCs,

they are all surprisingly covered (C1: Lys6, Lys11; C3: Lys27,

Lys29; C4: Lys33; C5: Lys48; C6: Lys63). According to the

results of our co-occurrence clustering algorithm (Figure 6),

the optimum number of cluster of the eight APCs is two. The

first cluster includes: C1, C4, C5, C6, C7, C8 and the second

cluster includes C2 and C3. Their biological significance is

discussed as follows.

There are two reasons for the APCs in the first cluster to

co-occur. First, excluding C7 and C8, all the APCs in the

first cluster covers a Lysine (K). Also, although the diglycine,

i.e Gly75 and Gly76, are not covered, C8 corresponds to the

C-terminal tail. As discussed before, Lysine and C-terminal

tail are both important for the formation of multiple ubiq-

uitin chains. Second, the APCs in the first cluster cover

the ubiquitin-binding residue, i.e. Ile44 in C5, and all the

facilitating residues, i.e Leu8 in C1, Lys48 in C5, His68 in C7,

and Val70 in C8 [25]. These residues, particularly Ile44 and

His68, are vitally important for the tight binding of ubiquitin

with almost all ubiquitin-binding proteins [25].

For the biological significance of the second cluster, it is

observed that both C2 and C3 cover the residues of the major

α-helix - α1, of the ubiquitin, which corresponds to residue 23-

34 [26]. Work in [27] discovered that α1 is an unconventional

recognition site of ubiquitin-binding proteins. Experiments in

[27] revealed that even if Ile44 and His68 were mutated, a

high affinity binding between CKS1 and ubiquitin was still

identified. This proved that CKS1 behaves unconventionally

in binding ubiquitin. Further experiments were done in [27]

to reveal the specificity of binding. Smt3p is a protein with

α-helices and has no interactions with CKS1 [27]. Once an α-

helix of Smt3p was replaced by the residues in α1 of ubiquitin,

protein-protein binding was detected [27]. Although individual

contributions have not been evaluated, the co-occurrence of

C2 and C3, which are close to each other in the primary

sequence, represents the unconventional recognition site (α1)

based on the observed specificity. It should be noted that this

unconventional way of binding is not mutually exclusive to

conventional binding [27]. Two models can co-occur, which

introduces more variety to the underlying mechanism.

IV. CONCLUSION

In this paper, we presented our newly developed algorithm

which discovers co-occurring aligned pattern clusters effi-

ciently, given only multiple homologous proteins sequences as

input. By applying the algorithm to two protein families, we

found that frequently co-occurring patterns are close in three-

dimensional distance and their co-occurrence is able to reveal

concurrent distant functions as well as structural relations.

Our experimental results demonstrated that our approach is

statistically reliable, algorithmically effective, and the co-

occurrence relations visualizable in an integrated process that

will benefit the biological community.

Fig. 6. The eight APC found and their relative positions to the ubiquitin sequence. C2, C3 forms one cluster and the other six APCs forms another cluster.The locations of Lys (K) amino acid are emphasized.

REFERENCES

[1] O. Lichtarge, H. R. Bourne, and F. E. Cohen, “An evolutionary tracemethod defines binding surfaces common to protein families,” Journalof molecular biology, vol. 257, no. 2, pp. 342–358, 1996.

[2] J. D. Thompson, B. Linard, O. Lecompte, and O. Poch, “A comprehen-sive benchmark study of multiple sequence alignment methods: currentchallenges and future perspectives,” PloS one, vol. 6, no. 3, p. e18093,2011.

[3] M. C. Frith, U. Hansen, J. L. Spouge, and Z. Weng, “Finding functionalsequence elements by multiple local alignment,” Nucleic acids research,vol. 32, no. 1, pp. 189–200, 2004.

[4] T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifsin biopolymers using expectation maximization,” Machine learning,vol. 21, no. 1-2, pp. 51–80, 1995.

[5] D. Altschuh, A. Lesk, A. Bloomer, and A. Klug, “Correlation of co-ordinated amino acid substitutions with function in viruses related totobacco mosaic virus,” Journal of molecular biology, vol. 193, no. 4,pp. 693–707, 1987.

[6] I. Kass and A. Horovitz, “Mapping pathways of allosteric communica-tion in groel by analysis of correlated mutations,” Proteins: Structure,Function, and Bioinformatics, vol. 48, no. 4, pp. 611–617, 2002.

[7] E.-S. A. Lee and A. K. C. Wong, “Revealing binding segments in proteinfamilies using aligned pattern clusters,” Proteome Science, 2013.

[8] X.-L. Li, S.-H. Tan, C.-S. Foo, S.-K. Ng et al., “Interaction graphmining for protein complexes using local clique merging,” GENOMEINFORMATICS SERIES, vol. 16, no. 2, p. 260, 2005.

[9] E. C. Kenley, L. Kirk, and Y.-R. Cho, “Differentiating party and datehubs in protein interaction networks using semantic similarity measures,”in Proceedings of the 2nd ACM Conference on Bioinformatics, Compu-tational Biology and Biomedicine. ACM, 2011, pp. 641–645.

[10] K.-S. Leung, K.-C. Wong, T.-M. Chan, M.-H. Wong, K.-H. Lee, C.-K.Lau, and S. K. Tsui, “Discovering protein–dna binding sequence patternsusing association rule mining,” Nucleic acids research, vol. 38, no. 19,pp. 6324–6337, 2010.

[11] T.-M. Chan, L.-Y. Lo, H.-Y. Sze-To, K.-S. Leung, X. Xiao, and M.-H. Wong, “Modeling associated protein-dna pattern discovery withunified scores.” IEEE/ACM transactions on computational biology andbioinformatics/IEEE, ACM, 2013.

[12] M. Weigt, R. A. White, H. Szurmant, J. A. Hoch, and T. Hwa,“Identification of direct residue contacts in protein–protein interactionby message passing,” Proceedings of the National Academy of Sciences,vol. 106, no. 1, pp. 67–72, 2009.

[13] L. Burger and E. van Nimwegen, “Disentangling direct from indirectco-evolution of residues in protein alignments,” PLoS computationalbiology, vol. 6, no. 1, p. e1000633, 2010.

[14] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. Marks, C. Sander,R. Zecchina, J. N. Onuchic, T. Hwa, and M. Weigt, “Direct-couplinganalysis of residue coevolution captures native contacts across many

protein families,” Proceedings of the National Academy of Sciences,vol. 108, no. 49, pp. E1293–E1301, 2011.

[15] S. Madabushi, A. K. Gross, A. Philippi, E. C. Meng, T. G. Wensel, andO. Lichtarge, “Evolutionary trace of g protein-coupled receptors revealsclusters of residues that determine global and class-specific functions,”Journal of Biological Chemistry, vol. 279, no. 9, pp. 8126–8132, 2004.

[16] A. K. Wong, D. Zhuang, G. C. Li, and E.-S. Lee, “Discovery of deltaclosed patterns and noninduced patterns from sequences,” Knowledgeand Data Engineering, IEEE Transactions on, vol. 24, no. 8, pp. 1408–1421, 2012.

[17] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining.Addison-Wesley, 2006.

[18] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrodl, “Constrained k-meansclustering with background knowledge,” in MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, 2001, pp.577–584.

[19] J. C. Dunn, “A fuzzy relative of the isodata process and its use indetecting compact well-separated clusters,” 1973.

[20] E. F. Pettersen, T. D. Goddard, C. C. Huang, G. S. Couch, D. M.Greenblatt, E. C. Meng, and T. E. Ferrin, “Ucsf chimera-a visualizationsystem for exploratory research and analysis,” Journal of computationalchemistry, vol. 25, no. 13, pp. 1605–1612, 2004.

[21] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig,I. N. Shindyalov, and P. E. Bourne, “The protein data bank,” Nucleicacids research, vol. 28, no. 1, pp. 235–242, 2000.

[22] R. Wierenga, E. Kapetaniou, and R. Venkatesan, “Triosephosphateisomerase: a highly evolved biocatalyst,” Cellular and molecular lifesciences, vol. 67, no. 23, pp. 3961–3982, 2010.

[23] M. M. Malabanan, L. Nitsch-Velasquez, T. L. Amyes, and J. P. Richard,“Magnitude and origin of the enhanced basicity of the catalytic gluta-mate of triosephosphate isomerase,” Journal of the American ChemicalSociety, vol. 135, no. 16, pp. 5978–5981, 2013.

[24] F. X. Guix, G. Ill-Raga, R. Bravo, T. Nakaya, G. de Fabritiis, M. Coma,G. P. Miscione, J. Villa-Freixa, T. Suzuki, X. Fernandez-Busquetset al., “Amyloid-dependent triosephosphate isomerase nitrotyrosinationinduces glycation and tau fibrillation,” Brain, vol. 132, no. 5, pp. 1335–1345, 2009.

[25] I. Dikic, S. Wakatsuki, and K. J. Walters, “Ubiquitin-binding domains-from structures to functions,” Nature reviews Molecular cell biology,vol. 10, no. 10, pp. 659–671, 2009.

[26] K.-Y. Huang, “Ubiquitin conformational dynamics and hydration shelldynamics by solid state nmr,” 2011.

[27] D. Tempe, M. Brengues, P. Mayonove, H. Bensaad, C. Lacrouts, andM. C. Morris, “The alpha helix of ubiquitin interacts with yeast cyclin-dependent kinase subunit cks1,” Biochemistry, vol. 46, no. 1, pp. 45–54,2007.