pattern recognition methods for protein functional site prediction

Current Protein and Peptide Science, 2005, 6, 479-491 479

1389-2037/05 $50.00+.00 © 2005 Bentham Science Publishers Ltd.

Pattern Recognition Methods for Protein Functional Site Prediction

Zheng Rong Yang1,*, Lipo Wang2, Natasha Young1, Dave Trudgian and Kuo-Chen Chou3

1Department of Computer Science, University of Exeter, UK, 2School of Electrical and Electronic Engineering, Nan-yang, Technological University, Singapore, 3Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego,California 92130, USA

Abstract: Protein functional site prediction is closely related to drug design, hence to public health. In order to save thecost and the time spent on identifying the functional sites in sequenced proteins in biology laboratory, computer programshave been widely used for decades. Many of them are implemented using the state-of-the-art pattern recognition algo-rithms, including decision trees, neural networks and support vector machines. Although the success of this effort hasbeen obvious, advanced and new algorithms are still under development for addressing some difficult issues. This reviewwill go through the major stages in developing pattern recognition algorithms for protein functional site prediction andoutline the future research directions in this important area.

Keywords: HIV protease, distorted-key theory, sliding window, neural network, support vector machines, Bio-basis functionclassifier.

1. INTRODUCTION

There are two types of protein sequence analyses, namelyprotein annotation (or homologous detection) and functionalsite recognition. The former is to annotate a newly se-quenced protein while the latter is to determine where thefunctional sites are in the sequence. Protein annotationcommonly involves sequence homologous detection. Forinstance, when the severe acute respiratory syndrome(SARS) virus was sequenced scientists identified the type ofthe virus through sequence homologous alignment [1-5].This helped the fast recognition of the virus and the effectivemedical treatment of the disease. A commonly used methodfor protein annotation is based on the use of homologyalignment algorithms which determine how similar two se-quences are [6,7]. If the homology alignment score betweenthe sequence of a novel protein and the sequences of knownproteins from an existing databank is large enough, the novelprotein is classified as a member of the family of the knownproteins. Many computational algorithms and tools havetherefore been developed for protein annotation using thisprinciple with success. For instance, the Smith–Watermandynamic programming algorithm [8] is the most accurate,while heuristic algorithms like BLAST [6] and FASTA [9]trade reduced accuracy for improved efficiency. Later on, themethods using aggregate statistics from a family of se-quences, such as profile-based method [10] were used forimproving the accuracy of sequence homology detection.Recently, iterative methods such as PSI-BLAST [11] andSAM-T98 [12] were used in large databases of positive pro-tein sequences for decision making.

Within a sequence, there will be one or more functionalsites for chemical activity, such as protease cleavage, signalpeptide cleavage or glycoprotein linkage. Shown in (Fig. 1)is the diagram of the HIV structure, where there are nine

*Address correspondence to this author at the Department of ComputerScience, University of Exeter, UK; E-mail: [email protected]

cleaved functional proteins, namely p17, p24, p9, p6, prote-ase, reverse transciptase, integrase, gp120 and gp41. TheAspartyl protease works on the polyproteins “gag” and “pol”while cellular protease works on the polyprotein “env”.

The specificity of functional sites in protein sequences isimportant to drug design. The effective design of an inhibitor(drug) against cleavage activity, and hence the possibility ofpreventing maturation of viruses, largely depends on theunderstanding of the specificity of the virus cleavage activ-ity. For instance, HIV protease can be inhibited using spe-cially designed inhibitors. Fig. 2 shows the working principlethat how inhibitors block the newly produced protease fromgetting out the cell to function.

As seen in (Fig. 2), the following two approaches can beused to block the newly produced protease: one is to inhibitthe reverse transcriptase (see, e.g., [13-19]); and the other isto inhibit the HIV protease. For the latter, many studies werefocused on finding the peptides that can be cleaved by HIVprotease (see, e.g., [20-29]), followed by modifying thesepeptides according to the distorted key theory [22,23] toconvert them to a potential inhibitor against HIV protease, asillustrated in Fig. 3.

Note that R4-R3-R2-R1-R1'-R2'-R3'-R4' in (Fig. 3) is a sub-sequence obtained from a whole protein sequence. Fig. 4shows a sliding window of size 2K used for scanning awhole sequence to generate sub-sequences. The window ismoving along the whole sequence from the N-terminal to theC-terminal residue by residue. All the residues within onescan constitute one sub-sequence. In (Fig. 4), the residues aredenoted as RK …R2-R1-R1'-R2' …RK'. The sub-sequence isfunctional if there is a cleavable site between R1 and R1'.Sometimes, the functional site occurs at only one residue likephosphorylation site [30] and O-linkage site [106,107]. Inthis situation, the residues within a window of size 2K+1 aredenoted as RK …R 2-R1-R0-R1'-R2' …R K. The functional siteis at R0.

480 Current Protein and Peptide Science, 2005, Vol. 6, No. 5 Yang et al.

Fig. (1). The diagram of the HIV structure and the cleavage sites. There are two types of proteases, namely, aspartyl and cellular proteases.They cleave two different polyproteins formed by two different genes. Each cleavage produces a functional protein.

Fig. (2) . Inhibitor function for HIV protease. The largest ellipse represents the cell and the ellipse within the largest ellipse represents thenucleus.

Functional activity is very substrate-selective, i.e., de-pending on the size and shape of the substrate for chemicalreaction. A small region around a functional site is thereforethe focus of the study. For instance, eight residues sur-rounding the cleavage site are used for analysing the speci-ficity of HIV protease cleavage [22,23,27]. The residueswithin such a small region constitute a sub-sequence (SS) incontrast to a whole sequence. Function occurs only when

certain amino acid compositions regarded as functional pat-terns are present in SSs. The recognition of functional sitesin protein sequences can be determined experimentally, but itis time-consuming and costly. To automate this process, acomputer program can be implemented with the observedfunctional patterns embedded for efficient detection of thefunctional sites in proteins. Based on the frequency matrix of20 amino acids on the eight sub-positions R4, R3, R2, R1, R1',

Protein Functional Site Prediction Current Protein and Peptide Science, 2005, Vol. 6, No. 5 481

R2', R3', R4' derived from a set of known SSs cleavable byHIV protease, various prediction algorithms were developed,such as h-function algorithm [27], Γ-function algorithm [26],Θ-function algorithm [21], Markov-chain model [25], alter-nate-subsite-coupled model [29], vectorized sequence-coupled model [22], and discriminant function algorithm[24]. Comments on each of these algorithms can be found ina comprehensive review [23].

2. CODING METHOD

Prior to using a machine-learning algorithm for buildinga model for protein functional site prediction, an importantissue is how to represent amino acids in SSs because aminoacids are non-numerical attributes, i.e., how to code aminoacids to numerical values. One of the most commonly usedencoding methods is the distributed encoding [35]. With thismethod, each amino acid is encoded by a 20-bit long binaryvector with one bit setting as one and the rest zeros. In somecases, a 21-bit binary vector is used for including the “un-known” amino acid. The use of this method has made appli-cations of most pattern recognition algorithms to the analysis

of protein sequences much easier. For instance, the methodhas been used in the prediction of protease cleavage sites[28], signal peptide cleavage sites [36], linkage sites in gly-coproteins [37], enzyme active sites [38], phosphorylationsites [30] and water active sites [39].

However, the distributed encoding method is unable toeffectively code biological content in SSs. The distance (dis-similarity) between two binary vectors encoded from twodifferent amino acids is always a constant (it is 2 if using theHamming distance) while the similarity (mutation or substi-tution probability) between any two amino acids varies[11,40,41].

The other difficulty of the distributed encoding method isthe model size. The number of input variables in a proteinsequence is enlarged 20 (or 21) times. Fig. 5 shows an appli-cation when the distributed encoding method is used in aneural network model. This will make the free parameterratio very small. The free parameter ratio is defined as a pro-portion of the number of training vectors over the number ofthe parameters (weights) used in a model. The larger theratio, the more significant the training data is. As the numberof input nodes increases dramatically, the number of modelparameters therefore increases significantly and the free pa-rameter ratio decreases significantly.

In order to overcome this difficulty, the bio-basis func-tion was proposed in 2003 [42,43]. The basic principle of thebio-basis function is the normalisation of pairwise homologyalignment scores. Shown in (Fig. 6), a query sub-sequence(IPRS) will be aligned with two template SSs (KPRT andYKAE) to produce two homology alignment scores a(24+56+56+36=172) and b (28+28+24+32=112) respec-tively. The values are from the Dayhoff matrix [41]. Becausea>b, it is believed that the query sub-sequence shares morefunctional similarity with the first template SS.

3. SUPPORT VECTOR MACHINES

Classification analysis is to find a mapping function be-tween input vector x and a class membership }1,1{−∈t ,

),( wxfy = (1)

where w is the parameter vector, ),( wxf the mapping

function and y the output. With most classification algo-rithms other than support vector machines (SVM) [44], thedistance (error) between y and t is minimised to optimise w.This can lead to a biased hyper-plane for discrimination. In

Fig. (3). Schematic illustration to show (a) a cleavable octapeptideis chemically effectively bound to the active site of HIV protease,and (b) although still bound to the active site, the peptide has lostits cleavability after its scissile bond is modified from a hybridpeptide bond to a single bond by some simple routine procedure.Reproduced from K.C. Chou [23] with permission.

Fig. (4). Sliding window [31-34] for scanning a whole sequence to generate sub-sequences, where the amino acid “A” is at the position R1

and the amino acid “D” at position R1'. Adapted with permission from [33].


Fig. (5). An example of using the distributed encoding method, where each residue is encoded using 20 inputs.

Fig. (6). An illustration of the development of bio-basis function. As IPRS is more similar to KPRT than YKAE, its similarity with KPRT islarger that that with YKAE, see the right figure. The values like 24, 56, 36, 28, 24, and 32 are obtained from the Dayhoff matrix [41].

(Fig. 7), four open circles of class A and four filled circles ofclass B are distributed in balance. When a shaded circle isincluded as noise, the hyper-plane (the dashed line) will bebiased because the error (distance) between the nine circlesand the hyper-plane has to be minimised. When using such ahyper-plane for testing the novel data with the similar distri-bution as the training data (eight small circles in this case),the bias will occur. For instance, the test using this biasedhyper-plane on the data with four test pints denoted by thetriangles will lead to 50% of the accuracy compared with100% of the accuracy on the eight small circles. In searchingfor the best hyper-plane, SVMs find a set of data pointswhich are most difficult to classify. These data points arereferred to as support vectors. In constructing a SVM classi-fier, the support vectors are closest to the hyper-plane andare located on the boundaries of the margin between twoclasses. The advantage of using SVMs is that the hyper-plane is found through maximising this margin. Because ofthis, the SVM classifier is the most robust, hence has the bestgeneralisation ability in many applications among all classi-fiers. In (Fig. 7), two open circles on the upper boundary and

two filled circles on the lower boundary are selected as sup-port vectors. The use of these four circles can form theboundaries of the maximum margin between the two classes.

The trained SVM classifier carries out a linear combina-tion of the similarity between an input and the support vec-tors. The similarity between an input vector x and a supportvector is quantified by a kernel function defined as

),( ixxψ (2)

where ix is the ith support vector. The decision is made

using the following equation

ψαΣ {= )} , (sign iiity xx (3)

where it is the class label of the ith support vector and αi the

positive parameter of the ith support vector determined bySVM.

SVMs have been widely applied to the prediction offunctional sites in proteins. For instance, it was used for theprediction of translation initiation sites [45]. Interestingly,

zafar1


the work designed a novel kernel function which simplycounted the number of nucleotides that coincide between twosequences. The kernel function was further improved basedon the biological knowledge that local correlation informa-tion is important for translation initiation sites. It was alsoused for the classification of proteins with a selective kernelscaling method [46], the prediction of the alpha and betaturns [47], the prediction of phosphorylation sites [48], theprediction of T-cell receptor [49], the prediction of protein-protein interaction sites [50], the prediction of protein sub-cellular location [51], the prediction of enzyme active site[52], the prediction of signal peptides [53], the prediction ofHIV protease cleavage sites [54], and the prediction ofmembrane protein type [55-58]. A detailed review can beseen in [59].

4. NEURAL NETWORKS

Neural networks, known as the multi-layer perceptron(MLP) with the error back-propagation algorithm [60] is oneof the most popular pattern recognition algorithms used inbioinformatics covering many areas. The basic principle ofthe MLP is to treat a mathematical model as a “black-box”.This is true in many real applications as prior knowledge ofthe exact mathematical model of a real system is commonlyunknown. An MLP model is a fully-connected three (ormore) layer network shown as in (Fig. 5). If a numericalvector nx is encoded from the nth input sub-sequence nseach hidden neuron’s output is defined as

)( nthh fz xw= (4)

where hw is the weight vector connecting the hth hidden

neuron to all the input neurons and f() is a sigmoid functiondefined as

xexf

−+=

1

1)( (5)

The output function in terms of the hidden neurons isdefined as

)(ˆ nton fy zw= (6)

where ow is the weight vector connecting the output neu-

ron to all the hidden neurons and ny is the model output

corresponding to the target value ny . An error function is

defined as

e =1

2(yn − yn )2

n=1

N

∑ (7)

where N is the total number of input vectors. The error back-propagation algorithm optimises the model parameters

(weights hw and ow ) through minimising the error func-

tion using a gradient descent method. With the gradient de-

Fig. (8). XOR case for showing the function of hidden neurons used in the MLP. The open circles belong one class and the filled circles theother class. X and Y are two input variables while Z1 and Z2 are two hidden neurons (variables).

Fig. (7). A biased hyper-plane (dashed line) formed using a con-ventional classification algorithm and the optimal hyper-plane(thick line) formed using SVM. The thin lines represent the margin( γ2 ) boundaries using SVM. The open circles represent class A,the filled circles class B, and the shaded circle noise.


scent method, the weights are updated progressively usingthe following equation

e∇−=∆ ηw (8)

where oldnew www −=∆ is the change on the weights,η is a learning rate and e∇ is the partial derivative of the

error function with respect to the weights. This update proc-ess will proceed until the validation error is increasing or thelearning cycle exceeds a limit.

The function of hidden neurons is to decompose thecomplicated (nonlinear) input space so that the hidden space(also called feature space) is simpler. Fig. 8 shows a typicalnonlinear case where two points belong to one class and theother two belong to the other one. Obviously, these twoclasses in this input space is non-separable using any linearfunction. With two hidden neurons, the hidden space is dem-onstrated on the right side in Fig. 8. It can be seen that fourpoints become linearly separable.

The applications of MLP to the prediction of functionalsites in proteins are huge. To name a few, for instance, theprediction of HIV protease cleavage sites [61,62], the pre-diction of specificity of GalNAc-transferase [63,64], the pre-diction of β-turn types [65] and α-turn type [66], the predic-tion of protein domain structural class [67], the prediction ofHepatitis C virus protease cleavage sites [68] and the predic-tion of phosphorylation sites [69].

5. BIO-BASIS FUNCTION CLASSIFIER

The introduction of bio-basis function classifier (BBC) isbased on the use of the bio-basis function [42,43]. BBC is alinear mathematical model of the bio-basis functions with theassumption that the space spanned by the bio-basis functionsis linearly separable as shown in (Fig. 9), where there aretwo bio-bases (BBs) supported by two template SSs.

Fig. (9) . The bio-basis function classifier with two bio-basis func-tions. The dark circles represent one class while the grey circles theother class. X1 is the BB supported the template SSs KPRT whileX2 is the BB supported the template SSs YKAE. W0 is the biaswhile W1 and W2 are two parameters associated with two BBs.

In order to demonstrate that such an assumption is valid.Two examples are given for the HIV-1 protease cleavagedata, where there are 362 octapeptides, of which 114 arecleaved and the rest non-cleaved. Fig. 10 shows two con-tours using two cleaved (positive) octapeptides. On the leftside, all the cleaved octapeptides are aligned with these twocleaved ones showing large values. One the right side, all thenon-cleaved (negative) octapeptides are aligned with thesetwo cleaved ones demonstrating small values. This evidenceshows that the use of bio-basis function is able to conductdiscrimination tasks using a proper pattern recognition algo-rithm.

BBC is composed of K inputs for K bio-bases. Each bio-basis is supported by a sub-sequence with a known property,i.e., with or without a functional site. The sub-sequence usedby a bio-basis is referred to as a template sub-sequence

(tSS). We refer to a sub-sequence as ms and its target

}1,0{∈my , where “1” means that ms has a functional site

and “0” not. A mathematical model of a bio-basis function

classifier for predicting if there is a functional site in ms is

defined as [42,43]

mm

K

kkmkm eywy −=φ= ∑

=1),(ˆ ts (9)

where my is the prediction for ms , kw the weight of the

kth bio-basis function (its tSS is kt ), me the error,

),( km tsφ the normalised similarity between ms and kt .

A feature matrix φ has M rows for M outputs and K col-

umns for K bio-bases with ),( km tsφ meaning the value

from the kth bio-basis for the mth input. We refer to the tar-

get vector as tMyyy ),...,,( 21=y , the error vector as

tMeee ),...,,( 21=e and the parameter vector as

tKwww ),...,,( 21=w . A vector-matrix notation of BBC

is defined as

y = φw + e (10)

Assuming that the error follows a Gaussian distribution,the pseudo-inverse method [70] can be used to solve thesystem

w = (φ tφ)-1φ ty (11)

where φt means the transpose of φ and (φ

tφ)-1 the inverse

matrix of φtφ .

The BBC has been used for the prediction of Trypsincleavage sites [42], HIV cleavage sites [43], Hepatitis C vi-rus protease cleavage sites [71], disordered protein predic-tion [72,73], phosphorylation site prediction [69], the pre-diction of the O-linkage sites in glycoproteins [74], the pre-diction of Caspase cleavage sites [75] and the prediction ofSARS-CoV protease cleavage sites [76]. In all cases, theBBC outperformed the other classification algorithms, such


as decision trees and neural networks with the back-propagation algorithm.

It should be noted that the system described in Eq (11) isbased on the assumption that there is no prior knowledge ofthe model parameters (weights). A lot of work has been doneto deal with parameter priors. For instance, the weights areassumed to follow one Gaussian in the Bayesian net [77]while each weight is assumed to follow one Gaussian inrelevance vector machine (RVM) [78].

However, we surprisingly found that the weights in aBBC model followed two approximate Gaussians for a dis-crimination task (Fig. 11). When conducting further experi-ments, we found that this phenomenon generally holds true.

This is contrary to our original belief that the weights shouldfollow a single Gaussian or each weight should follow a sin-gle Gaussian. It is then questioned whether the performanceof the model can be improved if the true weight distributioncan be recognised through automatic learning.

We then use Bayesian method to re-construct BBCs. Weuse λ for the hyper-parameters governing the error structureprior and weight structure prior. The Bayes formula of theposterior probability is as follows

L = p(w,λ | y) =

p(y | w,λ)p(w,λ)

p(y)(12)

Fig. (10). The contours showing the discriminating capability of using bio-bases.

Fig. (11). The weight distribution for a discriminant analysis using BBC. Reproduced from Yang and Chou [74] with permission.


where p(w,λ | y) is the posterior, p(y | w,λ) the con-

ditional probability and )(yp the normalisation factor and

p(w,λ) = p(w | λ) p(λ) (13)

where )|( ëwp is the posterior probability of the weightsand )(ëp the a priori probability of the hyper-parameters.The posterior probability can be re-written as

L ∝ p(y | w,λ) p(w,λ) = p(y | w,λ) p(w | λ) p(λ) . (14)

The method called MAP (maximum a posteriori) can beused for parameter estimation.

We use α and β to refer to non-functional and func-

tional SSs, respectively. The weights of the bio-bases sup-ported by non-functional SSs follow one Gaussian

)1,(~ αααα γ=σuNw k and the weights of the bio-

bases supported by functional SSs the other Gaussian

)1,(~ ββββ γ=σuNw k . It is also assumed that the

error follows a Gaussian )1,0( eeN γ=σ as well. The

hyper-parameters eγ , αu , αγ , βu and βγ are also as-

sumed to follow a uniform distribution. As each bio-basishas an associated class label, the feature matrix can be ex-

pressed as φ = φ

αφ

βU , where

φα =

φ(s1,t

1) φ(s

1,t

2) L φ(s

1,t

Kα)

φ(s2,t

1) φ(s

2,t

2) L φ(s

2,t

Kα)

M M M Mφ(s

M,t

1) φ(s

M,t

2) L φ(s

M,t

Kα)

.

φβ =

φ(s1,t

1) φ(s

1,t

2) L φ(s

1,t

Kβ)

φ(s2,t

1) φ(s

2,t

2) L φ(s

2,t

Kβ)

M M M Mφ(s

M,t

1) φ(s

M,t

2) L φ(s

M,t

Kβ)

(15)

where αK and βK are the numbers of non-cleaved and

cleaved sub-sequences, respectively. Correspondingly, we

have w = wα wβU , y = yα yβU and e = eα eβU . Ap-

plying negative log on L leads to

]lnlnln

||||||||||||[2

1ln

~ 222

CKKM

LL

e

e

+γ−γ−γ−

−γ+−γ+γ=−=

ββαα

βββααα uwuwe (16)

where C is a constant, uα = µα iKαand uβ = µβ iKβ

. Letting

the partial derivative of %L with respect to γe be zero leads

to

γe

=M

|| e ||2

(17)

We use τ to represent either α or β. Letting the partial de-rivative of L

~ with respect to αγ or βγ be zero leads to

2|||| ττ

ττ

−=γ

uw

K(18)

Letting the partial derivative of L~

with respect to αu or

βu be zero leads to

ττττ

ττ γ+

γ= K

t

Ku iw

1(19)

Letting the partial derivative of L~

with respect to αw or

βw be zero leads to

γ eφτtφτ wτ + γ τ wτ = γ eφτ

t yτ + γ τ uτ (20)

or

(γ eφtφ + γ τ I)w = γ eφ

t y + γ τ Iu (21)

where Iτγ is defined as follows, the first α diagonal ele-

ments are assigned value of αγ and the last β diagonal

elements are assigned value of βγ

γ τ I =

γ α 0 L 0 0 0 L 0

0 γ α L 0 0 0 L 0

M M M M M M M M

0 0 L γ α 0 0 L 0

0 0 L 0 γ β 0 L 0

0 0 L 0 0 γ β L 0

M M M M M M M M

0 0 L 0 0 0 L γ β

(22)

Suppose :φ = γ eφ

tφ + γ τ I and υ = γ eφt y + γ τ Iu . Eq

(21) can be re-written as

:φ w = υ (23)

The estimation of the weights is :

w = φ−1υ .

As the partial derivatives of L with respect to some pa-rameters are not in the closed forms, learning of the parame-ters including hyper-parameters can be implemented usingthe principle of the expectation-maximisation (EM) algo-rithm. Each parameter is assigned a random value at the be-ginning. In the tth learning cycle of the E-step, hyper-parameters are estimated as follows

2||)(||)1(

t

Mte

e=+γ


2||)()(||)1(

tt

Kt

ττ

ττ

−=+γ

uw

ττττ

ττ +γ+

+γ=+ K

t ttK

ttu iw )(

)1(1

)1()1( (24)

where )1( +të is the newly estimated value for ë at tthlearning cycle. In the tth cycle of the M-step, network pa-rameters are estimated as follows

:φ(t + 1) = γ e(t + 1)φt (t + 1)φ(t + 1)+ γ τ I(t + 1) ,

υ(t + 1) = γ e(t + 1)φt (t + 1)y + γ τ I(t + 1)u(t + 1)

:

w(t + 1) = φ(t + 1)−1υ(t + 1) (25)

The stop criterion can be defined as

ε<2|||| e (26)

The alternative is to measure if the system has ap-proached the steady state where the system parameters donot change too much for a certain period L, i.e.,

δ<−+∑=

L

llLl

1)()mod)1(( uu (27)

where ε and δ are two small positive values.

6. CASE STUDY: CASPASE CLEAVAGE SITEPREDICTION

Programmed cell death (apoptosis) is a common phe-nomenon of cell death. Its function is to maintain adult tis-sues and assist embryonic development. Apoptosis can bal-ance cell proliferation and maintain the number of cells intissues constant [79,80,81,82,83]. It is estimated that about 5_ 10 11 blood cells are eliminated by programmed cell deathdaily in humans. The other important role of apoptosis is thatit can eliminate potential dangerous cells which may causediseases. For instance, the frequent elimination of the virus-infected cells is one of the main tasks of apoptosis. Throughthe programmed death of these cells, the production of newvirus particles can be prevented and the spread of the virus inthe body through the host organism can be stopped [84]. Theother example of apoptosis is the mammalian nervous sys-tem where neurons are normally produced in excess andabout 50% of newly developed neurons are eliminated byprogrammed cell death to ensure that the survived are thebest for making the correct connections with their target cells[85].

In the process of apoptosis, chromosomal DNA is firstfragmented. The nucleus is then broken into small pieces.From this, the cell shrinks and is also broken into membrane-enclosed fragments called apoptosis bodies. Fig. 12 is thediagram. They will be removed from the tissues [86]. Lead-ing to apoptosis are the protease called caspases because theyhave cysteine (C) residue at their active sites and cleave afteraspartic acid (A) residue in their substrate proteins [87]. Thecaspases are synthesized as inactive precursors that are con-

verted to the active form by proteolytic cleavage, catalyzedby other caspases. The first cleavage activity therefore startsa chain of cleavage activities.

It can be seen that the study of caspase is critically im-portant to many disease studies [82,88,89,90,91,92,93].When apoptosis over-acts, some disease like Parkinson’s orAlzheimer will occur [94], on the other hand when apoptosisis blocked or slowed down cancer could happen [95,96]. Wedownload 13 sequences from NCBI (http://www.ncbi.nih.gov) for the experiment. These 13 proteins are O00273,Q07817, P11862, P08592, P05067, Q9JJV8, P10415,O43903, Q12772, Q13546, Q08378, O60216, and O95155.In total, there are 18 experimentally determined cleavagesites. Protein-oriented jackknife simulation is used. Theprotein-oriented jackknife means that one protein is ruled outand the remaining proteins are used for constructing a classi-fier in each run of simulation. The constructed classifier isused to test the singled-out protein. Each protein sequencewill be scanned using a sliding window with a fixed size is10.

Fig. (12). The apoptosis process.

Decision trees, neural networks and support vector ma-chines are used for the comparison with bio-basis functionclassifiers. Neural networks did not work with totally biasedprediction accuracy towards non-cleaved sub-sequenceswhen we used different number of hidden neurons. This co-incides with the phenomenon in [97], where the predictionaccuracy was always biased towards the class with the ma-jority of the inputs known as “the problem of small dis-juncts” in machine learning, e.g., Frayman and Wang, 1999[98]. In this study, the ratio of the non-cleaved SSs over thecleaved ones is 2025.

In the simulation, the maximum learning cycle is 1000unless the stopping criterion or the steady state of the meanparameters is satisfied when estimating parameters for BBC.The Blosum62 matrix [99] was used to calculate pairwisehomology alignment with SVM, all non-linear kernel func-tions produced results similar to neural networks while linearkernel worked the best with C as 100 for trading off betweentraining error and regularisation ability. The packageSVMlight [100] (http://svmlight.joachims.org/] was used.


Shown in (Fig. 13) is the performance comparison. Thesub-sequence is represented by R5-R4-R3-R2-R1-R1'-R2'-R3'-R4'-R5', where the cleavage occurs between R1-R1'. It can beseen that although BBC0 (original BBC) demonstratedhigher total accuracy than BBC2 (Bayesian BBC with twoGaussians), BBC0 demonstrated a very low true positivefraction rate. Note that BBC1 represents Bayesian BBC withone Gaussian. Two best models are BBC2 and SVM, theirTNfs are 96% and 94%, their TPfs are 92% and 90%, theirprecisions are 96% and 93%. The p-values of the T-test onTNf, TPf and precision between BBC2 and SVM are 0.0001,0.8181 and 0.0002, respectively. The hypotheses that BBC2shows similar TNf and precision as SVM have been denied.At the same time the hypothesis that BBC2 shows similarTPf as SVM cannot be denied because the p-value is ap-proaching one (0.8181). This means that BBC2 outper-formed SVM in reducing the false cleaved sub-sequencessignificantly while maintaining the similar performance inrecognising the cleaved sub-sequences. The experimentsshowed the true negative fraction, true positive fraction andthe total accuracy of BBC2 are 95.86 ± 1.11%,92.31 ± 2.66% and 95.83 ± 1.10%, respectively.

Fig. 14 shows a comparison among all models using twomeasurements, TNf and TPf. The best model should be lo-cated at the top-right corner and the worst one would be lo-cated at the left-bottom corner. Any model whose location isbelow the diagonal line from the top-left corner to the right-bottom corner is regarded a failure. In terms of this, it can beseen that the BBC2 models outperformed all the other mod-els because they are most close to the top-right corner.

Fig. 15 shows the evolution of the mean values of twoGaussians for the Bayesian BBC with two Gaussians. It canbe seen that they evolve to two distinct Gaussians throughlearning. Note that the hyper-parameters are unknown inadvance and can be learned through a MAP process. Theupper lines represent the mean value of the Gaussian for thebio-bases supported by cleaved tSSs while the lower linesthe mean value of the Gaussian for the bio-bases supportedby non-cleaved tSSs. This is not surprising as the non-

Fig. (14). The comparison among all the models using TNf andTPf. Adapted with permission from [75].

cleaved SSs tend to have no conserved patterns leading tovery diverged values of the bio-bases.

Fig. 16 shows that how the error evolves with the learn-ing process. After about 70 iterations, the system approachesat the steady state.

7. FUTURE RESEARCH DIRECTION

Although the development and the applications of patternrecognition algorithms to the prediction of functional sites inproteins have been successful, some hard issues still remainunsolved.

The first issue is the use of the mutation matrices. It hasbeen reported in [43] that the use of different mutation ma-trices led to the models with different performance. Fig. 17shows that the performance of BBC using six different mu-tation matrices, where the circles represent the true negativefraction, the cross the true positive fraction and the triangles

Fig. (13). The performance comparison for window size of 10. The horizontal axis denotes the three measurements while the vertical axis theperformance. Adapted with permission from [75].

zafar1

zafar1

zafar1


Fig. (15). The mean values of the two Gaussians evolved through learning.

Fig. (16). The evolved errors through learning.

Fig. (17). The performance of BBC for using different mutation matrices and different number of bio-bases. The triangles represent the meanaccuracy while the circles and crosses the upper and low bound of one standard deviation. Adapted with permission from [43].


the total accuracy. Dayhoff, Doolittle, Fitch, Gonnet, Gran-tham and Henikoff are six mutation matrices. The graph alsoshows the performance variation when using different num-ber of bio-bases. The optimisation of mutation matrix for thepurpose of functional site prediction is then a difficult issuewhich has not yet been well answered.

The second issue is the ratio of the non-functional SSsover the number of the functional ones. In this paper, it canbe seen that the ratio is 2025. With this kind of ratio, it isdesired to have a systematic method to handle the uncer-tainty in using the data for modelling. The common methodis to randomly select non-functional SSs with roughly thesame number as the functional SSs. However, this kind ofrandomness raises uncertainty. Feature selection methodsused in pattern recognition could be a way to approach.

8. SUMMARY

The applications of pattern recognition algorithms to theprediction of protein functional sites have been briefly re-viewed. Along with the algorithms based on the frequencymatrix developed in early 1990’s, the linear discriminantanalysis, quadratic discriminant analysis, neural networks,decision trees, support vector machines and bio-basis func-tion classifiers have played different roles in this area mak-ing their contribution towards the final goal, computer-aideddrug design and hence benefit the humans.

It has to be noted that the development of using patternrecognition algorithms for the prediction of protein func-tional sites has gone through a very interesting period. Atfirst, the frequency matrix approaches were inspired by bio-chemical observations that the frequency occurs in sub-sequences is regarded the natural law. After that, some sta-tistical methods like linear discriminant analysis and quad-ratic discriminant analysis which regard the residues in sub-sequences as variables were developed. With this kind ofdevelopment, the work proceeds towards more mathematicaldescription of the specificity of protein functional sites.Along with the occurrence of neural networks and supportvector machines, more focus was shifted to computer capa-bility. Although the development of the distributed encodingmethod has made many applications of the state-of-the-artpattern recognition algorithms to protein functional site pre-diction easier, its success in coding biological content in se-quences has been challenged. Recently, the development ofthe bio-basis function classifiers aimed to make a model forthe prediction of protein functional sites more biologicallysound. However, the introduction of the bio-basis functionclassifiers also posed new difficulties which need furtherstudies. Nevertheless, it is encouraging to note that using thedistorted key theory that was originally proposed for findingpeptide inhibitors against HIV protease [22,23] has beeneffectively used to find inhibitors against SARS (severeacute respiratory syndrome) enzyme very recently [101-105].

REFERENCES

[1] Ksiazek, T.G., Erdman, D.V.M., Cynthia, P.H., Goldsmith, S.,Sherif, M.S., et al., (2003) The New England Journal of Medicine,348, 1947-1957.

[2] Marra, M.A., Jones, S.J., Astell, C.R., Holt, R.A., Brooks-Wilson,A., Butterfield, Y.S., Khattra, J., Asano, J.K., Barber, S.A., Chan,S.Y., et al., (2003) Science, 300, 1399–1404.

[3] Rota, P.A., Oberste, M.S., Monroe, S.S., Nix, W.A., Campagnoli,R., Icenogle, J.P., Penaranda, S., Bankamp, B., Maher, K., Chen,M.H., Tong, S., Tamin, A., Lowe, L., Frace, M., DeRisi, J.L.,Chen, Q., Wang, D., Erdman, D.D., Peret, T.C., Burns, C., Ksiazek,T.G., Rollin, P.E., Sanchez, A., Liffick, S., Holloway, B., Limor, J.,McCaustland, K., Olsen-Rassmussen, M., Fouchier, R., Gunther,S., Osterhaus, A.D., Drosten, C., Pallansch, M.A., Anderson, L.J.& Bellini, W.J. (2003) Science, 1st May.

[4] Thiel, V., Ivanov, K.A., Putics, A., Hertzig, T., Schelle, B., Bayer,S., Weissbrich, B., Snijder, E.J., Rabenau, H., Doerr, H.W., Gor-balenya, A.E. & Ziebuhr, J. (2003) Journal of General Virology,Online 19 June.

[5] Wang, M., Yao, J.S., Huang, Z.D., Xu, Z.J., Liu, G.P., Zhao, H.Y.,Wang, X.Y., Yang, J., Zhu, Y.S. & Chou, K.C. (2005) MedicinalChemistry 1, 39-47.

[6] Altschul, S.F., Gish, W., Miller, W., Myers E. & Lipman, D.J.(1990) J. Mol. Biol., 215, 403-10.

[7] Jones, D.T. & Ward, J.J. (2003) Proteins, S6, 573-578.[8] Smith, T.F. & Waterman, M.S. (1981) J. Mol. Biol.147, 195--197.[9] Pearson, W.R. (1990) Methods Enzymol., 183, 63–98.[10] Gribskov, M. & Robinson, N.L. (1996) Computers and Chemistry ,

20, 25–33.[11] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z.,

Miller, W. & Lipman, D.J. (1997) Nucl. Acids Res., 25, 3389-3402.[12] Karplus, K., Barrett, C. & Hughey, R. (1998) Bioinformatics, 14,

846–856.[13] Althaus, I.W., Chou, J.J., Gonzales, A.J., Diebel, M.R., Chou,

K.C., Kezdy, F.J., Romero, D.L., Aristoff, P.A., Tarpley, W.G. &Reusser, F. (1993) Biochemistry, 32, 6548-6554.

[14] Althaus, I.W., Chou, J.J., Gonzales, A.J., Diebel, M.R., Chou,K.C., Kezdy, F.J., Romero, D.L., Aristoff, P.A., Tarpley, W.G. &Reusser, F. (1993) Journal of Biological Chemistry, 268, 6119-6124.

[15] Althaus, I.W., Gonzales, A.J., Chou, J.J., Diebel, M.R., Chou,K.C., Kezdy, F.J., Romero, D.L., Aristoff, P.A., Tarpley, W.G. &Reusser, F. (1993) Journal of Biological Chemistry, 268, 14875-14880.

[16] Althaus, I.W., Chou, J.J., Gonzales, A.J., Diebel, M.R., Chou,K.C., Kezdy, F.J., Romero, D.L., Aristoff, P.A., Tarpley, W.G. &Reusser, F. (1994) Experientia, 50, 23-28.

[17] Althaus, I.W., Chou, J.J., Gonzales, A.J., Diebel, M.R., Chou,K.C., Kezdy, F.J., Romero, D.L., Thomas, R.C., Aristoff, P.A.,Tarpley, W.G. & Reusser, F. (1994) Biochemical Pharmacology,47, 2017-2028.

[18] Althaus, I.W., Chou, K.C., Franks, K.M., Diebel, M.R., Kezdy,F.J., Romero, D.L., Thomas, R.C., Aristoff, P.A., Tarpley, W.G. &Reusser, F. (1996) Biochemical Pharmacology, 51, 743-750.

[19] Chou, K.C., Kezdy, F.J. & Reusser, F. (1994) Analytical Biochem-istry, 221, 217-230.

[20] Chou, J.J. (1993) Biopolymers, 33, 1405-1414.[21] Chou, J.J. (1993) Journal of Protein Chemistry, 12, 291-302.[22] Chou, K.C. (1993) Journal of Biological Chemistry, 268, 16938-

16948.[23] Chou, K.C. (1996) Analytical Biochemistry, 233, 1-14.[24] Chou, K.C., Tomasselli, A.L., Reardon, I.M. & Heinrikson, R.L.

(1996) PROTEINS: Structure, Function, and Genetics, 24, 51-72.[25] Chou, K.C. & Zhang, C.T. (1993) Journal of Protein Chemistry,

12, 709-724.[26] Chou, K.C., Zhang, C.T. & Kezdy, F.J. (1993) Proteins: Structure,

Function, and Genetics, 16, 195-204.[27] Poorman, R.A., Tomasselli, A.G., Heinrikson, R.L., Kezdy, F.J.

(1991) J. Biol. Chem., 22, 14554-14561.[28] Thompson, T.B., Chou, K.C. & Zhang, C. (1995) Journal of Theo-

retical Biology, 177, 369-379.[29] Zhang, C.T. & Chou, K.C. (1993) Protein Engineering, 7, 65-73.[30] Blom, N., Gammeltoft, S., Brunak, S. (1999) J. Mol. Biol., 294,

1351-1362.[31] Chou, K.C. (2001) PROTEINS: Structure, Function, and Genetics,

42, 136-139.[32] Chou, K.C. (2001) Peptides, 22, 1973-1979.[33] Chou, K.C. (2001) Protein Engineering, 14, 75-79.[34] Chou, K.C. (2002) Protein and Peptide Science, 3, 615-622.[35] Qian, N. & Sejnowski, T.J. (1988) J. Mol. Biol., 202, 865-884.[36] Nielsen, H., Engelbrecht, J., Brunak, S., von Heijne, G. (1997)

Protein Engineering, 10, 1-6.


[37] Hansen, J.E., Lund, O., Engelbrecht, J., Bohr, H., Nielsen, J.O.,(1995) Biochem J. 30, 801-13.

[38] Gutteridge, A., Bartlett, G.J. & Thornton, J.M. (2003) Journal ofMolecular Biology, 330, 719-734.

[39] Ehrlich, L., Reczko, M., Bohr, H. & Wade, R.C. (1998) ProteinEng., 11, 11-19.

[40] Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) Atlas ofprotein sequence and structure, 5, 345-358.

[41] Johnson, M.S. & Overington, J.P. (1993) J. Mol. Biol., 233, 716-738.

[42] Thomson, R., Hodgman, T.C., Yang, Z.R. & Doyle, A.K. (2003)Bioinformatics, 19, 1741-1747.

[43] Yang, Z.R. & Thomson, R. (2005) IEEE Trans.on Neural Net-works, 16, 263- 274.

[44] Vapnik, V. (1995) The Nature of Statistical Learning The-ory.Springer-Verlag, New York.

[45] Zien, A., Ratsch, G., Mika, S., Scholkopf, B., Lengauer, T., Muller,K.R. (2000) Bioinformatics, 16, 799-807.

[46] Zavaljevski, N., Stevens, F.J., Reifman, J. (2002) Bioinformatics,18, 689-696.

[47] Cai, Y.D., Feng, K.Y., Li, Y.X., Chou, K.C. (2003) Peptides, 24,629-630.

[48] Kim, J.H., Lee, J., Oh, B., Kimm, K., Koh, I. (2004) Bioinformat-ics, (in press)

[49] Zhao, Y., Pinilla, C., Valmori, D., Martin, R. & Simon, R. (2003)Bioinformatics, 19, 178–1984.

[50] Koike, A., Takagi, T. (2004) Protein Eng. Des. Sel., 17, 165-73.[51] Chou, K.C. & Cai, Y.D. (2002) Journal of Biological Chemistry,

277, 45765-45769.[52] Cai, Y.D., Zhou, G.P., Jen, C.H., Lin, S.L. & Chou, K.C. (2004)

Journal of Theoretical Biology, 228, 551-557.[53] Cai, Y.D., Lin, S. & Chou, K.C. (2003) Peptides, 24, 159-161.[54] Yang, Z.R. & Chou, K.C. (2004) Bioinformatics, 20, 735-741.[55] Cai, Y.D., Liu, X.J., Xu, X.B. & Chou, K.C. (2002) Electronic

Journal of Molecular Design, 1, 219-226.[56] Cai, Y.D., Pong-Wong, R., Feng, K., Jen, J.C.H. & Chou, K.C.

(2004) Journal of Theoretical Biology, 226, 373-376.[57] Cai, Y.D., Zhou, G.P. & Chou, K.C. (2003) Biophysical Journal ,

84, 3257-3263.[58] Wang, M., Yang, J., Liu, G.P., Xu, Z.J. & Chou, K.C. (2004) Pro-

tein Engineering, Design, and Selection, 17, 509-516.[59] Yang, Z.R. (2004) Briefings in Bioinformatics, 5, 328-338.[60] Rumelhart, D.E., Hinton, G.E. & Williams.R.J. (1986) Learning

internal representations by error propagation.In Parallel Distrib-uted Processing , ed. Rumelhart, D.E. & J.L. McClell, Cambridge,MA: MIT Press, 1, 318-362.

[61] Cai, Y.D. & Chou, K.C. (1998) Advances in Engineering Software ,29, 119-128.

[62] Cai, Y.D., Yu, H. & K.C.Chou. (1998) Journal of Protein Chemis-try, 17, 607-615.

[63] Cai, Y.D. & Chou, K.C. (1996) Analytical Biochemistry, 243, 284-285.

[64] Cai, Y.D., Yu, H. & Chou, K.C. (1997) Journal of Protein Chem-istry, 16, 689-700.

[65] Cai, Y.D., Li, Y.X. & Chou, K.C. (1999) Advances in EngineeringSoftware, 30, 347-352.

[66] Cai, Y.D. & Chou, K.C. (1999) Analytical Biochemistry, 268, 407-409.

[67] Cai, Y.D., Li, Y.X. & Chou, K.C. (2000) BBA, 1476, 1-2.[68] Narayanan, A., Wu, X.K. & Yang, Z.R. (2002) Bioinformatics, 18,

s5-s13.[69] Berry, E., Dalby, A. & Yang, Z.R. (2004) Computational Biology

and Chemistry, 28, 75-85.[70] Duda, R.O., Hart, P.E. & Stork, D.G. (2002) Pattern Classification

and Scene Analysis.New York, Wiley (2nd Edition)[71] Yang, Z.R. & Berry, E. (2004) J. Computational Biology and Bio-

informatics, 2, 511-531.

[72] Thomson, R. & Esnouf, R. (2004) Lecture Notes in ComputerScience, 3177, 109-117.

[73] Yang, Z.R., Thomson, R. & Esnouf, R. (2005) Bioinformatics, 21,1831-1837.

[74] Yang, Z.R. & Chou, K.C. (2004) Bioinformatics, 20, 903-908.[75] Yang, Z.R. (2005) Bioinformatics, 21, 1831-37.[76] Yang, Z.R. (2005) Bioinformatics, 21, 2644-2650.[77] Nabney, I. (2003) Netlab: Algorithms for Pattern Recognition,

Springer.[78] Tipping, M.E. (2000) Advances in Neural Information Processing

Systems, 12, 652-658.[79] Mommers, E.C., van Diest, P.J., Leonhart, A.M., Meijer, C.J. &

Baak, J.P. (1999) Breast Cancer Research Treatment, 58, 63-9.[80] Chou, J.J., Li, H., Salvessen, G.S., Yuan, J. & Wagner, G. (1999)

Cell 96, 615-624.[81] Chou, J.J., Matsuo, H., Duan, H. & Wagner, G. (1998) Cell, 94,

171-180.[82] Chou, K.C. (2004) Curr. Med. Chem., 11, 2105-2134.[83] Chou, K.C., Tomasselli, A.G. & Heinrikson, R.L. (2000) FEBS

Letters, 470, 249-256.[84] Malnati, M.S., Lusso, P., Ciccone, E., Moretta, A., Moretta, L. &

Long, E.O. (1993) J. Exp. Med., 178, 961-9.[85] Deshmukh, M. & Johnson, Jr.E.M. (1997) Molecular Pharmacol-

ogy, 51, 897-906.[86] Duke, R.C., Ojcius, D.M. & Young, J.D.E. (1996) Scientific

American, December, 80-87.[87] Chang, H.Y. & Yang, X. (2000) Microbiology and Molecular Biol-

ogy Reviews, 64, 821–846.[88] Chou, K.C., Jones, D. & Heinrikson, R.L. (1997) FEBS Letters,

419, 49-54.[89] Rohn, T.T., Cusack, S.M., Kessinger, S.R. & Oxford, J.T. (2004)

Experimental Cell Research, 293, 215-225.[90] Adams, J.M. & Cory, S. (1998) Science, 281, 1322-1326.[91] Even, G. & Littlewood, T. (1998) Science, 281, 1317-1322.[92] Reed, J.C. & Paternostro, G. (1999) PNAS, 96, 7614-7616.[93] Ethell, D.W., Bossy-Wetzel, E. & Bredesen, D.E. (2001) Caspase 7

can cleave tumor necrosis factor receptor-I (p60) at a non-consensus motif, in vitro.BBA, 1541, 231-238.

[94] Van de Craen, M., de Jonghe, C., van den Brande, I., Declercq, W.,van Gassen, G., van Criekinge, Vanderhoeven, I., Fiers, W., vanBroeckhoven, C., Hendriks, L. & Vandenabeele, P. (1999) FEBSLetters, 445, 149-154.

[95] West, K.A., Castillo, S.S., Dennis, P.A. (2002) Drug Resist. Up-date, 5, 234-248.

[96] Jakobi, R. (2004) Drug Resist. Update, 7, 11-17.[97] Wilson, R.L. & Sharda, R. (1994) Decision Support Systems, 11,

545-557.[98] Fraynam, Y., Ting, K.M. & Lipo Wang (1999) Proc. of Interna-

tional Joint Conference on Neural Networks, 4, 2490-2493.[99] Henikoff, S. & Henikoff, J.G. (1992) PNSA, 89, 10915-10919.[100] Joachims, T. (1999) Advances in Kernel Methods - Support Vector

Learning, Schölkopf, B., Burges, C. & Smola, A. (ed.), MIT Press.[101] Chou, K.C., Wei, D.Q. & Zhong, W.Z. (2003) Biochem Biophys

Res Comm, 308, 148-151. (Erratum: ibid, 2003, 310, 675).[102] Du, Q., Wang, S., Wei, D.Q., Sirois, S. & Chou, K.C. (2005) Ana-

lytical Biochemistry, 337, 262-270.[103] Du, Q.S., Wang, S.Q., Wei, D.Q., Zhu, Y., Guo, H., Sirois, S. &

Chou, K.C. (2004) Peptides, 25, 1857-1864.[104] Sirois, S., Wei, D.Q., Du, Q. & Chou, K.C. (2004) J. Chem. Inf.

Comput. Sci., 44, 1111-1122.[105] Du, Q. S., Wang, S. Q., Jiang, Z. Q., Gao, W. N., Li, Y. D., Wei, D.

Q. & Chou, K. C. (2005) Medicinal Chemistry, 1, 209-213.[106] Chou, K. C., Zhang, C. T., Kezdy, F. J. & Poorman, R. A. (1995)

PROTEINS: Structure, Function, and Genetics, 21, 118-126.[107] Chou, K. C. (1995) Protein Science, 4, 1365-1383.

Received: March 30, 2005 Accepted: July 15, 2005

pattern recognition methods for protein functional site prediction

Documents