the evolution of stochastic regular mo- tifs for protein ...bross/research/sredna_ngc.pdf · the...

The Evolution of Stochastic Regular Motifs for Protein Sequences 1

The Evolution of Stochastic Regular Mo-tifs for Protein Sequences

Brian J. ROSS

Brock University, Dept. of Computer ScienceSt. Catharines, Ontario, Canada L2S [email protected]

Abstract Stochastic regular motifs are evolved for protein sequencesusing genetic programming. The motif language, SRE-DNA, is a stochas-tic regular expression language suitable for denoting biosequences. Threerestricted versions of SRE-DNA are used as target languages for evolvedmotifs. The genetic programming experiments are implemented in DCTG-GP, which is a genetic programming system that uses logic–based at-tribute grammars to define the target language for evolved programs.Earlier preliminary work tested SRE-DNA’s viablility as a representationlanguage for aligned protein sequences. This work establishes that SRE-DNA is also suitable for evolving motifs for unaligned sets of sequences.∗1

Keywords Protein Motifs, Stochastic Regular Expressions, Grammat-ical Genetic Programming, Evolutionary Computation.

§1 IntroductionA motif is a representation modeling some shared characteristic of a

family of proteins.5) There are a number of ways in which motifs can be used.The most common use of a motif, as considered in this paper, is to act as a devicefor characterizing the sequence pattern common to a particular protein family,and therefore distinguishes them from unrelated sequences. In other words, amotif is a signature for a protein family. Once an effective motif is establishedfor a set of protein sequences, it can be used to query a database of proteinsin order to extract the sequences belonging to the family. This permits thediscovery of structural similarities between new sequences added to the database∗1 New Generation Computing, vol.20, n.2, Feb. 2002, pp. 187-213.

2 Brian J. ROSS

and others whose functionalities have been established earlier. Hence, motifs aremeans for determining potential protein functionalities for new sequences, whichis obviously an important practical tool for biologists.

An active research area is automated motif discovery. Given the grow-ing rate at which biosequences are being cultivated and added to databases, themanual construction of useful motifs becomes increasingly impractical, given thesheer volume of data to be analyzed. Furthermore, manual design of motifs iserror prone, since a human being may not recognize the subtleties that deter-mine protein family membership. The use of machine learning techniques forautomatic motif synthesis has been studied by many researchers.3, 6) Given thetechnical complexities of predicting protein functionality from raw biosequences,there does not yet exist a machine learning technology which can determine themost biologically pertinent motif for a given family of sequences. In the mean-time, however, different machine learning algorithms and sequence representa-tions can serve as tools that geneticists can use to study new sequence patterns,and therefore aid them in deriving the most biologically sound representationfor new families of interest.

This paper describes research in the automatic synthesis of biosequencemotifs using evolutionary computation. A contribution of this work is the in-troduction of a new motif language, stochastic regular expressions (SRE-DNA).SRE-DNA has characteristics of other, more established motif representations.Its regular expression bases is similar to commonly used regular expression motiflanguages, such as that used by the PROSITE protein database.12) On the otherhand, its probabilistic nature is similar to stochastic biosequence representationssuch as HMM’s.22) Another goal of this paper is to show that SRE-DNA motifsare readily synthesizable by genetic programming. SRE-DNA’s stochastic mod-eling of biosequences can be exploited directly by the evolution process, bothduring fitness evaluation, and during the evolutionary search itself. It is not agoal of this paper to suggest that this approach to motif discovery is necessarilysuperior to other established techniques, but rather, to illustrate that the au-tomatic discovery of stochastic motifs is feasible using SRE-DNA and geneticprogramming.

Section 2 reviews the problem of motif identification. The SRE-DNAmotif language is outlined in Section 3. Genetic programming is reviewed in Sec-tion 4. Section 5 overviews the design of the experiments. Results are presentedin Section 6, and evaluated in Section 7. Comparisons to related work are given


in Section 8. Conclusions and future directions conclude the paper in Section 9.

§2 Biosequence Identification

2.1 Motif RepresentationsGenetic similarities due to common evolutionary origins amongst differ-

ent species can often be identified by the similar protein functionalities observedat the bio-molecular level. The precise functionality of a transcribed proteinis fundamentally determined by its 3D structure. Although of primary impor-tance, such 3D structures are difficult and impractical to ascertain directly frombiosequences themselves, and thus remains a critically important open problemin bioinformatics. Consequently, more rudimentary characterizations are usedfor protein classification and prediction. Motifs that model protein families viathe shared similarities in their biosequence composition are widely used. Eventhough such motifs are crude, indirect denotations of the real factor of impor-tance (3D structure), often they are currently the most practical means for char-acterizing protein families, due to their parsimony, efficiency of interpretation,and amenability to automatic acquisition from raw sequences.

Many factors are pertinent to biosequence representation, which dif-ferent motif representations may incorporate in varying degrees.5, 8) Of centralimportance is the ability to represent conserved regions of amino acid sequences.These are the sequences common within a protein family, where differences areusually minor, and inserts and deletes are rare. Typically the conserved regionis surrounded by similar sequential patterns unique to an identified protein, butwhich vary in composition, size, and location according to the particular speciesof interest. Naturally, when larger sequence lengths in the vicinity are consid-ered, wider variability to motif patterns are introduced; a point will be reachedwhen the length is too large to be representationally beneficial. Consensus pat-terns can vary with respect to their location within supersequences. They canalso repeat within sequences, and possibly overlap. What can greatly complicatethe use of motifs is the phenomena of convergent evolution, in which two sepa-rate evolutionary paths have evolved completely different sequences, but whichhave similar (converged) protein functionalities. In other words, the 3D struc-ture of these different sequences yield similar functionalities. Such cases prohibitstraight-forward use of motifs, and they will not be considered further in thispaper.

4 Brian J. ROSS

Different motif languages have different linguistic strengths and weak-nesses with respect to biosequence representation. The simplest motif is a con-sensus sequence, which defines the exact sequence of amino acid codons commonto a family of proteins. Since such common sequences may be too small to bepractical, longer consensus patterns are often used. Pattern representationspermit variability, which reflect species diversity due to evolution. Consensuspatterns may be symbolic, for example, regular languages 2, 6, 7) and higher-levelgrammars.27, 28) Symbolic representations are called such because the structureof the motif is directly mappable to the symbolic representation within the mo-tif expression. This lends symbolic representations such as regular expressionstheir greatest advantage: the user-level interface to a family of sequences is analgebraic abstraction of the consensus pattern. Such motifs are akin to querylanguages used in conventional databases. A disadvantage of symbolic represen-tations is that more complex protein families result in correspondingly complexsymbolic expressions. Numeric and probabilistic representations are also com-monly used in motif representations.8, 22, 16) Their main advantages are theirinherent ability to account for structural variation, and their natural encodingof “scores” with which to heuristically judge the relative similarity of candidatesequences to a family profile. An instance of a motif language sharing both sym-bolic and numeric features is the stochastic context-free grammar of Sakakibaraet al.,26) as well as the stochastic regular expression language used in this paper.

The comparative linguistic power of different motif representations canbe better understood when formal language theory is considered.14) An impor-tant insight obtained from such a viewpoint is that motif languages can beformally categorized with respect to their expressive power, according to theirposition in the Chomsky hierarchy. For example, all regular motif languages areexpressively equivalent; regular expressions are no more or less powerful thanregular automata such as HMM’s. Furthermore, context-free grammar motifsare representationally more powerful than regular language motifs. Formal lan-guage theory also lends insight into the complexity of recognizing sequences withrespect to motif representations. For example, strings can be recognized by reg-ular languages in polynomial time. This means that efficient database access ispossible for regular motif languages; such advantages will be lost if higher-leveldenotations such as context-sensitive grammars are used. Finally, a language-theoretic view of motif representation can indicate if the automatic induction ofmotifs is feasible.


Not withstanding the advantages of characterizing motifs as formal lan-guages, this purely formal perspective can be somewhat myopic. At a funda-mental level, biosequences are indeed programs: they are used as instructionsduring protein translation. However, this does not imply that a family of biose-quences is most naturally or effectively denoted by a particular formal languagein the Chomsky hierarchy. Although symbolic or numeric representations canbe useful classification and prediction tools, they are not designed as modelsof the pertinent structural effects that determine protein functionality. For ex-ample, a regular expression motif may adequately segregate a family of proteinsequences from non-family sequences. However, this sequence-level classifica-tion does not account for the deeper structural principles defining the family.A more intelligent motif representation would use 3D stuctural information andother domain-specific knowledge. Such a motif representation would likely betoo complex for user-specified protein classification and database access, at leastcompared to regular motifs. Despite shortcomings, there are often benefits tosimplicity.

2.2 Regular Expression MotifsRegular expressions are widely used as symbolic motif languages, and

are used in protein databases such as PROSITE and its variants.12) Their popu-larity as motif languages arises from their simplicity, which makes them straight-forward to learn; their expressive adequacy for representing relatively small se-quences; and their computational tractability with respect to interpretation andautomatic acquisition. A hierarchy of regular expression motif languages hasbeen proposed.6) This hierarchy identifies the various degrees of expressivenesswith which nondeterministic choice of codons and gap expressions can be ar-ticulated. For example, the most deterministic category (class A) is one thatdenotes simple sequences of required codons (“t-c-t-t-g-a”). Class B extendsclass A with a wildcard “x” character, which substitutes for any codon (“t-c-x-x-g-a”). Further classes introduce additional devices. At the most expressiveextreme, class I languages are those which additionally permit nondeterministicskip expressions, indeterminate gaps, and alternate choices (masks):

d-t-x(2,4)-v-*-a-x-[nq]-g

Here, “x(2,4)” denotes a skip of length 2 to 4, “*” is an indeterminate gap, and“[nq]” denotes a choice of either n or q.

6 Brian J. ROSS

Note that formal language theory states that all regular languages, andhence all the motif classes discussed above, are equivalent with respect to thelanguages denotable. However, the introduction of new language constructs en-ables motif expressions to be more convenient and parsimonious. Higher classesof regular expressions can characterize particular protein patterns more con-cisely than more rudimentary classes. Hence the concept of expressiveness asused in the classification scheme in 6) is one of user-oriented convenience, ratherthan a language-theoretic one. Nevertheless, this classification scheme can haveramifications on the tractability of automatic expression synthesis performed viamachine learning algorithms.

2.3 Automatic Motif AcquisitionMuch work has been done on machine learning techniques for biose-

quences identification.6, 3) The survey in 6) lists 26 different algorithms, publishedbetween the years 1983 and 1996. Most work has focussed on regular languagerepresentations, since they are efficient to learn, and both adequate and prac-tical for the relatively low complexity of sequences being analyzed. Motif dis-covery is an instance of a classical machine learning problem: formal languageinduction.17) A successful motif should accurately identify a sequence belong-ing to a given family, while at the same time, reject sequences that are notmembers. This implies the existence of two sets of examples – a positive setconsisting of protein sequences, and negative examples which are disjoint fromthe positive set. Motif discovery algorithms that use this criteria are classifi-cation algorithms, as opposed to conservation algorithms which only use a setof positive example sequences.6) Successful classification requires a balance be-tween recognizing member sequences and rejecting non-member sequences in thetraining set, while at the same time being general enough to recognize and rejectsequences not seen in the original training sets.

There are other dimensions by which motif learning algorithms can beclassified. The solution space of an algorithm is the type of representation usedto denote hypotheses, and may include regular consensus patterns, weight matri-ces, Bayesian networks, or Hidden Markov Models (HMM). The alphabet usedby a representation is also pertinent, as are the generalizations used by the motiflanguage, for example, wildcard characters that denote sequence gaps. Learningalgorithms are also distinguishable by such factors as whether sequences mustbe prealigned or not, whether they guarantee solutions, their computational


efficiency, and their overall effectiveness in biological applications. The effec-tiveness of a learning algorithm is difficult to define, since different algorithmsexhibit particular advantages and disadvantages in their natural domains. Con-sequently, the various approaches to motif learning have not yet been empiricallycompared with one another.

2.4 Motifs and Genetic ProgrammingGenetic programming (GP) is a machine learning paradigm, and it is re-

viewed in Section 4. GP has been successfully applied towards various problemsin biosequence analysis.19, 10, 11, 20) Most of these applications involve the evolu-tion of programs which identify various properties of biosequences, for example,intracellular or extracellular portions of sequences.

There are a couple of examples of the use of GP to evolve biosequencemotifs. Hu uses GP to evolve motifs for unaligned example sequences, using aregular language equivalent to that used by PROSITE.15) Expressions are evolvedfrom sets of unaligned example sequences. Hu’s evolution algorithm uses a localoptimization step, in which expression terms denoting gaps are refined. A num-ber of protein families were studied, and the evolved solutions were often verysimilar to the source PROSITE motifs used to extract the example sequences.

Koza, Bennett, Andre and Keane use GP with ADF’s (automaticallydefined functions) to evolve a motif for a few unaligned sequence examples.21)

An ADF is a type of module or subroutine. The regular motif language used issimpler than the full PROSITE language as used by Hu. The protein familiesstudied contain sequence repetitions, which makes the use of ADF’s advanta-geous, since the ADF’s modularize the repeated structures. Koza did not specifyany motif parameterization requirements (eg. window size), other than overallexpression depth limits. The motif evolved for one case was found to be moreaccurate than the accepted one on file for that protein family.

§3 Stochastic Regular Expressions

3.1 SREStochastic Regular Expressions (SRE) is a probabilistic regular expres-

sion language, in which regular expressions14) are embellished with probabilityfields.23) A similar language was previously proposed by Garg, Kumar and Mar-cus, who prove a number of mathematical properties of the language.9)

8 Brian J. ROSS

Unrestricted SRE has the following form. Let E range over SRE, α rangeover atomic actions, n range over integers (n ≥ 1), and p range over probabilities(0 < p < 1). SRE syntax is:

E ::= α | E : E | E∗p | E+p | E1(n1) + ...+ Ek(nk)

The terms denote atomic actions, concatenation, iteration (Kleene closure or* iteration, and + iteration), and choice. The + iteration operator, E+p, isequivalent to E :E∗p.

The semantics of the language are briefly outlined as follows. Withchoice, each term Ei(ni) is chosen with a probability equivalent to ni/Σj(nj).For example, a(1) + b(2) + c(3) means that the terms are selected with probabil-ities of 1/6, 1/3 and 1/2 respectively. With the Kleene closure term E∗p, eachiteration of E occurs with a probability p, and the termination of iteration hasa probability 1 − p. Probabilities between terms propagate in an intuitive way.For example, with concatenation, the probability of E : F is the probability ofE multiplied by the probability of F . Therefore, the probability of E+p is thesame as E : E∗p, which is the probability of E multiplied by the probability ofthe Kleene iteration.

The overall effect of this probability scheme is that an SRE expressiondefines a sound, well-formed model of probability: each expression defines aprobability function. The sum of all the probabilities for all s ∈ L(E) is 1.23)

Furthermore, each string s ∈ L(E) has an associated probability, while anys 6∈ L(E) has a probability of 0. This property is important for this research,since biosequences will be associated with probabilities when given to particularSRE-DNA motifs. A task of motif synthesis will therefore be to evolve motifsthat yield high probabilities for candidate strings.

An example SRE expression is:

(a : b∗0.7)(2) + c∗0.1(3)

It recognizes string c with Pr = 0.054 (the term with c can be chosen withPr = 3/(2 + 3) = 0.6; then that term iterates once with Pr = 0.1; finally theiteration terminates with Pr = 1 − 0.1 = 0.9, giving an overall probability of0.6× 0.1× 0.9 = 0.054). The string bb is not recognized; its probability is 0.

An SRE interpreter is implemented and available for GP fitness func-tions. Like the case with conventional regular expressions,14) string recognitionfor SRE expressions is of polynomial time complexity. To test whether a strings is a member of an SRE expression E, the interpreter attempts to consume s


with E. If successful, a probability p > 0 is produced. Unsuccessful matcheswill result in probabilities of 0. The SRE-DNA interpreter only succeeds if anentire SRE-DNA expression is successfully interpreted. For example, in E1 :E2,if E1 consumes part of a string, but E2 does not, then the interpretation failsand yields a probability of 0.

3.2 SRE-DNAThe protein sequences studied here are fairly restricted in form. This is

evident if one examines the source PROSITE motifs used to access them fromthe database. This implies that unrestricted SRE may be too descriptively richfor the motifs required here. This would not normally pose a problem, sincea linguistically richer language may enjoy benefits over weaker ones. Unfor-tunately, unrestricted SRE does indeed pose problems during interpretation ofexpressions, because of the nature of the iteration and choice operators. Al-though regular expression and SRE expression interpretation is of polynomialcomplexity, expression interpretation is of combinatorial complexity with respectto expression size. For example, the expression ((a)∗)∗ can generate the stringa...a of length k a total of 2k different ways, due to the combinatorial numberof ways the two nested iteration operators can interact.

Early work on this project found that inefficient SRE expressions withnested iteration and choice were commonly constructed, and their slow interpre-tation made processing impossible.25) As a consequence, a restricted version ofSRE, SRE-DNA, is more practical. The details of the grammatical constraintsin SRE-DNA are described in Section 5.2, where three variations of SRE-DNAare introduced. All the variants ignore the choice operator, and restrict iterationso that terms with iteration are guaranteed to be constructive.

SRE-DNA also introduces mask terms, which are sets of amino acidcodons. A mask within an SRE-DNA expression means that any of the listedcodons is permissible at that portion of the sequence. For example, the mask inthe expression

a : [b, c, d] : e

denotes a choice of b, c, or d, each with an equal probability of 1/3. Hence thestrings abe, ace, and ade are valid strings of the expression, and each have aprobability of 1/3. SRE’s choice (+) operator is more general than a mask, andcan denote the same probabilistic languages. For example, SRE would denotedthe above as:

10 Brian J. ROSS

a : (b(1) + c(1) + d(1)) : e

Choice expressions also permit the modeling of more informative codon proba-bility distributions that might exist in real familes. For example,

a : (b(120) + c(12) + d(3)) : e

says that codon b has a probability of 120/135 or 88.9%. This expressiveness isnot available in a mask set, which treats all codons with the same probability.

Despite the weaker stochastic expressiveness of masks compared to choiceexpressions, masks are more concise, and consequently lend efficiency to evolu-tion during motif synthesis. Earlier work found that the choice operator isdetrimental to motif evolution for the proteins studied here, because it promotesGP intron material (expression bloat).25) Another reason to adopt masks is thatPROSITE motifs use them, and so it might be possible to evolve SRE expres-sions similar to the non-probabilistic regular motifs used by PROSITE. If a moreprecise denotation of codon distributions is required, choice expressions shouldbe investigated further.

Core regions of conserved motifs are ungapped, in the sense that in-sertions or deletions can change the activity dramatically. Nevertheless, as alinguistic convenience, the use of skip (gap) expressions is convenient in thecontext of regular motif languages such as SRE-DNA and PROSITE. This isbecause the purpose of such regular motif expressions is to concisely denote thecommon amino acids within a protein sequence. Since there will areas of pat-tern variability within family sequences, they are best denoted by gaps. Ideally,the resulting codons resident in a motif will be the commonly shared ones inthe family. The alternative is to represent all possible codons at these positionsexplicitly with masks or choice expressions, as is done in 21). This would resultin large motifs whose compactness and utility is lost.

Skip expressions are implemented via the code x, which is a wildcardthat replaces any codon in a sequence. SRE-DNA will combined this with iter-ation operators (for example, “c∗.10”), which permits variable-length gaps to berepresented (details in Section 5.2).

§4 Genetic ProgrammingGenetic programming 4, 18, 21) is a method of automatic programming in

which programs are evolved using a genetic algorithm.13). Both GP and GA arecharacterized by their use of the following (see Fig. 1): (i) an initial popula-


1. Initialize: Generate initial randomized population.2. Evolution:

GenCount := 0Loop while GenCount < maximum generations

and fitness of best individual not considered a solution {Loop until New population size = max. population size {

Select a genetic operation probabilistically:→ Crossover:

Select two individuals based on fitness.Perform crossover.

→ Mutation:Select one individual based on fitness.Perform mutation.

Add offspring to new population.}

GenCount := GenCount+1}

3. Output: Print best solution obtained.

Fig. 1 Genetic Algorithm

tion of random individuals, which in the case of GP are randomly–constructedprograms; (ii) a finite number of generations, each of which results in a newor replenished population of individuals; (iii) a problem–dependent fitness func-tion, which takes an individual and gives it a numeric score indicative of thatindividual’s ability to solve a problem at hand; (iv) a fitness–proportional se-lection scheme, in which programs are selected for reproduction in proportionto their fitness; (v) reproduction operations, usually the crossover and mutationoperations, which take selected programs and generate offspring for the nextgeneration.

The essential difference between GP and GA is the denotation of in-dividuals in the population. A pure GA uses genotypes that are fixed–lengthbit strings, and which must be decoded into a phenotype for the problem be-ing solved. A GP uses a variable–length tree data structure genotype, whichis directly interpretable as a computer program by some interpreter. The useof programming code as genotype is a powerful and practical representation for

12 Brian J. ROSS

solving a wide variety of problems.The two main reproduction operators used in GP are crossover and

mutation. Crossover permits the genetic combination of program code fromprograms into their offspring, and hence acts as the means for inheritance ofdesirable traits during evolution. Crossover takes two selected programs, findsa random crossover point in each program’s internal representation (normally aparse tree), and swaps the subtrees at those crossover points. Mutation finds arandom node in a selected program, and the subtree at that node is replaced witha new, randomly generated tree. This is the means by which new genetic traitscan be introduced into the population during evolution. Although crossover andmutation preserve the grammatical integrity of programs, the user must ensureclosure – that the resulting programs are always executable. So long as closureis maintained, all programs derivable by the GP system will be executable bythe fitness function, and hence their fitness will be derivable.

The GP implementation used in this paper is DCTG-GP. DCTG-GPis a grammatical genetic programming system.24) It uses logical grammars fordefining the target language for evolved programs. The logic grammar formalismused is definite clause translation grammars (DCTG).1) A DCTG is a logicalversion of a context-free attribute grammar, which allows the complete syntaxand semantics of a language to be defined in a unified framework. DCTG-GP is useful for these experiments because of the ease with which variations ofmotif languages can be designed and implemented (see Section 5.2). The use ofgrammars within GP also helps enhance search efficiency, by pruning the searchspace into more sensible structures.

§5 Experiment

5.1 Protein SequencesAs discussed in Section 2, motif classification algorithms require the dis-

covery of an expression that recognizes positive protein sequence members, whilerejecting negative sequences. The positive examples used here comprise a setof unaligned protein sequences. To generate this set, aligned protein sequenceswere initially obtained from the SWISS-PROT and TrEMBL sequence databasesfor the protein families listed in Table 1.12)∗2. For example, on August 9 2000,seaching the expression “snake toxin” resulted in 176 instances in the SWISS-∗2 Accessed between July and November 2000, at http://expasy.cbr.nrc.ca/sprot/


Table 1 Protein families and related parameters

Protein (accession #) lab m p w s1 s2

Amino acid oxidase (PS00677) AAO 5 1e-12 19 8 −Scorpion toxin (PS001138) ScT 5 1e-12 24 20 −Zinc finger, C2H2 type (PS00028) ZF1 10 1e-13 25 29 678Zinc finger, C3HC4 type (PS00518) ZF2 8 1e-13 10 21 168Snake toxin (PS00272) SnT 5 1e-12 22 18 127Kazal inhibitor (PS00282) KI 5 1e-12 24 24 125

PROT database, and an additional 98 examples from TrEMBL. Each examplein the database is a contribution by biologists, and contains a protein patternfrom various species and genome locations. Note that many sequences are dupli-cated in the databases. In addition, the entire family of snake toxin patterns hasa reference PROSITE motif expression. This pattern is useful for comparisonwith SRE-DNA motifs, as there are enough similarities in the PROSITE regularexpression language and SRE-DNA that similar patterns can often be detectedbetween them in many experiments.

The following is done to create a set of unaligned sequences from theabove aligned data. First, duplicate sequences are removed. Then the alignedsequences are padded evenly on each side with randomly generated sequences ofamino acid codons. The length of unaligned sequences for all experiments is 150codons. When possible, a subset of these unaligned sequences is used as trainingdata, and the remainder are testing data. Otherwise, in cases when there arefew examples, the entire set of sequences is used for training. Using artificialsequences such as these has advantages and disadvantages. A possible disadvan-tage is that, unlike real unaligned data, the synthesized unaligned data’s randompadding is unlikely to share subsequences outside of the functional domain ofthe protein in question. This suggests that the unaligned sequences we are usingmay be easier to process than what might be seen in a more realistic productionenvironment. On the other hand, an advantage in using them is that distract-ing noise is being avoided. This permits the experiments to focus on the basicproblem of SRE-DNA motif synthesis, without becoming encumbered by noisyinstances of shared artifacts that arise in real unaligned sequences. In the future,the problem of evolving SRE-DNA motifs in a noisy, more realistic environment

14 Brian J. ROSS

should be addressed.Negative examples are randomly-generated amino acid sequences, each

having a length approximately the size of the original aligned PROSITE data(the window size). Again, this is not as realistic as is possible. A more chal-lenging negative set would consist of a variety of real sequences, many of whichmight be closely related to the family belonging to the positive training set. Forthe purposes of this research, however, a randomly-generated negative set is ade-quate for ensuring that evolved motif expressions are sufficiently discriminating.

Table 1 summarizes characteristics of the example protein sequencesused for the GP experiments. The amino acid oxidase and scorpion toxins werechosen because the PROSITE source motifs were of intermediate complexity.The zinc fingers, snake toxin, and kazal inhibitors were chosen in order to com-pare the results to that of Hu.15) The protein names and PROSITE accessionnumbers are in column 1. Column 2 (lab) contains a shorthand label used insubsequent figures. The m column gives the maximum mask size permitted inSRE-DNA expressions for that protein family. It is based on the maximum masksize used in the PROSITE motif for that family, and is typically a little largerthan used by PROSITE in order to give motif evolution some extra freedom.The minimum probability value (p) is the minimum computed probability re-quired by an SRE-DNA expression to continue processing, before terminatingexpression interpretation. The w value is the size of the window used for thatprotein set, and is configured to be large enough to cover the largest alignedsequence for the family as defined by its source PROSITE motif. Finally, s1 ands2 are the sizes of the disjoint training and testing sets respectively.

5.2 SRE-DNA DefinitionDCTG-GP was used to derive 3 restricted variations of SRE-DNA.

Grammars for these variants are shown in Figure 2. All the grammars offervarious levels of grammatical constraints on full SRE, while at the same timeuse important SRE-DNA operations such as probabilistic skip and/or iteration.The grammars also encode general characteristics similar to what are found inestablished PROSITE motifs for protein families. In particular, the use of alter-nating skip and mask expressions will be favoured in evolved solutions.

One motivation for using constrained SRE-DNA is the fact that unre-stricted SRE-DNA will generally result in inefficient expressions. For example,previous work25) established that SRE-DNA’s choice operator is distinctly un-


G1 expr ::= guard | guard :expr | expr+p

guard ::= mask | mask :skipskip ::= x+p

G2 expr ::= guard | guard :exprguard ::= mask | mask :skipskip ::= x∗p | x+p

G3 expr ::= gexpr | gexpr∗p | gexpr+p

gexpr ::= guard | guard :expr | expr :guardguard ::= mask | mask :skipskip ::= x∗p | x+p

Fig. 2 SRE-DNA Variations

desirable for denoting the types of sequences being analyzed here. The choiceoperator promotes intron material in expressions, which is program code whichdoes not contribute meaningfully to computations. It was found that this op-erator could be removed without any loss in expressiveness, at least with thesequences studied here (its value as a nondeterministic operator is replaced bymask terms). Likewise, totally unrestricted iteration tends to result in intronmaterial, as well as very inefficient expressions. Restricting iteration is thereforepractical.

Grammar 1 permits nested + iteration and + skip. Iteration is restrictedby the use of guard terms, which forces iteration to be applied to suffixes ofconcatenated expressions. In other words, guards create a bias towards prefixconsumption, which promotes efficient sequence interpretation. Grammar 1 alsopermits directly nested iteration. Nesting should not be common, however,because nested iterative expressions usually have low probabilities, and hencelow fitness values. They therefore become extinct during evolution.

Grammar 2 is the only grammar without the iteration operator, andis the language most similar to PROSITE’s motif language. Expressions ingrammar 2 will take the form of alternating masks and skips. Both * and + skipare used.

Grammar 3 is the least-constrained grammar. Besides using both + and* iteration, it permits a more symmetric distribution of iteration expressions oneither side of expressions (unlike grammar 1’s bias towards iteration on the right-

16 Brian J. ROSS

side of expressions). Like grammar 2, both * and + skip are used. However,unlike grammar 1, nested iteration is not allowed. This is done via the use ofthe gexpr nonterminal.

The entire syntax and semantics of all the languages denoted in Figure 2are implemented together in DCTG-GP. The semantic rules of the DCTG permitvarious interpretation and analytical processing to be programmed within thegrammatical definitions. An advantage of this is that many grammatical devicesare concisely denoted with a DCTG. For example, the actual implementationof grammar 3 does not use the distinct nonterminals expr and gexpr. Rather,the semantics for expr are used to directly test whether expressions are iterativeor not, in order to ensure that iteration is not nested. Using these semanticdevices greatly reduces the complexity of the grammar, which in turn promotesefficiency of the GP implementation.

Note that, although not encountered in the sequences studied here, SRE-DNA’s iteration operators permit the denotation of protein repeats (eg. Hunt-ington’s polyCAG pattern).

5.3 Fitness EvaluationPractically speaking, regular motif languages such as PROSITE’s and

SRE-DNA are most suitable for relatively small sequences. Large sequencesrequire impractically large expressions, which in SRE-DNA’s case, yield verysmall probabilities. The aligned segment of sequences shared by a family is buta fraction of the overall unaligned sequence. Therefore, to make processing moreefficient, the GP system requires that the user supply a window size parameter,which defines the length of the longest window or aligned sequence length ofinterest. In addition, the SRE-DNA interpreter onlys succeeds in evaluating anexpression if the entire SRE-DNA expression is used. For example, in E:F (whereF does not recognize an empty string), if an input string is fully consumed by E,then interpretation of the entire expression fails because F has not been used.Taken together with the window length, the result is that successful SRE-DNAexpressions are ones that are used in their entirety in recognizing a completewindow from the unaligned sequence.

A candidate motif must recognize members of the protein family of in-terest, while at the same time reject non-member sequences. Likewise, a fitnessmeasure must balance the acceptance and rejection characteristics of expressions.Consider the formula,


Fitness = N +NegFit− PosFit

NegTot and PosFit are the negative and positive training scores respectively, andN is the number of positive (and negative) training sequences. A hypotheticalsolution can yield a total PosFit score of N (N positive examples multiplied bya score of 1 each, which is illustrated below). The NegFit score will be 0 for acorrect solution. Therefore, the value of Fitness in the above forula yields 0 fora perfect solution. This is impossible to obtain in practice, because the valuesrepresented by the PosTot term are usually small, due to the way probabilityscores are used within them.

Positive scoring is performed on each positive example sequence. Asequence is processed by extracting every consecutive window from it, and com-puting a fitness Fit on each window. If a sequence ei has length |ei|, and thewindow size is w (where w < |ei|), then there are |ei|−w+1 windows to process.For example, the sequence abcdef has 4 windows of length 3: abc, bcd, cde, anddef. The positive fitness formula is then:

PosFit =∑

ei∈Posmaximum(Fit(winj(ei)))

where ei is a positive example sequence being processed, and winj is one of itsj windows. Here, for each positive sequence ei, each window winj is extractedfrom it, and a positive score is obtained. The maximum of all the window scoresfor a sequence is used for that sequence. Finally, all these maximal scores for allthe positive sequences in set Pos are summed together, giving an overall positivescore.

Positive fitness evaluation of a window incorporates two measurements:the probability of recognizing the window, and the proportional amount of thewindow recognized in terms of its length. Consider the formula,

Fit(win) =12

(Pr(smax) +

|smax||e|

)Here, smax is the longest prefix recognized from some window win, and |smax|is its length. The term Pr(smax) is the probability of recognizing smax. Thesecond term measures the proportion of the window recognized. The fitnesspressure introduced by this formula is to favour expressions that recognize entirewindows with high probabilities. In early generations, the window-length termdominates the score, which forces fitness to favour expressions that recognizelarge portions of windows. The probability term still comes into consideration,

18 Brian J. ROSS

however, especially when the population converges to expressions that recognizewindows from a significant number of sequences. At that time, the probabilityfitness measure favours expressions that yield high probabilities.

Negative fitness scoring is calculated as:

NegTot = maximum(Fit(ni)) ∗N

where ni ∈ Neg (negative examples). The highest obtained fitness value for anyrecognized negative example suffix is used for the score. A discriminating expres-sion will not normally recognize negative examples, however, and so Fit(ni) = 0for most ni.

5.4 Genetic Programming Parameters

Table 2 GP Parameters

Parameter Value

GA type generationalMaximum generations 100Maximum runs/experiment 6Functions SRE-DNA variantsTerminals amino acid codons, probabilitiesPopulation size (initial) 2000Population size (culled) 1000Unique population yesMax. depth initial popn. 12Max. depth offspring 24Tournament size 7Elite migration size 10Retries for reproduction 3Prob. crossover 0.90Prob. mutation 0.10Prob. internal crossover 0.90Prob. terminal mutation 0.75Prob. SRE crossover 0.25Prob. SRE mutation 0.30SRE mutation range 0.1


Table 2 lists genetic programming parameters used for the experiments.Generational evolution is performed, in which distinct populations are evolvedduring each generation. Each population always consists of syntactically uniqueexpressions. The initial population size is 2000 individuals, which is gener-ated using Koza’s ramped half-and-half tree generation scheme.18) Half the treesgenerated are grow trees, in which a terminal or nonterminal can be randomlyselected as the root of each subtree, while the remaining half are full trees, inwhich nonterminals are always selected so long as the tree depth limit is not ex-ceeded. During tree generation, the tree depths are staggered (or ramped) fromdepths 2 through 12. The result is a population of random expressions havinga fair distribution of varied tree shapes. This initial population is culled by se-lecting the 1000 most fit individuals. This oversampling of the initial populationhelps discard the weak expressions which commonly arise during random treegeneration.

Nonterminals and terminals are determined by the SRE-DNA grammarvariant used (Section 5.2). All the grammars use amino acid codons in maskterms, as well as numeric fields for probabilities. Reproduction operations onthese fields work in a number of ways. Crossover and mutation are selectedwith probabilities of 90% and 10% respectively. When one of these reproductiveoperations is selected, either a conventional GP version or a hybrid SRE-DNAversion may be applied. For example, the table entry indicating the probabilityof SRE crossover means that, when crossover is to be performed, there is a 25%chance that SRE-DNA crossover is used. Conventional grammatical crossoverselects a subtree having the same nonterminal or terminal type of the grammarin each parent, and swaps these subtrees, yielding the offspring. SRE-DNAcrossover works on masks: two mask fields are selected in the parents, and thetwo offspring created are identical to each respective parent, but with the selectedmasks a merge of the elements from the parent masks.

Conventional grammatical mutation takes a subtree, and replaces it witha randomly generated subtree of the same nonterminal or terminal type. SRE-DNA mutation can involve a number of special operations: (i) a numeric field isperturbed ±10% of its original value; (ii) a mask has a random element insertedinto it; (iii) a mask has a random element removed from it; and (iv) a maskelement is randomly replaced.

5.5 Solution Refinement

20 Brian J. ROSS

For each mask maski in expression expr:For each element αj ∈ maski:

- Remove αj from maski, and call new expression expr′.- If fitness expr′ > fitness expr

then expr ← expr′.

Fig. 3 Mask refinement

Often, an evolved solution motif can be further improved after the runhas terminated. Its overall performance as measured by its probability score onrecognized sequences can be increased by refining its mask terms. Many experi-ments result in solution expressions that have unnecessarily large masks, whichyield smaller probabilities than optimally-sized ones. Although mask mutationmight occassionally delete extraneous mask codons, this occurs too infrequentlyduring evolution to prevent these bloated masks from appearing in solutions.

Mask refinement is performed on solution expressions by the procedurein Figure 3. The procedure performs hill-climbing transformations on an expres-sion, by deleting the mask elements whose removal improves overall fitness. Thealgorithm is a greedy one, and it is not guaranteed to find the optimal refinementfor an expression. In other words, depending on the SRE-DNA grammar used,some combinations of mask refinements might yield higher probabilities thanothers. The algorithm used is adequate for the majority of expressions evolvedin this paper.

§6 ResultsTable 3 summarizes the training and testing performance results for the

protein families in Table 1. Gi is the grammar used from Figure 2. The bestmeasure of a motif’s performance is to compute the average probability obtainedwhen recognizing positive example sequences. We call this value the probabilityscore, and denote it by pr. Then, “avg pr” is the average probability scores forthe best motifs for all six runs. Similarly, “best pr” is the probability score ofthe overall best motif. Testing was performed on the best evolved motif for everygrammar (the motif used to obtain “best pr”). “%tp” is the percentage of truepositives obtained by the best motif on the testing set. “pr” is the probabilityscore obtained for the recognized testing sequences. This score is normalized forthe true positives recognized. The percentage of false negatives (positive testcases yielding a probability of 0.0) is in column “%fn”, and is merely 100 - %tp.


Table 3 Solution statistics for example proteins

Training Testing (best)Family Gi avg pr best pr %tp pr %fn %fp

AAO G1 1.846e-4 8.565e-4G2 5.306e-5 1.914e-4 − − − −G3 5.414e-5 2.143e-4

ScT G1 1.312e-7 5.170e-7G2 2.349e-8 5.537e-8 − − − −G3 2.451e-5 1.469e-4

ZF1 G1 1.113e-11 2.898e-11 67.4 3.562e-11 32.6 0.0G2 7.330e-11 4.128e-10 81.3 4.645e-10 18.7 0.0G3 6.027e-11 3.374e-10 89.1 4.772e-10 10.9 0.0

ZF2 G1 9.614e-3 2.019e-2 100.0 1.959e-2 0.0 0.0G2 1.278e-2 1.919e-2 100.0 1.953e-2 0.0 0.0G3 1.415e-2 2.542e-2 100.0 2.410e-2 0.0 0.0

SnT G1 1.202e-7 2.398e-7 80.3 1.975e-7 19.7 0.0G2 5.153e-8 1.633e-7 82.7 1.311e-8 17.3 0.0G3 9.786e-8 1.857e-7 64.6 2.967e-7 35.4 0.0

KI G1 2.964e-9 5.461e-9 80.0 5.416e-9 20.0 0.0G2 1.340e-8 3.772e-8 68.8 3.337e-8 31.2 0.0G3 2.354e-8 4.546e-8 70.4 5.424e-8 29.6 0.0

Finally, false positives (erroneously identifying a negative example as a membersequence) is in column “%fp”. The actual testing results for the Kazal inhibitor(KI) experiments using all three grammars can be viewed on the web∗3.

Table 4 presents some additional testing results on the best solutionstested in Table 3. False negative (FN) and false positive (FP) sequences wereobtained from the PROSITE database for four of the protein families. This datarepresents sequences which either were missed by the original PROSITE motif(FN), or were erroneously classified as belonging to the protein family (FP).Not all the families have FN and FP available (the AAO and ScT proteins hadnone, and are omitted). In the #FN column in Table 4, the first number repre-sents the number of false negative sequences identified by the SRE-DNA motif.The number in parentheses is the total number of FN sequences extracted fromPROSITE for that protein. In terms of matching false negative sequences, higher∗3 http://www.cosc.brocku.ca/∼bross/research/kazal.html

22 Brian J. ROSS

Table 4 Additional testing results

Family Gi #FN (tot) pr #FP (tot) pr

ZF1 G1 2 (3) 7.764e-12 12 (12) 3.882e-11G2 1 1.925e-10 6 1.724e-10G3 2 2.163e-11 7 1.829e-10

ZF2 G1 4 (12) 2.398e-2 5 (5) 1.918e-2G2 4 2.398e-2 5 1.918e-2G3 4 1.942e-2 5 1.831e-2

SnT G1 2 (10) 6.662e-8 (0) −G2 2 3.707e-8 −G3 0 0.0 −

KI G1 0 (1) 0.0 0 (2) 0.0G2 0 0.0 1 8.569e-9G3 0 0.0 1 4.371e-9

numbers in the FN column are preferred, as they indicate that the SRE-DNAmotif recognized candidate proteins that were missed by the original PROSITEregular motif expression. The pr value is the normalized average probability ofrecognizing the FN sequences. The next two columns are similar to the FN ones,except that false positive sequences are tested. Here, lower numbers in the FPcolumn are preferred, since they mean that SRE-DNA motifs were more discrim-inatory than the PROSITE motifs, correctly classifying non-member sequencesthat PROSITE erroneously permitted.

Tables 5 and 6 list the best solutions for the different protein familiesand SRE-DNA variations, along with the source PROSITE motif used to obtainthe example sequences. Note that the PROSITE motif language differs fromSRE-DNA (see Section 2.3).

§7 DiscussionFirstly, the quality of results is influenced by the level of precision of

the experimental parameters. For example, due to time constraints, only 6 runswere undertaken per experiment, which is inadequate for statistically meaningfulconclusions to be drawn. Additionally, more precise SRE-DNA interpretationwill occur when lower minimal probabilities are used by the interpreter (seeTable 1). For example, early experiments suggested that the zinc finger C2H2(ZF1) family required a smaller minimum probability in order to get meaningful


Table 5 Best evolved motifs for protein families. The motif beside family label is the source

PROSITE motif.

AAO [ilmv](2) : h : [ahn] : y : g : x : [ags](2) : x : g : x(5) : g : x : aG1 h : x+.1 : y : g : x+.17 : [gs] : x+.17 : g : x+.1 : [iqt] : x+.17 : [ghs] : x+.13

: g : x+.1 : a : x+.19

G2 h : x+.1 : y : x∗.11 : g : x+.19 : [gs] : x+.16 : g : x+.19 : [hst]: x+.19 : g : x+.16 : a : x+.19

G3 h : x+.1 : y : g : x+.19 : [gs] : x+.19 : g : x+.19 : [hst] : x+.19

: g : x+.17 : a : x+.16

ScT c : x(3) : c : x(6, 9) : [ags] : k : c : [imqt] : x(3) : c : x : cG1 c : x+.19 : ((([acg] : x+.12 : [cg] : x+.19 : k : x+.12 : [ct] : x+.11 : c

: x+.19)+.19)+.19)+.19

G2 c : x+.19 : [gkps] : x+.19 : [acgk] : x+.19 : [agn] : x+.19 : c : x+.19 : [gn]: x+.19 : c : x+.1 : c : x+.19

G3 [ck] : x+.17 : c : x+.18 : k : x+.19 : (c : x∗.15 : k : x+.19 : a : x+.18 : m: x+.15 : f : x+.1 : k : c : x+.1)∗.19 : [gn] : x+.18 : k : c : x+.18

ZF1 c : x(2, 4) : c : x(3) : [cfilmvwy] : x(8) : h : x(3, 5) : hG1 [acfkr] : x+.19 : [acfsv] : x+.19 : [eklr] : x+.19 : [fklnst]

: x+.19 : [adklrt] : x+.19 : [hlnrsv] : x+.19 : [kr] : x+.19 : [hl]: x+.19 : [filnrtvw] : x+.19

G2 c : x+.19 : c : x+.19 : [kqr] : x+.18 : [afklr] : x+.19 : [lqrs] : x+.19 : h: x+.18 : [hklrt] : x+.19 : [acdhkt] : x+.19

G3 c : x+.19 : c : x+.19 : [hkr] : x+.19 : [afkqrsty] : x+.19 : [hlnstv] : x+.19

: [ahklnrst] : x+.19 : [hkr] : x+.19 : [hr] : x+.19

results. Using a low probability like this enhances precision, but at the expenseof computation time. If even smaller probability limits were used, both thetraining and testing results may be improved, due to the increased likelihood ofrecognizing some examples.

Other parameters, such as mask size, iteration ranges, window size, andtraining set size, also affect results. Preliminary runs suggested that largermasks, as used in the zinc finger cases, usually result in lower quality solutions.This implies that small masks are more naturally suitable for SRE-DNA motifs,in comparison to PROSITE’s motifs. Similarly, large iteration ranges greaterthan 0.25 often resulted in the evolution of motifs that would tend to skip largeportions of relevant aligned subsequences. The value used in these experiments

24 Brian J. ROSS

Table 6 Best evolved motifs for protein families (cont.).

ZF2 c : x : h : x : [filmvy] : c : x(2) : c : [ailmvy]G1,2 c : x+.1 : h : x+.19 : c : x+.19 : c : x+.1

G3 c : x+.1 : h : x+.19 : (f : x∗.11)∗.19 : c : x+.19 : c : x+.1

SnT g : c : x(1, 3) : c : p : x(8, 10) : c : c : x(2) : [denp]G1 g : c : x+.19 : c : x+.19 : [kps] : x+.19 : [klv] : x+.19 : c : c : x+.19

: [dt] : x+.19

G2 g : x∗.11 : c : x+.19 : c : x+.19 : [gks] : x+.19 : [dklv] : x+.19 : c : c : x+.19

: [dt] : x+.18

G3 ((((g : c : x+.19)+.1 : c)+.12 : p : x+.18 : [gkns] : x+.19 : [dgls] : x+.19

: [iklt] : x+.19)+.1 : c : x+.17)+.1 : [dt] : x+.19

KI c : x(7) : c : x(6) : y : x(3) : c : x(2, 3) : cG1 [cv] : x+.19 : [elrvy] : x+.19 : ([cps] : x+.19 : [cgs] : x+.19 : [cty] : x+.19

: n : x+.11 : [ct] : x+.19 : [ct] : x+.17)+.1

G2 [ct] : x+.19 : [celr] : x+.19 : [elpqr] : x+.19 : [cgk] : x+.19 : [dgk] : x+.19

: [fty] : x+.18 : [dnr] : x+.18 : c : x+.19 : c : x+.18

G3 c : x+.19 : [eir] : x+.19 : [epr] : x+.19 : c : x+.19 : [dgks] : x+.19

: y : x+.19 : [cn] : x+.19 : c : x+.19

(0.19) resulted in the most interesting motifs.With respect to the results in Table 3, the results obtained range from

acceptable (those with testing percentages less than 70%) to excellent. Withrespect to the training results, the best SRE-DNA motif for each family wasproduced twice by grammar 1, once by grammar 2, and three times by gram-mar 3. Often, the difference in probabilities between best motifs by differentgrammars was many orders of magnitude. For example, consider the scorpiontoxin experiment (ScT), in which the best grammar 3 solution had an averageprobability over 280 times larger than the next best grammar 1 motif, and over2600 times larger than the grammar 2 motif. Unfortunately, neither the ScT norAAO families had enough examples with which to perform testing.

The ZF1 runs were hindered by the small probabilities inherent with themotifs. This seems to be a product of the large mask size (10) combined withthe large window size (25). It was observed that many recognized probabilitiesfor examples were in the 1e-12 range. Others may have been smaller than 1e-13,which was beyond the probability limit for those runs, and thus would have beenterminated.


In the ZF2 and KI cases, the observed pr for testing roughly correlatewith the best pr for training. The ZF2 motifs were simple, and runs oftenconverged to the identical motif.

The snake toxin testing for G3’s motif suggests that this particular motifmay be overtrained. Although it recognized all of the training cases, it couldonly recognize 64% of the testing set.

The results in Table 4 indicate that SRE-DNA often performed betterthan the PROSITE equivalents with respect to false negative and false positivesequences. However, this must be considered in balance with the testing resultsin Table 3.

Cyclic subsequences are not common in the protein sequences studiedhere. This might imply that the iteration operator is detrimental for GP, andthat grammar G2 would be the most effective variation of SRE-DNA. The resultsin Table 3, however, suggest otherwise. Iteration can be argued to be beneficialfor evolution, given the number of best solutions found with grammars G1 andG3 (the grammars with iteration operators). One hypothesis for this is thatiteration aids evolution by allowing richer intermediate expressions to evolve,because iteration permits the recognition of greater numbers of positive exam-ple subsequences. This occurs because an SRE-DNA expression with iterationrecognizes more strings than one without it (ignoring the effects of skip terms,which are controlled by the use of negative examples). Upon inspecting themotifs in Tables 5 and 6, however, it is clear that iteration is not an importantfactor with respect to the construction of final motifs. In 7 of the 12 experimentsusing grammars with iteration (G1 and G3), the iteration operator did not arisein the best motifs. In the remaining 5 motifs, iteration is incidental, and takesthe form of intron material to protect useful terms. An obvious example of thisis the G1 solution for ScT in Table 5. Iteration could be removed from theseexpressions, resulting in motifs with higher probability performance.

Considering the structure of SRE-DNA expressions in Tables 5 and 6, itis clear that majority of evolved solutions are much more complex than the sourcePROSITE motifs. One reason for this complexity is the fact that expression sizewas not accounted for in the fitness function. Expression parsimony could beenhanced by incorporating a score for expression complexity, or reducing theGP parameter limiting tree sizes. Furthermore, evolved GP programs are oftenmore complex than manually programmed ones.

26 Brian J. ROSS

§8 Related work

8.1 Other representations: PROSITE, HMMsSRE-DNA is similar to the regular expressions used by PROSITE and

related databases.6, 12) SRE-DNA is essentially a PROSITE-like language en-hanced with a probability model. Not surprisingly, many evolved motifs in Fig-ures 5 and 6 are similar to the PROSITE source motifs. Some of the SRE-DNAmotifs, however, vary substantially from their PROSITE equivalents. This is acombined result of the methodology used during their acquisition with GP, suchas the fitness evaluation strategy and other GP parameters, and nuances in theSRE-DNA grammar used.

SRE-DNA’s main expressive advantage over PROSITE’s language is itsprobabilistic model. A SRE-DNA motif has associated with it a probability dis-tribution, and each member sequence has a corresponding probability computedwith it. Non-member sequences have a zero probability. With PROSITE’s lan-guage, there is no associated probability or scoring scheme; sequences are eithermembers or not members of the motif expression in question. Although the over-all structure of the regular expression underlying an SRE-DNA motif is oftenjust as deterministic as with a PROSITE equivalent, the probability distributionhelps to numerically justify the strength of membership of sequences within afamily. For example, sequences that use longer gaps will have lower probabilitiesthan those with smaller gaps.

SRE-DNA is more expressive at denoting variable-length terms and gapsthan PROSITE motifs. In PROSITE, a numeric range such as “X(2, 4)” indi-cates a variable gap between 2 and 4 codons. In SRE-DNA (as used in thispaper), a variable-length gap is expressed as a term like “x+.10”. Probabilisticiteration can permit gaps of any potential length, but at a probabilistic cost,since long gaps caused by highly-iterated terms have corresponding lower prob-abilities. A motif that requires a large gap should use an iteration with a higheriterative probability. SRE-DNA’s use of probabilistic choice also enhances its ex-pressiveness over PROSITE’s motifs. Although choice was not used here, otherprotein families may benefit by it.

A comparison of SRE-DNA with Hidden Markov Models (HMM’s) isalso useful.22) A HMM is a stochastic finite automata. In the HMM model ofKrogh et al., a target sequence of length k has associated with it an HMMgraph (automata) of approximately 3k nodes and 9k arcs. Nodes are either


match states, insert states, or delete states. Match states arise when sequencesmatch codes of the sequence, delete states cause codes to be skipped, and insertstates introduce codes. Every transition between these states has associatedwith it a probability. Codons themselves have specific probabilities when theyare seen when moving between states. All these probabilities can be determinedautomatically via various training algorithms. Given a candidate sequence, thecodons invoke translations through the HMM, until the sequence terminates,resulting in a probability score for that sequence.

HMM’s have shown good performance in protein classification, and haveperformed well in comparison to PROSITE’s regular motifs.22) HMM’s are notintended as user-level interfaces to protein databases, unlike SRE-DNA andPROSITE motifs. HMM’s are much less rigid than regular pattern motifs, astheir architecture is designed to permit variations of sequence patterns within aset window size. In comparison, PROSITE and SRE-DNA motifs specify fairlyrigid pattern matching criteria (although SRE-DNA motifs are somewhat morerelaxed due to probabilistic iteration). The fact that HMM’s are more forgivingwith sequence variations than regular pattern motifs is a mixed blessing. On onehand, an HMM is able to represent a large assortment of structurally dissimilarsequences, by fitting a probability distribution over them. The HMM archi-tecture is fixed with respect to the maximum sequence size permitted. Shouldthe sequences be very dissimilar, however, the resulting information content en-coded in the HMM’s probability distribution will become increasingly negligible.PROSITE patterns have a similar ability to permit structure variation throughthe use of variable gaps and codon choices. It treats sequence identification asa boolean operation, and there is no probability associated with membership.Although SRE-DNA expressions can denote dissimilar sequences with the choiceoperator, the resulting motif will grow in size and complexity.

Reconciling the expressive differences between HMM’s, SRE-DNA, andPROSITE expressions is not straight-forward. There is a 1-to-1 mapping be-tween regular expressions and finite automata: each can be automatically com-piled to the other.14) Similarly, there is a mapping between stochastic regularexpressions and stochastic finite automata. This implies that the languages de-noted by PROSITE, SRE-DNA and HMM motifs are theoretically equivalent.However, even though these models are inter-translatable, it does not implythat given sets of protein sequences are always more conveniently denoted inone model over another. It must be realized that all these motif models are

28 Brian J. ROSS

simple structural characterizations of the real factor of importance – the 3Dtopology of the protein.

8.2 Motifs and Genetic ProgrammingThe first successful application of genetic programming towards evolv-

ing PROSITE-style motifs for unaligned biosequences is by Hu.15) There aresimilarities and differences between Hu’s GP methodology and ours. Similarto our experiments, Hu pre-specifies the maximum window length, maximumgap length, and gap flexibility (ranges for variable gaps of the form “X(i, j)”).Whereas we use amino acid codons, Hu uses a combination of nucleotide andamino acid codons, as well as predefined codon subset codes. The use of pre-defined subsets is interesting, as it uses domain knowledge to establish usefulcombinations of possible codons for mask terms. Hu seeds the initial popula-tion with substrings from the example set. Although our initial population wasrandomly generated, in hindsight, seeding might prove advantageous.

One major difference between Hu’s approach and ours is his extensiveuse of greedy local optimization to refine expressions during evolution. Firstly,each pattern sub-term is replaced by a wildcard; if the fitness improves, thewildcard is retained. Then all the legal values for each variable gap are iteratedthrough, to find the combination giving the highest fitness. Although Hu doesnot specify the performance cost of this optimization procedure, it certainly mustbe significant. SRE-DNA evolution circumvented the need for gap placementand variable gap refinement, as the skip fields were automatically refined duringevolution. We applied greedy optimization to mask terms, but only on the bestsolution at the end of a run.

Table 7 Comparison of PROSITE, Hu, and SRE-DNA motifs for snake toxin (SnT)

PROSITE: g : c : x(1, 3) : c : p : x(8, 10) : c : c : x(2) : [denp]Hu #1: g : c : x(1, 3) : c : pHu #2: c : c : x(1, 2) : [denp]

SRE-DNA G2: g : x∗.11 : c : x+.19 : c : x+.19 : [gks] : x+.19 : [dklv] : x+.19

: c : c : x+.19 : [dt] : x+.18

A qualitative comparison of Hu’s results with ours is difficult, because ofthe difference in motif languages, as well as Hu’s lack of any numerical analysesof his results (eg. training and testing measurements). Hu investigated the sameZF1, ZF2, SnT, and KI protein families that we did. Hu’s resulting motifs for


ZF1 and KI are impressive, as they are nearly identical to the PROSITE motifexpressions. In other results, Hu’s motifs were simpler than the PROSITE ones.For example, Table 7 compares the PROSITE, Hu, and SRE-DNA G2 motifsfor SnT. The Hu motifs are clearly too short, as they cover sequences between4 and 7 codons long, which is far smaller than the maximum window size of 22codons used in the PROSITE motif. In the G2 motif, the first 4 terms of thePROSITE motif (up to “p”) are nearly identically covered, as is the “c:c” termlater in the sequence. The G2 motif needs to break up the long “x(8,10)” gapwith a series of iterations and masks, because the iteration limit of 0.19 preventsa single skip expression from being applied too often (a skip of 0.19 iterated 10times gives a probability of 5e-8).

The other GP work evolving regular expression motifs for unalignedsequences is by Koza, Bennett, Andre and Keane.21) The 2 protein familiesstudied are the D-E-A-D box and manganese superoxide dismutase families.The target language is a subset of the PROSITE motif language, in which gapexpressions (fixed or variable) are not used. They also use GP with ADF’s, whichenable the modularization of common subexpressions in evolved motifs. This isadvantageous for the protein families studied, which contain repetitive elements.The two protein families investigated have very small window lengths, of length8 and 9 respectively. This is considerably smaller than the lengths used in thispaper, which ranged up to 25 codons. They reported good results with theirwork, as one of their evolved motifs improved upon an established PROSITEsource motif, due to its replacement of a PROSITE gap with an explicit choiceof amino acid codons.

Unlike ours or Hu’s work, Koza et al. did not need to supply presetwindow limits. Although fewer user-specified parameters implies increased au-tomation, which is always an ideal in machine learning, a preset window sizewas used here out of necessity: an indeterminately large window results in longinterpretation times, especially when iteration is used. Given the complexity ofSRE-DNA and the lengths of the sequences studied here, it is unlikely that anunspecified window would have been feasible.

§9 ConclusionThis is a first attempt at evolving SRE-DNA motifs for unaligned se-

quences, and the results are promising. The expressive characteristics of SRE-DNA motifs are highly dependent upon language restrictions used, such as gram-

30 Brian J. ROSS

matical restrictions, mask sizes, and iteration ranges. Some of these constraintsmay be more suitable to particular protein families than others. Further reseachis required to understand in more depth the effect of language constraints onmotifs, and which constraints are most practically applicable to the majority ofprotein families.

One weakness in this paper’s application of SRE-DNA is that motifsdo not ensure fixed-size sequences for conserved regions. Conserved regionsrarely vary in size, since extra or missing amino acids will cause major changesto functionality. GP usually evolves appropriate iteration values that producegood results for the training set. These terms are not fixed in length, however,if iterative gaps are embedded in them. An interesting direction to take in thefuture is to use an enhanced SRE-DNA grammar that explicitly accounts forconserved regions. For example, the grammar could generate motifs of the form,

< var. region > < conserved region > < var. region >

The grammar rules for the variable regions would use iterative probabilities asis done currently, while the conserved region would forgo the use of all iterationoperators.

SRE-DNA is a new motif language for protein sequences. As withPROSITE motifs, much of the characterisation of a family of proteins is inherentin the structure of the SRE-DNA motif. Additionally, like HMM’s, an SRE-DNAmotif defines a probability distribution over the member sequences. In this two-level approach to sequence representation, the more discriminating linguisticlevel is the regular expression, since it implicitly defines which sequences belongto the modeled family. The probability fields refine this representation by as-cribing a probabiity distribution to member sequences. Like PROSITE’s motiflanguage, SRE-DNA motifs tend to grow in size in accordance with the com-plexity and size of the sequences being characterized. GP tends to create largerexpressions as well, which can often be simplified afterwards. SRE-DNA’s motifsize contrasts to HMM motifs, whose structure is fixed for the maximum size ofsequences being represented. Unlike HMM’s, however, SRE-DNA is a relativelyuser-friendly motif language, and can be used for interactive database access.

An empirical comparison of SRE-DNA and other representation modelswould be interesting. Since these different motif representations can vary sub-stantially, however, a direct comparison of them may not be too fruitful. It isworth realizing that the biological functionality of proteins is entirely dependentupon the 3D structure of the protein molecules. Motif representations such as


SRE-DNA, HMM’s, regular expressions, and context-free languages, are crudemeans for modeling characteristic features of proteins. Nevertheless, they arewidely used, because they are relatively concise, computationally efficient, andamenable to automatic acquisition.

There are other directions for future investigations. Similar to the maskrefinement performed on the final solutions, additional refinement of probabilityfields can be undertaken, which would improve the probability performance ofmotifs. Future work will investigate using a fuller variant of SRE-DNA to modelsequences with repetitive elements and weighted choice. More sophisticated GPapproaches, for example, the use of multiple populations, may result in betterevolution performance. Finally, more runs should be undertaken to obtain moremeaningful statistical results. A practical limitation to this is the speed of runs,which can be very slow when large masks and low probability limits are used.Porting DCTG-GP to a faster, compiled language such as C would be useful inthis regard.

Our fitness scheme uses a strictly structural perspective of motif perfor-mance: if a regular language can recognize the positive sequences in a family,and reject sequences outside of it, then it is considered correct. This formal viewof motif performance is practical and adequate for many problems, and thereforeis adopted in most machine learning applications. On the other hand, such anapproach is inherently limited. For example, it cannot account for convergedfunctionality for distantly related proteins. Such real-world phenomena requireadditional information (domain-specific knowledge about protein structure) be-yond what can be modeled by first-order structural principles themselves.

A comparison of different motif learning techniques is worth undertakingin the future. SRE-DNA motifs may be synthesizable by other machine learningalgorithms, many of which may yield superior performance over GP.

Acknowledgement: The author thanks two anonymous referees for their helpfulcomments. This work is supported though NSERC Operating Grant 138467-1998.

References1) H. Abramson and V. Dahl. Logic grammars. Springer-Verlag, 1989.

2) S. Arikawa, S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi, and T. Shino-hara. A Machine Discovery from Amino Acid Sequences by Decision Trees overRegular Patterns. New Generation Computing, 11:361–375, 1993.

32 Brian J. ROSS

3) P. Baldi and S. Brunak. Bioinformatics: the Machine Learning Approach. MITPress, 1998.

4) W. Banzhaf, P. Nordin, R.E. Keller, and F.D. Francone. Genetic Programming:An Introduction. Morgan Kaufmann, 1998.

5) P. Bork and E.V. Koonin. Protein sequence motifs. Current Opinion in Struc-tural Biology, 6:366–376, 1996.

6) A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to theAutomatic Discovery of Patterns in Biosequences. Journal of ComputationalBiology, 5(2):279–305, 1998.

7) A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. Pattern Discovery in Biose-quences. pages 255–270. Springer Verlag, 1998. LNAI 1433.

8) P. Bucher and A. Bairoch. A Generalized Profile Syntax for BiomolecularSequence Motifs and its Function in Automatic Sequence Interpretation. InR.Altman et al, editor, Proceedings 2nd International Conference on IntelligentSystems for Molecular Biology, pages 53–61. AAAI Press, 1994.

9) V.K. Garg, R. Kumar, and S.I Marcus. Probabilistic Language Formalism forStochastic Discrete Event Systems. IEEE Trans. Automatic Control, 44:280–293, February 1999.

10) S. Handley. Automated Learning of a Detector for the Cores of α-Helices inProtein Sequences Via Genetic Programming. In Proceedings 1994 IEEE WorldCongress on Computational Intelligence, volume 1, pages 474–479. IEEE Press,1994.

11) S. Handley. Classifying Nucleic Acid Sub-Sequences as Introns or Exons UsingGenetic Programming. In Proceedings 3rd International Conference on Intel-ligent Systems for Molecular Biology (ISMB-95), pages 162–169. AAAI Press,1995.

12) K. Hofmann, P. Bucher, L. Falquet, and A. Bairoch. The PROSITE database,its status in 1999. Nucleic Acids Research, 27(1):215–219, 1999.

13) J.H. Holland. Adaptation in Natural and Artificial Systems. MIT Press, 1992.

14) J.E. Hopcroft and J.D. Ullman. Introduction to Automata Theory, Languages,and Computation. Addison Wesley, 1979.

15) Y.-J. Hu. Biopattern Discovery by Genetic Programming. In J.R. Koza et al,editor, Proceedings Genetic Programming 1998, pages 152–157. Morgan Kauf-mann, 1998.

16) K. Karplus, K. Sjolander, C. Barrett, M. Cline, D. Haussler, R. Hughey,L. Holm, and C. Sander. Predicting protein structure using hidden Markovmodels. Proteins: Structure, Function, and Genetics, pages 134–139, 1997.supplement 1.

17) M.J. Kearns and U.V. Vazirani. An Introduction to Computational LearningTheory. MIT Press, 1994.

18) J.R. Koza. Genetic Programming: On the Programming of Computers by Meansof Natural Selection. MIT Press, 1992.

19) J.R. Koza. Automated Discovery of Detectors and Iteration-Performing Calcu-lations to Recognize Patterns in Protein Sequences Using Genetic Programming.In Proceedings of the Conference on Computer Vision and Pattern Recognition,pages 684–689. IEEE Press, 1994.


20) J.R. Koza, F.H. Bennett, and D. Andre. Classifying Proteins as ExtracellularUsing Programmatic Motifs and Genetic Programming. In Proceedings 1998IEEE World Congress on Computational Intelligence, pages 212–217. IEEEPress, 1998.

21) J.R. Koza, F.H. Bennett, D. Andre, and M.A. Keane. Genetic ProgrammingIII: Darwinian Invention and Problem Solving. Morgan Kaufmann, 1999.

22) A. Krogh, M. Brown, I.S. Mian, K. Sjolander, and D. Haussler. Hidden MarkovModels in Computational Biology. Journal of Molecular Biology, 235:1501–1531, 1994.

23) B.J. Ross. Probabilistic Pattern Matching and the Evolution of StochasticRegular Expressions. International Journal of Applied Intelligence, 13(3):285–300, November/December 2000.

24) B.J. Ross. Logic-based Genetic Programming with Definite Clause TranslationGrammars. New Generation Computing, 2001. In press.

25) B.J. Ross. The Evaluation of a Stochastic Regular Motif Language for Pro-tein Sequences. In L. Spector et al., editor, Proceedings of the Genetic andEvolutionary Computation Conference (GECCO-2001), pages 120–128. MorganKaufmann, 2001.

26) Y. Sakakibara, M. Brown, R. Hughey, I.S. Mian, K. Sjolander, R.C. Underwood,and D. Haussler. Stochastic Context-Free Grammars for tRNA Modeling. Nu-cleic Acids Research, 22(23):5112–5120, 1994.

27) D.B. Searls. The Computational Linguistics of Biological Sequences. InL. Hunter, editor, Artificial Intelligence and Molecular Biology, pages 47–120.AAAI Press, 1993.

28) D.B. Searls. String Variable Grammar: a Logic Grammar Formalism for theBiological Language of DNA. Journal of Logic Programming, 24(1,2), 1995.

the evolution of stochastic regular mo- tifs for protein ...bross/research/sredna_ngc.pdf · the...

Documents