research article in-maca-mcc: integrated multiple...

Research ArticleIN-MACA-MCC Integrated Multiple Attractor CellularAutomata with Modified Clonal Classifier for Human ProteinCoding and Promoter Prediction

Kiran Sree Pokkuluri1 Ramesh Babu Inampudi2 and S S S N Usha Devi Nedunuri3

1 Department of CSE JNTU Hyderabad 500 085 India2Department of CSE Acharya Nagarjuna University Guntur 522510 India3 Department of CSE University College of Engineering JNTU Kakinada 533003 India

Correspondence should be addressed to Kiran Sree Pokkuluri profkiransreegmailcom

Received 28 January 2014 Revised 7 June 2014 Accepted 14 June 2014 Published 15 July 2014

Academic Editor Bhaskar Dasgupta

Copyright copy 2014 Kiran Sree Pokkuluri et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Protein coding and promoter region predictions are very important challenges of bioinformatics (Attwood and Teresa 2000) Theidentification of these regions plays a crucial role in understanding the genesManynovel computational andmathematicalmethodsare introduced as well as existing methods that are getting refined for predicting both of the regions separately still there is a scopefor improvementWe propose a classifier that is built withMACA (multiple attractor cellular automata) andMCC (modified clonalclassifier) to predict both regions with a single classifier The proposed classifier is trained and tested with Fickett and Tung (1992)datasets for protein coding region prediction for DNA sequences of lengths 54 108 and 162 This classifier is trained and testedwith MMCRI datasets for protein coding region prediction for DNA sequences of lengths 252 and 354 The proposed classifier istrained and tested with promoter sequences from DBTSS (Yamashita et al 2006) dataset and nonpromoters from EID (Saxonovet al 2000) and UTRdb (Pesole et al 2002) datasets The proposed model can predict both regions with an average accuracy of905 for promoter and 896 for protein coding region predictionsThe specificity and sensitivity values of promoter and proteincoding region predictions are 089 and 092 respectively

1 Introduction

Many of the important problems [1] in bioinformatics canbeaddressed with our computing techniques very easily Sowe have identified two major problems in bioinformaticsand worked on them basically to understand the logicalitiesin these two problems After an extensive literature surveywe have developed the frame work for addressing theseproblems This frame work developed can be useful foraddressing other problems in bioinformatics like splice junc-tion prediction secondary structure prediction of proteinand so forth The proposed (IN-MACA-MCC) classifier canpredict both promoter and protein coding regions very easilywith more accuracy when compared with existing literaturewith less time

DNA is an important component of a cell and geneswill be found in specific portion of DNA which will contain

the information as explicit sequences of bases (A G Cand T) These explicit sequences of nucleotides will haveinstructions to build the proteins But the region whichwill have the instructions which is called protein codingregions occupies very less space in a DNA sequence Theidentification of protein coding regions plays a vital role inunderstanding the genes We can extract lot of informationlike what is the disease causing gene whether it is inheritedfrom father or mother and a promoter can regulate thegrowth of disease slowly and how one cell is going tocontrol another cell Although the entire human genome issequenced identifying the protein coding region as well asfinding the gene is still a complicated process

DNA contains lots of information We need promoter forDNA transcription to from RNA So promoter plays a vitalrole in DNA transcription It is defined as ldquothe sequence in

Hindawi Publishing CorporationAdvances in BioinformaticsVolume 2014 Article ID 261362 7 pageshttpdxdoiorg1011552014261362

2 Advances in Bioinformatics

the region of the upstream of the transcriptional start site(TSS)rdquo Identifying a new promoter in a DNA sequence willlead to finding a new protein If we identify the promoterregion we can extract information regarding gene expressionpatterns cell specificity and development Promoters willregulate a gene expression Some of the genetic diseaseswhich are associatedwith variations in promoters are asthmabeta thalassemia and Rubinstein-Taybi syndrome Promotersequence can be used to control the speed of translation fromDNA into protein It is also used in genetically modifiedfoods

In vertebrates only five percent of the gene is made upof exons Genes mostly will have seven to eight exons with145 bp length at an average Introns have 3365 bp length atan average Promoter comprises a small percentage of entiregenome The features of promoters are different from otherfunctional regions like exons introns and 31015840UTRs Thesefacts make protein coding and promoter region predictionsvery difficult tasks

This paper is organized in the followingmanner Section 2provides the entire literature survey of both protein codingand promoter regions Section 3 provides the design of theproposed system Section 4 presents the MACA-MCC clas-sifier for promoter and protein coding prediction Section 5provides the experimental results with discussion Section 6provides the future extensions and conclusion to the pro-posed classifier

2 Literature Survey

Salzberg has used a decision tree algorithm [2] for locatingprotein coding regions in DNA sequences which is adaptableand can process DNA sequences of lengths 54 bp 108 bp and162 bp Maji and Paul [3] have developed neural network treeclassifier for prediction of splice junction and coding regionsin genomic DNA A decision tree named NNTree (neuralnetwork tree) is constructed by dividing the training set withtheir corresponding labels to recursively generate a tree Xuet al [4] have developed an improved systemGRAIL II whichis a hybrid AI system which can predict the number of exonsin a human DNA sequence and also supports gene modelingThis process combines edge signal like accepter donortranslation start site detection and coding feature analysis

Snyder and Stormo [5] have applied dynamic program-ming and neural networks for predicting protein codingregions from a genomic DNA They have developed a pro-gramGeneParserwhich first scores theDNAsequences basedon exon-intron specific measures like local compositionalcomplexity codon usage length distribution 6-tuple fre-quency and periodic asymmetry Uberbacher and Mural [6]have proposed a method which combines some set of sensoralgorithms and neural network to predict the protein codingregions in eukaryotesThe programs developed will calculatethe values of seven sensors that were considered by theauthors The measures are frame bias matrix Fickett (three-periodicity) dinucleotide fractal dimension coding six tupleword preferences coding six tuples in frame preferencesword commonality and repetitive six tuple word preferences

Pinho et al [7] have proposed a three-state model forprotein coding region prediction Authors have consideredthree-base periodicity property Zhang [8] has used quadraticdiscriminant analysis method named MZEF for identifyingprotein coding regions in genomic human DNA Gish andStates [9] proposed a computer program named BLASTCwhich uses sequence similarity and codon utilization forpredicting the protein coding regions

Method in [3] takes more time to construct a tree forsequences of length 162The height of the trees is also a majorconcern for using this algorithm with DNA sequences ofmore length Method in [4] suffers with less accuracy due tomore error rate at classifier nodes Methods in [5ndash7] dependmore on the statistical information After this literaturesurvey the concern of a new classifier is to achieve goodclassifier accuracy and develop a classifier which can handleDNA sequences of length more than 162 with fewer nodes

Zeng et al [10] have proposed a hierarchical promoterprediction system named SCS where they have used signalstructure and context features Li et al [11] have proposeda method PCA-HPR (principal component analysis-humanpromoter recognition) to predict the promoters and tran-scription sites (TSS) Hannenhalli and Levy [12] tried toenhance the accuracy of promoter prediction by combiningCpG island feature with information of independent signalswhich are biologically motivated and these cover most of theknowledge to predict the promoter in human genome

Wu et al have proposed a method [13] for enhancingthe performance of human promoter region identification byselecting the most important features of DNA sequence foreach different functional region Ohler et al have proposed amodel [14] which integrates physical properties ofDNA into aprobabilistic eukaryotic promoter prediction system Goni etal have proposed a system ProStar [15] which uses structuralparameters for promoter region identification Authors onlyused descriptors derived from physical first principles

Bajic et al [16] have developed new software for iden-tifying promoters in a DNA sequence of vertebrates Thisprogram takes input as DNA sequence and generates a listof predicted TSS (transcription stating site) Zhang [17] hasproposed a new program for predicting a core promoter inhuman gene named as CorePromoter After the literaturesurvey on promoter prediction the main goal of proposedclassifier is to reduce the false prediction rates and improvespecificity and sensitivity values

3 Design of IN-MACA-MCC

IN-MACA-MACC basic processing as shown in Figure 1starts with identification promoter considering features likeTATA CAAT Inr and n-mers unlike AIX-MACA-Y [18] forpredicting both regions IN-MACA-MCC takes a DNA inputand checks whether it belongs to a promoter or not If itbelongs to promoter the exact boundaries are provided If thegiven input is a nonpromoter sequence it checks whether itbelongs to intron or exon or 31015840UTR If it belongs to an exonIN-MACA-MCC reports the boundaries of the first exonThese boundaries will be used by the nextmodule as shown inFigure 2 to trace the protein coding region starting from that

Advances in Bioinformatics 3

Promoter classifier

Promoter versus exons (nonpromoter)

(two-class predictor)


Feature Combiner

DNA

Promoter versus introns (nonpromoter)(two-class predictor)

Output class andsequence

extraction


boundary

Promoter versus 3998400UTR (nona promoter)

Figure 1 IN-MACA-MCC architecturemdashfront

Process three base

Real valued attributes

Input DNA sequence of lengths 54 108 and 162 from Fickett and Tung datasets and 252 and 354 length

DNA sequence from MMCRI database

Output of the class after all basins formed

Encoding the DNA sequence

IN-MACA-MCC classification algorithm

Input processing

Store basins(attractor basins are

Attractor basin

in the terms of three

pairs at each step

associated with class)

Figure 2 IN-MACA-MCC architecturemdashrear

boundary If the input does not belong to exons it will checkwith introns and 31015840UTRs and outputs the class accordingly

The design rear IN-MACA-MACC is indicated inFigure 2 Input to IN-MACA-MACC algorithm and its vari-ations will be DNA sequence and amino acid sequencesInput processing unit will process sequences three at a timeas three neighborhood cellular automata are considered forprocessingDNA sequencesThe rule generatorwill transformthe complemented and noncomplemented rules in the formof matrix so that we can apply the rules to the correspondingsequence positions very easily IN-MACA-MACC basins arecalculated as per the instructions of proposed algorithm

Table 1 Example rules

SNO Rule number General representation1 254 119902

119894minus1+ 119902119894+ 119902119894+1

2 252 119902119894minus1+ 119902119894

3 238 119902119894+ 119902119894+1

4 250 119902119894minus1+ 119902119894+1

5 204 119902119894

6 240 119902119894minus1

7 170 119902119894+1

Cellular automata that use fuzzy logic are an array ofcells arranged in linear fashion evolving with time Everycell of this array assumes a rational value in the interval ofzero and one All these cells change their states accordingto the local evaluation function which is a function of itsstate and its neighboring states The synchronous applicationof the local rules to all the cells of array will depict theglobal evolution Table 1 shows some rules for developing theproposed classifier

Example 1 Consider the rule ⟨170 238 204⟩ and corre-sponding transition matrix is shown below

If119875(0) is the initial state with real values (0 025 050) thesuccessive three steps are defined below

The transitions from one state to another state are definedas

119879 = [

[

0 1 0

0 1 1

0 0 1

]

]

119875 (0) = (0 025 050)

(1)

Step 1 Apply rule 170 for the first cell Rule 170 says that thenext state depends on the right neighbor Consider

119875 (1) = (025 025 050) (2)


04 06 02

04 10 02

06 04 02

02 10 02

10 02 02

08 02 02

08 06 02

08 08 02

08 10 02

02 02 02

02 04 02

06 10 02

10 10 02

02 06 02

06 08 02

Figure 3 Attractor state (10 10 and 02)mdashB formed with rule⟨170 252 204⟩

Apply rule 238 to the second cell Rule 238 says that the nextstate depends on its state and the right neighbor Consider

119875 (1) = (025 075 050) (025 + 050) (3)

Apply rule 204 to the third cell Rule 204 says that the nextstate depends only on its state

After applying the rule for all the cells in the state is (025075 050) that is the resultant state after first iteration

119875 (1) = (025 075 050) (4)

Similarly one has the following

Step 2 Consider

119875 (2) = (075 10 050) (5)

Step 3 Consider

119875 (3) = (10 10 050) (6)

Likewise we can construct IN-MACA-MCC for a sampledataset as shown in Figure 3

4 Modified Clonal Classifier with MACA

41 Simplified Modified Clonal Algorithm

(1) Generate initial antibody population (AIS-MACArules) randomly and call it Ab It consists of twosubsets memory population Ab

119898and reservoir pop-

ulation Ab119903

(2) Construct a set of antigens population and call it Ag(DNA sequence with classinput)

Table 2 Execution time for prediction of both protein andpromoterregions

Size of dataset Prediction time of integratedalgorithm in ms

5000 10646000 138910000 200220000 2545

(3) Select an antigenAg119895fromAg the antigen population

(4) Apply every member of antibody population to theselected antigen Ag

119895 check whether it is predicting

the correct class and calculate affinity of the rule withthe antigen via fitness equation

(5) Select 119898 highest affinity antibodies (AIS-MACArules) from Ab and place them in 119875

119898

(6) Generate clones for each antibody which will beproportional to the affinity as per fitness Place theclones in the new population 119875

119894

(7) Apply mutation to the newly formed population 119875119894

where the degree is inversely proportional to theiraffinity This produces a more mature population 119875lowast

119894

(8) 119877119890calculate the affinity of the rule with the corre-

sponding antigen as we did it in step four Order theantibodies in descending order (high fitness antibodywill be on top)

(9) Compare the antibodies from 119875lowast119894with the antibodies

population from Ab119898 Select the better fitness rules

remove them from 119875lowast119894 and place them in Ab

119898

(10) Randomly generate antibodies for introducing diver-sity Compare the antibodies in Abr the left-outantibodies in 119875lowast

119894 and randomly generate antibodies

Select the better fitness rules among three and placethem in Abr

(11) For every generation compare the antibodies in Ab119898

and Ab119903and place the best in Ab

119898

42 Difference betweenClonal andModifiedClonal AlgorithmThe difference between original clonal algorithm and themodified algorithm proposed by us lies on how efficientlywe are managing the use of generated antibodies Originalclonal algorithm will not take advantage of the antibodiesgenerated by every cloned population Once the comparisonof antibodies in 119875lowast

119894and Ab

119898gets completed the best will

be placed in Ab119898

and the rest of antibodies in 119875lowast119894

areomitted Even the reservoir antibodies are poorlymaintainedSo we try to use the best antibodies in 119875lowast

119894left out after

placing them in Abm For this purpose we are comparingthe antibodies already in Ab

119903with left-out antibodies in

119875lowast

119894and newly generated antibodies which were meant for

introducing diversity After comparing the three sets the bestwill be placed in Abr In the original clonal algorithm step 11will not exit Step 11 will ensure the best fitness rules stay inAb119898which will be solution of the entire problem


Figure 4 Basin calculation

Figure 5 Training interface

Table 3 IN-MACA-MCC protein coding comparison with existingapproaches

Algorithmcoding measure Sensitivity SpecificityOC1 653 664Hexamer 6836 702Position asymmetry 723 745Dicodon usage 813 823CRITICA 825 849IN-MACA-MCC 896 893

Table 4 IN-MACA-MCC promoter comparison with existingapproaches

Method Sensitivity SpecificityPromoter inspector 569 469Dragon promoter finder 623 593Promo predictor 653 669CNN-promoter 763 823SPANN 689 84IMC 76 86IN-MACA-MCC 885 927

5 Experimental Results

The proposed classifier is trained and tested with Fickettand Tung [19] datasets for protein coding region predictionfor DNA sequences of lengths 54 108 and 162 All the 21measures reported in [19] were considered for developing theclassifier This classifier is trained and tested with MMCRI(httpwwwmmchriresin) [20] datasets for protein coding

0102030405060708090

100

54 108 162 252 354

Pred

ictiv

e acc

urac

y

Accuracy of integrated AIS-MACA Algoprotein coding region identification

CodingNoncoding

sequences sequences sequences sequences sequences

Figure 6 Predictive accuracy for protein coding regions

75

80

85

90

95

100

54 108 162 252 354

Pred

ictiv

e acc

urac

y

Accuracy of integrated AIS-MACA Algo promoter region identification

PromoterNonpromoter


Figure 7 Predictive accuracy for promoter regions

region prediction for DNA sequences of lengths 252 and 354The proposed classifier is trained and tested with promotersequences from DBTSS [21] dataset and nonpromoters fromEID [22] and UTRdb [23] datasets Figures 4 and 5 show thedeveloped interfaces Table 2 shows the execution time forpredicting both protein and promoter regions which is verypromising Tables 3 and 4 show the sensitivity and specificityof both predictions All the experiments are performed inSUN with Solaris 58 445MHz clock Figures 6 and 7 showthe accuracy of prediction separately which is the importantoutput of our work

51 Specific Output of 54 Length DNA Sequence with Bound-aries See Box 1 Figures 4 5 6 and 7 and Table 2

52 Human Promoter Output See Box 2 and Tables 3 and 4


DNA Sequence Sequence Kiran12 jntuh - Length = 54 bps

AGGGGCAGCAACCAGAGCAGCAGCAGTGGCAGCAGTAGCAGGCGCCGGCCGCCGReverse ComplementCGGCGGCCGGCGCCTGCTACTGCTGCCACTGCTGCTGCTCTGGTTGCTGCCCCT

Sequence Name Program Type of Exon Boundary Strand

kiran12 jntuh AIS MACA Internal 3 25 +







Box 1

DNA Seq Sequence human Kiran promoter 223jntuh

CGCAGCAAAATGCACGGGCTTCTGCAGCCCACATGACTTTATTCTGAACGGACACAAGTCTGCTCGCTGGGCCGTTC

GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTGGCGACCAGAAAACACAAGTGAAAGAGC

ATTTGGCCAGCCCGGAGAAGCCGAGCTGGGTGGCTTGAGTCTACATGGTTCTCATGTCGCGTTTAAGGCCAGCCCCC

TGCACGGTGTGGAGCTTCAA

Reverse Complement of DNA Seq Sequence human Kiran promoter 223jntuh

TTGAAGCTCCACACCGTGCAGGGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCAGCTCGG

CTTCTCCGGGCTGGCCAAATGCTCTTTCACTTGTGTTTTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGC

CGTGTTTTTGGCCCAAAAGCGAACGGCCCAGCGAGCAGACTTGTGTCCGTTCAGAATAAAGTCATGTGGGCTGCAGA

AGCCCGTGCATTTTGCTGCG

Sequence Sequence human Kiran promoter 223jntuh - Length = 251 bps

Sequence human Kiran promoter 223jntuh Human Promoter Prediction

Start End Score Promoter Sequence

78 128 061 GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTG

201 251 046 GGTTCTCATGTCGCGTTTAAGGCCAGCCCCCTGCACGGTGTGGAGCTTCAKiran promoter 223jntuh Promoter Prediction Reverse Strand


230 180 059 GGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCA

137 87 060 TTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGCCGTGTTTTTG

Box 2

6 Conclusion

We have successfully developed a classifier which can predictpromoter and protein coding regions with higher accuracythan reported earlier The sensitivity and specificity valuesfor both predictions are also promisingThere is considerableimprovement in the reduction of false prediction rate IN-MACA-MCC attains highest accuracy of 923 for sequencesmore than 108 bp and less than 552 bp for protein codingregion prediction IN-MACA-MCC attains highest accuracyof 936 for sequences of length 251 for promoter regionsWe are trying to apply this classifier for most of the species ineukaryotes in future

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] T K Attwood ldquoThe babel of bioinformaticsrdquo Science vol 290no 5491 pp 471ndash473 2000

[2] S Salzberg ldquoLocating protein coding regions in human DNAusing a decision tree algorithm rdquo Journal of ComputationalBiology vol 2 no 3 pp 473ndash485 1995

[3] P Maji and S Paul ldquoNeural network tree for identificationof splice junction and protein coding region in DNArdquo inScalable Pattern Recognition Algorithms pp 45ndash66 SpringerInternational Berlin Germany 2014

[4] Y Xu R Mural M Shah and E Uberbacher ldquoRecognizingexons in genomic sequence using GRAIL IIrdquoGenetic Engineer-ing vol 16 pp 241ndash253 1994

[5] E E Snyder and G D Stormo ldquoIdentification of protein codingregions in genomicDNArdquo Journal ofMolecular Biology vol 248no 1 pp 1ndash18 1995

[6] E C Uberbacher and R J Mural ldquoLocating protein-codingregions in human DNA sequences by a multiple sensor-neural


network approachrdquo Proceedings of the National Academy ofSciences of the United States of America vol 88 no 24 pp 11261ndash11265 1991

[7] A J Pinho A J R Neves V Afreixo C A C Bastos and PJ S G Ferreira ldquoA three-state model for DNA protein-codingregionsrdquo IEEE Transactions on Biomedical Engineering vol 53no 11 pp 2148ndash2155 2006

[8] M Q Zhang ldquoIdentification of protein coding regions in thehuman genomeby quadratic discriminant analysisrdquoProceedingsof the National Academy of Sciences of the United States ofAmerica vol 94 no 2 pp 565ndash568 1997

[9] W Gish and D J States ldquoIdentification of protein codingregions by database similarity searchrdquo Nature Genetics vol 3no 3 pp 266ndash272 1993

[10] J Zeng X Zhao X Cao and H Yan ldquoSCS signal context andstructure features for genome-wide human promoter recogni-tionrdquo IEEEACM Transactions on Computational Biology andBioinformatics vol 7 no 3 pp 550ndash562 2010

[11] X Li J Zeng and H Yan ldquoPCA-HPR a principle componentanalysis model for human promoter recognitionrdquo Bioinforma-tion vol 2 no 9 pp 373ndash378 2008

[12] S Hannenhalli and S Levy ldquoPromoter prediction in the humangenomerdquo Bioinformatics vol 17 supplement 1 pp S90ndashS962001

[13] S Wu X Xie A W Liew and H Yan ldquoEukaryotic promoterprediction based on relative entropy and positional informa-tionrdquo Physical Review E vol 75 no 4 Article ID 041908 2007

[14] U Ohler H Niemann G Liao and G M Rubin ldquoJointmodeling of DNA sequence and physical properties to improveeukaryotic promoter recognitionrdquo Bioinformatics vol 17 sup-plement 1 pp S199ndashS206 2001

[15] J R Goni A Perez D Torrents and M Orozco ldquoDeterminingpromoter location based on DNA structure first-principlescalculationsrdquo Genome Biology vol 8 no 12 article R263 2007

[16] V B Bajic S H Seah A Chong G Zhang J L Y Koh andV Brusic ldquoDragon promoter finder recognition of vertebrateRNA polymerase II promotersrdquo Bioinformatics vol 18 no 1 pp198ndash199 2002

[17] M Q Zhang ldquoIdentification of human gene core promoters insilicordquo Genome Research vol 8 no 3 pp 319ndash326 1998

[18] P K Sree and I R Babu ldquoAIX-MACA-Y multiple attractorcellular automata based clonal classifier for promoter andprotein coding region predictionrdquo Journal of Bioinformatics andIntelligent Control vol 3 no 1 pp 23ndash30 2014

[19] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[20] httpwwwmmchriresin[21] R Yamashita Y Suzuki H Wakaguri K Tsuritani K Nakai

and S Sugano ldquoDBTSS database of human transcription startsites progress report 2006rdquo Nucleic Acids Research vol 34supplement 1 pp D86ndashD89 2006

[22] S Saxonov I Daizadeh A Fedorov and W Gilbert ldquoEIDThe Exon-Intron Databasemdashan exhaustive database of protein-coding intron-containing genesrdquoNucleic Acids Research vol 28no 1 pp 185ndash190 2000

[23] G Pesole S Liuni G Grillo et al ldquoUTRdb and UTRsitespecialized databases of sequences and functional elements of51015840 and 31015840 untranslated regions of eukaryotic mRNAs Update2002rdquo Nucleic Acids Research vol 30 no 1 pp 335ndash340 2002

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of


Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology


Molecular Biology International

GenomicsInternational Journal of


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


BioinformaticsAdvances in

Marine BiologyJournal of



Signal TransductionJournal of


BioMed Research International

Evolutionary BiologyInternational Journal of



Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Genetics Research International


Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational



Enzyme Research



Microbiology


the region of the upstream of the transcriptional start site(TSS)rdquo Identifying a new promoter in a DNA sequence willlead to finding a new protein If we identify the promoterregion we can extract information regarding gene expressionpatterns cell specificity and development Promoters willregulate a gene expression Some of the genetic diseaseswhich are associatedwith variations in promoters are asthmabeta thalassemia and Rubinstein-Taybi syndrome Promotersequence can be used to control the speed of translation fromDNA into protein It is also used in genetically modifiedfoods

In vertebrates only five percent of the gene is made upof exons Genes mostly will have seven to eight exons with145 bp length at an average Introns have 3365 bp length atan average Promoter comprises a small percentage of entiregenome The features of promoters are different from otherfunctional regions like exons introns and 31015840UTRs Thesefacts make protein coding and promoter region predictionsvery difficult tasks

This paper is organized in the followingmanner Section 2provides the entire literature survey of both protein codingand promoter regions Section 3 provides the design of theproposed system Section 4 presents the MACA-MCC clas-sifier for promoter and protein coding prediction Section 5provides the experimental results with discussion Section 6provides the future extensions and conclusion to the pro-posed classifier

2 Literature Survey

Salzberg has used a decision tree algorithm [2] for locatingprotein coding regions in DNA sequences which is adaptableand can process DNA sequences of lengths 54 bp 108 bp and162 bp Maji and Paul [3] have developed neural network treeclassifier for prediction of splice junction and coding regionsin genomic DNA A decision tree named NNTree (neuralnetwork tree) is constructed by dividing the training set withtheir corresponding labels to recursively generate a tree Xuet al [4] have developed an improved systemGRAIL II whichis a hybrid AI system which can predict the number of exonsin a human DNA sequence and also supports gene modelingThis process combines edge signal like accepter donortranslation start site detection and coding feature analysis

Snyder and Stormo [5] have applied dynamic program-ming and neural networks for predicting protein codingregions from a genomic DNA They have developed a pro-gramGeneParserwhich first scores theDNAsequences basedon exon-intron specific measures like local compositionalcomplexity codon usage length distribution 6-tuple fre-quency and periodic asymmetry Uberbacher and Mural [6]have proposed a method which combines some set of sensoralgorithms and neural network to predict the protein codingregions in eukaryotesThe programs developed will calculatethe values of seven sensors that were considered by theauthors The measures are frame bias matrix Fickett (three-periodicity) dinucleotide fractal dimension coding six tupleword preferences coding six tuples in frame preferencesword commonality and repetitive six tuple word preferences

Pinho et al [7] have proposed a three-state model forprotein coding region prediction Authors have consideredthree-base periodicity property Zhang [8] has used quadraticdiscriminant analysis method named MZEF for identifyingprotein coding regions in genomic human DNA Gish andStates [9] proposed a computer program named BLASTCwhich uses sequence similarity and codon utilization forpredicting the protein coding regions

Method in [3] takes more time to construct a tree forsequences of length 162The height of the trees is also a majorconcern for using this algorithm with DNA sequences ofmore length Method in [4] suffers with less accuracy due tomore error rate at classifier nodes Methods in [5ndash7] dependmore on the statistical information After this literaturesurvey the concern of a new classifier is to achieve goodclassifier accuracy and develop a classifier which can handleDNA sequences of length more than 162 with fewer nodes

Zeng et al [10] have proposed a hierarchical promoterprediction system named SCS where they have used signalstructure and context features Li et al [11] have proposeda method PCA-HPR (principal component analysis-humanpromoter recognition) to predict the promoters and tran-scription sites (TSS) Hannenhalli and Levy [12] tried toenhance the accuracy of promoter prediction by combiningCpG island feature with information of independent signalswhich are biologically motivated and these cover most of theknowledge to predict the promoter in human genome

Wu et al have proposed a method [13] for enhancingthe performance of human promoter region identification byselecting the most important features of DNA sequence foreach different functional region Ohler et al have proposed amodel [14] which integrates physical properties ofDNA into aprobabilistic eukaryotic promoter prediction system Goni etal have proposed a system ProStar [15] which uses structuralparameters for promoter region identification Authors onlyused descriptors derived from physical first principles

Bajic et al [16] have developed new software for iden-tifying promoters in a DNA sequence of vertebrates Thisprogram takes input as DNA sequence and generates a listof predicted TSS (transcription stating site) Zhang [17] hasproposed a new program for predicting a core promoter inhuman gene named as CorePromoter After the literaturesurvey on promoter prediction the main goal of proposedclassifier is to reduce the false prediction rates and improvespecificity and sensitivity values

3 Design of IN-MACA-MCC

IN-MACA-MACC basic processing as shown in Figure 1starts with identification promoter considering features likeTATA CAAT Inr and n-mers unlike AIX-MACA-Y [18] forpredicting both regions IN-MACA-MCC takes a DNA inputand checks whether it belongs to a promoter or not If itbelongs to promoter the exact boundaries are provided If thegiven input is a nonpromoter sequence it checks whether itbelongs to intron or exon or 31015840UTR If it belongs to an exonIN-MACA-MCC reports the boundaries of the first exonThese boundaries will be used by the nextmodule as shown inFigure 2 to trace the protein coding region starting from that


Promoter classifier




Feature Combiner

DNA



extraction


boundary



Process three base







Input processing


Attractor basin


pairs at each step







119894minus1+ 119902119894+ 119902119894+1

2 252 119902119894minus1+ 119902119894

3 238 119902119894+ 119902119894+1

4 250 119902119894minus1+ 119902119894+1

5 204 119902119894

6 240 119902119894minus1

7 170 119902119894+1





119879 = [

[

0 1 0

0 1 1

0 0 1

]

]

119875 (0) = (0 025 050)

(1)


119875 (1) = (025 025 050) (2)


04 06 02

04 10 02

06 04 02

02 10 02

10 02 02

08 02 02

08 06 02

08 08 02

08 10 02

02 02 02

02 04 02

06 10 02

10 10 02

02 06 02

06 08 02



119875 (1) = (025 075 050) (025 + 050) (3)



119875 (1) = (025 075 050) (4)


Step 2 Consider

119875 (2) = (075 10 050) (5)

Step 3 Consider

119875 (3) = (10 10 050) (6)






ulation Ab119903




5000 10646000 138910000 200220000 2545






119898


119894



119894






119898






119898


119894and Ab








119875lowast












0102030405060708090

100

54 108 162 252 354

Pred

ictiv

e acc

urac

y


CodingNoncoding



75

80

85

90

95

100

54 108 162 252 354

Pred

ictiv

e acc

urac

y


PromoterNonpromoter

















Box 1



















Box 2

6 Conclusion




References

































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


Promoter classifier




Feature Combiner

DNA



extraction


boundary



Process three base







Input processing


Attractor basin


pairs at each step







119894minus1+ 119902119894+ 119902119894+1

2 252 119902119894minus1+ 119902119894

3 238 119902119894+ 119902119894+1

4 250 119902119894minus1+ 119902119894+1

5 204 119902119894

6 240 119902119894minus1

7 170 119902119894+1





119879 = [

[

0 1 0

0 1 1

0 0 1

]

]

119875 (0) = (0 025 050)

(1)


119875 (1) = (025 025 050) (2)


04 06 02

04 10 02

06 04 02

02 10 02

10 02 02

08 02 02

08 06 02

08 08 02

08 10 02

02 02 02

02 04 02

06 10 02

10 10 02

02 06 02

06 08 02



119875 (1) = (025 075 050) (025 + 050) (3)



119875 (1) = (025 075 050) (4)


Step 2 Consider

119875 (2) = (075 10 050) (5)

Step 3 Consider

119875 (3) = (10 10 050) (6)






ulation Ab119903




5000 10646000 138910000 200220000 2545






119898


119894



119894






119898






119898


119894and Ab








119875lowast












0102030405060708090

100

54 108 162 252 354

Pred

ictiv

e acc

urac

y


CodingNoncoding



75

80

85

90

95

100

54 108 162 252 354

Pred

ictiv

e acc

urac

y


PromoterNonpromoter

















Box 1



















Box 2

6 Conclusion




References

































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


04 06 02

04 10 02

06 04 02

02 10 02

10 02 02

08 02 02

08 06 02

08 08 02

08 10 02

02 02 02

02 04 02

06 10 02

10 10 02

02 06 02

06 08 02



119875 (1) = (025 075 050) (025 + 050) (3)



119875 (1) = (025 075 050) (4)


Step 2 Consider

119875 (2) = (075 10 050) (5)

Step 3 Consider

119875 (3) = (10 10 050) (6)






ulation Ab119903




5000 10646000 138910000 200220000 2545






119898


119894



119894






119898






119898


119894and Ab








119875lowast












0102030405060708090

100

54 108 162 252 354

Pred

ictiv

e acc

urac

y


CodingNoncoding



75

80

85

90

95

100

54 108 162 252 354

Pred

ictiv

e acc

urac

y


PromoterNonpromoter

















Box 1



















Box 2

6 Conclusion




References

































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology










0102030405060708090

100

54 108 162 252 354

Pred

ictiv

e acc

urac

y


CodingNoncoding



75

80

85

90

95

100

54 108 162 252 354

Pred

ictiv

e acc

urac

y


PromoterNonpromoter

















Box 1



















Box 2

6 Conclusion




References

































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology












Box 1



















Box 2

6 Conclusion




References

































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology



























Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology

research article in-maca-mcc: integrated multiple...

Documents