research article in-maca-mcc: integrated multiple...
TRANSCRIPT
Research ArticleIN-MACA-MCC Integrated Multiple Attractor CellularAutomata with Modified Clonal Classifier for Human ProteinCoding and Promoter Prediction
Kiran Sree Pokkuluri1 Ramesh Babu Inampudi2 and S S S N Usha Devi Nedunuri3
1 Department of CSE JNTU Hyderabad 500 085 India2Department of CSE Acharya Nagarjuna University Guntur 522510 India3 Department of CSE University College of Engineering JNTU Kakinada 533003 India
Correspondence should be addressed to Kiran Sree Pokkuluri profkiransreegmailcom
Received 28 January 2014 Revised 7 June 2014 Accepted 14 June 2014 Published 15 July 2014
Academic Editor Bhaskar Dasgupta
Copyright copy 2014 Kiran Sree Pokkuluri et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited
Protein coding and promoter region predictions are very important challenges of bioinformatics (Attwood and Teresa 2000) Theidentification of these regions plays a crucial role in understanding the genesManynovel computational andmathematicalmethodsare introduced as well as existing methods that are getting refined for predicting both of the regions separately still there is a scopefor improvementWe propose a classifier that is built withMACA (multiple attractor cellular automata) andMCC (modified clonalclassifier) to predict both regions with a single classifier The proposed classifier is trained and tested with Fickett and Tung (1992)datasets for protein coding region prediction for DNA sequences of lengths 54 108 and 162 This classifier is trained and testedwith MMCRI datasets for protein coding region prediction for DNA sequences of lengths 252 and 354 The proposed classifier istrained and tested with promoter sequences from DBTSS (Yamashita et al 2006) dataset and nonpromoters from EID (Saxonovet al 2000) and UTRdb (Pesole et al 2002) datasets The proposed model can predict both regions with an average accuracy of905 for promoter and 896 for protein coding region predictionsThe specificity and sensitivity values of promoter and proteincoding region predictions are 089 and 092 respectively
1 Introduction
Many of the important problems [1] in bioinformatics canbeaddressed with our computing techniques very easily Sowe have identified two major problems in bioinformaticsand worked on them basically to understand the logicalitiesin these two problems After an extensive literature surveywe have developed the frame work for addressing theseproblems This frame work developed can be useful foraddressing other problems in bioinformatics like splice junc-tion prediction secondary structure prediction of proteinand so forth The proposed (IN-MACA-MCC) classifier canpredict both promoter and protein coding regions very easilywith more accuracy when compared with existing literaturewith less time
DNA is an important component of a cell and geneswill be found in specific portion of DNA which will contain
the information as explicit sequences of bases (A G Cand T) These explicit sequences of nucleotides will haveinstructions to build the proteins But the region whichwill have the instructions which is called protein codingregions occupies very less space in a DNA sequence Theidentification of protein coding regions plays a vital role inunderstanding the genes We can extract lot of informationlike what is the disease causing gene whether it is inheritedfrom father or mother and a promoter can regulate thegrowth of disease slowly and how one cell is going tocontrol another cell Although the entire human genome issequenced identifying the protein coding region as well asfinding the gene is still a complicated process
DNA contains lots of information We need promoter forDNA transcription to from RNA So promoter plays a vitalrole in DNA transcription It is defined as ldquothe sequence in
Hindawi Publishing CorporationAdvances in BioinformaticsVolume 2014 Article ID 261362 7 pageshttpdxdoiorg1011552014261362
2 Advances in Bioinformatics
the region of the upstream of the transcriptional start site(TSS)rdquo Identifying a new promoter in a DNA sequence willlead to finding a new protein If we identify the promoterregion we can extract information regarding gene expressionpatterns cell specificity and development Promoters willregulate a gene expression Some of the genetic diseaseswhich are associatedwith variations in promoters are asthmabeta thalassemia and Rubinstein-Taybi syndrome Promotersequence can be used to control the speed of translation fromDNA into protein It is also used in genetically modifiedfoods
In vertebrates only five percent of the gene is made upof exons Genes mostly will have seven to eight exons with145 bp length at an average Introns have 3365 bp length atan average Promoter comprises a small percentage of entiregenome The features of promoters are different from otherfunctional regions like exons introns and 31015840UTRs Thesefacts make protein coding and promoter region predictionsvery difficult tasks
This paper is organized in the followingmanner Section 2provides the entire literature survey of both protein codingand promoter regions Section 3 provides the design of theproposed system Section 4 presents the MACA-MCC clas-sifier for promoter and protein coding prediction Section 5provides the experimental results with discussion Section 6provides the future extensions and conclusion to the pro-posed classifier
2 Literature Survey
Salzberg has used a decision tree algorithm [2] for locatingprotein coding regions in DNA sequences which is adaptableand can process DNA sequences of lengths 54 bp 108 bp and162 bp Maji and Paul [3] have developed neural network treeclassifier for prediction of splice junction and coding regionsin genomic DNA A decision tree named NNTree (neuralnetwork tree) is constructed by dividing the training set withtheir corresponding labels to recursively generate a tree Xuet al [4] have developed an improved systemGRAIL II whichis a hybrid AI system which can predict the number of exonsin a human DNA sequence and also supports gene modelingThis process combines edge signal like accepter donortranslation start site detection and coding feature analysis
Snyder and Stormo [5] have applied dynamic program-ming and neural networks for predicting protein codingregions from a genomic DNA They have developed a pro-gramGeneParserwhich first scores theDNAsequences basedon exon-intron specific measures like local compositionalcomplexity codon usage length distribution 6-tuple fre-quency and periodic asymmetry Uberbacher and Mural [6]have proposed a method which combines some set of sensoralgorithms and neural network to predict the protein codingregions in eukaryotesThe programs developed will calculatethe values of seven sensors that were considered by theauthors The measures are frame bias matrix Fickett (three-periodicity) dinucleotide fractal dimension coding six tupleword preferences coding six tuples in frame preferencesword commonality and repetitive six tuple word preferences
Pinho et al [7] have proposed a three-state model forprotein coding region prediction Authors have consideredthree-base periodicity property Zhang [8] has used quadraticdiscriminant analysis method named MZEF for identifyingprotein coding regions in genomic human DNA Gish andStates [9] proposed a computer program named BLASTCwhich uses sequence similarity and codon utilization forpredicting the protein coding regions
Method in [3] takes more time to construct a tree forsequences of length 162The height of the trees is also a majorconcern for using this algorithm with DNA sequences ofmore length Method in [4] suffers with less accuracy due tomore error rate at classifier nodes Methods in [5ndash7] dependmore on the statistical information After this literaturesurvey the concern of a new classifier is to achieve goodclassifier accuracy and develop a classifier which can handleDNA sequences of length more than 162 with fewer nodes
Zeng et al [10] have proposed a hierarchical promoterprediction system named SCS where they have used signalstructure and context features Li et al [11] have proposeda method PCA-HPR (principal component analysis-humanpromoter recognition) to predict the promoters and tran-scription sites (TSS) Hannenhalli and Levy [12] tried toenhance the accuracy of promoter prediction by combiningCpG island feature with information of independent signalswhich are biologically motivated and these cover most of theknowledge to predict the promoter in human genome
Wu et al have proposed a method [13] for enhancingthe performance of human promoter region identification byselecting the most important features of DNA sequence foreach different functional region Ohler et al have proposed amodel [14] which integrates physical properties ofDNA into aprobabilistic eukaryotic promoter prediction system Goni etal have proposed a system ProStar [15] which uses structuralparameters for promoter region identification Authors onlyused descriptors derived from physical first principles
Bajic et al [16] have developed new software for iden-tifying promoters in a DNA sequence of vertebrates Thisprogram takes input as DNA sequence and generates a listof predicted TSS (transcription stating site) Zhang [17] hasproposed a new program for predicting a core promoter inhuman gene named as CorePromoter After the literaturesurvey on promoter prediction the main goal of proposedclassifier is to reduce the false prediction rates and improvespecificity and sensitivity values
3 Design of IN-MACA-MCC
IN-MACA-MACC basic processing as shown in Figure 1starts with identification promoter considering features likeTATA CAAT Inr and n-mers unlike AIX-MACA-Y [18] forpredicting both regions IN-MACA-MCC takes a DNA inputand checks whether it belongs to a promoter or not If itbelongs to promoter the exact boundaries are provided If thegiven input is a nonpromoter sequence it checks whether itbelongs to intron or exon or 31015840UTR If it belongs to an exonIN-MACA-MCC reports the boundaries of the first exonThese boundaries will be used by the nextmodule as shown inFigure 2 to trace the protein coding region starting from that
Advances in Bioinformatics 3
Promoter classifier
Promoter versus exons (nonpromoter)
(two-class predictor)
(two-class predictor)
Feature Combiner
DNA
Promoter versus introns (nonpromoter)(two-class predictor)
Output class andsequence
extraction
(two-class predictor)
boundary
Promoter versus 3998400UTR (nona promoter)
Figure 1 IN-MACA-MCC architecturemdashfront
Process three base
Real valued attributes
Input DNA sequence of lengths 54 108 and 162 from Fickett and Tung datasets and 252 and 354 length
DNA sequence from MMCRI database
Output of the class after all basins formed
Encoding the DNA sequence
IN-MACA-MCC classification algorithm
Input processing
Store basins(attractor basins are
Attractor basin
in the terms of three
pairs at each step
associated with class)
Figure 2 IN-MACA-MCC architecturemdashrear
boundary If the input does not belong to exons it will checkwith introns and 31015840UTRs and outputs the class accordingly
The design rear IN-MACA-MACC is indicated inFigure 2 Input to IN-MACA-MACC algorithm and its vari-ations will be DNA sequence and amino acid sequencesInput processing unit will process sequences three at a timeas three neighborhood cellular automata are considered forprocessingDNA sequencesThe rule generatorwill transformthe complemented and noncomplemented rules in the formof matrix so that we can apply the rules to the correspondingsequence positions very easily IN-MACA-MACC basins arecalculated as per the instructions of proposed algorithm
Table 1 Example rules
SNO Rule number General representation1 254 119902
119894minus1+ 119902119894+ 119902119894+1
2 252 119902119894minus1+ 119902119894
3 238 119902119894+ 119902119894+1
4 250 119902119894minus1+ 119902119894+1
5 204 119902119894
6 240 119902119894minus1
7 170 119902119894+1
Cellular automata that use fuzzy logic are an array ofcells arranged in linear fashion evolving with time Everycell of this array assumes a rational value in the interval ofzero and one All these cells change their states accordingto the local evaluation function which is a function of itsstate and its neighboring states The synchronous applicationof the local rules to all the cells of array will depict theglobal evolution Table 1 shows some rules for developing theproposed classifier
Example 1 Consider the rule ⟨170 238 204⟩ and corre-sponding transition matrix is shown below
If119875(0) is the initial state with real values (0 025 050) thesuccessive three steps are defined below
The transitions from one state to another state are definedas
119879 = [
[
0 1 0
0 1 1
0 0 1
]
]
119875 (0) = (0 025 050)
(1)
Step 1 Apply rule 170 for the first cell Rule 170 says that thenext state depends on the right neighbor Consider
119875 (1) = (025 025 050) (2)
4 Advances in Bioinformatics
04 06 02
04 10 02
06 04 02
02 10 02
10 02 02
08 02 02
08 06 02
08 08 02
08 10 02
02 02 02
02 04 02
06 10 02
10 10 02
02 06 02
06 08 02
Figure 3 Attractor state (10 10 and 02)mdashB formed with rule⟨170 252 204⟩
Apply rule 238 to the second cell Rule 238 says that the nextstate depends on its state and the right neighbor Consider
119875 (1) = (025 075 050) (025 + 050) (3)
Apply rule 204 to the third cell Rule 204 says that the nextstate depends only on its state
After applying the rule for all the cells in the state is (025075 050) that is the resultant state after first iteration
119875 (1) = (025 075 050) (4)
Similarly one has the following
Step 2 Consider
119875 (2) = (075 10 050) (5)
Step 3 Consider
119875 (3) = (10 10 050) (6)
Likewise we can construct IN-MACA-MCC for a sampledataset as shown in Figure 3
4 Modified Clonal Classifier with MACA
41 Simplified Modified Clonal Algorithm
(1) Generate initial antibody population (AIS-MACArules) randomly and call it Ab It consists of twosubsets memory population Ab
119898and reservoir pop-
ulation Ab119903
(2) Construct a set of antigens population and call it Ag(DNA sequence with classinput)
Table 2 Execution time for prediction of both protein andpromoterregions
Size of dataset Prediction time of integratedalgorithm in ms
5000 10646000 138910000 200220000 2545
(3) Select an antigenAg119895fromAg the antigen population
(4) Apply every member of antibody population to theselected antigen Ag
119895 check whether it is predicting
the correct class and calculate affinity of the rule withthe antigen via fitness equation
(5) Select 119898 highest affinity antibodies (AIS-MACArules) from Ab and place them in 119875
119898
(6) Generate clones for each antibody which will beproportional to the affinity as per fitness Place theclones in the new population 119875
119894
(7) Apply mutation to the newly formed population 119875119894
where the degree is inversely proportional to theiraffinity This produces a more mature population 119875lowast
119894
(8) 119877119890calculate the affinity of the rule with the corre-
sponding antigen as we did it in step four Order theantibodies in descending order (high fitness antibodywill be on top)
(9) Compare the antibodies from 119875lowast119894with the antibodies
population from Ab119898 Select the better fitness rules
remove them from 119875lowast119894 and place them in Ab
119898
(10) Randomly generate antibodies for introducing diver-sity Compare the antibodies in Abr the left-outantibodies in 119875lowast
119894 and randomly generate antibodies
Select the better fitness rules among three and placethem in Abr
(11) For every generation compare the antibodies in Ab119898
and Ab119903and place the best in Ab
119898
42 Difference betweenClonal andModifiedClonal AlgorithmThe difference between original clonal algorithm and themodified algorithm proposed by us lies on how efficientlywe are managing the use of generated antibodies Originalclonal algorithm will not take advantage of the antibodiesgenerated by every cloned population Once the comparisonof antibodies in 119875lowast
119894and Ab
119898gets completed the best will
be placed in Ab119898
and the rest of antibodies in 119875lowast119894
areomitted Even the reservoir antibodies are poorlymaintainedSo we try to use the best antibodies in 119875lowast
119894left out after
placing them in Abm For this purpose we are comparingthe antibodies already in Ab
119903with left-out antibodies in
119875lowast
119894and newly generated antibodies which were meant for
introducing diversity After comparing the three sets the bestwill be placed in Abr In the original clonal algorithm step 11will not exit Step 11 will ensure the best fitness rules stay inAb119898which will be solution of the entire problem
Advances in Bioinformatics 5
Figure 4 Basin calculation
Figure 5 Training interface
Table 3 IN-MACA-MCC protein coding comparison with existingapproaches
Algorithmcoding measure Sensitivity SpecificityOC1 653 664Hexamer 6836 702Position asymmetry 723 745Dicodon usage 813 823CRITICA 825 849IN-MACA-MCC 896 893
Table 4 IN-MACA-MCC promoter comparison with existingapproaches
Method Sensitivity SpecificityPromoter inspector 569 469Dragon promoter finder 623 593Promo predictor 653 669CNN-promoter 763 823SPANN 689 84IMC 76 86IN-MACA-MCC 885 927
5 Experimental Results
The proposed classifier is trained and tested with Fickettand Tung [19] datasets for protein coding region predictionfor DNA sequences of lengths 54 108 and 162 All the 21measures reported in [19] were considered for developing theclassifier This classifier is trained and tested with MMCRI(httpwwwmmchriresin) [20] datasets for protein coding
0102030405060708090
100
54 108 162 252 354
Pred
ictiv
e acc
urac
y
Accuracy of integrated AIS-MACA Algoprotein coding region identification
CodingNoncoding
sequences sequences sequences sequences sequences
Figure 6 Predictive accuracy for protein coding regions
75
80
85
90
95
100
54 108 162 252 354
Pred
ictiv
e acc
urac
y
Accuracy of integrated AIS-MACA Algo promoter region identification
PromoterNonpromoter
sequences sequences sequences sequences sequences
Figure 7 Predictive accuracy for promoter regions
region prediction for DNA sequences of lengths 252 and 354The proposed classifier is trained and tested with promotersequences from DBTSS [21] dataset and nonpromoters fromEID [22] and UTRdb [23] datasets Figures 4 and 5 show thedeveloped interfaces Table 2 shows the execution time forpredicting both protein and promoter regions which is verypromising Tables 3 and 4 show the sensitivity and specificityof both predictions All the experiments are performed inSUN with Solaris 58 445MHz clock Figures 6 and 7 showthe accuracy of prediction separately which is the importantoutput of our work
51 Specific Output of 54 Length DNA Sequence with Bound-aries See Box 1 Figures 4 5 6 and 7 and Table 2
52 Human Promoter Output See Box 2 and Tables 3 and 4
6 Advances in Bioinformatics
DNA Sequence Sequence Kiran12 jntuh - Length = 54 bps
AGGGGCAGCAACCAGAGCAGCAGCAGTGGCAGCAGTAGCAGGCGCCGGCCGCCGReverse ComplementCGGCGGCCGGCGCCTGCTACTGCTGCCACTGCTGCTGCTCTGGTTGCTGCCCCT
Sequence Name Program Type of Exon Boundary Strand
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 9 34 +
Box 1
DNA Seq Sequence human Kiran promoter 223jntuh
CGCAGCAAAATGCACGGGCTTCTGCAGCCCACATGACTTTATTCTGAACGGACACAAGTCTGCTCGCTGGGCCGTTC
GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTGGCGACCAGAAAACACAAGTGAAAGAGC
ATTTGGCCAGCCCGGAGAAGCCGAGCTGGGTGGCTTGAGTCTACATGGTTCTCATGTCGCGTTTAAGGCCAGCCCCC
TGCACGGTGTGGAGCTTCAA
Reverse Complement of DNA Seq Sequence human Kiran promoter 223jntuh
TTGAAGCTCCACACCGTGCAGGGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCAGCTCGG
CTTCTCCGGGCTGGCCAAATGCTCTTTCACTTGTGTTTTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGC
CGTGTTTTTGGCCCAAAAGCGAACGGCCCAGCGAGCAGACTTGTGTCCGTTCAGAATAAAGTCATGTGGGCTGCAGA
AGCCCGTGCATTTTGCTGCG
Sequence Sequence human Kiran promoter 223jntuh - Length = 251 bps
Sequence human Kiran promoter 223jntuh Human Promoter Prediction
Start End Score Promoter Sequence
78 128 061 GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTG
201 251 046 GGTTCTCATGTCGCGTTTAAGGCCAGCCCCCTGCACGGTGTGGAGCTTCAKiran promoter 223jntuh Promoter Prediction Reverse Strand
Start End Score Promoter Sequence
230 180 059 GGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCA
137 87 060 TTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGCCGTGTTTTTG
Box 2
6 Conclusion
We have successfully developed a classifier which can predictpromoter and protein coding regions with higher accuracythan reported earlier The sensitivity and specificity valuesfor both predictions are also promisingThere is considerableimprovement in the reduction of false prediction rate IN-MACA-MCC attains highest accuracy of 923 for sequencesmore than 108 bp and less than 552 bp for protein codingregion prediction IN-MACA-MCC attains highest accuracyof 936 for sequences of length 251 for promoter regionsWe are trying to apply this classifier for most of the species ineukaryotes in future
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] T K Attwood ldquoThe babel of bioinformaticsrdquo Science vol 290no 5491 pp 471ndash473 2000
[2] S Salzberg ldquoLocating protein coding regions in human DNAusing a decision tree algorithm rdquo Journal of ComputationalBiology vol 2 no 3 pp 473ndash485 1995
[3] P Maji and S Paul ldquoNeural network tree for identificationof splice junction and protein coding region in DNArdquo inScalable Pattern Recognition Algorithms pp 45ndash66 SpringerInternational Berlin Germany 2014
[4] Y Xu R Mural M Shah and E Uberbacher ldquoRecognizingexons in genomic sequence using GRAIL IIrdquoGenetic Engineer-ing vol 16 pp 241ndash253 1994
[5] E E Snyder and G D Stormo ldquoIdentification of protein codingregions in genomicDNArdquo Journal ofMolecular Biology vol 248no 1 pp 1ndash18 1995
[6] E C Uberbacher and R J Mural ldquoLocating protein-codingregions in human DNA sequences by a multiple sensor-neural
Advances in Bioinformatics 7
network approachrdquo Proceedings of the National Academy ofSciences of the United States of America vol 88 no 24 pp 11261ndash11265 1991
[7] A J Pinho A J R Neves V Afreixo C A C Bastos and PJ S G Ferreira ldquoA three-state model for DNA protein-codingregionsrdquo IEEE Transactions on Biomedical Engineering vol 53no 11 pp 2148ndash2155 2006
[8] M Q Zhang ldquoIdentification of protein coding regions in thehuman genomeby quadratic discriminant analysisrdquoProceedingsof the National Academy of Sciences of the United States ofAmerica vol 94 no 2 pp 565ndash568 1997
[9] W Gish and D J States ldquoIdentification of protein codingregions by database similarity searchrdquo Nature Genetics vol 3no 3 pp 266ndash272 1993
[10] J Zeng X Zhao X Cao and H Yan ldquoSCS signal context andstructure features for genome-wide human promoter recogni-tionrdquo IEEEACM Transactions on Computational Biology andBioinformatics vol 7 no 3 pp 550ndash562 2010
[11] X Li J Zeng and H Yan ldquoPCA-HPR a principle componentanalysis model for human promoter recognitionrdquo Bioinforma-tion vol 2 no 9 pp 373ndash378 2008
[12] S Hannenhalli and S Levy ldquoPromoter prediction in the humangenomerdquo Bioinformatics vol 17 supplement 1 pp S90ndashS962001
[13] S Wu X Xie A W Liew and H Yan ldquoEukaryotic promoterprediction based on relative entropy and positional informa-tionrdquo Physical Review E vol 75 no 4 Article ID 041908 2007
[14] U Ohler H Niemann G Liao and G M Rubin ldquoJointmodeling of DNA sequence and physical properties to improveeukaryotic promoter recognitionrdquo Bioinformatics vol 17 sup-plement 1 pp S199ndashS206 2001
[15] J R Goni A Perez D Torrents and M Orozco ldquoDeterminingpromoter location based on DNA structure first-principlescalculationsrdquo Genome Biology vol 8 no 12 article R263 2007
[16] V B Bajic S H Seah A Chong G Zhang J L Y Koh andV Brusic ldquoDragon promoter finder recognition of vertebrateRNA polymerase II promotersrdquo Bioinformatics vol 18 no 1 pp198ndash199 2002
[17] M Q Zhang ldquoIdentification of human gene core promoters insilicordquo Genome Research vol 8 no 3 pp 319ndash326 1998
[18] P K Sree and I R Babu ldquoAIX-MACA-Y multiple attractorcellular automata based clonal classifier for promoter andprotein coding region predictionrdquo Journal of Bioinformatics andIntelligent Control vol 3 no 1 pp 23ndash30 2014
[19] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992
[20] httpwwwmmchriresin[21] R Yamashita Y Suzuki H Wakaguri K Tsuritani K Nakai
and S Sugano ldquoDBTSS database of human transcription startsites progress report 2006rdquo Nucleic Acids Research vol 34supplement 1 pp D86ndashD89 2006
[22] S Saxonov I Daizadeh A Fedorov and W Gilbert ldquoEIDThe Exon-Intron Databasemdashan exhaustive database of protein-coding intron-containing genesrdquoNucleic Acids Research vol 28no 1 pp 185ndash190 2000
[23] G Pesole S Liuni G Grillo et al ldquoUTRdb and UTRsitespecialized databases of sequences and functional elements of51015840 and 31015840 untranslated regions of eukaryotic mRNAs Update2002rdquo Nucleic Acids Research vol 30 no 1 pp 335ndash340 2002
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
2 Advances in Bioinformatics
the region of the upstream of the transcriptional start site(TSS)rdquo Identifying a new promoter in a DNA sequence willlead to finding a new protein If we identify the promoterregion we can extract information regarding gene expressionpatterns cell specificity and development Promoters willregulate a gene expression Some of the genetic diseaseswhich are associatedwith variations in promoters are asthmabeta thalassemia and Rubinstein-Taybi syndrome Promotersequence can be used to control the speed of translation fromDNA into protein It is also used in genetically modifiedfoods
In vertebrates only five percent of the gene is made upof exons Genes mostly will have seven to eight exons with145 bp length at an average Introns have 3365 bp length atan average Promoter comprises a small percentage of entiregenome The features of promoters are different from otherfunctional regions like exons introns and 31015840UTRs Thesefacts make protein coding and promoter region predictionsvery difficult tasks
This paper is organized in the followingmanner Section 2provides the entire literature survey of both protein codingand promoter regions Section 3 provides the design of theproposed system Section 4 presents the MACA-MCC clas-sifier for promoter and protein coding prediction Section 5provides the experimental results with discussion Section 6provides the future extensions and conclusion to the pro-posed classifier
2 Literature Survey
Salzberg has used a decision tree algorithm [2] for locatingprotein coding regions in DNA sequences which is adaptableand can process DNA sequences of lengths 54 bp 108 bp and162 bp Maji and Paul [3] have developed neural network treeclassifier for prediction of splice junction and coding regionsin genomic DNA A decision tree named NNTree (neuralnetwork tree) is constructed by dividing the training set withtheir corresponding labels to recursively generate a tree Xuet al [4] have developed an improved systemGRAIL II whichis a hybrid AI system which can predict the number of exonsin a human DNA sequence and also supports gene modelingThis process combines edge signal like accepter donortranslation start site detection and coding feature analysis
Snyder and Stormo [5] have applied dynamic program-ming and neural networks for predicting protein codingregions from a genomic DNA They have developed a pro-gramGeneParserwhich first scores theDNAsequences basedon exon-intron specific measures like local compositionalcomplexity codon usage length distribution 6-tuple fre-quency and periodic asymmetry Uberbacher and Mural [6]have proposed a method which combines some set of sensoralgorithms and neural network to predict the protein codingregions in eukaryotesThe programs developed will calculatethe values of seven sensors that were considered by theauthors The measures are frame bias matrix Fickett (three-periodicity) dinucleotide fractal dimension coding six tupleword preferences coding six tuples in frame preferencesword commonality and repetitive six tuple word preferences
Pinho et al [7] have proposed a three-state model forprotein coding region prediction Authors have consideredthree-base periodicity property Zhang [8] has used quadraticdiscriminant analysis method named MZEF for identifyingprotein coding regions in genomic human DNA Gish andStates [9] proposed a computer program named BLASTCwhich uses sequence similarity and codon utilization forpredicting the protein coding regions
Method in [3] takes more time to construct a tree forsequences of length 162The height of the trees is also a majorconcern for using this algorithm with DNA sequences ofmore length Method in [4] suffers with less accuracy due tomore error rate at classifier nodes Methods in [5ndash7] dependmore on the statistical information After this literaturesurvey the concern of a new classifier is to achieve goodclassifier accuracy and develop a classifier which can handleDNA sequences of length more than 162 with fewer nodes
Zeng et al [10] have proposed a hierarchical promoterprediction system named SCS where they have used signalstructure and context features Li et al [11] have proposeda method PCA-HPR (principal component analysis-humanpromoter recognition) to predict the promoters and tran-scription sites (TSS) Hannenhalli and Levy [12] tried toenhance the accuracy of promoter prediction by combiningCpG island feature with information of independent signalswhich are biologically motivated and these cover most of theknowledge to predict the promoter in human genome
Wu et al have proposed a method [13] for enhancingthe performance of human promoter region identification byselecting the most important features of DNA sequence foreach different functional region Ohler et al have proposed amodel [14] which integrates physical properties ofDNA into aprobabilistic eukaryotic promoter prediction system Goni etal have proposed a system ProStar [15] which uses structuralparameters for promoter region identification Authors onlyused descriptors derived from physical first principles
Bajic et al [16] have developed new software for iden-tifying promoters in a DNA sequence of vertebrates Thisprogram takes input as DNA sequence and generates a listof predicted TSS (transcription stating site) Zhang [17] hasproposed a new program for predicting a core promoter inhuman gene named as CorePromoter After the literaturesurvey on promoter prediction the main goal of proposedclassifier is to reduce the false prediction rates and improvespecificity and sensitivity values
3 Design of IN-MACA-MCC
IN-MACA-MACC basic processing as shown in Figure 1starts with identification promoter considering features likeTATA CAAT Inr and n-mers unlike AIX-MACA-Y [18] forpredicting both regions IN-MACA-MCC takes a DNA inputand checks whether it belongs to a promoter or not If itbelongs to promoter the exact boundaries are provided If thegiven input is a nonpromoter sequence it checks whether itbelongs to intron or exon or 31015840UTR If it belongs to an exonIN-MACA-MCC reports the boundaries of the first exonThese boundaries will be used by the nextmodule as shown inFigure 2 to trace the protein coding region starting from that
Advances in Bioinformatics 3
Promoter classifier
Promoter versus exons (nonpromoter)
(two-class predictor)
(two-class predictor)
Feature Combiner
DNA
Promoter versus introns (nonpromoter)(two-class predictor)
Output class andsequence
extraction
(two-class predictor)
boundary
Promoter versus 3998400UTR (nona promoter)
Figure 1 IN-MACA-MCC architecturemdashfront
Process three base
Real valued attributes
Input DNA sequence of lengths 54 108 and 162 from Fickett and Tung datasets and 252 and 354 length
DNA sequence from MMCRI database
Output of the class after all basins formed
Encoding the DNA sequence
IN-MACA-MCC classification algorithm
Input processing
Store basins(attractor basins are
Attractor basin
in the terms of three
pairs at each step
associated with class)
Figure 2 IN-MACA-MCC architecturemdashrear
boundary If the input does not belong to exons it will checkwith introns and 31015840UTRs and outputs the class accordingly
The design rear IN-MACA-MACC is indicated inFigure 2 Input to IN-MACA-MACC algorithm and its vari-ations will be DNA sequence and amino acid sequencesInput processing unit will process sequences three at a timeas three neighborhood cellular automata are considered forprocessingDNA sequencesThe rule generatorwill transformthe complemented and noncomplemented rules in the formof matrix so that we can apply the rules to the correspondingsequence positions very easily IN-MACA-MACC basins arecalculated as per the instructions of proposed algorithm
Table 1 Example rules
SNO Rule number General representation1 254 119902
119894minus1+ 119902119894+ 119902119894+1
2 252 119902119894minus1+ 119902119894
3 238 119902119894+ 119902119894+1
4 250 119902119894minus1+ 119902119894+1
5 204 119902119894
6 240 119902119894minus1
7 170 119902119894+1
Cellular automata that use fuzzy logic are an array ofcells arranged in linear fashion evolving with time Everycell of this array assumes a rational value in the interval ofzero and one All these cells change their states accordingto the local evaluation function which is a function of itsstate and its neighboring states The synchronous applicationof the local rules to all the cells of array will depict theglobal evolution Table 1 shows some rules for developing theproposed classifier
Example 1 Consider the rule ⟨170 238 204⟩ and corre-sponding transition matrix is shown below
If119875(0) is the initial state with real values (0 025 050) thesuccessive three steps are defined below
The transitions from one state to another state are definedas
119879 = [
[
0 1 0
0 1 1
0 0 1
]
]
119875 (0) = (0 025 050)
(1)
Step 1 Apply rule 170 for the first cell Rule 170 says that thenext state depends on the right neighbor Consider
119875 (1) = (025 025 050) (2)
4 Advances in Bioinformatics
04 06 02
04 10 02
06 04 02
02 10 02
10 02 02
08 02 02
08 06 02
08 08 02
08 10 02
02 02 02
02 04 02
06 10 02
10 10 02
02 06 02
06 08 02
Figure 3 Attractor state (10 10 and 02)mdashB formed with rule⟨170 252 204⟩
Apply rule 238 to the second cell Rule 238 says that the nextstate depends on its state and the right neighbor Consider
119875 (1) = (025 075 050) (025 + 050) (3)
Apply rule 204 to the third cell Rule 204 says that the nextstate depends only on its state
After applying the rule for all the cells in the state is (025075 050) that is the resultant state after first iteration
119875 (1) = (025 075 050) (4)
Similarly one has the following
Step 2 Consider
119875 (2) = (075 10 050) (5)
Step 3 Consider
119875 (3) = (10 10 050) (6)
Likewise we can construct IN-MACA-MCC for a sampledataset as shown in Figure 3
4 Modified Clonal Classifier with MACA
41 Simplified Modified Clonal Algorithm
(1) Generate initial antibody population (AIS-MACArules) randomly and call it Ab It consists of twosubsets memory population Ab
119898and reservoir pop-
ulation Ab119903
(2) Construct a set of antigens population and call it Ag(DNA sequence with classinput)
Table 2 Execution time for prediction of both protein andpromoterregions
Size of dataset Prediction time of integratedalgorithm in ms
5000 10646000 138910000 200220000 2545
(3) Select an antigenAg119895fromAg the antigen population
(4) Apply every member of antibody population to theselected antigen Ag
119895 check whether it is predicting
the correct class and calculate affinity of the rule withthe antigen via fitness equation
(5) Select 119898 highest affinity antibodies (AIS-MACArules) from Ab and place them in 119875
119898
(6) Generate clones for each antibody which will beproportional to the affinity as per fitness Place theclones in the new population 119875
119894
(7) Apply mutation to the newly formed population 119875119894
where the degree is inversely proportional to theiraffinity This produces a more mature population 119875lowast
119894
(8) 119877119890calculate the affinity of the rule with the corre-
sponding antigen as we did it in step four Order theantibodies in descending order (high fitness antibodywill be on top)
(9) Compare the antibodies from 119875lowast119894with the antibodies
population from Ab119898 Select the better fitness rules
remove them from 119875lowast119894 and place them in Ab
119898
(10) Randomly generate antibodies for introducing diver-sity Compare the antibodies in Abr the left-outantibodies in 119875lowast
119894 and randomly generate antibodies
Select the better fitness rules among three and placethem in Abr
(11) For every generation compare the antibodies in Ab119898
and Ab119903and place the best in Ab
119898
42 Difference betweenClonal andModifiedClonal AlgorithmThe difference between original clonal algorithm and themodified algorithm proposed by us lies on how efficientlywe are managing the use of generated antibodies Originalclonal algorithm will not take advantage of the antibodiesgenerated by every cloned population Once the comparisonof antibodies in 119875lowast
119894and Ab
119898gets completed the best will
be placed in Ab119898
and the rest of antibodies in 119875lowast119894
areomitted Even the reservoir antibodies are poorlymaintainedSo we try to use the best antibodies in 119875lowast
119894left out after
placing them in Abm For this purpose we are comparingthe antibodies already in Ab
119903with left-out antibodies in
119875lowast
119894and newly generated antibodies which were meant for
introducing diversity After comparing the three sets the bestwill be placed in Abr In the original clonal algorithm step 11will not exit Step 11 will ensure the best fitness rules stay inAb119898which will be solution of the entire problem
Advances in Bioinformatics 5
Figure 4 Basin calculation
Figure 5 Training interface
Table 3 IN-MACA-MCC protein coding comparison with existingapproaches
Algorithmcoding measure Sensitivity SpecificityOC1 653 664Hexamer 6836 702Position asymmetry 723 745Dicodon usage 813 823CRITICA 825 849IN-MACA-MCC 896 893
Table 4 IN-MACA-MCC promoter comparison with existingapproaches
Method Sensitivity SpecificityPromoter inspector 569 469Dragon promoter finder 623 593Promo predictor 653 669CNN-promoter 763 823SPANN 689 84IMC 76 86IN-MACA-MCC 885 927
5 Experimental Results
The proposed classifier is trained and tested with Fickettand Tung [19] datasets for protein coding region predictionfor DNA sequences of lengths 54 108 and 162 All the 21measures reported in [19] were considered for developing theclassifier This classifier is trained and tested with MMCRI(httpwwwmmchriresin) [20] datasets for protein coding
0102030405060708090
100
54 108 162 252 354
Pred
ictiv
e acc
urac
y
Accuracy of integrated AIS-MACA Algoprotein coding region identification
CodingNoncoding
sequences sequences sequences sequences sequences
Figure 6 Predictive accuracy for protein coding regions
75
80
85
90
95
100
54 108 162 252 354
Pred
ictiv
e acc
urac
y
Accuracy of integrated AIS-MACA Algo promoter region identification
PromoterNonpromoter
sequences sequences sequences sequences sequences
Figure 7 Predictive accuracy for promoter regions
region prediction for DNA sequences of lengths 252 and 354The proposed classifier is trained and tested with promotersequences from DBTSS [21] dataset and nonpromoters fromEID [22] and UTRdb [23] datasets Figures 4 and 5 show thedeveloped interfaces Table 2 shows the execution time forpredicting both protein and promoter regions which is verypromising Tables 3 and 4 show the sensitivity and specificityof both predictions All the experiments are performed inSUN with Solaris 58 445MHz clock Figures 6 and 7 showthe accuracy of prediction separately which is the importantoutput of our work
51 Specific Output of 54 Length DNA Sequence with Bound-aries See Box 1 Figures 4 5 6 and 7 and Table 2
52 Human Promoter Output See Box 2 and Tables 3 and 4
6 Advances in Bioinformatics
DNA Sequence Sequence Kiran12 jntuh - Length = 54 bps
AGGGGCAGCAACCAGAGCAGCAGCAGTGGCAGCAGTAGCAGGCGCCGGCCGCCGReverse ComplementCGGCGGCCGGCGCCTGCTACTGCTGCCACTGCTGCTGCTCTGGTTGCTGCCCCT
Sequence Name Program Type of Exon Boundary Strand
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 9 34 +
Box 1
DNA Seq Sequence human Kiran promoter 223jntuh
CGCAGCAAAATGCACGGGCTTCTGCAGCCCACATGACTTTATTCTGAACGGACACAAGTCTGCTCGCTGGGCCGTTC
GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTGGCGACCAGAAAACACAAGTGAAAGAGC
ATTTGGCCAGCCCGGAGAAGCCGAGCTGGGTGGCTTGAGTCTACATGGTTCTCATGTCGCGTTTAAGGCCAGCCCCC
TGCACGGTGTGGAGCTTCAA
Reverse Complement of DNA Seq Sequence human Kiran promoter 223jntuh
TTGAAGCTCCACACCGTGCAGGGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCAGCTCGG
CTTCTCCGGGCTGGCCAAATGCTCTTTCACTTGTGTTTTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGC
CGTGTTTTTGGCCCAAAAGCGAACGGCCCAGCGAGCAGACTTGTGTCCGTTCAGAATAAAGTCATGTGGGCTGCAGA
AGCCCGTGCATTTTGCTGCG
Sequence Sequence human Kiran promoter 223jntuh - Length = 251 bps
Sequence human Kiran promoter 223jntuh Human Promoter Prediction
Start End Score Promoter Sequence
78 128 061 GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTG
201 251 046 GGTTCTCATGTCGCGTTTAAGGCCAGCCCCCTGCACGGTGTGGAGCTTCAKiran promoter 223jntuh Promoter Prediction Reverse Strand
Start End Score Promoter Sequence
230 180 059 GGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCA
137 87 060 TTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGCCGTGTTTTTG
Box 2
6 Conclusion
We have successfully developed a classifier which can predictpromoter and protein coding regions with higher accuracythan reported earlier The sensitivity and specificity valuesfor both predictions are also promisingThere is considerableimprovement in the reduction of false prediction rate IN-MACA-MCC attains highest accuracy of 923 for sequencesmore than 108 bp and less than 552 bp for protein codingregion prediction IN-MACA-MCC attains highest accuracyof 936 for sequences of length 251 for promoter regionsWe are trying to apply this classifier for most of the species ineukaryotes in future
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] T K Attwood ldquoThe babel of bioinformaticsrdquo Science vol 290no 5491 pp 471ndash473 2000
[2] S Salzberg ldquoLocating protein coding regions in human DNAusing a decision tree algorithm rdquo Journal of ComputationalBiology vol 2 no 3 pp 473ndash485 1995
[3] P Maji and S Paul ldquoNeural network tree for identificationof splice junction and protein coding region in DNArdquo inScalable Pattern Recognition Algorithms pp 45ndash66 SpringerInternational Berlin Germany 2014
[4] Y Xu R Mural M Shah and E Uberbacher ldquoRecognizingexons in genomic sequence using GRAIL IIrdquoGenetic Engineer-ing vol 16 pp 241ndash253 1994
[5] E E Snyder and G D Stormo ldquoIdentification of protein codingregions in genomicDNArdquo Journal ofMolecular Biology vol 248no 1 pp 1ndash18 1995
[6] E C Uberbacher and R J Mural ldquoLocating protein-codingregions in human DNA sequences by a multiple sensor-neural
Advances in Bioinformatics 7
network approachrdquo Proceedings of the National Academy ofSciences of the United States of America vol 88 no 24 pp 11261ndash11265 1991
[7] A J Pinho A J R Neves V Afreixo C A C Bastos and PJ S G Ferreira ldquoA three-state model for DNA protein-codingregionsrdquo IEEE Transactions on Biomedical Engineering vol 53no 11 pp 2148ndash2155 2006
[8] M Q Zhang ldquoIdentification of protein coding regions in thehuman genomeby quadratic discriminant analysisrdquoProceedingsof the National Academy of Sciences of the United States ofAmerica vol 94 no 2 pp 565ndash568 1997
[9] W Gish and D J States ldquoIdentification of protein codingregions by database similarity searchrdquo Nature Genetics vol 3no 3 pp 266ndash272 1993
[10] J Zeng X Zhao X Cao and H Yan ldquoSCS signal context andstructure features for genome-wide human promoter recogni-tionrdquo IEEEACM Transactions on Computational Biology andBioinformatics vol 7 no 3 pp 550ndash562 2010
[11] X Li J Zeng and H Yan ldquoPCA-HPR a principle componentanalysis model for human promoter recognitionrdquo Bioinforma-tion vol 2 no 9 pp 373ndash378 2008
[12] S Hannenhalli and S Levy ldquoPromoter prediction in the humangenomerdquo Bioinformatics vol 17 supplement 1 pp S90ndashS962001
[13] S Wu X Xie A W Liew and H Yan ldquoEukaryotic promoterprediction based on relative entropy and positional informa-tionrdquo Physical Review E vol 75 no 4 Article ID 041908 2007
[14] U Ohler H Niemann G Liao and G M Rubin ldquoJointmodeling of DNA sequence and physical properties to improveeukaryotic promoter recognitionrdquo Bioinformatics vol 17 sup-plement 1 pp S199ndashS206 2001
[15] J R Goni A Perez D Torrents and M Orozco ldquoDeterminingpromoter location based on DNA structure first-principlescalculationsrdquo Genome Biology vol 8 no 12 article R263 2007
[16] V B Bajic S H Seah A Chong G Zhang J L Y Koh andV Brusic ldquoDragon promoter finder recognition of vertebrateRNA polymerase II promotersrdquo Bioinformatics vol 18 no 1 pp198ndash199 2002
[17] M Q Zhang ldquoIdentification of human gene core promoters insilicordquo Genome Research vol 8 no 3 pp 319ndash326 1998
[18] P K Sree and I R Babu ldquoAIX-MACA-Y multiple attractorcellular automata based clonal classifier for promoter andprotein coding region predictionrdquo Journal of Bioinformatics andIntelligent Control vol 3 no 1 pp 23ndash30 2014
[19] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992
[20] httpwwwmmchriresin[21] R Yamashita Y Suzuki H Wakaguri K Tsuritani K Nakai
and S Sugano ldquoDBTSS database of human transcription startsites progress report 2006rdquo Nucleic Acids Research vol 34supplement 1 pp D86ndashD89 2006
[22] S Saxonov I Daizadeh A Fedorov and W Gilbert ldquoEIDThe Exon-Intron Databasemdashan exhaustive database of protein-coding intron-containing genesrdquoNucleic Acids Research vol 28no 1 pp 185ndash190 2000
[23] G Pesole S Liuni G Grillo et al ldquoUTRdb and UTRsitespecialized databases of sequences and functional elements of51015840 and 31015840 untranslated regions of eukaryotic mRNAs Update2002rdquo Nucleic Acids Research vol 30 no 1 pp 335ndash340 2002
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
Advances in Bioinformatics 3
Promoter classifier
Promoter versus exons (nonpromoter)
(two-class predictor)
(two-class predictor)
Feature Combiner
DNA
Promoter versus introns (nonpromoter)(two-class predictor)
Output class andsequence
extraction
(two-class predictor)
boundary
Promoter versus 3998400UTR (nona promoter)
Figure 1 IN-MACA-MCC architecturemdashfront
Process three base
Real valued attributes
Input DNA sequence of lengths 54 108 and 162 from Fickett and Tung datasets and 252 and 354 length
DNA sequence from MMCRI database
Output of the class after all basins formed
Encoding the DNA sequence
IN-MACA-MCC classification algorithm
Input processing
Store basins(attractor basins are
Attractor basin
in the terms of three
pairs at each step
associated with class)
Figure 2 IN-MACA-MCC architecturemdashrear
boundary If the input does not belong to exons it will checkwith introns and 31015840UTRs and outputs the class accordingly
The design rear IN-MACA-MACC is indicated inFigure 2 Input to IN-MACA-MACC algorithm and its vari-ations will be DNA sequence and amino acid sequencesInput processing unit will process sequences three at a timeas three neighborhood cellular automata are considered forprocessingDNA sequencesThe rule generatorwill transformthe complemented and noncomplemented rules in the formof matrix so that we can apply the rules to the correspondingsequence positions very easily IN-MACA-MACC basins arecalculated as per the instructions of proposed algorithm
Table 1 Example rules
SNO Rule number General representation1 254 119902
119894minus1+ 119902119894+ 119902119894+1
2 252 119902119894minus1+ 119902119894
3 238 119902119894+ 119902119894+1
4 250 119902119894minus1+ 119902119894+1
5 204 119902119894
6 240 119902119894minus1
7 170 119902119894+1
Cellular automata that use fuzzy logic are an array ofcells arranged in linear fashion evolving with time Everycell of this array assumes a rational value in the interval ofzero and one All these cells change their states accordingto the local evaluation function which is a function of itsstate and its neighboring states The synchronous applicationof the local rules to all the cells of array will depict theglobal evolution Table 1 shows some rules for developing theproposed classifier
Example 1 Consider the rule ⟨170 238 204⟩ and corre-sponding transition matrix is shown below
If119875(0) is the initial state with real values (0 025 050) thesuccessive three steps are defined below
The transitions from one state to another state are definedas
119879 = [
[
0 1 0
0 1 1
0 0 1
]
]
119875 (0) = (0 025 050)
(1)
Step 1 Apply rule 170 for the first cell Rule 170 says that thenext state depends on the right neighbor Consider
119875 (1) = (025 025 050) (2)
4 Advances in Bioinformatics
04 06 02
04 10 02
06 04 02
02 10 02
10 02 02
08 02 02
08 06 02
08 08 02
08 10 02
02 02 02
02 04 02
06 10 02
10 10 02
02 06 02
06 08 02
Figure 3 Attractor state (10 10 and 02)mdashB formed with rule⟨170 252 204⟩
Apply rule 238 to the second cell Rule 238 says that the nextstate depends on its state and the right neighbor Consider
119875 (1) = (025 075 050) (025 + 050) (3)
Apply rule 204 to the third cell Rule 204 says that the nextstate depends only on its state
After applying the rule for all the cells in the state is (025075 050) that is the resultant state after first iteration
119875 (1) = (025 075 050) (4)
Similarly one has the following
Step 2 Consider
119875 (2) = (075 10 050) (5)
Step 3 Consider
119875 (3) = (10 10 050) (6)
Likewise we can construct IN-MACA-MCC for a sampledataset as shown in Figure 3
4 Modified Clonal Classifier with MACA
41 Simplified Modified Clonal Algorithm
(1) Generate initial antibody population (AIS-MACArules) randomly and call it Ab It consists of twosubsets memory population Ab
119898and reservoir pop-
ulation Ab119903
(2) Construct a set of antigens population and call it Ag(DNA sequence with classinput)
Table 2 Execution time for prediction of both protein andpromoterregions
Size of dataset Prediction time of integratedalgorithm in ms
5000 10646000 138910000 200220000 2545
(3) Select an antigenAg119895fromAg the antigen population
(4) Apply every member of antibody population to theselected antigen Ag
119895 check whether it is predicting
the correct class and calculate affinity of the rule withthe antigen via fitness equation
(5) Select 119898 highest affinity antibodies (AIS-MACArules) from Ab and place them in 119875
119898
(6) Generate clones for each antibody which will beproportional to the affinity as per fitness Place theclones in the new population 119875
119894
(7) Apply mutation to the newly formed population 119875119894
where the degree is inversely proportional to theiraffinity This produces a more mature population 119875lowast
119894
(8) 119877119890calculate the affinity of the rule with the corre-
sponding antigen as we did it in step four Order theantibodies in descending order (high fitness antibodywill be on top)
(9) Compare the antibodies from 119875lowast119894with the antibodies
population from Ab119898 Select the better fitness rules
remove them from 119875lowast119894 and place them in Ab
119898
(10) Randomly generate antibodies for introducing diver-sity Compare the antibodies in Abr the left-outantibodies in 119875lowast
119894 and randomly generate antibodies
Select the better fitness rules among three and placethem in Abr
(11) For every generation compare the antibodies in Ab119898
and Ab119903and place the best in Ab
119898
42 Difference betweenClonal andModifiedClonal AlgorithmThe difference between original clonal algorithm and themodified algorithm proposed by us lies on how efficientlywe are managing the use of generated antibodies Originalclonal algorithm will not take advantage of the antibodiesgenerated by every cloned population Once the comparisonof antibodies in 119875lowast
119894and Ab
119898gets completed the best will
be placed in Ab119898
and the rest of antibodies in 119875lowast119894
areomitted Even the reservoir antibodies are poorlymaintainedSo we try to use the best antibodies in 119875lowast
119894left out after
placing them in Abm For this purpose we are comparingthe antibodies already in Ab
119903with left-out antibodies in
119875lowast
119894and newly generated antibodies which were meant for
introducing diversity After comparing the three sets the bestwill be placed in Abr In the original clonal algorithm step 11will not exit Step 11 will ensure the best fitness rules stay inAb119898which will be solution of the entire problem
Advances in Bioinformatics 5
Figure 4 Basin calculation
Figure 5 Training interface
Table 3 IN-MACA-MCC protein coding comparison with existingapproaches
Algorithmcoding measure Sensitivity SpecificityOC1 653 664Hexamer 6836 702Position asymmetry 723 745Dicodon usage 813 823CRITICA 825 849IN-MACA-MCC 896 893
Table 4 IN-MACA-MCC promoter comparison with existingapproaches
Method Sensitivity SpecificityPromoter inspector 569 469Dragon promoter finder 623 593Promo predictor 653 669CNN-promoter 763 823SPANN 689 84IMC 76 86IN-MACA-MCC 885 927
5 Experimental Results
The proposed classifier is trained and tested with Fickettand Tung [19] datasets for protein coding region predictionfor DNA sequences of lengths 54 108 and 162 All the 21measures reported in [19] were considered for developing theclassifier This classifier is trained and tested with MMCRI(httpwwwmmchriresin) [20] datasets for protein coding
0102030405060708090
100
54 108 162 252 354
Pred
ictiv
e acc
urac
y
Accuracy of integrated AIS-MACA Algoprotein coding region identification
CodingNoncoding
sequences sequences sequences sequences sequences
Figure 6 Predictive accuracy for protein coding regions
75
80
85
90
95
100
54 108 162 252 354
Pred
ictiv
e acc
urac
y
Accuracy of integrated AIS-MACA Algo promoter region identification
PromoterNonpromoter
sequences sequences sequences sequences sequences
Figure 7 Predictive accuracy for promoter regions
region prediction for DNA sequences of lengths 252 and 354The proposed classifier is trained and tested with promotersequences from DBTSS [21] dataset and nonpromoters fromEID [22] and UTRdb [23] datasets Figures 4 and 5 show thedeveloped interfaces Table 2 shows the execution time forpredicting both protein and promoter regions which is verypromising Tables 3 and 4 show the sensitivity and specificityof both predictions All the experiments are performed inSUN with Solaris 58 445MHz clock Figures 6 and 7 showthe accuracy of prediction separately which is the importantoutput of our work
51 Specific Output of 54 Length DNA Sequence with Bound-aries See Box 1 Figures 4 5 6 and 7 and Table 2
52 Human Promoter Output See Box 2 and Tables 3 and 4
6 Advances in Bioinformatics
DNA Sequence Sequence Kiran12 jntuh - Length = 54 bps
AGGGGCAGCAACCAGAGCAGCAGCAGTGGCAGCAGTAGCAGGCGCCGGCCGCCGReverse ComplementCGGCGGCCGGCGCCTGCTACTGCTGCCACTGCTGCTGCTCTGGTTGCTGCCCCT
Sequence Name Program Type of Exon Boundary Strand
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 9 34 +
Box 1
DNA Seq Sequence human Kiran promoter 223jntuh
CGCAGCAAAATGCACGGGCTTCTGCAGCCCACATGACTTTATTCTGAACGGACACAAGTCTGCTCGCTGGGCCGTTC
GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTGGCGACCAGAAAACACAAGTGAAAGAGC
ATTTGGCCAGCCCGGAGAAGCCGAGCTGGGTGGCTTGAGTCTACATGGTTCTCATGTCGCGTTTAAGGCCAGCCCCC
TGCACGGTGTGGAGCTTCAA
Reverse Complement of DNA Seq Sequence human Kiran promoter 223jntuh
TTGAAGCTCCACACCGTGCAGGGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCAGCTCGG
CTTCTCCGGGCTGGCCAAATGCTCTTTCACTTGTGTTTTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGC
CGTGTTTTTGGCCCAAAAGCGAACGGCCCAGCGAGCAGACTTGTGTCCGTTCAGAATAAAGTCATGTGGGCTGCAGA
AGCCCGTGCATTTTGCTGCG
Sequence Sequence human Kiran promoter 223jntuh - Length = 251 bps
Sequence human Kiran promoter 223jntuh Human Promoter Prediction
Start End Score Promoter Sequence
78 128 061 GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTG
201 251 046 GGTTCTCATGTCGCGTTTAAGGCCAGCCCCCTGCACGGTGTGGAGCTTCAKiran promoter 223jntuh Promoter Prediction Reverse Strand
Start End Score Promoter Sequence
230 180 059 GGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCA
137 87 060 TTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGCCGTGTTTTTG
Box 2
6 Conclusion
We have successfully developed a classifier which can predictpromoter and protein coding regions with higher accuracythan reported earlier The sensitivity and specificity valuesfor both predictions are also promisingThere is considerableimprovement in the reduction of false prediction rate IN-MACA-MCC attains highest accuracy of 923 for sequencesmore than 108 bp and less than 552 bp for protein codingregion prediction IN-MACA-MCC attains highest accuracyof 936 for sequences of length 251 for promoter regionsWe are trying to apply this classifier for most of the species ineukaryotes in future
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] T K Attwood ldquoThe babel of bioinformaticsrdquo Science vol 290no 5491 pp 471ndash473 2000
[2] S Salzberg ldquoLocating protein coding regions in human DNAusing a decision tree algorithm rdquo Journal of ComputationalBiology vol 2 no 3 pp 473ndash485 1995
[3] P Maji and S Paul ldquoNeural network tree for identificationof splice junction and protein coding region in DNArdquo inScalable Pattern Recognition Algorithms pp 45ndash66 SpringerInternational Berlin Germany 2014
[4] Y Xu R Mural M Shah and E Uberbacher ldquoRecognizingexons in genomic sequence using GRAIL IIrdquoGenetic Engineer-ing vol 16 pp 241ndash253 1994
[5] E E Snyder and G D Stormo ldquoIdentification of protein codingregions in genomicDNArdquo Journal ofMolecular Biology vol 248no 1 pp 1ndash18 1995
[6] E C Uberbacher and R J Mural ldquoLocating protein-codingregions in human DNA sequences by a multiple sensor-neural
Advances in Bioinformatics 7
network approachrdquo Proceedings of the National Academy ofSciences of the United States of America vol 88 no 24 pp 11261ndash11265 1991
[7] A J Pinho A J R Neves V Afreixo C A C Bastos and PJ S G Ferreira ldquoA three-state model for DNA protein-codingregionsrdquo IEEE Transactions on Biomedical Engineering vol 53no 11 pp 2148ndash2155 2006
[8] M Q Zhang ldquoIdentification of protein coding regions in thehuman genomeby quadratic discriminant analysisrdquoProceedingsof the National Academy of Sciences of the United States ofAmerica vol 94 no 2 pp 565ndash568 1997
[9] W Gish and D J States ldquoIdentification of protein codingregions by database similarity searchrdquo Nature Genetics vol 3no 3 pp 266ndash272 1993
[10] J Zeng X Zhao X Cao and H Yan ldquoSCS signal context andstructure features for genome-wide human promoter recogni-tionrdquo IEEEACM Transactions on Computational Biology andBioinformatics vol 7 no 3 pp 550ndash562 2010
[11] X Li J Zeng and H Yan ldquoPCA-HPR a principle componentanalysis model for human promoter recognitionrdquo Bioinforma-tion vol 2 no 9 pp 373ndash378 2008
[12] S Hannenhalli and S Levy ldquoPromoter prediction in the humangenomerdquo Bioinformatics vol 17 supplement 1 pp S90ndashS962001
[13] S Wu X Xie A W Liew and H Yan ldquoEukaryotic promoterprediction based on relative entropy and positional informa-tionrdquo Physical Review E vol 75 no 4 Article ID 041908 2007
[14] U Ohler H Niemann G Liao and G M Rubin ldquoJointmodeling of DNA sequence and physical properties to improveeukaryotic promoter recognitionrdquo Bioinformatics vol 17 sup-plement 1 pp S199ndashS206 2001
[15] J R Goni A Perez D Torrents and M Orozco ldquoDeterminingpromoter location based on DNA structure first-principlescalculationsrdquo Genome Biology vol 8 no 12 article R263 2007
[16] V B Bajic S H Seah A Chong G Zhang J L Y Koh andV Brusic ldquoDragon promoter finder recognition of vertebrateRNA polymerase II promotersrdquo Bioinformatics vol 18 no 1 pp198ndash199 2002
[17] M Q Zhang ldquoIdentification of human gene core promoters insilicordquo Genome Research vol 8 no 3 pp 319ndash326 1998
[18] P K Sree and I R Babu ldquoAIX-MACA-Y multiple attractorcellular automata based clonal classifier for promoter andprotein coding region predictionrdquo Journal of Bioinformatics andIntelligent Control vol 3 no 1 pp 23ndash30 2014
[19] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992
[20] httpwwwmmchriresin[21] R Yamashita Y Suzuki H Wakaguri K Tsuritani K Nakai
and S Sugano ldquoDBTSS database of human transcription startsites progress report 2006rdquo Nucleic Acids Research vol 34supplement 1 pp D86ndashD89 2006
[22] S Saxonov I Daizadeh A Fedorov and W Gilbert ldquoEIDThe Exon-Intron Databasemdashan exhaustive database of protein-coding intron-containing genesrdquoNucleic Acids Research vol 28no 1 pp 185ndash190 2000
[23] G Pesole S Liuni G Grillo et al ldquoUTRdb and UTRsitespecialized databases of sequences and functional elements of51015840 and 31015840 untranslated regions of eukaryotic mRNAs Update2002rdquo Nucleic Acids Research vol 30 no 1 pp 335ndash340 2002
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
4 Advances in Bioinformatics
04 06 02
04 10 02
06 04 02
02 10 02
10 02 02
08 02 02
08 06 02
08 08 02
08 10 02
02 02 02
02 04 02
06 10 02
10 10 02
02 06 02
06 08 02
Figure 3 Attractor state (10 10 and 02)mdashB formed with rule⟨170 252 204⟩
Apply rule 238 to the second cell Rule 238 says that the nextstate depends on its state and the right neighbor Consider
119875 (1) = (025 075 050) (025 + 050) (3)
Apply rule 204 to the third cell Rule 204 says that the nextstate depends only on its state
After applying the rule for all the cells in the state is (025075 050) that is the resultant state after first iteration
119875 (1) = (025 075 050) (4)
Similarly one has the following
Step 2 Consider
119875 (2) = (075 10 050) (5)
Step 3 Consider
119875 (3) = (10 10 050) (6)
Likewise we can construct IN-MACA-MCC for a sampledataset as shown in Figure 3
4 Modified Clonal Classifier with MACA
41 Simplified Modified Clonal Algorithm
(1) Generate initial antibody population (AIS-MACArules) randomly and call it Ab It consists of twosubsets memory population Ab
119898and reservoir pop-
ulation Ab119903
(2) Construct a set of antigens population and call it Ag(DNA sequence with classinput)
Table 2 Execution time for prediction of both protein andpromoterregions
Size of dataset Prediction time of integratedalgorithm in ms
5000 10646000 138910000 200220000 2545
(3) Select an antigenAg119895fromAg the antigen population
(4) Apply every member of antibody population to theselected antigen Ag
119895 check whether it is predicting
the correct class and calculate affinity of the rule withthe antigen via fitness equation
(5) Select 119898 highest affinity antibodies (AIS-MACArules) from Ab and place them in 119875
119898
(6) Generate clones for each antibody which will beproportional to the affinity as per fitness Place theclones in the new population 119875
119894
(7) Apply mutation to the newly formed population 119875119894
where the degree is inversely proportional to theiraffinity This produces a more mature population 119875lowast
119894
(8) 119877119890calculate the affinity of the rule with the corre-
sponding antigen as we did it in step four Order theantibodies in descending order (high fitness antibodywill be on top)
(9) Compare the antibodies from 119875lowast119894with the antibodies
population from Ab119898 Select the better fitness rules
remove them from 119875lowast119894 and place them in Ab
119898
(10) Randomly generate antibodies for introducing diver-sity Compare the antibodies in Abr the left-outantibodies in 119875lowast
119894 and randomly generate antibodies
Select the better fitness rules among three and placethem in Abr
(11) For every generation compare the antibodies in Ab119898
and Ab119903and place the best in Ab
119898
42 Difference betweenClonal andModifiedClonal AlgorithmThe difference between original clonal algorithm and themodified algorithm proposed by us lies on how efficientlywe are managing the use of generated antibodies Originalclonal algorithm will not take advantage of the antibodiesgenerated by every cloned population Once the comparisonof antibodies in 119875lowast
119894and Ab
119898gets completed the best will
be placed in Ab119898
and the rest of antibodies in 119875lowast119894
areomitted Even the reservoir antibodies are poorlymaintainedSo we try to use the best antibodies in 119875lowast
119894left out after
placing them in Abm For this purpose we are comparingthe antibodies already in Ab
119903with left-out antibodies in
119875lowast
119894and newly generated antibodies which were meant for
introducing diversity After comparing the three sets the bestwill be placed in Abr In the original clonal algorithm step 11will not exit Step 11 will ensure the best fitness rules stay inAb119898which will be solution of the entire problem
Advances in Bioinformatics 5
Figure 4 Basin calculation
Figure 5 Training interface
Table 3 IN-MACA-MCC protein coding comparison with existingapproaches
Algorithmcoding measure Sensitivity SpecificityOC1 653 664Hexamer 6836 702Position asymmetry 723 745Dicodon usage 813 823CRITICA 825 849IN-MACA-MCC 896 893
Table 4 IN-MACA-MCC promoter comparison with existingapproaches
Method Sensitivity SpecificityPromoter inspector 569 469Dragon promoter finder 623 593Promo predictor 653 669CNN-promoter 763 823SPANN 689 84IMC 76 86IN-MACA-MCC 885 927
5 Experimental Results
The proposed classifier is trained and tested with Fickettand Tung [19] datasets for protein coding region predictionfor DNA sequences of lengths 54 108 and 162 All the 21measures reported in [19] were considered for developing theclassifier This classifier is trained and tested with MMCRI(httpwwwmmchriresin) [20] datasets for protein coding
0102030405060708090
100
54 108 162 252 354
Pred
ictiv
e acc
urac
y
Accuracy of integrated AIS-MACA Algoprotein coding region identification
CodingNoncoding
sequences sequences sequences sequences sequences
Figure 6 Predictive accuracy for protein coding regions
75
80
85
90
95
100
54 108 162 252 354
Pred
ictiv
e acc
urac
y
Accuracy of integrated AIS-MACA Algo promoter region identification
PromoterNonpromoter
sequences sequences sequences sequences sequences
Figure 7 Predictive accuracy for promoter regions
region prediction for DNA sequences of lengths 252 and 354The proposed classifier is trained and tested with promotersequences from DBTSS [21] dataset and nonpromoters fromEID [22] and UTRdb [23] datasets Figures 4 and 5 show thedeveloped interfaces Table 2 shows the execution time forpredicting both protein and promoter regions which is verypromising Tables 3 and 4 show the sensitivity and specificityof both predictions All the experiments are performed inSUN with Solaris 58 445MHz clock Figures 6 and 7 showthe accuracy of prediction separately which is the importantoutput of our work
51 Specific Output of 54 Length DNA Sequence with Bound-aries See Box 1 Figures 4 5 6 and 7 and Table 2
52 Human Promoter Output See Box 2 and Tables 3 and 4
6 Advances in Bioinformatics
DNA Sequence Sequence Kiran12 jntuh - Length = 54 bps
AGGGGCAGCAACCAGAGCAGCAGCAGTGGCAGCAGTAGCAGGCGCCGGCCGCCGReverse ComplementCGGCGGCCGGCGCCTGCTACTGCTGCCACTGCTGCTGCTCTGGTTGCTGCCCCT
Sequence Name Program Type of Exon Boundary Strand
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 9 34 +
Box 1
DNA Seq Sequence human Kiran promoter 223jntuh
CGCAGCAAAATGCACGGGCTTCTGCAGCCCACATGACTTTATTCTGAACGGACACAAGTCTGCTCGCTGGGCCGTTC
GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTGGCGACCAGAAAACACAAGTGAAAGAGC
ATTTGGCCAGCCCGGAGAAGCCGAGCTGGGTGGCTTGAGTCTACATGGTTCTCATGTCGCGTTTAAGGCCAGCCCCC
TGCACGGTGTGGAGCTTCAA
Reverse Complement of DNA Seq Sequence human Kiran promoter 223jntuh
TTGAAGCTCCACACCGTGCAGGGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCAGCTCGG
CTTCTCCGGGCTGGCCAAATGCTCTTTCACTTGTGTTTTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGC
CGTGTTTTTGGCCCAAAAGCGAACGGCCCAGCGAGCAGACTTGTGTCCGTTCAGAATAAAGTCATGTGGGCTGCAGA
AGCCCGTGCATTTTGCTGCG
Sequence Sequence human Kiran promoter 223jntuh - Length = 251 bps
Sequence human Kiran promoter 223jntuh Human Promoter Prediction
Start End Score Promoter Sequence
78 128 061 GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTG
201 251 046 GGTTCTCATGTCGCGTTTAAGGCCAGCCCCCTGCACGGTGTGGAGCTTCAKiran promoter 223jntuh Promoter Prediction Reverse Strand
Start End Score Promoter Sequence
230 180 059 GGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCA
137 87 060 TTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGCCGTGTTTTTG
Box 2
6 Conclusion
We have successfully developed a classifier which can predictpromoter and protein coding regions with higher accuracythan reported earlier The sensitivity and specificity valuesfor both predictions are also promisingThere is considerableimprovement in the reduction of false prediction rate IN-MACA-MCC attains highest accuracy of 923 for sequencesmore than 108 bp and less than 552 bp for protein codingregion prediction IN-MACA-MCC attains highest accuracyof 936 for sequences of length 251 for promoter regionsWe are trying to apply this classifier for most of the species ineukaryotes in future
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] T K Attwood ldquoThe babel of bioinformaticsrdquo Science vol 290no 5491 pp 471ndash473 2000
[2] S Salzberg ldquoLocating protein coding regions in human DNAusing a decision tree algorithm rdquo Journal of ComputationalBiology vol 2 no 3 pp 473ndash485 1995
[3] P Maji and S Paul ldquoNeural network tree for identificationof splice junction and protein coding region in DNArdquo inScalable Pattern Recognition Algorithms pp 45ndash66 SpringerInternational Berlin Germany 2014
[4] Y Xu R Mural M Shah and E Uberbacher ldquoRecognizingexons in genomic sequence using GRAIL IIrdquoGenetic Engineer-ing vol 16 pp 241ndash253 1994
[5] E E Snyder and G D Stormo ldquoIdentification of protein codingregions in genomicDNArdquo Journal ofMolecular Biology vol 248no 1 pp 1ndash18 1995
[6] E C Uberbacher and R J Mural ldquoLocating protein-codingregions in human DNA sequences by a multiple sensor-neural
Advances in Bioinformatics 7
network approachrdquo Proceedings of the National Academy ofSciences of the United States of America vol 88 no 24 pp 11261ndash11265 1991
[7] A J Pinho A J R Neves V Afreixo C A C Bastos and PJ S G Ferreira ldquoA three-state model for DNA protein-codingregionsrdquo IEEE Transactions on Biomedical Engineering vol 53no 11 pp 2148ndash2155 2006
[8] M Q Zhang ldquoIdentification of protein coding regions in thehuman genomeby quadratic discriminant analysisrdquoProceedingsof the National Academy of Sciences of the United States ofAmerica vol 94 no 2 pp 565ndash568 1997
[9] W Gish and D J States ldquoIdentification of protein codingregions by database similarity searchrdquo Nature Genetics vol 3no 3 pp 266ndash272 1993
[10] J Zeng X Zhao X Cao and H Yan ldquoSCS signal context andstructure features for genome-wide human promoter recogni-tionrdquo IEEEACM Transactions on Computational Biology andBioinformatics vol 7 no 3 pp 550ndash562 2010
[11] X Li J Zeng and H Yan ldquoPCA-HPR a principle componentanalysis model for human promoter recognitionrdquo Bioinforma-tion vol 2 no 9 pp 373ndash378 2008
[12] S Hannenhalli and S Levy ldquoPromoter prediction in the humangenomerdquo Bioinformatics vol 17 supplement 1 pp S90ndashS962001
[13] S Wu X Xie A W Liew and H Yan ldquoEukaryotic promoterprediction based on relative entropy and positional informa-tionrdquo Physical Review E vol 75 no 4 Article ID 041908 2007
[14] U Ohler H Niemann G Liao and G M Rubin ldquoJointmodeling of DNA sequence and physical properties to improveeukaryotic promoter recognitionrdquo Bioinformatics vol 17 sup-plement 1 pp S199ndashS206 2001
[15] J R Goni A Perez D Torrents and M Orozco ldquoDeterminingpromoter location based on DNA structure first-principlescalculationsrdquo Genome Biology vol 8 no 12 article R263 2007
[16] V B Bajic S H Seah A Chong G Zhang J L Y Koh andV Brusic ldquoDragon promoter finder recognition of vertebrateRNA polymerase II promotersrdquo Bioinformatics vol 18 no 1 pp198ndash199 2002
[17] M Q Zhang ldquoIdentification of human gene core promoters insilicordquo Genome Research vol 8 no 3 pp 319ndash326 1998
[18] P K Sree and I R Babu ldquoAIX-MACA-Y multiple attractorcellular automata based clonal classifier for promoter andprotein coding region predictionrdquo Journal of Bioinformatics andIntelligent Control vol 3 no 1 pp 23ndash30 2014
[19] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992
[20] httpwwwmmchriresin[21] R Yamashita Y Suzuki H Wakaguri K Tsuritani K Nakai
and S Sugano ldquoDBTSS database of human transcription startsites progress report 2006rdquo Nucleic Acids Research vol 34supplement 1 pp D86ndashD89 2006
[22] S Saxonov I Daizadeh A Fedorov and W Gilbert ldquoEIDThe Exon-Intron Databasemdashan exhaustive database of protein-coding intron-containing genesrdquoNucleic Acids Research vol 28no 1 pp 185ndash190 2000
[23] G Pesole S Liuni G Grillo et al ldquoUTRdb and UTRsitespecialized databases of sequences and functional elements of51015840 and 31015840 untranslated regions of eukaryotic mRNAs Update2002rdquo Nucleic Acids Research vol 30 no 1 pp 335ndash340 2002
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
Advances in Bioinformatics 5
Figure 4 Basin calculation
Figure 5 Training interface
Table 3 IN-MACA-MCC protein coding comparison with existingapproaches
Algorithmcoding measure Sensitivity SpecificityOC1 653 664Hexamer 6836 702Position asymmetry 723 745Dicodon usage 813 823CRITICA 825 849IN-MACA-MCC 896 893
Table 4 IN-MACA-MCC promoter comparison with existingapproaches
Method Sensitivity SpecificityPromoter inspector 569 469Dragon promoter finder 623 593Promo predictor 653 669CNN-promoter 763 823SPANN 689 84IMC 76 86IN-MACA-MCC 885 927
5 Experimental Results
The proposed classifier is trained and tested with Fickettand Tung [19] datasets for protein coding region predictionfor DNA sequences of lengths 54 108 and 162 All the 21measures reported in [19] were considered for developing theclassifier This classifier is trained and tested with MMCRI(httpwwwmmchriresin) [20] datasets for protein coding
0102030405060708090
100
54 108 162 252 354
Pred
ictiv
e acc
urac
y
Accuracy of integrated AIS-MACA Algoprotein coding region identification
CodingNoncoding
sequences sequences sequences sequences sequences
Figure 6 Predictive accuracy for protein coding regions
75
80
85
90
95
100
54 108 162 252 354
Pred
ictiv
e acc
urac
y
Accuracy of integrated AIS-MACA Algo promoter region identification
PromoterNonpromoter
sequences sequences sequences sequences sequences
Figure 7 Predictive accuracy for promoter regions
region prediction for DNA sequences of lengths 252 and 354The proposed classifier is trained and tested with promotersequences from DBTSS [21] dataset and nonpromoters fromEID [22] and UTRdb [23] datasets Figures 4 and 5 show thedeveloped interfaces Table 2 shows the execution time forpredicting both protein and promoter regions which is verypromising Tables 3 and 4 show the sensitivity and specificityof both predictions All the experiments are performed inSUN with Solaris 58 445MHz clock Figures 6 and 7 showthe accuracy of prediction separately which is the importantoutput of our work
51 Specific Output of 54 Length DNA Sequence with Bound-aries See Box 1 Figures 4 5 6 and 7 and Table 2
52 Human Promoter Output See Box 2 and Tables 3 and 4
6 Advances in Bioinformatics
DNA Sequence Sequence Kiran12 jntuh - Length = 54 bps
AGGGGCAGCAACCAGAGCAGCAGCAGTGGCAGCAGTAGCAGGCGCCGGCCGCCGReverse ComplementCGGCGGCCGGCGCCTGCTACTGCTGCCACTGCTGCTGCTCTGGTTGCTGCCCCT
Sequence Name Program Type of Exon Boundary Strand
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 9 34 +
Box 1
DNA Seq Sequence human Kiran promoter 223jntuh
CGCAGCAAAATGCACGGGCTTCTGCAGCCCACATGACTTTATTCTGAACGGACACAAGTCTGCTCGCTGGGCCGTTC
GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTGGCGACCAGAAAACACAAGTGAAAGAGC
ATTTGGCCAGCCCGGAGAAGCCGAGCTGGGTGGCTTGAGTCTACATGGTTCTCATGTCGCGTTTAAGGCCAGCCCCC
TGCACGGTGTGGAGCTTCAA
Reverse Complement of DNA Seq Sequence human Kiran promoter 223jntuh
TTGAAGCTCCACACCGTGCAGGGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCAGCTCGG
CTTCTCCGGGCTGGCCAAATGCTCTTTCACTTGTGTTTTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGC
CGTGTTTTTGGCCCAAAAGCGAACGGCCCAGCGAGCAGACTTGTGTCCGTTCAGAATAAAGTCATGTGGGCTGCAGA
AGCCCGTGCATTTTGCTGCG
Sequence Sequence human Kiran promoter 223jntuh - Length = 251 bps
Sequence human Kiran promoter 223jntuh Human Promoter Prediction
Start End Score Promoter Sequence
78 128 061 GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTG
201 251 046 GGTTCTCATGTCGCGTTTAAGGCCAGCCCCCTGCACGGTGTGGAGCTTCAKiran promoter 223jntuh Promoter Prediction Reverse Strand
Start End Score Promoter Sequence
230 180 059 GGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCA
137 87 060 TTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGCCGTGTTTTTG
Box 2
6 Conclusion
We have successfully developed a classifier which can predictpromoter and protein coding regions with higher accuracythan reported earlier The sensitivity and specificity valuesfor both predictions are also promisingThere is considerableimprovement in the reduction of false prediction rate IN-MACA-MCC attains highest accuracy of 923 for sequencesmore than 108 bp and less than 552 bp for protein codingregion prediction IN-MACA-MCC attains highest accuracyof 936 for sequences of length 251 for promoter regionsWe are trying to apply this classifier for most of the species ineukaryotes in future
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] T K Attwood ldquoThe babel of bioinformaticsrdquo Science vol 290no 5491 pp 471ndash473 2000
[2] S Salzberg ldquoLocating protein coding regions in human DNAusing a decision tree algorithm rdquo Journal of ComputationalBiology vol 2 no 3 pp 473ndash485 1995
[3] P Maji and S Paul ldquoNeural network tree for identificationof splice junction and protein coding region in DNArdquo inScalable Pattern Recognition Algorithms pp 45ndash66 SpringerInternational Berlin Germany 2014
[4] Y Xu R Mural M Shah and E Uberbacher ldquoRecognizingexons in genomic sequence using GRAIL IIrdquoGenetic Engineer-ing vol 16 pp 241ndash253 1994
[5] E E Snyder and G D Stormo ldquoIdentification of protein codingregions in genomicDNArdquo Journal ofMolecular Biology vol 248no 1 pp 1ndash18 1995
[6] E C Uberbacher and R J Mural ldquoLocating protein-codingregions in human DNA sequences by a multiple sensor-neural
Advances in Bioinformatics 7
network approachrdquo Proceedings of the National Academy ofSciences of the United States of America vol 88 no 24 pp 11261ndash11265 1991
[7] A J Pinho A J R Neves V Afreixo C A C Bastos and PJ S G Ferreira ldquoA three-state model for DNA protein-codingregionsrdquo IEEE Transactions on Biomedical Engineering vol 53no 11 pp 2148ndash2155 2006
[8] M Q Zhang ldquoIdentification of protein coding regions in thehuman genomeby quadratic discriminant analysisrdquoProceedingsof the National Academy of Sciences of the United States ofAmerica vol 94 no 2 pp 565ndash568 1997
[9] W Gish and D J States ldquoIdentification of protein codingregions by database similarity searchrdquo Nature Genetics vol 3no 3 pp 266ndash272 1993
[10] J Zeng X Zhao X Cao and H Yan ldquoSCS signal context andstructure features for genome-wide human promoter recogni-tionrdquo IEEEACM Transactions on Computational Biology andBioinformatics vol 7 no 3 pp 550ndash562 2010
[11] X Li J Zeng and H Yan ldquoPCA-HPR a principle componentanalysis model for human promoter recognitionrdquo Bioinforma-tion vol 2 no 9 pp 373ndash378 2008
[12] S Hannenhalli and S Levy ldquoPromoter prediction in the humangenomerdquo Bioinformatics vol 17 supplement 1 pp S90ndashS962001
[13] S Wu X Xie A W Liew and H Yan ldquoEukaryotic promoterprediction based on relative entropy and positional informa-tionrdquo Physical Review E vol 75 no 4 Article ID 041908 2007
[14] U Ohler H Niemann G Liao and G M Rubin ldquoJointmodeling of DNA sequence and physical properties to improveeukaryotic promoter recognitionrdquo Bioinformatics vol 17 sup-plement 1 pp S199ndashS206 2001
[15] J R Goni A Perez D Torrents and M Orozco ldquoDeterminingpromoter location based on DNA structure first-principlescalculationsrdquo Genome Biology vol 8 no 12 article R263 2007
[16] V B Bajic S H Seah A Chong G Zhang J L Y Koh andV Brusic ldquoDragon promoter finder recognition of vertebrateRNA polymerase II promotersrdquo Bioinformatics vol 18 no 1 pp198ndash199 2002
[17] M Q Zhang ldquoIdentification of human gene core promoters insilicordquo Genome Research vol 8 no 3 pp 319ndash326 1998
[18] P K Sree and I R Babu ldquoAIX-MACA-Y multiple attractorcellular automata based clonal classifier for promoter andprotein coding region predictionrdquo Journal of Bioinformatics andIntelligent Control vol 3 no 1 pp 23ndash30 2014
[19] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992
[20] httpwwwmmchriresin[21] R Yamashita Y Suzuki H Wakaguri K Tsuritani K Nakai
and S Sugano ldquoDBTSS database of human transcription startsites progress report 2006rdquo Nucleic Acids Research vol 34supplement 1 pp D86ndashD89 2006
[22] S Saxonov I Daizadeh A Fedorov and W Gilbert ldquoEIDThe Exon-Intron Databasemdashan exhaustive database of protein-coding intron-containing genesrdquoNucleic Acids Research vol 28no 1 pp 185ndash190 2000
[23] G Pesole S Liuni G Grillo et al ldquoUTRdb and UTRsitespecialized databases of sequences and functional elements of51015840 and 31015840 untranslated regions of eukaryotic mRNAs Update2002rdquo Nucleic Acids Research vol 30 no 1 pp 335ndash340 2002
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
6 Advances in Bioinformatics
DNA Sequence Sequence Kiran12 jntuh - Length = 54 bps
AGGGGCAGCAACCAGAGCAGCAGCAGTGGCAGCAGTAGCAGGCGCCGGCCGCCGReverse ComplementCGGCGGCCGGCGCCTGCTACTGCTGCCACTGCTGCTGCTCTGGTTGCTGCCCCT
Sequence Name Program Type of Exon Boundary Strand
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 25 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 3 34 +
kiran12 jntuh AIS MACA Internal 9 34 +
Box 1
DNA Seq Sequence human Kiran promoter 223jntuh
CGCAGCAAAATGCACGGGCTTCTGCAGCCCACATGACTTTATTCTGAACGGACACAAGTCTGCTCGCTGGGCCGTTC
GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTGGCGACCAGAAAACACAAGTGAAAGAGC
ATTTGGCCAGCCCGGAGAAGCCGAGCTGGGTGGCTTGAGTCTACATGGTTCTCATGTCGCGTTTAAGGCCAGCCCCC
TGCACGGTGTGGAGCTTCAA
Reverse Complement of DNA Seq Sequence human Kiran promoter 223jntuh
TTGAAGCTCCACACCGTGCAGGGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCAGCTCGG
CTTCTCCGGGCTGGCCAAATGCTCTTTCACTTGTGTTTTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGC
CGTGTTTTTGGCCCAAAAGCGAACGGCCCAGCGAGCAGACTTGTGTCCGTTCAGAATAAAGTCATGTGGGCTGCAGA
AGCCCGTGCATTTTGCTGCG
Sequence Sequence human Kiran promoter 223jntuh - Length = 251 bps
Sequence human Kiran promoter 223jntuh Human Promoter Prediction
Start End Score Promoter Sequence
78 128 061 GCTTTTGGGCCAAAAACACGGCTCCGTCGGTGACTTTTGGCCCGATATTG
201 251 046 GGTTCTCATGTCGCGTTTAAGGCCAGCCCCCTGCACGGTGTGGAGCTTCAKiran promoter 223jntuh Promoter Prediction Reverse Strand
Start End Score Promoter Sequence
230 180 059 GGGGCTGGCCTTAAACGCGACATGAGAACCATGTAGACTCAAGCCACCCA
137 87 060 TTCTGGTCGCCAATATCGGGCCAAAAGTCACCGACGGAGCCGTGTTTTTG
Box 2
6 Conclusion
We have successfully developed a classifier which can predictpromoter and protein coding regions with higher accuracythan reported earlier The sensitivity and specificity valuesfor both predictions are also promisingThere is considerableimprovement in the reduction of false prediction rate IN-MACA-MCC attains highest accuracy of 923 for sequencesmore than 108 bp and less than 552 bp for protein codingregion prediction IN-MACA-MCC attains highest accuracyof 936 for sequences of length 251 for promoter regionsWe are trying to apply this classifier for most of the species ineukaryotes in future
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] T K Attwood ldquoThe babel of bioinformaticsrdquo Science vol 290no 5491 pp 471ndash473 2000
[2] S Salzberg ldquoLocating protein coding regions in human DNAusing a decision tree algorithm rdquo Journal of ComputationalBiology vol 2 no 3 pp 473ndash485 1995
[3] P Maji and S Paul ldquoNeural network tree for identificationof splice junction and protein coding region in DNArdquo inScalable Pattern Recognition Algorithms pp 45ndash66 SpringerInternational Berlin Germany 2014
[4] Y Xu R Mural M Shah and E Uberbacher ldquoRecognizingexons in genomic sequence using GRAIL IIrdquoGenetic Engineer-ing vol 16 pp 241ndash253 1994
[5] E E Snyder and G D Stormo ldquoIdentification of protein codingregions in genomicDNArdquo Journal ofMolecular Biology vol 248no 1 pp 1ndash18 1995
[6] E C Uberbacher and R J Mural ldquoLocating protein-codingregions in human DNA sequences by a multiple sensor-neural
Advances in Bioinformatics 7
network approachrdquo Proceedings of the National Academy ofSciences of the United States of America vol 88 no 24 pp 11261ndash11265 1991
[7] A J Pinho A J R Neves V Afreixo C A C Bastos and PJ S G Ferreira ldquoA three-state model for DNA protein-codingregionsrdquo IEEE Transactions on Biomedical Engineering vol 53no 11 pp 2148ndash2155 2006
[8] M Q Zhang ldquoIdentification of protein coding regions in thehuman genomeby quadratic discriminant analysisrdquoProceedingsof the National Academy of Sciences of the United States ofAmerica vol 94 no 2 pp 565ndash568 1997
[9] W Gish and D J States ldquoIdentification of protein codingregions by database similarity searchrdquo Nature Genetics vol 3no 3 pp 266ndash272 1993
[10] J Zeng X Zhao X Cao and H Yan ldquoSCS signal context andstructure features for genome-wide human promoter recogni-tionrdquo IEEEACM Transactions on Computational Biology andBioinformatics vol 7 no 3 pp 550ndash562 2010
[11] X Li J Zeng and H Yan ldquoPCA-HPR a principle componentanalysis model for human promoter recognitionrdquo Bioinforma-tion vol 2 no 9 pp 373ndash378 2008
[12] S Hannenhalli and S Levy ldquoPromoter prediction in the humangenomerdquo Bioinformatics vol 17 supplement 1 pp S90ndashS962001
[13] S Wu X Xie A W Liew and H Yan ldquoEukaryotic promoterprediction based on relative entropy and positional informa-tionrdquo Physical Review E vol 75 no 4 Article ID 041908 2007
[14] U Ohler H Niemann G Liao and G M Rubin ldquoJointmodeling of DNA sequence and physical properties to improveeukaryotic promoter recognitionrdquo Bioinformatics vol 17 sup-plement 1 pp S199ndashS206 2001
[15] J R Goni A Perez D Torrents and M Orozco ldquoDeterminingpromoter location based on DNA structure first-principlescalculationsrdquo Genome Biology vol 8 no 12 article R263 2007
[16] V B Bajic S H Seah A Chong G Zhang J L Y Koh andV Brusic ldquoDragon promoter finder recognition of vertebrateRNA polymerase II promotersrdquo Bioinformatics vol 18 no 1 pp198ndash199 2002
[17] M Q Zhang ldquoIdentification of human gene core promoters insilicordquo Genome Research vol 8 no 3 pp 319ndash326 1998
[18] P K Sree and I R Babu ldquoAIX-MACA-Y multiple attractorcellular automata based clonal classifier for promoter andprotein coding region predictionrdquo Journal of Bioinformatics andIntelligent Control vol 3 no 1 pp 23ndash30 2014
[19] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992
[20] httpwwwmmchriresin[21] R Yamashita Y Suzuki H Wakaguri K Tsuritani K Nakai
and S Sugano ldquoDBTSS database of human transcription startsites progress report 2006rdquo Nucleic Acids Research vol 34supplement 1 pp D86ndashD89 2006
[22] S Saxonov I Daizadeh A Fedorov and W Gilbert ldquoEIDThe Exon-Intron Databasemdashan exhaustive database of protein-coding intron-containing genesrdquoNucleic Acids Research vol 28no 1 pp 185ndash190 2000
[23] G Pesole S Liuni G Grillo et al ldquoUTRdb and UTRsitespecialized databases of sequences and functional elements of51015840 and 31015840 untranslated regions of eukaryotic mRNAs Update2002rdquo Nucleic Acids Research vol 30 no 1 pp 335ndash340 2002
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
Advances in Bioinformatics 7
network approachrdquo Proceedings of the National Academy ofSciences of the United States of America vol 88 no 24 pp 11261ndash11265 1991
[7] A J Pinho A J R Neves V Afreixo C A C Bastos and PJ S G Ferreira ldquoA three-state model for DNA protein-codingregionsrdquo IEEE Transactions on Biomedical Engineering vol 53no 11 pp 2148ndash2155 2006
[8] M Q Zhang ldquoIdentification of protein coding regions in thehuman genomeby quadratic discriminant analysisrdquoProceedingsof the National Academy of Sciences of the United States ofAmerica vol 94 no 2 pp 565ndash568 1997
[9] W Gish and D J States ldquoIdentification of protein codingregions by database similarity searchrdquo Nature Genetics vol 3no 3 pp 266ndash272 1993
[10] J Zeng X Zhao X Cao and H Yan ldquoSCS signal context andstructure features for genome-wide human promoter recogni-tionrdquo IEEEACM Transactions on Computational Biology andBioinformatics vol 7 no 3 pp 550ndash562 2010
[11] X Li J Zeng and H Yan ldquoPCA-HPR a principle componentanalysis model for human promoter recognitionrdquo Bioinforma-tion vol 2 no 9 pp 373ndash378 2008
[12] S Hannenhalli and S Levy ldquoPromoter prediction in the humangenomerdquo Bioinformatics vol 17 supplement 1 pp S90ndashS962001
[13] S Wu X Xie A W Liew and H Yan ldquoEukaryotic promoterprediction based on relative entropy and positional informa-tionrdquo Physical Review E vol 75 no 4 Article ID 041908 2007
[14] U Ohler H Niemann G Liao and G M Rubin ldquoJointmodeling of DNA sequence and physical properties to improveeukaryotic promoter recognitionrdquo Bioinformatics vol 17 sup-plement 1 pp S199ndashS206 2001
[15] J R Goni A Perez D Torrents and M Orozco ldquoDeterminingpromoter location based on DNA structure first-principlescalculationsrdquo Genome Biology vol 8 no 12 article R263 2007
[16] V B Bajic S H Seah A Chong G Zhang J L Y Koh andV Brusic ldquoDragon promoter finder recognition of vertebrateRNA polymerase II promotersrdquo Bioinformatics vol 18 no 1 pp198ndash199 2002
[17] M Q Zhang ldquoIdentification of human gene core promoters insilicordquo Genome Research vol 8 no 3 pp 319ndash326 1998
[18] P K Sree and I R Babu ldquoAIX-MACA-Y multiple attractorcellular automata based clonal classifier for promoter andprotein coding region predictionrdquo Journal of Bioinformatics andIntelligent Control vol 3 no 1 pp 23ndash30 2014
[19] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992
[20] httpwwwmmchriresin[21] R Yamashita Y Suzuki H Wakaguri K Tsuritani K Nakai
and S Sugano ldquoDBTSS database of human transcription startsites progress report 2006rdquo Nucleic Acids Research vol 34supplement 1 pp D86ndashD89 2006
[22] S Saxonov I Daizadeh A Fedorov and W Gilbert ldquoEIDThe Exon-Intron Databasemdashan exhaustive database of protein-coding intron-containing genesrdquoNucleic Acids Research vol 28no 1 pp 185ndash190 2000
[23] G Pesole S Liuni G Grillo et al ldquoUTRdb and UTRsitespecialized databases of sequences and functional elements of51015840 and 31015840 untranslated regions of eukaryotic mRNAs Update2002rdquo Nucleic Acids Research vol 30 no 1 pp 335ndash340 2002
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology