research article the virtual screening of the drug protein...
TRANSCRIPT
Research ArticleThe Virtual Screening of the Drug Protein witha Few Crystal Structures Based on the Adaboost-SVM
Meng-yu Wang1 Peng Li12 and Pei-li Qiao1
1School of Computer Science and Technology Harbin University of Science and Technology Harbin 150080 China2School of Software Harbin University of Science and Technology Harbin 150080 China
Correspondence should be addressed to Peng Li plihrbusteducn
Received 27 December 2015 Revised 6 March 2016 Accepted 7 March 2016
Academic Editor Ezequiel Lopez-Rubio
Copyright copy 2016 Meng-yu Wang et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Using the theory of machine learning to assist the virtual screening (VS) has been an effective plan However the quality of thetraining set may reduce because of mixing with the wrong docking poses and it will affect the screening efficiencies To solve thisproblem we present amethod using the ensemble learning to improve the support vectormachine to process the generated protein-ligand interaction fingerprint (IFP) By combiningmultiple classifiers ensemble learning is able to avoid the limitations of the singleclassifierrsquos performance and obtain better generalization According to the research of virtual screening experiment with SRC andCathepsin K as the target the results show that the ensemble learning method can effectively reduce the error because the samplequality is not high and improve the effect of the whole virtual screening process
1 Introduction
Since the 21st century the focus of life science has beendeveloped from the experimental analysis and data accumu-lation to experiments under the guidance of data analysis Lifescience is undergoing a transition from analysis of reductionof method to the system integration method [1] With thecompletion of human genome project (HGP) more andmore three-dimensional structures of important function ofbiological macromolecules (proteins nucleic acids enzymesetc) have been parsed [2] As the amount of data hasincreased exponentially in recent years the combination oftraditional pharmaceutical field and modern computer tech-nology has become the inevitable result of the development oflife science and virtual screening is the product of this com-bination At present millions of molecules can be screenedout by the virtual screening method every day For eachspecific target structure we can get the active compoundsin short time The research object is focused on hundredsof compounds from millions compounds which can greatlyimprove the speed and efficiency of the compounds screeningand shorten the cycle of new drug research However
the increasing amount of data makes ordinary computeralgorithm unable to maintain a high level so the machinelearning method has gradually entered the view of thescientists due to its reliable and fast performance
The combination of machine learning and virtual screen-ing has become a hotspot in the field of chemical informationand embodies its value in the process of drug discovery suchas searching inhibitors [3] finding novel search chemotypes[4] and predicting protein structures [5] The number ofcrystal structures of complex for training is crucial in themethod of the combination of virtual screening and machinelearning Relative to the small number of training sets alarger andmore diverse training set can train amore powerfullearning mode However the crystal structures which can beused for virtual screening always come from X-ray crystaldiffraction or the means of NMR [6] Although the structureis accurate the high funding and the period limit the speed ofresolution which cannotmeet the needs of the virtual screen-ing experiment So in order to expand the size of the trainingset some docking poses of the known active compounds willbe added to the training set Because the docking poses aresupposed to include incorrect binding modes large amounts
Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2016 Article ID 4809831 9 pageshttpdxdoiorg10115520164809831
2 Computational and Mathematical Methods in Medicine
of negative samples are introduced The accumulation of thenegative samples is possible for producing the imbalanceddata set which is a common phenomenon and of great valuein the studies on bioinformatics
On the prediction of DNA-binding proteins Song et alpropose an ensemble learning algorithm imDC according tothe analysis on unbalancedDNA-binding protein data whichhas outperformed classic classification models like SVMunder the same situation [7] Based on the ensemble learningframework Zou et al give a newpredictor to improve the per-formance of tRNAscan-SE Annotation and the experimen-tal results show their algorithm can distinguish functionaltRNAs from pseudo-tRNAs [8] Lin et al propose merging119870-means static selective strategy and ensemble forwardsequential selection on the ensemble learning architecture forhierarchical classification of protein folds with the accuracyreaching 7421 which is the state-of-the-art strategy atpresent [9] Zou et al combine the synthetic minority over-sampling and 119870-means clustering undersampling to tacklethe negative influence brought by imbalanced data sets [10]
Obviously it is common for bioinformatics studies to facethe imbalance data setsThe widely utilized strategies includepreprocessing training samples and improving classifiersat present In this paper we start from the perspectiveof improving machine learning algorithm introducing theensemble learning method on the basis of simple SVMclassifier using layered combination and iterative weight toenhance the performance of the classifier so as to reduce theimpact of the quality of the sample set Meanwhile this paperintroduces Random Forest as the experimental baseline toexamine the effect of ensemble learning on virtual screening
2 The Quantitative Method
With the rapid development of combinatorial chemistrybioinformatics molecular biology and computer science thecomputer aided drug design (CADD) is widely used Virtualscreening as one of the most widely used methods in theCADD because of its quick and low cost has been graduallyreplaced by the high-throughput screening as the main meanof drug screening [11] In this paper we will use the virtualscreening method to screen the drug protein
21 General Process of Virtual Screening Virtual screening isalso known as a computer screen which is a prescreening ofcompoundmolecules on the computer to reduce the numberof actual screening compounds and to improve the efficiencyof the discovery of lead compounds The workflow of virtualscreening process is shown in Figure 1
Virtual screening includes four steps the establishmentof the receptor model the generation of small moleculelibraries the computer screening the postprocessing of hitcompounds
Step 1 (the establishment of the receptor model) (1) Obtain-ing macromolecular structure preparation of protein struc-ture is an important step in the virtual screening The crystalstructures which will be used in the virtual screening can be
2D molecular database
Protein structures
Atombond type assignment 3D structure
optimization
Protonationcharge assignment structure
optimization
Conformation assemblydate assignment
Binding site determination grid
construction
Docking
Scoring
Rankingselection drug likeness analysis
Hit candidates list
Ligand databasepreparation
Protein structure preparation
Virtual screening
Rankingselection drug likeness analysis
Figure 1 Virtual screening process
directly obtained from the PDB or modeling the sequenceand structure information of the homologous protein
(2) Binding site description the choice of the appropri-ate ligand binding pocket is very important in moleculardocking There are two ways to choose (1) we take from theligand-receptor complex structure directly (2) If there is nocomplex structure we need to manually choose the bindingsites according to the experiment information of biologicalfunctions such as mutation and combination
Step 2 (the generation of small molecule libraries) We canuse conversion program to translate the two-dimensionalstructure to three-dimensional structure The obtained 3Dstructures can be used for docking process after adding thehydrogen atoms and charges
Step 3 (docking and scoring) Docking operation is puttingevery small molecule on ligand binding sites of receptor pro-tein optimizing the conformation and location of ligand andmaking sure of the best combination To score the best con-formation and to rank all compounds according to the scor-ing then pick out the small molecules with the highest scorefrom compound library Docking algorithm aims to predictcomplex conformation generated by the receptor and theligand The purpose of the scoring function is choosing theconformation from candidate set of conformations accordingto the score The scoring function will get a lower scoreif the docking result is more close to the natural compound
Computational and Mathematical Methods in Medicine 3
Hbond(D)
Hbond(D)-Hbond(D)Hbond(D)-Hbond(A)Hbond(A)-Hbond(A)Hbond(D)-Ionic(Cat)
Hbond(D)Hbond(D)
0 1 2 3 4 5 6 10
077 023
066 034 07 03
11 12
Hbond(A)Hbond(A)Ionic(Cat)
430Aring
234Aring
1023Aring
middot middot middot
middot middot middot middot middot middot
middot middot middotmiddot middot middot
middot middot middot
middot middot middot
(Aring)
Figure 2 The calculation process of Pharm-IF [14]
However there is no completely correct scoring function Sofar all kinds of scoring functions used in the existing variousdocking algorithms are only an approximation to the correctscoring function
Step 4 (postprocessing of hit compounds) If only use of thesample-scoring model will lead to a huge difference in thefinal results and sometimes lead to wrong judgment finalresultsmust be analyzed frommultiple perspectives and post-processing The purpose of this analysis and postprocessingis as accurate as possible to assess protein-ligand binding freeenergies The generated complex candidate set is classifiedand the error results are distinguished
As the accuracy of the scoring function in virtual screen-ing has not been properly resolved in this paper we usethe protein-ligand interaction fingerprint (IFP) to deal withthe interactions between target proteins and ligands The IFPencode the observed interactions between ligand and proteininto a binary string of fixed length [12] The IFP methodwas originally designed for analyzing ligand docking posesto protein kinases Based on this method the atom-basedIFP concept was put forward and extended Each kind of IFPhas its own characteristics whether it is residue-based IFP oratom-based IFP One-dimensional interaction fingerprint ismore likely to be generated and compared with the 3D struc-ture of protein ligand and it is more suitable for computeraided drug design [13]
22The Concept and Calculation Process of Pharm-IF In thispaper we use a kind of atomic-based fingerprintmdashPharm-IFas an aid to verify the theory in this paper The concept ofPharm-IF is put forward by Sato et al [14] The Pharm-IFis calculated from the distances of pairs of ligand pharma-cophore features that interact with protein atoms and it candetect important geometrical patterns of ligand pharma-cophore
The calculation of Pharm-IF can be divided into thefollowing three steps as Figure 2 shows
Step 1 To detect the protein-ligand interactions from com-plex structures interactions can be classified into six types(1) hydrogen bond with ligand acceptor (2) hydrogen bondwith ligand donor (3) hydrogen bond in which the rolesof ligand and protein atoms could not be determined (4)ionic interaction with ligand cation (5) ionic interaction withligand anion (6) hydrophobic interaction
Step 2 To create all possible interaction pairs each interac-tion pair is characterized by the pharmacophore features ofthe ligand atoms and their distance To calculate the resultingmatrix each interaction pair is assigned to the correspondingbin In an interaction pair of two hydrogen bonds ligandatoms will be assigned to the vector corresponding to thishydrogen bond pair if ligand atoms are a donor and anacceptor that are 43 A apart from each other For example inorder to describe the distance of 43 A in the interaction 07 isassigned to the bin of 4 A and 03 is assigned to the bin of 5 A
Step 3 The result matrix is calculated by the summationof the values of all of the interaction pairs The formula ofPharm-IF calculation is as follows
119867119905119896
= sum
119894isin119868119905
119860119896(119894) (1)
119860119896(119894) =
0 if 1003816100381610038161003816119896 minus 119889119894
1003816100381610038161003816 ge 1
1 minus1003816100381610038161003816119896 minus 119889
119894
1003816100381610038161003816 otherwise(2)
In formula (1) 119867 stands for the interaction fingerprintof a protein-ligand complex by Pharm-IF 119905 is the pair of sixtypes of pharmacophore features 119896 = 1 2 3 stands forthe corresponding bins of the distances (A) between ligandatoms 119868
119905represents the fact that the whole set of the inter-
actions are classified as type 119905 and 119894 represents an elementin 119868119905 In formula (2) 119889
119894represents the distances between
ligand atoms of 119894 (A)
4 Computational and Mathematical Methods in Medicine
23 CathepsinK and SRC Thispaper selects CathepsinK andSRC as the target for screening These two kinds of proteinsare the hotspot in the field of pharmaceutical drug targets andboth of them do not have enough experimentally determinedprotein-ligand complex structures for virtual screeningTherefore it is necessary to add some docking poses in thetraining set and these docking poses will influence the virtualscreening efficiency
Protooncogene tyrosine-protein kinase SRC also knownas protooncogene c-Src or simply c-Src is a nonreceptortyrosine kinase protein that in humans is encoded by the SRCgene The SRC family kinase is made up of 9 members LYNFYN LCK HCK FGR BLK YRK YES and c-SRCThe SRCwidely exists in tissue cells and it plays an important rolein the process of cell metabolism regulation of cell growthdevelopment and differentiation process by interacting withthe importantmolecules in the signal transduction pathwaysThe c-Src is made up of 6 functional regions SRC homology4 (SH4) domain (SH4 domain) unique region SH3 domainSH2 domain catalytic domain and short regulatory tailWhen SRC is inactive that will cause intermolecular interac-tions between the phosphorylation TYR527 (tyrosine group527) and SH2 domain At the same time the SH3 domainwill combine with the proline-rich SH2 kinase link domainWhen Tyr527 is dephosphorylated and Tyr416 is phosphor-ated links between these molecules will break and the SRCprotein is activated The SRC causes a series of biologicaleffects by participating inmany signal transduction pathwaysthrough a variety of receptors and this kind of protein isclosely associated with a variety of cancers The activationof the c-Src pathway has been observed in about 50 oftumors from colon liver lung breast and the pancreas Asa drug target a number of tyrosine kinase inhibitors treatingc-Src tyrosine kinase (as well as related tyrosine kinases)as target have been developed and put into use [15]
Cathepsin K is a lysosomal cysteine protease belongingto the papain superfamily and it has been cloned in 1999The gene location is lq212 the length of the transcript is17 kb and it consisted of 8 extrons and 7 intronsThe proteinexpression is in the osteoclasts and included in the boneresorption In the process of bone resorption the acid willdissolve the hydroxyapatite and the organic ingredients in thebone matrix will be separated and degraded by Cathepsin KCathepsin K has strong activity of collagenase in acid envi-ronment and it has been found that it plays a role in a varietyof pathological phenomena at present such as rheumatoidarthritis tumor invasion and metastasis inflammation andosteoporosis The function of Cathepsin K in osteoclast hasbeen recognized therefore many labs treat their inhibitors asa drug target for the treatment of osteoporosis At present thefirst choice of antiabsorption treatment is bisphosphonateswhich can reduce the risk of nonvertebral and vertebral frac-tures However the long-term using of bisphosphonates mayproduce adverse reactions esophageal stimulus symptomshypocalcaemia kidney irritation and so on In addition bis-phosphonates not only prevent bone loss but also inhibit boneformation at the same time so the new replacement therapydrugs are more meaningful [16]
3 Classification Algorithm Based onthe Adaboost-SVM
Currently SVM can deal with many problems such as smallsize of samples nonlinearity or high dimensions Based onthe statistical learning theory it has a simple mathematicalform fast training method and good generalization perfor-mance It has been widely used in data mining problems suchas pattern recognition function estimation and time seriesprediction Under the condition that the quality of sample setis not very low even if we do not make any improvementswe can get a good result The learning mechanism of SVMprovides a lot of space to improve the classification modelIn addition one major advantage of SVM is using of convexquadratic programming which provides only global minimaand hence avoids being trapped in local minima so in thispaper we use SVM as the base classifier There have beena large number of literatures about the SVM this paperonly gives a simple introduction The basic process of SVMclassification problems is as follows
For a given sample set 119871 = (1199091 1199101) (1199092 1199102) (119909
119899 119910119899)
and 119909119894isin 119877119889
119910119894isin 1 minus1 119894 = 1 2 119899 119910
119894stands for the
categories of sample 119909119894 119889 is the sample number and 119899 is the
training sample number If the input vector set is linearlyseparable then the input vector set can be separated by ahyperplane The hyperplane can be expressed as
119908 sdot 119909 minus 119887 = 0 (3)
119908 is the normal vector of the hyperplane and 119887 is offsetThe SVM learning problem is minimizing the objectivefunction
min120601 (119908) =1
21199082+ 119862(
119899
sum
119894=1
120585119894) (4)
This meets the condition
119910119894(119908119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 2 119899 (5)
Here (12)1199082 is structure complexity119862(sum
119899
119894=1120585119894) stands
for empirical risk and 120585119894presents the slack variable 119867 is
a constant which is punishment factor of samples wronglyclassified For the situation of linear inseparable the mainidea of SVM is used to map the feature vector to the highdimensional feature space and constructs an optimal hyper-plane in the feature space
To get the change of 120601 119909 in space of 119877119899 mapped into 119867
119909 997888rarr 120601 (119909) = (1206011(119909) 120601
2(119909) 120601
119894(119909))Τ
(6)
Eventually we can decide optimization classificationfunction
119891 (119909) = sgn (119908 sdot 120601 (119909) + 119887)
= sgn(
119899
sum
119894=1
119886119894119910119894120601 (119909119894) sdot 120601 (119909) + 119887)
(7)
Computational and Mathematical Methods in Medicine 5
In our work Radial Basis Function (RBF) is taken as thekernel function of SVM and the mathematical description ofthis kernel is given below
119870(119909119894 119909119895) = 119890minus119909119894minus119909119895221205902
(8)
Relative to other classifiers the SVM is more stable andless affected by the quality of sample set However the samplecategories imbalance of data set and the high complexity ofthe data set will destabilize the classifier and the instabilityof the classifier will directly affect the final classificationresult In this paper we introduce the Adaboost mechanismin ensemble learning to divide one classification process intoseveral layers of weak classifier based on SVM
The key of the combination of Adaboost and SVM is tofind a suitable Gauss width 120590 value for each component If the120590 value is relatively large the component classifier is tooweakand the final classification performance is decreased On theother hand if the 120590 value is relatively small which makesthe component classifier robust and the error of componentclassifier is highly correlated the difference is small so thatthe ensemble learning is invalid Even more importantly 120590value is too small which will lead to overfitting and resultingin a greatly reduced generalization Therefore in this paperthe standard deviation of the sample set of each componentclassifier is used as the 120590 value of the component classifierto control the classification accuracy of the component clas-sifier thus SVM based Adaboost classifier is obtained Theprogram used in this paper is not an open source so we needto explain some key parameters We list the values of 120590 119862and other parameters in Table 1
The specific process of the algorithm is as follows
(1) RBFSVM presents the SVM with the RBF kernel119879 presents the number of iterations required in theAdaboost process
(2) Initialization initialize the weights of each sample1199081(119894) = 1119899 119894 = 1 2 119899
(3) For 119905 = 1 2 119879for each ℎ(119909
119894) calculate the weighted error
120576119905
(119894)=
119899
sum
119895=1
119863119894
10038161003816100381610038161003816ℎ (119883119895) minus 119910 (119895)
10038161003816100381610038161003816 (9)
Choose a feature with the lowest weighted error rate120576119895and save its corresponding SVM model Calculate
the selected weak classifierrsquos weight
119886119905=
1
2ln(
1 minus 120576119905
120576119905
) (10)
Update sample weights according to 119886119905
119863119905+1
(119894) =119863119905(119894)
119885119905
119865 (119864119905)
=119863119905(119894)
119885119905
119890minus120572119905 if ℎ
119905(119909119894) = 119910119894
119890120572119905 if ℎ
119905(119909119894) = 119910119894
(11)
Table 1 SVM parameter setting
Parameter name Parameter valuesSVM type C-SVMClass number 2Kernel function RBFThe degree in kernel function 3120590 in kernel function 0001Coast factor 5Cache size 500MBTolerance in the termination criteria 0001The weight value of penalty factor for all kindsof samples 1
Cross validation 5
Table 2 Adaboost-SVM parameter setting
Parameter name Parameter valuesEnsemble learning type AdaboostClass number 2The basic classifier type C-SVMNumber of classifiers per layer 100The max false alarm rate 05The min hit rate 09Number of iterations 5Weight trim rate 09Cache size 500MB
And the normalized parameters are
119899
sum
119894=1
119863119905(119894) = 1 (12)
(4) Use strong classifier 119867 integrated by SVM weakclassifier to training set
119867(119909) = sign[
119899
sum
119894=1
119886119905ℎ119905(119909)] (13)
For the Adaboost-SVM we set the parameters in Table 2and the basic SVM parameters are the same as Table 1
Thus compared to the single machine learning algorithmensemble learning method requires that each base classifiershould be independent from the others The probabilityof sample misclassification should be less than 05 In theensemble learning method all classifiers will work togetherto solve one problem and this can also reduce the impact ofthe quality of sample set to the virtual screening effect
4 Experiment and Analysis
In order to verify the validity of the proposed method in thispaper besides the crystal structure of PDB we also combinethe data from the PubChem database and the StARLITe
6 Computational and Mathematical Methods in Medicine
database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining
All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata
41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set
Table 3 Experimental data structure
Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500
to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above
After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC
50le 10 120583m) from StARLITe and divide them into
100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set
After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively
From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue
42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows
EF =Hits119904119873119904
Hit119905119873119905
(14)
HereHits119904represents the number of active compounds in
the sample Hit119905is the number of active compounds tested119873
119904
is the number of compounds sampled and 119873119905is the number
of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use
Computational and Mathematical Methods in Medicine 7
False positive rate1009080706050403020100
10
09
08
07
06
05
04
03
02
01
00
True
pos
itive
rate
X
Y
Figure 3 Two ROC curves 119883 and 119884
10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows
AUC = int
1
0
TP119875
119889FP119873
=1
119875 sdot 119873int
119873
0
TP 119889 FP (15)
Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds
AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification
43 The Analysis of Experimental Results According to theexperiment we used three different classification methods
False positive rate1009080706050403020100
True
pos
itive
rate
10
09
08
07
06
05
04
03
02
01
00
Adaboost-SVMRandom ForestSVM
Figure 4 Comparative experiment of SRC
True
pos
itive
rate
False positive rate1009080706050403020100
Adaboost-SVMRandom ForestSVM
10
09
08
07
06
05
04
03
02
01
00
Figure 5 Comparative experiment of Cathepsin K
(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4
The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5
Table 5 shows the 10 EF of the two methods andcalculates the AUC value
Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this
8 Computational and Mathematical Methods in Medicine
Table 4 Random Forest parameter setting
Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50
Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest
Algorithm Target protein 10 EF AUC
SVM SRC 47 0734Cathepsin K 39 0683
Adaboost-SVM SRC 55 0821Cathepsin K 48 0802
Random Forest SRC 53 0805Cathepsin K 45 0783
iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained
5 Conclusion
In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying
robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)
References
[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014
[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013
[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009
[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015
[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015
[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014
[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014
Computational and Mathematical Methods in Medicine 9
[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015
[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013
[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013
[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013
[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015
[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014
[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010
[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012
[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013
[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
2 Computational and Mathematical Methods in Medicine
of negative samples are introduced The accumulation of thenegative samples is possible for producing the imbalanceddata set which is a common phenomenon and of great valuein the studies on bioinformatics
On the prediction of DNA-binding proteins Song et alpropose an ensemble learning algorithm imDC according tothe analysis on unbalancedDNA-binding protein data whichhas outperformed classic classification models like SVMunder the same situation [7] Based on the ensemble learningframework Zou et al give a newpredictor to improve the per-formance of tRNAscan-SE Annotation and the experimen-tal results show their algorithm can distinguish functionaltRNAs from pseudo-tRNAs [8] Lin et al propose merging119870-means static selective strategy and ensemble forwardsequential selection on the ensemble learning architecture forhierarchical classification of protein folds with the accuracyreaching 7421 which is the state-of-the-art strategy atpresent [9] Zou et al combine the synthetic minority over-sampling and 119870-means clustering undersampling to tacklethe negative influence brought by imbalanced data sets [10]
Obviously it is common for bioinformatics studies to facethe imbalance data setsThe widely utilized strategies includepreprocessing training samples and improving classifiersat present In this paper we start from the perspectiveof improving machine learning algorithm introducing theensemble learning method on the basis of simple SVMclassifier using layered combination and iterative weight toenhance the performance of the classifier so as to reduce theimpact of the quality of the sample set Meanwhile this paperintroduces Random Forest as the experimental baseline toexamine the effect of ensemble learning on virtual screening
2 The Quantitative Method
With the rapid development of combinatorial chemistrybioinformatics molecular biology and computer science thecomputer aided drug design (CADD) is widely used Virtualscreening as one of the most widely used methods in theCADD because of its quick and low cost has been graduallyreplaced by the high-throughput screening as the main meanof drug screening [11] In this paper we will use the virtualscreening method to screen the drug protein
21 General Process of Virtual Screening Virtual screening isalso known as a computer screen which is a prescreening ofcompoundmolecules on the computer to reduce the numberof actual screening compounds and to improve the efficiencyof the discovery of lead compounds The workflow of virtualscreening process is shown in Figure 1
Virtual screening includes four steps the establishmentof the receptor model the generation of small moleculelibraries the computer screening the postprocessing of hitcompounds
Step 1 (the establishment of the receptor model) (1) Obtain-ing macromolecular structure preparation of protein struc-ture is an important step in the virtual screening The crystalstructures which will be used in the virtual screening can be
2D molecular database
Protein structures
Atombond type assignment 3D structure
optimization
Protonationcharge assignment structure
optimization
Conformation assemblydate assignment
Binding site determination grid
construction
Docking
Scoring
Rankingselection drug likeness analysis
Hit candidates list
Ligand databasepreparation
Protein structure preparation
Virtual screening
Rankingselection drug likeness analysis
Figure 1 Virtual screening process
directly obtained from the PDB or modeling the sequenceand structure information of the homologous protein
(2) Binding site description the choice of the appropri-ate ligand binding pocket is very important in moleculardocking There are two ways to choose (1) we take from theligand-receptor complex structure directly (2) If there is nocomplex structure we need to manually choose the bindingsites according to the experiment information of biologicalfunctions such as mutation and combination
Step 2 (the generation of small molecule libraries) We canuse conversion program to translate the two-dimensionalstructure to three-dimensional structure The obtained 3Dstructures can be used for docking process after adding thehydrogen atoms and charges
Step 3 (docking and scoring) Docking operation is puttingevery small molecule on ligand binding sites of receptor pro-tein optimizing the conformation and location of ligand andmaking sure of the best combination To score the best con-formation and to rank all compounds according to the scor-ing then pick out the small molecules with the highest scorefrom compound library Docking algorithm aims to predictcomplex conformation generated by the receptor and theligand The purpose of the scoring function is choosing theconformation from candidate set of conformations accordingto the score The scoring function will get a lower scoreif the docking result is more close to the natural compound
Computational and Mathematical Methods in Medicine 3
Hbond(D)
Hbond(D)-Hbond(D)Hbond(D)-Hbond(A)Hbond(A)-Hbond(A)Hbond(D)-Ionic(Cat)
Hbond(D)Hbond(D)
0 1 2 3 4 5 6 10
077 023
066 034 07 03
11 12
Hbond(A)Hbond(A)Ionic(Cat)
430Aring
234Aring
1023Aring
middot middot middot
middot middot middot middot middot middot
middot middot middotmiddot middot middot
middot middot middot
middot middot middot
(Aring)
Figure 2 The calculation process of Pharm-IF [14]
However there is no completely correct scoring function Sofar all kinds of scoring functions used in the existing variousdocking algorithms are only an approximation to the correctscoring function
Step 4 (postprocessing of hit compounds) If only use of thesample-scoring model will lead to a huge difference in thefinal results and sometimes lead to wrong judgment finalresultsmust be analyzed frommultiple perspectives and post-processing The purpose of this analysis and postprocessingis as accurate as possible to assess protein-ligand binding freeenergies The generated complex candidate set is classifiedand the error results are distinguished
As the accuracy of the scoring function in virtual screen-ing has not been properly resolved in this paper we usethe protein-ligand interaction fingerprint (IFP) to deal withthe interactions between target proteins and ligands The IFPencode the observed interactions between ligand and proteininto a binary string of fixed length [12] The IFP methodwas originally designed for analyzing ligand docking posesto protein kinases Based on this method the atom-basedIFP concept was put forward and extended Each kind of IFPhas its own characteristics whether it is residue-based IFP oratom-based IFP One-dimensional interaction fingerprint ismore likely to be generated and compared with the 3D struc-ture of protein ligand and it is more suitable for computeraided drug design [13]
22The Concept and Calculation Process of Pharm-IF In thispaper we use a kind of atomic-based fingerprintmdashPharm-IFas an aid to verify the theory in this paper The concept ofPharm-IF is put forward by Sato et al [14] The Pharm-IFis calculated from the distances of pairs of ligand pharma-cophore features that interact with protein atoms and it candetect important geometrical patterns of ligand pharma-cophore
The calculation of Pharm-IF can be divided into thefollowing three steps as Figure 2 shows
Step 1 To detect the protein-ligand interactions from com-plex structures interactions can be classified into six types(1) hydrogen bond with ligand acceptor (2) hydrogen bondwith ligand donor (3) hydrogen bond in which the rolesof ligand and protein atoms could not be determined (4)ionic interaction with ligand cation (5) ionic interaction withligand anion (6) hydrophobic interaction
Step 2 To create all possible interaction pairs each interac-tion pair is characterized by the pharmacophore features ofthe ligand atoms and their distance To calculate the resultingmatrix each interaction pair is assigned to the correspondingbin In an interaction pair of two hydrogen bonds ligandatoms will be assigned to the vector corresponding to thishydrogen bond pair if ligand atoms are a donor and anacceptor that are 43 A apart from each other For example inorder to describe the distance of 43 A in the interaction 07 isassigned to the bin of 4 A and 03 is assigned to the bin of 5 A
Step 3 The result matrix is calculated by the summationof the values of all of the interaction pairs The formula ofPharm-IF calculation is as follows
119867119905119896
= sum
119894isin119868119905
119860119896(119894) (1)
119860119896(119894) =
0 if 1003816100381610038161003816119896 minus 119889119894
1003816100381610038161003816 ge 1
1 minus1003816100381610038161003816119896 minus 119889
119894
1003816100381610038161003816 otherwise(2)
In formula (1) 119867 stands for the interaction fingerprintof a protein-ligand complex by Pharm-IF 119905 is the pair of sixtypes of pharmacophore features 119896 = 1 2 3 stands forthe corresponding bins of the distances (A) between ligandatoms 119868
119905represents the fact that the whole set of the inter-
actions are classified as type 119905 and 119894 represents an elementin 119868119905 In formula (2) 119889
119894represents the distances between
ligand atoms of 119894 (A)
4 Computational and Mathematical Methods in Medicine
23 CathepsinK and SRC Thispaper selects CathepsinK andSRC as the target for screening These two kinds of proteinsare the hotspot in the field of pharmaceutical drug targets andboth of them do not have enough experimentally determinedprotein-ligand complex structures for virtual screeningTherefore it is necessary to add some docking poses in thetraining set and these docking poses will influence the virtualscreening efficiency
Protooncogene tyrosine-protein kinase SRC also knownas protooncogene c-Src or simply c-Src is a nonreceptortyrosine kinase protein that in humans is encoded by the SRCgene The SRC family kinase is made up of 9 members LYNFYN LCK HCK FGR BLK YRK YES and c-SRCThe SRCwidely exists in tissue cells and it plays an important rolein the process of cell metabolism regulation of cell growthdevelopment and differentiation process by interacting withthe importantmolecules in the signal transduction pathwaysThe c-Src is made up of 6 functional regions SRC homology4 (SH4) domain (SH4 domain) unique region SH3 domainSH2 domain catalytic domain and short regulatory tailWhen SRC is inactive that will cause intermolecular interac-tions between the phosphorylation TYR527 (tyrosine group527) and SH2 domain At the same time the SH3 domainwill combine with the proline-rich SH2 kinase link domainWhen Tyr527 is dephosphorylated and Tyr416 is phosphor-ated links between these molecules will break and the SRCprotein is activated The SRC causes a series of biologicaleffects by participating inmany signal transduction pathwaysthrough a variety of receptors and this kind of protein isclosely associated with a variety of cancers The activationof the c-Src pathway has been observed in about 50 oftumors from colon liver lung breast and the pancreas Asa drug target a number of tyrosine kinase inhibitors treatingc-Src tyrosine kinase (as well as related tyrosine kinases)as target have been developed and put into use [15]
Cathepsin K is a lysosomal cysteine protease belongingto the papain superfamily and it has been cloned in 1999The gene location is lq212 the length of the transcript is17 kb and it consisted of 8 extrons and 7 intronsThe proteinexpression is in the osteoclasts and included in the boneresorption In the process of bone resorption the acid willdissolve the hydroxyapatite and the organic ingredients in thebone matrix will be separated and degraded by Cathepsin KCathepsin K has strong activity of collagenase in acid envi-ronment and it has been found that it plays a role in a varietyof pathological phenomena at present such as rheumatoidarthritis tumor invasion and metastasis inflammation andosteoporosis The function of Cathepsin K in osteoclast hasbeen recognized therefore many labs treat their inhibitors asa drug target for the treatment of osteoporosis At present thefirst choice of antiabsorption treatment is bisphosphonateswhich can reduce the risk of nonvertebral and vertebral frac-tures However the long-term using of bisphosphonates mayproduce adverse reactions esophageal stimulus symptomshypocalcaemia kidney irritation and so on In addition bis-phosphonates not only prevent bone loss but also inhibit boneformation at the same time so the new replacement therapydrugs are more meaningful [16]
3 Classification Algorithm Based onthe Adaboost-SVM
Currently SVM can deal with many problems such as smallsize of samples nonlinearity or high dimensions Based onthe statistical learning theory it has a simple mathematicalform fast training method and good generalization perfor-mance It has been widely used in data mining problems suchas pattern recognition function estimation and time seriesprediction Under the condition that the quality of sample setis not very low even if we do not make any improvementswe can get a good result The learning mechanism of SVMprovides a lot of space to improve the classification modelIn addition one major advantage of SVM is using of convexquadratic programming which provides only global minimaand hence avoids being trapped in local minima so in thispaper we use SVM as the base classifier There have beena large number of literatures about the SVM this paperonly gives a simple introduction The basic process of SVMclassification problems is as follows
For a given sample set 119871 = (1199091 1199101) (1199092 1199102) (119909
119899 119910119899)
and 119909119894isin 119877119889
119910119894isin 1 minus1 119894 = 1 2 119899 119910
119894stands for the
categories of sample 119909119894 119889 is the sample number and 119899 is the
training sample number If the input vector set is linearlyseparable then the input vector set can be separated by ahyperplane The hyperplane can be expressed as
119908 sdot 119909 minus 119887 = 0 (3)
119908 is the normal vector of the hyperplane and 119887 is offsetThe SVM learning problem is minimizing the objectivefunction
min120601 (119908) =1
21199082+ 119862(
119899
sum
119894=1
120585119894) (4)
This meets the condition
119910119894(119908119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 2 119899 (5)
Here (12)1199082 is structure complexity119862(sum
119899
119894=1120585119894) stands
for empirical risk and 120585119894presents the slack variable 119867 is
a constant which is punishment factor of samples wronglyclassified For the situation of linear inseparable the mainidea of SVM is used to map the feature vector to the highdimensional feature space and constructs an optimal hyper-plane in the feature space
To get the change of 120601 119909 in space of 119877119899 mapped into 119867
119909 997888rarr 120601 (119909) = (1206011(119909) 120601
2(119909) 120601
119894(119909))Τ
(6)
Eventually we can decide optimization classificationfunction
119891 (119909) = sgn (119908 sdot 120601 (119909) + 119887)
= sgn(
119899
sum
119894=1
119886119894119910119894120601 (119909119894) sdot 120601 (119909) + 119887)
(7)
Computational and Mathematical Methods in Medicine 5
In our work Radial Basis Function (RBF) is taken as thekernel function of SVM and the mathematical description ofthis kernel is given below
119870(119909119894 119909119895) = 119890minus119909119894minus119909119895221205902
(8)
Relative to other classifiers the SVM is more stable andless affected by the quality of sample set However the samplecategories imbalance of data set and the high complexity ofthe data set will destabilize the classifier and the instabilityof the classifier will directly affect the final classificationresult In this paper we introduce the Adaboost mechanismin ensemble learning to divide one classification process intoseveral layers of weak classifier based on SVM
The key of the combination of Adaboost and SVM is tofind a suitable Gauss width 120590 value for each component If the120590 value is relatively large the component classifier is tooweakand the final classification performance is decreased On theother hand if the 120590 value is relatively small which makesthe component classifier robust and the error of componentclassifier is highly correlated the difference is small so thatthe ensemble learning is invalid Even more importantly 120590value is too small which will lead to overfitting and resultingin a greatly reduced generalization Therefore in this paperthe standard deviation of the sample set of each componentclassifier is used as the 120590 value of the component classifierto control the classification accuracy of the component clas-sifier thus SVM based Adaboost classifier is obtained Theprogram used in this paper is not an open source so we needto explain some key parameters We list the values of 120590 119862and other parameters in Table 1
The specific process of the algorithm is as follows
(1) RBFSVM presents the SVM with the RBF kernel119879 presents the number of iterations required in theAdaboost process
(2) Initialization initialize the weights of each sample1199081(119894) = 1119899 119894 = 1 2 119899
(3) For 119905 = 1 2 119879for each ℎ(119909
119894) calculate the weighted error
120576119905
(119894)=
119899
sum
119895=1
119863119894
10038161003816100381610038161003816ℎ (119883119895) minus 119910 (119895)
10038161003816100381610038161003816 (9)
Choose a feature with the lowest weighted error rate120576119895and save its corresponding SVM model Calculate
the selected weak classifierrsquos weight
119886119905=
1
2ln(
1 minus 120576119905
120576119905
) (10)
Update sample weights according to 119886119905
119863119905+1
(119894) =119863119905(119894)
119885119905
119865 (119864119905)
=119863119905(119894)
119885119905
119890minus120572119905 if ℎ
119905(119909119894) = 119910119894
119890120572119905 if ℎ
119905(119909119894) = 119910119894
(11)
Table 1 SVM parameter setting
Parameter name Parameter valuesSVM type C-SVMClass number 2Kernel function RBFThe degree in kernel function 3120590 in kernel function 0001Coast factor 5Cache size 500MBTolerance in the termination criteria 0001The weight value of penalty factor for all kindsof samples 1
Cross validation 5
Table 2 Adaboost-SVM parameter setting
Parameter name Parameter valuesEnsemble learning type AdaboostClass number 2The basic classifier type C-SVMNumber of classifiers per layer 100The max false alarm rate 05The min hit rate 09Number of iterations 5Weight trim rate 09Cache size 500MB
And the normalized parameters are
119899
sum
119894=1
119863119905(119894) = 1 (12)
(4) Use strong classifier 119867 integrated by SVM weakclassifier to training set
119867(119909) = sign[
119899
sum
119894=1
119886119905ℎ119905(119909)] (13)
For the Adaboost-SVM we set the parameters in Table 2and the basic SVM parameters are the same as Table 1
Thus compared to the single machine learning algorithmensemble learning method requires that each base classifiershould be independent from the others The probabilityof sample misclassification should be less than 05 In theensemble learning method all classifiers will work togetherto solve one problem and this can also reduce the impact ofthe quality of sample set to the virtual screening effect
4 Experiment and Analysis
In order to verify the validity of the proposed method in thispaper besides the crystal structure of PDB we also combinethe data from the PubChem database and the StARLITe
6 Computational and Mathematical Methods in Medicine
database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining
All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata
41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set
Table 3 Experimental data structure
Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500
to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above
After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC
50le 10 120583m) from StARLITe and divide them into
100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set
After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively
From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue
42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows
EF =Hits119904119873119904
Hit119905119873119905
(14)
HereHits119904represents the number of active compounds in
the sample Hit119905is the number of active compounds tested119873
119904
is the number of compounds sampled and 119873119905is the number
of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use
Computational and Mathematical Methods in Medicine 7
False positive rate1009080706050403020100
10
09
08
07
06
05
04
03
02
01
00
True
pos
itive
rate
X
Y
Figure 3 Two ROC curves 119883 and 119884
10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows
AUC = int
1
0
TP119875
119889FP119873
=1
119875 sdot 119873int
119873
0
TP 119889 FP (15)
Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds
AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification
43 The Analysis of Experimental Results According to theexperiment we used three different classification methods
False positive rate1009080706050403020100
True
pos
itive
rate
10
09
08
07
06
05
04
03
02
01
00
Adaboost-SVMRandom ForestSVM
Figure 4 Comparative experiment of SRC
True
pos
itive
rate
False positive rate1009080706050403020100
Adaboost-SVMRandom ForestSVM
10
09
08
07
06
05
04
03
02
01
00
Figure 5 Comparative experiment of Cathepsin K
(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4
The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5
Table 5 shows the 10 EF of the two methods andcalculates the AUC value
Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this
8 Computational and Mathematical Methods in Medicine
Table 4 Random Forest parameter setting
Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50
Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest
Algorithm Target protein 10 EF AUC
SVM SRC 47 0734Cathepsin K 39 0683
Adaboost-SVM SRC 55 0821Cathepsin K 48 0802
Random Forest SRC 53 0805Cathepsin K 45 0783
iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained
5 Conclusion
In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying
robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)
References
[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014
[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013
[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009
[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015
[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015
[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014
[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014
Computational and Mathematical Methods in Medicine 9
[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015
[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013
[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013
[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013
[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015
[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014
[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010
[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012
[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013
[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 3
Hbond(D)
Hbond(D)-Hbond(D)Hbond(D)-Hbond(A)Hbond(A)-Hbond(A)Hbond(D)-Ionic(Cat)
Hbond(D)Hbond(D)
0 1 2 3 4 5 6 10
077 023
066 034 07 03
11 12
Hbond(A)Hbond(A)Ionic(Cat)
430Aring
234Aring
1023Aring
middot middot middot
middot middot middot middot middot middot
middot middot middotmiddot middot middot
middot middot middot
middot middot middot
(Aring)
Figure 2 The calculation process of Pharm-IF [14]
However there is no completely correct scoring function Sofar all kinds of scoring functions used in the existing variousdocking algorithms are only an approximation to the correctscoring function
Step 4 (postprocessing of hit compounds) If only use of thesample-scoring model will lead to a huge difference in thefinal results and sometimes lead to wrong judgment finalresultsmust be analyzed frommultiple perspectives and post-processing The purpose of this analysis and postprocessingis as accurate as possible to assess protein-ligand binding freeenergies The generated complex candidate set is classifiedand the error results are distinguished
As the accuracy of the scoring function in virtual screen-ing has not been properly resolved in this paper we usethe protein-ligand interaction fingerprint (IFP) to deal withthe interactions between target proteins and ligands The IFPencode the observed interactions between ligand and proteininto a binary string of fixed length [12] The IFP methodwas originally designed for analyzing ligand docking posesto protein kinases Based on this method the atom-basedIFP concept was put forward and extended Each kind of IFPhas its own characteristics whether it is residue-based IFP oratom-based IFP One-dimensional interaction fingerprint ismore likely to be generated and compared with the 3D struc-ture of protein ligand and it is more suitable for computeraided drug design [13]
22The Concept and Calculation Process of Pharm-IF In thispaper we use a kind of atomic-based fingerprintmdashPharm-IFas an aid to verify the theory in this paper The concept ofPharm-IF is put forward by Sato et al [14] The Pharm-IFis calculated from the distances of pairs of ligand pharma-cophore features that interact with protein atoms and it candetect important geometrical patterns of ligand pharma-cophore
The calculation of Pharm-IF can be divided into thefollowing three steps as Figure 2 shows
Step 1 To detect the protein-ligand interactions from com-plex structures interactions can be classified into six types(1) hydrogen bond with ligand acceptor (2) hydrogen bondwith ligand donor (3) hydrogen bond in which the rolesof ligand and protein atoms could not be determined (4)ionic interaction with ligand cation (5) ionic interaction withligand anion (6) hydrophobic interaction
Step 2 To create all possible interaction pairs each interac-tion pair is characterized by the pharmacophore features ofthe ligand atoms and their distance To calculate the resultingmatrix each interaction pair is assigned to the correspondingbin In an interaction pair of two hydrogen bonds ligandatoms will be assigned to the vector corresponding to thishydrogen bond pair if ligand atoms are a donor and anacceptor that are 43 A apart from each other For example inorder to describe the distance of 43 A in the interaction 07 isassigned to the bin of 4 A and 03 is assigned to the bin of 5 A
Step 3 The result matrix is calculated by the summationof the values of all of the interaction pairs The formula ofPharm-IF calculation is as follows
119867119905119896
= sum
119894isin119868119905
119860119896(119894) (1)
119860119896(119894) =
0 if 1003816100381610038161003816119896 minus 119889119894
1003816100381610038161003816 ge 1
1 minus1003816100381610038161003816119896 minus 119889
119894
1003816100381610038161003816 otherwise(2)
In formula (1) 119867 stands for the interaction fingerprintof a protein-ligand complex by Pharm-IF 119905 is the pair of sixtypes of pharmacophore features 119896 = 1 2 3 stands forthe corresponding bins of the distances (A) between ligandatoms 119868
119905represents the fact that the whole set of the inter-
actions are classified as type 119905 and 119894 represents an elementin 119868119905 In formula (2) 119889
119894represents the distances between
ligand atoms of 119894 (A)
4 Computational and Mathematical Methods in Medicine
23 CathepsinK and SRC Thispaper selects CathepsinK andSRC as the target for screening These two kinds of proteinsare the hotspot in the field of pharmaceutical drug targets andboth of them do not have enough experimentally determinedprotein-ligand complex structures for virtual screeningTherefore it is necessary to add some docking poses in thetraining set and these docking poses will influence the virtualscreening efficiency
Protooncogene tyrosine-protein kinase SRC also knownas protooncogene c-Src or simply c-Src is a nonreceptortyrosine kinase protein that in humans is encoded by the SRCgene The SRC family kinase is made up of 9 members LYNFYN LCK HCK FGR BLK YRK YES and c-SRCThe SRCwidely exists in tissue cells and it plays an important rolein the process of cell metabolism regulation of cell growthdevelopment and differentiation process by interacting withthe importantmolecules in the signal transduction pathwaysThe c-Src is made up of 6 functional regions SRC homology4 (SH4) domain (SH4 domain) unique region SH3 domainSH2 domain catalytic domain and short regulatory tailWhen SRC is inactive that will cause intermolecular interac-tions between the phosphorylation TYR527 (tyrosine group527) and SH2 domain At the same time the SH3 domainwill combine with the proline-rich SH2 kinase link domainWhen Tyr527 is dephosphorylated and Tyr416 is phosphor-ated links between these molecules will break and the SRCprotein is activated The SRC causes a series of biologicaleffects by participating inmany signal transduction pathwaysthrough a variety of receptors and this kind of protein isclosely associated with a variety of cancers The activationof the c-Src pathway has been observed in about 50 oftumors from colon liver lung breast and the pancreas Asa drug target a number of tyrosine kinase inhibitors treatingc-Src tyrosine kinase (as well as related tyrosine kinases)as target have been developed and put into use [15]
Cathepsin K is a lysosomal cysteine protease belongingto the papain superfamily and it has been cloned in 1999The gene location is lq212 the length of the transcript is17 kb and it consisted of 8 extrons and 7 intronsThe proteinexpression is in the osteoclasts and included in the boneresorption In the process of bone resorption the acid willdissolve the hydroxyapatite and the organic ingredients in thebone matrix will be separated and degraded by Cathepsin KCathepsin K has strong activity of collagenase in acid envi-ronment and it has been found that it plays a role in a varietyof pathological phenomena at present such as rheumatoidarthritis tumor invasion and metastasis inflammation andosteoporosis The function of Cathepsin K in osteoclast hasbeen recognized therefore many labs treat their inhibitors asa drug target for the treatment of osteoporosis At present thefirst choice of antiabsorption treatment is bisphosphonateswhich can reduce the risk of nonvertebral and vertebral frac-tures However the long-term using of bisphosphonates mayproduce adverse reactions esophageal stimulus symptomshypocalcaemia kidney irritation and so on In addition bis-phosphonates not only prevent bone loss but also inhibit boneformation at the same time so the new replacement therapydrugs are more meaningful [16]
3 Classification Algorithm Based onthe Adaboost-SVM
Currently SVM can deal with many problems such as smallsize of samples nonlinearity or high dimensions Based onthe statistical learning theory it has a simple mathematicalform fast training method and good generalization perfor-mance It has been widely used in data mining problems suchas pattern recognition function estimation and time seriesprediction Under the condition that the quality of sample setis not very low even if we do not make any improvementswe can get a good result The learning mechanism of SVMprovides a lot of space to improve the classification modelIn addition one major advantage of SVM is using of convexquadratic programming which provides only global minimaand hence avoids being trapped in local minima so in thispaper we use SVM as the base classifier There have beena large number of literatures about the SVM this paperonly gives a simple introduction The basic process of SVMclassification problems is as follows
For a given sample set 119871 = (1199091 1199101) (1199092 1199102) (119909
119899 119910119899)
and 119909119894isin 119877119889
119910119894isin 1 minus1 119894 = 1 2 119899 119910
119894stands for the
categories of sample 119909119894 119889 is the sample number and 119899 is the
training sample number If the input vector set is linearlyseparable then the input vector set can be separated by ahyperplane The hyperplane can be expressed as
119908 sdot 119909 minus 119887 = 0 (3)
119908 is the normal vector of the hyperplane and 119887 is offsetThe SVM learning problem is minimizing the objectivefunction
min120601 (119908) =1
21199082+ 119862(
119899
sum
119894=1
120585119894) (4)
This meets the condition
119910119894(119908119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 2 119899 (5)
Here (12)1199082 is structure complexity119862(sum
119899
119894=1120585119894) stands
for empirical risk and 120585119894presents the slack variable 119867 is
a constant which is punishment factor of samples wronglyclassified For the situation of linear inseparable the mainidea of SVM is used to map the feature vector to the highdimensional feature space and constructs an optimal hyper-plane in the feature space
To get the change of 120601 119909 in space of 119877119899 mapped into 119867
119909 997888rarr 120601 (119909) = (1206011(119909) 120601
2(119909) 120601
119894(119909))Τ
(6)
Eventually we can decide optimization classificationfunction
119891 (119909) = sgn (119908 sdot 120601 (119909) + 119887)
= sgn(
119899
sum
119894=1
119886119894119910119894120601 (119909119894) sdot 120601 (119909) + 119887)
(7)
Computational and Mathematical Methods in Medicine 5
In our work Radial Basis Function (RBF) is taken as thekernel function of SVM and the mathematical description ofthis kernel is given below
119870(119909119894 119909119895) = 119890minus119909119894minus119909119895221205902
(8)
Relative to other classifiers the SVM is more stable andless affected by the quality of sample set However the samplecategories imbalance of data set and the high complexity ofthe data set will destabilize the classifier and the instabilityof the classifier will directly affect the final classificationresult In this paper we introduce the Adaboost mechanismin ensemble learning to divide one classification process intoseveral layers of weak classifier based on SVM
The key of the combination of Adaboost and SVM is tofind a suitable Gauss width 120590 value for each component If the120590 value is relatively large the component classifier is tooweakand the final classification performance is decreased On theother hand if the 120590 value is relatively small which makesthe component classifier robust and the error of componentclassifier is highly correlated the difference is small so thatthe ensemble learning is invalid Even more importantly 120590value is too small which will lead to overfitting and resultingin a greatly reduced generalization Therefore in this paperthe standard deviation of the sample set of each componentclassifier is used as the 120590 value of the component classifierto control the classification accuracy of the component clas-sifier thus SVM based Adaboost classifier is obtained Theprogram used in this paper is not an open source so we needto explain some key parameters We list the values of 120590 119862and other parameters in Table 1
The specific process of the algorithm is as follows
(1) RBFSVM presents the SVM with the RBF kernel119879 presents the number of iterations required in theAdaboost process
(2) Initialization initialize the weights of each sample1199081(119894) = 1119899 119894 = 1 2 119899
(3) For 119905 = 1 2 119879for each ℎ(119909
119894) calculate the weighted error
120576119905
(119894)=
119899
sum
119895=1
119863119894
10038161003816100381610038161003816ℎ (119883119895) minus 119910 (119895)
10038161003816100381610038161003816 (9)
Choose a feature with the lowest weighted error rate120576119895and save its corresponding SVM model Calculate
the selected weak classifierrsquos weight
119886119905=
1
2ln(
1 minus 120576119905
120576119905
) (10)
Update sample weights according to 119886119905
119863119905+1
(119894) =119863119905(119894)
119885119905
119865 (119864119905)
=119863119905(119894)
119885119905
119890minus120572119905 if ℎ
119905(119909119894) = 119910119894
119890120572119905 if ℎ
119905(119909119894) = 119910119894
(11)
Table 1 SVM parameter setting
Parameter name Parameter valuesSVM type C-SVMClass number 2Kernel function RBFThe degree in kernel function 3120590 in kernel function 0001Coast factor 5Cache size 500MBTolerance in the termination criteria 0001The weight value of penalty factor for all kindsof samples 1
Cross validation 5
Table 2 Adaboost-SVM parameter setting
Parameter name Parameter valuesEnsemble learning type AdaboostClass number 2The basic classifier type C-SVMNumber of classifiers per layer 100The max false alarm rate 05The min hit rate 09Number of iterations 5Weight trim rate 09Cache size 500MB
And the normalized parameters are
119899
sum
119894=1
119863119905(119894) = 1 (12)
(4) Use strong classifier 119867 integrated by SVM weakclassifier to training set
119867(119909) = sign[
119899
sum
119894=1
119886119905ℎ119905(119909)] (13)
For the Adaboost-SVM we set the parameters in Table 2and the basic SVM parameters are the same as Table 1
Thus compared to the single machine learning algorithmensemble learning method requires that each base classifiershould be independent from the others The probabilityof sample misclassification should be less than 05 In theensemble learning method all classifiers will work togetherto solve one problem and this can also reduce the impact ofthe quality of sample set to the virtual screening effect
4 Experiment and Analysis
In order to verify the validity of the proposed method in thispaper besides the crystal structure of PDB we also combinethe data from the PubChem database and the StARLITe
6 Computational and Mathematical Methods in Medicine
database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining
All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata
41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set
Table 3 Experimental data structure
Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500
to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above
After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC
50le 10 120583m) from StARLITe and divide them into
100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set
After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively
From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue
42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows
EF =Hits119904119873119904
Hit119905119873119905
(14)
HereHits119904represents the number of active compounds in
the sample Hit119905is the number of active compounds tested119873
119904
is the number of compounds sampled and 119873119905is the number
of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use
Computational and Mathematical Methods in Medicine 7
False positive rate1009080706050403020100
10
09
08
07
06
05
04
03
02
01
00
True
pos
itive
rate
X
Y
Figure 3 Two ROC curves 119883 and 119884
10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows
AUC = int
1
0
TP119875
119889FP119873
=1
119875 sdot 119873int
119873
0
TP 119889 FP (15)
Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds
AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification
43 The Analysis of Experimental Results According to theexperiment we used three different classification methods
False positive rate1009080706050403020100
True
pos
itive
rate
10
09
08
07
06
05
04
03
02
01
00
Adaboost-SVMRandom ForestSVM
Figure 4 Comparative experiment of SRC
True
pos
itive
rate
False positive rate1009080706050403020100
Adaboost-SVMRandom ForestSVM
10
09
08
07
06
05
04
03
02
01
00
Figure 5 Comparative experiment of Cathepsin K
(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4
The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5
Table 5 shows the 10 EF of the two methods andcalculates the AUC value
Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this
8 Computational and Mathematical Methods in Medicine
Table 4 Random Forest parameter setting
Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50
Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest
Algorithm Target protein 10 EF AUC
SVM SRC 47 0734Cathepsin K 39 0683
Adaboost-SVM SRC 55 0821Cathepsin K 48 0802
Random Forest SRC 53 0805Cathepsin K 45 0783
iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained
5 Conclusion
In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying
robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)
References
[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014
[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013
[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009
[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015
[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015
[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014
[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014
Computational and Mathematical Methods in Medicine 9
[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015
[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013
[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013
[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013
[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015
[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014
[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010
[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012
[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013
[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
4 Computational and Mathematical Methods in Medicine
23 CathepsinK and SRC Thispaper selects CathepsinK andSRC as the target for screening These two kinds of proteinsare the hotspot in the field of pharmaceutical drug targets andboth of them do not have enough experimentally determinedprotein-ligand complex structures for virtual screeningTherefore it is necessary to add some docking poses in thetraining set and these docking poses will influence the virtualscreening efficiency
Protooncogene tyrosine-protein kinase SRC also knownas protooncogene c-Src or simply c-Src is a nonreceptortyrosine kinase protein that in humans is encoded by the SRCgene The SRC family kinase is made up of 9 members LYNFYN LCK HCK FGR BLK YRK YES and c-SRCThe SRCwidely exists in tissue cells and it plays an important rolein the process of cell metabolism regulation of cell growthdevelopment and differentiation process by interacting withthe importantmolecules in the signal transduction pathwaysThe c-Src is made up of 6 functional regions SRC homology4 (SH4) domain (SH4 domain) unique region SH3 domainSH2 domain catalytic domain and short regulatory tailWhen SRC is inactive that will cause intermolecular interac-tions between the phosphorylation TYR527 (tyrosine group527) and SH2 domain At the same time the SH3 domainwill combine with the proline-rich SH2 kinase link domainWhen Tyr527 is dephosphorylated and Tyr416 is phosphor-ated links between these molecules will break and the SRCprotein is activated The SRC causes a series of biologicaleffects by participating inmany signal transduction pathwaysthrough a variety of receptors and this kind of protein isclosely associated with a variety of cancers The activationof the c-Src pathway has been observed in about 50 oftumors from colon liver lung breast and the pancreas Asa drug target a number of tyrosine kinase inhibitors treatingc-Src tyrosine kinase (as well as related tyrosine kinases)as target have been developed and put into use [15]
Cathepsin K is a lysosomal cysteine protease belongingto the papain superfamily and it has been cloned in 1999The gene location is lq212 the length of the transcript is17 kb and it consisted of 8 extrons and 7 intronsThe proteinexpression is in the osteoclasts and included in the boneresorption In the process of bone resorption the acid willdissolve the hydroxyapatite and the organic ingredients in thebone matrix will be separated and degraded by Cathepsin KCathepsin K has strong activity of collagenase in acid envi-ronment and it has been found that it plays a role in a varietyof pathological phenomena at present such as rheumatoidarthritis tumor invasion and metastasis inflammation andosteoporosis The function of Cathepsin K in osteoclast hasbeen recognized therefore many labs treat their inhibitors asa drug target for the treatment of osteoporosis At present thefirst choice of antiabsorption treatment is bisphosphonateswhich can reduce the risk of nonvertebral and vertebral frac-tures However the long-term using of bisphosphonates mayproduce adverse reactions esophageal stimulus symptomshypocalcaemia kidney irritation and so on In addition bis-phosphonates not only prevent bone loss but also inhibit boneformation at the same time so the new replacement therapydrugs are more meaningful [16]
3 Classification Algorithm Based onthe Adaboost-SVM
Currently SVM can deal with many problems such as smallsize of samples nonlinearity or high dimensions Based onthe statistical learning theory it has a simple mathematicalform fast training method and good generalization perfor-mance It has been widely used in data mining problems suchas pattern recognition function estimation and time seriesprediction Under the condition that the quality of sample setis not very low even if we do not make any improvementswe can get a good result The learning mechanism of SVMprovides a lot of space to improve the classification modelIn addition one major advantage of SVM is using of convexquadratic programming which provides only global minimaand hence avoids being trapped in local minima so in thispaper we use SVM as the base classifier There have beena large number of literatures about the SVM this paperonly gives a simple introduction The basic process of SVMclassification problems is as follows
For a given sample set 119871 = (1199091 1199101) (1199092 1199102) (119909
119899 119910119899)
and 119909119894isin 119877119889
119910119894isin 1 minus1 119894 = 1 2 119899 119910
119894stands for the
categories of sample 119909119894 119889 is the sample number and 119899 is the
training sample number If the input vector set is linearlyseparable then the input vector set can be separated by ahyperplane The hyperplane can be expressed as
119908 sdot 119909 minus 119887 = 0 (3)
119908 is the normal vector of the hyperplane and 119887 is offsetThe SVM learning problem is minimizing the objectivefunction
min120601 (119908) =1
21199082+ 119862(
119899
sum
119894=1
120585119894) (4)
This meets the condition
119910119894(119908119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 2 119899 (5)
Here (12)1199082 is structure complexity119862(sum
119899
119894=1120585119894) stands
for empirical risk and 120585119894presents the slack variable 119867 is
a constant which is punishment factor of samples wronglyclassified For the situation of linear inseparable the mainidea of SVM is used to map the feature vector to the highdimensional feature space and constructs an optimal hyper-plane in the feature space
To get the change of 120601 119909 in space of 119877119899 mapped into 119867
119909 997888rarr 120601 (119909) = (1206011(119909) 120601
2(119909) 120601
119894(119909))Τ
(6)
Eventually we can decide optimization classificationfunction
119891 (119909) = sgn (119908 sdot 120601 (119909) + 119887)
= sgn(
119899
sum
119894=1
119886119894119910119894120601 (119909119894) sdot 120601 (119909) + 119887)
(7)
Computational and Mathematical Methods in Medicine 5
In our work Radial Basis Function (RBF) is taken as thekernel function of SVM and the mathematical description ofthis kernel is given below
119870(119909119894 119909119895) = 119890minus119909119894minus119909119895221205902
(8)
Relative to other classifiers the SVM is more stable andless affected by the quality of sample set However the samplecategories imbalance of data set and the high complexity ofthe data set will destabilize the classifier and the instabilityof the classifier will directly affect the final classificationresult In this paper we introduce the Adaboost mechanismin ensemble learning to divide one classification process intoseveral layers of weak classifier based on SVM
The key of the combination of Adaboost and SVM is tofind a suitable Gauss width 120590 value for each component If the120590 value is relatively large the component classifier is tooweakand the final classification performance is decreased On theother hand if the 120590 value is relatively small which makesthe component classifier robust and the error of componentclassifier is highly correlated the difference is small so thatthe ensemble learning is invalid Even more importantly 120590value is too small which will lead to overfitting and resultingin a greatly reduced generalization Therefore in this paperthe standard deviation of the sample set of each componentclassifier is used as the 120590 value of the component classifierto control the classification accuracy of the component clas-sifier thus SVM based Adaboost classifier is obtained Theprogram used in this paper is not an open source so we needto explain some key parameters We list the values of 120590 119862and other parameters in Table 1
The specific process of the algorithm is as follows
(1) RBFSVM presents the SVM with the RBF kernel119879 presents the number of iterations required in theAdaboost process
(2) Initialization initialize the weights of each sample1199081(119894) = 1119899 119894 = 1 2 119899
(3) For 119905 = 1 2 119879for each ℎ(119909
119894) calculate the weighted error
120576119905
(119894)=
119899
sum
119895=1
119863119894
10038161003816100381610038161003816ℎ (119883119895) minus 119910 (119895)
10038161003816100381610038161003816 (9)
Choose a feature with the lowest weighted error rate120576119895and save its corresponding SVM model Calculate
the selected weak classifierrsquos weight
119886119905=
1
2ln(
1 minus 120576119905
120576119905
) (10)
Update sample weights according to 119886119905
119863119905+1
(119894) =119863119905(119894)
119885119905
119865 (119864119905)
=119863119905(119894)
119885119905
119890minus120572119905 if ℎ
119905(119909119894) = 119910119894
119890120572119905 if ℎ
119905(119909119894) = 119910119894
(11)
Table 1 SVM parameter setting
Parameter name Parameter valuesSVM type C-SVMClass number 2Kernel function RBFThe degree in kernel function 3120590 in kernel function 0001Coast factor 5Cache size 500MBTolerance in the termination criteria 0001The weight value of penalty factor for all kindsof samples 1
Cross validation 5
Table 2 Adaboost-SVM parameter setting
Parameter name Parameter valuesEnsemble learning type AdaboostClass number 2The basic classifier type C-SVMNumber of classifiers per layer 100The max false alarm rate 05The min hit rate 09Number of iterations 5Weight trim rate 09Cache size 500MB
And the normalized parameters are
119899
sum
119894=1
119863119905(119894) = 1 (12)
(4) Use strong classifier 119867 integrated by SVM weakclassifier to training set
119867(119909) = sign[
119899
sum
119894=1
119886119905ℎ119905(119909)] (13)
For the Adaboost-SVM we set the parameters in Table 2and the basic SVM parameters are the same as Table 1
Thus compared to the single machine learning algorithmensemble learning method requires that each base classifiershould be independent from the others The probabilityof sample misclassification should be less than 05 In theensemble learning method all classifiers will work togetherto solve one problem and this can also reduce the impact ofthe quality of sample set to the virtual screening effect
4 Experiment and Analysis
In order to verify the validity of the proposed method in thispaper besides the crystal structure of PDB we also combinethe data from the PubChem database and the StARLITe
6 Computational and Mathematical Methods in Medicine
database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining
All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata
41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set
Table 3 Experimental data structure
Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500
to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above
After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC
50le 10 120583m) from StARLITe and divide them into
100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set
After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively
From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue
42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows
EF =Hits119904119873119904
Hit119905119873119905
(14)
HereHits119904represents the number of active compounds in
the sample Hit119905is the number of active compounds tested119873
119904
is the number of compounds sampled and 119873119905is the number
of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use
Computational and Mathematical Methods in Medicine 7
False positive rate1009080706050403020100
10
09
08
07
06
05
04
03
02
01
00
True
pos
itive
rate
X
Y
Figure 3 Two ROC curves 119883 and 119884
10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows
AUC = int
1
0
TP119875
119889FP119873
=1
119875 sdot 119873int
119873
0
TP 119889 FP (15)
Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds
AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification
43 The Analysis of Experimental Results According to theexperiment we used three different classification methods
False positive rate1009080706050403020100
True
pos
itive
rate
10
09
08
07
06
05
04
03
02
01
00
Adaboost-SVMRandom ForestSVM
Figure 4 Comparative experiment of SRC
True
pos
itive
rate
False positive rate1009080706050403020100
Adaboost-SVMRandom ForestSVM
10
09
08
07
06
05
04
03
02
01
00
Figure 5 Comparative experiment of Cathepsin K
(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4
The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5
Table 5 shows the 10 EF of the two methods andcalculates the AUC value
Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this
8 Computational and Mathematical Methods in Medicine
Table 4 Random Forest parameter setting
Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50
Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest
Algorithm Target protein 10 EF AUC
SVM SRC 47 0734Cathepsin K 39 0683
Adaboost-SVM SRC 55 0821Cathepsin K 48 0802
Random Forest SRC 53 0805Cathepsin K 45 0783
iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained
5 Conclusion
In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying
robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)
References
[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014
[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013
[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009
[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015
[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015
[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014
[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014
Computational and Mathematical Methods in Medicine 9
[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015
[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013
[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013
[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013
[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015
[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014
[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010
[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012
[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013
[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 5
In our work Radial Basis Function (RBF) is taken as thekernel function of SVM and the mathematical description ofthis kernel is given below
119870(119909119894 119909119895) = 119890minus119909119894minus119909119895221205902
(8)
Relative to other classifiers the SVM is more stable andless affected by the quality of sample set However the samplecategories imbalance of data set and the high complexity ofthe data set will destabilize the classifier and the instabilityof the classifier will directly affect the final classificationresult In this paper we introduce the Adaboost mechanismin ensemble learning to divide one classification process intoseveral layers of weak classifier based on SVM
The key of the combination of Adaboost and SVM is tofind a suitable Gauss width 120590 value for each component If the120590 value is relatively large the component classifier is tooweakand the final classification performance is decreased On theother hand if the 120590 value is relatively small which makesthe component classifier robust and the error of componentclassifier is highly correlated the difference is small so thatthe ensemble learning is invalid Even more importantly 120590value is too small which will lead to overfitting and resultingin a greatly reduced generalization Therefore in this paperthe standard deviation of the sample set of each componentclassifier is used as the 120590 value of the component classifierto control the classification accuracy of the component clas-sifier thus SVM based Adaboost classifier is obtained Theprogram used in this paper is not an open source so we needto explain some key parameters We list the values of 120590 119862and other parameters in Table 1
The specific process of the algorithm is as follows
(1) RBFSVM presents the SVM with the RBF kernel119879 presents the number of iterations required in theAdaboost process
(2) Initialization initialize the weights of each sample1199081(119894) = 1119899 119894 = 1 2 119899
(3) For 119905 = 1 2 119879for each ℎ(119909
119894) calculate the weighted error
120576119905
(119894)=
119899
sum
119895=1
119863119894
10038161003816100381610038161003816ℎ (119883119895) minus 119910 (119895)
10038161003816100381610038161003816 (9)
Choose a feature with the lowest weighted error rate120576119895and save its corresponding SVM model Calculate
the selected weak classifierrsquos weight
119886119905=
1
2ln(
1 minus 120576119905
120576119905
) (10)
Update sample weights according to 119886119905
119863119905+1
(119894) =119863119905(119894)
119885119905
119865 (119864119905)
=119863119905(119894)
119885119905
119890minus120572119905 if ℎ
119905(119909119894) = 119910119894
119890120572119905 if ℎ
119905(119909119894) = 119910119894
(11)
Table 1 SVM parameter setting
Parameter name Parameter valuesSVM type C-SVMClass number 2Kernel function RBFThe degree in kernel function 3120590 in kernel function 0001Coast factor 5Cache size 500MBTolerance in the termination criteria 0001The weight value of penalty factor for all kindsof samples 1
Cross validation 5
Table 2 Adaboost-SVM parameter setting
Parameter name Parameter valuesEnsemble learning type AdaboostClass number 2The basic classifier type C-SVMNumber of classifiers per layer 100The max false alarm rate 05The min hit rate 09Number of iterations 5Weight trim rate 09Cache size 500MB
And the normalized parameters are
119899
sum
119894=1
119863119905(119894) = 1 (12)
(4) Use strong classifier 119867 integrated by SVM weakclassifier to training set
119867(119909) = sign[
119899
sum
119894=1
119886119905ℎ119905(119909)] (13)
For the Adaboost-SVM we set the parameters in Table 2and the basic SVM parameters are the same as Table 1
Thus compared to the single machine learning algorithmensemble learning method requires that each base classifiershould be independent from the others The probabilityof sample misclassification should be less than 05 In theensemble learning method all classifiers will work togetherto solve one problem and this can also reduce the impact ofthe quality of sample set to the virtual screening effect
4 Experiment and Analysis
In order to verify the validity of the proposed method in thispaper besides the crystal structure of PDB we also combinethe data from the PubChem database and the StARLITe
6 Computational and Mathematical Methods in Medicine
database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining
All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata
41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set
Table 3 Experimental data structure
Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500
to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above
After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC
50le 10 120583m) from StARLITe and divide them into
100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set
After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively
From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue
42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows
EF =Hits119904119873119904
Hit119905119873119905
(14)
HereHits119904represents the number of active compounds in
the sample Hit119905is the number of active compounds tested119873
119904
is the number of compounds sampled and 119873119905is the number
of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use
Computational and Mathematical Methods in Medicine 7
False positive rate1009080706050403020100
10
09
08
07
06
05
04
03
02
01
00
True
pos
itive
rate
X
Y
Figure 3 Two ROC curves 119883 and 119884
10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows
AUC = int
1
0
TP119875
119889FP119873
=1
119875 sdot 119873int
119873
0
TP 119889 FP (15)
Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds
AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification
43 The Analysis of Experimental Results According to theexperiment we used three different classification methods
False positive rate1009080706050403020100
True
pos
itive
rate
10
09
08
07
06
05
04
03
02
01
00
Adaboost-SVMRandom ForestSVM
Figure 4 Comparative experiment of SRC
True
pos
itive
rate
False positive rate1009080706050403020100
Adaboost-SVMRandom ForestSVM
10
09
08
07
06
05
04
03
02
01
00
Figure 5 Comparative experiment of Cathepsin K
(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4
The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5
Table 5 shows the 10 EF of the two methods andcalculates the AUC value
Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this
8 Computational and Mathematical Methods in Medicine
Table 4 Random Forest parameter setting
Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50
Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest
Algorithm Target protein 10 EF AUC
SVM SRC 47 0734Cathepsin K 39 0683
Adaboost-SVM SRC 55 0821Cathepsin K 48 0802
Random Forest SRC 53 0805Cathepsin K 45 0783
iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained
5 Conclusion
In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying
robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)
References
[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014
[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013
[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009
[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015
[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015
[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014
[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014
Computational and Mathematical Methods in Medicine 9
[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015
[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013
[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013
[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013
[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015
[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014
[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010
[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012
[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013
[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
6 Computational and Mathematical Methods in Medicine
database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining
All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata
41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set
Table 3 Experimental data structure
Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500
to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above
After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC
50le 10 120583m) from StARLITe and divide them into
100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set
After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively
From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue
42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows
EF =Hits119904119873119904
Hit119905119873119905
(14)
HereHits119904represents the number of active compounds in
the sample Hit119905is the number of active compounds tested119873
119904
is the number of compounds sampled and 119873119905is the number
of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use
Computational and Mathematical Methods in Medicine 7
False positive rate1009080706050403020100
10
09
08
07
06
05
04
03
02
01
00
True
pos
itive
rate
X
Y
Figure 3 Two ROC curves 119883 and 119884
10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows
AUC = int
1
0
TP119875
119889FP119873
=1
119875 sdot 119873int
119873
0
TP 119889 FP (15)
Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds
AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification
43 The Analysis of Experimental Results According to theexperiment we used three different classification methods
False positive rate1009080706050403020100
True
pos
itive
rate
10
09
08
07
06
05
04
03
02
01
00
Adaboost-SVMRandom ForestSVM
Figure 4 Comparative experiment of SRC
True
pos
itive
rate
False positive rate1009080706050403020100
Adaboost-SVMRandom ForestSVM
10
09
08
07
06
05
04
03
02
01
00
Figure 5 Comparative experiment of Cathepsin K
(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4
The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5
Table 5 shows the 10 EF of the two methods andcalculates the AUC value
Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this
8 Computational and Mathematical Methods in Medicine
Table 4 Random Forest parameter setting
Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50
Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest
Algorithm Target protein 10 EF AUC
SVM SRC 47 0734Cathepsin K 39 0683
Adaboost-SVM SRC 55 0821Cathepsin K 48 0802
Random Forest SRC 53 0805Cathepsin K 45 0783
iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained
5 Conclusion
In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying
robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)
References
[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014
[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013
[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009
[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015
[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015
[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014
[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014
Computational and Mathematical Methods in Medicine 9
[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015
[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013
[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013
[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013
[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015
[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014
[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010
[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012
[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013
[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 7
False positive rate1009080706050403020100
10
09
08
07
06
05
04
03
02
01
00
True
pos
itive
rate
X
Y
Figure 3 Two ROC curves 119883 and 119884
10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows
AUC = int
1
0
TP119875
119889FP119873
=1
119875 sdot 119873int
119873
0
TP 119889 FP (15)
Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds
AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification
43 The Analysis of Experimental Results According to theexperiment we used three different classification methods
False positive rate1009080706050403020100
True
pos
itive
rate
10
09
08
07
06
05
04
03
02
01
00
Adaboost-SVMRandom ForestSVM
Figure 4 Comparative experiment of SRC
True
pos
itive
rate
False positive rate1009080706050403020100
Adaboost-SVMRandom ForestSVM
10
09
08
07
06
05
04
03
02
01
00
Figure 5 Comparative experiment of Cathepsin K
(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4
The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5
Table 5 shows the 10 EF of the two methods andcalculates the AUC value
Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this
8 Computational and Mathematical Methods in Medicine
Table 4 Random Forest parameter setting
Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50
Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest
Algorithm Target protein 10 EF AUC
SVM SRC 47 0734Cathepsin K 39 0683
Adaboost-SVM SRC 55 0821Cathepsin K 48 0802
Random Forest SRC 53 0805Cathepsin K 45 0783
iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained
5 Conclusion
In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying
robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)
References
[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014
[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013
[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009
[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015
[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015
[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014
[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014
Computational and Mathematical Methods in Medicine 9
[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015
[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013
[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013
[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013
[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015
[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014
[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010
[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012
[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013
[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
8 Computational and Mathematical Methods in Medicine
Table 4 Random Forest parameter setting
Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50
Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest
Algorithm Target protein 10 EF AUC
SVM SRC 47 0734Cathepsin K 39 0683
Adaboost-SVM SRC 55 0821Cathepsin K 48 0802
Random Forest SRC 53 0805Cathepsin K 45 0783
iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained
5 Conclusion
In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying
robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)
References
[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014
[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013
[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009
[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015
[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015
[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014
[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014
Computational and Mathematical Methods in Medicine 9
[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015
[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013
[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013
[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013
[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015
[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014
[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010
[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012
[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013
[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 9
[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015
[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013
[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013
[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013
[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015
[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014
[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010
[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012
[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013
[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom