research article the virtual screening of the drug protein...

10
Research Article The Virtual Screening of the Drug Protein with a Few Crystal Structures Based on the Adaboost-SVM Meng-yu Wang, 1 Peng Li, 1,2 and Pei-li Qiao 1 1 School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China 2 School of Soſtware, Harbin University of Science and Technology, Harbin 150080, China Correspondence should be addressed to Peng Li; [email protected] Received 27 December 2015; Revised 6 March 2016; Accepted 7 March 2016 Academic Editor: Ezequiel L´ opez-Rubio Copyright © 2016 Meng-yu Wang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Using the theory of machine learning to assist the virtual screening (VS) has been an effective plan. However, the quality of the training set may reduce because of mixing with the wrong docking poses and it will affect the screening efficiencies. To solve this problem, we present a method using the ensemble learning to improve the support vector machine to process the generated protein- ligand interaction fingerprint (IFP). By combining multiple classifiers, ensemble learning is able to avoid the limitations of the single classifier’s performance and obtain better generalization. According to the research of virtual screening experiment with SRC and Cathepsin K as the target, the results show that the ensemble learning method can effectively reduce the error because the sample quality is not high and improve the effect of the whole virtual screening process. 1. Introduction Since the 21st century, the focus of life science has been developed from the experimental analysis and data accumu- lation to experiments under the guidance of data analysis. Life science is undergoing a transition from analysis of reduction of method to the system integration method [1]. With the completion of human genome project (HGP), more and more three-dimensional structures of important function of biological macromolecules (proteins, nucleic acids, enzymes, etc.) have been parsed [2]. As the amount of data has increased exponentially in recent years, the combination of traditional pharmaceutical field and modern computer tech- nology has become the inevitable result of the development of life science, and virtual screening is the product of this com- bination. At present, millions of molecules can be screened out by the virtual screening method every day. For each specific target structure, we can get the active compounds in short time. e research object is focused on hundreds of compounds from millions compounds, which can greatly improve the speed and efficiency of the compounds screening and shorten the cycle of new drug research. However, the increasing amount of data makes ordinary computer algorithm unable to maintain a high level, so the machine learning method has gradually entered the view of the scientists due to its reliable and fast performance. e combination of machine learning and virtual screen- ing has become a hotspot in the field of chemical information and embodies its value in the process of drug discovery, such as searching inhibitors [3], finding novel search chemotypes [4], and predicting protein structures [5]. e number of crystal structures of complex for training is crucial in the method of the combination of virtual screening and machine learning. Relative to the small number of training sets, a larger and more diverse training set can train a more powerful learning mode. However, the crystal structures which can be used for virtual screening always come from X-ray crystal diffraction or the means of NMR [6]. Although the structure is accurate, the high funding and the period limit the speed of resolution, which cannot meet the needs of the virtual screen- ing experiment. So in order to expand the size of the training set, some docking poses of the known active compounds will be added to the training set. Because the docking poses are supposed to include incorrect binding modes, large amounts Hindawi Publishing Corporation Computational and Mathematical Methods in Medicine Volume 2016, Article ID 4809831, 9 pages http://dx.doi.org/10.1155/2016/4809831

Upload: others

Post on 20-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article The Virtual Screening of the Drug Protein ...downloads.hindawi.com/journals/cmmm/2016/4809831.pdf · Research Article The Virtual Screening of the Drug Protein with

Research ArticleThe Virtual Screening of the Drug Protein witha Few Crystal Structures Based on the Adaboost-SVM

Meng-yu Wang1 Peng Li12 and Pei-li Qiao1

1School of Computer Science and Technology Harbin University of Science and Technology Harbin 150080 China2School of Software Harbin University of Science and Technology Harbin 150080 China

Correspondence should be addressed to Peng Li plihrbusteducn

Received 27 December 2015 Revised 6 March 2016 Accepted 7 March 2016

Academic Editor Ezequiel Lopez-Rubio

Copyright copy 2016 Meng-yu Wang et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Using the theory of machine learning to assist the virtual screening (VS) has been an effective plan However the quality of thetraining set may reduce because of mixing with the wrong docking poses and it will affect the screening efficiencies To solve thisproblem we present amethod using the ensemble learning to improve the support vectormachine to process the generated protein-ligand interaction fingerprint (IFP) By combiningmultiple classifiers ensemble learning is able to avoid the limitations of the singleclassifierrsquos performance and obtain better generalization According to the research of virtual screening experiment with SRC andCathepsin K as the target the results show that the ensemble learning method can effectively reduce the error because the samplequality is not high and improve the effect of the whole virtual screening process

1 Introduction

Since the 21st century the focus of life science has beendeveloped from the experimental analysis and data accumu-lation to experiments under the guidance of data analysis Lifescience is undergoing a transition from analysis of reductionof method to the system integration method [1] With thecompletion of human genome project (HGP) more andmore three-dimensional structures of important function ofbiological macromolecules (proteins nucleic acids enzymesetc) have been parsed [2] As the amount of data hasincreased exponentially in recent years the combination oftraditional pharmaceutical field and modern computer tech-nology has become the inevitable result of the development oflife science and virtual screening is the product of this com-bination At present millions of molecules can be screenedout by the virtual screening method every day For eachspecific target structure we can get the active compoundsin short time The research object is focused on hundredsof compounds from millions compounds which can greatlyimprove the speed and efficiency of the compounds screeningand shorten the cycle of new drug research However

the increasing amount of data makes ordinary computeralgorithm unable to maintain a high level so the machinelearning method has gradually entered the view of thescientists due to its reliable and fast performance

The combination of machine learning and virtual screen-ing has become a hotspot in the field of chemical informationand embodies its value in the process of drug discovery suchas searching inhibitors [3] finding novel search chemotypes[4] and predicting protein structures [5] The number ofcrystal structures of complex for training is crucial in themethod of the combination of virtual screening and machinelearning Relative to the small number of training sets alarger andmore diverse training set can train amore powerfullearning mode However the crystal structures which can beused for virtual screening always come from X-ray crystaldiffraction or the means of NMR [6] Although the structureis accurate the high funding and the period limit the speed ofresolution which cannotmeet the needs of the virtual screen-ing experiment So in order to expand the size of the trainingset some docking poses of the known active compounds willbe added to the training set Because the docking poses aresupposed to include incorrect binding modes large amounts

Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2016 Article ID 4809831 9 pageshttpdxdoiorg10115520164809831

2 Computational and Mathematical Methods in Medicine

of negative samples are introduced The accumulation of thenegative samples is possible for producing the imbalanceddata set which is a common phenomenon and of great valuein the studies on bioinformatics

On the prediction of DNA-binding proteins Song et alpropose an ensemble learning algorithm imDC according tothe analysis on unbalancedDNA-binding protein data whichhas outperformed classic classification models like SVMunder the same situation [7] Based on the ensemble learningframework Zou et al give a newpredictor to improve the per-formance of tRNAscan-SE Annotation and the experimen-tal results show their algorithm can distinguish functionaltRNAs from pseudo-tRNAs [8] Lin et al propose merging119870-means static selective strategy and ensemble forwardsequential selection on the ensemble learning architecture forhierarchical classification of protein folds with the accuracyreaching 7421 which is the state-of-the-art strategy atpresent [9] Zou et al combine the synthetic minority over-sampling and 119870-means clustering undersampling to tacklethe negative influence brought by imbalanced data sets [10]

Obviously it is common for bioinformatics studies to facethe imbalance data setsThe widely utilized strategies includepreprocessing training samples and improving classifiersat present In this paper we start from the perspectiveof improving machine learning algorithm introducing theensemble learning method on the basis of simple SVMclassifier using layered combination and iterative weight toenhance the performance of the classifier so as to reduce theimpact of the quality of the sample set Meanwhile this paperintroduces Random Forest as the experimental baseline toexamine the effect of ensemble learning on virtual screening

2 The Quantitative Method

With the rapid development of combinatorial chemistrybioinformatics molecular biology and computer science thecomputer aided drug design (CADD) is widely used Virtualscreening as one of the most widely used methods in theCADD because of its quick and low cost has been graduallyreplaced by the high-throughput screening as the main meanof drug screening [11] In this paper we will use the virtualscreening method to screen the drug protein

21 General Process of Virtual Screening Virtual screening isalso known as a computer screen which is a prescreening ofcompoundmolecules on the computer to reduce the numberof actual screening compounds and to improve the efficiencyof the discovery of lead compounds The workflow of virtualscreening process is shown in Figure 1

Virtual screening includes four steps the establishmentof the receptor model the generation of small moleculelibraries the computer screening the postprocessing of hitcompounds

Step 1 (the establishment of the receptor model) (1) Obtain-ing macromolecular structure preparation of protein struc-ture is an important step in the virtual screening The crystalstructures which will be used in the virtual screening can be

2D molecular database

Protein structures

Atombond type assignment 3D structure

optimization

Protonationcharge assignment structure

optimization

Conformation assemblydate assignment

Binding site determination grid

construction

Docking

Scoring

Rankingselection drug likeness analysis

Hit candidates list

Ligand databasepreparation

Protein structure preparation

Virtual screening

Rankingselection drug likeness analysis

Figure 1 Virtual screening process

directly obtained from the PDB or modeling the sequenceand structure information of the homologous protein

(2) Binding site description the choice of the appropri-ate ligand binding pocket is very important in moleculardocking There are two ways to choose (1) we take from theligand-receptor complex structure directly (2) If there is nocomplex structure we need to manually choose the bindingsites according to the experiment information of biologicalfunctions such as mutation and combination

Step 2 (the generation of small molecule libraries) We canuse conversion program to translate the two-dimensionalstructure to three-dimensional structure The obtained 3Dstructures can be used for docking process after adding thehydrogen atoms and charges

Step 3 (docking and scoring) Docking operation is puttingevery small molecule on ligand binding sites of receptor pro-tein optimizing the conformation and location of ligand andmaking sure of the best combination To score the best con-formation and to rank all compounds according to the scor-ing then pick out the small molecules with the highest scorefrom compound library Docking algorithm aims to predictcomplex conformation generated by the receptor and theligand The purpose of the scoring function is choosing theconformation from candidate set of conformations accordingto the score The scoring function will get a lower scoreif the docking result is more close to the natural compound

Computational and Mathematical Methods in Medicine 3

Hbond(D)

Hbond(D)-Hbond(D)Hbond(D)-Hbond(A)Hbond(A)-Hbond(A)Hbond(D)-Ionic(Cat)

Hbond(D)Hbond(D)

0 1 2 3 4 5 6 10

077 023

066 034 07 03

11 12

Hbond(A)Hbond(A)Ionic(Cat)

430Aring

234Aring

1023Aring

middot middot middot

middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

(Aring)

Figure 2 The calculation process of Pharm-IF [14]

However there is no completely correct scoring function Sofar all kinds of scoring functions used in the existing variousdocking algorithms are only an approximation to the correctscoring function

Step 4 (postprocessing of hit compounds) If only use of thesample-scoring model will lead to a huge difference in thefinal results and sometimes lead to wrong judgment finalresultsmust be analyzed frommultiple perspectives and post-processing The purpose of this analysis and postprocessingis as accurate as possible to assess protein-ligand binding freeenergies The generated complex candidate set is classifiedand the error results are distinguished

As the accuracy of the scoring function in virtual screen-ing has not been properly resolved in this paper we usethe protein-ligand interaction fingerprint (IFP) to deal withthe interactions between target proteins and ligands The IFPencode the observed interactions between ligand and proteininto a binary string of fixed length [12] The IFP methodwas originally designed for analyzing ligand docking posesto protein kinases Based on this method the atom-basedIFP concept was put forward and extended Each kind of IFPhas its own characteristics whether it is residue-based IFP oratom-based IFP One-dimensional interaction fingerprint ismore likely to be generated and compared with the 3D struc-ture of protein ligand and it is more suitable for computeraided drug design [13]

22The Concept and Calculation Process of Pharm-IF In thispaper we use a kind of atomic-based fingerprintmdashPharm-IFas an aid to verify the theory in this paper The concept ofPharm-IF is put forward by Sato et al [14] The Pharm-IFis calculated from the distances of pairs of ligand pharma-cophore features that interact with protein atoms and it candetect important geometrical patterns of ligand pharma-cophore

The calculation of Pharm-IF can be divided into thefollowing three steps as Figure 2 shows

Step 1 To detect the protein-ligand interactions from com-plex structures interactions can be classified into six types(1) hydrogen bond with ligand acceptor (2) hydrogen bondwith ligand donor (3) hydrogen bond in which the rolesof ligand and protein atoms could not be determined (4)ionic interaction with ligand cation (5) ionic interaction withligand anion (6) hydrophobic interaction

Step 2 To create all possible interaction pairs each interac-tion pair is characterized by the pharmacophore features ofthe ligand atoms and their distance To calculate the resultingmatrix each interaction pair is assigned to the correspondingbin In an interaction pair of two hydrogen bonds ligandatoms will be assigned to the vector corresponding to thishydrogen bond pair if ligand atoms are a donor and anacceptor that are 43 A apart from each other For example inorder to describe the distance of 43 A in the interaction 07 isassigned to the bin of 4 A and 03 is assigned to the bin of 5 A

Step 3 The result matrix is calculated by the summationof the values of all of the interaction pairs The formula ofPharm-IF calculation is as follows

119867119905119896

= sum

119894isin119868119905

119860119896(119894) (1)

119860119896(119894) =

0 if 1003816100381610038161003816119896 minus 119889119894

1003816100381610038161003816 ge 1

1 minus1003816100381610038161003816119896 minus 119889

119894

1003816100381610038161003816 otherwise(2)

In formula (1) 119867 stands for the interaction fingerprintof a protein-ligand complex by Pharm-IF 119905 is the pair of sixtypes of pharmacophore features 119896 = 1 2 3 stands forthe corresponding bins of the distances (A) between ligandatoms 119868

119905represents the fact that the whole set of the inter-

actions are classified as type 119905 and 119894 represents an elementin 119868119905 In formula (2) 119889

119894represents the distances between

ligand atoms of 119894 (A)

4 Computational and Mathematical Methods in Medicine

23 CathepsinK and SRC Thispaper selects CathepsinK andSRC as the target for screening These two kinds of proteinsare the hotspot in the field of pharmaceutical drug targets andboth of them do not have enough experimentally determinedprotein-ligand complex structures for virtual screeningTherefore it is necessary to add some docking poses in thetraining set and these docking poses will influence the virtualscreening efficiency

Protooncogene tyrosine-protein kinase SRC also knownas protooncogene c-Src or simply c-Src is a nonreceptortyrosine kinase protein that in humans is encoded by the SRCgene The SRC family kinase is made up of 9 members LYNFYN LCK HCK FGR BLK YRK YES and c-SRCThe SRCwidely exists in tissue cells and it plays an important rolein the process of cell metabolism regulation of cell growthdevelopment and differentiation process by interacting withthe importantmolecules in the signal transduction pathwaysThe c-Src is made up of 6 functional regions SRC homology4 (SH4) domain (SH4 domain) unique region SH3 domainSH2 domain catalytic domain and short regulatory tailWhen SRC is inactive that will cause intermolecular interac-tions between the phosphorylation TYR527 (tyrosine group527) and SH2 domain At the same time the SH3 domainwill combine with the proline-rich SH2 kinase link domainWhen Tyr527 is dephosphorylated and Tyr416 is phosphor-ated links between these molecules will break and the SRCprotein is activated The SRC causes a series of biologicaleffects by participating inmany signal transduction pathwaysthrough a variety of receptors and this kind of protein isclosely associated with a variety of cancers The activationof the c-Src pathway has been observed in about 50 oftumors from colon liver lung breast and the pancreas Asa drug target a number of tyrosine kinase inhibitors treatingc-Src tyrosine kinase (as well as related tyrosine kinases)as target have been developed and put into use [15]

Cathepsin K is a lysosomal cysteine protease belongingto the papain superfamily and it has been cloned in 1999The gene location is lq212 the length of the transcript is17 kb and it consisted of 8 extrons and 7 intronsThe proteinexpression is in the osteoclasts and included in the boneresorption In the process of bone resorption the acid willdissolve the hydroxyapatite and the organic ingredients in thebone matrix will be separated and degraded by Cathepsin KCathepsin K has strong activity of collagenase in acid envi-ronment and it has been found that it plays a role in a varietyof pathological phenomena at present such as rheumatoidarthritis tumor invasion and metastasis inflammation andosteoporosis The function of Cathepsin K in osteoclast hasbeen recognized therefore many labs treat their inhibitors asa drug target for the treatment of osteoporosis At present thefirst choice of antiabsorption treatment is bisphosphonateswhich can reduce the risk of nonvertebral and vertebral frac-tures However the long-term using of bisphosphonates mayproduce adverse reactions esophageal stimulus symptomshypocalcaemia kidney irritation and so on In addition bis-phosphonates not only prevent bone loss but also inhibit boneformation at the same time so the new replacement therapydrugs are more meaningful [16]

3 Classification Algorithm Based onthe Adaboost-SVM

Currently SVM can deal with many problems such as smallsize of samples nonlinearity or high dimensions Based onthe statistical learning theory it has a simple mathematicalform fast training method and good generalization perfor-mance It has been widely used in data mining problems suchas pattern recognition function estimation and time seriesprediction Under the condition that the quality of sample setis not very low even if we do not make any improvementswe can get a good result The learning mechanism of SVMprovides a lot of space to improve the classification modelIn addition one major advantage of SVM is using of convexquadratic programming which provides only global minimaand hence avoids being trapped in local minima so in thispaper we use SVM as the base classifier There have beena large number of literatures about the SVM this paperonly gives a simple introduction The basic process of SVMclassification problems is as follows

For a given sample set 119871 = (1199091 1199101) (1199092 1199102) (119909

119899 119910119899)

and 119909119894isin 119877119889

119910119894isin 1 minus1 119894 = 1 2 119899 119910

119894stands for the

categories of sample 119909119894 119889 is the sample number and 119899 is the

training sample number If the input vector set is linearlyseparable then the input vector set can be separated by ahyperplane The hyperplane can be expressed as

119908 sdot 119909 minus 119887 = 0 (3)

119908 is the normal vector of the hyperplane and 119887 is offsetThe SVM learning problem is minimizing the objectivefunction

min120601 (119908) =1

21199082+ 119862(

119899

sum

119894=1

120585119894) (4)

This meets the condition

119910119894(119908119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 2 119899 (5)

Here (12)1199082 is structure complexity119862(sum

119899

119894=1120585119894) stands

for empirical risk and 120585119894presents the slack variable 119867 is

a constant which is punishment factor of samples wronglyclassified For the situation of linear inseparable the mainidea of SVM is used to map the feature vector to the highdimensional feature space and constructs an optimal hyper-plane in the feature space

To get the change of 120601 119909 in space of 119877119899 mapped into 119867

119909 997888rarr 120601 (119909) = (1206011(119909) 120601

2(119909) 120601

119894(119909))Τ

(6)

Eventually we can decide optimization classificationfunction

119891 (119909) = sgn (119908 sdot 120601 (119909) + 119887)

= sgn(

119899

sum

119894=1

119886119894119910119894120601 (119909119894) sdot 120601 (119909) + 119887)

(7)

Computational and Mathematical Methods in Medicine 5

In our work Radial Basis Function (RBF) is taken as thekernel function of SVM and the mathematical description ofthis kernel is given below

119870(119909119894 119909119895) = 119890minus119909119894minus119909119895221205902

(8)

Relative to other classifiers the SVM is more stable andless affected by the quality of sample set However the samplecategories imbalance of data set and the high complexity ofthe data set will destabilize the classifier and the instabilityof the classifier will directly affect the final classificationresult In this paper we introduce the Adaboost mechanismin ensemble learning to divide one classification process intoseveral layers of weak classifier based on SVM

The key of the combination of Adaboost and SVM is tofind a suitable Gauss width 120590 value for each component If the120590 value is relatively large the component classifier is tooweakand the final classification performance is decreased On theother hand if the 120590 value is relatively small which makesthe component classifier robust and the error of componentclassifier is highly correlated the difference is small so thatthe ensemble learning is invalid Even more importantly 120590value is too small which will lead to overfitting and resultingin a greatly reduced generalization Therefore in this paperthe standard deviation of the sample set of each componentclassifier is used as the 120590 value of the component classifierto control the classification accuracy of the component clas-sifier thus SVM based Adaboost classifier is obtained Theprogram used in this paper is not an open source so we needto explain some key parameters We list the values of 120590 119862and other parameters in Table 1

The specific process of the algorithm is as follows

(1) RBFSVM presents the SVM with the RBF kernel119879 presents the number of iterations required in theAdaboost process

(2) Initialization initialize the weights of each sample1199081(119894) = 1119899 119894 = 1 2 119899

(3) For 119905 = 1 2 119879for each ℎ(119909

119894) calculate the weighted error

120576119905

(119894)=

119899

sum

119895=1

119863119894

10038161003816100381610038161003816ℎ (119883119895) minus 119910 (119895)

10038161003816100381610038161003816 (9)

Choose a feature with the lowest weighted error rate120576119895and save its corresponding SVM model Calculate

the selected weak classifierrsquos weight

119886119905=

1

2ln(

1 minus 120576119905

120576119905

) (10)

Update sample weights according to 119886119905

119863119905+1

(119894) =119863119905(119894)

119885119905

119865 (119864119905)

=119863119905(119894)

119885119905

119890minus120572119905 if ℎ

119905(119909119894) = 119910119894

119890120572119905 if ℎ

119905(119909119894) = 119910119894

(11)

Table 1 SVM parameter setting

Parameter name Parameter valuesSVM type C-SVMClass number 2Kernel function RBFThe degree in kernel function 3120590 in kernel function 0001Coast factor 5Cache size 500MBTolerance in the termination criteria 0001The weight value of penalty factor for all kindsof samples 1

Cross validation 5

Table 2 Adaboost-SVM parameter setting

Parameter name Parameter valuesEnsemble learning type AdaboostClass number 2The basic classifier type C-SVMNumber of classifiers per layer 100The max false alarm rate 05The min hit rate 09Number of iterations 5Weight trim rate 09Cache size 500MB

And the normalized parameters are

119899

sum

119894=1

119863119905(119894) = 1 (12)

(4) Use strong classifier 119867 integrated by SVM weakclassifier to training set

119867(119909) = sign[

119899

sum

119894=1

119886119905ℎ119905(119909)] (13)

For the Adaboost-SVM we set the parameters in Table 2and the basic SVM parameters are the same as Table 1

Thus compared to the single machine learning algorithmensemble learning method requires that each base classifiershould be independent from the others The probabilityof sample misclassification should be less than 05 In theensemble learning method all classifiers will work togetherto solve one problem and this can also reduce the impact ofthe quality of sample set to the virtual screening effect

4 Experiment and Analysis

In order to verify the validity of the proposed method in thispaper besides the crystal structure of PDB we also combinethe data from the PubChem database and the StARLITe

6 Computational and Mathematical Methods in Medicine

database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining

All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata

41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set

Table 3 Experimental data structure

Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500

to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above

After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC

50le 10 120583m) from StARLITe and divide them into

100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set

After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively

From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue

42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows

EF =Hits119904119873119904

Hit119905119873119905

(14)

HereHits119904represents the number of active compounds in

the sample Hit119905is the number of active compounds tested119873

119904

is the number of compounds sampled and 119873119905is the number

of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use

Computational and Mathematical Methods in Medicine 7

False positive rate1009080706050403020100

10

09

08

07

06

05

04

03

02

01

00

True

pos

itive

rate

X

Y

Figure 3 Two ROC curves 119883 and 119884

10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows

AUC = int

1

0

TP119875

119889FP119873

=1

119875 sdot 119873int

119873

0

TP 119889 FP (15)

Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds

AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification

43 The Analysis of Experimental Results According to theexperiment we used three different classification methods

False positive rate1009080706050403020100

True

pos

itive

rate

10

09

08

07

06

05

04

03

02

01

00

Adaboost-SVMRandom ForestSVM

Figure 4 Comparative experiment of SRC

True

pos

itive

rate

False positive rate1009080706050403020100

Adaboost-SVMRandom ForestSVM

10

09

08

07

06

05

04

03

02

01

00

Figure 5 Comparative experiment of Cathepsin K

(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4

The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5

Table 5 shows the 10 EF of the two methods andcalculates the AUC value

Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this

8 Computational and Mathematical Methods in Medicine

Table 4 Random Forest parameter setting

Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50

Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest

Algorithm Target protein 10 EF AUC

SVM SRC 47 0734Cathepsin K 39 0683

Adaboost-SVM SRC 55 0821Cathepsin K 48 0802

Random Forest SRC 53 0805Cathepsin K 45 0783

iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained

5 Conclusion

In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying

robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)

References

[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014

[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013

[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009

[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015

[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015

[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014

[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014

Computational and Mathematical Methods in Medicine 9

[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015

[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013

[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013

[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013

[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015

[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014

[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010

[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012

[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013

[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 2: Research Article The Virtual Screening of the Drug Protein ...downloads.hindawi.com/journals/cmmm/2016/4809831.pdf · Research Article The Virtual Screening of the Drug Protein with

2 Computational and Mathematical Methods in Medicine

of negative samples are introduced The accumulation of thenegative samples is possible for producing the imbalanceddata set which is a common phenomenon and of great valuein the studies on bioinformatics

On the prediction of DNA-binding proteins Song et alpropose an ensemble learning algorithm imDC according tothe analysis on unbalancedDNA-binding protein data whichhas outperformed classic classification models like SVMunder the same situation [7] Based on the ensemble learningframework Zou et al give a newpredictor to improve the per-formance of tRNAscan-SE Annotation and the experimen-tal results show their algorithm can distinguish functionaltRNAs from pseudo-tRNAs [8] Lin et al propose merging119870-means static selective strategy and ensemble forwardsequential selection on the ensemble learning architecture forhierarchical classification of protein folds with the accuracyreaching 7421 which is the state-of-the-art strategy atpresent [9] Zou et al combine the synthetic minority over-sampling and 119870-means clustering undersampling to tacklethe negative influence brought by imbalanced data sets [10]

Obviously it is common for bioinformatics studies to facethe imbalance data setsThe widely utilized strategies includepreprocessing training samples and improving classifiersat present In this paper we start from the perspectiveof improving machine learning algorithm introducing theensemble learning method on the basis of simple SVMclassifier using layered combination and iterative weight toenhance the performance of the classifier so as to reduce theimpact of the quality of the sample set Meanwhile this paperintroduces Random Forest as the experimental baseline toexamine the effect of ensemble learning on virtual screening

2 The Quantitative Method

With the rapid development of combinatorial chemistrybioinformatics molecular biology and computer science thecomputer aided drug design (CADD) is widely used Virtualscreening as one of the most widely used methods in theCADD because of its quick and low cost has been graduallyreplaced by the high-throughput screening as the main meanof drug screening [11] In this paper we will use the virtualscreening method to screen the drug protein

21 General Process of Virtual Screening Virtual screening isalso known as a computer screen which is a prescreening ofcompoundmolecules on the computer to reduce the numberof actual screening compounds and to improve the efficiencyof the discovery of lead compounds The workflow of virtualscreening process is shown in Figure 1

Virtual screening includes four steps the establishmentof the receptor model the generation of small moleculelibraries the computer screening the postprocessing of hitcompounds

Step 1 (the establishment of the receptor model) (1) Obtain-ing macromolecular structure preparation of protein struc-ture is an important step in the virtual screening The crystalstructures which will be used in the virtual screening can be

2D molecular database

Protein structures

Atombond type assignment 3D structure

optimization

Protonationcharge assignment structure

optimization

Conformation assemblydate assignment

Binding site determination grid

construction

Docking

Scoring

Rankingselection drug likeness analysis

Hit candidates list

Ligand databasepreparation

Protein structure preparation

Virtual screening

Rankingselection drug likeness analysis

Figure 1 Virtual screening process

directly obtained from the PDB or modeling the sequenceand structure information of the homologous protein

(2) Binding site description the choice of the appropri-ate ligand binding pocket is very important in moleculardocking There are two ways to choose (1) we take from theligand-receptor complex structure directly (2) If there is nocomplex structure we need to manually choose the bindingsites according to the experiment information of biologicalfunctions such as mutation and combination

Step 2 (the generation of small molecule libraries) We canuse conversion program to translate the two-dimensionalstructure to three-dimensional structure The obtained 3Dstructures can be used for docking process after adding thehydrogen atoms and charges

Step 3 (docking and scoring) Docking operation is puttingevery small molecule on ligand binding sites of receptor pro-tein optimizing the conformation and location of ligand andmaking sure of the best combination To score the best con-formation and to rank all compounds according to the scor-ing then pick out the small molecules with the highest scorefrom compound library Docking algorithm aims to predictcomplex conformation generated by the receptor and theligand The purpose of the scoring function is choosing theconformation from candidate set of conformations accordingto the score The scoring function will get a lower scoreif the docking result is more close to the natural compound

Computational and Mathematical Methods in Medicine 3

Hbond(D)

Hbond(D)-Hbond(D)Hbond(D)-Hbond(A)Hbond(A)-Hbond(A)Hbond(D)-Ionic(Cat)

Hbond(D)Hbond(D)

0 1 2 3 4 5 6 10

077 023

066 034 07 03

11 12

Hbond(A)Hbond(A)Ionic(Cat)

430Aring

234Aring

1023Aring

middot middot middot

middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

(Aring)

Figure 2 The calculation process of Pharm-IF [14]

However there is no completely correct scoring function Sofar all kinds of scoring functions used in the existing variousdocking algorithms are only an approximation to the correctscoring function

Step 4 (postprocessing of hit compounds) If only use of thesample-scoring model will lead to a huge difference in thefinal results and sometimes lead to wrong judgment finalresultsmust be analyzed frommultiple perspectives and post-processing The purpose of this analysis and postprocessingis as accurate as possible to assess protein-ligand binding freeenergies The generated complex candidate set is classifiedand the error results are distinguished

As the accuracy of the scoring function in virtual screen-ing has not been properly resolved in this paper we usethe protein-ligand interaction fingerprint (IFP) to deal withthe interactions between target proteins and ligands The IFPencode the observed interactions between ligand and proteininto a binary string of fixed length [12] The IFP methodwas originally designed for analyzing ligand docking posesto protein kinases Based on this method the atom-basedIFP concept was put forward and extended Each kind of IFPhas its own characteristics whether it is residue-based IFP oratom-based IFP One-dimensional interaction fingerprint ismore likely to be generated and compared with the 3D struc-ture of protein ligand and it is more suitable for computeraided drug design [13]

22The Concept and Calculation Process of Pharm-IF In thispaper we use a kind of atomic-based fingerprintmdashPharm-IFas an aid to verify the theory in this paper The concept ofPharm-IF is put forward by Sato et al [14] The Pharm-IFis calculated from the distances of pairs of ligand pharma-cophore features that interact with protein atoms and it candetect important geometrical patterns of ligand pharma-cophore

The calculation of Pharm-IF can be divided into thefollowing three steps as Figure 2 shows

Step 1 To detect the protein-ligand interactions from com-plex structures interactions can be classified into six types(1) hydrogen bond with ligand acceptor (2) hydrogen bondwith ligand donor (3) hydrogen bond in which the rolesof ligand and protein atoms could not be determined (4)ionic interaction with ligand cation (5) ionic interaction withligand anion (6) hydrophobic interaction

Step 2 To create all possible interaction pairs each interac-tion pair is characterized by the pharmacophore features ofthe ligand atoms and their distance To calculate the resultingmatrix each interaction pair is assigned to the correspondingbin In an interaction pair of two hydrogen bonds ligandatoms will be assigned to the vector corresponding to thishydrogen bond pair if ligand atoms are a donor and anacceptor that are 43 A apart from each other For example inorder to describe the distance of 43 A in the interaction 07 isassigned to the bin of 4 A and 03 is assigned to the bin of 5 A

Step 3 The result matrix is calculated by the summationof the values of all of the interaction pairs The formula ofPharm-IF calculation is as follows

119867119905119896

= sum

119894isin119868119905

119860119896(119894) (1)

119860119896(119894) =

0 if 1003816100381610038161003816119896 minus 119889119894

1003816100381610038161003816 ge 1

1 minus1003816100381610038161003816119896 minus 119889

119894

1003816100381610038161003816 otherwise(2)

In formula (1) 119867 stands for the interaction fingerprintof a protein-ligand complex by Pharm-IF 119905 is the pair of sixtypes of pharmacophore features 119896 = 1 2 3 stands forthe corresponding bins of the distances (A) between ligandatoms 119868

119905represents the fact that the whole set of the inter-

actions are classified as type 119905 and 119894 represents an elementin 119868119905 In formula (2) 119889

119894represents the distances between

ligand atoms of 119894 (A)

4 Computational and Mathematical Methods in Medicine

23 CathepsinK and SRC Thispaper selects CathepsinK andSRC as the target for screening These two kinds of proteinsare the hotspot in the field of pharmaceutical drug targets andboth of them do not have enough experimentally determinedprotein-ligand complex structures for virtual screeningTherefore it is necessary to add some docking poses in thetraining set and these docking poses will influence the virtualscreening efficiency

Protooncogene tyrosine-protein kinase SRC also knownas protooncogene c-Src or simply c-Src is a nonreceptortyrosine kinase protein that in humans is encoded by the SRCgene The SRC family kinase is made up of 9 members LYNFYN LCK HCK FGR BLK YRK YES and c-SRCThe SRCwidely exists in tissue cells and it plays an important rolein the process of cell metabolism regulation of cell growthdevelopment and differentiation process by interacting withthe importantmolecules in the signal transduction pathwaysThe c-Src is made up of 6 functional regions SRC homology4 (SH4) domain (SH4 domain) unique region SH3 domainSH2 domain catalytic domain and short regulatory tailWhen SRC is inactive that will cause intermolecular interac-tions between the phosphorylation TYR527 (tyrosine group527) and SH2 domain At the same time the SH3 domainwill combine with the proline-rich SH2 kinase link domainWhen Tyr527 is dephosphorylated and Tyr416 is phosphor-ated links between these molecules will break and the SRCprotein is activated The SRC causes a series of biologicaleffects by participating inmany signal transduction pathwaysthrough a variety of receptors and this kind of protein isclosely associated with a variety of cancers The activationof the c-Src pathway has been observed in about 50 oftumors from colon liver lung breast and the pancreas Asa drug target a number of tyrosine kinase inhibitors treatingc-Src tyrosine kinase (as well as related tyrosine kinases)as target have been developed and put into use [15]

Cathepsin K is a lysosomal cysteine protease belongingto the papain superfamily and it has been cloned in 1999The gene location is lq212 the length of the transcript is17 kb and it consisted of 8 extrons and 7 intronsThe proteinexpression is in the osteoclasts and included in the boneresorption In the process of bone resorption the acid willdissolve the hydroxyapatite and the organic ingredients in thebone matrix will be separated and degraded by Cathepsin KCathepsin K has strong activity of collagenase in acid envi-ronment and it has been found that it plays a role in a varietyof pathological phenomena at present such as rheumatoidarthritis tumor invasion and metastasis inflammation andosteoporosis The function of Cathepsin K in osteoclast hasbeen recognized therefore many labs treat their inhibitors asa drug target for the treatment of osteoporosis At present thefirst choice of antiabsorption treatment is bisphosphonateswhich can reduce the risk of nonvertebral and vertebral frac-tures However the long-term using of bisphosphonates mayproduce adverse reactions esophageal stimulus symptomshypocalcaemia kidney irritation and so on In addition bis-phosphonates not only prevent bone loss but also inhibit boneformation at the same time so the new replacement therapydrugs are more meaningful [16]

3 Classification Algorithm Based onthe Adaboost-SVM

Currently SVM can deal with many problems such as smallsize of samples nonlinearity or high dimensions Based onthe statistical learning theory it has a simple mathematicalform fast training method and good generalization perfor-mance It has been widely used in data mining problems suchas pattern recognition function estimation and time seriesprediction Under the condition that the quality of sample setis not very low even if we do not make any improvementswe can get a good result The learning mechanism of SVMprovides a lot of space to improve the classification modelIn addition one major advantage of SVM is using of convexquadratic programming which provides only global minimaand hence avoids being trapped in local minima so in thispaper we use SVM as the base classifier There have beena large number of literatures about the SVM this paperonly gives a simple introduction The basic process of SVMclassification problems is as follows

For a given sample set 119871 = (1199091 1199101) (1199092 1199102) (119909

119899 119910119899)

and 119909119894isin 119877119889

119910119894isin 1 minus1 119894 = 1 2 119899 119910

119894stands for the

categories of sample 119909119894 119889 is the sample number and 119899 is the

training sample number If the input vector set is linearlyseparable then the input vector set can be separated by ahyperplane The hyperplane can be expressed as

119908 sdot 119909 minus 119887 = 0 (3)

119908 is the normal vector of the hyperplane and 119887 is offsetThe SVM learning problem is minimizing the objectivefunction

min120601 (119908) =1

21199082+ 119862(

119899

sum

119894=1

120585119894) (4)

This meets the condition

119910119894(119908119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 2 119899 (5)

Here (12)1199082 is structure complexity119862(sum

119899

119894=1120585119894) stands

for empirical risk and 120585119894presents the slack variable 119867 is

a constant which is punishment factor of samples wronglyclassified For the situation of linear inseparable the mainidea of SVM is used to map the feature vector to the highdimensional feature space and constructs an optimal hyper-plane in the feature space

To get the change of 120601 119909 in space of 119877119899 mapped into 119867

119909 997888rarr 120601 (119909) = (1206011(119909) 120601

2(119909) 120601

119894(119909))Τ

(6)

Eventually we can decide optimization classificationfunction

119891 (119909) = sgn (119908 sdot 120601 (119909) + 119887)

= sgn(

119899

sum

119894=1

119886119894119910119894120601 (119909119894) sdot 120601 (119909) + 119887)

(7)

Computational and Mathematical Methods in Medicine 5

In our work Radial Basis Function (RBF) is taken as thekernel function of SVM and the mathematical description ofthis kernel is given below

119870(119909119894 119909119895) = 119890minus119909119894minus119909119895221205902

(8)

Relative to other classifiers the SVM is more stable andless affected by the quality of sample set However the samplecategories imbalance of data set and the high complexity ofthe data set will destabilize the classifier and the instabilityof the classifier will directly affect the final classificationresult In this paper we introduce the Adaboost mechanismin ensemble learning to divide one classification process intoseveral layers of weak classifier based on SVM

The key of the combination of Adaboost and SVM is tofind a suitable Gauss width 120590 value for each component If the120590 value is relatively large the component classifier is tooweakand the final classification performance is decreased On theother hand if the 120590 value is relatively small which makesthe component classifier robust and the error of componentclassifier is highly correlated the difference is small so thatthe ensemble learning is invalid Even more importantly 120590value is too small which will lead to overfitting and resultingin a greatly reduced generalization Therefore in this paperthe standard deviation of the sample set of each componentclassifier is used as the 120590 value of the component classifierto control the classification accuracy of the component clas-sifier thus SVM based Adaboost classifier is obtained Theprogram used in this paper is not an open source so we needto explain some key parameters We list the values of 120590 119862and other parameters in Table 1

The specific process of the algorithm is as follows

(1) RBFSVM presents the SVM with the RBF kernel119879 presents the number of iterations required in theAdaboost process

(2) Initialization initialize the weights of each sample1199081(119894) = 1119899 119894 = 1 2 119899

(3) For 119905 = 1 2 119879for each ℎ(119909

119894) calculate the weighted error

120576119905

(119894)=

119899

sum

119895=1

119863119894

10038161003816100381610038161003816ℎ (119883119895) minus 119910 (119895)

10038161003816100381610038161003816 (9)

Choose a feature with the lowest weighted error rate120576119895and save its corresponding SVM model Calculate

the selected weak classifierrsquos weight

119886119905=

1

2ln(

1 minus 120576119905

120576119905

) (10)

Update sample weights according to 119886119905

119863119905+1

(119894) =119863119905(119894)

119885119905

119865 (119864119905)

=119863119905(119894)

119885119905

119890minus120572119905 if ℎ

119905(119909119894) = 119910119894

119890120572119905 if ℎ

119905(119909119894) = 119910119894

(11)

Table 1 SVM parameter setting

Parameter name Parameter valuesSVM type C-SVMClass number 2Kernel function RBFThe degree in kernel function 3120590 in kernel function 0001Coast factor 5Cache size 500MBTolerance in the termination criteria 0001The weight value of penalty factor for all kindsof samples 1

Cross validation 5

Table 2 Adaboost-SVM parameter setting

Parameter name Parameter valuesEnsemble learning type AdaboostClass number 2The basic classifier type C-SVMNumber of classifiers per layer 100The max false alarm rate 05The min hit rate 09Number of iterations 5Weight trim rate 09Cache size 500MB

And the normalized parameters are

119899

sum

119894=1

119863119905(119894) = 1 (12)

(4) Use strong classifier 119867 integrated by SVM weakclassifier to training set

119867(119909) = sign[

119899

sum

119894=1

119886119905ℎ119905(119909)] (13)

For the Adaboost-SVM we set the parameters in Table 2and the basic SVM parameters are the same as Table 1

Thus compared to the single machine learning algorithmensemble learning method requires that each base classifiershould be independent from the others The probabilityof sample misclassification should be less than 05 In theensemble learning method all classifiers will work togetherto solve one problem and this can also reduce the impact ofthe quality of sample set to the virtual screening effect

4 Experiment and Analysis

In order to verify the validity of the proposed method in thispaper besides the crystal structure of PDB we also combinethe data from the PubChem database and the StARLITe

6 Computational and Mathematical Methods in Medicine

database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining

All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata

41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set

Table 3 Experimental data structure

Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500

to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above

After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC

50le 10 120583m) from StARLITe and divide them into

100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set

After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively

From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue

42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows

EF =Hits119904119873119904

Hit119905119873119905

(14)

HereHits119904represents the number of active compounds in

the sample Hit119905is the number of active compounds tested119873

119904

is the number of compounds sampled and 119873119905is the number

of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use

Computational and Mathematical Methods in Medicine 7

False positive rate1009080706050403020100

10

09

08

07

06

05

04

03

02

01

00

True

pos

itive

rate

X

Y

Figure 3 Two ROC curves 119883 and 119884

10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows

AUC = int

1

0

TP119875

119889FP119873

=1

119875 sdot 119873int

119873

0

TP 119889 FP (15)

Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds

AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification

43 The Analysis of Experimental Results According to theexperiment we used three different classification methods

False positive rate1009080706050403020100

True

pos

itive

rate

10

09

08

07

06

05

04

03

02

01

00

Adaboost-SVMRandom ForestSVM

Figure 4 Comparative experiment of SRC

True

pos

itive

rate

False positive rate1009080706050403020100

Adaboost-SVMRandom ForestSVM

10

09

08

07

06

05

04

03

02

01

00

Figure 5 Comparative experiment of Cathepsin K

(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4

The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5

Table 5 shows the 10 EF of the two methods andcalculates the AUC value

Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this

8 Computational and Mathematical Methods in Medicine

Table 4 Random Forest parameter setting

Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50

Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest

Algorithm Target protein 10 EF AUC

SVM SRC 47 0734Cathepsin K 39 0683

Adaboost-SVM SRC 55 0821Cathepsin K 48 0802

Random Forest SRC 53 0805Cathepsin K 45 0783

iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained

5 Conclusion

In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying

robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)

References

[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014

[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013

[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009

[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015

[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015

[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014

[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014

Computational and Mathematical Methods in Medicine 9

[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015

[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013

[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013

[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013

[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015

[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014

[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010

[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012

[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013

[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 3: Research Article The Virtual Screening of the Drug Protein ...downloads.hindawi.com/journals/cmmm/2016/4809831.pdf · Research Article The Virtual Screening of the Drug Protein with

Computational and Mathematical Methods in Medicine 3

Hbond(D)

Hbond(D)-Hbond(D)Hbond(D)-Hbond(A)Hbond(A)-Hbond(A)Hbond(D)-Ionic(Cat)

Hbond(D)Hbond(D)

0 1 2 3 4 5 6 10

077 023

066 034 07 03

11 12

Hbond(A)Hbond(A)Ionic(Cat)

430Aring

234Aring

1023Aring

middot middot middot

middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

(Aring)

Figure 2 The calculation process of Pharm-IF [14]

However there is no completely correct scoring function Sofar all kinds of scoring functions used in the existing variousdocking algorithms are only an approximation to the correctscoring function

Step 4 (postprocessing of hit compounds) If only use of thesample-scoring model will lead to a huge difference in thefinal results and sometimes lead to wrong judgment finalresultsmust be analyzed frommultiple perspectives and post-processing The purpose of this analysis and postprocessingis as accurate as possible to assess protein-ligand binding freeenergies The generated complex candidate set is classifiedand the error results are distinguished

As the accuracy of the scoring function in virtual screen-ing has not been properly resolved in this paper we usethe protein-ligand interaction fingerprint (IFP) to deal withthe interactions between target proteins and ligands The IFPencode the observed interactions between ligand and proteininto a binary string of fixed length [12] The IFP methodwas originally designed for analyzing ligand docking posesto protein kinases Based on this method the atom-basedIFP concept was put forward and extended Each kind of IFPhas its own characteristics whether it is residue-based IFP oratom-based IFP One-dimensional interaction fingerprint ismore likely to be generated and compared with the 3D struc-ture of protein ligand and it is more suitable for computeraided drug design [13]

22The Concept and Calculation Process of Pharm-IF In thispaper we use a kind of atomic-based fingerprintmdashPharm-IFas an aid to verify the theory in this paper The concept ofPharm-IF is put forward by Sato et al [14] The Pharm-IFis calculated from the distances of pairs of ligand pharma-cophore features that interact with protein atoms and it candetect important geometrical patterns of ligand pharma-cophore

The calculation of Pharm-IF can be divided into thefollowing three steps as Figure 2 shows

Step 1 To detect the protein-ligand interactions from com-plex structures interactions can be classified into six types(1) hydrogen bond with ligand acceptor (2) hydrogen bondwith ligand donor (3) hydrogen bond in which the rolesof ligand and protein atoms could not be determined (4)ionic interaction with ligand cation (5) ionic interaction withligand anion (6) hydrophobic interaction

Step 2 To create all possible interaction pairs each interac-tion pair is characterized by the pharmacophore features ofthe ligand atoms and their distance To calculate the resultingmatrix each interaction pair is assigned to the correspondingbin In an interaction pair of two hydrogen bonds ligandatoms will be assigned to the vector corresponding to thishydrogen bond pair if ligand atoms are a donor and anacceptor that are 43 A apart from each other For example inorder to describe the distance of 43 A in the interaction 07 isassigned to the bin of 4 A and 03 is assigned to the bin of 5 A

Step 3 The result matrix is calculated by the summationof the values of all of the interaction pairs The formula ofPharm-IF calculation is as follows

119867119905119896

= sum

119894isin119868119905

119860119896(119894) (1)

119860119896(119894) =

0 if 1003816100381610038161003816119896 minus 119889119894

1003816100381610038161003816 ge 1

1 minus1003816100381610038161003816119896 minus 119889

119894

1003816100381610038161003816 otherwise(2)

In formula (1) 119867 stands for the interaction fingerprintof a protein-ligand complex by Pharm-IF 119905 is the pair of sixtypes of pharmacophore features 119896 = 1 2 3 stands forthe corresponding bins of the distances (A) between ligandatoms 119868

119905represents the fact that the whole set of the inter-

actions are classified as type 119905 and 119894 represents an elementin 119868119905 In formula (2) 119889

119894represents the distances between

ligand atoms of 119894 (A)

4 Computational and Mathematical Methods in Medicine

23 CathepsinK and SRC Thispaper selects CathepsinK andSRC as the target for screening These two kinds of proteinsare the hotspot in the field of pharmaceutical drug targets andboth of them do not have enough experimentally determinedprotein-ligand complex structures for virtual screeningTherefore it is necessary to add some docking poses in thetraining set and these docking poses will influence the virtualscreening efficiency

Protooncogene tyrosine-protein kinase SRC also knownas protooncogene c-Src or simply c-Src is a nonreceptortyrosine kinase protein that in humans is encoded by the SRCgene The SRC family kinase is made up of 9 members LYNFYN LCK HCK FGR BLK YRK YES and c-SRCThe SRCwidely exists in tissue cells and it plays an important rolein the process of cell metabolism regulation of cell growthdevelopment and differentiation process by interacting withthe importantmolecules in the signal transduction pathwaysThe c-Src is made up of 6 functional regions SRC homology4 (SH4) domain (SH4 domain) unique region SH3 domainSH2 domain catalytic domain and short regulatory tailWhen SRC is inactive that will cause intermolecular interac-tions between the phosphorylation TYR527 (tyrosine group527) and SH2 domain At the same time the SH3 domainwill combine with the proline-rich SH2 kinase link domainWhen Tyr527 is dephosphorylated and Tyr416 is phosphor-ated links between these molecules will break and the SRCprotein is activated The SRC causes a series of biologicaleffects by participating inmany signal transduction pathwaysthrough a variety of receptors and this kind of protein isclosely associated with a variety of cancers The activationof the c-Src pathway has been observed in about 50 oftumors from colon liver lung breast and the pancreas Asa drug target a number of tyrosine kinase inhibitors treatingc-Src tyrosine kinase (as well as related tyrosine kinases)as target have been developed and put into use [15]

Cathepsin K is a lysosomal cysteine protease belongingto the papain superfamily and it has been cloned in 1999The gene location is lq212 the length of the transcript is17 kb and it consisted of 8 extrons and 7 intronsThe proteinexpression is in the osteoclasts and included in the boneresorption In the process of bone resorption the acid willdissolve the hydroxyapatite and the organic ingredients in thebone matrix will be separated and degraded by Cathepsin KCathepsin K has strong activity of collagenase in acid envi-ronment and it has been found that it plays a role in a varietyof pathological phenomena at present such as rheumatoidarthritis tumor invasion and metastasis inflammation andosteoporosis The function of Cathepsin K in osteoclast hasbeen recognized therefore many labs treat their inhibitors asa drug target for the treatment of osteoporosis At present thefirst choice of antiabsorption treatment is bisphosphonateswhich can reduce the risk of nonvertebral and vertebral frac-tures However the long-term using of bisphosphonates mayproduce adverse reactions esophageal stimulus symptomshypocalcaemia kidney irritation and so on In addition bis-phosphonates not only prevent bone loss but also inhibit boneformation at the same time so the new replacement therapydrugs are more meaningful [16]

3 Classification Algorithm Based onthe Adaboost-SVM

Currently SVM can deal with many problems such as smallsize of samples nonlinearity or high dimensions Based onthe statistical learning theory it has a simple mathematicalform fast training method and good generalization perfor-mance It has been widely used in data mining problems suchas pattern recognition function estimation and time seriesprediction Under the condition that the quality of sample setis not very low even if we do not make any improvementswe can get a good result The learning mechanism of SVMprovides a lot of space to improve the classification modelIn addition one major advantage of SVM is using of convexquadratic programming which provides only global minimaand hence avoids being trapped in local minima so in thispaper we use SVM as the base classifier There have beena large number of literatures about the SVM this paperonly gives a simple introduction The basic process of SVMclassification problems is as follows

For a given sample set 119871 = (1199091 1199101) (1199092 1199102) (119909

119899 119910119899)

and 119909119894isin 119877119889

119910119894isin 1 minus1 119894 = 1 2 119899 119910

119894stands for the

categories of sample 119909119894 119889 is the sample number and 119899 is the

training sample number If the input vector set is linearlyseparable then the input vector set can be separated by ahyperplane The hyperplane can be expressed as

119908 sdot 119909 minus 119887 = 0 (3)

119908 is the normal vector of the hyperplane and 119887 is offsetThe SVM learning problem is minimizing the objectivefunction

min120601 (119908) =1

21199082+ 119862(

119899

sum

119894=1

120585119894) (4)

This meets the condition

119910119894(119908119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 2 119899 (5)

Here (12)1199082 is structure complexity119862(sum

119899

119894=1120585119894) stands

for empirical risk and 120585119894presents the slack variable 119867 is

a constant which is punishment factor of samples wronglyclassified For the situation of linear inseparable the mainidea of SVM is used to map the feature vector to the highdimensional feature space and constructs an optimal hyper-plane in the feature space

To get the change of 120601 119909 in space of 119877119899 mapped into 119867

119909 997888rarr 120601 (119909) = (1206011(119909) 120601

2(119909) 120601

119894(119909))Τ

(6)

Eventually we can decide optimization classificationfunction

119891 (119909) = sgn (119908 sdot 120601 (119909) + 119887)

= sgn(

119899

sum

119894=1

119886119894119910119894120601 (119909119894) sdot 120601 (119909) + 119887)

(7)

Computational and Mathematical Methods in Medicine 5

In our work Radial Basis Function (RBF) is taken as thekernel function of SVM and the mathematical description ofthis kernel is given below

119870(119909119894 119909119895) = 119890minus119909119894minus119909119895221205902

(8)

Relative to other classifiers the SVM is more stable andless affected by the quality of sample set However the samplecategories imbalance of data set and the high complexity ofthe data set will destabilize the classifier and the instabilityof the classifier will directly affect the final classificationresult In this paper we introduce the Adaboost mechanismin ensemble learning to divide one classification process intoseveral layers of weak classifier based on SVM

The key of the combination of Adaboost and SVM is tofind a suitable Gauss width 120590 value for each component If the120590 value is relatively large the component classifier is tooweakand the final classification performance is decreased On theother hand if the 120590 value is relatively small which makesthe component classifier robust and the error of componentclassifier is highly correlated the difference is small so thatthe ensemble learning is invalid Even more importantly 120590value is too small which will lead to overfitting and resultingin a greatly reduced generalization Therefore in this paperthe standard deviation of the sample set of each componentclassifier is used as the 120590 value of the component classifierto control the classification accuracy of the component clas-sifier thus SVM based Adaboost classifier is obtained Theprogram used in this paper is not an open source so we needto explain some key parameters We list the values of 120590 119862and other parameters in Table 1

The specific process of the algorithm is as follows

(1) RBFSVM presents the SVM with the RBF kernel119879 presents the number of iterations required in theAdaboost process

(2) Initialization initialize the weights of each sample1199081(119894) = 1119899 119894 = 1 2 119899

(3) For 119905 = 1 2 119879for each ℎ(119909

119894) calculate the weighted error

120576119905

(119894)=

119899

sum

119895=1

119863119894

10038161003816100381610038161003816ℎ (119883119895) minus 119910 (119895)

10038161003816100381610038161003816 (9)

Choose a feature with the lowest weighted error rate120576119895and save its corresponding SVM model Calculate

the selected weak classifierrsquos weight

119886119905=

1

2ln(

1 minus 120576119905

120576119905

) (10)

Update sample weights according to 119886119905

119863119905+1

(119894) =119863119905(119894)

119885119905

119865 (119864119905)

=119863119905(119894)

119885119905

119890minus120572119905 if ℎ

119905(119909119894) = 119910119894

119890120572119905 if ℎ

119905(119909119894) = 119910119894

(11)

Table 1 SVM parameter setting

Parameter name Parameter valuesSVM type C-SVMClass number 2Kernel function RBFThe degree in kernel function 3120590 in kernel function 0001Coast factor 5Cache size 500MBTolerance in the termination criteria 0001The weight value of penalty factor for all kindsof samples 1

Cross validation 5

Table 2 Adaboost-SVM parameter setting

Parameter name Parameter valuesEnsemble learning type AdaboostClass number 2The basic classifier type C-SVMNumber of classifiers per layer 100The max false alarm rate 05The min hit rate 09Number of iterations 5Weight trim rate 09Cache size 500MB

And the normalized parameters are

119899

sum

119894=1

119863119905(119894) = 1 (12)

(4) Use strong classifier 119867 integrated by SVM weakclassifier to training set

119867(119909) = sign[

119899

sum

119894=1

119886119905ℎ119905(119909)] (13)

For the Adaboost-SVM we set the parameters in Table 2and the basic SVM parameters are the same as Table 1

Thus compared to the single machine learning algorithmensemble learning method requires that each base classifiershould be independent from the others The probabilityof sample misclassification should be less than 05 In theensemble learning method all classifiers will work togetherto solve one problem and this can also reduce the impact ofthe quality of sample set to the virtual screening effect

4 Experiment and Analysis

In order to verify the validity of the proposed method in thispaper besides the crystal structure of PDB we also combinethe data from the PubChem database and the StARLITe

6 Computational and Mathematical Methods in Medicine

database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining

All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata

41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set

Table 3 Experimental data structure

Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500

to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above

After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC

50le 10 120583m) from StARLITe and divide them into

100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set

After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively

From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue

42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows

EF =Hits119904119873119904

Hit119905119873119905

(14)

HereHits119904represents the number of active compounds in

the sample Hit119905is the number of active compounds tested119873

119904

is the number of compounds sampled and 119873119905is the number

of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use

Computational and Mathematical Methods in Medicine 7

False positive rate1009080706050403020100

10

09

08

07

06

05

04

03

02

01

00

True

pos

itive

rate

X

Y

Figure 3 Two ROC curves 119883 and 119884

10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows

AUC = int

1

0

TP119875

119889FP119873

=1

119875 sdot 119873int

119873

0

TP 119889 FP (15)

Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds

AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification

43 The Analysis of Experimental Results According to theexperiment we used three different classification methods

False positive rate1009080706050403020100

True

pos

itive

rate

10

09

08

07

06

05

04

03

02

01

00

Adaboost-SVMRandom ForestSVM

Figure 4 Comparative experiment of SRC

True

pos

itive

rate

False positive rate1009080706050403020100

Adaboost-SVMRandom ForestSVM

10

09

08

07

06

05

04

03

02

01

00

Figure 5 Comparative experiment of Cathepsin K

(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4

The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5

Table 5 shows the 10 EF of the two methods andcalculates the AUC value

Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this

8 Computational and Mathematical Methods in Medicine

Table 4 Random Forest parameter setting

Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50

Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest

Algorithm Target protein 10 EF AUC

SVM SRC 47 0734Cathepsin K 39 0683

Adaboost-SVM SRC 55 0821Cathepsin K 48 0802

Random Forest SRC 53 0805Cathepsin K 45 0783

iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained

5 Conclusion

In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying

robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)

References

[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014

[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013

[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009

[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015

[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015

[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014

[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014

Computational and Mathematical Methods in Medicine 9

[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015

[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013

[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013

[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013

[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015

[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014

[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010

[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012

[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013

[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 4: Research Article The Virtual Screening of the Drug Protein ...downloads.hindawi.com/journals/cmmm/2016/4809831.pdf · Research Article The Virtual Screening of the Drug Protein with

4 Computational and Mathematical Methods in Medicine

23 CathepsinK and SRC Thispaper selects CathepsinK andSRC as the target for screening These two kinds of proteinsare the hotspot in the field of pharmaceutical drug targets andboth of them do not have enough experimentally determinedprotein-ligand complex structures for virtual screeningTherefore it is necessary to add some docking poses in thetraining set and these docking poses will influence the virtualscreening efficiency

Protooncogene tyrosine-protein kinase SRC also knownas protooncogene c-Src or simply c-Src is a nonreceptortyrosine kinase protein that in humans is encoded by the SRCgene The SRC family kinase is made up of 9 members LYNFYN LCK HCK FGR BLK YRK YES and c-SRCThe SRCwidely exists in tissue cells and it plays an important rolein the process of cell metabolism regulation of cell growthdevelopment and differentiation process by interacting withthe importantmolecules in the signal transduction pathwaysThe c-Src is made up of 6 functional regions SRC homology4 (SH4) domain (SH4 domain) unique region SH3 domainSH2 domain catalytic domain and short regulatory tailWhen SRC is inactive that will cause intermolecular interac-tions between the phosphorylation TYR527 (tyrosine group527) and SH2 domain At the same time the SH3 domainwill combine with the proline-rich SH2 kinase link domainWhen Tyr527 is dephosphorylated and Tyr416 is phosphor-ated links between these molecules will break and the SRCprotein is activated The SRC causes a series of biologicaleffects by participating inmany signal transduction pathwaysthrough a variety of receptors and this kind of protein isclosely associated with a variety of cancers The activationof the c-Src pathway has been observed in about 50 oftumors from colon liver lung breast and the pancreas Asa drug target a number of tyrosine kinase inhibitors treatingc-Src tyrosine kinase (as well as related tyrosine kinases)as target have been developed and put into use [15]

Cathepsin K is a lysosomal cysteine protease belongingto the papain superfamily and it has been cloned in 1999The gene location is lq212 the length of the transcript is17 kb and it consisted of 8 extrons and 7 intronsThe proteinexpression is in the osteoclasts and included in the boneresorption In the process of bone resorption the acid willdissolve the hydroxyapatite and the organic ingredients in thebone matrix will be separated and degraded by Cathepsin KCathepsin K has strong activity of collagenase in acid envi-ronment and it has been found that it plays a role in a varietyof pathological phenomena at present such as rheumatoidarthritis tumor invasion and metastasis inflammation andosteoporosis The function of Cathepsin K in osteoclast hasbeen recognized therefore many labs treat their inhibitors asa drug target for the treatment of osteoporosis At present thefirst choice of antiabsorption treatment is bisphosphonateswhich can reduce the risk of nonvertebral and vertebral frac-tures However the long-term using of bisphosphonates mayproduce adverse reactions esophageal stimulus symptomshypocalcaemia kidney irritation and so on In addition bis-phosphonates not only prevent bone loss but also inhibit boneformation at the same time so the new replacement therapydrugs are more meaningful [16]

3 Classification Algorithm Based onthe Adaboost-SVM

Currently SVM can deal with many problems such as smallsize of samples nonlinearity or high dimensions Based onthe statistical learning theory it has a simple mathematicalform fast training method and good generalization perfor-mance It has been widely used in data mining problems suchas pattern recognition function estimation and time seriesprediction Under the condition that the quality of sample setis not very low even if we do not make any improvementswe can get a good result The learning mechanism of SVMprovides a lot of space to improve the classification modelIn addition one major advantage of SVM is using of convexquadratic programming which provides only global minimaand hence avoids being trapped in local minima so in thispaper we use SVM as the base classifier There have beena large number of literatures about the SVM this paperonly gives a simple introduction The basic process of SVMclassification problems is as follows

For a given sample set 119871 = (1199091 1199101) (1199092 1199102) (119909

119899 119910119899)

and 119909119894isin 119877119889

119910119894isin 1 minus1 119894 = 1 2 119899 119910

119894stands for the

categories of sample 119909119894 119889 is the sample number and 119899 is the

training sample number If the input vector set is linearlyseparable then the input vector set can be separated by ahyperplane The hyperplane can be expressed as

119908 sdot 119909 minus 119887 = 0 (3)

119908 is the normal vector of the hyperplane and 119887 is offsetThe SVM learning problem is minimizing the objectivefunction

min120601 (119908) =1

21199082+ 119862(

119899

sum

119894=1

120585119894) (4)

This meets the condition

119910119894(119908119909119894+ 119887) ge 1 minus 120585

119894 119894 = 1 2 119899 (5)

Here (12)1199082 is structure complexity119862(sum

119899

119894=1120585119894) stands

for empirical risk and 120585119894presents the slack variable 119867 is

a constant which is punishment factor of samples wronglyclassified For the situation of linear inseparable the mainidea of SVM is used to map the feature vector to the highdimensional feature space and constructs an optimal hyper-plane in the feature space

To get the change of 120601 119909 in space of 119877119899 mapped into 119867

119909 997888rarr 120601 (119909) = (1206011(119909) 120601

2(119909) 120601

119894(119909))Τ

(6)

Eventually we can decide optimization classificationfunction

119891 (119909) = sgn (119908 sdot 120601 (119909) + 119887)

= sgn(

119899

sum

119894=1

119886119894119910119894120601 (119909119894) sdot 120601 (119909) + 119887)

(7)

Computational and Mathematical Methods in Medicine 5

In our work Radial Basis Function (RBF) is taken as thekernel function of SVM and the mathematical description ofthis kernel is given below

119870(119909119894 119909119895) = 119890minus119909119894minus119909119895221205902

(8)

Relative to other classifiers the SVM is more stable andless affected by the quality of sample set However the samplecategories imbalance of data set and the high complexity ofthe data set will destabilize the classifier and the instabilityof the classifier will directly affect the final classificationresult In this paper we introduce the Adaboost mechanismin ensemble learning to divide one classification process intoseveral layers of weak classifier based on SVM

The key of the combination of Adaboost and SVM is tofind a suitable Gauss width 120590 value for each component If the120590 value is relatively large the component classifier is tooweakand the final classification performance is decreased On theother hand if the 120590 value is relatively small which makesthe component classifier robust and the error of componentclassifier is highly correlated the difference is small so thatthe ensemble learning is invalid Even more importantly 120590value is too small which will lead to overfitting and resultingin a greatly reduced generalization Therefore in this paperthe standard deviation of the sample set of each componentclassifier is used as the 120590 value of the component classifierto control the classification accuracy of the component clas-sifier thus SVM based Adaboost classifier is obtained Theprogram used in this paper is not an open source so we needto explain some key parameters We list the values of 120590 119862and other parameters in Table 1

The specific process of the algorithm is as follows

(1) RBFSVM presents the SVM with the RBF kernel119879 presents the number of iterations required in theAdaboost process

(2) Initialization initialize the weights of each sample1199081(119894) = 1119899 119894 = 1 2 119899

(3) For 119905 = 1 2 119879for each ℎ(119909

119894) calculate the weighted error

120576119905

(119894)=

119899

sum

119895=1

119863119894

10038161003816100381610038161003816ℎ (119883119895) minus 119910 (119895)

10038161003816100381610038161003816 (9)

Choose a feature with the lowest weighted error rate120576119895and save its corresponding SVM model Calculate

the selected weak classifierrsquos weight

119886119905=

1

2ln(

1 minus 120576119905

120576119905

) (10)

Update sample weights according to 119886119905

119863119905+1

(119894) =119863119905(119894)

119885119905

119865 (119864119905)

=119863119905(119894)

119885119905

119890minus120572119905 if ℎ

119905(119909119894) = 119910119894

119890120572119905 if ℎ

119905(119909119894) = 119910119894

(11)

Table 1 SVM parameter setting

Parameter name Parameter valuesSVM type C-SVMClass number 2Kernel function RBFThe degree in kernel function 3120590 in kernel function 0001Coast factor 5Cache size 500MBTolerance in the termination criteria 0001The weight value of penalty factor for all kindsof samples 1

Cross validation 5

Table 2 Adaboost-SVM parameter setting

Parameter name Parameter valuesEnsemble learning type AdaboostClass number 2The basic classifier type C-SVMNumber of classifiers per layer 100The max false alarm rate 05The min hit rate 09Number of iterations 5Weight trim rate 09Cache size 500MB

And the normalized parameters are

119899

sum

119894=1

119863119905(119894) = 1 (12)

(4) Use strong classifier 119867 integrated by SVM weakclassifier to training set

119867(119909) = sign[

119899

sum

119894=1

119886119905ℎ119905(119909)] (13)

For the Adaboost-SVM we set the parameters in Table 2and the basic SVM parameters are the same as Table 1

Thus compared to the single machine learning algorithmensemble learning method requires that each base classifiershould be independent from the others The probabilityof sample misclassification should be less than 05 In theensemble learning method all classifiers will work togetherto solve one problem and this can also reduce the impact ofthe quality of sample set to the virtual screening effect

4 Experiment and Analysis

In order to verify the validity of the proposed method in thispaper besides the crystal structure of PDB we also combinethe data from the PubChem database and the StARLITe

6 Computational and Mathematical Methods in Medicine

database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining

All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata

41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set

Table 3 Experimental data structure

Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500

to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above

After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC

50le 10 120583m) from StARLITe and divide them into

100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set

After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively

From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue

42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows

EF =Hits119904119873119904

Hit119905119873119905

(14)

HereHits119904represents the number of active compounds in

the sample Hit119905is the number of active compounds tested119873

119904

is the number of compounds sampled and 119873119905is the number

of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use

Computational and Mathematical Methods in Medicine 7

False positive rate1009080706050403020100

10

09

08

07

06

05

04

03

02

01

00

True

pos

itive

rate

X

Y

Figure 3 Two ROC curves 119883 and 119884

10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows

AUC = int

1

0

TP119875

119889FP119873

=1

119875 sdot 119873int

119873

0

TP 119889 FP (15)

Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds

AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification

43 The Analysis of Experimental Results According to theexperiment we used three different classification methods

False positive rate1009080706050403020100

True

pos

itive

rate

10

09

08

07

06

05

04

03

02

01

00

Adaboost-SVMRandom ForestSVM

Figure 4 Comparative experiment of SRC

True

pos

itive

rate

False positive rate1009080706050403020100

Adaboost-SVMRandom ForestSVM

10

09

08

07

06

05

04

03

02

01

00

Figure 5 Comparative experiment of Cathepsin K

(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4

The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5

Table 5 shows the 10 EF of the two methods andcalculates the AUC value

Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this

8 Computational and Mathematical Methods in Medicine

Table 4 Random Forest parameter setting

Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50

Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest

Algorithm Target protein 10 EF AUC

SVM SRC 47 0734Cathepsin K 39 0683

Adaboost-SVM SRC 55 0821Cathepsin K 48 0802

Random Forest SRC 53 0805Cathepsin K 45 0783

iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained

5 Conclusion

In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying

robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)

References

[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014

[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013

[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009

[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015

[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015

[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014

[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014

Computational and Mathematical Methods in Medicine 9

[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015

[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013

[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013

[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013

[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015

[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014

[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010

[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012

[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013

[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 5: Research Article The Virtual Screening of the Drug Protein ...downloads.hindawi.com/journals/cmmm/2016/4809831.pdf · Research Article The Virtual Screening of the Drug Protein with

Computational and Mathematical Methods in Medicine 5

In our work Radial Basis Function (RBF) is taken as thekernel function of SVM and the mathematical description ofthis kernel is given below

119870(119909119894 119909119895) = 119890minus119909119894minus119909119895221205902

(8)

Relative to other classifiers the SVM is more stable andless affected by the quality of sample set However the samplecategories imbalance of data set and the high complexity ofthe data set will destabilize the classifier and the instabilityof the classifier will directly affect the final classificationresult In this paper we introduce the Adaboost mechanismin ensemble learning to divide one classification process intoseveral layers of weak classifier based on SVM

The key of the combination of Adaboost and SVM is tofind a suitable Gauss width 120590 value for each component If the120590 value is relatively large the component classifier is tooweakand the final classification performance is decreased On theother hand if the 120590 value is relatively small which makesthe component classifier robust and the error of componentclassifier is highly correlated the difference is small so thatthe ensemble learning is invalid Even more importantly 120590value is too small which will lead to overfitting and resultingin a greatly reduced generalization Therefore in this paperthe standard deviation of the sample set of each componentclassifier is used as the 120590 value of the component classifierto control the classification accuracy of the component clas-sifier thus SVM based Adaboost classifier is obtained Theprogram used in this paper is not an open source so we needto explain some key parameters We list the values of 120590 119862and other parameters in Table 1

The specific process of the algorithm is as follows

(1) RBFSVM presents the SVM with the RBF kernel119879 presents the number of iterations required in theAdaboost process

(2) Initialization initialize the weights of each sample1199081(119894) = 1119899 119894 = 1 2 119899

(3) For 119905 = 1 2 119879for each ℎ(119909

119894) calculate the weighted error

120576119905

(119894)=

119899

sum

119895=1

119863119894

10038161003816100381610038161003816ℎ (119883119895) minus 119910 (119895)

10038161003816100381610038161003816 (9)

Choose a feature with the lowest weighted error rate120576119895and save its corresponding SVM model Calculate

the selected weak classifierrsquos weight

119886119905=

1

2ln(

1 minus 120576119905

120576119905

) (10)

Update sample weights according to 119886119905

119863119905+1

(119894) =119863119905(119894)

119885119905

119865 (119864119905)

=119863119905(119894)

119885119905

119890minus120572119905 if ℎ

119905(119909119894) = 119910119894

119890120572119905 if ℎ

119905(119909119894) = 119910119894

(11)

Table 1 SVM parameter setting

Parameter name Parameter valuesSVM type C-SVMClass number 2Kernel function RBFThe degree in kernel function 3120590 in kernel function 0001Coast factor 5Cache size 500MBTolerance in the termination criteria 0001The weight value of penalty factor for all kindsof samples 1

Cross validation 5

Table 2 Adaboost-SVM parameter setting

Parameter name Parameter valuesEnsemble learning type AdaboostClass number 2The basic classifier type C-SVMNumber of classifiers per layer 100The max false alarm rate 05The min hit rate 09Number of iterations 5Weight trim rate 09Cache size 500MB

And the normalized parameters are

119899

sum

119894=1

119863119905(119894) = 1 (12)

(4) Use strong classifier 119867 integrated by SVM weakclassifier to training set

119867(119909) = sign[

119899

sum

119894=1

119886119905ℎ119905(119909)] (13)

For the Adaboost-SVM we set the parameters in Table 2and the basic SVM parameters are the same as Table 1

Thus compared to the single machine learning algorithmensemble learning method requires that each base classifiershould be independent from the others The probabilityof sample misclassification should be less than 05 In theensemble learning method all classifiers will work togetherto solve one problem and this can also reduce the impact ofthe quality of sample set to the virtual screening effect

4 Experiment and Analysis

In order to verify the validity of the proposed method in thispaper besides the crystal structure of PDB we also combinethe data from the PubChem database and the StARLITe

6 Computational and Mathematical Methods in Medicine

database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining

All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata

41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set

Table 3 Experimental data structure

Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500

to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above

After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC

50le 10 120583m) from StARLITe and divide them into

100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set

After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively

From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue

42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows

EF =Hits119904119873119904

Hit119905119873119905

(14)

HereHits119904represents the number of active compounds in

the sample Hit119905is the number of active compounds tested119873

119904

is the number of compounds sampled and 119873119905is the number

of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use

Computational and Mathematical Methods in Medicine 7

False positive rate1009080706050403020100

10

09

08

07

06

05

04

03

02

01

00

True

pos

itive

rate

X

Y

Figure 3 Two ROC curves 119883 and 119884

10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows

AUC = int

1

0

TP119875

119889FP119873

=1

119875 sdot 119873int

119873

0

TP 119889 FP (15)

Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds

AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification

43 The Analysis of Experimental Results According to theexperiment we used three different classification methods

False positive rate1009080706050403020100

True

pos

itive

rate

10

09

08

07

06

05

04

03

02

01

00

Adaboost-SVMRandom ForestSVM

Figure 4 Comparative experiment of SRC

True

pos

itive

rate

False positive rate1009080706050403020100

Adaboost-SVMRandom ForestSVM

10

09

08

07

06

05

04

03

02

01

00

Figure 5 Comparative experiment of Cathepsin K

(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4

The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5

Table 5 shows the 10 EF of the two methods andcalculates the AUC value

Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this

8 Computational and Mathematical Methods in Medicine

Table 4 Random Forest parameter setting

Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50

Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest

Algorithm Target protein 10 EF AUC

SVM SRC 47 0734Cathepsin K 39 0683

Adaboost-SVM SRC 55 0821Cathepsin K 48 0802

Random Forest SRC 53 0805Cathepsin K 45 0783

iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained

5 Conclusion

In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying

robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)

References

[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014

[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013

[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009

[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015

[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015

[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014

[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014

Computational and Mathematical Methods in Medicine 9

[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015

[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013

[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013

[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013

[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015

[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014

[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010

[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012

[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013

[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 6: Research Article The Virtual Screening of the Drug Protein ...downloads.hindawi.com/journals/cmmm/2016/4809831.pdf · Research Article The Virtual Screening of the Drug Protein with

6 Computational and Mathematical Methods in Medicine

database and the enrichment factor (EF) and the ROC curveare used to evaluate the effect of virtual screening and themachine learning to ensure the effectiveness of the methodPubChem is a database of chemical molecules and theiractivities against biological assays StARLITe is a databasecontaining biological activity andor binding affinity databetween various compounds and proteins and it is one of thedatabases that can be directly used in data mining

All the crystal structures used in this experiment are fromPDB database the material is available free of charge via theInternet at httpwwwrcsborg All the training set decoysare from PubChem data set the material is available free ofcharge via the Internet at httppubchemncbinlmnihgovWe thank laboratory colleagues for providing the StARLITedata

41 Data Set In order to cooperate with machine learningin this paper we construct a set of training sets and testsets of these two target proteins for machine learningwhich include the decoys and known active compounds Inthe training set we selected the experimentally determinedcomplex structures of these two kinds of proteins from thePDB as the positive samples The Protein Data Bank (PDB)is a crystallographic database for the three-dimensionalstructural data of large biological molecules The data inthe PDB is submitted by biologists and biochemists aroundthe world by the experimental means such as X-ray crystal-lography NMR spectroscopy or increasingly cryoelectronmicroscopy To expand the training set we randomly andrespectively selected 1 3 5 10 20 40 60 and 80 activecompounds from the known active compounds for whichcrystal structures with their targets were not determined andthis process is repeated 10 times For each of the selected com-pounds we used the GLIDE to generate five docking posesand mixed them into the training set Among them thecrystal structures of these two proteins used in dockingexperiments are selected from thePDBwith protein-inhibitorcompounds with high inhibitor activity and high resolutioncrystal structureThe entry 2h8h SRC kinase in complexwitha quinazoline inhibitor the resolution 230 A was selectedfor SRC The entry 1u9w crystal structure of the cysteineprotease human Cathepsin K in complex with the covalentinhibitor NVP-ABI491 the resolution 220 A was selected forCathepsin K We used the SP mode of GLIDE to generate thedocking poses of the decoys and the active compounds forwhich crystal structureswere not experimentally determinedFor the preparation of the docking we use the ProteinPreparation Wizard to add the hydrogen atoms of theprotein and optimize their positions Using Pipeline Pilotof SciTegic to enumerate the tautomer stereoisomers andprotonationdeprotonation form at pH 74 of the active anddecoy compounds Then the additional ring conformationsof the compounds were generated by LigPrep Then theGLIDE score was used to select five poses for each compoundin the docking results In this paper five docking poses of eachactive compound as the positive examples were chosen by theGLIDE score because this procedure would generate higherenrichment factors in a preliminary test than using one poseof each active compound Other settings of GLIDE were set

Table 3 Experimental data structure

Date set Positive sample Negative sample TotalTraining set 100 2000 2100Test set 100 lowast 5 = 500 2000 lowast 5 = 10000 10500

to the default values In this paper we use the averages ofthe 10 EF and the ROC score of the 10 trials to evaluatethe screening efficiencies of the learning models using eachnumber of docking poses Then select the docking poses of2000 decoy compounds from them as the negative samplesrandomly Each decoy compound was docked to the targetproteins by the same way as mentioned above

After the completion of the training set we set out to buildthe test set First choose the active compounds of the targetproteins (IC

50le 10 120583m) from StARLITe and divide them into

100 clusters Dividing strategy is that hierarchical clusteringwith Ward method based on the Euclidean distance betweentheir 2D structure fingerprintsThe compoundwith the high-est inhibitory activity was selected from each cluster Dockthe 100 active compounds obtained for each target to their tar-get protein and five docking poses for each active compoundwere used as positive samples of the test setThe docking wayand target protein crystal structures are same as those men-tioned above Then use selection strategy of negative sampleof the training set to choose the decoys for the test set

After the completion of the date preparation we will usethe Pharm-IF to quantify the training set Then treat thedata as the input of machine learning algorithms to get thecorresponding learning model The learning model obtainedwill be used to test set respectively

From Table 3 it can be seen that the proportion of nega-tive samples and positive samples is up to 20 1 which leadsto significant imbalanced-data problem and the motivationof our work is to address this issue

42 The Evaluation Index of the Machine Learning Combinedwith Virtual Screening At present the virtual screening andmachine learning have their own evaluation index [17] Theenrichment factor (EF) is one of the most famous measuresfor evaluating the screening efficiency EF is usually used toevaluate the early recognition properties of screeningmethodand it can indicate the ratio of the number of obtained activecompounds by in silico screening against that generated byrandom selection at the predefined sampling percentageThecalculation method of EF is as follows

EF =Hits119904119873119904

Hit119905119873119905

(14)

HereHits119904represents the number of active compounds in

the sample Hit119905is the number of active compounds tested119873

119904

is the number of compounds sampled and 119873119905is the number

of all compounds In the actual drug discovery only a smallpart of the compound is filtered by computer In general 001ndash1 of the compounds will be selected from a huge compounddatabase (10000ndash1000000 compounds) in the actual virtualscreening process Since the number of test sets in this exper-iment is far from reaching this order of magnitude we use

Computational and Mathematical Methods in Medicine 7

False positive rate1009080706050403020100

10

09

08

07

06

05

04

03

02

01

00

True

pos

itive

rate

X

Y

Figure 3 Two ROC curves 119883 and 119884

10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows

AUC = int

1

0

TP119875

119889FP119873

=1

119875 sdot 119873int

119873

0

TP 119889 FP (15)

Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds

AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification

43 The Analysis of Experimental Results According to theexperiment we used three different classification methods

False positive rate1009080706050403020100

True

pos

itive

rate

10

09

08

07

06

05

04

03

02

01

00

Adaboost-SVMRandom ForestSVM

Figure 4 Comparative experiment of SRC

True

pos

itive

rate

False positive rate1009080706050403020100

Adaboost-SVMRandom ForestSVM

10

09

08

07

06

05

04

03

02

01

00

Figure 5 Comparative experiment of Cathepsin K

(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4

The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5

Table 5 shows the 10 EF of the two methods andcalculates the AUC value

Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this

8 Computational and Mathematical Methods in Medicine

Table 4 Random Forest parameter setting

Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50

Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest

Algorithm Target protein 10 EF AUC

SVM SRC 47 0734Cathepsin K 39 0683

Adaboost-SVM SRC 55 0821Cathepsin K 48 0802

Random Forest SRC 53 0805Cathepsin K 45 0783

iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained

5 Conclusion

In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying

robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)

References

[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014

[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013

[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009

[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015

[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015

[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014

[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014

Computational and Mathematical Methods in Medicine 9

[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015

[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013

[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013

[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013

[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015

[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014

[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010

[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012

[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013

[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 7: Research Article The Virtual Screening of the Drug Protein ...downloads.hindawi.com/journals/cmmm/2016/4809831.pdf · Research Article The Virtual Screening of the Drug Protein with

Computational and Mathematical Methods in Medicine 7

False positive rate1009080706050403020100

10

09

08

07

06

05

04

03

02

01

00

True

pos

itive

rate

X

Y

Figure 3 Two ROC curves 119883 and 119884

10EF to carry out this test in order to reduce the deviation ofthe early evaluation EF has only specific sampling proportionscreening efficiency therefore this paper also introduced theROC curve and AUC value to assess the entire range ofsampling (0ndash100)TheROC curve (receiver operating char-acteristic curve) is a graphical method to show the tradeoffbetween false positive rate and true positive rate of classifierAs shown in Figure 3 in ROC curve the true positive rate(TPR) is plotted along the 119910-axis while the false positive rate(FPR) is displayed on the 119909-axis Although the ROC curvecan directly reflect the effect of the machine learning modelwe also need a kind of numerical method the AUC (areaunder ROC curve) value to evaluate the effect of the modelin the practical applicationThe AUC value indicates the areaunder the ROC curve and it is more intuitive and accurateThe calculation method of AUC value is as follows

AUC = int

1

0

TP119875

119889FP119873

=1

119875 sdot 119873int

119873

0

TP 119889 FP (15)

Among all the variables of formula (15) 119875 stands forthe positive samples 119873 represents the negative samples TP(true positive) stands for the active compounds that areclassified correctly and FP (false positive) stands for themisclassification of active compounds

AUC values are between 05 and 10 if the model is per-fect the AUC value is 1 if the model is only a random guessthe AUC value is 05 If a model is better than another AUCvalue of the better one is higher ROC curves andAUC are notaffected by imbalance distribution of data class and normaldistribution of the data In addition the AUC value allowsa middle state and experimental results can be divided intomultiple ordered classification

43 The Analysis of Experimental Results According to theexperiment we used three different classification methods

False positive rate1009080706050403020100

True

pos

itive

rate

10

09

08

07

06

05

04

03

02

01

00

Adaboost-SVMRandom ForestSVM

Figure 4 Comparative experiment of SRC

True

pos

itive

rate

False positive rate1009080706050403020100

Adaboost-SVMRandom ForestSVM

10

09

08

07

06

05

04

03

02

01

00

Figure 5 Comparative experiment of Cathepsin K

(SVM Adaboost-SVM and RF) to compare the two kinds oftarget proteins for virtual screening experiment As the base-line algorithm RF is also developed based on the machinelearning libs of the authorsrsquo lab and the parameters of RF areset in Table 4

The ROC curves from the comparative experimentson these two kinds of data sets using ensemble learning(compared with SVM) are as those in Figures 4 and 5

Table 5 shows the 10 EF of the two methods andcalculates the AUC value

Aiming at the problemof sample set quality theAdaboostmethod gives the same weight value to each training data thesample weight represents the probability of the data treatedas the training set by a weak classifier At each iterationthe Adaboost algorithm will modify the weight value of thesample if the training sample is correctly classified in this

8 Computational and Mathematical Methods in Medicine

Table 4 Random Forest parameter setting

Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50

Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest

Algorithm Target protein 10 EF AUC

SVM SRC 47 0734Cathepsin K 39 0683

Adaboost-SVM SRC 55 0821Cathepsin K 48 0802

Random Forest SRC 53 0805Cathepsin K 45 0783

iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained

5 Conclusion

In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying

robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)

References

[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014

[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013

[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009

[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015

[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015

[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014

[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014

Computational and Mathematical Methods in Medicine 9

[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015

[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013

[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013

[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013

[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015

[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014

[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010

[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012

[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013

[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 8: Research Article The Virtual Screening of the Drug Protein ...downloads.hindawi.com/journals/cmmm/2016/4809831.pdf · Research Article The Virtual Screening of the Drug Protein with

8 Computational and Mathematical Methods in Medicine

Table 4 Random Forest parameter setting

Parameter name Parameter valuesTree number 1000Node size 5The number of different descriptors tried ateach split 50

Table 5 Experimental comparison of SVM Adaboost-SVM andRandom Forest

Algorithm Target protein 10 EF AUC

SVM SRC 47 0734Cathepsin K 39 0683

Adaboost-SVM SRC 55 0821Cathepsin K 48 0802

Random Forest SRC 53 0805Cathepsin K 45 0783

iteration the weight value of the sample will be reducedthat is the probability of being treated as the training sampleis reduced in the next iteration On the contrary if thetraining sample is misclassified in the current base classifiertheweight value of the samplewill be increased and the prob-ability of being treated as the training samplewill be increasedin the next iteration In this way the weak classifier will paymore attention to the serious misclassification of training setThe experimental results above show that Adaboost-SVMhas notably outperformed RFThis observation indicates thaton both 10 EF and AUC ensemble learning based virtualscreening has shown its ability of noise resistance under thesituation that the amount of structure samples is limited thusthe better screening results are obtained

5 Conclusion

In this paper we use ensemble learning method to solve theproblem caused by the quality of the training setThismethodmainly uses Pharm-IF to encode protein-ligand interactionsas a binary form and then uses the improved SVM algorithmAdaboost-SVM and Random Forest to classify the data Theidea of ensemble learning in dealing with data classificationproblem is to get a number of weak classifiers which areindependent of each other and then use an effective methodto combine these independent weak classifiers By comparingthe experimental results after the Adaboost-SVM is used asthe classifier 10 EF for the SRCmodel increased from 47 to55 and the AUC value increased from 0734 to 0821 10 EFof Cathepsin Kmodel increased from 39 to 48 and the AUCvalue increased from 0683 to 0802 It can be observed fromthe results that comparing with the naıve SVM RandomForest has obtained better performance on both 10 EFand AUC 10 EF is improved to 53 and 45 on SRC andCathepsin K respectively and AUC is improved to 0805 and0783 respectively As a classic ensemble learning algorithmRandom Forest has shown that ensemble learning is ableto get better results on the imbalanced data set with satisfying

robustness Nevertheless the performance of Random Forestis lower than Adaboost-SVM and we will continue investi-gating the reasons in our future work Compared with thetraditional method the proposed method is more significantfor the improvement of the accuracy of the virtual screeningmodel In the future work the problem of improving theaccuracy of virtual screening should be further studiedfrom two aspects virtual screening theory and computertheory Although the status of virtual screening is graduallyincreasing the problem of virtual screening false positive rateis still to be solved The speed of laboratory determination ofprotein structure has been unable to catch up with the needsof drug development therefore in the virtual screening itwill often encounter problems similar to this paper so thereare still many improvements in the algorithm For examplethe selection of kernel function will directly affect the perfor-mance of the classifier In view of the problem of this kind ofdata set we should set up a special kernel function to adapt tothe characteristics of the data set

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partially supported by Natural Science Founda-tion of Heilongjiang Province (QC2013C060) Science Fundsfor the Young Innovative Talents of HUST (no 201304)China Postdoctoral Science Foundation (2011M500682) andPostdoctoral Science Foundation of Heilongjiang Province(LBH-Z11106)

References

[1] K Tanaka K F Aoki-Kinoshita M Kotera et al ldquoWURCS theWeb3 unique representation of carbohydrate structuresrdquo Jour-nal of Chemical Information and Modeling vol 54 no 6 pp1558ndash1566 2014

[2] S Seuss and A R Boccaccini ldquoElectrophoretic depositionof biological macromolecules drugs and cellsrdquo Biomacro-molecules vol 14 no 10 pp 3355ndash3369 2013

[3] C Y Liew X H Ma X Liu and C W Yap ldquoSVM model forvirtual screening of Lck inhibitorsrdquo Journal of Chemical Infor-mation and Modeling vol 49 no 4 pp 877ndash885 2009

[4] A Kurczyk D Warszycki R Musiol R Kafel A J Bojarskiand J Polanski ldquoLigand-based virtual screening in a search fornovel anti-HIV-1 chemotypesrdquo Journal of Chemical Informationand Modeling vol 55 no 10 pp 2168ndash2177 2015

[5] A E Bilsland A Pugliese Y Liu et al ldquoIdentification of aselective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networksrdquoNeopla-sia vol 17 no 9 pp 704ndash715 2015

[6] JWallentinM Osterhoff R NWilke et al ldquoHard X-ray detec-tion using a single 100 nm diameter nanowirerdquo Nano Lettersvol 14 no 12 pp 7071ndash7076 2014

[7] L Song D Li X Zeng YWu L Guo andQ Zou ldquonDNA-protidentification of DNA-binding proteins based on unbalancedclassificationrdquo BMC Bioinformatics vol 15 no 1 article 298 10pages 2014

Computational and Mathematical Methods in Medicine 9

[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015

[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013

[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013

[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013

[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015

[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014

[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010

[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012

[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013

[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 9: Research Article The Virtual Screening of the Drug Protein ...downloads.hindawi.com/journals/cmmm/2016/4809831.pdf · Research Article The Virtual Screening of the Drug Protein with

Computational and Mathematical Methods in Medicine 9

[8] Q Zou J S Guo Y Ju M H Wu X X Zeng and Z L HongldquoImproving tRNAscan-SE annotation results via ensemble clas-sifiersrdquoMolecular Informatics vol 34 pp 761ndash770 2015

[9] C Lin Y Zou J Qin et al ldquoHierarchical classification of proteinfolds using a novel ensemble classifierrdquo PLoS ONE vol 8 no 2Article ID e56499 2013

[10] Q Zou Z Wang X Guan et al ldquoAn approach for identify-ing cytokines based on a novel ensemble classifierrdquo BioMedResearch International vol 8 no 8 pp 616ndash617 2013

[11] D Takaya T Sato H Yuki et al ldquoPrediction of ligand-inducedstructural polymorphism of receptor interaction sites usingmachine learningrdquo Journal of Chemical Information and Mod-eling vol 53 no 3 pp 704ndash716 2013

[12] V Ramatenki S R Potlapally R K Dumpati R VadijaU Vuruputuri and V Ramatenki ldquoHomology modeling andvirtual screening of ubiquitin conjugation enzyme E2A fordesigning a novel selective antagonist against cancerrdquo Journal ofReceptor and Signal Transduction Research vol 35 no 6 pp536ndash549 2015

[13] R SawadaM Kotera and Y Yamanishi ldquoBenchmarking a widerange of chemical descriptors for drug-target interaction pre-diction using a chemogenomic approachrdquoMolecular Informat-ics vol 33 no 11-12 pp 719ndash731 2014

[14] T Sato T Honma and S Yokoyama ldquoCombining machinelearning and pharmacophore-based interaction fingerprint forin silico screeningrdquo Journal of Chemical Information and Mod-eling vol 50 no 1 pp 170ndash185 2010

[15] M-Y Song C Hong S H Bae I So and K-S Park ldquoDynamicmodulation of the Kv21 channel by Src-dependent tyrosinephosphorylationrdquo Journal of ProteomeResearch vol 11 no 2 pp1018ndash1026 2012

[16] F S Nallaseth F Lecaille Z Li and D Bromme ldquoThe role ofbasic amino acid surface clusters on the collagenase activity ofcathepsin Krdquo Biochemistry vol 52 no 44 pp 7742ndash7752 2013

[17] A Anighoro and G Rastelli ldquoEnrichment factor analyses on G-Protein coupled receptors with known crystal structurerdquo Jour-nal of Chemical Information and Modeling vol 53 no 4 pp739ndash743 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 10: Research Article The Virtual Screening of the Drug Protein ...downloads.hindawi.com/journals/cmmm/2016/4809831.pdf · Research Article The Virtual Screening of the Drug Protein with

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom