articlenitish kumar mishra1, deepak singla1, sandhya agarwal1, open source drug discovery...

7
Delivered by Publishing Technology to: Guest User IP: 14.139.234.82 On: Sun, 11 May 2014 11:12:34 Copyright: American Scientific Publishers Copyright © 2014 American Scientific Publishers All rights reserved Printed in the United States of America Article Journal of Translational Toxicology Vol. 1, 21–27, 2014 www.aspbs.com/jtt ToxiPred: A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T. Pyriformis Nitish Kumar Mishra 1 , Deepak Singla 1 , Sandhya Agarwal 1 , Open Source Drug Discovery Consortium 2 , and Gajendra P. S. Raghava 1 1 Bioinformatics Centre, Institute of Microbial Technology (CSIR), Chandigarh 160036, India 2 The Open Source Drug Discovery (OSDD) Consortium, Council of Scientific and Industrial Research, Anusandhan Bhavan, 2 Rafi Marg, Delhi 110001, India Background: Toxicity Prediction is one of the crucial issues as various industrial chemicals are linked with acute and chronic human diseases like carcinogenicity, mutagenecity. Thus, there is a growing need to risk assessment of these chemicals. Tetrahymena pyriformis is used as a model organism to accessed the environmental fate of a chemical to address the toxicity potential of organic chemicals. Our study is based on large diverse dataset of 1208 compounds taken from an international open competition ICANN09 was organized for aqueous toxicity prediction of chemical molecules against Tetrahymena pyriformis. Results: This study described the development of Quantitative Structure Toxicity Rela- tionship (QSTR) model for the prediction of aqueous toxicity against T. pyriformis. Firstly, model developed on 1002 V-life calculated molecular descriptors shows a R/R 2 0.874/0.76 with RMSE 0.523. Further, selection of relevant descrip- tors leads to only 9 descriptors, which shows a performance R/R 2 0.846/0.71 with RMSE 0.574 while on blind dataset 0.756/0.570 with RMSE 0.570 respectively. Second, model developed on CDK based 178 descriptors shows correlation (R) 0.876/0.85, R 2 0.77/0.72 with RMSE 0.518/0.556 on training and blind dataset respectively. Next, model developed on selected 6 descriptors from CDK shows nearly equal performance with R 0.866/0.823, R 2 0.75/0.66 with RMSE 0.541/0.609 on training and blind dataset respectively. Finally, a hybrid model based on selected 17 descriptors from both V-life and CDK shows significant improvement in performance on both training and blind dataset with R 0.89/0.85, R 2 0.79/0.72 with RMSE 0.491/0.557 respectively. It was also observed that Molecular mass (M.W.), and XLogP have very high correlation with toxicity of chemical molecules, it suggests that size and solubility of chemical molecules play major role in toxicity. Our results suggest that it is possible to develop web service for computing toxicity of chemicals using non-commercial software. Conclusions: Our present study demonstrates that performance of a QSTR model depends on the quality/quantity of descriptors as well as on used techniques. Based on these observations, we developed a web server ToxiPred (http://crdd.osdd.net/raghava/toxipred), for environmental risk assessment of small chemical compounds. KEYWORDS: Toxicity, QSAR/QSATR, Prediction, Machine Learning Techniques, ToxiPred, SVM. INTRODUCTION Toxicity assessment for a given compound using toxico- logical experiment is a mammoth task due to cost and time. 1 2 A number of biological interactions and differ- ent environment with the living organisms are responsible for accurate determination of toxicity, but data that quite often are not available. 3 A generally accepted strategy for overcoming the shortage of experimental measurements is Author to whom correspondence should be addressed. Email: [email protected] Received: 14 July 2012 Revised/Accepted: 22 September 2012 the analysis based on Quantitative Structure–Activity Rela- tionships (QSAR). 4 5 Computational tools fasten the envi- ronmental assessment process by significantly reduce the cost of experimental. In past several QSAR/QSTR based models have been developed for toxicity prediction. 6–14 In year 2011, also a method for classification of toxic and non-toxic com- pounds was developed on a large diverse dataset. 15 The most critical limitation of existing QSAR studies is there low performance on blind/independent dataset despite its high accuracy on training data set. This is due to over optimization or over training on dataset used for training. Therefore, it is important to use J. Transl. Toxicol. 2014, Vol. 1, No. 1 2332-242X/2014/1/021/007 doi:10.1166/jtt.2014.1005 21

Upload: others

Post on 10-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ArticleNitish Kumar Mishra1, Deepak Singla1, Sandhya Agarwal1, Open Source Drug Discovery Consortium2, and Gajendra P. S. Raghava1 1Bioinformatics Centre, Institute of Microbial Technology

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Copyright copy 2014 American Scientific PublishersAll rights reservedPrinted in the United States of America

ArticleJournal of

TranslationalToxicology

Vol 1 21ndash27 2014wwwaspbscomjtt

ToxiPred A Server for Prediction of Aqueous Toxicity ofSmall Chemical Molecules in T Pyriformis

Nitish Kumar Mishra1 Deepak Singla1 Sandhya Agarwal1 Open Source Drug DiscoveryConsortium2 and Gajendra P S Raghava1lowast1Bioinformatics Centre Institute of Microbial Technology (CSIR) Chandigarh 160036 India2The Open Source Drug Discovery (OSDD) Consortium Council of Scientific and Industrial ResearchAnusandhan Bhavan 2 Rafi Marg Delhi 110001 India

Background Toxicity Prediction is one of the crucial issues as various industrial chemicals are linked with acute andchronic human diseases like carcinogenicity mutagenecity Thus there is a growing need to risk assessment of thesechemicals Tetrahymena pyriformis is used as a model organism to accessed the environmental fate of a chemical toaddress the toxicity potential of organic chemicals Our study is based on large diverse dataset of 1208 compounds takenfrom an international open competition ICANN09 was organized for aqueous toxicity prediction of chemical moleculesagainst Tetrahymena pyriformis Results This study described the development of Quantitative Structure Toxicity Rela-tionship (QSTR) model for the prediction of aqueous toxicity against T pyriformis Firstly model developed on 1002V-life calculated molecular descriptors shows a RR2 0874076 with RMSE 0523 Further selection of relevant descrip-tors leads to only 9 descriptors which shows a performance RR2 0846071 with RMSE 0574 while on blind dataset07560570 with RMSE 0570 respectively Second model developed on CDK based 178 descriptors shows correlation(R) 0876085 R2 077072 with RMSE 05180556 on training and blind dataset respectively Next model developedon selected 6 descriptors from CDK shows nearly equal performance with R 08660823 R2 075066 with RMSE05410609 on training and blind dataset respectively Finally a hybrid model based on selected 17 descriptors from bothV-life and CDK shows significant improvement in performance on both training and blind dataset with R 089085 R2

079072 with RMSE 04910557 respectively It was also observed that Molecular mass (MW) and XLogP have veryhigh correlation with toxicity of chemical molecules it suggests that size and solubility of chemical molecules play majorrole in toxicity Our results suggest that it is possible to develop web service for computing toxicity of chemicals usingnon-commercial software Conclusions Our present study demonstrates that performance of a QSTR model dependson the qualityquantity of descriptors as well as on used techniques Based on these observations we developed a webserver ToxiPred (httpcrddosddnetraghavatoxipred) for environmental risk assessment of small chemical compounds

KEYWORDS Toxicity QSARQSATR Prediction Machine Learning Techniques ToxiPred SVM

INTRODUCTIONToxicity assessment for a given compound using toxico-logical experiment is a mammoth task due to cost andtime12 A number of biological interactions and differ-ent environment with the living organisms are responsiblefor accurate determination of toxicity but data that quiteoften are not available3 A generally accepted strategy forovercoming the shortage of experimental measurements is

lowastAuthor to whom correspondence should be addressedEmail raghavaimtechresinReceived 14 July 2012RevisedAccepted 22 September 2012

the analysis based on Quantitative StructurendashActivity Rela-tionships (QSAR)45 Computational tools fasten the envi-ronmental assessment process by significantly reduce thecost of experimentalIn past several QSARQSTR based models have been

developed for toxicity prediction6ndash14 In year 2011 alsoa method for classification of toxic and non-toxic com-pounds was developed on a large diverse dataset15

The most critical limitation of existing QSAR studiesis there low performance on blindindependent datasetdespite its high accuracy on training data set This isdue to over optimization or over training on datasetused for training Therefore it is important to use

J Transl Toxicol 2014 Vol 1 No 1 2332-242X20141021007 doi101166jtt20141005 21

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

standard cross-validation criteria for testing trainingand independent validation Thus there is a need todevelop fast and robust in-silico model for predictingtoxicity of chemicals In order to address this prob-lem the highly diverse and large dataset taken fromInternational Conference on Artificial Neural Networks(ICANN09) benchmarking the performance of toxic-ity prediction methods (httpwwwcadastereunode65)was usedThe purpose of this study is firstly to develop a highly

robust and accurate QSAR model for the prediction ofaqueous toxicity of small chemical molecules and sec-ondly to develop softwareserver for public use For thispurpose we have used CDK16 and Vlife softwarersquos forcalculating descriptors of chemicals and WekaRapidMinerfor feature selection This study will be useful for experi-mental biologist for predicting the toxicity of a compoundagainst T pyriformis

MATERIAL AND METHODSDatasetIn this study total 1213 molecules have been used fordeveloping QSAR models for aqueous toxicity predictionobtained from httpwwwcadastereunode67 (ICANN09competition web site) MOPAC based optimized 3D struc-ture of these molecules are available in mol2 format alongwith pIGC50 values (logarithm of 50 growth inhibitoryconcentration) There was an error in 5 molecules fordescriptors calculations therefore the overall study is basedon 1208 compounds rather than 1213 compounds Thedataset was divided randomly into training and blind setwith 1108 and 100 molecules respectively Further Inorder to explore the diversity of toxicity dataset their toxi-city value and four molecular descriptors molecular weight(Mol wt) XlogP nHBA (no of hydrogen bond accep-tors) nHBD (no of hydrogen bond donors) calculatedusing CDK libraries for each compounds were analyzedby radar chart (Fig 1)

Molecular DescriptorsIn this study QSTR models were developed using descrip-tors computed from two softwarersquos namely CDK andVlife Vlife allows computing sim1002 type of molec-ular descriptors that can be broadly categorized intostructural thermodynamic electronic molecular surfacebased descriptors We also computed 178 moleculardescriptors using Chemistry Development Kit (CDK)library that includes topological descriptor geometricdescriptor molecular properties and Eigen values basedindices physiochemical and electronic descriptors TheCDK is a Java based open source library for structuralchemo- and bioinformatics projects Our group had ear-lier implemented this library in the form of a standaloneas well as a web server WebCDK which is available athttpcrddosddnet8081WebCDK

Figure 1 Depict the radar chart of four simple moleculardescriptors molecular weight (MW) nHBDon (no of hydro-gen bond donars) nHBAcc (no of hydrogen bond acceptors)XLogP and toxicity (pIGC50) on entire dataset with each colorline represent a compound

Descriptor SelectionIn a QSAR study selection of relevant molecular descrip-tors from descriptor space is most important and trickystep to build an efficient predictable model Most ofmolecular descriptor programs compute a large set ofmolecular descriptors that include highly correlated andirrelevant descriptor There are number of techniqueswhich allow selecting appropriate descriptors Amongthem only a small subset of descriptors are statistically sig-nificant for QSAR based model development To search aset of significant descriptors we adapted different featureselection methods such as CfsubsetEval module with best-fit algorithm F-stepping approach and removal of highlycorrelated descriptors

Cross-Validation TechniquesThe performance of the QSAR model was evaluatedusing five-fold cross-validation technique In five-fold CVthe data set is randomly divided in five partitions ofsimilar size Out of these five sets four sets are usedfor training and the remaining fifth set for testing Themodel was rebuilt five times once for each fold ensur-ing that all compounds were used for testing once Inorder to check the general ability of model an inde-pendent test set is most commonly used Therefore inthis study we have also used an independent datasetof 100 compounds to evaluate the performance of ourmodels

QSAR Model ConstructionQSAR methodology quantitatively correlates the struc-tural molecular properties (descriptors) with functions

22 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

(biological activities) for a set of compounds by meansof linear or non-linear statistical methods17ndash20 In thepresent study we have used both linear (MLR) andnon-linear (SMO) statistical methods for predictionof aqueous toxicity of small chemical molecules inT pyriformis Brief description of these methods is givenbelow

Non-Linear Statistical MethodWEKA-360 Based MethodThe machine-learning package WEKA 36022 is a col-lection of machine-learning algorithms which supportsin several standard data mining tasks data pre-processing clustering classification regression visual-ization and feature selection Here we used SMOreg(Sequential Minimization Optimization)23 algorithmsimplemented in WEKA to predict the end pointtoxicity

Linear Statistical MethodMultiple Linear Regression (MLR)MLR tries to model the relationship between two or moreindependent descriptors and dependent variable such as yby fitting a linear regression equation to observed data withcorresponding parameters (constants) and an error term InMLR every value of the independent variable is associatedwith a value of the dependent variable The multiple linearrelations between y and the xp is defined by followingequation

y = 0+1x1+2x2+ +pxp+x

Where y is a dependent variable x1 x2xp are theindependent variables 12p is the slop (beta coef-ficient) for particular independent variable and x is arandom noise (eg measurement errors) In current studyMLR equation has obtained through R package was usedfor QSAR modeling

Table I Performance of models developed on different set of descriptors calculated using various software packages

Train Blind

Software packages Descriptors Methods R R2 RMSE R R2 RMSE

178 PLS 0876 077 0518 0850 072 055611 SMOPuk 0867 075 0537 0816 065 0620CDK11 MLR 0853 073 0559 0799 064 06336 SMOPuk 0866 075 0541 0823 066 06096 MLR 0851 072 0563 0802 064 0628

1002 PLS 0874 0760 0523 0867 0750 053020 SMOPuk 0849 0720 0570 0781 06 0665V-Life9 SMOPuk 0846 071 0574 0756 057 06939 MLR 0831 069 0596 0785 061 0655

31 SMOPuk 088 077 0516 083 069 0586Hybrid-1 17 SMOPuk 089 079 0491 0851 072 0557

17 MLR 0867 075 0534 0826 068 0594

Evaluation ParameterOnce a regression model was constructed statistical sig-nificance of models was assessed using the following sta-tistical parameters

RMSE =radic

1n

nsumi=1

ToXact minusToXpred2

R = (nsum

TOXactToXpredminussumTOXact

sumToXpred

)

times(radic

nsum

TOXact2minus sum

TOXact2

radicnsum

ToXpred2minus sum

TOXact2)12

R2 = 1minussumn

i=1ToXact minusToXpred2sumn

i=1ToXact minusToX

Where n is the size of test set Toxpred is the predictedpIGC50 and Toxact is the actual pIGC50 is the average ofthe toxicity of test set RMSE is the root mean squarederror between actual and predicted pIGC50 of compoundsR is the Pearsonrsquos correlation coefficient between actualand predicted value R2 (Coefficient of determination) isthe statistical parameter for proportion of variability inmodel

RESULTSDiversity Analysis of Data SetThe diversity in the dataset is very important for robustQSTR model development In order to explore the chemi-cal domain of total dataset a radar chart analysis was per-formed on total 1209 compounds The radar chart showsthat Mol wt Varies from 3203 to 48359 XlogP from241 to 65 no of hydrogen bond donar range from 0to 5 no of hydrogen bond acceptor ranged from 0 to 6and toxicity value from 267 to 334 As shown Figure 1

J Transl Toxicol 1 21ndash27 2014 23

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

the entire dataset highly diverse and cover large chemicalspace24 The radar chart are highly similar to Cheng et al2011 showing the applicability of dataset15

Performance of Linear Statistical Method (MLR)and Non-Linear Statistical Method (SMO)In order to evaluate the performance of different soft-warersquos we have developed QSAR model on differentsets of descriptors calculated by various softwarersquos (open

(a) (b)

(c) (d)

Figure 2 Showing the scatter plots of four different models with actual value on X-axis and predicted value on Y -axis Theblue color circle represents the training set and green color triangle for blind dataset

source as well as commercial) Performance of variousmodels trained and blind dataset is described below

V-Life Descriptors Based ModelIn order to develop model we removed the moleculesfrom V-life for which CDK was unable to calculatethe descriptors for making the data composition uni-form throughout the study In this study a PLS modelwas developed using V-life calculated 1002 descriptors

24 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

Table II Correlation between selected Vlife descriptors calculated using Vlife software

Descriptors MolWt slogp chiV5chain SdSE-index Ipc MMFF_21 MMFF_46 MMFF_51 T_2_N_0 T_N_Br_3 T_N_Br_6 MDEN-23

MolWt 1 063 minus004 001 minus007 minus028 0 003 032 016 004 008slogp 1 001 012 minus006 minus036 002 002 01 005 002 005chiV5chain 1 003 0 minus004 minus001 0 minus003 minus001 minus001 minus001SdSE-index 1 0 minus007 minus001 0 024 minus002 minus001 005Ipc 1 005 0 0 minus001 0 0 0MMFF_21 1 minus002 minus001 minus015 minus005 minus002 minus003MMFF_46 1 0 008 minus001 0 041MMFF_51 1 minus001 0 0 0T_2_N_0 1 008 004 015T_N_Br_3 1 minus001 009T_N_Br_6 1 0MDEN-23 1

in WEKA and achieve nearly equal performance (R2076075 on training and blind dataset receptively(httpcrddosddnetraghavatoxipredsupplephp) A sec-ond model on 20 descriptors selected using CfsubsetEvalmodule of WEKA shows RR2 0849072 and 078106with RMSE 05700665 on training and blind datasetreceptively A third model using SMO with Puk-kernel on9 best-selected descriptors on training and blind datasetshows correlation (R) 08460756 R2 071057 withRMSE 05740693 (Table I Fig 2(A)) respectively TheMLR based model on selected 9 descriptors shows bet-ter performance as compared to non-linear WEKA basedmodel (httpcrddosddnetraghavatoxipredsupplephp)

CDK Descriptors Based ModelThe second model developed on train dataset and testedon blind dataset (randomly created) shows RtrainR

2train

0876077 RblindR2blind 085072 on training and blind

dataset using PLS in WEKA We have developed a QSARmodel on 11 significant descriptors selected using Cfsub-setEval achieved RTrainRBlind 08670816 R2

TrainR2Blind

075065 with RMSE 05370620 on training and blinddataset receptively (Table I) We further number ofdescriptors from 11 to 6 and developed a third modelshows nearly same performance (in term of R2 as shownin (httpcrddosddnetraghavatoxipredsupplephp) andFigure 2(B)

Hybrid ModelThe hybrid model developed on randomly selectedblind dataset was also shows better performance on

Table III Correlation between selected CDK descriptors

ATSc3 BCUTc-11 FNSA-1 RPSA GRAV-3 XLogP

ATSc3 1 013 minus011 018 minus005 minus007BCUTc-1l 1 018 minus033 026 03FNSA-1 1 012 045 009RPSA 1 minus011 minus048GRAV-3 1 058XLogP 1

minimum of only 17 relevant descriptors (for detailsee (httpcrddosddnetraghavatoxipredsupplephp))First hybrid model on 31 descriptors shows RR2

088077 on training and 083069 on blind dataset(Fig 2(C)) A second model on 17 descriptors shows abetter performance with R 089085 R2 079072 withRMSE 04910557 (Table I Fig 2(D)) on training andblind dataset respectively

Interpreting Best QSAR ModelAs shown in (httpcrddosddnetraghavatoxipredsupplephp) models developed using all three techniques MLRPLS and SMO performed equally well It is also observedthat hybrid model developed using descriptors obtainedfrom two or more than two software perfom better thanmodel developed using descriptors obtained an individ-ual software It is clear from results that free software ofdescriptor calculation are sufficiently powerful for devel-oping QSTR models The descriptor selection is impor-tant for developing QSTR model and for improving theperformance of model In this study we also used freesoftware even for feature selection This study shows thatlinear model is performing more or less equal to non-linear models From the bunch of thousands of descriptorsonly few statistically relevant descriptors were selectedand used for the model development Our study also sug-gests that among the vast of molecular descriptors molec-ular weight and xlogP plays significant role in toxicityprediction of chemical compounds From the Table IIITable IV Table V it was found that descriptors selectedin this study shows very little correlation with each otherand more with toxicity value The positive correlationwith mol wt and solubility (xLogP) are following thetwo of the four parameter famous lipinski rule of fiveThe XlogP descriptor identified in our study also identi-fied in past by Schuurmann et al 2003 as important inpredicting the toxic molecules The possible reason maybe that with increasing molecular weight of a compoundits solubility decrease and therefore that particular com-pound may persist in the body of T pyriformis and becometoxic

J Transl Toxicol 1 21ndash27 2014 25

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

Table IV Correlation between selected V-life and CDKdescriptors

CDK

Descriptors ATSc3 BCUTc-11 FNSA-1 RPSA GRAV-3 XLogP

MolWt minus008 023 047 minus019 086 064chiV5chain 003 009 minus004 001 minus003 minus003SdSE-index 003 017 002 minus012 007 015

Ipc minus002 minus005 minus004 004 minus01 minus006MMFF_21 minus022 minus049 minus022 009 minus037 minus03V-lifeMMFF_46 minus001 002 001 minus004 002 minus001MMFF_51 minus003 minus005 minus001 minus003 005 006T_2_N_0 minus008 029 055 minus024 04 016T_N_Br_3 minus002 01 016 minus004 007 003T_N_Br_6 minus002 01 003 minus004 002 001MDEN-23 minus002 004 005 minus006 009 003

Web ServerOne of the major challenges for researchers working inthe field of toxicology is to predict the toxicity of achemical compound Best of our knowledge two free soft-wares namely TEST (Toxicity Estimation Software Tool)and OpenTox are available for toxicity prediction2526 Inorder to complement existing effort for providing serviceto community we developed a web server ldquoToxiPredrdquo(httpwwwcrddosddnetraghavatoxipred) for predict-ing toxicity of molecules We integrate JME molecu-lar editor27 in ToxiPred that allows user to draw theirmolecules of choice This server is launch using Apacheunder Linux (Red Hat) environment The common gate-way interface (CGI) script of ToxiPred is written usingPERL version 503 This is a user friendly web serverthat allows to predict pIGC50 of a small chemical againstTetrahymena pyriformis

Table V Correlation between descriptors and PIGC50 values

No Descriptor name Correlation (PIGC50) P -value

1 MolWt 07 00112 chiV5chain minus003 0933 SdSE-index 022 172E-134 Ipc minus008 0305 MMFF_21 minus039 147e-096 MMFF_46 008 196E-057 MMFF_51 005 0678 T_2_N_0 038 00089 T_N_Br_3 012 0003

10 T_N_Br_6 006 002011 MDEN-23 01 096912 ATSc3 minus014 878E-1113 BCUTc-1l 032 075514 FNSA-1 039 732E-1215 RPSA minus031 022216 GRAV-3 07 797e-0717 XLogP 075 lt2e-16

DISCUSSIONIn this study we have developed several models for theprediction of pIGC50 of small organic chemical moleculesin Tetrahymena pyriformis Our present study suggeststhat descriptors are keys for any QSAR modeling but itis not always possible that all relevant descriptors wereimplemented in single software Thus it is important touse descriptors obtained from different software for devel-oping a model As shown in our result section our hybridmodel based on selected set of descriptors obtain from dif-ferent software performed better than models developedusing descriptors of individual software After analyzingselected descriptors from different software we found thatmolecular weight and xlogP shows very high correlation(gt= 070) with pIGC50 value In order to provide thefacility to scientific community we developed a webserverldquoToxiPredrdquo to predict toxicity of chemical compoundsWe hope that present model will aid in the area of drugdesigning

Conflicts of InterestThe authors declare that there are no conflicts of interest

Authorsrsquo ContributionsNKM DS and SA perform all the analysis of data anddeveloped QSAR models DS SA and NKM have devel-oped server ToxiPred Manuscript have been written byDS and NKM OSDD members provides suggestions tothis work GPSR conceived and gave overall supervisionto the project helped in interpretation of data and refinedthe drafted manuscript All authors read and approved thefinal manuscript

Acknowledgments The authors are thankful to OpenSource for Drug Discovery (OSDD) foundation and Coun-cil for Scientific and Industrial Research (CSIR) andDepartment of Biotechnology (DBT) New Delhi India forfinancial support The manuscript has Institute of Micro-bial Technology (IMTECH) communication no 0322011

REFERENCES1 D L Hill The Biochemistry and Physiology of Tetrahymena edn

Academic Press New York (1972) pp 2302 J G Hengstler H Foth R Kahl P J Kramer W Lilienblum

T Schulz and H Schweinfurth The reach concept and its impacton toxicological sciences Toxicology 220 232 (2006)

3 P R Duchowicz A G Mercader F M Fernandez and E A CastroPrediction of aqueous toxicity for heterogeneous phenol derivativesby QSAR Chemometr Intell Lab Syst 90 97 (2008)

4 C Hansch and A Leo Substituent Constants for Correla-tion Analysis in Chemistry and Biology Wiley New York(1979)

5 M T D Cronin and J C Dearden Review QSAR in toxicology1 Prediction of Aquatic Toxicity Quant Struct Act Relat 14 1(1995)

6 K Roy and G Ghosh QSTR with extended topochemical atom(ETA) indices 12 QSAR for the toxicity of diverse aromatic

26 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

compounds to Tetrahymena pyriformis using chemometric toolsChemosphere 77 999 (2009)

7 A M Richard Structure-based methods for predicting mutagenicityand carcinogenicity Are we there yet Mutat Res 400 493 (1998)

8 F Cheng J Shen Y Yu W Li G Liu P W Lee andY Tang In-silico prediction of tetrahymena pyriformis toxicity fordiverse industrial chemicals with substructure pattern recognitionand machine learning methods Chemosphere 82 1636 (2011)

9 G Klopman S K Chakravarti H Zhu J M Ivanov and R DSaiakhov ESP A method to predict toxicity and pharmacologicalproperties of chemicals using multiple MCASE databases J ChemInf Comput Sci 44 704 (2004a)

10 G Klopman H Zhu M A Fuller and R D Saiakhov Searchingfor an enhanced predictive tool for mutagenicity SAR QSAR EnvironRes 15 251 (2004b)

11 P Mazzatorta E Benfenati D Neagu and G Gini The importanceof scaling in data mining for toxicity prediction J Chem Inf Com-put Sci 42 1250 (2002)

12 P Mazzatorta E Benfenati C D Neagu and G Gini Tuning neu-ral and fuzzy-neural networks for toxicity modeling J Chem InfComput Sci 43 513 (2003)

13 A M Richard Commercial toxicology prediction systems A regu-latory perspective Toxicol Lett 102 611 (1998a)

14 A M Richard Future of toxicologyndashpredictive toxicology Anexpanded view of chemical toxicity Chem Res Toxicol 19 1257(2006)

15 F Cheng J Shen Y Yu W Li G Liu P W Lee and Y Tang Insilico prediction of tetrahymena pyriformis toxicity for diverse indus-trial chemicals with substructure pattern recognition and machinelearning methods Chemosphere 82 1636 (2011)

16 C Steinbeck C Hoppe S Kuhn M Floris R Guha and E LWillighagen Recent developments of the chemistry development kit(CDK)mdashan open-source java library for chemo- and bioinformaticsCurr Pharm Des 12 2111 (2006)

17 A M Richard C Yang and R S Judson Toxicity data informaticsSupporting a new paradigm for toxicity prediction Toxicol MechMethods 18 103 (2008)

18 A Garg R Tewari and G P Raghava KiDoQ Using dockingbased energy scores to develop ligand based model for predictingantibacterials BMC Bioinformatics 11 125 (2010)

19 N K Mishra S Agarwal and G P Raghava Pre-diction of cytochrome P450 isoform responsible formetabolizing a drug molecule BMC Pharmacol 10 8(2010)

20 N K Mishra and G P S Raghava Prediction of specificity andcross-reactivity of kinase inhibitors Letters in Drug Design andDiscovery 8 223 (2011)

21 D Singla M Anurag D Dash G P S Raghava A web serverfor predicting inhibitors against bacterial target GlmU protein BMCPharmacol 6 11 (2011)

22 M Hall E Frank G Holmes B Pfahringer P Reutemann andH I Witten The WEKA data mining software An update SIGKDDExplorations 11 10 (2009)

23 T Knebel S Hochreiter and K Obermayer An SMO algorithmfor the potential support vector machine Neural Comput 20 271(2008)

24 T I Netzeva A Gallego Saliner and A P Worth Comparisionof applicability of domain of a quantitative structure-activity rela-tionship for estrogenecity with a large chemical inventry EnvironToxicol Chem 25 1223 (2006)

25 T M Martin P Harten R Venkatapathy S Das D MYoung A hierarchical clustering methodology for theestimation of toxicity Toxicol Mech Methods 18 251(2008)

26 B Hardy N Douglas C Helma M Rautenberg N JeliazkovaV Jeliazkov I Nikolova R Benigni O Tcheremenskaia S KramerT Girschick F Buchwald J Wicker A Karwath M GuumltleinA Maunz H Sarimveis G Melagraki A Afantitis P SopasakisD Gallagher V Poroikov D Filimonov A Zakharov A LaguninT Gloriozova S Novikov N Skvortsova D Druzhilovsky SChawla I Ghosh S Ray H Patel and S Escher Collaborativedevelopment of predictive toxicology applications J Cheminform31 2 7 (2010)

27 Java Molecular Editor [httpwwwmolinspirationcomjme]

J Transl Toxicol 1 21ndash27 2014 27

Page 2: ArticleNitish Kumar Mishra1, Deepak Singla1, Sandhya Agarwal1, Open Source Drug Discovery Consortium2, and Gajendra P. S. Raghava1 1Bioinformatics Centre, Institute of Microbial Technology

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

standard cross-validation criteria for testing trainingand independent validation Thus there is a need todevelop fast and robust in-silico model for predictingtoxicity of chemicals In order to address this prob-lem the highly diverse and large dataset taken fromInternational Conference on Artificial Neural Networks(ICANN09) benchmarking the performance of toxic-ity prediction methods (httpwwwcadastereunode65)was usedThe purpose of this study is firstly to develop a highly

robust and accurate QSAR model for the prediction ofaqueous toxicity of small chemical molecules and sec-ondly to develop softwareserver for public use For thispurpose we have used CDK16 and Vlife softwarersquos forcalculating descriptors of chemicals and WekaRapidMinerfor feature selection This study will be useful for experi-mental biologist for predicting the toxicity of a compoundagainst T pyriformis

MATERIAL AND METHODSDatasetIn this study total 1213 molecules have been used fordeveloping QSAR models for aqueous toxicity predictionobtained from httpwwwcadastereunode67 (ICANN09competition web site) MOPAC based optimized 3D struc-ture of these molecules are available in mol2 format alongwith pIGC50 values (logarithm of 50 growth inhibitoryconcentration) There was an error in 5 molecules fordescriptors calculations therefore the overall study is basedon 1208 compounds rather than 1213 compounds Thedataset was divided randomly into training and blind setwith 1108 and 100 molecules respectively Further Inorder to explore the diversity of toxicity dataset their toxi-city value and four molecular descriptors molecular weight(Mol wt) XlogP nHBA (no of hydrogen bond accep-tors) nHBD (no of hydrogen bond donors) calculatedusing CDK libraries for each compounds were analyzedby radar chart (Fig 1)

Molecular DescriptorsIn this study QSTR models were developed using descrip-tors computed from two softwarersquos namely CDK andVlife Vlife allows computing sim1002 type of molec-ular descriptors that can be broadly categorized intostructural thermodynamic electronic molecular surfacebased descriptors We also computed 178 moleculardescriptors using Chemistry Development Kit (CDK)library that includes topological descriptor geometricdescriptor molecular properties and Eigen values basedindices physiochemical and electronic descriptors TheCDK is a Java based open source library for structuralchemo- and bioinformatics projects Our group had ear-lier implemented this library in the form of a standaloneas well as a web server WebCDK which is available athttpcrddosddnet8081WebCDK

Figure 1 Depict the radar chart of four simple moleculardescriptors molecular weight (MW) nHBDon (no of hydro-gen bond donars) nHBAcc (no of hydrogen bond acceptors)XLogP and toxicity (pIGC50) on entire dataset with each colorline represent a compound

Descriptor SelectionIn a QSAR study selection of relevant molecular descrip-tors from descriptor space is most important and trickystep to build an efficient predictable model Most ofmolecular descriptor programs compute a large set ofmolecular descriptors that include highly correlated andirrelevant descriptor There are number of techniqueswhich allow selecting appropriate descriptors Amongthem only a small subset of descriptors are statistically sig-nificant for QSAR based model development To search aset of significant descriptors we adapted different featureselection methods such as CfsubsetEval module with best-fit algorithm F-stepping approach and removal of highlycorrelated descriptors

Cross-Validation TechniquesThe performance of the QSAR model was evaluatedusing five-fold cross-validation technique In five-fold CVthe data set is randomly divided in five partitions ofsimilar size Out of these five sets four sets are usedfor training and the remaining fifth set for testing Themodel was rebuilt five times once for each fold ensur-ing that all compounds were used for testing once Inorder to check the general ability of model an inde-pendent test set is most commonly used Therefore inthis study we have also used an independent datasetof 100 compounds to evaluate the performance of ourmodels

QSAR Model ConstructionQSAR methodology quantitatively correlates the struc-tural molecular properties (descriptors) with functions

22 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

(biological activities) for a set of compounds by meansof linear or non-linear statistical methods17ndash20 In thepresent study we have used both linear (MLR) andnon-linear (SMO) statistical methods for predictionof aqueous toxicity of small chemical molecules inT pyriformis Brief description of these methods is givenbelow

Non-Linear Statistical MethodWEKA-360 Based MethodThe machine-learning package WEKA 36022 is a col-lection of machine-learning algorithms which supportsin several standard data mining tasks data pre-processing clustering classification regression visual-ization and feature selection Here we used SMOreg(Sequential Minimization Optimization)23 algorithmsimplemented in WEKA to predict the end pointtoxicity

Linear Statistical MethodMultiple Linear Regression (MLR)MLR tries to model the relationship between two or moreindependent descriptors and dependent variable such as yby fitting a linear regression equation to observed data withcorresponding parameters (constants) and an error term InMLR every value of the independent variable is associatedwith a value of the dependent variable The multiple linearrelations between y and the xp is defined by followingequation

y = 0+1x1+2x2+ +pxp+x

Where y is a dependent variable x1 x2xp are theindependent variables 12p is the slop (beta coef-ficient) for particular independent variable and x is arandom noise (eg measurement errors) In current studyMLR equation has obtained through R package was usedfor QSAR modeling

Table I Performance of models developed on different set of descriptors calculated using various software packages

Train Blind

Software packages Descriptors Methods R R2 RMSE R R2 RMSE

178 PLS 0876 077 0518 0850 072 055611 SMOPuk 0867 075 0537 0816 065 0620CDK11 MLR 0853 073 0559 0799 064 06336 SMOPuk 0866 075 0541 0823 066 06096 MLR 0851 072 0563 0802 064 0628

1002 PLS 0874 0760 0523 0867 0750 053020 SMOPuk 0849 0720 0570 0781 06 0665V-Life9 SMOPuk 0846 071 0574 0756 057 06939 MLR 0831 069 0596 0785 061 0655

31 SMOPuk 088 077 0516 083 069 0586Hybrid-1 17 SMOPuk 089 079 0491 0851 072 0557

17 MLR 0867 075 0534 0826 068 0594

Evaluation ParameterOnce a regression model was constructed statistical sig-nificance of models was assessed using the following sta-tistical parameters

RMSE =radic

1n

nsumi=1

ToXact minusToXpred2

R = (nsum

TOXactToXpredminussumTOXact

sumToXpred

)

times(radic

nsum

TOXact2minus sum

TOXact2

radicnsum

ToXpred2minus sum

TOXact2)12

R2 = 1minussumn

i=1ToXact minusToXpred2sumn

i=1ToXact minusToX

Where n is the size of test set Toxpred is the predictedpIGC50 and Toxact is the actual pIGC50 is the average ofthe toxicity of test set RMSE is the root mean squarederror between actual and predicted pIGC50 of compoundsR is the Pearsonrsquos correlation coefficient between actualand predicted value R2 (Coefficient of determination) isthe statistical parameter for proportion of variability inmodel

RESULTSDiversity Analysis of Data SetThe diversity in the dataset is very important for robustQSTR model development In order to explore the chemi-cal domain of total dataset a radar chart analysis was per-formed on total 1209 compounds The radar chart showsthat Mol wt Varies from 3203 to 48359 XlogP from241 to 65 no of hydrogen bond donar range from 0to 5 no of hydrogen bond acceptor ranged from 0 to 6and toxicity value from 267 to 334 As shown Figure 1

J Transl Toxicol 1 21ndash27 2014 23

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

the entire dataset highly diverse and cover large chemicalspace24 The radar chart are highly similar to Cheng et al2011 showing the applicability of dataset15

Performance of Linear Statistical Method (MLR)and Non-Linear Statistical Method (SMO)In order to evaluate the performance of different soft-warersquos we have developed QSAR model on differentsets of descriptors calculated by various softwarersquos (open

(a) (b)

(c) (d)

Figure 2 Showing the scatter plots of four different models with actual value on X-axis and predicted value on Y -axis Theblue color circle represents the training set and green color triangle for blind dataset

source as well as commercial) Performance of variousmodels trained and blind dataset is described below

V-Life Descriptors Based ModelIn order to develop model we removed the moleculesfrom V-life for which CDK was unable to calculatethe descriptors for making the data composition uni-form throughout the study In this study a PLS modelwas developed using V-life calculated 1002 descriptors

24 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

Table II Correlation between selected Vlife descriptors calculated using Vlife software

Descriptors MolWt slogp chiV5chain SdSE-index Ipc MMFF_21 MMFF_46 MMFF_51 T_2_N_0 T_N_Br_3 T_N_Br_6 MDEN-23

MolWt 1 063 minus004 001 minus007 minus028 0 003 032 016 004 008slogp 1 001 012 minus006 minus036 002 002 01 005 002 005chiV5chain 1 003 0 minus004 minus001 0 minus003 minus001 minus001 minus001SdSE-index 1 0 minus007 minus001 0 024 minus002 minus001 005Ipc 1 005 0 0 minus001 0 0 0MMFF_21 1 minus002 minus001 minus015 minus005 minus002 minus003MMFF_46 1 0 008 minus001 0 041MMFF_51 1 minus001 0 0 0T_2_N_0 1 008 004 015T_N_Br_3 1 minus001 009T_N_Br_6 1 0MDEN-23 1

in WEKA and achieve nearly equal performance (R2076075 on training and blind dataset receptively(httpcrddosddnetraghavatoxipredsupplephp) A sec-ond model on 20 descriptors selected using CfsubsetEvalmodule of WEKA shows RR2 0849072 and 078106with RMSE 05700665 on training and blind datasetreceptively A third model using SMO with Puk-kernel on9 best-selected descriptors on training and blind datasetshows correlation (R) 08460756 R2 071057 withRMSE 05740693 (Table I Fig 2(A)) respectively TheMLR based model on selected 9 descriptors shows bet-ter performance as compared to non-linear WEKA basedmodel (httpcrddosddnetraghavatoxipredsupplephp)

CDK Descriptors Based ModelThe second model developed on train dataset and testedon blind dataset (randomly created) shows RtrainR

2train

0876077 RblindR2blind 085072 on training and blind

dataset using PLS in WEKA We have developed a QSARmodel on 11 significant descriptors selected using Cfsub-setEval achieved RTrainRBlind 08670816 R2

TrainR2Blind

075065 with RMSE 05370620 on training and blinddataset receptively (Table I) We further number ofdescriptors from 11 to 6 and developed a third modelshows nearly same performance (in term of R2 as shownin (httpcrddosddnetraghavatoxipredsupplephp) andFigure 2(B)

Hybrid ModelThe hybrid model developed on randomly selectedblind dataset was also shows better performance on

Table III Correlation between selected CDK descriptors

ATSc3 BCUTc-11 FNSA-1 RPSA GRAV-3 XLogP

ATSc3 1 013 minus011 018 minus005 minus007BCUTc-1l 1 018 minus033 026 03FNSA-1 1 012 045 009RPSA 1 minus011 minus048GRAV-3 1 058XLogP 1

minimum of only 17 relevant descriptors (for detailsee (httpcrddosddnetraghavatoxipredsupplephp))First hybrid model on 31 descriptors shows RR2

088077 on training and 083069 on blind dataset(Fig 2(C)) A second model on 17 descriptors shows abetter performance with R 089085 R2 079072 withRMSE 04910557 (Table I Fig 2(D)) on training andblind dataset respectively

Interpreting Best QSAR ModelAs shown in (httpcrddosddnetraghavatoxipredsupplephp) models developed using all three techniques MLRPLS and SMO performed equally well It is also observedthat hybrid model developed using descriptors obtainedfrom two or more than two software perfom better thanmodel developed using descriptors obtained an individ-ual software It is clear from results that free software ofdescriptor calculation are sufficiently powerful for devel-oping QSTR models The descriptor selection is impor-tant for developing QSTR model and for improving theperformance of model In this study we also used freesoftware even for feature selection This study shows thatlinear model is performing more or less equal to non-linear models From the bunch of thousands of descriptorsonly few statistically relevant descriptors were selectedand used for the model development Our study also sug-gests that among the vast of molecular descriptors molec-ular weight and xlogP plays significant role in toxicityprediction of chemical compounds From the Table IIITable IV Table V it was found that descriptors selectedin this study shows very little correlation with each otherand more with toxicity value The positive correlationwith mol wt and solubility (xLogP) are following thetwo of the four parameter famous lipinski rule of fiveThe XlogP descriptor identified in our study also identi-fied in past by Schuurmann et al 2003 as important inpredicting the toxic molecules The possible reason maybe that with increasing molecular weight of a compoundits solubility decrease and therefore that particular com-pound may persist in the body of T pyriformis and becometoxic

J Transl Toxicol 1 21ndash27 2014 25

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

Table IV Correlation between selected V-life and CDKdescriptors

CDK

Descriptors ATSc3 BCUTc-11 FNSA-1 RPSA GRAV-3 XLogP

MolWt minus008 023 047 minus019 086 064chiV5chain 003 009 minus004 001 minus003 minus003SdSE-index 003 017 002 minus012 007 015

Ipc minus002 minus005 minus004 004 minus01 minus006MMFF_21 minus022 minus049 minus022 009 minus037 minus03V-lifeMMFF_46 minus001 002 001 minus004 002 minus001MMFF_51 minus003 minus005 minus001 minus003 005 006T_2_N_0 minus008 029 055 minus024 04 016T_N_Br_3 minus002 01 016 minus004 007 003T_N_Br_6 minus002 01 003 minus004 002 001MDEN-23 minus002 004 005 minus006 009 003

Web ServerOne of the major challenges for researchers working inthe field of toxicology is to predict the toxicity of achemical compound Best of our knowledge two free soft-wares namely TEST (Toxicity Estimation Software Tool)and OpenTox are available for toxicity prediction2526 Inorder to complement existing effort for providing serviceto community we developed a web server ldquoToxiPredrdquo(httpwwwcrddosddnetraghavatoxipred) for predict-ing toxicity of molecules We integrate JME molecu-lar editor27 in ToxiPred that allows user to draw theirmolecules of choice This server is launch using Apacheunder Linux (Red Hat) environment The common gate-way interface (CGI) script of ToxiPred is written usingPERL version 503 This is a user friendly web serverthat allows to predict pIGC50 of a small chemical againstTetrahymena pyriformis

Table V Correlation between descriptors and PIGC50 values

No Descriptor name Correlation (PIGC50) P -value

1 MolWt 07 00112 chiV5chain minus003 0933 SdSE-index 022 172E-134 Ipc minus008 0305 MMFF_21 minus039 147e-096 MMFF_46 008 196E-057 MMFF_51 005 0678 T_2_N_0 038 00089 T_N_Br_3 012 0003

10 T_N_Br_6 006 002011 MDEN-23 01 096912 ATSc3 minus014 878E-1113 BCUTc-1l 032 075514 FNSA-1 039 732E-1215 RPSA minus031 022216 GRAV-3 07 797e-0717 XLogP 075 lt2e-16

DISCUSSIONIn this study we have developed several models for theprediction of pIGC50 of small organic chemical moleculesin Tetrahymena pyriformis Our present study suggeststhat descriptors are keys for any QSAR modeling but itis not always possible that all relevant descriptors wereimplemented in single software Thus it is important touse descriptors obtained from different software for devel-oping a model As shown in our result section our hybridmodel based on selected set of descriptors obtain from dif-ferent software performed better than models developedusing descriptors of individual software After analyzingselected descriptors from different software we found thatmolecular weight and xlogP shows very high correlation(gt= 070) with pIGC50 value In order to provide thefacility to scientific community we developed a webserverldquoToxiPredrdquo to predict toxicity of chemical compoundsWe hope that present model will aid in the area of drugdesigning

Conflicts of InterestThe authors declare that there are no conflicts of interest

Authorsrsquo ContributionsNKM DS and SA perform all the analysis of data anddeveloped QSAR models DS SA and NKM have devel-oped server ToxiPred Manuscript have been written byDS and NKM OSDD members provides suggestions tothis work GPSR conceived and gave overall supervisionto the project helped in interpretation of data and refinedthe drafted manuscript All authors read and approved thefinal manuscript

Acknowledgments The authors are thankful to OpenSource for Drug Discovery (OSDD) foundation and Coun-cil for Scientific and Industrial Research (CSIR) andDepartment of Biotechnology (DBT) New Delhi India forfinancial support The manuscript has Institute of Micro-bial Technology (IMTECH) communication no 0322011

REFERENCES1 D L Hill The Biochemistry and Physiology of Tetrahymena edn

Academic Press New York (1972) pp 2302 J G Hengstler H Foth R Kahl P J Kramer W Lilienblum

T Schulz and H Schweinfurth The reach concept and its impacton toxicological sciences Toxicology 220 232 (2006)

3 P R Duchowicz A G Mercader F M Fernandez and E A CastroPrediction of aqueous toxicity for heterogeneous phenol derivativesby QSAR Chemometr Intell Lab Syst 90 97 (2008)

4 C Hansch and A Leo Substituent Constants for Correla-tion Analysis in Chemistry and Biology Wiley New York(1979)

5 M T D Cronin and J C Dearden Review QSAR in toxicology1 Prediction of Aquatic Toxicity Quant Struct Act Relat 14 1(1995)

6 K Roy and G Ghosh QSTR with extended topochemical atom(ETA) indices 12 QSAR for the toxicity of diverse aromatic

26 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

compounds to Tetrahymena pyriformis using chemometric toolsChemosphere 77 999 (2009)

7 A M Richard Structure-based methods for predicting mutagenicityand carcinogenicity Are we there yet Mutat Res 400 493 (1998)

8 F Cheng J Shen Y Yu W Li G Liu P W Lee andY Tang In-silico prediction of tetrahymena pyriformis toxicity fordiverse industrial chemicals with substructure pattern recognitionand machine learning methods Chemosphere 82 1636 (2011)

9 G Klopman S K Chakravarti H Zhu J M Ivanov and R DSaiakhov ESP A method to predict toxicity and pharmacologicalproperties of chemicals using multiple MCASE databases J ChemInf Comput Sci 44 704 (2004a)

10 G Klopman H Zhu M A Fuller and R D Saiakhov Searchingfor an enhanced predictive tool for mutagenicity SAR QSAR EnvironRes 15 251 (2004b)

11 P Mazzatorta E Benfenati D Neagu and G Gini The importanceof scaling in data mining for toxicity prediction J Chem Inf Com-put Sci 42 1250 (2002)

12 P Mazzatorta E Benfenati C D Neagu and G Gini Tuning neu-ral and fuzzy-neural networks for toxicity modeling J Chem InfComput Sci 43 513 (2003)

13 A M Richard Commercial toxicology prediction systems A regu-latory perspective Toxicol Lett 102 611 (1998a)

14 A M Richard Future of toxicologyndashpredictive toxicology Anexpanded view of chemical toxicity Chem Res Toxicol 19 1257(2006)

15 F Cheng J Shen Y Yu W Li G Liu P W Lee and Y Tang Insilico prediction of tetrahymena pyriformis toxicity for diverse indus-trial chemicals with substructure pattern recognition and machinelearning methods Chemosphere 82 1636 (2011)

16 C Steinbeck C Hoppe S Kuhn M Floris R Guha and E LWillighagen Recent developments of the chemistry development kit(CDK)mdashan open-source java library for chemo- and bioinformaticsCurr Pharm Des 12 2111 (2006)

17 A M Richard C Yang and R S Judson Toxicity data informaticsSupporting a new paradigm for toxicity prediction Toxicol MechMethods 18 103 (2008)

18 A Garg R Tewari and G P Raghava KiDoQ Using dockingbased energy scores to develop ligand based model for predictingantibacterials BMC Bioinformatics 11 125 (2010)

19 N K Mishra S Agarwal and G P Raghava Pre-diction of cytochrome P450 isoform responsible formetabolizing a drug molecule BMC Pharmacol 10 8(2010)

20 N K Mishra and G P S Raghava Prediction of specificity andcross-reactivity of kinase inhibitors Letters in Drug Design andDiscovery 8 223 (2011)

21 D Singla M Anurag D Dash G P S Raghava A web serverfor predicting inhibitors against bacterial target GlmU protein BMCPharmacol 6 11 (2011)

22 M Hall E Frank G Holmes B Pfahringer P Reutemann andH I Witten The WEKA data mining software An update SIGKDDExplorations 11 10 (2009)

23 T Knebel S Hochreiter and K Obermayer An SMO algorithmfor the potential support vector machine Neural Comput 20 271(2008)

24 T I Netzeva A Gallego Saliner and A P Worth Comparisionof applicability of domain of a quantitative structure-activity rela-tionship for estrogenecity with a large chemical inventry EnvironToxicol Chem 25 1223 (2006)

25 T M Martin P Harten R Venkatapathy S Das D MYoung A hierarchical clustering methodology for theestimation of toxicity Toxicol Mech Methods 18 251(2008)

26 B Hardy N Douglas C Helma M Rautenberg N JeliazkovaV Jeliazkov I Nikolova R Benigni O Tcheremenskaia S KramerT Girschick F Buchwald J Wicker A Karwath M GuumltleinA Maunz H Sarimveis G Melagraki A Afantitis P SopasakisD Gallagher V Poroikov D Filimonov A Zakharov A LaguninT Gloriozova S Novikov N Skvortsova D Druzhilovsky SChawla I Ghosh S Ray H Patel and S Escher Collaborativedevelopment of predictive toxicology applications J Cheminform31 2 7 (2010)

27 Java Molecular Editor [httpwwwmolinspirationcomjme]

J Transl Toxicol 1 21ndash27 2014 27

Page 3: ArticleNitish Kumar Mishra1, Deepak Singla1, Sandhya Agarwal1, Open Source Drug Discovery Consortium2, and Gajendra P. S. Raghava1 1Bioinformatics Centre, Institute of Microbial Technology

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

(biological activities) for a set of compounds by meansof linear or non-linear statistical methods17ndash20 In thepresent study we have used both linear (MLR) andnon-linear (SMO) statistical methods for predictionof aqueous toxicity of small chemical molecules inT pyriformis Brief description of these methods is givenbelow

Non-Linear Statistical MethodWEKA-360 Based MethodThe machine-learning package WEKA 36022 is a col-lection of machine-learning algorithms which supportsin several standard data mining tasks data pre-processing clustering classification regression visual-ization and feature selection Here we used SMOreg(Sequential Minimization Optimization)23 algorithmsimplemented in WEKA to predict the end pointtoxicity

Linear Statistical MethodMultiple Linear Regression (MLR)MLR tries to model the relationship between two or moreindependent descriptors and dependent variable such as yby fitting a linear regression equation to observed data withcorresponding parameters (constants) and an error term InMLR every value of the independent variable is associatedwith a value of the dependent variable The multiple linearrelations between y and the xp is defined by followingequation

y = 0+1x1+2x2+ +pxp+x

Where y is a dependent variable x1 x2xp are theindependent variables 12p is the slop (beta coef-ficient) for particular independent variable and x is arandom noise (eg measurement errors) In current studyMLR equation has obtained through R package was usedfor QSAR modeling

Table I Performance of models developed on different set of descriptors calculated using various software packages

Train Blind

Software packages Descriptors Methods R R2 RMSE R R2 RMSE

178 PLS 0876 077 0518 0850 072 055611 SMOPuk 0867 075 0537 0816 065 0620CDK11 MLR 0853 073 0559 0799 064 06336 SMOPuk 0866 075 0541 0823 066 06096 MLR 0851 072 0563 0802 064 0628

1002 PLS 0874 0760 0523 0867 0750 053020 SMOPuk 0849 0720 0570 0781 06 0665V-Life9 SMOPuk 0846 071 0574 0756 057 06939 MLR 0831 069 0596 0785 061 0655

31 SMOPuk 088 077 0516 083 069 0586Hybrid-1 17 SMOPuk 089 079 0491 0851 072 0557

17 MLR 0867 075 0534 0826 068 0594

Evaluation ParameterOnce a regression model was constructed statistical sig-nificance of models was assessed using the following sta-tistical parameters

RMSE =radic

1n

nsumi=1

ToXact minusToXpred2

R = (nsum

TOXactToXpredminussumTOXact

sumToXpred

)

times(radic

nsum

TOXact2minus sum

TOXact2

radicnsum

ToXpred2minus sum

TOXact2)12

R2 = 1minussumn

i=1ToXact minusToXpred2sumn

i=1ToXact minusToX

Where n is the size of test set Toxpred is the predictedpIGC50 and Toxact is the actual pIGC50 is the average ofthe toxicity of test set RMSE is the root mean squarederror between actual and predicted pIGC50 of compoundsR is the Pearsonrsquos correlation coefficient between actualand predicted value R2 (Coefficient of determination) isthe statistical parameter for proportion of variability inmodel

RESULTSDiversity Analysis of Data SetThe diversity in the dataset is very important for robustQSTR model development In order to explore the chemi-cal domain of total dataset a radar chart analysis was per-formed on total 1209 compounds The radar chart showsthat Mol wt Varies from 3203 to 48359 XlogP from241 to 65 no of hydrogen bond donar range from 0to 5 no of hydrogen bond acceptor ranged from 0 to 6and toxicity value from 267 to 334 As shown Figure 1

J Transl Toxicol 1 21ndash27 2014 23

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

the entire dataset highly diverse and cover large chemicalspace24 The radar chart are highly similar to Cheng et al2011 showing the applicability of dataset15

Performance of Linear Statistical Method (MLR)and Non-Linear Statistical Method (SMO)In order to evaluate the performance of different soft-warersquos we have developed QSAR model on differentsets of descriptors calculated by various softwarersquos (open

(a) (b)

(c) (d)

Figure 2 Showing the scatter plots of four different models with actual value on X-axis and predicted value on Y -axis Theblue color circle represents the training set and green color triangle for blind dataset

source as well as commercial) Performance of variousmodels trained and blind dataset is described below

V-Life Descriptors Based ModelIn order to develop model we removed the moleculesfrom V-life for which CDK was unable to calculatethe descriptors for making the data composition uni-form throughout the study In this study a PLS modelwas developed using V-life calculated 1002 descriptors

24 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

Table II Correlation between selected Vlife descriptors calculated using Vlife software

Descriptors MolWt slogp chiV5chain SdSE-index Ipc MMFF_21 MMFF_46 MMFF_51 T_2_N_0 T_N_Br_3 T_N_Br_6 MDEN-23

MolWt 1 063 minus004 001 minus007 minus028 0 003 032 016 004 008slogp 1 001 012 minus006 minus036 002 002 01 005 002 005chiV5chain 1 003 0 minus004 minus001 0 minus003 minus001 minus001 minus001SdSE-index 1 0 minus007 minus001 0 024 minus002 minus001 005Ipc 1 005 0 0 minus001 0 0 0MMFF_21 1 minus002 minus001 minus015 minus005 minus002 minus003MMFF_46 1 0 008 minus001 0 041MMFF_51 1 minus001 0 0 0T_2_N_0 1 008 004 015T_N_Br_3 1 minus001 009T_N_Br_6 1 0MDEN-23 1

in WEKA and achieve nearly equal performance (R2076075 on training and blind dataset receptively(httpcrddosddnetraghavatoxipredsupplephp) A sec-ond model on 20 descriptors selected using CfsubsetEvalmodule of WEKA shows RR2 0849072 and 078106with RMSE 05700665 on training and blind datasetreceptively A third model using SMO with Puk-kernel on9 best-selected descriptors on training and blind datasetshows correlation (R) 08460756 R2 071057 withRMSE 05740693 (Table I Fig 2(A)) respectively TheMLR based model on selected 9 descriptors shows bet-ter performance as compared to non-linear WEKA basedmodel (httpcrddosddnetraghavatoxipredsupplephp)

CDK Descriptors Based ModelThe second model developed on train dataset and testedon blind dataset (randomly created) shows RtrainR

2train

0876077 RblindR2blind 085072 on training and blind

dataset using PLS in WEKA We have developed a QSARmodel on 11 significant descriptors selected using Cfsub-setEval achieved RTrainRBlind 08670816 R2

TrainR2Blind

075065 with RMSE 05370620 on training and blinddataset receptively (Table I) We further number ofdescriptors from 11 to 6 and developed a third modelshows nearly same performance (in term of R2 as shownin (httpcrddosddnetraghavatoxipredsupplephp) andFigure 2(B)

Hybrid ModelThe hybrid model developed on randomly selectedblind dataset was also shows better performance on

Table III Correlation between selected CDK descriptors

ATSc3 BCUTc-11 FNSA-1 RPSA GRAV-3 XLogP

ATSc3 1 013 minus011 018 minus005 minus007BCUTc-1l 1 018 minus033 026 03FNSA-1 1 012 045 009RPSA 1 minus011 minus048GRAV-3 1 058XLogP 1

minimum of only 17 relevant descriptors (for detailsee (httpcrddosddnetraghavatoxipredsupplephp))First hybrid model on 31 descriptors shows RR2

088077 on training and 083069 on blind dataset(Fig 2(C)) A second model on 17 descriptors shows abetter performance with R 089085 R2 079072 withRMSE 04910557 (Table I Fig 2(D)) on training andblind dataset respectively

Interpreting Best QSAR ModelAs shown in (httpcrddosddnetraghavatoxipredsupplephp) models developed using all three techniques MLRPLS and SMO performed equally well It is also observedthat hybrid model developed using descriptors obtainedfrom two or more than two software perfom better thanmodel developed using descriptors obtained an individ-ual software It is clear from results that free software ofdescriptor calculation are sufficiently powerful for devel-oping QSTR models The descriptor selection is impor-tant for developing QSTR model and for improving theperformance of model In this study we also used freesoftware even for feature selection This study shows thatlinear model is performing more or less equal to non-linear models From the bunch of thousands of descriptorsonly few statistically relevant descriptors were selectedand used for the model development Our study also sug-gests that among the vast of molecular descriptors molec-ular weight and xlogP plays significant role in toxicityprediction of chemical compounds From the Table IIITable IV Table V it was found that descriptors selectedin this study shows very little correlation with each otherand more with toxicity value The positive correlationwith mol wt and solubility (xLogP) are following thetwo of the four parameter famous lipinski rule of fiveThe XlogP descriptor identified in our study also identi-fied in past by Schuurmann et al 2003 as important inpredicting the toxic molecules The possible reason maybe that with increasing molecular weight of a compoundits solubility decrease and therefore that particular com-pound may persist in the body of T pyriformis and becometoxic

J Transl Toxicol 1 21ndash27 2014 25

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

Table IV Correlation between selected V-life and CDKdescriptors

CDK

Descriptors ATSc3 BCUTc-11 FNSA-1 RPSA GRAV-3 XLogP

MolWt minus008 023 047 minus019 086 064chiV5chain 003 009 minus004 001 minus003 minus003SdSE-index 003 017 002 minus012 007 015

Ipc minus002 minus005 minus004 004 minus01 minus006MMFF_21 minus022 minus049 minus022 009 minus037 minus03V-lifeMMFF_46 minus001 002 001 minus004 002 minus001MMFF_51 minus003 minus005 minus001 minus003 005 006T_2_N_0 minus008 029 055 minus024 04 016T_N_Br_3 minus002 01 016 minus004 007 003T_N_Br_6 minus002 01 003 minus004 002 001MDEN-23 minus002 004 005 minus006 009 003

Web ServerOne of the major challenges for researchers working inthe field of toxicology is to predict the toxicity of achemical compound Best of our knowledge two free soft-wares namely TEST (Toxicity Estimation Software Tool)and OpenTox are available for toxicity prediction2526 Inorder to complement existing effort for providing serviceto community we developed a web server ldquoToxiPredrdquo(httpwwwcrddosddnetraghavatoxipred) for predict-ing toxicity of molecules We integrate JME molecu-lar editor27 in ToxiPred that allows user to draw theirmolecules of choice This server is launch using Apacheunder Linux (Red Hat) environment The common gate-way interface (CGI) script of ToxiPred is written usingPERL version 503 This is a user friendly web serverthat allows to predict pIGC50 of a small chemical againstTetrahymena pyriformis

Table V Correlation between descriptors and PIGC50 values

No Descriptor name Correlation (PIGC50) P -value

1 MolWt 07 00112 chiV5chain minus003 0933 SdSE-index 022 172E-134 Ipc minus008 0305 MMFF_21 minus039 147e-096 MMFF_46 008 196E-057 MMFF_51 005 0678 T_2_N_0 038 00089 T_N_Br_3 012 0003

10 T_N_Br_6 006 002011 MDEN-23 01 096912 ATSc3 minus014 878E-1113 BCUTc-1l 032 075514 FNSA-1 039 732E-1215 RPSA minus031 022216 GRAV-3 07 797e-0717 XLogP 075 lt2e-16

DISCUSSIONIn this study we have developed several models for theprediction of pIGC50 of small organic chemical moleculesin Tetrahymena pyriformis Our present study suggeststhat descriptors are keys for any QSAR modeling but itis not always possible that all relevant descriptors wereimplemented in single software Thus it is important touse descriptors obtained from different software for devel-oping a model As shown in our result section our hybridmodel based on selected set of descriptors obtain from dif-ferent software performed better than models developedusing descriptors of individual software After analyzingselected descriptors from different software we found thatmolecular weight and xlogP shows very high correlation(gt= 070) with pIGC50 value In order to provide thefacility to scientific community we developed a webserverldquoToxiPredrdquo to predict toxicity of chemical compoundsWe hope that present model will aid in the area of drugdesigning

Conflicts of InterestThe authors declare that there are no conflicts of interest

Authorsrsquo ContributionsNKM DS and SA perform all the analysis of data anddeveloped QSAR models DS SA and NKM have devel-oped server ToxiPred Manuscript have been written byDS and NKM OSDD members provides suggestions tothis work GPSR conceived and gave overall supervisionto the project helped in interpretation of data and refinedthe drafted manuscript All authors read and approved thefinal manuscript

Acknowledgments The authors are thankful to OpenSource for Drug Discovery (OSDD) foundation and Coun-cil for Scientific and Industrial Research (CSIR) andDepartment of Biotechnology (DBT) New Delhi India forfinancial support The manuscript has Institute of Micro-bial Technology (IMTECH) communication no 0322011

REFERENCES1 D L Hill The Biochemistry and Physiology of Tetrahymena edn

Academic Press New York (1972) pp 2302 J G Hengstler H Foth R Kahl P J Kramer W Lilienblum

T Schulz and H Schweinfurth The reach concept and its impacton toxicological sciences Toxicology 220 232 (2006)

3 P R Duchowicz A G Mercader F M Fernandez and E A CastroPrediction of aqueous toxicity for heterogeneous phenol derivativesby QSAR Chemometr Intell Lab Syst 90 97 (2008)

4 C Hansch and A Leo Substituent Constants for Correla-tion Analysis in Chemistry and Biology Wiley New York(1979)

5 M T D Cronin and J C Dearden Review QSAR in toxicology1 Prediction of Aquatic Toxicity Quant Struct Act Relat 14 1(1995)

6 K Roy and G Ghosh QSTR with extended topochemical atom(ETA) indices 12 QSAR for the toxicity of diverse aromatic

26 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

compounds to Tetrahymena pyriformis using chemometric toolsChemosphere 77 999 (2009)

7 A M Richard Structure-based methods for predicting mutagenicityand carcinogenicity Are we there yet Mutat Res 400 493 (1998)

8 F Cheng J Shen Y Yu W Li G Liu P W Lee andY Tang In-silico prediction of tetrahymena pyriformis toxicity fordiverse industrial chemicals with substructure pattern recognitionand machine learning methods Chemosphere 82 1636 (2011)

9 G Klopman S K Chakravarti H Zhu J M Ivanov and R DSaiakhov ESP A method to predict toxicity and pharmacologicalproperties of chemicals using multiple MCASE databases J ChemInf Comput Sci 44 704 (2004a)

10 G Klopman H Zhu M A Fuller and R D Saiakhov Searchingfor an enhanced predictive tool for mutagenicity SAR QSAR EnvironRes 15 251 (2004b)

11 P Mazzatorta E Benfenati D Neagu and G Gini The importanceof scaling in data mining for toxicity prediction J Chem Inf Com-put Sci 42 1250 (2002)

12 P Mazzatorta E Benfenati C D Neagu and G Gini Tuning neu-ral and fuzzy-neural networks for toxicity modeling J Chem InfComput Sci 43 513 (2003)

13 A M Richard Commercial toxicology prediction systems A regu-latory perspective Toxicol Lett 102 611 (1998a)

14 A M Richard Future of toxicologyndashpredictive toxicology Anexpanded view of chemical toxicity Chem Res Toxicol 19 1257(2006)

15 F Cheng J Shen Y Yu W Li G Liu P W Lee and Y Tang Insilico prediction of tetrahymena pyriformis toxicity for diverse indus-trial chemicals with substructure pattern recognition and machinelearning methods Chemosphere 82 1636 (2011)

16 C Steinbeck C Hoppe S Kuhn M Floris R Guha and E LWillighagen Recent developments of the chemistry development kit(CDK)mdashan open-source java library for chemo- and bioinformaticsCurr Pharm Des 12 2111 (2006)

17 A M Richard C Yang and R S Judson Toxicity data informaticsSupporting a new paradigm for toxicity prediction Toxicol MechMethods 18 103 (2008)

18 A Garg R Tewari and G P Raghava KiDoQ Using dockingbased energy scores to develop ligand based model for predictingantibacterials BMC Bioinformatics 11 125 (2010)

19 N K Mishra S Agarwal and G P Raghava Pre-diction of cytochrome P450 isoform responsible formetabolizing a drug molecule BMC Pharmacol 10 8(2010)

20 N K Mishra and G P S Raghava Prediction of specificity andcross-reactivity of kinase inhibitors Letters in Drug Design andDiscovery 8 223 (2011)

21 D Singla M Anurag D Dash G P S Raghava A web serverfor predicting inhibitors against bacterial target GlmU protein BMCPharmacol 6 11 (2011)

22 M Hall E Frank G Holmes B Pfahringer P Reutemann andH I Witten The WEKA data mining software An update SIGKDDExplorations 11 10 (2009)

23 T Knebel S Hochreiter and K Obermayer An SMO algorithmfor the potential support vector machine Neural Comput 20 271(2008)

24 T I Netzeva A Gallego Saliner and A P Worth Comparisionof applicability of domain of a quantitative structure-activity rela-tionship for estrogenecity with a large chemical inventry EnvironToxicol Chem 25 1223 (2006)

25 T M Martin P Harten R Venkatapathy S Das D MYoung A hierarchical clustering methodology for theestimation of toxicity Toxicol Mech Methods 18 251(2008)

26 B Hardy N Douglas C Helma M Rautenberg N JeliazkovaV Jeliazkov I Nikolova R Benigni O Tcheremenskaia S KramerT Girschick F Buchwald J Wicker A Karwath M GuumltleinA Maunz H Sarimveis G Melagraki A Afantitis P SopasakisD Gallagher V Poroikov D Filimonov A Zakharov A LaguninT Gloriozova S Novikov N Skvortsova D Druzhilovsky SChawla I Ghosh S Ray H Patel and S Escher Collaborativedevelopment of predictive toxicology applications J Cheminform31 2 7 (2010)

27 Java Molecular Editor [httpwwwmolinspirationcomjme]

J Transl Toxicol 1 21ndash27 2014 27

Page 4: ArticleNitish Kumar Mishra1, Deepak Singla1, Sandhya Agarwal1, Open Source Drug Discovery Consortium2, and Gajendra P. S. Raghava1 1Bioinformatics Centre, Institute of Microbial Technology

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

the entire dataset highly diverse and cover large chemicalspace24 The radar chart are highly similar to Cheng et al2011 showing the applicability of dataset15

Performance of Linear Statistical Method (MLR)and Non-Linear Statistical Method (SMO)In order to evaluate the performance of different soft-warersquos we have developed QSAR model on differentsets of descriptors calculated by various softwarersquos (open

(a) (b)

(c) (d)

Figure 2 Showing the scatter plots of four different models with actual value on X-axis and predicted value on Y -axis Theblue color circle represents the training set and green color triangle for blind dataset

source as well as commercial) Performance of variousmodels trained and blind dataset is described below

V-Life Descriptors Based ModelIn order to develop model we removed the moleculesfrom V-life for which CDK was unable to calculatethe descriptors for making the data composition uni-form throughout the study In this study a PLS modelwas developed using V-life calculated 1002 descriptors

24 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

Table II Correlation between selected Vlife descriptors calculated using Vlife software

Descriptors MolWt slogp chiV5chain SdSE-index Ipc MMFF_21 MMFF_46 MMFF_51 T_2_N_0 T_N_Br_3 T_N_Br_6 MDEN-23

MolWt 1 063 minus004 001 minus007 minus028 0 003 032 016 004 008slogp 1 001 012 minus006 minus036 002 002 01 005 002 005chiV5chain 1 003 0 minus004 minus001 0 minus003 minus001 minus001 minus001SdSE-index 1 0 minus007 minus001 0 024 minus002 minus001 005Ipc 1 005 0 0 minus001 0 0 0MMFF_21 1 minus002 minus001 minus015 minus005 minus002 minus003MMFF_46 1 0 008 minus001 0 041MMFF_51 1 minus001 0 0 0T_2_N_0 1 008 004 015T_N_Br_3 1 minus001 009T_N_Br_6 1 0MDEN-23 1

in WEKA and achieve nearly equal performance (R2076075 on training and blind dataset receptively(httpcrddosddnetraghavatoxipredsupplephp) A sec-ond model on 20 descriptors selected using CfsubsetEvalmodule of WEKA shows RR2 0849072 and 078106with RMSE 05700665 on training and blind datasetreceptively A third model using SMO with Puk-kernel on9 best-selected descriptors on training and blind datasetshows correlation (R) 08460756 R2 071057 withRMSE 05740693 (Table I Fig 2(A)) respectively TheMLR based model on selected 9 descriptors shows bet-ter performance as compared to non-linear WEKA basedmodel (httpcrddosddnetraghavatoxipredsupplephp)

CDK Descriptors Based ModelThe second model developed on train dataset and testedon blind dataset (randomly created) shows RtrainR

2train

0876077 RblindR2blind 085072 on training and blind

dataset using PLS in WEKA We have developed a QSARmodel on 11 significant descriptors selected using Cfsub-setEval achieved RTrainRBlind 08670816 R2

TrainR2Blind

075065 with RMSE 05370620 on training and blinddataset receptively (Table I) We further number ofdescriptors from 11 to 6 and developed a third modelshows nearly same performance (in term of R2 as shownin (httpcrddosddnetraghavatoxipredsupplephp) andFigure 2(B)

Hybrid ModelThe hybrid model developed on randomly selectedblind dataset was also shows better performance on

Table III Correlation between selected CDK descriptors

ATSc3 BCUTc-11 FNSA-1 RPSA GRAV-3 XLogP

ATSc3 1 013 minus011 018 minus005 minus007BCUTc-1l 1 018 minus033 026 03FNSA-1 1 012 045 009RPSA 1 minus011 minus048GRAV-3 1 058XLogP 1

minimum of only 17 relevant descriptors (for detailsee (httpcrddosddnetraghavatoxipredsupplephp))First hybrid model on 31 descriptors shows RR2

088077 on training and 083069 on blind dataset(Fig 2(C)) A second model on 17 descriptors shows abetter performance with R 089085 R2 079072 withRMSE 04910557 (Table I Fig 2(D)) on training andblind dataset respectively

Interpreting Best QSAR ModelAs shown in (httpcrddosddnetraghavatoxipredsupplephp) models developed using all three techniques MLRPLS and SMO performed equally well It is also observedthat hybrid model developed using descriptors obtainedfrom two or more than two software perfom better thanmodel developed using descriptors obtained an individ-ual software It is clear from results that free software ofdescriptor calculation are sufficiently powerful for devel-oping QSTR models The descriptor selection is impor-tant for developing QSTR model and for improving theperformance of model In this study we also used freesoftware even for feature selection This study shows thatlinear model is performing more or less equal to non-linear models From the bunch of thousands of descriptorsonly few statistically relevant descriptors were selectedand used for the model development Our study also sug-gests that among the vast of molecular descriptors molec-ular weight and xlogP plays significant role in toxicityprediction of chemical compounds From the Table IIITable IV Table V it was found that descriptors selectedin this study shows very little correlation with each otherand more with toxicity value The positive correlationwith mol wt and solubility (xLogP) are following thetwo of the four parameter famous lipinski rule of fiveThe XlogP descriptor identified in our study also identi-fied in past by Schuurmann et al 2003 as important inpredicting the toxic molecules The possible reason maybe that with increasing molecular weight of a compoundits solubility decrease and therefore that particular com-pound may persist in the body of T pyriformis and becometoxic

J Transl Toxicol 1 21ndash27 2014 25

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

Table IV Correlation between selected V-life and CDKdescriptors

CDK

Descriptors ATSc3 BCUTc-11 FNSA-1 RPSA GRAV-3 XLogP

MolWt minus008 023 047 minus019 086 064chiV5chain 003 009 minus004 001 minus003 minus003SdSE-index 003 017 002 minus012 007 015

Ipc minus002 minus005 minus004 004 minus01 minus006MMFF_21 minus022 minus049 minus022 009 minus037 minus03V-lifeMMFF_46 minus001 002 001 minus004 002 minus001MMFF_51 minus003 minus005 minus001 minus003 005 006T_2_N_0 minus008 029 055 minus024 04 016T_N_Br_3 minus002 01 016 minus004 007 003T_N_Br_6 minus002 01 003 minus004 002 001MDEN-23 minus002 004 005 minus006 009 003

Web ServerOne of the major challenges for researchers working inthe field of toxicology is to predict the toxicity of achemical compound Best of our knowledge two free soft-wares namely TEST (Toxicity Estimation Software Tool)and OpenTox are available for toxicity prediction2526 Inorder to complement existing effort for providing serviceto community we developed a web server ldquoToxiPredrdquo(httpwwwcrddosddnetraghavatoxipred) for predict-ing toxicity of molecules We integrate JME molecu-lar editor27 in ToxiPred that allows user to draw theirmolecules of choice This server is launch using Apacheunder Linux (Red Hat) environment The common gate-way interface (CGI) script of ToxiPred is written usingPERL version 503 This is a user friendly web serverthat allows to predict pIGC50 of a small chemical againstTetrahymena pyriformis

Table V Correlation between descriptors and PIGC50 values

No Descriptor name Correlation (PIGC50) P -value

1 MolWt 07 00112 chiV5chain minus003 0933 SdSE-index 022 172E-134 Ipc minus008 0305 MMFF_21 minus039 147e-096 MMFF_46 008 196E-057 MMFF_51 005 0678 T_2_N_0 038 00089 T_N_Br_3 012 0003

10 T_N_Br_6 006 002011 MDEN-23 01 096912 ATSc3 minus014 878E-1113 BCUTc-1l 032 075514 FNSA-1 039 732E-1215 RPSA minus031 022216 GRAV-3 07 797e-0717 XLogP 075 lt2e-16

DISCUSSIONIn this study we have developed several models for theprediction of pIGC50 of small organic chemical moleculesin Tetrahymena pyriformis Our present study suggeststhat descriptors are keys for any QSAR modeling but itis not always possible that all relevant descriptors wereimplemented in single software Thus it is important touse descriptors obtained from different software for devel-oping a model As shown in our result section our hybridmodel based on selected set of descriptors obtain from dif-ferent software performed better than models developedusing descriptors of individual software After analyzingselected descriptors from different software we found thatmolecular weight and xlogP shows very high correlation(gt= 070) with pIGC50 value In order to provide thefacility to scientific community we developed a webserverldquoToxiPredrdquo to predict toxicity of chemical compoundsWe hope that present model will aid in the area of drugdesigning

Conflicts of InterestThe authors declare that there are no conflicts of interest

Authorsrsquo ContributionsNKM DS and SA perform all the analysis of data anddeveloped QSAR models DS SA and NKM have devel-oped server ToxiPred Manuscript have been written byDS and NKM OSDD members provides suggestions tothis work GPSR conceived and gave overall supervisionto the project helped in interpretation of data and refinedthe drafted manuscript All authors read and approved thefinal manuscript

Acknowledgments The authors are thankful to OpenSource for Drug Discovery (OSDD) foundation and Coun-cil for Scientific and Industrial Research (CSIR) andDepartment of Biotechnology (DBT) New Delhi India forfinancial support The manuscript has Institute of Micro-bial Technology (IMTECH) communication no 0322011

REFERENCES1 D L Hill The Biochemistry and Physiology of Tetrahymena edn

Academic Press New York (1972) pp 2302 J G Hengstler H Foth R Kahl P J Kramer W Lilienblum

T Schulz and H Schweinfurth The reach concept and its impacton toxicological sciences Toxicology 220 232 (2006)

3 P R Duchowicz A G Mercader F M Fernandez and E A CastroPrediction of aqueous toxicity for heterogeneous phenol derivativesby QSAR Chemometr Intell Lab Syst 90 97 (2008)

4 C Hansch and A Leo Substituent Constants for Correla-tion Analysis in Chemistry and Biology Wiley New York(1979)

5 M T D Cronin and J C Dearden Review QSAR in toxicology1 Prediction of Aquatic Toxicity Quant Struct Act Relat 14 1(1995)

6 K Roy and G Ghosh QSTR with extended topochemical atom(ETA) indices 12 QSAR for the toxicity of diverse aromatic

26 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

compounds to Tetrahymena pyriformis using chemometric toolsChemosphere 77 999 (2009)

7 A M Richard Structure-based methods for predicting mutagenicityand carcinogenicity Are we there yet Mutat Res 400 493 (1998)

8 F Cheng J Shen Y Yu W Li G Liu P W Lee andY Tang In-silico prediction of tetrahymena pyriformis toxicity fordiverse industrial chemicals with substructure pattern recognitionand machine learning methods Chemosphere 82 1636 (2011)

9 G Klopman S K Chakravarti H Zhu J M Ivanov and R DSaiakhov ESP A method to predict toxicity and pharmacologicalproperties of chemicals using multiple MCASE databases J ChemInf Comput Sci 44 704 (2004a)

10 G Klopman H Zhu M A Fuller and R D Saiakhov Searchingfor an enhanced predictive tool for mutagenicity SAR QSAR EnvironRes 15 251 (2004b)

11 P Mazzatorta E Benfenati D Neagu and G Gini The importanceof scaling in data mining for toxicity prediction J Chem Inf Com-put Sci 42 1250 (2002)

12 P Mazzatorta E Benfenati C D Neagu and G Gini Tuning neu-ral and fuzzy-neural networks for toxicity modeling J Chem InfComput Sci 43 513 (2003)

13 A M Richard Commercial toxicology prediction systems A regu-latory perspective Toxicol Lett 102 611 (1998a)

14 A M Richard Future of toxicologyndashpredictive toxicology Anexpanded view of chemical toxicity Chem Res Toxicol 19 1257(2006)

15 F Cheng J Shen Y Yu W Li G Liu P W Lee and Y Tang Insilico prediction of tetrahymena pyriformis toxicity for diverse indus-trial chemicals with substructure pattern recognition and machinelearning methods Chemosphere 82 1636 (2011)

16 C Steinbeck C Hoppe S Kuhn M Floris R Guha and E LWillighagen Recent developments of the chemistry development kit(CDK)mdashan open-source java library for chemo- and bioinformaticsCurr Pharm Des 12 2111 (2006)

17 A M Richard C Yang and R S Judson Toxicity data informaticsSupporting a new paradigm for toxicity prediction Toxicol MechMethods 18 103 (2008)

18 A Garg R Tewari and G P Raghava KiDoQ Using dockingbased energy scores to develop ligand based model for predictingantibacterials BMC Bioinformatics 11 125 (2010)

19 N K Mishra S Agarwal and G P Raghava Pre-diction of cytochrome P450 isoform responsible formetabolizing a drug molecule BMC Pharmacol 10 8(2010)

20 N K Mishra and G P S Raghava Prediction of specificity andcross-reactivity of kinase inhibitors Letters in Drug Design andDiscovery 8 223 (2011)

21 D Singla M Anurag D Dash G P S Raghava A web serverfor predicting inhibitors against bacterial target GlmU protein BMCPharmacol 6 11 (2011)

22 M Hall E Frank G Holmes B Pfahringer P Reutemann andH I Witten The WEKA data mining software An update SIGKDDExplorations 11 10 (2009)

23 T Knebel S Hochreiter and K Obermayer An SMO algorithmfor the potential support vector machine Neural Comput 20 271(2008)

24 T I Netzeva A Gallego Saliner and A P Worth Comparisionof applicability of domain of a quantitative structure-activity rela-tionship for estrogenecity with a large chemical inventry EnvironToxicol Chem 25 1223 (2006)

25 T M Martin P Harten R Venkatapathy S Das D MYoung A hierarchical clustering methodology for theestimation of toxicity Toxicol Mech Methods 18 251(2008)

26 B Hardy N Douglas C Helma M Rautenberg N JeliazkovaV Jeliazkov I Nikolova R Benigni O Tcheremenskaia S KramerT Girschick F Buchwald J Wicker A Karwath M GuumltleinA Maunz H Sarimveis G Melagraki A Afantitis P SopasakisD Gallagher V Poroikov D Filimonov A Zakharov A LaguninT Gloriozova S Novikov N Skvortsova D Druzhilovsky SChawla I Ghosh S Ray H Patel and S Escher Collaborativedevelopment of predictive toxicology applications J Cheminform31 2 7 (2010)

27 Java Molecular Editor [httpwwwmolinspirationcomjme]

J Transl Toxicol 1 21ndash27 2014 27

Page 5: ArticleNitish Kumar Mishra1, Deepak Singla1, Sandhya Agarwal1, Open Source Drug Discovery Consortium2, and Gajendra P. S. Raghava1 1Bioinformatics Centre, Institute of Microbial Technology

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

Table II Correlation between selected Vlife descriptors calculated using Vlife software

Descriptors MolWt slogp chiV5chain SdSE-index Ipc MMFF_21 MMFF_46 MMFF_51 T_2_N_0 T_N_Br_3 T_N_Br_6 MDEN-23

MolWt 1 063 minus004 001 minus007 minus028 0 003 032 016 004 008slogp 1 001 012 minus006 minus036 002 002 01 005 002 005chiV5chain 1 003 0 minus004 minus001 0 minus003 minus001 minus001 minus001SdSE-index 1 0 minus007 minus001 0 024 minus002 minus001 005Ipc 1 005 0 0 minus001 0 0 0MMFF_21 1 minus002 minus001 minus015 minus005 minus002 minus003MMFF_46 1 0 008 minus001 0 041MMFF_51 1 minus001 0 0 0T_2_N_0 1 008 004 015T_N_Br_3 1 minus001 009T_N_Br_6 1 0MDEN-23 1

in WEKA and achieve nearly equal performance (R2076075 on training and blind dataset receptively(httpcrddosddnetraghavatoxipredsupplephp) A sec-ond model on 20 descriptors selected using CfsubsetEvalmodule of WEKA shows RR2 0849072 and 078106with RMSE 05700665 on training and blind datasetreceptively A third model using SMO with Puk-kernel on9 best-selected descriptors on training and blind datasetshows correlation (R) 08460756 R2 071057 withRMSE 05740693 (Table I Fig 2(A)) respectively TheMLR based model on selected 9 descriptors shows bet-ter performance as compared to non-linear WEKA basedmodel (httpcrddosddnetraghavatoxipredsupplephp)

CDK Descriptors Based ModelThe second model developed on train dataset and testedon blind dataset (randomly created) shows RtrainR

2train

0876077 RblindR2blind 085072 on training and blind

dataset using PLS in WEKA We have developed a QSARmodel on 11 significant descriptors selected using Cfsub-setEval achieved RTrainRBlind 08670816 R2

TrainR2Blind

075065 with RMSE 05370620 on training and blinddataset receptively (Table I) We further number ofdescriptors from 11 to 6 and developed a third modelshows nearly same performance (in term of R2 as shownin (httpcrddosddnetraghavatoxipredsupplephp) andFigure 2(B)

Hybrid ModelThe hybrid model developed on randomly selectedblind dataset was also shows better performance on

Table III Correlation between selected CDK descriptors

ATSc3 BCUTc-11 FNSA-1 RPSA GRAV-3 XLogP

ATSc3 1 013 minus011 018 minus005 minus007BCUTc-1l 1 018 minus033 026 03FNSA-1 1 012 045 009RPSA 1 minus011 minus048GRAV-3 1 058XLogP 1

minimum of only 17 relevant descriptors (for detailsee (httpcrddosddnetraghavatoxipredsupplephp))First hybrid model on 31 descriptors shows RR2

088077 on training and 083069 on blind dataset(Fig 2(C)) A second model on 17 descriptors shows abetter performance with R 089085 R2 079072 withRMSE 04910557 (Table I Fig 2(D)) on training andblind dataset respectively

Interpreting Best QSAR ModelAs shown in (httpcrddosddnetraghavatoxipredsupplephp) models developed using all three techniques MLRPLS and SMO performed equally well It is also observedthat hybrid model developed using descriptors obtainedfrom two or more than two software perfom better thanmodel developed using descriptors obtained an individ-ual software It is clear from results that free software ofdescriptor calculation are sufficiently powerful for devel-oping QSTR models The descriptor selection is impor-tant for developing QSTR model and for improving theperformance of model In this study we also used freesoftware even for feature selection This study shows thatlinear model is performing more or less equal to non-linear models From the bunch of thousands of descriptorsonly few statistically relevant descriptors were selectedand used for the model development Our study also sug-gests that among the vast of molecular descriptors molec-ular weight and xlogP plays significant role in toxicityprediction of chemical compounds From the Table IIITable IV Table V it was found that descriptors selectedin this study shows very little correlation with each otherand more with toxicity value The positive correlationwith mol wt and solubility (xLogP) are following thetwo of the four parameter famous lipinski rule of fiveThe XlogP descriptor identified in our study also identi-fied in past by Schuurmann et al 2003 as important inpredicting the toxic molecules The possible reason maybe that with increasing molecular weight of a compoundits solubility decrease and therefore that particular com-pound may persist in the body of T pyriformis and becometoxic

J Transl Toxicol 1 21ndash27 2014 25

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

Table IV Correlation between selected V-life and CDKdescriptors

CDK

Descriptors ATSc3 BCUTc-11 FNSA-1 RPSA GRAV-3 XLogP

MolWt minus008 023 047 minus019 086 064chiV5chain 003 009 minus004 001 minus003 minus003SdSE-index 003 017 002 minus012 007 015

Ipc minus002 minus005 minus004 004 minus01 minus006MMFF_21 minus022 minus049 minus022 009 minus037 minus03V-lifeMMFF_46 minus001 002 001 minus004 002 minus001MMFF_51 minus003 minus005 minus001 minus003 005 006T_2_N_0 minus008 029 055 minus024 04 016T_N_Br_3 minus002 01 016 minus004 007 003T_N_Br_6 minus002 01 003 minus004 002 001MDEN-23 minus002 004 005 minus006 009 003

Web ServerOne of the major challenges for researchers working inthe field of toxicology is to predict the toxicity of achemical compound Best of our knowledge two free soft-wares namely TEST (Toxicity Estimation Software Tool)and OpenTox are available for toxicity prediction2526 Inorder to complement existing effort for providing serviceto community we developed a web server ldquoToxiPredrdquo(httpwwwcrddosddnetraghavatoxipred) for predict-ing toxicity of molecules We integrate JME molecu-lar editor27 in ToxiPred that allows user to draw theirmolecules of choice This server is launch using Apacheunder Linux (Red Hat) environment The common gate-way interface (CGI) script of ToxiPred is written usingPERL version 503 This is a user friendly web serverthat allows to predict pIGC50 of a small chemical againstTetrahymena pyriformis

Table V Correlation between descriptors and PIGC50 values

No Descriptor name Correlation (PIGC50) P -value

1 MolWt 07 00112 chiV5chain minus003 0933 SdSE-index 022 172E-134 Ipc minus008 0305 MMFF_21 minus039 147e-096 MMFF_46 008 196E-057 MMFF_51 005 0678 T_2_N_0 038 00089 T_N_Br_3 012 0003

10 T_N_Br_6 006 002011 MDEN-23 01 096912 ATSc3 minus014 878E-1113 BCUTc-1l 032 075514 FNSA-1 039 732E-1215 RPSA minus031 022216 GRAV-3 07 797e-0717 XLogP 075 lt2e-16

DISCUSSIONIn this study we have developed several models for theprediction of pIGC50 of small organic chemical moleculesin Tetrahymena pyriformis Our present study suggeststhat descriptors are keys for any QSAR modeling but itis not always possible that all relevant descriptors wereimplemented in single software Thus it is important touse descriptors obtained from different software for devel-oping a model As shown in our result section our hybridmodel based on selected set of descriptors obtain from dif-ferent software performed better than models developedusing descriptors of individual software After analyzingselected descriptors from different software we found thatmolecular weight and xlogP shows very high correlation(gt= 070) with pIGC50 value In order to provide thefacility to scientific community we developed a webserverldquoToxiPredrdquo to predict toxicity of chemical compoundsWe hope that present model will aid in the area of drugdesigning

Conflicts of InterestThe authors declare that there are no conflicts of interest

Authorsrsquo ContributionsNKM DS and SA perform all the analysis of data anddeveloped QSAR models DS SA and NKM have devel-oped server ToxiPred Manuscript have been written byDS and NKM OSDD members provides suggestions tothis work GPSR conceived and gave overall supervisionto the project helped in interpretation of data and refinedthe drafted manuscript All authors read and approved thefinal manuscript

Acknowledgments The authors are thankful to OpenSource for Drug Discovery (OSDD) foundation and Coun-cil for Scientific and Industrial Research (CSIR) andDepartment of Biotechnology (DBT) New Delhi India forfinancial support The manuscript has Institute of Micro-bial Technology (IMTECH) communication no 0322011

REFERENCES1 D L Hill The Biochemistry and Physiology of Tetrahymena edn

Academic Press New York (1972) pp 2302 J G Hengstler H Foth R Kahl P J Kramer W Lilienblum

T Schulz and H Schweinfurth The reach concept and its impacton toxicological sciences Toxicology 220 232 (2006)

3 P R Duchowicz A G Mercader F M Fernandez and E A CastroPrediction of aqueous toxicity for heterogeneous phenol derivativesby QSAR Chemometr Intell Lab Syst 90 97 (2008)

4 C Hansch and A Leo Substituent Constants for Correla-tion Analysis in Chemistry and Biology Wiley New York(1979)

5 M T D Cronin and J C Dearden Review QSAR in toxicology1 Prediction of Aquatic Toxicity Quant Struct Act Relat 14 1(1995)

6 K Roy and G Ghosh QSTR with extended topochemical atom(ETA) indices 12 QSAR for the toxicity of diverse aromatic

26 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

compounds to Tetrahymena pyriformis using chemometric toolsChemosphere 77 999 (2009)

7 A M Richard Structure-based methods for predicting mutagenicityand carcinogenicity Are we there yet Mutat Res 400 493 (1998)

8 F Cheng J Shen Y Yu W Li G Liu P W Lee andY Tang In-silico prediction of tetrahymena pyriformis toxicity fordiverse industrial chemicals with substructure pattern recognitionand machine learning methods Chemosphere 82 1636 (2011)

9 G Klopman S K Chakravarti H Zhu J M Ivanov and R DSaiakhov ESP A method to predict toxicity and pharmacologicalproperties of chemicals using multiple MCASE databases J ChemInf Comput Sci 44 704 (2004a)

10 G Klopman H Zhu M A Fuller and R D Saiakhov Searchingfor an enhanced predictive tool for mutagenicity SAR QSAR EnvironRes 15 251 (2004b)

11 P Mazzatorta E Benfenati D Neagu and G Gini The importanceof scaling in data mining for toxicity prediction J Chem Inf Com-put Sci 42 1250 (2002)

12 P Mazzatorta E Benfenati C D Neagu and G Gini Tuning neu-ral and fuzzy-neural networks for toxicity modeling J Chem InfComput Sci 43 513 (2003)

13 A M Richard Commercial toxicology prediction systems A regu-latory perspective Toxicol Lett 102 611 (1998a)

14 A M Richard Future of toxicologyndashpredictive toxicology Anexpanded view of chemical toxicity Chem Res Toxicol 19 1257(2006)

15 F Cheng J Shen Y Yu W Li G Liu P W Lee and Y Tang Insilico prediction of tetrahymena pyriformis toxicity for diverse indus-trial chemicals with substructure pattern recognition and machinelearning methods Chemosphere 82 1636 (2011)

16 C Steinbeck C Hoppe S Kuhn M Floris R Guha and E LWillighagen Recent developments of the chemistry development kit(CDK)mdashan open-source java library for chemo- and bioinformaticsCurr Pharm Des 12 2111 (2006)

17 A M Richard C Yang and R S Judson Toxicity data informaticsSupporting a new paradigm for toxicity prediction Toxicol MechMethods 18 103 (2008)

18 A Garg R Tewari and G P Raghava KiDoQ Using dockingbased energy scores to develop ligand based model for predictingantibacterials BMC Bioinformatics 11 125 (2010)

19 N K Mishra S Agarwal and G P Raghava Pre-diction of cytochrome P450 isoform responsible formetabolizing a drug molecule BMC Pharmacol 10 8(2010)

20 N K Mishra and G P S Raghava Prediction of specificity andcross-reactivity of kinase inhibitors Letters in Drug Design andDiscovery 8 223 (2011)

21 D Singla M Anurag D Dash G P S Raghava A web serverfor predicting inhibitors against bacterial target GlmU protein BMCPharmacol 6 11 (2011)

22 M Hall E Frank G Holmes B Pfahringer P Reutemann andH I Witten The WEKA data mining software An update SIGKDDExplorations 11 10 (2009)

23 T Knebel S Hochreiter and K Obermayer An SMO algorithmfor the potential support vector machine Neural Comput 20 271(2008)

24 T I Netzeva A Gallego Saliner and A P Worth Comparisionof applicability of domain of a quantitative structure-activity rela-tionship for estrogenecity with a large chemical inventry EnvironToxicol Chem 25 1223 (2006)

25 T M Martin P Harten R Venkatapathy S Das D MYoung A hierarchical clustering methodology for theestimation of toxicity Toxicol Mech Methods 18 251(2008)

26 B Hardy N Douglas C Helma M Rautenberg N JeliazkovaV Jeliazkov I Nikolova R Benigni O Tcheremenskaia S KramerT Girschick F Buchwald J Wicker A Karwath M GuumltleinA Maunz H Sarimveis G Melagraki A Afantitis P SopasakisD Gallagher V Poroikov D Filimonov A Zakharov A LaguninT Gloriozova S Novikov N Skvortsova D Druzhilovsky SChawla I Ghosh S Ray H Patel and S Escher Collaborativedevelopment of predictive toxicology applications J Cheminform31 2 7 (2010)

27 Java Molecular Editor [httpwwwmolinspirationcomjme]

J Transl Toxicol 1 21ndash27 2014 27

Page 6: ArticleNitish Kumar Mishra1, Deepak Singla1, Sandhya Agarwal1, Open Source Drug Discovery Consortium2, and Gajendra P. S. Raghava1 1Bioinformatics Centre, Institute of Microbial Technology

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis Mishra et al

Table IV Correlation between selected V-life and CDKdescriptors

CDK

Descriptors ATSc3 BCUTc-11 FNSA-1 RPSA GRAV-3 XLogP

MolWt minus008 023 047 minus019 086 064chiV5chain 003 009 minus004 001 minus003 minus003SdSE-index 003 017 002 minus012 007 015

Ipc minus002 minus005 minus004 004 minus01 minus006MMFF_21 minus022 minus049 minus022 009 minus037 minus03V-lifeMMFF_46 minus001 002 001 minus004 002 minus001MMFF_51 minus003 minus005 minus001 minus003 005 006T_2_N_0 minus008 029 055 minus024 04 016T_N_Br_3 minus002 01 016 minus004 007 003T_N_Br_6 minus002 01 003 minus004 002 001MDEN-23 minus002 004 005 minus006 009 003

Web ServerOne of the major challenges for researchers working inthe field of toxicology is to predict the toxicity of achemical compound Best of our knowledge two free soft-wares namely TEST (Toxicity Estimation Software Tool)and OpenTox are available for toxicity prediction2526 Inorder to complement existing effort for providing serviceto community we developed a web server ldquoToxiPredrdquo(httpwwwcrddosddnetraghavatoxipred) for predict-ing toxicity of molecules We integrate JME molecu-lar editor27 in ToxiPred that allows user to draw theirmolecules of choice This server is launch using Apacheunder Linux (Red Hat) environment The common gate-way interface (CGI) script of ToxiPred is written usingPERL version 503 This is a user friendly web serverthat allows to predict pIGC50 of a small chemical againstTetrahymena pyriformis

Table V Correlation between descriptors and PIGC50 values

No Descriptor name Correlation (PIGC50) P -value

1 MolWt 07 00112 chiV5chain minus003 0933 SdSE-index 022 172E-134 Ipc minus008 0305 MMFF_21 minus039 147e-096 MMFF_46 008 196E-057 MMFF_51 005 0678 T_2_N_0 038 00089 T_N_Br_3 012 0003

10 T_N_Br_6 006 002011 MDEN-23 01 096912 ATSc3 minus014 878E-1113 BCUTc-1l 032 075514 FNSA-1 039 732E-1215 RPSA minus031 022216 GRAV-3 07 797e-0717 XLogP 075 lt2e-16

DISCUSSIONIn this study we have developed several models for theprediction of pIGC50 of small organic chemical moleculesin Tetrahymena pyriformis Our present study suggeststhat descriptors are keys for any QSAR modeling but itis not always possible that all relevant descriptors wereimplemented in single software Thus it is important touse descriptors obtained from different software for devel-oping a model As shown in our result section our hybridmodel based on selected set of descriptors obtain from dif-ferent software performed better than models developedusing descriptors of individual software After analyzingselected descriptors from different software we found thatmolecular weight and xlogP shows very high correlation(gt= 070) with pIGC50 value In order to provide thefacility to scientific community we developed a webserverldquoToxiPredrdquo to predict toxicity of chemical compoundsWe hope that present model will aid in the area of drugdesigning

Conflicts of InterestThe authors declare that there are no conflicts of interest

Authorsrsquo ContributionsNKM DS and SA perform all the analysis of data anddeveloped QSAR models DS SA and NKM have devel-oped server ToxiPred Manuscript have been written byDS and NKM OSDD members provides suggestions tothis work GPSR conceived and gave overall supervisionto the project helped in interpretation of data and refinedthe drafted manuscript All authors read and approved thefinal manuscript

Acknowledgments The authors are thankful to OpenSource for Drug Discovery (OSDD) foundation and Coun-cil for Scientific and Industrial Research (CSIR) andDepartment of Biotechnology (DBT) New Delhi India forfinancial support The manuscript has Institute of Micro-bial Technology (IMTECH) communication no 0322011

REFERENCES1 D L Hill The Biochemistry and Physiology of Tetrahymena edn

Academic Press New York (1972) pp 2302 J G Hengstler H Foth R Kahl P J Kramer W Lilienblum

T Schulz and H Schweinfurth The reach concept and its impacton toxicological sciences Toxicology 220 232 (2006)

3 P R Duchowicz A G Mercader F M Fernandez and E A CastroPrediction of aqueous toxicity for heterogeneous phenol derivativesby QSAR Chemometr Intell Lab Syst 90 97 (2008)

4 C Hansch and A Leo Substituent Constants for Correla-tion Analysis in Chemistry and Biology Wiley New York(1979)

5 M T D Cronin and J C Dearden Review QSAR in toxicology1 Prediction of Aquatic Toxicity Quant Struct Act Relat 14 1(1995)

6 K Roy and G Ghosh QSTR with extended topochemical atom(ETA) indices 12 QSAR for the toxicity of diverse aromatic

26 J Transl Toxicol 1 21ndash27 2014

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

compounds to Tetrahymena pyriformis using chemometric toolsChemosphere 77 999 (2009)

7 A M Richard Structure-based methods for predicting mutagenicityand carcinogenicity Are we there yet Mutat Res 400 493 (1998)

8 F Cheng J Shen Y Yu W Li G Liu P W Lee andY Tang In-silico prediction of tetrahymena pyriformis toxicity fordiverse industrial chemicals with substructure pattern recognitionand machine learning methods Chemosphere 82 1636 (2011)

9 G Klopman S K Chakravarti H Zhu J M Ivanov and R DSaiakhov ESP A method to predict toxicity and pharmacologicalproperties of chemicals using multiple MCASE databases J ChemInf Comput Sci 44 704 (2004a)

10 G Klopman H Zhu M A Fuller and R D Saiakhov Searchingfor an enhanced predictive tool for mutagenicity SAR QSAR EnvironRes 15 251 (2004b)

11 P Mazzatorta E Benfenati D Neagu and G Gini The importanceof scaling in data mining for toxicity prediction J Chem Inf Com-put Sci 42 1250 (2002)

12 P Mazzatorta E Benfenati C D Neagu and G Gini Tuning neu-ral and fuzzy-neural networks for toxicity modeling J Chem InfComput Sci 43 513 (2003)

13 A M Richard Commercial toxicology prediction systems A regu-latory perspective Toxicol Lett 102 611 (1998a)

14 A M Richard Future of toxicologyndashpredictive toxicology Anexpanded view of chemical toxicity Chem Res Toxicol 19 1257(2006)

15 F Cheng J Shen Y Yu W Li G Liu P W Lee and Y Tang Insilico prediction of tetrahymena pyriformis toxicity for diverse indus-trial chemicals with substructure pattern recognition and machinelearning methods Chemosphere 82 1636 (2011)

16 C Steinbeck C Hoppe S Kuhn M Floris R Guha and E LWillighagen Recent developments of the chemistry development kit(CDK)mdashan open-source java library for chemo- and bioinformaticsCurr Pharm Des 12 2111 (2006)

17 A M Richard C Yang and R S Judson Toxicity data informaticsSupporting a new paradigm for toxicity prediction Toxicol MechMethods 18 103 (2008)

18 A Garg R Tewari and G P Raghava KiDoQ Using dockingbased energy scores to develop ligand based model for predictingantibacterials BMC Bioinformatics 11 125 (2010)

19 N K Mishra S Agarwal and G P Raghava Pre-diction of cytochrome P450 isoform responsible formetabolizing a drug molecule BMC Pharmacol 10 8(2010)

20 N K Mishra and G P S Raghava Prediction of specificity andcross-reactivity of kinase inhibitors Letters in Drug Design andDiscovery 8 223 (2011)

21 D Singla M Anurag D Dash G P S Raghava A web serverfor predicting inhibitors against bacterial target GlmU protein BMCPharmacol 6 11 (2011)

22 M Hall E Frank G Holmes B Pfahringer P Reutemann andH I Witten The WEKA data mining software An update SIGKDDExplorations 11 10 (2009)

23 T Knebel S Hochreiter and K Obermayer An SMO algorithmfor the potential support vector machine Neural Comput 20 271(2008)

24 T I Netzeva A Gallego Saliner and A P Worth Comparisionof applicability of domain of a quantitative structure-activity rela-tionship for estrogenecity with a large chemical inventry EnvironToxicol Chem 25 1223 (2006)

25 T M Martin P Harten R Venkatapathy S Das D MYoung A hierarchical clustering methodology for theestimation of toxicity Toxicol Mech Methods 18 251(2008)

26 B Hardy N Douglas C Helma M Rautenberg N JeliazkovaV Jeliazkov I Nikolova R Benigni O Tcheremenskaia S KramerT Girschick F Buchwald J Wicker A Karwath M GuumltleinA Maunz H Sarimveis G Melagraki A Afantitis P SopasakisD Gallagher V Poroikov D Filimonov A Zakharov A LaguninT Gloriozova S Novikov N Skvortsova D Druzhilovsky SChawla I Ghosh S Ray H Patel and S Escher Collaborativedevelopment of predictive toxicology applications J Cheminform31 2 7 (2010)

27 Java Molecular Editor [httpwwwmolinspirationcomjme]

J Transl Toxicol 1 21ndash27 2014 27

Page 7: ArticleNitish Kumar Mishra1, Deepak Singla1, Sandhya Agarwal1, Open Source Drug Discovery Consortium2, and Gajendra P. S. Raghava1 1Bioinformatics Centre, Institute of Microbial Technology

Delivered by Publishing Technology to Guest UserIP 1413923482 On Sun 11 May 2014 111234

Copyright American Scientific Publishers

Mishra et al ToxiPred A Server for Prediction of Aqueous Toxicity of Small Chemical Molecules in T Pyriformis

compounds to Tetrahymena pyriformis using chemometric toolsChemosphere 77 999 (2009)

7 A M Richard Structure-based methods for predicting mutagenicityand carcinogenicity Are we there yet Mutat Res 400 493 (1998)

8 F Cheng J Shen Y Yu W Li G Liu P W Lee andY Tang In-silico prediction of tetrahymena pyriformis toxicity fordiverse industrial chemicals with substructure pattern recognitionand machine learning methods Chemosphere 82 1636 (2011)

9 G Klopman S K Chakravarti H Zhu J M Ivanov and R DSaiakhov ESP A method to predict toxicity and pharmacologicalproperties of chemicals using multiple MCASE databases J ChemInf Comput Sci 44 704 (2004a)

10 G Klopman H Zhu M A Fuller and R D Saiakhov Searchingfor an enhanced predictive tool for mutagenicity SAR QSAR EnvironRes 15 251 (2004b)

11 P Mazzatorta E Benfenati D Neagu and G Gini The importanceof scaling in data mining for toxicity prediction J Chem Inf Com-put Sci 42 1250 (2002)

12 P Mazzatorta E Benfenati C D Neagu and G Gini Tuning neu-ral and fuzzy-neural networks for toxicity modeling J Chem InfComput Sci 43 513 (2003)

13 A M Richard Commercial toxicology prediction systems A regu-latory perspective Toxicol Lett 102 611 (1998a)

14 A M Richard Future of toxicologyndashpredictive toxicology Anexpanded view of chemical toxicity Chem Res Toxicol 19 1257(2006)

15 F Cheng J Shen Y Yu W Li G Liu P W Lee and Y Tang Insilico prediction of tetrahymena pyriformis toxicity for diverse indus-trial chemicals with substructure pattern recognition and machinelearning methods Chemosphere 82 1636 (2011)

16 C Steinbeck C Hoppe S Kuhn M Floris R Guha and E LWillighagen Recent developments of the chemistry development kit(CDK)mdashan open-source java library for chemo- and bioinformaticsCurr Pharm Des 12 2111 (2006)

17 A M Richard C Yang and R S Judson Toxicity data informaticsSupporting a new paradigm for toxicity prediction Toxicol MechMethods 18 103 (2008)

18 A Garg R Tewari and G P Raghava KiDoQ Using dockingbased energy scores to develop ligand based model for predictingantibacterials BMC Bioinformatics 11 125 (2010)

19 N K Mishra S Agarwal and G P Raghava Pre-diction of cytochrome P450 isoform responsible formetabolizing a drug molecule BMC Pharmacol 10 8(2010)

20 N K Mishra and G P S Raghava Prediction of specificity andcross-reactivity of kinase inhibitors Letters in Drug Design andDiscovery 8 223 (2011)

21 D Singla M Anurag D Dash G P S Raghava A web serverfor predicting inhibitors against bacterial target GlmU protein BMCPharmacol 6 11 (2011)

22 M Hall E Frank G Holmes B Pfahringer P Reutemann andH I Witten The WEKA data mining software An update SIGKDDExplorations 11 10 (2009)

23 T Knebel S Hochreiter and K Obermayer An SMO algorithmfor the potential support vector machine Neural Comput 20 271(2008)

24 T I Netzeva A Gallego Saliner and A P Worth Comparisionof applicability of domain of a quantitative structure-activity rela-tionship for estrogenecity with a large chemical inventry EnvironToxicol Chem 25 1223 (2006)

25 T M Martin P Harten R Venkatapathy S Das D MYoung A hierarchical clustering methodology for theestimation of toxicity Toxicol Mech Methods 18 251(2008)

26 B Hardy N Douglas C Helma M Rautenberg N JeliazkovaV Jeliazkov I Nikolova R Benigni O Tcheremenskaia S KramerT Girschick F Buchwald J Wicker A Karwath M GuumltleinA Maunz H Sarimveis G Melagraki A Afantitis P SopasakisD Gallagher V Poroikov D Filimonov A Zakharov A LaguninT Gloriozova S Novikov N Skvortsova D Druzhilovsky SChawla I Ghosh S Ray H Patel and S Escher Collaborativedevelopment of predictive toxicology applications J Cheminform31 2 7 (2010)

27 Java Molecular Editor [httpwwwmolinspirationcomjme]

J Transl Toxicol 1 21ndash27 2014 27