optimization of gaussian kernel function in support vector machine aided qsar studies of c-aryl...

8
Interdiscip Sci Comput Life Sci (2013) 5: 45–52 DOI: 10.1007/s12539-013-0156-y Optimization of Gaussian Kernel Function in Support Vector Machine Aided QSAR Studies of C-Aryl Glucoside SGLT2 Inhibitors Rebekah K. PRASOONA 1,2 , A. JYOTI 1,2 , Yadav MUKESH 3, Sharma NISHANT 3 , Nayarisseri S. ANURAJ 1,4 , Joshi SHOBHA 5 1 (Department of Genetics, Osmania University, Hyderabad, India) 2 (Institute of Genetics & Hospital for Genetic Diseases, Osmania University, Begumpet 500016, Hyderabad, India) 3 (Department of Pharmaceutical Chemistry, Softvision College, Indore 452010, Madhya Pradesh, India) 4 (Bioinformatics Research Laboratory, Eminent Biosciences, Indore 452010, Madhya Pradesh, India) 5 (Government Degree College, Depalpur, Indore 453001, Madhya Pradesh, India) Received 8 March 2012 / Revised 17 April 2012 / Accepted 4 June 2012 Abstract: The present investigations include utility of latest statistical algorithm Support Vector Machine (SVM) to identify non-linear structure activity relationship between IC50 values and structures of C-aryl glucoside SGLT2 inhibitors. Training dataset consisted of forty molecules and the remaining six molecules were chosen for test set validation. SVM under Gaussian Kernel Function yielded non-linear QSAR models. Forward selection algorithm was applied after pruning and redundancy check on molecular descriptors. Internal validations of QSAR mod- els have been achieved using R 2 CV (LOO), PRESS, SDEP and Y-Scrambling. SVM aided non-linear models are more efficient when optimization of Gaussian Kernel Function was introduced. Non-linear QSAR studies further identified atomic van der Waals volumes, atomic masses, sum of geometrical distances between O..S and degree of unsaturation as molecular descriptors and crucial structural requirements to model IC50 of C-aryl glucoside derivatives. Key words: QSAR, Support Vector Machine (SVM), Gaussian Kernel Function, non-linear QSAR, C-aryl glu- coside, SGLT2 inhibitors. 1 Introduction Diabetes, a disease on high concerns to human health has claimed 285 million people in 2010. Estimation states that its present status from 6.4% of world pop- ulation may scale up to 7.8% by 2030 (International Diabetes Federation, 2009). Diabetes has been leading to early age disease and deaths. Between two forms of diabetes, type I (5-10%) is autoimmune and treated by external insulin supply while type II (90-95%) is state of glucose accumulation leading to pathophysiol- ogy (Dwarakanathan, 2006). In diabetes type II, glu- cose transporter proteins supply glucose to fuel up tis- sues, membranes, muscles and organs (Brown, 2000; Wright et al., 2007; Kloeckener-Gruissem, 2008). This facilitative glucose carriage covering cellular domain is conducted by GLUT transporter family of proteins. Sodium-glucose transport proteins (SGLTs), encoded by SLC5 gene are found responsible in glucose trans- port in the intestine and kidney. Corresponding author. E-mail: [email protected] S1 segment of proximal convoluted tubule is primary site for reabsorption (90%) of glucose in mammalian kidney (Berry, 1991). Among SGLTs, SGLT2 trans- porters are in key role at S1 segment (Rector, 1983; Moe, 2000). SGLT2 helps with translocation of at- tached sodium and glucose with it across apical cell membrane. This secondary active-transport works on electrochemical gradient and thus makes SGLT2 trans- porter protein an attractive therapeutic target in type- II diabetes (Bakris, 2009; Gerich, 2001). Administration of phlorizin, a natural product ob- tained from root bark of apple tree in 19 th century causes glucosuria. This evidence led medicinal soci- ety to conclude that removal of extra glucose from a diabetic patient thorough kidney in urine is achievable (Rossetti et al., 1987; Ehrenkranz et al., 2005). Among C-aryl glucosides, a molecule named dapagliflozin was identified as potent SGLT2 inhibitor in the treatment of diabetes type II (Junwon et al., 2010). Derivatives of this class have confirmed enhanced chemical stability and biological responses resulted from glucosidic bond (Fig. 1). Quantitative Structure Activity Relationship

Upload: joshi

Post on 08-Dec-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization of Gaussian Kernel Function in Support Vector Machine aided QSAR studies of C-aryl glucoside SGLT2 inhibitors

Interdiscip Sci Comput Life Sci (2013) 5: 45–52

DOI: 10.1007/s12539-013-0156-y

Optimization of Gaussian Kernel Function in Support Vector MachineAided QSAR Studies of C-Aryl Glucoside SGLT2 Inhibitors

Rebekah K. PRASOONA1,2, A. JYOTI1,2, Yadav MUKESH3∗, Sharma NISHANT3,Nayarisseri S. ANURAJ1,4, Joshi SHOBHA5

1(Department of Genetics, Osmania University, Hyderabad, India)2(Institute of Genetics & Hospital for Genetic Diseases, Osmania University, Begumpet 500016, Hyderabad, India)

3(Department of Pharmaceutical Chemistry, Softvision College, Indore 452010, Madhya Pradesh, India)4(Bioinformatics Research Laboratory, Eminent Biosciences, Indore 452010, Madhya Pradesh, India)

5(Government Degree College, Depalpur, Indore 453001, Madhya Pradesh, India)

Received 8 March 2012 / Revised 17 April 2012 / Accepted 4 June 2012

Abstract: The present investigations include utility of latest statistical algorithm Support Vector Machine (SVM)to identify non-linear structure activity relationship between IC50 values and structures of C-aryl glucoside SGLT2inhibitors. Training dataset consisted of forty molecules and the remaining six molecules were chosen for test setvalidation. SVM under Gaussian Kernel Function yielded non-linear QSAR models. Forward selection algorithmwas applied after pruning and redundancy check on molecular descriptors. Internal validations of QSAR mod-els have been achieved using R2

CV (LOO), PRESS, SDEP and Y-Scrambling. SVM aided non-linear models aremore efficient when optimization of Gaussian Kernel Function was introduced. Non-linear QSAR studies furtheridentified atomic van der Waals volumes, atomic masses, sum of geometrical distances between O..S and degreeof unsaturation as molecular descriptors and crucial structural requirements to model IC50 of C-aryl glucosidederivatives.Key words: QSAR, Support Vector Machine (SVM), Gaussian Kernel Function, non-linear QSAR, C-aryl glu-coside, SGLT2 inhibitors.

1 Introduction

Diabetes, a disease on high concerns to human healthhas claimed 285 million people in 2010. Estimationstates that its present status from 6.4% of world pop-ulation may scale up to 7.8% by 2030 (InternationalDiabetes Federation, 2009). Diabetes has been leadingto early age disease and deaths. Between two formsof diabetes, type I (5-10%) is autoimmune and treatedby external insulin supply while type II (90-95%) isstate of glucose accumulation leading to pathophysiol-ogy (Dwarakanathan, 2006). In diabetes type II, glu-cose transporter proteins supply glucose to fuel up tis-sues, membranes, muscles and organs (Brown, 2000;Wright et al., 2007; Kloeckener-Gruissem, 2008). Thisfacilitative glucose carriage covering cellular domain isconducted by GLUT transporter family of proteins.Sodium-glucose transport proteins (SGLTs), encodedby SLC5 gene are found responsible in glucose trans-port in the intestine and kidney.

∗Corresponding author.E-mail: [email protected]

S1 segment of proximal convoluted tubule is primarysite for reabsorption (90%) of glucose in mammaliankidney (Berry, 1991). Among SGLTs, SGLT2 trans-porters are in key role at S1 segment (Rector, 1983;Moe, 2000). SGLT2 helps with translocation of at-tached sodium and glucose with it across apical cellmembrane. This secondary active-transport works onelectrochemical gradient and thus makes SGLT2 trans-porter protein an attractive therapeutic target in type-II diabetes (Bakris, 2009; Gerich, 2001).

Administration of phlorizin, a natural product ob-tained from root bark of apple tree in 19th centurycauses glucosuria. This evidence led medicinal soci-ety to conclude that removal of extra glucose from adiabetic patient thorough kidney in urine is achievable(Rossetti et al., 1987; Ehrenkranz et al., 2005). AmongC-aryl glucosides, a molecule named dapagliflozin wasidentified as potent SGLT2 inhibitor in the treatmentof diabetes type II (Junwon et al., 2010). Derivativesof this class have confirmed enhanced chemical stabilityand biological responses resulted from glucosidic bond(Fig. 1).

Quantitative Structure Activity Relationship

Page 2: Optimization of Gaussian Kernel Function in Support Vector Machine aided QSAR studies of C-aryl glucoside SGLT2 inhibitors

46 Interdiscip Sci Comput Life Sci (2013) 5: 45–52

OEtCl

O

OH

OH

HO

HO

Fig. 1 Dapagliflozin

(QSAR) is a tool that implies mathematics andstatistics to identify and predict the biological responseof compounds in class as a function of their struc-ture. Generally structural properties are expressedin numerical magnitudes as molecular descriptorsderived from chemical structures. Last three decadeshave been evolutionary in QSAR in development ofmethodologies and statistical approaches in regressionlike Support Vector Machine (SVM), Neural Network(NN), Partial Least Square (PLS), Regression treesand ensembles etc. (Axel et al., 2008).

Multiple linear regression is the most simple and sig-nificant approach used in QSAR so far. This techniquehelped to identify linear relationship among molecularstructures and their biological responses. Often, struc-ture activity relationship can be non-linear, which can-not be identified using MLR analysis (Tetyana et al.,2003). Recent developments and work reported con-fer SVM as an accurate, robust and fast statistical tool(Pavlidis, 2004; Norinder, 2003). Efficiency of SVM inidentifying non-linear QSAR has been appreciable andacceptable. Furthermore, development of kernel func-tions like Gaussian and polynomial made SVM evenmore applicable and an alternative tool in QSAR stud-ies. SVM developed by Vapnik for classification wasfurther optimized and applied to achieve regression inexploring non-linear QSAR models (Cortes and Vap-nik, 1995). The present work includes optimization ofGaussian Kernel Function through regulation of per-missible values of epsilon value, cost value and sigmavalues. Optimization of Gaussian Kernel Function wasinvestigated in order to reveal the impact of variation indefault parameters on statistical fitness of QSAR mod-els. The computational tool which has been extensivelyused in the present investigation is Sarchitect Designer2.5.0 (Strand Life Science). Sarchitect, a sophisticatedtool with simple user interface, provides latest machinelearning techniques like MLR, SVM, ANN, PLS etc. forregression and classification. Its heuristic accuracy, ro-bustness and simple interface makes it a tool of choiceto achieve QSAR models.

2 Methodology

We have achieved SVM optimization for presentdataset (Table 1) as results obtained from MLR couldnot be improved in terms of regression parameters.Description under the following heading explains opti-

mization of SVM algorithm in forward selection wrap-per.

Table 1 Structural details of C-aryl glucosideSGLT2 inhibitors, original IC50 and logIC50 used in present investigations. Struc-tures with (*) are included in test set val-idation.

Molecule R1 R2IC50

(nM)

log IC50

(nM)

1 H H 6.25 0.7959

(2*) H CH3 6910 3.8395

3 H 1420 3.1523

4 HN

1230 3.0899

5 Cl 542 2.7340

6 Cl CH3 103 2.0128

7 Cl CH3 24.9 1.3962

(8*) ClN

45.3 1.6561

9 Cl N 53.0 1.7243

10 Cl Cl 2390 3.3784

11 Cl CH3 214 2.3304

12 Cl CH3 379 2.5786

13 Cl CH3 353 2.5478

(14*) Cl CH3 122 2.0864

Page 3: Optimization of Gaussian Kernel Function in Support Vector Machine aided QSAR studies of C-aryl glucoside SGLT2 inhibitors

Interdiscip Sci Comput Life Sci (2013) 5: 45–52 47

continue

Molecule R1 R2IC50

(nM)

log IC50

(nM)

15 Cl OCH3 146 2.1644

16 Cl SCH3 10.9 1.0374

17 Cl F

F

F

155 2.1903

18 Cl F

F

F

759 2.8802

19 Cl 445 2.6484

20 Cl 16.6 1.2201

21 Cl Cl 126 2.1004

22 ClN

30.3 1.4814

(23*) Cl NCH3

638 2.8048

24 Cl CH3

N97.4 1.9886

25 Cl

CH3

N36.4 1.5611

26 ClN

81.6 1.9117

27 Cl

N562 2.7497

(28*) Cl

N

N

4.06 0.6085

29 ClN

N 190 2.2788

continue

Molecule R1 R2IC50

(nM)

log IC50

(nM)

30 ClO

7.03 0.8470

31 ClO

17.7 1.2480

32 Cl

O

H3C

106 2.0253

33 ClO CH3

58.3 1.7657

34 ClO

10.6 1.0253

35 ClO

249 2.3962

(36*) ClON

CH3

257 2.4099

(37*) Cl N

CH3

119 2.0755

38 Cl

CH3

NN3280 3.5159

(39*) ClS

14.4 1.1584

40 ClS

3.51 0.5453

(41*) Cl

CH3

S

762 2.8820

42 Cl

S

Cl

712 2.8525

(43*) ClClS

18.2 1.2601

Page 4: Optimization of Gaussian Kernel Function in Support Vector Machine aided QSAR studies of C-aryl glucoside SGLT2 inhibitors

48 Interdiscip Sci Comput Life Sci (2013) 5: 45–52

continue

Molecule R1 R2IC50

(nM)

log IC50

(nM)

44 ClS

44.5 1.6484

45 Cl

S

N21.8 1.3385

46 Cl

S

NN72.6 1.8609

2.1 Optimization of Gaussian Kernel FunctionThe main idea of SVM (Smola and Scholkopf, 2004;

Scholkopf and Smola, 2002) is to find the “flattest”(i.e. less complex) linear function that approximates thegiven data with ε precision in a kernel-induced featurespace. This is reached using the ε-insensitive loss func-tion, which penalizes errors greater that ε. The tradebetween the flatness of the estimate and the amountup to which deviations greater than ε are tolerated, isdetermined by the regularization constant C≥0. Thissetting is transformed into a constrained optimizationproblem, in which the Wolfe dual is computed, result-ing in a convex programming problem. The solutionof this problem is sparse: a subset of the resulting La-grange multipliers will be nonzero (Kuhn and Tucker,1951; Pavlidis, 2004) and the associated samples willbe support vectors (SV). Only these vectors Lagrangemultipliers from the SVR and LS-SVM models wereexplained for the first time in terms of molecular struc-tures, descriptors, biological activity and principal com-ponents.2.2 Default parameters

Gaussian Kernel Function has been applied in for-ward selection wrapper with epsilon value 0.1 (theprediction error acceptable by the user, i.e. the accept-able difference between the target value for regressionand the predicted value), cost value 100 (increasingthis parameter will reduce the training error but at thecost of generalization achieved by the regression model)and sigma value (the default value is set to 1.0). Typ-ically, there is an optimum value of sigma such thatgoing below this value decreases both misclassification(training error) and generalization and going above thisvalue increases misclassification. Models obtained us-ing Gaussian Kernel Function identified non-linear re-lationship between activity and descriptors.2.3 Optimization of parameters

Optimization of Gaussian Kernel Function wasachieved by the following variation when introduced in

deciding parameters epsilon, cost value and sigma val-ues.

Sigma values are held as default constant i.e.1.0. Lowering this parameter may lead to misclas-sification and further selection of non-significant de-scriptors, which can produce good statistical modelsbut structure-activity relationship models can be non-relevant in bio-chemical aspects. Instead of all riskprobability we lowered its value to less than 1.0 i.e.0.9, 0.8 and 0.7. This risk is attributed to the fact thatresults of MLR have been nearly 0.8 in terms of R2.Results discussed below were obtained after optimizingsigma values to 0.9 from 1.0.

Epsilon values, by default are 0.1, which is a pre-diction error acceptance by user, can be optimized toincrease from 0.1 (> 0.1), in present investigations weoptimized the limits of epsilon above 0.1 to 0.9. This de-cision can be supported by the assumption that to geta more significant QSAR models standard error limitcan be permitted. Alteration of epsilon to 0.5 producedpresent outcome of SVM non-linear models included instudy.

Cost value - The cost or penalty associated withsample error. If the error is less than the epsilonvalue, the cost incurred is zero, else if the error is morethan epsilon then the cost associated with the error isCost∗(error-epsilon). The default value is 100. Increas-ing this parameter will reduce the training error butat the cost of generalization achieved by the regressionmodel. We have taken risk of generalization by increas-ing cost value above 100 (> 100) i.e. 110, 120..., 200.This optimization approach recorded fall in statisticaloutcomes. Hereby we decided to hold it constant i.e.100.

3 Results and discussions

Stepwise multiple regression analysis facilitated byforward selection algorithm of SARCHITECT Designer2.5.0 (developed by Strands Life Science Pvt. Ltd.)produced QSAR models with acceptable statistical pa-rameters including R2, R2A, S.E. and F-stat values (Ta-ble 2).3.1 SVM aided non-linear QSAR

SVM, a statistically fast, robust and accurate method(Vapnik, 1999; Furey et al., 2000) was also employed forthe supervised learning of datasets generally used for es-tablishing the non-linear relationship of variables withactivity. Descriptors selected for forward selection werecollected. Their redundancy check was performed, sta-tistical parameters obtained in forwards selection of op-timized SVM, contributing to model fitness are shownin Table 2.

The five models including different descriptors fromvarious classes were employed. DISPm: Geometricaldescriptor: d COMMA2 value / weighted by atomic

Page 5: Optimization of Gaussian Kernel Function in Support Vector Machine aided QSAR studies of C-aryl glucoside SGLT2 inhibitors

Interdiscip Sci Comput Life Sci (2013) 5: 45–52 49

Table 2 Support Vector Machine analysis and their statistical parameters including R2, R2A and standarderrors.

Model R2 Max Absolute Error

Train Metric

Mean Absolute Error

Train Metric(R2

CV N-Fold)

One-descriptor model 0.1877 1.5211 0.5361 0.3337

two-descriptor model 0.5787 1.2938 0.3534 0.4899

three-descriptor model 0.6840 0.9673 0.2940 0.5871

four-descriptor model 0.7718 0.9354 0.2412 0.6413

five-descriptor model 0.8480 0.8985 0.1914 0.6682

4.0

3.5

3.0

2.5

2.0

1.5

1.0

0.5

00 1

(a) (b)

Observed LogIC50

Pre

dict

ed L

ogIC

50

2 3 4

4.0

3.5

3.0

2.5

2.0

1.5

1.0

0.5

00 1

Observed LogIC50

Pre

dict

ed L

ogIC

50

2 3 4

y=0.872x+0.246R2=0.877

y=0.767x+0.498R2=0.788

Fig. 2 Graphical Correlation of Observed (Log IC50) Vs. Predicted (Log IC50) for training set (36 molecules) using SVMaided non-linear MODEL -4 (a) and MODEL-5 (b).

1.0

0.5

0

−1.5 −1.0 −0.5 0

(a) (b)

R-Square

Cor

rela

tion

0.5 1.0

1.0

0.5

0

−1.5 −1.0 −0.5 0R-Square

Cor

rela

tion

0.5 1.0

Fig. 3 Graphical representation of Y-scrambling as internal validation parameter using SVM aided non-linear MODEL-4(a) and MODEL-5 (b).

masses (svm) c5AB, R4v+, G (O..S) Geometricaldescriptor denotes sum of geometrical distances be-tween O..S and Du belongs to a class of WHIM de-scriptors and D total accessibility index / unweightedto find the relationship with IC50 of the dataset com-pounds.

SVM aided non-linear models are found to be moreefficient with the statistical fitness and prediction data.3.2 Model validation

Non-linear QSAR models (tetra and pentavariable)are manifested to obtain predicted binding affinity and

further correlated with observed binding affinity. Cor-relations of observed and predicted binding affini-ties have been reported in Table 3 along with its graph-ical correlation view in Fig. 2.

Method of N-folded R2cv is introduced as internalcross validation parameter for univariable to hexavari-able models. Calculations suggest that R2cv is stableand there is no sharp fall in R2cv values when correlatedwith original values of R2. This assessment is shown inTable 2.

Y-scrambling guards against the possibility of hav-

Page 6: Optimization of Gaussian Kernel Function in Support Vector Machine aided QSAR studies of C-aryl glucoside SGLT2 inhibitors

50 Interdiscip Sci Comput Life Sci (2013) 5: 45–52

Table 3 Correlation of observed (Log IC50) values and predicted (Log IC50) values using SVM aided non-linear QSAR MODEL-4 and MODEL-5.

MoleculesLog IC50

(nM)

Model 4 Model 5

Predicted Log

IC50 (nM)

Residual Log

IC50 (nM)

Predicted Log

IC50 (nM)

Residual Log

IC50 (nM)

1 0.7960 0.7699 0.0261 0.7958 0.0002

(2*) 3.8390 3.1973 0.6417 3.8526 −0.0136

3 3.1520 3.1554 −0.0034 3.1557 −0.0037

4 3.0900 3.0868 0.0032 3.0867 0.0033

5 2.7340 2.3800 0.3540 2.4117 0.3223

6 2.0130 2.1447 −0.1317 1.8024 0.2106

7 1.3960 2.0586 −0.6626 1.5943 −0.1983

(8*) 1.6560 2.0554 −0.3994 1.9538 −0.2978

9 1.7240 2.0666 −0.3426 1.7239 0.0001

10 3.3780 3.3813 −0.0033 3.3814 −0.0034

11 2.3300 2.4277 −0.0977 2.4762 −0.1462

12 2.5790 2.4722 0.1068 2.5494 0.0296

13 2.5480 2.4045 0.1435 2.5514 −0.0034

(14*) 2.0860 2.8865 −0.8005 3.3005 −1.2145

15 2.1640 1.5840 0.5800 2.1642 −0.0002

16 1.0370 1.0336 0.0034 1.0360 0.0010

17 2.1900 2.1895 0.0005 2.1920 −0.0020

18 2.8800 2.6888 0.1912 2.8771 0.0029

19 2.6480 2.6466 0.0014 2.5931 0.0549

20 1.2200 1.7924 −0.5724 1.7012 −0.4812

21 2.1000 1.8622 0.2378 1.9843 0.1157

22 1.4810 1.7342 −0.2532 1.7393 −0.2583

(23*) 2.8050 2.0583 0.7467 1.8041 1.0009

24 1.9890 1.9752 0.0138 1.9863 0.0027

25 1.5610 2.2015 −0.6405 2.1337 −0.5727

26 1.9120 1.9152 −0.0032 1.7998 0.1122

27 2.7500 1.9264 0.8236 1.8709 0.8791

(28*) 0.6090 1.7052 −1.0962 1.7983 −1.1893

29 2.2790 1.7013 0.5777 1.6782 0.6008

30 0.8470 1.1050 −0.2580 1.1911 −0.3441

31 1.2480 1.2098 0.0382 1.1820 0.0660

32 2.0250 2.0279 −0.0029 2.0255 −0.0005

33 1.7660 1.3825 0.3835 1.3234 0.4426

34 1.0250 1.8643 −0.8393 1.4336 −0.4086

35 2.3960 2.3933 0.0027 2.3929 0.0031

(36*) 2.4100 1.0894 1.3206 1.1967 1.2133

(37*) 2.0760 2.1253 −0.0493 1.9741 0.1019

38 3.5160 3.5155 0.0005 3.5156 0.0004

(39*) 1.1580 0.5541 0.6039 −0.4462 1.6042

40 0.5450 1.1965 −0.6515 0.5477 −0.0027

(41*) 2.8820 1.6503 1.2317 2.0009 0.8811

42 2.8520 2.8548 −0.0028 2.8547 −0.0027

(43*) 1.2600 0.7254 0.5346 0.9510 0.3090

44 1.6480 1.6310 0.0170 1.6446 0.0034

45 1.3380 1.3342 0.0038 1.3347 0.0033

46 1.8610 1.8644 −0.0034 1.8576 0.0034

(*) Compounds are included in external test set.

Page 7: Optimization of Gaussian Kernel Function in Support Vector Machine aided QSAR studies of C-aryl glucoside SGLT2 inhibitors

Interdiscip Sci Comput Life Sci (2013) 5: 45–52 51

4.5

4.0

3.5

3.0

2.5

2.0

1.5

1.0

0.5

0

−0.5

−1.0

3.5

3.0

2.5

2.0

1.5

1.0

0.5

000 2

(a) (b)

Observed LogIC50 Observed LogIC50

Pre

dict

ed L

ogIC

50

Pre

dict

ed L

ogIC

50

4 6 1 2 3 4 5

Fig. 4 Graphical Correlation of Observed (Log IC50) Vs. Predicted (Log IC50) for test set (10 molecules) using SVM aidednon-linear MODEL-4 and MODEL-5.

ing learned chance models, where descriptors happen tobe correlated to the endpoint for the particular datasetby statistical chance. Models are fitted for randomlyreordered activity value and compared with the modelobtained for actual activity. Further, absence of any bychance modeling events is confirmed by 100 iterationsand correlation of R2 values as given in Fig. 3.

Set of 10 molecules, kept aside for test set valida-tion were further submitted to activity prediction usingSVM aided non-linear models (tetravariable and pen-tavariables). Results are availed in Table 3 and graphi-cal correlation in Fig. 4. Overall SVM models reveal tobe efficient in predictive powers.

4 Conclusion

Results and discussion show utility of SVM to iden-tify non-linear relationship, which for a long timeQSAR community has deprived. Significantly the su-periority of SVM can be easily distinguished in presentinvestigations. SVM has suggested that there couldbe non-linear structure relationship when linear ap-proaches fail to identify it. Gaussian Kernel Functionwas optimized through insensitive loss function param-eters.

Autocorrelated atomic van der Waals volumes, auto-correlated atomic polarazibility, aromatic index and ro-tatable bond fractions are identified as structural prop-erties which are linearly (MLR) related to IC50 valuesof SGLT2 inhibitors. Non-linear QSAR studies iden-tified R maximal autocorrelation of lag 4 / weightedby atomic van der Waals volumes, d COMMA2 value/ weighted by atomic masses, number of fragmentsCyc5[AB] with label C on atom 1, sum of geometri-cal distances between O..S and degree of unsaturationas molecular descriptors.

References

[1] Aksyonova, T.I., Volkovich, V.V., Tetko, I.V. 2003.Robust polynomial neural networks in quantative-structure activity relationship studies. Syst AnalModel Simul 43, 1331-1339.

[2] Bakris, G.L., Fonseca, V., Sharma, K., Wright, E.2009. Renal sodium-glucose transport: Role in dia-betes mellitus and potential clinical implications. Kid-ney Int 75, 1272-1277.

[3] Berry, C.A., Rector, F.C. Jr. 1991. Renal transport ofglucose, amino acids, sodium, chloride, and water. In:Brenner, B.M., Rector, F.C. Jr. (Eds.) The Kidney,4th Edition, W.B. Saunders, Philadelphia, 245-282.

[4] Brown, G.K. 2000.Glucose transporters: Structure,function and consequences of glucose deficiency. J In-herit Metab Dis 23, 237-246.

[5] Cortes, C., Vapnik, V. 1995. Support-vector networks.Mach Learn 20, 273-297.

[6] Dwarakanathan, A. 2006. Diabetes update. J InsurMed 38, 20-30.

[7] Ehrenkranz, J.R., Lewis, N.G., Kahn, C.R., Roth, J.2005. Phlorizin: A review. Diabetes Metab Res Rev21, 31-38.

[8] Furey, T., Cristianini, N., Duffy, N., Bednarski, D.,Schummer, M., Haussler, D. 2000. Support vector ma-chine classification and validation of cancer tissue sam-ples using microarray expression data. Bioinformatics16, 906-914.

[9] Gerich, J.E., Woerle, H.J., Meyer, C., Stumvoll, M.2001. Renal gluconeogenesis. Diabetes Care 24, 382-391.

[10] International Diabetes Federation. 2009. Diabetes At-las, 4th Edition, Montreal, Canada.

[11] Kloeckener-Gruissem, B., Vandekerckhove, K., Nurn-berg, G., Neidhardt, J., Zeitz, C., Nurnberg, P., Schip-per, I., Berger, W. 2008. Mutation of solute carrier

Page 8: Optimization of Gaussian Kernel Function in Support Vector Machine aided QSAR studies of C-aryl glucoside SGLT2 inhibitors

52 Interdiscip Sci Comput Life Sci (2013) 5: 45–52

SLC16A12 associates with a syndrome combining ju-venile cataract with microcornea and renal glucosuria.Am J Hum Genet 82, 772-779.

[12] Kuhn, H.W., Tucker, A.W. 1951. Nonlinear program-ming. In: Neyman, J. (Ed.) Proceedings of the Sec-ond Berkeley Symposium on Mathematical Statisticsand Probability, Berkeley, CA, University of CaliforniaPress, Los Angeles , 481-492.

[13] Lee, J., Lee, S.H., Seo, H.J., Son, E.J., Lee, S.H.,Jung, M.E., Lee, M., Han, H.K., Kim, J., Kang,J., Lee, J. 2010. Novel C-aryl glucoside SGLT2inhibitors as potential antidiabetic agents. 1,3,4-Thiadiazolylmethylphenyl glucoside congeners. Bioor& Med Chem 18, 2178-2194.

[14] Norinder, U. 2003. Support vector machine models indrug design: Application to drug transport processesand QSAR using simplex optimizations and variableselection. Neurocomputing 55, 337-346.

[15] Pavlidis, P., Wapinski, I., Noble, W.S. 2004. Supportvector machine classification on the web. Bioinformat-ics 20, 586-587.

[16] Rector, F.C. Jr. 1983. Sodium, bicarbonate, and chlo-ride absorption by the proximal tubule. Am J Physiol244, F461-F471.

[17] Rossetti, L., Shulman, G.I., Zawalich, W., DeFronzo,R.A. 1987. Effect of chronic hyperglycemia on in vivoinsulin secretion in partially pancreatectomized rats. JClin Invest 80, 1037-1044.

[18] Scholkopf, B., Smola, A.J. 2002. Learning with Ker-nels: Support Vector Machines, Regularization, Op-timization and Beyond, MIT Press, Cambridge, MA,185-208.

[19] Smola, A.J., Scholkopf, B. 2004. A tutorial on sup-port vector regression. Statistics and computing. StatComput 14, 199-222.

[20] Soto, A.J., Cecchini, R.L., Vazquez, G.E., Ponzoni, I.2008. An evolutionary approach for feature selectionapplied to ADMET prediction. Amer J Artif Intell 37,55-63.

[21] Vapnik, V. 1999. The Nature of Statistical LearningTheory, Verlag Springer, New York.

[22] Wright, E.M., Hirayama, B.A., Loo, D.F. 2007. Activesugar transport in health and disease. J Intern Med261, 32-43.