hardgrove grindability index prediction using support vector regression

Int. J. Miner. Process. 91 (2009) 55–59

Contents lists available at ScienceDirect

Int. J. Miner. Process.

j ourna l homepage: www.e lsev ie r.com/ locate / i jm inpro

Hardgrove grindability index prediction using support vector regression

B. Venkoba Rao a,⁎, S.J. Gopalakrishna b

a Engineering and Industrial Services R and D, Tata Consultancy Services Limited, 54 B, Hadapsar Industrial Estate, Pune 411 013, Indiab Department of Mineral Processing, Post Graduate Center, Sandur, Karnataka 583119, India

⁎ Corresponding author. Tel.: +91 20 66086104; fax:E-mail address: [email protected] (B. Venkoba Rao).

0301-7516/$ – see front matter © 2008 Elsevier B.V. Aldoi:10.1016/j.minpro.2008.12.003

a b s t r a c t
a r t i c l e i n f o
Article history:
Hardgrove grindability inde Received 4 August 2008Received in revised form 30 October 2008Accepted 10 December 2008Available online 24 December 2008
Keywords:CoalHardgrove grindability indexSupport vector regression

x (HGI) measures the grindability of coal and is a qualitative measure of coal. It isreferred to in mining, beneficiation and utilization of coal. HGI of coal depends on the coal composition andthere is an interest to predict this property from proximate analysis of coal. In this paper, support vectorregression (SVR), a potential machine learning technique is used to develop a non-linear relationshipbetween input proximate analyses of coal with output HGI by training the SVR model with limited measureddata and to validate it with the rest of the untrained data. SVR is a promising method and suggests that asmaller data set can be used for training the model than what has been studied earlier using artificial neuralnetwork (ANN) techniques, so that the model still validates the remaining data.

© 2008 Elsevier B.V. All rights reserved.

1. Introduction

Coal grindability is measured in terms of Hardgrove grindabilityindex (HGI) in contrast to mineral grindabilty that is measured interms of Bond grindability index. HGI is widely used as a coal-quality parameter in coal mining, beneficiation and utilization.Handling and economic considerations of coal make HGI animportant parameter, especially for pulverized-coal-fired boilers.HGI reflects the characteristics of coal in terms of hardness, tenacityand fracture. It is influenced by coal rank, petrography and mineralcomposition of the coal (Ural and Akyildiz, 2004). HGI isexperimentally determined by a specified standard procedureusing the standard Hardgrove equipment. Lower HGI indicatesthat the coal is harder to grind and thus more energy is required togrind it.

Although the HGI testing device is not costly, it is tedious todetermine the grindability index experimentally (Chelgani et al.,2008). Therefore there exists an interest to predict HGI valuesfrom proximate analysis using second order regression asproposed by Sengupta (2002) or by machine learning methodssuch as artificial neural network (ANN) techniques (Peisheng etal., 2005; Chelgani et al., 2008; Özbayoğlu et al., 2008) tounderstand the complex relationship between input proximateanalysis and output HGI. Proximate analysis characterizes thechemical composition of coal in terms of moisture, volatile matter,ash and fixed carbon. Moreover, as the inherent characteristics of

+91 20 66086199.

l rights reserved.

coal depend on rank of the coal and region they belong to, nogeneral equation for HGI is valid for all the coal samples unlessthese aspects are considered during model development (Hower,2006). This paper brings out the prediction of HGI from proximateanalysis of published data of China coal with the use of supportvector regression.

2. Support vector regression (SVR)

SVR is a powerful machine learning method that is useful forconstructing data-driven non-linear empirical process models. Itshares many features with ANNs but possesses some additionaldesirable characteristics and is gaining widespread acceptance indata-driven non-linear modeling applications. SVR possesses goodgeneralization ability of regression function, robustness of solution,addressing regression from sparse data and an automatic control ofsolution complexity. The method brings out the explicit data pointsfrom the input variables that are important for defining the regressionfunction. This feature of SVR makes it interpretable in terms of thetraining data in comparison with the other black-box modelsincluding ANNs, where themodel parameters are difficult to interpret.SVR has found application in soft-sensor development (Desai et al.,2006), non-linear system-identification of autocatalytic reactor(Jemwa and Aldrich, 2003), battery state of charge estimation(Hansen and Wang, 2005), image approximation and smoothening(Chow and Lee, 2001) and many others.

Given below is a brief description of SVR. Amore detailed descriptionof SVR can be found in (Vapnik, 2000; Cristianini and Shawe-Taylor,2003). Given N input sample points as xY1, xY2, xY3,… xYN whereYxiaRN andthe respective scalar outputs as y1, y2, y3,… yN, the objective is to learnthis input–output mapping with high generalization from the training

mailto:[email protected]

http://dx.doi.org/10.1016/j.minpro.2008.12.003

http://www.sciencedirect.com/science/journal/03017516

Fig. 1. Illustration of SVR showing regression curve f(x) together with the ε-insensitive‘tube’. Also shown are slack variables ξ and ξ̂.

Table 1List of popular kernel functions (Vapnik, 2000; Cristianini and Shawe-Taylor, 2003;Paláncz et al., 2004; Nilsson et al., 2006).

1 Simple dot product K Yxi;Yxj

� �= Yxi:

Yxj� �

2 Polynomial kernel K Yxi;Yxj

� �= Yxi:

Yxj� �

+ 1� �d

; d N 0

3 Vovk's real polynomial K Yxi;Yxj

� �=

1 − Yxi :

Yxj

� �d1 − Y

xi :Yxj

� �4 Gaussian RBF K Yxi;

Yxj� �

= exp − βOYxi:YxjO2

� �; β N 0

5 Exponential RBF K Yxi;Yxj

� �= exp − βOYxi:

YxjO� �

; β N 0

6 Regularized Fourier kernel K Yxi;Yxj

� �=

Qni=1

1 − q2

2 1 − 2q cosYxi − Y

xj� �

+ q2� � ; 0 b q b 1

7 Wavelet kernel K Yxi;Yxj

� �=

Qni=1

cos 1:75Yxi − Y

xja

� exp −

Yxi −

Yxj

� �22a2

" # !

56 B. Venkoba Rao, S.J. Gopalakrishna / Int. J. Miner. Process. 91 (2009) 55–59

set of examples and find a regression function in the form:

f Yx� �

=XNi=1

αiKYxi;Yx� �

+ b: ð1Þ

The accuracy to which the function fits the data is determined bythe loss function called ε-insensitive loss function proposed by(Vapnik, 2000) and is given by —

Le y; f Yx� ��

= jy − f Yx� � j for jy − f Yx

� � jze

0 otherwise

�ð2Þ

where ε is the prescribed parameter, y is the desired response, f(xY) isthe estimated output, and xY is an input vector. Fig. 1 shows a tube ofradius ε represented in dotted lines around the estimated output. Thefunction Lt (y, f (xY)) determines cost or penalty associated with thedata point for being outside the ε-tube insensitive region. Fordeviations less than specified ε-value, no penalty is incurred. This isshown in Fig. 2. SVR uses an admissible kernel, K(xYi,xY), whichsatisfies the Mercer's condition (Vapnik, 2000) to map the data ininput space to a high dimensional feature space where the regressionis processed in a linear form. The input vector xY is mapped into afeature space by some nonlinear mapping, Φ. In the feature space alinear regression function is approximated using the followingfunction:

f Yx� �

= hw:Φ Yx� �i + b ð3Þ

where wi and b are coefficients, Ф(xY) denotes the high dimensionalfeature space that is nonlinearly mapped from the input space xY. Thevalue of the kernel function K(xYi,xYj) equals the inner product of two

Fig. 2. Plot showing ε-insensitive error function, Lɛ, in which the error increases linearlywith distance for the data points beyond the insensitive region. The point represents asupport vector.

vectors xYi and xYj in the feature space, namely Ф(xYi)and Ф(xYj), whichmeans K(xYi,xYj)=Ф(xYi)·Ф(xYj)=ФT(xYi)Ф(xYj). The use of kernel func-tion makes it possible to map the data implicitly into a feature spaceand to train the SVR in such a space without needing to represent thefeature vectors explicitly (Cristianini and Shawe-Taylor, 2003). Table 1gives a list of some of the popular kernels.

SVR is formulated as minimization of the functional (Vapnik,

2000): cPNi=1

Le y; f Yx� ��

+ 12 j jw j j2, mainly to avoid overfit of the

regression model. This is called structural risk minimization. It isrepresented as sum of two terms, an empirical risk term and aregularization term. The empirical risk is a measure of prediction errorwith respect to the training set and is given by the difference betweenthe target output and predicted output as defined in Eq. (2) and theregularization term enables to keep the weight coefficients in Eq. (3)as small as possible during minimization.

By considering the constraints, this can be stated as

minimize cXNi=1

ni + n̂i� �

+12

j jw j j2;

subjected to hw:Φ xið Þi + bð Þ− yiV e + nið Þyi− hw:Φ xið Þi + bð ÞV e + n̂i

� �ni;n̂iz0; i = 1;2; ::::;N

ð4Þ

where w is weight vector with j jw j j =ffiffiffiffiffiffiffiffiffiffiwTw

p=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPNi=1

w2i

sand c is

a constant parameter that controls number of support vectors as well

as gives the tradeoff between the smoothness of the SVR function andthe total training error. ξi and ξ̂ i denote slack variables that measurethe cost of error pertaining to up-side and down-side of the regressionrespectively. The slack variables are shown in Figs. 1 and 2. For thepoints inside the ε-tube, ξi= ξ̂ i=0.

By using Lagrange multiplier techniques, the minimizationproblem in Eq. (4) leads to the following quadratic optimizationproblem (Chow and Lee, 2001; Cristianini and Shawe-Taylor, 2003):

maximize WðαÞ =XNi=1

yiαi − eXNi=1

jαi j −12

XNi;j=1

αiαjK xi; xj� �

subjected toXNi=1

αi = 0 and −cVαiVc; i = 1;2; :::;N

ð5Þ

where αi is the Lagrange multiplier associated with each trainingexample xYi. Appendix A provides the derivation of Eq. (5) from Eq. (4).Eq. (5) suggests that the training process of SVR involves finding anoptimal set of Lagrange multipliers (αi, ∀i∈ [1, N]) in order tomaximize the SVR energy function, W(α). The convexity of thefunctional W(α) and of the feasible region ensures that the solutioncan always be found efficiently. This is often claimed as a great

57B. Venkoba Rao, S.J. Gopalakrishna / Int. J. Miner. Process. 91 (2009) 55–59

progress as compared to the old neural networks that were plagued bymany local optima.

After obtaining the optimum set of Lagrange multipliers for thetraining data by way of maximizing the SVR energy function, the biasof the SVR function, b, is calculated as shown below (Cristianini andShawe-Taylor, 2003).

bj = yj −XNi=1

αiKYxj;Yxi� �

− e − αj = c: ð6Þ

The value of b is considered as average of bj values. The trainingvectors with non-zero Lagrange multipliers are called support vectors.These are the points that lie on the boundary of the ε-tube or outsidethe tube and contribute to predictions given by Eq. (1). The functionf(xY) is equivalent to the hyperplane in the feature space implicitlydefined by the kernel, K(xYi,xY), Lagrange multipliers, αi, obtained byoptimizing Eq. (5) and the bias, b, defined in Eq. (6) (Cristianini andShawe-Taylor, 2003). There is no need to know the weight vector wand the true mapping,Ф(xY). Support vectors form a sparse subset ofthe training data to essentially define the regression function andthese are the only terms that have to be evaluated in the predictiveregression model.

For the regression by SVR, the user has to select three parameters,namely insensitivity parameter ε, the penalty parameter c and theshape parameter of kernel function. The choice of these parameters isvital to good regression. If c is too small then insufficient stress will beplaced on fitting the training data. If c value is too large then thealgorithm will overfit the training data and overfit implies poorgeneralization. Maximum value that c can take is infinity.

The performance of SVR largely depends on the choice of thekernel type and the kernel parameters. However, there is notheoretical guidance how to choose a kernel function. The best choice

Fig. 3. SVR model performance with data set selection used by Peisheng et al. (2005): (a)ε=0.001 and c=1000. (b) Validation of SVR model for rest of the 6 data set.

of a kernel for a given problem is still an open research issue. In theabsence of any known guidelines, kernel has to be chosen in a datadependent way. The default options are Gaussian or polynomialkernels and if these prove to be ineffective, more elaborate kernelsneed to be tried out. In the following section, wavelet kernel has beenfound to be more appropriate for the analysis of HGI data.

3. SVR for HGI prediction

Here published data by Peisheng et al. (2005) is considered torelate proximate analysis with HGI using SVR analysis. The datacomprises of 67 coal samples with wide rank range. Recently, it hasbeen pointed out by Chelgani et al. (2008) that this data has a problemwith regard to reversal of HGI values in the medium volatilebituminous rank range. Apart from this, they have questioned theuse of all the four parameters namely moisture, volatile matter, ashand fixed carbon for the development of regression model for HGI, asall the four parameters form a closed system that adds up to 100%. Inview of these comments, SVR presented in this paper is based on threeparameters namely moisture, volatile matter and ash as they form thepromising base that give better regression results.

Initially the input and output data of proximate analysis (namely,moisture, volatile matter, ash as well as HGI) are preprocessed bynormalizing each of these variables with regard to their correspondingmaximum values so that all the values are in the range of 0 to 1.

The SVR model is trained with wavelet kernel (refer Table 1) withparameter coefficients: a=0.6, ε=0.001 and c=1000 so as toestimate the Lagrange multipliers and maximize the SVR energyfunction, W(α). In order to see how SVR performs against the gener-alized regression neural network (GRNN) model of Peisheng et al.(2005), SVR is trained initially with 61 sets of data and is validatedwith data set: #28, #50, #4, #23, #21 and #12 as proposed by

SVR training for 61 data sets with wavelet kernel with parameter coefficients: a=0.6,

ni;n̂iz0; i = 1;2; ::::;N:

Fig. 4. SVR model performance with smaller data set selection: (a) SVR training for 48 data set with wavelet kernel with parameter coefficients: a=0.8, ε=0.005 and c=1000.(b) Validation of SVR model for rest of the 19 data set.

58 B. Venkoba Rao, S.J. Gopalakrishna / Int. J. Miner. Process. 91 (2009) 55–59

Peisheng et al. (2005). Fig. 3 shows the results for this study withtraining and validation respectively represented in Fig. 3(a) and (b).Fig. 3(a) shows that the predicted values match the target valuesfairly well without overfit. Fig. 3(b) indicates that SVR model closelytracks the similar results of GRNN model of Peisheng et al. (2005)with only three parameters used from the proximate analysis forbuilding the SVR model. The correlation coefficient of the modeloutput with the actual data is also indicated on the graph which is ingeneral in agreement with what has been proposed in literature.

SVR can be built from a small data set whose results hold good forrest of the data set. To confirm this postulate an exercise of training themodel with 48 sets and to validate the built model with rest of the 19sets is considered. The main task is to obtain a multivariate regressionbetween HGI and proximate analysis parameters from the measuredtraining data but yet capable of generalizing it to a larger validationdata that is unseen by the model, in the presence of measurementerrors. The choice of the data set for training and validation is arrivedat by random trials that give better correlation coefficient for bothtraining and validation data set. This random choice for selecting thedata for training and validation could have been posed because ofreversal of HGI values in the medium volatile bituminous rank rangeas pointed out by (Chelgani et al., 2008). In spite of the associatedproblems with the data, SVR is able to track the proper input–outputrelation with smaller training data set and validate the remaining 30%of the un-trained data. There are many such combinations that givebetter correlation for both training and validation data. Basically theexistence of such sets is influenced by the associate small noise in themeasured variables. For each choice of kernel and kernel parameter,SVR needs to be retrained. So extensive search must be conductedbefore results can be trusted and this often complicates the task.Fig. 4(a) and (b) show respectively the results for training andgeneralization for one such instance along with the correlation

coefficients indicated on the graph. The results are encouraging andsuggest that SVR is one of the potential techniques for data analyseswherein multivariate relations in data are built that essentiallycapture the trend in the data, especially with limited data.

4. Conclusions

This paper presents support vector regression as an alternatemethod for regression of HGI from proximate analysis of coal apartform those existing ANN techniques. It is shown that only threeparameters of the proximate analysis namelymoisture, volatile matterand ash are capable of getting a fairly good correlation coefficient forthe training as well as validation data. It also shows that SVR can bedeveloped by learning a smaller set of training datawhose predictionsare valid for rest of the data.

Acknowledgements

BVR would like to acknowledge Mr. Shivaram Kamat, Mr. VivekDiwanji and Dr. Phanibhushan Sistu, Engineering and IndustrialServices, Tata Consultancy Services Limited for supporting this work.

Appendix A

Eq. (4) can be stated as

minimize cXNi=1

ni + n̂i� �

+12wTw;

subjected to wTΦ xið Þ + b − yi + e + niz0yi−wTΦ xið Þ− b + e + n̂ i

z0

ðA:1Þ

59B. Venkoba Rao, S.J. Gopalakrishna / Int. J. Miner. Process. 91 (2009) 55–59

By introducing Lagrange multipliers: γi≥0, γ̂i≥0, ηi≥0 and η̂i≥0,a primal variable Lagrangian, L, can be formed as follows.

L =12wTw + c

XNi=1

ni + n̂i� �

−XNi=1

ηini + η̂i n̂i� �

−XNi=1

γi wTΦ xið Þ + b − yi + e + ni� �

−XNi=1

γ̂i

yi−wTΦ xið Þ− b + e + n̂i� �

:

ðA:2Þ

The primal variable Lagrangian, L, has to be minimized withrespect to primal variables w, b, ξi & ξ̂i and maximized with respect tonon-negative Lagrange multipliers: γi, γ̂i, ηi & η̂̂i. Hence the functionhas the saddle point at the optimal solution with respect to primalvariables. At the optimal solution, the partial derivatives of L withrespect to primal variables vanish. Therefore,

ALAw

= 0 Z w =XNi=1

γi − γ̂i

� �Φ xið Þ ðA:3Þ

ALAb

= 0 ZXNi=1

γi − γ̂ið Þ = 0 ðA:4Þ

ALAni

= 0 Z γi + ηi = c ðA:5Þ

AL

An̂i= 0 Z γ̂i + η̂i = c: ðA:6Þ

Using the results from Eqs. (A.3)–(A.6) to eliminate correspondingvariables from the Lagrangian L, we see that the dual problem involvesmaximizing:

L γ; γ̂ð Þ = − 12

XNi=1

XNj=1

γi − γ̂ið Þ γj − γ̂j

� �K xi; xj� �

− eXNi=1

γi + γ̂ið Þ +XNi=1

γi − γ̂ið Þyi

subjected toXNi=1

γi − γ̂ið Þ = 0

0VγiVc; i = 1;2; :::;N0Vγ̂iVc; i = 1;2; :::;N:

ðA:7Þ

By considering, αi=γi− γ̂i and using γiγ̂i=0 towrite γi+ γ̂i=|αi|,Eq. (5) is obtained.

References

Chelgani, S.C., Hower, J.C., Jorjani, E., Mesroghli, Sh., Bagherieh, A.H., 2008. Prediction ofcoal grindability based on petrography, proximate and ultimate analysis usingmultiple regression and artificial neural network models. Fuel ProcessingTechnology 89, 13–20.

Chow, D.K.T., Lee, T., 2001. Image approximation and smoothening by support vectorregression. Proceedings IJCNN'01: International Joint Conference on NeuralNetworks, vol. 4, pp. 2427–2432.

Cristianini, N., Shawe-Taylor, J., 2003. An Introduction to Support Vector Machines andOther Kernel-Based Learning Methods. Cambridge University Press.

Desai, K., Yogesh, B., Sanjeev, S.T., Kulkarni, B.D., 2006. Soft-sensor development for fed-batch bioreactors using support vector regression. Biochemical Engineering Journal27, 225–239.

Hansen, T., Wang, C.J., 2005. Support vector based battery state of charge estimator.Journal of Power Sources 141 (2), 351–358.

Hower, J.C., 2006. Discussion: prediction of grindability with multivariable regressionand neural network in Chinese coal. Fuel 85, 1307–1308.

Jemwa, G.T., Aldrich, C., 2003. Non-linear system identification of an autocatalyticreactor using least square support vector machines. The Journal of the South AfricanInstitute of Mining and Metallurgy 119–125.

Nilsson, R., Björkegren, J., Tegnér, J., 2006. A flexible implementation for support vectormachines. The Mathematica Journal 10 (1), 114–127.

Özbayoğlu, G., Özbayoğlu, A.M., Özbayoğlu, M.E., 2008. Estimation of Hardgrovegrindability index of Turkish coals by neural networks. International Journal ofMineral Processing 85, 93–100.

Paláncz, B., Völgyesi, L., 2004. Support vector regression via Mathematica. PeriodicaPolytechnica Civil Engineering 48 (1–2), 15–37.

Peisheng, L., Youhui, X., Dunxi, Y., Xuexin, S., 2005. Prediction of grindability withmultivariable regression and neural network in Chinese coal. Fuel 84, 2384–2388.

Sengupta, A.N., 2002. An assessment of grindability index of coal. Fuel ProcessingTechnology 76, 1–10.

Vapnik, V.N., 2000. The Nature of Statistical Learning Theory. Springer-Verlag New York,Inc.

Ural, S., Akyildiz, M., 2004. Studies of relationship betweenmineral matter and grindingproperties for low-rank coal. International Journal of Coal Geology 60, 81–84.

hardgrove grindability index prediction using support vector regression

Documents