a systematic study on the application of scatter-corrective and spectral...

13
Contents lists available at ScienceDirect Geoderma journal homepage: www.elsevier.com/locate/geoderma A systematic study on the application of scatter-corrective and spectral- derivative preprocessing for multivariate prediction of soil organic carbon by Vis-NIR spectra Andre Carnieletto Dotto a, , Ricardo Simao Diniz Dalmolin a , Alexandre ten Caten b , Sabine Grunwald c a Department of Soil, Federal University of Santa Maria, Room 3314, CCR 42, Av. Roraima 1000, CEP 97105-900, Santa Maria, RS, Brazil b Department of Biodiversity, Agriculture and Forestry, University Federal of Santa Catarina, Rod. Ulysses Gaboardi, km 3, Caixa Postal 101, CEP 89520-000, Curitibanos, SC, Brazil c Soil and Water Sciences Department, University of Florida, 2181 McCarty Hall, PO Box 110290, Gainesville, FL 32611, USA ARTICLE INFO Keywords: Visible-near infrared spectroscopy Modeling Prediction Soil property Pedometrics ABSTRACT Soil organic carbon (SOC) plays a crucial role as an ecosystems indicator. Its quantication requires an af- fordable, and less time-consuming method. Visible and near infrared (Vis-NIR) reectance spectroscopy has demonstrated its applicability to predict SOC over the years. The aims of this study were to i) to compare the inuence of preprocessing techniques on prediction performance, ii) assess the modeling performance of a wide range of multivariate methods, and iii) evaluate the potential of Vis-NIR spectroscopy to predict SOC. Soil sampling was conducted over an area of approximately 1800 km 2 in the Southern region of Brazil, where a total of 595 soil samples were collected. Oxisols are predominant in the area following by Entisols and Inceptisols. The seven preprocessing techniques that were employed can be divided into two categories based on their SOC prediction performance: scatter-correction and spectral-derivatives. A total of nine dierent methods were evaluated to predict SOC from Vis-NIR spectra. The models that use scatter-corrective preprocessing exhibited superior prediction compared to the spectral-derivatives group. In the scatter-correction group, continuum re- moval was the most suitable preprocessing method for SOC prediction. Except for random forest (RF), all the multivariate methods presented robust predictions. The best t and highest model accuracy for SOC models in validation mode were achieved when applying the weighted average partial least-squares (WAPLS) method and normalization by range (NBR) preprocessing (R 2 = 0.82, root mean square error = 0.48%, and ratio of the performance to the interquartile range = 3.18). Findings from this systematic methodology study identied the reliability of SOC determinations by examining how preprocessing techniques and multivariate methods aect spectral analyses. It also guides future studies to select the most appropriate methods on similar soils. 1. Introduction Soil organic carbon (SOC) plays a fundamental and crucial role as an ecosystems' indicator and is a key component of soil quality (Andrews et al., 2004) and soil security (McBratney et al., 2014). This soil property is one of the most important constituents of the soil because of its capacity to aect plant growth as a source of energy and nutrients. According to the amount of SOC, the management and usage of the soil must be distinct. Digital soil mapping (DSM) has given considerable attention to SOC because of the importance of this soil fraction (Grimm et al., 2008; Grunwald, 2009). DSM requires high accuracy to evaluate the quality of the covariates and of the observations. However, the accurate estimations of observations in a complex environment are not easy to collect. The quantication of SOC demands an alternative technique that can accommodate extensive volume analysis and is non- intrusive, aordable, and less time consuming (Minasny and McBratney, 2008; Viscarra Rossel et al., 2006). Visible and near in- frared (Vis-NIR) reectance spectroscopy has been frequently applied in soil analysis and has demonstrated its applicability to accurately predict SOC and a variety of other soil properties in recent years (Ben Dor et al., 2015; Nocita et al., 2015; Viscarra Rossel et al., 2016). Several spectral preprocessing techniques have been introduced to improve the eciency of SOC prediction with Vis-NIR spectral data. Spectral preprocessing techniques have been used to transform soil https://doi.org/10.1016/j.geoderma.2017.11.006 Received 31 May 2017; Received in revised form 22 September 2017; Accepted 2 November 2017 Corresponding author. E-mail address: [email protected] (A.C. Dotto). Geoderma 314 (2018) 262–274 Available online 07 December 2017 0016-7061/ © 2017 Elsevier B.V. All rights reserved. T

Upload: others

Post on 14-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

Contents lists available at ScienceDirect

Geoderma

journal homepage: www.elsevier.com/locate/geoderma

A systematic study on the application of scatter-corrective and spectral-derivative preprocessing for multivariate prediction of soil organic carbonby Vis-NIR spectra

Andre Carnieletto Dottoa,⁎, Ricardo Simao Diniz Dalmolina, Alexandre ten Catenb,Sabine Grunwaldc

a Department of Soil, Federal University of Santa Maria, Room 3314, CCR 42, Av. Roraima 1000, CEP 97105-900, Santa Maria, RS, BrazilbDepartment of Biodiversity, Agriculture and Forestry, University Federal of Santa Catarina, Rod. Ulysses Gaboardi, km 3, Caixa Postal 101, CEP 89520-000,Curitibanos, SC, Brazilc Soil and Water Sciences Department, University of Florida, 2181 McCarty Hall, PO Box 110290, Gainesville, FL 32611, USA

A R T I C L E I N F O

Keywords:Visible-near infrared spectroscopyModelingPredictionSoil propertyPedometrics

A B S T R A C T

Soil organic carbon (SOC) plays a crucial role as an ecosystems indicator. Its quantification requires an af-fordable, and less time-consuming method. Visible and near infrared (Vis-NIR) reflectance spectroscopy hasdemonstrated its applicability to predict SOC over the years. The aims of this study were to i) to compare theinfluence of preprocessing techniques on prediction performance, ii) assess the modeling performance of a widerange of multivariate methods, and iii) evaluate the potential of Vis-NIR spectroscopy to predict SOC. Soilsampling was conducted over an area of approximately 1800 km2 in the Southern region of Brazil, where a totalof 595 soil samples were collected. Oxisols are predominant in the area following by Entisols and Inceptisols.

The seven preprocessing techniques that were employed can be divided into two categories based on theirSOC prediction performance: scatter-correction and spectral-derivatives. A total of nine different methods wereevaluated to predict SOC from Vis-NIR spectra. The models that use scatter-corrective preprocessing exhibitedsuperior prediction compared to the spectral-derivatives group. In the scatter-correction group, continuum re-moval was the most suitable preprocessing method for SOC prediction. Except for random forest (RF), all themultivariate methods presented robust predictions. The best fit and highest model accuracy for SOC models invalidation mode were achieved when applying the weighted average partial least-squares (WAPLS) method andnormalization by range (NBR) preprocessing (R2 = 0.82, root mean square error = 0.48%, and ratio of theperformance to the interquartile range = 3.18). Findings from this systematic methodology study identified thereliability of SOC determinations by examining how preprocessing techniques and multivariate methods affectspectral analyses. It also guides future studies to select the most appropriate methods on similar soils.

1. Introduction

Soil organic carbon (SOC) plays a fundamental and crucial role as anecosystems' indicator and is a key component of soil quality (Andrewset al., 2004) and soil security (McBratney et al., 2014). This soilproperty is one of the most important constituents of the soil because ofits capacity to affect plant growth as a source of energy and nutrients.According to the amount of SOC, the management and usage of the soilmust be distinct. Digital soil mapping (DSM) has given considerableattention to SOC because of the importance of this soil fraction (Grimmet al., 2008; Grunwald, 2009). DSM requires high accuracy to evaluatethe quality of the covariates and of the observations. However, the

accurate estimations of observations in a complex environment are noteasy to collect. The quantification of SOC demands an alternativetechnique that can accommodate extensive volume analysis and is non-intrusive, affordable, and less time consuming (Minasny andMcBratney, 2008; Viscarra Rossel et al., 2006). Visible and near in-frared (Vis-NIR) reflectance spectroscopy has been frequently applied insoil analysis and has demonstrated its applicability to accurately predictSOC and a variety of other soil properties in recent years (Ben Dor et al.,2015; Nocita et al., 2015; Viscarra Rossel et al., 2016).

Several spectral preprocessing techniques have been introduced toimprove the efficiency of SOC prediction with Vis-NIR spectral data.Spectral preprocessing techniques have been used to transform soil

https://doi.org/10.1016/j.geoderma.2017.11.006Received 31 May 2017; Received in revised form 22 September 2017; Accepted 2 November 2017

⁎ Corresponding author.E-mail address: [email protected] (A.C. Dotto).

Geoderma 314 (2018) 262–274

Available online 07 December 20170016-7061/ © 2017 Elsevier B.V. All rights reserved.

T

Page 2: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

spectra, remove noise, emphasize features, and extract useful in-formation for quantitative predictive models. The preprocessing ofspectra includes smoothing, normalization, scatter-correction, con-tinuum removal, and derivatives. These preprocessing techniques canbe divided into two groups: scatter-corrections and spectral-derivatives(Rinnan et al., 2009). Scatter-corrections are represented by continuumremoval, normalization by range, standard normal variates, and mul-tiplicative scatter-correction. Spectral-derivative preprocessing includesSavitzky-Golay and Norris-Williams derivatives. The performances ofboth preprocessing groups in terms of soil-property prediction can varyaccording to the study. For instance, Ben-Dor et al. (1997) applied firstand second derivatives to investigate the reflectance spectra of organicmatter regarding the possible changes that occurred during biologicaldecomposition. These authors assumed that the use of spectral deriva-tion enhanced weak spectral features and extracted hidden information.Vasques et al. (2008) compared thirty preprocessing methods, includingSavitzky-Golay and Norris-Williams derivatives, Kubelka-Munk trans-formation, reflectance-to-absorbance transformation, baseline offset,standardizations, and normalizations in Florida, USA. Overall, theseauthors found that the results from Savitzky-Golay derivatives con-sistently improved SOC predictions. A similar outcome was achieved byPeng et al. (2014), who explored the effects of eight spectral pre-processing techniques in 298 heterogeneous soil samples from differentprovinces in China. Their results indicated that the selection and dis-tribution of the model variables were affected by different preproces-sing techniques and that Savitzky-Golay derivative obtained better re-sults in terms of model development. Muñoz and Kravchenko (2011)included Savitzky-Golay derivatives, standard normal variates andmean centering preprocessing to predict SOC with three sources ofauxiliary information under low carbon contents from Alfisols insoutheastern Michigan. These authors observed no improvements incalibration accuracy when using preprocessing transformations. Nawaret al. (2016) compared the performances of three regression methodsby subjecting spectra to seven preprocessing techniques to assess theorganic matter and clay content in salt-affected soils from northernSinai, Egypt, and the best predictions were obtained with continuum-removed preprocessing. These diverse findings from different studiesindicate that there is no single preprocessing technique that is the bestacross diverse geographic soilscapes. This motivated the current in-vestigation in Brazilian soils to test various preprocessing techniques.

The same issue applies to the identification of appropriate multi-variate modeling approaches that use spectral data to estimate soilproperties. In fact, multivariate modeling is co-dependent on the se-lection of a suitable preprocessing. Several multivariate methods forSOC prediction have been successfully utilized to develop a faster andhigher-quality model. Partial least-squares regression (PLSR) (Woldet al., 1984) is the most common multivariate calibration method. PLSRhas been applied to SOC prediction in many studies (Conforti et al.,2015; Knox et al., 2015; Kuang et al., 2015; Viscarra Rossel andBehrens, 2010). Moreover, other methods such as principal componentsregression (PCR) (Kendall, 1957) and multiple linear regression (MLR)have shown significant results to predict SOC. Nonlinear data miningmethods such as support vector machines (SVM) (Cortes and Vapnik,1995) and the random forest (RF) ensemble learning method (Breiman,2001) have recently gained ground as one of the multivariate methodsto predict SOC. In addition to these methods, a new set of machine-learning algorithms have been introduced into pedometric approaches.Bayesian model averaging (BMA) (Raftery, 1995) is a probabilisticmodel that represents a set of random variables and their conditionalindependencies and was applied in the study by Leon and Gonzalez(2009) for SOC prediction. Ramirez-Lopez et al. (2013) and Gholizadehet al. (2016) applied the weighted average partial least-squares(WAPLS) (Ter Braak and Juggins, 1993) method as a memory-basedlearning multivariate method to predict SOC. WAPLS mimics thehuman cognitive process, remembering and memorizing previous si-tuations and adapting them to solve a problem by examining the

probability. Another machine-learning approach is Gaussian processregression (GPR) (Williams and Barber, 1998), which inserts the inputdata into a high-dimensional feature space that is defined by a kernelfunction. Artificial neural networks (ANN) (McCulloch and Pitts, 1943)are learning algorithms that are inspired by the structure and functionalaspects of biological neural networks. ANNs have been prominentlyused for soil-property predictions and specifically SOC prediction, forinstance in Kuang et al. (2015) and Were et al. (2015). These machine-learning approaches have been underutilized, so more efforts areneeded to reveal the potential of these methods in soil applications.

Comparing the performances of preprocessing techniques andmultivariate methods can become complicated and disorganized be-cause studies are conducted in dissimilar areas with distinct soil sam-ples, soil types, spectral ranges, spectral data acquisition methods, anddifferent measurement units. Few studies have simultaneously exploredmany forms of preprocessing and modeling methods in the same da-tabase. We assert that the performance of SOC predictions depends oncombinations of linear modeling, nonparametric, data mining andlearning algorithm approaches and tiered preprocessing methods.

The notion is that a side-by-side comparison of a wide variety ofpreprocessing and multivariate statistics allows to create a systematicmethodology to optimize SOC predictions in the same dataset in asouthern region of Brazil. The aims of this study were to i) to comparethe influence of preprocessing techniques on prediction performance,ii) assess the modeling performance of a wide range of multivariatemethods, and iii) evaluate the potential of Vis-NIR spectroscopy topredict SOC.

2. Materials and methods

2.1. Study area

The study area is located in the region of the Plateau of Curitibanosand includes several municipalities in the central region of SantaCatarina, Brazil (Fig. 1). The highest and lowest altitudes are around1300 and 800 m, respectively. Most of the soils are located in regions ofhigh altitude, where the average annual temperature is 18–20 °C,average annual precipitation is high (1500–1700 mm) and well dis-tributed throughout the year, and evapotranspiration rates are lower(700–800 mm), favoring the formation of acid soils and high values oforganic matter (Pandolfo et al., 2002).

The collected soil samples represent the prominent soil types overan area of approximately 1800 km2. The soils of this region are origi-nated from volcanic rocks, predominantly rhyodacite of the Serra Geralformation. The most weathered Oxisols present predominantly kaoli-nite and iron oxides (hematite to goethite), whereas the Entisols andInceptisols present predominance of kaolinite and 2:1 clay mineralswith hydroxyl polymers between layers and goethite. The main soils ofthis region are the Oxisols with very deep profiles and with low naturalfertility. Because they occur mainly in areas of smooth undulating reliefthey have little or no rock and are widely used with mechanized agri-culture (Dalmolin et al., 2017).

The study area contains similar soils because of the homogeneity ofparental material, which is predominantly basalt from a landscape thatis dominated by a smooth-relief plateau. According to the Köppen cli-mate classification, the study area has a humid subtropical climate(Cfa). The original vegetation is a native field with Araucaria foreststhat has been gradually replaced by annual crops and planted forests(Pinus taeda). Also, the large and medium-sized rural properties aregrowing beans, corn, soybeans and wheat (Dalmolin et al., 2017).

2.2. Data collection and soil analysis

A total of 595 soil samples were collected using the conditionedLatin hypercube sampling design, wherein 539 followed the depthspecifications of 0–5, 5–15, 15–30, 30–60, 60–100, and 100–200 cm

A.C. Dotto et al. Geoderma 314 (2018) 262–274

263

Page 3: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

from Globalsoilmap.net (Arrouays et al., 2014) and 56 samples werederived from genetic horizons of 11 additional profiles. The soil sam-ples were dried (at 45 °C for 72 h) and then grounded and sieved (2-mmmesh). The total organic carbon content was determined by wet com-bustion with the Mebius method in a digestion block (Yeomans andBremner, 1988). The soil organic matter was oxidized with a mixture ofK2Cr2O7 0.167 mol L−1 and concentrated H2SO4, and excess dichro-mate was titrated with ferrous ammonium sulfate. The reduced di-chromate during this reaction with the soil corresponds to the organiccarbon in the sample.

2.3. Training and validation sets

Seventy percent of the whole dataset was chosen by random sam-pling and used for the training set (n = 417). The remaining 30% wasused for the validation set (n = 178). Independent soil samples wereused in each set. The homogeneity of these two sets was assessed byLevene's test to confirm the reliability of splitting these sets. Levene'stest was applied to verify that the assumed variances were equal afterthe random selection of the training and validations groups. A violingraph shows the density and descriptive statistics of SOC for thetraining and validation sets (Fig. 2).

2.4. Spectral reflectance measurements

The spectral reflectance of the soil samples was obtained using aFieldSpec 3 Spectroradiometer (ASD Inc.) with a spectral range of350–2500 nm. The soil samples were homogeneously distributed inpetri dishes to measure the spectra. The spectral sensor, which was usedto capture light through a fiber-optic cable, was allocated 8 cm from thesample surface. The sensor scanned an area of approximately 2 cm2,and a light source was provided by two external 50-W halogen lamps.These lamps were positioned a distance of 35 cm from the sample (non-collimated rays and a zenithal angle of 30°) with an angle of 90° be-tween them. A Spectralon® standard white plate was scanned every20 min during calibration. Two replications (one involving a 180° turnof the petri dish) were obtained for each sample. Each spectrum wasaveraged from 100 readings over 10 s. The mean values of two re-plicates were used for each sample.

2.5. Spectral preprocessing techniques

Spectral preprocessing techniques involve a variety of mathematicalprocedures for transforming reflectance measurements before usingcalibration models. Spectral preprocessing can remove physical varia-bility from light scattering and enhance features of interest (Rinnanet al., 2009). Certain preprocessing techniques were selected followingthe best results from Cambule et al. (2012); Dotto et al. (2017); Knoxet al. (2015); McDowell et al. (2012); Nawar et al. (2016); Peng et al.(2014); Stevens et al. (2013); Vasques et al. (2008), includingsmoothing, averaging, derivatives, normalizations, scatter-corrections,and absorbance transformations. These preprocessing techniques wereapplied to the soil reflectance curves in the range of 350–2500 nm.Seven forms of spectral preprocessing were used to develop models forSOC prediction. The first was used as a ‘control treatment’, where theraw reflectance values were only smoothed (SMO) across a movingwindow of 9 nm. SMO was considered here as a preprocessing methodeven if no transformation was implemented in the spectral data. Then,the following preprocessing techniques were applied to the raw re-flectance: the Savitzky-Golay first derivative with a first-order poly-nomial and a window size of 9 nm (SGD), normalization by range(NBR), standard normal variates (SNV), multiplicative scatter-correc-tion (MSC), continuum removed reflectance (CRR), and the transfor-mation to absorbance and application of a Savitzky-Golay first deriva-tive with a first-order polynomial and a window size of 5 nm (ASG). TheSMO, CRR, SGD, ASG, and SNV preprocessing steps were conductedwith the prospectr package (Stevens and Ramirez-Lopez, 2013). MSCand NBR were conducted with the pls (Mevik et al., 2016) and clus-terSim packages (Walesiak and Dudek, 2016), respectively. Principalcomponent analysis (PCA) (stats package, R Core Team, 2016) wasutilized to emphasize the variations and reveal strong patterns betweenthe seven-chosen spectral preprocessing. PCA is a technique that isoften used to explore and visualize correlated data, and reduce di-mensionality. The test of Scott-Knott (Scott and Knott, 1974) is aclustering algorithm used as one of the alternatives where multiplecomparison procedures are applied. The test of Scott-Knott was con-ducted to verify whether the averaged results of R2 and RMSE presentsignificant differences between spectral preprocessing or methods. Thistest was performed using the ScottKnott R package (Jelihovschi et al.,

Fig. 1. Soil sampling sites and municipalities in the central region of Santa Catarina, Brazil.

A.C. Dotto et al. Geoderma 314 (2018) 262–274

264

Page 4: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

2014).

2.6. Multivariate methods

Nine multivariate methods were implemented to evaluate the pre-dictive performance of the preprocessing methods. Each method (e.g.,PLSR and WAPLS) has specific and different required parameters thatcontrol how the relationship between the input variables and outcomesis defined. These parameters were manually optimized to generate thebest possible fit between the variables and outcomes. All the modelingwas conducted with the R programming language (R Core Team, 2016).R is an open source software environment for statistical computing andgraphics. Following are the multivariate methods with the settings anddetails to perform the models and the corresponding R packages thatwere applied: PLSR and PCR (ncomp = 50, validation = “LOO”) in thepls package (Mevik et al., 2016), MLR (band interval = 25 nm, stepAIC,direction = “both”) in the stats package (R Core Team, 2016), SVM(type = “eps”, kernel = “radial”, cost = 50) in the e1071 package(Meyer, 2001), RF (ntree = 1000, mtry = 300, type = “regression”) inthe randomForest package (Liaw and Wiener, 2002), BMA(burn = 1,000,000, g = “hyper”, mprior = “random”,) in the BMSpackage (Zeugner and Feldkircher, 2015), WAPLS (dis-sUsage = “predictors”, k = 20, method = “wapls1”, pls.c = c(1,11))in the resemble package (Ramirez-Lopez and Stevens, 2016), GPR(type = “regression”, kernel = “vanilladot”, var. = 100, cross = 10)in the kernlab package (Karatzoglou et al., 2004), and ANN (nhid = 40,actfun = “purelin”) in the elmNN package (Gosso, 2012). These sevenspectral preprocessing were used as independent variables for eachmodel that was developed.

A Scopus database search was done to select the articles that haveapplied the multivariate methods to predict soil properties over the lastten years. The reason was to illustrate the volume of publication in eachmethod and the trend in prediction of soil properties by Vis-NIR spec-troscopy. A different assessment applied was the time consumed (inminutes) to generate each model was used to assess the efficiency of theSOC prediction method. The running time for each model was calcu-lated in R by the system.time command, and then the average for eachmethod was considered. A personal computer with a 3.60 GHz IntelCore i7 processor, 16 GB RAM, and Windows 10 operating system wasused to run these models.

Three statistical measures were used in the multivariate methods toevaluate the fitted model: the coefficient of determination (R2) (Eq.(1)), root mean square error (RMSE) (Eq. (2)), and ratio of the

performance to the interquartile range (RPIQ) (Eq. (3)). R2 provides thepercent of the variance that is explained by the model. R2 is the mostwidely used and reported measure of the goodness of fit. The RMSE iscommonly used to measure the difference between the predicted andobserved values from the fitted model. The RMSE is an easily inter-preted statistic because it has the same data units. The RPIQ is based onquartiles, which better represent the spread of the population. Ac-cording to Bellon-Maurel et al. (2010), soil sample sets often show askewed distribution and not a normal distribution. Thus, the RPIQindex better explains the distribution of the dataset by using the in-terquartile distance.

=∑ −

∑ −=

=

R(ŷ y )(y y )

2 i 1n

i i2

i 1n

i i2 (1)

∑= −=

RMSE 1n

(y y )i 1

n

i i2

(2)

=−RPIQ (Q3 Q1)

RMSE (3)

where ŷ is the predicted values, ȳ is the mean of the observed value, y isthe observed value, n is the number of samples with i equal to 1, 2,… n,IQ is the difference between the third and first quartiles (Q3 − Q1), Q1is the value in 25% of the samples, and Q3 is the value in 75% of thesamples.

3. Results and discussion

3.1. Descriptive and inferential statistics

Considering the density of the training and validation sets, morethan 50% of total SOC values were placed among 1% to 3% (Fig. 2). Thedata presented widespread variation, with maximum and minimumSOC values of 0.02 and 6.87%, respectively. The skewness towards lowvalues is stronger. Considering a global spectral library to characterizethe world's soil, Viscarra Rossel et al. (2016) showed that the globalSOC mean is around 2.16% considering a total of 17,928 soil samples.This is only slightly above the values presented in this study. The modelprediction was potentially influenced by this high variation in the data;the standard deviation confirmed this tendency. This large variation inthe SOC content was expected based on the wide range of depths in thisstudy, which ranged from 0–5 to 100–200 cm. The highest SOC valuesoccurred in soils with forest at an upper depth of 0–5 cm. These soils

Fig. 2. Density of the training and validation sets. The darkarea indicates the inter-quartile range and the white dotindicates the median value of the dataset. The p-value ofLevene's test = 0.205 (significance level α = 0.05).

A.C. Dotto et al. Geoderma 314 (2018) 262–274

265

Page 5: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

constantly received organic material, which promoted the accumula-tion of carbon from the low decomposition of organic matter because ofthe high altitude and low temperature of the area. The lowest SOCvalues were found at depths of 100–200 cm, where the storage ofcarbon in soils was reduced. Levene's test produced a p-value of 0.205for the homogeneity of the variances tests between the training andvalidation datasets. This p-value was much higher than the significancelevel of α = 0.05, so the hypothesis of equal variance could not berejected, and no significant difference was observed between the var-iances. The training and validation sets used independent samples.However, the similarity between sets revealed that the randomly splitgroups were statistically similar and that multivariate analysis is sui-table.

3.2. Characteristics of the soil spectral reflectance curves

The spectral reflectance curve of each soil sample was characterized

by the variability of its soil properties. The soil samples showed thepresence of distinguished soil reflectance curves that were associatedwith different shapes and absorption bands. This distinction was mainlycaused by the organic matter and iron oxide contents in these soils.Fig. 3a shows predominantly very low reflectance. The majority of soilsamples presented high iron oxide content, indicating low reflectance.Hunt et al. (1971) reported that iron oxide showed an absorption bandnear 550 nm. Stoner (1979) noted some absorption features of iron-richsoils around 700 and 900 nm. Rezende (1980) observed a characteristicconcave-convex pattern in the absorbance curve around 440 to 460 nmthat was related to goethite in some Oxisols. Stevens et al. (2013) foundthe presence of iron oxides as indicated by well-defined peaks around540, 640 and 900 nm (goethite at 620 and 920 nm). Demattê et al.(2017) found that goethite is defined at 425 nm, 480 nm, and 600 nmand hematite at 750 nm and 1050 nm, and the absorption feature ofcrystalline Fe is at 850 nm. A large amount of soil samples exhibitedlow overall reflectance (Fig. 3a). These soils belong to a particular type

Fig. 3. Preprocessing of the spectral curves for all soil samples. a) SMO: smoothed across a moving window of 9 nm, b) CRR: continuum-removed reflectance, c) NBR: normalization byrange, d) SNV: standard normal variate, e) MSC: multiplicative scatter-correction, f) SGD: Savitzky-Golay first derivative with a first-order polynomial and a window size of 9 nm, and g)ASG: transformation to absorbance and then the application of a Savitzky-Golay first derivative with a first-order polynomial and a window size of 5 nm.

A.C. Dotto et al. Geoderma 314 (2018) 262–274

266

Page 6: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

of spectral curve that are designated as iron-dominated with high ironcontent and fine texture (Stoner and Baumgardner, 1981). This trendwas found in this study, where the reflectance decreased at wavelengthsbeyond 750 nm affecting the water absorption centered at 1400 and1900 nm almost undetectable.

Demattê et al. (2017) highlighted the bands at 1400, 1900 and200 nm are due to OH− groups and water molecule vibrations, and theabsorption at 2200 nm indicates the presence of kaolinite and otherphyllosilicates. The dominant absorption bands of water around1400–1900 nm are characteristic of soil spectra and the wavelengthsaround 2200 nm are characteristics of clay minerals (Clark, 1999).According to Ben-Dor (2011) there are three major spectral regionsactive for clay minerals in general and for smectite minerals in parti-cular around 1300–1400, 1800–1900, and 2200-2500 nm. For kaoli-nite, 1:1 mineral, the signal at around 1400 and 2200 nm are relativelystrong, whereas the signal at 1900 nm is very weak.

3.3. Influence of the preprocessing techniques on the performances of theSOC models

3.3.1. Two groups of preprocessing techniquesPreprocessing techniques were employed to enhance the spectral

features and fit the best relationship with a soil property of interest.Results from the PCA analysis showed each color representing the sevenspectral preprocessing in a multidimensional space that was projectedby first and second principal components (PC1 and PC2, respectively)(Fig. 4). PCA captured the variation in these preprocessing methods.SGD and ASG were grouped together, while CRR, NBR and SNV werefar from the symmetric center. PC1 explained 82.6% of the total var-iance and certain preprocessing techniques were grouped together,suggesting that SGD, ASG, MSC, SMO, and SNV were correlated.

However, SGD and ASG had almost the same position in PC1, differentfrom other preprocessing methods. NBR and CRR were grouped to-gether and SNV was separate in PC2. These finding support the ex-istence of two different preprocessing groups, which can affect theperformance of SOC modeling.

The preprocessing techniques were divided into two categories:scatter-correction and spectral-derivatives. The first group includesCRR, MSC, SNV, and NBR. The spectral-derivative group is representedby SGD and ASG. The performances of the models that were obtainedfrom the scatter-corrective preprocessing techniques were superior tothose from the spectral-derivatives group. Scatter-correction pre-processing techniques are designed to reduce physical variability (un-desirable scatter effect) and compare individual features of each ele-ment from a common baseline (Rinnan et al., 2009). This grouprepresents powerful preprocessing techniques that can isolate and re-move complicated effects from physical phenomena, where soil che-mical effects can be more easily modeled.

Spectral-derivative preprocessing can remove both additive andmultiplicative effects in spectra. Two different preprocessing methods,namely, SGD and ASG, were used to reduce the signal-to-noise ratio inspectra by using Savitzky-Golay derivative. The derivatives were esti-mated with a moving-window, where only a local portion of spectrawas used at one time to compute the derivative. This phenomenon isone distinction from scatter-corrective preprocessing methods, whichcan be performed on entire windows.

The scatter-correction preprocessing group exhibited significantimprovement over the Vis-NIR spectral models. The performance as-sessment of the scatter-correction preprocessing fluctuated among themodels. The fitted R2 values varied from 0.54 to 0.82, while the RMSEand RPIQ varied from 0.77% to 0.48% and 3.18 to 1.99, respectively(Table 1). CRR achieved the lowest RMSE values for three multivariatemethods (PLSR, PCR, and RF). SNV exhibited the highest performancefor two methods (MLR and GPR), as did NBR (WAPLS and ANN). MSCwas ranked the best preprocessing approach for only one method(BMA).

The best performance was found for CRR in terms of the perfor-mance of models that used scatter-correction preprocessing to predictSOC. The CRR technique, which was proposed by Clark and Roush(1984), removes the continuous features of spectra and is often used toisolate specific absorption features. The continuum is represented by amathematical function that is used to separate and highlight specificabsorption bands of a reflectance spectrum (Mutanga et al., 2005).Creating a continuum or hull is similar to fitting a rubber band over theoriginal spectrum. The spectrum is normalized by setting the value ofthe hull to 100% reflection, where the first and last values of the con-tinuum-removed spectrum equal 1. The strength of CRR is its ability toenhance absorption depths by correcting apparent shifts from wave-length-dependent scattering.

Another scatter-correction technique is SNV. It achieved the bestpredictions for two methods, which were MLR and GPR. This pre-processing method has been proposed to remove the multiplicativeinterference of particle size by the simple rotation and offset correctionof spectra (Barnes et al., 1989). As observed in Fig. 3d, the similaritybetween SNV and MSC is obvious. However, they present different re-flectance scales. The signal-correction concepts behind SNV are thesame as for the MSC, except a common reference signal is not required,which is observed in the reflectance values. SNV is designed to centerthe underlying linear slope of each individual sample spectrum (Barneset al., 1989). Moreover, SNV can be sensitive to noise in the spectrum.Instead of using the average and standard deviation as correctionparameters, users must use each observation on its own, isolated fromthe remainder of the dataset.

NBR produced the best model result for the WAPLS and ANNmethods, both of which are machine-learning algorithms. In NBR,normalization means adjusting values that are measured on differentscales to a common scale. The simple normalization of each sample is a

Fig. 4. Principal component analysis of seven preprocessing techniques. SMO: smoothed,CRR: continuum-removed reflectance, NBR: normalization by range, SNV: standardnormal variate, MSC: multiplicative scatter-correction, SGD: Savitzky-Golay derivative,and ASG: absorbance and then the application of a Savitzky-Golay derivative.

A.C. Dotto et al. Geoderma 314 (2018) 262–274

267

Page 7: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

common approach to the multiplicative scaling problem. NBR involvesthe creation of shifted and scaled versions of spectral data, where thesenormalized values eliminate scattering effects (Rinnan et al., 2009). Ifthe relationship between the variables is the most important aspect ofspectral data, then normalization is recommended.

The final scatter-correction is MSC, which achieved the best pre-diction result only for the BMA method. Nonetheless, all four scatter-corrections presented a concentrated performance in the BMA, with aslightly higher result for MSC. The purpose of MSC is to eliminatescatter errors to linearize spectral data and decrease noise variance(Geladi et al., 1985). In MSC, each spectrum is corrected so that all thespectral samples appear to have the same scatter level. MSC and SNVspectral preprocessing are closely related, and differences in predictionability between these methods seem to be quite small.

The spectral-derivative preprocessing achieved the greatest perfor-mance only for SGD with the SVM method. ASG and SMO never pro-duced the best model performance with any method. Additionally, SMOproduced the lowest model performance for three methods (SVM, RF,and GPR). Interesting findings occurred for the results of SVM and RFmodeling. In both methods, two spectral-derivative preprocessing (SGDand ASG) produced the best performances. The results from these twospectral-derivative preprocessing with SVM and RF match those fromVasques et al. (2008). These authors investigated several multivariatemethods, including two supervised machine-learning algorithms(committee trees and regression trees), to assess soil carbon in Florida,USA. Among the thirty spectral preprocessing that were tested, thespectral-derivative techniques exhibited the highest predictive perfor-mance for both SVM and RF. Both methods are classified as supervised-learning algorithms, with SVM being a machine-learning algorithm andRF being an ensemble-learning algorithm. In addition, these two algo-rithms demonstrated efficient modeling with large datasets. The modelaccuracy was maintained when missing data or outliers were present,and these models did not predict beyond the range of the responsevalues for the training data during regression. These models under-estimated the high values and overestimated the low values, which aretheoretically and practically difficult to analyze (Breiman, 2001;Ivanciuc, 2007; Mountrakis et al., 2011; Viscarra Rossel and Behrens,2010).

SMO frequently generated low-accuracy performance regardless ofthe employed method. SMO always exhibited performance within thebottom three (Table 1). Nawar et al. (2016) obtained similar results,where no preprocessing was used for organic matter prediction. Earlierstudies showed that calibration models in which spectra were notpreprocessed are more sensitive to changes compared to models forwhich preprocessing was applied (Moros et al., 2009).

3.3.2. Performance of the best preprocessing techniqueCRR was considered the most robust spectral preprocessing method

based on its predictive performance for SOC. CRR presented the bestperformance for PLSR, PCR, and RF methods (Table 1). Considering allthese prediction methods, this preprocessing method always appearedamong the top-four best results. This result demonstrates that CRR issuitable for SOC prediction with Vis-NIR spectral data. CRR has alsobeen successfully used in other studies, for instance, to estimate the soilcolor (Viscarra Rossel et al., 2009), clay content (Dotto et al., 2017;Lagacherie et al., 2008; Nawar et al., 2016; Viscarra Rossel et al., 2009),organic matter (Nawar et al., 2016; Xie et al., 2012), soil organic carbon(Dotto et al., 2017; Nocita et al., 2014), soil heavy metals (Gholizadehet al., 2015; Vašát et al., 2014; Xie et al., 2012), soil macro and micronutrients (Vašát et al., 2014), and soil nitrogen (Zhang et al., 2016).CRR preprocessing was also applied to characterize the world's soil in

Table 1Performance of SOC predictive models from nine multivariate methods with the corre-sponding spectral preprocessing techniques.

Validation seta

Methodb Preprocessingc R2 RMSE (%)d RPIQ

PLSR CRR 0.81 0.49 3.12NBR 0.80 0.52 2.94SNV 0.79 0.52 2.94MSC 0.78 0.54 2.84ASG 0.71 0.62 2.49SMO 0.70 0.63 2.42SGD 0.67 0.67 2.30

PCR CRR 0.80 0.51 3.00NBR 0.79 0.52 2.95SNV 0.79 0.52 2.92MSC 0.78 0.54 2.86SMO 0.70 0.62 2.47ASG 0.68 0.64 2.39SGD 0.66 0.66 2.31

MLR SNV 0.79 0.52 2.93CRR 0.78 0.53 2.88MSC 0.78 0.54 2.84NBR 0.77 0.56 2.75SMO 0.73 0.60 2.56ASG 0.71 0.61 2.50SGD 0.69 0.64 2.41

SVM SGD 0.80 0.52 2.94ASG 0.80 0.53 2.90CRR 0.78 0.53 2.87NBR 0.77 0.54 2.82MSC 0.76 0.56 2.73SNV 0.75 0.56 2.72SMO 0.74 0.59 2.59

RF CRR 0.77 0.55 2.77SGD 0.74 0.60 2.58ASG 0.72 0.61 2.51SNV 0.67 0.66 2.31MSC 0.65 0.67 2.27NBR 0.54 0.77 1.99SMO 0.47 0.84 1.83

BMA MSC 0.80 0.51 3.03SNV 0.79 0.52 2.97CRR 0.79 0.52 2.96NBR 0.78 0.54 2.85SMO 0.72 0.61 2.52ASG 0.71 0.61 2.51SGD 0.68 0.65 2.36

WAPLS NBR 0.82 0.48 3.18CRR 0.81 0.49 3.10SNV 0.80 0.51 2.99MSC 0.80 0.51 2.98SMO 0.79 0.52 2.96ASG 0.71 0.62 2.47SGD 0.48 0.74 2.10

GPR SNV 0.79 0.52 2.96MSC 0.79 0.52 2.94CRR 0.79 0.53 2.90NBR 0.78 0.53 2.89ASG 0.69 0.66 2.34SGD 0.65 0.69 2.21SMO 0.65 0.69 2.21

ANN NBR 0.80 0.51 3.01SNV 0.79 0.52 2.92MSC 0.75 0.56 2.73CRR 0.73 0.59 2.61ASG 0.70 0.63 2.44SMO 0.66 0.66 2.32SGD 0.64 0.69 2.22

a R2: coefficient of determination, RMSE: root mean square error, and RPIQ: ratio ofthe performance to the interquartile range.

b PLSR: partial least-squares regression, PCR: principal components regression, MLR:multiple linear regression, SVM: support vector machine, RF: random forest, BMA:Bayesian model averaging, WAPLS: weighted average partial least-square, GPR: Gaussianprocess regression, and ANN: artificial neural network.

c SMO: smoothed, CRR: continuum-removed reflectance, NBR: normalization by range,SNV: standard normal variate, MSC: multiplicative scatter-correction, SGD: Savitzky-

Golay derivative, and ASG: absorbance and then the application of a Savitzky-Golayderivative.

d The preprocessing column is ordered by decreasing predictive performance in eachmultivariate method.

A.C. Dotto et al. Geoderma 314 (2018) 262–274

268

Page 8: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

the global spectral library (Viscarra Rossel et al., 2016), estimate tro-pical pasture quality (Mutanga et al., 2005), and even examine otherplanets, such as elemental concentration estimation on Mars (with aspectrometer on the robotic rover Curiosity) (Wang et al., 2014). Nocitaet al. (2014) applied CRR to predict the SOC content by diffuse re-flectance spectroscopy from soil samples throughout the EuropeanUnion. These authors concluded that the SOC predictions of mineralsoils were more accurate when the sand content was added to the soilspectra as co-variables. Nawar et al. (2016) found similar results forCRR when applied to organic-matter prediction.

The improved performance of CRR can be attributed to effectivenoise removal and the reduction of the physical variability betweensamples, providing a more consistent definition of band depth (Clarkand Roush, 1984). Continuum removal can also be used to analyze theabsorption features and correct the band minimum to the true bandcenter (Clark and Roush, 1984). This technique can be used to nor-malize absorption features and emphasize the reflectance features ofspectrum curves. CRR proved to be a great preprocessing for SOCcontent prediction regardless of the multivariate method applied.

3.4. Influence of the multivariate methods on the SOC predictionperformance

A search was conducted in the scientific citation database Scopus tocompare the volume of multivariate methods that were published overthe last ten years that applied spectroscopy to predict soil properties(Fig. 5). The high frequency of publications with the PLSR methodproves its application in predicting soil properties, appearing in around65% of all published papers over the last ten years. The PCR method hasalso appeared in a significant amount of publications, especially be-tween 2006 and 2011, and was the second-most-common method overthese years. The remaining methods were utilized in a number ofpublications, particularly data-mining algorithms. Over the last fiveyears, the usage of these methods has grown and attracted the attentionof the pedometric community. A positive aspect that drew attention wasthe quantitative increase in total publications regarding soil-propertyprediction by spectral data, confirming the growth of predometricprediction in this recent period.

3.4.1. Partial least-squares regression performanceThe prediction accuracy and model performance of PLSR are pre-

sented in Table 1. The results revealed why PLSR is the most commonlyused method. In fact, the R2, RMSE and RPIQ values of the PLSR modelsranged from 0.67 to 0.81, 0.67% to 0.49%, and 2.30 to 3.12, respec-tively. These results are comparable to the prediction accuracy that hasbeen established in the literature. Viscarra Rossel and Behrens (2010)applied the PLSR method, among others, to predict SOC based on Vis-NIR spectra by using a large spectral library with 1104 soil samples. InViscarra Rossel and Behrens (2010), the PLSR model prediction showedonly slightly higher results (R2 = 0.82 and RMSE = 0.96%) compared

to our study. Vasques et al. (2008) compared multivariate methods toinferentially model the soil total carbon, and the PLSR models achievedan average R2

val value of 0.82 for 30 spectral preprocessing methods.This performance is considered slightly better than our PLSR resultbecause the 554 soil samples in their study were log-transformed beforemodeling. Araújo et al. (2014) improved the prediction performance ofa large tropical Vis-NIR spectroscopic soil library from Brazil andachieved a R2

val of 0.60 and RMSEv of 0.55% for organic matter in 7172soil samples when applying PLSR. Knox et al. (2015) modeled soilcarbon fractions with Vis-NIR spectroscopy in a set of 1014 soil samplesfrom the state of Florida, USA. These authors applied 10 differentspectral preprocessing techniques, resulting in an average R2

val value of0.80 and RMSEv of 0.48 log g kg−1 for PLSR modeling. Kuang et al.(2015) compared the calibration of Vis-NIR spectroscopy for onlinemeasurements of SOC and achieved similar R2 performance with PLSRin cross-validation and inferior RMSE (R2

val of 0.81 and RMSEv of1.99%).

The low and high results that were obtained in this study for SOCmeasurement with the PLSR were consistent and comparable to theabove reported results. PLSR presented suitable outcomes, providing aquantitative model that can handle complicated relationships betweenpredictors and responses and handle complex modeling problems (Woldet al., 2001). PLSR is considerable a popular regression method inchemometrics because the emphasis is on predicting responses and notnecessarily understanding the underlying relationships between vari-ables (Wold et al., 2001). Additionally, PLSR is a method for con-structing predictive models when the factors are numerous and highlycollinear (Wold et al., 1984), which is the case for hyperspectral data.

3.4.2. Principal component regression performanceAs previously discussed, PCR is the second-most-common method in

the predictions that apply Vis-NIR spectroscopy (Fig. 5). PCR producedresults that were equivalent to PLSR, with R2 varying from 0.66 to 0.80,RMSE from 0.66% to 0.51%, and RPIQ from 2.31 to 3.00 (Table 1).Chang et al. (2001) achieved superior results when applying PCR(R2 = 0.87 and RMSE = 0.78%) to predict the total soil carbon, whenusing 802 soil samples from different location in the USA. Wang et al.(2015) used optical diffuse reflectance spectroscopy to predict organicmatter with 155 soil samples from China. These authors adopted dif-ferent spectral preprocessing from two spectrometers, producing R2

val

results from 0.79 to 0.86 for organic matter prediction. The PCRmethod exhibited prominent results in the abovementioned literaturebecause the PCR and PLSR techniques are similar in many ways. Theyare both methods to model response variables when a large number ofpredictor variables are present and these predictors are highly corre-lated (Wold et al., 1984). Moreover, both methods construct new pre-dictor variables, which are known as linear combinations of originalpredictor variables. Wentzell and Vega Montoto (2003) presented anextensive review of the literature to compare these two methods. Intheir survey, the prediction errors and number of latent variables

Fig. 5. Publications of multivariate methods from the lastten years that applied spectroscopy to predict soil proper-ties. TP is the total number of publication per year. BMA,WAPLS and GPR were grouped because of the low volumeof publications. PLSR: partial least-squares regression, PCR:principal components regression, MLR: multiple linear re-gression, SVM: support vector machine, RF: random forest,BMA: Bayesian model averaging, WAPLS: weighted averagepartial least-square, GPR: Gaussian process regression, andANN: artificial neural network.

A.C. Dotto et al. Geoderma 314 (2018) 262–274

269

Page 9: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

differed among PCR and PLSR. Hemmateenejad et al. (2007) stated thatthe success of the PCR and PLSR methods was caused by their ability toovercome problems that are common to spectral data, such as colli-nearity, and their easy implementation because of the availability ofsoftware.

3.4.3. Multiple linear regression performanceMLR exhibited fair performance for SOC predictions, with R2 ran-

ging from 0.6 to 0.79, RMSE from 0.64% to 0.52%, and RPIQ from 2.41to 2.93 (Table 1). The best performance was reached with SNV. MLR isconsidered the most common form of linear regression analysis, thus,various studies have applied this method to soil-property prediction.Bayer et al. (2012) compared regression methods to predict the SOC ina degraded South African ecosystem and achieved a R2

val of 0.74 andRMSEv of 0.36% with the MLR model in 164 soil samples. These resultswere slightly inferior to those in our study based on the best modelresults for MLR. Viscarra Rossel and Behrens (2010) compared differentdata-mining algorithms to model soil Vis-NIR with a dataset of 1104soil samples from Australia. These authors achieved better results withMLR predicting the SOC (R2

val from 0.81 to 0.84). One piece of evidencethat increased the model performance was the large number of soilsamples. Vasques et al. (2008) achieved a R2

val value from 0.66 to 0.85for MLR modeling.

These results indicate that MLR is still a beneficial method for SOCprediction when the choice is a statistical method that uses severalexplanatory variables to predict the outcome of a response variable in asimple linear model. MLR assumes that the relationships between in-dependent variables and dependent variables are linear (Montgomeryet al., 2012). Another important assumption is the absence of multi-collinearity; thus, the independent variables are not highly correlated(Osbourne and Waters, 2002). Further suppositions include homo-scedasticity and normality. This assumption is not met in spectral da-tasets showing high multicollinearity. Presuming these linear regressionassumptions, a robust prediction can be achieved by using a relativelysimple algorithm.

3.4.4. Support vector machine performanceThe prediction results for SVM method achieved an R2, RMSE, and

RPIQ values of 0.74 to 0.80, 0.59% to 0.52%, and 2.59 to 2.94, re-spectively (Table 1). SGD preprocessing achieved the highest predictionassessment for SVM. This method has been widely implemented tosolve complex regression assignments (Ramirez-Lopez et al., 2013;Terra et al., 2015; Viscarra Rossel and Behrens, 2010). Viscarra Rosseland Behrens (2010) reported that SVM produced a similar result asPLSR, whereas Stevens et al. (2013) presented higher SOC predictionsfor SVM (R2 from 0.67 to 0.86) when evaluating several data-miningcalibration methods on a diverse sample set of soil types in the EU.Terra et al. (2015) compared spectral libraries (Vis-NIR spectroscopy)to quantitatively analyze tropical Brazilian soils and the predictive re-sult achieved were inferior, for SOC when applying SVM (R2

val = 0.65,RMSEv = 0.16 g kg−1, and RPIQv = 2.49), compared to the resultsobtained in this study. The low SOC prediction found in Terra et al.(2015) can be attributed to different soils, methodologies, and predic-tion algorithms. Ramirez-Lopez et al. (2013) compared a regional andglobal soil spectral library to predict SOC with different modeling ap-proaches. Models with SVM obtained prediction results of R2 = 0.54and 0.57 and RMSE = 0.27% and 0.93%, for the regional and globallibraries, respectively. Their results showed slightly higher R2 for theglobal soil spectral library. On the other hand, the prediction error waslower for the regional soil spectral library, which could be attributed tothe small SOC variation in the regional spectral library. Araújo et al.(2014) compared the ability of multivariate models to determine or-ganic matter from 7172 samples of seven different soil types fromseveral areas of Brazil. These authors found that SVM (R2 = 0.69 andRMSE = 0.48%) outperformed PLSR (R2 = 0.60 and RMSE = 0.55%)in terms of organic matter prediction.

The literature results corroborate SVM as a very promising methodto estimate the SOC content. The greatest performance of SVM occurredbecause SVM is a group of supervised learning methods, which re-present an extension to nonlinear models of generalized algorithms thatcan train nonlinear classifiers (Ivanciuc, 2007). Also associated with theSVM algorithm is the smaller number of support vectors, which yieldedbetter model performance (Loosli et al., 2007). The reason for the highperformance of SVM models is related to its efficiency in modelinglinear and/or nonlinear relationships and handling large databases.

3.4.5. Random forest performanceThe overall predictive ability of RF models for SOC content was

considered relatively minor. The prediction results of R2, RMSE, andRPIQ were 0.47 to 0.77, 0.84% to 0.55%, and 1.83 to 2.77, respectively(Table 1). The RF approach exhibited the lowest model predictioncompared to the other methods. Viscarra Rossel and Behrens (2010)compared different algorithms for modeling soil Vis-NIR spectra andachieved the lowest results for SOC estimation with RF (R2 = 0.71 andRMSE = 1.23%), with the best prediction found for ANN. Knox et al.(2015) evaluated the potential of Vis-NIR-MIR spectroscopy to predictsoil carbon fractions in 1014 soil samples that were collected across thestate of Florida, USA. RF validation produced an R2 and RMSE from0.63 to 0.88 and from 0.70 to 0.38 log g kg−1, respectively, when usingdifferent spectral preprocessing only in the Vis-NIR range. Feng et al.(2014) emphasized the difficulty of interpreting model estimates fromlog-transformed data. These authors stated that estimating originalobservations by using the exponent or anti-log of sample log-trans-formed data can generate inaccurate estimates of the true population oforiginal data. These authors suggested that abandoning the classic ap-proach and switching to modern distribution-free methods would bebetter for many applications rather than attempting to find an appro-priate statistical distribution or transformation to model the observeddata.

According to Hastie et al. (2009), predictive learning is an im-portant aspect of data-mining methods, which are invariant undertransformations. Thus, scaling or general transformations are not anissue and are immune to the effects of predictor outliers. RF tends to beversatile and flexible with small or large datasets and has become aneffective prediction tool (Breiman, 2001). One issue with RF is thatperformance in calibration mode is usually very good. But they degradein validation mode (low robustness). Model interpretability is anotherissue compared to linear models. RF lacks transparency in terms ofidentification of predictor variables that are sensitive to shifts if slightchanges in the model are done (low transparency). RF models are ablack-box approach that is very hard to interpret.

3.4.6. Bayesian model averaging performanceThe BMA method provided a new approach in regard to SOC pre-

dictions via Vis-NIR. The predictive performance of R2, RMSE, andRPIQ were 0.68 to 0.80, 0.65% to 0.51%, and 2.36 to 3.03, respectively(Table 1). BMA has become more common in the soil science commu-nity, particularly for soil-property prediction. Leon and Gonzalez(2009) predicted SOC with BMA by considering several predictors, suchas the loss on ignition, parent material, drainage status, soil-horizontype, clay content, and pH. Their validation analysis showed that theprediction accuracy for SOC was improved with the BMA approachcompared to the ordinary least-squares approach. Malone et al. (2014)applied the BMA approach to combine digital soil-property maps thatwere derived from disaggregated legacy soil class maps. These authorsdetermined the efficacy of ensemble modeling as a useful combinatorialapproach to combine digital soil-property maps from Australia. Poggioet al. (2016) assessed the spatial uncertainty with the Bayesian ap-proach by modeling the soil organic matter content in the Grampianregion of Scotland. Similarly, Xiong et al. (2015) applied Bayesiangeostatistics to assess the uncertainty that was associated with SOCpredictive models in Florida, USA. But none of these studies used BMA

A.C. Dotto et al. Geoderma 314 (2018) 262–274

270

Page 10: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

to develop soil models using Vis-NIR spectra.The BMA approach can extract empirical relevant relationships and

calculate a set of ‘models’ by assuming that no single ‘model’ describesthe data process, instead keeping all the ‘models’ and assigning a weightto each. BMA refers to the process of averaging estimates according tothe probability distributions, where all the ‘models’ can be interpretedas proxies for some unknown underlying model (Brandl, 2008). TheBMA approach is a quantitative tool that is adjustable and flexible re-garding the efficiency of the input variables to estimate SOC. The dis-tinct advantage of BMA is its ability to express which input variablemost influences the ‘models’ via prior specification (probability)(Raftery, 1995). Additionally, the benefit of using BMA for spectral datais to access the uncertainty of each predictive variable.

3.4.7. Weighted average partial least squares performanceOverall, WAPLS produced the highest accuracy prediction model for

SOC (R2 = 0.82, RMSE = 0.48, and RPIQ = 3.18) (Table 1). The bestWAPLS model returned the lowest RMSE value (0.48%) compared tothe remaining algorithms. Ramirez-Lopez et al. (2013) emphasized thegreat potential of WAPLS to predict soil properties in large and diverseVis-NIR datasets. These authors introduced the spectrum-based learner(SBL) technique, which is a category of WAPLS, and compared thepredictive performance of this technique to those of other approaches,including SVM and PLSR. SBL outperformed the other approaches forboth datasets (regional and global soil spectral libraries), producing thelowest RMSE and highest R2 (RMSE = 0.25% and 0.80% and R2 = 0.59and 0.68 for the regional and global soil spectral libraries, respectively).The low predictive performance compared to this study was attributedto large spectral variation from the diversity of the soil-formation en-vironments where the samples were collected (Ramirez-Lopez et al.,2013). Gholizadeh et al. (2016) applied the WAPLS approach and otherdata mining algorithms (PLSR and SVM) to predict the soil texture withVis-NIR spectra from the Czech Republic (total of 264 samples). Theresults of the WAPLS model outperformed the prediction accuracies ofthe three soil fractions. These authors concluded that WAPLS had notyet been commonly used to predict soil properties and that these sta-tistical methods with high prediction efficiency have the best adapt-ability to analyze the structure of soil data. The best performance ofWAPLS was related to important characteristics, such as the use ofmultiple models from multiple partial least squares (pls) componentsand the final predicted value being a weighted average of all the pre-dicted values from the multiple pls models (Ramirez-Lopez and Stevens,2016).

3.4.8. Gaussian process regression performanceGPR is a machine-learning algorithm that applies a kernel function

for training and prediction. The accuracy performance of GPR modelsproduced values of R2 from 0.65 to 0.79, RMSE from 0.69% to 0.52%,and RPIQ from 2.21 to 2.96 (Table 1). Few studies have utilized theGPR method for SOC predictions. Numerous applications of kernel-based algorithms have been reported in the context of optical patternand object recognition, text categorization, time-series prediction, andgene expression profile analysis (Muller et al., 2001). Kernel methodsare a class of algorithms for pattern analysis in machine learning. Formany algorithms that solve regression problems, the data must be ex-plicitly transformed into feature vector representations. In contrast,kernel methods require only a user-specified kernel. This process, whichis called the ‘kernel trick’, replaces the features (predictors) with akernel function. Several classes of kernels can be used for machinelearning, and the selection of a kernel is critical to the success of thesealgorithms (Karatzoglou et al., 2004).

One benefit is that this algorithm is often computationally fasterthan specific memory-learning methods (Fig. 7). Interestingly, researchgaps in the GPR method have not yet sufficiently utilized kernels forregression problems. The GPR method is an alternative when workingwith learning algorithms, and results in our study demonstrated that

GPR must be considered as a prediction method for SOC with Vis-NIRspectral data.

3.4.9. Artificial neural network performanceThe final machine-learning approach is ANN. This method produced

R2 values from 0.64 to 0.80, RMSE from 0.69% to 0.51%, and RPIQfrom 2.22 to 3.01 (Table 1). When we evaluated the prediction ac-curacies of all the methods, ANN produced reasonable outcomes, withthe highest model performance (R2 of 0.80, RMSE of 0.51%, and RPIQof 3.01), hence ANN did produce results comparable to other methods.This statement is corroborated by good performances of ANN modelspredicting SOC in several studies. Viscarra Rossel and Behrens (2010)ANN model returned the best prediction results for SOC (R2 = 0.89 andRMSE = 0.75%) compared to PLSR, MLR, SVM, and RF, among others.However, this ANN model was implemented with a reduced number ofwavelet coefficients. These authors concluded that ANNs could extractmore relevant information when adding more features (hidden nodesand fitting coefficients). Were et al. (2015) applied an ANN algorithm tospatially predict the SOC stocks in the eastern Mau Forest Reserve,Kenya. These authors determined the prediction accuracy of the ANNmodel, with R2 of 0.61 and RMSE of 15.46 Mg ha−1. They suggestedthat machine-learning techniques offer pathways for spatially pre-dicting target soil variables. Kuang et al. (2015) compared ANN andPLSR model performances in cross-validation, laboratory independentvalidation, online validation and online independent validation for SOCprediction in two farm fields in Viborg, Denmark. Models that werebased on an ANN algorithm showed a stronger prediction capabilitythan those that were based on PLSR in both fields: the highest perfor-mance was produced by an ANN in the cross-validation model(R2 = 0.90 and RMSE = 1.50%). Mouazen et al. (2010) used an ANNcalibration model for SOC prediction with 133 soil samples from Bel-gium and northern France, which produced superior fit and accuracy(R2 = 0.84 and RMSE = 0.68%) than the model in our study(R2 = 0.80 and RMSE = 0.51). Daniel et al. (2003) assessed the po-tential of ANN modeling for soil organic matter within a spectral rangefrom 400 to 1100 nm in 41 soil samples in Thailand. The ANN modelsexhibited increased performance in the laboratory (R2 = 0.86) com-pared to field-based assessments (R2 = 0.84). The suitable perfor-mances of ANN models might be attributed to the nature of ANNs insolving nonlinear problems (Kuang et al., 2015). In ANNs, the mathe-matical model assigns weights between elements, and the networkstructure is adjusted depending on the inputs (McBratney et al., 2003).

3.5. Computational performance of the models

The best multivariate method is presumably the one that producesthe best predictive ability with a robust accuracy result. Nonetheless,the rules regarding which method or algorithm is better are tough todecide. The time to process each model in R was calculated to com-plement this assessment because the methods revealed prominent re-sults. The averages of the seven preprocessing models were determinedto find which of the nine methods consumed the lowest amount of time(Fig. 6). This procedure required to run the models on the same com-puter. The computational performance of the models can be influencedby several factors, such as the computational system, number of ob-servations, number of variables in the prediction model, method that isused, etc. BMA and MLR consumed the lowest amount of time (0.20 and0.33 min.). PLSR and RF consumed more time (2.69 and 3.52 min.),and the least efficient methods were PCR and WAPLS, which had ex-ceedingly long computation times (7.38 and 9.37 min.). As an alter-native, PCR can be replaced by PLSR because these methods showedsimilar performance in terms of SOC prediction and because PLSR tookless time to process the models in R. The evaluation of the time that wasconsumed revealed that BMA was the most efficient method. WAPLSproduced great predictive performance overall (Table 1). However, itscomputational performance was the least efficient. Instead of using

A.C. Dotto et al. Geoderma 314 (2018) 262–274

271

Page 11: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

WAPLS, one alternative is to apply GPR because the kernel functionaccelerates the process without significantly diminishing the perfor-mance of the models.

3.6. Comparative performance analysis

When comparing the mean values of R2 and RMSE for each pre-processing techniques (Fig. 7i), the test of Scott-Knott showed a sig-nificant difference between two groups. The first group consisted ofNBR, MSC, SNV, and CRR, which belong to scatter-correction tech-nique. This result endorses the trend found in Table 1, where the

predictive result of scatter-correction preprocessing was grouped ineach method. According to the test of Scott-Knott, the four preproces-sing presented statistically identical R2 and RMSE results (letter a,Fig. 7i). CRR achieved the best performance, which is an indicator ofthe great potential of this preprocessing in SOC predictions. This out-come reinforces the prediction power of this preprocessing technique.The second group consisted of SGD and ASG (spectral-derivatives),alongside with SMO (letter b, Fig. 7i). This group presented inferiorresults both for R2 and RMSE. The poorest result was achieved by SGD,which presented the lowest R2 and highest RMSE (0.67 and 0.65%).

The mean values of R2 and RMSE for each method are shown inFig. 7i. The results of Scott-Knott's test showed that the methods weredivided into two groups. Except for RF (letter b, Fig. 7ii), all themethods were classified into the same group, which means there was nosignificant difference between these methods (letter a, Fig. 7ii). Ac-cording to the comparison of means, eight methods presented the sameperformance in SOC prediction. Compering the methods, SVM pre-sented a slight better performance considering the results. In Table 1,the highest predictive performance of SVM was achieved when appliedSGD preprocessing with R2 and RMSE of 0.80 and 0.52%, respectively.This outcome makes very difficult deciding which method showedbetter SOC predictive performance.

One finding that stands out was that numerous preprocessing andmethods performed very similarly. An R2 of 0.78, 0.79, 0.80, 0.81 or0.82 (same for RMSE and RPIQ) indicates that methods are not sub-stantially different. Even the best performing model (WAPLS) was si-milar to the second and third performing models. Besides, it should beconsidered the proportion of unexplained variance that could be at-tributed to error in the laboratory measurement of SOC may have af-fected the performance of the models. Soil analyses, like any other type,are subject to several errors. The recommended methods for routinelaboratory analysis have some limitations (O' Rourke and Holden,2011), such as reagents quality, brands of different equipment, ex-tensive sequence of steps, any sample handling errors, and operationalerrors.

For future studies, we suggest that the following questions should betaken into account. Are we chasing the 1%, or sometimes less, differ-ences between methods in terms of predicting SOC? Have we reached aplateau in terms of maximizing model performance (based on Vis-NIRdata) that whatever new method we use there is about 20% or so un-explained variability left? What does this mean for the future researchesof soil property prediction with Vis-NIR? The use of reflectance spec-troscopy + digital elevation may be a possibility. Perhaps, spectro-scopic data, alone, would have reached a limit and the support of otherdata or techniques (as environmental covariates) is needed. Anotherissue is that spectroscopy + processing + models (in all possible com-binations) need to mature for new possibilities as to involve genesis andlocal characteristics of soil formation.

4. Conclusions

This study explored a systematic SOC prediction methodology withVis-NIR spectroscopic data to support choices of spectral preprocessingand multivariate methods. Among the preprocessing techniques, thescatter-correction group (NBR, MSC, SNV, and CRR) showed improvedprediction capability. Overall, continuum-removal preprocessing pro-duced the best predictive results, which confirms the potential of thispreprocessing method to predict SOC. However, the spectral-derivativepreprocessing group, which included SGD and ASG, showed superiorresults for the SVM and RF methods, revealing their capability to betterhandle derivative transformation. Except for RF, all the multivariatemethods presented high prediction outcomes. The highest model ac-curacy for SOC prediction was found when applying the WAPLS methodand NBR preprocessing (R2 = 0.82, RMSE = 0.48%, andRPIQ = 3.18).

The systematic methodology that was applied in this study can

Fig. 6. Time to process the models in R. The average of the seven preprocessing modelswas considered for each method. PLSR: partial least-squares regression, PCR: principalcomponents regression, MLR: multiple linear regression, SVM: support vector machine,RF: random forest, BMA: Bayesian model averaging, WAPLS: weighted average partialleast-square, GPR: Gaussian process regression, and ANN: artificial neural network.

Fig. 7. The mean values of R2 and RMSE for each preprocessing techniques (i) andmethods (ii). The letters in parentheses represent the results of the Scott-Knott’ test(significance level α = 0.05). SMO: smoothed, CRR: continuum-removed reflectance,NBR: normalization by range, SNV: standard normal variate, MSC: multiplicative scatter-correction, SGD: Savitzky-Golay derivative, and ASG: absorbance and then the applica-tion of a Savitzky-Golay derivative, PLSR: partial least-squares regression, PCR: principalcomponents regression, MLR: multiple linear regression, SVM: support vector machine,RF: random forest, BMA: Bayesian model averaging, WAPLS: weighted average partialleast-square, GPR: Gaussian process regression, and ANN: artificial neural network.

A.C. Dotto et al. Geoderma 314 (2018) 262–274

272

Page 12: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

improve the reliability of SOC determinations by examining how pre-processing techniques and multivariate methods affect spectral ana-lyses. The comparative analyses presented in this paper will guide otherstudies on similar soils to select appropriate preprocessing and methodsto model SOC from Vis-NIR spectra. The quantification of SOC canimprove soil-property information and supply digital soil mapping ap-proaches to develop soil-property maps.

Acknowledgments

This research was funded by Coordination for the Improvement ofHigher Education Personnel (CAPES). The second and third authorsthank the National Council for Scientific and TechnologicalDevelopment (CNPq) and Santa Catarina Research Foundation(FAPESC) from Brazil's Ministry of Education for the financial support.The authors are grateful to the GeoSS Laboratory, Soil ScienceDepartment, ESALQ/University of Sao Paulo, Brazil for the Vis-NIRspectral measurements.

References

Andrews, S.S., Karlen, D.L., Cambardella, C.A., 2004. The soil management assessmentframework: a quantitative soil quality evaluation method. Soil Sci. Soc. Am. J. 68,1945–1962.

Araújo, S.R., Wetterlind, J., Demattê, J.A.M., Stenberg, B., 2014. Improving the predic-tion performance of a large tropical Vis-NIR spectroscopic soil library from Brazil byclustering into smaller subsets or use of data mining calibration techniques. Eur. J.Soil Sci. 65, 718–729. http://dx.doi.org/10.1111/ejss.12165.

Arrouays, D., McKenzie, N., Hempel, J., de Forges, A.C.R., McBratney, A.B. (Eds.), 2014.GlobalSoilMap: Basis of the Global Spatial Soil Information System. CRC Press/Balkema.

Barnes, R.J., Dhanoa, M.S., Lister, S.J., 1989. Standard normal variate transformation andde-trending of near-infrared diffuse reflectance spectra. Appl. Spectrosc. 43,772–777.

Bayer, A., Bachmann, M., Muller, A., Kaufmann, H., 2012. A comparison of feature-basedMLR and PLS regression techniques for the prediction of three soil constituents in adegraded South African ecosystem. Appl. Environ. Soil Sci. 2012, e971252. http://dx.doi.org/10.1155/2012/971252.

Bellon-Maurel, V., Fernandez-Ahumada, E., Palagos, B., Roger, J.-M., McBratney, A.,2010. Critical review of chemometric indicators commonly used for assessing thequality of the prediction of soil attributes by NIR spectroscopy. TrAC Trends Anal.Chem. 29, 1073–1081. http://dx.doi.org/10.1016/j.trac.2010.05.006.

Ben Dor, E., Ong, C., Lau, I.C., 2015. Reflectance measurements of soils in the laboratory:standards and protocols. Geoderma 245–246, 112–124. http://dx.doi.org/10.1016/j.geoderma.2015.01.002.

Ben-Dor, E., 2011. Characterization of soil properties using reflectance spectroscopy. In:Thenkabail, P.S., Lyon, J.G., Huete, A. (Eds.), Hyperspectral Remote Sensing ofVegetation. CRC press, Boca Raton, pp. 513–557.

Ben-Dor, E., Inbar, Y., Chen, Y., 1997. The reflectance spectra of organic matter in thevisible near-infrared and short wave infrared region (400–2500 nm) during a con-trolled decomposition process. Remote Sens. Environ. 61, 1–15. http://dx.doi.org/10.1016/S0034-4257(96)00120-4.

Brandl, B., 2008. Bayesian model averaging and model selection: two sides of the samecoin when identifying the determinants of trade union density? Cent. Eur. J. Oper.Res. 17, 13–29. http://dx.doi.org/10.1007/s10100-008-0072-0.

Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. http://dx.doi.org/10.1023/A:1010933404324.

Cambule, A.H., Rossiter, D.G., Stoorvogel, J.J., Smaling, E.M.A., 2012. Building a nearinfrared spectral library for soil organic carbon estimation in the Limpopo NationalPark, Mozambique. Geoderma 183–184, 41–48. http://dx.doi.org/10.1016/j.geoderma.2012.03.011.

Chang, C.W., Laird, D.A., Mausbach, M.J., Hurburgh, C.R., 2001. Near-infrared re-flectance spectroscopy–principal components regression analyses of soil properties.Soil Sci. Soc. Am. J. 65, 480–490. http://dx.doi.org/10.2136/sssaj2001.652480x.

Clark, R.N., 1999. Spectroscopy of rocks and minerals, and principles of spectroscopy. In:Rencz, N. (Ed.), Remote Sensing for the Earth Sciences: Manual of Remote Sensing.John Wiley & Sons, New York, pp. 3–52.

Clark, R.N., Roush, T.L., 1984. Reflectance spectroscopy: quantitative analysis techniquesfor remote sensing applications. J. Geophys. Res. Solid Earth 89, 6329–6340. http://dx.doi.org/10.1029/JB089iB07p06329.

Conforti, M., Castrignanò, A., Robustelli, G., Scarciglia, F., Stelluti, M., Buttafuoco, G.,2015. Laboratory-based Vis–NIR spectroscopy and partial least square regressionwith spatially correlated errors for predicting spatial variation of soil organic mattercontent. Catena 124, 60–67. http://dx.doi.org/10.1016/j.catena.2014.09.004.

Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20, 273–297. http://dx.doi.org/10.1007/BF00994018.

Dalmolin, R.S.D., Pedron, F. de A., de Almeida, J.A., Ribas, C.G., 2017. Solos do Planaltodas Araucárias. In: Curi, N., Ker, J.C., Novais, R.F., Vidal-Torrado, P., Schaefer,C.E.G.R. (Eds.), Pedologia - Solos dos Biomas Brasileiros. Sociedade Brasileira de

Ciência do Solo (SBCS), Viçosa, MG (p. 597).Daniel, K.W., Tripathi, N.K., Honda, K., 2003. Artificial neural network analysis of la-

boratory and in situ spectra for the estimation of macronutrients in soils of Lop Buri(Thailand). Soil Res. 41, 47–59.

Demattê, J.A.M., Horák-Terra, I., Beirigo, R.M., Terra, F. da S., Marques, K.P.P., Fongaro,C.T., Silva, A.C., Vidal-Torrado, P., 2017. Genesis and properties of wetland soils byVIS-NIR-SWIR as a technique for environmental monitoring. J. Environ. Manag. 197,50–62. http://dx.doi.org/10.1016/j.jenvman.2017.03.014.

Dotto, A.C., Dalmolin, R.S.D., Grunwald, S., ten Caten, A., Pereira Filho, W., 2017. Twopreprocessing techniques to reduce model covariables in soil property predictions byVis-NIR spectroscopy. Soil Tillage Res. 172, 59–68. http://dx.doi.org/10.1016/j.still.2017.05.008.

Feng, C., Wang, H., Lu, N., Chen, T., He, H., Lu, Y., Tu, X.M., 2014. Log-transformationand its implications for data analysis. Shanghai Arch. Psychiatry 26, 105–109. http://dx.doi.org/10.3969/j.issn.1002-0829.2014.02.009.

Geladi, P., MacDougall, D., Martens, H., 1985. Linearization and scatter-correction fornear-infrared reflectance spectra of meat. Appl. Spectrosc. 39, 491–500.

Gholizadeh, A., Borůvka, L., Saberioon, M.M., Kozák, J., Vašát, R., Němeček, K., 2015.Comparing different data preprocessing methods for monitoring soil heavy metalsbased on soil spectral features. Soil. Water Res. 10, 218–227. http://dx.doi.org/10.17221/113/2015-SWR.

Gholizadeh, A., Borůvka, L., Saberioon, M., Vašát, R., 2016. A memory-based learningapproach as compared to other data mining algorithms for the prediction of soiltexture using diffuse reflectance spectra. Remote Sens. 8, 341. http://dx.doi.org/10.3390/rs8040341.

Gosso, A., 2012. elmNN: implementation of ELM (Extreme Learning Machine) algorithmfor SLFN (Single Hidden Layer Feedforward Neural Networks). In: R PackageVersion 1.

Grimm, R., Behrens, T., Märker, M., Elsenbeer, H., 2008. Soil organic carbon con-centrations and stocks on Barro Colorado Island — digital soil mapping using randomforests analysis. Geoderma 146, 102–113. http://dx.doi.org/10.1016/j.geoderma.2008.05.008.

Grunwald, S., 2009. Multi-criteria characterization of recent digital soil mapping andmodeling approaches. Geoderma 152, 195–207. http://dx.doi.org/10.1016/j.geoderma.2009.06.003.

Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning. In:Springer Series in Statistics. Springer New York, New York, NY.

Hemmateenejad, B., Akhond, M., Samari, F., 2007. A comparative study between PCR andPLS in simultaneous spectrophotometric determination of diphenylamine, aniline,and phenol: effect of wavelength selection. Spectrochim. Acta A Mol. Biomol.Spectrosc. 67, 958–965. http://dx.doi.org/10.1016/j.saa.2006.09.014.

Hunt, G.R., Salisbury, J.W., Lenhoff, C.J., 1971. Visible and near infrared spectra ofminerals and rocks: II. Carbonates. Mod. Geol. 195–205.

Ivanciuc, O., 2007. Applications of Support Vector Machines in Chemistry. In: Lipkowitz,K.B., Cundari, T.R. (Eds.), Reviews in Computational Chemistry. John Wiley & Sons,Inc., pp. 291–400.

Jelihovschi, E.G., Faria, J.C., Allaman, I.B., 2014. ScottKnott: a package for performingthe Scott-Knott clustering algorithm in R. Trends. Appl. Comput. Math. 15, 3–17.

Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A., 2004. Kernlab - an S4 package forkernel methods in R. J. Stat. Softw. 11, 1–20. http://dx.doi.org/10.18637/jss.v011.i09.

Kendall, M.G., 1957. A Course in Multivariate Analysis. Griffin, London.Knox, N.M., Grunwald, S., McDowell, M.L., Bruland, G.L., Myers, D.B., Harris, W.G.,

2015. Modelling soil carbon fractions with visible near-infrared (VNIR) and mid-in-frared (MIR) spectroscopy. Geoderma 239–240, 229–239. http://dx.doi.org/10.1016/j.geoderma.2014.10.019.

Kuang, B., Tekin, Y., Mouazen, A.M., 2015. Comparison between artificial neural networkand partial least squares for on-line visible and near infrared spectroscopy mea-surement of soil organic carbon, pH and clay content. Soil Tillage Res. 146 (Part B),243–252. http://dx.doi.org/10.1016/j.still.2014.11.002.

Lagacherie, P., Baret, F., Feret, J.-B., Madeira Netto, J., Robbez-Masson, J.M., 2008.Estimation of soil clay and calcium carbonate using laboratory, field and airbornehyperspectral measurements. Remote Sens. Environ. 112, 825–835. http://dx.doi.org/10.1016/j.rse.2007.06.014.

Leon, A., Gonzalez, R.L., 2009. Predicting soil organic carbon percentage from loss-on-ignition using Bayesian Model Averaging. Aust. J. Soil Res. 47, 763–769. http://dx.doi.org/10.1071/SR08119.

Liaw, A., Wiener, M., 2002. Classification and Regression by randomForest. R News 2,18–22.

Loosli, G., Canu, S., Bottou, L., 2007. Training invariant support vector machines usingselective sampling. In: Bottou, L., Chapelle, O., DeCoste, D., Weston, J. (Eds.), LargeScale Kernel Machines. MIT Press, Cambridge, MA, pp. 301–320.

Malone, B.P., Minasny, B., Odgers, N.P., McBratney, A.B., 2014. Using model averaging tocombine soil property rasters from legacy soil maps and from point data. Geoderma232–234, 34–44. http://dx.doi.org/10.1016/j.geoderma.2014.04.033.

McBratney, A.B., Mendonça Santos, M.L., Minasny, B., 2003. On digital soil mapping.Geoderma 117, 3–52. http://dx.doi.org/10.1016/S0016-7061(03)00223-4.

McBratney, A., Field, D.J., Koch, A., 2014. The dimensions of soil security. Geoderma213, 203–213. http://dx.doi.org/10.1016/j.geoderma.2013.08.013.

McCulloch, W.S., Pitts, W., 1943. A logical calculus of the ideas immanent in nervousactivity. Bull. Math. Biophys. 5, 115–133. http://dx.doi.org/10.1007/BF02478259.

McDowell, M.L., Bruland, G.L., Deenik, J.L., Grunwald, S., Knox, N.M., 2012. Soil totalcarbon analysis in Hawaiian soils with visible, near-infrared and mid-infrared diffusereflectance spectroscopy. Geoderma 189–190, 312–320. http://dx.doi.org/10.1016/j.geoderma.2012.06.009.

Mevik, B.-H., Wehrens, R., Liland, K.H., 2016. pls: partial least squares and principal

A.C. Dotto et al. Geoderma 314 (2018) 262–274

273

Page 13: A systematic study on the application of scatter-corrective and spectral …ufgrunwald.com/wp-content/uploads/2018/02/Dotto-et-al... · 2018. 2. 5. · ples, soil types, spectral

component regression. In: R Package Version 26-0.Meyer, D., 2001. Support vector machines the interface to libsvm in package e1071. R

News 1.Minasny, B., McBratney, A.B., 2008. Regression rules as a tool for predicting soil prop-

erties from infrared reflectance spectroscopy. Chemom. Intell. Lab. Syst. 94, 72–79.http://dx.doi.org/10.1016/j.chemolab.2008.06.003.

Montgomery, D.C., Peck, E.A., Vining, G.G., 2012. Introduction to Linear RegressionAnalysis, 5 edition. A John Wiley & Sons, Inc., Hoboken, New Jersey.

Moros, J., de Vallejuelo, S.F.-O., Gredilla, A., de Diego, A., Madariaga, J.M., Garrigues, S.,de la Guardia, M., 2009. Use of reflectance infrared spectroscopy for monitoring themetal content of the estuarine sediments of the Nerbioi-Ibaizabal River (MetropolitanBilbao, Bay of Biscay, Basque Country). Environ. Sci. Technol. 43, 9314–9320.http://dx.doi.org/10.1021/es9005898.

Mouazen, A.M., Kuang, B., BDe Baerdemaeker, J., Ramon, H., 2010. Comparison amongprincipal component, partial least squares and back propagation neural networkanalyses for accuracy of measurement of selected soil properties with visible and nearinfrared spectroscopy. Geoderma, Diffuse reflectance spectroscopy in soil science andland resource assessment 158, 23–31. http://dx.doi.org/10.1016/j.geoderma.2010.03.001.

Mountrakis, G., Im, J., Ogole, C., 2011. Support vector machines in remote sensing: areview. ISPRS J. Photogramm. Remote Sens. 66, 247–259. http://dx.doi.org/10.1016/j.isprsjprs.2010.11.001.

Muller, K.R., Mika, S., Ratsch, G., Tsuda, K., Scholkopf, B., 2001. An introduction tokernel-based learning algorithms. IEEE Trans. Neural Netw. 12, 181–201. http://dx.doi.org/10.1109/72.914517.

Muñoz, J.D., Kravchenko, A., 2011. Soil carbon mapping using on-the-go near infraredspectroscopy, topography and aerial photographs. Geoderma 166, 102–110. http://dx.doi.org/10.1016/j.geoderma.2011.07.017.

Mutanga, O.M.C., Skidmore, A.K., Kumar, L., Ferwerda, J., 2005. Estimating tropicalpasture quality at canopy level using band depth analysis with continuum removal inthe visible domain. Int. J. Remote Sens. 26, 1093–1108. http://dx.doi.org/10.1080/01431160512331326738.

Nawar, S., Buddenbaum, H., Hill, J., Kozak, J., Mouazen, A.M., 2016. Estimating the soilclay content and organic matter by means of different calibration methods of Vis-NIRdiffuse reflectance spectroscopy. Soil Tillage Res. 155, 510–522. http://dx.doi.org/10.1016/j.still.2015.07.021.

Nocita, M., Stevens, A., Toth, G., Panagos, P., van Wesemael, B., Montanarella, L., 2014.Prediction of soil organic carbon content by diffuse reflectance spectroscopy using alocal partial least square regression approach. Soil Biol. Biochem. 68, 337–347.http://dx.doi.org/10.1016/j.soilbio.2013.10.022.

Nocita, M., Stevens, A., van Wesemael, B., Aitkenhead, M., Bachmann, M., Barthès, B.,Ben Dor, E., Brown, D.J., Clairotte, M., Csorba, A., Dardenne, P., Demattê, J.A.M.,Genot, V., Guerrero, C., Knadel, M., Montanarella, L., Noon, C., Ramirez-Lopez, L.,Robertson, J., Sakai, H., Soriano-Disla, J.M., Shepherd, K.D., Stenberg, B., Towett,E.K., Vargas, R., Wetterlind, J., 2015. Chapter four - soil spectroscopy: an alternativeto wet chemistry for soil monitoring. In: Sparks, D.L. (Ed.), Advances in Agronomy.Academic Press, pp. 139–159.

O' Rourke, S.M., Holden, N.M., 2011. Optical sensing and chemometric analysis of soilorganic carbon - a cost effective alternative to conventional laboratory methods? SoilUse Manag. 27, 143–155. http://dx.doi.org/10.1111/j.1475-2743.2011.00337.x.

Osbourne, J.W., Waters, E., 2002. Four assumptions of multiple regression that re-searchers should always test. Pract. Assess. Res. Eval. 8.

Pandolfo, C., Braga, H.J., Silva Júnior, V.P., Massignan, A.M., Pereira, E.S., Thomé,V.M.R., Valci, F.V., 2002. Atlas Climatológico do Estado de Santa Catarina. Epagri,CD-Rom, Florianópolis.

Peng, X., Shi, T., Song, A., Chen, Y., Gao, W., 2014. Estimating soil organic carbon usingVIS/NIR spectroscopy with SVMR and SPA methods. Remote Sens. 6, 2699–2717.http://dx.doi.org/10.3390/rs6042699.

Poggio, L., Gimona, A., Spezia, L., Brewer, M.J., 2016. Bayesian spatial modelling of soilproperties and their uncertainty: the example of soil organic matter in Scotland usingR-INLA. Geoderma 277, 69–82. http://dx.doi.org/10.1016/j.geoderma.2016.04.026.

R Core Team, 2016. R: A Language and Environment for Statistical Computing.Raftery, A.E., 1995. Bayesian model selection in social research. Sociol. Methodol. 25,

111–164.Ramirez-Lopez, L., Stevens, A., 2016. Resemble: regression and similarity evaluation for

memory-based learning in spectral chemometrics. In: R Package Version 122.Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Demattê, J.A.M., Scholten, T.,

2013. The spectrum-based learner: a new local approach for modeling soil Vis–NIRspectra of complex datasets. Geoderma 195–196, 268–279. http://dx.doi.org/10.1016/j.geoderma.2012.12.014.

Rezende, S.B., 1980. Geomorphology, Mineralogy and Genesis of Four Soils on Gneiss inSoutheastern Brazil. Ann Arbor, Michigan.

Rinnan, Å., van den Berg, F., Engelsen, S.B., 2009. Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends Anal. Chem. 28,1201–1222. http://dx.doi.org/10.1016/j.trac.2009.07.007.

Scott, A.J., Knott, M., 1974. A cluster analysis method for grouping means in the analysisof variance. Biometrics 30, 507–512. http://dx.doi.org/10.2307/2529204.

Stevens, A., Ramirez-Lopez, L., 2013. An introduction to the prospect package. In: RPackage Vignette.

Stevens, A., Nocita, M., Tóth, G., Montanarella, L., van Wesemael, B., 2013. Prediction ofsoil organic carbon at the European scale by visible and near InfraRed reflectancespectroscopy. PLoS One 8, e66409. http://dx.doi.org/10.1371/journal.pone.0066409.

Stoner, E., 1979. Physiochemical, site, and bidirectional reflectance factor characteristicsof uniformly-moist soils. In: PhD Thesis. Ann Arbor, Michigan.

Stoner, E.R., Baumgardner, M.F., 1981. Characteristic variations in reflectance of surfacesoils. Soil Sci. Soc. Am. J. 45, 1161. http://dx.doi.org/10.2136/sssaj1981.03615995004500060031x.

Ter Braak, C.J.F., Juggins, S., 1993. Weighted averaging partial least squares regression(WA-PLS): an improved method for reconstructing environmental variables fromspecies assemblages. Hydrobiologia 269–270, 485–502. http://dx.doi.org/10.1007/BF00028046.

Terra, F.S., Demattê, J.A.M., Viscarra Rossel, R.A., 2015. Spectral libraries for quantita-tive analyses of tropical Brazilian soils: comparing Vis–NIR and mid-IR reflectancedata. Geoderma 255–256, 81–93. http://dx.doi.org/10.1016/j.geoderma.2015.04.017.

Vašát, R., Kodešová, R., Borůvka, L., Klement, A., Jakšík, O., Gholizadeh, A., 2014.Consideration of peak parameters derived from continuum-removed spectra to pre-dict extractable nutrients in soils with visible and near-infrared diffuse reflectancespectroscopy (VNIR-DRS). Geoderma 232–234, 208–218. http://dx.doi.org/10.1016/j.geoderma.2014.05.012.

Vasques, G.M., Grunwald, S., Sickman, J.O., 2008. Comparison of multivariate methodsfor inferential modeling of soil carbon using visible/near-infrared spectra. Geoderma146, 14–25. http://dx.doi.org/10.1016/j.geoderma.2008.04.007.

Viscarra Rossel, R.A., Behrens, T., 2010. Using data mining to model and interpret soildiffuse reflectance spectra. Geoderma, Diffuse reflectance spectroscopy in soil scienceand land resource assessment 158, 46–54. http://dx.doi.org/10.1016/j.geoderma.2009.12.025.

Viscarra Rossel, R.A., Walvoort, D.J.J., McBratney, A.B., Janik, L.J., Skjemstad, J.O.,2006. Visible, near infrared, mid infrared or combined diffuse reflectance spectro-scopy for simultaneous assessment of various soil properties. Geoderma 131, 59–75.http://dx.doi.org/10.1016/j.geoderma.2005.03.007.

Viscarra Rossel, R.A., Cattle, S.R., Ortega, A., Fouad, Y., 2009. In situ measurements ofsoil colour, mineral composition and clay content by Vis–NIR spectroscopy.Geoderma 150, 253–266. http://dx.doi.org/10.1016/j.geoderma.2009.01.025.

Viscarra Rossel, R.A., Behrens, T., Ben-Dor, E., Brown, D.J., Demattê, J.A.M., Shepherd,K.D., Shi, Z., Stenberg, B., Stevens, A., Adamchuk, V., Aïchi, H., Barthès, B.G.,Bartholomeus, H.M., Bayer, A.D., Bernoux, M., Böttcher, K., Brodský, L., Du, C.W.,Chappell, A., Fouad, Y., Genot, V., Gomez, C., Grunwald, S., Gubler, A., Guerrero, C.,Hedley, C.B., Knadel, M., Morrás, H.J.M., Nocita, M., Ramirez-Lopez, L., Roudier, P.,Campos, E.M.R., Sanborn, P., Sellitto, V.M., Sudduth, K.A., Rawlins, B.G., Walter, C.,Winowiecki, L.A., Hong, S.Y., Ji, W., 2016. A global spectral library to characterizethe world's soil. Earth-Sci. Rev. 155, 198–230. http://dx.doi.org/10.1016/j.earscirev.2016.01.012.

Walesiak, M., Dudek, A., 2016. clusterSim: Searching for Optimal Clustering Procedurefor a Data Set.

Wang, W., Li, S., Qi, H., Ayhan, B., Kwan, C., Vance, S., 2014. Revisiting the preprocessingprocedures for elemental concentration estimation based on CHEMCAM LIBS onMARS Rover. In: ResearchGate. Presented at the 6th Workshop on HyperspectralImage and Signal Processing: Evolution in Remote Sensing.

Wang, Y., Huang, T., Liu, J., Lin, Z., Li, S., Wang, R., Ge, Y., 2015. Soil pH value, organicmatter and macronutrients contents prediction using optical diffuse reflectancespectroscopy. Comput. Electron. Agric. 111, 69–77. http://dx.doi.org/10.1016/j.compag.2014.11.019.

Wentzell, P.D., Vega Montoto, L., 2003. Comparison of principal components regressionand partial least squares regression through generic simulations of complex mixtures.Chemom. Intell. Lab. Syst. 65, 257–279. http://dx.doi.org/10.1016/S0169-7439(02)00138-7.

Were, K., Bui, D.T., Dick, Ø.B., Singh, B.R., 2015. A comparative assessment of supportvector regression, artificial neural networks, and random forests for predicting andmapping soil organic carbon stocks across an Afromontane landscape. Ecol. Indic. 52,394–403. http://dx.doi.org/10.1016/j.ecolind.2014.12.028.

Williams, C.K.I., Barber, D., 1998. Bayesian classification with Gaussian processes. IEEETrans. Pattern Anal. Mach. Intell. 20, 1342–1351. http://dx.doi.org/10.1109/34.735807.

Wold, S., Ruhe, A., Wold, H., Dunn, I.,.W., 1984. The collinearity problem in linear re-gression. The partial least squares (PLS) approach to generalized inverses. SIAM. Int.J. Sci. Stat. Comput. 5, 735–743. http://dx.doi.org/10.1137/0905052.

Wold, S., Sjöström, M., Eriksson, L., 2001. PLS-regression: a basic tool of chemometrics.Chemom. Intell. Lab. Syst. 58, 109–130. http://dx.doi.org/10.1016/S0169-7439(01)00155-1.

Xie, X.-L., Pan, X.-Z., Sun, B., 2012. Visible and near-infrared diffuse reflectance spec-troscopy for prediction of soil properties near a copper smelter. Pedosphere 22,351–366. http://dx.doi.org/10.1016/S1002-0160(12)60022-8.

Xiong, X., Grunwald, S., Myers, D.B., Kim, J., Harris, W.G., Bliznyuk, N., 2015. Assessinguncertainty in soil organic carbon modeling across a highly heterogeneous landscape.Geoderma 251–252, 105–116. http://dx.doi.org/10.1016/j.geoderma.2015.03.028.

Yeomans, J.C., Bremner, J.M., 1988. A rapid and precise method for routine determi-nation of organic carbon in soil. Commun. Soil Sci. Plant Anal. 19, 1467–1476.http://dx.doi.org/10.1080/00103628809368027.

Zeugner, S., Feldkircher, M., 2015. Bayesian model averaging employing fixed and flex-ible priors: the BMS package for R. J. Stat. Softw. 68. http://dx.doi.org/10.18637/jss.v068.i04.

Zhang, Y., Li, M., Zheng, L., Zhao, Y., Pei, X., 2016. Soil nitrogen content forecastingbased on real-time NIR spectroscopy. Comput. Electron. Agric. 124, 29–36. http://dx.doi.org/10.1016/j.compag.2016.03.016.

A.C. Dotto et al. Geoderma 314 (2018) 262–274

274