APP下载

Soil texture prediction through stratification of a regional soil spectral library

2022-05-11JosJandersonFerreiraCOSTAlvioGIASSONElisngelaBenedetDASILVATalesTIECHERAntonnyFranciscoSampaioDESENAandRyshardsonGeovanePereiradeOliveiraSILVA

Pedosphere 2022年2期

JoséJanderson Ferreira COSTA,Élvio GIASSON,Elisângela Benedet DA SILVA,Tales TIECHER,Antonny Francisco Sampaio DE SENA and Ryshardson Geovane Pereira de Oliveira E SILVA

1 Department of Soil Science,Faculty of Agronomy,University of RioGrande doSul,BentoGonçalves Avenue 7712,PortoAlegre 91540-000(Brazil)

2 Agricultural Research and Rural Extension Cor poration of Santa Catarina,Rodovia Admar Gonzaga 1347,Florianópolis,Santa Catarina 88034-901(Brazil)

ABSTRACT Knowing the spatial distribution of soil texture,which is a physical property,is essential to support agricultural and environmental decision making.Soil texture can be estimated using visible,near infrared,and shortwave infrared(Vis-NIR-SWIR)spectroscopy.However,the performance of spectroscopic models is variable because of soil heterogeneity.Currently,few studies address the effects of soil sample variability on the performance of the models,especially for larger spectral libraries that include soils that are more heterogeneous.Therefore,the objectives of this study were to:i)apply Vis-based color parameters on the stratification of a regional soil spectral library;ii)evaluate the performance of the predictive models generated from the spectral library stratification;iii)compare the performance of stratified models(SMs)and the model without stratification(WSM),and iv)explain possible changes in prediction accuracy based on the SMs.Thus,a regional soil spectral library with 1 535 samples from the State of Santa Catarina,Brazil was used.Soil reflectance data were obtained by Vis-NIR-SWIR spectroscopy in the laboratory using a spectroradiometer covering the 350—2 500 nm spectral range.Sand,silt,and clay fractions were determined using the pipette method.Twenty-two components of color parameters were derived from the Vis spectrum using the colorimetric models.A cubist regression algorithm was used to assess the accuracy of the applicability of the initial models(SMs and WSM)and of the validation between the clusters.Fractional order derivatives(FODs)at 0.5,1.5,and 2 intervals were used to explain possible changes in the performance of the SMs.The SMs with higher contents of clay and iron oxides obtained the highest accuracy,and the most important spectral bands were identified,mainly in the 480—550 and 850—900 nm ranges and the 1 400,1 900,and 2 200 nm bands.Therefore,stratification of soil spectral libraries is a good strategy to improve regional assessments of soil resources,reducing prediction errors in the qualitative determination of soil properties.

Key Words: color parameters,cubist regression,fractional order derivatives,soil spectroscopy,spectroscopic model,stratified model,Vis-NIR-SWIR

INTRODUCTION

Soil texture plays an important role in infiltration capacity and availability of water,nutrient uptake by plants,resistance to root penetration,microbial activity,and susceptibility to erosion and compaction,among others(Phogatet al.,2014;Jaconiet al.,2019;Tümsavaşet al.,2019).Thus,due to the importance of soil texture for agricultural production,fast,accurate and cost-effective quantitative methods are needed to predict this property at different scales(Jaconiet al.,2019)and support decision making in the agricultural and environmental context.

In the last decades,researchers have used reflectance spectroscopy(RS)to estimate important soil properties(Viscarra Rossel and Webster,2012;Nocitaet al.,2014;Jaconiet al.,2017),including texture(Xuet al.,2018;Silvaet al.,2019).The spectral ranges commonly used in predictions include the visible(Vis),near infrared(NIR),and shortwave infrared(SWIR)ranges,corresponding to 400—700,700—1 100,and 1 100—2 500 nm,respectively(Viscarra Rossel and Webster,2012;Demattêet al.,2019).This technique made RS more useful because it measures various soil properties from a single spectral reading,allowing for the analysis of various samples in a fast,cost-effective manner without the need for chemical reagents and eliminating waste production(Minasny and McBratney,2008;Viscarra Rossel and Behrens,2010;Jiet al.,2016;Demattêet al.,2019).

To improve the performance of predictive models,many studies have tested different spectral pre-processing techniques(treatments)and multivariate methods(Araújoet al.,2014;Demattêet al.,2019;Jaconiet al.,2019;Silvaet al.,2019).Table I lists selected studies on soil texture estimation using RS with different types of spectral pre-processing and multivariate methods.These studies revealed the existence of variations in the accuracy indices,coefficient of determination(R2)and root mean square error(RMSE),which can be attributed to the differences between the types of preprocessing and between the multivariate methods applied to spectral data(Araújoet al.,2014;Demattêet al.,2016;Nawaret al.,2016;Dottoet al.,2017;Silvaet al.,2019).Pinheiroet al.(2017)used a soil spectral library and the partial least square regression(PLSR)multivariate model to predict soil texture and reported a moderate predictive capacity of sand content(R2=0.62 and RMSE=11.5%),a low predictive capacity of silt content(R2=0.36and RMSE=9.5%),and an high prediction performance for clay content(R2=0.78 and RMSE=6.2%).Nawaret al.(2016)tested three multivariate models and seven pre-processing techniques in clay prediction and obtained the best validation results using the multivariate adaptive regression splines(MARS)model in combination with continuum removal(CR)pre-processing(R2=0.79 and RMSE=7.6%).Silvaet al.(2019)achieved better predictive results using the Cubist model when compared with support vector machines(SVM),Gaussian process regression(GPR),random forest(RF),and PLSR.Araújoet al.(2014)showed that SVMmodel estimates of clay content had higher accuracy compared to PLSR.On the other hand,when Dottoet al.(2017)used the PLSR and SVMmethods,the estimates for clay and silt contents had a moderate performance,contrasting with sand content,which had a poor performance.

TABLE I Selected studies that applied different types of pre-processing techniques and multivariate models to estimate soil sand,silt,and clay contents using spectroscopy

The predictive variability of the models can also be attributed to the highly variable characteristics of the soils that make up the spectral libraries,such as clay content,mineralogy,organic matter content,soil moisture,and iron oxides,among others(Dalmolinet al.,2005;Stenberget al.,2010;Viscarra Rosselet al.,2016).The stratification of spectral libraries into more homogeneous sets of samples could overcome this problem.Araújoet al.(2014)segregated data from a spectral library to improve the accuracy of predicting clay content and obtained a reduction of 21%in the prediction error.Moura-Buenoet al.(2019)found that segregating samples into more homogeneous groups improved the accuracy of the multivariate models in predicting soil carbon when a set of soil samples were stratified based on soil classes,uses,and layers.

Many soil properties influence soil color.For example,organic matter gives soil a dark color;quartz,calcite,and other carbonates give soil a white color;hematite(α-Fe2O3)gives soil a red color;and goethite(α-FeOOH)gives soil a yellow color(Aitkenheadet al.,2013a,b).Color can be used as a criterion in the stratification of soil samples and provide prediction models with sets of samples with more homogeneous characteristics due to the well-known relationships between soil color,mineralogy,and physicalchemical properties(Ben-Doret al.,1997;Viscarra Rosselet al.,2009;Aitkenheadet al.,2013a,b).Soil color can be obtained from the spectral curves themselves and be represented by three-dimensional color space models(XYZcoordinates),andXYZvalues can be transformed into other color space models,e.g.,red-green-blue(RGB),Commission Internationale de l’Eclairage(CIE)Lab,CIE Luv,and CIExyY,to improve the representation of colors(Viscarra Rosselet al.,2006;Simonet al.,2020).

The present study was motivated by the hypothesis that the stratification of samples in a regional soil spectral library,which typically have high spatial variability,can improve the precision of soil texture prediction.A new strategy is proposed to stratify samples based on soil color.Thus,the objectives of this study were to:i)apply Vis-based color parameters to the stratification of a regional soil spectral library;ii)evaluate the performance of predictive models generated from the stratification of the soil spectral library;iii)compare the performance of stratified models(SMs)and the model without stratification(WSM),and iv)explain possible changes in prediction accuracy based on the SMs.

MATERIALS AND METHODS

Study area and soil spectral librar y

The study used 1 535 soil samples from the database of Agricultural Research and Rural Extension Corporation of Santa Catarina,which makes up the Brazilian Soil Spectral Library(BSSL)(Demattêet al.,2019).The soil samples were from 260 municipalities(about 90%)within the state of Santa Catarina(SC),Brazil(Fig.1).According to the Köppen climate classification system,the climate of Santa Catarina is Cf,humid mesothermal climate,and includes subtypes Cfa and Cfb.The constantly humid subtropical climate Cfa is characterized by hot summers(average temperature of the hottest month≥22°C)with no dry season.The humid temperate climate Cfb is characterized by cool summers(average temperature of the hottest month<22°C)with no defined dry season(Alvareset al.,2013).According to the Brazilian Soil Classification System(Dos Santoset al.,2018),the predominant soil classes are Cambissolos(Inceptisols),Neossolos(Entisols),Nitossolos(Ultisols,Oxisols),Argissolos(Ultisols),Latossolos(Oxisols),and Gleissolos(Aquents)(Embrapa,2004).

Fig.1 Soil sampling sites in the state of Santa Catarina,Brazil.

The soil samples were collected at 0.0—0.5 m depth from agriculture lands.Sand,silt,and clay fractions were determined using the pipette method(Donagemmaet al.,2017),as described by De Veigaet al.(2012).

Spectral reflectance and pre-processing measurements

To obtain spectral reflectance,a FieldSpec 3 spectroradiometer(Analytical Spectral Devices,Boulder,USA),covering the spectral range of 350—2 500 nm and with a resolution of 1 nm,was used.The geometry of the spectral reading followed the BSSL methodology described in Romeroet al.(2018).The sensor was calibrated at the beginning of spectral measurements of every 20 soil samples using a white Spectralon plate.Three readings were obtained for each sample using the mean spectral curve,as described by Silvaet al.(2019).

Vis-based color parameters calculation

Soil color can be represented by three-dimensional models of color spaces(XYZcoordinates)and can be transformed to other color models using colorimetry models.The CIE(1986)proposed the use of CIE models to facilitate visualization and standardize color models.The Vis region of the spectrum(400—700 nm)and illuminant C were used to calculate theXYZtristimulus based on the color-matching functions defined in 1931 by the CIE(1986),whereYrepresents the brightness andXandZare the virtual components of the primary spectra.The derivedXYZvalues were then transformed into nineteen color space models(e.g.,RGB,Munsell HVC,CIE xyY,CIE Lab,CIE Luv,CIE Lch,and CMYK chromaticity coordinates)using the Munsell Conversion software(WallkillColor,2019).In the CIE xyY system,Yrepresents luminance,andxandyrepresent color variations from blue to red and blue to green,respectively.In the CIE Lab and CIE Luv systems,Lrepresents brightness or luminance,andaandbanduandvrepresent chromaticity coordinates,as opponent red-green(aandu)and blue-yellow scales(bandv).The CIE Lch model represents a transformation of the CIE Lab spherical color space into cylindrical coordinates,resulting in hue(h)and chroma(c)values.The RGB system forms a cube comprising red(R),green(G),and blue(B)orthogonal axes,which can produce other colors by mixing these three primary colors.The CMYK model is an abbreviation of the system formed by the colors cyan,magenta,yellow,and black.The Munsell HVC system used in soil science describes soil color using hue(H),value(V),and chroma(Ch).Table II presents a summary of the color space models used in this study and the abbreviations for all 22 calculated color parameters.

Principal component analysis(PCA)

The PCA analysis was applied to color parameter values to reduce the dimensionality of the data in multivariate space to visualize distribution patterns and structures for the identification of clusters and outliers(Galvãoet al.,1995).Each principal component is represented by an axis orthogonal to the other axes,which are formed by a linear combination of the original variables.According to this technique,the Euclidean distance is preserved between the descriptors and the identified relationships are linear(Borcardet al.,2011).The descriptors used in this study were the color parameters listed in Table II.As these variables have different units of measurement,the values were converted to the same scale to equalize the statistical significance of all variables following Legendre and Legendre(1998),which simplified the mathematical relationships.Next,the variables were linearly transformed through a data translation and expansion process by subtracting a constant(mean)from each value and dividing them by another constant(standard deviation).This transformation is often used in PCA and is called standardization ofz-scores.The scale function of the base package was used to perform this transformation.The PCA analysis was applied using the statistical package’s princomp function on the descriptor values to extract information from the 22 color parameters across the entire sample set,a single 1 535×22 matrix.

TABLE II Vis-based color parameters derived from different color space models and calculated using Munsell Conversion software(WallkillColor,2019)

Cluster analysis of the samples

For the stratification of the soil spectral library and its representation in well-defined patterns,Fuzzy K-means(FKM)clustering was applied to PCA scores to discriminate different soil samples based on color parameter values that were described in detail by Terraet al.(2018).The first four PCA scores were used to determine the ideal number of clusters needed to represent the sample set.The FKM technique assigns a degree of fuzzy association to each sample based on the distance to the center of the cluster.The degrees of fuzzy associations are continuous and vary from 0 to 1.A high degree of fuzzy association indicates a high degree of similarity between a sample and a cluster,while a low degree of fuzzy association indicates low similarity(Bezdeket al.,1984).The appropriate number of clusters and PCA scores for clustering were established using the following indices:partition coefficient(PC),partition entropy(PE),and modified partition coefficient(MPC)(Wu and Yang,2005;Ferraro and Giordani,2015).Two to four PCA scores and 2—8 clusters were tested.The ideal number of clusters was selected based on the best PC and MPC scores.Therefore,the initial group of 1 535 soil samples was divided into four clusters,C1—C4,with C1 containing 400 samples,C2 265 samples,C3 447 samples,and C4 423 samples.All analyses and statistical procedures were performed using the R software(R Development Core Team,2017).The ppclust package was used for FKM clustering(Cebeciet al.,2020).For a better understanding,stratified clusters were considered as SMs,and the group containing all samples was considered as a WSM.

Pre-processing and construction of models

Spectral soil data were slightly smoothed(SMO)using the Savitzky-Golay smoothing filter(with adjustment through a second-order polygon and an 11-nm sliding window).The SMO performed a slight smoothing in the spectral curves and is thus considered a pre-processing method.Next,three spectral pre-processing techniques frequently used in spectroscopy studies were applied(Dottoet al.,2017;Silvaet al.,2019;Baet al.,2020;Yanget al.,2020)to test their ability to predict soil texture based on SMs:i)multiplicative scatter correction(MSC);ii)normalizations by range(NBR),and iii)standard normal variates(SNV).These pre-processing techniques were applied to spectral curves to eliminate noise caused by light scattering and highlight the most interesting characteristics of the spectral signal.All pre-processing techniques were performed using the R software(R Development Core Team,2017).

To evaluate the performance of SMs and WSMin predicting soil texture,the Cubist model(CBT)in the cubist package was used(Kuhnet al.,2020).The CBT is a model that builds regression trees using the classification and regression tree(CART)approach(Kuhnet al.,2020).Based on these rules,soil samples were reallocated according to their spectra and a linear model was applied to predict the target variable.The CBT model requires the configuration of two parameters,which are the committees and the neighbors.The committees constitute a boosting-like scheme that uses the nearest neighboring data(neighbors)in the calibration set and develops a series of trees in sequence with adjusted weights.The committees were set at 0—100 and the neighbors at 0—9.

Two types of models,initial models and calibration and validation models,were generated to test the hypothesis of this study.The effect of input data and heterogeneity on model performance was investigated using SMs and WSM.In the construction of the SMs,75%of the samples were used to develop the calibration model and 25%were used for validation.Thus,the calibration sets for clusters C1,C2,C3,and C4 contained 300,199,335,and 317 samples,respectively(i.e.,75%of the samples in each cluster).The validation sets for clusters C1,C2,C3,and C4 contained 100,66,112,and 106samples,respectively(i.e.,25%of each cluster).A total of 1 151 samples were used to calibrate the models and 384 samples were predicted.In constructing the WSM,the same 1 151 samples of the SMs were used for model calibration and to predict the 384 samples of the SMs.Thus,the stratified and without stratification groups are equivalent and allowed a fair comparison of the accuracy indices.To verify the effect of soil texture heterogeneity on the performance of the calibration and validation models,the calibrated models of the clusters that showed lower data variability and better performance in the initial models were validated with the cluster that had the highest data variability and the lowest performance in the initial models.This procedure was repeated in reverse order.

Fractional order derivatives

Fractional order derivatives(FODs),as well as first and second order derivatives,were used to highlight the spectral resources of interest and optimize the extraction of useful spectral information,These derivatives vary in small intervals and are able to detect the slightest changes generated by the functional groups that make up soil properties(Zhanget al.,2016;Honget al.,2019).Fractional derivatives,which are common in RS studies,were used to explain possible changes in the SMs performance(Wanget al.,2018;Honget al.,2019).In this study,fractional derivatives were obtained using the prospectr package(Stevens and Ramirez-Lopez,2014)in the R software(R Development Core Team,2017),where mobile window varied from 0.5 to 2 with increments of 1 and 0.5.

Model accuracy

To evaluate the performance of the predictive models,several model accuracy indices were calculated.These includedR2,which varies between 0 and 1 and provides the percentage of variation explained by the model,RMSE,which measures the general accuracy of the prediction model,and the ratio of performance to interquartile range(RPIQ).The model generally performs well when the RMSE is low and theR2and RPIQ are high.

RESULTS AND DISCUSSION

Descriptive analysis

For the entire set of 1 535 samples,the sand fraction varied from 1.0%to 99.0%,indicating a wide range of values,and the clay fraction varied from 0.0%to 77.0%(Table III).The silt content distribution was closer to the average,with a standard deviation of 11.3%.With the stratification of the dataset,the statistical distribution differed between the groups.For sand and clay contents,there were low variabilities in clusters C1,C2,and C3,while the lowest silt variability was observed in clusters C2 and C3.For clay content,the negative asymmetry in cluster C2(-0.8)indicated a distribution closer to the maximum content of this fraction,whereas in cluster C3,the slightly positive distribution(0.3)indicated values close to the minimum content of this fraction.For both the sand and silt fractions,the asymmetry was positive.The variability of soil texture can be attributed to the large spatial distribution of the samples,different soil formation processes,and the different soil parent materials in the state of Santa Catarina(Embrapa,2004).Soil color was redder and lighter in clusters C2 and C3.In short,the four clusters provided sample sets with well-defined patterns of an area with wide spatial coverage.

TABLE III Statistics of soil granulometric fractions from clusters C1—C4 acquired by stratification analysis of the initial group of 1 535 soil samples collected from the state of Santa Catarina,Brazil

PCA

Results of the PCA analysis found the first principal component(PC1)accounted for 59%of the total variation(Fig.2a).The color parameters V,X,Y,L,R,G,K,andvcontributed the highest percentages on this axis,with each parameter explaining on average 7%.The parameters with the highest positive contribution values were V,X,Y,L,R,G,andv(0.9),whereas the K parameter(-0.9)had the highest negative contribution(inverse direction)(Fig.2c).The parameter V in the Munsell System,as well as the luminance parameterLand thevof the CIE Luv,refers to the amount of light that a given surface can emit or reflect.Soils with a predominance of quartz minerals(lighter)or organic material(darker)can be detected by their natural behavior of reflecting or absorbing light,respectively(Viscarra Rosselet al.,2006).The second principal component(PC2)explained 31%of the total variation and was associated with a strong influence of the parameters Ch,a,x,u,and M,where each parameter explained,on average,12%of the variability(Fig.2b).The parameters with the highest positive contribution values were Ch,a,M,x,andu(0.9),and the parameterhhad highest negative contribution(-0.6)(Fig.2d).Theavalue of the CIE Lab models is related to the red color of soils,which may be influenced by iron(Fe)oxides and indicate a predominance of hematite.Thus,Vis-based color parameters can be used to determine the color and mineralogy and assist in the selection of the predominant soil color(Viscarra Rosselet al.,2003;Dominguez Sotoet al.,2012).Simonet al.(2020)pointed out that quick prediction of soil color and a better cost-benefit ratio are key to planning soil management practices,as well as to prospecting and opening new areas.

Fig.2 Contribution percentages(a and b)and values(c and d)of the first(PC1)(a and c)and second principal components(PC2)(b and d)of color parameters that best explained the variation in soil information.The 1 535 soil samples were collected from the state of Santa Catarina,Brazil.The contour lines represent contribution percentages increasing from 0%(center)to 8%at 1%intervals(a)and from 0%(center)to 14%at 2%intervals(b).See Table II for the color parameters.

Analysis of spectral curves

The interpretation of spectral curves requires inspection of the reflectance intensity,the shape of the curve along the spectrum(for example,ascending,descending,or flat),and the absorption characteristics,which are influenced by the physical,chemical,and mineralogical composition of soils.Clusters C1 and C4 presented ascending curves with high reflectance intensity(Fig.3).This ascending behavior is explained by the higher sand content in these clusters.Quartz is the dominant mineral in sand fraction and has no absorption features in the Vis-NIR-SWIR region,increasing the reflectance throughout the spectrum(Demattêet al.,2007;Wightet al.,2016).Clusters C2 and C3 had flat curves with average intensities close to 0.2,which generally indicate high contents of iron oxides and clay(Dalmolinet al.,2005).In fact,clusters C2 and C3 had high levels of clay(Table III),confirming this statement.Additionally,concavities at 480,550,850,and 900 nm indicated the presence of iron oxides(Dalmolinet al.,2005;Demattêet al.,2014;Moura-Buenoet al.,2019).Therefore,it is possible to infer some soil characteristics from soil spectral curves and select the most appropriate quantitative analysis,reducing costs and time spent in the laboratory.

Fig.3 Spectral curves of soil clusters C1—C4 acquired by stratification analysis of the initial group of 1 535 soil samples collected from the state of Santa Catarina,Brazil.Kt=kaolinite.

Fractional derivatives of the stratified sets

The reflectance spectra of clusters C1,C2,C3,and C4 with fractional order derivatives at different intervals are shown in Fig.4.A spectrum is generally slightly different from the original at intervals of 0.5.For clusters C1 and C4,there are three well-defined absorption features at approximately 1 400,1 900,and 2 200 nm at intervals of 0.5.The first and second features refer to the stretching vibrations of the OH-and H2O groups of the clay minerals,which are visually identified in these regions(Nocitaet al.,2014;Gholizadehet al.,2016;Dottoet al.,2018).For the third feature,the absorption characteristics are attributed to clay minerals,mainly kaolinite and illite(Camargoet al.,2018;Zhaoet al.,2018).When the intervals increased from 0.5 to 1.5,the three aforementioned absorption characteristics became more evident.Also,the absorption characteristics in the Vis region(bands 480 and 550 nm)were well defined due to iron oxides,indicating goethite and hematite(Dalmolinet al.,2005;Moura-Buenoet al.,2019).For clusters C1 and C4,the absorption at 480 nm was more evident,indicating a higher content of goethite;for clusters C2 and C3,the absorption at 550 nm was more evident,indicating a higher content of hematite(as shown in Table III).At 1 400 nm,there were changes in absorption and the spectral feature gradually changed to an undulating shape(positive and negative peaks)(Zhanget al.,2020).

Fig.4 Spectral curves of soil clusters C1—C4 with fractional order derivatives at intervals of 0.5,1.5,and 2.The four soil clusters were acquired by stratification analysis of the initial group of 1 535 soil samples collected from the state of Santa Catarina,Brazil.The grey areas represent the standard deviations of the spectral curves.

When the intervals increased from 1.5 to 2 in the Vis region,all clusters showed two positive peaks at 420 and 490 nm(caused by goethite and hematite,respectively).However,clusters C2 and C3 showed another positive peak at 523 nm,confirming that these two clusters have higher contents of hematite(Dottoet al.,2018;Fanget al.,2018).

Initial models(stratified and without stratification)

Based on the color parameters,the spectral library was stratified into different clusters as shown in Table IV.The validation results of the Cubist model showed that the structure of the input data generated differences in the performance of prediction of clay,sand,and silt fractions.Regarding the SMs for clay content,the most accurate results were obtained for clusters C2(R2=0.84 and RMSE=5.7%)and C3(R2=0.84 and RMSE=6.1%).The better performance of clusters C2 and C3 in predicting the clay content is associated with the higher clay content in these clusters.The spectral absorption characteristics of clay,especially in the Vis range(480 and 550 nm)are related to iron oxides(Dalmolinet al.,2005),and in the SWIR range(1 400 and 1 900 nm)they are due to the stretching vibrations of OH-and H2O in the clay mineral structures(Ben-Doret al.,2008;Stenberget al.,2010;Gholizadehet al.,2016).However,clusters C2 and C3 showed the least variability in sand content,which may have contributed to the better performance of the models.Regarding the RMSE values,the results for the sand fraction indicated that the C2(4.9%)and C3(4.5%)models had better predictive capacities.These clusters showed the lowest variability and the lowest content of sand fraction(Table III).For silt,the results of the C3 model were similar with the four pre-processing techniques(R2=0.58,RMSE=5.7,and RPIQ=1.96),and the other models showed low performance,both for SMs and WSM,which is similar to the performance obtained in other studies(Dottoet al.,2017;Pinheiroet al.,2017;Zhanget al.,2017;Vasavaet al.,2019).

Many studies obtained better performance of predictive models when they used pre-processed spectral data to estimate soil texture(Dottoet al.,2017;Demattêet al.,2019;Jaconiet al.,2019).However,Coblinskiet al.(2020)obtained better results without pre-processing spectral data.In the present study,the best performances of the models for the clay fraction were obtained with SMO and MSC pre-processing(Table IV).For sand,SMO,SNV,and MSC pre-processing achieved the best performances.

The results obtained for the SMs(Table IV)showed that soil characteristics had a greater impact on the predictive capacity of the models than spectral pre-processing method,confirming the potential of the stratification strategy to provide more homogeneous sets in the attempt to get good prediction models.According to the results,the accuracy of SMs(with an average RMSE value of 6.9%for clay and 8.6%for sand)was comparatively higher than in most previous studies,where RMSE values ranged from 1.9%to 18%for clay and 3.8%to 24%for sand(Table I).Zhanget al.(2017)applied the cubist algorithm to a local scale dataset of 257 samples that had clay contents ranging from 3%to 88%and obtainedR2of 0.70 and RMSE of 14.7%for validation.Likewise,Dottoet al.(2017)used a local scale dataset of 299 soil samples with clay contents ranging from 21%to 78%and sand ranging from 1%to 35%,and obtainedR2of 0.62 and RMSE of 6.8%for clay andR2of 0.25 and RMSE of 6.4%for sand.More recently,Demattêet al.(2019)used 39 284 samples at a national scale in the prediction of soil properties from all Brazilian states.In their study,clay content varied from 0%to 98.7%and sand content varied from 0%to 99%.The authors obtainedR2of 0.88 and RMSE of 7.6%for clay andR2of 0.87 and RMSE of 10.3%for sand.However,the prediction results obtained by Coblinskiet al.(2020)from 197 samples at a local scale,R2of 0.89 and RMSE of 5.1%for clay andR2of 0.81 and RMSE of 6.5%for sand,were better than the results obtained in the present study.The better performance of the models in that study may be due to the smaller number of samples used and the homogeneity of the database(local spectral library).

TABLE IV Validation results of the stratified(i.e.,clusters C1—C4a)models and the model without stratification(i.e.,all samples)for soil texture using different pre-processing methods for the 1 535 soil samples collected from the state of Santa Catarina,Brazil

Regarding the performance of cluster C4,the RMSE values were higher than those of the other clusters(7.5%to 8.3%for clay and 12.1%to 13.1%for sand)(Table IV).This can be attributed to the amplitude of the data in cluster C4,even after stratification.These results are consistent with published studies where more homogeneous soil sets generally provided better performances for precision models(Araújoet al.,2014;Jianget al.,2017;Moura-Buenoet al.,2019).Jianget al.(2017)collected soil samples in surface(0—0.1 m)and subsurface(0.1—0.3 m)layers in central China and found that differences in the variability of carbon levels for each soil depth affected the performance of the prediction models.Araújoet al.(2014)segregated data from a national spectral library in an attempt to improve the prediction accuracy of clay content.The authors concluded that dividing the library into subsets improved soil property estimates with a 21%reduction in the prediction error for clay.

After the soil samples were grouped,there was a reduction in spectral variability and sample composition,mainly for clusters C2 and C3,where the groups had smaller variation in the soil texture fractions and spectral data.It should be noted that cluster C1 also showed a low variability for clay content,which can be confirmed by the standard deviation(Table III).However,model performance in this group was lower than in the other groups,which can be explained by the lower clay content in this cluster.Therefore,the data variability and the fraction content that integrate these libraries must be considered in the performance analysis of models when they are built from the stratification of spectral libraries.

Validation between groups

To verify the applicability of the initial models,clusters with smaller standard deviations were used to estimate the sand and clay fractions of the clusters with larger standard deviations.The clay and sand fractions were considered in this analysis due to the better performance of the models obtained in the previous step.The models were not recalibrated because the same calibration and validation sets from the previous step was used.Clusters C3 and C4,which had the lowest and highest standard deviation,respectively(Table III),were used for sand.Clusters C1 and C4,which had the lowest and highest standard deviation,respectively,were used for clay.Compared with the validation results of the initial stratified models,the calibration models had poor predictions for sand and clay contents for the different clusters(Fig.5).A similar result was obtained for cluster C3,where RMSE increased from 5.4%to 10.7%,R2decreased from 0.80 to 0.26,and RPIQ decreased from 3.04 to 2.01 in sand prediction.Furthermore,the prediction of cluster C4 when calibrated with cluster C3 revealed low performance of the models(RMSE increased from 12.1%to 14.8%andR2decreased from 0.78 to 0.20).For clay content,the comparison results of validation between clusters revealed that the models were not very accurate.This was observed mainly in cluster C1,which had an increase in RMSE value from 7.0%to 8.4%,and a reduction inR2value from 0.70 to 0.57.

Fig.5 Comparisons of root mean square error(RMSE),coefficient of determination(R2),and ratio of performance to interquartile range(RPIQ)of the initial models in the prediction of sand and clay fractions to validate their applicability between clusters with high and low standard deviations.CC1=calibration using cluster C1;CC3=calibration using cluster C3;CC4=calibration using cluster C4;VC1=validation using cluster C1;VC3=validation using cluster C3;VC4=validation using cluster C4.Soil clusters C1—C4 were acquired by stratification analysis of the initial group of 1 535 soil samples collected from the state of Santa Catarina,Brazil.

This additional calibration step between the different clusters in the validation process was necessary to confirm that stratification can improve the prediction of soil texture.In addition,the models performed poorly when used with samples that are more heterogeneous or have a higher standard deviation.These results corroborate the findings of Jianget al.(2017),who showed that cross-validation models had poor estimates of carbon content of subsurface layers when surface layer samples with higher standard deviations were used.

From an analytical and agronomic point of view,stratification of spectral library samples by soil color can be used in commercial soil analysis laboratories as a complementary method for guiding the analysis and saving time and money.

CONCLUSIONS

The stratification of a large-scale soil spectral library improved the predictive capacity of the models in estimating the contents of soil fractions.This is due to the influence of soil variability typically found in these libraries,which can interfere with spectral behavior.Therefore,the structure of the input data should be considered in the construction of more accurate prediction models.

Regarding prediction based on SMs,two factors had the greatest impacts on the results:soil variability and the contents of clay,silt,and sand fractions.The smaller standard deviation and the higher clay content observed in soil clusters C2 and C3 led to good clay estimates,with an average reduction in prediction error of 5%compared with WSM.The lower variability in soil clusters C2 and C3 also led to a better ability to predict sand,with an average reduction in prediction error of 22%.The SMs with higher content soil clusters of clay and iron oxides obtained the highest accuracy,with the most important spectral bands identified mainly in 480—550 and 850—900 nm and at 1 400,1 900,and 2 200 nm.

Validation between clusters showed poor performance when the initial models were calibrated and validated with datasets with greater standard deviations.This confirms the initial hypothesis and explains the influence of very heterogeneous data on model accuracy.Therefore,stratification of soil samples from large-scale spectral libraries is a good strategy for improving regional assessments of soil resources and reducing prediction errors in qualitative determination of soil properties.

ACKNOWLEDGEMENTS

The authors thank the Coordination for the Improvement of Higher Education Personnel(CAPES)(Finance Code 001)and National Council for Scientific and Technological Development(CNPq),Brazil for the Ph.D.scholarships and the Biodiversity Research Program,Atlantic Forest,Santa Catarina(PPBio-MA-SC)and Agricultural Research and Rural Extension Corporation of Santa Catarina(EPAGRI),Brazil for providing the data that make up the Brazilian Soil Spectral Library(BSSL).The second author also thanks the CNPq for the research productivity grant.