ABOVEGROUND BIOMASS ESTIMATION IN A TROPICAL FOREST WITH SELECTIVE LOGGING USING RANDOM FOREST AND LIDAR DATA

The tropical forest is characterized by expressive biomass and stores high amounts of carbon, which is an important variable for climate monitoring. Thus, studies aiming to analyze suitable methods to predict biomass are crucial, especially in the tropics, where dense vegetation makes modeling difficult. Thus, the objective of the present study was to estimate aboveground biomass (AGB) in a tropical forest area with selective logging in the Amazon forest using the Random Forest (RF) machine learning algorithm and LiDAR data. For this, 85 sample units were used at Fazenda Cauaxi, in the municipality of Paragominas, Pará State. LiDAR data were collected in 2014 and made available by the Sustainable Landscapes Project. The software R was used for data analysis. Among the LiDAR metrics, the average height was used as it had the greatest significance to compose the model. The model presented a pseudo R2 of 0.69 (value obtained by the RF), Spearman's Correlation Coefficient of 0.80, RMSE of 47.05 Mg.ha (19.84%), and Bias of 2.06 Mg.ha (0.87%). With the results, it was possible to infer that the average height metric was enough to estimate AGB in a tropical forest with selective logging, in addition, the RF algorithm the biomass to be estimated, which can be used to assist in monitoring and action management in areas of selective logging and serve as a basis for climate change mitigation policies.


INTRODUCTION
Tropical forests are one of the most biodiverse habitats known and cover 45% of the total forest area of the world (D' ANNUNZIO et al., 2017). These biomes represent two-thirds of the total terrestrial area of the planet and are characterized by expressive biomass and storing high amounts of carbon (PAN et al., 2013). Despite their importance, tropical forests have been destroyed at a rapid pace, with the forest typology having lost the largest area (D'ANNUNZIO et al., 2017). Among tropical forests, the Amazon rainforest has experienced the most deforestation (ISU, 2015), as selective logging in recent years has resulted in the widespread logging of large trees. FLORESTA, Curitiba, PR, v. 50, n. 4, p. 1873-1882, out/dez 2020. Marchesan, J. et.al Selective logging contributes to carbon flux and influences the fragmentation of the remnants (ASNER et al., 2009). In this context, estimating biomass makes it possible to determine carbon emissions caused by deforestation and analyze the impacts of selective logging. Quantifying carbon is fundamental to develop forest monitoring systems and implement climate change mitigation policies, such as the program for Reducing Emissions from Deforestation and Forest Degradation (REDD+) (GARCIA et al., 2017).
Measuring AGB in tropical forests is costly due to extensive and difficult-to-access areas. In these conditions, advances in remote sensing have allowed for the mapping and monitoring of such regions, making it possible to obtain reliable information within short periods (ALVES et al., 2013).
Selective logging is characterized by the cutting of large trees, which, unlike deforestation, causes diffuse thinning and makes monitoring even more difficult (ASNER et al., 2009). Thus, LiDAR (Light Detection and Ranging) technology stands out as it allows detailed information on vegetation to be obtained and accurately estimates biophysical variables (KUMAR; MUTANGA, 2017) since they correlate with the metrics derived from LiDAR data (LI et al., 2017).
Therefore, several studies have been carried out to predict different biophysical variables, including aboveground biomass (AGB), in which research has demonstrated the capacity of LiDAR data to estimate this variable in different ecosystems (D'OLIVEIRA et al., 2012;GARCIA et al., 2017;HANSEN et al., 2015). However, further studies are required to evaluate the methods of estimating AGB in tropical forest areas because, due to dense vegetation, errors in modeling may occur (D'OLIVEIRA et al., 2012). Thus, research aimed at analyzing the different AGB prediction methods is crucial to verify which are suitable for each forest typology.
AGB can be measured through data from remote sensing techniques using parametric and non-parametric methods. However, flexible methods have become more common in recent years, often being non-parametric (SHAO; ZHANG, 2016). Among such methods, the Random Forest (RF) machine learning algorithm stands out and has been used in several studies (HUDAK et al., 2012;LI et al., 2017). Machine learning algorithms are an alternative to parametric methods since they can be used when the data are heterogeneous and do not present normality, as is the case with tropical forests (MONTAÑO et al., 2017).
Given the above, LiDAR data allow for the gathering of AGB estimates, although further studies are required to assess their use to estimate tropical forests. Thus, the objective of this study was to estimate the AGB of a tropical forest area with selective logging located in the Amazon forest by using the RF machine learning algorithm and LiDAR data.

Study area location and characterization
The study area comprises Fazenda Cauaxi, which is located in the municipality of Paragominas (Pará State, Brazil; Figure 1), and located between the UTM coordinates 9583090 and 9587407 mN, and 778074 and 781384 mE west of the Central Meridian (51º), Zone 22 S, totaling 1,216 ha.
The area under study is divided into twelve logging units, ten of which have been registered as reducedimpact logging areas since 2007 (SILVA et al., 2017a). The Tropical Forest Institute (IFT) has been carrying out forestry activities at Fazenda Cauaxi since 1995. The predominant forest typology is the Dense Ombrophilous Forest and is characterized by having perennial and rarely deciduous trees, high biomass, and closed canopy. The climate is humid, with an average annual rainfall of 2,200 mm and an average annual temperature of approximately 25 °C (ALVAREZ et al., 2013).

Field Data
Twenty-two transects (20 x 500 m) were randomly stratified in the study area in 2012, and 88 sample units (50 x 50 m) were spaced at 100 m intervals along the transects in 2014. Subunits (5 x 50 m) were stratified in each sample unit. The coordinates of the sample units were obtained by differential GNSS (Global Navigation Satellite System) (GeoXH6000, Trimble Navigation, Ltd.).
In the sample units, trees with DBH ≥ 35 cm were measured, while all trees with DBH ≥ 10 cm were measured in the subunits, totaling 1,793 trees measured. Three sample units were excluded from the analysis, as they were not included by the LiDAR coverage due to the overflight not encompassing the entire extent measured in the inventory. The AGB of each tree was estimated by using the equation created by Chave et al. (2014). As described by the authors, the pantropical equation was elaborated using 4,004 trees located in tropical regions (Latin America, Afro-tropical areas, and Southeast Asia and Australia), totaling 58 tree collection sites. Thus, the model generated contributed to improving the accuracy of the tropical vegetation biomass. Where: AGB (kg) is the living aboveground biomass in kg; DBH is the diameter at breast height (measured 1.3 m above ground level) in cm; E is a measure of environmental stress (E = -0.103815), and ρ is the basic density of the wood in kg.m -³. The total biomass for each sample unit was obtained by adding the biomass of the sample units and subunits extrapolated to hectares (Mg.ha -1 ).

LiDAR data and processing
LiDAR data were collected at Fazenda Cauaxi in 2014 by using a scanner (Optech ORION M300) with an emission capacity of 37.5 laser pulses per square meter of area. The coverage area of the overflight was 1,216 ha.
The data were acquired by the Sustainable Landscapes Project (https://www.paisagenslidar.cnptia.embrapa.br/webgis/), which was developed in partnership between the Brazilian Agricultural Research Corporation (EMBRAPA) and the United States Forest Service.
LiDAR data were processed using the software Fusion/LDV version 3.6. Pulse density was normalized using ThinData, which randomly reduces the number of pulses according to the desired size. The Catalog tool was used to evaluate the quality of the data, which is one of the initial stages of processing. The GroundFilter tool was used to filter the point cloud located on the terrain surface, with which it was possible to generate the Digital Terrain Model (DTM) through the GridSurfaceCreate tool with a spatial resolution of 1.0 m. Then, the points were normalized using the ClipData tool, DTM switches, and minimum and maximum height in order to ensure that the height corresponded to the one above the ground. With the CanopyModel tool, the Canopy Height Model (CHM) was generated with a resolution of 1.0 m. The sampling units were cut by using the PolyClipData tool and the metrics derived from the data were gathered for each sample unit of interest by using the CloudMetrics tool. FLORESTA, Curitiba, PR, v. 50, n. 4, p. 1873-1882

AGB modeling
To estimate AGB, it was necessary to select the most important LiDAR metrics for modeling. For this purpose, the RF algorithm was first used to demonstrate the importance value of each variable. Afterwards, the Model Improvement Ratio (MIR) method was used as a criterion to select the most relevant metrics, as it calculates the importance value based on standardized variables (MURPHY et al., 2010). This way, the metrics that yielded the highest MIR values were selected to compose the model. When the selected predictor variables showed a high correlation with each other (r > 0.9), as proposed by Hudak et al. (2012), only one of the variables remained, eliminating the others (SILVA et al., 2017b). The software R (version 3.5.0) was used for data analysis.
Due to the non-normal data distribution verified by the Shapiro-Wilk normality test, non-parametric regression was chosen to obtain the AGB estimate in the study area. Thus, the RF machine learning algorithm was used as it allows a high set of sample trees to be integrated based on a deterministic technique. For this, the algorithm randomly selects a set of variables (SHAO; ZHANG, 2016) and divides the nodes of each tree using the best variables from a random sample.
Therefore, the randomForest package, which is available in R, was used for running the algorithm, requiring the definition of two parameters: mtry, which corresponds to the number of variables randomly sampled as candidates for each division, using the standard value for regression of p/3, where p is the number of variables; and ntree, which refers to the total number of trees (500) to be added to the model, as it was the one best suited for the data.
The model fit was evaluated by using Spearman's Correlation coefficient (r) due to the non-normality of the training data, Pseudo Coefficient of Determination (Pseudo R²), given directly by the RF model, Root Mean Square Error (RMSE), and absolute and relative bias.
The results were validated by using 20% of the data and the remaining 80% used to train the algorithm. The validation was done by the paired Wilcoxon test, being possible to evaluate whether the predicted and observed values differed significantly with a 95% significance level. The methodological procedures performed in this study can be viewed in the flowchart (Figure 2).

Variable selection
With the calculation of the AGB for the 85 sample units under study, the lowest value was 61.46 Mg.ha -1 and the highest value was 527.31 Mg.ha -1 , averaging 229.57 Mg.ha -1 . The Shapiro-Wilk test indicated that the data did not show normality (p-value = 0.0058), which is due to the discrepancy between the AGB values obtained in the sample units as a result of the existence of gaps in certain areas combined with sampling units with dense vegetation. A comparison of two different areas located at Fazenda Cauaxi is shown in Figure 3. The first area (Figure 3a) is characterized by the presence of gaps, with sparse trees and some larger ones, while the forest is denser in the second area ( Figure 3b) with a predominance of large trees. By processing the LiDAR point cloud in the Fusion/LDV Software, 87 metrics were obtained for each sample unit, which are statistical parameters that use the intensity and elevation of the points (MCGAUGHEY, 2018). The metrics with the highest correlation with AGB in the study area are described in Table 1. LiDAR metrics derived from the processing of the point cloud in the Fusion/LDV Software for the sample units. The metrics represent the height of the points in the study area; (2) Only the LiDAR metrics with significant Spearman's correlation coefficient; *** p-value < 0.001.
The variable importance plot obtained by the RF algorithm is shown in Figure 4. The importance value is represented on the x-axis for each LiDAR metric (y-axis) with significant Spearman's correlation (p-value <0.001) with AGB. Thus, via the graph, it was possible to infer that the metrics of mean height (Hmean) and height in the moment 1 (HL1) were the ones that yielded the highest importance value, followed by the height quadratic mean (HQmean). By using the MIR value found for each metric, it was possible to select the mean height (Hmean), height in the moment 1 (HL1), and the height quadratic mean (HQmean) as the most important to compose the model. However, the first two were highly correlated (r > 0.9), since HL1 also refers to the mean height. Therefore, Hmean was kept as a predictor variable. HQmean did not influence the improvement of the model performance, therefore, it was excluded.

AGB modeling
The RF model using the Hmean metric yielded a Pseudo R² of 0.69 and a Spearman's correlation coefficient of 0.80 with the AGB obtained in the field. Considering the error found, the RMSE was 47.05 Mg.ha -1 , which is equivalent to 19.84%. The Bias, in turn, was 2.06 Mg.ha -1 , corresponding to 0.87%. A scatter plot was created (Figure 5a) in order to visualize the adjustment between the AGB obtained in the field and the LiDAR point cloud data.
To corroborate the results, the residual plot was generated (Figure 5b). The RF model showed residue of the positive and negative amplitude of approximately 150 Mg.ha -1 , thereby corroborating the dispersion plot, in which it is possible to observe that overestimated values occurred in some sample units and underestimated ones occurred in others.
The Wilcoxon test demonstrated that the estimated and observed values did not differ significantly (pvalue = 0.5171) since it was higher than the acceptance threshold of 0.05. In addition, it is possible to view the scatter plot between the estimated and observed values of the sample validation units ( Figure 6). Visually, through the scatter plot ( Figure 6a) and density plot (Figure 6b), it was noted that the AGB values found in the field (observed AGB) were similar to those obtained using the RF algorithm (estimated AGB), corroborating the result obtained by the Wilcoxon test.

DISCUSSION
The AGB values found in the study area were similar to those found in other areas located in the same biome, such as by Benítez et al. (2016), who obtained average AGB of 195.80 Mg.ha -1 and d'Oliveira et al. (2012) and Andersen et al. (2014), who analyzed regions of the Amazon biome with selective logging and found AGB values that varied between 96.9 and 493.6 Mg.ha -1 , averaging 230.9 Mg.ha -1 .
Among the LiDAR metrics obtained, those related to tree height were the ones that showed the highest correlation with AGB, with the average height being used to compose the model. Hansen et al. (2015) highlighted that of the metrics from the LiDAR point cloud, the average height is one of the most used for predicting AGB in tropical forests. Moreover, d'Oliveira et al. (2012) found one of the best adjustments for the average height, of the metrics tested, to estimate AGB in tropical forest areas. In addition, Silva et al. (2017a) used this metric to estimate the stock and changes in AGB in different DTM scenarios and pulse densities in the same field of study as the present work and found satisfactory results.
The model adjustment yielded by the RF algorithm was similar to other studies carried out in tropical forest areas. While using a multiple regression model with R² of 0.70 and RMSE of 41.50 Mg.ha -1 , Andersen et al. (2014) analyzed selective logging areas in the Amazon with LiDAR data and reported values close to those acquired here. Further corroborating the results found here, Silva et al. (2017a) obtained an R² varying between 0.60 and 0.73 and an RMSE between 18.81 and 22.80% depending on the scenario and pulse density used, although the authors used parametric regression. Clark et al. (2011) also found similar RMSE values (38.30 Mg.ha -1 ) when using two metrics, average and maximum height. Given the above, it was possible to infer that the RF algorithm combined with LiDAR data yielded good estimates of AGB in tropical forest areas with selective logging.
However, by analyzing the scatter plot of the validation samples, a trend in the RF model can be noticed, in which low AGB values are overestimated and high values are underestimated, which was also reported by other authors (OTA et al., 2014;SILVA et al., 2017b). This may have occurred due to the RF model estimating values by using the average of several decision trees, which tends to underestimate when the estimated values are close to the maximum value of the training data and, similarly, tends to overestimate when they are close to the minimum value of the training data (OTA et al., 2014). Another possible cause, in the case of the present work, was the low number of sample units used for testing.
Moreover, Silva et al. (2017b) pointed out that a disadvantage of RF is that it does not allow the prediction to be extrapolated beyond the trained data, increasing errors when using a new data source, thus corroborating the results of this study. Additionally, Verrelst et al. (2012) found that the performance of validation using machine learning algorithms showed instability, with an increase in errors. Thus, the authors highlighted that it is important to analyze the precision of the trained model through validation instead of just analyzing the training performance.
It should be noted that some factors influenced the variance values not explained by the regression model in the study area. Among these, the discrepancy between the minimum and maximum AGB obtained in the sample units due to the presence of gaps in contrast to those with dense vegetation. There may also have been an incorrect registration of the sample units in the field caused by the dense canopy and high diameter and height of the trees, which prevent the accurate acquisition of coordinates by GNSS. Notably, the signal between the satellite and GNSS receiver may often be completely blocked in these areas, as described by d' Oliveira et al. (2012).
Nevertheless, the RF algorithm has the advantage of being a non-parametric method, which does not need to meet the conditions of the regression, and can be used when there is a small number of observations, thus allowing for good estimates (LI et al., 2017). Therefore, the RF is a suitable method for tropical forest areas, in which often the data are not normal and there is a small number of sample units that are allocated to field research due to difficult-to-access areas that hinder making measurements. Montaño et al. (2017) highlighted that machine learning techniques can replace allometric models and are a viable and safe alternative for regression analyses. In this context, machine learning algorithms have several advantages that make them an alternative to parametric models. Penner et al. (2013) described that one of the advantages of machine learning algorithms is that they use cross-validation during model development, thus producing robust models. Additionally, machine learning algorithms are preferable when several independent variables are making up the model.

CONCLUSION
• This study analyzed the use of the RF algorithm to estimate AGB in a tropical forest using LiDAR data. It was possible to infer that the integration of LiDAR data with a machine learning algorithm had the potential to estimate AGB in areas of tropical forest with selective logging, despite the natural condition of the data, which presented a discrepancy between the minimum and maximum AGB of the sample units in the study area. • In this context, the method employed may be used to assist in the monitoring of native forest areas with selective logging and the managing of said areas. In addition, the same may be applied to other selectively explored tropical forest regions in order to assist in the monitoring of biomass and, consequently, serve as a basis for climate change mitigation policies such as REDD+.