14.5 Conclusions
In this chapter we have ordinated the community matrix of the lomas Mt. Mongón with the help of a NMDS (Section 14.3).The first axis, representing the main floristic gradient in the study area, was modeled as a function of environmental predictors which partly were derived through R-GIS bridges (Section 14.2).The mlr package provided the building blocks to spatially tune the hyperparameters mtry
, sample.fraction
and min.node.size
(Section 14.4.1).The tuned hyperparameters served as input for the final model which in turn was applied to the environmental predictors for a spatial representation of the floristic gradient (Section 14.4.2).The result demonstrates spatially the astounding biodiversity in the middle of the desert.Since lomas mountains are heavily endangered, the prediction map can serve as basis for informed decision-making on delineating protection zones, and making the local population aware of the uniqueness found in their immediate neighborhood.
In terms of methodology, a few additional points could be addressed:
- It would be interesting to also model the second ordination axis, and to subsequently find an innovative way of visualizing jointly the modeled scores of the two axes in one prediction map.
- If we were interested in interpreting the model in an ecologically meaningful way, we should probably use (semi-)parametric models (Muenchow, Bräuning, et al. 2013; Zuur et al. 2009, 2017).However, there are at least approaches that help to interpret machine learning models such as random forests (see, e.g., https://mlr-org.github.io/interpretable-machine-learning-iml-and-mlr/).
- A sequential model-based optimization (SMBO) might be preferable to the random search for hyperparameter optimization used in this chapter (Probst, Wright, and Boulesteix 2018).
Finally, please note that random forest and other machine learning models are frequently used in a setting with lots of observations and many predictors, much more than used in this chapter, and where it is unclear which variables and variable interactions contribute to explaining the response.Additionally, the relationships might be highly non-linear.In our use case, the relationship between response and predictors are pretty clear, there is only a slight amount of non-linearity and the number of observations and predictors is low.Hence, it might be worth trying a linear model.A linear model is much easier to explain and understand than a random forest model, and therefore to be preferred (law of parsimony), additionally it is computationally less demanding (see Exercises).If the linear model cannot cope with the degree of non-linearity present in the data, one could also try a generalized additive model (GAM).The point here is that the toolbox of a data scientist consists of more than one tool, and it is your responsibility to select the tool best suited for the task or purpose at hand.Here, we wanted to introduce the reader to random forest modeling and how to use the corresponding results for spatial predictions.For this purpose, a well-studied dataset with known relationships between response and predictors, is appropriate.However, this does not imply that the random forest model has returned the best result in terms of predictive performance (see Exercises).