11.1 Introduction
Statistical learning is concerned with the use of statistical and computational models for identifying patterns in data and predicting from these patterns.Due to its origins, statistical learning is one of R’s great strengths (see Section 1.3).61Statistical learning combines methods from statistics and machine learning and its methods can be categorized into supervised and unsupervised techniques.Both are increasingly used in disciplines ranging from physics, biology and ecology to geography and economics (James et al. 2013).
This chapter focuses on supervised techniques in which there is a training dataset, as opposed to unsupervised techniques such as clustering.Response variables can be binary (such as landslide occurrence), categorical (land use), integer (species richness count) or numeric (soil acidity measured in pH).Supervised techniques model the relationship between such responses — which are known for a sample of observations — and one or more predictors.
The primary aim of much machine learning research is to make good predictions, as opposed to statistical/Bayesian inference, which is good at helping to understand underlying mechanisms and uncertainties in the data (see Krainski et al. 2018).Machine learning thrives in the age of ‘big data’ because its methods make few assumptions about input variables and can handle huge datasets.Machine learning is conducive to tasks such as the prediction of future customer behavior, recommendation services (music, movies, what to buy next), face recognition, autonomous driving, text classification and predictive maintenance (infrastructure, industry).
This chapter is based on a case study: the (spatial) prediction of landslides.This application links to the applied nature of geocomputation, defined in Chapter 1, and illustrates how machine learning borrows from the field of statistics when the sole aim is prediction.Therefore, this chapter first introduces modeling and cross-validation concepts with the help of a Generalized Linear Model (GLM; Zuur et al. 2009).Building on this, the chapter implements a more typical machine learning algorithm, namely a Support Vector Machine (SVM).The models’ predictive performance will be assessed using spatial cross-validation (CV), which accounts for the fact that geographic data is special.
CV determines a model’s ability to generalize to new data, by splitting a dataset (repeatedly) into training and test sets.It uses the training data to fit the model, and checks its performance when predicting against the test data.CV helps to detect overfitting since models that predict the training data too closely (noise) will tend to perform poorly on the test data.
Randomly splitting spatial data can lead to training points that are neighbors in space with test points.Due to spatial autocorrelation, test and training datasets would not be independent in this scenario, with the consequence that CV fails to detect a possible overfitting.Spatial CV alleviates this problem and is the central theme in this chapter.